「データマイニングエンジニア実戦」—通信事業者:顧客流出警報

63694 ワード

python データ解析データ分析項目

データ分析アルゴリズム応用の顧客流失警報実戦
四、データの分析と準備—会議と討論

State:州名/地域

Account Length:口座長

Area Code:ゾーン番号

Phone:電話番号

「Int」l Plan:国際ローミング需要の有無

Vmail Plan:参加活動

Vmail Message:音声メール

Day Mins:日中通話分数

Day Calls:日中の電話数

Day Charge:昼間の料金徴収状況

Eve Mins:夜間通話分数

Eve Calls:夜の電話数

Eve Charge:夜間料金

Night Mins:夜間通話分数

Night Calls:夜間電話数

Night Charge:夜間料金

Intel Mins:国際通話分数

Intel Calls:国際電話番号

Intel Charge:国際料金

CustServ Calls:カスタマーサービスから苦情が寄せられた電話数

Churn:流出の有無

4.1データ洗浄とフォーマット変換

import warnings
warnings.filterwarnings('ignore') #

Step.1 pandasによるcsvのインポート:データの基本状況を確認すると、データセット全体に3333個のデータ、21次元、最後の列は分類

であることがわかります.

import pandas as pd
import numpy as np
#      
churn_df = pd.read_csv('churn.csv')
col_names = churn_df.columns.tolist() #        

print("Column names:")
print(col_names)

Column names:
['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan", 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?']

Step.2基本情報およびタイプ

to_show = col_names[:6] + col_names[-6:] # 6     6   

print("
Sample data:")
churn_df[to_show].head(6)

Sample data:

State
Account Length
Area Code
Phone
Int'l Plan
VMail Plan
Night Charge
Intl Mins
Intl Calls
Intl Charge
CustServ Calls
Churn?
0
KS
128
415
382-4657
no
yes
11.01
10.0
3
2.70
1
False.
1
OH
107
415
371-7191
no
yes
11.45
13.7
3
3.70
1
False.
2
NJ
137
415
358-1921
no
no
7.32
12.2
5
3.29
0
False.
3
OH
84
408
375-9999
yes
no
8.86
6.6
7
1.78
2
False.
4
OK
75
415
330-6626
yes
no
8.41
10.1
3
2.73
3
False.
5
AL
118
510
391-8027
yes
no
9.18
6.3
6
1.70
0
False.

churn_df.info() #


RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
State             3333 non-null object
Account Length    3333 non-null int64
Area Code         3333 non-null int64
Phone             3333 non-null object
Int'l Plan        3333 non-null object
VMail Plan        3333 non-null object
VMail Message     3333 non-null int64
Day Mins          3333 non-null float64
Day Calls         3333 non-null int64
Day Charge        3333 non-null float64
Eve Mins          3333 non-null float64
Eve Calls         3333 non-null int64
Eve Charge        3333 non-null float64
Night Mins        3333 non-null float64
Night Calls       3333 non-null int64
Night Charge      3333 non-null float64
Intl Mins         3333 non-null float64
Intl Calls        3333 non-null int64
Intl Charge       3333 non-null float64
CustServ Calls    3333 non-null int64
Churn?            3333 non-null object
dtypes: float64(8), int64(8), object(5)
memory usage: 546.9+ KB

churn_df.describe() 
#describe()          ，      。

#           25%    50%     75%                  NA      。

Account Length
Area Code
VMail Message
Day Mins
Day Calls
Day Charge
Eve Mins
Eve Calls
Eve Charge
Night Mins
Night Calls
Night Charge
Intl Mins
Intl Calls
Intl Charge
CustServ Calls
count
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
3333.000000
mean
101.064806
437.182418
8.099010
179.775098
100.435644
30.562307
200.980348
100.114311
17.083540
200.872037
100.107711
9.039325
10.237294
4.479448
2.764581
1.562856
std
39.822106
42.371290
13.688365
54.467389
20.069084
9.259435
50.713844
19.922625
4.310668
50.573847
19.568609
2.275873
2.791840
2.461214
0.753773
1.315491
min
1.000000
408.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
23.200000
33.000000
1.040000
0.000000
0.000000
0.000000
0.000000
25%
74.000000
408.000000
0.000000
143.700000
87.000000
24.430000
166.600000
87.000000
14.160000
167.000000
87.000000
7.520000
8.500000
3.000000
2.300000
1.000000
50%
101.000000
415.000000
0.000000
179.400000
101.000000
30.500000
201.400000
100.000000
17.120000
201.200000
100.000000
9.050000
10.300000
4.000000
2.780000
1.000000
75%
127.000000
510.000000
20.000000
216.400000
114.000000
36.790000
235.300000
114.000000
20.000000
235.300000
113.000000
10.590000
12.100000
6.000000
3.270000
2.000000
max
243.000000
510.000000
51.000000
350.800000
165.000000
59.640000
363.700000
170.000000
30.910000
395.000000
175.000000
17.770000
20.000000
20.000000
5.400000
9.000000
4.2探索的データ分析

Step1.特徴自己の情報

#           ，               
import matplotlib.pyplot as plt #   
%matplotlib inline

fig = plt.figure()
fig.set(alpha=0.3)  #       alpha  
plt.subplot2grid((1,2),(0,0))#       ，  0  0 ，

# bar:     
churn_df['Churn?'].value_counts().plot(kind='bar') #           ，       ，         
plt.title("stat for churn") #     /label
plt.ylabel("number")  #       ，  3333 ，       2700 ，    500  

plt.subplot2grid((1,2),(0,1))            
churn_df[u'CustServ Calls'].value_counts().plot(kind='bar') #     ，                 
plt.title("stat for cusServCalls") #   
plt.ylabel(u"number") #   1       1400   ，  .....      3333  

plt.show()

[外部チェーン画像の転送に失敗しました.ソース局には盗難防止チェーンがある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-NjLJqJ 6 G-15958557483349)(output_13_0.png)]

import matplotlib.pyplot as plt

%matplotlib inline
fig = plt.figure()
fig.set(alpha=0.2)  #       alpha  

plt.subplot2grid((1,3),(0,0)) #             
churn_df['Day Mins'].plot(kind='kde') #        ，   kde   
plt.xlabel(u"Mins")#       
plt.ylabel(u"density")  # density：  
plt.title(u"dis for day mins") #  



plt.subplot2grid((1,3),(0,1))            
churn_df['Day Calls'].plot(kind='kde')#        
plt.xlabel(u"call")#        
plt.ylabel(u"density") #  
plt.title(u"dis for day calls") #  

plt.subplot2grid((1,3),(0,2))           
churn_df['Day Charge'].plot(kind='kde') #       
plt.xlabel(u"Charge")#          
plt.ylabel(u"density") #  
plt.title(u"dis for day charge")

plt.show()

[外部チェーン画像の転送に失敗しました.ソース局に盗難防止チェーンがある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-DL 1 d 9 usG-15958557483351)(output_14_0.png)]

Step.2特徴と分類の関連

#import matplotlib.pyplot as plt

fig = plt.figure()
fig.set(alpha=0.2)  #       alpha  

#              
int_yes = churn_df['Churn?'][churn_df['Int\'l Plan'] == 'yes'].value_counts() #   ，yes:               
int_no = churn_df['Churn?'][churn_df['Int\'l Plan'] == 'no'].value_counts() #  ：no:             

# DataFrame        ，    
df_int=pd.DataFrame({u'int plan':int_yes, u'no int plan':int_no})

df_int.plot(kind='bar', stacked=True)
plt.title(u"statistic between int plan and churn")
plt.xlabel(u"int or not") 
plt.ylabel(u"number")

plt.show()

#    ，     3333 ，False：          ， 2700 ，      100   ，     2600  
#True：          400   ，         100 ，        300 
#  ：

[外部チェーン画像の転送に失敗しました.ソース局には盗難防止チェーン機構がある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-LjLY 7 AK 4-15958557483352)(output_16_1.png)]

#              
fig = plt.figure()
fig.set(alpha=0.2) #       alpha  

cus_0 = churn_df['CustServ Calls'][churn_df['Churn?'] == 'False.'].value_counts()#          
cus_1 = churn_df['CustServ Calls'][churn_df['Churn?'] == 'True.'].value_counts()#         

df=pd.DataFrame({u'churn':cus_1, u'retain':cus_0})
df.plot(kind='bar', stacked=True)
plt.title(u"Static between customer service call and churn")
plt.xlabel(u"Call service") #       
plt.ylabel(u"Num")  #         

plt.show()

#   3         400  ，        300，    100 
#   4         180 ，        80 ，    100

[外部チェーン画像の転送に失敗しました.ソース局に盗難防止チェーンがある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-7 K 4 w 6 mAn-1595857483354)(output_17_1.png)]
4.3特徴フィルタ

#           
ds_result = churn_df['Churn?']

#shift+tab:condition        ，      x ,y   

#      /True 1 ，    /False 0
Y = np.where(ds_result == 'True.',1,0) 
#     ==     
dummies_int = pd.get_dummies(churn_df['Int\'l Plan'], prefix='_int\'l Plan') #prefix：  
# VMail Plan：        prefix：  
dummies_voice = pd.get_dummies(churn_df['VMail Plan'], prefix='VMail')

#concat：    2   2      
ds_tmp=pd.concat([churn_df, dummies_int, dummies_voice], axis=1)

#     、    、   、      、      
to_drop = ['State','Area Code','Phone','Churn?', 'Int\'l Plan', 'VMail Plan']
df = ds_tmp.drop(to_drop,axis=1)

print("after convert ")
df.head(5)

after convert

Account Length
VMail Message
Day Mins
Day Calls
Day Charge
Eve Mins
Eve Calls
Eve Charge
Night Mins
Night Calls
Night Charge
Intl Mins
Intl Calls
Intl Charge
CustServ Calls
_int'l Plan_no
_int'l Plan_yes
VMail_no
VMail_yes
0
128
25
265.1
110
45.07
197.4
99
16.78
244.7
91
11.01
10.0
3
2.70
1
1
0
0
1
1
107
26
161.6
123
27.47
195.5
103
16.62
254.4
103
11.45
13.7
3
3.70
1
1
0
0
1
2
137
0
243.4
114
41.38
121.2
110
10.30
162.6
104
7.32
12.2
5
3.29
0
1
0
1
0
3
84
0
299.4
71
50.90
61.9
88
5.26
196.9
89
8.86
6.6
7
1.78
2
0
1
1
0
4
75
0
166.7
113
28.34
148.3
122
12.61
186.9
121
8.41
10.1
3
2.73
3
0
1
1
0
4.4特徴工事

scaleの仕事をする必要があります.いくつかの属性のscaleが大きすぎます.

論理回帰と勾配降下では,各属性のscale差が大きすぎて収束速度に大きな影響を及ぼす.

私たちはここですべてをしていますが、実際にはいくつかの際立った特徴に対してこのような処理をすることができます.

#      ，，  Scaler        
#                    ，as_matrix()：          np.float
X = df.as_matrix().astype(np.float)


from sklearn.preprocessing import StandardScaler #    

scaler = StandardScaler()#    

X = scaler.fit_transform(X)

print("Feature space holds %d observations and %d features" % X.shape) #  3333  * 19 
print("---------------------------------")
print("Unique target labels:", np.unique(Y)) #       
print("---------------------------------")
print(len(Y[Y==0])) #      2850
print("---------------------------------")
print(len(Y[Y==1])) #     483

Feature space holds 3333 observations and 19 features
---------------------------------
Unique target labels: [0 1]
---------------------------------
2850
---------------------------------
483

#          
churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0)
to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1)
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'
features = churn_feat_space.columns
X = churn_feat_space.as_matrix().astype(np.float)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

print("Feature space holds %d observations and %d features" % X.shape)
print("---------------------------------")
print("Unique target labels:", np.unique(y))
print("---------------------------------")
print(X[0])# 1 
print("---------------------------------")
print(len(y[y == 0]))

Feature space holds 3333 observations and 17 features
---------------------------------
Unique target labels: [0 1]
---------------------------------
[ 0.67648946 -0.32758048  1.6170861   1.23488274  1.56676695  0.47664315
  1.56703625 -0.07060962 -0.05594035 -0.07042665  0.86674322 -0.46549436
  0.86602851 -0.08500823 -0.60119509 -0.0856905  -0.42793202]
---------------------------------
2850

4.5多様な基礎モデルを構築し、多様なアルゴリズムを試みる

#         ：  
from sklearn.model_selection import KFold

def run_cv(X,y,clf_class,**kwargs):
    # Construct a kfolds object
    kf = KFold(5,shuffle=True) # 5 
    y_pred = y.copy() #      y       copy()

    #    5 ，          ，        
    for train_index, test_index in kf.split(X):
    
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # Initialize a classifier with key word arguments
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred

#     
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.neighbors import KNeighborsClassifier as KNN

def accuracy(y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred) #    True ，   False ，  1+0+1+0.../3333

print("Support vector machines:")
print("%.3f" % accuracy(y, run_cv(X,y,SVC)))
print("----------------------------")
print("LogisticRegression :")
print("%.3f" % accuracy(y, run_cv(X,y,LR)))
print("----------------------------")
print("K-nearest-neighbors:")
print("%.3f" % accuracy(y, run_cv(X,y,KNN)))

Support vector machines:
0.917
----------------------------
LogisticRegression :
0.859
----------------------------
K-nearest-neighbors:
0.894

#      
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score,KFold
from sklearn.neighbors import KNeighborsClassifier 
import matplotlib.pyplot as plt


#      
models = []
models.append(('KNN', KNeighborsClassifier()))


models.append(('LR', LogisticRegression()))

models.append(('SVM', SVC()))

#    
results = []
names = []
scoring = 'accuracy' #    
for name, model in models:
    
    #random_state = 0 
    kfold = KFold(5,shuffle=True,random_state = 0) # 5 
    cv_results = cross_val_score(model, X, Y, cv=kfold)#scoring=scoring    None
    results.append(cv_results)#         
    names.append(name)
    #      ，          ，std     
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    print("------------------------------")
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

#   ：SVM

KNN: 0.894088 (0.009717)
------------------------------
LR: 0.861384 (0.015574)
------------------------------
SVM: 0.919288 (0.011112)
------------------------------

[外部チェーン画像の転送に失敗しました.ソース局に盗難防止チェーンがある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-8 Tvl 4 mJr-1595857483356)(output_27_1.png)]
4.6モデルパラメータ調整/リフトモデル

リフトの部分で、リフトアルゴリズムをどのように使用するか.例えばランダム森林

from sklearn.ensemble import RandomForestClassifier as RF
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7)
model = RF(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.953197209185233

from sklearn.ensemble import GradientBoostingClassifier
seed = 7
num_trees = 100
kfold = KFold(n_splits=10, random_state=seed)

model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

0.9525966085846325

4.7評価テスト/結論報告

def run_prob_cv(X, y, clf_class, **kwargs):
    kf = KFold(5,True)
    y_prob = np.zeros((len(y),2))
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_prob[test_index] = clf.predict_proba(X_test) #        ，  0     ，  1      
    return y_prob

import warnings
warnings.filterwarnings('ignore')


pred_prob = run_prob_cv(X, y, RF, n_estimators=10)

pred_churn = pred_prob[:,1]#    1       ，           
is_churn = y == 1


counts = pd.value_counts(pred_churn) #   1            ，  ：pred_prob	count



true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[pred_churn == prob]) 
    true_prob = pd.Series(true_prob)


counts = pd.concat([counts,true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']
counts

pred_prob
count
true_prob
0
0.0
1781
0.026390
1
0.1
674
0.031157
2
0.2
261
0.034483
3
0.3
124
0.145161
4
0.8
86
0.941860
5
0.7
74
0.878378
6
0.9
72
0.972222
7
0.4
69
0.347826
8
0.5
65
0.569231
9
0.6
64
0.750000
10
1.0
63
1.000000

KUSANAGI を VirtualBox に入れた時のネットワーク設定をブリッジネットワークからNATに変える

django学習——ModelForm