「データマイニングエンジニア実戦」—通信事業者:顧客流出警報


データ分析アルゴリズム応用の顧客流失警報実戦
四、データの分析と準備—会議と討論
  • State:州名/地域
  • Account Length:口座長
  • Area Code:ゾーン番号
  • Phone:電話番号
  • 「Int」l Plan:国際ローミング需要の有無
  • Vmail Plan:参加活動
  • Vmail Message:音声メール
  • Day Mins:日中通話分数
  • Day Calls:日中の電話数
  • Day Charge:昼間の料金徴収状況
  • Eve Mins:夜間通話分数
  • Eve Calls:夜の電話数
  • Eve Charge:夜間料金
  • Night Mins:夜間通話分数
  • Night Calls:夜間電話数
  • Night Charge:夜間料金
  • Intel Mins:国際通話分数
  • Intel Calls:国際電話番号
  • Intel Charge:国際料金
  • CustServ Calls:カスタマーサービスから苦情が寄せられた電話数
  • Churn:流出の有無
  • 4.1データ洗浄とフォーマット変換
    import warnings
    warnings.filterwarnings('ignore') #  
    
  • Step.1 pandasによるcsvのインポート:データの基本状況を確認すると、データセット全体に3333個のデータ、21次元、最後の列は分類
  • であることがわかります.
    import pandas as pd
    import numpy as np
    #      
    churn_df = pd.read_csv('churn.csv')
    col_names = churn_df.columns.tolist() #        
    
    print("Column names:")
    print(col_names)
    
    Column names:
    ['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan", 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?']
    
  • Step.2基本情報およびタイプ
  • to_show = col_names[:6] + col_names[-6:] # 6     6   
    
    print("
    Sample data:"
    ) churn_df[to_show].head(6)
    Sample data:
    

    State
    Account Length
    Area Code
    Phone
    Int'l Plan
    VMail Plan
    Night Charge
    Intl Mins
    Intl Calls
    Intl Charge
    CustServ Calls
    Churn?
    0
    KS
    128
    415
    382-4657
    no
    yes
    11.01
    10.0
    3
    2.70
    1
    False.
    1
    OH
    107
    415
    371-7191
    no
    yes
    11.45
    13.7
    3
    3.70
    1
    False.
    2
    NJ
    137
    415
    358-1921
    no
    no
    7.32
    12.2
    5
    3.29
    0
    False.
    3
    OH
    84
    408
    375-9999
    yes
    no
    8.86
    6.6
    7
    1.78
    2
    False.
    4
    OK
    75
    415
    330-6626
    yes
    no
    8.41
    10.1
    3
    2.73
    3
    False.
    5
    AL
    118
    510
    391-8027
    yes
    no
    9.18
    6.3
    6
    1.70
    0
    False.
    churn_df.info() #       
    
    
    RangeIndex: 3333 entries, 0 to 3332
    Data columns (total 21 columns):
    State             3333 non-null object
    Account Length    3333 non-null int64
    Area Code         3333 non-null int64
    Phone             3333 non-null object
    Int'l Plan        3333 non-null object
    VMail Plan        3333 non-null object
    VMail Message     3333 non-null int64
    Day Mins          3333 non-null float64
    Day Calls         3333 non-null int64
    Day Charge        3333 non-null float64
    Eve Mins          3333 non-null float64
    Eve Calls         3333 non-null int64
    Eve Charge        3333 non-null float64
    Night Mins        3333 non-null float64
    Night Calls       3333 non-null int64
    Night Charge      3333 non-null float64
    Intl Mins         3333 non-null float64
    Intl Calls        3333 non-null int64
    Intl Charge       3333 non-null float64
    CustServ Calls    3333 non-null int64
    Churn?            3333 non-null object
    dtypes: float64(8), int64(8), object(5)
    memory usage: 546.9+ KB
    
    churn_df.describe() 
    #describe()          ,      。
    
    #           25%    50%     75%                  NA      。
    

    Account Length
    Area Code
    VMail Message
    Day Mins
    Day Calls
    Day Charge
    Eve Mins
    Eve Calls
    Eve Charge
    Night Mins
    Night Calls
    Night Charge
    Intl Mins
    Intl Calls
    Intl Charge
    CustServ Calls
    count
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    3333.000000
    mean
    101.064806
    437.182418
    8.099010
    179.775098
    100.435644
    30.562307
    200.980348
    100.114311
    17.083540
    200.872037
    100.107711
    9.039325
    10.237294
    4.479448
    2.764581
    1.562856
    std
    39.822106
    42.371290
    13.688365
    54.467389
    20.069084
    9.259435
    50.713844
    19.922625
    4.310668
    50.573847
    19.568609
    2.275873
    2.791840
    2.461214
    0.753773
    1.315491
    min
    1.000000
    408.000000
    0.000000
    0.000000
    0.000000
    0.000000
    0.000000
    0.000000
    0.000000
    23.200000
    33.000000
    1.040000
    0.000000
    0.000000
    0.000000
    0.000000
    25%
    74.000000
    408.000000
    0.000000
    143.700000
    87.000000
    24.430000
    166.600000
    87.000000
    14.160000
    167.000000
    87.000000
    7.520000
    8.500000
    3.000000
    2.300000
    1.000000
    50%
    101.000000
    415.000000
    0.000000
    179.400000
    101.000000
    30.500000
    201.400000
    100.000000
    17.120000
    201.200000
    100.000000
    9.050000
    10.300000
    4.000000
    2.780000
    1.000000
    75%
    127.000000
    510.000000
    20.000000
    216.400000
    114.000000
    36.790000
    235.300000
    114.000000
    20.000000
    235.300000
    113.000000
    10.590000
    12.100000
    6.000000
    3.270000
    2.000000
    max
    243.000000
    510.000000
    51.000000
    350.800000
    165.000000
    59.640000
    363.700000
    170.000000
    30.910000
    395.000000
    175.000000
    17.770000
    20.000000
    20.000000
    5.400000
    9.000000
    4.2探索的データ分析
  • Step1.特徴自己の情報
  • #           ,               
    import matplotlib.pyplot as plt #   
    %matplotlib inline
    
    fig = plt.figure()
    fig.set(alpha=0.3)  #       alpha  
    plt.subplot2grid((1,2),(0,0))#       ,  0  0 ,
    
    # bar:     
    churn_df['Churn?'].value_counts().plot(kind='bar') #           ,       ,         
    plt.title("stat for churn") #     /label
    plt.ylabel("number")  #       ,  3333 ,       2700 ,    500  
    
    plt.subplot2grid((1,2),(0,1))            
    churn_df[u'CustServ Calls'].value_counts().plot(kind='bar') #     ,                 
    plt.title("stat for cusServCalls") #   
    plt.ylabel(u"number") #   1       1400   ,  .....      3333  
    
    plt.show()
    

    [外部チェーン画像の転送に失敗しました.ソース局には盗難防止チェーンがある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-NjLJqJ 6 G-15958557483349)(output_13_0.png)]
    import matplotlib.pyplot as plt
    
    %matplotlib inline
    fig = plt.figure()
    fig.set(alpha=0.2)  #       alpha  
    
    plt.subplot2grid((1,3),(0,0)) #             
    churn_df['Day Mins'].plot(kind='kde') #        ,   kde   
    plt.xlabel(u"Mins")#       
    plt.ylabel(u"density")  # density:  
    plt.title(u"dis for day mins") #  
    
    
    
    plt.subplot2grid((1,3),(0,1))            
    churn_df['Day Calls'].plot(kind='kde')#        
    plt.xlabel(u"call")#        
    plt.ylabel(u"density") #  
    plt.title(u"dis for day calls") #  
    
    plt.subplot2grid((1,3),(0,2))           
    churn_df['Day Charge'].plot(kind='kde') #       
    plt.xlabel(u"Charge")#          
    plt.ylabel(u"density") #  
    plt.title(u"dis for day charge")
    
    plt.show()
    
    

    [外部チェーン画像の転送に失敗しました.ソース局に盗難防止チェーンがある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-DL 1 d 9 usG-15958557483351)(output_14_0.png)]
  • Step.2特徴と分類の関連
  • #import matplotlib.pyplot as plt
    
    fig = plt.figure()
    fig.set(alpha=0.2)  #       alpha  
    
    #              
    int_yes = churn_df['Churn?'][churn_df['Int\'l Plan'] == 'yes'].value_counts() #   ,yes:               
    int_no = churn_df['Churn?'][churn_df['Int\'l Plan'] == 'no'].value_counts() #  :no:             
    
    # DataFrame        ,    
    df_int=pd.DataFrame({u'int plan':int_yes, u'no int plan':int_no})
    
    df_int.plot(kind='bar', stacked=True)
    plt.title(u"statistic between int plan and churn")
    plt.xlabel(u"int or not") 
    plt.ylabel(u"number")
    
    plt.show()
    
    #    ,     3333 ,False:          , 2700 ,      100   ,     2600  
    #True:          400   ,         100 ,        300 
    #  :            
    

    [外部チェーン画像の転送に失敗しました.ソース局には盗難防止チェーン機構がある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-LjLY 7 AK 4-15958557483352)(output_16_1.png)]
    #              
    fig = plt.figure()
    fig.set(alpha=0.2) #       alpha  
    
    cus_0 = churn_df['CustServ Calls'][churn_df['Churn?'] == 'False.'].value_counts()#          
    cus_1 = churn_df['CustServ Calls'][churn_df['Churn?'] == 'True.'].value_counts()#         
    
    df=pd.DataFrame({u'churn':cus_1, u'retain':cus_0})
    df.plot(kind='bar', stacked=True)
    plt.title(u"Static between customer service call and churn")
    plt.xlabel(u"Call service") #       
    plt.ylabel(u"Num")  #         
    
    plt.show()
    
    #   3         400  ,        300,    100 
    #   4         180 ,        80 ,    100 
    

    [外部チェーン画像の転送に失敗しました.ソース局に盗難防止チェーンがある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-7 K 4 w 6 mAn-1595857483354)(output_17_1.png)]
    4.3特徴フィルタ
    #           
    ds_result = churn_df['Churn?']
    
    #shift+tab:condition        ,      x ,y   
    
    #      /True 1 ,    /False 0
    Y = np.where(ds_result == 'True.',1,0) 
    #     ==     
    dummies_int = pd.get_dummies(churn_df['Int\'l Plan'], prefix='_int\'l Plan') #prefix:  
    # VMail Plan:        prefix:  
    dummies_voice = pd.get_dummies(churn_df['VMail Plan'], prefix='VMail')
    
    #concat:    2   2      
    ds_tmp=pd.concat([churn_df, dummies_int, dummies_voice], axis=1)
    
    #     、    、   、      、      
    to_drop = ['State','Area Code','Phone','Churn?', 'Int\'l Plan', 'VMail Plan']
    df = ds_tmp.drop(to_drop,axis=1)
    
    print("after convert ")
    df.head(5)
    
    after convert 
    

    Account Length
    VMail Message
    Day Mins
    Day Calls
    Day Charge
    Eve Mins
    Eve Calls
    Eve Charge
    Night Mins
    Night Calls
    Night Charge
    Intl Mins
    Intl Calls
    Intl Charge
    CustServ Calls
    _int'l Plan_no
    _int'l Plan_yes
    VMail_no
    VMail_yes
    0
    128
    25
    265.1
    110
    45.07
    197.4
    99
    16.78
    244.7
    91
    11.01
    10.0
    3
    2.70
    1
    1
    0
    0
    1
    1
    107
    26
    161.6
    123
    27.47
    195.5
    103
    16.62
    254.4
    103
    11.45
    13.7
    3
    3.70
    1
    1
    0
    0
    1
    2
    137
    0
    243.4
    114
    41.38
    121.2
    110
    10.30
    162.6
    104
    7.32
    12.2
    5
    3.29
    0
    1
    0
    1
    0
    3
    84
    0
    299.4
    71
    50.90
    61.9
    88
    5.26
    196.9
    89
    8.86
    6.6
    7
    1.78
    2
    0
    1
    1
    0
    4
    75
    0
    166.7
    113
    28.34
    148.3
    122
    12.61
    186.9
    121
    8.41
    10.1
    3
    2.73
    3
    0
    1
    1
    0
    4.4特徴工事
  • scaleの仕事をする必要があります.いくつかの属性のscaleが大きすぎます.
  • 論理回帰と勾配降下では,各属性のscale差が大きすぎて収束速度に大きな影響を及ぼす.
  • 私たちはここですべてをしていますが、実際にはいくつかの際立った特徴に対してこのような処理をすることができます.
  • #      ,,  Scaler        
    #                    ,as_matrix():          np.float
    X = df.as_matrix().astype(np.float)
    
    
    from sklearn.preprocessing import StandardScaler #    
    
    scaler = StandardScaler()#    
    
    X = scaler.fit_transform(X)
    
    print("Feature space holds %d observations and %d features" % X.shape) #  3333  * 19 
    print("---------------------------------")
    print("Unique target labels:", np.unique(Y)) #       
    print("---------------------------------")
    print(len(Y[Y==0])) #      2850
    print("---------------------------------")
    print(len(Y[Y==1])) #     483
    
    Feature space holds 3333 observations and 19 features
    ---------------------------------
    Unique target labels: [0 1]
    ---------------------------------
    2850
    ---------------------------------
    483
    
    #          
    churn_result = churn_df['Churn?']
    y = np.where(churn_result == 'True.',1,0)
    to_drop = ['State','Area Code','Phone','Churn?']
    churn_feat_space = churn_df.drop(to_drop,axis=1)
    yes_no_cols = ["Int'l Plan","VMail Plan"]
    churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'
    features = churn_feat_space.columns
    X = churn_feat_space.as_matrix().astype(np.float)
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    
    print("Feature space holds %d observations and %d features" % X.shape)
    print("---------------------------------")
    print("Unique target labels:", np.unique(y))
    print("---------------------------------")
    print(X[0])# 1 
    print("---------------------------------")
    print(len(y[y == 0]))
    
    Feature space holds 3333 observations and 17 features
    ---------------------------------
    Unique target labels: [0 1]
    ---------------------------------
    [ 0.67648946 -0.32758048  1.6170861   1.23488274  1.56676695  0.47664315
      1.56703625 -0.07060962 -0.05594035 -0.07042665  0.86674322 -0.46549436
      0.86602851 -0.08500823 -0.60119509 -0.0856905  -0.42793202]
    ---------------------------------
    2850
    

    4.5多様な基礎モデルを構築し、多様なアルゴリズムを試みる
    #         :  
    from sklearn.model_selection import KFold
    
    def run_cv(X,y,clf_class,**kwargs):
        # Construct a kfolds object
        kf = KFold(5,shuffle=True) # 5 
        y_pred = y.copy() #      y       copy()
    
        #    5 ,          ,        
        for train_index, test_index in kf.split(X):
        
            X_train, X_test = X[train_index], X[test_index]
            y_train = y[train_index]
            # Initialize a classifier with key word arguments
            clf = clf_class(**kwargs)
            clf.fit(X_train,y_train)
            y_pred[test_index] = clf.predict(X_test)
        return y_pred
    
    #     
    from sklearn.svm import SVC
    from sklearn.linear_model import LogisticRegression as LR
    from sklearn.neighbors import KNeighborsClassifier as KNN
    
    def accuracy(y_true,y_pred):
        # NumPy interprets True and False as 1. and 0.
        return np.mean(y_true == y_pred) #    True ,   False ,  1+0+1+0.../3333
    
    print("Support vector machines:")
    print("%.3f" % accuracy(y, run_cv(X,y,SVC)))
    print("----------------------------")
    print("LogisticRegression :")
    print("%.3f" % accuracy(y, run_cv(X,y,LR)))
    print("----------------------------")
    print("K-nearest-neighbors:")
    print("%.3f" % accuracy(y, run_cv(X,y,KNN)))
    
    Support vector machines:
    0.917
    ----------------------------
    LogisticRegression :
    0.859
    ----------------------------
    K-nearest-neighbors:
    0.894
    
    #      
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.model_selection import cross_val_score,KFold
    from sklearn.neighbors import KNeighborsClassifier 
    import matplotlib.pyplot as plt
    
    
    #      
    models = []
    models.append(('KNN', KNeighborsClassifier()))
    
    
    models.append(('LR', LogisticRegression()))
    
    models.append(('SVM', SVC()))
    
    #    
    results = []
    names = []
    scoring = 'accuracy' #    
    for name, model in models:
        
        #random_state = 0 
        kfold = KFold(5,shuffle=True,random_state = 0) # 5 
        cv_results = cross_val_score(model, X, Y, cv=kfold)#scoring=scoring    None
        results.append(cv_results)#         
        names.append(name)
        #      ,          ,std     
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        print("------------------------------")
    # boxplot algorithm comparison
    fig = plt.figure()
    fig.suptitle('Algorithm Comparison')
    ax = fig.add_subplot(111)
    
    plt.boxplot(results)
    ax.set_xticklabels(names)
    plt.show()
    
    #   :SVM      
    
    KNN: 0.894088 (0.009717)
    ------------------------------
    LR: 0.861384 (0.015574)
    ------------------------------
    SVM: 0.919288 (0.011112)
    ------------------------------
    

    [外部チェーン画像の転送に失敗しました.ソース局に盗難防止チェーンがある可能性があります.画像を保存して直接アップロードすることをお勧めします(img-8 Tvl 4 mJr-1595857483356)(output_27_1.png)]
    4.6モデルパラメータ調整/リフトモデル
  • リフトの部分で、リフトアルゴリズムをどのように使用するか.例えばランダム森林
  • from sklearn.ensemble import RandomForestClassifier as RF
    num_trees = 100
    max_features = 3
    kfold = KFold(n_splits=10, random_state=7)
    model = RF(n_estimators=num_trees, max_features=max_features)
    results = cross_val_score(model, X, Y, cv=kfold)
    print(results.mean())
    
    0.953197209185233
    
    from sklearn.ensemble import GradientBoostingClassifier
    seed = 7
    num_trees = 100
    kfold = KFold(n_splits=10, random_state=seed)
    
    model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
    results = cross_val_score(model, X, Y, cv=kfold)
    
    print(results.mean())
    
    0.9525966085846325
    

    4.7評価テスト/結論報告
    def run_prob_cv(X, y, clf_class, **kwargs):
        kf = KFold(5,True)
        y_prob = np.zeros((len(y),2))
        for train_index, test_index in kf.split(X):
            X_train, X_test = X[train_index], X[test_index]
            y_train = y[train_index]
            clf = clf_class(**kwargs)
            clf.fit(X_train,y_train)
            y_prob[test_index] = clf.predict_proba(X_test) #        ,  0     ,  1      
        return y_prob
    
    import warnings
    warnings.filterwarnings('ignore')
    
    
    pred_prob = run_prob_cv(X, y, RF, n_estimators=10)
    
    pred_churn = pred_prob[:,1]#    1       ,           
    is_churn = y == 1
    
    
    counts = pd.value_counts(pred_churn) #   1            ,  :pred_prob	count
    
    
    
    true_prob = {}
    for prob in counts.index:
        true_prob[prob] = np.mean(is_churn[pred_churn == prob]) 
        true_prob = pd.Series(true_prob)
    
    
    counts = pd.concat([counts,true_prob], axis=1).reset_index()
    counts.columns = ['pred_prob', 'count', 'true_prob']
    counts
    

    pred_prob
    count
    true_prob
    0
    0.0
    1781
    0.026390
    1
    0.1
    674
    0.031157
    2
    0.2
    261
    0.034483
    3
    0.3
    124
    0.145161
    4
    0.8
    86
    0.941860
    5
    0.7
    74
    0.878378
    6
    0.9
    72
    0.972222
    7
    0.4
    69
    0.347826
    8
    0.5
    65
    0.569231
    9
    0.6
    64
    0.750000
    10
    1.0
    63
    1.000000