タスク1:データ分析


説明:データセットは金融データ(元のデータではなく、処理済み)であり、ローンユーザーが期限切れになるかどうかを予測します.表の「status」は結果ラベルです.0は期限切れを示し、1は期限切れを示します.
要求:データの切り分け方式は三七分で、そのうちテストセットは30%、訓練セットは70%、ランダム種子は2018
 
タスク1:データの探索と分析を行う.時間2日
  • データ型の分析
  • 無関係フィーチャー削除
  • データ型変換
  • 欠落値処理
  • 及び思いつき及び参考になるデータ分析処理
  • # -*- coding: utf-8 -*-
    """
    Created on Sun Mar 31 14:27:38 2019
    
    @author: kratos
    """
    
    #       
    import pandas as pd
    from sklearn.preprocessing import LabelBinarizer, Imputer
    
    #       ,        
    data_origin = pd.read_csv('D:\datamine\data.csv', encoding='gbk')
    data_origin.head()
    
    #               
    label = data_origin.status
    data = data_origin.drop(['status'], axis=1)
    
    #           
    data.info()
    
    #                 
    data_del = data.dropna(thresh=50)
    data_del.drop_duplicates(inplace=True)
    
    #                  
    object_column = ['trade_no', 'bank_card_no', 'reg_preference_for_trad', 'source',
                     'id_name', 'latest_query_time', 'loans_latest_time']
    data_obj = data_del[object_column]
    data_num = data_del.drop(object_column, axis=1)
    
    #         
    data_obj.describe()
    
    #           
    data_obj.drop(['bank_card_no', 'source', 'trade_no', 'id_name'], axis=1, inplace=True)
    
    #           
    data_num.drop(['custid', 'student_feature', 'Unnamed: 0'], axis=1, inplace=True)
    
    #             
    imputer = Imputer(strategy='mean')
    num = imputer.fit_transform(data_num)
    data_num = pd.DataFrame(num, columns=data_num.columns)
    
    #              
    data_obj.ffill(inplace=True)
    
    #
    encoder = LabelBinarizer()
    reg_preference_1hot = encoder.fit_transform(data_obj[['reg_preference_for_trad']])
    data_obj.drop(['reg_preference_for_trad'], axis=1, inplace=True)
    
    reg_preference_df = pd.DataFrame(reg_preference_1hot, columns=encoder.classes_)
    data_obj = pd.concat([data_obj, reg_preference_df], axis=1)
    
    #     
    data_obj['latest_query_time'] = pd.to_datetime(data_obj['latest_query_time'])
    data_obj['latest_query_time_month'] = data_obj['latest_query_time'].dt.month
    data_obj['latest_query_time_weekday'] = data_obj['latest_query_time'].dt.weekday
    
    data_obj['loans_latest_time'] = pd.to_datetime(data_obj['loans_latest_time'])
    data_obj['loans_latest_time_month'] = data_obj['loans_latest_time'].dt.month
    data_obj['loans_latest_time_weekday'] = data_obj['loans_latest_time'].dt.weekday
    
    data_obj = data_obj.drop(['latest_query_time', 'loans_latest_time'], axis=1)
    
    #       
    data_obj.head()
    
    #   
    data_processed = pd.concat([data_num, data_obj], axis=1)
    data_processed.head()
    
    #   
    data_saved = pd.concat([data_processed, label], axis=1)
    data_saved.to_csv('D:\datamine\data_processed.csv', index=False)