Python実装決定ツリー(シリーズ記事4)--私の決定ツリー


紙の上で得たのは結局浅く感じて、この事がcodingを要することを絶対に知っています!
1アルゴリズムについての簡単な考え方
アルゴリズムとは何ですか?—アルゴリズムは非表示のプログラミングです【たぶん私が見た中で最も簡潔で要約されたバージョンです】
アルゴリズムは炒め物として通俗的に理解できると思います.私たちはどこから(食材、調味料)始めるか知っています.私たちもいつ終わるか知っています(味の差は多くなければいいです).しかし、中間の過程には大まかな方法があります(しかし、具体的にはどのくらいあるべきか分かりません)、例えば塩を入れてこそ塩味があることを知っていますが、いくらですか?「塩少々」はどうやって少しずつ?この過程で私たちは味わうまで繰り返し試します.この過程でまた少し問題があり、塩を少し入れる実験の過程で、1マイクログラム1マイクログラムを加えると、飢え死にしても良い料理が得られないと推定されています.塩を撒くと、この鍋の料理は壊れてしまうだろう.
だから、アルゴリズムは優雅な「塩を入れる」方法で、あなたの好みに合わせて、少しやってみればokの方法です.
私たちの今の問題に対して、どのように決定木で一つの問題を解決しますか?(タイタニック号乗客の生還確率を予測)
  • 1ターゲット:変数はSurvived(0,1)
  • 2初期:gini
  • を計算できる元のデータセットがある
  • 3反復:利用可能なすべての変数をループして探し、最適な分割点を見つけようと試み、決定ツリーの新しい枝
  • を得る.
  • 4終了:予め設定された目標/制限に達したときに停止します.ここでの制限条件は少し多いです.
  • 1現在データセットの本数が少ない場合(結果的に統計性がない)、
  • を停止する必要がある.
  • 2に分けられる可能性のある葉のデータが少ない場合(結果は同様に信頼性が低い)、
  • を停止する必要がある.
  • 3が既に目標をよく区別できる場合(結果はすでに「純」である)、
  • を停止する必要がある.
  • 4は予め設定する反復回数を超え、
  • を停止する必要がある.
  • 5は、さらにいくつかの適切な終了条件
  • を有することができる.

    この考え方に基づいて、私は決定木を編纂しました.中にはpandas,numpy,copy,combinations(組み合わせを求める)といういくつかのパッケージを呼び出す以外は、直接コードで実現されています.だから、50のコードで問題を解決することができず、実際には800行ほど書いていました.
    2私の決定ツリーコードと呼び出し方法
    import pandas as pd
    import numpy as np
    import copy
    from itertools import combinations #     
    
    class iTree():
        #     -     
        tree_template_x = {
         }
        #      :   -    (      )-    
        tree_template_x['from'] = {
         }
        tree_template_x['now'] = {
         }
        tree_template_x['whatif'] = {
         }
        tree_template_x['to'] = {
         }
    
        tree_template_x['from']['trace'] = [] #    ,        
    
        tree_template_x['now']['samples'] = None #        
        tree_template_x['now']['vars_available'] = []  #        
        tree_template_x['now']['dummy_var_list'] = []  #           
        tree_template_x['now']['current_layers'] = 0 #     
        tree_template_x['now']['is_leaf'] = 0  #          
    
        #       
        tree_template_x['now']['target_counts'] = None #    
        tree_template_x['now']['non_target_counts'] = None  #     
        tree_template_x['now']['gini'] = None  #     
        tree_template_x['now']['prob'] = None  #    
        tree_template_x['now']['class'] = None  #    
        #       
        tree_template_x['now']['mse'] = None  #    
    
        tree_template_x['whatif']['compete_win_varname'] = None #        
        tree_template_x['whatif']['compete_win_vartype'] = None  #           N, O, C
        tree_template_x['whatif']['compete_win_gini'] = None  #       
        tree_template_x['whatif']['compete_win_mse'] = None  #      
        tree_template_x['whatif']['except_var_list'] = []  # (  )        
    
        tree_template_x['to']['left_condition'] = None #           
        tree_template_x['to']['right_condition'] = None  #           
    
        tree_template_x['to']['left'] = None #    
        tree_template_x['to']['right'] = None  #    
    
        #     
        '''
                        ,       :
        - C(Continuous)     ->          :    100      
        - N(Nominal)       -> ∑C(n, x), x >= n//2      
        - O(Ordinal)       ->      N      
        '''
        #   1:             ,   Series    
        @staticmethod
        def cbcut(data=None, pstart=0.1, pend=0.9):
            data = data.copy()
            #        10~90  
            #   linspace       (0.1, 0.11,... 0.9)
            bins = int((pend - pstart) * 100 + 1)
            qlist = np.linspace(pstart, pend, bins)
    
            #         ,  (unique      )
            qtiles = data.quantile(qlist).unique()
            res_list = []
            for q in qtiles:
                data1 = data.apply(lambda x: 1 if x < q else 0)
                res_list.append(data1)
            res_dict = {
         }
            res_dict['data_list'] = res_list
            res_dict['qtiles'] = qtiles
            return res_dict
    
        #   2:            /           ,                     
        @staticmethod
        def obcut(data=None, start=1):
            data = data.copy()
            qtiles = data.unique()[start:]
            res_list = []
            for q in qtiles:
                data1 = data.apply(lambda x: 1 if x < q else 0)
                res_list.append(data1)
            res_dict = {
         }
            res_dict['data_list'] = res_list
            res_dict['qtiles'] = qtiles
            return res_dict
        #   3:            -   map  
        @staticmethod
        def list_key_dict(data = None ,fill_value = 1):
            return dict(zip(data, [fill_value]*len(data)))
        #   4:      (         )
        @staticmethod
        def kv2vk(data =None ):
            new_dict = {
         }
            for k in data.keys():
                new_dict[data[k]] = k 
            return new_dict
        #   5:     ,      
        @staticmethod
        def nbcut(data=None):
            data = data.copy()
            var_set = set(data)
            comb_num = len(var_set) // 2
    
            #  var_set          
            comb_list = []
            not_comb_list = []
            for i in range(comb_num):
                tem_num = i+1
                comb_list += list(combinations(var_set, tem_num))
    
            #             
            comb_sel_list = []
            for clist in comb_list:
                comb_sel_list.append(iTree.list_key_dict(data=clist))
                #       not_comb_list
                not_comb_list.append(list(var_set - set(clist)))
    
            res_list = []
            for comb_sel in comb_sel_list:
                data1 = data.map(comb_sel).fillna(0)
                res_list.append(data1)
            res_dict = {
         }
            res_dict['data_list'] = res_list  #             
            res_dict['comb_list'] = comb_list  #        
            res_dict['not_comb_list'] = not_comb_list  #        
            res_dict['comb_sel_list'] = comb_sel_list  #              
            return res_dict
        #   6:      
        @staticmethod
        def collect_var_attr(data=None, varname=None):
            data = data.copy()
            data1 = data.dropna().apply(str)
            # 1      
            missing_num = len(data) - data.notnull().sum()
            # 2    
            missing_rate = missing_num / len(data)
    
            # 3    
            levs = len(data1.unique())
    
            # 4  -  
            ## 4.1          
            is_integer = data1.apply(lambda x: x.isdigit()).sum() == len(data1)
    
            # 5       
            is_dot = data1.apply(
                lambda x: True if '.' in x else False).sum() == len(data1)
    
            # 6      ,      
            if is_dot:
                is_dot_is_digit = data1.apply(lambda x: True if x.split(
                    '.')[0].isdigit() and x.split('.')[1].isdigit() else False).sum() == len(data1)
            else:
                is_dot_is_digit = False
            # 7      ,      ,        0
            is_float = False
            if is_dot_is_digit:
                is_integer = data1.apply(lambda x: True if float(
                    x.split('.')[1]) == 0 else False).sum() == len(data1)
            else:
                is_float = True
            #       
            try:
                data1.apply(float)
                is_num = True
                is_str = False
            except:
                is_num = False
                is_str = True
            #        
            res_dict = {
         }
            res_dict['missing_num'] = missing_num
            res_dict['missing_rate'] = missing_rate
    
            res_dict['levs'] = levs
            res_dict['is_all_integer'] = is_integer
            res_dict['is_dot'] = is_dot
            res_dict['is_str'] = is_str
            res_dict['is_num'] = is_num
    
            res_dict['is_all_dot_and_digit'] = is_dot_is_digit
            res_dict['is_all_float'] = is_float
            return {
         varname: res_dict}
        #   7
        # Note:            (C、N、O)           
        #              
        @staticmethod
        def infer_var_type(data=None):
            if data['is_str']:
                res = 'N'
            else:
                if data['levs'] >=20:
                    res = 'C'
                else:
                    res = 'N'
            return res
        #   8: df      
        @staticmethod
        def df_infer_var_type(df, cols=None):
            var_meta_dict = {
         }
            if cols is None:
                cols = list(df.columns)
            for col in cols:
                tem_var_meta_dict = iTree.collect_var_attr(df[col], varname=col)
                var_meta_dict.update(tem_var_meta_dict)
            for k in var_meta_dict.keys():
                res = iTree.infer_var_type(var_meta_dict[k])
                var_meta_dict[k]['vartype'] = res
            return var_meta_dict
    
        #   9:  x y      (  Series),     df,    
        #   ,     x y
        # x    ,y    
        @staticmethod
        def align_xy(x = None, y = None ):
            tem_df = pd.DataFrame()
            tem_df['x'] = x.copy()
            tem_df['y'] = y.copy()
            return tem_df.dropna()
        
        #   10:  gini   
        #   Gini     
        @staticmethod
        def cal_gini_impurity(target_count = None, total_count = None):
            p1 = target_count / total_count
            p0 = 1 - p1
            return 1 - p1**2 - p0**2
        
        #   11:  gini
        @staticmethod
        def get_gini(x=None, y=None):
            tem_df = iTree.align_xy(x=x, y=y)
            #      (       ,        )
            vals = tem_df['x'].unique()
            _gini = 0
            total_recs = len(tem_df)
            for val in vals:
                #          
                tem_df1 = tem_df[tem_df['x'] == val]
                #     
                tem_weight = len(tem_df1) / total_recs
                #    
                tem_target_count = tem_df1['y'].sum()
                #     gini
                tem_gini = iTree.cal_gini_impurity(
                    target_count=tem_target_count, total_count=len(tem_df1))
                _gini += tem_weight * tem_gini
            return _gini
        #   12:  mse
        # y_i    
        # c_i        
        @staticmethod
        def get_mse(x = None, y = None):
            tem_df = iTree.align_xy(x=x, y=y)
            #      (       ,        )
            vals = tem_df['x'].unique()
            _mse = 0
            for val in vals:
                #          
                tem_df1 = tem_df[tem_df['x'] == val]
                #     
                tem_df1_y_mean = tem_df1['y'].mean()
                # tem_mse = (tem_df1['y'] - tem_df1_y_mean).apply(lambda x: x**2).sum() / len(tem_df1['y']) #         
                tem_mse = (tem_df1['y'] - tem_df1_y_mean).apply(lambda x: x**2).sum() 
                _mse += tem_mse
            return _mse
    
        #   13:
        #      ,    gini
        # x     , y    
        @staticmethod
        def find_min_gini(x=None, y=None, varname=None, vartype=None, start=1, pstart=0.1, pend=0.9):
            #         
            assert vartype in [
                'C', 'O', 'N'], 'Only Accept Vartype C(continuous), O(Oridinal), N(Nominal)'
            if vartype == 'N':
                res_dict = iTree.nbcut(data=x)
            elif vartype == 'O':
                res_dict = iTree.obcut(data=x, start=start)
            else:
                res_dict = iTree.cbcut(data=x, pstart=pstart, pend=pend)
            #             
            tem_gini_list = []
            for i in range(len(res_dict['data_list'])):
                tem_gini = iTree.get_gini(res_dict['data_list'][i], y)
                tem_gini_list.append(tem_gini)
            # index + min           
            min_gini = min(tem_gini_list)
            mpos = tem_gini_list.index(min_gini)
            if vartype == 'N':
                #      (in),     ,            
                condition_left = res_dict['comb_list'][mpos]
                condition_right = res_dict['not_comb_list'][mpos]
            else:
                #       ,        ( < q   >= q)
                #             ,         
                condition_left = ' + str(res_dict['qtiles'][mpos])
                condition_right = '>=' + str(res_dict['qtiles'][mpos])
            new_res_dict = {
         }
            new_res_dict[varname] = {
         }
            new_res_dict[varname]['gini'] = min_gini
            new_res_dict[varname]['condition_left'] = condition_left
            new_res_dict[varname]['condition_right'] = condition_right
            return new_res_dict
    
        #   14
        #      ,    mse
        # x     , y    
        # https://blog.csdn.net/zpalyq110/article/details/79527653
        @staticmethod
        def find_min_mse(x=None, y=None, varname=None, vartype=None, start=1, pstart=0.1, pend=0.9):
            #         
            assert vartype in [
                'C', 'O', 'N'], 'Only Accept Vartype C(continuous), O(Oridinal), N(Nominal)'
            if vartype == 'N':
                res_dict = iTree.nbcut(data=x)
            elif vartype == 'O':
                res_dict = iTree.obcut(data=x, start=start)
            else:
                res_dict = iTree.cbcut(data=x, pstart=pstart, pend=pend)
            #             
            tem_mse_list = []
            for i in range(len(res_dict['data_list'])):
                tem_mse = iTree.get_mse(res_dict['data_list'][i], y)
                tem_mse_list.append(tem_mse)
            # index + min           
            min_mse = min(tem_mse_list)
            mpos = tem_mse_list.index(min_mse)
            if vartype == 'N':
                #      (in),     ,            
                condition_left = res_dict['comb_list'][mpos]
                condition_right = res_dict['not_comb_list'][mpos]
            else:
                #       ,        ( < q   >= q)
                #             ,         
                condition_left = ' + str(res_dict['qtiles'][mpos])
                condition_right = '>=' + str(res_dict['qtiles'][mpos])
            new_res_dict = {
         }
            new_res_dict[varname] = {
         }
            new_res_dict[varname]['mse'] = min_mse
            new_res_dict[varname]['condition_left'] = condition_left
            new_res_dict[varname]['condition_right'] = condition_right
            return new_res_dict
        
        #   15
        #        
        @staticmethod
        def find_dict_minmax(some_dict = None, attrname = None, method='min'):
            klist = []
            vlist = []
            except_list = []
            for k in some_dict.keys():
                try:
                    attr = some_dict[k][attrname]
                    klist.append(k)
                    vlist.append(attr)
                except:
                    print('fail to find the val', k)
                    except_list.append(k)
            #   
            if method.strip().lower() == 'min':
                the_val = min(vlist)
                the_key = klist[vlist.index(the_val)]
            elif method.strip().lower() == 'max':
                the_val = max(vlist)
                the_key = klist[vlist.index(the_val)]
            else:
                the_val, the_key = None, None
            return the_val, the_key, except_list
    
        #   16
        #        
        #     N  N  
        # N:           ,              ,  map,       
        #  N(O,C):         ,        。                   。
        @staticmethod
        def filter_df_varattr(df=None, varname=None, vartype=None, condition=None):
            assert vartype in ['C', 'O', 'N'], 'Vartype must in C/O/N '
            if vartype == 'N':  # nomimal  map  
                tem_map_dict = dict(zip(condition, [True]*len(condition)))
                _tem_sel = df[varname].map(tem_map_dict)
                res_df = df[_tem_sel.notnull()]
                # df['_tem_sel'] = df[varname].map(tem_map_dict)
                # res_df = df[df['_tem_sel'].notnull()]
            else:
                if ' in condition:
                    val = float(condition.replace(', ''))
                    res_df = df[df[varname] < val]
                elif '>=' in condition:
                    val = float(condition.replace('>=', ''))
                    res_df = df[df[varname] >= val]
                else:
                    res_df = None
                    raise ValueError('condition symbol error >=, condition)
            # del df['_tem_sel']
            return res_df
    
        #   17
        #       : (varname, vartype, cut_condition)       
        @staticmethod
        def filter_df_varattr_chain(df=None, chain_list=None):
            tem_df = df.copy()
            for the_chain in chain_list:
                varname = the_chain['varname']
                vartype = the_chain['vartype']
                condition = the_chain['cut_condition']
                tem_df = iTree.filter_df_varattr(
                    df=tem_df, varname=varname, vartype=vartype, condition=condition)
            return tem_df
        
        #   18
        #           ,           ,              
        @staticmethod
        def find_dummy_var(df, cols=None, exe_cols=None):
            if cols is None:
                cols = list(df.columns)
            if exe_cols is not None:
                exe_cols1 = [x for x in exe_cols if len(x.strip()) > 0]
                cols = list(set(cols) - set(exe_cols1))
            res_list = []
            for c in cols:
                if len(df[c].unique()) == 1:
                    res_list.append(c)
            return res_list
    
    
    
    
        # ==============      =============
        def __init__(self):
            #         ,       |   /  /  
            self.train_history_list = []
            self.train_branch_dict = None
            self.train_rules_df = None
            self.train_partition_df = None
    
            # debug
            self.debug = {
         }
    
        #   
        def fit(self, data=None, target_name=None, id_name=None, time_name=None, tree_type='classification',
                max_iter=1000, min_sample_to_split=100, min_sample_to_predict=10, max_depth=3, gini_thresh=0,
                mse_thresh=0, improve_ratio=0, class_thresh=0.5):
            assert all([not data.empty, target_name]), '            '
            self.para_dict = {
         }
            self.para_dict['max_iter'] = max_iter  # 1          
            # 2          
            self.para_dict['min_sample_to_split'] = min_sample_to_split
            # 3            
            self.para_dict['min_sample_to_predict'] = min_sample_to_predict
            self.para_dict['max_depth'] = max_depth  # 4         
            self.para_dict['gini_thresh'] = gini_thresh  # 5           
            self.para_dict['mse_thresh'] = mse_thresh  # 6           
            self.para_dict['improve_ratio'] = improve_ratio  # 7    /         
            self.para_dict['class_thresh'] = class_thresh  # 8        
            self.para_dict['current_iter_list'] = []  # 9            
            self.para_dict['history_dict'] = {
         }  # 10              
            # 11       (    ) -           ,                
            self.para_dict['leaf_list'] = []
            self.para_dict['data'] = data  # 12    ,pd.DataFrame
            self.para_dict['target_name'] = target_name  # 13       
            self.para_dict['id_name'] = id_name  # 14 ID    
            self.para_dict['time_name'] = time_name  # 15       
            self.para_dict['var_meta'] = iTree.df_infer_var_type(self.para_dict['data'])  # 16         
            self.para_dict['tree_type'] = tree_type  # 17      
    
            # ---        
            # 1     
            iter_cnt = 0
            tree_template, tem_df = self.first_head()
            tree_template = self.iter_body(tree_template = tree_template, tem_df = tem_df)
            self.para_dict['history_dict'] = tree_template
            while len(self.para_dict['current_iter_list']) > 0:
                iter_cnt += 1
                if (iter_cnt) >= self.para_dict['max_iter']:
                    break
                tree_template, tem_df = self.iter_head()
                _ = self.iter_body(tree_template=tree_template, tem_df=tem_df)
            self.summary()
    
        #    (   )     
        def first_head(self):
            tree_template = copy.deepcopy(self.tree_template_x)
            tem_df = self.para_dict['data']  #      |       “  ”  ,      
            target_name = self.para_dict['target_name']
            exe_cols = [target_name, self.para_dict['id_name'], self.para_dict['time_name']]
            exe_cols = [x for x in exe_cols if x] #       
            tree_template['whatif']['except_var_list'] = list(set(tree_template['whatif']['except_var_list'] + exe_cols))
            tree_template['now']['samples'] = len(tem_df)
            tree_template['now']['dummy_var_list'] = iTree.find_dummy_var(tem_df, exe_cols= exe_cols)
            #                 ,   ,         (       )
            tree_template['now']['vars_available'] = list(set(tem_df.columns) - set(tree_template['now']['dummy_var_list']) -
                                                      set(tree_template['whatif']['except_var_list']))
            return tree_template, tem_df
        #        
        def iter_head(self):
            tree_template = self.para_dict['current_iter_list'].pop()
            chain_list = tree_template['from']['trace']
            tem_df = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list)
            tree_template['whatif']['except_var_list'] = list(set(tree_template['whatif']['except_var_list'] + [x['varname'] for x in chain_list]))
            tree_template['now']['samples'] = len(tem_df)
            tree_template['now']['dummy_var_list'] = iTree.find_dummy_var(tem_df)
            #                 ,   ,         (       )
            tree_template['now']['vars_available'] = list(set(tem_df.columns) -
                                                        set(tree_template['now']['dummy_var_list']) -
                                                        set(tree_template['whatif']['except_var_list']))
            return tree_template, tem_df
    
    
        #    
        def iter_body(self, tree_template = None, tem_df = None):
            target_name = self.para_dict['target_name']
            #               ,        ,        
            is_enough_sample = tree_template['now']['samples'] >= self.para_dict['min_sample_to_split']
            is_depth_ok = tree_template['now']['current_layers'] < self.para_dict['max_depth']
            is_var_available = len(tree_template['now']['vars_available']) > 0
            #          ,      “    ”
            if self.para_dict['tree_type'] == 'classification':
                tree_template['now']['target_counts'] = tem_df[target_name].apply(int).sum()
                tree_template['now']['non_target_counts'] = tree_template['now']['samples'] - tree_template['now']['target_counts']
                tree_template['now']['gini'] = iTree.cal_gini_impurity(tree_template['now']['target_counts'], tree_template['now']['samples'])
                tree_template['now']['prob'] = tree_template['now']['target_counts'] / tree_template['now']['samples']
                tree_template['now']['class'] = 1 if tree_template['now']['prob'] >= self.para_dict['class_thresh'] else 0
                #          ,     (   gini mse >=0)
                if tree_template['now']['gini'] < self.para_dict['gini_thresh']:
                    is_kpi_need_improve = False
                else:
                    is_kpi_need_improve = True
            else:
                y_mean = tem_df[target_name].mean()
                y = tem_df[target_name]
                tree_template['now']['mse'] = (y - y_mean).apply(lambda x: x**2).mean()
                if tree_template['now']['mse'] < self.para_dict['mse_thresh']:
                    is_kpi_need_improve = False
                else:
                    is_kpi_need_improve = True
            if all([is_enough_sample, is_depth_ok, is_var_available, is_kpi_need_improve]):
                is_compete = True
            else:
                is_compete = False
            # ====
            #      |                ,1:           Name -> Title         ;2.           ,          
            if is_compete:
                #   N       ,     (       )
                N_too_many_lev = [x for x in tree_template['now']['vars_available'] if self.para_dict['var_meta'][x]['vartype'] == 'N' and len(tem_df[x].unique()) > 20]
                tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + N_too_many_lev
                compete_dict = {
         }
                cols = list(
                    set(tree_template['now']['vars_available']) - set(N_too_many_lev))
                for c in cols:
                    tem_varname = c
                    tem_vartype = self.para_dict['var_meta'][c]['vartype']
                    print(tem_varname, tem_vartype)
                    if self.para_dict['tree_type'] == 'classification':
                        tem_res_dict = iTree.find_min_gini(x=tem_df[c], y=tem_df[target_name], varname=tem_varname, vartype=tem_vartype)
                    else:
                        tem_res_dict = iTree.find_min_mse(x=tem_df[c], y=tem_df[target_name], varname=tem_varname, vartype=tem_vartype)
                    compete_dict.update(tem_res_dict)
                if self.para_dict['tree_type'] == 'classification':
                    win_gini, win_var, except_list = iTree.find_dict_minmax(some_dict=compete_dict, attrname='gini')
                    tree_template['whatif']['compete_win_varname'] = win_var
                    tree_template['whatif']['compete_win_gini'] = win_gini
                    tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + except_list
                    #       win_gini      
                else:
                    win_mse, win_var, except_list = iTree.find_dict_minmax(some_dict=compete_dict, attrname='mse')
                    tree_template['whatif']['compete_win_varname'] = win_var
                    tree_template['whatif']['compete_win_mse'] = win_mse
                    tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + except_list
                #            
                #            -                 (    ),       
                # Note:        ,          (  Embarked          , nan    )
                win_dict = compete_dict[win_var]
                left_candidate_df = iTree.filter_df_varattr(df=tem_df, varname=win_var,
                                                    vartype=self.para_dict['var_meta'][win_var]['vartype'],
                                                    condition=win_dict['condition_left'])
                is_left_branch_ok = len(left_candidate_df) >= self.para_dict['min_sample_to_predict']
                right_candidate_df = iTree.filter_df_varattr(df=tem_df, varname=win_var,
                                                    vartype=self.para_dict['var_meta'][win_var]['vartype'],
                                                    condition=win_dict['condition_right'])
                is_right_branch_ok = len(right_candidate_df) >= self.para_dict['min_sample_to_predict']
    
                #                    ,      
                # if not all([is_left_branch_ok, is_right_branch_ok]):
                #  --       ,   ,    ,    all[True, False] = False ,not    True,             
                if not is_left_branch_ok and not is_right_branch_ok:
                    tree_template['now']['is_leaf'] = 1
                    self.para_dict['leaf_list'].append(tree_template)
    
                #        
                if is_left_branch_ok:
                    tem_left_dict = copy.deepcopy(self.tree_template_x)
    
                    #      trace
                    left_trace = tree_template['from']['trace'].copy()
                    tem_trace_dict = {
         }
                    tem_trace_dict['varname'] = win_var
                    tem_trace_dict['vartype'] = self.para_dict['var_meta'][win_var]['vartype']
                    tem_trace_dict['cut_condition'] = win_dict['condition_left']
                    left_trace.append(tem_trace_dict)
    
                    #       trace
                    tem_left_dict['from']['trace'] = left_trace
                    tem_left_dict['now']['current_layers'] = tree_template['now']['current_layers'] + 1
    
                    #        (             )
                    tem_left_dict['whatif']['except_var_list'] = tree_template['whatif']['except_var_list']
    
                    #   
                    tree_template['to']['left_condition'] = win_dict['condition_left']
                    tree_template['to']['left'] = tem_left_dict
    
                    #        
                    self.para_dict['current_iter_list'].append(tree_template['to']['left'])
    
                #        
                if is_right_branch_ok:
                    tem_right_dict = copy.deepcopy(self.tree_template_x)
    
                    #      trace
                    right_trace = tree_template['from']['trace'].copy()
                    tem_trace_dict = {
         }
                    tem_trace_dict['varname'] = win_var
                    tem_trace_dict['vartype'] = self.para_dict['var_meta'][win_var]['vartype']
                    tem_trace_dict['cut_condition'] = win_dict['condition_right']
                    right_trace.append(tem_trace_dict)
    
                    #       trace
                    tem_right_dict['from']['trace'] = right_trace
                    tem_right_dict['now']['current_layers'] = tree_template['now']['current_layers'] + 1
    
                    #        (             )
                    tem_right_dict['whatif']['except_var_list'] = tree_template['whatif']['except_var_list']
    
                    #   
                    tree_template['to']['right_condition'] = win_dict['condition_right']
                    tree_template['to']['right'] = tem_right_dict
    
                    #        
                    self.para_dict['current_iter_list'].append(tree_template['to']['right'])
            else:
                #        ,           
                tree_template['now']['is_leaf'] = 1
                self.para_dict['leaf_list'].append(tree_template)
            return tree_template
    
        #   fit     
        def summary(self):
            self.train_branch_dict  ={
         }
            self.train_branch_dict['tier1'] = {
         } #     
            self.train_branch_dict['tier2'] = {
         }  #         
            leaf_list = self.para_dict['leaf_list']
            res_dict = {
         }
            res_dict1 = {
         }
            for i in range(len(leaf_list)):
                cx = 'b' + str(i)
                chain_list = leaf_list[i]['from']['trace']
                res_dict[cx] = {
         }
                res_dict[cx]['data'] = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list)
                res_dict[cx]['trace'] = chain_list
                res_dict[cx]['proba'] = res_dict[cx]['data'][self.para_dict['target_name']].mean()
                res_dict[cx]['class'] = 1 if res_dict[cx]['proba'] >= self.para_dict['class_thresh'] else 0
                res_dict[cx]['size'] = len(res_dict[cx]['data'])
                #         1,      
                if len(chain_list) > 1:
                    cx1 = cx + '_a1'
                    chain_list1  = chain_list[:-1]
                    res_dict1[cx1] = {
         }
                    res_dict1[cx1]['data'] = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list1)
                    res_dict1[cx1]['trace'] = chain_list1
                    res_dict1[cx1]['proba'] = res_dict1[cx1]['data'][self.para_dict['target_name']].mean()
                    res_dict1[cx1]['class'] = 1 if res_dict1[cx1]['proba'] >= self.para_dict['class_thresh'] else 0
                    res_dict1[cx1]['size'] = len(res_dict1[cx1]['data'])
    
            self.train_branch_dict['tier1'] = res_dict
            self.train_branch_dict['tier2'] = res_dict1
    
            res_df = pd.DataFrame(columns = ['partition','size','proba','class','tier'])
    
    
            for k in res_dict.keys():
                tem_dict = {
         }
                tem_dict['partition'] = k
                tem_dict['size'] = res_dict[k]['size']
                tem_dict['proba'] = res_dict[k]['proba']
                tem_dict['class'] = res_dict[k]['class']
                tem_dict['tier'] = 'tier1'
                res_df = res_df.append(tem_dict, ignore_index=True)
            for k in res_dict1.keys():
                tem_dict = {
         }
                tem_dict['partition'] = k
                tem_dict['size'] = res_dict1[k]['size']
                tem_dict['proba'] = res_dict1[k]['proba']
                tem_dict['class'] = res_dict1[k]['class']
                tem_dict['tier'] = 'tier2'
                res_df = res_df.append(tem_dict, ignore_index=True)
            
            self.train_partition_df = res_df
            #     
            rule_df = pd.DataFrame(columns=['rule', 'tier' ,'condition', 'class', 'proba','support'])
    
            for k in res_dict.keys():
                k_trace = res_dict[k]['trace']
                tem_dict = {
         }
                tem_dict['rule'] = k
                tem_dict['tier'] = 1
                tem_dict['condition'] = '&'.join([x['varname'] + ' in ' + str(list(x['cut_condition'])) if x['vartype'] == 'N' else x['varname'] +  x['cut_condition'] for x in k_trace ])
                tem_dict['class'] = res_dict[k]['class']
                tem_dict['proba'] = res_dict[k]['proba']
                tem_dict['support'] = res_dict[k]['size']
                rule_df = rule_df.append(tem_dict, ignore_index=True)
    
                if len(k_trace) > 1:
                    #      
                    tem_dict = {
         }
                    k_trace1 = k_trace[:-1]
                    tem_dict['rule'] = k + '_a1'
                    tem_dict['tier'] = 2
                    tem_dict['condition'] = '&'.join([x['varname'] + ' in ' + str(list(x['cut_condition']))
                                                if x['vartype'] == 'N' else x['varname'] + x['cut_condition'] for x in k_trace1])
                    tem_dict['proba'] = res_dict[k]['proba']
                    tem_dict['class'] = res_dict[k]['class']
                    tem_dict['support'] = res_dict[k]['size']
                    rule_df = rule_df.append(tem_dict, ignore_index=True)
            self.train_rules_df = rule_df
    
    
        #     
        def predict(self, data=None):
            #      (       )
            predict_df_list1 = [] #          
            predict_df_list2 = [] #              
            for b in self.train_branch_dict['tier1'].keys():
                #            
                tem_proba = self.train_branch_dict['tier1'][b]['proba']
                tem_class = self.train_branch_dict['tier1'][b]['class']
                tem_size = self.train_branch_dict['tier1'][b]['size']
                tem_chain_list = self.train_branch_dict['tier1'][b]['trace']
    
                #      
                tem_predict_df = iTree.filter_df_varattr_chain(df = data, chain_list= tem_chain_list)
                tem_predict_df['predict_class'] = tem_class
                tem_predict_df['predict_proba'] = tem_proba
                tem_predict_df['predict_support'] = tem_size
    
                # debug 
                tem_predict_df['branch'] = b
    
                predict_df_list1.append(tem_predict_df)
            #       index
            predict_df1 = pd.concat(predict_df_list1)
            
            # self.debug['a1'] = predict_df1
            # predict_df1 = predict_df1[~predict_df1.index.duplicated()]
            print('***totle recs to predict', len(data))
            print('***predict by leaf Node', predict_df1.shape)
            add_list = ['predict_proba', 'predict_class', 'predict_support']
            keep_list= [self.para_dict['target_name']] + add_list
            for x in add_list:
                data[x] = predict_df1[x]
    
            #            ,        
            is_missing_predict = data['predict_proba'].notnull().sum() < len(data)
            if is_missing_predict:
                mis_data = data[~data['predict_proba'].notnull()]
                for b in self.train_branch_dict['tier2'].keys():
                    tem_proba = self.train_branch_dict['tier2'][b]['proba']
                    tem_class = self.train_branch_dict['tier2'][b]['class']
                    tem_size = self.train_branch_dict['tier2'][b]['size']
                    tem_chain_list = self.train_branch_dict['tier2'][b]['trace']
    
                    #      
                    tem_predict_df = iTree.filter_df_varattr_chain(df=mis_data, chain_list=tem_chain_list)
                    tem_predict_df['predict_class'] = tem_class
                    tem_predict_df['predict_proba'] = tem_proba
                    tem_predict_df['predict_support'] = tem_size
                    predict_df_list2.append(tem_predict_df)
                predict_df2 = pd.concat(predict_df_list2)
                #               ?
                predict_df2 = predict_df2[~predict_df2.index.duplicated()]
                print('***predict by branch Node', predict_df2.shape)
                for x in add_list:
                    mis_data[x] = predict_df2[x]
                data = pd.concat([data[data['predict_proba'].notnull()],mis_data])
            return data[keep_list].sort_index()
            # return data
    

    中のコードが多いので、後で一つ一つ展開して話すことができます.構造的にはiTreeというクラスにはいくつかの部分があります.
  • 1 tree_template_xこれは決定ツリーの各ノードに共通するテンプレートであり、クラスの共通属性
  • とする.
  • 2@staticmethodは、キー指標の計算に加えて、変数の最適な切断点を探す関数の山を定義します.ここでは変数の概念について、変数をN(Nominal)、O(Ordinal)、C(Continuous)の3種類に分けます.
  • N:例えば性別
  • O:例えば学歴
  • C:例えば収入
  • fitメソッド:決定ツリーフィッティングの主要メソッドを実行します.その中のfirst_ヘッドとiter_headはアルゴリズムの最初の実行と循環体での関数に対応する.主な違いは、最初に実行したデータとフィルタする必要がなく、ループでは現在使用可能なデータセットを得るためにデータをフィルタする必要があります.
  • 4 predict法:訓練されたパラメータを用いて新しいデータを予測する.

  • 使用方法は次のとおりです.
        some_tree = iTree()
        df = pd.read_csv('train.csv') #       
        raw_df = pd.read_csv('raw_train.csv') #       
    
        some_tree.fit(data=df, target_name='Survived')
        res_predict_df = some_tree.predict(data=df)
        from sklearn.metrics import classification_report
        predict_eva = classification_report(res_predict_df['Survived'], res_predict_df['predict_class'])
        print(predict_eva)
    
    

    実行結果は次のとおりです.
    ***totle recs to predict 891
    ***predict by leaf Node (883, 15)
    ***predict by branch Node (8, 14)
    
      precision    recall  f1-score   support
    
               0       0.84      0.88      0.86       549
               1       0.80      0.74      0.77       342
    
       micro avg       0.83      0.83      0.83       891
       macro avg       0.82      0.81      0.82       891
    weighted avg       0.83      0.83      0.83       891
    

    ここでdfはデータ洗浄を行ったデータ、raw_dfは未処理のデータである.leaf Nodeが883件の記録を予測し、8件の記録がbranch Nodeが予測したことに気づくことができますが、これはどういう意味ですか?初期パラメータには葉ノードの数が10個以上設定されているので,1つの葉が分割されていないため,予測できる葉も存在しない.このような場合、1つの方法は予測をしないことであり、もう1つは、この葉ノードの親ノード(すなわちbranch Node)で近似的な予測を行うことである.また同様に3層の予測であり,結果とKの差は多くなくOKとなるはずである.しかし、タイタニック号の例は簡単で、モデルの効果の問題を説明できません(もちろん決定木がすごいとは期待しないでください).またtrain,testの分割も現在行われておらず,厳密なモデリングとは言えない.後でbenchmarkを構築し、いくつかのアルゴリズムを一緒に比較します.
    ここを見て、10分で決定木をマスターしますか?
    その後、変数の分類、検索など、その過程を徐々に分解し、できるだけこの過程をより明確に話します.