Python実装決定ツリー(シリーズ記事4)--私の決定ツリー

192153 ワード

紙の上で得たのは結局浅く感じて、この事がcodingを要することを絶対に知っています!
1アルゴリズムについての簡単な考え方
アルゴリズムとは何ですか?—アルゴリズムは非表示のプログラミングです【たぶん私が見た中で最も簡潔で要約されたバージョンです】
アルゴリズムは炒め物として通俗的に理解できると思います.私たちはどこから(食材、調味料)始めるか知っています.私たちもいつ終わるか知っています(味の差は多くなければいいです).しかし、中間の過程には大まかな方法があります(しかし、具体的にはどのくらいあるべきか分かりません)、例えば塩を入れてこそ塩味があることを知っていますが、いくらですか?「塩少々」はどうやって少しずつ?この過程で私たちは味わうまで繰り返し試します.この過程でまた少し問題があり、塩を少し入れる実験の過程で、1マイクログラム1マイクログラムを加えると、飢え死にしても良い料理が得られないと推定されています.塩を撒くと、この鍋の料理は壊れてしまうだろう.
だから、アルゴリズムは優雅な「塩を入れる」方法で、あなたの好みに合わせて、少しやってみればokの方法です.
私たちの今の問題に対して、どのように決定木で一つの問題を解決しますか?(タイタニック号乗客の生還確率を予測)

1ターゲット:変数はSurvived(0,1)

2初期:gini

を計算できる元のデータセットがある

3反復:利用可能なすべての変数をループして探し、最適な分割点を見つけようと試み、決定ツリーの新しい枝

を得る.

4終了:予め設定された目標/制限に達したときに停止します.ここでの制限条件は少し多いです.

1現在データセットの本数が少ない場合(結果的に統計性がない)、

を停止する必要がある.

2に分けられる可能性のある葉のデータが少ない場合(結果は同様に信頼性が低い)、

を停止する必要がある.

3が既に目標をよく区別できる場合(結果はすでに「純」である)、

を停止する必要がある.

4は予め設定する反復回数を超え、

を停止する必要がある.

5は、さらにいくつかの適切な終了条件

を有することができる.

この考え方に基づいて、私は決定木を編纂しました.中にはpandas,numpy,copy,combinations(組み合わせを求める)といういくつかのパッケージを呼び出す以外は、直接コードで実現されています.だから、50のコードで問題を解決することができず、実際には800行ほど書いていました.
2私の決定ツリーコードと呼び出し方法

import pandas as pd
import numpy as np
import copy
from itertools import combinations #     

class iTree():
    #     -     
    tree_template_x = {
     }
    #      ：   -    (      )-    
    tree_template_x['from'] = {
     }
    tree_template_x['now'] = {
     }
    tree_template_x['whatif'] = {
     }
    tree_template_x['to'] = {
     }

    tree_template_x['from']['trace'] = [] #    ，        

    tree_template_x['now']['samples'] = None #        
    tree_template_x['now']['vars_available'] = []  #        
    tree_template_x['now']['dummy_var_list'] = []  #           
    tree_template_x['now']['current_layers'] = 0 #     
    tree_template_x['now']['is_leaf'] = 0  #          

    #       
    tree_template_x['now']['target_counts'] = None #    
    tree_template_x['now']['non_target_counts'] = None  #     
    tree_template_x['now']['gini'] = None  #     
    tree_template_x['now']['prob'] = None  #    
    tree_template_x['now']['class'] = None  #    
    #       
    tree_template_x['now']['mse'] = None  #    

    tree_template_x['whatif']['compete_win_varname'] = None #        
    tree_template_x['whatif']['compete_win_vartype'] = None  #           N, O, C
    tree_template_x['whatif']['compete_win_gini'] = None  #       
    tree_template_x['whatif']['compete_win_mse'] = None  #      
    tree_template_x['whatif']['except_var_list'] = []  # （  ）        

    tree_template_x['to']['left_condition'] = None #           
    tree_template_x['to']['right_condition'] = None  #           

    tree_template_x['to']['left'] = None #    
    tree_template_x['to']['right'] = None  #    

    #     
    '''
                    ，       ：
    - C(Continuous)     ->          ：    100      
    - N(Nominal)       -> ∑C(n, x), x >= n//2      
    - O(Ordinal)       ->      N      
    '''
    #   1:             ，   Series    
    @staticmethod
    def cbcut(data=None, pstart=0.1, pend=0.9):
        data = data.copy()
        #        10~90  
        #   linspace       （0.1， 0.11，... 0.9)
        bins = int((pend - pstart) * 100 + 1)
        qlist = np.linspace(pstart, pend, bins)

        #         ，  （unique      ）
        qtiles = data.quantile(qlist).unique()
        res_list = []
        for q in qtiles:
            data1 = data.apply(lambda x: 1 if x < q else 0)
            res_list.append(data1)
        res_dict = {
     }
        res_dict['data_list'] = res_list
        res_dict['qtiles'] = qtiles
        return res_dict

    #   2：            /           ，                     
    @staticmethod
    def obcut(data=None, start=1):
        data = data.copy()
        qtiles = data.unique()[start:]
        res_list = []
        for q in qtiles:
            data1 = data.apply(lambda x: 1 if x < q else 0)
            res_list.append(data1)
        res_dict = {
     }
        res_dict['data_list'] = res_list
        res_dict['qtiles'] = qtiles
        return res_dict
    #   3：            -   map  
    @staticmethod
    def list_key_dict(data = None ,fill_value = 1):
        return dict(zip(data, [fill_value]*len(data)))
    #   4：      （         ）
    @staticmethod
    def kv2vk(data =None ):
        new_dict = {
     }
        for k in data.keys():
            new_dict[data[k]] = k 
        return new_dict
    #   5：     ，      
    @staticmethod
    def nbcut(data=None):
        data = data.copy()
        var_set = set(data)
        comb_num = len(var_set) // 2

        #  var_set          
        comb_list = []
        not_comb_list = []
        for i in range(comb_num):
            tem_num = i+1
            comb_list += list(combinations(var_set, tem_num))

        #             
        comb_sel_list = []
        for clist in comb_list:
            comb_sel_list.append(iTree.list_key_dict(data=clist))
            #       not_comb_list
            not_comb_list.append(list(var_set - set(clist)))

        res_list = []
        for comb_sel in comb_sel_list:
            data1 = data.map(comb_sel).fillna(0)
            res_list.append(data1)
        res_dict = {
     }
        res_dict['data_list'] = res_list  #             
        res_dict['comb_list'] = comb_list  #        
        res_dict['not_comb_list'] = not_comb_list  #        
        res_dict['comb_sel_list'] = comb_sel_list  #              
        return res_dict
    #   6：      
    @staticmethod
    def collect_var_attr(data=None, varname=None):
        data = data.copy()
        data1 = data.dropna().apply(str)
        # 1      
        missing_num = len(data) - data.notnull().sum()
        # 2    
        missing_rate = missing_num / len(data)

        # 3    
        levs = len(data1.unique())

        # 4  -  
        ## 4.1          
        is_integer = data1.apply(lambda x: x.isdigit()).sum() == len(data1)

        # 5       
        is_dot = data1.apply(
            lambda x: True if '.' in x else False).sum() == len(data1)

        # 6      ，      
        if is_dot:
            is_dot_is_digit = data1.apply(lambda x: True if x.split(
                '.')[0].isdigit() and x.split('.')[1].isdigit() else False).sum() == len(data1)
        else:
            is_dot_is_digit = False
        # 7      ，      ，        0
        is_float = False
        if is_dot_is_digit:
            is_integer = data1.apply(lambda x: True if float(
                x.split('.')[1]) == 0 else False).sum() == len(data1)
        else:
            is_float = True
        #       
        try:
            data1.apply(float)
            is_num = True
            is_str = False
        except:
            is_num = False
            is_str = True
        #        
        res_dict = {
     }
        res_dict['missing_num'] = missing_num
        res_dict['missing_rate'] = missing_rate

        res_dict['levs'] = levs
        res_dict['is_all_integer'] = is_integer
        res_dict['is_dot'] = is_dot
        res_dict['is_str'] = is_str
        res_dict['is_num'] = is_num

        res_dict['is_all_dot_and_digit'] = is_dot_is_digit
        res_dict['is_all_float'] = is_float
        return {
     varname: res_dict}
    #   7
    # Note:            （C、N、O）           
    #              
    @staticmethod
    def infer_var_type(data=None):
        if data['is_str']:
            res = 'N'
        else:
            if data['levs'] >=20:
                res = 'C'
            else:
                res = 'N'
        return res
    #   8： df      
    @staticmethod
    def df_infer_var_type(df, cols=None):
        var_meta_dict = {
     }
        if cols is None:
            cols = list(df.columns)
        for col in cols:
            tem_var_meta_dict = iTree.collect_var_attr(df[col], varname=col)
            var_meta_dict.update(tem_var_meta_dict)
        for k in var_meta_dict.keys():
            res = iTree.infer_var_type(var_meta_dict[k])
            var_meta_dict[k]['vartype'] = res
        return var_meta_dict

    #   9：  x y      (  Series)，     df,    
    #   ，     x y
    # x    ，y    
    @staticmethod
    def align_xy(x = None, y = None ):
        tem_df = pd.DataFrame()
        tem_df['x'] = x.copy()
        tem_df['y'] = y.copy()
        return tem_df.dropna()
    
    #   10：  gini   
    #   Gini     
    @staticmethod
    def cal_gini_impurity(target_count = None, total_count = None):
        p1 = target_count / total_count
        p0 = 1 - p1
        return 1 - p1**2 - p0**2
    
    #   11：  gini
    @staticmethod
    def get_gini(x=None, y=None):
        tem_df = iTree.align_xy(x=x, y=y)
        #      （       ，        ）
        vals = tem_df['x'].unique()
        _gini = 0
        total_recs = len(tem_df)
        for val in vals:
            #          
            tem_df1 = tem_df[tem_df['x'] == val]
            #     
            tem_weight = len(tem_df1) / total_recs
            #    
            tem_target_count = tem_df1['y'].sum()
            #     gini
            tem_gini = iTree.cal_gini_impurity(
                target_count=tem_target_count, total_count=len(tem_df1))
            _gini += tem_weight * tem_gini
        return _gini
    #   12：  mse
    # y_i    
    # c_i        
    @staticmethod
    def get_mse(x = None, y = None):
        tem_df = iTree.align_xy(x=x, y=y)
        #      （       ，        ）
        vals = tem_df['x'].unique()
        _mse = 0
        for val in vals:
            #          
            tem_df1 = tem_df[tem_df['x'] == val]
            #     
            tem_df1_y_mean = tem_df1['y'].mean()
            # tem_mse = (tem_df1['y'] - tem_df1_y_mean).apply(lambda x: x**2).sum() / len(tem_df1['y']) #         
            tem_mse = (tem_df1['y'] - tem_df1_y_mean).apply(lambda x: x**2).sum() 
            _mse += tem_mse
        return _mse

    #   13：
    #      ，    gini
    # x     ， y    
    @staticmethod
    def find_min_gini(x=None, y=None, varname=None, vartype=None, start=1, pstart=0.1, pend=0.9):
        #         
        assert vartype in [
            'C', 'O', 'N'], 'Only Accept Vartype C(continuous), O(Oridinal), N(Nominal)'
        if vartype == 'N':
            res_dict = iTree.nbcut(data=x)
        elif vartype == 'O':
            res_dict = iTree.obcut(data=x, start=start)
        else:
            res_dict = iTree.cbcut(data=x, pstart=pstart, pend=pend)
        #             
        tem_gini_list = []
        for i in range(len(res_dict['data_list'])):
            tem_gini = iTree.get_gini(res_dict['data_list'][i], y)
            tem_gini_list.append(tem_gini)
        # index + min           
        min_gini = min(tem_gini_list)
        mpos = tem_gini_list.index(min_gini)
        if vartype == 'N':
            #      （in)，     ，            
            condition_left = res_dict['comb_list'][mpos]
            condition_right = res_dict['not_comb_list'][mpos]
        else:
            #       ，        （ < q   >= q)
            #             ，         
            condition_left = ' + str(res_dict['qtiles'][mpos])
            condition_right = '>=' + str(res_dict['qtiles'][mpos])
        new_res_dict = {
     }
        new_res_dict[varname] = {
     }
        new_res_dict[varname]['gini'] = min_gini
        new_res_dict[varname]['condition_left'] = condition_left
        new_res_dict[varname]['condition_right'] = condition_right
        return new_res_dict

    #   14
    #      ，    mse
    # x     ， y    
    # https://blog.csdn.net/zpalyq110/article/details/79527653
    @staticmethod
    def find_min_mse(x=None, y=None, varname=None, vartype=None, start=1, pstart=0.1, pend=0.9):
        #         
        assert vartype in [
            'C', 'O', 'N'], 'Only Accept Vartype C(continuous), O(Oridinal), N(Nominal)'
        if vartype == 'N':
            res_dict = iTree.nbcut(data=x)
        elif vartype == 'O':
            res_dict = iTree.obcut(data=x, start=start)
        else:
            res_dict = iTree.cbcut(data=x, pstart=pstart, pend=pend)
        #             
        tem_mse_list = []
        for i in range(len(res_dict['data_list'])):
            tem_mse = iTree.get_mse(res_dict['data_list'][i], y)
            tem_mse_list.append(tem_mse)
        # index + min           
        min_mse = min(tem_mse_list)
        mpos = tem_mse_list.index(min_mse)
        if vartype == 'N':
            #      （in)，     ，            
            condition_left = res_dict['comb_list'][mpos]
            condition_right = res_dict['not_comb_list'][mpos]
        else:
            #       ，        （ < q   >= q)
            #             ，         
            condition_left = ' + str(res_dict['qtiles'][mpos])
            condition_right = '>=' + str(res_dict['qtiles'][mpos])
        new_res_dict = {
     }
        new_res_dict[varname] = {
     }
        new_res_dict[varname]['mse'] = min_mse
        new_res_dict[varname]['condition_left'] = condition_left
        new_res_dict[varname]['condition_right'] = condition_right
        return new_res_dict
    
    #   15
    #        
    @staticmethod
    def find_dict_minmax(some_dict = None, attrname = None, method='min'):
        klist = []
        vlist = []
        except_list = []
        for k in some_dict.keys():
            try:
                attr = some_dict[k][attrname]
                klist.append(k)
                vlist.append(attr)
            except:
                print('fail to find the val', k)
                except_list.append(k)
        #   
        if method.strip().lower() == 'min':
            the_val = min(vlist)
            the_key = klist[vlist.index(the_val)]
        elif method.strip().lower() == 'max':
            the_val = max(vlist)
            the_key = klist[vlist.index(the_val)]
        else:
            the_val, the_key = None, None
        return the_val, the_key, except_list

    #   16
    #        
    #     N  N  
    # N:           ，              ，  map,       
    #  N(O,C):         ，        。                   。
    @staticmethod
    def filter_df_varattr(df=None, varname=None, vartype=None, condition=None):
        assert vartype in ['C', 'O', 'N'], 'Vartype must in C/O/N '
        if vartype == 'N':  # nomimal  map  
            tem_map_dict = dict(zip(condition, [True]*len(condition)))
            _tem_sel = df[varname].map(tem_map_dict)
            res_df = df[_tem_sel.notnull()]
            # df['_tem_sel'] = df[varname].map(tem_map_dict)
            # res_df = df[df['_tem_sel'].notnull()]
        else:
            if ' in condition:
                val = float(condition.replace(', ''))
                res_df = df[df[varname] < val]
            elif '>=' in condition:
                val = float(condition.replace('>=', ''))
                res_df = df[df[varname] >= val]
            else:
                res_df = None
                raise ValueError('condition symbol error >=, condition)
        # del df['_tem_sel']
        return res_df

    #   17
    #       : (varname, vartype, cut_condition)       
    @staticmethod
    def filter_df_varattr_chain(df=None, chain_list=None):
        tem_df = df.copy()
        for the_chain in chain_list:
            varname = the_chain['varname']
            vartype = the_chain['vartype']
            condition = the_chain['cut_condition']
            tem_df = iTree.filter_df_varattr(
                df=tem_df, varname=varname, vartype=vartype, condition=condition)
        return tem_df
    
    #   18
    #           ，           ，              
    @staticmethod
    def find_dummy_var(df, cols=None, exe_cols=None):
        if cols is None:
            cols = list(df.columns)
        if exe_cols is not None:
            exe_cols1 = [x for x in exe_cols if len(x.strip()) > 0]
            cols = list(set(cols) - set(exe_cols1))
        res_list = []
        for c in cols:
            if len(df[c].unique()) == 1:
                res_list.append(c)
        return res_list




    # ==============      =============
    def __init__(self):
        #         ，       |   /  /  
        self.train_history_list = []
        self.train_branch_dict = None
        self.train_rules_df = None
        self.train_partition_df = None

        # debug
        self.debug = {
     }

    #   
    def fit(self, data=None, target_name=None, id_name=None, time_name=None, tree_type='classification',
            max_iter=1000, min_sample_to_split=100, min_sample_to_predict=10, max_depth=3, gini_thresh=0,
            mse_thresh=0, improve_ratio=0, class_thresh=0.5):
        assert all([not data.empty, target_name]), '            '
        self.para_dict = {
     }
        self.para_dict['max_iter'] = max_iter  # 1          
        # 2          
        self.para_dict['min_sample_to_split'] = min_sample_to_split
        # 3            
        self.para_dict['min_sample_to_predict'] = min_sample_to_predict
        self.para_dict['max_depth'] = max_depth  # 4         
        self.para_dict['gini_thresh'] = gini_thresh  # 5           
        self.para_dict['mse_thresh'] = mse_thresh  # 6           
        self.para_dict['improve_ratio'] = improve_ratio  # 7    /         
        self.para_dict['class_thresh'] = class_thresh  # 8        
        self.para_dict['current_iter_list'] = []  # 9            
        self.para_dict['history_dict'] = {
     }  # 10              
        # 11       （    ） -           ，                
        self.para_dict['leaf_list'] = []
        self.para_dict['data'] = data  # 12    ，pd.DataFrame
        self.para_dict['target_name'] = target_name  # 13       
        self.para_dict['id_name'] = id_name  # 14 ID    
        self.para_dict['time_name'] = time_name  # 15       
        self.para_dict['var_meta'] = iTree.df_infer_var_type(self.para_dict['data'])  # 16         
        self.para_dict['tree_type'] = tree_type  # 17      

        # ---        
        # 1     
        iter_cnt = 0
        tree_template, tem_df = self.first_head()
        tree_template = self.iter_body(tree_template = tree_template, tem_df = tem_df)
        self.para_dict['history_dict'] = tree_template
        while len(self.para_dict['current_iter_list']) > 0:
            iter_cnt += 1
            if (iter_cnt) >= self.para_dict['max_iter']:
                break
            tree_template, tem_df = self.iter_head()
            _ = self.iter_body(tree_template=tree_template, tem_df=tem_df)
        self.summary()

    #    （   ）     
    def first_head(self):
        tree_template = copy.deepcopy(self.tree_template_x)
        tem_df = self.para_dict['data']  #      |       “  ”  ，      
        target_name = self.para_dict['target_name']
        exe_cols = [target_name, self.para_dict['id_name'], self.para_dict['time_name']]
        exe_cols = [x for x in exe_cols if x] #       
        tree_template['whatif']['except_var_list'] = list(set(tree_template['whatif']['except_var_list'] + exe_cols))
        tree_template['now']['samples'] = len(tem_df)
        tree_template['now']['dummy_var_list'] = iTree.find_dummy_var(tem_df, exe_cols= exe_cols)
        #                 ，   ，         （       ）
        tree_template['now']['vars_available'] = list(set(tem_df.columns) - set(tree_template['now']['dummy_var_list']) -
                                                  set(tree_template['whatif']['except_var_list']))
        return tree_template, tem_df
    #        
    def iter_head(self):
        tree_template = self.para_dict['current_iter_list'].pop()
        chain_list = tree_template['from']['trace']
        tem_df = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list)
        tree_template['whatif']['except_var_list'] = list(set(tree_template['whatif']['except_var_list'] + [x['varname'] for x in chain_list]))
        tree_template['now']['samples'] = len(tem_df)
        tree_template['now']['dummy_var_list'] = iTree.find_dummy_var(tem_df)
        #                 ，   ，         （       ）
        tree_template['now']['vars_available'] = list(set(tem_df.columns) -
                                                    set(tree_template['now']['dummy_var_list']) -
                                                    set(tree_template['whatif']['except_var_list']))
        return tree_template, tem_df


    #    
    def iter_body(self, tree_template = None, tem_df = None):
        target_name = self.para_dict['target_name']
        #               ，        ，        
        is_enough_sample = tree_template['now']['samples'] >= self.para_dict['min_sample_to_split']
        is_depth_ok = tree_template['now']['current_layers'] < self.para_dict['max_depth']
        is_var_available = len(tree_template['now']['vars_available']) > 0
        #          ，      “    ”
        if self.para_dict['tree_type'] == 'classification':
            tree_template['now']['target_counts'] = tem_df[target_name].apply(int).sum()
            tree_template['now']['non_target_counts'] = tree_template['now']['samples'] - tree_template['now']['target_counts']
            tree_template['now']['gini'] = iTree.cal_gini_impurity(tree_template['now']['target_counts'], tree_template['now']['samples'])
            tree_template['now']['prob'] = tree_template['now']['target_counts'] / tree_template['now']['samples']
            tree_template['now']['class'] = 1 if tree_template['now']['prob'] >= self.para_dict['class_thresh'] else 0
            #          ，     （   gini mse >=0)
            if tree_template['now']['gini'] < self.para_dict['gini_thresh']:
                is_kpi_need_improve = False
            else:
                is_kpi_need_improve = True
        else:
            y_mean = tem_df[target_name].mean()
            y = tem_df[target_name]
            tree_template['now']['mse'] = (y - y_mean).apply(lambda x: x**2).mean()
            if tree_template['now']['mse'] < self.para_dict['mse_thresh']:
                is_kpi_need_improve = False
            else:
                is_kpi_need_improve = True
        if all([is_enough_sample, is_depth_ok, is_var_available, is_kpi_need_improve]):
            is_compete = True
        else:
            is_compete = False
        # ====
        #      |                ，1：           Name -> Title         ；2.           ，          
        if is_compete:
            #   N       ，     （       ）
            N_too_many_lev = [x for x in tree_template['now']['vars_available'] if self.para_dict['var_meta'][x]['vartype'] == 'N' and len(tem_df[x].unique()) > 20]
            tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + N_too_many_lev
            compete_dict = {
     }
            cols = list(
                set(tree_template['now']['vars_available']) - set(N_too_many_lev))
            for c in cols:
                tem_varname = c
                tem_vartype = self.para_dict['var_meta'][c]['vartype']
                print(tem_varname, tem_vartype)
                if self.para_dict['tree_type'] == 'classification':
                    tem_res_dict = iTree.find_min_gini(x=tem_df[c], y=tem_df[target_name], varname=tem_varname, vartype=tem_vartype)
                else:
                    tem_res_dict = iTree.find_min_mse(x=tem_df[c], y=tem_df[target_name], varname=tem_varname, vartype=tem_vartype)
                compete_dict.update(tem_res_dict)
            if self.para_dict['tree_type'] == 'classification':
                win_gini, win_var, except_list = iTree.find_dict_minmax(some_dict=compete_dict, attrname='gini')
                tree_template['whatif']['compete_win_varname'] = win_var
                tree_template['whatif']['compete_win_gini'] = win_gini
                tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + except_list
                #       win_gini      
            else:
                win_mse, win_var, except_list = iTree.find_dict_minmax(some_dict=compete_dict, attrname='mse')
                tree_template['whatif']['compete_win_varname'] = win_var
                tree_template['whatif']['compete_win_mse'] = win_mse
                tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + except_list
            #            
            #            -                 （    ），       
            # Note:        ，          （  Embarked          ， nan    ）
            win_dict = compete_dict[win_var]
            left_candidate_df = iTree.filter_df_varattr(df=tem_df, varname=win_var,
                                                vartype=self.para_dict['var_meta'][win_var]['vartype'],
                                                condition=win_dict['condition_left'])
            is_left_branch_ok = len(left_candidate_df) >= self.para_dict['min_sample_to_predict']
            right_candidate_df = iTree.filter_df_varattr(df=tem_df, varname=win_var,
                                                vartype=self.para_dict['var_meta'][win_var]['vartype'],
                                                condition=win_dict['condition_right'])
            is_right_branch_ok = len(right_candidate_df) >= self.para_dict['min_sample_to_predict']

            #                    ，      
            # if not all([is_left_branch_ok, is_right_branch_ok]):
            #  --       ，   ，    ，    all[True, False] = False ,not    True，             
            if not is_left_branch_ok and not is_right_branch_ok:
                tree_template['now']['is_leaf'] = 1
                self.para_dict['leaf_list'].append(tree_template)

            #        
            if is_left_branch_ok:
                tem_left_dict = copy.deepcopy(self.tree_template_x)

                #      trace
                left_trace = tree_template['from']['trace'].copy()
                tem_trace_dict = {
     }
                tem_trace_dict['varname'] = win_var
                tem_trace_dict['vartype'] = self.para_dict['var_meta'][win_var]['vartype']
                tem_trace_dict['cut_condition'] = win_dict['condition_left']
                left_trace.append(tem_trace_dict)

                #       trace
                tem_left_dict['from']['trace'] = left_trace
                tem_left_dict['now']['current_layers'] = tree_template['now']['current_layers'] + 1

                #        （             ）
                tem_left_dict['whatif']['except_var_list'] = tree_template['whatif']['except_var_list']

                #   
                tree_template['to']['left_condition'] = win_dict['condition_left']
                tree_template['to']['left'] = tem_left_dict

                #        
                self.para_dict['current_iter_list'].append(tree_template['to']['left'])

            #        
            if is_right_branch_ok:
                tem_right_dict = copy.deepcopy(self.tree_template_x)

                #      trace
                right_trace = tree_template['from']['trace'].copy()
                tem_trace_dict = {
     }
                tem_trace_dict['varname'] = win_var
                tem_trace_dict['vartype'] = self.para_dict['var_meta'][win_var]['vartype']
                tem_trace_dict['cut_condition'] = win_dict['condition_right']
                right_trace.append(tem_trace_dict)

                #       trace
                tem_right_dict['from']['trace'] = right_trace
                tem_right_dict['now']['current_layers'] = tree_template['now']['current_layers'] + 1

                #        （             ）
                tem_right_dict['whatif']['except_var_list'] = tree_template['whatif']['except_var_list']

                #   
                tree_template['to']['right_condition'] = win_dict['condition_right']
                tree_template['to']['right'] = tem_right_dict

                #        
                self.para_dict['current_iter_list'].append(tree_template['to']['right'])
        else:
            #        ，           
            tree_template['now']['is_leaf'] = 1
            self.para_dict['leaf_list'].append(tree_template)
        return tree_template

    #   fit     
    def summary(self):
        self.train_branch_dict  ={
     }
        self.train_branch_dict['tier1'] = {
     } #     
        self.train_branch_dict['tier2'] = {
     }  #         
        leaf_list = self.para_dict['leaf_list']
        res_dict = {
     }
        res_dict1 = {
     }
        for i in range(len(leaf_list)):
            cx = 'b' + str(i)
            chain_list = leaf_list[i]['from']['trace']
            res_dict[cx] = {
     }
            res_dict[cx]['data'] = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list)
            res_dict[cx]['trace'] = chain_list
            res_dict[cx]['proba'] = res_dict[cx]['data'][self.para_dict['target_name']].mean()
            res_dict[cx]['class'] = 1 if res_dict[cx]['proba'] >= self.para_dict['class_thresh'] else 0
            res_dict[cx]['size'] = len(res_dict[cx]['data'])
            #         1，      
            if len(chain_list) > 1:
                cx1 = cx + '_a1'
                chain_list1  = chain_list[:-1]
                res_dict1[cx1] = {
     }
                res_dict1[cx1]['data'] = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list1)
                res_dict1[cx1]['trace'] = chain_list1
                res_dict1[cx1]['proba'] = res_dict1[cx1]['data'][self.para_dict['target_name']].mean()
                res_dict1[cx1]['class'] = 1 if res_dict1[cx1]['proba'] >= self.para_dict['class_thresh'] else 0
                res_dict1[cx1]['size'] = len(res_dict1[cx1]['data'])

        self.train_branch_dict['tier1'] = res_dict
        self.train_branch_dict['tier2'] = res_dict1

        res_df = pd.DataFrame(columns = ['partition','size','proba','class','tier'])


        for k in res_dict.keys():
            tem_dict = {
     }
            tem_dict['partition'] = k
            tem_dict['size'] = res_dict[k]['size']
            tem_dict['proba'] = res_dict[k]['proba']
            tem_dict['class'] = res_dict[k]['class']
            tem_dict['tier'] = 'tier1'
            res_df = res_df.append(tem_dict, ignore_index=True)
        for k in res_dict1.keys():
            tem_dict = {
     }
            tem_dict['partition'] = k
            tem_dict['size'] = res_dict1[k]['size']
            tem_dict['proba'] = res_dict1[k]['proba']
            tem_dict['class'] = res_dict1[k]['class']
            tem_dict['tier'] = 'tier2'
            res_df = res_df.append(tem_dict, ignore_index=True)
        
        self.train_partition_df = res_df
        #     
        rule_df = pd.DataFrame(columns=['rule', 'tier' ,'condition', 'class', 'proba','support'])

        for k in res_dict.keys():
            k_trace = res_dict[k]['trace']
            tem_dict = {
     }
            tem_dict['rule'] = k
            tem_dict['tier'] = 1
            tem_dict['condition'] = '&'.join([x['varname'] + ' in ' + str(list(x['cut_condition'])) if x['vartype'] == 'N' else x['varname'] +  x['cut_condition'] for x in k_trace ])
            tem_dict['class'] = res_dict[k]['class']
            tem_dict['proba'] = res_dict[k]['proba']
            tem_dict['support'] = res_dict[k]['size']
            rule_df = rule_df.append(tem_dict, ignore_index=True)

            if len(k_trace) > 1:
                #      
                tem_dict = {
     }
                k_trace1 = k_trace[:-1]
                tem_dict['rule'] = k + '_a1'
                tem_dict['tier'] = 2
                tem_dict['condition'] = '&'.join([x['varname'] + ' in ' + str(list(x['cut_condition']))
                                            if x['vartype'] == 'N' else x['varname'] + x['cut_condition'] for x in k_trace1])
                tem_dict['proba'] = res_dict[k]['proba']
                tem_dict['class'] = res_dict[k]['class']
                tem_dict['support'] = res_dict[k]['size']
                rule_df = rule_df.append(tem_dict, ignore_index=True)
        self.train_rules_df = rule_df


    #     
    def predict(self, data=None):
        #      （       ）
        predict_df_list1 = [] #          
        predict_df_list2 = [] #              
        for b in self.train_branch_dict['tier1'].keys():
            #            
            tem_proba = self.train_branch_dict['tier1'][b]['proba']
            tem_class = self.train_branch_dict['tier1'][b]['class']
            tem_size = self.train_branch_dict['tier1'][b]['size']
            tem_chain_list = self.train_branch_dict['tier1'][b]['trace']

            #      
            tem_predict_df = iTree.filter_df_varattr_chain(df = data, chain_list= tem_chain_list)
            tem_predict_df['predict_class'] = tem_class
            tem_predict_df['predict_proba'] = tem_proba
            tem_predict_df['predict_support'] = tem_size

            # debug 
            tem_predict_df['branch'] = b

            predict_df_list1.append(tem_predict_df)
        #       index
        predict_df1 = pd.concat(predict_df_list1)
        
        # self.debug['a1'] = predict_df1
        # predict_df1 = predict_df1[~predict_df1.index.duplicated()]
        print('***totle recs to predict', len(data))
        print('***predict by leaf Node', predict_df1.shape)
        add_list = ['predict_proba', 'predict_class', 'predict_support']
        keep_list= [self.para_dict['target_name']] + add_list
        for x in add_list:
            data[x] = predict_df1[x]

        #            ，        
        is_missing_predict = data['predict_proba'].notnull().sum() < len(data)
        if is_missing_predict:
            mis_data = data[~data['predict_proba'].notnull()]
            for b in self.train_branch_dict['tier2'].keys():
                tem_proba = self.train_branch_dict['tier2'][b]['proba']
                tem_class = self.train_branch_dict['tier2'][b]['class']
                tem_size = self.train_branch_dict['tier2'][b]['size']
                tem_chain_list = self.train_branch_dict['tier2'][b]['trace']

                #      
                tem_predict_df = iTree.filter_df_varattr_chain(df=mis_data, chain_list=tem_chain_list)
                tem_predict_df['predict_class'] = tem_class
                tem_predict_df['predict_proba'] = tem_proba
                tem_predict_df['predict_support'] = tem_size
                predict_df_list2.append(tem_predict_df)
            predict_df2 = pd.concat(predict_df_list2)
            #               ?
            predict_df2 = predict_df2[~predict_df2.index.duplicated()]
            print('***predict by branch Node', predict_df2.shape)
            for x in add_list:
                mis_data[x] = predict_df2[x]
            data = pd.concat([data[data['predict_proba'].notnull()],mis_data])
        return data[keep_list].sort_index()
        # return data

中のコードが多いので、後で一つ一つ展開して話すことができます.構造的にはiTreeというクラスにはいくつかの部分があります.

1 tree_template_xこれは決定ツリーの各ノードに共通するテンプレートであり、クラスの共通属性

とする.

2@staticmethodは、キー指標の計算に加えて、変数の最適な切断点を探す関数の山を定義します.ここでは変数の概念について、変数をN(Nominal)、O(Ordinal)、C(Continuous)の3種類に分けます.

N:例えば性別

O:例えば学歴

C:例えば収入

fitメソッド:決定ツリーフィッティングの主要メソッドを実行します.その中のfirst_ヘッドとiter_headはアルゴリズムの最初の実行と循環体での関数に対応する.主な違いは、最初に実行したデータとフィルタする必要がなく、ループでは現在使用可能なデータセットを得るためにデータをフィルタする必要があります.

4 predict法:訓練されたパラメータを用いて新しいデータを予測する.

使用方法は次のとおりです.

    some_tree = iTree()
    df = pd.read_csv('train.csv') #       
    raw_df = pd.read_csv('raw_train.csv') #       

    some_tree.fit(data=df, target_name='Survived')
    res_predict_df = some_tree.predict(data=df)
    from sklearn.metrics import classification_report
    predict_eva = classification_report(res_predict_df['Survived'], res_predict_df['predict_class'])
    print(predict_eva)

実行結果は次のとおりです.

***totle recs to predict 891
***predict by leaf Node (883, 15)
***predict by branch Node (8, 14)

  precision    recall  f1-score   support

           0       0.84      0.88      0.86       549
           1       0.80      0.74      0.77       342

   micro avg       0.83      0.83      0.83       891
   macro avg       0.82      0.81      0.82       891
weighted avg       0.83      0.83      0.83       891

ここでdfはデータ洗浄を行ったデータ、raw_dfは未処理のデータである.leaf Nodeが883件の記録を予測し、8件の記録がbranch Nodeが予測したことに気づくことができますが、これはどういう意味ですか?初期パラメータには葉ノードの数が10個以上設定されているので,1つの葉が分割されていないため,予測できる葉も存在しない.このような場合、1つの方法は予測をしないことであり、もう1つは、この葉ノードの親ノード(すなわちbranch Node)で近似的な予測を行うことである.また同様に3層の予測であり,結果とKの差は多くなくOKとなるはずである.しかし、タイタニック号の例は簡単で、モデルの効果の問題を説明できません(もちろん決定木がすごいとは期待しないでください).またtrain,testの分割も現在行われておらず,厳密なモデリングとは言えない.後でbenchmarkを構築し、いくつかのアルゴリズムを一緒に比較します.
ここを見て、10分で決定木をマスターしますか?
その後、変数の分類、検索など、その過程を徐々に分解し、できるだけこの過程をより明確に話します.

[テストコード]音階

最小数を削除