Python実装決定ツリー(シリーズ記事4)--私の決定ツリー
紙の上で得たのは結局浅く感じて、この事がcodingを要することを絶対に知っています!
1アルゴリズムについての簡単な考え方
アルゴリズムとは何ですか?—アルゴリズムは非表示のプログラミングです【たぶん私が見た中で最も簡潔で要約されたバージョンです】
アルゴリズムは炒め物として通俗的に理解できると思います.私たちはどこから(食材、調味料)始めるか知っています.私たちもいつ終わるか知っています(味の差は多くなければいいです).しかし、中間の過程には大まかな方法があります(しかし、具体的にはどのくらいあるべきか分かりません)、例えば塩を入れてこそ塩味があることを知っていますが、いくらですか?「塩少々」はどうやって少しずつ?この過程で私たちは味わうまで繰り返し試します.この過程でまた少し問題があり、塩を少し入れる実験の過程で、1マイクログラム1マイクログラムを加えると、飢え死にしても良い料理が得られないと推定されています.塩を撒くと、この鍋の料理は壊れてしまうだろう.
だから、アルゴリズムは優雅な「塩を入れる」方法で、あなたの好みに合わせて、少しやってみればokの方法です.
私たちの今の問題に対して、どのように決定木で一つの問題を解決しますか?(タイタニック号乗客の生還確率を予測)1ターゲット:変数はSurvived(0,1) 2初期:gini を計算できる元のデータセットがある 3反復:利用可能なすべての変数をループして探し、最適な分割点を見つけようと試み、決定ツリーの新しい枝 を得る.4終了:予め設定された目標/制限に達したときに停止します.ここでの制限条件は少し多いです. 1現在データセットの本数が少ない場合(結果的に統計性がない)、 を停止する必要がある.2に分けられる可能性のある葉のデータが少ない場合(結果は同様に信頼性が低い)、 を停止する必要がある.3が既に目標をよく区別できる場合(結果はすでに「純」である)、 を停止する必要がある.4は予め設定する反復回数を超え、 を停止する必要がある.5は、さらにいくつかの適切な終了条件 を有することができる.
この考え方に基づいて、私は決定木を編纂しました.中にはpandas,numpy,copy,combinations(組み合わせを求める)といういくつかのパッケージを呼び出す以外は、直接コードで実現されています.だから、50のコードで問題を解決することができず、実際には800行ほど書いていました.
2私の決定ツリーコードと呼び出し方法
中のコードが多いので、後で一つ一つ展開して話すことができます.構造的にはiTreeというクラスにはいくつかの部分があります. 1 tree_template_xこれは決定ツリーの各ノードに共通するテンプレートであり、クラスの共通属性 とする.2@staticmethodは、キー指標の計算に加えて、変数の最適な切断点を探す関数の山を定義します.ここでは変数の概念について、変数をN(Nominal)、O(Ordinal)、C(Continuous)の3種類に分けます. N:例えば性別 O:例えば学歴 C:例えば収入 fitメソッド:決定ツリーフィッティングの主要メソッドを実行します.その中のfirst_ヘッドとiter_headはアルゴリズムの最初の実行と循環体での関数に対応する.主な違いは、最初に実行したデータとフィルタする必要がなく、ループでは現在使用可能なデータセットを得るためにデータをフィルタする必要があります. 4 predict法:訓練されたパラメータを用いて新しいデータを予測する.
使用方法は次のとおりです.
実行結果は次のとおりです.
ここでdfはデータ洗浄を行ったデータ、raw_dfは未処理のデータである.leaf Nodeが883件の記録を予測し、8件の記録がbranch Nodeが予測したことに気づくことができますが、これはどういう意味ですか?初期パラメータには葉ノードの数が10個以上設定されているので,1つの葉が分割されていないため,予測できる葉も存在しない.このような場合、1つの方法は予測をしないことであり、もう1つは、この葉ノードの親ノード(すなわちbranch Node)で近似的な予測を行うことである.また同様に3層の予測であり,結果とKの差は多くなくOKとなるはずである.しかし、タイタニック号の例は簡単で、モデルの効果の問題を説明できません(もちろん決定木がすごいとは期待しないでください).またtrain,testの分割も現在行われておらず,厳密なモデリングとは言えない.後でbenchmarkを構築し、いくつかのアルゴリズムを一緒に比較します.
ここを見て、10分で決定木をマスターしますか?
その後、変数の分類、検索など、その過程を徐々に分解し、できるだけこの過程をより明確に話します.
1アルゴリズムについての簡単な考え方
アルゴリズムとは何ですか?—アルゴリズムは非表示のプログラミングです【たぶん私が見た中で最も簡潔で要約されたバージョンです】
アルゴリズムは炒め物として通俗的に理解できると思います.私たちはどこから(食材、調味料)始めるか知っています.私たちもいつ終わるか知っています(味の差は多くなければいいです).しかし、中間の過程には大まかな方法があります(しかし、具体的にはどのくらいあるべきか分かりません)、例えば塩を入れてこそ塩味があることを知っていますが、いくらですか?「塩少々」はどうやって少しずつ?この過程で私たちは味わうまで繰り返し試します.この過程でまた少し問題があり、塩を少し入れる実験の過程で、1マイクログラム1マイクログラムを加えると、飢え死にしても良い料理が得られないと推定されています.塩を撒くと、この鍋の料理は壊れてしまうだろう.
だから、アルゴリズムは優雅な「塩を入れる」方法で、あなたの好みに合わせて、少しやってみればokの方法です.
私たちの今の問題に対して、どのように決定木で一つの問題を解決しますか?(タイタニック号乗客の生還確率を予測)
この考え方に基づいて、私は決定木を編纂しました.中にはpandas,numpy,copy,combinations(組み合わせを求める)といういくつかのパッケージを呼び出す以外は、直接コードで実現されています.だから、50のコードで問題を解決することができず、実際には800行ほど書いていました.
2私の決定ツリーコードと呼び出し方法
import pandas as pd
import numpy as np
import copy
from itertools import combinations #
class iTree():
# -
tree_template_x = {
}
# : - ( )-
tree_template_x['from'] = {
}
tree_template_x['now'] = {
}
tree_template_x['whatif'] = {
}
tree_template_x['to'] = {
}
tree_template_x['from']['trace'] = [] # ,
tree_template_x['now']['samples'] = None #
tree_template_x['now']['vars_available'] = [] #
tree_template_x['now']['dummy_var_list'] = [] #
tree_template_x['now']['current_layers'] = 0 #
tree_template_x['now']['is_leaf'] = 0 #
#
tree_template_x['now']['target_counts'] = None #
tree_template_x['now']['non_target_counts'] = None #
tree_template_x['now']['gini'] = None #
tree_template_x['now']['prob'] = None #
tree_template_x['now']['class'] = None #
#
tree_template_x['now']['mse'] = None #
tree_template_x['whatif']['compete_win_varname'] = None #
tree_template_x['whatif']['compete_win_vartype'] = None # N, O, C
tree_template_x['whatif']['compete_win_gini'] = None #
tree_template_x['whatif']['compete_win_mse'] = None #
tree_template_x['whatif']['except_var_list'] = [] # ( )
tree_template_x['to']['left_condition'] = None #
tree_template_x['to']['right_condition'] = None #
tree_template_x['to']['left'] = None #
tree_template_x['to']['right'] = None #
#
'''
, :
- C(Continuous) -> : 100
- N(Nominal) -> ∑C(n, x), x >= n//2
- O(Ordinal) -> N
'''
# 1: , Series
@staticmethod
def cbcut(data=None, pstart=0.1, pend=0.9):
data = data.copy()
# 10~90
# linspace (0.1, 0.11,... 0.9)
bins = int((pend - pstart) * 100 + 1)
qlist = np.linspace(pstart, pend, bins)
# , (unique )
qtiles = data.quantile(qlist).unique()
res_list = []
for q in qtiles:
data1 = data.apply(lambda x: 1 if x < q else 0)
res_list.append(data1)
res_dict = {
}
res_dict['data_list'] = res_list
res_dict['qtiles'] = qtiles
return res_dict
# 2: / ,
@staticmethod
def obcut(data=None, start=1):
data = data.copy()
qtiles = data.unique()[start:]
res_list = []
for q in qtiles:
data1 = data.apply(lambda x: 1 if x < q else 0)
res_list.append(data1)
res_dict = {
}
res_dict['data_list'] = res_list
res_dict['qtiles'] = qtiles
return res_dict
# 3: - map
@staticmethod
def list_key_dict(data = None ,fill_value = 1):
return dict(zip(data, [fill_value]*len(data)))
# 4: ( )
@staticmethod
def kv2vk(data =None ):
new_dict = {
}
for k in data.keys():
new_dict[data[k]] = k
return new_dict
# 5: ,
@staticmethod
def nbcut(data=None):
data = data.copy()
var_set = set(data)
comb_num = len(var_set) // 2
# var_set
comb_list = []
not_comb_list = []
for i in range(comb_num):
tem_num = i+1
comb_list += list(combinations(var_set, tem_num))
#
comb_sel_list = []
for clist in comb_list:
comb_sel_list.append(iTree.list_key_dict(data=clist))
# not_comb_list
not_comb_list.append(list(var_set - set(clist)))
res_list = []
for comb_sel in comb_sel_list:
data1 = data.map(comb_sel).fillna(0)
res_list.append(data1)
res_dict = {
}
res_dict['data_list'] = res_list #
res_dict['comb_list'] = comb_list #
res_dict['not_comb_list'] = not_comb_list #
res_dict['comb_sel_list'] = comb_sel_list #
return res_dict
# 6:
@staticmethod
def collect_var_attr(data=None, varname=None):
data = data.copy()
data1 = data.dropna().apply(str)
# 1
missing_num = len(data) - data.notnull().sum()
# 2
missing_rate = missing_num / len(data)
# 3
levs = len(data1.unique())
# 4 -
## 4.1
is_integer = data1.apply(lambda x: x.isdigit()).sum() == len(data1)
# 5
is_dot = data1.apply(
lambda x: True if '.' in x else False).sum() == len(data1)
# 6 ,
if is_dot:
is_dot_is_digit = data1.apply(lambda x: True if x.split(
'.')[0].isdigit() and x.split('.')[1].isdigit() else False).sum() == len(data1)
else:
is_dot_is_digit = False
# 7 , , 0
is_float = False
if is_dot_is_digit:
is_integer = data1.apply(lambda x: True if float(
x.split('.')[1]) == 0 else False).sum() == len(data1)
else:
is_float = True
#
try:
data1.apply(float)
is_num = True
is_str = False
except:
is_num = False
is_str = True
#
res_dict = {
}
res_dict['missing_num'] = missing_num
res_dict['missing_rate'] = missing_rate
res_dict['levs'] = levs
res_dict['is_all_integer'] = is_integer
res_dict['is_dot'] = is_dot
res_dict['is_str'] = is_str
res_dict['is_num'] = is_num
res_dict['is_all_dot_and_digit'] = is_dot_is_digit
res_dict['is_all_float'] = is_float
return {
varname: res_dict}
# 7
# Note: (C、N、O)
#
@staticmethod
def infer_var_type(data=None):
if data['is_str']:
res = 'N'
else:
if data['levs'] >=20:
res = 'C'
else:
res = 'N'
return res
# 8: df
@staticmethod
def df_infer_var_type(df, cols=None):
var_meta_dict = {
}
if cols is None:
cols = list(df.columns)
for col in cols:
tem_var_meta_dict = iTree.collect_var_attr(df[col], varname=col)
var_meta_dict.update(tem_var_meta_dict)
for k in var_meta_dict.keys():
res = iTree.infer_var_type(var_meta_dict[k])
var_meta_dict[k]['vartype'] = res
return var_meta_dict
# 9: x y ( Series), df,
# , x y
# x ,y
@staticmethod
def align_xy(x = None, y = None ):
tem_df = pd.DataFrame()
tem_df['x'] = x.copy()
tem_df['y'] = y.copy()
return tem_df.dropna()
# 10: gini
# Gini
@staticmethod
def cal_gini_impurity(target_count = None, total_count = None):
p1 = target_count / total_count
p0 = 1 - p1
return 1 - p1**2 - p0**2
# 11: gini
@staticmethod
def get_gini(x=None, y=None):
tem_df = iTree.align_xy(x=x, y=y)
# ( , )
vals = tem_df['x'].unique()
_gini = 0
total_recs = len(tem_df)
for val in vals:
#
tem_df1 = tem_df[tem_df['x'] == val]
#
tem_weight = len(tem_df1) / total_recs
#
tem_target_count = tem_df1['y'].sum()
# gini
tem_gini = iTree.cal_gini_impurity(
target_count=tem_target_count, total_count=len(tem_df1))
_gini += tem_weight * tem_gini
return _gini
# 12: mse
# y_i
# c_i
@staticmethod
def get_mse(x = None, y = None):
tem_df = iTree.align_xy(x=x, y=y)
# ( , )
vals = tem_df['x'].unique()
_mse = 0
for val in vals:
#
tem_df1 = tem_df[tem_df['x'] == val]
#
tem_df1_y_mean = tem_df1['y'].mean()
# tem_mse = (tem_df1['y'] - tem_df1_y_mean).apply(lambda x: x**2).sum() / len(tem_df1['y']) #
tem_mse = (tem_df1['y'] - tem_df1_y_mean).apply(lambda x: x**2).sum()
_mse += tem_mse
return _mse
# 13:
# , gini
# x , y
@staticmethod
def find_min_gini(x=None, y=None, varname=None, vartype=None, start=1, pstart=0.1, pend=0.9):
#
assert vartype in [
'C', 'O', 'N'], 'Only Accept Vartype C(continuous), O(Oridinal), N(Nominal)'
if vartype == 'N':
res_dict = iTree.nbcut(data=x)
elif vartype == 'O':
res_dict = iTree.obcut(data=x, start=start)
else:
res_dict = iTree.cbcut(data=x, pstart=pstart, pend=pend)
#
tem_gini_list = []
for i in range(len(res_dict['data_list'])):
tem_gini = iTree.get_gini(res_dict['data_list'][i], y)
tem_gini_list.append(tem_gini)
# index + min
min_gini = min(tem_gini_list)
mpos = tem_gini_list.index(min_gini)
if vartype == 'N':
# (in), ,
condition_left = res_dict['comb_list'][mpos]
condition_right = res_dict['not_comb_list'][mpos]
else:
# , ( < q >= q)
# ,
condition_left = ' + str(res_dict['qtiles'][mpos])
condition_right = '>=' + str(res_dict['qtiles'][mpos])
new_res_dict = {
}
new_res_dict[varname] = {
}
new_res_dict[varname]['gini'] = min_gini
new_res_dict[varname]['condition_left'] = condition_left
new_res_dict[varname]['condition_right'] = condition_right
return new_res_dict
# 14
# , mse
# x , y
# https://blog.csdn.net/zpalyq110/article/details/79527653
@staticmethod
def find_min_mse(x=None, y=None, varname=None, vartype=None, start=1, pstart=0.1, pend=0.9):
#
assert vartype in [
'C', 'O', 'N'], 'Only Accept Vartype C(continuous), O(Oridinal), N(Nominal)'
if vartype == 'N':
res_dict = iTree.nbcut(data=x)
elif vartype == 'O':
res_dict = iTree.obcut(data=x, start=start)
else:
res_dict = iTree.cbcut(data=x, pstart=pstart, pend=pend)
#
tem_mse_list = []
for i in range(len(res_dict['data_list'])):
tem_mse = iTree.get_mse(res_dict['data_list'][i], y)
tem_mse_list.append(tem_mse)
# index + min
min_mse = min(tem_mse_list)
mpos = tem_mse_list.index(min_mse)
if vartype == 'N':
# (in), ,
condition_left = res_dict['comb_list'][mpos]
condition_right = res_dict['not_comb_list'][mpos]
else:
# , ( < q >= q)
# ,
condition_left = ' + str(res_dict['qtiles'][mpos])
condition_right = '>=' + str(res_dict['qtiles'][mpos])
new_res_dict = {
}
new_res_dict[varname] = {
}
new_res_dict[varname]['mse'] = min_mse
new_res_dict[varname]['condition_left'] = condition_left
new_res_dict[varname]['condition_right'] = condition_right
return new_res_dict
# 15
#
@staticmethod
def find_dict_minmax(some_dict = None, attrname = None, method='min'):
klist = []
vlist = []
except_list = []
for k in some_dict.keys():
try:
attr = some_dict[k][attrname]
klist.append(k)
vlist.append(attr)
except:
print('fail to find the val', k)
except_list.append(k)
#
if method.strip().lower() == 'min':
the_val = min(vlist)
the_key = klist[vlist.index(the_val)]
elif method.strip().lower() == 'max':
the_val = max(vlist)
the_key = klist[vlist.index(the_val)]
else:
the_val, the_key = None, None
return the_val, the_key, except_list
# 16
#
# N N
# N: , , map,
# N(O,C): , 。 。
@staticmethod
def filter_df_varattr(df=None, varname=None, vartype=None, condition=None):
assert vartype in ['C', 'O', 'N'], 'Vartype must in C/O/N '
if vartype == 'N': # nomimal map
tem_map_dict = dict(zip(condition, [True]*len(condition)))
_tem_sel = df[varname].map(tem_map_dict)
res_df = df[_tem_sel.notnull()]
# df['_tem_sel'] = df[varname].map(tem_map_dict)
# res_df = df[df['_tem_sel'].notnull()]
else:
if ' in condition:
val = float(condition.replace(', ''))
res_df = df[df[varname] < val]
elif '>=' in condition:
val = float(condition.replace('>=', ''))
res_df = df[df[varname] >= val]
else:
res_df = None
raise ValueError('condition symbol error >=, condition)
# del df['_tem_sel']
return res_df
# 17
# : (varname, vartype, cut_condition)
@staticmethod
def filter_df_varattr_chain(df=None, chain_list=None):
tem_df = df.copy()
for the_chain in chain_list:
varname = the_chain['varname']
vartype = the_chain['vartype']
condition = the_chain['cut_condition']
tem_df = iTree.filter_df_varattr(
df=tem_df, varname=varname, vartype=vartype, condition=condition)
return tem_df
# 18
# , ,
@staticmethod
def find_dummy_var(df, cols=None, exe_cols=None):
if cols is None:
cols = list(df.columns)
if exe_cols is not None:
exe_cols1 = [x for x in exe_cols if len(x.strip()) > 0]
cols = list(set(cols) - set(exe_cols1))
res_list = []
for c in cols:
if len(df[c].unique()) == 1:
res_list.append(c)
return res_list
# ============== =============
def __init__(self):
# , | / /
self.train_history_list = []
self.train_branch_dict = None
self.train_rules_df = None
self.train_partition_df = None
# debug
self.debug = {
}
#
def fit(self, data=None, target_name=None, id_name=None, time_name=None, tree_type='classification',
max_iter=1000, min_sample_to_split=100, min_sample_to_predict=10, max_depth=3, gini_thresh=0,
mse_thresh=0, improve_ratio=0, class_thresh=0.5):
assert all([not data.empty, target_name]), ' '
self.para_dict = {
}
self.para_dict['max_iter'] = max_iter # 1
# 2
self.para_dict['min_sample_to_split'] = min_sample_to_split
# 3
self.para_dict['min_sample_to_predict'] = min_sample_to_predict
self.para_dict['max_depth'] = max_depth # 4
self.para_dict['gini_thresh'] = gini_thresh # 5
self.para_dict['mse_thresh'] = mse_thresh # 6
self.para_dict['improve_ratio'] = improve_ratio # 7 /
self.para_dict['class_thresh'] = class_thresh # 8
self.para_dict['current_iter_list'] = [] # 9
self.para_dict['history_dict'] = {
} # 10
# 11 ( ) - ,
self.para_dict['leaf_list'] = []
self.para_dict['data'] = data # 12 ,pd.DataFrame
self.para_dict['target_name'] = target_name # 13
self.para_dict['id_name'] = id_name # 14 ID
self.para_dict['time_name'] = time_name # 15
self.para_dict['var_meta'] = iTree.df_infer_var_type(self.para_dict['data']) # 16
self.para_dict['tree_type'] = tree_type # 17
# ---
# 1
iter_cnt = 0
tree_template, tem_df = self.first_head()
tree_template = self.iter_body(tree_template = tree_template, tem_df = tem_df)
self.para_dict['history_dict'] = tree_template
while len(self.para_dict['current_iter_list']) > 0:
iter_cnt += 1
if (iter_cnt) >= self.para_dict['max_iter']:
break
tree_template, tem_df = self.iter_head()
_ = self.iter_body(tree_template=tree_template, tem_df=tem_df)
self.summary()
# ( )
def first_head(self):
tree_template = copy.deepcopy(self.tree_template_x)
tem_df = self.para_dict['data'] # | “ ” ,
target_name = self.para_dict['target_name']
exe_cols = [target_name, self.para_dict['id_name'], self.para_dict['time_name']]
exe_cols = [x for x in exe_cols if x] #
tree_template['whatif']['except_var_list'] = list(set(tree_template['whatif']['except_var_list'] + exe_cols))
tree_template['now']['samples'] = len(tem_df)
tree_template['now']['dummy_var_list'] = iTree.find_dummy_var(tem_df, exe_cols= exe_cols)
# , , ( )
tree_template['now']['vars_available'] = list(set(tem_df.columns) - set(tree_template['now']['dummy_var_list']) -
set(tree_template['whatif']['except_var_list']))
return tree_template, tem_df
#
def iter_head(self):
tree_template = self.para_dict['current_iter_list'].pop()
chain_list = tree_template['from']['trace']
tem_df = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list)
tree_template['whatif']['except_var_list'] = list(set(tree_template['whatif']['except_var_list'] + [x['varname'] for x in chain_list]))
tree_template['now']['samples'] = len(tem_df)
tree_template['now']['dummy_var_list'] = iTree.find_dummy_var(tem_df)
# , , ( )
tree_template['now']['vars_available'] = list(set(tem_df.columns) -
set(tree_template['now']['dummy_var_list']) -
set(tree_template['whatif']['except_var_list']))
return tree_template, tem_df
#
def iter_body(self, tree_template = None, tem_df = None):
target_name = self.para_dict['target_name']
# , ,
is_enough_sample = tree_template['now']['samples'] >= self.para_dict['min_sample_to_split']
is_depth_ok = tree_template['now']['current_layers'] < self.para_dict['max_depth']
is_var_available = len(tree_template['now']['vars_available']) > 0
# , “ ”
if self.para_dict['tree_type'] == 'classification':
tree_template['now']['target_counts'] = tem_df[target_name].apply(int).sum()
tree_template['now']['non_target_counts'] = tree_template['now']['samples'] - tree_template['now']['target_counts']
tree_template['now']['gini'] = iTree.cal_gini_impurity(tree_template['now']['target_counts'], tree_template['now']['samples'])
tree_template['now']['prob'] = tree_template['now']['target_counts'] / tree_template['now']['samples']
tree_template['now']['class'] = 1 if tree_template['now']['prob'] >= self.para_dict['class_thresh'] else 0
# , ( gini mse >=0)
if tree_template['now']['gini'] < self.para_dict['gini_thresh']:
is_kpi_need_improve = False
else:
is_kpi_need_improve = True
else:
y_mean = tem_df[target_name].mean()
y = tem_df[target_name]
tree_template['now']['mse'] = (y - y_mean).apply(lambda x: x**2).mean()
if tree_template['now']['mse'] < self.para_dict['mse_thresh']:
is_kpi_need_improve = False
else:
is_kpi_need_improve = True
if all([is_enough_sample, is_depth_ok, is_var_available, is_kpi_need_improve]):
is_compete = True
else:
is_compete = False
# ====
# | ,1: Name -> Title ;2. ,
if is_compete:
# N , ( )
N_too_many_lev = [x for x in tree_template['now']['vars_available'] if self.para_dict['var_meta'][x]['vartype'] == 'N' and len(tem_df[x].unique()) > 20]
tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + N_too_many_lev
compete_dict = {
}
cols = list(
set(tree_template['now']['vars_available']) - set(N_too_many_lev))
for c in cols:
tem_varname = c
tem_vartype = self.para_dict['var_meta'][c]['vartype']
print(tem_varname, tem_vartype)
if self.para_dict['tree_type'] == 'classification':
tem_res_dict = iTree.find_min_gini(x=tem_df[c], y=tem_df[target_name], varname=tem_varname, vartype=tem_vartype)
else:
tem_res_dict = iTree.find_min_mse(x=tem_df[c], y=tem_df[target_name], varname=tem_varname, vartype=tem_vartype)
compete_dict.update(tem_res_dict)
if self.para_dict['tree_type'] == 'classification':
win_gini, win_var, except_list = iTree.find_dict_minmax(some_dict=compete_dict, attrname='gini')
tree_template['whatif']['compete_win_varname'] = win_var
tree_template['whatif']['compete_win_gini'] = win_gini
tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + except_list
# win_gini
else:
win_mse, win_var, except_list = iTree.find_dict_minmax(some_dict=compete_dict, attrname='mse')
tree_template['whatif']['compete_win_varname'] = win_var
tree_template['whatif']['compete_win_mse'] = win_mse
tree_template['whatif']['except_var_list'] = tree_template['whatif']['except_var_list'] + except_list
#
# - ( ),
# Note: , ( Embarked , nan )
win_dict = compete_dict[win_var]
left_candidate_df = iTree.filter_df_varattr(df=tem_df, varname=win_var,
vartype=self.para_dict['var_meta'][win_var]['vartype'],
condition=win_dict['condition_left'])
is_left_branch_ok = len(left_candidate_df) >= self.para_dict['min_sample_to_predict']
right_candidate_df = iTree.filter_df_varattr(df=tem_df, varname=win_var,
vartype=self.para_dict['var_meta'][win_var]['vartype'],
condition=win_dict['condition_right'])
is_right_branch_ok = len(right_candidate_df) >= self.para_dict['min_sample_to_predict']
# ,
# if not all([is_left_branch_ok, is_right_branch_ok]):
# -- , , , all[True, False] = False ,not True,
if not is_left_branch_ok and not is_right_branch_ok:
tree_template['now']['is_leaf'] = 1
self.para_dict['leaf_list'].append(tree_template)
#
if is_left_branch_ok:
tem_left_dict = copy.deepcopy(self.tree_template_x)
# trace
left_trace = tree_template['from']['trace'].copy()
tem_trace_dict = {
}
tem_trace_dict['varname'] = win_var
tem_trace_dict['vartype'] = self.para_dict['var_meta'][win_var]['vartype']
tem_trace_dict['cut_condition'] = win_dict['condition_left']
left_trace.append(tem_trace_dict)
# trace
tem_left_dict['from']['trace'] = left_trace
tem_left_dict['now']['current_layers'] = tree_template['now']['current_layers'] + 1
# ( )
tem_left_dict['whatif']['except_var_list'] = tree_template['whatif']['except_var_list']
#
tree_template['to']['left_condition'] = win_dict['condition_left']
tree_template['to']['left'] = tem_left_dict
#
self.para_dict['current_iter_list'].append(tree_template['to']['left'])
#
if is_right_branch_ok:
tem_right_dict = copy.deepcopy(self.tree_template_x)
# trace
right_trace = tree_template['from']['trace'].copy()
tem_trace_dict = {
}
tem_trace_dict['varname'] = win_var
tem_trace_dict['vartype'] = self.para_dict['var_meta'][win_var]['vartype']
tem_trace_dict['cut_condition'] = win_dict['condition_right']
right_trace.append(tem_trace_dict)
# trace
tem_right_dict['from']['trace'] = right_trace
tem_right_dict['now']['current_layers'] = tree_template['now']['current_layers'] + 1
# ( )
tem_right_dict['whatif']['except_var_list'] = tree_template['whatif']['except_var_list']
#
tree_template['to']['right_condition'] = win_dict['condition_right']
tree_template['to']['right'] = tem_right_dict
#
self.para_dict['current_iter_list'].append(tree_template['to']['right'])
else:
# ,
tree_template['now']['is_leaf'] = 1
self.para_dict['leaf_list'].append(tree_template)
return tree_template
# fit
def summary(self):
self.train_branch_dict ={
}
self.train_branch_dict['tier1'] = {
} #
self.train_branch_dict['tier2'] = {
} #
leaf_list = self.para_dict['leaf_list']
res_dict = {
}
res_dict1 = {
}
for i in range(len(leaf_list)):
cx = 'b' + str(i)
chain_list = leaf_list[i]['from']['trace']
res_dict[cx] = {
}
res_dict[cx]['data'] = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list)
res_dict[cx]['trace'] = chain_list
res_dict[cx]['proba'] = res_dict[cx]['data'][self.para_dict['target_name']].mean()
res_dict[cx]['class'] = 1 if res_dict[cx]['proba'] >= self.para_dict['class_thresh'] else 0
res_dict[cx]['size'] = len(res_dict[cx]['data'])
# 1,
if len(chain_list) > 1:
cx1 = cx + '_a1'
chain_list1 = chain_list[:-1]
res_dict1[cx1] = {
}
res_dict1[cx1]['data'] = iTree.filter_df_varattr_chain(df=self.para_dict['data'], chain_list=chain_list1)
res_dict1[cx1]['trace'] = chain_list1
res_dict1[cx1]['proba'] = res_dict1[cx1]['data'][self.para_dict['target_name']].mean()
res_dict1[cx1]['class'] = 1 if res_dict1[cx1]['proba'] >= self.para_dict['class_thresh'] else 0
res_dict1[cx1]['size'] = len(res_dict1[cx1]['data'])
self.train_branch_dict['tier1'] = res_dict
self.train_branch_dict['tier2'] = res_dict1
res_df = pd.DataFrame(columns = ['partition','size','proba','class','tier'])
for k in res_dict.keys():
tem_dict = {
}
tem_dict['partition'] = k
tem_dict['size'] = res_dict[k]['size']
tem_dict['proba'] = res_dict[k]['proba']
tem_dict['class'] = res_dict[k]['class']
tem_dict['tier'] = 'tier1'
res_df = res_df.append(tem_dict, ignore_index=True)
for k in res_dict1.keys():
tem_dict = {
}
tem_dict['partition'] = k
tem_dict['size'] = res_dict1[k]['size']
tem_dict['proba'] = res_dict1[k]['proba']
tem_dict['class'] = res_dict1[k]['class']
tem_dict['tier'] = 'tier2'
res_df = res_df.append(tem_dict, ignore_index=True)
self.train_partition_df = res_df
#
rule_df = pd.DataFrame(columns=['rule', 'tier' ,'condition', 'class', 'proba','support'])
for k in res_dict.keys():
k_trace = res_dict[k]['trace']
tem_dict = {
}
tem_dict['rule'] = k
tem_dict['tier'] = 1
tem_dict['condition'] = '&'.join([x['varname'] + ' in ' + str(list(x['cut_condition'])) if x['vartype'] == 'N' else x['varname'] + x['cut_condition'] for x in k_trace ])
tem_dict['class'] = res_dict[k]['class']
tem_dict['proba'] = res_dict[k]['proba']
tem_dict['support'] = res_dict[k]['size']
rule_df = rule_df.append(tem_dict, ignore_index=True)
if len(k_trace) > 1:
#
tem_dict = {
}
k_trace1 = k_trace[:-1]
tem_dict['rule'] = k + '_a1'
tem_dict['tier'] = 2
tem_dict['condition'] = '&'.join([x['varname'] + ' in ' + str(list(x['cut_condition']))
if x['vartype'] == 'N' else x['varname'] + x['cut_condition'] for x in k_trace1])
tem_dict['proba'] = res_dict[k]['proba']
tem_dict['class'] = res_dict[k]['class']
tem_dict['support'] = res_dict[k]['size']
rule_df = rule_df.append(tem_dict, ignore_index=True)
self.train_rules_df = rule_df
#
def predict(self, data=None):
# ( )
predict_df_list1 = [] #
predict_df_list2 = [] #
for b in self.train_branch_dict['tier1'].keys():
#
tem_proba = self.train_branch_dict['tier1'][b]['proba']
tem_class = self.train_branch_dict['tier1'][b]['class']
tem_size = self.train_branch_dict['tier1'][b]['size']
tem_chain_list = self.train_branch_dict['tier1'][b]['trace']
#
tem_predict_df = iTree.filter_df_varattr_chain(df = data, chain_list= tem_chain_list)
tem_predict_df['predict_class'] = tem_class
tem_predict_df['predict_proba'] = tem_proba
tem_predict_df['predict_support'] = tem_size
# debug
tem_predict_df['branch'] = b
predict_df_list1.append(tem_predict_df)
# index
predict_df1 = pd.concat(predict_df_list1)
# self.debug['a1'] = predict_df1
# predict_df1 = predict_df1[~predict_df1.index.duplicated()]
print('***totle recs to predict', len(data))
print('***predict by leaf Node', predict_df1.shape)
add_list = ['predict_proba', 'predict_class', 'predict_support']
keep_list= [self.para_dict['target_name']] + add_list
for x in add_list:
data[x] = predict_df1[x]
# ,
is_missing_predict = data['predict_proba'].notnull().sum() < len(data)
if is_missing_predict:
mis_data = data[~data['predict_proba'].notnull()]
for b in self.train_branch_dict['tier2'].keys():
tem_proba = self.train_branch_dict['tier2'][b]['proba']
tem_class = self.train_branch_dict['tier2'][b]['class']
tem_size = self.train_branch_dict['tier2'][b]['size']
tem_chain_list = self.train_branch_dict['tier2'][b]['trace']
#
tem_predict_df = iTree.filter_df_varattr_chain(df=mis_data, chain_list=tem_chain_list)
tem_predict_df['predict_class'] = tem_class
tem_predict_df['predict_proba'] = tem_proba
tem_predict_df['predict_support'] = tem_size
predict_df_list2.append(tem_predict_df)
predict_df2 = pd.concat(predict_df_list2)
# ?
predict_df2 = predict_df2[~predict_df2.index.duplicated()]
print('***predict by branch Node', predict_df2.shape)
for x in add_list:
mis_data[x] = predict_df2[x]
data = pd.concat([data[data['predict_proba'].notnull()],mis_data])
return data[keep_list].sort_index()
# return data
中のコードが多いので、後で一つ一つ展開して話すことができます.構造的にはiTreeというクラスにはいくつかの部分があります.
使用方法は次のとおりです.
some_tree = iTree()
df = pd.read_csv('train.csv') #
raw_df = pd.read_csv('raw_train.csv') #
some_tree.fit(data=df, target_name='Survived')
res_predict_df = some_tree.predict(data=df)
from sklearn.metrics import classification_report
predict_eva = classification_report(res_predict_df['Survived'], res_predict_df['predict_class'])
print(predict_eva)
実行結果は次のとおりです.
***totle recs to predict 891
***predict by leaf Node (883, 15)
***predict by branch Node (8, 14)
precision recall f1-score support
0 0.84 0.88 0.86 549
1 0.80 0.74 0.77 342
micro avg 0.83 0.83 0.83 891
macro avg 0.82 0.81 0.82 891
weighted avg 0.83 0.83 0.83 891
ここでdfはデータ洗浄を行ったデータ、raw_dfは未処理のデータである.leaf Nodeが883件の記録を予測し、8件の記録がbranch Nodeが予測したことに気づくことができますが、これはどういう意味ですか?初期パラメータには葉ノードの数が10個以上設定されているので,1つの葉が分割されていないため,予測できる葉も存在しない.このような場合、1つの方法は予測をしないことであり、もう1つは、この葉ノードの親ノード(すなわちbranch Node)で近似的な予測を行うことである.また同様に3層の予測であり,結果とKの差は多くなくOKとなるはずである.しかし、タイタニック号の例は簡単で、モデルの効果の問題を説明できません(もちろん決定木がすごいとは期待しないでください).またtrain,testの分割も現在行われておらず,厳密なモデリングとは言えない.後でbenchmarkを構築し、いくつかのアルゴリズムを一緒に比較します.
ここを見て、10分で決定木をマスターしますか?
その後、変数の分類、検索など、その過程を徐々に分解し、できるだけこの過程をより明確に話します.