『機械学習実戦』学習ノート:意思決定ツリーの実現

6572 ワード

意思決定ツリーは極めて分かりやすいアルゴリズムであり、最もよく使われるデータマイニングアルゴリズムでもあり、意思決定ツリーは機械がデータセットに基づいて規則を創造することを許可し、実はこれが機械学習の過程である.専門家システムでは、意思決定ツリーとその変種がよく使用され、意思決定ツリーが与える結果は、現在の分野で数十年の経験を持つ専門家に匹敵することが多い.

の利点:決定ツリーの計算の複雑さは高くなく、出力結果は理解しやすく、中間値の欠落に敏感ではなく、関連しない特徴データを処理することができる.

の欠点:過剰なマッチングの問題が発生する可能性があります.

適用データ型:数値型と公称型.

この章の主なタスクは,決定ツリーを構築する方法と,決定ツリーを構築するpythonコードを記述し,再帰呼び出しを用いて分類器を構築し,最後にMattplotlibを用いて決定ツリー図を描くことである.アルゴリズム検証セクションでは、決定ツリーにコンタクトレンズの処方データを入力し、訓練された分類器から必要なレンズタイプを予測します.
以下は、決定ツリーのPython実装です.
一、まずいくつかの関数を作成してデータを前処理する.
1.エントロピー関数の計算:エントロピーは集合の無秩序度を表し、集合が無秩序であるほどエントロピーが大きくなる.

from math import log 

def entropy(sample):  
    log2 = lambda x:log(x)/log(2)   

    results = {}    
    for row in sample:    
        r = row[len(row)-1]  
        results[r] = results.get(r, 0) + 1  

    ent = 0.0  # entropy
    for r in results.keys():    
        p = float(results[r]) / len(sample)    
        ent -= p * log2(p)    
    return ent

2.データセット関数を属性と値で取得すると、pythonで記述された関数が非常に簡潔であることがわかります.この関数は、データセットsampleシーケンスからk番目の列の値vのサブセットを取得し、取得したサブセットからk番目の列を削除することを意味します.

def fetch_subdataset(sample, k, v):  
    return [d[:k] + d[k+1:] for d in sample if d[k] == v]

3.最大確率属性の相関関数を計算し、決定ツリーを構築する際、すべての決定属性を処理した後、データを一意に区別できない場合、多数決の方法で最終分類を選択します.

def MaxFeature(classList):  
    classCount = {}  
    for cla in classList:  
        classCount[cla] = classCount.get(cla, 0) + 1  
    sortedClassCount =  sorted(classCount.items(), key=lambda d: d[1], reverse = True)   
    return sortedClassCount[0][0]

二.最適データフィーチャー分割方式の関数の選択
決定ツリーを構築する前に,まず各特性の役割を評価し,どの列の値で集合を分割し,最大の情報利得を得るかを考える必要があり,以下の関数はこの機能を実現し,データセットを入力し,最適な特徴値を出力する.

def DecisionFeature(sample):  
    ent, feature = 100000000, -1  
    for i in range(len(sample[0]) - 1):  
        feat_list = [e[i] for e in sample]  
        unq_feat_list = set(feat_list)  
        ent_t = 0.0  
        for f in unq_feat_list:  
            sub_data = fetch_subdataset(sample, i, f)  
            ent_t += entropy(sub_data) * len(sub_data) / len(sample)  

        if ent_t < ent:  
            ent, feature = ent_t, i  

    return feature

三.再帰を使用した意思決定ツリーの構築

def buildTree(sample, datalabel):  
    cla = [c[-1] for c in sample]  
    if len(cla) == cla.count(cla[0]):  
        return cla[0]  
    if len(sample[0]) == 1:  
        return MaxFeature(sample)  

    feature = DecisionFeature(sample)  
    featureLabel = datalabel[feature]  
    decisionTree = {featureLabel:{}}  
    del(datalabel[feature])  

    featValue = [d[feature] for d in sample]  
    UniqueFeatValue = set(featValue)  
    for value in UniqueFeatValue:  
        subLabel = datalabel[:]  
        decisionTree[featureLabel][value] = buildTree( \
            fetch_subdataset(sample, feature, value), subLabel)  

    return decisionTree

四.意思決定ツリーの使用

def classify(decisionTree, featLabels, test):  
    label = decisionTree.keys()[0]  
    next_dict = decisionTree[label]  
    feat_index = featLabels.index(label)  
    for key in next_dict.keys():  
        if test[feat_index] == key:  
            if type(next_dict[key]).__name__ == 'dict':  
                c_label = classify(next_dict[key], featLabels, test)  
            else:  
                c_label = next_dict[key]  
    return c_label

時間が限られているため,ここまでしか解析できず,実装結果とインスタンス検証はさらに議論される必要がある.

loop recur から reduce へ

[vue]画像を動的に挿入する場合