ID 3決定ツリーのアルゴリズム原理とpython実現

14728 ワード

1.引用
意思決定ツリーは本質的に訓練データセットの訓練から分類ルールのセットであり、訓練データに完全に基づいて、得られたルールがフィッティングされやすく、これも意思決定ツリーの欠点であるが、意思決定ツリーの剪定によって、意思決定ツリーの汎化能力を高めることができる.これにより、決定ツリーの作成は、特徴選択、決定ツリーの生成、および決定ツリーの剪断の3つの部分を含むことができる.決定ツリーの応用には、分類、回帰、およびフィーチャー選択が含まれます.決定ツリーの最も古典的なアルゴリズムはID 3、C 4を含む.5及びCARTアルゴリズム、ID 3及びC 4.5アルゴリズムは似ている、C 4.5特徴選択時に選択される情報基準は情報利得比であり、ID 3は情報利得である.情報利得は、より多くの可能な値を有する特徴を選択することに偏っている(例えば、ある特徴が5つの可能な値を有し、その情報利得は2つの特徴値を有する情報利得よりも大きい).
2.主な内容

情報論に基づく特徴選択(python情報利得の計算を実現)

決定ツリーの生成

ID 3アルゴリズムのpython実装.

3.情報論に基づく特徴選択
注意:エントロピーはランダム変数の不確実性を表し、エントロピー値が大きいほどランダム変数に含まれる情報が少なくなり、変数の不確実性が大きくなります.1)シャノンが1つのデータを定義する情報は、次式(ここでは2をベースとした対数)で計算することができる.
l(xi)=−log2p(xi)
2)エントロピーは1つのデータセット情報の期待を表し、式を押して計算することができる:(この式は理解できないが、想像することができて、変数の期待する式を求めて、p(xi)は変数xiと情報l(xi)の確率で、確率は変数(情報)を乗じて変数(情報)の期待である)
H=−∑ni=1p(xi)log2p(xi)
3)特徴AのデータセットDに対する情報利得は:
g(D,A)=H(D)−H(D|A)=−∑Kk=1|Ck||D|log2|Ck||D|−∑ni=1|Di||D|∑Kk=1|Dik||Di|log2|Dik||Di|
上式では、訓練データセットをDとし、そのサンプル容量を|D|とし、すなわちサンプル個数とし、KクラスCk,k=1,2,…,K,|Ck|はCkのサンプル個数であり,特徴Aの取値に基づいてDをn個のサブセットD 1,D 2,...,Dn,|Di|はDiのサンプル数,Dik=Di⋂Ck,|Dik|はDikのサンプル数である.次の表と図に示します.
feature1(A)
feature1
feature3
labels
a1
b1
c1
y
a1
b2
c2
n
a1
b1
c2
n
a1
b1
c2
n
a2
b1
c1
y
a2
b2
c2
y
a2
b1
c1
n
4)python実装

def calcShannonEnt(dataset):#   
    numSamples = len(dataset)
    labelCounts = {}
    for allFeatureVector in dataset:
        currentLabel = allFeatureVector[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    entropy = 0.0
    for key in labelCounts:
        property = float(labelCounts[key])/numSamples
        entropy -= property * log(property,2)
    return entropy
def BestFeatToGetSubdataset(dataset):
    #      ：                      
    numFeature = len(dataset[0]) - 1 
    baseEntropy = calcShannonEnt(dataset)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeature):#i                
        #           i          
        feat_i_values = [example[i] for example in dataset]
        uniqueValues = set(feat_i_values)
        feat_i_entropy = 0.0
        for value in uniqueValues:
            subDataset = getSubDataset(dataset,i,value)
            #      pi，             
            prob_i = len(subDataset)/float(len(dataset))
            feat_i_entropy += prob_i * calcShannonEnt(subDataset)
        infoGain_i = baseEntropy - feat_i_entropy
        if (infoGain_i > bestInfoGain):
            bestInfoGain = infoGain_i
            bestFeature = i
    return bestFeature

4.決定ツリー生成
決定ツリーには、次のフローチャートが表示されます.
5.ID 3アルゴリズムpython実装コード

# -*- coding: utf-8 -*-
from math import log
import operator
import pickle
'''   ：     、    （         ，      ）   ：       、    （             ）       ：float   （      ） '''
def calcShannonEnt(dataset):
    numSamples = len(dataset)
    labelCounts = {}
    for allFeatureVector in dataset:
        currentLabel = allFeatureVector[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    entropy = 0.0
    for key in labelCounts:
        property = float(labelCounts[key])/numSamples
        entropy -= property * log(property,2)
    return entropy

'''   ：    ：          ：   、     '''     
def creatDataSet():
    dataset = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,0,'no']]
    labels = ['no surfacing','flippers']
    return dataset,labels

'''   ：   、               、         （  ，（     、0,1 ））   ：              （        ）   ：     '''
def getSubDataset(dataset,colIndex,value):
    subDataset = [] #        
    for rowVector in dataset:
        if rowVector[colIndex] == value:
            #          colIndex          
            subRowVector = rowVector[:colIndex]
            subRowVector.extend(rowVector[colIndex+1:])
            #                 
            subDataset.append(subRowVector)
    return subDataset

'''   ：      ：       ，           （                   ）   ：              '''
def BestFeatToGetSubdataset(dataset):
    #      ：                      
    numFeature = len(dataset[0]) - 1 
    baseEntropy = calcShannonEnt(dataset)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeature):#i                
        #           i          
        feat_i_values = [example[i] for example in dataset]
        uniqueValues = set(feat_i_values)
        feat_i_entropy = 0.0
        for value in uniqueValues:
            subDataset = getSubDataset(dataset,i,value)
            #      pi
            prob_i = len(subDataset)/float(len(dataset))
            feat_i_entropy += prob_i * calcShannonEnt(subDataset)
        infoGain_i = baseEntropy - feat_i_entropy
        if (infoGain_i > bestInfoGain):
            bestInfoGain = infoGain_i
            bestFeature = i
    return bestFeature

'''   ：             ：                ：               '''  
def mostClass(ClassList):
    classCount = {}
    for class_i in ClassList:
        if class_i not in classCount.keys():
            classCount[class_i] = 0
        classCount[class_i] += 1
    sortedClassCount = sorted(classCount.iteritems(),
    key=operator.itemgetter(1),reverse = True)        
    return sortedClassCount[0][0]

'''   ：   ，       ：     （                     ）   ：   （        ） '''    
def creatTree(dataset,labels):
    classList = [example[-1] for example in dataset]
    #     dataset         ， ，     
    if classList.count(classList[0]) == len(classList):    
        return classList[0]
    #            , ，         
    if len(dataset[0]) == 1:
        return mostClass(classList)
    #            
    bestFeat = BestFeatToGetSubdataset(dataset)
    #           
    bestFeatLabel = labels[bestFeat]
    #     
    myTree = {bestFeatLabel:{}}
    del (labels[bestFeat])
    #             
    bestFeatValues = [example[bestFeat] for example in dataset]
    uniqueBestFeatValues = set(bestFeatValues)
    for value in uniqueBestFeatValues:
        #         value              
        subDataset = getSubDataset(dataset,bestFeat,value)
        subLabels = labels[:]
        #      
        myTree[bestFeatLabel][value] = creatTree(subDataset,subLabels)
    return myTree

'''   ：         ：                     ：           '''        
def classify(inputTree,featlabels,testFeatValue):
    firstStr = inputTree.keys()[0]
    secondDict = inputTree[firstStr]
    featIndex = featlabels.index(firstStr)
    for firstStr_value in secondDict.keys():
        if testFeatValue[featIndex] == firstStr_value:
            if type(secondDict[firstStr_value]).__name__ == 'dict':
                classLabel = classify(secondDict[firstStr_value],featlabels,testFeatValue)
            else: classLabel = secondDict[firstStr_value]
    return classLabel    


'''   ：   ，         ：         ： '''
def storeTree(trainTree,filename):

    fw = open(filename,'w')
    pickle.dump(trainTree,fw)
    fw.close()
def grabTree(filename):

    fr = open(filename)
    return pickle.load(fr)


if __name__ == '__main__':
    dataset,labels = creatDataSet()
    storelabels = labels[:]#  label
    trainTree = creatTree(dataset,labels)    
    classlabel = classify(trainTree,storelabels,[0,1])
    print classlabel

実行結果:

In [1]:runfile('E:/python/ml/dtrees/trees.py', wdir='E:/python/ml/dtrees')
no
In [2]:

参考文献:統計学習方法、李航machine learning in action中国語版

word 2 vec中国語の語彙を訓練する

アルゴリズム:5ステップで再帰を解消します