KNN(K近接)アルゴリズム

5779 ワード

python マシン学習 k-近接

k-近接アルゴリズムの一般的な流れ

データ収集:任意の方法で

を使用できます.

準備データ:距離計算(距離計算のはず)に必要な数値、好ましくは構造化データフォーマット

分析データ:任意の方法

を使用できます.

トレーニングアルゴリズム:このステップはk隣接アルゴリズム

には適用されません.

テストアルゴリズム:計算エラー率

アルゴリズムを用いる:まずサンプルデータと構造化された出力結果を入力する必要があり、その後k-近接アルゴリズムを実行して入力データがそれぞれどの分類に属するかを判定し、最後に計算した分類に対して後続の処理

を適用する.

# -*- coding: utf-8 -*-

#     kNN.py Python  
from numpy import * #        NumPy
import operator #        

#        ，      group   labels
def createDataSet():
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0] 
    #   array   ,array(4,2),array.shape=(4,2),shape[0]=4
    #dataSetSize = 4

    diffMat = tile(inX, (dataSetSize, 1)) - dataSet 
    # tile         ,tile(A,n).tile(A,(2,1))              
    # diffMat :
        # [-1. , -1.1],
        # [-1. , -1. ],
        # [ 0. , 0. ],
        # [ 0. , 0.1]

    sqDiffMat = diffMat ** 2 
    #         inX        
    # sqDiffMat:
        # [1. , 1.21],
        # [1. , 1. ],
        # [0. , 0. ],
        # [0. , 0.01],

    sqDistances = sqDiffMat.sum(axis=1) 
    # sum of each row,if axis=0:sum of each column
    # sqDistances:
    #          ：[ 2.21, 2. , 0. , 0.01]

    distances = sqDistances ** 0.5
    #       ， xA(xA0, xA1) xB(xB0, xB1)     
    # d = sqrt((xA0-xB0)^2) + (xA1-xB1)^2)
    #   ： (0, 0) (1,2)        ：
    # sqrt( (1 - 0)^2 + (2 -0)^2 )
    #          ：[ 1.48660687, 1.41421356, 0. , 0.1 ]

    sortedDistIndicies = distances.argsort()
    # numpy.argsort(a, axis=-1, kind='quicksort', order=None)
    #          ：[2, 3, 1, 0]， sortedDistIndicies[2] < sortedDistIndicies[3] < sortedDistIndicies[1] < sortedDistIndicies[0]，
    ''' One dimensional array: >>> x = np.array([3, 1, 2]) >>> np.argsort(x) array([1, 2, 0]) Two-dimensional array: >>> x = np.array([[0, 3], [2, 2]]) >>> x array([[0, 3], [2, 2]]) '''

    classCount = {}
    # classCount    

    for i in range(k):
        m = sortedDistIndicies[i]
        #  i=0 ，sortedDistIndicies[0] = 2

        voteIlabel = labels[m]
        # voteIlabel = B

        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        # dict.get(key, default=None)
        # key --         。
        # default --            ，       
        #        ，             None

    #          ：
    # classCount = {'A': 1, 'B': 2}

    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse= True)
    # itemgetter(1)                   ，
    # reverse    ，   ：[('B', 2), ('A', 1)]

    return sortedClassCount[0][0]

if __name__ == '__main__':
    group, labels = createDataSet()
    print classify0([0, 0], group, labels, 3)

出力の結果:B

C++ソフトC#シリーズ-std::function

numpyのブロードキャストルール