
主成分分析、すなわちPrincipal Component Analysis(PCA)は、多重統計における重要な内容であり、機械学習や他の分野にも広く応用されている.その主な役割は、高次元データの次元ダウンです.PCAは元のn個の特徴をより少ないk個の特徴で置き換え,新しい特徴は古い特徴の線形組合せであり,これらの線形組合せは試料分散を最大化し,できるだけ新しいk個の特徴を互いに相関させないようにした.
  • は、モデルの使用を容易にするためにデータ形式を組織する.
  • サンプルの各特徴の平均値を計算する.
  • 各サンプルデータから特徴の平均値を減算する(正規化処理).
  • 共分散行列を求める;
  • 共分散行列の特徴値と特徴ベクトルを見つける.
  • は、特徴値と特徴ベクトルを再配列する(特徴値が大きいから小さいまで配列する).
  • 特徴値に対して累積寄与率を求める.
  • は、累積寄与率に対して、ある特定の割合で特徴ベクトルセットのサブセットを選択する.
  • は、元のデータ(ステップ3以降)を変換する.

  • ここで、共分散行列の分解は、対称行列の特徴ベクトルによってもよいし、行列のSVDを分解することによっても実現できるが、Scikit−learnにおいても、SVDを用いてPCAアルゴリズムを実現する.
    import numpy as np
    from sklearn.decomposition import PCA
    import sys
    #returns choosing how many main factors
    def index_lst(lst, component=0, rate=0):
        #component: numbers of main factors
        #rate: rate of sum(main factors)/sum(all factors)
        #rate range suggest: (0.8,1)
        #if you choose rate parameter, return index = 0 or less than len(lst)
        if component and rate:
            print('Component and rate must choose only one!')
        if not component and not rate:
            print('Invalid parameter for numbers of components!')
        elif component:
            print('Choosing by component, components are %s......'%component)
            return component
            print('Choosing by rate, rate is %s ......'%rate)
            for i in range(1, len(lst)):
                if sum(lst[:i])/sum(lst) >= rate:
                    return i
            return 0
    def main():
        # test data
        mat = [[-1,-1,0,2,1],[2,0,0,-1,-1],[2,0,1,1,0]]
        # simple transform of test data
        Mat = np.array(mat, dtype='float64')
        print('Before PCA transforMation, data is:
    ', Mat) print('
    Method 1: PCA by original algorithm:') p,n = np.shape(Mat) # shape of Mat t = np.mean(Mat, 0) # mean of each column # substract the mean of each column for i in range(p): for j in range(n): Mat[i,j] = float(Mat[i,j]-t[j]) # covariance Matrix cov_Mat = np.dot(Mat.T, Mat)/(p-1) # PCA by original algorithm # eigvalues and eigenvectors of covariance Matrix with eigvalues descending U,V = np.linalg.eigh(cov_Mat) # Rearrange the eigenvectors and eigenvalues U = U[::-1] for i in range(n): V[i,:] = V[i,:][::-1] # choose eigenvalue by component or rate, not both of them euqal to 0 Index = index_lst(U, component=2) # choose how many main factors if Index: v = V[:,:Index] # subset of Unitary matrix else: # improper rate choice may return Index=0 print('Invalid rate choice.
    Please adjust the rate.') print('Rate distribute follows:') print([sum(U[:i])/sum(U) for i in range(1, len(U)+1)]) sys.exit(0) # data transformation T1 = np.dot(Mat, v) # print the transformed data print('We choose %d main factors.'%Index) print('After PCA transformation, data becomes:
    ',T1) # PCA by original algorithm using SVD print('
    Method 2: PCA by original algorithm using SVD:') # u: Unitary matrix, eigenvectors in columns # d: list of the singular values, sorted in descending order u,d,v = np.linalg.svd(cov_Mat) Index = index_lst(d, rate=0.95) # choose how many main factors T2 = np.dot(Mat, u[:,:Index]) # transformed data print('We choose %d main factors.'%Index) print('After PCA transformation, data becomes:
    ',T2) # PCA by Scikit-learn pca = PCA(n_components=2) # n_components can be integer or float in (0,1) pca.fit(mat) # fit the model print('
    Method 3: PCA by Scikit-learn:') print('After PCA transformation, data becomes:') print(pca.fit_transform(mat)) # transformed data main()

    Before PCA transforMation, data is:
     [[-1. -1.  0.  2.  1.]
     [ 2.  0.  0. -1. -1.]
     [ 2.  0.  1.  1.  0.]]
    Method 1: PCA by original algorithm:
    Choosing by component, components are 2......
    We choose 2 main factors.
    After PCA transformation, data becomes:
     [[ 2.6838453  -0.36098161]
     [-2.09303664 -0.78689112]
     [-0.59080867  1.14787272]]
    Method 2: PCA by original algorithm using SVD:
    Choosing by rate, rate is 0.95 ......
    We choose 2 main factors.
    After PCA transformation, data becomes:
     [[ 2.6838453   0.36098161]
     [-2.09303664  0.78689112]
     [-0.59080867 -1.14787272]]
    Method 3: PCA by Scikit-learn:
    After PCA transformation, data becomes:
    [[ 2.6838453  -0.36098161]
     [-2.09303664 -0.78689112]
     [-0.59080867  1.14787272]]