Python-次元ダウン(PCA、コアPCA、SVD、ガウスランダムマッピング、NMF)

7528 ワード

マシン学習

以下の内容は、Pythonデータ科学ガイドラインの次元ダウン方法の比較から来ています.

PCA:計算コストが高く、特徴ベクトルには線形相関がある.

コアPCA:特徴ベクトルは非線形相関でもよい.

SVD:PCAよりもデータを解釈できます.元のデータセットに直接作用するため、PCAのように相関変数を一連のコヒーレントでない変数に変換することはありません.また,PCAは単一モード因子解析法であり,行列は同一の実体を表すが,SVDは二モード因子(すなわち2種類の実体行列を適用する)であり,テキストマイニングにおいて行対応語,列対応文書に応用できる.

Gaussランダムマッピング:速度が速く、オーステナイト距離を利用して次元を下げるが、データにはメモリの問題が多く、疎ランダムマッピングの代わりに考慮できる.

NMF:推奨システムによく見られる入力マトリクスA=次元ダウンマトリクス(行)A_dash*コスト行列(列)F.

1.PCA:Principle Component Analysis、PCA主成分分析、計算コストが高く、特徴ベクトル間に線形相関がある環境にのみ適しています.

はデータセットを中心化する.

データセットの相関行列と単位標準偏差値を探し出す.

相関行列をその特徴ベクトルと値に分解する.

降順の特徴値に基づいてTop-N特徴ベクトルを選択する.

は、入力された特徴ベクトル行列を新しい空間に投影する.

# -*- coding: utf-8 -*-
"""
Created on Fri Mar 30 17:47:41 2018

@author: Alvin AI
"""

from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
import scipy
from sklearn.preprocessing import scale

data = load_iris()
x = data['data']
y = data['target']

x_s = scale(x,with_mean=True,with_std=True,axis=0)#   
x_c = np.corrcoef(x_s.T)#       

eig_val,r_eig_vec = scipy.linalg.eig(x_c)
print 'Eigen values 
%s' % (eig_val)#           
print '
 Eigen vectors 
%s' % (r_eig_vec)#           
#         =   /        ，   4   
w = r_eig_vec[:,0:2]#         ，      Eigen values         

x_rd = x_s.dot(w)#  ，     y

plt.figure(1)
plt.scatter(x_rd[:,0],x_rd[:,1],c=y)
plt.xlabel('component 1')
plt.ylabel('component 2')

#           
print "Component, Eigen Value, % of Variance, Cumulative %"
cum_per = 0
per_var = 0
for i,e_val in enumerate(eig_val):
    per_var = round((e_val/len(eig_val)),3)
    cum_per += per_var
    print ('%d, %0.2f, %0.2f, %0.2f')%(i+1, e_val, per_var*100, cum_per*100)

#    ：
Eigen values #    
[2.91081808+0.j 0.92122093+0.j 0.14735328+0.j 0.02060771+0.j]

 Eigen vectors #    
[[ 0.52237162 -0.37231836 -0.72101681  0.26199559]
 [-0.26335492 -0.92555649  0.24203288 -0.12413481]
 [ 0.58125401 -0.02109478  0.14089226 -0.80115427]
 [ 0.56561105 -0.06541577  0.6338014   0.52354627]]

#         =   /        ，   4   
#2.91/4=72.80
#        72.80%，        23%，           95.8%
Component, Eigen Value, % of Variance, Cumulative %
1, 2.91, 72.80, 72.80
2, 0.92, 23.00, 95.80
3, 0.15, 3.70, 99.50
4, 0.02, 0.50, 100.00

2.核PCA:非線形データセットに対して次元を下げる.コアカテゴリは、線形、多項式、sigmoid、余弦値、予め計算された、RBFである.

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA #PCA  
from sklearn.decomposition import KernelPCA # PCA  

#             
np.random.seed(10)#         
x,y = make_circles(n_samples=400, factor=.2, noise=0.02)#factor    

plt.close('all')#       
plt.figure(1)
plt.title('original space')
plt.scatter(x[:,0],x[:,1],c=y)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

#  PCA  
pca = PCA(n_components=2)
pca.fit(x)
x_pca=pca.transform(x)

#          
plt.figure(2)
plt.title('pca')
plt.scatter(x_pca[:,0],x_pca[:,1],c=y)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

#           ，             ，      
class_1_index = np.where(y==0)[0]
class_2_index = np.where(y==1)[0]

plt.figure(3)
plt.title('pca-one component')
plt.scatter(x_pca[class_1_index,0],np.zeros(len(class_1_index)),color='red')
plt.scatter(x_pca[class_2_index,0],np.zeros(len(class_2_index)),color='blue')

#  kernal PCA
#   PCA          （Radial Basis Function, RBF）
#gamma  10，gamma    （       ）  --    
kpca = KernelPCA(kernel='rbf',gamma=10) 
x_kpca = kpca.fit_transform(x)

plt.figure(4)
plt.title('kernel pca')
plt.scatter(x_kpca[:,0],x_kpca[:,1],c=y)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

3.特異値分解:Singular Value Decomposition,SVD、PCAと異なり、元のデータ行列に直接作用する.SVDはm*n行列を3つの行列の積に分解する:A=U*S*V^T.

U:左奇異行列、m*k行列.

V:右奇異行列、n*k行列.

S:この行列の対角線値は奇異値であり、k*k行列である.

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from scipy.linalg import svd

data = load_iris()
x = data['data']
y = data['target']

#    ，             
#               ，                 
x_s = scale(x,with_mean=True,with_std=False,axis=0)

# SVD    
#       ，full_matrices=False      
U,S,V = svd(x_s,full_matrices=False)

#                 
x_t = U[:,:2]

#                  
plt.figure(1)
plt.scatter(x_t[:,0],x_t[:,1],c=y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()

4.ガウスのランダムマッピング:速度が速く、データ間の距離を利用して次元を下げる.

# -*- coding: utf-8 -*-
"""
Created on Mon Apr 23 21:19:54 2018

@author: Alvin AI
"""

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import euclidean_distances
from sklearn.random_projection import GaussianRandomProjection
import matplotlib.pyplot as plt

#  20       
#     sci.crypt  
#       “sci.med” "sci.space" 
cat = ['sci.crypt']
data = fetch_20newsgroups(categories=cat)

#             -    ，     ，    idf
vectorizer = TfidfVectorizer(use_idf=False)
vector = vectorizer.fit_transform(data.data)

#    ，       1000
gauss_proj = GaussianRandomProjection(n_components=1000)
vector_t = gauss_proj.fit_transform(vector)

#           
print vector.shape
print vector_t.shape

#               ，              
org_dist = euclidean_distances(vector)
red_dist = euclidean_distances(vector_t)
diff_dist = abs(org_dist-red_dist)

#      （   100   ）
plt.figure()
plt.pcolor(diff_dist[0:100,0:100])
plt.colorbar()
plt.show()

5.非負のマトリックス分解:Non-negative Matrix Factorization,NMF.推奨システムによく使用され、元の欠落したデータを予測します.

# -*- coding: utf-8 -*-
"""
Created on Sat Mar 31 15:04:36 2018

@author: Alvin AI
"""

import numpy as np#
#from collections import dafaultdict
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt

#         
ratings = [\
        [1,2,3,5,2,1],\
        [2,3,1,1,2,1],\
        [4,2,1,3,1,4],\
        [2,9,5,4,2,1],\
        [1,4,2,1,1,1]]

movie_dict = {1:'alvin story', 
              2:'star wars',
              3:'inception',
              4:'gunsa',
              5:'dream',
              6:'decomere'}

A = np.asmatrix(ratings,dtype=float)#   

max_components = 2
reconstruction_error = []
nmf = None
nmf = NMF(n_components = max_components, random_state=1) #   2
A_dash = nmf.fit_transform(A)#A_dash     ，      ，   

for i in range(A_dash.shape[0]):
    print 'User id = %d,  comp1 score = %0.2f, comp2 score = \
    %0.2f' % (i+1,A_dash[i][0],A_dash[i][1])

#    A=A_dash*F
    
#A_dash     ，      ，   
plt.figure(1)
plt.title('user concept mapping')
x = A_dash[:,0]
y = A_dash[:,1]
plt.scatter(x,y)
plt.xlabel('component1')
plt.ylabel('component2')

#F     ，     ，   
F =nmf.components_
plt.figure(2)
plt.title('movie concept mapping')
x = F[0,:]
y = F[1,:]
plt.scatter(x,y)
plt.xlabel('component1')
plt.ylabel('component2')

for i in range(F[0,:].shape[0]):
    plt.annotate(movie_dict[i+1],(F[0,:][i],F[1,:][i]))#              
plt.show()    

#       ，          0，                
reconstructed_A = np.dot(A_dash,F)
np.set_printoptions(precision=2)#       2 
print reconstructed_A

動的計画の無条件創造条件

コンピュータネットワーク同期伝送と非同期伝送(理解)