第三章.機械学習分類モデルツアー-2

27615 ワード

sklearn machine learning テキストリンク

サポートベクトルマシンによる最大利益分類

サポートベクトルマシン

強力で広く使われている学習アルゴリズム

パーセプトロンの拡張と考えられる

利益最大化
💡 マージン:カテゴリを分割するスーパープレーン(決定境界)と、このスーパープレーンに最も近いトレーニングサンプルとの距離を定義します.

最大利益

一般化誤差が減少傾向にあるため、大幅な利益決定限界を知る必要がある

小利益のパターンが誇張されやすい

📍 メモ変数を使用して非線形分類問題を処理する
💡 緩和変数:非線形分割データから線形制約を緩和する必要があるため導入

from sklearn.svm import SVC

svm = SVC(kernel='linear', C=1.0, random_state=1)
svm.fit(x_train_std, y_train)

plot_decision_regions(x_combined_std, y_combined, 
                      classifier=svm, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()

plt.show()

制限された論理回帰モデルがC値を減少すると偏向が増加し,モデル分散が減少する
💡 Sikit Runを使用する場合、データセットが大きすぎてコンピュータのメモリ容量に合わない場合があります
👉 代替としてSGD Classifierクラスを提供

partic fitメソッドオンライン学習サポート

from sklearn.linear_model import SGDClassifier

ppn = SGDClassifier(loss='perceptron')
lr = SGDClassifier(loss='log')
svm = SGDClassifier(loss='hinge')

カーネル仮想マシンによる非線形問題の解決

📍 データを線形に区別しないカーネルメソッド

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
x_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(x_xor[:, 0] > 0,
                      x_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)

plt.scatter(x_xor[y_xor == 1, 0], x_xor[y_xor == 1, 1],
           c='b', marker='x', label='1')
plt.scatter(x_xor[y_xor == -1, 0], x_xor[y_xor == -1, 1],
           c = 'r', marker='s', label='-1')
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend(loc='best')
plt.tight_layout()

plt.show()

カーネルメソッドの基本思想はマッピング関数(ϕ\phiϕ )高次元空間への投影を使用する

📍 カーネルテクノロジーを使用して、高次元空間で分割スーパープレーンを検索する

カーネルテクノロジーが使用されていない場合の問題

SVMで非線形問題を解くマッピング関数を用いて訓練データを高次元特性空間に変換する

新しい特性空間でデータを分類するためのリニアSVMモデルの訓練

新機能を創出する計算費用が非常に高い

カーネルテクノロジーを使用する場合
x(i)Tx(j)x^{(i)T}x^{(j)}x(i)Tx(j)ϕ(x(i))Tϕ(x(j))\phi(x^{(i)})^T\phi(x^{(j)})ϕ(x(i))Tϕ置換後(x(j))
・カーネル関数の定義

K(x(i),x(j))=ϕ(x(i))Tϕ(x(j))K(x^{(i)}, x^{(j)}) =\phi(x^{(i)})^T\phi(x^{(j)})K(x(i),x(j))=ϕ(x(i))Tϕ(x(j))

svm = SVC(kernel='rbf', random_state=1, gamma=0.10, C=10.0)
svm.fit(x_xor, y_xor)

plot_decision_regions(x_xor, y_xor, classifier=svm)
plt.legend(loc='upper left')
plt.tight_layout()

plt.show()

📍 ペンシルデータセットに適用

rrr値によりポートベクトルの影響または範囲が変化する

svm = SVC(kernel='rbf', random_state=1, gamma=0.2, C=1.0)
svm.fit(x_train_std, y_train)

plot_decision_regions(x_combined_std, y_combined, 
                      classifier=svm, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()

plt.show()

rrr値が大きくなった場合

svm = SVC(kernel='rbf', random_state=1, gamma=100.0, C=1.0)
svm.fit(x_train_std, y_train)

plot_decision_regions(x_combined_std, y_combined, 
                      classifier=svm, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()

plt.show()

突然rrrパラメータは過大な適合または分散を調節する重要な役割を果たす.

学習決定ツリー

決定木分類器重要な説明に役立つモデル

プロセスを繰り返すことで、リーフノードが純化するまで全てのサブノードでこの分割操作を繰り返す

特性空間を矩形メッシュに分割するので複雑な決定境界を作成できる

決定木が深いほど決定境界が複雑になり誇張されやすいので注意

📍 意思決定ツリーの作成

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=1)
tree.fit(x_train, y_train)
x_combined = np.vstack((x_train, x_test))
y_combined = np.hstack((y_train, y_test))

plot_decision_regions(x_combined, y_combined,
                     classifier=tree, test_idx=range(105, 150))
plt.xlabel('petal length [cm]')
plt.ylabel('petal width [cm]')
plt.legend(loc='upper left')
plt.tight_layout()

plt.show()

📍 GraphVicプログラムを使用した決定ツリーモデルの可視化

from pydotplus import graph_from_dot_data
from sklearn.tree import export_graphviz

dot_data = export_graphviz(tree, filled=True, rounded=True,
                          class_names=['Setosa', 'Versicolor', 'Virginica'],
                          feature_names=['petal length', 'petal width'],
                          out_file=None)
graph = graph_from_dot_data(dot_data)
graph.write_png('tree.png')

ランダムツリー接続複数の決定ツリー

ランダム森林

- 결정 트리의 **앙상블**
- 여러 개의 결정 트리를 평균 내는 것
- 견고한 모델을 만들어 일반화 성능을 높이고 과대적합의 위험을 줄일 수 있다

n個ランダムガイドサンプル

ガイドバーサンプルから決定ツリーを学ぶ
a.ランダム選択ddd個の特性を繰り返すことは許されない
b.情報ゲインなどの目標関数に基づいて最適な分割特性を作成してノードを分割する

1 ~ 2KK手順を繰り返す

各ツリーの予測を収集し、多数投票でランク付け

ランダム森林の大きな利点は、意思決定木ほど説明が容易ではないものの、オーバーパラメータ調整に多くの努力が必要ではないこと

ツリーが多いほど計算コストが高くなり、ランダムツリー分類器の性能が良い

📍 ランダムな森の木の組み合わせによって形成される結晶領域.

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(criterion='gini', n_estimators=25, random_state=1, n_jobs=2)
forest.fit(x_train, y_train)

plot_decision_regions(x_combined, y_combined,
                     classifier=forest, test_idx=range(105, 150))
plt.xlabel('petal length [cm]')
plt.ylabel('petal width [cm]')
plt.legend(loc='upper left')
plt.tight_layout()

plt.show()

n_estimators25個の決定木をパラメータとして使用

区分ノードの不純度指標としてキニーの不純度を用いる

k-近隣:怠惰な学習アルゴリズム

トレーニングデータから判別関数を学習するのではなく、トレーニングデータセットをメモリに格納します.

数字kkと測距基準を選択

分類するサンプルでkk個の最寄りの隣接ノードを探す

多数決投票による種別ラベルの配分

選択した測距基準に基づいて,訓練データセットにおいてKNNアルゴリズムが分類する点に最も近いサンプルKKK個を探す.

新しいデータポイントのカテゴリラベルはkkk個の最近の隣接点の多数投票によって決定される

📍 ユークリディアン測距を用いたSikit RunのKNNモデル

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(x_train_std, y_train)

plot_decision_regions(x_combined_std, y_combined,
                     classifier=knn, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()

plt.show()

パラメータp1に指定した場合:マンハッタン通り
2に指定した場合:ユクラディアン距離

Reference

この問題について(第三章.機械学習分類モデルツアー-2), 我々は、より多くの情報をここで見つけました https://velog.io/@ksj5738/3장.-사이킷런을-타고-떠나는-머신-러닝-분류-모델-투어-2

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

メモリリーク

ActivityとFragment間のジャンプ