XGBoost with breast_cancer

4471 ワード

XGBoostを用いてウィスコンシン乳癌データを分析した.

モジュール

import xgboost as xgb
from xgboost import plot_importance
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

データの読み込み

dataset = load_breast_cancer()
x_features = dataset.data
y_label = dataset.target
df = pd.DataFrame(data=x_features, columns=dataset.feature_names)
df['target'] = y_label
df.head()

print(dataset.target_names)
print(df['target'].value_counts())

['malignant' 'benign']
1 357
0 212
Name: target, dtype: int64

データ分離

x_train, x_test, y_train, y_test = train_test_split(x_features, y_label, test_size=0.2, random_state=156)
print(x_train.shape, x_test.shape)

(455, 30) (114, 30)

DMatrix

Python Rapper XGBBoostは、train、testデータセットの個別のオブジェクトDMatrixを生成する必要がある.DMatrixはNO.1 FINDRAY Data Frameパラメータとしてvalues、libsvm txtフォーマットファイル、xgboostバイナリバッファファイルを入力します.

dtrain = xgb.DMatrix(data=x_train, label=y_train)
dtest = xgb.DMatrix(data=x_test, label=y_test)

XGBoostモデルの作成

まず、XGBoostを実行するためにスーパーパラメータを指定します.

params = {
    'max_depth' : 3,
    'eta' : 0.1,
    'objective' : 'binary:logistic',
    'eval_metric' : 'logloss',
    'early_stoppings' : 100
}
num_rounds=400

XGBoostでearly stoppingsを設定する場合は、eval setとeval metricsを同時に設定する必要があります.各反復について、XGBoostは、eval setとして指定されたデータセットでeval metricの評価指標を使用して軸エラーを測定する.

eval set:パフォーマンス評価を実行するデータセット

eval metric:評価セットのパフォーマンス評価方法.分類時は主に「error;,」logloss"

を使用

# train 데이터 세트는 'train', eval 데이터 세트는 'eval'로 명시
wlist = [(dtrain, 'train'), (dtest, 'eval')]
xgb_model = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_rounds,\
                     early_stopping_rounds=100, evals=wlist)

モデリング結果からtrain loglossもeval loglossも減少した.また、311回目の学習で早期終了が確認できた.

pred_probs = xgb_model.predict(dtest)
print('predict() 수행 결과값 10개 표시, 예측 확률값로 표시된다.')
print(np.round(pred_probs[:10], 3))

予測()実行結果は10値、予測確率値は表示されます.
[0.934 0.003 0.91 0.094 0.993 1. 1. 0.999 0.997 0. ]
XGBoostの予測結果は値ではなく예측 확률값であった.
したがって、確率が0.5より大きい場合は、1または0で予測値を決定する必要があります.

pred = [ 1 if x > 0.5 else 0 for x in pred_probs]
print('예측값 10개 표시 : \n', pred[:10])

10個の予測値を表示します.
[1, 0, 1, 0, 1, 1, 1, 1, 1, 0]

予測評価

from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# 3장 내용
def get_clf_eval(y_test, pred, pred_probs):
   confusion = confusion_matrix(y_test, pred)
   accuracy = accuracy_score(y_test, pred)
   precision = precision_score(y_test, pred)
   recall = recall_score(y_test, pred)
   f1 = f1_score(y_test, pred)
   # ROC-AUC
   roc_auc = roc_auc_score(y_test, pred_probs)
   print('오차 행렬')
   print(confusion)
   # ROc-AUC
   print('정확도 : {:.4f}, 정밀도 : {:.4f}, 재현율 : {:.4f},\
   F1 : {:.4f}, AUC : {:.4f}'.format(accuracy,precision,recall,f1,roc_auc))
   
get_clf_eval(y_test, pred, pred_probs)

ごさぎょうれつ
[[35 2][ 1 76]]
精度:0.9737、精度:0.9744、再現率:0.9870、F 1:0.9806、AUC:0.9951

plot_importance

コース運転では,特徴importionsのようにXGBoostはplot重要度可視化変数の重要度を用いることができる.基本評価指標としてf1 scoreを用いた.

from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(figsize=(10, 12))
plot_importance(xgb_model, ax=ax)

Reference

この問題について(XGBoost with breast_cancer), 我々は、より多くの情報をここで見つけました https://velog.io/@lsmmay322/XGBoost-with-breastcancer

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

[iOS]CALayerの基本概念

Kotlinベース#9分割