Evaluation Metrics for Classification

4183 ワード

テキストリンク

分類問題の評価指標

精度(Accuracy):対の合計で除算された値
TP+TNTotal\large\frac{TP + TN}{Total}TotalTP+TN

精度(Precision):位置予測時に位置のパーセントを正しく調整
TPTP+FP\large\frac{TP}{TP + FP}TP+FPTP

再現率(Recall,Sensivity):実際の位置と正確な位置の比率
TPTP+FN\large\frac{TP}{TP + FN}TP+FNTP

F 1点(F 1点):精度と再現率の調和平均値(調和平均値)
2⋅precision⋅recallprecision+recall2\cdot\large\frac{precision\cdot recall}{precision + recall}2⋅precision+recallprecision⋅recall

True Positive(TP):予測がTrueの場合、

True Negative(TN):予測がFalseの場合、

False Positive(FP):予測がTrueの場合、

エラー負(FN):エラーが本当であれば

Confusion Matrix

分類モデルの性能評価指標

from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt

pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)
)

# 시각화
fig, ax = plt.subplots()
pcm = plot_confusion_matrix(pipe, X_val, y_val,
                            cmap=plt.cm.Blues,
                            ax=ax);
plt.title(f'Confusion matrix, n = {len(y_val)}', fontsize=15)
plt.show()

pcm.confusion_matrix
# 출력
array([[6165, 1515],
       [1930, 4442]])
       
# 정밀도, 재현율 확인
from sklearn.metrics import classification_report
print(classification_report(y_val, y_pred))
# 출력
                precision    recall  f1-score   support

           0       0.76      0.80      0.78      7680
           1       0.75      0.70      0.72      6372

    accuracy                           0.75     14052
   macro avg       0.75      0.75      0.75     14052
weighted avg       0.75      0.75      0.75     14052

しきい値

from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

pipe = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)
)
# 검증세트로 확률예측
pipe.predict_proba(X_val)
# 출력
array([[0.46      , 0.54      ],
       [0.85      , 0.15      ],
       [0.78      , 0.22      ],
       ...
       0일확률       1일확률

※臨界値が0.7の場合、1日の確率は0.7を超えます.
ex)[0.46,0.54]面0 / [0.15,0.85]面1
しきい値を下げ、精度を上げ、再現率を下げる
※何が必要ですか?
ex)閾値を下げることで、ワクチンを接種しない確率の高い人をより正確に見つけることができる

ROC curve, AUC

ROC曲線は、複数の閾値の実位置率(TPR)および高速位置率(FPR)曲線である.

Recall(再現率):TPTPTP+FN{frac{mathm{TP}{mathm{TP}+mathm{FN}}TP+FNTP

Fall-out(上記陽性率):FPFP+TN{frac{mathm{FP}{mathm{FP}+mathm{TN}}FP+TNFP

の再現率を向上させるためには,正と判定された臨界値を下げ続け,全員が正と判定されるようにする.しかし、そうすると同時に陰性であるが、陽性と判断された胃の陽性率も

上昇する.
最適しきい値は

であり,

の再現率を最大化し,擬似陽性率を最小化した.

AUCのROC曲線下の面積は

である.

from sklearn.metrics import roc_curve

y_pred_proba = pipe.predict_proba(X_val)[:, 1]
fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba)

roc = pd.DataFrame({
    'FPR(Fall-out)': fpr, 
    'TPRate(Recall)': tpr, 
    'Threshold': thresholds
})
# ROC curve 시각화
plt.scatter(fpr, tpr)
plt.title('ROC curve')
plt.xlabel('FPR(Fall-out)')  # x축
plt.ylabel('TPR(Recall)');   # y축

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]  # 최적 임계값
print('idx:', optimal_idx, ', threshold:', optimal_threshold)  # 출력

# AUC 점수
from sklearn.metrics import roc_auc_score
auc_score = roc_auc_score(y_val, y_pred_proba)
auc_score
# 출력
0.82653190

# 테스트세트로 에측 
y_test_proba = pipe.predict_proba(X_test)[:, 1]
y_test_optimal = y_test_proba >= optimal_threshold  #임계값보다 높은 것
# 제출 form
submission = pd.DataFrame(y_test_optimal).reset_index().astype(int)

Reference

この問題について(Evaluation Metrics for Classification), 我々は、より多くの情報をここで見つけました https://velog.io/@ssulee0206/Evaluation-Metrics-for-Classification

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

Djangoでのbootstrapの参照

[Django]会員加入と登録ver.2