K-Fold Cross Validation


K-Fold Cross Validation
1 Import libraries
# import K-folds library 
from sklearn.model_selection import KFold

# Evaluation Score Library
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
2 Define Functions for K-Folds
2.1 Print Function
  • MAE,MSE,R 2スコアを印刷する機能です.
  • に必要な評価指標に従って修正して使用する.
  • def print_function(scores):
    
        score1, score2, score3, score4, score5, score6 = scores
        print("------ MAE ------")
        print("Train loss : %.4f" % score1)
        print("Validation loss : %.4f" % score2)
        print()
        print("------ MSE ------")
        print("Train loss : %.4f" % score3)
        print("Validation loss : %.4f" % score4)
        print()
        print("------ R2 ------")
        print("Train R2 score : %.4f" % score5)
        print("Validation R2 score : %.4f" % score6)
        print()
    2.2 Calculating CV Score
  • で定義されたprint functionスコアを計算する関数.
  • ウォンで使えばいい.
    得点の平均値だけでなく、倍ごとのスコアも見たい場合は、表示される場所*をアクティブにすることができます.
  • def train_and_validation(train_data, validation_data, model, metrics, print_mode):
        # 
        X_train, y_train = train_data
        X_val, y_val = validation_data
        
        model.fit(X_train, y_train)
        train_pred = model.predict(X_train)
        val_pred = model.predict(X_val)
    
        score1 = metrics[0](y_train, train_pred)
        score2 = metrics[0](y_val, val_pred)
        score3 = metrics[1](y_train, train_pred)
        score4 = metrics[1](y_val, val_pred)
        score5 = metrics[2](y_train, train_pred)
        score6 = metrics[2](y_val, val_pred)
    
        scores = [score1, score2]
    	
        ### *각 fold의 스코어를 보고 싶을 경우
        # if print_mode:
        #     print_function(scores)
        
        return np.array(scores)
    3 K-Fold CV
    3.1 Model Selection and Setting
    # Choose K
    K = 5
    
    # K-folds
    kfcv = KFold(n_splits=K, shuffle=True, random_state=42)
    
    # evalution Score
    evalution = [mean_absolute_error, mean_squared_error, r2_score]
    
    # models 
    ## 원하는 모델을 설정한다
    lr = LinearRegression(normalize=True)
    lgbm = LGBMRegressor()
    catb = CatBoostRegressor(silent=True)
    xgbm = XGBRegressor(silent=True)
    
    ## 선택한 모델을 리스트로 생성한다.
    models = [lr, lgbm, catb, xgbm]
    
    print_mode = True
    3.2 Run K-Fold CV
  • を実行すると、K倍の平均値が返されます.
    各foldの値を望む場合、*は表示される場所を明示します.
  • # 몇몇 모델의 경우 장문의 메세지가 뜨는데 원치 않을경우 사용한다.
    import warnings
    warnings.filterwarnings("ignore")
    
    for index, model in enumerate(models):
        if print_mode:
            print(f"\n====== Model {model} ======\n")
    
        # generate a blank fold
        folds = []
    
        # model's scores
        model_scores = []
    
        X = X_iter
        # Generate K-fold
        for train_index, val_index in kfcv.split(X, y):
            folds.append((train_index, val_index))
        
        # fold 별 학습 및 검증
        for i in range(K):
            ### * 각 fold의 값을 원할경우
            # if print_mode:
                # print(f"{i+1}th folds in {K} folds.")
            
            train_index, val_index = folds[i]
    
            X_train = X.iloc[train_index, :]
            X_val = X.iloc[val_index, :]
            y_train = y[train_index]
            y_val = y[val_index]
    
            # 모델별 score 산축
            scores = train_and_validation((X_train, y_train), (X_val, y_val), model, evalution, print_mode)
            model_scores.append(scores)
    
        # mean of scores
        model_scores = np.array(model_scores)
        if print_mode:
            print("Average Score in %dfolds." % K)
            print_function(model_scores.mean(axis=0))
    
    print("Done.")