Porto Seguro Exploratory Analysis and Prediction_prepare the model


新しい認識の事実


Stackmodelは、複数の利点が収集されているため、モデルを構築することによってパフォーマンスを向上させる方法です.ただし、演算量が大きくなることに注意しましょう.
注意:リンクテキスト
*Code
#initialize the ensambing object :Very interesting point to me

stack = Ensemble(n_splits=3,
        stacker = log_model,
        base_models = (lgb_model1, lgb_model2, lgb_model3, xgb_model))  

Prepare the model


Ensable class for validation and ensamble


Spliit data in KFolds

  • train the models

  • ensemble the results

  • init method paraneters

  • self: the object to be initialized

  • n_splits: the number of cross-validation splits to be used

  • stack: the model used for stacking the prediction results from the trained base models

  • base_models: the list of base models used in training

  • fit_predict four functions

  • split the training data in n_splits folds:

  • run the base models

  • perform prediction using each model;

  • ensamble the resuls using the stacker;
  • Code

    class Ensemble(object):
        def __init__(self, n_splits, stacker, base_models):
            self.n_splits = n_splits
            self.stacker = stacker
            self.base_models = base_models
    
        def fit_predict(self, X, y, T):
            X = np.array(X)
            y = np.array(y)
            T = np.array(T)
    
            folds = list(StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=314).split(X,y))
    
            S_train = np.zeros((X.shape[0], len(self.base_models)))
            S_test = np.zeros((T.shape[0], len(self.base_models)))
            for i, clf in enumerate(self.base_models):
    
                S_test_i = np.zeros((T.shape[0], self.n_splits))
    
                for j, (train_idx, test_idx) in enumerate(folds):
                    X_train = X[train_idx]
                    y_train = y[train_idx]
                    X_holdout = X[test_idx]
    
    
                    print("Base model %d: fit %s model | fold %d" % (i+1, str(clf).split('(')[0], j+1))
                    clf.fit(X_train, y_train)
                    cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
                    print("cross_score[roc-auc]: %.5f [gini]: %.5f" % (cross_score.mean(), 2*cross_score.mean()-1))
                    y_pred = clf.predict_proba(X_holdout)[:,1]
    
                    S_train[test_idx, i] = y_pred
                    S_test_i[:, j] = clf.predict_proba(T)[:,1]
                S_test[:, i] = S_test_i.mean(axis=1)
    
            results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
            #Calculate gini factor as 2 * AUC -1
            print("Stacker score[gini]: %.5f" % (2 * results.mean() -1))
    
            self.stacker.fit(S_train, y)
            res = self.stacker.predict_proba(S_test)[:,1]
            return res

    スタックモデルのタスク


    Parameters for the base models
    Three different LightGBM and XGB model
    train data : well cross-validation with 3 folds
    #LightGBM params
    #lgb_1
    
    lgb_params1 = {}
    lgb_params1['learning_rate'] = 0.02
    lgb_params1['n_estimators'] = 650
    lgb_params1['max_bin'] = 10
    lgb_params1['subsample'] = 0.8
    lgb_params1['subsample_freq'] = 10
    lgb_params1['colsample_bytree'] = 0.8
    lgb_params1['min_child_samples'] = 500
    lgb_params1['seed'] = 314
    lgb_params1['num_threads'] = 4
    
    #lgb_2
    lgb_params2 = {}
    
    lgb_params2['n_estimators'] = 1000
    lgb_params2['learning_rate'] = 0.02
    lgb_params2['colsample_bytree'] = 0.3
    lgb_params2['subsample'] = 0.7
    lgb_params2['subsample_freq'] = 2
    lgb_params2['num_leaves'] = 16
    lgb_params2['seed'] = 314
    lgb_params2['num_threads'] = 4
    
    #lgb_3
    lgb_params3 = {}
    lgb_params3['n_estimators'] = 100
    lgb_params3['max_depth'] = 4
    lgb_params3['learning_rate'] = 0.02
    lgb_params3['seed'] = 314
    lgb_params3['num_threads'] = 4
    
    #XGBoost params
    
    xgb_params = {}
    xgb_params['objective'] = 'binary:logistic'
    xgb_params['learning_rate'] = 0.04
    xgb_params['n_estimators'] = 490
    xgb_params['max_depth'] = 4
    xgb_params['subsample'] = 0.9
    xgb_params['colsample_bytree'] = 0.9
    xgb_params['min_child_weight'] = 10
    xgb_params['num_threads'] = 4
    #Initialize the models with parameters
    
    ##3base models and the stacking model
    import xgboost
    from xgboost import XGBClassifier
    # Base models
    lgb_model1 = LGBMClassifier(**lgb_params1)
    
    lgb_model2 = LGBMClassifier(**lgb_params2)
           
    lgb_model3 = LGBMClassifier(**lgb_params3)
    
    xgb_model = XGBClassifier(**xgb_params)
    
    # Stacking model
    log_model = LogisticRegression()
    #Run the predictive models
    #fit_predict method of stack object
    # predict the target with each model
    #Ensamble the results using the stacker model and output the stacked result
    
    y_prediction = stack.fit_predict(trainset, target_train, testset)