[Coursera]How to Winaデータ科学コンテスト-4週間3強


1. Ensemble Method


より強力な予測を得るために,複数の機械学習モデルを組み合わせた.
  • 簡単な平均方法から、複数の重み付け平均方法
  • が存在する.

    2. Bagging


    Means averaging slightly different versions of the same model to improve accuracy
    (1) Why Bagging?
    : Errors due to Bias(Underfitting) and Variance(Overfitting) exist
    (2) Parameters that control bagging
    : Changing the seed, Row sampling or Bootstrapping, Shuffling, Column sampling, Model-specific parameters, Number of models or bags, Parallelism
    (3) Example of bagging
    # train is the training data
    # test is the test data
    # y is target variable
    
    model = RandomForestRegressor()
    bags = 10
    seed = 1
    
    bagged_prediction = np.zeros(test.shape[0])
    
    for n in range(0,bags):
    	model.set_params(random_state = seed+n) # update seed
        model.fit(train.y)
        preds = model.predict(test)
        bagged_prediction +=preds
    # take average of predictions
    bagged_prediction/= bags

    3. Boosting


    Form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.
    =以前のモデルのパフォーマンスに基づいて、各モデルのモデルの加重平均フォーマットを順次作成します.
    (1) Weight based boosting
    特定のルールに基づいてweightを作成し、weightをフィーチャーの1つとして追加します.
  • Learning rate
  • Number of estimators
  • Input model - can be anything that accepts weights
  • Sub boosting type : AdaBoost, LogitBoost
  • (2) Residual based boosting
    特定のルールに基づいてエラーを計算し、Old Predictionに基づいてy labelを再決定します.
  • Learning rate
  • Number of estimators
  • Row sampling
  • Column sampling
  • Input model - better be trees
  • Sub boosting type : Fully gradient based, Dart
  • XGBBoost、LightGBM、H 20'S GBM、CatBoostなどの主流アルゴリズムで使われる方法!

    4. Stacking


    Means making predictions of a number of models in a hold-out set and then using a different meta model to train on these predictions.
    予測モデルセクションで最も人気のある形式と、最終段階で一般的に使用される方法.
    () Stacking Example
    from sklearn.ensemble import RandomForestRegressor
    
    training, valid, ytraining, yvalid = train_test_split(train, y, test_size=0.5)
    
    model1 = RandomForestRegressor()
    model2 = LinearRegression()
    
    model1.fit(training, ytraining)
    model2.fit(training, ytraining)
    
    preds1 = model1.predict(valid)
    preds2 = model2.predict(valid)
    
    test_preds1 = model1.predict(test)
    test_preds2 = model2.predict(test)
    
    stacked_prediction = np.column_stack(preds1,preds2)
    stacked_test_prediction = np.column_stack(test_preds1, test_preds2)
    
    #specifiy meta model
    meta_model = LinearRegression()
    # fit meta model on stacked predictions
    meta_model.fit(stacked_predictions, yvalid)
    # make predictions on the stacked predictions of the test data
    final_predictions = meta_model.predict(stacked_test_predictions)
    () Things to consider
  • With time sensitive data- respect time
  • Diversity as important as performance
  • Diversity may come from different algorithms or different input features
  • Performance plateauing after N models
  • Meta model is normally modest
  • 5. StackNet


    Scalable meta modelling methodology that utilizes stacking to combine multiple models in a neural network architecture of multiple levels
    スタックを使用して複数のモデルを複数のラベルのNNにマージする拡張可能なメタモデリング方法

    6. Tips and Tricks


    (1) 1st level tips
  • 2-3 gradient boosted trees (lightgbm, xgboost, catboost)
  • 2-3 Neural Net (keras, pytorch)
  • 1-2 ExtraTrees/Random Forest
  • 1-2 linear models, SVM
  • 1-2 KNN
  • 1 factorization machine (libfm)
  • 1 SVM with nonlinear kernel if size/memory allows
  • (2) subsequent level tips
    1) simpler algorithms
  • gradient boosted trees with small depth like 2-3
  • linear models with high regularization
  • extra trees
  • shallow networks
  • KNN with braycurtis distance
  • Brute forcing a search for best linear weights based on CV
  • 2) Feature engineering
  • pairwise differences between meta features
  • row-wise statistics like avg or std
  • standard feature selection techniques
  • 3) Be mindful of target leakage!