[Coursera]How to Winaデータ科学コンテスト-4週間3強

4218 ワード

kaggle machine learning Coursera テキストリンク

1. Ensemble Method

より強力な予測を得るために,複数の機械学習モデルを組み合わせた.

簡単な平均方法から、複数の重み付け平均方法

が存在する.

2. Bagging

Means averaging slightly different versions of the same model to improve accuracy
(1) Why Bagging?
: Errors due to Bias(Underfitting) and Variance(Overfitting) exist
(2) Parameters that control bagging
: Changing the seed, Row sampling or Bootstrapping, Shuffling, Column sampling, Model-specific parameters, Number of models or bags, Parallelism
(3) Example of bagging

# train is the training data
# test is the test data
# y is target variable

model = RandomForestRegressor()
bags = 10
seed = 1

bagged_prediction = np.zeros(test.shape[0])

for n in range(0,bags):
	model.set_params(random_state = seed+n) # update seed
    model.fit(train.y)
    preds = model.predict(test)
    bagged_prediction +=preds
# take average of predictions
bagged_prediction/= bags

3. Boosting

Form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.
=以前のモデルのパフォーマンスに基づいて、各モデルのモデルの加重平均フォーマットを順次作成します.
(1) Weight based boosting
特定のルールに基づいてweightを作成し、weightをフィーチャーの1つとして追加します.

Learning rate

Number of estimators

Input model - can be anything that accepts weights

Sub boosting type : AdaBoost, LogitBoost

(2) Residual based boosting
特定のルールに基づいてエラーを計算し、Old Predictionに基づいてy labelを再決定します.

Learning rate

Number of estimators

Row sampling

Column sampling

Input model - better be trees

Sub boosting type : Fully gradient based, Dart

XGBBoost、LightGBM、H 20'S GBM、CatBoostなどの主流アルゴリズムで使われる方法!

4. Stacking

Means making predictions of a number of models in a hold-out set and then using a different meta model to train on these predictions.
予測モデルセクションで最も人気のある形式と、最終段階で一般的に使用される方法.
() Stacking Example

from sklearn.ensemble import RandomForestRegressor

training, valid, ytraining, yvalid = train_test_split(train, y, test_size=0.5)

model1 = RandomForestRegressor()
model2 = LinearRegression()

model1.fit(training, ytraining)
model2.fit(training, ytraining)

preds1 = model1.predict(valid)
preds2 = model2.predict(valid)

test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)

stacked_prediction = np.column_stack(preds1,preds2)
stacked_test_prediction = np.column_stack(test_preds1, test_preds2)

#specifiy meta model
meta_model = LinearRegression()
# fit meta model on stacked predictions
meta_model.fit(stacked_predictions, yvalid)
# make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)

() Things to consider

With time sensitive data- respect time

Diversity as important as performance

Diversity may come from different algorithms or different input features

Performance plateauing after N models

Meta model is normally modest

5. StackNet

Scalable meta modelling methodology that utilizes stacking to combine multiple models in a neural network architecture of multiple levels
スタックを使用して複数のモデルを複数のラベルのNNにマージする拡張可能なメタモデリング方法

6. Tips and Tricks

(1) 1st level tips

2-3 gradient boosted trees (lightgbm, xgboost, catboost)

2-3 Neural Net (keras, pytorch)

1-2 ExtraTrees/Random Forest

1-2 linear models, SVM

1-2 KNN

1 factorization machine (libfm)

1 SVM with nonlinear kernel if size/memory allows

(2) subsequent level tips
1) simpler algorithms

gradient boosted trees with small depth like 2-3

linear models with high regularization

extra trees

shallow networks

KNN with braycurtis distance

Brute forcing a search for best linear weights based on CV

2) Feature engineering

pairwise differences between meta features

row-wise statistics like avg or std

standard feature selection techniques

3) Be mindful of target leakage!

Reference

この問題について([Coursera]How to Winaデータ科学コンテスト-4週間3強), 我々は、より多くの情報をここで見つけました https://velog.io/@jhbale11/Coursera-How-to-win-a-data-science-competition-4주차-3강

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

[BOJ]2941クロアチア文字

Webpackインストールとコマンドライン