[Review] MATLAB ~Selection of Predictors in Random Forest~
Introduction
Recently I have been preparing for my master course in UK, which starts from 24th in this September. And one the programming language they use is MATLAB, so I have written this article to review the knowledge of MATLAB.
Select Predictors for Random Forests
This example shows how to choose the appropriate split predictor selection technique for your data set when growing a random forest of regression trees. This example also shows how to decide which predictors are most important to include in the training data.
Load and Pre-process Data
Load the carbig dataset. Consider a model that predicts the fuel economy of a car given its number of cylinders, engine displacement, horsepower, weight, acceleration, model_year, and country of origin. Consider Cylinders, Model_year, and Origin as categorical variables.
load carbig
Cylinders = categorical(Cylinders)
Model_Year = categorical(Model_Year)
Origin = categorical(cellstr(Origin))
X = table(Cylinders, Displacement, Horsepower, Weight, Acceleration, Model_Year, Origin, MPG)
Determines Levels in Predictors
The standard CART(Classification and Regression Trees) algorithm tends to split predictors with many unique values(levels) e.g., continuous variables, over those with fewer levels, e.g., categorical variables. If your data is heterogeneous, or your predictor variables vary greatly in their number of levels, then consider using the curvature or interaction tests for split-predictor selection of standard CART.
For each predictor, determine the number of levels in the data. One way to do this is determine an anonymous function that:
- Converts all variables to the categorical data type using categorical
- Determines all unique categories while ignoring missing values categories
- Counts the categories using
numel
then, apply the function to each variable using varfun
.
countLevels = @(x)numel(categories(categorical(x)))
numLevels = varfun(countLevels, X(:, 1:end-1), 'OutputFormat', 'uniform')
Compare the number of levels among the predictor variables.
figure
bar(numLevels)
title('Number of Levels Among Predictors')
xlabal('Predictor variable')
ylabel('Number of levels')
h = gca
h.XtickLabel = X.Properties.VariableNames(1:end-1)
h.XTickLabelRotation = 45
h.TickLabelInterpreter = 'none'
The continuous variables have many more levels than the categorical variables. Because the number of levels among the predictors vary so much, using standard CART to select split predictors at each node of the trees in a random forest can yield inaccurate predictor importance estimates.
Grow Robust Random Forest
Grow a random forest of 200 regression trees. Specify sampling all variables at each node. Specify usage of the interaction test to select split predictors. Becuase there are missing values in the data, speficy usage of surrogate splits to increase accuracy.
t = templateTree('NumPredictorsToSample','all', 'PredictorSelection', 'interaction-curvature', 'Surrogate', 'on')
rng(1)
Mdl = fitrensemble(X, 'MPG', 'Method', 'bag', 'NumLearningCycles', 200, 'Learners', t)
Mdl is a RegressionBaggedEnsemble model.
Estimate the model $R^2$ using out-of-bags predictions.
yHat = oobPredict(Mdl)
R2 = corr(Mdl.Y, yHat)^2
R2 = 0.8739
Mdl explains 87.39% of the variability around the mean.
Predictor Importance Estimation
Estimate predictor importane values by permuting out-of-bag observation among the trees.
impOOB = oobPermutedPredictorImportance(Mdl)
impOOB is a 1by7 vector of predictor importance estimates corresponding to the predictors in Mdl.PredictorNames. The estimates are not biased toward predictors containing many levels.
figure
bar(impOOB)
title('Unbiased Predictor Importance Estimates')
xlabel('Predictor variable')
ylabel('Importance')
h - gca
h.XTickLabel = Mdl.PredictorNames
h.XTickLabelRotation = 45
h.TickLabelInterpreter = 'none'
Greater importance estimates indicate more important predictors. The bar suggests that Model_Year is the most importance predictor, followed by Weight. Model_Year has 13 distinct levels only, whereas Weight has over 300.
Compare predictor importance estimates by permuting out-of-bag observations and those estimates obtained by summing gains in the mean squared error due to splits on each predictor. Also, obtain predictor association measures estimated by surrogate splits.
[impGain,predAssociation] = predictorImportance(Mdl);
figure;
plot(1:numel(Mdl.PredictorNames),[impOOB' impGain']);
title('Predictor Importance Estimation Comparison')
xlabel('Predictor variable');
ylabel('Importance');
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
legend('OOB permuted','MSE improvement')
grid on
impGain is commensurate with impOOB. According to the values of impGain, Model_Year and Weight do not appear to be the most important predictors.
predAssociation
is a 7by7 matrix of predictor association measures. Rows and columns corespond to the predictors in Mdl.PredictorNames. You can infer the strength of the relationship between pairs of predictors using the elements of predAssociation. Larger values indicate more highly correlated pairs of predictors.
figure;
imagesc(predAssociation);
title('Predictor Association Estimates');
colorbar;
h = gca;
h.XTickLabel = Mdl.PredictorNames;
h.XTickLabelRotation = 45;
h.TickLabelInterpreter = 'none';
h.YTickLabel = Mdl.PredictorNames;
predAssociation(1,2)
ans = 0.683
Tha largest association is between Cylinders and Displacement, but the value is not high enough to indicate a strong relationship between the two predictors.
Grow Random Forest Using Reduced Predictor Set
Because prediction time increases with the number of predictors in random forests, it is good practice to create a model using as few predictors as possible.
Grow a random forest of 200 regression trees using the best two predictors only.
matlabMdlReduced = fitrensemble(X(:,{'Model_Year' 'Weight' 'MPG'}),'MPG','Method','bag',...
'NumLearningCycles',200,'Learners',t);
Compute the $R^2$ o the reduced model.
yHatReduced = oobPredict(MdlReduced);
r2Reduced = corr(Mdl.Y,yHatReduced)^2
r2Reduced = 0.8524
The $R^2$ for the reduced model is close to the $R^2$ of the full model.
This result suggests that the reduced model is sufficient for prediction.
Author And Source
この問題について([Review] MATLAB ~Selection of Predictors in Random Forest~), 我々は、より多くの情報をここで見つけました https://qiita.com/Rowing0914/items/a0e0e0d13faa337bb6dc著者帰属:元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。
Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .