Kaggle Learn:Intro to Machine Learningレコード

4428 ワード

kaggle上learnプレートintro to machine learningのノート、3つのMLの授業を続けて、exerciseがあって、初心者の独学に適していますhttps://www.kaggle.com/learn/intro-to-machine-learning
import pandas as pd
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
#dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

We’ll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices in the Melbourne data is
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

予測するのはprice By convention,this data is called X.
X = melbourne_data[melbourne_features]
X.describe()
X.head(n) # 5 

The steps to building and using a model are:
  • Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
  • Fit: Capture patterns from provided data. This is the heart of modeling.
  • Predict: Just what it sounds like
  • Evaluate: Determine how accurate the model’s predictions are.

  • Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.
    from sklearn.tree import DecisionTreeRegressor
    #Define model. Specify a number for random_state to ensure same results each run
    melbourne_model = DecisionTreeRegressor(random_state=1)
    #Fit model
    melbourne_model.fit(X, y)
    
    print("Making predictions for the following 5 houses:")
    print(X.head())
    print("The predictions are")
    print(melbourne_model.predict(X.head()))
    

    Making predictions for the following 5 houses: Rooms Bathroom Landsize Lattitude Longtitude 1 2 1.0 156.0 -37.8079 144.9934 2 3 2.0 134.0 -37.8093 144.9944 4 4 1.0 120.0 -37.8072 144.9941 6 3 2.0 245.0 -37.8024 144.9993 7 2 1.0 256.0 -37.8060 144.9954 The predictions are [1035000. 1465000. 1600000. 1876000. 1636000.] MAE:Mean Absolute Errorモデルが正確かどうかを見る
    from sklearn.metrics import mean_absolute_error
    predicted_home_prices = melbourne_model.predict(X)
    mean_absolute_error(y, predicted_home_prices)
    

    Sklearn train_test_split() https://www.cnblogs.com/bonelee/p/8036024.html
    from sklearn.model_selection import train_test_split
    
    train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
    #Define model
    melbourne_model = DecisionTreeRegressor()
    #Fit model
    melbourne_model.fit(train_X, train_y)
    #get predicted prices on validation data
    val_predictions = melbourne_model.predict(val_X)
    print(mean_absolute_error(val_y, val_predictions))
    

    training dataでfitしてvalでpredit When we divide the houses amongst many leaves,we also have fewer houses in each leaf.Leaves with very few houses will make predictions that are quite close to those homes’ actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).
    This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn’t divide up the houses into very distinct groups.
    At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.
    Here’s the takeaway: Models can suffer from either:
  • Overfitting: capturing spurious patterns that won’t recur in the future, leading to less accurate predictions, or
  • Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

  • We use validation data, which isn’t used in model training, to measure a candidate model’s accuracy. This lets us try many candidate models and keep the best one.