データ・マイニング:失敗した回帰分析を記録する


プロジェクト説明:データソースアリ天池の掘削試合:中古車取引価格を予測する
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Lasso,LassoCV


filename = r'C:\Users\liuhao\Desktop\     \   \used_car_train_20200313.csv'
train = pd.read_csv(filename, sep=' ')

iqr = train['price'].quantile(0.75) + ((train['price'].quantile(0.75) - train['price'].quantile(0.25))*1.5)
train.drop(train['price'][train['price'] > iqr].index,inplace=True)

train['price'] = np.log1p(train['price'])

bra_p = train['price'].groupby(train['brand']).mean()
train['b_p'] = train['brand'].apply(lambda x:bra_p.iloc[x])

train['used_months'] = ((pd.to_datetime(train['creatDate'], format='%Y%m%d', errors='coerce') - 
                         pd.to_datetime(train['regDate'], format='%Y%m%d', errors='coerce')).dt.days)/30
train['used_months'].fillna(train['used_months'].mean(),inplace=True)

def fill_missing(df):
    df['fuelType'] = df['fuelType'].fillna(train['fuelType'].value_counts().index[0])
    df['gearbox'] = df['gearbox'].fillna(train['gearbox'].value_counts().index[0])
    df['bodyType'] = df['bodyType'].fillna(train['bodyType'].value_counts().index[0])
    df['model'] = df['model'].fillna(train['model'].value_counts().index[0])
    df['brand'] = df['brand'].fillna(train['brand'].value_counts().index[0])
    return df

ndata = fill_missing(train)

ndata['notRepairedDamage'].replace('-',ndata['notRepairedDamage'].value_counts().index[0],inplace=True)
ndata['power'] = ndata['power'].map(lambda x: 600 if x>600 else x)

all_features = ndata.drop(['SaleID', 'name', 'regDate', 'model', 'seller',
                  'offerType', 'creatDate','regionCode',], axis=1)


def data_astype(df):
    # string
#     df['SaleID'] = df['SaleID'].astype(int).astype(str)
#     df['name'] = df['name'].astype(int).astype(str)
    # df['model'] = df['model'].astype(str)
    df['brand'] = df['brand'].astype(str)
    df['bodyType'] = df['bodyType'].astype(str)
    df['fuelType'] = df['fuelType'].astype(str)
    df['gearbox'] = df['gearbox'].astype(str)
    df['notRepairedDamage'] = df['notRepairedDamage'].astype(str)
#     df['regionCode'] = df['regionCode'].astype(int).astype(str)
#     df['seller'] = df['seller'].astype(int).astype(str)
#     df['offerType'] = df['offerType'].astype(int).astype(str)

    return df

all_features = data_astype(all_features)
all_features = pd.get_dummies(all_features).reset_index(drop=True)

X = all_features[all_features['price'].notnull()].drop(['price'], axis=1)
y = all_features[all_features['price'].notnull()]['price']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=15, shuffle=True)

model = LassoCV(cv=5)

# model = RidgeCV(cv=kf)

model.fit(X_train, y_train)

mae_train = mean_absolute_error(np.expm1(y_train), np.expm1(model.predict(X_train)))
mae_valid = mean_absolute_error(np.expm1(y_valid), np.expm1(model.predict(X_valid)))

print('   MAE: {}'.format(mae_train))
print('   MAE: {}'.format(mae_valid))


c = dict(zip(X.columns.values,model.coef_))
for k,v in c.items():
	if v != 0:
		print(k,v)

初期MAEは1000余りで、最終的に最高の訓練結果MAEはLassoが700余りに復帰した.予測結果をアップロードしなかったのは、ランキング1位が300以上で、ランクインも400以上だったからだ.また,いくつかの異なる特徴の組合せを試みたが,予想されるいくつかの特徴パラメータは結果にほとんど表現されず,解釈性は強くなかった.ニューラルネットワークを直接呼び出したいので、コンピュータの性能は直接退却します