titanicの解法

32583 ワード

sklearn kaggle

Titanic Data Science Solutionsと最初のKaggleプロジェクトであるタイタニック号のコード構想を参考にした.

ステップ

問題提出

データの導入と整理

特徴工事

Ageの処理

Fareの処理

Embarkedの処理,'S'最も一般的な

sexの0-1

乗船港Embarkedはone-hot符号化を行い、同時に元の変数を

削除する.

客室等級Pclassはone-hot符号化を行い、同時に元の変数を

削除する.

名前

名前を整理する意味

名前をone-hot符号化するとともに、元の変数を

削除する.

所在家庭サイズ(船上の)

は空の値が多すぎるCabinを除去し,情報が乱雑なTicket,PassengerIdも不要である.

train、test

を分離

機械学習

計算スコア

結果

いくつかの発見

問題提起

どんな人がタイタニック号で生きやすいですか?

データのインポートと整理

本明細書で使用するデータはKaggleにあり、次いでpandasを用いてデータのインポートが行われる.

import pandas as pd
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

欠落値の存在はデータの使用に大きく影響するため、.info()を使用してデータの欠落値を確認します.

train_df.info()


RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

trainデータセットでは、Age、Cabin、およびEmbarkedがデータの補完を行う必要がある.

test_df.info()


RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

testデータセットでは、Age、Fare、およびCabinがデータの補完を行う必要がある.
データセットを結合し、同時に2つのデータセットを洗浄します.

combine = pd.concat([train_df, test_df], axis = 0)

フィーチャーエンジニアリング

import numpy as np 
# 
np.random.seed()

Ageの処理

ランダム生産正規分布数列、平均値:np.mean()分散np.std()

Age_null = combine[combine['Age'].isna()]
Age_null['Age'] = np.random.normal(np.mean(combine['Age']), np.std(combine['Age'])\
        , (Age_null.shape[0], 1))
#Age_null['Age'] = Age_null['Age'].apply(round) 
Age_notnull = combine[combine['Age'].notna()]
combine = pd.concat([Age_null, Age_notnull], axis = 0)

Fareの処理

Fare_null = combine[combine['Fare'].isna()]
Fare_null['Fare'] = np.random.normal(np.mean(combine['Fare']), np.std(combine['Fare']), \
         (Fare_null.shape[0], 1))
#Fare_null['Fare'] = Fare_null['Fare'].apply(round)
Fare_notnull = combine[combine['Fare'].notna()]
combine = pd.concat([Fare_null, Fare_notnull], axis = 0)

Embarkedの処理,'S'が最も一般的である

combine['Embarked'] = combine['Embarked'].fillna('S')

sexの0-1

sex = {'male': 1, 'female': 0}
combine['Sex'] = combine['Sex'].map(sex)

乗船港Embarkedはone-hotコードを行い、元の変数を削除します

data_Embark = pd.get_dummies(combine['Embarked'], prefix = 'Embarked')
combine = pd.concat([data_Embark, combine], axis = 1)
combine = combine.drop('Embarked', axis = 1)

客室等級Pclassはone-hot符号化を行い、同時に元の変数を削除する

data_Pclass = pd.get_dummies(combine['Pclass'], prefix = 'Pclass')
combine = pd.concat([data_Pclass, combine], axis = 1)
combine = combine.drop('Pclass', axis = 1)

名前

名前の意味を整理する

combine['NameTitle'] = combine.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
combine['NameTitle'] = combine['NameTitle'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
combine['NameTitle'] = combine['NameTitle'].replace(['Mlle', 'Ms'], 'Miss')
combine['NameTitle'] = combine['NameTitle'].replace('Mme', 'Mrs')
combine = combine.drop('Name', axis = 1)

名前をone-hot符号化し、元の変数を削除

data_NameTitle = pd.get_dummies(combine['NameTitle'], prefix = 'NameTitle')
combine = pd.concat([data_NameTitle, combine], axis = 1)
combine = combine.drop('NameTitle', axis = 1)

ホームサイズ

combine['FamilySize'] = combine['SibSp'] + combine['Parch'] + 1 
combine = combine.drop(['SibSp', 'Parch'], axis = 1)

空の値が多すぎるCabinを除いて,情報が乱雑なTicket,PassengerIdも不要である.

combine = combine.drop(['Cabin', 'PassengerId', 'Ticket'], axis = 1)

train、testを切り離す

train = combine[combine['Survived'].notna()]
test = combine[combine['Survived'].isna()].drop('Survived', axis=1)

X_train = train.drop('Survived', axis = 1)
Y_train = train['Survived']
X_test = test

機械学習

from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

svc = SVC()
svc.fit(X_train, Y_train)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)

knn = KNeighborsClassifier(n_neighbors = 33)
knn.fit(X_train, Y_train)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)


linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

スコアの計算

models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})

結果

Model
Score
0
Support Vector Machines
88.55
1
KNN
72.62
2
Random Forest
99.10
3
Naive Bayes
79.91
4
Perceptron
58.59
5
Stochastic Gradient Decent
73.51
6
Linear SVC
82.04
7
Decision Tree
99.10

いくつかの発見

欠損値を補う過程で、AgeとFareの整列はACC点数の

上昇に来ない.

はAgeとFareをone−hot符号化できるが,ACCスコアは次の発見と同様に低下した.

elasticsearch 6.8ベースの地理的位置情報クエリー

MIMEのQuoted-printableコーデック