タイタニック号:生存者数予測データ分析:簡単だった

9327 ワード


タイタニック号:データ分析
データ:12フィールド;訓練集892条の記録.テストデータセット:418レコード
?PassengerId =>   ID

?Pclass =>     (1/2/3   )

?Name =>     

?Sex =>   

?Age =>   

?SibSp =>    /   

?Parch =>        

?Ticket =>     

?Fare =>   

?Cabin =>   

?Embarked =>     

 
目標:テストデータの418人が最終的に救われるかどうかを予測
結果:精度97.37%
 
このような高い精度は驚くべきことだが、歪果仁のpythonコードを見てみると、深いアルゴリズムを使っていないような気がするが、精度は内かどうかの高さで、彼らがどうしてこんなに複雑な問題を、このような簡単な方法で解くことができるのか感心せざるを得ない.
主な考え方:主に3つのフィールドの性別、乗客の等級、運賃を利用します
運賃をそれぞれ4つのレベルに分けます:0~10,10~20,20~30,30~max
性別、乗客等級、運賃に基づいて異なる組み合わせの生存率を計算する.
以下の表:2つの3*4列の行列があり、1つ目は女性を表し、1つ目の行の1列目は女性を表している:1等級乗客:0~10運賃区間:の生存率、その他の属性は順次類推する;2番目の3*4列の行列が表す男性
survival_table
[[[ 0.         0.          0.83333333  0.97727273]
  [ 0.          0.91428571  0.9        1.        ]
  [ 0.59375     0.58139535 0.33333333  0.125     ]]
 
 [[ 0.          0.          0.4         0.38372093]
  [ 0.          0.15873016  0.16       0.21428571]
  [ 0.11153846  0.23684211 0.125       0.24      ]]]
 
上記の表から分かるように、女性の生存率は全体的に男性の生存率よりずっと大きい.これはヨーロッパの当時の高貴な紳士精神が、何事も子供、女性、老人を優先していたため、表のようにこの結果を示したからだ.
以下の表に示すように、対応位置生存率>0.5のマークが1未満のマークが0
survival_table13
[[[ 0.  0.  1.  1.]
  [ 0.  1. 1.  1.]
  [ 1.  1. 0.  0.]]
 
 [[ 0.  0. 0.  0.]
  [ 0.  0. 0.  0.]
  [ 0.  0. 0.  0.]]]
この表に基づいて418のテストデータセットをテストでき、最終予測の精度は97.37%に達した.
コードに必要なデータのダウンロードアドレス:https://www.kaggle.com/c/titanic/data
# coding=utf-8
""" Now that the user can read in a file this creates a model which uses the price, class and gender
Author : AstroDave
Date : 18th September 2012
Revised : 28 March 2014

"""


import csv as csv
import numpy as np

csv_file_object = csv.reader(open('train.csv', 'rb'))       # Load in the csv file
header = csv_file_object.next()                             # Skip the fist line as it is a header
data=[]                                                     # Create a variable to hold the data

for row in csv_file_object:                 # Skip through each row in the csv file
    data.append(row)                        # adding each row to the data variable
data = np.array(data)                       # Then convert from a list to an array

# In order to analyse the price column I need to bin up that data
# here are my binning parameters, the problem we face is some of the fares are very large
# So we can either have a lot of bins with nothing in them or we can just lose some
# information by just considering that anythng over 39 is simply in the last bin.
# So we add a ceiling
fare_ceiling = 40
# then modify the data in the Fare column to = 39, if it is greater or equal to the ceiling
data[ data[0::,9].astype(np.float) >= fare_ceiling, 9 ] = fare_ceiling - 1.0

fare_bracket_size = 10
number_of_price_brackets = fare_ceiling / fare_bracket_size
number_of_classes = 3                             # I know there were 1st, 2nd and 3rd classes on board.
number_of_classes = len(np.unique(data[0::,2]))   # But it's better practice to calculate this from the Pclass directly:
                                                  # just take the length of an array of UNIQUE values in column index 2


# This reference matrix will show the proportion of survivors as a sorted table of
# gender, class and ticket fare.
# First initialize it with all zeros
survival_table = np.zeros([2,number_of_classes,number_of_price_brackets],float)
# print 'survival_table 
',survival_table # I can now find the stats of all the women and men on board for i in xrange(number_of_classes): for j in xrange(number_of_price_brackets): women_only_stats = data[ (data[0::,4] == "female") \ & (data[0::,2].astype(np.float) == i+1) \ & (data[0:,9].astype(np.float) >= j*fare_bracket_size) \ & (data[0:,9].astype(np.float) < (j+1)*fare_bracket_size), 1] men_only_stats = data[ (data[0::,4] != "female") \ & (data[0::,2].astype(np.float) == i+1) \ & (data[0:,9].astype(np.float) >= j*fare_bracket_size) \ & (data[0:,9].astype(np.float) < (j+1)*fare_bracket_size), 1] #if i == 0 and j == 3: survival_table[0,i,j] = np.mean(women_only_stats.astype(np.float)) # Female stats survival_table[1,i,j] = np.mean(men_only_stats.astype(np.float)) # Male stats # Since in python if it tries to find the mean of an array with nothing in it # (such that the denominator is 0), then it returns nan, we can convert these to 0 # by just saying where does the array not equal the array, and set these to 0. # print 'survival_table1
',survival_table survival_table[ survival_table != survival_table ] = 0. print 'survival_table12
',survival_table # Now I have my proportion of survivors, simply round them such that if <0.5 # I predict they dont surivive, and if >= 0.5 they do survival_table[ survival_table < 0.5 ] = 0 survival_table[ survival_table >= 0.5 ] = 1 print 'survival_table13
',survival_table # Now I have my indicator I can read in the test file and write out # if a women then survived(1) if a man then did not survived (0) # First read in test test_file = open('test.csv', 'rb') test_file_object = csv.reader(test_file) header = test_file_object.next() # Also open the a new file so I can write to it. predictions_file = open("genderclassmodel.csv", "wb") predictions_file_object = csv.writer(predictions_file) predictions_file_object.writerow(["PassengerId", "Survived"]) # First thing to do is bin up the price file sum=0 count=1 import linecache linecache.clearcache() for row in test_file_object: count+=1 for j in xrange(number_of_price_brackets): # If there is no fare then place the price of the ticket according to class try: row[8] = float(row[8]) # No fare recorded will come up as a string so # try to make it a float except: # If fails then just bin the fare according to the class bin_fare = 3 - float(row[1]) break # Break from the loop and move to the next row if row[8] > fare_ceiling: # Otherwise now test to see if it is higher # than the fare ceiling we set earlier bin_fare = number_of_price_brackets - 1 break # And then break to the next row if row[8] >= j*fare_bracket_size\ and row[8] < (j+1)*fare_bracket_size: # If passed these tests then loop through # each bin until you find the right one # append it to the bin_fare # and move to the next loop bin_fare = j break # Now I have the binned fare, passenger class, and whether female or male, we can # just cross ref their details with our survival table if row[3] == 'female': if int(survival_table[ 0, float(row[1]) - 1, bin_fare ])==int( linecache.getline('gendermodel.csv', count).split(',')[1]): sum+=1 predictions_file_object.writerow([row[0], "%d" % int(survival_table[ 0, float(row[1]) - 1, bin_fare ])]) else: if int(survival_table[ 1, float(row[1]) - 1, bin_fare ])==int( linecache.getline('gendermodel.csv', count).split(',')[1]): sum+=1 predictions_file_object.writerow([row[0], "%d" % int(survival_table[ 1, float(row[1]) - 1, bin_fare])]) # Close out the files proportion_survived = sum /float(count-1) print 'people number:%d accuracy number: %d'%(count,sum) print 'Forecast accuracy of people who survived is %s' % proportion_survived # print 'Proportion of men who survived is %s' % proportion_men_survived test_file.close() predictions_file.close()

コード実行結果:
survival_table12 
[[[ 0.          0.          0.83333333  0.97727273]
  [ 0.          0.91428571  0.9         1.        ]
  [ 0.59375     0.58139535  0.33333333  0.125     ]]

 [[ 0.          0.          0.4         0.38372093]
  [ 0.          0.15873016  0.16        0.21428571]
  [ 0.11153846  0.23684211  0.125       0.24      ]]]
survival_table13 
[[[ 0.  0.  1.  1.]
  [ 0.  1.  1.  1.]
  [ 1.  1.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]]
people number:419 accuracy number: 407
Forecast accuracy  rate of people who survived is 0.973684210526