pandasベース

42618 ワード

Pandasは2次元のデータ構造DataFrameを用いてテーブル形式のデータを表す.
まずpandasとnumpyをロードします

import pandas as pd
import numpy as np

一.データテーブルの生成
1.ファイル読み込み
csvファイルとxlsxはそれぞれread_csv()とread_xlsx()

df = pd.read_csv('./data/HR.csv')

2.pandasでデータテーブルを作成する

df = pd.DataFrame({
    "id": [1001,1002,1003,1004,1005,1006],
    "date": pd.date_range('20130102', periods=6),
    "city": ['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
    "age": [23,44,54,32,34,32],
    "category": ['100-A','100-B','110-A','110-C','210-A','130-F'],
    "price": [1200,np.nan,2133,5433,np.nan,4432]},
     columns = ['id','date','city','category','age','price'])

実行結果:

     id       date         city category  age   price
0  1001 2013-01-02     Beijing     100-A   23  1200.0
1  1002 2013-01-03           SH    100-B   44     NaN
2  1003 2013-01-04   guangzhou     110-A   54  2133.0
3  1004 2013-01-05     Shenzhen    110-C   32  5433.0
4  1005 2013-01-06     shanghai    210-A   34     NaN
5  1006 2013-01-07     BEIJING     130-F   32  4432.0

二.データテーブル情報の表示
1.次元表示

df.shape   #（6，6）

2.データテーブルの基本情報(次元、列名、データフォーマット、占有スペースなど)

df.info()

3.各列のデータフォーマット

df.dtypes

実行結果:

id                   int64
date        datetime64[ns]
city                object
category            object
age                  int64
price              float64

4.ある列のデータフォーマット

df['date'].dtypes

5.Null値

df.isnull
df['date'].isnull   #

6.列の一意の値の表示

df['date'].unique()

7.データテーブルの値の表示

df.values

df.head()  #     5 ，       
df.tail()  #     5 ，

8.既存のデータから新しいデータを生成する
例:max_timeとmin_timeは既存の2つのカラムであり、ビジネスではgs、gs=max_のカラムを生成する必要があります.time-min_time

df.['gs']=df.['max_time']-['min_time']

9.基本統計量の表示

df.describe()

実行結果:

              a         b         c         d
count  4.000000  4.000000  4.000000  4.000000
mean  -0.058927 -0.474549  1.019342 -0.750464
std    0.595253  0.530539  0.753136  1.022685
min   -0.640585 -0.997408  0.160999 -1.855990
25%   -0.532082 -0.812058  0.509721 -1.489673
50%   -0.065873 -0.561149  1.077771 -0.708147
75%    0.407282 -0.223640  1.587391  0.031062
max    0.536626  0.221508  1.760826  0.270427

10.データ・ボックス操作


df.head(1)['data'] #       date 

df.head(1)['data'][0] #       date     

sum(df['ability']) #        

df[df['data'] == '20161111']  #           

df[df['data'] == '20161111'].index[0]   #                 

df.index #       

df.index[0] #         

df.index[-1]   #          ,       

df.columns    #      

df[0:2]    #    1  2 ， 0  ，

三.データテーブルクリーニング
1.Null値を0で入力

df.fillna(value=0)

2.列priceの平均値を使用してNAに記入する

df['price'].fillna(df['price'].mean())

3.cityフィールドの文字スペースをクリアする

df['city']=df['city'].map(str.strip)

4.大文字と小文字の変換

df['city']=df['city'].str.lower

5.データフォーマットの変更

df['price'].astype('int')

6.列名の変更

df.rename(columns={'category': 'category-size'})

7.重複値の削除

df['city'].drop_duplicates()   #          
df['city'].drop_duplicates(keep='last')     #

8.データ置換

df['city'].replace('sh', 'shanghai')

四.データプリプロセッシング

df1=pd.DataFrame({
    "id":[1001,1002,1003,1004,1005,1006,1007,1008], 
    "gender":['male','female','male','female','male','female','male','female'],
    "pay":['Y','N','Y','Y','N','Y','N','Y',],
    "m-point":[10,12,20,40,40,40,30,20]})

1.データテーブルのマージ

df_inner = pd.merge(df,df1,how='inner')
df_left = pd.merge(df,df1,how='left')
df_right = pd.merge(df,df1,how='right')
df_outer = pd.merge(df,df1,how='outer')

2.索引列の設定

df_left.set_index('id')

3.特定の列の値でソート

df_left.sort_values(by=['age'])

4.索引列でソート

df_left.sort_index()

5.prince列の値>3000の場合group列はhigh、そうでない場合low

df_left['group'] = np.where(df_left['price'] > 3000,'high','low')

6.複合複数条件のデータをグループ化する

df_left.loc[(df_left['city'] == 'beijing') & (df_left['price'] >= 4000), 'sign'] = 1

7.categoryフィールドの値を順番に並べ替え、データテーブルを作成します.インデックス値はdf_です.innerのインデックス列、カラム名categoryとsize

pd.DataFrame((x.split('-') for x in df_left['category']),index=df_left.index,columns=['category','size'])

8.分割が完了したデータテーブルと元のdf_Innerデータテーブルのマッチング

d = df_left=pd.merge(df_left,'     ',right_index=True, left_index=True)

五.データ抽出
主に3つの関数を使用します:loc,iloc,ix

loc関数ラベル値による抽出

iloc位置別抽出

ixは同時にラベルと位置によって

を抽出する.
1.インデックスによる単一ローの値の抽出

df_left.loc[3]  #    3，

2.索引によるリージョン行の値の抽出

df_left.iloc[0:5]    # 0，1，2，3，4

3.インデックスのリセット

df_left.reset_index()

4.dateをインデックスに設定

df_left=df_left.set_index('date')

5.2013-01-04までのすべてのデータを抽出する

df_left[:'2013-01-04']

6.ilocを使用して位置領域ごとにデータを抽出する

df_left.iloc[:3,:2]  #                  ，         ， 0  ，   ，   。

7.ilocを使用して位置別にデータを抽出する

df_left.iloc[[0,2,5],[4,5]] #   0、2、5 ，4、5

8.ixを使用してインデックスラベルと位置を混合してデータを抽出する

df_left.ix[:'2013-01-03',:4] #2013-01-03   ，

9.city列の値が北京かどうかを判断する

df_left['city'].isin(['beijing'])

10.city列にbeijingとshanghaiが含まれているかどうかを判断し、条件に合致するデータを抽出する

df_left.loc[df_left['city'].isin(['beijing','shanghai'])]

11.最初の3文字を抽出し、データテーブルを生成する

pd.DataFrame(category.str[:3])

六.データフィルタ
1.データのフィルタリングには、、、または、非3つの条件を使用して、より大きい、より小さい、等しいものを使用します.

#  
df_left.loc[(df_left['age'] > 25) & (df_left['city'] == 'beijing'), ['id','city','age','category','gender']]
#  
df_left.loc[(df_left['age'] > 25) | (df_left['city'] == 'beijing'), ['id','city','age','category','gender']]
#  
df_left.loc[(df_left['city'] != 'beijing'), ['id','city','age','category','gender']]

2.フィルタされたデータをcity列でカウントする

df_left.loc[(df_left['age'] > 25) & (df_left['city'] == 'shanghai'), ['id','city','age','category','gender']].city.count()

3.query関数によるフィルタリング

df_left.query('city == ["Beijing", "shanghai"]')

4.スクリーニング後の結果をprinceで加算

df_left.query('city == ["beijing", "Shenzhen"]').price.sum()

七.データの要約
主な関数はgroupbyとpivote_です.table
1.すべての列のカウントの要約

df_left.groupby('city').count()

2.都市別idフィールドのカウント

df_left.groupby('city')['id'].count()

3.2つのフィールドの合計数

df_left.groupby(['city','size'])['id'].count()

4.cityフィールドを要約し、princeの合計と平均値をそれぞれ計算する

df_left.groupby('city')['price'].agg([len,np.sum, np.mean])

八.データ統計
データサンプリング、標準偏差、共分散、相関係数の計算
1.簡単なデータサンプリング後に戻さない

df_left.sample(n=3)

2.サンプリングウェイトを手動で設定する

weights = [0, 0, 0, 0, 0.5, 0.5]
df_left.sample(n=2, weights=weights)

3.サンプリング後に戻す/戻さない

df_left.sample(n=6, replace=True)     #      
df_left.sample(n=6, replace=False)    #

4.計算カラムの標準偏差

df_left['price'].std()

5.共分散の計算

df_inner.cov()    #              
df_left['price'].cov(df_inner['m-point'])    #

6.相関分析

#          
df_inner.corr()
#           
df_left['price'].corr(df_inner['m-point']) #     -1 1  ，  1    ，  -1    ，0

九.データ出力
解析後のデータはxlsx形式とcsv形式で出力できます.
1.Excelとして出力

df_left.to_excel('excel_to_python.xlsx', sheet_name='bluewhale_cc')

2.CSV出力

df_left.to_csv('excel_to_python.csv')

配列による両端キューの実現(C++)

【PHP】CarbonのaddMonth()を使ってハマった話（CarbonとDateTimeクラスの仕様を今一度確認してみる）