PythonでのDataFrameモジュール学習

33294 ワード

PythonでのDataFrameモジュール学習
この文書は、Windowsシステム環境に基づいて、DataFrameモジュールの学習とテストを行います.
  • Windows 10
  • PyCharm 2018.3.5 for Windows (exe)
  • python 3.6.8 Windows x86 executable installer

  • 1.DataFrameの初期化
  • 空のDataFrame変数
  • を作成
    import pandas as pd
    import numpy as np
    data = pd.DataFrame()
    print(np.shape(data)) # (0,0)
    
  • 辞書によるDataFrame
  • の作成
    import pandas as pd
    import numpy as np
    dict_a = {'name': ['xu', 'wang'], 'gender': ['male', 'female']}
    data = pd.DataFrame(dict_a)
    print(np.shape(data)) # (2,2)
    print(data) 
    # data = 
    # 	name  gender
    # 0	 xu		male
    # 1	 wang	female
    
  • numpyを通過する.ArrayはDataFrame
  • を作成する
    import pandas as pd
    import numpy as np
    mat = np.random.randn(3,4)
    df = pd.DataFrame(mat)
    df.columns = ['a','b','c','d']
    print(df)
    
  • DataFrameをnumpyに変換する.array
  • import pandas as pd
    import numpy as np
    mat = np.random.randn(3,4)
    df = pd.DataFrame(mat)
    df.columns = ['a','b','c','d']
    print(df)
    n = np.array(df)
    print(n)
    
  • DataFrame 1列のデータ
  • を追加
    import pandas as pd
    import numpy as np
    data = pd.DataFrame()
    data['ID'] = range(0,10) 
    print(np.shape(data)) # (10,1) 
    
  • DataFrame 1行分のデータ
  • を追加
    import pandas as pd
    import numpy as np
    df = pd.DataFrame(columns=('a', 'b', 'c'))
    df = df.append([{'a': 10.0, 'b': 'name', 'c': 10}], ignore_index=True)
    
  • DataFrameは1列のデータを増加し、値は同じ
  • である.
    import pandas as pd
    import numpy as np
    dict_a = {'name': ['xu', 'wang'], 'gender': ['male', 'female']}
    data = pd.DataFrame(dict_a)
    data['country'] = 'China' 
    print(data) 
    # data = 
    # 	name    gender	country
    # 0	 xu		male	China
    # 1	 wang	female	China
    
  • Data Frame重複するデータ行を削除
  • import pandas as pd
    norepeat_df = df.drop_duplicates(subset=['A_ID', 'B_ID'], keep='first')
    # norepeat_df = df.drop_duplicates(subset=[1, 2], keep='first')
    # keep=False ,           
    # keep=‘first' ,              
    # keep='last'               。
    

    2.基本操作
  • 取得DataFrameの行数と列数
  • df.shape[0] #   
    df.shape[1] #   
    
  • 取得DataFrameの転置
  • df.T
    
  • DataFrameのデータ精度
  • を修正する.
    a = np.array([[0.03, 0.05, 1.22], [0.04, 4.54, 3.68]])
    df = pd.DataFrame(a.T, columns=['a', 'b'])
    df.round(2) #       ,      
    print(df)
    #    a    b
    # 0  0.0  0.0
    # 1  0.1  4.5
    # 2  1.2  3.7
    
  • DataFrameのapply関数
  • を取得する
    a = np.array([[3, 1, 2], [2, 4, 3]])
    df = pd.DataFrame(a.T, columns=['a', 'b'])
    print(df)
    #    a  b
    # 0  3  2
    # 1  1  4
    # 2  2  3
    
    f = lambda x: np.mean(x)
    t1 = df.apply(f) #     
    print(t1)
    # a    2.0
    # b    3.0
    t2 = df.apply(f, axis=1) #     
    print(t2)
    # 0    2.5
    # 1    2.5
    # 2    2.5
    
  • DataFrameの中のある列がある値に等しいすべての行
  • を選択する.
    df.loc[df['columnName']=='the value']
    
  • ある列の両端の指定文字
  • を除去する.
    import pandas as pd
    dict_a = {'name': ['.xu', 'wang'], 'gender': ['male', 'female.']}
    data = pd.DataFrame(dict_a)
    print(data) 
    # data = 
    # 	name    gender	
    # 0	 .xu		male	
    # 1	 wang	female.	
    data['name'] = data['name'].str.strip('.') #   '.'
    # data['name'] = data['name'].str.strip() #     
    print(data) 
    # data = 
    # 	name    gender	
    # 0	 xu		male	
    # 1	 wang	female.	
    
  • indexの値
  • を再調整
    import pandas as pd
    data = pd.DataFrame()
    data['ID'] = range(0,3) 
    # data = 
    # 	ID
    # 0	 0
    # 1	 1
    # 2  2
    data.index = range(1,len(data) + 1) 
    # data = 
    # 	ID
    # 1	 0
    # 2	 1
    # 3  2
    
  • DataFrame列順序
  • を調整する.
    import pandas as pd
    data = pd.DataFrame()
    print(data)
    # data = 
    # 	ID  name
    # 0	 0	xu
    # 1	 1	wang
    # 2  2	li
    data = data[['name','ID']]
    # data = 
    # 	name  ID
    # 0	 xu	   0
    # 1	 wang  1
    # 2  li    2
    
  • DataFrameのカラム名
  • を取得
    import pandas as pd
    data = pd.DataFrame()
    print(data)
    # data = 
    # 	ID  name
    # 0	 0	xu
    # 1	 1	wang
    # 2  2	li
    print(data.columns.values.tolist())
    # 	['ID', 'name']
    
  • DataFrameの行名
  • を取得する.
    import pandas as pd
    data = pd.DataFrame()
    print(data)
    # data = 
    # 	ID  name
    # 0	 0	xu
    # 1	 1	wang
    # 2  2	li
    print(data._stat_axis.values.tolist())
    # 	[0, 1, 2]
    
  • Data Frame
  • を列ごとに巡回する.
    import pandas as pd
    import numpy as py
    data = pd.DataFrame(np.arange(6).reshape((2, 3)))
    print(data)
    # data = 
    # 	 0	1	2
    # 0	 0	1	2
    # 1	 3	4	5
    cols = data.columns.values
    for i in range(len(cols)):
    	print(data[cols[i]]) 
    
    
    
    data = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=['a', 'b', 'c'])
    print(data)
    # data = 
    # 	 0	1	2
    # 0	 0	1	2
    # 1	 3	4	5
    cols = data.columns.values
    for i in range(len(cols)):
    	print(data[cols[i]]) 
    

    3.読み書き操作
  • csvファイルをDataFrameデータ
  • に読み込む
  • read_csv()関数のパラメータ構成は公式サイトpandas.read_csv
  • import pandas as pd 
    data = pd.read_csv('user.csv')
    print (data) 
    
  • データFrameデータをcsvファイル
  • に書き込む
  • to_csv()関数のパラメータ構成は公式サイトpandas.DataFrame.to_csv
  • import pandas as pd 
    data = pd.read_csv('test1.csv')
    data.to_csv("test2.csv",index=False, header=True)
    

    4.異常処理
  • NaNを含むすべての行をフィルタする
  • dropna()関数のパラメータ構成は公式サイトpandas.DataFrame.dropna
  • from numpy import nan as NaN
    import pandas as pd 
    data = pd.DataFrame([[1,2,3],[NaN,NaN,2],[NaN,NaN,NaN],[8,8,NaN]])
    print (data) 
    # data =
    # 1    2   3
    # NaN NaN  2
    # NaN NaN NaN
    # 8    8  NaN
    data = data.dropna()
    # DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
    # axis: 0 or 'index'        1 or 'columns'     
    # how: 'any'         NaN   ,'all'         NaN   
    # thresh:   n,          n     NaN,    
    # subset: ['name', 'gender']       NaN ,     index,     axis=1
    # inplace:    True,     ,    None
    print(data)
    # data =
    # 1    2   3