pandas入門--DataFrame

45374 ワード

pandasのDataFrame 2 D配列オブジェクト
DataFrameは、順序付けされた列のセットを含むテーブル行のデータ構造であり、DateFrameはSeriesからなる辞書とみなされ、インデックスが共通に使用されます.
作成方法:
pd.DataFrame({'one':[1,2,3,4], 'two':[5,4,3,2]})

pd.DataFrame({'one':pd.Series([1,2,3], index=['a','b','c']), 
'two':pd.Series([1,2,3,4],index=['a','b','c','d'])})

---           ---
df.read_csv('filename.csv')

df.to_csv()
In [1]: import pandas as pd

In [2]: pd.DataFrame({'one':[1,2,3,4], 'two':[5,4,3,2]})
Out[2]:
   one  two
0    1    5
1    2    4
a,b,c
1,2,3
2,4,6
3,6,9


In [3]: pd.DataFrame({'one':[1,2,3,4], 'two':[5,4,3,2]}, 
index=['a','b','c','d'])
Out[3]:
   one  two
a    1    5
b    2    4
c    3    3
d    4    2

In [4]: pd.DataFrame({'one':pd.Series([1,2,3], index=['a','b','c']), 
'two':pd.Series([1,2,3,4],index=['a','b','c','d'])})
Out[4]:
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

In [5]: #            

In [6]: !vi test.csv

In [8]: pd.read_csv('test.csv')
Out[8]:
   a  b  c
0  1  2  3
1  2  4  6
2  3  6  9

In [9]: df = _4

In [10]: df
Out[10]:
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

In [11]: df.to_csv('test_save.csv')

In [12]: !cat test_save.csv
,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4

pandasのDataFrame共通属性
  • index>>行インデックス
  • の取得
  • T>>転置
  • columns>>>カラムインデックス
  • を取得
  • values>>取得値配列
  • describe()>>>高速統計を取得
  • In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame({'one':pd.Series([1,2,3], index=['a','b','c']), 
    'two':pd.Series([1,2,3,4],index=['a','b','c','d'])})
    
    In [3]: df.index
    Out[3]: Index(['a', 'b', 'c', 'd'], dtype='object')
    
    In [4]: df.values
    Out[4]:
    array([[  1.,   1.],
           [  2.,   2.],
           [  3.,   3.],
           [ nan,   4.]])
    
    In [5]: #   Series    df。values         
    
    In [6]: df
    Out[6]:
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    
    In [7]: df.columns
    Out[7]: Index(['one', 'two'], dtype='object')
    
    In [8]: #      
    
    In [9]: df.T
    Out[9]:
           a    b    c    d
    one  1.0  2.0  3.0  NaN
    two  1.0  2.0  3.0  4.0
    
    In [10]: #        ,     
    
    In [11]: #    numpy    pandas              
    
    In [12]: #    NaN          ,                
    
    In [13]: #                 ,         ,        
        ...:    
    
    In [14]: df.describe()
    .../numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
      RuntimeWarning)
    Out[14]:
           one       two
    count  3.0  4.000000
    mean   2.0  2.500000
    std    1.0  1.290994
    min    1.0  1.000000
    25%    NaN  1.750000
    50%    NaN  2.500000
    75%    NaN  3.250000
    max    3.0  4.000000
    
    In [15]:
    

    DataFrameインデックスとスライス
    In [15]: df['one']['a']
    Out[15]: 1.0
    
    In [16]: #              ,       。
    
    In [17]: df.loc['a','one']
    Out[17]: 1.0
    
    In [18]: # loc           [  ,  ]     numpy       
    
    In [19]: df.loc['a',:]
    Out[19]:
    one    1.0
    two    1.0
    Name: a, dtype: float64
    
    In [20]: df.loc['a',]
    Out[20]:
    one    1.0
    two    1.0
    Name: a, dtype: float64
    
    In [21]: #       
    
    In [22]: df.loc[['a','c'],]
    Out[22]:
       one  two
    a  1.0    1
    c  3.0    3
    
    In [23]: #     
    
    In [24]: df.loc[['a','c'],'two']
    Out[24]:
    a    1
    c    3
    Name: two, dtype: int64
    
    In [25]: #     two     
    
    In [26]:
    

    DataFrameデータの整列と欠落
    In [26]: df
    Out[26]:
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    
    In [27]: df1 = pd.DataFrame({'two':[1,2,3,4], 'one':[4,5,6,7]},
     index=['c','d','b','a'])
    
    In [28]: df1
    Out[28]:
       one  two
    c    4    1
    d    5    2
    b    6    3
    a    7    4
    
    In [29]: df+df1
    Out[29]:
       one  two
    a  8.0    5
    b  8.0    5
    c  7.0    4
    d  NaN    6
    
    In [30]: #          
    
    In [32]: df.fillna(0)
    Out[32]:
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  0.0    4
    
    In [33]: #   NaN    0
    
    In [34]: df2 = _26
    
    In [35]: df2
    Out[35]:
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    
    In [36]: df2.dropna()
    Out[36]:
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    
    In [37]: # dropna                  
    
    In [38]: df2
    Out[38]:
       one  two
    a  1.0    1
    b  2.0    2
    c  3.0    3
    d  NaN    4
    
    In [39]: import numpy as np
    
    In [40]: df2.loc['d','two'] = np.nan
    
    In [41]: df2
    Out[41]:
       one  two
    a  1.0  1.0
    b  2.0  2.0
    c  3.0  3.0
    d  NaN  NaN
    
    In [42]: df2.loc['c','two'] = np.nan
    
    In [43]: df2
    Out[43]:
       one  two
    a  1.0  1.0
    b  2.0  2.0
    c  3.0  NaN
    d  NaN  NaN
    
    In [44]: df2.dropna(how='all')
    Out[44]:
       one  two
    a  1.0  1.0
    b  2.0  2.0
    c  3.0  NaN
    
    In [45]: # dropna   how         all ,            NaN      。    how      ‘any‘        NaN        
    
    In [46]: df2
    Out[46]:
       one  two
    a  1.0  1.0
    b  2.0  2.0
    c  3.0  NaN
    d  NaN  NaN
    
    In [47]: df1
    Out[47]:
       one  two
    c    4    1
    d    5    2
    b    6    3
    a    7    4
    
    In [48]: df3 = _
    
    In [50]: df3
    Out[50]:
       one  two
    c    4    1
    d    5    2
    b    6    3
    a    7    4
    
    In [51]: df3.loc['c','one']=np.nan
    
    In [52]: df3
    Out[52]:
       one  two
    c  NaN    1
    d  5.0    2
    b  6.0    3
    a  7.0    4
    
    In [53]: df.dropna(axis=1)
    Out[53]:
    Empty DataFrame
    Columns: []
    Index: [a, b, c, d]
    
    In [54]: df3.dropna(axis=1)
    Out[54]:
       two
    c    1
    d    2
    b    3
    a    4
    
    In [55]: df
    Out[55]:
       one  two
    a  1.0  1.0
    b  2.0  2.0
    c  3.0  NaN
    d  NaN  NaN
    
    In [56]: # dropna
    
    In [57]: # dropna     axis     0      ,    1      
    
    In [58]: