Pythonデータ解析_Pandas03_データの整理

13517 ワード

主な内容:
  • 削除行列
  • 修正値:apply lambda
  • 記述統計
  • データの統合、補完
  • 極端値処理
  • 開始データ・ボックス:
    In [76]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    
    In [77]: df
    Out[77]:
                       A         B         C         D
    2013-01-01 -0.411674  0.273549  0.629843  1.881497
    2013-01-02  1.240512  0.970725  0.033099  1.553420
    2013-01-03 -0.544326  0.545738 -1.325810  0.130738
    2013-01-04  1.044803 -0.117151  0.874583  2.278227
    2013-01-05 -2.194728 -2.536257  0.478644  0.057728
    2013-01-06 -1.092031  1.249952  1.598761 -0.153423
    
    In [98]: df2 = df.copy()
        ...: df2['E'] = ['one', 'one','two','three','four','three']
        ...: df2
        ...:
    Out[98]:
                       A         B         C         D      E
    2013-01-01 -0.411674  0.273549  0.629843  1.881497    one
    2013-01-02  1.240512  0.970725  0.033099  1.553420    one
    2013-01-03 -0.544326  0.545738 -1.325810  0.130738    two
    2013-01-04  1.044803 -0.117151  0.874583  2.278227  three
    2013-01-05 -2.194728 -2.536257  0.478644  0.057728   four
    2013-01-06 -1.092031  1.249952  1.598761 -0.153423  three
    

    列を削除
    新しい列columnを追加
    In [110]: df2["F"] = np.arange(3,9)   #           3,      3。
    
    In [111]: df2
    Out[111]:
                       A         B         C         D      E  F
    2013-01-01 -0.411674  0.273549  0.629843  1.881497    ONE  3
    2013-01-02  1.240512  0.970725  0.033099  1.553420    ONE  4
    2013-01-03 -0.544326  0.545738 -1.325810  0.130738    TWO  5
    2013-01-04  1.044803 -0.117151  0.874583  2.278227  THREE  6
    2013-01-05 -2.194728 -2.536257  0.478644  0.057728   FOUR  7
    2013-01-06 -1.092031  1.249952  1.598761 -0.153423  THREE  8
    

    列名の変更
    In [112]: df2.columns = list("qwerty")
    
    In [113]: df2.columns
    Out[113]: Index(['q', 'w', 'e', 'r', 't', 'y'], dtype='object')
    

    indexの設定
    pd.DataFrame.set_index(keys,drop=False, inplace=False) 
    # keys : column label or list of column labels / arrays
    #   drop=True,   index        。
    

    数値の変更
    Lambdaとapplyでデータ値を変更する
    In [101]: df2.loc[:,"E"].unique()
    Out[101]: array(['one', 'two', 'three', 'four'], dtype=object)
    
    In [105]: df2.loc[:,"E"]=df2.E.apply(lambda x: x.upper())
    
    In [106]: df2.loc[:,"E"].unique()
    Out[106]: array(['ONE', 'TWO', 'THREE', 'FOUR'], dtype=object)
    

    関数アプリケーション
    3つのケース:
  • 全表:pipe()
  • 行列アプリケーション:apply()
  • 要素レベルアプリケーション:applymap()
  • この部分は【functionアプリケーション部分】にあります.http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-discretization.
    apply()の使用
    In [34]: df.apply(np.mean)
    Out[34]:
    A   -0.274648
    B   -0.260124
    C    0.152004
    D   -0.659868
    dtype: float64
    
    In [35]: df.apply(lambda x: x.max() - x.min())  #      
    Out[35]:
    A    3.203982
    B    4.084655
    C    1.984507
    D    3.190727
    dtype: float64
    

    重複データのマージ
    #1
    a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
               index=['f', 'e', 'd', 'c', 'b', 'a'])
    b = Series(np.arange(len(a), dtype=np.float64),
               index=['f', 'e', 'd', 'c', 'b', 'a'])
    b[-1] = np.nan
    
    
    np.where(pd.isnull(a), b, a)
    
    #2      
    b[:-2].combine_first(a[2:])
    
    #3 
    df1 = DataFrame({'a': [1., np.nan, 5., np.nan],
                     'b': [np.nan, 2., np.nan, 6.],
                     'c': range(2, 18, 4)})
    df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.],
                     'b': [np.nan, 3., 4., 6., 8.]})
    df1.combine_first(df2)  
    #  df1      NaN,   df2      。
    
    

    欠落値処理
    df1.dropna(how='any')       #      ,  。
    df1.fillna(value=5)         #      ,9999。
    pd.isnull(df1)              #   df1    nan    。
    
    DataFrame.fillna(value=None,    # scalar, dict, Series, or DataFrame. 【   list】
                                    # dict Series          。
                    method=None,    # {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
                                    # backfill/bfill,        ,ffill         。
                    axis=None,      # {0 or ‘index’, 1 or ‘columns’}
                    inplace=False,  #    ,True False。  False。
                    limit=None, 
                    downcast=None, 
                    **kwargs)
    
    limit : int, default None
        If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.
    downcast : dict, default is None
        a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
    Returns:    
        filled : DataFrame
    
  • 1.96個の標準差以外の
  • グループ化後、グループ内1.96個の標準差以外の
  • 多重パケット後..
  • #---        ,      ---
    In [116]: States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL']
         ...: data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]
         ...: idx = pd.date_range('1/1/2012', periods=10, freq='MS')
         ...: df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
         ...: df1['State'] = States
         ...:
    
    In [118]: data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
         ...: idx2 = pd.date_range('1/1/2013', periods=10, freq='MS')
         ...: df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
         ...: df2['State'] = States
         ...:
    
    In [119]: df = pd.concat([df1,df2])
    
    In [120]: df
    Out[120]:
                Revenue State
    2012-01-01      1.0    NY
    2012-02-01      2.0    NY
    2012-03-01      3.0    NY
    2012-04-01      4.0    NY
    2012-05-01      5.0    FL
    2012-06-01      6.0    FL
    2012-07-01      7.0    GA
    2012-08-01      8.0    GA
    2012-09-01      9.0    FL
    2012-10-01     10.0    FL
    2013-01-01     10.0    NY
    2013-02-01     10.0    NY
    2013-03-01      9.0    NY
    2013-04-01      9.0    NY
    2013-05-01      8.0    FL
    2013-06-01      8.0    FL
    2013-07-01      7.0    GA
    2013-08-01      7.0    GA
    2013-09-01      6.0    FL
    2013-10-01      6.0    FL
    

    方法1:超全体平均1.96 std
    In [121]: newdf = df.copy()
         ...:
         ...: newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
         ...: newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
         ...: newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
         ...: newdf
         ...:
    Out[121]:
                Revenue State  x-Mean  1.96*std Outlier
    2012-01-01      1.0    NY    5.75  5.200273    True
    2012-02-01      2.0    NY    4.75  5.200273   False
    2012-03-01      3.0    NY    3.75  5.200273   False
    2012-04-01      4.0    NY    2.75  5.200273   False
    2012-05-01      5.0    FL    1.75  5.200273   False
    2012-06-01      6.0    FL    0.75  5.200273   False
    2012-07-01      7.0    GA    0.25  5.200273   False
    2012-08-01      8.0    GA    1.25  5.200273   False
    2012-09-01      9.0    FL    2.25  5.200273   False
    2012-10-01     10.0    FL    3.25  5.200273   False
    2013-01-01     10.0    NY    3.25  5.200273   False
    2013-02-01     10.0    NY    3.25  5.200273   False
    2013-03-01      9.0    NY    2.25  5.200273   False
    2013-04-01      9.0    NY    2.25  5.200273   False
    2013-05-01      8.0    FL    1.25  5.200273   False
    2013-06-01      8.0    FL    1.25  5.200273   False
    2013-07-01      7.0    GA    0.25  5.200273   False
    2013-08-01      7.0    GA    0.25  5.200273   False
    2013-09-01      6.0    FL    0.75  5.200273   False
    2013-10-01      6.0    FL    0.75  5.200273   False
    

    方法2:グループ化後のスーパーグループ1.96 std
    In [122]: newdf = df.copy()
         ...:
         ...: State = newdf.groupby('State')   #        。groupby.transform
         ...:
         ...: newdf['Outlier'] = State.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
         ...: newdf['x-Mean'] = State.transform( lambda x: abs(x-x.mean()) )
         ...: newdf['1.96*std'] = State.transform( lambda x: 1.96*x.std() )
         ...: newdf
         ...:
    Out[122]:
                Revenue State Outlier  x-Mean  1.96*std
    2012-01-01      1.0    NY   False    5.00  7.554813
    2012-02-01      2.0    NY   False    4.00  7.554813
    2012-03-01      3.0    NY   False    3.00  7.554813
    2012-04-01      4.0    NY   False    2.00  7.554813
    2012-05-01      5.0    FL   False    2.25  3.434996
    2012-06-01      6.0    FL   False    1.25  3.434996
    2012-07-01      7.0    GA   False    0.25  0.980000
    2012-08-01      8.0    GA   False    0.75  0.980000
    2012-09-01      9.0    FL   False    1.75  3.434996
    2012-10-01     10.0    FL   False    2.75  3.434996
    2013-01-01     10.0    NY   False    4.00  7.554813
    2013-02-01     10.0    NY   False    4.00  7.554813
    2013-03-01      9.0    NY   False    3.00  7.554813
    2013-04-01      9.0    NY   False    3.00  7.554813
    2013-05-01      8.0    FL   False    0.75  3.434996
    2013-06-01      8.0    FL   False    0.75  3.434996
    2013-07-01      7.0    GA   False    0.25  0.980000
    2013-08-01      7.0    GA   False    0.25  0.980000
    2013-09-01      6.0    FL   False    1.25  3.434996
    2013-10-01      6.0    FL   False    1.25  3.434996
    
    #---      ---
    newdf = df.copy()
    
    State = newdf.groupby('State')
    
    def s(group):
        group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
        group['1.96*std'] = 1.96*group['Revenue'].std()  
        group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
        return group
    
    Newdf2 = State.apply(s)
    Newdf2
    

    方法3:多重グループ化後、1.96
    In [123]: newdf = df.copy()
         ...:
         ...: StateMonth = newdf.groupby(['State', lambda x: x.month])
         ...:
         ...: newdf['Outlier'] = StateMonth.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
         ...: newdf['x-Mean'] = StateMonth.transform( lambda x: abs(x-x.mean()) )
         ...: newdf['1.96*std'] = StateMonth.transform( lambda x: 1.96*x.std() )
         ...: newdf
         ...:
    Out[123]:
                Revenue State Outlier  x-Mean   1.96*std
    2012-01-01      1.0    NY   False     4.5  12.473364
    2012-02-01      2.0    NY   False     4.0  11.087434
    2012-03-01      3.0    NY   False     3.0   8.315576
    2012-04-01      4.0    NY   False     2.5   6.929646
    2012-05-01      5.0    FL   False     1.5   4.157788
    2012-06-01      6.0    FL   False     1.0   2.771859
    2012-07-01      7.0    GA   False     0.0   0.000000
    2012-08-01      8.0    GA   False     0.5   1.385929
    2012-09-01      9.0    FL   False     1.5   4.157788
    2012-10-01     10.0    FL   False     2.0   5.543717
    2013-01-01     10.0    NY   False     4.5  12.473364
    2013-02-01     10.0    NY   False     4.0  11.087434
    2013-03-01      9.0    NY   False     3.0   8.315576
    2013-04-01      9.0    NY   False     2.5   6.929646
    2013-05-01      8.0    FL   False     1.5   4.157788
    2013-06-01      8.0    FL   False     1.0   2.771859
    2013-07-01      7.0    GA   False     0.0   0.000000
    2013-08-01      7.0    GA   False     0.5   1.385929
    2013-09-01      6.0    FL   False     1.5   4.157788
    2013-10-01      6.0    FL   False     2.0   5.543717
    
    In [124]: StateMonth
    Out[124]: 
    

    方法4:四分位数.非Gauss分布/正規分布のデータについて.
    # make a copy of original df
    In [126]: newdf = df.copy()
         ...:
         ...: State = newdf.groupby('State')
         ...:
         ...: newdf['Lower'] = State['Revenue'].transform( lambda x: x.quantile(q=.25) - (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
         ...: newdf['Upper'] = State['Revenue'].transform( lambda x: x.quantile(q=.75) + (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
         ...: newdf['Outlier'] = (newdf['Revenue'] < newdf['Lower']) | (newdf['Revenue'] > newdf['Upper'])
         ...: newdf
         ...:
    Out[126]:
                Revenue State  Lower   Upper Outlier
    2012-01-01      1.0    NY -7.000  19.000   False
    2012-02-01      2.0    NY -7.000  19.000   False
    2012-03-01      3.0    NY -7.000  19.000   False
    2012-04-01      4.0    NY -7.000  19.000   False
    2012-05-01      5.0    FL  2.625  11.625   False
    2012-06-01      6.0    FL  2.625  11.625   False
    2012-07-01      7.0    GA  6.625   7.625   False
    2012-08-01      8.0    GA  6.625   7.625    True
    2012-09-01      9.0    FL  2.625  11.625   False
    2012-10-01     10.0    FL  2.625  11.625   False
    2013-01-01     10.0    NY -7.000  19.000   False
    2013-02-01     10.0    NY -7.000  19.000   False
    2013-03-01      9.0    NY -7.000  19.000   False
    2013-04-01      9.0    NY -7.000  19.000   False
    2013-05-01      8.0    FL  2.625  11.625   False
    2013-06-01      8.0    FL  2.625  11.625   False
    2013-07-01      7.0    GA  6.625   7.625   False
    2013-08-01      7.0    GA  6.625   7.625   False
    2013-09-01      6.0    FL  2.625  11.625   False
    2013-10-01      6.0    FL  2.625  11.625   False