Pythonデータ解析_Pandas03_データの整理

13517 ワード

主な内容:

削除行列

修正値:apply lambda

記述統計

データの統合、補完

極端値処理

開始データ・ボックス:

In [76]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [77]: df
Out[77]:
                   A         B         C         D
2013-01-01 -0.411674  0.273549  0.629843  1.881497
2013-01-02  1.240512  0.970725  0.033099  1.553420
2013-01-03 -0.544326  0.545738 -1.325810  0.130738
2013-01-04  1.044803 -0.117151  0.874583  2.278227
2013-01-05 -2.194728 -2.536257  0.478644  0.057728
2013-01-06 -1.092031  1.249952  1.598761 -0.153423

In [98]: df2 = df.copy()
    ...: df2['E'] = ['one', 'one','two','three','four','three']
    ...: df2
    ...:
Out[98]:
                   A         B         C         D      E
2013-01-01 -0.411674  0.273549  0.629843  1.881497    one
2013-01-02  1.240512  0.970725  0.033099  1.553420    one
2013-01-03 -0.544326  0.545738 -1.325810  0.130738    two
2013-01-04  1.044803 -0.117151  0.874583  2.278227  three
2013-01-05 -2.194728 -2.536257  0.478644  0.057728   four
2013-01-06 -1.092031  1.249952  1.598761 -0.153423  three

列を削除
新しい列columnを追加

In [110]: df2["F"] = np.arange(3,9)   #           3，      3。

In [111]: df2
Out[111]:
                   A         B         C         D      E  F
2013-01-01 -0.411674  0.273549  0.629843  1.881497    ONE  3
2013-01-02  1.240512  0.970725  0.033099  1.553420    ONE  4
2013-01-03 -0.544326  0.545738 -1.325810  0.130738    TWO  5
2013-01-04  1.044803 -0.117151  0.874583  2.278227  THREE  6
2013-01-05 -2.194728 -2.536257  0.478644  0.057728   FOUR  7
2013-01-06 -1.092031  1.249952  1.598761 -0.153423  THREE  8

列名の変更

In [112]: df2.columns = list("qwerty")

In [113]: df2.columns
Out[113]: Index(['q', 'w', 'e', 'r', 't', 'y'], dtype='object')

indexの設定

pd.DataFrame.set_index(keys,drop=False, inplace=False) 
# keys : column label or list of column labels / arrays
#   drop=True，   index        。

数値の変更
Lambdaとapplyでデータ値を変更する

In [101]: df2.loc[:,"E"].unique()
Out[101]: array(['one', 'two', 'three', 'four'], dtype=object)

In [105]: df2.loc[:,"E"]=df2.E.apply(lambda x: x.upper())

In [106]: df2.loc[:,"E"].unique()
Out[106]: array(['ONE', 'TWO', 'THREE', 'FOUR'], dtype=object)

関数アプリケーション
3つのケース:

全表:pipe()

行列アプリケーション:apply()

要素レベルアプリケーション:applymap()

この部分は【functionアプリケーション部分】にあります.http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-discretization.
apply()の使用

In [34]: df.apply(np.mean)
Out[34]:
A   -0.274648
B   -0.260124
C    0.152004
D   -0.659868
dtype: float64

In [35]: df.apply(lambda x: x.max() - x.min())  #      
Out[35]:
A    3.203982
B    4.084655
C    1.984507
D    3.190727
dtype: float64

重複データのマージ

#1
a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
           index=['f', 'e', 'd', 'c', 'b', 'a'])
b = Series(np.arange(len(a), dtype=np.float64),
           index=['f', 'e', 'd', 'c', 'b', 'a'])
b[-1] = np.nan


np.where(pd.isnull(a), b, a)

#2      
b[:-2].combine_first(a[2:])

#3 
df1 = DataFrame({'a': [1., np.nan, 5., np.nan],
                 'b': [np.nan, 2., np.nan, 6.],
                 'c': range(2, 18, 4)})
df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.],
                 'b': [np.nan, 3., 4., 6., 8.]})
df1.combine_first(df2)  
#  df1      NaN，   df2      。

欠落値処理

df1.dropna(how='any')       #      ，  。
df1.fillna(value=5)         #      ，9999。
pd.isnull(df1)              #   df1    nan    。

DataFrame.fillna(value=None,    # scalar, dict, Series, or DataFrame. 【   list】
                                # dict Series          。
                method=None,    # {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
                                # backfill/bfill，        ，ffill         。
                axis=None,      # {0 or ‘index’, 1 or ‘columns’}
                inplace=False,  #    ，True False。  False。
                limit=None, 
                downcast=None, 
                **kwargs)

limit : int, default None
    If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled.
downcast : dict, default is None
    a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns:    
    filled : DataFrame

1.96個の標準差以外の

グループ化後、グループ内1.96個の標準差以外の

多重パケット後..

#---        ，      ---
In [116]: States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL']
     ...: data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]
     ...: idx = pd.date_range('1/1/2012', periods=10, freq='MS')
     ...: df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
     ...: df1['State'] = States
     ...:

In [118]: data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
     ...: idx2 = pd.date_range('1/1/2013', periods=10, freq='MS')
     ...: df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue'])
     ...: df2['State'] = States
     ...:

In [119]: df = pd.concat([df1,df2])

In [120]: df
Out[120]:
            Revenue State
2012-01-01      1.0    NY
2012-02-01      2.0    NY
2012-03-01      3.0    NY
2012-04-01      4.0    NY
2012-05-01      5.0    FL
2012-06-01      6.0    FL
2012-07-01      7.0    GA
2012-08-01      8.0    GA
2012-09-01      9.0    FL
2012-10-01     10.0    FL
2013-01-01     10.0    NY
2013-02-01     10.0    NY
2013-03-01      9.0    NY
2013-04-01      9.0    NY
2013-05-01      8.0    FL
2013-06-01      8.0    FL
2013-07-01      7.0    GA
2013-08-01      7.0    GA
2013-09-01      6.0    FL
2013-10-01      6.0    FL

方法1:超全体平均1.96 std

In [121]: newdf = df.copy()
     ...:
     ...: newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean())
     ...: newdf['1.96*std'] = 1.96*newdf['Revenue'].std()
     ...: newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std()
     ...: newdf
     ...:
Out[121]:
            Revenue State  x-Mean  1.96*std Outlier
2012-01-01      1.0    NY    5.75  5.200273    True
2012-02-01      2.0    NY    4.75  5.200273   False
2012-03-01      3.0    NY    3.75  5.200273   False
2012-04-01      4.0    NY    2.75  5.200273   False
2012-05-01      5.0    FL    1.75  5.200273   False
2012-06-01      6.0    FL    0.75  5.200273   False
2012-07-01      7.0    GA    0.25  5.200273   False
2012-08-01      8.0    GA    1.25  5.200273   False
2012-09-01      9.0    FL    2.25  5.200273   False
2012-10-01     10.0    FL    3.25  5.200273   False
2013-01-01     10.0    NY    3.25  5.200273   False
2013-02-01     10.0    NY    3.25  5.200273   False
2013-03-01      9.0    NY    2.25  5.200273   False
2013-04-01      9.0    NY    2.25  5.200273   False
2013-05-01      8.0    FL    1.25  5.200273   False
2013-06-01      8.0    FL    1.25  5.200273   False
2013-07-01      7.0    GA    0.25  5.200273   False
2013-08-01      7.0    GA    0.25  5.200273   False
2013-09-01      6.0    FL    0.75  5.200273   False
2013-10-01      6.0    FL    0.75  5.200273   False

方法2:グループ化後のスーパーグループ1.96 std

In [122]: newdf = df.copy()
     ...:
     ...: State = newdf.groupby('State')   #        。groupby.transform
     ...:
     ...: newdf['Outlier'] = State.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
     ...: newdf['x-Mean'] = State.transform( lambda x: abs(x-x.mean()) )
     ...: newdf['1.96*std'] = State.transform( lambda x: 1.96*x.std() )
     ...: newdf
     ...:
Out[122]:
            Revenue State Outlier  x-Mean  1.96*std
2012-01-01      1.0    NY   False    5.00  7.554813
2012-02-01      2.0    NY   False    4.00  7.554813
2012-03-01      3.0    NY   False    3.00  7.554813
2012-04-01      4.0    NY   False    2.00  7.554813
2012-05-01      5.0    FL   False    2.25  3.434996
2012-06-01      6.0    FL   False    1.25  3.434996
2012-07-01      7.0    GA   False    0.25  0.980000
2012-08-01      8.0    GA   False    0.75  0.980000
2012-09-01      9.0    FL   False    1.75  3.434996
2012-10-01     10.0    FL   False    2.75  3.434996
2013-01-01     10.0    NY   False    4.00  7.554813
2013-02-01     10.0    NY   False    4.00  7.554813
2013-03-01      9.0    NY   False    3.00  7.554813
2013-04-01      9.0    NY   False    3.00  7.554813
2013-05-01      8.0    FL   False    0.75  3.434996
2013-06-01      8.0    FL   False    0.75  3.434996
2013-07-01      7.0    GA   False    0.25  0.980000
2013-08-01      7.0    GA   False    0.25  0.980000
2013-09-01      6.0    FL   False    1.25  3.434996
2013-10-01      6.0    FL   False    1.25  3.434996

#---      ---
newdf = df.copy()

State = newdf.groupby('State')

def s(group):
    group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean())
    group['1.96*std'] = 1.96*group['Revenue'].std()  
    group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std()
    return group

Newdf2 = State.apply(s)
Newdf2

方法3:多重グループ化後、1.96

In [123]: newdf = df.copy()
     ...:
     ...: StateMonth = newdf.groupby(['State', lambda x: x.month])
     ...:
     ...: newdf['Outlier'] = StateMonth.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
     ...: newdf['x-Mean'] = StateMonth.transform( lambda x: abs(x-x.mean()) )
     ...: newdf['1.96*std'] = StateMonth.transform( lambda x: 1.96*x.std() )
     ...: newdf
     ...:
Out[123]:
            Revenue State Outlier  x-Mean   1.96*std
2012-01-01      1.0    NY   False     4.5  12.473364
2012-02-01      2.0    NY   False     4.0  11.087434
2012-03-01      3.0    NY   False     3.0   8.315576
2012-04-01      4.0    NY   False     2.5   6.929646
2012-05-01      5.0    FL   False     1.5   4.157788
2012-06-01      6.0    FL   False     1.0   2.771859
2012-07-01      7.0    GA   False     0.0   0.000000
2012-08-01      8.0    GA   False     0.5   1.385929
2012-09-01      9.0    FL   False     1.5   4.157788
2012-10-01     10.0    FL   False     2.0   5.543717
2013-01-01     10.0    NY   False     4.5  12.473364
2013-02-01     10.0    NY   False     4.0  11.087434
2013-03-01      9.0    NY   False     3.0   8.315576
2013-04-01      9.0    NY   False     2.5   6.929646
2013-05-01      8.0    FL   False     1.5   4.157788
2013-06-01      8.0    FL   False     1.0   2.771859
2013-07-01      7.0    GA   False     0.0   0.000000
2013-08-01      7.0    GA   False     0.5   1.385929
2013-09-01      6.0    FL   False     1.5   4.157788
2013-10-01      6.0    FL   False     2.0   5.543717

In [124]: StateMonth
Out[124]:

方法4:四分位数.非Gauss分布/正規分布のデータについて.

# make a copy of original df
In [126]: newdf = df.copy()
     ...:
     ...: State = newdf.groupby('State')
     ...:
     ...: newdf['Lower'] = State['Revenue'].transform( lambda x: x.quantile(q=.25) - (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
     ...: newdf['Upper'] = State['Revenue'].transform( lambda x: x.quantile(q=.75) + (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) )
     ...: newdf['Outlier'] = (newdf['Revenue'] < newdf['Lower']) | (newdf['Revenue'] > newdf['Upper'])
     ...: newdf
     ...:
Out[126]:
            Revenue State  Lower   Upper Outlier
2012-01-01      1.0    NY -7.000  19.000   False
2012-02-01      2.0    NY -7.000  19.000   False
2012-03-01      3.0    NY -7.000  19.000   False
2012-04-01      4.0    NY -7.000  19.000   False
2012-05-01      5.0    FL  2.625  11.625   False
2012-06-01      6.0    FL  2.625  11.625   False
2012-07-01      7.0    GA  6.625   7.625   False
2012-08-01      8.0    GA  6.625   7.625    True
2012-09-01      9.0    FL  2.625  11.625   False
2012-10-01     10.0    FL  2.625  11.625   False
2013-01-01     10.0    NY -7.000  19.000   False
2013-02-01     10.0    NY -7.000  19.000   False
2013-03-01      9.0    NY -7.000  19.000   False
2013-04-01      9.0    NY -7.000  19.000   False
2013-05-01      8.0    FL  2.625  11.625   False
2013-06-01      8.0    FL  2.625  11.625   False
2013-07-01      7.0    GA  6.625   7.625   False
2013-08-01      7.0    GA  6.625   7.625   False
2013-09-01      6.0    FL  2.625  11.625   False
2013-10-01      6.0    FL  2.625  11.625   False

Linux_CentOS7_lnmp環境構築プロセス記録

C#のDicをLuaのTable&&C#のListをLuaのTableにする