アリス-データ分析(4)

71754 ワード

Pandas


モジュールのインストール

----console-----
pip install pandas
import numpy as np

DataFrameとSeries


通常、DataFrameは全体的なデータであり、Seriesはカラムであり、カラムはありません.
a=pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])
b=pd.Series([1, 2, 3, 4, 5])
print(a)
print(b)
                     Bob           Sue
Product A    I liked it.  Pretty good.
Product B  It was awful.        Bland.
0    1
1    2
2    3
3    4
4    5
dtype: int64

ファイルの読み込み

df = pd.read_csv("test.csv")
print(df)

データフレームについて


df.head(n)により前のn個のデータしか出力できません.
df = pd.read_csv("test.csv")
print(df.head())
print(df.shape)
print(df.info())
    country                                        description  ...         variety               winery
0     Italy  Aromas include tropical fruit, broom, brimston...  ...     White Blend              Nicosia
1  Portugal  This is ripe and fruity, a wine that is smooth...  ...  Portuguese Red  Quinta dos Avidagos
2        US  Tart and snappy, the flavors of lime flesh and...  ...      Pinot Gris            Rainstorm
3        US  Pineapple rind, lemon pith and orange blossom ...  ...        Riesling           St. Julian
4        US  Much like the regular bottling from 2012, this...  ...      Pinot Noir         Sweet Cheeks

[5 rows x 13 columns]
(10000, 13)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   country                9995 non-null   object
 1   description            10000 non-null  object
 2   designation            7092 non-null   object
 3   points                 10000 non-null  int64
 4   price                  9315 non-null   float64
 5   province               9995 non-null   object
 6   region_1               8307 non-null   object
 7   region_2               3899 non-null   object
 8   taster_name            8018 non-null   object
 9   taster_twitter_handle  7663 non-null   object
 10  title                  10000 non-null  object
 11  variety                10000 non-null  object
 12  winery                 10000 non-null  object
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB
None

索引の作成


df.ilocは数字、dfを通ります.locは数値とカラム名でインデックスされ、一般的なインデックスはdf[]でカラム名でアクセスされます.(インデックスにより特定のカラムのみが抽出され、シリアル形式で返されます.複数のカラムが抽出された場合はDataFrame形式で返されます.)
df = pd.read_csv("test.csv")
print(df['country'][:4])   
print(df.iloc[:4,1:4])    
print(df.loc[:4,'country']) 
0       Italy
1    Portugal
2          US
3          US
Name: country, dtype: object
                                         description           designation  points
0  Aromas include tropical fruit, broom, brimston...          Vulkà Bianco      87
1  This is ripe and fruity, a wine that is smooth...              Avidagos      87
2  Tart and snappy, the flavors of lime flesh and...                   NaN      87
3  Pineapple rind, lemon pith and orange blossom ...  Reserve Late Harvest      87
0       Italy
1    Portugal
2          US
3          US
4          US
Name: country, dtype: object

DataFrame形式でインデックスを作成する


locのcolumnsパラメータに[]を1回追加するとDataFrame形式に戻り、複数のcolumnsを取得するには[columns 1,columns 2]の形式で使用する必要があるため、DataFrame形式が得られる.
df = pd.read_csv("test.csv")
print(df.loc[:4,['country']])
0     Italy
1  Portugal
2        US
3        US
4        US

列の置換


data.columnsでデータを挿入するとinplaceが表示されます.
data=pd.read_csv('testfile.csv')
print(data.head())
print(data.columns)
data.columns=[1,2,3,4,5]
print(data.head())
print(data.columns)
  name  class  math  english  korean
0    A      1    96       90      95
1    B      1    95       66      71
2    C      2    91       89      92
3    D      1    92       83      87
4    E      2    93       84      95
Index(['name', 'class', 'math', 'english', 'korean'], dtype='object')
   1  2   3   4   5
0  A  1  96  90  95
1  B  1  95  66  71
2  C  2  91  89  92
3  D  1  92  83  87
4  E  2  93  84  95
Int64Index([1, 2, 3, 4, 5], dtype='int64')

列をrenameで置換


renameを使用する場合は、inplaceオプションで新しい配列を返すかinplaceを返すかを選択して列を変更できます.
test_data = pd.read_csv('testfile.csv')
print(test_data.head())
print(test_data.columns)
test_data.rename(columns={'name':'이름','class':'학급명','math':'수학','english':'국어'},inplace=True)
print(test_data.head())
  name  class  math  english  korean
0    A      1    96       90      95
1    B      1    95       66      71
2    C      2    91       89      92
3    D      1    92       83      87
4    E      2    93       84      95
Index(['name', 'class', 'math', 'english', 'korean'], dtype='object')
  이름  학급명  수학  국어  korean
0  A    1  96  90      95
1  B    1  95  66      71
2  C    2  91  89      92
3  D    1  92  83      87
4  E    2  93  84      95

じょうけんしき


isinメソッドと条件式でマスクできます.
(df.contry='Italy')&(df.point>=90)は、論理演算子によって複数の条件式を実現することができる.
df = pd.read_csv("test.csv")
print(df.country=='Italy')
print(df['points'].isin([89,91]))
print(df.loc[df.country=='Italy',['country']].head(5))
print(df[df['points'].isin([89,91])].head(5))
0        True
1       False
2       False
3       False
4       False
        ...
9995    False
9996    False
9997    False
9998    False
9999    False
Name: country, Length: 10000, dtype: bool
0       False
1       False
2       False
3       False
4       False
        ...
9995     True
9996     True
9997     True
9998     True
9999     True
Name: points, Length: 10000, dtype: bool
   country
0    Italy
6    Italy
13   Italy
22   Italy
24   Italy
          country                                        description  ...                   variety        winery
125  South Africa  Etienne Le Riche is a total Cabernet specialis...  ...        Cabernet Sauvignon      Le Riche   
126        France  Mid-gold color. Pronounced and enticing aromas...  ...            Gewürztraminer  Pierre Sparr   
127        France  Attractive mid-gold color with intense aromas ...  ...               White Blend  Pierre Sparr   
128        France  Compelling minerality on the nose, Refined and...  ...               Pinot Blanc    Kuentz-Bas   
129  South Africa  A big, black bruiser of a wine that has black ...  ...  Bordeaux-style Red Blend     Camberley   

[5 rows x 13 columns]

df.value_count()


各値が何回表示され、シリアル形式で個数が返されます.
df = pd.read_csv("test.csv")
print(df['country'].value_counts())
print(type(df['country'].value_counts()))
US                4181
France            1588
Italy             1546
Spain              493
Portugal           468
Chile              399
Argentina          316
Austria            243
Australia          192
Germany            157
South Africa       132
New Zealand        111
Greece              32
Israel              29
Canada              20
Romania             17
Hungary             12
Bulgaria             8
Turkey               7
Uruguay              5
Mexico               5
Czech Republic       4
Croatia              4
Lebanon              4
Slovenia             4
Moldova              4
England              3
Georgia              2
Brazil               2
India                1
Cyprus               1
Serbia               1
Peru                 1
Morocco              1
Luxembourg           1
Armenia              1
Name: country, dtype: int64
<class 'pandas.core.series.Series'>

df[column].unique()


各値を配列で返します.
df = pd.read_csv("test.csv")
print(df['country'].unique())
['Italy' 'Portugal' 'US' 'Spain' 'France' 'Germany' 'Argentina' 'Chile'
 'Australia' 'Austria' 'South Africa' 'New Zealand' 'Israel' 'Hungary'
 'Greece' 'Romania' 'Mexico' 'Canada' nan 'Turkey' 'Czech Republic'
 'Slovenia' 'Luxembourg' 'Croatia' 'Georgia' 'Uruguay' 'England' 'Lebanon'
 'Serbia' 'Brazil' 'Moldova' 'Morocco' 'Peru' 'India' 'Bulgaria' 'Cyprus'
 'Armenia']

df[column].mean()


特定のカラムの平均値を返します.
df = pd.read_csv("test.csv")
print(df['points'].mean())

シリーズ内の演算


通常+演算を使用する場合、インデックスが正しくない場合はNaNが使用されます.
addメソッドを使用すると、fill valueで必要なデータを簡単に挿入できます.
df = pd.read_csv("test.csv")
A=pd.Series([2,4,6],index=[0,1,2])
B=pd.Series([1,3,5],index=[1,2,3])
print(A+B)
print(A.add(B,fill_value=0))
0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64
0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

DataFrameの演算


add法はNAN時のデータ処理を簡略化した.
df = pd.read_csv("test.csv")
A=pd.DataFrame(np.random.randint(0,10,(2,2)),columns=list('AB'))
B=pd.DataFrame(np.random.randint(0,10,(3,3)),columns=list('BAC'))
print(A+B)
print(A.add(B,fill_value=0))
     A    B   C
0  2.0  1.0 NaN
1  9.0  6.0 NaN
2  NaN  NaN NaN
     A    B    C
0  2.0  1.0  3.0
1  9.0  6.0  2.0
2  4.0  8.0  4.0

集約関数

data={
    'A':[i+5 for i in range(3)],
    'B':[i**2 for i in range(3)]
    }
df=pd.DataFrame(data)
print(df['A'].sum())
print(df.sum())
print(df.mean())
18
A    18
B     5
dtype: int64
A    6.000000
B    1.666667
dtype: float64

df.dropnaとdf。fillna


dropnaはNaNデータ行を削除し、filnaはNaNを必要なデータに埋め込む.(inplace X)
df.dropna()
df['전화번호']=df['전화번호'].fillna('전화번호없음')

df.mapとdf。apply

df = pd.read_csv("test.csv")
print(df['points'].map(lambda x:x-df['points'].mean()))
print(df.apply(lambda x:x['points']-df['points'].mean(),axis=1))
0      -1.3957
1      -1.3957
2      -1.3957
3      -1.3957
4      -1.3957
         ...
9995    0.6043
9996    0.6043
9997    0.6043
9998    2.6043
9999    2.6043
Name: points, Length: 10000, dtype: float64
0      -1.3957
1      -1.3957
2      -1.3957
3      -1.3957
4      -1.3957
         ...
9995    0.6043
9996    0.6043
9997    0.6043
9998    2.6043
9999    2.6043
Length: 10000, dtype: float64

groupby


特定の列を基準にして、それぞれグループ化します.
df = pd.read_csv("test.csv")
print(df.groupby('country').points.count())
print(df.groupby('country').points.value_counts())
country
Argentina          316
Armenia              1
Australia          192
Austria            243
Brazil               2
Bulgaria             8
Canada              20
Chile              399
Croatia              4
Cyprus               1
Czech Republic       4
England              3
France            1588
Georgia              2
Germany            157
Greece              32
Hungary             12
India                1
Israel              29
Italy             1546
Lebanon              4
Luxembourg           1
Mexico               5
Moldova              4
Morocco              1
New Zealand        111
Peru                 1
Portugal           468
Romania             17
Serbia               1
Slovenia             4
South Africa       132
Spain              493
Turkey               7
US                4181
Uruguay              5
Name: points, dtype: int64
country    points
Argentina  87        46
           88        42
           85        37
           83        27
           84        27
                     ..
US         98         2
           99         2
Uruguay    86         2
           90         2
           88         1

ツールバーの


昇順がtrueの場合は昇順、falseの場合は降順、配列によって複数の基準を与えることもできます.
df = pd.read_csv("test.csv")
print(df.sort_values(by='country', ascending = True).head())
print(df.sort_values(by=['country','points'], ascending = False).loc[:,['country','points']].head())
        country                                        description        designation  points  ...  taster_twitter_handle                                              title             variety          winery
5786  Argentina  Leafy, spicy, dry berry aromas lead to a jacke...            Reserva      83  ...            @wineschach           Fat Gaucho 2013 Reserva Malbec (Mendoza)              Malbec      Fat Gaucho
1482  Argentina  Rather sweet and medicinal; the wine comes int...            Altosur      85  ...            @wineschach  Finca Sophenia 2006 Altosur Cabernet Sauvignon...  Cabernet Sauvignon  Finca Sophenia
1991  Argentina  Initial plum and berry aromas fall off with ai...                NaN      84  ...            @wineschach       Ricominciare 2010 Malbec-Tannat (Uco Valley)       Malbec-Tannat    Ricominciare
4367  Argentina  A smooth operator with sweet aromas of cotton ...  Cincuenta y Cinco      91  ...            @wineschach  Bodega Chacra 2009 Cincuenta y Cinco Pinot Noi...          Pinot Noir   Bodega Chacra
4369  Argentina  Clean, cedary and dynamic, with fine black-fru...            Reserva      91  ...            @wineschach            Ca' de Calle 2008 Reserva Red (Mendoza)           Red Blend    Ca' de Calle

[5 rows x 13 columns]
      country  points
6005  Uruguay      90
9133  Uruguay      90
4051  Uruguay      88
4104  Uruguay      86
6969  Uruguay      86

replace


特定のデータを必要なデータに変換し、inplaceオプションを指定できます.
df = pd.read_csv("test.csv")
print(df.replace('US','USA',inplace=False))
0          Italy
1       Portugal
2            USA
3            USA
4            USA
          ...
9995         USA
9996         USA
9997      France
9998         USA
9999      France
Name: country, Length: 10000, dtype: object