アリス-データ分析(4)
71754 ワード
Pandas
モジュールのインストール
----console-----
pip install pandas
import numpy as np
DataFrameとSeries
通常、DataFrameは全体的なデータであり、Seriesはカラムであり、カラムはありません.
a=pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
b=pd.Series([1, 2, 3, 4, 5])
print(a)
print(b)
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
0 1
1 2
2 3
3 4
4 5
dtype: int64
ファイルの読み込み
df = pd.read_csv("test.csv")
print(df)
データフレームについて
df.head(n)により前のn個のデータしか出力できません.
df = pd.read_csv("test.csv")
print(df.head())
print(df.shape)
print(df.info())
country description ... variety winery
0 Italy Aromas include tropical fruit, broom, brimston... ... White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... ... Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... ... Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... ... Pinot Noir Sweet Cheeks
[5 rows x 13 columns]
(10000, 13)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 9995 non-null object
1 description 10000 non-null object
2 designation 7092 non-null object
3 points 10000 non-null int64
4 price 9315 non-null float64
5 province 9995 non-null object
6 region_1 8307 non-null object
7 region_2 3899 non-null object
8 taster_name 8018 non-null object
9 taster_twitter_handle 7663 non-null object
10 title 10000 non-null object
11 variety 10000 non-null object
12 winery 10000 non-null object
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB
None
索引の作成
df.ilocは数字、dfを通ります.locは数値とカラム名でインデックスされ、一般的なインデックスはdf[]でカラム名でアクセスされます.(インデックスにより特定のカラムのみが抽出され、シリアル形式で返されます.複数のカラムが抽出された場合はDataFrame形式で返されます.)
df = pd.read_csv("test.csv")
print(df['country'][:4])
print(df.iloc[:4,1:4])
print(df.loc[:4,'country'])
0 Italy
1 Portugal
2 US
3 US
Name: country, dtype: object
description designation points
0 Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87
1 This is ripe and fruity, a wine that is smooth... Avidagos 87
2 Tart and snappy, the flavors of lime flesh and... NaN 87
3 Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87
0 Italy
1 Portugal
2 US
3 US
4 US
Name: country, dtype: object
DataFrame形式でインデックスを作成する
locのcolumnsパラメータに[]を1回追加するとDataFrame形式に戻り、複数のcolumnsを取得するには[columns 1,columns 2]の形式で使用する必要があるため、DataFrame形式が得られる.
df = pd.read_csv("test.csv")
print(df.loc[:4,['country']])
0 Italy
1 Portugal
2 US
3 US
4 US
列の置換
data.columnsでデータを挿入するとinplaceが表示されます.
data=pd.read_csv('testfile.csv')
print(data.head())
print(data.columns)
data.columns=[1,2,3,4,5]
print(data.head())
print(data.columns)
name class math english korean
0 A 1 96 90 95
1 B 1 95 66 71
2 C 2 91 89 92
3 D 1 92 83 87
4 E 2 93 84 95
Index(['name', 'class', 'math', 'english', 'korean'], dtype='object')
1 2 3 4 5
0 A 1 96 90 95
1 B 1 95 66 71
2 C 2 91 89 92
3 D 1 92 83 87
4 E 2 93 84 95
Int64Index([1, 2, 3, 4, 5], dtype='int64')
列をrenameで置換
renameを使用する場合は、inplaceオプションで新しい配列を返すかinplaceを返すかを選択して列を変更できます.
test_data = pd.read_csv('testfile.csv')
print(test_data.head())
print(test_data.columns)
test_data.rename(columns={'name':'이름','class':'학급명','math':'수학','english':'국어'},inplace=True)
print(test_data.head())
name class math english korean
0 A 1 96 90 95
1 B 1 95 66 71
2 C 2 91 89 92
3 D 1 92 83 87
4 E 2 93 84 95
Index(['name', 'class', 'math', 'english', 'korean'], dtype='object')
이름 학급명 수학 국어 korean
0 A 1 96 90 95
1 B 1 95 66 71
2 C 2 91 89 92
3 D 1 92 83 87
4 E 2 93 84 95
じょうけんしき
isinメソッドと条件式でマスクできます.
(df.contry='Italy')&(df.point>=90)は、論理演算子によって複数の条件式を実現することができる.
df = pd.read_csv("test.csv")
print(df.country=='Italy')
print(df['points'].isin([89,91]))
print(df.loc[df.country=='Italy',['country']].head(5))
print(df[df['points'].isin([89,91])].head(5))
0 True
1 False
2 False
3 False
4 False
...
9995 False
9996 False
9997 False
9998 False
9999 False
Name: country, Length: 10000, dtype: bool
0 False
1 False
2 False
3 False
4 False
...
9995 True
9996 True
9997 True
9998 True
9999 True
Name: points, Length: 10000, dtype: bool
country
0 Italy
6 Italy
13 Italy
22 Italy
24 Italy
country description ... variety winery
125 South Africa Etienne Le Riche is a total Cabernet specialis... ... Cabernet Sauvignon Le Riche
126 France Mid-gold color. Pronounced and enticing aromas... ... Gewürztraminer Pierre Sparr
127 France Attractive mid-gold color with intense aromas ... ... White Blend Pierre Sparr
128 France Compelling minerality on the nose, Refined and... ... Pinot Blanc Kuentz-Bas
129 South Africa A big, black bruiser of a wine that has black ... ... Bordeaux-style Red Blend Camberley
[5 rows x 13 columns]
df.value_count()
各値が何回表示され、シリアル形式で個数が返されます.
df = pd.read_csv("test.csv")
print(df['country'].value_counts())
print(type(df['country'].value_counts()))
US 4181
France 1588
Italy 1546
Spain 493
Portugal 468
Chile 399
Argentina 316
Austria 243
Australia 192
Germany 157
South Africa 132
New Zealand 111
Greece 32
Israel 29
Canada 20
Romania 17
Hungary 12
Bulgaria 8
Turkey 7
Uruguay 5
Mexico 5
Czech Republic 4
Croatia 4
Lebanon 4
Slovenia 4
Moldova 4
England 3
Georgia 2
Brazil 2
India 1
Cyprus 1
Serbia 1
Peru 1
Morocco 1
Luxembourg 1
Armenia 1
Name: country, dtype: int64
<class 'pandas.core.series.Series'>
df[column].unique()
各値を配列で返します.
df = pd.read_csv("test.csv")
print(df['country'].unique())
['Italy' 'Portugal' 'US' 'Spain' 'France' 'Germany' 'Argentina' 'Chile'
'Australia' 'Austria' 'South Africa' 'New Zealand' 'Israel' 'Hungary'
'Greece' 'Romania' 'Mexico' 'Canada' nan 'Turkey' 'Czech Republic'
'Slovenia' 'Luxembourg' 'Croatia' 'Georgia' 'Uruguay' 'England' 'Lebanon'
'Serbia' 'Brazil' 'Moldova' 'Morocco' 'Peru' 'India' 'Bulgaria' 'Cyprus'
'Armenia']
df[column].mean()
特定のカラムの平均値を返します.
df = pd.read_csv("test.csv")
print(df['points'].mean())
シリーズ内の演算
通常+演算を使用する場合、インデックスが正しくない場合はNaNが使用されます.
addメソッドを使用すると、fill valueで必要なデータを簡単に挿入できます.
df = pd.read_csv("test.csv")
A=pd.Series([2,4,6],index=[0,1,2])
B=pd.Series([1,3,5],index=[1,2,3])
print(A+B)
print(A.add(B,fill_value=0))
0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64
DataFrameの演算
add法はNAN時のデータ処理を簡略化した.
df = pd.read_csv("test.csv")
A=pd.DataFrame(np.random.randint(0,10,(2,2)),columns=list('AB'))
B=pd.DataFrame(np.random.randint(0,10,(3,3)),columns=list('BAC'))
print(A+B)
print(A.add(B,fill_value=0))
A B C
0 2.0 1.0 NaN
1 9.0 6.0 NaN
2 NaN NaN NaN
A B C
0 2.0 1.0 3.0
1 9.0 6.0 2.0
2 4.0 8.0 4.0
集約関数
data={
'A':[i+5 for i in range(3)],
'B':[i**2 for i in range(3)]
}
df=pd.DataFrame(data)
print(df['A'].sum())
print(df.sum())
print(df.mean())
18
A 18
B 5
dtype: int64
A 6.000000
B 1.666667
dtype: float64
df.dropnaとdf。fillna
dropnaはNaNデータ行を削除し、filnaはNaNを必要なデータに埋め込む.(inplace X)
df.dropna()
df['전화번호']=df['전화번호'].fillna('전화번호없음')
df.mapとdf。apply
df = pd.read_csv("test.csv")
print(df['points'].map(lambda x:x-df['points'].mean()))
print(df.apply(lambda x:x['points']-df['points'].mean(),axis=1))
0 -1.3957
1 -1.3957
2 -1.3957
3 -1.3957
4 -1.3957
...
9995 0.6043
9996 0.6043
9997 0.6043
9998 2.6043
9999 2.6043
Name: points, Length: 10000, dtype: float64
0 -1.3957
1 -1.3957
2 -1.3957
3 -1.3957
4 -1.3957
...
9995 0.6043
9996 0.6043
9997 0.6043
9998 2.6043
9999 2.6043
Length: 10000, dtype: float64
groupby
特定の列を基準にして、それぞれグループ化します.
df = pd.read_csv("test.csv")
print(df.groupby('country').points.count())
print(df.groupby('country').points.value_counts())
country
Argentina 316
Armenia 1
Australia 192
Austria 243
Brazil 2
Bulgaria 8
Canada 20
Chile 399
Croatia 4
Cyprus 1
Czech Republic 4
England 3
France 1588
Georgia 2
Germany 157
Greece 32
Hungary 12
India 1
Israel 29
Italy 1546
Lebanon 4
Luxembourg 1
Mexico 5
Moldova 4
Morocco 1
New Zealand 111
Peru 1
Portugal 468
Romania 17
Serbia 1
Slovenia 4
South Africa 132
Spain 493
Turkey 7
US 4181
Uruguay 5
Name: points, dtype: int64
country points
Argentina 87 46
88 42
85 37
83 27
84 27
..
US 98 2
99 2
Uruguay 86 2
90 2
88 1
ツールバーの
昇順がtrueの場合は昇順、falseの場合は降順、配列によって複数の基準を与えることもできます.
df = pd.read_csv("test.csv")
print(df.sort_values(by='country', ascending = True).head())
print(df.sort_values(by=['country','points'], ascending = False).loc[:,['country','points']].head())
country description designation points ... taster_twitter_handle title variety winery
5786 Argentina Leafy, spicy, dry berry aromas lead to a jacke... Reserva 83 ... @wineschach Fat Gaucho 2013 Reserva Malbec (Mendoza) Malbec Fat Gaucho
1482 Argentina Rather sweet and medicinal; the wine comes int... Altosur 85 ... @wineschach Finca Sophenia 2006 Altosur Cabernet Sauvignon... Cabernet Sauvignon Finca Sophenia
1991 Argentina Initial plum and berry aromas fall off with ai... NaN 84 ... @wineschach Ricominciare 2010 Malbec-Tannat (Uco Valley) Malbec-Tannat Ricominciare
4367 Argentina A smooth operator with sweet aromas of cotton ... Cincuenta y Cinco 91 ... @wineschach Bodega Chacra 2009 Cincuenta y Cinco Pinot Noi... Pinot Noir Bodega Chacra
4369 Argentina Clean, cedary and dynamic, with fine black-fru... Reserva 91 ... @wineschach Ca' de Calle 2008 Reserva Red (Mendoza) Red Blend Ca' de Calle
[5 rows x 13 columns]
country points
6005 Uruguay 90
9133 Uruguay 90
4051 Uruguay 88
4104 Uruguay 86
6969 Uruguay 86
replace
特定のデータを必要なデータに変換し、inplaceオプションを指定できます.
df = pd.read_csv("test.csv")
print(df.replace('US','USA',inplace=False))
0 Italy
1 Portugal
2 USA
3 USA
4 USA
...
9995 USA
9996 USA
9997 France
9998 USA
9999 France
Name: country, Length: 10000, dtype: object
Reference
この問題について(アリス-データ分析(4)), 我々は、より多くの情報をここで見つけました https://velog.io/@tjwjdgus83/엘리스-데이터분석4テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。
Collection and Share based on the CC Protocol