[Data Handling] Data Cleansing::(1) missing value
Data cleansing :: Data Problems
Data quality problems
Data preprocessing issues
欠落
Data cleansing :: Missing Values
データがない場合のポリシー
import pandas as pd
import numpy as np
# Eaxmple from - https://chrisalbon.com/python/pandas_missing_data.html
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
'age': [42, np.nan, 36, 24, 73],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'preTestScore': [4, np.nan, np.nan, 2, 3],
'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df
first_namelast_name
age
sex
preTestScore
postTestScore
0
Jason
Miller
42.0
m
4.0
25.0
1
NaN
NaN
NaN
NaN
NaN
NaN
2
Tina
Ali
36.0
f
NaN
NaN
3
Jake
Milner
24.0
m
2.0
62.0
4
Amy
Cooze
73.0
f
3.0
70.0
df.isnull().sum() / len(df) #몇퍼센트가 비어있는가?
first_name 0.2last_name 0.2
age 0.2
sex 0.2
preTestScore 0.4
postTestScore 0.4
dtype: float64
df_no_missing = df.dropna()
df_no_missing #dropna -> 데이터들이 사라짐
first_namelast_name
age
sex
preTestScore
postTestScore
0
Jason
Miller
42.0
m
4.0
25.0
3
Jake
Milner
24.0
m
2.0
62.0
4
Amy
Cooze
73.0
f
3.0
70.0
df
first_namelast_name
age
sex
preTestScore
postTestScore
0
Jason
Miller
42.0
m
4.0
25.0
1
NaN
NaN
NaN
NaN
NaN
NaN
2
Tina
Ali
36.0
f
NaN
NaN
3
Jake
Milner
24.0
m
2.0
62.0
4
Amy
Cooze
73.0
f
3.0
70.0
df_cleaned = df.dropna(how='all')
df_cleaned #모든 데이터가 비어 있으면 drop
first_namelast_name
age
sex
preTestScore
postTestScore
0
Jason
Miller
42.0
m
4.0
25.0
2
Tina
Ali
36.0
f
NaN
NaN
3
Jake
Milner
24.0
m
2.0
62.0
4
Amy
Cooze
73.0
f
3.0
70.0
df['location'] = np.nan #nan을 생성 column
df
first_namelast_name
age
sex
preTestScore
postTestScore
location
0
Jason
Miller
42.0
m
4.0
25.0
NaN
1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
Tina
Ali
36.0
f
NaN
NaN
NaN
3
Jake
Milner
24.0
m
2.0
62.0
NaN
4
Amy
Cooze
73.0
f
3.0
70.0
NaN
df.dropna(axis=1, how='all') #column을 기준으로 삭제
first_namelast_name
age
sex
preTestScore
postTestScore
0
Jason
Miller
42.0
m
4.0
25.0
1
NaN
NaN
NaN
NaN
NaN
NaN
2
Tina
Ali
36.0
f
NaN
NaN
3
Jake
Milner
24.0
m
2.0
62.0
4
Amy
Cooze
73.0
f
3.0
70.0
df.dropna(axis=1, thresh = 3) #column기준, 데이터가 최소 4개 이상 없을 때 drop
first_namelast_name
age
sex
preTestScore
postTestScore
0
Jason
Miller
42.0
m
4.0
25.0
1
NaN
NaN
NaN
NaN
NaN
NaN
2
Tina
Ali
36.0
f
NaN
NaN
3
Jake
Milner
24.0
m
2.0
62.0
4
Amy
Cooze
73.0
f
3.0
70.0
df.dropna(axis=0, thresh=1) #row기준, 데이터가 최소 2개 이상 없을 때 drop
first_namelast_name
age
sex
preTestScore
postTestScore
location
0
Jason
Miller
42.0
m
4.0
25.0
NaN
2
Tina
Ali
36.0
f
NaN
NaN
NaN
3
Jake
Milner
24.0
m
2.0
62.0
NaN
4
Amy
Cooze
73.0
f
3.0
70.0
NaN
df.dropna(thresh=5)
first_namelast_name
age
sex
preTestScore
postTestScore
location
0
Jason
Miller
42.0
m
4.0
25.0
NaN
3
Jake
Milner
24.0
m
2.0
62.0
NaN
4
Amy
Cooze
73.0
f
3.0
70.0
NaN
データ値の入力
# 평균값 : 해당 column의 값의 평균을 내서 채우기
df["preTestScore"].mean()
3.0# 중위값 : 값을 일렬로 나열했을 때 중간에 위치한 값
df["postTestScore"].median()
62.0# 최빈값 : 가장 많이 나오는 값
df["postTestScore"].mode()
0 25.01 62.0
2 70.0
dtype: float64
Data Fill
df.fillna(0) #데이터가 없는 곳은 0으로 집어넣어라
first_namelast_name
age
sex
preTestScore
postTestScore
location
0
Jason
Miller
42.0
m
4.0
25.0
0.0
1
0
0
0.0
0
0.0
0.0
0.0
2
Tina
Ali
36.0
f
0.0
0.0
0.0
3
Jake
Milner
24.0
m
2.0
62.0
0.0
4
Amy
Cooze
73.0
f
3.0
70.0
0.0
df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
df #preTestScore의 평균값을 집어넣어라
first_namelast_name
age
sex
preTestScore
postTestScore
location
0
Jason
Miller
42.0
m
4.0
25.0
NaN
1
NaN
NaN
NaN
NaN
3.0
NaN
NaN
2
Tina
Ali
36.0
f
3.0
NaN
NaN
3
Jake
Milner
24.0
m
2.0
62.0
NaN
4
Amy
Cooze
73.0
f
3.0
70.0
NaN
df.groupby("sex")["postTestScore"].transform("mean")
0 43.51 NaN
2 70.0
3 43.5
4 70.0
Name: postTestScore, dtype: float64
df["postTestScore"].fillna(
df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
df #성별로 나눠서 평균 값을 집어넣어라
first_namelast_name
age
sex
preTestScore
postTestScore
location
0
Jason
Miller
42.0
m
4.0
25.0
NaN
1
NaN
NaN
NaN
NaN
3.0
NaN
NaN
2
Tina
Ali
36.0
f
3.0
70.0
NaN
3
Jake
Milner
24.0
m
2.0
62.0
NaN
4
Amy
Cooze
73.0
f
3.0
70.0
NaN
https://www.boostcourse.org/ai222/lecture/24076
Reference
この問題について([Data Handling] Data Cleansing::(1) missing value), 我々は、より多くの情報をここで見つけました https://velog.io/@ssongplay/Data-Handling-Data-Cleansing-missing-valueテキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。
Collection and Share based on the CC Protocol