[Data Handling] Data Cleansing::(1) missing value


Data cleansing :: Data Problems


Data quality problems

  • データの最大/最小値は->スケールのy値
  • に依存する.
  • OrdinaryまたはNominalの値はどのように表示されますか?
  • 無効値
  • の処理
  • の値がなければ?
  • 極端に大きな値を残すべきですか?それとも小さい値を残すべきですか?
  • Data preprocessing issues


    欠落
  • データ(測定値処理)
  • 処理categoryとマークされたデータ
  • データの規模が大きく異なる場合、
  • Data cleansing :: Missing Values


    データがない場合のポリシー

  • のデータがない場合は、
  • を削除する.
  • のデータがない最小数を決定して
  • を削除する.
  • データのほとんどない特性は
  • を削除することができる.
  • 最空値、充填平均値
  • Data drop
    import pandas as pd
    import numpy as np
    # Eaxmple from - https://chrisalbon.com/python/pandas_missing_data.html
    raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
            'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
            'age': [42, np.nan, 36, 24, 73],
            'sex': ['m', np.nan, 'f', 'm', 'f'],
            'preTestScore': [4, np.nan, np.nan, 2, 3],
            'postTestScore': [25, np.nan, np.nan, 62, 70]}
    df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
    df
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    1
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    2
    Tina
    Ali
    36.0
    f
    NaN
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    df.isnull().sum() / len(df)  #몇퍼센트가 비어있는가?
    first_name 0.2
    last_name 0.2
    age 0.2
    sex 0.2
    preTestScore 0.4
    postTestScore 0.4
    dtype: float64
    df_no_missing = df.dropna()
    df_no_missing  #dropna -> 데이터들이 사라짐
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    df
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    1
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    2
    Tina
    Ali
    36.0
    f
    NaN
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    df_cleaned = df.dropna(how='all') 
    df_cleaned  #모든 데이터가 비어 있으면 drop
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    2
    Tina
    Ali
    36.0
    f
    NaN
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    df['location'] = np.nan  #nan을 생성 column
    df
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    location
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    NaN
    1
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    2
    Tina
    Ali
    36.0
    f
    NaN
    NaN
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    NaN
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    NaN
    df.dropna(axis=1, how='all') #column을 기준으로 삭제
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    1
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    2
    Tina
    Ali
    36.0
    f
    NaN
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    df.dropna(axis=1, thresh = 3)  #column기준, 데이터가 최소 4개 이상 없을 때 drop
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    1
    NaN
    NaN
    NaN
    NaN
    NaN
    NaN
    2
    Tina
    Ali
    36.0
    f
    NaN
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    df.dropna(axis=0, thresh=1)  #row기준, 데이터가 최소 2개 이상 없을 때 drop
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    location
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    NaN
    2
    Tina
    Ali
    36.0
    f
    NaN
    NaN
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    NaN
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    NaN
    df.dropna(thresh=5)
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    location
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    NaN
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    NaN

    データ値の入力

  • の平均値、中位値、および最も頻繁な値(https://goo.gl/i8iuL9)を使用
  • # 평균값 : 해당 column의 값의 평균을 내서 채우기
    df["preTestScore"].mean()
    3.0
    # 중위값 : 값을 일렬로 나열했을 때 중간에 위치한 값
    df["postTestScore"].median()
    62.0
    # 최빈값 : 가장 많이 나오는 값
    df["postTestScore"].mode()
    0 25.0
    1 62.0
    2 70.0
    dtype: float64
    Data Fill
    df.fillna(0)  #데이터가 없는 곳은 0으로 집어넣어라
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    location
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    0.0
    1
    0
    0
    0.0
    0
    0.0
    0.0
    0.0
    2
    Tina
    Ali
    36.0
    f
    0.0
    0.0
    0.0
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    0.0
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    0.0
    df["preTestScore"].fillna(df["preTestScore"].mean(), inplace=True)
    df  #preTestScore의 평균값을 집어넣어라
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    location
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    NaN
    1
    NaN
    NaN
    NaN
    NaN
    3.0
    NaN
    NaN
    2
    Tina
    Ali
    36.0
    f
    3.0
    NaN
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    NaN
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    NaN
    df.groupby("sex")["postTestScore"].transform("mean")
    0 43.5
    1 NaN
    2 70.0
    3 43.5
    4 70.0
    Name: postTestScore, dtype: float64
    df["postTestScore"].fillna(
        df.groupby("sex")["postTestScore"].transform("mean"), inplace=True)
    df  #성별로 나눠서 평균 값을 집어넣어라
    first_name
    last_name
    age
    sex
    preTestScore
    postTestScore
    location
    0
    Jason
    Miller
    42.0
    m
    4.0
    25.0
    NaN
    1
    NaN
    NaN
    NaN
    NaN
    3.0
    NaN
    NaN
    2
    Tina
    Ali
    36.0
    f
    3.0
    70.0
    NaN
    3
    Jake
    Milner
    24.0
    m
    2.0
    62.0
    NaN
    4
    Amy
    Cooze
    73.0
    f
    3.0
    70.0
    NaN
    https://www.boostcourse.org/ai222/lecture/24076