n232_data-wrangling

16028 ワード

AI Bootcamp テキストリンク

学習目標

指導学習(機械学習の監視)モデルを学習するための訓練データを生成する.

学習を指導するための

データエンジニアリング方法を理解し、正しい特性を作成します.

データランキング

解析またはモデルの作成を行う前に、データは変形またはマッピングしやすく、通常はモデリング中に最も時間がかかるステップです.

0. preview

#데이터 shape, head 동시 확인
def preview():
    for filename in glob('*.csv'):
        df = pd.read_csv(filename)
        print(filename, df.shape)
        display(df.head())
        print('\n')

1.データファイルと各特徴の分析と理解

From jeremystan

orders (3.4m rows, 206k users):

order_id: order identifier

user_id: customer identifier

eval_set: which evaluation set this order belongs in (see SET described below)

order_number: the order sequence number for this user (1 = first, n = nth)

order_dow: the day of the week the order was placed on

order_hour_of_day: the hour of the day the order was placed on

days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

products (50k rows):

product_id: product identifier

product_name: name of the product

aisle_id: foreign key

department_id: foreign key

aisles (134 rows):

aisle_id: aisle identifier

aisle: the name of the aisle

deptartments (21 rows):

department_id: department identifier

department: the name of the department

order_products__SET (30m+ rows):

order_id: foreign key

product_id: foreign key

add_to_cart_order: order in which each product was added to cart

reordered: 1 if this product has been ordered by this user in the past, 0 otherwise

where SET is one of the four following evaluation sets (eval_set in orders):

"prior": orders prior to that users most recent order (~3.2m orders)

"train": training data supplied to participants (~131k orders)

"test": test data reserved for machine learning competitions (~75k orders)

2.データフレーム間の関係の解析

ex)すべての顧客の連続購入情報(order id、user idなど)は受注に存在し、prior、trainにはorder idに関連する製品情報(product id、カートの順序、再購入の有無)がある.
test(submission)はorder idのみでproduct idはありません.

テスト、列車データ分離、重複分析

# set1.isdisjoint(set2)
# set.isdisjoint() - 두 집합이 공통 원소를 갖지 않는가?
set(orders[orders['eval_set']=='test']['user_id'])\
    .isdisjoint(set(orders[orders['eval_set']=='train']['user_id']))
    
>>> True

# 한 고객은 한 샘플만 있음
len(orders[orders['eval_set'].isin(['train','test'])]) \
,len(orders[orders['eval_set'].isin(['train','test'])]['user_id'].unique())

>>> (206209, 206209)

3.バイナリ分類による問題の簡略化

どのお客様もどのような商品を購入しますか?
->買い手は特定商品(BinaryClassification)を購入しますか?

4.設計問題の答え

お客様が最も頻繁に注文する製品は?

お客様が最近購入した製品の数は?

以前この製品を購入したお客様は?

この製品を購入した履歴顧客データセットは?

お客様がこの製品を再購入することを予測するには、エンジニアリング設計が必要な特性はどれですか?
考えられる特性がたくさんあります!

顧客が注文ごとに平均購入した製品数

注文時間

バナナ購入回数

回、頻度

バナナのほかに、他の果物

も一緒に購入

バナナ買い戻し間の日数

最近何日前にバナナを买いましたか?

5.データを組み合わせて問題を解決するための関数と方法

mode:最NULL値

prior['product_id'].mode()

value counts:一意値+個数

top5_products = prior['product_id'].value_counts()[:5]
top5_products

merge:共通基準でdfをマージ

#prior와 product를 product_id를 기준으로 합침
prior = prior.merge(products, on='product_id')

prior = prior.merge(orders, how='left', on='order_id')

prior.groupby(['user_id','order_id']).count()
prior.groupby(['user_id','order_id']).count().reset_index().groupby('user_id').mean()

groupby:カテゴリ別にグループ化

#order_id별 제품 리스트
train.groupby('order_id')['product_id'].apply(list)

# any(): 주문(order_id) 중에서 한 번이라도 Banana 주문이 있는 경우 True
train.groupby('order_id')['banana'].any().value_counts(normalize=True)

#filtering beer_servings.mean() by continent
drinks.groupby('continent').beer_servings.mean()

#only 'Africa'
drinks[drinks.continent=='Africa'].beer_servings.mean()

#agg : allows us to specify the multiple aggregation function at one
drinks[drinks.continent=='Africa'].beer_servings.agg(['count', 'min', 'max', 'mean'])

#case of few columns
drinks.groupby('continent').mean()

#시각화
%matplotlib inline
drinks.groupby('continent').mean().plot(kind='bar')

参考資料

Pandas Cheat Sheet

GroupBy

Group by: split-apply-combine

Reference

この問題について(n232_data-wrangling), 我々は、より多くの情報をここで見つけました https://velog.io/@ssu_hyun/n232data-wrangling

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

24日目

SQLにおけるhavingとwhereの区別分析