クレジットカードユーザーが予測AI競争を滞納

30889 ワード

AI dacon 大会。 ML テキストリンク

1.テーマ

クレジットカードのユーザーデータを表示してユーザーの借金の程度を予測するアルゴリズムの開発

2.背景

クレジットカード会社は、クレジットカード申請者が提出した個人情報とデータを使用して、クレジットスコアを計算します.クレジットカード会社はこのクレジットスコアを利用して、申請者が将来債務を履行しない可能性とクレジットカードがローンを滞納する可能性を予測します.

現在、多くの金融業界は人工知能(AI)を利用して金融サービスを実施することを望んでいる.ユーザーの借金の程度を予測できる人工知能アルゴリズムを開発し、金融業界にアドバイスできるサイトを掘り起こす.

3.大会説明

クレジットカードユーザー個人情報データはユーザークレジットカードの借金の程度を予測する
(評価基準:log_loss)

4.データ変数の説明

index

Sex:性別

Annual_income:年収

income_type:収益区分[「商業協会」、「Working」、「State servant」、「Pensioner」、「Studio」

Education:教育レベル["Higher教育"、"Secondary/中級特殊"、"不完全高級"、"低級中級"、"学院学位"

family_type:結婚するかどうか[「結婚」、「Civil結婚」、「分離」、「独身/未結婚」、「Widow」

house_type:ライフスタイル["Municipalマンション"、"ファミリー/マンション"、"With家長"、"コラボマンション"、"Rentedマンション"、"Officeマンション"

DAYS_BIRTH:生年月日(データ収集時の0)から逆数、すなわち-1データ収集前日に生年月日)

working_day:事業開始日(データ収集時(0)から逆数、すなわち-1データ収集前日から稼働)

FLAG_MOBIL:携帯電話を持っているかどうか

work_phone:オフィス電話を持っているかどうか

phone:電話を持っているかどうか

email:Eメールをお持ちですか

occyp_type:職業タイプ

begin_month:クレジットカード発行月(データ収集時(0)から逆数、すなわち-1はデータ収集前の月にクレジットカードを発行することを示す)

car_reality:車両と不動産を所有しているかどうか[0:両方とも所有していない.1:1つしか所有していない.2:両方とも所有している]

credit:ユーザークレジットカードローンの延期に基づく信用度
=>低ければ低いほどクレジットカード利用者が高くなる

5.EDA、データプリプロセッシング

必要なlibrary import

import os, random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import VotingClassifier, RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from tensorflow.keras.utils import to_categorical

データの読み込み

train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prediction_of_default_rate/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prediction_of_default_rate/test.csv')
submission = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Prediction_of_default_rate/sample_submission.csv')

データプリプロセッシング

train.info()

# 결측치 처리
train['occyp_type'] = train['occyp_type'].fillna('Null')
test['occyp_type'] = test['occyp_type'].fillna('Null')

# binary type (여성 - 0, 남성 - 1)
train['gender'] = train['gender'].replace({'F':0, 'M':1})
test['gender'] = test['gender'].replace({'F':0, 'M':1}) 


# 무의미한 변수 제거
train.drop('FLAG_MOBIL', axis=1, inplace=True) 
del train['index']

test.drop('FLAG_MOBIL', axis=1, inplace=True)
del test['index']

# one-hot encoding
train = pd.get_dummies(train)
test = pd.get_dummies(test)

# 수치형 데이터 전처리(0~1)
train_x['DAYS_BIRTH'] = train_x['DAYS_BIRTH'] / train_x['DAYS_BIRTH'].min()
test_x['DAYS_BIRTH'] = test_x['DAYS_BIRTH'] / test_x['DAYS_BIRTH'].min()
train_x['working_day'] = train_x['working_day'] / train_x['working_day'].min()
test_x['working_day'] = test_x['working_day'] / test_x['working_day'].min()
train_x['begin_month'] = train_x['begin_month'] / train_x['begin_month'].min()
test_x['begin_month'] = test_x['begin_month'] / test_x['begin_month'].min()

train_x['Annual_income'] = train_x['Annual_income'] / train_x['Annual_income'].max()
test_x['Annual_income'] = test_x['Annual_income'] / test_x['Annual_income'].max()

列車データと試験データを分離する

train_x = train.drop('credit', axis=1)
train_y = train[['credit']]
test_x = test
print(train_x.shape, train_y.shape, test_x.shape)

X_train, X_val, y_train, y_val = train_test_split(train_x, train_y, 
                                                    stratify=train_y,
                                                  test_size=0.2,
                                                    random_state = SEED)

6. Training

Randomforest Classifier

model_RF = RandomForestClassifier(n_estimators=500, max_features=16, random_state=SEED)
model_RF.fit(X_train, y_train)
y_pred = model_RF.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")

Decision Tree Classifier

model_TREE = tree.DecisionTreeClassifier(criterion='entropy', max_depth=2, random_state=SEED)
model_TREE.fit(X_train, y_train)
y_pred = model_TREE.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")

LGBNM Classifier

model_LGBM = LGBMClassifier(n_estimators=10000, num_leaves=50, subsample=0.8,learning_rate=0.01,
                      min_child_samples=60, max_depth=20)
evals = [(X_val, y_val)]
model_LGBM.fit(X_train, y_train, early_stopping_rounds=100,
                 eval_set=evals, eval_metric='logloss',verbose=False)
pred_y = model_LGBM.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), pred_y)}")

BaggingClassifier

model_BAG1 = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=600,
    max_samples=0.7,
    max_features=0.6, 
    bootstrap=True,
    n_jobs=-1 
)
model_BAG1.fit(X_train, y_train)
y_pred = model_BAG1.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")

model_BAG2 = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=550,
    max_samples=0.7,
    max_features=0.6, 
    bootstrap=True,
    n_jobs=-1 
)
model_BAG2.fit(X_train, y_train)
y_pred = model_BAG2.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), y_pred)}")

VotingClassifier

model_VOTING = VotingClassifier(estimators=[('LGBM', model_LGBM),
                                      ('BAGClassifier1', model_BAG1),
                                      ('BAGlassifier2', model_BAG2),
                                      ('RF', model_RF),
                                      ('TREE', model_TREE)],
                         voting='soft')
model_VOTING.fit(X_train, y_train)
pred_y = model_VOTING.predict_proba(X_val)
print(f"log_loss: {log_loss(to_categorical(y_val['credit']), pred_y)}")

7. Test

pred = model_VOTING.predict_proba(test)
submission.loc[:,1:] = pred
submission.to_csv('/content/drive/MyDrive/Colab Notebooks/Prediction_of_default_rate/7120501821010008.csv',index=False)

8.感想

初めての試合だったので物足りないことも多かったのですが、データ処理やアプリケーションモデルのプロセスが本当に面白かったので、これからも100%の意向でいろいろな試合に参加したいと思います!努力して勉强して、次の试合はきっともっと良い成绩を取ることができることを望みます:)

Reference

この問題について(クレジットカードユーザーが予測AI競争を滞納), 我々は、より多くの情報をここで見つけました https://velog.io/@danbibibi/신용카드-사용자-연체-예측-AI-경진대회

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

DBASK問答集抽出第3期

zshがgitブランチとして表示されたviの場合