キャンペーンへの反応予測の閾値の最適化

14095 ワード

analytics Python3 データサイエンス analytics テキストリンク

kaggleの「データセットのローンキャンペーンへの応答」データを使って、ロジスティック回帰のモデルから算出した予測確率の閾値の最適化をします。
https://www.kaggle.com/dineshmk594/loan-campaign

要は、ロジスティック回帰に当てはめたときに、算出された応答確率がいくつ以上のユーザーをキャンペーンの対象にするべきかを最適化します。

データセットを準備していきます。

df = pd.read_csv('PL_XSELL.csv', index_col = 0)
Y = df.TARGET
x = df.drop('TARGET', axis=1）

数値ではない値を、OneHotEncodingします。

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
x_gender = pd.DataFrame(enc.fit_transform([[g] for g in x.GENDER]).A, columns=['F', 'M', 'O'], index=x.index)
x_occupation = pd.DataFrame(enc.fit_transform([[oc] for oc in x.OCCUPATION]).A, columns=set(x.OCCUPATION), index=x.index)
x_acc_type = pd.DataFrame(enc.fit_transform([[at] for at in x.ACC_TYPE]).A, columns=set(x.ACC_TYPE), index=x.index)

Account Open Dateを今日までの経過日時に変更します。

from datetime import datetime

now = datetime.now()

acc_days = x.ACC_OP_DATE.apply(lambda x : x.replace('-', '/')).apply(lambda x: (now - datetime.strptime(x, '%m/%d/%Y')).days)
x_acc_days = pd.DataFrame(acc_days)
x_acc_days.columns = ['ACC_DAYS']

AGE列があるので、AGE_BKT列はdropします。
上記で前処理したデータを合わせます。

x_ = x.drop(columns=['GENDER', 'OCCUPATION', 'ACC_TYPE', 'AGE_BKT','ACC_OP_DATE'], axis=1)

new_x = pd.concat([x_, x_gender, x_occupation, x_acc_type, x_acc_days], axis=1)

モデルに当てはめるために、標準化します。
（本当ならトレーニングセット・テストセットに分けるべきですが、デモで行なっているので、分けずにやります。）

from sklearn.preprocessing import StandardScaler
std = StandardScaler()
X_std = std.fit_transform(new_x)

ロジスティック回帰を使って、モデルを学習させます。

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(y=Y, X=X_std)

閾値を変化させて、confusion　matrix作成。
閾値を超えたユーザーを対象にキャンペーンを行うとして、一人当たりのコスト（今回は仮定で10）を掛け合わせて総コストを計算。
true　positiveがキャンペーンに応答するので、応答した時の期待値（今回は仮定で100）をtrue　positiveのサンプル数に掛け合わせて総収入を計算する。
収入とコストの差が利益となる

from sklearn.metrics import confusion_matrix

def generate_confusion_matrix(threshold, y_true, y_prob):
    y_pred = [pred[1] for pred in np.where(y_prob > threshold, 1, 0)]
    return confusion_matrix(y_pred=y_pred, y_true=y_true)

def calculate_profit(c_mat, gain, cost):
    sum_gain = gain * c_mat[1,1]
    sum_cost = cost * sum(c_mat[:,1])
    return sum_gain - sum_cost

y_prob = clf.predict_proba(X_std)
thresholds = np.linspace(0,1)
profits = []
gain = 100
cost = 10
for threshold in thresholds:
    c_mat = generate_confusion_matrix(threshold, y_true=Y, y_prob=y_prob)
    profit = calculate_profit(c_mat, gain, cost)
    profits.append(profit)

結果をプロットします。

pd.DataFrame(profits, index=thresholds, columns=['profit']).plot()

だいたい0.1くらいが最適な閾値という結果となりました。

Author And Source

この問題について(キャンペーンへの反応予測の閾値の最適化), 我々は、より多くの情報をここで見つけました https://qiita.com/takekazu/items/3fe19a8e4df7721174fe

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .