単語以外の分散表現をword2vecで取得しRNNのEmbedding層に組み込むまで

14196 ワード

word2vec Python 機械学習 Python テキストリンク

本記事でやること

購買履歴データから商品アイテムの分散表現をword2vecを使って取得する。
特定商品に似ているアイテムを表示する。
商品アイテムの分散表現をRNN(リカレントニューラルネットワーク)のEmbedding層に利用する。

本記事でやらないことは以下の通りなので、他の記事を参照してください。

Pythonの各ライブラリー（gensimなど）のinstall方法

対象読者

Python入門者
RNNを学び始めた人

使用言語及びフレームワーク

Python 3.6.3
keras 2.2.2

データセットの用意

今回使用するのはUCIで用意されている購買データになります。
またUCIのデータをユーザー単位で買った順に商品IDで並べたデータをこちらから拝借しましたのでgit cloneでローカル環境にダウンロードしてください。
データは以下のような形になっています。

word2vecで商品の分散表現を取得する

jupyernotebook

import numpy as np
import pandas as pd

# データの読み込み
data_path = ! echo file://$PWD/keras/purchacedata_base.csv
raw_data = pd.read_csv(data_path[0], header = 0)

max_length = 10 # 購入件数を指定
# 購入件数を限定し、Naがあるレコードを削除したデータフレームを取得
df = raw_data[raw_data.columns[0:(max_length + 2)]].dropna().astype(int)
df.head()

わかち書きをしたように、1つのカラムに商品id同士を空白で区切った状態にしたいので以下のような処理を行います。

jupyternotebook


df['items'] = df['P1'].astype(str)
for i in range(2, 11):
    df['items'] += ' '+ df['P'+str(i)].astype(str)

一旦、ユーザー単位でわかち書きのような処理を行ったカラムのみをtext形式でダウンロードします。

jupyternotebook

df["items"].to_csv("data.txt", header = False, index=False)

そして、ダウンロードしたtextを読み込み以下のような方法でword2vecで分散表現を取得します。
word2vec.Word2Vecで分散表現を取得しているのですが、引数のsizeやwindow、min_countは公式のドキュメントを参照してください。

jupyternotebook

from gensim.models import word2vec

text_path = ! echo file://$PWD/data.txt

sentences = word2vec.Text8Corpus(text_path[0])
model = word2vec.Word2Vec(sentences, size=100, window=1, min_count=1)
model.save("item.model")

似ている商品を表示

modelにはmost_similarという関数が用意しているので引数に商品のIDを入れるだけでokです。

jupyternotebook

# 商品ID = 117と似ている商品を表示
model.most_similar("117")

出力

[('2', 0.925855815410614),
 ('1', 0.9179891347885132),
 ('4', 0.9112440943717957),
 ('6', 0.9003599882125854),
 ('14', 0.8991273045539856),
 ('31', 0.8983150124549866),
 ('10', 0.8967700600624084),
 ('13', 0.895893931388855),
 ('41', 0.8949518799781799),
 ('7', 0.8913787603378296)]

商品ID=2や商品ID=1が全体の購買履歴の中で頻度が多いので、その影響が出てしまっていると思われます。
一応、ここまでで各商品の分散表現を取得出来ました。

RNNのEmbedding層に組み込む

Embedding層の重みに上記で作成した単語の分散表現を利用します。
まずは、商品IDに対応するIndexを作成します。のちにこのIndexを利用してEmbedding層の空の重みに商品ID毎の分散表現を順に入力をさせていきます。

jupyternotebook

# kerasのTokemizerを利用します
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df["items"])
# 商品idをindexに変換する
word_index = tokenizer.word_index
print(word_index)

以下のようなディクショナリー形式で{'商品ID', Index}が得られます。

出力

{'552': 652, '910': 450, '2538': 2410, '2394': 2074, '2359': 1827, '962': 1173,...}

そして、以下の処理でEmbedding層の空の重みに分散表現を順に入力させていきます。

embedding_matrixは商品数+1個の行と分散表現の取得時に設定した次元数の大きさになります。（要素はゼロで埋めておきます）
word_index.items()で単語とIndexを順に取得し、model.wv[word]で取得した分散表現をembedding_matrixに入れていきます。

jupyternotebook

embedding_matrix = np.zeros((len(word_index)+1, 100))
for word, i in word_index.items():
    embedding_matrix[i] = model.wv[word]

上記でEmbedding層の重みに分散表現を組み込むことが出来ました。

あとは、モデルを組み立てていくだけですが、model.add(Embedding())の引数weightsで上記で定義した分散表現を設定し、Embedding層の重みは訓練しないので、trainableはFalseにするので注意してください。

jupyternotebook

from keras.models import Sequential 
from keras.layers import Embedding, SimpleRNN, Dense

model = Sequential()
model.add(Embedding(len(word_index)+1,
                    100,
                    weights=[embedding_matrix],
                    trainable=False))
model.add(SimpleRNN(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=["acc"])

終わりに

時間がある時にSageMakerのBlazingTextやSparkでの分散表現の取得に試みたいと思います。

Author And Source

この問題について(単語以外の分散表現をword2vecで取得しRNNのEmbedding層に組み込むまで), 我々は、より多くの情報をここで見つけました https://qiita.com/kurakura0916/items/c8d3f2eb65917aae98a9

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .