書籍「15Stepで踏破自然言語処理アプリケーション開発入門」をやってみる - 3章Step09メモ「ニューラルネットワークによる識別器」

14520 ワード

自然言語処理 Python Python テキストリンク

内容

15stepで踏破自然言語処理アプリケーション入門を読み進めていくにあたっての自分用のメモです。
今回は3章Step09で、自分なりのポイントをメモります。

準備

個人用MacPC：MacOS Mojave バージョン10.14.6
docker version：Client, Server共にバージョン19.03.2

章の概要

前章で扱った多層パーセプトロンを用いて、多クラス識別器を実装してみる。

softmax：多クラス識別時の活性化関数 ⇄ sigmoid（2クラス識別時）
categorical_crossentropy：多クラス識別時の損失関数 ⇄ binary_crossentropy（2クラス識別時）

09.1 多クラス識別器となる多層パーセプトロン

2クラス識別器と多クラス識別器では、出力層のユニット数が異なり、教師ラベルの与え方も異なる。

2クラス識別器：出力層が1次元のユニット。0 or 1で識別クラスを出力
- クラスIDでの表現
- 0, -> Class ID is 0
- 1, -> Class ID is 1
- 2, -> Class ID is 2
多クラス識別器：出力層がクラス数分の次元のユニット。クラスIDに対応するユニットのみ1、他は0で識別クラスを出力
- one-hot表現
- [1, 0, 0], -> Class ID is 0
- [0, 1, 0], -> Class ID is 1
- [0, 0, 1], -> Class ID is 2

多クラス識別時の活性化関数

よく使われるのはsoftmax。

出力が0と1の間に収まる
適用した層の全ての出力の和が1
適用した層の各ユニットの出力値の、大きい値と小さい値の差が開く

softmaxへ通すことによって、大小の差がある値が0と1の間に収まった上で、より大小の比が大きくなるよう0か1に寄る。

大きな値を持つユニットが1つに限定されやすくなるため、多クラス識別において有用
識別結果を確率として扱うことができる

2クラス分類と多クラス分類

1つのユニットの出力0 or 1で2クラス分類をできるように、log2N個のユニットがあればそれらの出力0 or 1の組み合わせでNクラス分類も理論上は可能である。
ただ、低位のユニットはより複数のクラスで同一の0 or 1を学習しなければならず、直感的にも不自然で学習もうまくいかないらしい。

多クラス識別時の損失関数

2クラス識別時のbinary_crossentropyに対し、多クラス識別時にはcategorical_crossentropyを用いる。

クラスIDのリストを教師データとして使う

Nクラス分類する際は、出力層にN個のニューロンを用意しなければならない。
この時、出力ラベルとしてはクラスIDそのものではなく、N個のニューロンにそれぞれ0 or 1の値を与えれるように指定しなければならない。

keras.util.to_categoricalでone-hot表現に変換
損失関数をsparse_categorical_crossentropyにして非one-hot表現に対応

09.2 対話エージェントへの適用

実装例

実装パターン	ポイント
ベーシック	# 設定・モデルの入力次元数を別途設定・モデルの出力次元数を別途設定・学習時　・教師ラベルのone-hot表現への変換が必要・識別時　・one-hot表現からクラスIDへの変換が必要 # 実行・学習時　・vectorizerのfit_transform実行　・classifierのfit実行・識別時　・vectorizerのtransform実行　・classifierのpredict実行
Kerasのscikit-learn APIと sklearn.pipeline.Pipelineへの組み込み	# 設定・モデルの入力次元数を別途設定・モデルの出力次元数を別途設定 # 実行・学習時　・vectorizerのfit実行　・pipelineのfit実行・識別時　・pipelineのpredict実行

実装パターン

ポイント

ベーシック

# 設定
・モデルの入力次元数を別途設定
・モデルの出力次元数を別途設定
・学習時
　・教師ラベルのone-hot表現への変換が必要
・識別時
　・one-hot表現からクラスIDへの変換が必要

# 実行
・学習時
　・vectorizerのfit_transform実行
　・classifierのfit実行
・識別時
　・vectorizerのtransform実行
　・classifierのpredict実行

Kerasのscikit-learn APIと
sklearn.pipeline.Pipelineへの組み込み

# 設定
・モデルの入力次元数を別途設定
・モデルの出力次元数を別途設定

# 実行
・学習時
　・vectorizerのfit実行
　・pipelineのfit実行
・識別時
　・pipelineのpredict実行

keras.wrappers.scikit_learn.KerasClassifierでは、fit()でto_categorical相当の処理を、predict()でnp.argmax相当の処理を実行する。
またpipelineを用いることによってvectorizerとclassifierのfit()とpredict()をまとめて実行できるが、モデル設定時に入力次元を指定するためにvectorizerのfit()のみ別途必要な点に注意する。

前章(Step06)からの追加・変更点

出力層の活性化関数：sigmoid → softmax
損失関数：binary_crossentropy → categorical_crossentropy
識別器：RandomForestClassifier → KerasClassifier

    def _build_mlp(self, input_dim, hidden_units, output_dim):
        mlp = Sequential()
        mlp.add(Dense(units=hidden_units,
                      input_dim=input_dim,
                      activation='relu'))
        mlp.add(Dense(units=output_dim, activation='softmax')) # 1：出力層の活性化関数
        mlp.compile(loss='categorical_crossentropy', # 2：損失関数
                    optimizer='adam')

        return mlp

    def train(self, texts, labels):
~~

        feature_dim = len(vectorizer.get_feature_names())
        n_labels = max(labels) + 1

        # 3：識別器
        classifier = KerasClassifier(build_fn=self._build_mlp,
                                     input_dim=feature_dim,
                                     hidden_units=32,
                                     output_dim=n_labels)
~~

実行結果

# evaluate_dialogue_agent.pyの読み込みモジュール名を必要に応じて修正
from dialogue_agent_sklearn_pipeline import DialogueAgent

$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python evaluate_dialogue_agent.py
0.65957446

通常実装(Step01)：37.2%
前処理追加(Step02)：43.6%
前処理＋特徴抽出変更(Step04)：58.5%
前処理＋特徴抽出変更＋識別器変更(Step06)：61.7%
前処理＋特徴抽出変更＋識別器変更(Step09)：66.0%

応用課題

DialogueAgentクラスのtrainメソッドの引数にhidden_unitsとclassifier__epochsを追加。

dialogue_agent_sklearn_pipeline.py

    def train(self, texts, labels, hidden_units = 32, classifier__epochs = 100):
~~
        classifier = KerasClassifier(build_fn=self._build_mlp,
                                     input_dim=feature_dim,
                                     hidden_units=hidden_units,
                                     output_dim=n_labels)

~~
        pipeline.fit(texts, labels, classifier__epochs=classifier__epochs)
~~

DialogueAgentクラスのtrainメソッド呼び出し時に、hidden_unitsとclassifier__epochsを指定。

evaluate_dialogue_agent.py

    HIDDEN_UNITS = 64
    CLASSIFIER_EPOCHS = 50

    # Training
    training_data = pd.read_csv(join(BASE_DIR, './training_data.csv'))

    dialogue_agent = DialogueAgent()
    dialogue_agent.train(training_data['text'], training_data['label'], HIDDEN_UNITS, CLASSIFIER_EPOCHS)

実行結果

Epoch 50/50
917/917 [==============================] - 0s 288us/step - loss: 0.0229

### ついでに色々と見てみた ###
# pprint.pprint(dialogue_agent.pipeline.steps)
[('vectorizer',
  TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<bound method DialogueAgent._tokenize of <dialogue_agent_sklearn_pipeline.DialogueAgent object at 0x7f7fc81bd128>>,
        use_idf=True, vocabulary=None)),
 ('classifier',
  <keras.wrappers.scikit_learn.KerasClassifier object at 0x7f7fa4a6a320>)]

# pprint.pprint(dialogue_agent.pipeline.steps[1][1].get_params())
{'build_fn': <bound method DialogueAgent._build_mlp of <dialogue_agent_sklearn_pipeline.DialogueAgent object at 0x7f7fc81bd128>>,
 'hidden_units': 64,
 'input_dim': 3219,
 'output_dim': 49}

# print([len(v) for v in dialogue_agent.pipeline.steps[1][1].model.layers[0].get_weights()])
[3219, 64]

# print([len(v) for v in dialogue_agent.pipeline.steps[1][1].model.layers[1].get_weights()])
[64, 49]

入力層次元が3219、隠れ層次元が64、出力層次元が49であることが確認でき、
0層目と1層目の重みリストの形式からも正しいことが確認できた。
（学習が進めば、この重みのリストがどんどん更新されていくことになる）

Author And Source

この問題について(書籍「15Stepで踏破自然言語処理アプリケーション開発入門」をやってみる - 3章Step09メモ「ニューラルネットワークによる識別器」), 我々は、より多くの情報をここで見つけました https://qiita.com/meritama/items/09cda6dce812be494604

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

リスナー-オブザーバモード

pipとpip 3の違い