XLNET中国語テキスト分類

4164 ワード

xlnet ぶんかつ中国語テキスト text emmbedding 中国語の短いテキストの分類

一.Xlnetの概要
Xlnetは,BERTプリトレーニング−微調整モードが創設されて以来,この自然言語処理NLPにおけるもう一つの大きな進展である.Xlnetは,自己回帰(AR,一方向言語モデル),自己符号化(AE,双方向言語モデル)などの言語モデルの特徴を融合させ,最先端のtransformer特徴抽出器(transformer−xl,分割サイクルメカニズムと相対位置符号化による高同時−超長テキスト処理)を用いて,配列言語モデル(Permutation Language Modeling)を創始的に提案した.
PLMは、このような言語モデル機構は、自己回帰(AR)言語モデルモデルの利点を保持することができる(自己回帰(AR)モデルを利用してテキストコーパスの確率分布を推定することができ、すなわち、NLGタスクのテキスト生成に有利である).また,自己符号化(AE)言語モデルを巧みに導入し(テキストを因式分解し,maskを固定予測する際に文の順序を乱すように表現することで,コンテキストテキストの特徴を効果的にキャプチャすることができ,テキスト理解NLUタスクに有利である).
また,このクラスBERTにおけるmaskedメカニズム(Masked LM)の予測が遮られるワード/ワードの過程は,Xlnetプリトレーニングモデルにおけるmulti-head-attention内で発生し,BERTプリトレーニング(mask入力)、微調整(mask入力不要)時の入力の異なる問題を克服した.
今(20190829)中国語Xlnetグーグル版はまだ訓練されていませんが、ネ、ハ工大版中国語Xlnet予備訓練モデルがオープンしました(ハ工大牛批)が、このニュースは何の鬼ですか.の
予備訓練モデルはすでにありますが、もちろん微調整を行う各種実験です.文ベクトルembedding、分類、類似度、読解、テキスト生成......
xlnet-embeddingアドレス:https://github.com/yongzhuo/nlp_xiaojiang/tree/master/FeatureProject/xlnet
xlnet-chinese-text-classificationアドレス:https://github.com/yongzhuo/Keras-TextClassification

二.Xlnet分類例
bert微調整と大同小異ですが、やはり微妙な違いがあります.
keras-xlnetを例にとると、プリロードモデル設定でtarget_を設定できます.len長(ターゲット、すなわち現在入力されているテキストの最大長)、attentionタイプ('uni'または'bi')、memory_len長(セグメント長テキスト最長依存、Transformer-XL).
異なるレイヤ、さまざまな組み合わせを参照することもできます.ここで中国語xlnet哈工大版-初版は24層246個のlayerがあり、6層入力とembeddingを含む.その他は10個のlayerごとに1つのblock、すなわち1層であり、あるlayer出力は2つのtensorである.ここでは注意が必要である.
簡単なxlnet-finetureコードは以下の通りです.具体的なembeddingはgithubを参照してください.https://github.com/yongzhuo/Keras-TextClassification/blob/master/keras_textclassification/base/embedding.py

# -*- coding: UTF-8 -*-
# !/usr/bin/python
# @time     :2019/8/28 23:06
# @author   :Mo
# @function :graph of xlnet fineture,  ,  
# @paper    :XLNet: Generalized Autoregressive Pretraining for Language Understanding

from __future__ import print_function, division

from keras.layers import SpatialDropout1D, Conv1D, GlobalMaxPooling1D, Dense
from keras.layers import Dropout, Reshape, Concatenate, Lambda
from keras.layers import LSTM, GRU
from keras.layers import Flatten
from keras.models import Model
from keras import backend as K
from keras import regularizers

from keras_textclassification.base.graph import graph

import numpy as np


class XlnetGraph(graph):
    def __init__(self, hyper_parameters):
        """
             
        :param hyper_parameters: json， 
        """
        super().__init__(hyper_parameters)

    def create_model(self, hyper_parameters):
        """
             
        :param hyper_parameters:json,  hyper parameters of network
        :return: tensor, moedl
        """
        super().create_model(hyper_parameters)
        embedding_output = self.word_embedding.output
        x = embedding_output
        # x = Lambda(lambda x : x[:, 0:1, :])(embedding_output) #  CLS
        # # text cnn
        # bert_output_emmbed = SpatialDropout1D(rate=self.dropout)(embedding_output)
        # concat_out = []
        # for index, filter_size in enumerate(self.filters):
        #     x = Conv1D(name='TextCNN_Conv1D_{}'.format(index),
        #                filters= self.filters_num, # int(K.int_shape(embedding_output)[-1]/self.len_max),
        #                strides=1,
        #                kernel_size=self.filters[index],
        #                padding='valid',
        #                kernel_initializer='normal',
        #                activation='relu')(bert_output_emmbed)
        #     x = GlobalMaxPooling1D(name='TextCNN_MaxPool1D_{}'.format(index))(x)
        #     concat_out.append(x)
        # x = Concatenate(axis=1)(concat_out)
        # x = Dropout(self.dropout)(x)
        x = Flatten()(x)
        #  softmax
        dense_layer = Dense(self.label, activation=self.activate_classify)(x)
        output_layers = [dense_layer]
        self.model = Model(self.word_embedding.input, output_layers)
        self.model.summary(120)

あなたに役に立つことを望みます!
足りないところは指摘してください.ありがとうございます.

Python 3におけるbytesとstringの相互変換

[白俊]1328号高層ビル(c++)