skipベースthoughts vectorsのsentence 2 vecニューラルネットワーク実現

32194 ワード

1、論文の概要
汎用的で分散型文エンコーダの監視なし学習方法について述べた.本から抽出した連続テキストを用いて,符号化段落の周囲の文を再構築しようとするエンコーダ‐デコーダモデルを訓練した.意味と構文の属性が一致する文は、類似のベクトル表現にマッピングされます.次に,予想内の単語を訓練せずに語彙量を100万語に拡張する簡単な語彙拡張法を導入した.モデルを訓練した後,意味相関,意味検出,画像文並べ替え,問題タイプ分類,および4つの基準情緒と主観性データセットを含む8つのタスク上で線形モデルを用いてベクトルを抽出し評価した.最終的な結果は,高度に汎用性のある文表現を生成し,実践的に良好に表現できる非専門設計のエンコーダである.
2、モデルの詳細
コードソースhttps://github.com/tensorflow/models/tree/master/research/skip_thoughtsモデルは大きく3つのステップに分けられる:1、コンテキストを構築する3つの要素グループ;2、エンコーダを構築する;3、デコーダを構築する.
2.1コンテキスト三元グループの構築(skip_thoughts/data/preprocess_dataset.py)
def _process_input_file(filename, vocab, stats):
  """Processes the sentences in an input file.
  Args:
    filename: Path to a pre-tokenized input .txt file.
    vocab: A dictionary of word to id.
    stats: A Counter object for statistics.
  Returns:
    processed: A list of serialized Example protos
  """
  tf.logging.info("Processing input file: %s", filename)
  processed = []

  predecessor = None  # Predecessor sentence (list of words).
  current = None  # Current sentence (list of words).
  successor = None  # Successor sentence (list of words).

  for successor_str in tf.gfile.FastGFile(filename):
    stats.update(["sentences_seen"])
    successor = successor_str.split()

    # The first 2 sentences per file will be skipped.
    if predecessor and current and successor:
      stats.update(["sentences_considered"])

      # Note that we are going to insert <EOS> later, so we only allow
      # sentences with strictly less than max_sentence_length to pass.
      if FLAGS.max_sentence_length and (
          len(predecessor) >= FLAGS.max_sentence_length or len(current) >=
          FLAGS.max_sentence_length or len(successor) >=
          FLAGS.max_sentence_length):
        stats.update(["sentences_too_long"])
      else:
        serialized = _create_serialized_example(predecessor, current, successor,
                                                vocab)
        processed.append(serialized)
        stats.update(["sentences_output"])

    predecessor = current
    current = successor

    sentences_seen = stats["sentences_seen"]
    sentences_output = stats["sentences_output"]
    if sentences_seen and sentences_seen % 100000 == 0:
      tf.logging.info("Processed %d sentences (%d output)", sentences_seen,
                      sentences_output)
    if FLAGS.max_sentences and sentences_output >= FLAGS.max_sentences:
      break

  tf.logging.info("Completed processing file %s", filename)
  return processed

目的:文書をあらかじめ分割しておき、例えば、文書を[S 1,S 2,S 3,S 4]から構成すれば、pr e d e c e s o r=[S 1,S 2]predessor=[S 1,S 2]predessor=[S 1,S 2]c u r e n t=[32,S 3]current=[32,S 3]current=[32,S 3]s u c e s o r=[S 3,S 4]successor=[S 3,S 4]successor=[S 3,S 4]successor=[S 3,S 4]に構成し、辞書に従って対応するidに変換することができる
2.2ビルドエンコーダ(skip_thoughts/skip_thoughts_model.py)
 def build_encoder(self):
    """Builds the sentence encoder.
    Inputs:
      ##self.encode_ids 2.1 current CBOW 
      self.encode_emb## 2.1 current word embedding(cbow) 
      self.encode_mask## 2.1 current one-hot 
    Outputs:
      self.thought_vectors
    Raises:
      ValueError: if config.bidirectional_encoder is True and config.encoder_dim
        is odd.
    """
    with tf.variable_scope("encoder") as scope:
      length = tf.to_int32(tf.reduce_sum(self.encode_mask, 1), name="length")

      if self.config.bidirectional_encoder:
        if self.config.encoder_dim % 2:
          raise ValueError(
              "encoder_dim must be even when using a bidirectional encoder.")
        num_units = self.config.encoder_dim // 2
        cell_fw = self._initialize_gru_cell(num_units)  # Forward encoder
        cell_bw = self._initialize_gru_cell(num_units)  # Backward encoder
        _, states = tf.nn.bidirectional_dynamic_rnn(
            cell_fw=cell_fw,
            cell_bw=cell_bw,
            inputs=self.encode_emb,
            sequence_length=length,
            dtype=tf.float32,
            scope=scope)
        thought_vectors = tf.concat(states, 1, name="thought_vectors")
      else:
        cell = self._initialize_gru_cell(self.config.encoder_dim)
        _, state = tf.nn.dynamic_rnn(
            cell=cell,
            inputs=self.encode_emb,
            sequence_length=length,
            dtype=tf.float32,
            scope=scope)
        # Use an identity operation to name the Tensor in the Graph.
        thought_vectors = tf.identity(state, name="thought_vectors")

    self.thought_vectors = thought_vectors

目的:current(テキスト)を符号化(双方向gru/一方向gruを採用)し、テキスト符号化ベクトルを出力する
2.3ビルドデコーダ(skip_thoughts/skip_thoughts_model.py)
  def _build_decoder(self, name, embeddings, targets, mask, initial_state,
                     reuse_logits):
    """Builds a sentence decoder.
    Args:
      name: Decoder name.
      embeddings: Batch of sentences to decode; a float32 Tensor with shape
        [batch_size, padded_length, emb_dim].
      targets: Batch of target word ids; an int64 Tensor with shape
        [batch_size, padded_length].
      mask: A 0/1 Tensor with shape [batch_size, padded_length].
      initial_state: Initial state of the GRU. A float32 Tensor with shape
        [batch_size, num_gru_cells].
      reuse_logits: Whether to reuse the logits weights.
    """
    # Decoder RNN.
    cell = self._initialize_gru_cell(self.config.encoder_dim)
    with tf.variable_scope(name) as scope:
      # Add a padding word at the start of each sentence (to correspond to the
      # prediction of the first word) and remove the last word.
      decoder_input = tf.pad(
          embeddings[:, :-1, :], [[0, 0], [1, 0], [0, 0]], name="input")
      length = tf.reduce_sum(mask, 1, name="length")
      decoder_output, _ = tf.nn.dynamic_rnn(
          cell=cell,
          inputs=decoder_input,
          sequence_length=length,
          initial_state=initial_state,
          scope=scope)

    # Stack batch vertically.
    decoder_output = tf.reshape(decoder_output, [-1, self.config.encoder_dim])
    targets = tf.reshape(targets, [-1])
    weights = tf.to_float(tf.reshape(mask, [-1]))

    # Logits.
    with tf.variable_scope("logits", reuse=reuse_logits) as scope:
      logits = tf.contrib.layers.fully_connected(
          inputs=decoder_output,
          num_outputs=self.config.vocab_size,
          activation_fn=None,
          weights_initializer=self.uniform_initializer,
          scope=scope)

    losses = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=targets, logits=logits)
    batch_loss = tf.reduce_sum(losses * weights)
    tf.losses.add_loss(batch_loss)

    tf.summary.scalar("losses/" + name, batch_loss)

    self.target_cross_entropy_losses.append(losses)
    self.target_cross_entropy_loss_weights.append(weights)

  def build_decoders(self):
    """Builds the sentence decoders.
    Inputs:## 2.2 encode,decode_pre 2.1 predecessor ,decode_post 2.2 successor 
      self.decode_pre_emb
      self.decode_post_emb
      self.decode_pre_ids
      self.decode_post_ids
      self.decode_pre_mask
      self.decode_post_mask
      self.thought_vectors
    Outputs:
      self.target_cross_entropy_losses
      self.target_cross_entropy_loss_weights
    """
    if self.mode != "encode":
      # Pre-sentence decoder.
      self._build_decoder("decoder_pre", self.decode_pre_emb,
                          self.decode_pre_ids, self.decode_pre_mask,
                          self.thought_vectors, False)

      # Post-sentence decoder. Logits weights are reused.
      self._build_decoder("decoder_post", self.decode_post_emb,
                          self.decode_post_ids, self.decode_post_mask,
                          self.thought_vectors, True)

目的:2.2から出力されたテキストベクトル符号化に基づいて、上記と以下をそれぞれ復号し、上記復号ベクトルと以下復号ベクトルを出力し、上記復号ベクトルはdecode_pre_idsは損失を求めて、損失はdecodeに乗じます_pre_mask干渉除去
アルゴリズムの考え方は簡単で、私はこのアルゴリズムにbilstmとselfを追加しました.attention 2つのモデルコードアドレスhttps://github.com/jinjiajia/skip_thoughts