自然言語処理-Gensim構築語ベクトル(単純版)

6641 ワード

深く勉強する.

文書ディレクトリ

1.インポートモデル

2. 2つの言葉

3. 分割

4.モデル作成

min_count:

Size:

5.2つの語の類似度をテストする

自然言語処理-Gensim構築語ベクトル(単純版)

1.モデルのインポート

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)

2.二言三言

raw_sentences=("the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep")

3.分割

sentences=[s.split() for s in raw_sentences]
print(sentences)

[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]

4.モデルの作成

model=word2vec.Word2Vec(sentences,min_count=1)

2020-04-20 18:33:15,654:INFO:collecting all words and their counts
2020-04-20 18:33:15,655:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-20 18:33:15,656:INFO:collected 15 word types from a corpus of 16 raw words and 2 sentences
2020-04-20 18:33:15,657:INFO:Loading a fresh vocabulary
2020-04-20 18:33:15,658:INFO:effective_min_count=1 retains 15 unique words (100% of original 15, drops 0)
2020-04-20 18:33:15,659:INFO:effective_min_count=1 leaves 16 word corpus (100% of original 16, drops 0)
2020-04-20 18:33:15,660:INFO:deleting the raw counts dictionary of 15 items
2020-04-20 18:33:15,660:INFO:sample=0.001 downsamples 15 most-common words
2020-04-20 18:33:15,661:INFO:downsampling leaves estimated 2 word corpus (13.7% of prior 16)
2020-04-20 18:33:15,663:INFO:estimated required memory for 15 words and 100 dimensions: 19500 bytes
2020-04-20 18:33:15,664:INFO:resetting layer weights
2020-04-20 18:33:15,665:INFO:training model with 3 workers on 15 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-04-20 18:33:15,672:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,676:INFO:EPOCH - 1 : training on 16 raw words (2 effective words) took 0.0s, 542 effective words/s
2020-04-20 18:33:15,680:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,681:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,682:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,684:INFO:EPOCH - 2 : training on 16 raw words (3 effective words) took 0.0s, 639 effective words/s
2020-04-20 18:33:15,688:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,692:INFO:EPOCH - 3 : training on 16 raw words (1 effective words) took 0.0s, 263 effective words/s
2020-04-20 18:33:15,697:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,699:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,700:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,701:INFO:EPOCH - 4 : training on 16 raw words (2 effective words) took 0.0s, 402 effective words/s
2020-04-20 18:33:15,705:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,708:INFO:EPOCH - 5 : training on 16 raw words (2 effective words) took 0.0s, 486 effective words/s
2020-04-20 18:33:15,709:INFO:training on a 80 raw words (10 effective words) took 0.0s, 234 effective words/s
2020-04-20 18:33:15,710:WARNING:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay

min_count:

異なるサイズの語彙セットでは,基準語周波数に対する我々のニーズも異なる.例えば、大きな語彙セットでは、1、2回しか現れなかった単語を無視したいと思っています.ここではmin_を設定することができます.countパラメータで制御します.一般に、適切なパラメータ値は0~100の間に設定されます.

Size:

sizeパラメータは主にニューラルネットワークの層数を設定するために使用され、Word 2 Vecのデフォルトは100層に設定されています.より大きな設定はより多くの入力データを意味するが、全体の精度を向上させ、合理的な設定範囲は10~数百である.

5.両語の類似度をテストする

model.similarity('dogs','go')

d:\progra~2\python\virtua~1\py37_x64\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).
  """Entry point for launching an IPython kernel.





-0.031395614

Caffeは自分の画像データを訓練し、テストする.

tomcat 7にクッキーを中国語に書き込んでControl character in cookie value or attribute異常を起こす