自然言語処理-Gensim構築語ベクトル(単純版)
6641 ワード
文書ディレクトリ
自然言語処理-Gensim構築語ベクトル(単純版)
1.モデルのインポート
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)
2.二言三言
raw_sentences=("the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep")
3.分割
sentences=[s.split() for s in raw_sentences]
print(sentences)
[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]
4.モデルの作成
model=word2vec.Word2Vec(sentences,min_count=1)
2020-04-20 18:33:15,654:INFO:collecting all words and their counts
2020-04-20 18:33:15,655:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-20 18:33:15,656:INFO:collected 15 word types from a corpus of 16 raw words and 2 sentences
2020-04-20 18:33:15,657:INFO:Loading a fresh vocabulary
2020-04-20 18:33:15,658:INFO:effective_min_count=1 retains 15 unique words (100% of original 15, drops 0)
2020-04-20 18:33:15,659:INFO:effective_min_count=1 leaves 16 word corpus (100% of original 16, drops 0)
2020-04-20 18:33:15,660:INFO:deleting the raw counts dictionary of 15 items
2020-04-20 18:33:15,660:INFO:sample=0.001 downsamples 15 most-common words
2020-04-20 18:33:15,661:INFO:downsampling leaves estimated 2 word corpus (13.7% of prior 16)
2020-04-20 18:33:15,663:INFO:estimated required memory for 15 words and 100 dimensions: 19500 bytes
2020-04-20 18:33:15,664:INFO:resetting layer weights
2020-04-20 18:33:15,665:INFO:training model with 3 workers on 15 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-04-20 18:33:15,672:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,676:INFO:EPOCH - 1 : training on 16 raw words (2 effective words) took 0.0s, 542 effective words/s
2020-04-20 18:33:15,680:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,681:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,682:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,684:INFO:EPOCH - 2 : training on 16 raw words (3 effective words) took 0.0s, 639 effective words/s
2020-04-20 18:33:15,688:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,692:INFO:EPOCH - 3 : training on 16 raw words (1 effective words) took 0.0s, 263 effective words/s
2020-04-20 18:33:15,697:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,699:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,700:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,701:INFO:EPOCH - 4 : training on 16 raw words (2 effective words) took 0.0s, 402 effective words/s
2020-04-20 18:33:15,705:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,708:INFO:EPOCH - 5 : training on 16 raw words (2 effective words) took 0.0s, 486 effective words/s
2020-04-20 18:33:15,709:INFO:training on a 80 raw words (10 effective words) took 0.0s, 234 effective words/s
2020-04-20 18:33:15,710:WARNING:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
min_count:
異なるサイズの語彙セットでは,基準語周波数に対する我々のニーズも異なる.例えば、大きな語彙セットでは、1、2回しか現れなかった単語を無視したいと思っています.ここではmin_を設定することができます.countパラメータで制御します.一般に、適切なパラメータ値は0~100の間に設定されます.
Size:
sizeパラメータは主にニューラルネットワークの層数を設定するために使用され、Word 2 Vecのデフォルトは100層に設定されています.より大きな設定はより多くの入力データを意味するが、全体の精度を向上させ、合理的な設定範囲は10~数百である.
5.両語の類似度をテストする
model.similarity('dogs','go')
d:\progra~2\python\virtua~1\py37_x64\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).
"""Entry point for launching an IPython kernel.
-0.031395614
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)
raw_sentences=("the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep")
sentences=[s.split() for s in raw_sentences]
print(sentences)
[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]
model=word2vec.Word2Vec(sentences,min_count=1)
2020-04-20 18:33:15,654:INFO:collecting all words and their counts
2020-04-20 18:33:15,655:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-20 18:33:15,656:INFO:collected 15 word types from a corpus of 16 raw words and 2 sentences
2020-04-20 18:33:15,657:INFO:Loading a fresh vocabulary
2020-04-20 18:33:15,658:INFO:effective_min_count=1 retains 15 unique words (100% of original 15, drops 0)
2020-04-20 18:33:15,659:INFO:effective_min_count=1 leaves 16 word corpus (100% of original 16, drops 0)
2020-04-20 18:33:15,660:INFO:deleting the raw counts dictionary of 15 items
2020-04-20 18:33:15,660:INFO:sample=0.001 downsamples 15 most-common words
2020-04-20 18:33:15,661:INFO:downsampling leaves estimated 2 word corpus (13.7% of prior 16)
2020-04-20 18:33:15,663:INFO:estimated required memory for 15 words and 100 dimensions: 19500 bytes
2020-04-20 18:33:15,664:INFO:resetting layer weights
2020-04-20 18:33:15,665:INFO:training model with 3 workers on 15 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2020-04-20 18:33:15,672:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,676:INFO:EPOCH - 1 : training on 16 raw words (2 effective words) took 0.0s, 542 effective words/s
2020-04-20 18:33:15,680:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,681:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,682:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,684:INFO:EPOCH - 2 : training on 16 raw words (3 effective words) took 0.0s, 639 effective words/s
2020-04-20 18:33:15,688:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,692:INFO:EPOCH - 3 : training on 16 raw words (1 effective words) took 0.0s, 263 effective words/s
2020-04-20 18:33:15,697:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,699:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,700:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,701:INFO:EPOCH - 4 : training on 16 raw words (2 effective words) took 0.0s, 402 effective words/s
2020-04-20 18:33:15,705:INFO:worker thread finished; awaiting finish of 2 more threads
2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 1 more threads
2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 0 more threads
2020-04-20 18:33:15,708:INFO:EPOCH - 5 : training on 16 raw words (2 effective words) took 0.0s, 486 effective words/s
2020-04-20 18:33:15,709:INFO:training on a 80 raw words (10 effective words) took 0.0s, 234 effective words/s
2020-04-20 18:33:15,710:WARNING:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
model.similarity('dogs','go')
d:\progra~2\python\virtua~1\py37_x64\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).
"""Entry point for launching an IPython kernel.
-0.031395614