自然言語処理-Gensim構築語ベクトル(単純版)

6641 ワード

文書ディレクトリ

  • 自然言語処理-Gensim構築語ベクトル(単純版)
  • 1.インポートモデル
  • 2. 2つの言葉
  • 3. 分割
  • 4.モデル作成
  • min_count:
  • Size:
  • 5.2つの語の類似度をテストする
  • 自然言語処理-Gensim構築語ベクトル(単純版)


    1.モデルのインポート

    from gensim.models import word2vec
    import logging
    
    logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)
    

    2.二言三言

    raw_sentences=("the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep")
    

    3.分割

    sentences=[s.split() for s in raw_sentences]
    print(sentences)
    
    [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]
    

    4.モデルの作成

    model=word2vec.Word2Vec(sentences,min_count=1)
    
    2020-04-20 18:33:15,654:INFO:collecting all words and their counts
    2020-04-20 18:33:15,655:INFO:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
    2020-04-20 18:33:15,656:INFO:collected 15 word types from a corpus of 16 raw words and 2 sentences
    2020-04-20 18:33:15,657:INFO:Loading a fresh vocabulary
    2020-04-20 18:33:15,658:INFO:effective_min_count=1 retains 15 unique words (100% of original 15, drops 0)
    2020-04-20 18:33:15,659:INFO:effective_min_count=1 leaves 16 word corpus (100% of original 16, drops 0)
    2020-04-20 18:33:15,660:INFO:deleting the raw counts dictionary of 15 items
    2020-04-20 18:33:15,660:INFO:sample=0.001 downsamples 15 most-common words
    2020-04-20 18:33:15,661:INFO:downsampling leaves estimated 2 word corpus (13.7% of prior 16)
    2020-04-20 18:33:15,663:INFO:estimated required memory for 15 words and 100 dimensions: 19500 bytes
    2020-04-20 18:33:15,664:INFO:resetting layer weights
    2020-04-20 18:33:15,665:INFO:training model with 3 workers on 15 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
    2020-04-20 18:33:15,672:INFO:worker thread finished; awaiting finish of 2 more threads
    2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 1 more threads
    2020-04-20 18:33:15,675:INFO:worker thread finished; awaiting finish of 0 more threads
    2020-04-20 18:33:15,676:INFO:EPOCH - 1 : training on 16 raw words (2 effective words) took 0.0s, 542 effective words/s
    2020-04-20 18:33:15,680:INFO:worker thread finished; awaiting finish of 2 more threads
    2020-04-20 18:33:15,681:INFO:worker thread finished; awaiting finish of 1 more threads
    2020-04-20 18:33:15,682:INFO:worker thread finished; awaiting finish of 0 more threads
    2020-04-20 18:33:15,684:INFO:EPOCH - 2 : training on 16 raw words (3 effective words) took 0.0s, 639 effective words/s
    2020-04-20 18:33:15,688:INFO:worker thread finished; awaiting finish of 2 more threads
    2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 1 more threads
    2020-04-20 18:33:15,691:INFO:worker thread finished; awaiting finish of 0 more threads
    2020-04-20 18:33:15,692:INFO:EPOCH - 3 : training on 16 raw words (1 effective words) took 0.0s, 263 effective words/s
    2020-04-20 18:33:15,697:INFO:worker thread finished; awaiting finish of 2 more threads
    2020-04-20 18:33:15,699:INFO:worker thread finished; awaiting finish of 1 more threads
    2020-04-20 18:33:15,700:INFO:worker thread finished; awaiting finish of 0 more threads
    2020-04-20 18:33:15,701:INFO:EPOCH - 4 : training on 16 raw words (2 effective words) took 0.0s, 402 effective words/s
    2020-04-20 18:33:15,705:INFO:worker thread finished; awaiting finish of 2 more threads
    2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 1 more threads
    2020-04-20 18:33:15,707:INFO:worker thread finished; awaiting finish of 0 more threads
    2020-04-20 18:33:15,708:INFO:EPOCH - 5 : training on 16 raw words (2 effective words) took 0.0s, 486 effective words/s
    2020-04-20 18:33:15,709:INFO:training on a 80 raw words (10 effective words) took 0.0s, 234 effective words/s
    2020-04-20 18:33:15,710:WARNING:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
    

    min_count:


    異なるサイズの語彙セットでは,基準語周波数に対する我々のニーズも異なる.例えば、大きな語彙セットでは、1、2回しか現れなかった単語を無視したいと思っています.ここではmin_を設定することができます.countパラメータで制御します.一般に、適切なパラメータ値は0~100の間に設定されます.

    Size:


    sizeパラメータは主にニューラルネットワークの層数を設定するために使用され、Word 2 Vecのデフォルトは100層に設定されています.より大きな設定はより多くの入力データを意味するが、全体の精度を向上させ、合理的な設定範囲は10~数百である.

    5.両語の類似度をテストする

    model.similarity('dogs','go')
    
    d:\progra~2\python\virtua~1\py37_x64\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).
      """Entry point for launching an IPython kernel.
    
    
    
    
    
    -0.031395614