使用 wor2vec 加強搜尋體驗


藝人在發行專輯會有多種名稱,我們可以稱為別名,那麼是否能搜尋藝人的專輯資料時,
同時找出別名發行的專輯和其它相關的專輯。

agenda

  1. Download ja-wikepeda 語料
  2. Preprocess ja-wikiepeda 語料
  3. Use gensim generate word2vec model
  4. Final

Download ja-wikepeda 語料

wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2

Preprocess ja-wikiepeda 語料

WikiCorpus


from gensim.corpora import WikiCorpus

corpus = WikiCorpus('./jawiki-latest-pages-articles.xml.bz2', lemmatize=False, dictionary={})
with open('./ja-wiki-raw') as output:
    for text in corpus.get_texts():
        output.write(' '.join(text) + '\n')

Mecab + mecab-ipadic-neologd

mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd  -Owakati jawiki-raw -o ja-wiki-token -b 1000000000

Use gensim generate word2vec model

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

source = "./ja-wiki-token"
vector_size = 200
min_count = 10
window_size = 5
workers = 3

sentences = word2vec.LineSentence(source)
model = word2vec.Word2Vec(sentences, size=vector_size, min_count=min_count, window=window_size, workers=workers)
model.wv.save_word2vec_format('./ja-wiki-model.vec.pt', binary=True)

Final

  1. 計算澤野弘之nzk 的相關度
  2. 列出澤野弘之 相關詞
from gensim.models import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('./ja-wiki-model.vec.pt', binary=True)

similarity = word_vectors.similarity('澤野弘之', 'nzk')

print(similarity)

data = word_vectors.most_similar('澤野弘之')

for word in data:
    print(word)
0.7995355
('nzk', 0.7995355129241943)
('梶浦由記', 0.79180908203125)
('tielle', 0.7494333982467651)
('gemie', 0.7381956577301025)
('小林未郁', 0.7224546074867249)
('yamanaiame', 0.7218720316886902)
('和田貴史', 0.711627721786499)
('caldito', 0.7096607685089111)
('blackschleger', 0.7088098526000977)
('cyua', 0.7018069624900818)

most_similar 的字串放入 elasticseach MultiMatch query 並且根據相關性 boost score。