Word 2 vecの感情語義分析実戦(part 3)--分布式語ベクトルを利用して監督学習任務を完成する

18500 ワード

機械学習

引用する

このブログは前ブログPart 2に基づいてさらなる探索と実戦を行う.demoコードとデータ:転送ゲート

単語の数値化表示

前に単語の意味理解モデルを訓練した.深く研究するとPart 2で訓練したモデルは語彙表の単語の特徴ベクトルから構成されていることがわかる.これらの特徴ベクトルはsyn 0というnumpy配列に格納される.

# Load the model that we created in Part 2
from gensim.models import Word2Vec
model = Word2Vec.load("300features_40minwords_10context")
#type(model.syn0)
#model.syn0.shape
type(model.wv.syn0)
model.wv.syn0.shape

[output] numpy.ndarray [output] (16490, 300)
このnumpy配列の大きさは(16490300)語彙の単語数と各単語に対応する特徴数をそれぞれ表すことが明らかである.単一の単語ベクトルは、次の形式で直接アクセスできます.

model["flower"]

単語から段落へ、試行1:ベクトル平均

IMDBデータセットでは,各コメントの長さが異なり,ここではまず独立した単語ベクトルを等長の特徴集合に変換する必要がある.各単語は300次元の特徴ベクトルであるため,ベクトル操作を用いて各コメントの単語を結合することができる.この例では,単語ベクトルを単純に平均し,無効語を加えるとノイズが増加するだけであるため,無効語を除去する.コードは次のとおりです.

import numpy as np  # Make sure that numpy is imported

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0.
    # 
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    # 
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec


def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    # 
    # Initialize a counter
    counter = 0
    # 
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    # 
    # Loop through the reviews
    for review in reviews:
       #
       # Print a status message every 1000th review
       if counter%1000 == 0:
           print "Review %d of %d" % (counter, len(reviews))
       # 
       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[counter] = makeFeatureVec(review, model, \
           num_features)
       #
       # Increment the counter
       counter = counter + 1
    return reviewFeatureVecs

次にPart 2で読み取ったトレーニングセットとテストセットを利用して、それぞれベクトル平均を行います.

# ****************************************************************
# Calculate average feature vectors for training and testing sets,
# using the functions we defined above. Notice that we now use stop word
# removal.
import pandas as pd

# Read data from files 
train = pd.read_csv( "./data/labeledTrainData.tsv", header=0, 
 delimiter="\t", quoting=3 )
test = pd.read_csv( "./data/testData.tsv", header=0, delimiter="\t", quoting=3 )
unlabeled_train = pd.read_csv( "./data/unlabeledTrainData.tsv", header=0, 
 delimiter="\t", quoting=3 )

# Verify the number of reviews that were read (100,000 in total)
print("Read %d labeled train reviews, %d labeled test reviews, " \
 "and %d unlabeled reviews
" % (train["review"].size,  
 test["review"].size, unlabeled_train["review"].size ))

# Import various modules for string cleaning
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def review_to_wordlist( review, remove_stopwords=False ):
    # Function to convert a document to a sequence of words,
    # optionally removing stop words.  Returns a list of words.
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(review).get_text()
    #  
    # 2. Remove non-letters
    review_text = re.sub("[^a-zA-Z]"," ", review_text)
    #
    # 3. Convert words to lower case and split them
    words = review_text.lower().split()
    #
    # 4. Optionally remove stop words (false by default)
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    #
    # 5. Return a list of words
    return(words)
# Download the punkt tokenizer for sentence splitting
num_features = 300    # Word vector dimensionality

clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append( review_to_wordlist( review, \
        remove_stopwords=True ))

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

print("Creating average feature vecs for test reviews")
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append( review_to_wordlist( review, \
        remove_stopwords=True ))

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )

次に、ランダム森林を使用して予測を行います.コードは次のとおりです.

# Fit a random forest to the training data, using 100 trees
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )

print "Fitting a random forest to labeled training data..."
forest = forest.fit( trainDataVecs, train["sentiment"] )

# Test & extract results 
result = forest.predict( testDataVecs )

# Write the test results 
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )

この結果は偶然発見した結果よりはるかに良好であることが分かったが,Part 1で使用した語袋モデルよりも数ポイント精度が低下した.ベクトル平均は驚くべき結果を生み出していないので、もっと賢い方法でできるかもしれません.重み付け語ベクトルの標準的な方法は、所与のドキュメントセットにおける所与の単語の重要性を測定する「tf-idf」重み付けを適用することである.Pythonでtf-idf重みを抽出する方法の1つはscikitt-learnのTfidfVectorizerを使用することであり、そのインタフェースはPart 1で使用しているCountVectorizerと似ている.しかし、重みを増やすには大きな変化はありません.したがってベクトル平均もtf-idfも大きな改善はなく,次にクラスタリングを用いて改善効果を試みた.

単語から段落へ、試行2:クラスタリング

Word 2 Vecは意味関連単語のクラスタリングを作成するので,クラスタリング中の単語の類似性を利用することも可能である.このようにベクトルをグループ化することを「ベクトル量子化」と呼ぶ.これを実現するためには,まず単語クラスタの中心を見つける必要があり,k−meansのようなクラスタリングアルゴリズムを用いて実現できる.
K-meansでは、設定するパラメータの1つは「K」、すなわちクラスタの数です.どのようにしてクラスタを作成するかを決定しますか?実験と誤りは,平均5単語のみの小クラスタが複数単語を用いた大型クラスタよりも良好な結果をもたらすことを示した.クラスタリングコードは以下の通りです.私たちはscikit-learnを使用してk-meansを実行します.

from sklearn.cluster import KMeans
import time

start = time.time() # Start time

# Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an
# average of 5 words per cluster
word_vectors = model.wv.syn0
num_clusters = word_vectors.shape[0] / 5

# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans( n_clusters = num_clusters )
idx = kmeans_clustering.fit_predict( word_vectors )

# Get the end time and print how long the process took
end = time.time()
elapsed = end - start
print("Time taken for K Means clustering: ", elapsed, "seconds.")

各単語に割り当てられたクラスタはidxに格納され、元のWord 2 Vecモデルの語彙はmodelに格納されています.wv.index 2 wordにあります.便宜上、以下に示すように、これらの内容を辞書に圧縮します.

# Create a Word / Index dictionary, mapping each vocabulary word to
# a cluster number                                                                                            
word_centroid_map = dict(zip( model.wv.index2word, idx ))

最初の10個のクラスタリングセンターを印刷し、効果を見てみましょう.

# For the first 10 clusters
for cluster in range(0,10):
    #
    # Print the cluster number  
    print("
Cluster %d" % cluster)
    #
    # Find all of the words for that cluster number, and print them out
    words = []
    for i in xrange(0,len(word_centroid_map.values())):
        if( list(word_centroid_map.values())[i] == cluster ):
            words.append(list(word_centroid_map.keys())[i])
    print(words)

クラスタリングの品質がバラツキがあることがわかります.いくつかの意味がある--クラスタリング3は主に名前を含み、クラスタリング6-8は関連する形容詞を含む(クラスタリング6は私が必要とする感情形容詞である).一方、クラスター5には神秘的な点があります.ザリガニと鹿にはどんな共通点がありますか(2つの動物を除いて)?クラスタリング0はもっと悪いです.最上階のアパートとスイートルームは同じようですが、アップルとパスポートには属していないようです.クラスタリング2には戦争関連の単語が含まれていますか?私たちのクラスタリングアルゴリズムは形容詞で一番使いやすいかもしれません.いずれにしても、各単語にクラスタリング(または「centroid」)を割り当て、コメントをクラスタリング袋に変換する関数を定義できます.これは語袋モデルのようなものですが、単一の単語ではなく意味に関連するクラスタを使用します.

def create_bag_of_centroids( wordlist, word_centroid_map ):
    #
    # The number of clusters is equal to the highest cluster index
    # in the word / centroid map
    num_centroids = max( word_centroid_map.values() ) + 1
    #
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
    #
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count 
    # by one
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1
    #
    # Return the "bag of centroids"
    return bag_of_centroids

上記の関数は、各コメントにnumpy配列を提供し、各コメントの特徴数はクラスタ数と等しい.最後に、トレーニングとテストセットのクラスタリング袋を作成し、ランダムな森をトレーニングし、結果を抽出しました.

from sklearn.ensemble import RandomForestClassifier
# Pre-allocate an array for the training set bags of centroids (for speed)
train_centroids = np.zeros( (train["review"].size, num_clusters), \
    dtype="float32" )

# Transform the training set reviews into bags of centroids
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = create_bag_of_centroids( review, \
        word_centroid_map )
    counter += 1

# Repeat for test reviews 
test_centroids = np.zeros(( test["review"].size, num_clusters), \
    dtype="float32" )

counter = 0
for review in clean_test_reviews:
    test_centroids[counter] = create_bag_of_centroids( review, \
        word_centroid_map )
    counter += 1
# Fit a random forest and extract predictions 
forest = RandomForestClassifier(n_estimators = 100)

# Fitting the forest may take a few minutes
print("Fitting a random forest to labeled training data...")
forest = forest.fit(train_centroids,train["sentiment"])
result = forest.predict(test_centroids)

# Write the test results 
output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
output.to_csv( "BagOfCentroids.csv", index=False, quoting=3 )

まとめ

上記のコードはPart 1のワードバッグモデルの結果とほぼ同じであることが分かった.これは私たちのWord 2 vecが役に立たないというわけではありませんが、この応用上の感情分析でGoogleが出したdoc 2 vecがもっと良いだけです.demoコードとデータ:転送ゲート

Day 2

Oracle SQLブランチコマンド