私と一緒に基礎ゼロ入門NLP(実戦プロジェクト:ニューステキスト分類)3機械学習に基づくテキスト分類

5869 ワード

Task 3機械学習に基づくテキスト分類学習目標学会TF-IDFの原理とsklearnを用いた機械学習モデルを用いてテキスト分類テキスト表現方法テキストをコンピュータで演算可能な数字やベクトルとして表す方法を一般的にワード埋め込み(Word Embedding)方法と呼ぶ:不定長のテキストを一定長の空間内に変換する.

One-hotは、各単語を離散ベクトルで表します.各単語をインデックスに符号化し、インデックスに基づいて値を割り当てます.e.g.,句子1:私は北京の天安門が好きです句子2:私は上海が好きです
まずすべての文の字を索引します:{‘私’:1,‘愛’:2,‘北’:3,‘京’:4,‘天’:5,‘安’:6,‘門’:7,‘喜’:8,‘歓’:9,‘上’:10,‘海’:11}各字が11次元の疎ベクトルに変換されます:私:[1,0,0,0,0,0,0,0,0,0,0,0,0]愛:[0,1,0,0,0,0,0,0,0,0,0,...海:[0,0,0,0,0,0,0,0,0,0,1]2.Bag of Words/Count Vectors各ドキュメントのワード/ワードは出現回数で表される.文1:私は北京の天安門を愛している->[1,1,1,1,1,1,1,1,1,0,0,0]文2:私は上海が好きです->[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1]
sklearnのCountVectorizerで実現できる:from sklearn.feature_extraction.text import CountVectorizer corpus=[‘This is the first document.’,‘This document is the second document.’,‘And this is the third one.’,‘Is this the first document?’,]vectorizer=CountVectorizer()X=vectorizer.fit_transform print(X.toarray()#ワード頻度結果print(vectorizer.get_feature_names()#ワードバッグ内のすべてのテキストキーワードResult:
[[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1 1 1] [0 1 1 1 0 0 1 0 1]] [‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]3.N-gramはCount Vectorsと似ていますが、隣接する単語を加えて新しい単語を構成し、カウントします.e.g.Nは2の値を取って、文1:私は北京京天天安安門が好きです.2:私は上海が好きです.

TF-IDF

term frequency-inverse document frequency:a statistical measure used to evaluate how important a word is to a document in a collection or corpus.TF(t)=この語が現在の文書に出現した回数/現在の文書における語の総数IDF(t)=log_e(文書総数/その語が出現した文書総数)
sklearrnのTfidfVectorizerで実現できる:from sklearn.feature_extraction.text import TfidfVectorizer corpus=[‘This is the first document.’,‘This document is the second document.’,‘Andthis is the third one.’,‘Is this the first document?’,]vectorizer=TfidfVectorizer()X=vectorizer.fit_transform(corps)prinsprinsform(corps)prinsform(corps)pris prinsform(corps)prinsform(corps)prinsform(corps)prinsnt(X.toarray()) print(vectorizer.get_feature_names()) Result:
[[0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524] [0. 0.6876236 0. 0.28108867 0. 0.53864762 0.28108867 0. 0.28108867] [0.51184851 0. 0. 0.26710379 0.51184851 0. 0.26710379 0.51184851 0.26710379] [0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524]] [‘and’,‘document’,‘first’,‘is’,‘one’,‘second’,‘the’,‘third’,‘this’]機械学習のテキスト分類に基づいて異なるテキスト表現アルゴリズムの精度を比較し,ローカル構築検証セットによりF 1スコアを計算する.

Count Vectors + RidgeClassifier

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import RidgeClassifier from sklearn.metrics import f1_score
train_df = pd.read_csv(‘data/train_set.csv’, sep=’\t’, nrows=15000)
vectorizer = CountVectorizer(max_features=3000) train_test = vectorizer.fit_transform(train_df[‘text’])
clf = RidgeClassifier() clf.fit(train_test[:10000], train_df[‘label’].values[:10000]) val_pred = clf.predict(train_test[10000:])
print(“Count Vectors + RidgeClassifier: f1_score =”, end=’ ') print(f1_score(train_df[‘label’].values[10000:], val_pred, average=‘macro’))
0.74

TF-IDF + RidgeClassifier

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import RidgeClassifier from sklearn.metrics import f1_score
train_df = pd.read_csv(‘data/train_set.csv’, sep=’\t’, nrows=15000)
tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000) train_test = tfidf.fit_transform(train_df[‘text’])
clf = RidgeClassifier() clf.fit(train_test[:10000], train_df[‘label’].values[:10000]) val_pred = clf.predict(train_test[10000:])
print(“TF-IDF\t\t + RidgeClassifier: f1_score =”, end=’ ') print(f1_score(train_df[‘label’].values[10000:], val_pred, average=‘macro’))
0.87
Result:
Count Vectors+RidgeClassifier:f 1_score=0.7406241569237678 TF-IDF+RidgeClassifier:f 1_score=0.872159853546126本章作業でTF-IDFのパラメータを変更しようと試み、精度tfidf=TfidfVectorizer(ngram_range=(1,1)、max_features=None)パラメータの意味:ngram_range=(min,max)-textをmin~maxの異なるフレーズに分ける.例えば、「Python is useful」のngram_range(1,3)は、「Python」「is」「useful」「Python is」「is useful」および「Python is useful」を得ることができる.ngram_range(1,1)単一の単語「Python」「is」と「useful」max_features:int-build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.Set a certain threshold for word frequences.e.g.,threshold=50,and data corpus consists of 100 words.After looking at the word frequences 20 words occur less than 50 times. Thus, set max_features=80.
max_featuresを変更するには:
for max_features in range(1000,5000,500): tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=max_features) train_test = tfidf.fit_transform(train_df[‘text’])

clf = RidgeClassifier()
clf.fit(train_test[:10000], train_df['label'].values[:10000])

val_pred = clf.predict(train_test[10000:])
print("max_features =",max_features,": f1_score =", end=' ')
print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))

Result:
max_features = 1000 : f1_score = 0.8270776630718544 max_features = 1500 : f1_score = 0.8422204285520029 max_features = 2000 : f1_score = 0.8603842642428617 max_features = 2500 : f1_score = 0.8680439682849046 max_features = 3000 : f1_score = 0.8721598830546126 max_features = 3500 : f1_score = 0.8690857354726925 max_features = 4000 : f1_score = 0.8753945850878357 max_features=4500:f 1_score=0.882933113362208 max_features=4000でf 1_scoreの方が高いことがわかりました
max_features=4000を選択し、ngram_range:tfidf=TfidfVectorizer(ngram_range=(1,1)、max_features=3200)tfidf=TfidfVectorizer(ngram_range=(1,2)、max_features=3200)tfidf=TfidfVectorizer(ngram_range=(1,4)、max_features=3200)Result:
ngram_range=(1,1)、f 1_score=ngram_range=(1,2)、f 1_score=ngram_range=(1,4)、f 1_score=他のマシン学習モデルを使用して、トレーニングと検証を完了します.
備考:宿題を剽窃し、後続の修正を準備する

Windows使用要約

JUnitとmockito