感情分析——snownlpの原理と実践に深く入り込む

16202 ワード

snownlp 感情分析 NLP

一、snownlpの概要
snownlpとは何ですか?
SnowNLPはpythonが書いたクラスライブラリで、中国語のテキストの内容を簡単に処理できるようになり、TextBlobの啓発を受けて書かれたもので、現在のほとんどの自然言語処理ライブラリは基本的に英語向けであるため、中国語を処理しやすいクラスライブラリを書いており、TextBlobとは異なり、ここではNLTKは使われておらず、すべてのアルゴリズムは自分で実現されている.そして、訓練された辞書を持ってきました.注意本プログラムはすべて処理するunicodeコードなので、使用するときは自分でdecodeしてunicodeにしてください.
以上はsnownlpに関する公式の説明です.簡単に言えば、snownlpは中国語の自然言語処理のPythonライブラリであり、中国語の自然言語操作をサポートしています.

中国語分詞

品詞表記

感情分析

テキスト分類

ピンイン

に変換

繁体字簡体

テキストキーワード抽出

抽出テキスト要約

tf,idf

Tokenization

テキスト類似

本稿では,snownlpにおける感情解析(Sentiment Analysis)に焦点を当てる.
二、snownlp感情分析モジュールの使用
2.1、snownlpライブラリのインストール
snownlpのインストール方法は以下の通りです.

pip install snownlp

2.2、snownlp感情分析を使う
snownlpを用いた感情分析のコードは以下の通りである.

#coding:UTF-8
import sys
from snownlp import SnowNLP

def read_and_analysis(input_file, output_file):
  f = open(input_file)
  fw = open(output_file, "w")
  while True:
    line = f.readline()
    if not line:
      break
    lines = line.strip().split("\t")
    if len(lines) < 2:
      continue

    s = SnowNLP(lines[1].decode('utf-8'))
    # s.words       
    seg_words = ""
    for x in s.words:
      seg_words += "_"
      seg_words += x
    # s.sentiments             
    fw.write(lines[0] + "\t" + lines[1] + "\t" + seg_words.encode('utf-8') + "\t" + str(s.sentiments) + "
")
  fw.close()
  f.close()

if __name__ == "__main__":
  input_file = sys.argv[1]
  output_file = sys.argv[2]
  read_and_analysis(input_file, output_file)

上記のコードは、ファイルから各行のテキストを読み出し、感情分析を行い、最終的な結果を出力します.
注:ライブラリで訓練されたモデルは商品のコメントデータに基づいているため、実際に使用する過程で、自分の状況に応じてモデルを再訓練する必要があります.
2.3、新しいデータを利用して感情分析モデルを訓練する
実際のプロジェクトでは、実際のデータに基づいて感情分析のモデルを再訓練する必要があり、大きく以下のいくつかのステップに分けられます.

は、正負のサンプルを準備し、正負のサンプルがpos.txtに保存され、負のサンプルがneg.txtに保存されるように、それぞれ保存する.

snownlpを用いて新しいモデル

を訓練する

新しいモデル

を保存
感情分析を再訓練するコードは以下の通りです.

#coding:UTF-8

from snownlp import sentiment

if __name__ == "__main__":
  #       
  sentiment.train('./neg.txt', './pos.txt')
  #          
  sentiment.save('sentiment.marshal')

注意:新しい訓練モデルを用いて感情分析を行うには,コード内の呼び出しモデルの位置を修正する必要がある.

data_path = os.path.join(os.path.dirname(os.path.abspath(__file__)),'sentiment.marshal')

三、snownlp感情分析のソースコード解析
snownlpで感情分析をサポートするモジュールはsentimentフォルダにあり、そのコアコードは__init__.pyである.
Sentimentクラスのコードは次のとおりです.

class Sentiment(object):

    def __init__(self):
        self.classifier = Bayes() #     Bayes   

    def save(self, fname, iszip=True):
        self.classifier.save(fname, iszip) #        

    def load(self, fname=data_path, iszip=True):
        self.classifier.load(fname, iszip) #        

    #                
    def handle(self, doc):
        words = seg.seg(doc) #   
        words = normal.filter_stop(words) #     
        return words #         

    def train(self, neg_docs, pos_docs):
        data = []
        #      
        for sent in neg_docs:
            data.append([self.handle(sent), 'neg'])
        #      
        for sent in pos_docs:
            data.append([self.handle(sent), 'pos'])
        #     Bayes       
        self.classifier.train(data)

    def classify(self, sent):
        # 1、  sentiment   handle  
        # 2、  Bayes   classify  
        ret, prob = self.classifier.classify(self.handle(sent)) #        classify  
        if ret == 'pos':
            return prob
        return 1-probclass Sentiment(object):

    def __init__(self):
        self.classifier = Bayes() #     Bayes   

    def save(self, fname, iszip=True):
        self.classifier.save(fname, iszip) #        

    def load(self, fname=data_path, iszip=True):
        self.classifier.load(fname, iszip) #        

    #                
    def handle(self, doc):
        words = seg.seg(doc) #   
        words = normal.filter_stop(words) #     
        return words #         

    def train(self, neg_docs, pos_docs):
        data = []
        #      
        for sent in neg_docs:
            data.append([self.handle(sent), 'neg'])
        #      
        for sent in pos_docs:
            data.append([self.handle(sent), 'pos'])
        #     Bayes       
        self.classifier.train(data)

    def classify(self, sent):
        # 1、  sentiment   handle  
        # 2、  Bayes   classify  
        ret, prob = self.classifier.classify(self.handle(sent)) #        classify  
        if ret == 'pos':
            return prob
        return 1-prob

上記のコードから、classify関数とtrain関数は2つのコアの関数であり、train関数は感情分類器を訓練するために使用され、classify関数は予測のために使用される.この2つの関数のうち、同時に使用されるhandle関数、handle関数の主な動作は、次のとおりです.

対入力テキスト分詞

去停用词

感情分類の基本モデルはベイズモデルBayesであり、ベイズモデルについては、文章の簡単で学びやすい機械学習アルゴリズムである素朴なベイズを参照することができる.2つのカテゴリc 1 c 1とc 2 c 2の分類問題について、その特徴はw 1,⋯,wn w 1,⋯,w nであり、特徴間は互いに独立しており、カテゴリc 1 c 1に属するベイズモデルの基本的な過程は以下の通りである.
P(c1∣w1,⋯,wn)=P(w1,⋯,wn∣c1)⋅P(c1)P(w1,⋯,wn) P ( c 1 ∣ w 1 , ⋯ , w n ) = P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) P ( w 1 , ⋯ , w n )
次のようになります.
P(w1,⋯,wn)=P(w1,⋯,wn∣c1)⋅P(c1)+P(w1,⋯,wn∣c2)⋅P(c2) P ( w 1 , ⋯ , w n ) = P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) + P ( w 1 , ⋯ , w n ∣ c 2 ) ⋅ P ( c 2 )
3.1、ベイズモデルの訓練
ベイズモデルの訓練過程は実質的に各特徴の出現頻度を統計し、その核心コードは以下の通りである.

def train(self, data):
    # data        ，      
    for d in data: # data  list
        # d[0]:     ，list
        # d[1]: /      
        c = d[1]
        if c not in self.d:
            self.d[c] = AddOneProb() #      
        for word in d[0]: #           
            self.d[c].add(word, 1)
    #            
    self.total = sum(map(lambda x: self.d[x].getsum(), self.d.keys())) #      d  sum

これはAddOneProbクラスに使用され、AddOneProbクラスは以下の通りである.

class AddOneProb(BaseProb):

    def __init__(self):
        self.d = {}
        self.total = 0.0
        self.none = 1 #      none 1
    #     value   1，  key    ，    2
    def add(self, key, value):
        self.total += value
        #     key ，   key
        if not self.exists(key):
            self.d[key] = 1
            self.total += 1
        self.d[key] += value

注意:

noneのデフォルト値は1

です.

keyが存在しない場合、totalと対応するd[key]は1+valueで加算され、これは後で予測する場合に

を用いる必要がある.AddOneProbクラスのtotalは、正クラスまたは負クラスのすべての値を表す.train関数のtotalは正負クラスのtotalの和を表す.
訓練サンプル中のtotalと各特徴keyのd[key]を統計した後,訓練過程の構築が完了した.
3.2、ベイズモデルの予測
予測プロセスは、次の式を使用します.
P(c1∣w1,⋯,wn)=P(w1,⋯,wn∣c1)⋅P(c1)P(w1,⋯,wn∣c1)⋅P(c1)+P(w1,⋯,wn∣c2)⋅P(c2) P ( c 1 ∣ w 1 , ⋯ , w n ) = P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) + P ( w 1 , ⋯ , w n ∣ c 2 ) ⋅ P ( c 2 )
上記の数式を簡略化します.
P(c1∣w1,⋯,wn)=P(w1,⋯,wn∣c1)⋅P(c1)P(w1,⋯,wn∣c1)⋅P(c1)+P(w1,⋯,wn∣c2)⋅P(c2)=11+P(w1,⋯,wn∣c2)⋅P(c2)P(w1,⋯,wn∣c1)⋅P(c1)=11+exp[log(P(w1,⋯,wn∣c2)⋅P(c2)P(w1,⋯,wn∣c1)⋅P(c1))]=11+exp[log(P(w1,⋯,wn∣c2)⋅P(c2))−log(P(w1,⋯,wn∣c1)⋅P(c1))] P ( c 1 ∣ w 1 , ⋯ , w n ) = P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) + P ( w 1 , ⋯ , w n ∣ c 2 ) ⋅ P ( c 2 ) = 1 1 + P ( w 1 , ⋯ , w n ∣ c 2 ) ⋅ P ( c 2 ) P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) = 1 1 + e x p [ l o g ( P ( w 1 , ⋯ , w n ∣ c 2 ) ⋅ P ( c 2 ) P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) ) ] = 1 1 + e x p [ l o g ( P ( w 1 , ⋯ , w n ∣ c 2 ) ⋅ P ( c 2 ) ) − l o g ( P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) ) ]
ここで、分母の1は次のように書き換えることができます.
1=exp[log(P(w1,⋯,wn∣c1)⋅P(c1))−log(P(w1,⋯,wn∣c1)⋅P(c1))] 1 = e x p [ l o g ( P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) ) − l o g ( P ( w 1 , ⋯ , w n ∣ c 1 ) ⋅ P ( c 1 ) ) ]
上記の手順に対応するコードは次のとおりです.

def classify(self, x):
    tmp = {}
    for k in self.d: #      
        tmp[k] = log(self.d[k].getsum()) - log(self.total) #   /     log  -     log  
        for word in x:
            tmp[k] += log(self.d[k].freq(word)) #   ，     0
    ret, prob = 0, 0
    for k in self.d:
        now = 0
        try:
            for otherk in self.d:
                now += exp(tmp[otherk]-tmp[k])
            now = 1/now
        except OverflowError:
            now = 0
        if now > prob:
            ret, prob = k, now
    return (ret, prob)

ここで、第1のforサイクルにおけるtmp[k]は式中のlog(P(ck))l o g(P(c k))に対応し、第2のforサイクルにおけるtmp[k]は式中のlog(P(w 1,⋯,wn(ck))l o g(P(w 1,⋯,w n(c))に対応する.
参考文献

snownlp github

自然言語処理ライブラリのsnowNLP

centos 7の下にZooKeeper 3を構築する.4ミドルウェア共通コマンドまとめ

QtのEvent Filter(回転)