正確な分詞:カスタム辞書分詞のロード(pyhanlp分詞の例)

6713 ワード

NLP自然言語処理

目次
一、pyhanlp
1.1基本紹介
1.2 pyhanlp辞書の追加
二、分詞対比
tokenizer.py:hanlp関数
cut_data.pyプライマリファイル
すべてのコード、データセット:https://github.com/455125158/NLP_basis

一、pyhanlp
1.1基本紹介
pyhanlpの紹介:https://github.com/hankcs/pyhanlp
pyhanlpオンラインプレゼンテーション:http://hanlp.com/?sentence=%E4%B8%8B%E9%9B%A8%E5%A4%A9%E5%9C%B0%E9%9D%A2%E7%A7%AF%E6%B0%B4
中国語分詞≠自然言語処理!
中国語の分詞は第一歩にすぎない.HanLPは中国語の分詞から始まり、語性表示、ネーミングエンティティ識別、文法分析、テキスト分類などの一般的なタスクをカバーし、豊富なAPIを提供している.
いくつかの粗末な分詞クラスライブラリとは異なり、HanLPは内部データ構造とIOインタフェースを丹念に最適化し、ミリ秒級の冷却起動、千万文字毎秒の処理速度を実現したが、メモリは最低120 MBしか必要としない.モバイルデバイスでも大規模なクラスタでも、良好な体験が得られます.
市販のビジネスツールとは異なり、HanLPはトレーニングモジュールを提供し、ユーザーの語彙でモデルをトレーニングし、デフォルトモデルを置き換えて、異なる分野に適応することができます.プロジェクトのホームページには、詳細なドキュメントと、いくつかのオープンソースの語彙で訓練されたモデルが表示されます.
HanLPは学界の精確さと工業界の効率を両立させ、両者の間にバランスをとり、自然言語処理を生産環境に本当に普及させることを望んでいる.
私たちが使っているpyhanlpはpythonでHanLpをパッケージしたjavaインタフェースです.
pyhanlp共通API:

from pyhanlp import *

print(HanLP.segment('  ，   Python   HanLP API'))
for term in HanLP.segment('       '):
    print('{}\t{}'.format(term.word, term.nature)) #        
testCases = [
    "     ",
    "                 ",
    "               ",
    "        ",
    "          ",
    "                    24               ",
    "              ，                 ，           。"]
for sentence in testCases: print(HanLP.segment(sentence))
#      
document = "            9 29                   ，" \
           "                 ，            ，" \
           "           。          ，     ，               ，" \
           "                  。"
print(HanLP.extractKeyword(document, 2))
#     
print(HanLP.extractSummary(document, 3))
#       
print(HanLP.parseDependency("                、           。"))

1.2 pyhanlp辞書の追加

hanlp分詞カスタム辞書を追加し、「D:anaconda 1Libsite-packagespyhanlpstaticdatadictionarycustom」の下

2.2.1.「CustomDictionary.txt.bin」を削除します(キャッシュファイルを削除しても大丈夫です)
2.2.2.「CustomDictionary.txt」に分詞しない言葉を追加します.次は医学系の文章を選んだので、医用名をいくつか追加します.

追加形式は、「語」「語性」「語周波数」で、中央はスペースで区切られています.

二、分詞対比

tokenizer.py:hanlp関数

pyhanlpはテキストの分詞の基本フォーマットを処理して、とても良い移植型があります.


from pyhanlp import *


def to_string(sentence,return_generator=False):
    if return_generator:
        return (word_pos_item.toString().split('/') for word_pos_item in HanLP.segment(sentence))
        # toString()   str
    else:
        return " ".join([word_pos_item.toString().split('/')[0] for word_pos_item in HanLP.segment(sentence)])
        #    “”.split('/')   string   list  ：'ssfa/fsss'.split('/') => ['ssfa', 'fsss']

def seg_sentences(sentence,with_filter=True,return_generator=False):
    segs=to_string(sentence,return_generator=return_generator)
    if with_filter:
        g = [word_pos_pair[0] for word_pos_pair in segs if len(word_pos_pair)==2 and word_pos_pair[0]!=' ' and word_pos_pair[1] not in drop_pos_set]
    else:
        g = [word_pos_pair[0] for word_pos_pair in segs if len(word_pos_pair)==2 and word_pos_pair[0]!=' ']
    return iter(g) if return_generator else g

def cut_hanlp(raw_sentence,return_list=True):
    '''

    :param raw_sentence:      
    :param return_list:       ，True   []
    :return:
    '''
    if len(raw_sentence.strip())>0:
        return to_string(raw_sentence) if return_list else iter(to_string(raw_sentence))

cut_data.pyプライマリファイル

#-*- coding=utf8 -*-
import jieba
import re
from tokenizer import cut_hanlp

jieba.load_userdict("dict.txt")   # jieba  

def merge_two_list(a, b):
    c=[]
    len_a, len_b = len(a), len(b)
    minlen = min(len_a, len_b)
    for i in range(minlen):
        c.append(a[i])
        c.append(b[i])

    if len_a > len_b:
        for i in range(minlen, len_a):
            c.append(a[i])
    else:
        for i in range(minlen, len_b):
            c.append(b[i])
    return c


if __name__=="__main__":
    fp=open("text.txt","r",encoding="utf8")
    fout=open("result_cut.txt","w",encoding="utf8")    
    regex1=u'(?:[^\u4e00-\u9fa5（）*&……%￥$，,。.@! ！]){1,5} ' #    xxx 
    regex2=r'(?:[0-9]{1,3}[.]?[0-9]{1,3})%'                    #   xx.xx%
    p1=re.compile(regex1)
    p2=re.compile(regex2)
    for line in fp.readlines():
        result1=p1.findall(line)  #       list
        # print(result1)
        if result1:       
            regex_re1=result1
            line=p1.sub("FLAG1",line)  #         FLAG1
            # print(line)
        result2=p2.findall(line)
        if result2:
            line=p2.sub("FLAG2",line)
            # print(line)
        words=jieba.cut(line)     #     ，    generator object
        result = " ".join(words)  #             generator object，     “ ”.join()     

        words1=cut_hanlp(line)    # hanlp    ，    str

        if "FLAG1" in result:
            result=result.split("FLAG1")
            result=merge_two_list(result,result1)
            ss = result
            result="".join(result)   #     list，      str，     "".join()     
        if "FLAG2" in result:       
            result=result.split("FLAG2")
            result=merge_two_list(result,result2)
            result="".join(result)        
        # print(result)

        fout.write("jieba :"+result)
        fout.write("hanlp:"+str(words1))
    fout.close()

最終結果:

ほぼ一致

c言語版データ構造(奇跡冬瓜)-配列と一般化テーブル(疎行列の乗算)

MVVM フレームワーク Prism の全体概観