【5】シバ分詞対分類コーパス分詞を用いる

12350 ワード

NLPフォーラムからhttp://www.threedweb.cn/thread-1295-1-1.htmlワークスペースパス:X:WorkSpacetext_mining XはWindowsハードディスクドライブの文字アイテムのホームディレクトリです.
text_mining

    |-- text_corpus_small   ：          ，            ，                        
    |-- text_corpust_pos   ：           ，            ，            
    |-- text_corpus_segment   ：              ，            ，            
    |-- text_corpus_wordbag   ：            
         |-- train_set.data   ：          
         |-- word_bag.data   ：       
    |-- jieba_example.py   ：        
    |-- corpus_segment.py   ：              |-- corpus_prepos.py    :          
    |-- train_bags.py    :            
    |-- tf-idf.py    :       Tf-idf  ，

前処理フェーズのビジネス・プロセス:

1.    corpus_prepos.py   text_corpus_small        ：
    1）      header，footer，        
    2）        "\r
" ，       
    3）          text_corpust_pos    ，      text_corpus_small   

2.    corpus_segment.py   text_corpust_pos        ，       text_corpus_segment   ，      

3.    train_bags.py   text_corpust_pos      ，        text_corpus_wordbag    ，    ：train_set.data

4.    tf-idf.py   train_set.data           Tf-idf  ，        ，    ：word_bag.data

元の語彙カテゴリセット:このカテゴリセットがtext_です.corpus_smallサブディレクトリのリスト

分詞前サンプルファイル:

《     <              >  》  2005 1 1     ，1 25            ，                ， 1              ，                  。　　   ，           “    ”     ， ：           、        ， 、              ；               ，                        。　　               、                       ，      （   、   、  、  、  ）         ；  、       ，            ，      。        、“122”          、                。　　         、                。　　                          ，       、   、    、   、“  ”、      、        “    ”       。            、                         ，                。    、       、  、            、     （ ）   ，        、      、                  ，     、      、               ，           ，           ，                。      、              ，             、             ；        “  ”、“ 、 、 ”              ，      ，     、 、                   ，              ，             。　　    、                    ，             ，         。   （  1 20 ）          、  、      ，      、   、   、                              、  ，       、              ；                 ，          、    、    、        、    、                        。　　  ，                、       ，    、         “    ”             ，          、        、            、“  ”     。　　        ，                         ，            、                           ，      。　　1           ，        、     、                              ，               、                   。（  ）

corpus_segment.pyコードは次のとおりです.

# -*- coding: utf-8 -*-

import sys  
import os 
import jieba

#   utf-8    
reload(sys)
sys.setdefaultencoding('utf-8')
#        
corpus_path = "text_corpus_small"+"/"
#           
seg_path = "text_corpus_segment"+"/"

#   corpus_path       
dir_list = os.listdir(corpus_path)

#             
for mydir in dir_list:
        class_path = corpus_path+mydir+"/" #           
        file_list = os.listdir(class_path)  #   class_path      
        for file_path in file_list:   #       
                file_name = class_path + file_path  #         
                file_read = open(file_name, 'rb')   #       
                raw_corpus = file_read.read()       #        
                seg_corpus = jieba.cut(raw_corpus)  #       
                #           
                seg_dir = seg_path+mydir+"/"  
                if not os.path.exists(seg_dir):    #      
                        os.makedirs(seg_dir) 
                file_write = open ( seg_dir + file_path, 'wb' ) #         ，           
                file_write.write(" ".join(seg_corpus))  #                      
                file_read.close()  #       
                file_write.close()  #       

print "          ！！！"

corpus_path：        
seg_path ：

出力結果:

Building Trie..., from C:\Python27\lib\site-packages\jieba\dict.txt
loading model from cache c:\users\jackycaf\appdata\local\temp\jieba.cache
loading model cost 2.61299991608 seconds.
Trie has been built succesfully.
          ！！！

分詞後語彙カテゴリセット:このカテゴリセットがtext_corpus_segmentサブディレクトリのリスト:

カテゴリセット同じ:文語後サンプルファイル

《        <                  >    》    2005   1   1         ， 1   25                    ，                          ，   1                     ，                              。 　 　      ，                 “       ”         ，   ：                 、            ，   、                     ；                        ，                                      。 　 　                       、                                     ，          （     、     、    、    、    ）               ；    、           ，                   ，          。            、 “ 122 ”                、                          。 　 　               、                        。 　 　                                      ，           、     、      、      、 “     ” 、         、             “       ”            。                   、                                       ，                         。        、            、    、                    、         （   ）      ，             、          、                          ，         、         、                       ，                 ，                  ，                        。           、                      ，                      、                    ；              “     ” 、 “   、   、   ”                      ，          ，          、   、                              ，                       ，                   。 　 　       、                              ，                     ，             。      （    1   20   ）                 、     、          ，          、     、     、                                              、    ，           、                      ；                           ，                、       、       、             、       、                                    。 　 　    ，                        、           ，       、               “       ”                    ，                、             、                   、 “     ”        。 　 　            ，                                      ，                  、                                          ，          。 　 　 1                  ，              、         、                                              ，                         、                            。 （    ）

C++における文字列操作--幅の狭い文字変換の例の詳細

readelfコマンドの使い方