NLP4J [007] で kuromoji を利用する Annotator を作成する

14266 ワード

NLP NLP4J kuromoji Java 自然言語処理 Java テキストリンク

Indexに戻る

形態素解析のモジュールを使い分ける

NLP4J では標準(nlp4j-core)においてYahoo!デベロッパーネットワークの形態素解析処理を利用しています。

テキスト解析:日本語形態素解析 - Yahoo!デベロッパーネットワーク
https://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html

Yahoo!デベロッパーネットワークのAPIはHTTPで呼べるので便利ではありますが、回数制限があるという弱点もあります。
そこでローカルでも使える kuromoji を利用するライブラリを作成することにします。

Annotator の作成

今回は nlp4j プロジェクトのサブモジュール(sub module)として nlp4j-kuromoji を作成しました。

nlp4j-kuromoji
https://github.com/oyahiroki/nlp4j/tree/master/nlp4j/nlp4j-kuromoji

Maven には kuromoji を利用するためのdependencyを追加しています。

<!-- https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji -->
<dependency>
 <groupId>com.atilika.kuromoji</groupId>
 <artifactId>kuromoji</artifactId>
 <version>0.9.0</version>
 <type>pom</type>
</dependency>
<dependency>
 <groupId>com.atilika.kuromoji</groupId>
 <artifactId>kuromoji-ipadic</artifactId>
 <version>0.9.0</version>
</dependency>

Class Diagram

クラス図としてはこんな感じです。
形態素解析エンジンとしては同じようなことをしているので兄弟関係ということになります。
一度インプリしてしまえば差分を意識することはなくなるので、kuromojiのインプリを意識するのもおそらく今回限りということになります。

@startuml
nlp4j.DocumentAnnotator <|-- YJpMaAnnotator
nlp4j.DocumentAnnotator <|-- KuromojiAnnotator 
@enduml

Code

NLP4J が提供する nlp4j.DocumentAnnotator インターフェイスを継承(implement)します。
kuromoji で抽出したキーワードをNLP4Jで用意しているキーワードにマップしています。


package nlp4j.krmj.annotator;
import java.util.List;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.atilika.kuromoji.ipadic.Token;
import com.atilika.kuromoji.ipadic.Tokenizer;
import nlp4j.AbstractDocumentAnnotator;
import nlp4j.Document;
import nlp4j.DocumentAnnotator;
import nlp4j.impl.DefaultKeyword;

/**
 * Kuromoji Annotator
 * @author Hiroki Oya
 * @since 1.2
 */
public class KuromojiAnnotator extends AbstractDocumentAnnotator implements DocumentAnnotator {
    static private final Logger logger = LogManager.getLogger(KuromojiAnnotator.class);
    @Override
    public void annotate(Document doc) throws Exception {
        Tokenizer tokenizer = new Tokenizer(); // kuromoji のインスタンス
        for (String target : targets) {
            Object obj = doc.getAttribute(target);
            if (obj == null || obj instanceof String == false) {
                continue;
            }
            String text = (String) obj;
            List<Token> tokens = tokenizer.tokenize(text);
            int sequence = 1;
            for (Token token : tokens) {
                logger.debug(token.getAllFeatures());
                DefaultKeyword kwd = new DefaultKeyword(); // 新しいキーワード
                kwd.setLex(token.getBaseForm());
                kwd.setStr(token.getSurface());
                kwd.setReading(token.getReading());
                kwd.setBegin(token.getPosition());
                kwd.setEnd(token.getPosition() + token.getSurface().length());
                kwd.setFacet(token.getPartOfSpeechLevel1());
                kwd.setSequence(sequence);
                doc.addKeyword(kwd);
                sequence++;
            }
        }
    }
}

同じ「原形」でもbaseForm と lex の違いがあったり、用語が微妙に違うことがみて取れると思います。

使い方

Annotator のクラス指定を変更する以外はYahoo!デベロッパーネットワークと同じになります。
別々の自然言語処理である kuromoji と Yahoo!デベロッパーネットワークの自然言語処理をWRAPしていることになります。

    public void testAnnotateDocument001() throws Exception {
        // 自然文のテキスト
        String text = "私は学校に行きました。";
        Document doc = new DefaultDocument();
        doc.putAttribute("text", text);
        KuromojiAnnotator annotator = new KuromojiAnnotator(); // ここだけ変更してモジュールを差し替え可能
        annotator.setProperty("target", "text");
        annotator.annotate(doc); // throws Exception
        System.err.println("Finished : annotation");
        for (Keyword kwd : doc.getKeywords()) {
            System.err.println(kwd);
        }
    }

結果

結果は以下のようになりました。
自然言語処理ライブラリの実装を意識することなく利用することができました。

Finished : annotation
私 [sequence=1, facet=名詞, lex=私, str=私, reading=ワタシ, count=-1, begin=0, end=1, correlation=0.0]
は [sequence=2, facet=助詞, lex=は, str=は, reading=ハ, count=-1, begin=1, end=2, correlation=0.0]
学校 [sequence=3, facet=名詞, lex=学校, str=学校, reading=ガッコウ, count=-1, begin=2, end=4, correlation=0.0]
に [sequence=4, facet=助詞, lex=に, str=に, reading=ニ, count=-1, begin=4, end=5, correlation=0.0]
行く [sequence=5, facet=動詞, lex=行く, str=行き, reading=イキ, count=-1, begin=5, end=7, correlation=0.0]
ます [sequence=6, facet=助動詞, lex=ます, str=まし, reading=マシ, count=-1, begin=7, end=9, correlation=0.0]
た [sequence=7, facet=助動詞, lex=た, str=た, reading=タ, count=-1, begin=9, end=10, correlation=0.0]
。 [sequence=8, facet=記号, lex=。, str=。, reading=。, count=-1, begin=10, end=11, correlation=0.0]

まとめ

NLP4J を使うと、Javaで簡単に自然言語処理ができますね！

プロジェクトURL

https://www.nlp4j.org/

Indexに戻る

Author And Source

この問題について(NLP4J [007] で kuromoji を利用する Annotator を作成する), 我々は、より多くの情報をここで見つけました https://qiita.com/oyahiroki/items/ce351abed333fc7278e6

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .