【Elasticsearch】kuromoji analyzerで出来ることと設定の解説

131592 ワード

kuromoji Elasticsearch Elasticsearch テキストリンク

kuromoji analyzerを使ってどんなことができるのかを把握していなかったので、ドキュメント見ながら「どんなことができるのか」を理解したことを書いていきます。

参考にしているのは、こちらのドキュメントです。
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji.html

kuromoji analyzer とは

日本語の形態素解析機(分かち書きする機能)です
漢字とかひらながとかを品詞ごとにわかち書きしてくれます。

たとえば「東京都の目黒区に行く」を、kuromoji analyzerを使って分かち書きした場合とkuromoji analyzerを使わないで分かち書きした場合について見てみます。

以下使い方と出力はKibanaのConsoleで行った表示です。

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "text": ["東京都の目黒区に行く"]
}

kuromoji analyzerを使わない場合の出力

{
  "tokens" : [
    {
      "token" : "東",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "京",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "都",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "の",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<HIRAGANA>",
      "position" : 3
    },
    {
      "token" : "目",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "黒",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "区",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "に",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<HIRAGANA>",
      "position" : 7
    },
    {
      "token" : "行",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "く",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "<HIRAGANA>",
      "position" : 9
    }
  ]
}

→1文字ずつ分割されてしまっている

kuromoji analyzerを使う場合の出力

{
  "tokens" : [
    {
      "token" : "東京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "都",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "の",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "目黒",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "区",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "に",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "行く",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "word",
      "position" : 6
    }
  ]
}

→意味がわかる品詞単位で分割された

結果

kuromoji analysisを使うと、「東京」「都」「の」「目黒」「区」「に」「行く」と日本語で意味の分かる単位で分割されました。

kuromoji analysisには、この出力を色々変えることができる設定が豊富にあるので以下その設定について解説を載せていきます。

kuromoji_tokenizer

日本語の文章を分かち書きします

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "text": ["東京都の目黒区に行く"]
}

設定

mode

複合語(2つ以上の単語が合わさって出来た語)と不明な単語を処理する方法を指定できます。

normal：通常の分割方法(複合語の分割も不明な単語の分割もない)
search：長い名詞を複合語解除する分割する
extended：不明な単語のユニグラム(任意の文字列が1文字だけ続いた文字列)に分割する

normal

関西国際空港をnormal modeでtokenize

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "mode": "normal"
  },
  "text": ["関西国際空港", "アブラカダブラ"]
}

出力

{
  "tokens" : [
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "アブラカダブラ",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "word",
      "position" : 101
    }
  ]
}

normalモードのとき

複合語(=関西国際空港)が分割されてない
不明な単語(=アブラカタブラ)も分割されていない

search

関西国際空港をsearch modeでtokenize

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "mode": "search"
  },
  "text": ["関西国際空港", "アブラカタブラ"]
}

出力

{
  "tokens" : [
    {
      "token" : "関西",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "国際",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "空港",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "アブラカダブラ",
      "start_offset" : 7,
      "end_offset" : 14,
      "type" : "word",
      "position" : 103
    }
  ]
}

searchモードのとき

複合語(=関西国際空港)が分割される
不明な単語(=アブラカタブラ)は分割されていない

extended

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "mode": "extended"
  },
  "text": ["関西国際空港", "アブラカダブラ"]
}

出力

{
  "tokens" : [
    {
      "token" : "関西",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "関西国際空港",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "国際",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "空港",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "ア",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 103
    },
    {
      "token" : "ブ",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 104
    },
    {
      "token" : "ラ",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 105
    },
    {
      "token" : "カ",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 106
    },
    {
      "token" : "ダ",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "word",
      "position" : 107
    },
    {
      "token" : "ブ",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 108
    },
    {
      "token" : "ラ",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 109
    }
  ]
}

extendedモードのとき

複合語(=関西国際空港)が分割される
不明な単語(=アブラカタブラ)は1文字ずつ分割される

discard_punctuation

句読点を出力するかどうを設定できます。デフォルトでは句読点を出力します。

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "discard_punctuation": true
  },
  "text": ["明日、晴れですか。"]
}

discard_punctuationを使わない場合

{
  "tokens" : [
    {
      "token" : "明日",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "、",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "晴れ",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "です",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "か",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "。",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 5
    }
  ]
}

→ 句読点が表示される

discard_punctuationを使った場合

{
  "tokens" : [
    {
      "token" : "明日",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "晴れ",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "です",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "か",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 3
    }
  ]
}

→ 句読点が表示されない

user_dictionary

デフォルトでMeCab-IPADICの辞書を使っている。その辞書に単語を追加ができる。

user_dictionary_rulesとしてインラインで辞書を書いて追加もできる

使い方

$ES_HOME/config/userdict_ja.txt このパスに以下のようなファイルを設置して

東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "user_dictionary": "userdict_ja.txt"
  },
  "text": ["東京スカイツリー"]
}

またはファイルを用意せずにインラインで書いて追加もできる。

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer",
    "user_dictionary_rules": ["東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"]
  },
  "text": ["東京スカイツリー"]
}

user_dictionary_rulesを使わない場合の出力

{
  "tokens" : [
    {
      "token" : "東京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "スカイ",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ツリー",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    }
  ]
}

→ 「東京, スカイ, ツリー」という分割になる。

user_dictionary_rulesを使う場合の出力

{
  "tokens" : [
    {
      "token" : "東京",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "スカイツリー",
      "start_offset" : 2,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    }
  ]
}

→ user_dictionary_rulesを使った結果、「スカイツリー」が1単語となった

kuromoji_iteration_mark (Character Filter)

踊り字(々、ヽ、ゝなど)を直します。

設定

normalize_kanji

漢字の踊り字を直すかどうか。デフォルトでは直します。

normalize_kana

かなの踊り字を直すかどうか。デフォルトでは直します。

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "char_filter": {
    "type": "kuromoji_iteration_mark"
  },
  "text": ["時々すゞめを見る"]
}

kuromoji_iteration_markを使わない場合の出力

{
  "tokens" : [
    {
      "token" : "時々",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "すゞ",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "め",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "を",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "見る",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 4
    }
  ]
}

kuromoji_iteration_markを使った場合の出力

{
  "tokens" : [
    {
      "token" : "時時",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "すずめ",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "を",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "見る",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 3
    }
  ]
}

結果

kuromoji_iteration_markを使った結果

時々 → 時時
すゞめ → すずめ

になった。

kuromoji_baseform (Token Filter)

文中で動詞、形容詞を直す

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "filter": [
    "kuromoji_baseform"
  ],
  "text": ["楽しく飲み会に参加"]
}

kuromoji_baseformを使わない場合の出力

{
  "tokens" : [
    {
      "token" : "楽しく",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "飲み",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "会",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "に",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "参加",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    }
  ]
}

kuromoji_baseformを使う場合の出力

{
  "tokens" : [
    {
      "token" : "楽しい",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "飲む",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "会",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "に",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "参加",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    }
  ]
}

結果

kuromoji_baseformを使った結果

楽しく → 楽しい
飲み → 飲む

になった

kuromoji_part_of_speech(Token Filter)

出力に不要な品詞を指定できます。
デフォルトではstoptags.txtに含まれている品詞を表示しません。

設定

stoptags

ここでデフォルト以外の表示しない品詞タグを指定できます

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "filter": [
    "kuromoji_part_of_speech"
  ],
  "text": ["お寿司っておいしいね"]
}

kuromoji_part_of_speechを使わなかった場合の出力

{
  "tokens" : [
    {
      "token" : "お",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "寿司",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "って",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "おいしい",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ね",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 4
    }
  ]
}

kuromoji_part_of_speechを使った場合の出力

{
  "tokens" : [
    {
      "token" : "お",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "寿司",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "おいしい",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    }
  ]
}

結果

「って」「ね」が出力されなくなり、名詞や形容詞が出力された

kuromoji_readingform (Token Filter)

カタカナやローマ時に変換して出力する

設定

user_romaji

カタカナじゃなくてローマ字で出力する。デフォルトではfalse。

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "filter": {
    "type": "kuromoji_readingform",
    "use_romaji": false
  },
  "text": ["お寿司上手い"]
}

user_romajiがFalseの場合の出力

{
  "tokens" : [
    {
      "token" : "オ",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "スシ",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ウマイ",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    }
  ]
}

→カタカナで出力される

user_romajiがTrueの場合の出力

{
  "tokens" : [
    {
      "token" : "o",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "sushi",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "umai",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    }
  ]
}

→ ローマ字で出力される

kuromoji_stemmer (Token Filter)

カタカナの長音記号(「プリンター」でいう「ー」のこと)を非表示にする。デフォルトでは、4文字の単語から非表示にする。

設定

minimum_length

「何文字以下の場合には、長音記号を表示する」といった文字数指定ができる。

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "filter": {
    "type": "kuromoji_stemmer"
  },
  "text": ["コーヒー"]
}

出力

{
  "tokens" : [
    {
      "token" : "コーヒ",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    }
  ]
}

ja_stop (Token Filter)

ストップワードを定義して非表示にできる

■ストップワード
情報量の少ない単語や、頻出頻度の少ない単語、タスクに関係ない単語を解析不要な単語としてまとめたもの。

設定

stopwords

非表示にする文字を指定できる。
デフォルトではこのファイルにある文字が非表示になる。

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "filter": {
    "type": "ja_stop"
  },
  "text": ["ここのストップは消えるなど"]
}

ja_stopを使わない場合の出力

{
  "tokens" : [
    {
      "token" : "ここ",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "の",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ストップ",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "は",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "消える",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "など",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 5
    }
  ]
}

ja_stopを使う場合の出力

{
  "tokens" : [
    {
      "token" : "ストップ",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "消える",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 4
    }
  ]
}

結果

「ここ」「は」「など」といったあまり情報を持っていなそうな品詞が非表示になった

kuromoji_number (Token Filter)

漢数字を半角数字にする

使い方

GET /_analyze
{
  "tokenizer": {
    "type": "kuromoji_tokenizer"
  },
  "filter": {
    "type": "kuromoji_number"
  },
  "text": ["六〇百"]
}

kuromoji_numberを使わない場合の出力

{
  "tokens" : [
    {
      "token" : "六",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "〇",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "百",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    }
  ]
}

kuromoji_numberを使う場合の出力

{
  "tokens" : [
    {
      "token" : "6000",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    }
  ]
}

結果

六〇百 → 6000 と表示された

終わりに

以上でこのドキュメントに載っている kuromoji analyzerでできることの解説を一通り終えました。

なるべくこのあたりの設定をしていき、表記ゆれをなくしていくと検索の精度も向上していくかなと！

Author And Source

この問題について(【Elasticsearch】kuromoji analyzerで出来ることと設定の解説), 我々は、より多くの情報をここで見つけました https://qiita.com/hatsu/items/dacbbba02d72947df435

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

P 3 P IEドメイン間でサードパーティクッキーを受け取る

Javaタイマフレームワークについて