paodingは辞書に基づいてどのように単語を分けるか?

5757 ワード

分詞中国語の分詞 paoding 粒度 Aray.sort()

前回はPaodingの辞書のデータ構造を紹介しましたが、今回はパodingがどのように単語のテキストを分類するかを紹介します.paodingは辞書を検索する時、2つの種類によって、BinaryDictionaryとHash BinaryDictionaryです.前回もこの二つのデータ構造を紹介しましたが、ここではもう繰り返しません.
Hash Binary Dictionaryとは、大きなデータ辞書を小さいサイズに切った辞書で、Binary Dictionaryで保存しています.hash Binary Dictionaryのsearch方法で調べたところ、再帰方法を採用し、最終的にはBinaryDictionaryに入ることになりました.コードを見れば分かります


public Hit search(CharSequence input, int begin, int count) {
		SubDictionaryWrap subDic = (SubDictionaryWrap) subs.get(keyOf(input
				.charAt(hashIndex + begin)));
		if (subDic == null) {
			return Hit.UNDEFINED;
		}
		Dictionary dic = subDic.dic;
		//  count==hashIndex + 1   
		if (count == hashIndex + 1) {
			Word header = dic.get(0);
			if (header.length() == hashIndex + 1) {
				if (subDic.wordIndexOffset + 1 < this.ascWords.length) {
					return new Hit(subDic.wordIndexOffset, header,
							this.ascWords[subDic.wordIndexOffset + 1]);
				} else {
					return new Hit(subDic.wordIndexOffset, header, null);
				}
			} else {
				return new Hit(Hit.UNCLOSED_INDEX, null, header);
			}
		}
		// count > hashIndex + 1
		Hit word = dic.search(input, begin, count);
		if (word.isHit()) {
			int index = subDic.wordIndexOffset + word.getIndex();
			word.setIndex(index);
			if (word.getNext() == null && index < size()) {
				word.setNext(get(index + 1));
			}
		}
		return word;
	}

確実に再帰的にSearchメソッドを呼び出す(この構造を構築する際にも再帰的な構造を採用する)ことがわかる.
Binary Dictionaryのsearch法では、二分割ルックアップ(なぜ二分割ルックアップを使用するのか説明)が採用されています.コードは以下の通りです


public Hit search(CharSequence input, int begin, int count) {
		int left = this.start;
		int right = this.end - 1;
		int pointer = 0;
		Word word = null;
		int relation;
		//
		while (left <= right) {
			pointer = (left + right) >> 1;
			word = ascWords[pointer];
			relation = compare(input, begin, count, word);
			if (relation == 0) {
				// System.out.println(new String(input,begin, count)+"***" +
				// word);
				int nextWordIndex = pointer + 1;
				if (nextWordIndex >= ascWords.length) {
					return new Hit(pointer, word, null);
				}
				else {
					return new Hit(pointer, word, ascWords[nextWordIndex]);
				}
			}
			if (relation < 0)
				right = pointer - 1;
			else
				left = pointer + 1;
		}
		//
		if (left >= ascWords.length) {
			return Hit.UNDEFINED;
		}
		//
		boolean asPrex = true;
		Word nextWord = ascWords[left];
		if (nextWord.length() < count) {
			asPrex = false;
		}
		for (int i = begin, j = 0; asPrex && j < count; i++, j++) {
			if (input.charAt(i) != nextWord.charAt(j)) {
				asPrex = false;
			}
		}
		return asPrex ? new Hit(Hit.UNCLOSED_INDEX, null, nextWord) : Hit.UNDEFINED;
	}

ここではPaodingの著者に感心しています.このような構造は決して中国科学院辞書のデータ構造に見劣りしません.中国科学院の辞書は人為的に順序を並べたものですが、ここのpaodingの辞書は列を並べていません.なぜ中国科学院の検索方法と似ていますか?
主に辞書データを作成するときに呼び出します.

Arrays.sort(array);

この方法は、以下の通りです.
Sorts the speciiiifed array of Oojecs in to ascending order,accoding to the naural ordedeing of its eleemens.All elemens in the array mutimplement the Coparablble interface.Frthemore,all eleeeeeeeeemens thethe the emmmmmutaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaents e 1 and e 2 in the array)This is sort isgararanteed to be stable:equal elemens will not reordeed as a rerererererererererererererererererererererererererererererererererererereorororordet t of the sot.Th sot sot sort.Th sotititititititititititititititititititing algogogogogogogogothm m isa momodidididididididididimamamamamamamamamamamamamamamagegegesort the the the thethethethethethethethethethethethethe lorerererererererererererererereサービス.
疑惑を解くために、相応のテストもしました.


public class TestArraySort {
	public static void main(String[] args) {
		HashSet<String> set = new HashSet<String>();
		set.add("    ");
		set.add("    ");
		set.add("    ");
		set.add("   ");
		set.add("   ");
		set.add("   ");
		Object[] array = set.toArray();
		Arrays.sort(array);
		for (int i = 0; i < array.length; i++) {
			System.out.println(array[i]);
		}
	}
}

結果は以下の通りです
3人はあれこれ考えていますが、5人は五穀豊穣で、6人は六大順です.


public class TestCharactor {
	public static void main(String[] args)  {
		int c1 = ' ';
		int c2 = ' ';
		
		System.out.println("The category of c1 is: " + c1);
		System.out.println("The category of c2 is: " + c2);
	}
}

結果は以下の通りです
The category of c 1 is:1997 The category of c 2 is:2016


public class TestCharactor {
	public static void main(String[] args)  {
		int c1 = ' ';
		int c2 = ' ';
		
		System.out.println("The category of c1 is: " + c1);
		System.out.println("The category of c2 is: " + c2);
	}
}

The category of c 1 is:20010 The category of c 2 is:24515
これは十分に説明したsortは確かに辞書の中の単語を漢字の対応する値によって昇順に並べて、これは二分のために伏線を埋めます.
paodingは切り分ける時に使う二重循環方式で、これはできるだけ多くの単語を切り分けることができます.外層の循環は一つの遊標に相当します.一つ一つの顔の切り分けられたテキストを掃きます.内層の循環はできるだけの接辞を使って、切った語を専門的な接受語の種類に捨てます.これはいわゆる細かい粒度の切り分けです.

優先順位を設定する