英語の語幹抽出アルゴリズム-Lovins,Porter

11164 ワード

アルゴリズム自然言語 nlp

英語の語幹抽出には様々な方法がありますが、実際には機械学習のデータ発掘など多岐にわたる内容があります.ここで主に紹介されているのは、実現しやすいいくつかのオリジナルアルゴリズムです.

Lovins(1968)

Porter(1980)

Porter 2(2000)

1.Lovins
Lovinsは一番早い実現です.
1.1.概要
アルゴリズムは以下の構成要素に関する.

ending、語の接尾辞、全部で294つあって、詳しいリストは最後の

を見ます.

condition、語の接尾辞は条件を取り除いて、すべてのendingは1つのconditionに対応して、共に29つあって、詳しいリストは最後の

を見ます.

transformation、endingの方式を変換して、全部で35個あります.詳しいリストは最後の

をご覧ください.
アルゴリズムは二つの部分に分けられます.

ペアの英語の単語は、endingリストによると、endingに従って長さから短さまでスキャンして、最初にconditionに合うending

を見つけます.

残りのstemアプリケーションtransformationにより、endingを適切な形に変換する

.
1.2.例
第一歩
英語の単語はnationnallyで、endlingリストに従って、長さから短走査まで、まず.09. ationally Bを見つけました.対応する規則はB Minimum stem length = 3です.endingを除去することを要求した後、残りの部分の長さは3 n ationallyを除いて、conditionに合わないです.
endingをスキャンし続けて、.07. ionally Aが見つかりました.対応する規則はA No restrictions on stemで、何の制約もありません.そこで最終的にionallyをendingとして選定した.
第二のステップ
英語のnationlyのstemはnalで、tranformationを探してみましたが、該当するtransformationがないので、変換せずに直接出力します.例えばもう一つの単語がsittingで、第一歩がstemになるのはsittで、第二歩はここで第一条のtransformationを適用して、最終的にsitを出力します.
1.Apppendix.A endingsリスト

.11.
alistically B   arizability A   izationally B

.10.
antialness A    arisations A    arizations A    entialness A

.09.
allically C     antaneous A     antiality A     arisation A
arization A     ationally B     ativeness A     eableness E
entations A     entiality A     entialize A     entiation A
ionalness A     istically A     itousness A     izability A
izational A

.08.
ableness A      arizable A      entation A      entially A
eousness A      ibleness A      icalness A      ionalism A
ionality A      ionalize A      iousness A      izations A
lessness A

.07.
ability A       aically A       alistic B       alities A
ariness E       aristic A       arizing A       ateness A
atingly A       ational B       atively A       ativism A
elihood E       encible A       entally A       entials A
entiate A       entness A       fulness A       ibility A
icalism A       icalist A       icality A       icalize A
ication G       icianry A       ination A       ingness A
ionally A       isation A       ishness A       istical A
iteness A       iveness A       ivistic A       ivities A
ization F       izement A       oidally A       ousness A

.06.
aceous A        acious B        action G        alness A
ancial A        ancies A        ancing B        ariser A
arized A        arizer A        atable A        ations B
atives A        eature Z        efully A        encies A
encing A        ential A        enting C        entist A
eously A        ialist A        iality A        ialize A
ically A        icance A        icians A        icists A
ifully A        ionals A        ionate D        ioning A
ionist A        iously A        istics A        izable E
lessly A        nesses A        oidism A

.05.
acies A         acity A         aging B         aical A
alist A         alism B         ality A         alize A
allic BB        anced B         ances B         antic C
arial A         aries A         arily A         arity B
arize A         aroid A         ately A         ating I
ation B         ative A         ators A         atory A
ature E         early Y         ehood A         eless A
elity A         ement A         enced A         ences A
eness E         ening E         ental A         ented C
ently A         fully A         ially A         icant A
ician A         icide A         icism A         icist A
icity A         idine I         iedly A         ihood A
inate A         iness A         ingly B         inism J
inity CC        ional A         ioned A         ished A
istic A         ities A         itous A         ively A
ivity A         izers F         izing F         oidal A
oides A         otide A         ously A

.04.
able A          ably A          ages B          ally B
ance B          ancy B          ants B          aric A
arly K          ated I          ates A          atic B
ator A          ealy Y          edly E          eful A
eity A          ence A          ency A          ened E
enly E          eous A          hood A          ials A
ians A          ible A          ibly A          ical A
ides L          iers A          iful A          ines M
ings N          ions B          ious A          isms B
ists A          itic H          ized F          izer F
less A          lily A          ness A          ogen A
ward A          wise A          ying B          yish A

.03.
acy A           age B           aic A           als BB
ant B           ars O           ary F           ata A
ate A           eal Y           ear Y           ely E
ene E           ent C           ery E           ese A
ful A           ial A           ian A           ics A
ide L           ied A           ier A           ies P
ily A           ine M           ing N           ion Q
ish C           ism B           ist A           ite AA
ity A           ium A           ive A           ize F
oid A           one R           ous A

.02.
ae A            al BB           ar X            as B
ed E            en F            es E            ia A
ic A            is A            ly B            on S
or T            um U            us V            yl R
s' A            's A

.01.
a A             e A             i A             o A
s W             y B

1.Apppendix.B conditionsリスト

A   No restrictions on stem
B   Minimum stem length = 3
C   Minimum stem length = 4
D   Minimum stem length = 5
E   Do not remove ending after e
F   Minimum stem length = 3 and do not remove ending after e
G   Minimum stem length = 3 and remove ending only after f
H   Remove ending only after t or ll
I   Do not remove ending after o or e
J   Do not remove ending after a or e
K   Minimum stem length = 3 and remove ending only after l, i or u*e
L   Do not remove ending after u, x or s, unless s follows o
M   Do not remove ending after a, c, e or m
N   Minimum stem length = 4 after s**, elsewhere = 3
O   Remove ending only after l or i
P   Do not remove ending after c
Q   Minimum stem length = 3 and do not remove ending after l or n
R   Remove ending only after n or r
S   Remove ending only after dr or t, unless t follows t
T   Remove ending only after s or t, unless t follows o
U   Remove ending only after l, m, n or r
V   Remove ending only after c
W   Do not remove ending after s or u
X   Remove ending only after l, i or u*e
Y   Remove ending only after in
Z   Do not remove ending after f
AA  Remove ending only after d, f, ph, th, l, er, or, es or t
BB  Minimum stem length = 3 and do not remove ending after met or ryst
CC  Remove ending only after l

1.Apppendix.C.transformationsリスト

1   remove one of double b, d, g, l, m, n, p, r, s, t
2   iev   ->   ief
3   uct   ->   uc
4   umpt  ->   um
5   rpt   ->   rb
6   urs   ->   ur
7   istr  ->   ister
7a  metr  ->   meter
8   olv   ->   olut
9   ul    ->   l except following a, o, i
10  bex   ->   bic
11  dex   ->   dic
12  pex   ->   pic
13  tex   ->   tic
14  ax    ->   ac
15  ex    ->   ec
16  ix    ->   ic
17  lux   ->   luc
18  uad   ->   uas
19  vad   ->   vas
20  cid   ->   cis
21  lid   ->   lis
22  erid  ->   eris
23  pand  ->   pans
24  end   ->   ens except following s
25  ond   ->   ons
26  lud   ->   lus
27  rud   ->   rus
28  her   ->   hes except following p, t
29  mit   ->   mis
30  ent   ->   ens except following m
31  ert   ->   ers
32  et    ->   es except following n
33  yt    ->   ys
34  yz    ->   ys

2.Porter
2.1.概要
母音と子音
母音子音と共通の定義は少し違っています.

母音(Vowel)-A E I O U、および子音の後のY

子音(Consontant)-A E I O Uのほかに、母音の後のY

単語のグループ化
連続的な母音は母音グループVと見なされ、連続的な子音は子音グループCと見なされるので、いずれの単語もVCインタリーブの形式として表し得る.

segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC
porter -> p/o/rt/e/r -> CVCVC
application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC
apple -> a/ppl/e -> V/C/V

総合的には、VCグループとして表示され得る形式は、$C^m[V]$のうち、パラメータmはLovinのconditionのstem長に類似しており、その後の判断に用いられる.
ルール
Porterアルゴリズムはruleを主とし、ruleの形式は:

(condition) S1 -> S2

conditionはS 1のstemを除去する役割を果たしています.m以外にも他の特徴があります.

m-VCグループの数を表す

*-任意の文字を表し、サブストリング、v、d、oと組み合わせて

を使用します.

大文字-サブストリングを表す

v-母音文字を表す

d-2つの同じ子音を表す

o-はcvcを表しています.ここで第二のcはW、X、Y

ではありません.
S 1は語のサフィックスであり、S 2の変化後のサフィックスである.
Lovinとは違って、1つの語は複数の規則の直列処理を経て、例えばhoppingのような出力対象語(Lovinは1回限りの出力)を出力し、まずアプリケーション規則(*v*) ING ->を適用して、hoppにしてからアプリケーション規則(*d and not (*L or *S or *Z)) -> single letterを適用して、hoppからhopに変更する.
プロセス
全体のアルゴリズムは、上から下に規則を適用し、いくつかの規則が特殊であり、もしトリガされたら追加の規則がたくさんあります.したがって、規則をグループ化します.ここでのパケットは論理的に区別されます.

ドStep_1 a

ドStep_1 b(step 2 b.2 or step 2 b.3に命中した場合は、いくつかの追加作業を行う)

ドStep_1 c

ドStep_2

ドStep_3

ドStep_4

ドStep_5 a

ドStep_5 b

各Stepの詳細は付録を参照してください.
2.2.例
2.Apppendix Step 1 a

      SSES  ->   SS
      IES   ->   I
      SS    ->   SS
      S     ->

2.Apppendix Step 1 b

(m>0) EED     ->   EE
(*v*) ED      ->
(*v*) ING     ->

If the second or third of the rules in Step 1b is successful, the following is done:

      AT      ->   ATE
      BL      ->   BLE
      IZ      ->   IZE
      (*d and not (*L or *S or *Z)) -> single letter
      (m=1 and *o)  ->   E

2.Apppendix Step 1 c

(*v*) Y       ->   I

2.Apppendix Step 2

(m>0) ATIONAL ->   ATE
(m>0) TIONAL  ->   TION
(m>0) ENCI    ->   ENCE
(m>0) ANCI    ->   ANCE
(m>0) IZER    ->   IZE
(m>0) ABLI    ->   ABLE
(m>0) ALLI    ->   AL
(m>0) ENTLI   ->   ENT
(m>0) ELI     ->   E
(m>0) OUSLI   ->   OUS
(m>0) IZATION ->   IZE
(m>0) ATION   ->   ATE
(m>0) ATOR    ->   ATE
(m>0) ALISM   ->   AL
(m>0) IVENESS ->   IVE
(m>0) FULNESS ->   FUL
(m>0) OUSNESS ->   OUS
(m>0) ALITI   ->   AL
(m>0) IVITI   ->   IVE
(m>0) BILITI  ->   BLE

2.Apppendix Step 3

(m>0) ICATE   ->   IC
(m>0) ATIVE   ->
(m>0) ALIZE   ->   AL
(m>0) ICITI   ->   IC
(m>0) ICAL    ->   IC
(m>0) FUL     ->
(m>0) NESS    ->

2.Apppendix Step 4

(m>1) AL      ->
(m>1) ANCE    ->
(m>1) ENCE    ->
(m>1) ER      ->
(m>1) IC      ->
(m>1) ABLE    ->
(m>1) IBLE    ->
(m>1) ANT     ->
(m>1) EMENT   ->
(m>1) MENT    ->
(m>1) ENT     ->
(m>1 and (*S or *T)) ION   ->
(m>1) OU      ->
(m>1) ISM     ->
(m>1) ATE     ->
(m>1) ITI     ->
(m>1) OUS     ->
(m>1) IVE     ->
(m>1) IZE     ->

2.Apppendix Step 5 a

(m>1) E   ->
(m=1 and not *o) E   ->

2.Apppendix Step 5 b

(m > 1 and *d and *L)   ->   single letter

並べ替え、並べ替え、泡並べ替えテストの挿入

5つの一般的な古典的アルゴリズムのコードテンプレート