edit distance編集距離

7070 ワード

Reference:Dynamic Programming Algorithm(DPA)for Edit-Distance編集距離2つの文字列s 1,s 2の違いについては、彼らの最小編集距離を計算することによって決定することができる.編集距離とは、s 1とs 2を同じ文字列にするには、以下の操作の最小回数が必要です.1.ある文字ch 1をch 22にする.文字3を削除します.たとえば、s 1="12433"およびs 2="1233"などの文字を挿入します.s 2の間に4を挿入することによって12433がs 1と一致することができる.すなわちd(s 1,s 2)=1(1回の挿入操作を行った)編集距離の性質2つの文字列s 1+ch 1を計算し、s 2+ch 2の編集距離にはこのような性質がある.         d(s1,””) = d(“”,s1) = |s1|    d(“ch1”,”ch2”) = ch1 == ch2 ? 0 : 1;2.         d(s1+ch1,s2+ch2) = min(     d(s1,s2)+ ch1==ch2 ? 0 : 1 ,d(s1+ch1,s2),d(s1,s2+ch2)  );最初の性質は明らかだ.2つ目の性質:私たちが定義した3つの操作によって距離を編集する測定方法として使用されます.そこでch 1,ch 2に対して可能な操作は1のみである.ch 1をch 22に変える.s 1+ch 1後ch 1 d=(1+d(s 1,s 2+ch 2))3を削除する.s 1+ch 1を挿入した後にch 2 d=(1+d(s 1+ch 1,s 2))を挿入すると、2および3に対する操作は、次のように等価である.2.s 2+ch 2後にch 1 d=(1+d(s 1,s 2+ch 2))3.s 2+ch 2を削除してch 2 d=(1+d(s 1+ch 1,s 2))を削除することで、編集距離を計算する性質2を得ることができる.複雑度解析上の性質2から計算過程は,各層が現在計算されているシリアル長で表記され,2つのシリアル長がnであると仮定する構造を示していることが分かるが,この問題の複雑度は指数レベル3のn次方であり,長いシリアルに対しては時間的に耐えられない.解析:上記の構造では,(n−1,n−1),(n−1,n−2).......が複数回出現することが分かった.言い換えれば、この構造は重複するサブ問題を有する.さらに,前述の性質2が有する最適サブ構造を加える.動的計画アルゴリズムの基本要素に合致する.したがって,動的プランニングアルゴリズムを用いて複雑さを多項式レベルに低減することができる.動的計画解は,まず計算サブ問題の繰返しを避けるために,2つの補助配列を追加する.一.サブ問題の結果を保存します.M[|s 1|,|s 2|],ここでM[i,j]は、サブストリングs 1(0->i)とs 2(0->j)との編集距離2を表す.文字間の編集距離を保存します.E[|s 1|,|s 2|],そのうちE[i,j]=s[i]=s[j]?0:1三.新しい計算式は性質1からM[0,0]=0を得た.M[ s1i, 0 ] = |s1i|;M[ 0, s2j ] = |s2j|;性質2からM[i,j]=min(m[i−1,j−1]+E[i,j],m[i,j−1],m[i−1,j])を得た.複雑度は新しい計算式から分かるように,計算過程はi=1->|s 1|j=1->|s 2|M[i][j]=......従って複雑度はO(|s 1|*|s 2|)であり,彼らの長さがnであると仮定すると複雑度はO(n^2)である.
Reference: Dynamic Programming Algorithm (DPA) for Edit-Distance
The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `spot' by the deletion of the `p', or equivalently, `spot' can be changed into `sport' by the insertion of `p'.
The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:
  • change a letter,
  • insert a letter or
  • delete a letter

  • The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:
    d('', '') = 0
    d(s, '')  = d('', s) = |s|   -- i.e. length of s
    d(s1+ch1, s2+ch2)
      = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
             d(s1+ch1, s2) + 1,
             d(s1, s2+ch2) + 1 )
    

    The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.
    The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.
    Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.
    A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:
    m[i,j] = d(s1[1..i], s2[1..j])
    
    m[0, 0] = 0
    m[i, 0] = i,  i=1..|s1|
    m[0, j] = j,  j=1..|s2|
    
    m[i,j] = min( m[i-1,j-1]
                  + if s1[i]=s2[j] then 0 else 1 fi,
                  m[i-1, j] + 1,
                  m[i, j-1] + 1 ),    i=1..|s1|, j=1..|s2|
    

    m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!
    The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `spot' by the deletion of the `p', or equivalently, `spot' can be changed into `sport' by the insertion of `p'.
    The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:
  • change a letter,
  • insert a letter or
  • delete a letter

  • The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:
    d('', '') = 0
    d(s, '')  = d('', s) = |s|   -- i.e. length of s
    d(s1+ch1, s2+ch2)
      = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
             d(s1+ch1, s2) + 1,
             d(s1, s2+ch2) + 1 )
    

    The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.
    The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.
    Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.
    A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:
    m[i,j] = d(s1[1..i], s2[1..j])
    
    m[0, 0] = 0
    m[i, 0] = i,  i=1..|s1|
    m[0, j] = j,  j=1..|s2|
    
    m[i,j] = min( m[i-1,j-1]
                  + if s1[i]=s2[j] then 0 else 1 fi,
                  m[i-1, j] + 1,
                  m[i, j-1] + 1 ),    i=1..|s1|, j=1..|s2|
    

    m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!