PythonはFaissライブラリを利用してANN近隣検索の方法を実現します。

5475 ワード

python faiss 近隣

Emboddingの近隣検索は現在の図推奨システムの非常に重要なリコール方式であり、item 2 vec、マトリックス分解、タワーDNNなどを通じてトレーニングされたuser embedding、item embeddingが産出され、embeddingの使用に非常に柔軟である：

はuser embeddingを入力して、近隣の検索item embedding、userに興味があるitems

を推薦することができます。

はuser embeddingを入力して、近隣の検索user embedding、userに興味のあるuser

を推薦することができます。

はitem embeddingを入力して、近隣はitem embeddingを検索して、itemに関連するitemsを推薦することができます。

しかし、プロジェクトの問題があります。一旦user embedding、item embeddingのデータ量が一定のレベルに達すると、彼らの近隣の検索は非常に遅くなります。オフライン段階で事前に検索すればいいです。例えば、redisが記憶されていても大丈夫です。しかし、このような方式は非常に非リアルタイムです。オンライン段階で数十MSの検索ができれば、当然の効果が一番いいです。
FaissはFacebook AIチームのオープンソースであり、クラスターと類似性の検索ライブラリに対して、稠密ベクトルのための効率的な類似度検索とクラスターを提供し、10億レベルのベクトルの検索をサポートしており、現在最も成熟している近隣の検索ライブラリです。
続いてjupyter notebookのコードを通して、faissを使う簡単な流れを見せます。内容は以下の通りです。

トレーニングされたEmboddingデータを読み出す

。

faissインデックスを構築し、検索対象のEmboddingを

に追加します。

は目標Emboddingを取得し、検索取得IDリスト

を実現する。

IDに基づいて映画タイトルを取得し、結果を返します。

すでに訓練されたEmboddingがどのように高速近隣検索を実現するかは工程上の問題であり、facebookのfaissライブラリは複数のembeddingインデックスを構築して目標embeddingの高速近隣検索を実現することができ、オンラインで使用する必要を満たすことができる。
インストールコマンド:


conda install -c pytorch faiss-cpu

事前にfaissの使用経験をまとめます。
1.自分のIDをサポートするために、faiss.IndexIDMapでfaiss.IndexFlatL 2を小包てもいいです。
2.embeddingデータはnp.float 32に変換する必要があり、索引のembedding及び検索するembeddingを含む。
3.idsはint 64タイプに変換する必要があります。
1.準備データ


import pandas as pd
import numpy as np


df = pd.read_csv("./datas/movielens_sparkals_item_embedding.csv")
df.head()

id。
フィーチャー
0
10
[0.58664906302493286,0.560943232323207241,0.15...
1
20
［0.24459632585848676、-0.928250133912415、-0…
2
30
[0.95555553178723,0.694761805534363,0.141...
3
40
［0.181887972011029、0.36547207272832364、0.696…
4
50
［0.455231272913475037、0.4402626752853394、-0…
idsを構築


ids = df["id"].values.astype(np.int64)
type(ids), ids.shape
(numpy.ndarray, (3706,))
ids.dtype
dtype('int64')
ids_size = ids.shape[0]
ids_size
3706

構築datas


import json
import numpy as np
datas = []
for x in df["features"]:
 datas.append(json.loads(x))
datas = np.array(datas).astype(np.float32)
datas.dtype
dtype('float32')
datas.shape
(3706, 10)
datas[0]
array([ 0.2586649 , 0.35605943, 0.15589039, -0.7067125 , -0.07414215,
 -0.62500805, -0.0573845 , 0.4533663 , 0.26074877, -0.60799956],
 dtype=float32)
#   
dimension = datas.shape[1]
dimension
10

2.索引の作成


import faiss
index = faiss.IndexFlatL2(dimension)
index2 = faiss.IndexIDMap(index)
ids.dtype
dtype('int64')
index2.add_with_ids(datas, ids)
index.ntotal
3706

4.近隣IDリストの検索


df_user = pd.read_csv("./datas/movielens_sparkals_user_embedding.csv")
df_user.head()
id features

id。
フィーチャー
0
10
［0.5342885801819,0.74869656280518,0.04…
1
20
［1.39100208247、0.537978291511536、0.260…
2
30
-1.1886241436004639、-0.3511677086353302、0…
3
40
［1.08092999945831299、1.00803538324487、0.986…
4
50
[0.4238868059278137,0.2989807701111,-0.6…


user_embedding = np.array(json.loads(df_user[df_user["id"] == 10]["features"].iloc[0]))
user_embedding = np.expand_dims(user_embedding, axis=0).astype(np.float32)
user_embedding
array([[ 0.59742886, 0.17486966, 0.04345559, -1.3193961 , 0.5313592 ,
 -0.6052168 , -0.19088413, 1.5307966 , 0.09310367, -2.7573566 ]],
 dtype=float32)
user_embedding.shape
(1, 10)
user_embedding.dtype
dtype('float32')
topk = 30
D, I = index.search(user_embedding, topk) # actual search
I.shape
(1, 30)
I
array([[3380, 2900, 1953, 121, 3285, 999, 617, 747, 2351, 601, 2347,
 42, 2383, 538, 1774, 980, 2165, 3049, 2664, 367, 3289, 2866,
 2452, 547, 1072, 2055, 3660, 3343, 3390, 3590]])

5.映画IDから映画情報を取り出す


target_ids = pd.Series(I[0], name="MovieID")
target_ids.head()
0 3380
1 2900
2 1953
3 121
4 3285
Name: MovieID, dtype: int64
df_movie = pd.read_csv("./datas/ml-1m/movies.dat",
  sep="::", header=None, engine="python",
  names = "MovieID::Title::Genres".split("::"))
df_movie.head()

MovieID
Title
Genres
0
1
Toy Story（1995）
Animation Children's Coedy
1
2
Jummaji（1995）
Advienture Children's Fatasy
2
3
Gumpier Old Men(1995)
Commey Romance
3
4
Waiting to Exhale(1995)
Commeyドラマ
4
5
Father of the Bride Part II（1995）
Commey


df_result = pd.merge(target_ids, df_movie)
df_result.head()

MovieID
Title
Genres
0
3380
Railroaded1947）
Film-Noir
1
2900
モンキーシーズ(1988)
Horror 124 Sci-Fi
2
1953
Frech Connection、The（1971）
アクションクリムマードラマ
3
121
Boys of St.Vincent，The（1993）
ドラマ
4
3285
Beach，The（2000）
Advientureドラマ
締め括りをつける
ここで、PythonがFaissライブラリを利用してANN近隣検索を実現した記事について紹介します。PythonはFaissライブラリANN近隣の検索内容を使っています。以前の文章を検索したり、下記の関連記事を見たりしてください。これからもよろしくお願いします。

Appacheのrewrite技術を使う

php正則