データソースから埋め込みインデックスを作成する

25363 ワード

machinelearning showdev python nlp テキストリンク

この記事はチュートリアルシリーズの一部ですtxtai , AI動力セマンティック検索プラットフォーム
第1部では、Txtaiの一般的な概観、バッキング技術と類似性検索のためにそれを使用する方法の例を示しました.Part 2大きなデータ集合を持つ埋め込みインデックスをカバーします.
実世界大規模ユースケースについては、データを頻繁にデータベース(エラスティックサーチ、SQL、MongoDB、ファイルなど)に格納されます.ここでは、SQLiteからの読み取り方法を示します.Wordの埋め込みによってバックアップされた埋め込みインデックスを作成し、生成された埋め込みインデックスに対してクエリを実行します.
この例では、paperai 図書館.以下で議論されるデータセットと共に使用できる完全な解決のためにそのライブラリを見てください.

依存関係のインストール

インストールtxtai すべての依存関係.この記事は単語ベクトルを構築していますので、類似のエクストラパッケージをインストールする必要があります.

pip install txtai[similarity]

ダウンロード

この例は、CORD-19 データセット.COVID - 19 Open Research Dataset(CORD - 19)は、主要な研究グループの連合によって集められた学術論文の無料のリソースです.
次のダウンロードはSQLiteデータベースKaggle notebook . このデータ形式に関する詳しい情報はCORD-19 Analysis ノートブック.

wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite

ビルドワードベクトル

この例では、Word埋め込みによってバックアップされた検索システムを構築します.変圧器埋込みと全く同じくらい強力でない間、彼らはしばしば埋め込みベースの検索システムのために機能性にパフォーマンスの良いトレードオフを提供します.
この記事では、デモ目的のために独自のカスタム埋め込みを構築します.事前に訓練された単語の埋め込みモデルの数が利用可能です.

General language models from pymagnitude

CORD-19 fastText

import os
import sqlite3
import tempfile

from txtai.pipeline import Tokenizer
from txtai.vectors import WordVectors

print("Streaming tokens to temporary file")

# Stream tokens to temp working file
with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as output:
  # Save file path
  tokens = output.name

  db = sqlite3.connect("articles.sqlite")
  cur = db.cursor()
  cur.execute("SELECT Text from sections")

  for row in cur:
    output.write(" ".join(Tokenizer.tokenize(row[0])) + "\n")

  # Free database resources
  db.close()

# Build word vectors model - 300 dimensions, 3 min occurrences
WordVectors.build(tokens, 300, 3, "cord19-300d")

# Remove temporary tokens file
os.remove(tokens)

# Show files
ls -l

Streaming tokens to temporary file
Building 300 dimension model
Converting vectors to magnitude format
total 78948
-rw-r--r-- 1 root root  8065024 Aug 25 01:44 articles.sqlite
-rw-r--r-- 1 root root 24145920 Jan  9 20:45 cord19-300d.magnitude
-rw-r--r-- 1 root root 48625387 Jan  9 20:45 cord19-300d.txt
drwxr-xr-x 1 root root     4096 Jan  6 18:10 sample_data

埋め込みインデックスを作成する

次の手順では、作成した単語ベクトルモデルを使用して埋め込みインデックスを作成します.このモデルはBM 25 + FastTextインデックスを構築します.BM 25はセクションのための単語埋め込みの加重平均を構築するために使用されます.このメソッドの詳細についてはMedium article .

import sqlite3

import regex as re

from txtai.embeddings import Embeddings
from txtai.pipeline import Tokenizer

def stream():
  # Connection to database file
  db = sqlite3.connect("articles.sqlite")
  cur = db.cursor()

  # Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
  cur.execute("SELECT Id, Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND tags is not null")

  count = 0
  for row in cur:
    # Unpack row
    uid, name, text = row

    # Only process certain document sections
    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      # Tokenize text
      tokens = Tokenizer.tokenize(text)

      document = (uid, tokens, None)

      count += 1
      if count % 1000 == 0:
        print("Streamed %d documents" % (count), end="\r")

      # Skip documents with no tokens parsed
      if tokens:
        yield document

  print("Iterated over %d total rows" % (count))

  # Free database resources
  db.close()

# BM25 + fastText vectors
embeddings = Embeddings({"path": "cord19-300d.magnitude",
                         "scoring": "bm25",
                         "pca": 3})

# Build scoring index if scoring method provided
if embeddings.config.get("scoring"):
  embeddings.score(stream())

# Build embeddings index
embeddings.index(stream())

Iterated over 21499 total rows
Iterated over 21499 total rows

クエリデータ

以下は、“危険因子”という用語の埋め込みインデックスに対するクエリを実行します.これは、トップ5の一致を見つけ、それぞれのマッチに関連付けられている対応するドキュメントを返します.

import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

results = []
for uid, score in embeddings.search("risk factors", 5):
  cur.execute("SELECT article, text FROM sections WHERE id = ?", [uid])
  uid, text = cur.fetchone()

  cur.execute("SELECT Title, Published, Reference from articles where id = ?", [uid])
  results.append(cur.fetchone() + (text,))

# Free database resources
db.close()

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])

display(HTML(df.to_html(index=False)))

タイトル
出版
リファレンス
マッチ
COVID‐19パンデミック中の変形性関節症の管理
2020 - 05 - 21 00 : 00 : 00
https://doi.org/10.1002/cpt.1910
実際,リスク因子は性,肥満,遺伝的因子,及び機械的因子(3)である.
COVID‐19の反応中の精神的well‐beingに関連する仕事関連および個人的要因:健康管理および他の労働者の調査
2020 - 06 - 11 00 : 00 : 00
http://medrxiv.org/cgi/content/short/2020.06.09.20126722v1?rss=1
監督者による貧しい家族支援行動は,これらの結果［1.40(1.21−1.62),1.69(1.48−1.92),1.54(1.44−1.64)］と関連した.
肺TMPRSS 2のアンドロゲン調節が性不一致性COVID‐19結果を説明する証拠
2020 - 04 - 21 00 : 00 : 00
https://doi.org/10.1101/2020.04.21.051201
男性性に加えて、喫煙はCOVID - 19感受性と貧しい臨床結果の危険因子です.
COViD‐19危機に対する潜在的治療候補の現状
2020 - 04 - 22 00 : 00 : 00
https://doi.org/10.1016/j.bbi.2020.04.046
ヘパリン使用者と非使用者の間で28日間の死亡率に差はなかった.
COVID‐19:新しいコロナウイルス疾患について学んだことと学んだこと
2020 - 03 - 15 00 : 00 : 00
https://doi.org/10.7150/ijbs.45134
covid‐19の3つの主要な危険因子は性(男性),年齢であった≥60),重症肺炎.

クエリ結果から追加列の抽出

上の例では、トップ5のベストマッチを見つけるために埋め込みインデックスを使用します.これに加えて、抽出者インスタンス(これはパート5でさらに説明されます)は、検索結果の上にさらなる質問をするのに用いられます.そして、豊かな質問応答をつくります.

from txtai.pipeline import Extractor

# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")

db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

results = []
for uid, score in embeddings.search("risk factors", 5):
  cur.execute("SELECT article, text FROM sections WHERE id = ?", [uid])
  uid, text = cur.fetchone()

  # Get list of document text sections to use for the context
  cur.execute("SELECT Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND article = ? ORDER BY Id", [uid])
  texts = []
  for name, txt in cur.fetchall():
    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      texts.append(txt)

  cur.execute("SELECT Title, Published, Reference from articles where id = ?", [uid])
  article = cur.fetchone()

  # Use QA extractor to derive additional columns
  answers = extractor([("Risk Factors", "risk factors", "What risk factors?", False),
                       ("Locations", "hospital country", "What locations?", False)], texts)

  results.append(article + (text,) + tuple([answer[1] for answer in answers]))

# Free database resources
db.close()

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])
display(HTML(df.to_html(index=False)))

タイトル
出版
リファレンス
マッチ
危険因子
場所
COVID‐19パンデミック2020‐05‐21 00:00における変形性関節症の管理
https://doi.org/10.1002/cpt.1910
実際,リスク因子は性,肥満,遺伝的因子,及び機械的因子(3)である.性,肥満,遺伝的要因および機械的因子
なし
COVID‐19の反応中の精神的well‐beingに関連する仕事関連および個人的要因:健康管理および他の労働者の調査
2020 - 06 - 11 00 : 00 : 00
http://medrxiv.org/cgi/content/short/2020.06.09.20126722v1?rss=1
監督者による貧しい家族支援行動は,これらの結果［1.40(1.21−1.62),1.69(1.48−1.92),1.54(1.44−1.64)］と関連した.
貧しい家族支援行動
なし
肺TMPRSS 2のアンドロゲン調節が性不一致性COVID‐19結果を説明する証拠
2020 - 04 - 21 00 : 00 : 00
https://doi.org/10.1101/2020.04.21.051201
男性性に加えて、喫煙はCOVID - 19感受性と貧しい臨床結果の危険因子です.
より高い罹患率と死亡率
なし
COViD‐19危機に対する潜在的治療候補の現状
2020 - 04 - 22 00 : 00 : 00
https://doi.org/10.1016/j.bbi.2020.04.046
ヘパリン使用者と非使用者の間で28日間の死亡率に差はなかった.
誘発された強い炎症反応は好ましいか有害である
なし
COVID‐19:新しいコロナウイルス疾患について学んだことと学んだこと
2020 - 03 - 15 00 : 00 : 00
https://doi.org/10.7150/ijbs.45134
covid‐19の3つの主要な危険因子は性(男性),年齢であった≥60),重症肺炎.
性別(年齢)≥60 )重症肺炎
なし
上の例では、埋め込みインデックスを使用して、指定したクエリの先頭のN結果を検索します.その上に、質問回答抽出器は質問のリストに基づいて追加のコラムを引き出すのに用いられます.この場合、「危険因子」と「場所」列は文書テキストから引かれました.

Reference

この問題について(データソースから埋め込みインデックスを作成する), 我々は、より多くの情報をここで見つけました https://dev.to/neuml/build-an-embeddings-index-from-a-data-source-52pf

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

エラスティックサーチへのセマンティックサーチの追加

【python】雲を渡っていくゲームの最短ステップ数を求めるプログラム