弾性検索による抽出QA

24105 ワード

machinelearning showdev python nlp テキストリンク

この記事は、txtai、AI動力セマンティック検索プラットフォームのチュートリアルシリーズの一部です.
TxtaiはデータストアAgnosticです、ライブラリはテキストのセットを分析します.次の例では、弾性検索システムの上部に抽出質問応答を追加する方法を示します.

依存関係のインストール

txtaiとElasticsearchをインストールします.

# Install txtai and elasticsearch python client
pip install txtai elasticsearch

# Download and extract elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.1

エラスティックサーチのインスタンスを開始します.

import os
from subprocess import Popen, PIPE, STDOUT

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))

sleep 30

ダウンロード

この例は、CORD-19データセットのサブセットをオフに動作します.COVID - 19 Open Research Dataset(CORD - 19)は、主要な研究グループの連合によって集められた学術論文の無料のリソースです.
以下のダウンロードは、Kaggle notebookから生成されたSQLiteデータベースです.このデータ形式に関する詳しい情報はCORD-19 Analysisノートブックで見つけることができます.

wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite

データを弾性検索に読み込む

次のブロックは、sqliteからエラスティックサーチへの行をコピーします.

import sqlite3

import regex as re

from elasticsearch import Elasticsearch, helpers

# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)

# Connection to database file
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

# Elasticsearch bulk buffer
buffer = []
rows = 0

# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
cur.execute("SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null")
for row in cur:
  # Build dict of name-value pairs for fields
  article = dict(zip(("id", "article", "title", "published", "reference", "name", "text"), row))
  name = article["name"]

  # Only process certain document sections
  if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
    # Bulk action fields
    article["_id"] = article["id"]
    article["_index"] = "articles"

    # Buffer article
    buffer.append(article)

    # Increment number of articles processed
    rows += 1

    # Bulk load every 1000 records
    if rows % 1000 == 0:
      helpers.bulk(es, buffer)
      buffer = []

      print("Inserted {} articles".format(rows), end="\r")

if buffer:
  helpers.bulk(es, buffer)

print("Total articles inserted: {}".format(rows))

Total articles inserted: 21499

クエリデータ

以下は、“危険因子”という用語の弾性検索に対するクエリを実行します.これは、トップ5の一致を見つけ、それぞれのマッチに関連付けられている対応するドキュメントを返します.

import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

query = {
    "_source": ["article", "title", "published", "reference", "text"],
    "size": 5,
    "query": {
        "query_string": {"query": "risk factors"}
    }
}

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]
  results.append((source["title"], source["published"], source["reference"], source["text"]))

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])

display(HTML(df.to_html(index=False)))

タイトル
出版
リファレンス
マッチ
COVID‐19パンデミック中の変形性関節症の管理
2020 - 05 - 21 00 : 00 : 00
https://doi.org/10.1002/cpt.1910
実際,リスク因子は性,肥満,遺伝的因子,及び機械的因子(3)である.
COVID‐19感染で入院した患者における心筋傷害の有病率と影響
2020 - 04 - 24 00 : 00 : 00
http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1
このリスクはcvdの歴史,危険因子ではなくcvdでは成層化した患者と一貫して一貫していた.
アポリポ蛋白質E遺伝子型はcovid‐19重症度を予測するか?
2020 - 04 - 27 00 : 00 : 00
https://doi.org/10.1093/qjmed/hcaa142
その後の死亡に関連する危険因子は、年齢、高血圧、糖尿病、虚血性心疾患、肥満、慢性肺疾患を含むしかし、時々明らかな危険因子がありません.
COViD‐19と弱さと多元性との関連:英国生物多様性参加者の前向き分析
2020 - 07 - 23 00 : 00 : 00
https://www.ncbi.nlm.nih.gov/pubmed/32705587/
背景:過酷なcovid‐19病の危険因子として,弱さと多面性が示唆されている.
COVID‐19:新しいコロナウイルス疾患について学んだことと学んだこと
2020 - 03 - 15 00 : 00 : 00
https://doi.org/10.7150/ijbs.45134
covid‐19の3つの主要な危険因子は性(男性),年齢であった≥60),重症肺炎.

抽出QAでカラムを得る

次のセクションでは、抽出QAを使用して追加の列を取得します.各記事については、フルテキストが取得され、一連の質問は、文書の質問です.答えは記事ごとに派生カラムとして追加されます.

from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")

document = {
    "_source": ["id", "name", "text"],
    "size": 1000,
    "query": {
        "term": {"article": None}
    },
    "sort" : ["id"]
}

def sections(article):
  rows = []

  search = document.copy()
  search["query"]["term"]["article"] = article

  for result in es.search(index="articles", body=search)["hits"]["hits"]:
    source = result["_source"]
    name, text = source["name"], source["text"]

    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      rows.append(text)

  return rows

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]

  # Use QA extractor to derive additional columns
  answers = extractor([("Risk factors", "risk factor", "What are names of risk factors?", False),
                       ("Locations", "city country state", "What are names of locations?", False)], sections(source["article"]))

  results.append((source["title"], source["published"], source["reference"], source["text"]) + tuple([answer[1] for answer in answers]))

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])

display(HTML(df.to_html(index=False)))

タイトル
出版
リファレンス
マッチ
危険因子
場所
COVID‐19パンデミック中の変形性関節症の管理
2020 - 05 - 21 00 : 00 : 00
https://doi.org/10.1002/cpt.1910
実際,リスク因子は性,肥満,遺伝的因子,及び機械的因子(3)である.
合併症
肺外部位
COVID‐19感染で入院した患者における心筋傷害の有病率と影響
2020 - 04 - 24 00 : 00 : 00
http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1
このリスクはcvdの歴史,危険因子ではなくcvdでは成層化した患者と一貫して一貫していた.
CVD、危険因子ではなくCVD、CVD
なし
アポリポ蛋白質E遺伝子型はcovid‐19重症度を予測するか?
2020 - 04 - 27 00 : 00 : 00
https://doi.org/10.1093/qjmed/hcaa142
その後の死亡に関連する危険因子は、年齢、高血圧、糖尿病、虚血性心疾患、肥満、慢性肺疾患を含むしかし、時々明らかな危険因子がありません.
社会経済的不平等とリスク要因
なし
COViD‐19と弱さと多元性との関連:英国生物多様性参加者の前向き分析
2020 - 07 - 23 00 : 00 : 00
https://www.ncbi.nlm.nih.gov/pubmed/32705587/
背景:過酷なcovid‐19病の危険因子として,弱さと多面性が示唆されている.
弱さと多元性
群集分類
COVID‐19:新しいコロナウイルス疾患について学んだことと学んだこと
2020 - 03 - 15 00 : 00 : 00
https://doi.org/10.7150/ijbs.45134
covid‐19の3つの主要な危険因子は性(男性),年齢であった≥60),重症肺炎.
年齢と基礎疾患は強く相関する
都市、地方と国

Reference

この問題について(弾性検索による抽出QA), 我々は、より多くの情報をここで見つけました https://dev.to/neuml/extractive-qa-with-elasticsearch-5aij

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

[伯俊2941]クロアチア文字

Txtaiによる抽出QA