ChemDataExtractor(導入編)


概要

論文や特許文献から材料名,化合物名,そしてそれに紐づく物性値を自動的に取得したり抽出したりしてマイニングしたい.そのようなときに使われるのが,近年ではpythonライブラリのChemDataExtractorに勢いがあります.あまり日本語の解説サイトがないので,メモとして書き残しておきます.


ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature
Matthew C. Swain and Jacqueline M. Cole
Journal of Chemical Information and Modeling, 2016, 56 (10), 1894-1904
DOI: 10.1021/acs.jcim.6b00207


ChemDataExtractorは英・ケンブリッジ大のJacqueline M. Cole教授 によるものですが,ご専門は分子工学なのですね.

1. ライブラリのインストール

方法1. condaによるインストール

Anacondaからであれば,condaからコマンドラインで次のコマンドを実行する.

 conda install -c chemdataextractor chemdataextractor 

方法2. pipからのインストール

pipからのインストールであれば以下のコマンドでOK.

pip install ChemDataExtractor

昨年はinstallでDAWNパッケージで引っかかることがありましたが,現在は解消されているようです.

2. データファイルの取得

ChemDataExtractorを機能させるために,cdeという独自のコマンド(ChemDataExtractorの頭文字でしょうね)から機械学習モデル,辞書,単語クラスタなど、さまざまなデータファイルをインストールしておく必要があります.

cde data download 

INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/cem_crf-1.0.pickle to /root/.local/share/ChemDataExtractor/models/cem_crf-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/cem_crf_chemdner_cemp-1.0.pickle to /root/.local/share/ChemDataExtractor/models/cem_crf_chemdner_cemp-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/cem_dict_cs-1.0.pickle to /root/.local/share/ChemDataExtractor/models/cem_dict_cs-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/cem_dict-1.0.pickle to /root/.local/share/ChemDataExtractor/models/cem_dict-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/clusters_chem1500-1.0.pickle to /root/.local/share/ChemDataExtractor/models/clusters_chem1500-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_ap_genia_nocluster-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_ap_genia_nocluster-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_ap_genia-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_ap_genia-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_ap_wsj_genia_nocluster-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_ap_wsj_genia_nocluster-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_ap_wsj_genia-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_ap_wsj_genia-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_ap_wsj_nocluster-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_ap_wsj_nocluster-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_ap_wsj-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_ap_wsj-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_crf_genia_nocluster-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_crf_genia_nocluster-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_crf_genia-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_crf_genia-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_crf_wsj_genia_nocluster-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_crf_wsj_genia_nocluster-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_crf_wsj_genia-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_crf_wsj_genia-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_crf_wsj_nocluster-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_crf_wsj_nocluster-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_crf_wsj-1.0.pickle to /root/.local/share/ChemDataExtractor/models/pos_crf_wsj-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/punkt_chem-1.0.pickle to /root/.local/share/ChemDataExtractor/models/punkt_chem-1.0.pickle
Successfully downloaded 18 new data packages (0 existing)

必要なすべてのデータファイルがデータディレクトリにダウンロードされたかは,以下のコマンドでインストールされた先で確認できます.

cde data where

/root/.local/share/ChemDataExtractor

以上で準備は完了です.

参照