pdfからtxtへの変換1[pdfminer]

5270 ワード

Python pdfminer PDF Python テキストリンク

はじめに

以下のような英語で書かれたコピーできないpdfファイル20枚を翻訳する必要があったので、
テキストを抽出してgoogle翻訳等にかけたい、、

目的

pdfファイルからテキストを抽出する。

使用したもの

今回は、pdfminerを用いた。
https://github.com/pdfminer/pdfminer.six

また、以下の記事も参考にした。
https://qiita.com/mczkzk/items/894110558fb890c930b5

処理の流れ

1.Please input pdf path : のあとに、pdfファイル名入力
2.入力ファイル名の拡張子を.txtに変更し、テキストファイル作成
3.それに結果出力

といった簡単な動作である。

結果

先ほどのpdfファイルを指定してみた結果が以下である。

矢印１つだけが出力された。おかしい、
他のpdfファイルで確かめるため、wordで作成した以下のpdfを指定してみる。
　　　　　　　　　　　

結果は以下のようになった。
　　　　　　　　　　　　　　　

また先ほどの矢印が出力されているが、英語も日本語もうまく出力されている。
プログラムが問題ではなさそう。
pdfの保護による問題だと思い「pdfに印刷」で保護を解除してみたが、
また１つの矢印のみが出力された。

考察

pdfminer自体はうまく動いていることが確認できたため、問題はpdfファイルにあると考えられる。
対象のpdfファイルがスキャンされたものなのか、画質が粗いことが原因だと思う。

プログラム

pdfminerが便利すぎるため、とても短いプログラムとなった。

pdf2text.py

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

input_path = input("Please input pdf path : ")
output_path,ext = input_path.split(".")
output_path += ".txt"

manager = PDFResourceManager()

with open(output_path, "wb") as output:
    with open(input_path, 'rb') as input:
        with TextConverter(manager, output, codec='utf-8', laparams=LAParams()) as conv:
            interpreter = PDFPageInterpreter(manager, conv)
            for page in PDFPage.get_pages(input):
                interpreter.process_page(page)

Author And Source

この問題について(pdfからtxtへの変換1[pdfminer]), 我々は、より多くの情報をここで見つけました https://qiita.com/ptxyasu/items/4180035bd0ccd789c858

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .