Python Tesseract + PyOCRを使って画像データのテキスト化

4428 ワード

tesseract-ocr tesseract pyocr Python Python テキストリンク

Python Tesseract + PyOCRを使って画像データのテキスト化

Pythonで画像データをテキスト化はTesseract + PyOCRを使うことで簡単に実現が可能だった

準備

PyOCRインストール

$ pip install pyocr

Tesseractインストール
※ Windows環境は別方法でインストールしてください

$ brew install tesseract
$ ls /usr/local/Cellar/tesseract/4.1.1/share/tessdata/

jpn.traineddata取得

$ wget https://github.com/tesseract-ocr/tessdata/raw/4.1.0/jpn.traineddata
$ mv jpn.traineddata /usr/local/Cellar/tesseract/4.1.1/share/tessdata/

サンプル

以下の画像をPyOCRを使って解析

from PIL import Image
import pyocr
import pyocr.builders


def main():
    # OCRエンジンの取得
    tools = pyocr.get_available_tools()
    tool = tools[0]

    # ＯＣＲ実行
    builder = pyocr.builders.TextBuilder()

    with Image.open("images/sample.png") as im:
        result = tool.image_to_string(im, lang="jpn", builder=builder)
        print(result)


if __name__ == "__main__":
    main()

出力

ToysCreation

ト イ ズ ク リ エ イ シ ョ ン

簡単に画像データからテキストを出力することができました
いいね！と思ったら LGTM お願いします

【PR】プログラミング新聞リリースしました！ → https://pronichi.com
【PR】週末ハッカソンというイベントやってます！ → https://weekend-hackathon.toyscreation.jp/about/

Author And Source

この問題について(Python Tesseract + PyOCRを使って画像データのテキスト化), 我々は、より多くの情報をここで見つけました https://qiita.com/morita-toyscreation/items/8b5cc889a162052fe2b1

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .