フラッターのPDF文書からテキストを抽出する5つの方法

25503 ワード

Syncfusion Flutter PDF library あなたのフラッタアプリケーションに強健なPDF機能を追加することができますファイル形式のライブラリです.それを使用すると、書式設定されたテキスト、画像、テーブル、リンク、リスト、ヘッダー、フッター、ブックマーク、および詳細をPDF形式でプログラムを作成できます.このライブラリはまた、Adobe DependenciesなしでPDF文書を読んで、編集する機能を提供します.
PDF文書は、ほとんどの請求書、購入注文、発送ノート、レポート、プレゼンテーション、価格、商品リスト、HRフォーム、およびフォームの形式でビジネスデータを交換するために使用されます.
いくつかの時点で、ユーザはPDF文書に存在するデータを読み書きする必要があるかもしれません.これは手動でこれを行うにはいくつかの追加費用と時間を必要とする場合があります.この問題を回避するために,テキスト抽出技術を用いることができる.これらのテクニックは、自動化された方法でさらに有効にするためにPDF文書からすべてのテキストデータまたは特定のテキストデータを抽出します.
を使用してFlutter PDF library , 簡単にフラッタアプリケーションでPDFドキュメントからテキストを抽出することができます.このブログでは、次のようにします.

Extract all the text from a PDF document .

Extract text from predefined bounds .

Extract text from a specific page .

Extract text from a range of pages .

Extract text with font and style information .

そして、我々は途中でコード例を提供します!

PDF文書からすべてのテキストを展開する

SyncFusionフラッタPDFライブラリを使用すると、PDFドキュメントからすべてのテキストを抽出することができます.以下に手順を示します:

ステップ1 :フラッターアプリケーションを作成する

この指示に従ってくださいGetting Started ドキュメントの基本的なフラッタアプリケーションを作成します.

ステップ・アンナ2 : syncfusionフラッターPDF依存性を加えてください

を含めるSyncfusion Flutter PDF パッケージ仕様プロジェクトのYAMLファイル.次のコードを参照してください.

dependencies:
  syncfusion_flutter_pdf: ^18.3.50-beta

ステップ・ワン3：パッケージを得る

必要なパッケージを取得するには、次のコマンドを実行します.
畝$ flutter pub get |

ステップ△4 :パッケージをインポートする

PDFパッケージをメインにインポートします.以下のコード例を使用したDARTファイル.

import 'package:syncfusion_flutter_pdf/pdf.dart';

PDFファイルからすべてのテキストを抽出します

次のコード例に示すように、ボタンウィジェットをコンテナウィジェットに追加します.

@override
Widget build(BuildContext context) {
  return Scaffold(
    appBar: AppBar(
      title: Text(widget.title),
    ),
    body: Center(
      child: Column(
        mainAxisAlignment: MainAxisAlignment.center,
        children: <Widget>[
          FlatButton(
            child: Text(
              'Generate PDF',
              style: TextStyle(color: Colors.white),
            ),
            onPressed: _extractText,
            color: Colors.blue,
          )
        ],
      ),
    ),
  );
}

ボタンをクリックしてイベント全体のPDFファイルからすべてのテキストを抽出するには、次のコードが含まれます.

//Load an existing PDF document.
PdfDocument document =
    PdfDocument(inputBytes: await _readDocumentData('pdf_succinctly.pdf'));

//Create a new instance of the PdfTextExtractor.
PdfTextExtractor extractor = PdfTextExtractor(document);

//Extract all the text from the document.
String text = extractor.extractText();

//Display the text.
_showResult(text);

保存されているフォルダからPDFドキュメントを読み取るには、次のコードを含める.ここでは、我々のフォルダassets .

Future<List<int>> _readDocumentData(String name) async {
final ByteData data = await rootBundle.load('assets/$name');
return data.buffer.asUint8List(data.offsetInBytes, data.lengthInBytes);
}

抽出されたテキストを表示するには、次のコードを含める.

void _showResult(String text) {
showDialog(
context: context,
builder: (BuildContext context) {
return AlertDialog(
title: Text('Extracted text'),
content: Scrollbar(
child: SingleChildScrollView(
child: Text(text),
physics: BouncingScrollPhysics(
parent: AlwaysScrollableScrollPhysics()),
),
),
actions: [
FlatButton(
child: Text('Close'),
onPressed: () {
Navigator.of(context).pop();
},
)
],
);
});
}

前のコード例を実行することで、PDFドキュメントから抽出されたテキストが次のスクリーンショットのように表示されます.

PDF文書から抽出したテキスト

定義済みの境界からテキストを展開する

既存のPDFドキュメントで定義済みの境界からテキストを抽出できます.これを行うには、必要なデータがPDFに存在する境界を指定する必要があります.
指定した範囲からテキストを抽出する手順を次のコード例に示します.ここでは、PDFドキュメントの請求書番号を抽出します.

//Load an existing PDF document.
PdfDocument document =
    PdfDocument(inputBytes: await _readDocumentData('invoice.pdf'));

//Create a new instance of the PdfTextExtractor.
PdfTextExtractor extractor = PdfTextExtractor(document);

//Extract all the text from a particular page.
List<TextLine> result = extractor.extractTextWithLine(startPageIndex: 0);

//Predefined bound.
Rect textBounds = Rect.fromLTWH(474, 161, 50, 9);

String invoiceNumber = '';

for (int i = 0; i < result.length; i++) {
  List<TextWord> wordCollection = result[i].wordCollection;
  for (int j = 0; j < wordCollection.length; j++) {
    if (textBounds.overlaps(wordCollection[j].bounds)) {
      invoiceNumber = wordCollection[j].text;
      break;
    }
  }
  if(invoiceNumber != ''){
    break;
  }
}

//Display the text.
_showResult(invoiceNumber);

上記のコード例を実行すると、次のスクリーンショットに示す出力テキストが表示されます.

定義済みの境界から抽出されたテキスト

特定のページからテキストを展開する

特定のページインデックスをExpertTextメソッドに渡すことで、特定のページからテキストを抽出できます.
この方法を次のコード例に示します.

//Load an existing PDF document.
PdfDocument document =
    PdfDocument(inputBytes: await _readDocumentData('pdf_succinctly.pdf'));

//Create a new instance of the PdfTextExtractor.
PdfTextExtractor extractor = PdfTextExtractor(document);

//Extract all the text from the first page of the PDF document.
String text = extractor.extractText(startPageIndex: 0);

//Display the text.
_showResult(text);

上記のコード例を実行すると、次のスクリーンショットのように最初のページからテキストが表示されます.

特定のページから抽出されたテキスト

ページの範囲からテキストを展開する

また、ExtractTextメソッドに開始ページと終了ページのインデックスを提供することで、PDFドキュメント内のページの範囲からテキストを抽出することもできます.ページの範囲からテキストを抽出する方法を次の例に示します.

//Load the existing PDF document.
PdfDocument document =
    PdfDocument(inputBytes: await _readDocumentData('pdf_succinctly.pdf'));

//Create the new instance of the PdfTextExtractor.
PdfTextExtractor extractor = PdfTextExtractor(document);

//Extract all the text from the first page to third page of the PDF document.
String text = extractor.extractText(startPageIndex: 0, endPageIndex: 2);

//Display the text.
_showResult(text);

フォントとスタイルの情報を抽出テキスト

また、その境界、フォント名、フォントスタイル、およびフォントサイズでテキストを抽出することができます.詳細をテキストを抽出する方法を次のコード例に示します.

//Load an existing PDF document.
PdfDocument document =
PdfDocument(inputBytes: await _readDocumentData('invoice.pdf'));

//Create a new instance of the PdfTextExtractor.
PdfTextExtractor extractor = PdfTextExtractor(document);

//Extract all the text from specific page.
List<TextLine> result = extractor.extractTextWithLine(startPageIndex: 0);

//Draw rectangle.
for (int i = 0; i < result.length; i++) {
List<TextWord> wordCollection = result[i].wordCollection;
for (int j = 0; j < wordCollection.length; j++) {
if ('2058557939' == wordCollection[j].text) {
//Get the font name.
String fontName = wordCollection[j].fontName;
//Get the font size.
double fontSize = wordCollection[j].fontSize;
//Get the font style.
List<PdfFontStyle> fontStyle = wordCollection[j].fontStyle;
//Get the text.
String text = wordCollection[j].text;
String fontStyleText = '';
for (int i = 0; i < fontStyle.length; i++) {
fontStyleText += fontStyle[i].toString() + ' ';
}
fontStyleText = fontStyleText.replaceAll('PdfFontStyle.', '');
_showResult(
'Text : $text \r\n Font Name: $fontName \r\n Font Size: $fontSize \r\n Font Style: $fontStyleText');
break;
}
}
}
//Dispose the document.
document.dispose();

上記のコード例を実行すると、次のスクリーンショットに出力されます.

エクステンションテキスト情報

Githubサンプル:

あなたはこれらのすべての抽出タイプのサンプルをチェックアウトすることができますGitHub repository .

結論

このブログ記事では、SyncFusionフラッターPDFライブラリを使用してフラッターアプリケーションでPDFドキュメントからテキストを抽出する5つの異なる方法をカバーしている.一時停止するdocumentation , 他のオプションと機能、すべての付随するコード例を見つけます.
あなたがこれらの機能についての質問をするならば、下記のコメント部で知らせてください.また、我々を介してお問い合わせすることができますsupport forums , Direct-Trac , or feedback portal . 私たちはあなたを支援して満足している!
あなたがこの記事が好きならば、我々は我々が我々のPDF図書館について以下の記事も好きであると思います:

Create and Validate PDF Digital Signatures in C#

7 Ways to Compress PDF Files in C#, VB.NET

Reference

この問題について(フラッターのPDF文書からテキストを抽出する5つの方法), 我々は、より多くの情報をここで見つけました https://dev.to/syncfusion/5-ways-to-extract-text-from-pdf-documents-in-flutter-3k85

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

JavaScript言語の抽出関数

[Swift] ブラー画像をヘッダーにした、tableViewのサンプル