JavaでPDFドキュメントから抽出表

11581 ワード

表は、電子請求書、財務報告書、または表形式のデータを含むすべてのPDF文書などのPDF文書の中で最も一般的な要素の一つです.開発者は、PDFテーブルのデータを抽出し、さらに分析を行う必要がある状況に遭遇する可能性があります.この記事では、Spire.PDF for Javaを使用してPDFファイルのテーブルからデータを秒単位で抽出する方法を紹介します.

spireのインストールPDFファイル.ジャー
あなたがMavenを使用する場合は、簡単にspireをインポートすることができます.PDFファイル.プロジェクトのPOMに次のコードを追加することによって、アプリケーション内のJARを追加します.XMLファイル.非Mavenプロジェクトの場合は、this linkからJARファイルをダウンロードし、手動でアプリケーションに依存して追加します.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId> e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <verson>4.11.2</version>
    </dependency>
</dependencies>

コードの使用
尖塔PDFはPDFTableExtractorを提供します.ExperTable ()メソッドを使用して、特定のページからテーブルを抽出します.以下は、PDFファイル全体からテーブルを抽出する主な手順です.

PDFDocumentオブジェクトを初期化しながらサンプルPDFドキュメントを読み込みます.

は、ドキュメント内のページをループし、ExtracTable ()メソッドを使用して特定のページからテーブルコレクションを取得します.

特定のテーブルの行と列を通して

ループし、PDFTableを使用して特定のセルの値を取得します.gettext ( int rowindex , int columnindex )メソッド.

抽出したデータをtxtファイルに書き込みます.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractTablesFromPdf {

    public static void main(String[] args) throws IOException {

        //Load a sample PDF document
        PdfDocument pdf = new PdfDocument("C:\\Users\\Administrator\\Desktop\\Table.pdf");

        //Create a StringBuilder instance
        StringBuilder builder = new StringBuilder();

        //Create a PdfTableExtractor instance
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);

        //Declare variables
        PdfTable[] pdfTables = null;
        int tableNumber = 1;

        //Loop through the pages
        for (int pageIndex = 0; pageIndex < pdf.getPages().getCount(); pageIndex++) {

            //Extract tables from the current page
            pdfTables = extractor.extractTable(pageIndex);

            //If any tables are found
            if (pdfTables != null && pdfTables.length > 0) {

                //Loop through the tables in the array
                for (PdfTable table : pdfTables) {

                    builder.append("Table " + tableNumber);
                    builder.append("\r\n");

                    //Loop through the rows in the current table
                    for (int i = 0; i < table.getRowCount(); i++) {

                        //Loop through the columns in the current table
                        for (int j = 0; j < table.getColumnCount(); j++) {

                            //Extract data from the current table cell
                            String text = table.getText(i, j);

                            //Append the text to the string builder
                            builder.append(text + " ");
                        }
                        builder.append("\r\n");
                    }
                    builder.append("\r\n");
                    tableNumber += 1;
                }
            }
        }

        //Write data into a .txt document
        FileWriter fw = new FileWriter("output/ExtractTables.txt");
        fw.write(builder.toString());
        fw.flush();
        fw.close();
    }
}

出力

Reference

この問題について(JavaでPDFドキュメントから抽出表), 我々は、より多くの情報をここで見つけました https://dev.to/eiceblue/extract-tables-from-an-entire-pdf-document-in-java-3ce0

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

mysqlでテーブルのデフォルト符号化とテーブル内のフィールドの符号化を変更する

[BOJ 1629:乗算]