PythonでPowerPointからテキストを抽出する！（テーブルにも対応）

6199 ワード

PowerPoint Python 議事録 pptx text Python テキストリンク

この記事について

会議の議事録などをメールで送る場合、スライドの文字だけ欲しい場合があったので、ちょっとコードを書いてみました。テーブルの文字を抜き出すのが若干手間でした。

準備したPowerPointサンプルファイル (sampleFile.pptx として同じフォルダに入れておく）

結果

File name:  sampleFile.pptx 

-- Page 1 --
タイトルです

サブタイトルです

-- Page 2 --
２ページ目です。
テキストボックスです♪

果物, 八百屋さんA, スーパーB, 
バナナ, 100円, 200円, 
リンゴ, 150円, 140円, 

テーブルのサンプル

コード


import pptx
from glob import glob

for fname in glob ('*.pptx'):
    print ('File name: ', fname, '\n')
    prs = pptx.Presentation(fname)

    for i, sld in enumerate(prs.slides, start=1):

        print(f'-- Page {i} --')

        for shp in sld.shapes:

            if shp.has_text_frame:
                print (shp.text)

            if shp.has_table:
                tbl = shp.table
                row_count = len(tbl.rows)
                col_count = len(tbl.columns)
                for r in range(0, row_count):                 
                    text=''
                    for c in range(0, col_count):
                        cell = tbl.cell(r,c)
                        paragraphs = cell.text_frame.paragraphs 
                        for paragraph in paragraphs:
                            for run in paragraph.runs:
                                text+=run.text
                            text+=', '
                    print (text)
            print ()

同じフォルダに入っているpptxの拡張子が入っているファイルすべてのテキストを抽出します。

　参考

Powerpoint（pptx）の表をスクレイピング
https://qiita.com/barobaro/items/a3a4a00aeda9d19e41b6

PDF・Word・PowerPoint・Excelファイルからテキスト部分を一括抽出するメソッド
https://qiita.com/barobaro/items/a3a4a00aeda9d19e41b6

Author And Source

この問題について(PythonでPowerPointからテキストを抽出する！（テーブルにも対応）), 我々は、より多くの情報をここで見つけました https://qiita.com/Kent-747/items/21d1dee3d486f38c4b3f

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .