从docx文件中提取纯文本

1507 ワード

Office365 docx Office365 テキストリンク

解压`docx`文件

直接使用unzip file.docs 命令，解压出来很多文件

├── [Content_Types].xml
├── _rels
├── docProps
│   ├── app.xml
│   └── core.xml
└── word
    ├── _rels
    │   └── document.xml.rels
    ├── document.xml
    └── settings.xml

查看下 word/document.xml的内容，非常标准的xml格式的文件

提取 xml中的纯文本

cat word/document.xml sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

组合命如下， unzip -p 是解压文件到管道流，而不是输出文件。

unzip -p file.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

参考：

Author And Source

この問題について(从docx文件中提取纯文本), 我々は、より多くの情報をここで見つけました https://qiita.com/shooter/items/777c17502c6df5d41b4c

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

从docx文件中提取纯文本

解压docx文件

提取 xml中的纯文本

Author And Source

解压`docx`文件