Web上のHTMLからPython Notebookを自動生成してみた

3844 ワード

自動化 BeautifulSoup Python notebook Jupyter Python テキストリンク

初めに

GitHubのREADME記載のExampleとかをNotebookで実行したことがちまちまあります。
数が多いとコピペも面倒ですし、何よりそんな生産性ゼロなことに時間を奪われたくない！！
ということでBeautifulSoupとjupytextで自動化してみました。
皆さんの貴重な時間を節約するために是非ご活用くださいー！

実装

GithubのREADMEの場合はelementにdiv、class_にhighlight-source-pythonを指定すればPythonのコードブロックのみをとってこれます。その他のWebページでもパラメータをよしなに変更していただければ動くかと。

import requests
from bs4 import BeautifulSoup
import jupytext

def convert_html_to_ipynb(
    url: str,
    filename: str,
    element: str = "div",
    class_: str = "highlight-source-python",
) -> None:
    """Fetch HTML and extract only its code block and save it as ipynb."""
    res = requests.get(URL)
    soup = BeautifulSoup(res.text, "html.parser")
    py_percent_text = "\n# %%\n".join(
        [x.get_text() for x in soup.find_all(element, class_=class_)]
    )
    nb_text = jupytext.reads(py_percent_text, fmt="py")
    jupytext.write(nb_text, filename)

convert_html_to_ipynb("https://github.com/...", 'sample.ipynb')

終わりに

このコードで皆さんの1秒が節約できれば幸いです。

Author And Source

この問題について(Web上のHTMLからPython Notebookを自動生成してみた), 我々は、より多くの情報をここで見つけました https://qiita.com/ozora/items/31a0fa3feeb93af49cd3

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .