【Python】Beautiful Soup 4＋requestsでインデントされたHTMLを取得する（備忘録）

4738 ワード

Requests HTML BeautifulSoup Python Python3 Python テキストリンク

自分用メモ

Beautiful Soup 4とrequestsのインストール

pip install beautifulsoup4
pip install requests

Beautiful Soup 4とrequestsのインポート

from bs4 import BeautifulSoup
import requests

URLからrequestsのresponseを取得

変数「url」にリクエスト対象のURLを指定
※この段階ではインデントされていないHTML

url = "https://qiita.com"
response = requests.get(url)
print(response.text)

BeautifulSoupでHTMLをインデントして取得

prettify()がHTMLをインデントして取得する関数

soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

Beautiful Soup 4＋requestsでインデントされたHTMLを取得するまとめ

ついでにファイル出力

from bs4 import BeautifulSoup
import requests

# URLからrequestsのresponseを取得
url = "https://qiita.com"
response = requests.get(url)
print(response.text)

# BeautifulSoupの初期化
soup = BeautifulSoup(response.text, 'html.parser')

# HTMLをインデントして取得
html = soup.prettify()

# インデントされたHTMLをファイル出力
f = open('prettify.html', 'w', encoding='UTF-8') 
f.write(html)
f.close()

出力したHTMLがこんな感じ

参考（Beautiful Soup 4.2.0 Doc. 日本語訳）

403 Forbiddenが出てHTML情報が取得出来なかった場合は以下参照
403 Forbidden（You don't have permission to access）が出た場合の対処法

以上

Author And Source

この問題について(【Python】Beautiful Soup 4＋requestsでインデントされたHTMLを取得する（備忘録）), 我々は、より多くの情報をここで見つけました https://qiita.com/shinkai_/items/eb1e092c8ecc0bf14d92

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .