[個人的メモ]python3でのwebページのスクレイピング

5914 ワード

Python scraping Python3 BeautifulSoup Python テキストリンク

スクレイピングをするときの注意点

ページのソースコードは右クリックをして、ページのソースを表示ではなく

デベロッパーツールで表示された方を使う

textの取り出し

<dt>価格<span class="tax">（税込）</span></dt>

のようにdtタグにspanタグが埋め込まれたもののテキスト取り出すには

source = '<dt>価格<span class="tax">（税込）</span></dt>'
soup = BeautifulSoup(source, "html.parser")
soup.text

と.text指定やることで取り出せる

空白文字の削除

<dt>
    価格
    <span class="tax">（税込）</span>
</dt>

といったタグ内に空白文字がある場合

def remove_whitespace(str):
    return ''.join(str.split())

source = '<dt>価格<span class="tax">（税込）</span></dt>'
soup = BeautifulSoup(source, "html.parser")
remove_whitespace(soup.text)

とやって取り出せる

strip()とかでは中央にある空白は削除できないため、split()で空白文字を区切り文字として
.joinで結合している

BeautifulSoupでのfind

ある特定のクラスを探したい場合

一つの場合

soup.find(class_='hoge')

全てを検索する場合

soup.find_all(class_='hoge')

ある特定のidを探したい場合

一つの場合

soup.find(id='hoge')

全てを検索する場合

soup.find_all(id='hoge')

ある特定のタグを探したい場合

一つの場合

soup.find('hoge')

全てを検索する場合

soup.find_all('hoge')

またこれらは複数条件を同時にもできます

soup.find('hoge',class_='fuga)

Author And Source

この問題について([個人的メモ]python3でのwebページのスクレイピング), 我々は、より多くの情報をここで見つけました https://qiita.com/ayumu838/items/80239a5bd8072a6f70a5

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .