Beautifulsoup でWebページのリンクを抽出する

5927 ワード

scraping Python3 beautifulsoup4 scraping テキストリンク

参考にしたページ
PythonでWebページのリンクを抽出するスクリプトを書いた
この例は、python2 用なので、python3 用に書き換えました。
HTTP Error 403: Forbidden を回避するようにしました。

実行結果

$ ./get_url.py  https://ekzemplaro.org
en/      
English
ekzemplaro/      
言語とデータベースの接続プログラムサンプル集
audio_books/     オーディオブック
librivox/    LibriVox の勧め
./audio/     Audio
http://www.hi-ho.ne.jp/linux     オープンソース開発
./raspberry/     Raspberry Pi
./storytelling/      ストーリーテリング
./crowdsourcing/     クラウドソーシング
https://twitter.com/ekzemplaro   私のツイッター
https://github.com/ekzemplaro/   GitHub
qiita/   Qiita
./test_dir/      テストコーナー

get_url.py

#! /usr/bin/python
# -*- coding: utf-8 -*-
#
#   get_url.py
#
#                   Aug/18/2018
#
# ------------------------------------------------------------------
import requests
import sys
from bs4 import BeautifulSoup
#
url = sys.argv[1]
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0",}
#
try:
    rr = requests.get(url,headers=headers)
    html = rr.content
    try:
        soup = BeautifulSoup(html, "html.parser")
        for aa in soup.find_all("a"):
            link = aa.get("href")
            name = aa.get_text()
            print(link,"\t",name)
    except Exception as ee:
        sys.stderr.write("*** error *** in BeautifulSoup ***\n")
        sys.stderr.write(str(ee) + "\n")
#

except Exception as ee:
    sys.stderr.write("*** error *** in requests.get ***\n")
    sys.stderr.write(str(ee) + "\n")
#
# ------------------------------------------------------------------
# ------------------------------------------------------------------

Arch Linux での requests と beautifulsoup4 のインストール方法

sudo pacman -S python-requests
sudo pacman -S python-beautifulsoup4

次のバージョンで動作を確認しました。

$ python --version
Python 3.9.5

Author And Source

この問題について(Beautifulsoup でWebページのリンクを抽出する), 我々は、より多くの情報をここで見つけました https://qiita.com/ekzemplaro/items/a0dd7dd2bbdcf077a626

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .