BeautifulSoup を使用し、キーワードを含むページを抽出する

8396 ワード

scraping BeautifulSoup scraping テキストリンク

sample.py

import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
import time

url_nikkei = 'https://www.nikkei.com/'
url_nikkei_business = 'https://www.nikkei.com/business/'
key_word = '環境'

response = requests.get(url_nikkei_business)
soup = BeautifulSoup(response.text, "html.parser")
regex = re.compile(r'm-miM\d{2}_title')
articles = soup.find_all('h3', {'class': regex})

time.sleep(3)

url_nikkei_articles = []
for article in articles:
    url_nikkei_article = article.find('a').get('href')
    url_nikkei_articles.append(url_nikkei_article)

url_list = []
title_list = []
for url in url_nikkei_articles:
    url_nikkei_article = urljoin(url_nikkei, url)

    response = requests.get(url_nikkei_article)
    soup = BeautifulSoup(response.text, "html.parser")

    regex = re.compile(r'cmn-section')
    temp = soup.find_all('div', {'class': regex})
    title = temp[0].find('span').string

    regex = re.compile(r'cmn-article_text')
    contents = temp[0].find('div', {'class': regex}).find_all('p')

    for content in contents:
        s = re.search(key_word, str(content))
        if s:
            url_list.append(url_nikkei_article)
            title_list.append(title)
            break

    time.sleep(3)

for i, title in enumerate(title_list):
    print(i+1, title)

for i, url in enumerate(url_list):
    print(i+1, url)

Author And Source

この問題について(BeautifulSoup を使用し、キーワードを含むページを抽出する), 我々は、より多くの情報をここで見つけました https://qiita.com/takeshikondo/items/28a094d75c501cdfd926

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .