BeautifulSoup を使用し、キーワードを含むページを抽出する
8396 ワード
sample.py
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
import time
url_nikkei = 'https://www.nikkei.com/'
url_nikkei_business = 'https://www.nikkei.com/business/'
key_word = '環境'
response = requests.get(url_nikkei_business)
soup = BeautifulSoup(response.text, "html.parser")
regex = re.compile(r'm-miM\d{2}_title')
articles = soup.find_all('h3', {'class': regex})
time.sleep(3)
url_nikkei_articles = []
for article in articles:
url_nikkei_article = article.find('a').get('href')
url_nikkei_articles.append(url_nikkei_article)
url_list = []
title_list = []
for url in url_nikkei_articles:
url_nikkei_article = urljoin(url_nikkei, url)
response = requests.get(url_nikkei_article)
soup = BeautifulSoup(response.text, "html.parser")
regex = re.compile(r'cmn-section')
temp = soup.find_all('div', {'class': regex})
title = temp[0].find('span').string
regex = re.compile(r'cmn-article_text')
contents = temp[0].find('div', {'class': regex}).find_all('p')
for content in contents:
s = re.search(key_word, str(content))
if s:
url_list.append(url_nikkei_article)
title_list.append(title)
break
time.sleep(3)
for i, title in enumerate(title_list):
print(i+1, title)
for i, url in enumerate(url_list):
print(i+1, url)
Author And Source
この問題について(BeautifulSoup を使用し、キーワードを含むページを抽出する), 我々は、より多くの情報をここで見つけました https://qiita.com/takeshikondo/items/28a094d75c501cdfd926著者帰属:元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。
Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .