複数サイトからのTech系記事のタイトルとURLを一枚のexcelにまとめる

11794 ワード

pandas Python Excel scraping Python テキストリンク

初投稿です。

タイトルの通り、Tech系有名サイトの新着記事タイトルとそのURLを1枚のexcelにできるコードを書きました。

実行結果↓

タイトルを１枚のシートでぱっと一覧できるのっていいですね

実際のコード

1.WEB_scrapingというクラスの中に４つのインスタンス変数をもつインスタンスを生成（これがクロールするサイト）

web_scraping.py

import requests
from bs4 import BeautifulSoup
import pandas as pd

#タイトル、URL、拾うタグ、classを変数にする
class WEB_scraping:
    def __init__(self,name,url,tag,detail):
        self.name=name
        self.url=url
        self.tag=tag
        self.detail=detail
#webページのHTMLをかえす
    def response(self):
        html = requests.get(self.url).text
        return BeautifulSoup(html, 'html.parser')

    def find(self,html):
        return html.find_all(self.tag,class_=self.detail)

responseはselfのHTMLを返すインスタンスメソッドで、findは変数htmlの中から特定のtagとクラスを探して返すインスタンスメソッドです。

メインのスクリプトでインスタンスを定義します。

rss_to_excel.py

import pandas as pd
from web_scraping import WEB_scraping

item1=WEB_scraping("gizmood","https://www.gizmodo.jp/articles/","h3","p-archive-cardTitle")
item2=WEB_scraping("gigazine","http://gigazine.net/","h2",None)
item3=WEB_scraping("TechCrunch","https://jp.techcrunch.com/","h2","post-title")
item4=WEB_scraping("zdnet","https://japan.zdnet.com/archives/","h3",None)
item5=WEB_scraping("hatenaIT","http://b.hatena.ne.jp/hotentry/it","h3","entrylist-contents-title")

items=[item1,item2,item3,item4,item5]

インスタンス変数のtagとdetailはChromeの開発者ツールで調べました。gigazineとZDnetの記事タイトルはクラスが指定されてませんでしたのでdetailは空にしています。

2.タイトルとURLを列に持つExcelをつくる

rss_to_excel.py

columns=["タイトル","Url"] 
excel_writer = pd.ExcelWriter('result.xlsx')

3.itemをループする

rss_to_excel.py

for item in items:
    df=pd.DataFrame(columns=columns)
    soup=item.response()
    contents=item.find(soup)

タイトル,URLを列に持つ、空のDataFrameをつくり抽出したいHTMLのタグを取得します。WEB_scrapingのインスタンスメソッドをたたいています

4.contentの記事タイトルとURLをdfに入れていく

rss_to_excel.py

for content in contents:
        title=content.a.string
        #gizmoodとzdnetだけhttps～がget(href)で返ってこないから足す
        if item.name=="gizmood":
            link="https://www.gizmodo.jp"+content.a.get("href")
        elif item.name=="zdnet":
            link="https://japan.zdnet.com"+content.a.get("href")
        else:
            link=content.a.get("href")

        se=pd.Series([title,link],columns)
        print(se)
        df=df.append(se,columns)

取得したcontentsから一つずつ記事タイトルとURLを抜き出して、定義したdfに追加していきます。
gizmoodとzdnetはhrefで返ってくる文字列がhttps://を含んでいないので足します。力技です・・・

4.excelに保存

rss_to_excel.py

df.to_excel(excel_writer,item.name)
excel_writer.save()

初投稿でした。インスタンスメソッドを用いることで、「このサイト追加したい」「このサイトはいらない」という時すごく簡単にできます。

実は４番もインスタンスメソッドで華麗にやりたかったんですが、逆にごちゃごちゃしたのでやめました。綺麗なコードを書きたいです。。。

Author And Source

この問題について(複数サイトからのTech系記事のタイトルとURLを一枚のexcelにまとめる), 我々は、より多くの情報をここで見つけました https://qiita.com/kaka__non/items/59568bd985470bf0d6ab

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .