Crytocurrency Calendarをスクレイピング1

4247 ワード

Python scraping CryptocurrencyCalendar Python テキストリンク

Crytocurrency Calendarという仮想通貨の情報サイトをスクレイピングします。

ソースコード

import requests
import lxml.html

r = requests.get("https://coinmarketcal.com/")
html = r.text
root = lxml.html.fromstring(html)

time_of_event = root.xpath(
    "/html/body/main/div[3]/section[1]/div[2]/div[3]/article[1]/div/h5[1]/strong")
print(time_of_event[0].text.strip())
#strip()で出力結果の前後の空白を抜け出す
title_of_event = root.xpath(
    "/html/body/main/div[3]/section[1]/div[2]/div[3]/article[1]/div/h5[2]/strong")
print(title_of_event[0].text.strip())
sort_of_event = root.xpath(
    "/html/body/main/div[3]/section[1]/div[2]/div[3]/article[1]/div/h5[3]")
print(sort_of_event[0].text.strip())
content_of_event = root.xpath(
    "/html/body/main/div[3]/section[1]/div[2]/div[3]/article[1]/div/div[1]/p[2]")
print(content_of_event[0].text.strip())

実行結果

15 March 2018
Burst (BURST)
Hard Fork
Dynamic block size and transaction fees, PoC2 protocol, partial Dymaxion code... The fork is planned to happen around block 470 000.

解説

XPathでHTMLから要素を抜き出しました。Google Chromeから指定のXPathを抜き出すことができます。

XPathを使う上で注意することは、配列の添え字が0からはじまることでなく1から始まることです。XPathはGoogle Chromeを使うことで抜き出すことができますが、完全なXPathでなく、そのまま使うことはできません。なので自分で補う必要があります。
　サイトでは数十個のエベントがありますが、このソースコードでは1個のエベントしか抜き出すことしかできません。次回ではXPathの指定を工夫して1つのページからすべてのイベントを抜き出してみます。

参考にした書籍

Pythonによるクローラー&スクレイピング入門設計・開発から収集データの解析・運用まで

Author And Source

この問題について(Crytocurrency Calendarをスクレイピング1), 我々は、より多くの情報をここで見つけました https://qiita.com/Jhon_Connor/items/cf248d3bc87a60e112e2

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .