【勉強ノート】ウェブサイトからデータ収集のための Scrapy Shell

3750 ワード

Python scraping Scrapy Python テキストリンク

前回はScrapy基礎を勉強しました。
今回はScrapy Shellについて勉強します。

注意事項

この投稿は単純な勉強の記録です。私はWeb Scrapingに経験がないですから、間違ったことをやる可能性も高いと思います。その時はコメントで教えていただけると幸いです。そして、この投稿の内容はScrapyの公式ホームページを参考にしています。

今回の流れ

Scrapy Shellとは
Scrapy Shellを利用しデータ抽出

1. Scrapy Shell とは

Scrapy Shellは単純にScrapyの便利ツールとobjectが提供できているpython shellです。本当にそれだけですね。ipythonがインストールされている場合はデフォルトのpython shellの代わりにipython shellが利用されます。ipythonの方が便利なので私はipythonを利用します。

Scrapy Shellが便利なところは、spiderを実際に起動しなくてものウェブページscrapyingのテストが出来るところです。実際に使ってみましょう

2. Scrapy Shellを利用しデータ抽出

Scrapy Shellを起動するには下記のようにターミナルに入力します。

$ scrapy shell <url>

今回はQuatesサイトを対象にしますので、下記のようになります。

$ scrapy shell 'http://quotes.toscrape.com/page/1/'

成功すると、下記のような結果になるになります。

[ ... Scrapy log here ... ]
2018-07-07 11:41:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f94eb735630>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7f94ea4bd748>
[s]   spider     <QuotesSpider 'quotes' at 0x7f94ea04c780>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

ご覧の通り、Scrapyの便利なobjectとshortcutsが使えることが明記されています。
このobjectとshortcutsを使ってtitleのデータを抽出してみます。

In [1]: response.xpath('//title')
Out[1]: [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]

簡単にタイトルのデータを抽出出来ることが分かります。私はxpathを利用し抽出しますが、cssの方が楽な方はcssでも良いと思います。

Scrapy Shellの公式ホームページにもっと詳しい説明が書いてますので参考にしてください。

Author And Source

この問題について(【勉強ノート】ウェブサイトからデータ収集のための Scrapy Shell), 我々は、より多くの情報をここで見つけました https://qiita.com/suckgeun/items/d237a545af7fa118d6ad

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

Windowsバッチ:パス内のスペース

ブルーブリッジカップJavaB組本題詳細】大衍数列(2014)