python 3爬虫類学習シリーズ08-scrapy(二)

24274 ワード

python3 爬虫類

文書ディレクトリ

1. トレースリンク

2. requestの作成ショートカット

3. その他の例

4. spiderパラメータ

の使用

5. 参考文献

以前のブログ:
python 3爬虫類学習シリーズ02-一般的なWebページのダウンロードと抽出方法
python 3爬虫類学習シリーズ03-ダウンロードキャッシュ
python 3爬虫類学習シリーズ04-同時ダウンロード
python 3爬虫類学習シリーズ05-ダイナミックコンテンツの取得
python 3爬虫類学習シリーズ06-フォームインタラクション
python 3爬虫類学習シリーズ07-処理検証コード
python 3爬虫類学習シリーズ08-scrapy(一)
前回のブログでは、Scrapyの簡単な使用(scrapyのインストール、Webページの登り取り、データの抽出)を学びました.この記事では、公式ドキュメントの学習を続けます.

1.トレースリンク(fllow links)

htmlからリンクを抽出するにはどうすればいいですか?登ったサイトは前のブログと一致し、依然としてhttp://quotes.toscrape.com.
最初のステップでは、ページから必要なリンクを抽出します.
私たちのページを分析すると、次のページのリンクには次のタグが表示されます.

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">→span>a>
    li>
ul>

このデータをscrapy shellで抽出してみます.

#  CSS ， ラベルですが、 たちに なのはこれです。ラベルのhrefプロパティ
>>> response.css('li.next a').get()
'Next →'
#  attra(href) 
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
# .attrib 
>>> response.css('li.next a').attrib['href']
'/page/2'

次に、私たちの爬虫類が次のページへのリンクに再帰的に従うように変更され、データが抽出されているのを見てみましょう.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
		#  
        next_page = response.css('li.next a::attr(href)').get()
        #  
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

データを抽出した後、parse()メソッドは現在のリンクの次のページのリンクを検索し、urljoin()メソッドを使用して完全な絶対URL(リンクが相対的である可能性があるため)yieldを構築して次のページの新しいrequestを取得し、この新しいrequestで自分自身を呼び出し、このような再帰的な方法ですべてのページに登ることを保証します.
これがScrapyでリンクを追跡するメカニズムで、コールバック関数でyield(生成)に新しいrequestが生成され、この新しいrequestが完了するとscrapyはまたこのrequestのためにコールバック関数を呼び出します.

2.requestを作成するショートカット

requestを作成するショートカットはresponseを使用することです.fllow().

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

scrapyとRequestは違うfollow()は相対urlをサポートし、urljoin()の使用を回避します.response.follow()はRequestのインスタンスを返し、このインスタンスオブジェクトをyieldすることができます.
セレクタをresponseに渡すこともできます.文字列ではなくfollowを使用して、このセレクタから必要なプロパティを抽出できます.

# response.css()  list
for href in response.css('li.next a::attr(href)'):
	# href  
    yield response.follow(href, callback=self.parse)

< a > ，response.css() href です.したがって、コードは以下のように簡単に書くこともできます.

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

， response.follow(response.css('li.next a'))， response.css() list，。

3.その他の例

次のコードはhttp://quotes.toscrape.com/このurlから、ページ内の作成者リンクと次のページリンクを追跡し、コールバック関数parse_をそれぞれ呼び出します.author()とparse()は応答を処理します.

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        #  ， parse_author()  
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        #  ， parse()  
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

， scrapy ， url ， scrapy url、プログラミングロジックの問題によるサーバーのアクセス過多の問題を回避し、DUPEFILTER_CLASSパラメータで構成します.
公式例quotesbotはこちらをクリックしてご覧ください

4.spiderパラメータの使用

デフォルトでは、コマンドラインで-aパラメータがSpiderの__init__メソッドに渡され、spiderのプロパティになります.

scrapy crawl quotes -o quotes-humor.json -a tag=humor

コードにはselfを用いることができる.tagはコマンドラインのtagの値を取得します.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
            #   url   http://quotes.toscrape.com/tag/humor
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

5.参考文献

[1]scrapy公式ドキュメント[2]爬虫類フレームScrapyのインストールと基本使用-簡書

毎日5分コード「Arrow Function」

StringとListの相互変換