爬虫類(2):Pipeline

7645 ワード

python 爬虫類

Item Pipeline
SpeiderでItemが収集されると、Item Piplineに転送され、一部のコンポーネントは一定の順序でItemの処理を実行します.
各item pipelineコンポーネントは簡単な方法を実現したPythonクラスである.彼らはItemを受け入れ、それによっていくつかの行為を実行し、このitemがpipelineを通過し続けるか、または処理を行わずに廃棄されるかを決定します.
以下はitem piplineの典型的な応用です.

HTMLデータをクリーンアップ

取得データの検証(itemに特定のフィールドが含まれていることを確認)

重量検査(廃棄)

取得結果をデータベースに保存する

自分のitem pipelineを書く
各item piplineコンポーネントは独立したPythonクラスであり、以下の方法を実装する必要があります.

process_item(self,item,spider)

各item piplineコンポーネントは、このメソッドを呼び出す必要があります.この方法は、データを持つdictオブジェクトまたはItemオブジェクト、またはDropItem例外を放出する必要があります.破棄されたitemは、その後のpipelineコンポーネントによって処理されません.

   
item(item      dict)-    item
spider(spider  )-   item spider

また、以下の方法も実現できます.

open_spider(self,spider)

spiderがオンになると、この方法が呼び出されます.
パラメータ:spider(spiderオブジェクト)-開いているspider

close_spider(self,spider)

spiderが閉じられると、このメソッドが呼び出されます.
パラメータ:spider(spiderオブジェクト)-閉じられたspider

from_crawler(cls,crawler)

この方法はcrawlerからpipelineの新しいオブジェクトを作成し、CrawlerはScrapyコアコンポーネント(settings、signalsなど)向けのすべての接続を提供し、これはpipelineが彼らを接続し、彼らの機能をScrapyに接続する重要な道である.
パラメータ:cralwer(clawlerオブジェクト)-このpipelineのcralwerを使用します.
Item pipelineサンプル
価格を検証し、価格のないitemを破棄
機能:税金(price_excludes_vatプロパティ)を含まないitemにpriceプロパティを調整し、価格のないitemを破棄します.

from scrapy.exception import DropItem
class PricePipeline(object):
    var_factor=1.15
    def process_item(self,item,spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price']=item['price']*self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" %item )

itemをJSONに書き込む
以下のpipelineは、すべての(すべてのspiderから)登ったitemを、JSON形式にシーケンス化されたitemに格納します.

import json
class JsonWriterPipeline(object):
    def__init__(self):
        self.file=open('items.jl','wb')
    def process_item(self,item,spider):
        line=json.dump(dict(item))+"
"
        self.file.writer(line)
        return item

itemsをMongoDBに書き込む
この例ではpymongoを使用してpymongoにitemsを書き、MongoDBのアドレスとデータベース名はScrapy settingsにあります.MongoDBコレクションはitemclassの後に命名されます.この例の主なポイントはfromの使い方を示すことですcrawler()メソッドと、リソースを正しくクリーンアップする方法.

import pymongo
class MongoPipeline(object):
    collection_name='scrapy_items'
    def __ init__(self,mongo_uri,mongo_db):
        self.mongo_uri=mongo_uri
        self.mongo_db=mongo_db
    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE','items'
            )
    def open_spider(self,spider):
        self.client=pymongo.MongoClient(self.mongo_uri)
        self.db=self.client[self.mogo_db]
    def close_spider(self,spider):
        self.client.close()
    def process_item(self,item,spider):
        self.db[self.collection_name].insert(dict(item))
        return item

重さを落とす
重さを除去するためのフィルタで、処理されたitemを廃棄します.私たちのitemには唯一のidがあると仮定しますが、spiderが返す複数のitemには同じidが含まれています.

from scrapy.exception import DropItem
class DuplicatePipeline(object):
    def __init__(self):
        self.ids_seen=set()
    def process_item(self,item,spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" %item)
        else:
            self.ids_seen.add(item['id'])
            return item

Item Pipelineコンポーネントの有効化
Item Pipelineコンポーネントを有効にするには、そのクラスをITEM_に追加する必要があります.PIPELINES構成.

ITEM_PIPELINES={
    'myproject.pipelines.PricePipeline':300,
    'myproject.pipeline.JsonWriterPipeline':800,


}

各クラスに割り当てられた整数値は、実行順序を決定し、itemは数値が低い順から高い順にpipelineを介して、通常、これらの数値を0-1000の範囲内に定義します.

コード解読レコード

gvimとpython 3の関連問題