scrapyはトラの歯の部分のデータを這い出すことをテストします(2つの記憶方式)

3831 ワード

ステップ1:
scrapy startproject huyaPro

ステップ2:
cd huyaPro

scrapy genspider huya www.xxx.com

ステップ3:settingで関連する設定を行う
ステップ4:データ解析を行う
4.1:端末指令による永続化ストレージ
    def parse(self, response):
        
        li_list = response.xpath('//*[@id="js-live-list"]/li')
        all_data = []
        for li in li_list:

            title = li.xpath('./a[2]/text()').extract_first()
            man = li.xpath('./span/span[1]/i/text()').extract_first()
            hot = li.xpath('./span/span[2]/i[2]/text()').extract_first()
            dic = {
                'title': title,
                'man': man,
                'hot': hot,
            }
            all_data.append(dic)
        return all_data

4.2:パイプベースの永続化ストレージ
    def parse(self, response):
        li_list = response.xpath('//*[@id="js-live-list"]/li')
        for li in li_list:
            title = li.xpath('./a[2]/text()').extract_first()
            author = li.xpath('./span/span[1]/i/text()').extract_first()
            hot = li.xpath('./span/span[2]/i[2]/text()').extract_first()

            # item 
            item = HuyaproItem()
            item['title'] = title
            item['author'] = author
            item['hot'] = hot

            yield item # 

ステップ5:item.へpyでの設定
class HuyaproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()
    hot = scrapy.Field()

ステップ6:pipelines.pyでは各パイプを設定します
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from redis import Redis

class HuyaproPipeline(object):
    fp = None
    def open_spider(self,spider):
        print('i am open_spider()')
        self.fp = open('huyazhibo.txt','w',encoding='utf-8')
    def process_item(self, item, spider):#item item 

        self.fp.write(item['title']+':'+item['author']+':'+item['hot']+'
') print(item['title'],' !!!') return item def close_spider(self,spider): self.fp.close() print('i am close_spider()') class mysqlPipeLine(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='123',db='Spider',charset='utf8')#utf-8 print(self.conn) def process_item(self,item,spider): sql = 'insert into huya values("%s","%s","%s")'%(item['title'],item['author'],item['hot']) self.cursor = self.conn.cursor() try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.cursor.close() self.conn.close() class RedisPipeLine(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) def process_item(self,item,spider): self.conn.lpush('huyaList',item) return item

ステップ7:settingで各パイプと優先度を設定する
ITEM_PIPELINES = {
   'huyaPro.pipelines.HuyaproPipeline': 300,
    'huyaPro.pipelines.mysqlPipeLine': 301,
    'huyaPro.pipelines.RedisPipeLine': 302,
}