Scrapyプロジェクト(東莞陽光網)---CrawlSpiderを利用して投稿内容を取り、画像を含まない

4259 ワード

1、Scrapyプロジェクトの作成

scapy startproject dongguan

2.プロジェクトディレクトリに入り、コマンドgenspiderを使用してSpiderを作成する

scrapy genspider -t crawl sunwz "wz.sun0769.com"

3、キャプチャするデータの定義(items.pyファイルの処理)

# -*- coding: utf-8 -*-
import scrapy

class DongguanItem(scrapy.Item):
    #     
    number = scrapy.Field()
    #     
    title = scrapy.Field()
    #     
    content = scrapy.Field()
    #   url
    url = scrapy.Field()

4、itemデータを抽出するSpiderを作成する(spidersフォルダの下:sunwz.py)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
#      pycharm       ，      ：https://blog.csdn.net/z564359805/article/details/80650843
from dongguan.items import DongguanItem

class SunwzSpider(CrawlSpider):
    name = 'sunwz'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0']
    # LinkExtractor()       ,    
    rules = (
        #   callback，follow  True， callback   False
        Rule(LinkExtractor(allow=r"report\?page=\d+"),follow=True),
        Rule(LinkExtractor(allow=r"html/question/\d+/\d+.shtml"), callback='parse_item',follow=False),
        #             60510            ：http://d.wz.sun0769.com/index.php/question/show?id=267700，      
        Rule(LinkExtractor(allow=r"/question/show\?id=\d+"), callback='parse_item',follow=False),
    )
    print("     ……")
    def parse_item(self, response):
        item = DongguanItem()
        #         ,strip()      
        item['title'] = response.xpath('//div[@class="pagecenter p3"]//strong/text()').extract()[0].strip()
        #          
        item['number'] = item['title'].split(":")[-1].strip()
        #             
        content= response.xpath('//div[@class="contentext"]/text()').extract()
        if len(content) == 0:
            #      
            content = response.xpath('//div[@class="c1 text14_2"]/text()').extract()
            # content   ，  join        ，       ,           
            item['content'] = "".join(content).strip().replace("\xa0","")
        else:
            print("      …………")
            item['content'] = "".join(content).strip().replace("\xa0","")
        #        
        item['url'] = response.url
        yield item

5.pipelinesパイプファイルを処理してデータを保存し、結果をファイルに保存する(pipelines.py)

# -*- coding: utf-8 -*-
import json

#     ，  json.JSONEncoder   ， json    encoder.py 
class MyEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, bytes):
            return str(o, encoding='utf-8')
        return JSONEncoder.default(self, o)

class DongguanPipeline(object):
    def __init__(self):
        self.file = open("dongguan.json", 'w', encoding='utf-8')

    def process_item(self, item, spider):
        text = json.dumps(dict(item), ensure_ascii=False, cls=MyEncoder) + '
'
        self.file.write(text)
        return item

    def close_spider(self, spider):
        print("      ，    ！")
        self.file.close()

6.settingsファイルの構成(settings.py)

# Obey robots.txt rules，      ：https://blog.csdn.net/z564359805/article/details/80691677
ROBOTSTXT_OBEY = False

# Override the default request headers:  User-Agent  
DEFAULT_REQUEST_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);',
  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
}

# Configure item pipelines      ，      
ITEM_PIPELINES = {
   'dongguan.pipelines.DongguanPipeline': 300,
}

#              （      ）
LOG_FILE = "dongguanlog.log"
LOG_LEVEL = "DEBUG"

7.以上の設定を完了して、爬取を行う:プロジェクトコマンドcrawlを実行し、Spiderを起動する:

scrapy crawl sunwz

Firefoxで最後のタブを閉じてもブラウザを終了しない方法

Firefoxのブックマークをシェルで分析する