python学習-scrapy爬虫類フレームワーク

5518 ワード

Scrapyのインストールpip3 install Scrapy(mac版)
Scrapy爬虫類を作るには全部で4歩に分かれています
  • 新規プロジェクト
  • 目標を明確にする:あなたが登りたい目標を明確にする
  • 爬虫類の作成:爬虫類の作成ページ
  • ストレージ内容:設計パイプストレージ爬取内容
  • コマンド詳細
    インストール後、端末に直接scrapyを入力すると、バージョンといくつかのコマンドプロンプトが表示されます:scrapy bench:あなたのパソコンのscrapy性能をテストscrapy fetch + :URLアドレスにダウンロードページ情報を取得scrapy genspider scrapy runspider:爬虫類scrapy shellを作成/起動:環境を表示するvi settings.pyこのドキュメントを開く
    ステップの詳細
    新規プロジェクトscrapy startproject + は、自動的に多くのファイルを生成します.
    生成ファイルコードの詳細:
    items.py
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class ItcastItem(scrapy.Item):   #       
        # define the fields for your item here like:
        # name = scrapy.Field()
        #     
        name = scrapy.Field()
        #     
        title = scrapy.Field()
        #     
        info = scrapy.Field()
    

    settings.py
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for ITcast project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     http://doc.scrapy.org/en/latest/topics/settings.html
    #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'ITcast'
    
    SPIDER_MODULES = ['ITcast.spiders']     #              
    NEWSPIDER_MODULE = 'ITcast.spiders'
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'ITcast (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules   #            robots   
    ROBOTSTXT_OBEY = True
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32    #        
    
    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3     #     
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16    #           
    #CONCURRENT_REQUESTS_PER_IP = 16   #     ip  
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:       #      header
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {        #      
    #    'ITcast.middlewares.ItcastSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {     #      ,    
    #    'ITcast.middlewares.MyCustomDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    #EXTENSIONS = {   #     
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    #ITEM_PIPELINES = {    #       
    #    'ITcast.pipelines.ItcastPipeline': 300,
    #}
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    

    itcast.py
    # -*- coding: utf-8 -*-
    import scrapy
    from ITcast.ITcast.items import ItcastItem
    
    class ItcastSpider(scrapy.Spider):
        name = 'itcast'   #     ,            
        allowed_domains = ['itcast.cn']    #       ,           ,    
        start_urls = ['http://www.itcast.cn/channel/teacher.shtml']   #       ,        ,        
    
        def parse(self, response):    #      ,  url         
            node_list = response.xpath("//div[@class='li_txt']")
            items = []
            for node in node_list:
    
                item = ItcastItem()
                # .extract() xpath    Unicode     xpath        
                name = node.xpath("./h3/text()").extract()
                title = node.xpath("./h4/text()").extract()
                info = node.xpath("./p/text()").extract()
    
                item['name'] = name[0]
                item['title'] = title[0]
                item['info'] = info[0]
                items.append(item)
            return items