python学習-scrapy爬虫類フレームワーク
5518 ワード
Scrapyのインストール
Scrapy爬虫類を作るには全部で4歩に分かれています新規プロジェクト 目標を明確にする:あなたが登りたい目標を明確にする 爬虫類の作成:爬虫類の作成ページ ストレージ内容:設計パイプストレージ爬取内容 コマンド詳細
インストール後、端末に直接scrapyを入力すると、バージョンといくつかのコマンドプロンプトが表示されます:
ステップの詳細
新規プロジェクト
生成ファイルコードの詳細:
items.py
settings.py
itcast.py
pip3 install Scrapy
(mac版)Scrapy爬虫類を作るには全部で4歩に分かれています
インストール後、端末に直接scrapyを入力すると、バージョンといくつかのコマンドプロンプトが表示されます:
scrapy bench
:あなたのパソコンのscrapy性能をテストscrapy fetch +
:URLアドレスにダウンロードページ情報を取得scrapy genspider
scrapy runspider
:爬虫類scrapy shell
を作成/起動:環境を表示するvi settings.py
このドキュメントを開くステップの詳細
新規プロジェクト
scrapy startproject +
は、自動的に多くのファイルを生成します.生成ファイルコードの詳細:
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ItcastItem(scrapy.Item): #
# define the fields for your item here like:
# name = scrapy.Field()
#
name = scrapy.Field()
#
title = scrapy.Field()
#
info = scrapy.Field()
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for ITcast project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'ITcast'
SPIDER_MODULES = ['ITcast.spiders'] #
NEWSPIDER_MODULE = 'ITcast.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'ITcast (+http://www.yourdomain.com)'
# Obey robots.txt rules # robots
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3 #
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16 #
#CONCURRENT_REQUESTS_PER_IP = 16 # ip
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers: # header
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = { #
# 'ITcast.middlewares.ItcastSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = { # ,
# 'ITcast.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = { #
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = { #
# 'ITcast.pipelines.ItcastPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
itcast.py
# -*- coding: utf-8 -*-
import scrapy
from ITcast.ITcast.items import ItcastItem
class ItcastSpider(scrapy.Spider):
name = 'itcast' # ,
allowed_domains = ['itcast.cn'] # , ,
start_urls = ['http://www.itcast.cn/channel/teacher.shtml'] # , ,
def parse(self, response): # , url
node_list = response.xpath("//div[@class='li_txt']")
items = []
for node in node_list:
item = ItcastItem()
# .extract() xpath Unicode xpath
name = node.xpath("./h3/text()").extract()
title = node.xpath("./h4/text()").extract()
info = node.xpath("./p/text()").extract()
item['name'] = name[0]
item['title'] = title[0]
item['info'] = info[0]
items.append(item)
return items