Python爬虫フレームワークscrapyで実装されたファイルダウンロード機能の例

2275 ワード

この例では、Python爬虫フレームワークscrapyが実装するファイルダウンロード機能について説明します.皆さんの参考にしてください.具体的には以下の通りです.
私たちは普通のスクリプトを書くとき、あるウェブサイトからファイルのダウンロードurlを手に入れて、それからダウンロードして、直接データをファイルに書き込んだり保存したりしますが、これは私たち自身が少しずつ書く必要があります.そして、繰り返し利用率は高くありません.車輪を繰り返さないために、scrapyはスムーズなダウンロードファイル方式を提供して、勝手に書くだけで利用できます.
mat.pyファイル


# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from weidashang.items import matplotlib
class MatSpider(scrapy.Spider):
  name = "mat"
  allowed_domains = ["matplotlib.org"]
  start_urls = ['https://matplotlib.org/examples']
  def parse(self, response):
　　　　　　　#             ，     
    link = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l2')
    for link in link.extract_links(response):
      yield scrapy.Request(url=link.url,callback=self.example)
  def example(self,response):
　　　　　　#         ，        ，  base_url           url
    href = response.css('a.reference.external::attr(href)').extract_first()
    url = response.urljoin(href)
    example = matplotlib()
    example['file_urls'] = [url]
    return example

pipelines.py


class MyFilePlipeline(FilesPipeline):
  def file_path(self, request, response=None, info=None):
    path = urlparse(request.url).path
    return join(basename(dirname(path)),basename(path))

settings.py


ITEM_PIPELINES = {
  'weidashang.pipelines.MyFilePlipeline': 1,
}
FILES_STORE = 'examples_src'

items.py


class matplotlib(Item):
  file_urls = Field()
  files = Field()

run.py


from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'mat','-o','example.json'])

Pythonに関する詳細は、「Python Socketプログラミングテクニックまとめ」、「Python正規表現用法まとめ」、「Pythonデータ構造とアルゴリズムチュートリアル」、「Python関数使用テクニックまとめ」、「Python文字列操作テクニックまとめ」、「Python入門と進級経典チュートリアル」、「Pythonファイルとディレクトリ操作テクニックまとめ」のトピックを参照してください.
ここではPythonプログラムの設計に役立つことを願っています.

個人ブログ構築記録

Ubuntu/16.04 LTS+Apache/2.4.18環境でPython CGIプログラミングを実現