scrapy大規模な画像サイト(http://5442.com/)

12955 ワード

Python学習

1、まずプロジェクトを作成する
対応ディレクトリに入りscrapy startprojectを入力 img
2、爬虫類ファイルの作成
cd img 入力 scrapy genspider -t basic qiantu 5442.com
3、items文に入ってurlアドレス保存容器を作成する

import scrapy


class ImgItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()

4、分析サイト作成
トップページの最初のレイヤーへのリンク

def parse(self, response):

    urldata = response.xpath("//div[@class='nav both']//a/@href").extract()
    print(urldata)
    for i in range(0, len(urldata)):
        urllist = urldata[i]
        yield Request(url=urllist, callback=self.next)

階層2リンクの抽出

 
  def next(self, response):
    thisurl = response.url
    # print(thisurl)

    for j in range(1, 250):
        if thisurl == "http://www.5442.com/mingxing/":
            pageurl = thisurl + "list_" + "2_" + str(j) + ".html"
            yield Request(url=pageurl, callback=self.next1)
        if thisurl == "http://www.5442.com/qiche/":
            pageurl = thisurl + "list_" + "3_" + str(j) + ".html"
            yield Request(url=pageurl, callback=self.next1)
        if thisurl == "http://www.5442.com/fengjing/":
            pageurl = thisurl + "list_" + "4_" + str(j) + ".html"
            yield Request(url=pageurl, callback=self.next1)
        if thisurl == "http://www.5442.com/youxi/":
            pageurl = thisurl + "list_" + "5_" + str(j) + ".html"
            yield Request(url=pageurl, callback=self.next1)
        if thisurl == "http://www.5442.com/katong/":
            pageurl = thisurl + "list_" + "6_" + str(j) + ".html"
            yield Request(url=pageurl, callback=self.next1)
        if thisurl == "http://www.5442.com/tushuo/":
            pageurl = thisurl + "list_" + "8_" + str(j) + ".html"
            yield Request(url=pageurl, callback=self.next1)
        if thisurl == "http://www.5442.com/mingxingtuku/":
            pageurl = thisurl + "list_" + "9_" + str(j) + ".html"
            yield Request(url=pageurl, callback=self.next1)


def next1(self, response):

    imglist = response.xpath("//div[@class='w650 l']//li//a/@href").extract()

    # print(imglist)
    # print(type(imglist))
    for i in range(0, len(imglist)):
        thisurl = imglist[i]
        # print(thisurl)
        yield Request(url=thisurl, callback=self.next2)
  http://www.5442.com/mingxing/list_2_2.html
  
このサイトでは分類ごとのページ数が表示されないため、分類ごとに250ページを例に、次のページのリンク構成(nextメソッド)を分析します(next 1メソッドは画像のあるリンクを這い出すことです).
3階層目のリンク(画像の最終アドレス)を登る
  
  def next2(self, response):

    imgurllist = response.xpath("//div[@class='arcBody']//a/img/@src").extract()
    try:
        for i in range(0, len(imgurllist)):
            imgurl = imgurllist[i]

            item = ImgItem()

            item['url'] = imgurl
            # print(item['url'])

            yield item
    except ValueError as e:
        pass
  
5、pipelinesファイルを書くclass ImgPipeline(object):

    def process_item(self, item, spider):
            # print(item["url"])
            try:
                thisurl = item["url"]
                # print(thisurl)

                #       ，         
                file = "F:/peitao/image/" + str(int(random.random() * 100000)) + ".jpg"
                urllib.request.urlretrieve(thisurl, filename=file)
            except Exception as e:
                pass
                return item
 settingファイルでコメントをキャンセルしたことを覚えておいてください.  ITEM_PIPELINES = {
   'img.pipelines.ImgPipeline': 300,
  
}

leetcode-配列が重複する要素を削除

[iOSクラウドストレージ]leancloudネットワークストレージセット