python爬虫類/ノート(一)

4376 ワード

一、準備
1、urlを登る:http://www.7724.com/new-hot.htmlページの上の順位の上位の13個のゲームの名称とピクチャー、そしてピクチャーをコンピュータのローカルにダウンロードします
2、使用するpythonバージョンは3.6
二、コード
#-*-coding:utf-8-*-
import requests,time,re,urllib
from lxml import html
from lxml import etree
from bs4 import BeautifulSoup

def resolveXpath(xpathGet):
    for index in range(len(xpathGet)):
        if (index % 2) == 0:
            tagGet = xpathGet[index].tag
            attribGet = xpathGet[index].attrib
            text = xpathGet[index].text
    return tagGet, attribGet, text

ticks = time.time()
imgBase = 'D:/imgSave/'

url = 'http://www.7724.com/new-hot.html'

cookie = {}
raw_cookie = 'Cookie: aliyungf_tc=AQAAADgU4VdCvggAnkHidOrA7UvoNOqm; PHPSESSID=d25a29dd4c960b85916b80eaa2671b34; qqes_c_t_diuu=d3cbab9bb6443f72b6167daa7413374c; referer_host=www.7724.com; uid=6291849; username=w2844398290; sign=a76ff3933f9f307e6630f13e31418c84; nickname=g_follower; headimg=http%3A%2F%2Ftvax3.sinaimg.cn%2Fcrop.0.0.996.996.180%2F0069qtr9ly8fmuf1li5mej30ro0rogmf.jpg; user_playgame_record=%2C2910; session_flag1=http%3A%2F%2Fwww.7724.com%2Fnew-hot.html'
for line in raw_cookie.split(';'):
    key, value = line.split('=', 1)
    cookie[key] = value
print(cookie)

page = requests.get(url, cookie)
reqData = page.text

# print(reqData)
# tree = html.fromstring(page.txt) #  page       
# fromstring()   XML             Element,       。
#   :'Response' object has no attribute 'txt'

tree = etree.HTML(reqData) #   etree.HTML          HTML   

basePath = '/html/body/div[6]/div[4]/ul'
for num in range(1, 2): #        ,     1 
    imgPath = basePath + '/li' + '[' + str(num) + ']' + '/div[2]/div/a/img'
    print(imgPath)
    namePath = basePath + '/li' + '[' + str(num) + ']' + '/div[3]/p[1]/a'
    imgGet = tree.xpath(imgPath)
    nameGet = tree.xpath(namePath)

    imgSolve = resolveXpath(imgGet)
    nameSolve = resolveXpath(nameGet)
    img = imgSolve[1]
    imgUrl = img['src']
    name = nameSolve[2]
    print("  URL:", imgUrl)
    print("   :", name)

    imgName = re.findall('http://img.7724.com/pipaw/logo/(\d+)/(\d+)/(\d+)/(.+)', imgUrl)
    print(imgName[0][3])
    # imgLayout = re.findall('.(.+)', imgUrl)
    imgSave = imgBase + str(imgName[0][3])
    print(imgSave)
    urllib.request.urlretrieve(imgUrl, imgSave)

実行結果:
python爬虫/笔记(一)_第1张图片
パソコンディレクトリ:D:/imgSave/下(デバッグ時に24個登った)
python爬虫/笔记(一)_第2张图片
 
三、エラー記録
1、python3.Xでurlretrieveを使用してエラーを報告:module'urllib'has no attribute'urlretrieve'
解決方法:元の
urllib.urlretrieve(imgUrl, imgName,imgSave)    
urllib.request.urlretrieve(imgUrl, imgSave)

2、Xpath解析print結果:
解決方法:要素解析関数の定義
def resolveXpath(xpathGet):
    for index in range(len(xpathGet)):
        if (index % 2) == 0:
            tagGet = xpathGet[index].tag
            attribGet = xpathGet[index].attrib
            text = xpathGet[index].text
    return tagGet, attribGet, text

3、python 3を使用する.6 import etree時赤でエラーを報告
解決方法:pip install lxml=4.1.0
この場合もimportは赤く表示されますが、tree=etreeと書きます.HTML(reqData)は正常に動作しています
4、xPath解析を使用してimgUrlアドレスを取得する:
chromeでページをチェックすると、表示されるelementsオブジェクトは次のとおりです.

文を使用します.
    imgSolve = resolveXpath(imgGet)
    img = imgSolve[1]
    imgUrl = img['src']

imgUrlアドレスをフィルタできます