10 python爬虫類入門の基本コードの実例+簡単なpython爬虫類の完全な例

8302 ワード

python爬虫類 requests

この文章は主にpython爬虫類の知識点に関連しています。
webはどうやって相互作用しますか？
requestsライブラリのget、post関数の応用
レスポンスオブジェクトの関連関数、属性
pythonファイルのオープン、保存
コードにコメントが付いていますので、そのまま実行してもいいですよ。
どうやってrequestsライブラリをインストールしますか？（pythonをインストールした友達は直接参考できます。ない場合は、まず環境をインストールしてください。）
windowsユーザー、Linuxユーザーはほぼ同じです。
cmdを開いて以下のコマンドを入力すればいいです。pythonの環境がC盤のディレクトリにある場合、権限が足りないと警告されます。管理者方式でcmdウィンドウを実行するだけです。


pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests

Linuxユーザー類似（ubantu例）：権限が足りない場合は命令前にsudoを入れてもいいです。


sudo pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests

python爬虫類入門の基本コードの例は以下の通りです。
1.RequestsはBDページを登り、ページ情報を印刷する


#        ,      
import requests #      ，           
response = requests.get("http://www.baidu.com") #    response  
response.encoding = response.apparent_encoding #      
print("   :"+ str( response.status_code ) ) #     
print(response.text)#

2.Requestsの一般的な方法のget方法の例、以下にも参照例がある。


#    get    
import requests #       ，           
response = requests.get("http://httpbin.org/get") #get  
print( response.status_code ) #   
print( response.text )

3.Requests一般的な方法のpost方法の例、以下にも参考例がある。


#     post    
import requests #       ，           
response = requests.post("http://httpbin.org/post") #post    
print( response.status_code ) #   
print( response.text )

4.Requests put方法の例


#     put    
import requests #       ，           
response = requests.put("http://httpbin.org/put") # put    
print( response.status_code ) #   
print( response.text )

5.Requests常用方法のget方法の参考例（1）
複数のパラメータを送るには&記号で接続すればいいです。


#     get      
import requests #       ，           
response = requests.get("http://httpbin.org/get?name=hezhi&age=20") # get  
print( response.status_code ) #   
print( response.text )

6.Requests常用方法のget方法の参考例（2）
パラmsは辞書で複数を伝えることができます。


#     get      
import requests #       ，           
data = {
	"name":"hezhi",
	"age":20
}
response = requests.get( "http://httpbin.org/get" , params=data ) # get  
print( response.status_code ) #   
print( response.text )

7.Requestsの一般的な方法のpost方法の参考例（2）と前のものは似ていますか？


#     post      
import requests #       ，           
data = {
	"name":"hezhi",
	"age":20
}
response = requests.post( "http://httpbin.org/post" , params=data ) # post  
print( response.status_code ) #   
print( response.text )

8.反登山機構の巻線については、知呼を例とします。


#         
import requests #       ，           
response = requests.get( "http://www.zhihu.com") #       ，       
print( "   ,      ,   :"+response.status_code )#   headers，      ，      200
#            ，   User-Agent  
headers = {
		"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"
}#      ,     
response = requests.get( "http://www.zhihu.com" , headers=headers ) #get    ,  headers  ，
print( response.status_code ) # 200！        
print( response.text )

9.情報を取得し、ローカルに保存する
ディレクトリの関係でDボードに爬虫というフォルダを作って情報を保存します。
注意ファイル保存時のencoding設定


#     html   
import requests
url = "http://www.baidu.com"
response = requests.get( url )
response.encoding = "utf-8" #        
print("
r   " + str( type(response) ) )
print("
    :" + str( response.status_code ) )
print("
    :" + str( response.headers ) )
print( "
    :" )
print( response.text )

#    
file = open("D:\\  \\baidu.html","w",encoding="utf") #      ，w             ，    wb           
file.write( response.text )
file.close()

10.写真をよじ登って、現地に保存する


#         
import requests #       ，           
response = requests.get("https://www.baidu.com/img/baidu_jgylogo3.gif") #get        
file = open("D:\\  \\baidu_logo.gif","wb") #      ,wb                   
file.write(response.content) #    
file.close()#    ，

以下は完全なpython爬虫類の例で、機能はBaiduの貼る上のピクチャーを登ってそして現地までダウンロードするのです。
公衆番号Python宿屋にも注目できます。リプライ756 完全なコードを取得します

上記の二次元コードをスキャンして、公衆番号Pythonの宿屋回復756を確認します。完全なpython爬虫源を取得します。
python爬虫類の主な操作手順：
ページを取得します。
htmlの中のピクチャのhtmlタグの特徴を分析して、正則ですべてのピクチャurlリンクのリストを解析します。
画像のurlリンクリストに基づいて、画像をローカルフォルダにダウンロードします。
1.urllib+re実現


#!/usr/bin/python
# coding:utf-8
#          ，        
import urllib
import re

#   url    html  
def getHtmlContent(url):
  page = urllib.urlopen(url)
  return page.read()

#  html      jpg   url
#     html jpg   url   ：<img ... src="XXX.jpg" width=...>
def getJPGs(html):
  #   jpg  url   
  jpgReg = re.compile(r'<img.+?src="(.+?\.jpg)" width') #  ：       'width'          
  #    jpg url  
  jpgs = re.findall(jpgReg,html)
  
  return jpgs

#    url             
def downloadJPG(imgUrl,fileName):
  urllib.urlretrieve(imgUrl,fileName)
  
#       ，          
def batchDownloadJPGs(imgUrls,path = './'):
  #        
  count = 1
  for url in imgUrls:
    downloadJPG(url,''.join([path,'{0}.jpg'.format(count)]))
    count = count + 1

#   ：           
def download(url):
  html = getHtmlContent(url)
  jpgs = getJPGs(html)
  batchDownloadJPGs(jpgs)
  
def main():
  url = 'http://tieba.baidu.com/p/2256306796'
  download(url)
  
if __name__ == '__main__':
  main()

上のスクリプトを実行して、数秒後にダウンロードを完了します。現在のディレクトリの下で画像がダウンロードされているのを見ることができます。

2.requests+re実現
以下はrequestsライブラリでダウンロードして、getHtmlConttentとdownloadJPG関数をすべてrequestsで再実現します。


#!/usr/bin/python
# coding:utf-8
#          ，        
import requests
import re

#   url    html  
def getHtmlContent(url):
  page = requests.get(url)
  return page.text

#  html      jpg   url
#     html jpg   url   ：<img ... src="XXX.jpg" width=...>
def getJPGs(html):
  #   jpg  url   
  jpgReg = re.compile(r'<img.+?src="(.+?\.jpg)" width') #  ：       'width'          
  #    jpg url  
  jpgs = re.findall(jpgReg,html)
  
  return jpgs

#    url             
def downloadJPG(imgUrl,fileName):
  #              
  from contextlib import closing
  with closing(requests.get(imgUrl,stream = True)) as resp:
    with open(fileName,'wb') as f:
      for chunk in resp.iter_content(128):
        f.write(chunk)
  
#       ，          
def batchDownloadJPGs(imgUrls,path = './'):
  #        
  count = 1
  for url in imgUrls:
    downloadJPG(url,''.join([path,'{0}.jpg'.format(count)]))
    print '     {0}   '.format(count)
    count = count + 1

#   ：           
def download(url):
  html = getHtmlContent(url)
  jpgs = getJPGs(html)
  batchDownloadJPGs(jpgs)
  
def main():
  url = 'http://tieba.baidu.com/p/2256306796'
  download(url)
  
if __name__ == '__main__':
  main()

上で紹介した10のpython爬虫類入門の基本コードの例と簡単なpython爬虫類の完全な例は全部基礎知識ですが、python爬虫類の主要な操作方法もこれらです。これらのpython爬虫類を身につけると大半を学びます。python爬虫についてもっと多い文章は下の関連ローラを見てください。

pythonは簡単に並べ替えのインスタンスコードを挿入することを実現します。

MySQLデータベース基礎入門のための常用命令のまとめ