1.python標準ライブラリurllibの使用[入門]

11790 ワード

データ収集python

この文章では,主にいくつかのurllibでよく用いられるクラス関数を紹介した.
1.urllib——処理URLurllibURLを使ったモジュールを複数集めたパッケージでpython標準ライブラリの一員

urllib.requestURLを開いて読み取る

urllib.error含むurllib.request投げ出す異常

urllib.parseURL解析用

urllib.robotparser解析用robots.txtファイル

1.1. urllib.requestモジュール——読み出しURLを開くurllib.requestモジュールは、基本認証、要約認証、リダイレクト、cookies、その他、様々な複雑な状況でURL(主にHTTP)を開くのに適した関数およびクラスを定義する.
1.1.1関数

urllib.request.urlopen(url，...)urlで指定するリソースを開く、urlは文字列でもurllibでもよい.request.Requestオブジェクト、httpを返します.client.responseオブジェクト

urllib.request.``build_opener([handler, ...])OpenerDirectorエンティティを1つ返す

1.1.2.クラス#クラス#

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)このクラスはURLリクエストの抽象です

urlパラメータは、文字列タイプであり、対応するリソースのurlである.

headerパラメータは、辞書タイプの要求メッセージヘッダである.

classurllib.request.OpenerDirectorこのクラスは接続されているBaseHandlersurlを開く.handlersチェーンを管理し、エラーからリカバリします.OpenerDirectorインスタンスのopen(url, data=None[, timeout])メソッドは、urlを開きます.その戻り値と発生した異常とurllib.request.urlopen()と同じです.

classurllib.request.ProxyHandler(proxies=None)エージェントからのリクエストでproxiesパラメータが与えられる場合は辞書(プロトコルからIPマッピングまでの辞書)

1.2. urllib.parse——解析URL

urllib.parse.urlencode(query,...)・mapping objectまたはstrまたはbytesオブジェクトを含む二元グループのシーケンスをパーセンテージ符号化されたASCIIテキスト文字列に変換する.

1.3. 例
最も簡単な例です

# -*- coding:utf-8 -*-
from urllib import request
url = "http://www.baidu.com"
#  url              ，   http.client.http.client.HTTPResponse
resp = request.urlopen(url)
#read()       
r_content = resp.read() 
#  
r_text = r_content.decode("utf-8") 
print(r_text)

一部のサイトでは、リクエストヘッドを通じて、反爬虫類の制限が行われる可能性があります.したがって、次の例では、Resquestオブジェクトにheadersを設定できます.

# -*- coding:utf-8 -*-

# urllib           URL    ,  python       

from urllib import request
import urllib.response

url = "https://tieba.baidu.com/f"
header = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36",

}
resp = request.urlopen(request.Request(url, headers=header))  #     url   ，    url          urllib.request.Request  ,     http.client.HTTPResponse  
r_content = resp.read() #         ，     bytes 
r_text = r_content.decode("utf-8") #   utf-8


print(type(resp)) #   resp   
print(resp.version) #  http    
print(resp.status) #    
print(resp.url) #  url
print(resp.headers) #     
print(r_text) #

中国語リクエストパラメータ付きurlにはurllibが必要です.parseテンプレート

# -*- coding:utf-8 -*-

# urllib           URL    ,  python       

from urllib import request
from urllib import parse
url = "https://tieba.baidu.com/f"

data = {
    "kw":"  "
}

data_string = parse.urlencode(data)

new_url = url+"?"+data_string

resp = request.urlopen(new_url)

text_content = resp.read().decode("utf-8")

with open("tieba.html", "w", encoding="utf-8") as fp:
    fp.write(text_content)s

エージェントはエージェントの設定を説明する文章を設定します

from urllib import request

url = "https://www.runoob.com/w3cnote/python-pip-install-usage.html"
header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'
}
req = request.Request(url, headers=header)
#          ，        （   ，   IP   ）
Ip = request.ProxyHandler({
        'http':'118.187.58.34:53281',
                           })

#    OpenerDirector  
opener = request.build_opener(Ip,request.HTTPHandler)

#      url，          urllib.request.urlopen()    
resp = opener.open(req)
text = resp.read().decode('utf-8')
print(text[:500])

リファレンス

python公式文書urllib.request https://docs.python.org/zh-cn/3.7/library/urllib.request.html#request-objects

python公式文書urllib.parse https://docs.python.org/zh-cn/3.7/library/urllib.parse.html#module-urllib.parse

陈桑啊丶.urllibライブラリgetとpostリクエストの送信https://www.cnblogs.com/chensang/p/10096352.html

python公式文書http.client https://docs.python.org/zh-cn/3.7/library/http.client.html

python公式文書urllibhttps://docs.python.org/zh-cn/3.7/library/urllib.html

zsh で N 個のスペースからなる文字列を生成したい

設定したエイリアスを確認するコマンド