爬虫類(三)bs 4庫

12807 ワード

python&爬虫類

0.Beautiful Soupライブラリの設置
cmdに管理者として入力します.

pip install beautifulsoup4

BSライブラリの設置小測定デモHTMLページのアドレスhttp://python123.io/ws/demo.html
リンクのソースコードを取得:
(1)ソースコードを右クリックする
(2)IDLE入力:(デモの解析を行う)

>>> import requests
>>> r=requests.get("http://python123.io/ws/demo.html")
>>> r.text
'This is a python demo page\r
\r
The demo python introduces several python courses.
\r
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r
Baic Python and A dvd Python.
\r
'
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())

結果は、そのウェブページのソースコードを出力します.


 
  
   This is a python demo page
  
 
 
  
   
    The demo python introduces several python courses.
   
  
  
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   Baic Python
   and
   A dvd Python
   .

1.Beautiful Soupライブラリの基本要素
Beautiful Soupライブラリは、解析、遍歴、「ラベルツリー」を維持する機能ライブラリです.

from bs4 import BeautifulSoup
soup=BeautifulSoup("data","html.parser")
soup2=BeautifulSoup(open("D://demo.html"),"html.parser")

Beautiful Soupは一つのHTML/XMLドキュメントの全部に対応しています.

BS類の基本要素…

タブ
説明:Tagは最も基本的な情報組織ユニットであり、それぞれ＜＞と＞を用いて開始と終了を表す.

>>> import requests
>>> r=requests.get("http://python123.io/ws/demo.html")
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.title
This is a python demo page
>>> tag=soup.a
>>> tag
Baic Python
>>> type(tag)

HTMLシンタックスに存在するタグはどれでもsoup.でアクセスできます.HTMLドキュメントに複数の同じ対応内容が存在する場合、soup.は最初に戻ります.
Tagのname(名前)
説明:ラベルの名前は、…の名前は'p'です.フォーマット:name
オブジェクトの名前:nameで取得し、文字列の種類

>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name

Tagのatrs(プロパティ)
説明:ラベルの属性、辞書形式の組織、フォーマット:atrs
各オブジェクトには0以上の属性があります.

>>> tag=soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)

TagのNavigable String
説明:ラベル内の属性文字列ではなく、<>>の文字列、書式:string
オブジェクト内に含まれる文字列:stringによって取得し、Navigable Stringタイプは、複数の階層にまたがることができます.

>>> soup.a
Baic Python
>>> soup.a.string
'Basic Python'
>>> soup.p
The demo python introduces several python courses.
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)

Tagのコメント
説明:ラベル内の文字列のコメント部分、特殊なコメントタイプ
Commentsは特殊な文字タイプです.

>>> newsoup=BeautifulSoup("This is not a comment","html.parser")
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)

>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)

2.bs 4ライブラリに基づくHTMLコンテンツの遍歴方法

A.ラベルツリーの下り遍歴

ラベルツリーのダウンリンクコード例:

from bs4 import BeautifulSoup
new_soup=BeautifulSoup("A
BC","html.parser")
print("    ：")
for i,child in enumerate(new_soup.div.children):
    print(i+1,child)
print("    ：")
for i,child in enumerate(new_soup.div.descendants):
    print(i+1,child)

結果出力:

    ：
1 A
2 B
3 C
    ：
1 A
2 A
3 B
4 C
5 C

Beautiful Soupタイプは、ラベルツリーのルートノードです.

>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.head
This is a python demo page
>>> soup.head.contents
[This is a python demo page]
>>> soup.body.contents
['
', The demo python introduces several python courses.
, '
', Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Baic Python and A dvd Python.
, '
']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
The demo python introduces several python courses.

B.ラベルツリーの上り遍歴

すべての先辈の结点を通して、soup自身を含めて、だから判别します.
C.ラベルツリーの平行遍歴

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.a.next_sibling
' and '
>>> soup.a.next_sibling.next_sibling
A dvd Python
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r
'
>>> soup.a.previous_sibling.previous_sibling
>>> soup.a.parent
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

Baic Python and A dvd Python.

注意:

平行遍歴(all):

3.bs 4ライブラリに基づくHTMLフォーマットと符号化
bs 4ライブラリのprettify()方法
方法:.prettify()
HTMLの内容をより「友好的」に表示し、HTMLタグとその内容に''を追加し、ラベルオブジェクトとsoupオブジェクトはこの方法を呼び出すことができます.

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.prettify()
'
 
  <br>   This is a python demo page<br>  
 
 
  
   
    The demo python introduces several python courses.
   
  

  
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   Baic Python
   and
   A dvd Python
   .
  

 
'
>>> print(soup.prettify())

 
  
   This is a python demo page
  
 
 
  
   
    The demo python introduces several python courses.
   
  
  
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   Baic Python
   and
   A dvd Python
   .

上記のように、printは改行を印刷することができます.

bs 4ライブラリはどのHTML入力もutf-8符号化(国際共通の符号化フォーマット)になります.
Pyhton 3.xデフォルトサポートコードはutf-8で、解析に支障がない.

>>> soup=BeautifulSoup("  ","html.parser")
>>> soup.p.string
'  '
>>> print(soup.p.prettify())

4.ユニットのまとめ

BZOJ 1090文字列折りたたみ(区間DP)