Python 3爬虫類ノート--解析庫Beautiful Soup

38661 ワード

Python

文書ディレクトリ

1概要

2基本用法

3ノードセレクタ

3.1選択要素

3.2抽出情報

3.2.1名前

を取得

3.2.2取得属性

3.2.3コンテンツ

を取得する.

3.3ネスト選択

3.4関連選択

3.4.1子ノードおよび子ノード

3.4.2親ノードおよび祖先ノード

3.4.3兄弟ノード

3.4.4抽出情報

4メソッドセレクタ

4.1 find_all()

4.1.1 name

4.1.2 attrs

4.1.3 text

4.2 find()

4.3その他の問合せ方法

CSSセレクタ

5.1ネスト選択

5.2取得属性

5.3取得テキスト

6まとめ

1概要

Beautiful Soup:PythonのHTMLまたはXMLの解析ライブラリで、Webページの構造や属性などの特性を利用してWebページを解析します.それがあれば、複雑な正規表現を書く必要はありません.簡単ないくつかの文だけで、Webページの要素の抽出を完了することができます.

Beautiful Soup解析時に実際に解析器に依存するが、ここではlxml解析器の使用を推奨し、Beautiful Soupの初期化時に2番目のパラメータをlxmlに変更すればよい:

from bs4 import BeautifulSoup
soup = BeautifulSoup('Hello
', 'lxml')
print(soup.p.string)

2基本的な使い方

BeautifulSoupを初期化すると、不完全なhtmlコードの補完

が完了する.

prettify():解析する文字列を標準のインデント形式で

出力する

soup.title.string:HTMLのtitleノードのテキスト内容を出力します.soup.titleはtitleノード

全体を出力する.

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)


#    ：
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

3ノードセレクタ

ノード名を直接呼び出すとノード要素を選択でき、string属性を呼び出すとノード内のテキストが得られ、この選択方式は非常に速い.単一ノード構造階層が非常に明確である場合、

を解析するには、この方法を選択することができる.
3.1選択要素

複数のノードがある場合、この選択方式は、次の例のpノードのように、最初に一致するノードのみが選択され、他の後続ノードは無視される.

html = """
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

#    
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

3.2情報の抽出
3.2.1名前の取得

は、name属性を利用してノードの名前

を取得することができる.

print(soup.title.name)

#    
title

3.2.2属性の取得

attrsを呼び出してすべての属性

を取得する

print(soup.p.attrs)
print(soup.p.attrs['name'])

#    
{'class': ['title'], 'name': 'dromouse'}
dromouse

より簡便な表記はattrs

を省略する.

ここで注意しなければならないのは、返された結果が文字列であり、返された結果が文字列からなるリストであることです.たとえば、nameプロパティの値は一意であり、返される結果は単一の文字列です.一方、classでは、1つのノード要素に複数のclassがある可能性があるので、リストを返します.

print(soup.p['name'])
print(soup.p['class'])

#     
dromouse
['title']

3.2.3コンテンツの取得
stringプロパティを使用してノード要素に含まれるテキストの内容を取得
3.3ネストされた選択

の各戻り結果はbs 4である.element.Tagタイプは、ノードを呼び出し続けて次のステップの選択を行うことができる

html = """
The Dormouse's story

"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

#    
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

3.4関連選択

選択をする場合、一歩で目的のノード要素を選択できない場合があります.まずノード要素を選択し、それを基準にしてサブノード、親ノード、兄弟ノードなど

を選択する必要があります.
3.4.1子ノードと子ノード

contents:結果は、

を単独でリストすることなく、孫ノードがバイトポイントに含まれるすべての直接サブノードのリストです.

soup = BeautifulSoup(html, 'lxml')
print (soup.p.contents)

children:戻り結果はジェネレータであり、すべての子孫ノードを返し、バイトポイントに含まれる孫ノードも

を個別にリストする.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
#enumerate()                (   、      )         ，           ，     for     。
for i, child in enumerate(soup.p.children):
    print(i, child)

3.4.2親ノードと祖先ノード
parentプロパティparent Properties:ノード要素の直接親ノードparentsプロパティを取得するノード要素のすべての祖先ポイントを取得する

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print (soup.a.parent)
print (soup.a.parents)

3.4.3兄弟ノード

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))

3.4.4情報の抽出

戻り結果が単一ノードである場合、string、attrsなどの属性を直接呼び出してテキストおよび属性を得ることができる.結果が複数のノードのジェネレータである場合、リストに移動して要素を取り出し、string、attrsなどの属性を呼び出して対応するノードのテキストと属性を取得できます.

html = """

    
        
            Once upon a time there were three little sisters; and their names were
            BobLacie 
        
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

4メソッドセレクタ
4.1 find_all()

は、いくつかの属性またはテキストを入力し、条件に合致するすべての要素をクエリーします.

4.1.1 name

は、すべてのulノードからなるリストを返し、各要素は依然としてbs 4である.element.Tagタイプ

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

各ノードはTagタイプであるため、クエリ

をネストすることができる.

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

4.1.2 attrs

いくつかの属性に従って

を問い合わせる

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

idやclassなど、よく使われる属性については、attrsを使わずに伝えることができます.たとえば、idがlist-1のノードをクエリーするには、idというパラメータを直接入力します.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
#  class Python       ，            
print(soup.find_all(class_='element'))

4.1.3 text

は、ノードのテキストを一致させるために使用することができ、入力形式は文字列であってもよく、正規表現オブジェクト

であってもよい.

import re
html='''

    
        Hello, this is a link
        Hello, this is a link, too
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))

#    
['Hello, this is a link', 'Hello, this is a link, too']

4.2 find()

は、最初の一致する要素

を返す.

は、結果がリスト形式ではなく、最初に一致するノード要素を返し、タイプは依然としてTagタイプ

である.
4.3その他の問合せ方法

find_parents()とfind_parent():前者はすべての祖先ノードを返し、後者は直接の親ノードを返します.

find_next_Siblings()とfind_next_Sibling():前者は後ろのすべての兄弟ノードを返し、後者は後ろの最初の兄弟ノードを返します.

find_previous_Siblings()とfind_previous_Sibling():前者は前のすべての兄弟ノードを返し、後者は前の最初の兄弟ノードを返します.

find_all_next()とfind_next():前者はノードを返した後、すべての条件に合致するノードを返し、後者は最初の条件に合致するノードを返します.

find_all_previous()とfind_previous():前者はノードを返した後、すべての条件に合致するノードを返し、後者は最初の条件に合致するノードを返します.

5 CSSセレクタ

CSSセレクタを使用する場合、select()メソッドを呼び出すだけで、対応するCSSセレクタに転送できる

html='''

    
        Hello
    
    
        
            Foo
            Bar
            Jay
        
        
            Foo
            Bar
        
    

'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
#    ul       li  
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

5.1ネストされた選択

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

5.2属性の取得

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

5.3テキストの取得

テキストを取得するには、前述したstringプロパティを使用することもできます.また、get_text()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print('Get Text:', li.get_text())
    print('String:', li.string)

6まとめ

lxml解析ライブラリの使用を推奨し、必要に応じてhtmlを使用する.parser.

ノード選択フィルタ機能は弱いが速度が速い.

find()またはfind_の使用を推奨all()クエリは、単一の結果または複数の結果に一致します.

CSSセレクタに詳しい場合はselect()メソッドを使用して選択できます.

Web3.js 1.0 で Infura を使用する

2018年のDApp開発ツールまとめ