Python爬虫類庫Beautiful Soupの紹介と簡単な使用例

20432 ワード

Python爬虫類 Beautiful Soup

紹介します
Beautiful Soupライブラリは、柔軟で便利なウェブページの解析ライブラリであり、処理が効率的で、様々な解析器をサポートしています。正規表現を作成せずにウェブページ情報の抽出を簡単に行うことができます。
Python常用解析ライブラリ
解像度計
使い方
優勢
劣勢
Python標準ライブラリ
Beautiful Soup（markp，「httml.parser」）
Pythonは標準ライブラリを内蔵し、実行速度が適度で、ドキュメントの許容範囲が高いです。
Python 2.7.3 or 3.2.2）前のバージョンは中国語の許容差があります。
lxml HTML解析器
Beautiful Soup（markp，「lxml」）
スピードが速くて、ドキュメントのフォールトトレランスが強いです。
C言語ライブラリのインストールが必要です。
lxml XML解析器
Beautiful Soup（markp，「xml」）
速度が速く、唯一XML対応の解像器
C言語ライブラリのインストールが必要です。
httml 5 lib
Beautiful Soup（markp，「httml 5 lib」）
最も良いフォールトトレランス、ドキュメントをブラウザで解析し、HTML 5形式の文書を作成します。
速度が遅く、外部展開に依存しない
二、快速開始
与えられたファイルは、Beautiful Soupオブジェクトを生成します。


from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

フルテキストを出力


print(soup.prettify())


<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
  <b>
  The Dormouse's story
  </b>
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">
  Elsie
  </a>
  ,
  <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">
  Lacie
  </a>
  and
  <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>

構造化データを参照


print(soup.title) #<title>     
print(soup.title.name) #<title>name  
print(soup.title.string) #<title>     
print(soup.title.parent.name) #<title>    name  (head)
print(soup.p) #    <p></p>
print(soup.p['class']) #   <p></p> class
print(soup.a) #    <a></a>
print(soup.find_all('a')) #   <a></a>
print(soup.find(id="link3")) #   id='link3'


<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

ラベル内のリンクを全部探してください。


for link in soup.find_all('a'):
  print(link.get('href'))


http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

すべてのテキストの内容を取得します。


print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

ラベルを自動的に補完して書式設定します。


html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.prettify())#     ，    
print(soup.title.string)#  title

タブ選択
要素を選択


html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.title)#   title  
print(type(soup.title))#    
print(soup.head)

ラベル名を取得


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.title.name)

ラベルのプロパティを取得


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.p.attrs['name'])#  p   ，name      
print(soup.p['name'])#     ，

ラベルの内容を取得


print(soup.p.string)

タブネストの選択


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.head.title.string)

サブノードと子孫ノード


html = """
<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">
        <span>Elsie</span>
      </a>
      <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> 
      and
      <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>
      and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.p.contents)#          ，   list

もう一つの方法、child：


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.p.children)#                
for i,children in enumerate(soup.p.children):#i    ，children    
	print(i,children)

出力結果は上記と同じで、インデックスが一つ増えました。サブノードの情報は、ループのみで反復されることに留意されたい。直接返したのは1つのディエゼルのオブジェクトだけです。
子孫ノードを取得:


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.p.descendants)#                 
for i,child in enumerate(soup.p.descendants):#i    ，child    
	print(i,child)

親ノードと祖先ノード
parent


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(soup.a.parent)#

parents


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(list(enumerate(soup.a.parents)))#

兄弟ノード


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#     ：lxml
print(list(enumerate(soup.a.next_siblings)))#              
print(list(enumerate(soup.a.previous_siblings)))#

標準セレクタ
find_all(name、atrs、recursive、text、***kwargs)
文書は、署名、属性、内容に基づいて検索できます。
name


html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#    ul      
print(type(soup.find_all('ul')[0]))#

以下の例はすべてのulタグの下のliタグを検索します。


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

アトラス(プロパティ)
属性による要素の検索


html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1" name="elements">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#          ，          
print(soup.find_all(attrs={'name': 'elements'}))

検索したのは同じ内容です。この二つの属性は同じラベルの中にあります。
特殊なタイプのパラメータの検索:


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id       ，      
print(soup.find_all(class_='element')) #class        class_

テキスト
テキストの内容に応じて選択します。


html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#     Foo   ，

ですから、このテキストは内容が合う時に便利ですが、内容を調べる時にはあまり便利ではありません。
方法
find
findの使い方はfindllと同じですが、最初に条件に合った内容の出力が返ってきます。
ind_parents()、find_parent()
find_parents()はすべての祖先ノードに戻り、find_parent()は直接の親ノードに戻る。
find_next_.siblings()、find_next_.sibling()
find_next_.siblings()は後ろのすべての兄弟ノードに戻ります。find_next_.siblingは後の最初の兄弟ノードに戻る。
find_previous_siblings()、find_previous_sibling()
find_previous_siblings()は前のすべての兄弟ノードに戻り、find_previous_sibling（）は前の最初の兄弟ノードに戻る。
find_all_next()、find_next()
find_all_next()がノードに戻った後、条件を満たすノード、find_next()は、条件を満たす最初のノードに戻ります。
find_all_previous()、find_previous()
find_all_previous（）は、ノードに戻る前に条件に合致するノードすべてに対して、find＿uを返します。previous（）は、前の最初の条件に合致するノードを返します。
CSSセレクタはselect（）を通じて直接CSSセレクタに入ると選択が完了します。


html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))#.  class，         
print(soup.select('ul li')) #  ul     li  
print(soup.select('#list-2 .element')) #'#'  id。        id "list-2"     ，class=element   
print(type(soup.select('ul')[0]))#

レイヤーネストの選択を見てください。


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

属性を取得


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
  print(ul['id'])#  [ ]      
  print(ul.attrs['id'])#

コンテンツを取得


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
  print(li.get_text())

getで動かすtext()方法でコンテンツを取得できます。
締め括りをつける
lxml解析ライブラリの使用を推奨します。必要な時はhttml.parserを使います。
ラベル選択フィルタ機能が弱いですが、スピードが速いので、find()、find_を使うことをおすすめします。all()クエリは、単一の結果または複数の結果にマッチします。
CSSセレクタに詳しいなら、selectを使用することを提案します（）
よく使う属性とテキスト値の取得方法を覚えます。
Python爬虫類庫Beautiful Soupについての紹介と簡単な使用例は下記のリンクをクリックしてください。

PHP性能最適化による高度最適化コード

PHPがMySQLの大量データを検索する時メモリ占有分析