爬虫解析ライブラリのbs 4モジュールの詳細

57204 ワード

爬虫類

06.爬虫解析庫のbs 4
文書ディレクトリ

06.爬虫解析庫のbs 4

一、紹介

1.基本紹介

2.htmlでデータを検索する場合の3つの方法

3.取付

4.解析器

二、基本使用

三、ドキュメントツリー

を巡る

1.紹介

.用法遍歴

3.ラベルの名前を取得する

4.ラベルの属性(classであればリストに入れる)

を取得する

5.タグの内容を取得する

6.ネスト選択

7.子ノード、子ノード(了解)

8.親ノード、祖先ノード(了解)

9.兄弟ノード(了解)

11.小結

四、bs 4の検索ドキュメントツリー

1.5種類のフィルタ:文字列、正規表現、リスト、ブール値、メソッド

文字列

正規表現

リスト

ブール

メソッド(了解)

小結

2.その他の

3. find_all

4.find

5.その他の方法

6.CSSセレクタ

7.文書ツリー

を修正する

8.汎用性

一、紹介
1.基本紹介

Beautiful Soup 3は現在開発を停止しており、公式サイトでは現在のプロジェクトでBeautiful Soup 4を使用してBS 4

に移植することを推奨している.

Beautiful Soupは、HTMLファイルまたはXMLファイルからデータを抽出できるPythonライブラリ

です.

デフォルトには解析器があります:html.parser

は追加でインストールすることもできます:lxml

2.htmlでデータを検索する場合の3つの方法

cssセレクタ(汎用)

xpathセレクタ(汎用)

モジュールが提供する検索方法(find,find_all)

3.インストール

#    Beautiful Soup
pip install beautifulsoup4

#      
Beautiful Soup  Python     HTML   ,            ,      lxml .        ,           lxml:

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

             Python    html5lib , html5lib           ,           html5lib:

$ apt-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

4.解析器
以下の表は主な解析器とその長所と短所を示しており、公式サイトでは効率が高いためlxmlを解析器として使用することを推奨している.Python 2.7.3以前のバージョンとPython 3の3.2.2以前のバージョンでは、Pythonバージョンの標準ライブラリに組み込まれているHTML解析方法が不安定であるため、lxmlまたはhtml 5 libをインストールする必要がある.
ぶんせきき
使用方法
メリット
劣勢
Python標準ライブラリBeautifulSoup(markup, "html.parser")
Pythonの内蔵標準ライブラリは、ドキュメントのフォールトトレランスに適した速度で実行できます.
Python 2.7.3 or 3.2.2)前のバージョンでは文書のフォールトトレランスが悪かった
lxml HTML解析器BeautifulSoup(markup, "lxml")
高速ドキュメントのフォールトトレランス機能
C言語ライブラリのインストールが必要です
lxml XML解析器BeautifulSoup(markup, ["lxml", "xml"])``BeautifulSoup(markup, "xml")
速度が速くXMLを唯一サポートする解析器
C言語ライブラリのインストールが必要です
html5lib BeautifulSoup(markup, "html5lib")
最適なフォールトトレランスは、ドキュメントをブラウザで解析してHTML 5形式のドキュメントを生成します.
速度が遅い外部拡張に依存しない
中国語ドキュメント:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
二、基本使用

html_doc = """
The Dormouse's story

zhangchengDSB The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""

# -------------------------------      ----------------------------
from bs4 import BeautifulSoup

# BeautifulSoup(markup=html_doc, features='html.parser')
soup = BeautifulSoup(markup=html_doc, features='lxml')

#      ，     ,  
res = soup.prettify()  # /ˈprɪtɪfaɪ/
print(res)
'''

 
  
   The Dormouse's story
  
 
 
  
   zhangchengDSB
   
    The Dormouse's story
   
  
  
   Once upon a time there were three little sisters; and their names were
   Elsie
   ,
   Lacie
   and
   Tillie
   ;
and they lived at the bottom of a well.
  
  
   ...
  
 

'''

三、ドキュメントツリーを巡る
1.紹介

#   :           ，        
#   :                   

# 1、  
# 2、       
# 3、       
# 4、       
# 5、    
# 6、   、    
# 7、   、    
# 8、

2.用法遍歴

# bs4.element.Tag         ，    soup     
print(soup.html.head)
print(soup.html.body.p)

3.ラベルの名前を取得

# bs4.element.Tag     name  
print(soup.html.body.name)  #body

4.ラベルのプロパティを取得します(classの場合はリストに入れます).

print(soup.html.body.p)
print(soup.html.body.p.attrs)
print(soup.html.body.p.attrs.get('class'))
print(soup.html.body.p.attrs['id'])
print(soup.html.body.p['class'])  #   class      
print(soup.html.body.p['id'])  #id   

'''
p   The Dormouse's story  lqz
{'class': ['title'], 'id': 'id_p'}
['title']
id_p
['title']
id_p
'''

5.ラベルの内容を取得する

print(soup.html.body.p)
print(soup.html.body.p.text)  #              

print(soup.html.body.p.string)  #          ，        

print(list(soup.html.body.p.strings))  #                  

'''
p   The Dormouse's story  lqz
p   The Dormouse's story  lqz
None
['p   ', "The Dormouse's story", '  ', 'lqz']
'''

6.ネストされた選択

print(soup.p.b.string)

7.子ノード、子ノード(了解)

print(soup.p.contents) #p      

print(soup.p.children) #       ,  p      
for i,child in enumerate(soup.p.children):
    print(i,child)

print(soup.p.descendants) #      ,p            
for i,child in enumerate(soup.p.descendants):
    print(i,child)

8.親ノード、祖先ノード(了解)

print(soup.b.parent) #  b      
print(list(soup.b.parents)) #  a         ，     ，        ...
print(len(list(soup.b.parents))) #  a         ，     ，        ...

9.兄弟ノード(了解)

print(soup.a.next_sibling) #        （           ）
print(soup.a.previous_sibling) #     

print(list(soup.a.next_siblings)) #      =>     
print(soup.a.previous_siblings) #      =>

11.まとめ

#   :            
soup.head                 (       ) 

soup.head.name               

soup.head.attrs              {
     'class': [xx, yy, ...], 'id': jj}

soup.text                         
soup.string                                
soup.strings                     ,           

soup.get('   ')              

soup.contents                    (  :        )
soup.children                     ,       

soup.parent                   (  )
soup.parents                     

soup.next_sibling                (  :        ,    )
soup.previous_sibling            (  :        ,     )

四、bs 4の検索ドキュメントツリー
1.5つのフィルタ:文字列、正規表現、リスト、ブール値、メソッド

soup.find():一致する第1個を見つける(ヒント:内部本質はfind_all([0])を呼び出したのか)

soup.find_all():該当するすべての

が見つかりました.
文字列

#      

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

p   The Dormouse's story  lqz


Once upon a time there were three little sisters; and their names were
lqzElsie
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""

soup = BeautifulSoup(html_doc, 'lxml')
res=soup.find(name='body',)
#  a  ，id  link1
res=soup.find(name='a',id='link1')
res=soup.find_all(name='a',class_='sister')
res=soup.find_all(name='a',href="http://example.com/elsie")
res=soup.find_all(name='a',xx='xx')

res=soup.find_all(name='a',attrs={
     'class':'sister'})   #     attrs
res=soup.find_all(attrs={
     'id':'link1'})
res=soup.find_all(attrs={
     'xx':'xx'})


res=soup.find_all(name='a',attrs={
     'name':'lqz'})
print(res)

正規表現

import re
res=soup.find_all(name=re.compile('^b'))
res=soup.find_all(class_=re.compile('^s'))
res=soup.find_all(attrs={
     'name':'lqz'},id=re.compile('^l'))
print(res)

リスト#リスト#

res=soup.find_all(name=['b',])
res=soup.find_all(id=['link1','link2'])
print(res)

ブール

res = soup.find_all(class_=True)  #      

res = soup.find_all(href=True)  #      
print(res)

方法(了解)

適切なフィルタがない場合、要素パラメータ

のみを受け入れる方法を定義することもできる.

この方法が現在の要素が一致することを示すTrueを返す場合、そうでなければFalse

に戻る.

#      ，    id   
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')


res = soup.find_all(name=has_class_but_no_id)
print(res)

小結

#    
soup.find('a')             a  
soup.find_all('a')         a  
    
#      
import re
pattern = re.compile(r'  ')
soup.find_all(name=pattern)

#   
soup.find_all(name=['b', 'a'])

#   
soup.find_all(id=True)
soup.find_all(href=True)

#   
soup.find_all(has_class_but_no_id)

2.その他

#                
res=soup.find(name='a').span.text
res=soup.html.body.find('a')
print(res)

# limit       
soup.findChild()
res=soup.find_all(name='a',limit=1)
print(res)

# recursive       ，   False     
res=soup.body.find_all(name='p',recursive=False)
res=soup.find_all(name='p',recursive=False)
res=soup.find_all(name='p',recursive=True)
print(res)

3. find_all find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

import re

# 1. name:   name                 ,   ,     ,  ,     True .
print(soup.find_all(name=re.compile('^t')))

# 2. keyword: key=value   ，value      ：    ,       ,   , True .
print(soup.find_all(id=re.compile('my')))
print(soup.find_all(href=re.compile('lacie'), id=re.compile('\d')))  #      class_
print(soup.find_all(id=True))  #    id     

#   tag         ,  HTML5   data-*   :
data_soup = BeautifulSoup('foo!
', 'lxml')
# data_soup.find_all(data-foo="value") #  ：SyntaxError: keyword can't be an expression
#        find_all()     attrs                     tag:
print(data_soup.find_all(attrs={
     "data-foo": "value"}))
# [foo!
]

# 3.       ，      class_，class_=value,value          
'''
print(soup.find_all('a', class_='sister'))  #     sister a  
print(soup.find_all('a', class_='sister ssss'))  #     sister sss a  ，          
print(soup.find_all(class_=re.compile('^sis')))  #     sister     
'''

# 4. attrs
print(soup.find_all('p', attrs={
     'class': 'story'}))

# 5. text:     ：  ，  ，True，  
print(soup.find_all(text='Elsie'))
print(soup.find_all('a', text='Elsie'))

# 6. limit  :              .           ,     limit            .   SQL  limit     ,            limit     ,         
print(soup.find_all('a', limit=2))
'''
[
    Elsie, 
    Lacie
]
'''

# 7. recursive:   tag  find_all()    ,Beautiful Soup     tag       ,      tag      ,       recursive=False .
print(soup.html.find_all('a'))
'''
[
    Elsie, 
    Lacie, 
    Tillie
]
'''
print(soup.html.find_all('a', recursive=False))  # []

小結

#   : 
    def find_all(self, name=None, attrs={
     }, recursive=True, text=None, limit=None, **kwargs)
        """
        :param name:          。
        :param attrs:          。
        :param recursive:          。    :    ，find_all()         PageElement    。  ,          。
        :param limit:               .           ,     limit            .   SQL  limit     ,            limit     ,         
        :kwargs:          。
        :return: pageelement    。
        """

#   :     
    soup('a')
    soup.find('a')

    soup.p.find_all(text=True)
    soup.p(text=True)

4.find find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)

'''
find_all()                tag,               .
           ,     find_all()             , 
   find_all       limit=1          find()   .          :
'''
soup.find_all('title', limit=1)  # [The Dormouse's story]
soup.find('title')               # The Dormouse's story


#        find_all()                   ,  find()         .
# find_all()               , find()         ,   None .
print(soup.find("nosuchtag"))  # None


# soup.head.title   tag         .               tag  find()   :
soup.head.title  # The Dormouse's story
soup.find("head").find("title")  # The Dormouse's story

小結

soup.find('a')    soup.a    soup.find_all('a')[0]     soup('a')[0]

soup.find('xxxx')      None
soup.find_all('xxxx')[0]    soup('xxxx')[0]          []

soup.head.a    soup.find(head).find(a)

5.その他の方法
https://www.cummy.com/software/BeautifulSoup/bs4/doc/indx.zh.html#find-parents-find-parent
6.CSSセレクタ

#css       

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

p   The Dormouse's story  lqz


Once upon a time there were three little sisters; and their names were
lqzElsie
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""

soup = BeautifulSoup(html_doc, 'lxml')


#     css   
'''
      
. 
#id
div>a   div    a
div a   div       a

'''
# res=soup.select('.sister')
# res=soup.select('#link2')
# res=soup.select('p>b')
res=soup.select('p b')
print(res)


# bs4     xml     ：             xml

小結

#   : select        ,                           
soup.p.select('.sister')  
    soup.p(class_='sister') 
    soup.p.find_all(class_='sister')

soup.select('.sister span')    
    li = []
    for sister in soup(class_='sister'):
        if not sister.span:
            continue
        span_list = sister('span')
        for span in span_list:
            li.append(span)
    print(li)
    
soup.select('#link1')    soup(id='link1')    soup.find_all(id='link1')

soup.select('#list-2 h1')[0].attrs    
    soup.find(id='list-2').find_all('h1')[0].attrs 
    soup.find(id='list-2').find('h1').attrs   ->   

soup.select('#list-2 h1')[0].get_text()   
    soup.find(id='list-2').find('h1').text
    list(soup.find(id='list-2').find('h1').strings)

7.文書ツリーの変更
bs 4の変更ドキュメントツリー、ソフトウェアプロファイルはxml形式です.https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id40
8.汎用性
拡張性:cssセレクタ汎用、find、find_allは少ないです.一部の解析器はサポートされていません.

Shell breakコマンドとcontinueコマンド

SpringのJdbcTemplateを使用してデータを操作する方法