Beautiful Soup 4学習ノート(五):ドキュメントツリーの修正

7377 ワード

Beautiful Soupの強みは、ドキュメントツリーの検索ですが、ドキュメントツリーの変更も簡単に行えます

tagの名前と属性の変更

>>> soup = BeautifulSoup('Extremely bold') 
>>> tag = soup.b 
>>> tag.name = "blockquote" 
>>> tag["class"] = "verybold" 
>>> tag["id"] = 1 
>>> tag 
Extremely bold
>>> del tag["class"] 
>>> del tag["id"] 
>>> tag 
Extremely bold

修正string

tagにあげます.stringプロパティは、元のコンテンツの代わりに現在のコンテンツを使用することに相当します.

>>> markup = 'I linked to example.com' 
>>> soup = BeautifulSoup(markup) 
>>> tag = soup.a 
>>> tag.string = "New link text." 
>>> tag 
New link text.

注意:現在のtagに他のtagが含まれている場合は、その.stringプロパティ割り当ては、サブtagを含むすべてのコンテンツを上書きします.

append()

Tag.append()メソッドはtagにコンテンツを追加したいと思っています.Pythonのリストのようです.append()メソッド:

>>> soup = BeautifulSoup("Foo") 
>>> soup.a.append("Bar") 
>>> soup 
FooBar
>>> soup.a.contents 
['Foo', 'Bar']

NavigablesString()と.new_tag()

ドキュメントにテキストを追加しても問題ない場合は、Pythonのappend()メソッドを呼び出すか、NavigablesStringの構築メソッドを呼び出すことができます.

soup = BeautifulSoup("")
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
tag
# Hello there.
tag.contents
# [u'Hello', u' there']

注意:ここではタイムズを入力し間違えました.NavigablesStringは有効ではないようです.
注釈、またはNavigablesStringのサブクラスを作成する場合は、NavigablesStringの構築方法を呼び出すだけです.

>>> from bs4 import Comment 
>>> new_comment = soup.new_string("Nice to see you.", Comment) 
>>> tag.append(new_comment) 
>>> tag 
Hello

tagを作成する最善の方法は工場メソッドBeautifulSoupを呼び出すことです.new_tag() :

>>> soup = BeautifulSoup("")  
>>> original_tag = soup.b 
>>> new_tag = soup.new_tag("a",href="http://www.example.com")
>>> original_tag.append(new_tag) 
>>> original_tag 

>>> new_tag.string = "Link text." 
>>> original_tag  
Link text.

注意:最初のパラメータはtagのnameとして、必須であり、その他のパラメータは選択します.

insert()

Tag.Insert()メソッドとTag.append()の方法は同様で、新しい要素を親ノードに追加しないという違いがある.contentsプロパティの最後に、エレメントを指定した位置に挿入します.Pythonリストと合計します.Insert()メソッドの使い方は以下の通りです.

>>> markup = 'I linked to example.com' 
>>> soup = BeautifulSoup(markup)  
>>> tag = soup.a 
>>> tag.insert(1,"but did not endorse ") 
>>> tag 
I linked to but did not endorse example.com
>>> tag.contents 
['I linked to ', 'but did not endorse ', example.com]

insert_before()とinsert_after()

insert_before()メソッドは、現在のtagまたはテキストノードの前にコンテンツを挿入します.

>>> soup = BeautifulSoup("stop") 
>>> tag = soup.new_tag("i") 
>>> tag.string = "Don't" 
>>> soup.b.string.insert_before(tag) 
>>> soup.b 
Don'tstop

insert_after()メソッドは、現在のtagまたはテキストノードの後にコンテンツを挿入します。

>>> soup.b.i.insert_after(soup.new_string(" ever "))
>>> soup.b
Don't ever stop
>>> soup.b.contents 
[Don't, ' ever ', 'stop']

clear()

Tag.clear()メソッドは、現在のtagの内容を除去します.

>>> markup = 'I linked to example.com' 
>>> soup = BeautifulSoup(markup) 
>>> tag = soup.a
>>> tag.clear() 
>>> tag

extract()

PageElement.extract()メソッドは、現在のtagをドキュメントツリーから削除し、メソッドの結果として返します.

>>> markup = 'I linked to example.com'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a 
>>> i_tag = soup.i.extract() 
>>> a_tag 
I linked to
>>> i_tag 
example.com
>>> print(i_tag.parent)
None

この方法は実際に2つのドキュメントツリーを生成する:1つは元のドキュメントを解析するためのBeautifulSoupオブジェクトであり、もう1つは除去されて戻るtagである.削除されて返されるtagは、extractメソッドを呼び出し続けることができます.

>>> my_string = i_tag.string.extract()
>>> my_string 
'example.com'
>>> print(my_string.parent)
None
>>> i_tag

decompose()

Tag.decompose()メソッドは、現在のノードをドキュメントツリーから削除し、完全に破棄します.

>>> markup = 'I linked to example.com' 
>>> soup = BeautifulSoup(markup) 
>>> a_tag = soup.a 
>>> soup.i.decompose() 
>>> a_tag 
I linked to

replace_with()

PageElement.replace_with()メソッドは、ドキュメントツリーのセグメントを削除し、新しいtagまたはテキストノードで置き換えます.

>>> markup = 'I linked to example.com'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>> new_tag = soup.new_tag("b") 
>>> new_tag.string ="example.net" 
>>> a_tag.i.replace_with(new_tag) 
example.com
>>> a_tag 
I linked to example.net

replace_with()メソッドは、ドキュメントツリーの他の場所を参照または追加するために使用できる代替tagまたはテキストノードを返します.

wrap()

PageElement.wrap()メソッドは、指定したtag要素を包装し、包装後の結果を返します.

>>> soup = BeautifulSoup("I wish I was bold.") 
>>> soup.p.string.wrap(soup.new_tag("b"))
I wish I was bold.
>>> soup.p.wrap(soup.new_tag("div")) 
I wish I was bold.

unwrap()

Tag.unwrap()方法はwrap()方法とは逆である.タグのパケット解除に使用されるタグ内のすべてのタグラベルが除去されます.

>>> markup = 'I linked to example.com'
>>> soup = BeautifulSoup(markup)
>>> a_tag = soup.a
>>> a_tag.i.unwrap() 

>>> a_tag 
I linked to example.com

とreplace_with()メソッドは同じで、unwrap()メソッドは除去されたtagを返します.

discuz x3.1文章のタイトルをプッシュする文字数の制限を完璧に解決

爬虫類のまとめを記録する