爬虫類(BeautifulSoup)

2047 ワード

1.BeautifulSoupとは
Beautiful Soup(  BS4)       、python          、  、        。 
      ,       tiful Soup          Unicode  ,       utf-8  。
         ,        ,          。

2.BS 4の4種類のオブジェクト
2.1 BeautifulSoup  
	      
2.2 Tag  
	Tag  html      , BeautifulSoup      Tag     ,
	      soup.name,  name html    。
2.3 NavigableString  
	        
2.4     
	   NavigableString  

3.解析器
3.1       :
	             
3.2      :
	3.2.1 Python   
        BeautifulSoup(markup, "html.parser")
        Python      
              
               
        Python 2.7.3 or 3.2.2)             
	3.2.2 lxml
	    HTML    	BeautifulSoup(markup, "lxml")
	       
	           
	        C   
	3.2.3 lxml
	    XML    
	    BeautifulSoup(markup, ["lxml-xml"])
	    BeautifulSoup(markup, "xml")
	       
	        XML    
	        C   
	3.2.4 html5lib
	    BeautifulSoup(markup, "html5lib")
	          
	               
	      HTML5     
	       
	           

4.BS 4の使い方
4.1        :
	import re
	from bs4 import BeautifulSoup
4.2  html     BS4       :
	soup = BeautifulSoup(html, 'html.parser')
4.3         :
	soup.title							  title    
	soup.title.name						  title    
	soup.a.attrs						  a         
	soup.a.attrs['href']				  a    href    
	soup.a.get['href']					  a    href    (  )
	soup.a.string    					         (            ,        
										              ,      None)
	soup.a.get_text()					         (                ,    
										     ,        )
	soup.a.get('href')=''				          
	soup.find_all('a')					           
	soup.find_all('a', class_="sister")	     a  ,      "sister"
	soup.find_all(text=re.compile('story\d+'))		        
	soup.select("title")				     
	soup.select(".sister")				    (.  )
	soup.select("#link1")				id   (#id  )
	soup.select("input[type='password']")			     ()