七月アルゴリズム課程「python爬虫」第三課:爬虫基礎知識と簡易爬虫実現

22749 ワード

この授業はCSS、XPath、Json、DomとSax、正規表現、Seleniumなど多くの知識に関連しています.W 3 SchoolとRUNOOBでCOMでこの方面の関連知識を了解する
CSSのいくつかのウェブページの使用例
対応するhtmlに保存してブラウザで直接開くと効果が表示されます.css_background_color.html:
<html>
<head>

<style type="text/css">

body {background-color: yellow}
h1 {background-color: #00ff00}
h2 {background-color: transparent}
p {background-color: rgb(250,0,255)}
p.no2 {background-color: gray; padding: 20px;}

style>

head>

<body>

<h1>     1h1>
<h2>     2h2>
<p>    p>
<p class="no2">p>

body>
html>

css_board_color.html:
<html>
<head>

<style type="text/css">
p.one
{
border-style: solid;
border-color: #0000ff
}
p.two
{
border-style: solid;
border-color: #ff0000 #0000ff
}
p.three
{
border-style: solid;
border-color: #ff0000 #00ff00 #0000ff
}
p.four
{
border-style: solid;
border-color: #ff0000 #00ff00 #0000ff rgb(250,0,255)
}
style>

head>

<body>

<p class="one">One-colored border!p>

<p class="two">Two-colored border!p>

<p class="three">Three-colored border!p>

<p class="four">Four-colored border!p>

<p><b>b>"border-width"                  。      "border-style"        。p>

body>
html>

css_font_family.html:
<html>
<head>
<style type="text/css">
p.serif{font-family:"Times New Roman",Georgia,Serif}
p.sansserif{font-family:Arial,Verdana,Sans-serif}
style>
head>

<body>
<h1>CSS font-familyh1>
<p class="serif">This is a paragraph, shown in the Times New Roman font.p>
<p class="sansserif">This is a paragraph, shown in the Arial font.p>

body>
html>

css_text_decoration.html:
<html>
<head>
<style type="text/css">
h1 {text-decoration: overline}
h2 {text-decoration: line-through}
h3 {text-decoration: underline}
h4 {text-decoration:blink}
a {text-decoration: none}
style>
head>

<body>
<h1>     1h1>
<h2>     2h2>
<h3>     3h3>
<h4>     4h4>
<p><a href="http://www.w3school.com.cn/index.html">      a>p>
body>

html>

Json復号と符号化
import json

obj = {'one': ' ', 'two': ' '}
encoded = json.dumps(obj)
print(type(encoded))
print(encoded)
decoded = json.loads(encoded)
print(type(decoded))
print(decoded)

{"one": "\u4e00", "two": "\u4e8c"}

{'one': ' ', 'two': ' '}

Python処理XMLメソッドのDOM
次のプログラムでbookに使用します.xml、内容は以下の通りです.

<bookstore>
    <book>
        <title lang="eng">Harry Pottertitle>
        <price>29.99price>
    book>
    <book>
        <title lang="eng">Learning XMLtitle>
        <price>39.95price>
    book>
bookstore>
from xml.dom import minidom

doc = minidom.parse('book.xml')
root = doc.documentElement
# print(dir(root))
print(root.nodeName)
books = root.getElementsByTagName('book')
print(type(books))
for book in books:
    titles = book.getElementsByTagName('title')
    print(titles[0].childNodes[0].nodeValue)
bookstore

Harry Potter
Learning XML

Python処理XML方法のSAX
import string
from xml.parsers.expat import ParserCreate

class DefaultSaxHandler(object):
    def start_element(self, name, attrs):
        self.element = name
        print('element: %s, attrs: %s' % (name, str(attrs)))

    def end_element(self, name):
        print('end element: %s' % name)

    def char_data(self, text):
        if text.strip():
            print("%s's text is %s" % (self.element, text))

handler = DefaultSaxHandler()
parser = ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element
parser.CharacterDataHandler = handler.char_data
with open('book.xml', 'r') as f:
    parser.Parse(f.read())
element: bookstore, attrs: {}
element: book, attrs: {}
element: title, attrs: {'lang': 'eng'}
title's text is Harry Potter
end element: title
element: price, attrs: {}
price's text is 29.99
end element: price
end element: book
element: book, attrs: {}
element: title, attrs: {'lang': 'eng'}
title's text is Learning XML
end element: title
element: price, attrs: {}
price's text is 39.95
end element: price
end element: book
end element: bookstore

Python正規表現
import re

m = re.match(r'\d{3}\-\d{3,8}', '010-12345')
# print(dir(m))
print(m.string)
print(m.pos, m.endpos)

#   
print('  ')
m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
print(m.groups())
print(m.group(0))
print(m.group(1))
print(m.group(2))

#   
print('  ')
p = re.compile(r'\d+')
print(type(p))
print(p.split('one1two3three3four4'))

t = '20:15:45'
m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
print(m.groups())
010-12345
0 9
  
('010', '12345')
010-12345
010
12345
  

['one', 'two', 'three', 'four', '']
('20', '15', '45')

電子商取引のウェブサイトのデータは登ります
seleniumインストールリファレンス:
seleniumは直接pipでインストールすればいいです.またchromedriverもダウンロードしますhttps://sites.google.com/a/chromium.org/chromedriver/getting-started
インストールチュートリアルについては、次を参照してください.http://www.cnblogs.com/fnng/archive/2013/05/29/3106515.html
チュートリアルを使用するには、Python+selenium自動化テストを参照してください.Python爬虫利器五のSeleniumの使い方;Selenium with Python
from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.set_page_load_timeout(30)    #set the amount of time to wait for a page load to complete before throwing an error.
browser.get('http://www.17huo.com/search.html?sq=2&keyword=%E7%BE%8A%E6%AF%9B')
page_info = browser.find_element_by_css_selector('body > div.wrap > div.pagem.product_list_pager > div')
# print(page_info.text)
pages = int((page_info.text.split(',')[0]).split(' ')[1])
for page in range(pages):
    if page > 2:
        break
    url = 'http://www.17huo.com/?mod=search&sq=2&keyword=%E7%BE%8A%E6%AF%9B&page=' + str(page + 1)
    browser.get(url)
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)   #    load   
    goods = browser.find_element_by_css_selector('body > div.wrap > div:nth-child(2) > div.p_main > ul').find_elements_by_tag_name('li')
    print('%d  %d   ' % ((page + 1), len(goods)))
    for good in goods:
        try:
            title = good.find_element_by_css_selector('a:nth-child(1) > p:nth-child(2)').text
            price = good.find_element_by_css_selector('div > a > span').text
            print(title, price)
        except:
            print(good.text)
1  24   
2017             ¥105.00
      9829 P95   ¥95.00
     1629 P95  M ¥95.00
       16807 P95 ¥95.00
      5266 P95   ¥95.00
      6072 P75   ¥75.00
      8013 P75   ¥75.00
     8606 P95    ¥95.00
     8656 P95    ¥95.00
      6602 P95   ¥95.00
8621 P95         ¥95.00
9993 P70        ¥115.00
      55081 P75 ¥75.00
6887 P95         ¥115.00
6888 P95         ¥115.00
A01 P95          ¥95.00
A02 P95         ¥95.00
A09 P95         ¥95.00
            8007 ¥110.00
            8008 ¥110.00
            8009 ¥110.00
            8010 ¥110.00
            8011 ¥110.00
            8016 ¥110.00
2  24   
            8018 ¥110.00
            8019 ¥110.00
                 ¥110.00
                 ¥110.00
            8015 ¥110.00
            8001 ¥110.00
            8002 ¥110.00
            8004 ¥110.00
            8005 ¥110.00
            8006 ¥110.00
AB16P50          ¥50.00
(      )         ¥165.00
                 ¥125.00
        /  2017  ¥115.00
2016             ¥200.00
2199 P95         ¥95.00
2335 P95         ¥95.00
2616 P95         ¥95.00
2017             ¥100.00
    2017       V ¥100.00
   /             ¥90.00
2017             ¥65.00
[  ]             ¥130.00
2016            ¥155.00
3  24   
2016            ¥155.00
2016            ¥155.00
2016             ¥430.00
【  】       2016 ¥125.00
     【  】       ¥125.00
【  】     2016    ¥65.00
【  】     2016   ¥65.00
【  】     2016   ¥85.00
【  】     2016   ¥75.00
【    】           ¥115.00
【    】           ¥130.00
【  】            ¥150.00
  2017           ¥160.00
                 ¥150.00
      2017       ¥110.00
2017             ¥125.00
2017             ¥125.00
2017      /      ¥110.00
          V      ¥95.00
2017             ¥95.00
     /           ¥95.00
         /       ¥115.00
2017             ¥115.00
         2016    ¥148.00