pythonは新浪ニュースに登った--新車を例に挙げる

10549 ワード

新浪ニュースに登る時、テーマ語は異なって、ホームページのフォーマットも異なって、だからここで“新車”をテーマに選んで、新浪ニュースのタイトル、発表時間、リンク、具体的な内容と発表作者を登ってウェブサイトを取ります:http://auto.sina.com.cn/newcar/index.d.html
爬取コードは次のとおりです.

####      、    、    
import requests
from bs4 import BeautifulSoup
import urllib
import sys
import importlib

'''importlib.reload(sys)
key='film'
url="http://auto.sina.com.cn/newcar/index.d.html"
data=urllib.request.urlopen(url).read().decode('utf-8')'''

for i in range(0,2):
    url="http://auto.sina.com.cn/newcar/?page="+str(i+1)
    res=requests.get(url)
    res.encoding = 'utf-8'#       utf-8
    soup = BeautifulSoup(res.text, 'html.parser')
    for new in soup.select('.s-left.fL.clearfix'):#BeautifulSoup       select     html    ，   ，             
        if len(new.select('h3')) > 0:
            # [0]   select        list   [  ,],text          
            date=new.select('.time.fL')[0].text
            title=new.select('h3')[0].text
            href=new.select('a')[0]['href']
            print(str(date)+"  "+title+"  "+href)

###  “  6           12.48  ”，  ，      
import time
import requests
from bs4 import BeautifulSoup 
info = requests.get('http://auto.sina.com.cn/newcar/x/2019-11-01/detail-iicezzrr6503390.shtml')
info.encoding = 'utf-8'
html = BeautifulSoup(info.text, 'html.parser')
main_title=html.select('.main-title')[0].text#     
date1=html.select('.date')[0].text#      
print(date1+"  "+main_title)
print("______________________________________________________________________________________")
article = []
for v in html.select('.article p'):
    article.append(v.text.strip())#         ，         
author_info = '
'.join(article)#           
print (author_info)
print (html.select('.show_author')[0].text.lstrip(u'    ：'))#

爬取结果:

いくつかの注意点:(1)赤の丸1の表示箇所をクリックしてページ要素を分析し、赤の丸2はあなたが取得するすべての要素を选んで、リンク、时间、タイトルなどを含んで、それからElementsの中で相応の要素を分析します(ps:私は最初からタイトルだけを选んで、それから箱はずっと间违いを探しています!)

(2)BeautifulSoupの文字化けし問題について、python 3以上のバージョンは、import sys import importlibの2行のコードで解決することができ、親測、管用;(3)https://blog.csdn.net/qq_33722172/article/details/82469050ああ、このリンクのブロガーの说明はとても详しくて、どのようにページを分析することを含んで、ステップは详しくて、しかし私は多くのブロガーがすべて“world”をキーワードとして実戦分析を行うことを见て、みんなに言叶を変えて、爬虫類の初心者は自分で试して、それから自分でページを分析する能力が本当に强くなることを発见します~!

【Python】文法学習6

Java学習シリーズ(十五)Javaオブジェクト向けの細談スレッド、スレッド通信(下)