Pythonがbs 4ライブラリの爬虫類を使用した例

5039 ワード

Python爬虫類

実験内容:
http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html
ホームページから大学のランキングを抽出します.
を選択します.
含む
ランキング、学校名、省市、総得点及びすべての指標得点(進学生の大学入試成績得点、育成結果(卒業生の就職率)、社会的名誉(社会寄付収入・
千元)、研究規模(論文数
・
篇)、研究品質(論文品質・FWCI
)、トップクラスの成果(高い引用論文
・
編)、トップ人材(高所学者・
科学技術サービス(企業研究経費)
・
千元)、成果転化(技術譲渡収入・
千元)、学生の国際化(留学生の割合)
を選択します.
大学ランキング
.csv」
.

import re,requests
import csv
import numpy
import lxml
from bs4 import BeautifulSoup

url1 = "http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html"
html1 = requests.get(url1).content.decode()

soup = BeautifulSoup(html1,'lxml')
tag = soup.find(class_='table table-small-font table-bordered table-striped')
text1 = tag.find_all('th')[0:4]
text2 = tag.find_all('option')
text3 = tag.find_all('td')

th = []
td = []
for a in text1+text2:
    th += [a.string]
for a in text3:
    td += [a.string]
td = numpy.array(td).reshape(int(len(text3)/14),14)

with open('    .csv','w',newline='',encoding='utf-8') as f:
    writer = csv.writer(f)
    #writer.writeheader()
    writer.writerow(th)
    for a in td:
        print(a)
        writer.writerow(a)

実験内容:
https://www.dxsbb.com/news/5463.html
ウェブサイトから大学のランキング情報を抽出し、
ランキング、学校名、総合得点、星ランク及び学校運営レベルの情報を含みます.
大学ランキング校友会版
.csv」
.

import re,requests
import csv
import numpy
import lxml
from bs4 import BeautifulSoup

url1 = "https://www.dxsbb.com/news/5463.html"
html1 = requests.get(url1).content.decode('gbk')

soup = BeautifulSoup(html1,'html.parser')
text1 = soup.find_all('tbody')[1].find_all('td',)

td = []
for a in text1:
    td += [a.text]
td = numpy.array(td).reshape(int(len(text1)/5),5)

with open('        .csv','w',newline='',encoding='utf-8') as f:
    writer = csv.writer(f)
    #writer.writeheader()
    for a in td:
        print(a)
        writer.writerow(a)

実験内容:
(
1
)ウェブサイトを開く
http://dianying.2345.com/list/----2019---.html
下のページをクリックして、ホームページのURLを確認してください.
リンクの変化;
(
2
)第一ページのページを登った中、すべての映画の
<>
名前
>
を選択します
<>
役者
>
および
<>
スコア
>
(
3
)手順二のモードで関数を作成し、
循環構造を利用して全ページの映画の情報を取得する.
(
4
)這い取った情報を保存する
「
最新の映画情報
.csv」
ファイルにあります

import re,requests
import csv
import numpy
import lxml
from bs4 import BeautifulSoup

film_list = []
for i in range(1,30):
    url = "http://dianying.2345.com/list/----2019---" + str(i) + ".html"
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')

    filename_tag = soup.find_all('em', class_='emTit')
    score_tag = soup.find_all('span', {"class": "pRightBottom"})
    star_tag = soup.find_all('span', {"class": "sDes"})

    for i in range(0, len(filename_tag)):
        tag = star_tag[i]
        if (tag.em != None):
            temp = tag.text.strip().split("：")[1].split("\xa0\xa0\xa0")
        else:
            temp = [' ']
        film_list += [[filename_tag[i].text, score_tag[i].em.text] + temp]

with open('  .csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    #writer.writerow(["  ","  ","  "])
    for a in film_list:
        print(a)
        writer.writerow(a)

実験内容:
(
1
)
http://www.zhcw.com/ssq/kaijiangshuju/index.shtml?type=0
を選択して、このウェブサイトを開き、ブラウザから「
チェック
」
このページのデータソースの法則を見つけます.
(
2
)はいて取る
1-150
ページの中から全部当選しました.
<>
授賞式の時間
>
を選択します
<>
約束番号
>
を選択します
<>
当籤番号>
を選択します
<>
売上高
>
3、
<>
一等賞
>
3、
<>
二等賞
>
情報を保存
CSV
ファイル

import re,requests
import csv
from bs4 import BeautifulSoup

form = []
for i in range(1,2):
    url1 = "http://kaijiang.zhcw.com/zhcw/html/ssq/list_%s.html" %(i)
    html1 = requests.get(url1).text
    soup = BeautifulSoup(html1, 'html.parser')
    tag = soup.find_all('tr')
    print(tag)
    for a in tag[2:len(tag) - 1]:
        temp = []
        for b in a.contents[0:12]:
            if (b != '
'):
                temp += [b.text.strip().replace('\r
', '').replace(' ', '').replace('
', ' ')]
        form.append(temp)

with open('       .csv','w',newline='',encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['    ', '  ', '    ', '   ( )', '   ', '   '])
    for a in form:
        print(a)
        writer.writerow(a)

openSUSE 13.2 Nodejsをインストールして最新版に更新します.