5分で智聯招聘サイトを把握してモンゴDBデータベースに保存する

15628 ワード

前言
今回のテーマは2つの文章に分けて紹介します.

一、データ収集

二、データ分析

第1編では、pythonでWebサイトのデータを抽出するデータ収集について説明します.
1環境とpythonライブラリの実行
まず、実行環境について説明します.

python3.5

windows 7,64ビットシステム

pythonライブラリ
今回の智聯招聘のウェブサイトは、主に以下のpythonライブラリに関連しています.

requests

BeautifulSoup

multiprocessing

pymongo

itertools

2爬取の主な手順

キーワード、都市、およびページ番号に基づいて、登る必要があるページリンク

を生成する.

requestsで対応するウェブページコンテンツを取得する

BeautifulSoupで解析し、必要なキー情報

を取得する.

這い出した情報をMongoDBデータベースに格納、新規レコードを挿入または更新する

multiprocessingでマルチプロセスを起動し、実行効率を向上させる

.
3ファイル構成

情報プロファイル「zhilian_kw_config.py」

爬虫類メイン実行ファイル「zhilian_kw_spider.py」

プロファイルにロールアップする情報を設定し、メインプログラムを実行してコンテンツをキャプチャします.
プロファイル「zhilian_kw_config.py」の内容は次のとおりです.

# Code based on Python 3.x
# _*_ coding: utf-8 _*_
# __Author: "LEMON"

TOTAL_PAGE_NUMBER = 90  # PAGE_NUMBER: total number of pages，     

KEYWORDS = ['   ', 'python', '    '] #                 

#          
ADDRESS = ['  ', '  ', '  ', '  ', '  ',
           '  ', '  ', '  ', '  ', '  ',
           '  ', '  ', '  ', '  ', '  ',
           '  ', '  ', '  ', '  ', '  ',
           '  ', '  ', '  ', '  ', '   ',
           '   ', '  ', '  ', '  ', '  ',
           '  ', '  ', '  ', '  ', '  ']

MONGO_URI = 'localhost'
MONGO_DB = 'zhilian'

爬虫類マスター実行ファイル「zhilian_kw_spider.py」の内容は以下の通りです.

# Code based on Python 3.x
# _*_ coding: utf-8 _*_
# __Author: "LEMON"

from datetime import datetime
from urllib.parse import urlencode
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
import pymongo
from zhilian.zhilian_kw_config import *
import time
from itertools import product

client = pymongo.MongoClient(MONGO_URI)
db = client[MONGO_DB]

def download(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0'}
    response = requests.get(url, headers=headers)
    return response.text

def get_content(html):
    #       
    date = datetime.now().date()
    date = datetime.strftime(date, '%Y-%m-%d')  #    str

    soup = BeautifulSoup(html, 'lxml')
    body = soup.body
    data_main = body.find('div', {'class': 'newlist_list_content'})

    if data_main:
        tables = data_main.find_all('table')

        for i, table_info in enumerate(tables):
            if i == 0:
                continue
            tds = table_info.find('tr').find_all('td')
            zwmc = tds[0].find('a').get_text()  #     
            zw_link = tds[0].find('a').get('href')  #     
            fkl = tds[1].find('span').get_text()  #    
            gsmc = tds[2].find('a').get_text()  #     
            zwyx = tds[3].get_text()  #     
            gzdd = tds[4].get_text()  #     
            gbsj = tds[5].find('span').get_text()  #     

            tr_brief = table_info.find('tr', {'class': 'newlist_tr_detail'})
            #     
            brief = tr_brief.find('li', {'class': 'newlist_deatil_last'}).get_text()

            #         
            yield {'zwmc': zwmc,  #     
                   'fkl': fkl,  #    
                   'gsmc': gsmc,  #     
                   'zwyx': zwyx,  #     
                   'gzdd': gzdd,  #     
                   'gbsj': gbsj,  #     
                   'brief': brief,  #     
                   'zw_link': zw_link,  #     
                   'save_date': date  #          
                   }

def main(args):
    basic_url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?'

    for keyword in KEYWORDS:
        mongo_table = db[keyword]
        paras = {'jl': args[0],
                 'kw': keyword,
                 'p': args[1]  #  X 
                 }
        url = basic_url + urlencode(paras)
        # print(url)
        html = download(url)
        # print(html)
        if html:
            data = get_content(html)
            for item in data:
                if mongo_table.update({'zw_link': item['zw_link']}, {'$set': item}, True):
                    print('     ：', item)

if __name__ == '__main__':
    start = time.time()
    number_list = list(range(TOTAL_PAGE_NUMBER))
    args = product(ADDRESS, number_list)
    pool = Pool()
    pool.map(main, args) #      
    end = time.time()
    print('Finished, task runs %s seconds.' % (end - start))

もっと素晴らしい内容は、微信の公衆番号に注目してください.
「Pythonデータの道」

転載先:https://www.cnblogs.com/lemonbit/p/6886641.html

Python-day 8-文字列の宿題を勝手に書きます

Android api 28 tablayout変更