Python爬虫類(四):新たに縦横中国語網爬虫類Demoを追加--136冊の本屋小説を爬取し、ローカルテキストファイルに保存し、単一プロセスの多プロセス対比効率(三生三世十里桃花を例に)

43972 ワード

運転環境:Python3.62019-05-24更新、既存のページが改版されたため、現在「縦横中国語網book.zongheng.com」採集コードDemoが追加された.
  • 反爬が存在し、爬虫類の運行ミスを招き、以下の2つの方法で親測して解決することができる.
  • IPに加入し、 IP抽出インタフェース->ジャンプを書きました.
  • は、ブラウザアクセスによって生成されたCookie情報をheadersに追加する.

  • この爬虫類は、VIP がアクセスできるコンテンツ
  • を正確につかむことができない.
    # -*- coding: utf-8 -*-
    # @Author : Leo
    
    import re
    import os
    import logging
    import requests
    from bs4 import BeautifulSoup
    from requests.adapters import HTTPAdapter
    
    logging.basicConfig(level=logging.INFO,  #     
                        format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                        datefmt='%a, %d %b %Y %H:%M:%S')
    
    
    class ZonghengSpider:
        """
               
        - http://book.zongheng.com/
        """
        #        
        novel_save_dir = 'novels'
        session = requests.session()
        #       
        session.mount('http://', HTTPAdapter(max_retries=3))
        session.mount('https://', HTTPAdapter(max_retries=3))
    
        def __init__(self):
            self.session.headers.update(
                {'Host': 'book.zongheng.com',
                 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) '
                               'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'})
            self.chapter_url = 'http://book.zongheng.com/api/chapter/chapterinfo?bookId={book_id}&chapterId={chapter_id}'
    
        def crawl(self, target_url: str):
            """
                    url
            :param target_url:           URL
            :return:
            """
    
            def request_url(url):
                resp = self.session.get(url=url)
                if resp.status_code == 200:
                    return resp.json()
                else:
                    return None
    
            book_name, book_id, chapter_id = self.get_page_info(target_url)
            logging.info(f'       : {book_name},   ID: {book_id},   ID: {chapter_id}')
            if all([book_name, book_id, chapter_id]):
                #       
                novel_save_path = os.path.join(self.novel_save_dir, book_name)
                if not os.path.exists(novel_save_path):
                    os.makedirs(novel_save_path)
                logging.info(f'      : {novel_save_path}')
                index = 0
                while True:
                    index += 1
                    chapter_url = self._get_chapter_url(book_id, chapter_id)
                    logging.info(f'      URL: {chapter_url}')
                    chapter_json = request_url(url=chapter_url)
                    if chapter_json is not None:
                        chapter_data = chapter_json.get('data')
                        if not chapter_data:
                            break
                        #    
                        chapter_name = chapter_data.get('chapterName')
                        content_raw = chapter_data.get('content', '')
                        #     
                        clear_content = '
    '
    .join( [repr(p).strip('\'') for p in BeautifulSoup(content_raw, 'html.parser').strings]) # TODO 、 、 ... with open(os.path.join(novel_save_path, str(index) + '-' + chapter_name + '.txt'), 'w', encoding='utf8') as f: f.write(clear_content) logging.info(' > %s' % os.path.join(novel_save_path, str(index) + '-' + chapter_name)) # chapter_id chapter_id = chapter_data.get('nexCid') else: logging.error(f' URL , URL: {chapter_url}') logging.info(' ') def get_page_info(self, homepage_url): """ book-id, ID :param homepage_url: url :return: """ resp = self.session.get(url=homepage_url) if resp.status_code == 200: soup = BeautifulSoup(resp.text, 'html.parser') book_name = soup.find('div', {'class': 'book-name'}).get_text().strip() first_chapter_tag = soup.find('a', {'class': 'btn read-btn', 'href': True}) if first_chapter_tag is not None: first_chapter_url = first_chapter_tag.get('href') result = re.findall(r'chapter/(\d+)/(\d+).html', first_chapter_url) book_id, chapter_id = result[0] if result else (None, None, None) return book_name, book_id, chapter_id else: logging.error(' !') return None, None, None def _get_chapter_url(self, book_id, chapter_id): """ :param book_id: :param chapter_id: :return: """ return self.chapter_url.format(book_id=book_id, chapter_id=chapter_id) if __name__ == '__main__': spider = ZonghengSpider() spider.crawl(target_url='http://book.zongheng.com/book/840152.html')

    概要
  • 小说网址:http://www.136book.com/
  • 136 book小説網の具体的な小説のurlを修正することによって、異なる小説の章を登って
  • を大量にダウンロードする.
  • このコードは三生三世十里桃花を例に(リンク)
  • –> http://www.136book.com/sanshengsanshimenglitaohua/

  • 運行効果の展示は図が切れたようです
    book136_singleprocess.py
    単プロセス保存小説章
    #!/usr/bin/env python 
    # -*- coding: utf-8 -*- 
    # @Author : Woolei
    # @File : book136_singleprocess.py
    
    import requests
    import time
    import os
    from bs4 import BeautifulSoup
    
    
    headers = {
        'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
    }
    
    
    #         ,     
    def getChapterContent(each_chapter_dict):
        content_html = requests.get(each_chapter_dict['chapter_url'], headers=headers).text
        soup = BeautifulSoup(content_html, 'lxml')
        content_tag = soup.find('div', {'id': 'content'})
        p_tag = content_tag.find_all('p')
        print('        --> ' + each_chapter_dict['name'])
        for each in p_tag:
            paragraph = each.get_text().strip()
            with open(each_chapter_dict['name'] + r'.txt', 'a', encoding='utf8') as f:
                f.write('  ' + paragraph + '

    '
    ) f.close() # url def getChapterInfo(novel_url): chapter_html = requests.get(novel_url, headers=headers).text soup = BeautifulSoup(chapter_html, 'lxml') chapter_list = soup.find_all('li') chapter_all_dict = {} for each in chapter_list: import re chapter_each = {} chapter_each['name'] = each.find('a').get_text() # chapter_each['chapter_url'] = each.find('a')['href'] # url chapter_num = int(re.findall('\d+', each.get_text())[0]) # chapter_all_dict[chapter_num] = chapter_each # return chapter_all_dict if __name__ == '__main__': start = time.clock() # novel_url = 'http://www.136book.com/sanshengsanshimenglitaohua/' # novel_info = getChapterInfo(novel_url) # dir_name = ' ' if not os.path.exists(dir_name): os.mkdir(dir_name) os.chdir(dir_name) # for each in novel_info: getChapterContent(novel_info[each]) # time.sleep(1) end = time.clock() # print(' , %d , :%f s' % (len(novel_info), (end - start))) ** , 。 **

    book136_multiprocess.py
    マルチプロセス保存小説章
    #!/usr/bin/env python 
    # -*- coding: utf-8 -*- 
    # @Author : Woolei
    # @File : book136_2.py 
    
    
    import requests
    import time
    import os
    from bs4 import BeautifulSoup
    from multiprocessing import Pool
    
    url = 'http://www.136book.com/huaqiangu/ebxeeql/'
    headers = {
        'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
    }
    
    
    #         ,     
    def getChapterContent(each_chapter_dict):
        content_html = requests.get(each_chapter_dict['chapter_url'], headers=headers).text
        soup = BeautifulSoup(content_html, 'lxml')
        content_tag = soup.find('div', {'id': 'content'})
        p_tag = content_tag.find_all('p')
        print('        --> ' + each_chapter_dict['name'])
        for each in p_tag:
            paragraph = each.get_text().strip()
            with open(each_chapter_dict['name'] + r'.txt', 'a', encoding='utf8') as f:
                f.write('  ' + paragraph + '

    '
    ) f.close() # url def getChapterInfo(novel_url): chapter_html = requests.get(novel_url, headers=headers).text soup = BeautifulSoup(chapter_html, 'lxml') chapter_list = soup.find_all('li') chapter_all_dict = {} for each in chapter_list: import re chapter_each = {} chapter_each['name'] = each.find('a').get_text() # chapter_each['chapter_url'] = each.find('a')['href'] # url chapter_num = int(re.findall('\d+', each.get_text())[0]) # chapter_all_dict[chapter_num] = chapter_each # return chapter_all_dict if __name__ == '__main__': start = time.clock() novel_url = 'http://www.136book.com/sanshengsanshimenglitaohua/' novel_info = getChapterInfo(novel_url) dir_name = ' ' if not os.path.exists(dir_name): os.mkdir(dir_name) os.chdir(dir_name) pool = Pool(processes=10) # 10 pool.map(getChapterContent, [novel_info[each] for each in novel_info]) pool.close() pool.join() end = time.clock() print(' , %d , :%f s' % (len(novel_info), (end - start)))
  • の実行中、タスクマネージャでは10のサブプロセス(processes=10)が作成され、効率を向上させるために複数のプロセスを作成することができますが、コンピュータのパフォーマンスを考慮せずに過剰なプロセスを作成すると、コンピュータの実行効率が大幅に低下します.