pythonマイクロブログのファンリスト、注目リスト、マイクロブログテキストリストを登る

5042 ワード

python

本明細書のすべてのコードはgithubに配置され、githubアドレスはhttps://github.com/ximingren/clawer_sumary
爬虫類全体の流れは、cookieを取得すること->ニックネームに基づいてユーザーの関連情報を取得すること->注目リストを取得すること->ファンリストを取得すること->微博テキストリストを取得することである
以下の内容はいくつかの部分に分かれています
一.クッキーの取得

    driver = webdriver.Chrome(driver_path)  #   Chrome
    driver.maximize_window()  #          
    driver.get(weibo_url)  #         
    time.sleep(10)  #         ，  10s          
    time.sleep(2)
    driver.find_element_by_name("username").send_keys(username)  ##     
    driver.find_element_by_name("password").send_keys(password)  ##    
    driver.find_element_by_xpath("//a[@node-type='submitBtn']").click()  ##      
    cookies = driver.get_cookies()  ##  cookies
    cookie = ""
    #     Cookies         cookie  
    for x in range(len(cookies)):
        value = cookies[x]['name'] + "=" + cookies[x]['value'] + ";"
        cookie = cookie + value
    return cookie

上記のコードでは、cookieを取得するためにwebdriverというライブラリを使用しています.Webdriverは遊覧機を開くために使用されますdriver.get_cookies()は、すべてのcookieを取得します.そしてクッキーを分割処理し、最後に処理後のクッキーに戻る
二.headersを追加requestsなどのライブラリでログインをシミュレート

        headers['Cookie']=cookie
        info_response = requests.get('http://s.weibo.com/user/' + names_list[x], headers)  #        url
        info_soup = BeautifulSoup(info_response.text, 'html5lib')  #   BeautifulSoup     html  
        info_soup = get_html(info_soup, "pl_user_feedList")
        weibo_info = info_soup.find_all('a', attrs={"class": "W_linkb", "target": "_blank"})  #        html
        id = weibo_info[0].get('href')  #   id
        subs_size = weibo_info[0].string  #    
        fans_size = weibo_info[1].string  #    
        contents_size = weibo_info[2].string  #    
        subs_size = int(re.sub("\D", "", subs_size))  #      ,      ,    
        fans_size = int(re.sub("\D", "", fans_size))
        contents_size = int(re.sub("\D", "", contents_size))
        id = int(re.findall('\d+', id)[0])
        return [subs_size, fans_size, contents_size, id]

上記のコードではdriverで得られたクッキーをheadersに追加しrequests.get()Webページにアクセスしたときにヘッダを追加します.htmlコードをBeautifulSoupで解析します.関連情報を処理する注目者数、ファン数、微博数を返す.
三.対象者リストの取得

  for page in range(1, subs_list_size + 1):
            subs_url = "https://weibo.com/p/100505" + weibo_id + "/follow?page=" + str(page) + "#Pl_Official_HisRelation__59"  #            url
            subs_request = Request(subs_url, headers=headers)
            subs_response = urlopen(subs_request)
            subs_html = subs_response.read().decode()
            subs_soup = BeautifulSoup(subs_html, 'html5lib')
            subs_soup = get_html(subs_soup, "WB_cardwrap S_bg")
            subs_list = subs_soup.find_all('a', attrs={"class": "S_txt1", "target": "_blank"}) #

上記ではRequestという関数を使いましたが、requestsを使うとログインしたページが返ってくるので、ここはちょっと分かりません.また、ファン数リストを取得する操作は似ています.
四.マイクロブログのテキストリストの取得

            params = urllib.parse.urlencode(
                {'__rnd': get_timestamp(), 'page': page, 'pagebar': pagebar, "id": "100505" + weibo_id,
                 "script_uri": "/p/" + "100505" + weibo_id,
                 'ajwvr': 6, 'domain': 100505, "pl_name": "Pl_Official_MyProfileFeed__22", "profile_ftype": 1,
                 'feed_type': 0, 'domain_op': 100505}).encode()  #           
            request = Request(api_url + "?%s" % (params).decode(), headers=headers)
            print("---------------           ")
            response = urlopen(request)
            html = response.read().decode('utf8')  #                 
            html_start = html.find("")  #       div  
            parser_html = html[html_start:html_end + 4]
            cont_html = parser_html.replace('\\"', '"')  #        
            cont_html = cont_html.replace('\\/', '/')
            print("-----------------        %d " % count)
            cont_soup = BeautifulSoup(cont_html, 'html5lib')  #     
            text_list = cont_soup.find_all('div', attrs={"class": 'WB_text W_f14', "node-type": "feed_list_content"})  #     
            time_list = cont_soup.find_all('a', attrs={"class": 'S_txt2', 'node-type': "feed_list_item_date"})  #     
            phone_list = cont_soup.find_all('a', attrs={"action-type": "app_source"})  #       
               
                   ，        url，      html  。                   。

pythonベースの二分実装で、指定した要素を整列リストで検索できます.

Pythonにおけるif条件文