Googleの検索結果をクロール

52035 ワード

python tutorial テキストリンク

このチュートリアルの前の部分では、Googleの通常の検索結果をクロールすることができる非常に単純なクモを作成しました.チュートリアルのこの部分では、我々は以前よりも進んでいる.
警告:これまで多くのデータをscrapeするには、このクモを使用しないでください.Googleが無料で100回を呼び出すことができるパブリックAPIを提供しているので、Googleがあなたのコンピュータからの異常なトラフィックに気がつくならば、あなたのIPは禁止されます.この蜘蛛は学習目的のためだけに構築されており、実際のプロジェクトでは使用すべきではない.だから心に留めておきましょう.

我々が這っているもの

この動画にはまだ動画レスポンスがありません.よし、説明しましょう.あなたがGoogleを捜しているならばPython , たとえば、下に示す画像として検索に関連するビデオを含むカードがあります.

Aとの部分Videos 見出しは、我々が這うつもりであるものです.簡単、右?ウィル、それはパート1の作品と同じくらい簡単になるつもりはない.

分析時間

さて、今、我々はどこに我々が造っているかについてわかっています.
あなたが結果を探すならば、あなたは彼らがg-scrolling-carousel 要素.

中には、別のg-inner-card すべてのビデオのビデオの詳細を含む要素.

さて、今はすべてのコンテナを持っている、詳細を見てみましょう.まず、ビデオタイトルが必要です.それは内部ですdiv 属性付き要素role="heading" .

...と内部のリンクa 要素

それから、私たちはビデオの著者を探します

max-height:1.5800000429153442em;min-height:1.5800000429153442em;font-size:14px;padding:2px 0 0;line-height:1.5800000429153442em

また、ビデオのソース、またはプラットフォームを取得する必要があります.例えば、YouTube.それは内部ですspan 彼女の親はdiv スタイルでfont-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em .

そして、我々はまた、ビデオのアップロード日付を取得します.それは、ビデオ著者の下にありますspan , 同じ親要素内.我々はコードに到達すると、そのテキストを得るためにビデオ著者のテキストを取り除きます.

最後に、ビデオの長さを探します.Aの2番目の子要素にありますdiv スタイルのheight:118px;width:212px .

カバーはどこですか。

閉じるこの動画はお気に入りから削除されています.どこ?さて、それはJavaScriptの内部です.閉じるこの動画はお気に入りから削除されています<script> Base 64画像を含むタグ.それらの1つをコピーし、おそらくビデオカバーを取得します.だから今、我々は情報を持っている、どのように我々はそれらを見つけることができます見てみましょう.最も簡単な方法は、親を見つけることです<div> すべてのスクリプトタグを見つけます.しかし、それらの多くがあります!私が使っている方法は、兄弟要素を見つけることです<span id="fld"></span> . これにより、その兄弟要素-スクリプトタグを見つけることができます.我々が探しているタグは最初のものを除いて最後の3つのスクリプト要素です.我々は、ちょうど使用することができます[1:] Pythonではそれを取り除く.

大丈夫、手を！

関数を作成する__search_video そして、我々のコードの中にすべてを置きます.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        """Search for video results based on the given response

        Args:
            response (requests.Response): the response requested to Google search

        Returns:
            list: A list of found video results, usually three if found
        """
        pass

そして、レスポンスからBeautifulSmokeオブジェクトを作成します.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        """Search for video results based on the given response

        Args:
            response (requests.Response): the response requested to Google search

        Returns:
            list: A list of found video results, usually three if found
        """
        soup = BeautifulSoup(response.text, 'html.parser')

では、見つけましょうg-inner-card :

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        cards = soup.find('g-scrolling-carousel').findAll('g-inner-card')

そしてループを通して検索結果を生成します.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        results = []
        # Generate video information
        for card in cards:
            try:  # Just in case
                # Title
                title = card.find('div', role='heading').text
                # Video length
                length = card.findAll('div', style='height:118px;width:212px')[
                    1].findAll('div')[1].text
                # Video upload author
                author = card.find(
                    'div', style='max-height:1.5800000429153442em;min-height:1.5800000429153442em;font-size:14px;padding:2px 0 0;line-height:1.5800000429153442em').text
                # Video source (Youtube, for example)
                source = card.find(
                    'span', style='font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em').text
                # Video publish date
                date = card.find(
                    'div', style='font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em').text.lstrip(source).lstrip('- ')  # Strip the source out because they're in the same container
                # Video link
                url = card.find('a')['href']
            except IndexError:
                continue
            else:
                # Append result
                results.append({
                    'title': title,
                    'length': length,
                    'author': author,
                    'source': source,
                    'date': date,
                    'url': url,
                    'type': 'video'
                })
        return results

そして最後に、私たちはカバー部分を一緒にします.まず、変数と呼ばれる変数を作成しましょうcovers_ 我々が見つけたスクリプトを格納するために.使用することに注意してください[1:] リストのスライスを作成して最初のタグを削除します.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        # Pre-process the video covers
        covers_ = soup.find('span', id='fld').findNextSiblings('script')[1:]
        # ...

それから、ループを通してループします.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        # Pre-process the video covers
        covers_ = soup.find('span', id='fld').findNextSiblings('script')[1:]
        # Get the cover images
        covers = []
        for c in covers_:
            # TODO
        # ...

そして、それぞれのBase 64コード化されたイメージをリストに追加しますcovers . ここでは不要なJavaScriptコードを削除し、イメージを維持する必要があります.あなたが知らないならばrsplit 私は私のコードで使用した、よく、それは特別なバージョンのsplit . これは、結果のスキャンを終了し、それらを分割から開始されます.例えば、変数がtext :

>>> text = 'Hi everyone! Would you like to say Hi to me?'

通常の方法で分割する場合:

>>> text.split('Hi', 1)
['', ' everyone! Would you like to say Hi to me?']

でもrsplit :

>>> text.rsplit('Hi', 1)
['Hi everyone! Would you like to say ', ' to me?']

私がここで使用した理由は、別の可能性があるためです;var ii base 64イメージ内で、split , これは壊れたイメージを作成します.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        # Pre-process the video covers
        covers_ = soup.find('span', id='fld').findNextSiblings('script')[1:]
        # Get the cover images
        covers = []
        for c in covers_:
            # Fetch cover image
            try:
                covers.append(str(c).split('s=\'')[-1].split(
                    '\';var ii')[0].rsplit('\\', 1)[0])
            except IndexError:
                pass
        # ...

...リストに追加します.

class GoogleSpider(object):
    # ...
    def __search_video(self, response: requests.Response) -> list:
        # ...
        for card in cards:
            # ...
            try:  # Just in case
                # Video cover image
                try:  # Just in case that the cover wasn't found in page's JavaScript
                    cover = covers[cards.index(card)]
                except IndexError:
                    cover = None
            except IndexError:
                continue
            else:
                # Append result
                results.append({
                    # ...
                    'cover': cover,
                    # ...
                })
        # ...

ので、私たちはほとんどそこにいる.私たちが仕事をする必要があるもう一つのことがあります、我々がビデオ結果を含まない何かを捜すとき、我々のプログラムはAttributeError . それを防ぐために、我々はtry-except :

class GoogleSpider(object):
    # ...

    def __search_video(self, response: requests.Response) -> list:
        # ...
        try:
            cards = soup.find('g-scrolling-carousel').findAll('g-inner-card')
        except AttributeError:
            return []
        # ...
        for card in cards:
            try:
                # Title
                # If the container is not about videos, there won't be a div with
                # attrs `role="heading"`. So to catch that, I've added a try-except
                # to catch the error and return.
                try:
                    title = card.find('div', role='heading').text
                except AttributeError:
                    return []
                # ...
            except IndexError:
                continue
            else:
                # ...
        return results

の構造を再構成GoogleSpider クラスなので、あなたは私と同じことをしたいかもしれません.パート1のコードの全てを__search_result メソッド、およびsearch 関数.すべての関数はプライベート関数を呼び出し、結果をまとめます.

class GoogleSpider(object):
    # ...

    def search(self, query: str, page: int = 1) -> dict:
        """Search Google

        Args:
            query (str): The query to search for
            page (int): The page number of search result

        Returns:
            dict: The search results and the total page number
        """
        # Get response
        response = self.__get_source(
            'https://www.google.com/search?q=%s&start=%d' % (quote(query), (page - 1) * 10))
        results = []
        video = self.__search_video(response)
        result = self.__search_result(response)
        pages = self.__get_total_page(response)
        results.extend(result)
        results.extend(video)
        return {
            'results': results,
            'pages': pages
        }

    # ...

フルコード

このチュートリアルのパート2までの完全なコードは以下です.

# Import dependencies
from pprint import pprint
from urllib.parse import quote

import requests
from bs4 import BeautifulSoup


class GoogleSpider(object):
    def __init__(self):
        """Crawl Google search results

        This class is used to crawl Google's search results using requests and BeautifulSoup.
        """
        super().__init__()
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:79.0) Gecko/20100101 Firefox/79.0',
            'Host': 'www.google.com',
            'Referer': 'https://www.google.com/'
        }

    def __get_source(self, url: str) -> requests.Response:
        """Get the web page's source code

        Args:
            url (str): The URL to crawl

        Returns:
            requests.Response: The response from URL
        """
        return requests.get(url, headers=self.headers)

    def __search_video(self, response: requests.Response) -> list:
        """Search for video results based on the given response

        Args:
            response (requests.Response): the response requested to Google search

        Returns:
            list: A list of found video results, usually three if found
        """
        soup = BeautifulSoup(response.text, 'html.parser')
        try:
            cards = soup.find('g-scrolling-carousel').findAll('g-inner-card')
        except AttributeError:
            return []
        # Pre-process the video covers
        covers_ = soup.find('span', id='fld').findNextSiblings('script')[1:]
        # Get the cover images
        covers = []
        for c in covers_:
            # Fetch cover image
            try:
                covers.append(str(c).split('s=\'')[-1].split(
                    '\';var ii')[0].rsplit('\\', 1)[0])
            except IndexError:
                pass
        results = []
        # Generate video information
        for card in cards:
            try:
                # Title
                # If the container is not about videos, there won't be a div with
                # attrs `role="heading"`. So to catch that, I've added a try-except
                # to catch the error and return.
                try:
                    title = card.find('div', role='heading').text
                except AttributeError:
                    return []
                # Video length
                length = card.findAll('div', style='height:118px;width:212px')[
                    1].findAll('div')[1].text
                # Video upload author
                author = card.find(
                    'div', style='max-height:1.5800000429153442em;min-height:1.5800000429153442em;font-size:14px;padding:2px 0 0;line-height:1.5800000429153442em').text
                # Video source (Youtube, for example)
                source = card.find(
                    'span', style='font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em').text
                # Video publish date
                date = card.find(
                    'div', style='font-size:14px;padding:1px 0 0 0;line-height:1.5800000429153442em').text.lstrip(source).lstrip('- ')  # Strip the source out because they're in the same container
                # Video link
                url = card.find('a')['href']
                # Video cover image
                try:  # Just in case that the cover wasn't found in page's JavaScript
                    cover = covers[cards.index(card)]
                except IndexError:
                    cover = None
            except IndexError:
                continue
            else:
                # Append result
                results.append({
                    'title': title,
                    'length': length,
                    'author': author,
                    'source': source,
                    'date': date,
                    'cover': cover,
                    'url': url,
                    'type': 'video'
                })
        return results

    def __get_total_page(self, response: requests.Response) -> int:
        """Get the current total pages

        Args:
            response (requests.Response): the response requested to Google using requests

        Returns:
            int: the total page number (might be changing when increasing / decreasing the current page number)
        """
        soup = BeautifulSoup(response.text, 'html.parser')
        pages_ = soup.find('div', id='foot', role='navigation').findAll('td')
        maxn = 0
        for p in pages_:
            try:
                if int(p.text) > maxn:
                    maxn = int(p.text)
            except:
                pass
        return maxn

    def __search_result(self, response: requests.Response) -> list:
        """Search for normal search results based on the given response

        Args:
            response (requests.Response): The response requested to Google

        Returns:
            list: A list of results
        """
        # Initialize BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        # Get the result containers
        result_containers = soup.findAll('div', class_='rc')
        # Final results list
        results = []
        # Loop through every container
        for container in result_containers:
            # Result title
            title = container.find('h3').text
            # Result URL
            url = container.find('a')['href']
            # Result description
            des = container.find('span', class_='st').text
            results.append({
                'title': title,
                'url': url,
                'des': des,
                'type': 'result'
            })
        return results

    def search(self, query: str, page: int = 1) -> dict:
        """Search Google

        Args:
            query (str): The query to search for
            page (int): The page number of search result

        Returns:
            dict: The search results and the total page number
        """
        # Get response
        response = self.__get_source(
            'https://www.google.com/search?q=%s&start=%d' % (quote(query), (page - 1) * 10))
        results = []
        video = self.__search_video(response)
        result = self.__search_result(response)
        pages = self.__get_total_page(response)
        results.extend(result)
        results.extend(video)
        return {
            'results': results,
            'pages': pages
        }


if __name__ == '__main__':
    pprint(GoogleSpider().search(input('Search for what? ')))

総括

だから今、Googleの関連するビデオの結果をクロールすることができますが、あなたは尋ねている可能性があります:なぜそれが唯一のクロール3ビデオ結果?さて、Googleのソースコードでは3つしかないからです.あなたがより多くをこすっている方法を見つけたならば、コメントを残してください、そして、私はできるだけ早くそれをポストに加えます.もちろん、何か質問がある場合やコーディング時にエラーがある場合は、以下のコメントを残して、私は助けるために満足しているよ.

Reference

この問題について(Googleの検索結果をクロール), 我々は、より多くの情報をここで見つけました https://dev.to/samzhangjy/crawling-google-search-results-part-2-crawling-video-4hi9

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

【CraftCMS】投稿データのhard delete

対称ツリー( JavaScriptを使用した)