Pythonでページ化を使用してすべてのNovorビデオ結果をscrapeする


  • What will be scraped
  • Prerequisites
  • Full Code
  • Links
  • Outro
  • スクラップ


    タイトル、リンク、サムネイル、起源、ビュー、日付公開、すべての結果からのチャンネル.

    📌注:NAVER検索は、最高の検索結果品質の600以上のビデオ検索結果を提供しません네이버 검색은 최상의 검색결과 품질을 위해 600건 이상의 동영상 검색결과를 제공하지 않습니다「これは、あなたが検索結果の底を打つとき、あなたが見るものです.
    しかし、1008の結果は、複数のテストの間、掻き取られました.もしかしたら、Noverが常に変化しているからです.
    CSSセレクタのテストSelectorGadget Chrome extension :

    CSSセレクタのコンソールでのテスト

    必要条件


    CSSセレクタによる基礎知識の抽出
    CSSセレクタは、マークアップのどの部分がスタイルを適用するかを宣言するので、タグと属性のマッチングからデータを取り出すことができます.
    あなたがCSSセレクタで掻き取られないならば、私のものの専用のブログ柱はありますhow to use CSS selectors when web-scraping それは、それが何であるかについて、賛否両論をカバーします、そして、なぜ、彼らがウェブを削っている展望から重要である理由.
    個別仮想環境
    あなたが以前に仮想環境で働いていないならば、専用を見てくださいPython virtual environments tutorial using Virtualenv and Poetry お馴染みのブログ記事.
    要するに、ライブラリやPythonのバージョンの競合を防ぐために、同じシステムで互いに共存できる、異なるPythonのバージョンを含むインストールされたライブラリの独立したセットを作成するものです.
    📌注意:このブログ記事の厳密な要件ではありません.
    ライブラリをインストール
    pip install requests, parsel, playwright
    

    フルコード


    このセクションは2つの部分に分割されます.
    方法
    使用ライブラリ
    parse data without browser automation
    requests and parsel bs4 XPathをサポートするアナログ.
    parse data with browser automation
    playwright , どちらが現代か selenium アナログ.

    ブラウザーオートメーションなしですべてのNovorビデオ結果を傷つける


    import requests, json
    from parsel import Selector
    
    params = {
        "start": 0,            # page number
        "display": "48",       # videos to display. Hard limit.
        "query": "minecraft",  # search query
        "where": "video",      # Naver videos search engine 
        "sort": "rel",         # sorted as you would see in the browser
        "video_more": "1"      # required to receive a JSON data
    }
    
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
    }
    
    video_results = []
    
    html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
    json_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))
    html_data = json_data["aData"]
    
    while params["start"] <= int(json_data["maxCount"]):
        for result in html_data:
            selector = Selector(result)
    
            for video in selector.css(".video_bx"):
                title = video.css(".text").xpath("normalize-space()").get().strip()
                link = video.css(".info_title::attr(href)").get()
                thumbnail = video.css(".thumb_area img::attr(src)").get()
                channel = video.css(".channel::text").get()
                origin = video.css(".origin::text").get()
                video_duration = video.css(".time::text").get()
                views = video.css(".desc_group .desc:nth-child(1)::text").get()
                date_published = video.css(".desc_group .desc:nth-child(2)::text").get()
    
                video_results.append({
                    "title": title,
                    "link": link,
                    "thumbnail": thumbnail,
                    "channel": channel,
                    "origin": origin,
                    "video_duration": video_duration,
                    "views": views,
                    "date_published": date_published
                })
    
        params["start"] += 48
        html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
        html_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))["aData"]
    
    print(json.dumps(video_results, indent=2, ensure_ascii=False))
    
    URLパラメータとリクエストヘッダを作成します
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {
        "start": 0,           # page number
        "display": "48",      # videos to display. Hard limit.
        "query": "minecraft", # search query
        "where": "video",     # Naver videos search engine 
        "sort": "rel",        # sorted as you would see in the browser
        "video_more": "1"     # unknown but required to receive a JSON data
    }
    
    # https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
    }
    
    一時作成list データを保存するには、次の手順に従います.
    video_results = []
    
    パスheaders , URLparams JSONデータの取得を要求します.
    html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
    
    # removes (replaces) unnecessary parts from parsed JSON 
    json_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))
    html_data = json_data["aData"]
    
    コード
    解説
    timeout=30
    30秒後の応答を待つのを止める.
    JSONデータをjson_data :

    から返される実際のHTMLhtml_data , より正確にjson_data["aData"] (ブラウザで保存して開く)

    クリエイトアwhile すべての利用可能なビデオ結果を展開するループ
    while params["start"] <= int(json_data["maxCount"]):
        for result in html_data:
            selector = Selector(result)
    
            for video in selector.css(".video_bx"):
                title = video.css(".text").xpath("normalize-space()").get().strip()
                link = video.css(".info_title::attr(href)").get()
                thumbnail = video.css(".thumb_area img::attr(src)").get()
                channel = video.css(".channel::text").get()
                origin = video.css(".origin::text").get()
                video_duration = video.css(".time::text").get()
                views = video.css(".desc_group .desc:nth-child(1)::text").get()
                date_published = video.css(".desc_group .desc:nth-child(2)::text").get()
    
                video_results.append({
                    "title": title,
                    "link": link,
                    "thumbnail": thumbnail,
                    "channel": channel,
                    "origin": origin,
                    "video_duration": video_duration,
                    "views": views,
                    "date_published": date_published
                })
    
            params["start"] += 48
    
            # update previous page to a new page
            html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
            html_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))["aData"]
    
    コード
    解説while params["start"] <= int(json_data["maxCount"])は、ハードの制限されて1000年の結果まで繰り返して["maxCount"] xpath("normalize-space()")空白のテキストノードを取得するにはparsel translates every CSS query to XPath , and because XPath's text() ignores blank text nodes を返します.::text or ::attr(href) parsel 独自のCSSの擬似要素をサポートするテキストや属性に応じて抽出します.params["start"] += 48次のページの結果にインクリメントします.
    出力:python
    print(json.dumps(video_results, indent=2, ensure_ascii=False))

    json
    [
    {
    "title": "Minecraft : 🏰 How to build a Survival Castle Tower house",
    "link": "https://www.youtube.com/watch?v=iU-xjhgU2vQ",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FiU-xjhgU2vQ%2Fmqdefault.jpg&type=ac612_350",
    "channel": "소피 Sopypie",
    "origin": "Youtube",
    "video_duration": "25:27",
    "views": "126",
    "date_published": "1일 전"
    },
    {
    "title": "조금 혼란스러울 수 있는 마인크래프트 [ Minecraft ASMR Tower ]",
    "link": "https://www.youtube.com/watch?v=y8x8oDAek_w",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fy8x8oDAek_w%2Fmqdefault.jpg&type=ac612_350",
    "channel": "세빈 XEBIN",
    "origin": "Youtube",
    "video_duration": "00:58",
    "views": "1,262",
    "date_published": "2021.11.13."
    }
    ]


    ブラウザの自動化ですべてのNAVERビデオの結果をscrape

    `python
    from playwright.sync_api import sync_playwright
    import json

    with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto("https://search.naver.com/search.naver?where=video&query=minecraft")

    video_results = []
    
    not_reached_end = True
    while not_reached_end:
        page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);
                                 scrollingElement.scrollTop = scrollingElement scrollHeight;""")
    
        if page.locator("#video_max_display").is_visible():
            not_reached_end = False
    
    for index, video in enumerate(page.query_selector_all(".video_bx"), start=1):
        title = video.query_selector(".text").inner_text()
        link = video.query_selector(".info_title").get_attribute("href")
        thumbnail = video.query_selector(".thumb_area img").get_attribute("src")
        channel = None if video.query_selector(".channel") is None else video.query_selector(".channel").inner_text()
        origin = video.query_selector(".origin").inner_text()
        video_duration = video.query_selector(".time").inner_text()
        views = video.query_selector(".desc_group .desc:nth-child(1)").inner_text()
        date_published = None if video.query_selector(".desc_group .desc:nth-child(2)") is None else \
            video.query_selector(".desc_group .desc:nth-child(2)").inner_text()
    
        video_results.append({
            "position": index,
            "title": title,
            "link": link,
            "thumbnail": thumbnail,
            "channel": channel,
            "origin": origin,
            "video_duration": video_duration,
            "views": views,
            "date_published": date_published
        })
    
    print(json.dumps(video_results, indent=2, ensure_ascii=False))
    
    browser.close()
    

    `

    Lunch a Chromium browser and make a request:

    `python

    また、

    with sync_playwright() as p:
    # launches Chromium, opens a new page and makes a request
    browser = p.chromium.launch(headless=False) # or firefox, webkit
    page = browser.new_page()
    page.goto("https://search.naver.com/search.naver?where=video&query=minecraft")
    `

    Create temporary list 抽出したデータを格納するには、次の手順に従いpython
    video_results = []

    Create a while ループを停止し、スクロールを停止する例外を確認します.`python
    not_reached_end = True
    while not_reached_end:
    # scroll to the bottom of the page
    page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);
    scrollingElement.scrollTop = scrollingElement scrollHeight;""")

    # break out of the while loop when hit the bottom of the video results 
    # looks for text at the bottom of the results:
    # "Naver Search does not provide more than 600 video search results..."
    if page.locator("#video_max_display").is_visible():
        not_reached_end = False
    

    `

    Code Explanation
    page.evaluate()JavaScript式を実行するにはまた、使用することができますplaywright keyboard keys and shortcuts 同じことをする
    結果をスクロールし、結果を繰り返しますappend 一時的にlist :`python
    for index, video in enumerate(page.query_selector_all(".video_bx"), start=1):
    title = video.query_selector(".text").inner_text()
    link = video.query_selector(".info_title").get_attribute("href")
    thumbnail = video.query_selector(".thumb_area img").get_attribute("src")
    # return None if no result is displayed from Naver.
    # "is None" used because query_selector() returns a NoneType (None) object:
    # https://playwright.dev/python/docs/api/class-page#page-query-selector
    channel = None if video.query_selector(".channel") is None else video.query_selector(".channel").inner_text()
    origin = video.query_selector(".origin").inner_text()
    video_duration = video.query_selector(".time").inner_text()
    views = video.query_selector(".desc_group .desc:nth-child(1)").inner_text()
    date_published = None if video.query_selector(".desc_group .desc:nth-child(2)") is None else \
        video.query_selector(".desc_group .desc:nth-child(2)").inner_text()
    
    video_results.append({
        "position": index,
        "title": title,
        "link": link,
        "thumbnail": thumbnail,
        "channel": channel,
        "origin": origin,
        "video_duration": video_duration,
        "views": views,
        "date_published": date_published
    })
    

    `

    Code Explanation
    enumerate()各ビデオのインデックス位置を取得するには
    query_selector_all()
    返すlist を返します.デフォルト[] query_selector()
    を返します.デフォルトNoneデータが抽出された後にブラウザインスタンスを閉じるpython
    browser.close()

    Output:

    json
    [
    {
    "position": 1,
    "title": "Minecraft : 🏰 How to build a Survival Castle Tower house",
    "link": "https://www.youtube.com/watch?v=iU-xjhgU2vQ",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FiU-xjhgU2vQ%2Fmqdefault.jpg&type=ac612_350",
    "channel": "소피 Sopypie",
    "origin": "Youtube",
    "video_duration": "25:27",
    "views": "재생수126",
    "date_published": "20시간 전"
    },
    {
    "position": 1008,
    "title": "Titanic [Minecraft] V3 | 타이타닉 [마인크래프트] V3",
    "link": "https://www.youtube.com/watch?v=K39joThAoC0",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FK39joThAoC0%2Fmqdefault.jpg&type=ac612_350",
    "channel": "나이아Naia",
    "origin": "Youtube",
    "video_duration": "02:40",
    "views": "재생수22",
    "date_published": "2021.11.11."
    }
    ]


    リンク


    アウトロ

    This blog post is for information purpose only. Use the received information for useful purposes, for example, if you know how to help improve Naver's service.

    If you have anything to share, any questions, suggestions, or something that isn't working correctly, reach out via Twitter at , or .

    Yours,
    Dmitriy, and the rest of SerpApi Team.


    Join us on Reddit | |

    Add a Feature Request💫 or a Bug🐞