Pythonでページ化を使用してすべてのNovorビデオ結果をscrapeする

35761 ワード

スクラップ

タイトル、リンク、サムネイル、起源、ビュー、日付公開、すべての結果からのチャンネル.

📌注:NAVER検索は、最高の検索結果品質の600以上のビデオ検索結果を提供しません네이버 검색은 최상의 검색결과 품질을 위해 600건 이상의 동영상 검색결과를 제공하지 않습니다「これは、あなたが検索結果の底を打つとき、あなたが見るものです.
しかし、1008の結果は、複数のテストの間、掻き取られました.もしかしたら、Noverが常に変化しているからです.
CSSセレクタのテストSelectorGadget Chrome extension :

CSSセレクタのコンソールでのテスト

必要条件

CSSセレクタによる基礎知識の抽出
CSSセレクタは、マークアップのどの部分がスタイルを適用するかを宣言するので、タグと属性のマッチングからデータを取り出すことができます.
あなたがCSSセレクタで掻き取られないならば、私のものの専用のブログ柱はありますhow to use CSS selectors when web-scraping それは、それが何であるかについて、賛否両論をカバーします、そして、なぜ、彼らがウェブを削っている展望から重要である理由.
個別仮想環境
あなたが以前に仮想環境で働いていないならば、専用を見てくださいPython virtual environments tutorial using Virtualenv and Poetry お馴染みのブログ記事.
要するに、ライブラリやPythonのバージョンの競合を防ぐために、同じシステムで互いに共存できる、異なるPythonのバージョンを含むインストールされたライブラリの独立したセットを作成するものです.
📌注意:このブログ記事の厳密な要件ではありません.
ライブラリをインストール

pip install requests, parsel, playwright

フルコード

このセクションは2つの部分に分割されます.
方法
使用ライブラリ
parse data without browser automation
requests and parsel は bs4 XPathをサポートするアナログ.
parse data with browser automation
playwright , どちらが現代か selenium アナログ.

ブラウザーオートメーションなしですべてのNovorビデオ結果を傷つける

import requests, json
from parsel import Selector

params = {
    "start": 0,            # page number
    "display": "48",       # videos to display. Hard limit.
    "query": "minecraft",  # search query
    "where": "video",      # Naver videos search engine 
    "sort": "rel",         # sorted as you would see in the browser
    "video_more": "1"      # required to receive a JSON data
}

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
}

video_results = []

html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
json_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))
html_data = json_data["aData"]

while params["start"] <= int(json_data["maxCount"]):
    for result in html_data:
        selector = Selector(result)

        for video in selector.css(".video_bx"):
            title = video.css(".text").xpath("normalize-space()").get().strip()
            link = video.css(".info_title::attr(href)").get()
            thumbnail = video.css(".thumb_area img::attr(src)").get()
            channel = video.css(".channel::text").get()
            origin = video.css(".origin::text").get()
            video_duration = video.css(".time::text").get()
            views = video.css(".desc_group .desc:nth-child(1)::text").get()
            date_published = video.css(".desc_group .desc:nth-child(2)::text").get()

            video_results.append({
                "title": title,
                "link": link,
                "thumbnail": thumbnail,
                "channel": channel,
                "origin": origin,
                "video_duration": video_duration,
                "views": views,
                "date_published": date_published
            })

    params["start"] += 48
    html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
    html_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))["aData"]

print(json.dumps(video_results, indent=2, ensure_ascii=False))

URLパラメータとリクエストヘッダを作成します

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "start": 0,           # page number
    "display": "48",      # videos to display. Hard limit.
    "query": "minecraft", # search query
    "where": "video",     # Naver videos search engine 
    "sort": "rel",        # sorted as you would see in the browser
    "video_more": "1"     # unknown but required to receive a JSON data
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
}

一時作成list データを保存するには、次の手順に従います.

video_results = []

パスheaders , URLparams JSONデータの取得を要求します.

html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)

# removes (replaces) unnecessary parts from parsed JSON 
json_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))
html_data = json_data["aData"]

コード
解説
timeout=30
30秒後の応答を待つのを止める.
JSONデータをjson_data :

から返される実際のHTMLhtml_data , より正確にjson_data["aData"] (ブラウザで保存して開く)

クリエイトアwhile すべての利用可能なビデオ結果を展開するループ

while params["start"] <= int(json_data["maxCount"]):
    for result in html_data:
        selector = Selector(result)

        for video in selector.css(".video_bx"):
            title = video.css(".text").xpath("normalize-space()").get().strip()
            link = video.css(".info_title::attr(href)").get()
            thumbnail = video.css(".thumb_area img::attr(src)").get()
            channel = video.css(".channel::text").get()
            origin = video.css(".origin::text").get()
            video_duration = video.css(".time::text").get()
            views = video.css(".desc_group .desc:nth-child(1)::text").get()
            date_published = video.css(".desc_group .desc:nth-child(2)::text").get()

            video_results.append({
                "title": title,
                "link": link,
                "thumbnail": thumbnail,
                "channel": channel,
                "origin": origin,
                "video_duration": video_duration,
                "views": views,
                "date_published": date_published
            })

        params["start"] += 48

        # update previous page to a new page
        html = requests.get("https://s.search.naver.com/p/video/search.naver", params=params, headers=headers, timeout=30)
        html_data = json.loads(html.text.replace("( {", "{").replace("]})", "]}"))["aData"]

コード
解説while params["start"] <= int(json_data["maxCount"])は、ハードの制限されて1000年の結果まで繰り返して["maxCount"] xpath("normalize-space()")空白のテキストノードを取得するにはparsel translates every CSS query to XPath , and because XPath's text() ignores blank text nodes を返します.::text or ::attr(href) parsel 独自のCSSの擬似要素をサポートするテキストや属性に応じて抽出します.params["start"] += 48次のページの結果にインクリメントします.
出力:

python

print(json.dumps(video_results, indent=2, ensure_ascii=False))

  
json

[

{

"title": "Minecraft : 🏰 How to build a Survival Castle Tower house",

"link": "https://www.youtube.com/watch?v=iU-xjhgU2vQ",

"thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FiU-xjhgU2vQ%2Fmqdefault.jpg&type=ac612_350",

"channel": "소피 Sopypie",

"origin": "Youtube",

"video_duration": "25:27",

"views": "126",

"date_published": "1일 전"

},

{

"title": "조금 혼란스러울 수 있는 마인크래프트 [ Minecraft ASMR Tower ]",

"link": "https://www.youtube.com/watch?v=y8x8oDAek_w",

"thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fy8x8oDAek_w%2Fmqdefault.jpg&type=ac612_350",

"channel": "세빈 XEBIN",

"origin": "Youtube",

"video_duration": "00:58",

"views": "1,262",

"date_published": "2021.11.13."

}

]

  

ブラウザの自動化ですべてのNAVERビデオの結果をscrape
`python

from playwright.sync_api import sync_playwright

import json 
with sync_playwright() as p:

browser = p.chromium.launch(headless=False)

page = browser.new_page()

page.goto("https://search.naver.com/search.naver?where=video&query=minecraft")


video_results = []

not_reached_end = True
while not_reached_end:
    page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);
                             scrollingElement.scrollTop = scrollingElement scrollHeight;""")

    if page.locator("#video_max_display").is_visible():
        not_reached_end = False

for index, video in enumerate(page.query_selector_all(".video_bx"), start=1):
    title = video.query_selector(".text").inner_text()
    link = video.query_selector(".info_title").get_attribute("href")
    thumbnail = video.query_selector(".thumb_area img").get_attribute("src")
    channel = None if video.query_selector(".channel") is None else video.query_selector(".channel").inner_text()
    origin = video.query_selector(".origin").inner_text()
    video_duration = video.query_selector(".time").inner_text()
    views = video.query_selector(".desc_group .desc:nth-child(1)").inner_text()
    date_published = None if video.query_selector(".desc_group .desc:nth-child(2)") is None else \
        video.query_selector(".desc_group .desc:nth-child(2)").inner_text()

    video_results.append({
        "position": index,
        "title": title,
        "link": link,
        "thumbnail": thumbnail,
        "channel": channel,
        "origin": origin,
        "video_duration": video_duration,
        "views": views,
        "date_published": date_published
    })

print(json.dumps(video_results, indent=2, ensure_ascii=False))

browser.close()


` 
Lunch a Chromium browser and make a request:

`python 
また、

with sync_playwright() as p:

# launches Chromium, opens a new page and makes a request

browser = p.chromium.launch(headless=False) # or firefox, webkit

page = browser.new_page()

page.goto("https://search.naver.com/search.naver?where=video&query=minecraft")

` 
Create temporary list 抽出したデータを格納するには、次の手順に従いpython

video_results = []

  
Create a while ループを停止し、スクロールを停止する例外を確認します.`python

not_reached_end = True

while not_reached_end:

# scroll to the bottom of the page

page.evaluate("""let scrollingElement = (document.scrollingElement || document.body);

scrollingElement.scrollTop = scrollingElement scrollHeight;""") 

# break out of the while loop when hit the bottom of the video results 
# looks for text at the bottom of the results:
# "Naver Search does not provide more than 600 video search results..."
if page.locator("#video_max_display").is_visible():
    not_reached_end = False


` 



Code
Explanation




page.evaluate()JavaScript式を実行するにはまた、使用することができますplaywright keyboard keys and shortcuts 同じことをする
結果をスクロールし、結果を繰り返しますappend 一時的にlist :`python

for index, video in enumerate(page.query_selector_all(".video_bx"), start=1):

title = video.query_selector(".text").inner_text()

link = video.query_selector(".info_title").get_attribute("href")

thumbnail = video.query_selector(".thumb_area img").get_attribute("src") 

# return None if no result is displayed from Naver.
# "is None" used because query_selector() returns a NoneType (None) object:
# https://playwright.dev/python/docs/api/class-page#page-query-selector
channel = None if video.query_selector(".channel") is None else video.query_selector(".channel").inner_text()
origin = video.query_selector(".origin").inner_text()
video_duration = video.query_selector(".time").inner_text()
views = video.query_selector(".desc_group .desc:nth-child(1)").inner_text()
date_published = None if video.query_selector(".desc_group .desc:nth-child(2)") is None else \
    video.query_selector(".desc_group .desc:nth-child(2)").inner_text()

video_results.append({
    "position": index,
    "title": title,
    "link": link,
    "thumbnail": thumbnail,
    "channel": channel,
    "origin": origin,
    "video_duration": video_duration,
    "views": views,
    "date_published": date_published
})


` 



Code
Explanation




enumerate()各ビデオのインデックス位置を取得するには
 query_selector_all()  
返すlist を返します.デフォルト[] query_selector()  
を返します.デフォルトNoneデータが抽出された後にブラウザインスタンスを閉じるpython

browser.close()

  
Output:
json

[

{

"position": 1,

"title": "Minecraft : 🏰 How to build a Survival Castle Tower house",

"link": "https://www.youtube.com/watch?v=iU-xjhgU2vQ",

"thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FiU-xjhgU2vQ%2Fmqdefault.jpg&type=ac612_350",

"channel": "소피 Sopypie",

"origin": "Youtube",

"video_duration": "25:27",

"views": "재생수126",

"date_published": "20시간 전"

},

{

"position": 1008,

"title": "Titanic [Minecraft] V3 | 타이타닉 [마인크래프트] V3",

"link": "https://www.youtube.com/watch?v=K39joThAoC0",

"thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fi.ytimg.com%2Fvi%2FK39joThAoC0%2Fmqdefault.jpg&type=ac612_350",

"channel": "나이아Naia",

"origin": "Youtube",

"video_duration": "02:40",

"views": "재생수22",

"date_published": "2021.11.11."

}

]

  

リンク

Code in the online IDE


アウトロ
This blog post is for information purpose only. Use the received information for useful purposes, for example, if you know how to help improve Naver's service. 
If you have anything to share, any questions, suggestions, or something that isn't working correctly, reach out via Twitter at , or .
Yours, 

Dmitriy, and the rest of SerpApi Team.

Join us on Reddit | | 
Add a Feature Request💫 or a Bug🐞

Reference

この問題について(Pythonでページ化を使用してすべてのNovorビデオ結果をscrapeする), 我々は、より多くの情報をここで見つけました https://dev.to/dmitryzub/scrape-all-naver-video-results-using-pagination-in-python-5eph

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

Why React?

[Unity] Shader Forgeで視差マッピングを実装する