Webページのコンテンツからカテゴリを自動類推

19349 ワード

Python naturalLanguageAPI scraping BeautifulSoup Python テキストリンク

はじめに

URLを元に、そのページのカテゴリを調べる必要がありました
何か良いサービスが無いか探すと色々なWeb Categorizationサービスがありましたが従量課金のサービスは見当たらず（月額ばかり）
Google CloudのNatural Language APIでコンテンツの分類ができるようなので、Pythonで作ってみました
そこそこ良い感じのものができました
処理の流れは以下の通りです
1. BeautifulSoupでWebページから特徴的なテキストを抽出
2. Google Translate（無償版）で英語に翻訳　※2021/4時点でコンテンツの分類は英語のみ対応のため
3. Natural Language APIでコンテンツの分類

コンテンツの分類

テキストを元にカテゴリを類推してくれるサービスです。
カテゴリは以下ページの通り600種類以上。十分すぎます。

課金体系は1,000文字を1ユニットとし、コンテンツの分類は毎月30,000ユニットまで無料（2021/4時点）。
神か。

API利用の準備

以下ページを参考にNatural Language APIの有効化と秘密鍵ファイルのダウンロードまでやっておきます。

コード

パッケージインストール

パッケージ

pip install --upgrade requests
pip install --upgrade beautifulsoup4
pip install googletrans==4.0.0-rc1 #2021/4時点の最新版3.0.0では正常に動作しないため
pip install --upgrade google-cloud-language

Import

import

from bs4 import BeautifulSoup
from google.cloud import language_v1
from googletrans import Translator
import requests
import json
import os

Webページのコンテンツを取得

BeautifulSoupを使い、一般的に重要度順になると思われるTitle、Description、H1～H3、pタグ内のテキストの順で取得。
あまりにテキストが多いサイトは無駄に処理時間を要したりユニットを消費するので2,000文字ぐらいでカットする。
（だいたい2,000文字ぐらいで精度の上限に達しました）

def get_web_contents(url, limit=2000):

    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")

    """ get important texts """
    imp_texts = []
    """ title, description """
    title = soup.find("title").text
    description = soup.find("meta", attrs={"name": "description"})
    description = description.get("content") if "content" in str(description) else ""
    """ OGP description """
    ogp_description = soup.find("meta", attrs={"property": "og:description"})
    ogp_description = ogp_description.get("content") if "content" in str(ogp_description) else ""
    """ h1 """
    h1_texts = []
    for c in soup.find_all("h1"):
        h1_texts.append(str.strip(c.text))
    """ h2 """
    h2_texts = []
    for c in soup.find_all("h2"):
        h2_texts.append(str.strip(c.text))
    """ h3 """
    h3_texts = []
    for c in soup.find_all("h3"):
        h3_texts.append(str.strip(c.text))
    """ p text """
    p_texts = []
    for c in soup.find_all("p"):
        p_texts.append(str.strip(c.text))

    imp_texts.append(str(title))
    imp_texts.append(str(description))
    imp_texts.append(str(ogp_description))
    imp_texts.append(", ".join(h1_texts))
    imp_texts.append(", ".join(h2_texts))
    imp_texts.append(", ".join(h3_texts))
    imp_texts.append(", ".join(p_texts))

    imp_texts = ",".join(imp_texts)

    """ cut text above limit param """
    imp_texts = imp_texts[0:limit-1]

    return imp_texts

英語に翻訳

Google Translateを使い英語に翻訳。

def translation(text):

    translator = Translator()
    translated_text = translator.translate(text, dest="en").text

    return translated_text

コンテンツの分類

公式ページのコードサンプルをペタリ。
準備した秘密鍵ファイルのパスを指定してください。

def classify(text, verbose=False):

    """ set credential """
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "秘密鍵ファイルのパス"

    """Classify the input text into categories. """
    language_client = language_v1.LanguageServiceClient()

    document = language_v1.Document(
        content=text, type_=language_v1.Document.Type.PLAIN_TEXT
    )
    response = language_client.classify_text(request={'document': document})
    categories = response.categories

    result = {}

    for category in categories:
        # Turn the categories into a dictionary of the form:
        # {category.name: category.confidence}, so that they can
        # be treated as a sparse vector.
        result[category.name] = category.confidence

    if verbose:
        print(text)
        for category in categories:
            print(u"=" * 20)
            print(u"{:<16}: {}".format("category", category.name))
            print(u"{:<16}: {}".format("confidence", category.confidence))

    return result

試してみる

とりあえずAPIの公式で試してみる。

def main():
    url = "https://cloud.google.com/natural-language?hl=ja"
    contents = get_web_contents(url)
    #print(contents)
    translated_text = translation(contents)
    #print(translated_text)
    classification = classify(translated_text)
    print(classification)

if __name__ == "__main__":
    main()

結果

{'/Computers & Electronics': 0.8700000047683716, '/Science/Computer Science': 0.6200000047683716}

賢い。

コインチェック

url = "https://coincheck.com/ja/"

結果

{'/Finance/Investing/Currencies & Foreign Exchange': 0.9599999785423279}

某大人のページ

url = "xxx"

結果

{'/Adult': 0.9900000095367432}

いけるやん！

まとめ

Natural Language APIを使ってURLからカテゴリを類推できるようにしてみました。
上記テストの他、色々なページに試してみた課題としては

あまりに文字が少ないページは類推に失敗するので、情報が足りない場合は上の階層を辿ったり工夫が必要そう
精度をあげるためテキスト情報取得の対象タグを検討した方がよさそう

とは言っても、これだけでもなかなか良い精度かと思います。

Author And Source

この問題について(Webページのコンテンツからカテゴリを自動類推), 我々は、より多くの情報をここで見つけました https://qiita.com/symmr/items/f68aaca5ae8d08d31aaa

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .