自分のブログ記事をIBM Watson Personality Insightsに突っ込んで分析してみる

17145 ワード

Watson Python3 Python3 テキストリンク

2020/01/10 追記：

こちらに 2020/01 現在のサービスで利用できるコードを用意しました。IBM Cloud のライト・アカウントを取得するなどしてご利用ください。

https://github.com/hmatsu47/watson_personality_insights_test_for_blog

IBM Cloud の Personality Insights はこちらです。

Personality Insights (性格分析)

※ここから下のリンクは既にリンク切れになっています。ご注意ください。

IBM Watson Personality Insightsのデモサイトでは、自分のTwitterで分析ができますが、私はTwitterアカウントを開設していないので、Python用のSDKを使って、自分のブログ記事を抜き出してAPIから投入してみました。

※自己分析用です。悪用しないでね。

1. IBM Watson Personality InsightsのAPIを有効化する

基本的には、こちらのブログ記事を参考にしました。

PythonからIBM Personality Insights Service APIを叩いてみた

但し、アカウント開設後の流れが少し変わっているようですので、私が試した時点での流れを簡単に記しておきます。

アカウントを取得してログイン
画面右上の「カタログ」をクリック
画面左メニューの下あたり「サービス」の「Watson」をクリック
アイコンが並ぶ画面の左中央付近「Personality Insights」をクリック
画面右下「作成」ボタンをクリック
「サービス資格情報」を見ると、ユーザ名とパスワードが自動生成されていることが確認できる

2. Pythonに必要パッケージをインストールする

以下、Amazon Linux（64ビット）でPython 3.5を使う場合の流れです。

pip3でインストール

# pip3 install readability-lxml
# pip3 install reppy
# pip3 install --upgrade watson-developer-cloud

本文抽出用にreadability、robots.txtの処理にreppy、その他HTMLの処理全般にlxmlを使っています。

readability
reppy
lxml
IBM Watson Personality Insights API・SDKのリファレンス

3. Pythonコードを記述する

ちょっと試すだけなので、雑なコードです。
よくある実行例ではJSON形式でデータを送信していますが、こちらはプレーンテキスト形式で送信する場合のサンプルとなっています（結果はJSON。パラメータを変更すれば、CSV形式も可能）。

で発行したAPIのユーザ名・パスワードと、取得対象の自分のブログに合わせて適宜書き換えてください。

※記述方法のサンプルとして私のQiita記事のURL等を入れていますが、これを使って分析して遊ぶのは、恥ずかしいのでやめてください…。

test.py

import json
import urllib.request
from lxml.html import fromstring
from readability.readability import Document
from reppy.cache import RobotsCache
from time import sleep
from watson_developer_cloud import PersonalityInsightsV3

# 開始URL
start_url = "http://qiita.com/hmatsu47"
# ドメイン
domain = "http://qiita.com/"
# 探索対象URLパス
scrape_path = "http://qiita.com/hmatsu47/items/"
# 探索対象外URL文字列
exclude_str_list = ["feed", "rss", "archive", "about", "revision", "like", "follow", "contribution", 

"comment", "reference", ".md"]
# 探索済みURL
scrape_url_list = []
# 抽出した本文
summary_ap_text = []
# 探索する最大ページ数
crawl_limit = 100
# 本文を抽出する最大ページ数
item_limit = 50
# robots.txt判定用
robots_cache = RobotsCache(capacity=crawl_limit)
# Watson認証情報
watson_account = "【ユーザ名】"
watson_password = "【パスワード】"

# 対象外URLが含まれていないか判定
def is_crawlable_url(url):
  for es in exclude_str_list:
    if url.find(es) != -1:
      break
  else:
    robots_flag = robots_cache.allowed(domain, "*")
    return (robots_flag)
  return False

# 探索
def crawl(url):
  # 探索最大ページ数に達していれば何もしない
  if crawl_limit <= len(scrape_url_list):
    return
  # 探索対象かどうか判定
  if (len(summary_ap_text) < item_limit) and (url not in scrape_url_list) and (is_crawlable_url(url)):
    print(url)
    scrape_url_list.append(url)
    # ページHTMLを取得
    html = urllib.request.urlopen(url).read()
    # 1秒スリープする
    sleep(1)
    # 本文抽出対象なら抽出処理を行う
    et = fromstring(html.lower())
    robots = et.xpath("//meta[@name='robots']/@content")
    if (url.startswith(scrape_path)) and not ("nofollow" in robots):
      summary = Document(html).summary()
      et2 = fromstring(summary)
      text = "".join([text for text in et2.xpath("//text()") if text.strip()])
      print("☆append☆")
      summary_ap_text.append(text)
    # リンクを抽出する
    et.make_links_absolute(domain)
    ev_url_list = et.xpath("//@href")
    # 抽出したリンクの先を探索する
    for evurl in ev_url_list:
      if evurl.startswith(start_url):
        crawl(evurl)

# 処理メイン

# 開始URLから探索する
crawl(start_url)
# 抽出文書の表示
print("□抽出文書□")
print(" ".join(summary_ap_text))
# Personality Insightsの認証
print("◇プロファイリング開始◇")
personality_insights = PersonalityInsightsV3(
  version="2016-10-20",
  username=watson_account,
  password=watson_password)
# プロファイリング
profile = personality_insights.profile(
  " ".join(summary_ap_text).encode("utf-8"),
  content_type="text/plain", content_language="ja",
  accept="application/json", accept_language="ja",
  raw_scores=True, consumption_preferences=True)
# 結果を表示する
print(json.dumps(profile, indent=2, ensure_ascii=False))

4. 実行結果（一部抜粋）

結果は、デモサイトのように文章で返ってくるわけではありません。
抽出対象の文章が長すぎると、先頭から一部を切り出して分析するようです。
そのため、探索対象や抽出対象のページ数を多くしすぎると、無駄になる可能性があります。

実行結果

# python3 test.py
http://qiita.com/hmatsu47
http://qiita.com/hmatsu47/items/476d446887244de17ae4
☆append☆
http://qiita.com/hmatsu47/items/a779213a71c3c1f72763
☆append☆
http://qiita.com/hmatsu47/items/ebeacd179024c8515c8c
☆append☆
http://qiita.com/hmatsu47/items/2d44c173a9114fd06853
☆append☆
http://qiita.com/hmatsu47/items/b99f5e1fb3e6675e07d8
☆append☆
http://qiita.com/hmatsu47/items/bfbc841545e2cd8b0699
☆append☆

（中略）

□抽出文書□
MySQL 5.7.11で導入、5.7.12で一部改良された、透過的データ暗号化をテストしたときのメモです。
とある勉強会のLTでグダグダになったので、あらためて書き直して投稿しておきます。※内容は無保証です。透過的データ暗号化（TDE）とはアプリケーション（SQL）側で暗号化／復号処理をしなくても、DBのデータファイルが暗号化される機能です。

（中略）

◇プロファイリング開始◇
{
  "values": [
    {
      "raw_score": 0.6676328466571492,
      "name": "現状維持",
      "percentile": 0.04565673211210963,
      "trait_id": "value_conservation",
      "category": "values"
    },
    {
      "raw_score": 0.7705061728489029,
      "name": "変化許容性",
      "percentile": 0.7885442617572225,
      "trait_id": "value_openness_to_change",
      "category": "values"
    },

（後略）

Author And Source

この問題について(自分のブログ記事をIBM Watson Personality Insightsに突っ込んで分析してみる), 我々は、より多くの情報をここで見つけました https://qiita.com/hmatsu47/items/cba33dca86553c0af161

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

PHPは4種類の基本並べ替えアルゴリズムを実現する

PHP SQL注入とXSS攻撃を防ぐ