O'Reillyの本をクラスタリングで分類してみる

7474 ワード

MachineLearning Python clustering scikit-learn 機械学習 Python テキストリンク

目標

O'Reilly JapanのHPから本の情報を取得して、
取得した情報から本を非階層クラスタリングで分類してみます。
手順は以下のとおり。
　・Webのトップページから本の詳細情報情報にアクセスし、
　　本紹介の文章をリストで取得する
　・本ごとに本紹介の文章を単語レベルに分解して、各々の単語に重み付けする
　・上記情報をもとに、クラスタリングで本を分類する
言語はPythonを利用します。

Webから情報を取得する

※クローリングとスクレイピングで調べるといろいろ情報が出てくると思います。
1.まず、トップページにある新刊本の詳細ページへのURLを全て取得、
　allBookLinksの中に配列で格納する。

clustering.py

#coding:utf-8

import numpy as np
import mechanize
import MeCab
import util
import re
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation

# get O'Reilly new books from Top page
page = mechanize.Browser()
page.open('http://www.oreilly.co.jp/index.shtml')

response = page.response()
soup = BeautifulSoup(response.read(), "html.parser")

allBookLinks = []
bibloLinks = soup.find_all("p", class_="biblio_link")
for bibloLink in bibloLinks:
    books = bibloLink.find_all("a", href=re.compile("http://www.oreilly.co.jp/books/"))
    for book in books:
        allBookLinks.append( book.get("href") )

2.上記で取得した本の詳細ページURLに移動し、移動先のページから
　titleListに本のタイトル・inputDatasに紹介文を格納する。
　関連書籍情報のURLも取得し、1階層分だけリストに加える。

clustering.py

def get_detail_sentence_list( detailPageLink ):
    page.open( detailPageLink )
    detailResponse = page.response()
    detailSoup = BeautifulSoup( detailResponse.read(), "html.parser" )
    # get title
    titleTag = detailSoup.find("h3", class_="title")
    title = titleTag.get_text().encode('utf-8')
    # get detail
    detailDiv = detailSoup.find("div", id="detail")
    detail = detailDiv.find("p").get_text().encode('utf-8')
    # get relation book links
    relationLinks = detailDiv.find_all("a")
    relationLinkList = []
    for relationLink in relationLinks:
        href = relationLink.get("href")
        if href.find('/books/') > 0:
            relationLinkList.append(href[href.find('/books/') + len('/books/'):])
    return [ title, detail, relationLinkList ]


# crolling books info
titleList = []
inputDatas = []
for bookLink in allBookLinks:
    title, detail, relationLinkList = get_detail_sentence_list( bookLink )
    # save
    if not (title in titleList):
        titleList.append(title)
        inputDatas.append( detail )

    # go to relation book links
    for relationLink in relationLinkList:
        title, detail, relationLinkList = get_detail_sentence_list( 'http://www.oreilly.co.jp/books/' + relationLink )
        # save
        if not (title in titleList):
            titleList.append(title)
            inputDatas.append( detail )

TF-IDF法で本ごとの紹介文を重み付けする

TfidfVectorizerを使ったXの中身は、
　・len( X )=探索した本の数
　・len( X[0] )=本の紹介文の単語の数
　・X[0][0]=0番目の本の0番目に出てくる単語（terms[0]に格納されている単語）のTF-IDFの値
みたいな感じの行列。
ロジック組んでTF-IDFを計算しても良いけど、このライブラリを使うと楽。

clustering.py

def get_word_list( targetText ):
    tagger = MeCab.Tagger()
    wordList = []
    if len(targetText) > 0:
        node = tagger.parseToNode(targetText)
        while node:
            if len(util.mytrim(node.surface)) > 0:
                wordList.append(node.surface)
            node = node.next
    return wordList

tfidfVectonizer = TfidfVectorizer(analyzer=get_word_list, min_df=1, max_df=50)
X = tfidfVectonizer.fit_transform( inputDatas )
terms = tfidfVectonizer.get_feature_names()

util.py

#coding:utf-8

def mytrim( target ):
    target = target.replace('　','')
    return target.strip()

クラスタリングで本を分類する

K-meansとAffinityPropagationの両方で試してみた。
K-meansは先に何個に分類するか決まっているときに利用、
決まっていないときはAffinityPropagationを使うとかなりうまくいく。
今回の場合はAffinityPropagationのほうが適していたと思う。

clustering.py

# clustering by KMeans
k_means = KMeans(n_clusters=5, init='k-means++', n_init=5, verbose=True)
k_means.fit(X)
label = k_means.labels_

clusterList = {}
for i in range(len(titleList)):
    clusterList.setdefault( label[i], '' )
    clusterList[label[i]] = clusterList[label[i]] + ',' + titleList[i]

print 'By KMeans'
for key, value in clusterList.items():
    print key
    print value

print 'By AffinityPropagation'
# clustering by AffinityPropagation
af = AffinityPropagation().fit(X)
afLabel = af.labels_
afClusterList = {}
for i in range(len(titleList)):
    afClusterList.setdefault( afLabel[i], '' )
    afClusterList[afLabel[i]] = afClusterList[afLabel[i]] + ',' + titleList[i]

for key, value in afClusterList.items():
    print key
    print value

いちおう、AffinityPropagation使ったほうの実行結果

なんかそれっぽくなった！

分類1: 実践機械学習システム; ハイパフォーマンスPython; 初めてのコンピュータサイエンス; Make: Electronics――作ってわかる電気と電子回路の基礎; キャパシティプランニング――リソースを最大限に活かすサイト分析・予測・配置; 詳説イーサネット第2版; JavaScriptによるデータビジュアライゼーション入門
分類2: 実践 Python 3; Cython――Cとの融合によるPythonの高速化; MongoDB & Python; Python & AWS クックブック; Pythonによるデータ分析入門――NumPy、pandasを使ったデータ処理; Python文法詳解; 実践コンピュータビジョン; 入門 Python 3; 初めてのPython 第3版; Pythonチュートリアル　第2版; Arduinoをはじめよう第3版; Processingをはじめよう; Python クックブック第2版; 入門自然言語処理; OpenStack Swift――Swiftオブジェクトストレージの管理と開発; SAN & NAS ストレージネットワーク管理
分類3: Prototyping Lab――「作りながら考える」ためのArduino実践レシピ; ウェブオペレーション――サイト運用管理の実践テクニック; 実践 Metasploit――ペネトレーションテストによる脆弱性評価; ビジュアライジング・データ――Processingによる情報視覚化手法; ビューティフルビジュアライゼーション
分類4: メタプログラミングRuby 第2版; Rubyベストプラクティス――プロフェッショナルによるコードとテクニック; アンダースタンディングコンピュテーション――単純な機械から不可能なプログラムまで; 初めてのRuby; プログラミング言語 Ruby
分類5: Seleniumデザインパターン & ベストプラクティス; 実践 Selenium WebDriver; テスタブルJavaScript; ビューティフルテスティング――ソフトウェアテストの美しい実践

Author And Source

この問題について(O'Reillyの本をクラスタリングで分類してみる), 我々は、より多くの情報をここで見つけました https://qiita.com/xiao/items/ba54bf354536ccda8555

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .