【練習】pythonでbaseball-labをスクレイピングする

6480 ワード

pandas Python scraping Python3 Python テキストリンク

はじめに

【世界で5万人が受講】実践 Python データサイエンス
を受講して野球のデータを触ってみたくなったので
スクレイピングしてみることにした。
著者のプログラミング歴は2週間ほど

やってみたこと

ベースボールラボから2018年のベイスターズの野手データを抜き出す
抜き出したデータをdataframe化

参考

ベースボールラボ
 【世界で5万人が受講】実践 Python データサイエンス

コード

from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame

#2018年のベイスターズの野手データ
url = 'http://www.baseball-lab.jp/player/batter/3/2018/'

#この辺は講座の情報通りおこなった
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c)
summary = soup.find('div', {'class': 'content-holder'})
tables = summary.find_all('table')
data = []
rows = tables[0].find_all('tr')
for tr in rows:
    cols = tr.find_all('td')
    th_sort = tr.find_all('th')
    for td in cols:
        players = td.find(text=True)
        data.append(players)


#numpyをimportして、リストをアレイ化して26個ずつにreshapeする
import numpy as np
arr1 = np.array(data).reshape(-1,26)

ここまででdataframe化するアレイは用意できたのでcolumnsに設定する項目を作っておく（thタグから取り出そうとしたが、改行のせいか中身がないデータが返ってきたため）

:index_batter.txt
背番号
選手名
試合
打席
打数
得点
安打
二塁打
三塁打
本塁打
塁打
打点
三振
四球
敬遠
死球
犠打
犠飛
盗塁
盗塁刺
併殺打
失策
打率
長打率
出塁率
OPS

#columnsに名前を付ける
f = open('index_batter.txt')
index_batter = f.read().split()
print(index_batter)
f.close()

df = DataFrame(arr1)
df.columns = index_batter

#実行
df

今後の課題

選手名に改行やスペースがあり、扱いづらいので修正したい
columnsの名前をhtmlから直接引っ張ってきたい
年度別や他チームとの比較したい
データの可視化

Author And Source

この問題について(【練習】pythonでbaseball-labをスクレイピングする), 我々は、より多くの情報をここで見つけました https://qiita.com/takeuchi_kojii/items/c82d16a36ad3e2c5902a

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .