Pythonでテーブルをスクレピングしてcsvに保存する

9658 ワード

scraping python3.6 Requests BeautifulSoup scraping テキストリンク

はじめに

webサイトでスクレイピングをしたい。テーブルで書かれているところを必要なところだけ欲しいと思って、試しに以下のサイトでやってみたので、メモとして残します。

対象のサイトは以下になります。

画面を見てみる

htmlをソース表示で見てみる　　tableタグ一部抜粋

<table cellpadding="0" cellspacing="0">

        <tr align="center"> 
        <th colspan="5" class="riku"><b>宮島口発</b></th>
        </tr>
        <tr align="center"> 
        <th>時</th>
        <th colspan="4">分</th>
        </tr>
        <tr align="center"> 
        <td><b>5</b></td>
        <td>&nbsp;</td>
        <td>&nbsp;</td>
        <td>&nbsp;</td>

        <td>&nbsp;</td>
        </tr>

        <tr align="center" class="b"> 
        <td><b>6</b></td>
        <td>&nbsp;</td>
        <td>25</td>
        <td>&nbsp;</td>
        <td>&nbsp;</td>
        </tr>

        <tr align="center"> 
        <td><b>7</b></td>
        <td>05</td>
        <td>&nbsp;</td>
        <td>40</td>
        <td>57</td>
        </tr>

こんな感じでテーブルが二つあるような構成になっています。
tableタグは二つ使われており、時間と分はtdタグで囲まれている状態です。
headerとなっているthは今回は不要とします。

環境

Python 3.6.5
requests 2.18.4
beautifulsoup4 4.6.0

Beautifulsoupというwebスクレイピングを簡単にできるモジュールがあります。
今回はこれを使えば割と簡単にできそう。

コード

ferry_scraping.py



import requests
from bs4 import BeautifulSoup
import csv

url = 'http://jr-miyajimaferry.co.jp/timetable/'
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')

def tableToCSV(filename, table):
    # 時間単位で１行としたいため
    rows = table.findAll('tr')

    with open(filename, 'wt', newline = '', encoding = 'utf-8') as f:
        writer = csv.writer(f)
        for row in rows:
            csvRow = []
            for cell in row.findAll('td'):
                csvRow.append(cell.get_text())

            # はじめのヘッダー部分のtrは行に含めないため
            if len(csvRow) > 0:
                writer.writerow(csvRow)

table_miyajimaguti_start = soup.find_all("table")[0] #　宮島口発
table_miyajima_start = soup.find_all("table")[1] #　宮島発

tableToCSV("ferry_miyajimaguti_start.csv", table_miyajimaguti_start)
tableToCSV("ferry_miyajima_start.csv", table_miyajima_start)

結果

うまく二つのCSVとして出力できました。一番左のカラムが時間です。

ferry_miyajimaguti_start.csv

5, , , , 
6, ,25, , 
7,05, ,40,57
8,10,25,40,55
9,10,25,40,55
10,10,25,40,55
11,10,25,40,55
12,10,25,40,55
13,10,25,40,55
14,10,25,40,55
15,10,25,40,55
16,10,25,40,55
17,10,25,40,55
18,10,25,45, 
19,15, ,45, 
20, ,27, , 
21,10, , , 
22,00, ,42,

ferry_miyajima_start.csv

5, , ,45, 
6, , ,40, 
7, ,20, ,55
8,10,25,40,55
9,10,25,40,55
10,10,25,40,55
11,10,25,40,55
12,10,25,40,55
13,10,25,40,55
14,10,25,40,55
15,10,25,40,55
16,10,25,40,55
17,10,25,40,55
18,10,25,40, 
19,00,30, , 
20,00, ,42, 
21, ,25, , 
22,14, , ,

最後に

わりと簡単にできました。サイトの構成が変わるようなものはメンテナンスしなきゃいけなそうですが、
このサイトに関しては、基本はこの形かなと思います。
同様なサイトは同じような形で流用できるかと思うので、参考にしてください。

Author And Source

この問題について(Pythonでテーブルをスクレピングしてcsvに保存する), 我々は、より多くの情報をここで見つけました https://qiita.com/tottu22/items/b298c05b4cf159b0be13

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .