ノードを使用してWebスクレーピングへの究極のガイド.js

32551 ワード

node webscraping javascript beginners テキストリンク

ウェブスクレーピングとは何か

これは、Webサイトから情報を収集するタスクを自動化が含まれます.
あなたが価格比較サイトのための様々な電子商取引サイトから価格を収集したいかもしれないウェブスクレーピングのための多くのユースケースがあります.それとも、飛行時間と旅行サイトのホテルのリストが必要です.たぶん、あなたは、販売のリードのためのさまざまなディレクトリから電子メールを収集したり、Googleのような検索エンジンを構築したいかもしれない!
Webの掻き取りを始めることは簡単です、そして、プロセスは2つの主要な部分に分解されることができます:

HTMLリクエストライブラリやヘッドレスブラウザを使用してデータを取得する(多分、別のポストでこれをチェックします).

そして、あなたが望む正確な情報を得るためにデータを解析します.

このガイドでは、人気のあるノードを使用してプロセスを歩いていきます.jsrequest-promise モジュールCheerioJS , and Puppeteer . このポストの例を通して働くこと、我々はあなたがノードで必要とするどんなデータも集めてプロになるために必要とするすべてのヒントとトリックを学びます.js!
我々は、ウィキペディアからインドの大統領のすべての名前と誕生日のリストを集めています.

一歩一歩やりましょう

Step 1 :システムにノードとNPMをインストールしているか確認してください.
ターミナル/コマンドラインでこれらのコマンドを実行してください

node -v

and

npm -v

あなたが既にノードとNPMをインストールしたコマンドの出力としてバージョンを得るならば、どんなエラーでも彼らをインストールしようとしてください.出力は

v14.16.1

ステップ2 : NPMパッケージの設定
実行コマンド

npm init -y

このコマンドは、バックのハードワークを行い、パッケージを作成します.JSONファイルは、すべての依存関係と我々のプログラムを通してインストールするdevdependenciesのトラックを維持します.
ステップ3:あなたの最初の要求を作る

npm i -D request request-promise cheerio puppeteer

npm install --save request request-promise cheerio puppeteer

-D and --save tags are used to install npm module as DevDependencies.

Puppeteer will take a while to install as it needs to download Chromium as well or you can skip this because we are not using puppeteer in our program yet.

ステップ3:お気に入りのコードエディタ/IDEに移動
スクレーパーという名前のファイルを作りましょう.JS、およびWikipedia“大統領のリスト”ページのHTMLを取得するために迅速な機能を記述します.

const rp = require('request-promise');
const url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_India';

rp(url)
  .then((html)=>{
    console.log(html);
  })
  .catch((err)=>{
    console.log(err);
  });

出力:

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of Presidents of the India - Wikipedia</title>
...

クロムdevtoolsの使用

クール、我々はWebページから生のHTMLを得た!しかし今、我々はテキストのこの巨大な塊を理解する必要があります.そのためには、WebページのHTMLを簡単に検索できるようにChrome DevToolsを使用する必要があります.
クロムdevtoolsを使用して簡単です:単純にGoogle Chromeを開き、あなたがscrape

さて、単に点検をクリックしてください、そして、Chromeはそのdevtoolペインを持ってきます.そして、簡単にページのソースHTMLを検査することができます.

インドの社長の名前を検査した後に、名前がアンカータグで包まれるthタグの中に格納されるのを知っていました.それでは、それを使用してみましょう!
Step 4 : HTMLをHTMLで解析する

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_India';

rp(url)
  .then((html)=>{
    console.log($('th > a', html).length);
    console.log($('th > a', html));
  })
  .catch((err)=>{
    console.log(err);
  });

出力:

18
{ '0':
  { type: 'tag',
    name: 'a',
    attribs: { href: '/wiki/Rajendra_Prasad', title: 'Rajendra Prasad' },
    children: [ [Object] ],
    next: null,
    prev: null,
    parent:
      { type: 'tag',
        name: 'big',
        attribs: {},
        children: [Array],
        next: null,
        prev: null,
        parent: [Object] } },
  '1':
    { type: 'tag'
...

注意:

私はチェリオを使用していくつかの問題に直面していたときに必要があることを発見した.デフォルトのエクスポートが必要です.したがって、Cherrioに関するエラーが関数でないか、$が関数でない場合.使用してみてください.

var $ = require('cheerio');
if (typeof $ != "function") $ = require("cheerio").default;

それは私のために働いた!
ステップ5:すべての大統領の名前を取得します.
我々は、正確に18の要素が返される(インドの大統領の数)が確実であることを確認するためにチェックします.そして、何か他の隠れた「th」タグがページのどこかにないことを意味します.今、我々は通過することができますし、各要素の“attribs”セクションからそれらを得ることによって、すべての18の大統領のWikipediaページへのリンクのリストを取得します.

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_India';
if (typeof $ != "function") $ = require("cheerio").default;

rp(url)
  .then((html)=>{
    const presidentUrls = [];
    const length = $("th > a", html).length;
    for (let i = 0; i < length ; i++) {
      presidentUrls.push($('th > a', html)[i].attribs.href);
    }
    console.log(presidentUrls);
  })
  .catch((err)=>{
    console.log(err);
  });

出力

[
  '/wiki/Rajendra_Prasad',
  '/wiki/Sir Sarvepalli_Radhakrishnan',
  '/wiki/Zakir_Husain',
  '/wiki/V._V._Giri',
  '/wiki/Mohammad_Hidayatullah',
  '/wiki/V._V._Giri',
  '/wiki/Fakhruddin_Ali_Ahmed',
  ...
]

ステップ6 : HTMLページから誕生日をつかむ.
現在、我々は全18の大統領Wikipediaページのリストを持ちます.新しいファイル(scrparse . jsという名前の)を作成しましょう.そして、それは大統領Wikipediaページをとって、大統領の名前と誕生日を返す機能を含みます.まず最初に、Rajendra PrasadのWikipediaページから生のHTMLを得ましょう.

const rp = require('request-promise');
const url = 'https://en.wikipedia.org/wiki/Rajendra_Prasad';

rp(url)
  .then((html)=> {
    console.log(html);
  })
  .catch((err)=> {
    console.log(err);
  });

出力:

<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Rajendra Prasad - Wikipedia</title>
...

もう一度私たちが解析したいコードの構文を見つけるためにChrome DevToolsを使いましょう.そうすれば、私たちはCherioと名前と誕生日を抽出することができます.js

それで、名前は「firstheading」と呼ばれているクラスにあります、そして、誕生日は「BDay」と呼ばれているクラスにあります.チェリオを使用するコードを変更しましょう.これらの2つのクラスを抽出するJS.

const rp = require('request-promise');
const $ = require('cheerio');
const url = 'https://en.wikipedia.org/wiki/Rajendra_Prasad';
if (typeof $ != "function") $ = require("cheerio").default;

rp(url)
  .then((html)=> {
    console.log($('.firstHeading', html).text());
    console.log($('.bday', html).text());
  })
  .catch((err)=> {
    console.log(err);
  });

出力:

Rajendra Prasad
1884-12-03

ステップ4:すべてを一緒に置く
これを関数にラップしてこのモジュールからエクスポートしましょう.

const rp = require('request-promise');
var $ = require('cheerio');

if( typeof $ != 'function' ) $ = require('cheerio').default;

const scrapParse = (url) => {
    return rp(url)
    .then((html)=>{
        return {
        name: $('.firstHeading', html).text(),
        birthday: $('.bday', html).text(),
        };
    }).catch((err)=>{
        console.log(err);
    });
}

module.exports = scrapParse;

さあ、オリジナルのファイルスクレーパーに戻りましょう.jsとscrparseを必要とします.jsモジュール.我々は、我々が以前集めた大統領職のリストにそれを適用します.

const rp = require("request-promise");
var $ = require("cheerio");
const scrapParse = require("scrapParse");
if (typeof $ != "function") $ = require("cheerio").default;

const url = "https://en.wikipedia.org/wiki/List_of_presidents_of_India";

if (typeof $ != "function") $ = require("cheerio").default;

rp(url)
  .then((html) => {
    const presidentUrl = [];
    const length = $("th > a", html).length;
    for (let i = 0; i < length; i++) {
      presidentUrl.push($("th > a", html)[i].attribs.href);
    }
    return Promise.all(
      presidentUrl.map((name) => {
        return scrapParse(`https://en.wikipedia.org${name}`);
      })
    );
  })
  .then((presidents) => {
    console.log(presidents);
  })
  .catch((err) => {
    console.log(err);
  });

出力:

[
  { name: 'Rajendra Prasad', birthday: '1884-12-03' },
  { name: 'Sarvepalli Radhakrishnan', birthday: '1888-09-05' },
  { name: 'Zakir Husain (politician)', birthday: '1897-02-08' },
  { name: 'V. V. Giri', birthday: '1894-08-10' },
  { name: 'V. V. Giri', birthday: '1894-08-10' },
  { name: 'Fakhruddin Ali Ahmed', birthday: '1905-05-13' },
  { name: 'B. D. Jatti', birthday: '1912-09-10' },
  { name: 'Neelam Sanjiva Reddy', birthday: '1913-05-19' },
  { name: 'Zail Singh', birthday: '1916-05-05' },
  { name: 'Zail Singh', birthday: '1916-05-05' },
  { name: 'Zail Singh', birthday: '1916-05-05' },
  { name: 'Ramaswamy Venkataraman', birthday: '1910-12-04' },
  { name: 'Shankar Dayal Sharma', birthday: '1918-08-19' },
  { name: 'K. R. Narayanan', birthday: '1997-07-25' },
  { name: 'A. P. J. Abdul Kalam', birthday: '1931-10-15' },
  { name: 'Pratibha Patil', birthday: '1934-12-19' },
  { name: 'Pranab Mukherjee', birthday: '1935-12-11' },
  { name: 'Ram Nath Kovind', birthday: '1945-10-01' }
]