ウェブでのラッピング

55308 ワード

テキストリンク

私がstrapiコミュニティに伝えようとしている1つのメッセージがあるならば、それはstrapiでどんなタイプのアプリケーションも作成することが可能であるということです.それがAであるかどうかsimple blog , エーshowcase site , エーcorporate site , 安e-commerce site , ある方法では、モバイルアプリケーションなどで使用できるAPI.
その強力なカスタマイズのために、あなたはstrapiで望むものは何でも作成できます.今日、このチュートリアルでは、Webサイトをscrapeに使用するアプリケーションを作成するためにご案内いたします.私は、過去の情報を収集するためにいくつかのウェブサイトを削って構成されたフリーランスの使命のために過去にstrapiを使用している.
strapiは非常に便利だったので、私はアーキテクチャを構築するためにスクレーパーだけで数回のクリックで管理することができます.しかし、次の手順は、strapiでスクラップアプリを作成する最良の方法ではないかもしれません.公式ガイドラインはありません、それはあなたのニーズとあなたの想像力に依存します.
このチュートリアルでは、我々はjamstack.org サイトとより正確にsite generators section . このサイトはstrapiやその他のようなヘッドレスのCMSSをリストしますが、サイトジェネレータやフレームワークなどのようなフレームワークを示しますNext.js , Gatsby or Jekyll .
我々は単に、cronを介して、収集するアプリを作成するつもりですsite generators 毎日、それらをstrapiアプリに挿入します.
これを行うにはPuppeteer ブラウザを制御して必要な情報を抽出するCheerio .
さあ、始めましょう!

StrAPIアプリケーションの作成

次のコマンドを使用して、strapiアプリケーションを作成します.

npx create-strapi-app jamstack-scraper --quickstartこのアプリケーションはSQLiteデータベースがあります.自由に取り除く--quickstart オプションは、お気に入りのデータベースを選択します.

フォームを提出して管理者を作成します.

スクレーパーと呼ばれる最初のコレクション型を作成します
コンテンツタイプビルダー>+新しいコレクションタイプを作成します.

今すぐあなたの側のnavのスクレーパーをクリックして、スクレーパを見ることができる必要があります:

新しいスクレーパーを追加

このビューはよく組織されていません.

ビューの構成をクリックし、次の組織を再生します

プレスセーブ

スクレーパーコレクションの種類を作成することにより、直接管理者のスクレーパーの動作を管理することができます.できます.

1. Enable/Disable it
2. Update the frequency
3. See the errors of the last execution
4. Get a report of the last execution
5. See all the data scraped for this scraper with a relation

別のコレクション型を作成します

パーフェクト!すべては管理者のコンテンツに関して準備ができているようです!
さあ、スクレーパーを作りましょう.

スクレーパーの作成

スクレーパーをクリックしてください

次のスクレーパーを作成します

ここにあります.

1. The name of the scraper.
2. The slug which is generated thanks to the name.
3. We disable it for now.
4. The frequency here is expressed using cron schedule expressions. `* * * * *` here means every minute. Since after testing we want to launch the scraper every day at, let's say, 3pm, it would be something like this: `0 15 * * *`
5. The `next_execution_at` will contains a string of the timestamp of the next execution. This way the scraper will respect the frequency. You will have more details after ;)

コードに飛び込みましょう!
最初にすることは、あなたのアプリケーションをスラグでスクラップを得ることができることです.

次のコードを追加します./api/scraper/controllers/scraper.js ファイル

const { sanitizeEntity } = require('strapi-utils');

module.exports = {
  /**
   * Retrieve a record.
   *
   * @return {Object}
   */

  async findOne(ctx) {
    const { slug } = ctx.params;

    const entity = await strapi.services.scraper.findOne({ slug });
    return sanitizeEntity(entity, { model: strapi.models.scraper });
  },
};

のみ更新findOne あなたのルート./api/scraper/config/routes.json 次のコードをファイルします

{
  "routes": [
    ...
    {
      "method": "GET",
      "path": "/scrapers/:slug",
      "handler": "scraper.findOne",
      "config": {
        "policies": []
      }
    },
    ... 
    ]
}

すごい!今、あなたは関連スラグを使用してスクレーパーを取得することができます.
さあ、スクレーパーを作りましょう.

クリエイトア./scripts あなたのStrapiプロジェクトのルートのフォルダ.

クリエイトア./scripts/scrapers この以前に作成したフォルダ内のフォルダ.

クリエイトア./scripts/starters/jamstack.js 空のファイル.

このファイルには、Jamstackのサイトジェネレータをscrapeするロジックが含まれます.org. これはcronによって呼び出されますが、開発中は./config/functions/bootstrap.js サーバが再起動する度に実行されるファイル.このようにして、ファイルを保存するたびにスクリプトを試してみることができます.
最後に、我々はそこからそれを削除し、毎分私たちのcronファイルでそれを呼び出します.スクリプトは、これを実行する時間かどうかのために定義されるスクレーパーの周波数のおかげではないかどうかをチェックします.

次のコードを追加します./scripts/starters/jamstack.js ファイル

const main = async () => {
  // Fetch the correct scraper thanks to the slug
  const slug = "jamstack-org"
  const scraper = await strapi.query('scraper').findOne({
    slug: slug
  });

  console.log(scraper);

  // If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
  if (scraper == null || !scraper.enabled || !scraper.frequency)
    return
}

exports.main = main;

アップデート./config/functions/bootstrap.js 以下のファイルを指定します.

'use strict';

/**
 * An asynchronous bootstrap function that runs before
 * your application gets started.
 *
 * This gives you an opportunity to set up your data model,
 * run jobs, or perform some special logic.
 *
 * See more details here: https://strapi.io/documentation/developer-docs/latest/concepts/configurations.html#bootstrap
 */
const jamstack = require('../../scripts/scrapers/jamstack.js')

module.exports = () => {
  jamstack.main()
};

これで、ブートストラップファイルを保存した後、あなたの端末にスクレーパー(JSON)を見ることができます.

すごい!今スクラップする時間です!
まず最初に、我々はスクリプトの役に立つ機能を含んでいるファイルを作成するつもりです.

クリエイトア./scripts/scrapers/utils/utils.js 以下を含むファイル

'use strict'

最初の関数はcron-parser Adminで設定した頻度に応じてスクリプトを実行できるかどうかをチェックするパッケージ.また、私たちはchalk 色でメッセージを表示するパッケージ.

次のコマンドを実行して、次のパッケージを追加します.

yarn add cron-parser chalk

次のコードを追加します./scripts/scrapers/utils/utils.js ファイル.

'use strict'

const parser = require('cron-parser');

const scraperCanRun = async (scraper) => {
  const frequency = parser.parseExpression(scraper.frequency);
  const current_date = parseInt((new Date().getTime() / 1000));
  let next_execution_at = ""

  if (scraper.next_execution_at){
    next_execution_at = scraper.next_execution_at
  }
  else {
    next_execution_at = (frequency.next().getTime() / 1000);
    await strapi.query('scraper').update({
      id: scraper.id
    }, {
      next_execution_at: next_execution_at
    });
  }

  if (next_execution_at <= current_date){
    await strapi.query('scraper').update({
      id: scraper.id
    }, {
      next_execution_at: (frequency.next().getTime() / 1000)
    });
    return true
  }
  return false
}

module.exports = { scraperCanRun }

この関数は、スクレーパーを実行する時間かどうかチェックします.あなたのスクレーパー、パッケージに設定する周波数に応じてcron-parser それを解析して、これがあなたのスクレイパーを実行する適切な時間であるかどうか確かめてくださいnext_execution_at フィールド.
重要:あなたのスクレーパーの周波数を変更する場合は、削除する必要がありますnext_execution_at 値.それはあなたの新しい周波数に応じてリセットされます.
次の関数は、スクリプトの実行中に再び取得しないように、すでにデータベースに挿入されたすべてのサイトジェネレータを取得します.

以下の関数を追加します./scripts/scrapers/utils/utils.js ファイル.

'use strict'

...

const getAllSG = async (scraper) => {
  const existingSG = await strapi.query('site-generator').find({
    _limit: 1000,
    scraper: scraper.id
  }, ["name"]);
  const allSG = existingSG.map(x => x.name);
  console.log(`Site generators in database: \t${chalk.blue(allSG.length)}`);

  return allSG;
}

module.exports = { getAllSG, scraperCanRun }

次の関数は単にレポートとエラーログの現在の日付を取得します

以下の関数を追加します./scripts/scrapers/utils/utils.js ファイル.

'use strict'

...

const getDate = async () => {
  const today = new Date();
  const date  = today.getFullYear()+'-'+(today.getMonth()+1)+'-'+today.getDate();
  const time  = today.getHours() + ":" + today.getMinutes() + ":" + today.getSeconds();
  return date+' '+time;
}

module.exports = { getDate, getAllSG, scraperCanRun }

最後の関数は最後の実行のレポートを準備します.

以下の関数を追加します./scripts/scrapers/utils/utils.js ファイル.

'use strict'

...

const getReport = async (newSG) => {
  return { newSG: newSG, date: await getDate()}
}

module.exports = { getReport, getDate, getAllSG, scraperCanRun }

ああ!あなた./scripts/scrapers/utils/utils.js 次のようになります.

```bash
'use strict'

const parser = require('cron-parser');
const chalk = require('chalk');

const scraperCanRun = async (scraper) => {
  const frequency = parser.parseExpression(scraper.frequency);
  const current_date = parseInt((new Date().getTime() / 1000));
  let next_execution_at = ""

  if (scraper.next_execution_at){
    next_execution_at = scraper.next_execution_at
  }
  else {
    next_execution_at = (frequency.next().getTime() / 1000);
    await strapi.query('scraper').update({
      id: scraper.id
    }, {
      next_execution_at: next_execution_at
    });
  }

  if (next_execution_at <= current_date){
    await strapi.query('scraper').update({
      id: scraper.id
    }, {
      next_execution_at: (frequency.next().getTime() / 1000)
    });
    return true
  }
  return false
}

const getAllSG = async (scraper) => {
  const existingSG = await strapi.query('site-generator').find({
    scraper: scraper.id
  }, ["name"]);
  const allSG = existingSG.map(x => x.name);
  console.log(`Site generators in database: \t${chalk.blue(allSG.length)}`);

  return allSG
}

const getDate = async () => {
  const today = new Date();
  const date  = today.getFullYear()+'-'+(today.getMonth()+1)+'-'+today.getDate();
  const time  = today.getHours() + ":" + today.getMinutes() + ":" + today.getSeconds();
  return date+' '+time;
}

const getReport = async (newSG) => {
  return { newSG: newSG, date: await getDate()}
}

module.exports = { getReport, getDate, getAllSG, scraperCanRun }
```

更新しましょう./scripts/scrapers/jamstack.js 今少しファイルを.でも、まずピペットを加えましょう.)

人形師とチェリオを加えること

あなたにPetpeteerとチェリオを加えてくださいpackage.json 以下のコマンドを実行します.

yarn add puppeteer cheerio

アップデート./scripts/scrapers/jamstack.js 以下のファイルを指定します.

'use strict'

const chalk = require('chalk');
const puppeteer = require('puppeteer');
const {
  getReport,
  getDate,
  getAllSG,
  scraperCanRun
} = require('./utils/utils.js')

let report = {}
let errors = []
let newSG = 0

const scrape = async () => {
  console.log("Scrape function");
}

const main = async () => {
  // Fetch the correct scraper thanks to the slug
  const slug = "jamstack-org"
  const scraper = await strapi.query('scraper').findOne({
    slug: slug
  });

  // If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
  if (scraper == null || !scraper.enabled || !scraper.frequency){
    console.log(`${chalk.red("Exit")}: (Your scraper may does not exist, is not activated or does not have a frequency field filled in)`);
    return
  }

  const canRun = await scraperCanRun(scraper);
  if (canRun && scraper.enabled){
    const allSG = await getAllSG(scraper)
    await scrape(allSG, scraper)
  report = await getReport(newSG);
  }
}

exports.main = main;

今すぐあなたのファイルを保存!次のメッセージがあります.

はい、あなたはそれがさらに行くことはありませんので、あなたのスクレーパーを有効にしなかった.したがって、継続するためには、このコードをコメントする必要があります.なぜなら、scrape関数のコードを待機する必要がないかどうかをコード化できますからです).

次のコードの一部をコメントします

'use strict'

const chalk = require('chalk');
const puppeteer = require('puppeteer');
const {
  getReport,
  getDate,
  getAllSG,
  scraperCanRun
} = require('./utils/utils.js')

let report = {}
let errors = []
let newSG = 0

const scrape = async () => {
  console.log("Scrape function");
}

const main = async () => {
  // Fetch the correct scraper thanks to the slug
  const slug = "jamstack-org"
  const scraper = await strapi.query('scraper').findOne({
    slug: slug
  });

  // If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
  // if (scraper == null || !scraper.enabled || !scraper.frequency){
  //   console.log(`${chalk.red("Exit")}: (Your scraper may does not exist, is not activated or does not have a frequency field filled in)`);
  //   return
  // }

  // const canRun = await scraperCanRun(scraper);
  // if (canRun && scraper.enabled){
    const allSG = await getAllSG(scraper)
    await scrape(allSG, scraper)
  // }
}

exports.main = main;

あなたのファイルを保存することができます今、それは罰金する必要があります!

パーフェクト!今すぐデータをこするに飛び込む!

データの削り

アップデート./scripts/scrapers/jamstack.js 以下のファイルを指定します.

'use strict'

const chalk = require('chalk');
const cheerio = require('cheerio');
const puppeteer = require('puppeteer');
const {
  getReport,
  getDate,
  getAllSG,
  scraperCanRun
} = require('./utils/utils.js')
const {
  createSiteGenerators,
  updateScraper
} = require('./utils/query.js')

let report = {}
let errors = []
let newSG = 0

const scrape = async (allSG, scraper) => {
  const url = "https://jamstack.org/generators/"
  const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });
  const page = await browser.newPage();

  try {
    await page.goto(url)
  } catch (e) {
    console.log(`${chalk.red("Error")}: (${url})`);
    errors.push({
      context: "Page navigation",
      url: url,
      date: await getDate()
    })
    return
  }

  const expression = "//div[@class='generator-card flex flex-col h-full']"
  const elements = await page.$x(expression);
  await page.waitForXPath(expression, { timeout: 3000 })

  const promise = new Promise((resolve, reject) => {
    elements.forEach(async (element) => {
      let card = await page.evaluate(el => el.innerHTML, element);
      let $ = cheerio.load(card)
      const name = $('.text-xl').text().trim() || null;
      // Skip this iteration if the sg is already in db
      if (allSG.includes(name))
        return;
      const stars = $('span:contains("stars")').parent().text().replace("stars", "").trim() || null;
      const forks = $('span:contains("forks")').parent().text().replace("forks", "").trim() || null;
      const issues = $('span:contains("issues")').parent().text().replace("issues", "").trim() || null;
      const description = $('.text-sm.mb-4').text().trim() || null;
      const language = $('dt:contains("Language:")').next().text().trim() || null;
      const template = $('dt:contains("Templates:")').next().text().trim() || null;
      const license = $('dt:contains("License:")').next().text().trim() || null;
      const deployLink = $('a:contains("Deploy")').attr('href') || null;

      await createSiteGenerators(
        name,
        stars,
        forks,
        issues,
        description,
        language,
        template,
        license,
        deployLink,
        scraper
      )
      newSG += 1;
    });
  });

  promise.then(async () => {
    await page.close()
    await browser.close();
  });
}

const main = async () => {
  // Fetch the correct scraper thanks to the slug
  const slug = "jamstack-org"
  const scraper = await strapi.query('scraper').findOne({
    slug: slug
  });

  // If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
  // if (scraper == null || !scraper.enabled || !scraper.frequency){
  //   console.log(`${chalk.red("Exit")}: (Your scraper may does not exist, is not activated or does not have a frequency field filled in)`);
  //   return
  // }

  // const canRun = await scraperCanRun(scraper);
  // if (canRun && scraper.enabled){
  const allSG = await getAllSG(scraper)
  await scrape(allSG, scraper)
  report = await getReport(newSG);
  // }
}

exports.main = main;

この変更について説明します.
まず最初に、まだ作成していないファイルから関数をインポートしていることがわかります../scripts/scrapers/utils/query.js . これらの2つの関数は、私たちのデータベースにサイトジェネレータを作成し、スクレーパ(エラーとレポート)を更新できます.このファイルは心配しないで作成します.
それから、我々はscrape 関数は、単に我々が使用したいデータをこすってpuppeteer and cheerio そして、以前に説明した関数を使って、このデータをデータベースに挿入します.

クリエイトア./scripts/scrapers/utils/query.js 以下を含むファイル

'use strict'

const chalk = require('chalk');

const createSiteGenerators = async (name, stars, forks, issues, description, language, template, license, deployLink, scraper) => {

  try {
    const entry = await strapi.query('site-generator').create({
      name: name,
      stars: stars,
      forks: forks,
      issues: issues,
      description: description,
      language: language,
      templates: template,
      license: license,
      deploy_to_netlify_link: deployLink,
      scraper: scraper.id
    })
  } catch (e) {
    console.log(e);
  }
}

const updateScraper = async (scraper, report, errors) => {
  await strapi.query('scraper').update({
    id: scraper.id
  }, {
    report: report,
    error: errors,
  });

  console.log(`Job done for: ${chalk.green(scraper.name)}`);
}

module.exports = {
  createSiteGenerators,
  updateScraper,
}

ご存知のように、我々はこれで最後にスクレーパーを更新しているupdateScraper 関数.それから、それが有効になっているならば、我々は我々のスクレーパーが実行されるか、頻度に応じてならないコードをコメントしないでしょう.

それを加えましょう./scripts/scrapers/jamstack.js ファイル

'use strict'

const chalk = require('chalk');
const cheerio = require('cheerio');
const puppeteer = require('puppeteer');
const {
  getReport,
  getDate,
  getAllSG,
  scraperCanRun
} = require('./utils/utils.js')
const {
  createSiteGenerators,
  updateScraper
} = require('./utils/query.js')

let report = {}
let errors = []
let newSG = 0

const scrape = async (allSG, scraper) => {
  const url = "https://jamstack.org/generators/"
  const browser = await puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox'] });
  const page = await browser.newPage();

  try {
    await page.goto(url)
  } catch (e) {
    console.log(`${chalk.red("Error")}: (${url})`);
    errors.push({
      context: "Page navigation",
      url: url,
      date: await getDate()
    })
    return
  }

  const expression = "//div[@class='generator-card flex flex-col h-full']"
  const elements = await page.$x(expression);
  await page.waitForXPath(expression, { timeout: 3000 })

  const promise = new Promise((resolve, reject) => {
    elements.forEach(async (element) => {
      let card = await page.evaluate(el => el.innerHTML, element);
      let $ = cheerio.load(card)
      const name = $('.text-xl').text().trim() || null;
            // Skip this iteration if the sg is already in db
      if (allSG.includes(name))
        return;
      const stars = $('span:contains("stars")').parent().text().replace("stars", "").trim() || null;
      const forks = $('span:contains("forks")').parent().text().replace("forks", "").trim() || null;
      const issues = $('span:contains("issues")').parent().text().replace("issues", "").trim() || null;
      const description = $('.text-sm.mb-4').text().trim() || null;
      const language = $('dt:contains("Language:")').next().text().trim() || null;
      const template = $('dt:contains("Templates:")').next().text().trim() || null;
      const license = $('dt:contains("License:")').next().text().trim() || null;
      const deployLink = $('a:contains("Deploy")').attr('href') || null;

      await createSiteGenerators(
        name,
        stars,
        forks,
        issues,
        description,
        language,
        template,
        license,
        deployLink,
        scraper
      )
      newSG += 1;
    });
  });

  promise.then(async () => {
    await page.close()
    await browser.close();
  });
}

const main = async () => {
  // Fetch the correct scraper thanks to the slug
  const slug = "jamstack-org"
  const scraper = await strapi.query('scraper').findOne({
    slug: slug
  });

  // If the scraper doesn't exists, is disabled or doesn't have a frequency then we do nothing
  if (scraper == null || !scraper.enabled || !scraper.frequency){
    console.log(`${chalk.red("Exit")}: (Your scraper may does not exist, is not activated or does not have a frequency field filled in)`);
    return
  }

  const canRun = await scraperCanRun(scraper);
  if (canRun && scraper.enabled){
    const allSG = await getAllSG(scraper)
    await scrape(allSG, scraper)
    report = await getReport(newSG);
    await updateScraper(scraper, report, errors)
  }
}

exports.main = main;

もう一度、あなたのファイルを保存することによって、このメッセージを持つべきです.

```bash
Exit: (Your scraper may not exist, is not activated, or does not have a frequency field filled in)
```

まず第一にすべてを設定する必要がありますfrequency に設定し、next_execution_at フィールド.このチュートリアルでは、1分の周波数をすばやく結果を見るために設定します.次に、単に有効にする必要があります.

アップデート./config/functions/bootstrap.js デフォルト値:

'use strict'

/**
 * An asynchronous bootstrap function that runs before
 * your application gets started.
 *
 * This gives you an opportunity to set up your data model,
 * run jobs, or perform some special logic.
 *
 * See more details here: https://strapi.io/documentation/developer-docs/latest/concepts/configurations.html#bootstrap
 */

module.exports = () => {};

アップデート./config/functions/cron.js 次のコードを使用します.

'use strict';

/**
 * Cron config that gives you an opportunity
 * to run scheduled jobs.
 *
 * The cron format consists of:
 * [SECOND (optional)] [MINUTE] [HOUR] [DAY OF MONTH] [MONTH OF YEAR] [DAY OF WEEK]
 *
 * See more details here: https://strapi.io/documentation/developer-docs/latest/concepts/configurations.html#cron-tasks
 */

const jamstack = require('../../scripts/scrapers/jamstack.js')

module.exports = {
  '* * * * *': () => {
    jamstack.main()
  }
};

あなたの更新を更新することによってcronを使用する可能性をアクティブにします./config/server.js 以下のファイルを指定します.

module.exports = ({ env }) => ({
  host: env('HOST', '0.0.0.0'),
  port: env.int('PORT', 1337),
  admin: {
    auth: {
      secret: env('ADMIN_JWT_SECRET', 'ea8735ca1e3318a64b96e79cd093cd2c'),
    },
  },
  cron: {
    enabled: true,
  }
});

あなたのファイルを保存!

今、何が起こるかは、次の分では、スクレイパーが記入されますnext_execution_at あなたの周波数に応じて対応する1つの値は、ここで、次の分後.
だから次の分後next_execution_at Timestampは実際のものと比較されます、そして、もちろん、あなたのスクレーパーが実行されるように、劣るか等しいでしょう.新しいnext_execution_at これに応じてタイムスタンプが次の分に設定されますfrequency .
さて、私はそれがこの迅速かつ簡単なチュートリアルのためだと思う!

結論

では、周波数を普通の何かに戻しましょう.

私はあなたがそれを楽しんでほしい!
あなたが学習を維持したい場合はAcademy ストーカーの専門家になるか、単に我々の閲覧blog あなたが好きなすべての科目を明らかにする.

Reference

この問題について(ウェブでのラッピング), 我々は、より多くの情報をここで見つけました https://dev.to/strapijs/web-scraping-with-strapi-31kh

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

AWS無料枠内でWebサーバ(EC2)を起動し、httpとsshで接続する方法

nomadインストール