AxiosとCherioを用いたウェブの掻き取り


こんにちは人々は、今日私はWebスクレーピングに関する情報を共有します.Webスクレーピングは、単にウェブサイトからコンテンツやデータを抽出するプロセスです.このポストは教育目的のためだ❗


前提条件
👨‍💻 NODEJS
👨‍💻 開発ツールの知識
👨‍💻 文書オブジェクトモデル知識

始めましょう
🥦 私の場合新しいディレクトリを作るnodescraping とノードのJSアプリを開始するnpm init -y
🎯 結果:作成しますpackage.json ファイル
🥦 依存関係のインストールnpm i express axios cheerio
🎯 結果:

🥦 ファイルの変更時に自動的にノードのアプリケーションを再起動します.npm i nodemon -save-dev🎯 結果:

🥦 スクリプトを編集
  "start": "node app.js ",
   "dev:": "nodemon app.js"
🎯 結果:

🥦 ファイルを作るapp.js パッケージをインポートする
const axios = require('axios');
const cheerio = require('cheerio');
const express = require('express');

const port = process.env.PORT || 4000;

const app = express();
🥦 私は、使用していますaxios ウェブサイトを取得するパッケージ.私はサイトを使用します😁. 自由にあなたの選択の任意のウェブサイトを使用してください.私たちはスクレーピングし、結果をプレーンテキストファイルにエクスポートしますCSV .
🥦 右クリックして、要素(クラス、IS)とそのそれぞれの属性(A、Li)を選択するためにウェブサイトを検査する.

🎯 これは私たちが選択したいクラスを検査する能力を与えます.
🥦 以下をターゲットにしたい.ブログのタイトル、リンク、著者、および時間を読んでください.


側注:
常に使用. 対象とするクラス名の前に.
axios.get('https://dev.to/')
    .then(res => {
        const $ = cheerio.load(res.data)
        $('.crayons-story').each((index, element) => {
            const blogTitle = $(element).find('.crayons-story__title').text()

    }).catch(err => console.error(err))
上のロジックではクラスクレヨン物語の子要素を対象としています.
The .text() メソッドは結果をテキストに変換します.
🥦 私は、ブログのリンク、著者を選択し、時間を読んで全体のプロセスを繰り返した.
🥦 最終論理は
const axios = require('axios');
const cheerio = require('cheerio');
const express = require('express');
require('dotenv').config();
const fs = require('fs');
const writeStream = fs.createWriteStream('devBlog.csv');

const port = process.env.PORT || 4000;

const app = express();

//write headers
writeStream.write(`author, BlogTitle, bloglink, readtime \n`);


axios.get('https://dev.to/')
    .then(res => {
        const $ = cheerio.load(res.data)
        $('.crayons-story').each((index, element) => {

            const author = $(element).find('.profile-preview-card__trigger').text().replace(/\s\s+/g, '')
            const blogTitle = $(element).find('.crayons-story__title').text().replace(/\s\s+/g, '')
            const blogLink = $(element).find('a').attr('href');
            const readTime = $(element).find('.crayons-story__tertiary').text()
            const dev = 'https://dev.to'
            const joinedBlogLink = `${dev}` + `${blogLink}`;
            writeStream.write(`Author: ${author}, \n Blog title is : ${blogTitle} ,\n Blog link: ${joinedBlogLink}, \n Blog read time : ${readTime} \n`);
        });


    }).catch(err => console.error(err))

//Listen to server
app.listen(port, () => {
    console.log(`Server Established and  running on Port ⚡${port}`)
})
ここのソースコードを見るhere

ノートと解説
  • fsモジュールは、devblogに最終結果を書くために使用されました.CSVファイル
  • \nは新しい行に等しい
  • .replace (\\s\s +/g , '')は作者のフィールド間の空白を削除します.
  • AxiosはマークアップデータをURLから取得する
  • チェリオは、URLからHTMLデータをグラブ.チェリオは、ノード内のHTMLとXMLを解析するためのツールです.js
  • チェリオLoadメソッドはウェブサイトをアップロードし、値を宣言された変数に格納します$
  • .each 選択した要素をループします.
  • 🥦 サーバーを実行するnpm run dev🎯 結果:
    author, BlogTitle, bloglink, readtime 
    Author: Gracie Gregory (she/her), 
     The blog title is : What was your win this week? ,
     Blog link: https://dev.to/devteam/what-was-your-win-this-week-5h25, 
     Blog read time :  for Oct 8
                1 min read
    
    Author: Jeremy Friesen, 
     Blog title is : Trick or Treat, I've Joined the DEV Team ,
     Blog link: https://dev.to/jeremyf/trick-or-treat-i-ve-joined-the-dev-team-4283, 
     Blog read time : Oct 8
                5 min read
    
    Author: Michael, 
     Blog title is : How To See Which Branch Your Teammate Is On In Android Studio ,
     Blog link: https://dev.to/gitlive/how-to-see-which-branch-your-teammate-is-on-in-android-studio-2n3i, 
     Blog read time :  for Oct 8
                1 min read
    
    Author: Iain Freestone, 
     Blog title is : 🚀10 Trending projects on GitHub for web developers - 8th October 2021 ,
     Blog link: https://dev.to/iainfreestone/10-trending-projects-on-github-for-web-developers-8th-october-2021-102e, 
     Blog read time : Oct 8
                3 min read
    
    Author: AM, 
     Blog title is : Django Cloud Task Queue ,
     Blog link: https://dev.to/txiocoder/django-cloud-task-queue-27g2, 
     Blog read time : Oct 8
                1 min read
    
    Author: Ankit Anand ✨, 
     Blog title is : AWS X-Ray vs Jaeger - key features, differences and alternatives ,
     Blog link: https://dev.to/signoz/aws-x-ray-vs-jaeger-key-features-differences-and-alternatives-322, 
     Blog read time :  for Oct 8
                6 min read
    
    Author: Raquel Román-Rodriguez, 
     Blog title is : Algo Logging: the Longest Substring of Unique Characters in JavaScript ,
     Blog link: https://dev.to/raquii/algo-logging-the-longest-substring-of-unique-characters-in-javascript-4i3, 
     Blog read time : Oct 8
                3 min read
    
    Author: Shaher Shamroukh, 
     Blog title is : Working With Folders & Files In Ruby ,
     Blog link: https://dev.to/shahershamroukh/working-with-folders-files-in-ruby-2l97, 
     Blog read time : Oct 8
                3 min read
    
    Author: Roberto Ruiz, 
     Blog title is : Untangling Your Logic Using State Machines ,
     Blog link: https://dev.to/robruizr/untangling-your-logic-using-state-machines-2epj, 
     Blog read time : Oct 8
                5 min read
    
    Author: Cubite, 
     Blog title is : How To Manage Open edX® Environment Variables Using Doppler and Automating The Deployment ,
     Blog link: https://dev.to/corpcubite/how-to-manage-open-edx-environment-variables-using-doppler-and-automating-the-deployment-4c5e, 
     Blog read time : Oct 8
                5 min read
    
    Author: OpenReplay Tech Blog, 
     Blog title is : Building an Astro Website with WordPress as a Headless CMS ,
     Blog link: https://dev.to/asayerio_techblog/building-an-astro-website-with-wordpress-as-a-headless-cms-47mo, 
     Blog read time : Oct 8
                9 min read
    
    Author: Anamika, 
     Blog title is : How to setup Appwrite on Ubuntu ,
     Blog link: https://dev.to/noviicee/how-to-setup-appwrite-on-ubuntu-3j67, 
     Blog read time : Oct 8
                4 min read
    
    Author: Bryan Robinson, 
     Blog title is : Building server-rendered search for static sites with 11ty Serverless, Netlify, and Algolia ,
     Blog link: https://dev.to/algolia/building-server-rendered-search-for-static-sites-with-11ty-serverless-netlify-and-algolia-13e2, 
     Blog read time :  for Oct 8
                8 min read
    
    Author: bhupendra, 
     Blog title is : Understanding Redux without React ,
     Blog link: https://dev.to/bhupendra1011/understanding-redux-without-react-223n, 
     Blog read time : Oct 8
                4 min read
    
    Author: Rizel Scarlett, 
     Blog title is : Add Fuzzy Search to Your Web App with this Open Source Tool ,
     Blog link: https://dev.to/github/add-fuzzy-search-to-your-web-app-with-this-open-source-tool-22d7, 
     Blog read time :  for Oct 8
                6 min read
    
    Author: Marcelo Sousa, 
     Blog title is : Ship / Show / Ask With Reviewpad ,
     Blog link: https://dev.to/reviewpad/ship-show-ask-with-reviewpad-47jh, 
     Blog read time :  for Oct 8
                5 min read
    
    Author: Shantanu Jana, 
     Blog title is : Random Gradient Generator using JavaScript & CSS ,
     Blog link: https://dev.to/shantanu_jana/random-gradient-generator-using-javascript-css-529c, 
     Blog read time : Oct 8
                6 min read
    
    Author: Miles Watson, 
     Blog title is : URL Shortener with Rust, Svelte, & AWS (6/): Deploying to AWS ,
     Blog link: https://dev.to/mileswatson/url-shortener-with-rust-svelte-aws-6-deploying-to-aws-2gi0, 
     Blog read time : Oct 8
                4 min read
    
    Author: Jon Deavers, 
     Blog title is : Publishing my first NPM package ,
     Blog link: https://dev.to/lucsedirae/publishing-my-first-npm-package-200g, 
     Blog read time : Oct 8
                3 min read
    
    Author: Anjan Shomooder, 
     Blog title is : CSS positions: Everything you need to know ,
     Blog link: https://dev.to/thatanjan/css-positions-everything-you-need-to-know-2ng4, 
     Blog read time : Oct 8
                4 min read
    
    Author: Alvaro Montoro, 
     Blog title is : Divtober Day 8: Growing ,
     Blog link: https://dev.to/alvaromontoro/divtober-day-8-growing-1182, 
     Blog read time : Oct 8
                1 min read
    
    Author: Jambang J, 
     Blog title is : Deploying an discordjs bot to Qovery ,
     Blog link: https://dev.to/jambang067/deploying-an-discordjs-bot-to-qovery-51e, 
     Blog read time : Oct 8
                7 min read
    
    Author: Sadee, 
     Blog title is : How to create responsive navbar {twitter clone} with HTML CSS ,
     Blog link: https://dev.to/codewithsadee/how-to-create-responsive-navbar-twitter-clone-with-html-css-6fa, 
     Blog read time : Oct 8
                1 min read
    
    Author: Jeremy Grifski, 
     Blog title is : Support The Sample Programs Repo This Hacktoberfest ,
     Blog link: https://dev.to/renegadecoder94/support-the-sample-programs-repo-this-hacktoberfest-42ad, 
     Blog read time : Oct 8
                5 min read
    
    Author: Sebastian Rindom, 
     Blog title is : Making your store more powerful with Contentful ,
     Blog link: https://dev.to/medusajs/making-your-store-more-powerful-with-contentful-3efk, 
     Blog read time :  for Oct 8
                7 min read
    
    Author: Shalvah, 
     Blog title is : A practical tracing journey with OpenTelemetry on Node.js ,
     Blog link: https://dev.to/shalvah/a-practical-tracing-journey-with-opentelemetry-on-node-js-5706, 
     Blog read time : Oct 8
                16 min read
    
    Author: Kingsley Ubah, 
     Blog title is : How to build an Accordion Menu using HTML, CSS and JavaScript ,
     Blog link: https://dev.to/ubahthebuilder/how-to-build-an-accordion-menu-using-html-css-and-javascript-3omb, 
     Blog read time : Oct 7
                6 min read
    
    Author: mike1237, 
     Blog title is : Create Proxmox cloud-init templates for use with Packer ,
     Blog link: https://dev.to/mike1237/create-proxmox-cloud-init-templates-for-use-with-packer-193a, 
     Blog read time : Oct 8
                3 min read
    
    Author: Prosper Yong, 
     Blog title is : Get Paid Writing ,
     Blog link: https://dev.to/yongdev/get-paid-writing-2i2j, 
     Blog read time : Oct 8
                1 min read
    
    Author: Debbie O'Brien, 
     Blog title is : Understanding TypeScript ,
     Blog link: https://dev.to/debs_obrien/understanding-typescript-378g, 
     Blog read time : Oct 8
                5 min read
    
    Author: Matias D, 
     Blog title is : Show me your portfolio ,
     Blog link: https://dev.to/matiasdandrea/show-me-your-portfolio-1l9h, 
     Blog read time : Oct 8
                1 min read
    
    Author: Marcos Henrique, 
     Blog title is : You should use Buildpacks instead Dockerfile and I'll tell you why ,
     Blog link: https://dev.to/wakeupmh/you-should-use-buildpack-instead-dockerfile-and-i-ll-tell-you-why-2n6, 
     Blog read time : Oct 8
                2 min read
    
    Author: Gaurav Gupta, 
     Blog title is : Smart Notes - A Build-in Public Product. BuildLog[1] ,
     Blog link: https://dev.to/gauravgupta/smart-notes-a-build-in-public-product-buildlog-1-kj6, 
     Blog read time : Oct 8
                4 min read
    
    Author: Andrea Giammarchi, 
     Blog title is : About bitwise operations ,
     Blog link: https://dev.to/webreflection/about-bitwise-operations-29mm, 
     Blog read time : Oct 8
                10 min read
    
    Author: AbcSxyZ, 
     Blog title is : Business models of Free and Open Source software ,
     Blog link: https://dev.to/abcsxyz/business-models-of-free-and-open-source-software-2cg8, 
     Blog read time : Oct 8
                4 min read
    
    Author: Saharsh Laud, 
     Blog title is : Face Detection in just 15 lines of Code! (ft. Python and OpenCV) ,
     Blog link: https://dev.to/saharshlaud/face-detection-in-just-15-lines-of-code-ft-python-and-opencv-37ci, 
     Blog read time : Oct 8
                4 min read
    
    Author: Kaustubh Joshi, 
     Blog title is : Hello, I'm HTTP and these are my request methods👋🏻 ,
     Blog link: https://dev.to/elpidaguy/hello-i-m-http-and-these-are-my-request-methods-co, 
     Blog read time : Oct 8
                3 min read
    
    Author: SilvenLEAF, 
     Blog title is : Easiest way to create a ChatBOT from Level 0 ,
     Blog link: https://dev.to/silvenleaf/easiest-way-to-create-a-chatbot-from-level-0-31pf, 
     Blog read time : Oct 8
                6 min read
    
    Author: whykay 👩🏻‍💻🐈🏳️‍🌈 (she/her), 
     Blog title is : 👏 New EuroPython Fellows ,
     Blog link: https://dev.to/europython/new-europython-fellows-2ob2, 
     Blog read time :  for Oct 8
                1 min read
    
    Author: Zaw Zaw Win, 
     Blog title is : How to pass props object from child component to parent ,
     Blog link: https://dev.to/hareom284/how-to-pass-props-object-from-child-component-to-parent-2a8d, 
     Blog read time : Oct 8
                2 min read
    
    Author: Zack DeRose, 
     Blog title is : The "DeRxJSViewModel Pattern": The E=mc^2 of State Management [Part 1] ,
     Blog link: https://dev.to/zackderose/the-derxjsviewmodel-pattern-the-e-mc-2-of-state-management-part-1-3dka, 
     Blog read time : Oct 8
                23 min read
    
    Author: john methew, 
     Blog title is : Serverless Cloud Application Development with AWS Lambda ,
     Blog link: https://dev.to/johnmethew18/serverless-cloud-application-development-with-aws-lambda-3o7l, 
     Blog read time : Oct 8
                1 min read
    
    Author: Antonio-Bennett, 
     Blog title is : Hacktoberfest Week 1 ,
     Blog link: https://dev.to/antoniobennett/hacktoberfest-week-1-4ebc, 
     Blog read time : Oct 8
                2 min read
    
    Author: ZigRazor, 
     Blog title is : Hacktoberfest Beginners and Advanced Repos to Contribute to ,
     Blog link: https://dev.to/zigrazor/hacktoberfest-beginners-and-advanced-repos-to-contribute-to-p1, 
     Blog read time : Oct 8
                1 min read
    
    Author: Rahul kumar, 
     Blog title is : Added option to share the blog on any social media | @dsabyte.com ,
     Blog link: https://dev.to/ats1999/added-option-to-share-the-blog-on-any-social-media-dsabyte-com-57oo, 
     Blog read time : Oct 8
                2 min read
    
    Author: Kavindu Santhusa, 
     Blog title is : Top 10 trending github repos of the week💜. ,
     Blog link: https://dev.to/ksengine/top-10-trending-github-repos-of-the-week-k7, 
     Blog read time : Oct 8
                1 min read
    
    Author: Andre Willomitzer, 
     Blog title is : OpenAQ - My first open source PR :) ,
     Blog link: https://dev.to/andrewillomitzer/openaq-my-first-open-source-pr-3k32, 
     Blog read time : Oct 8
                2 min read
    
    Author: Kinanee Samson, 
     Blog title is : Observables Or Promises ,
     Blog link: https://dev.to/kalashin1/observables-or-promises-29a8, 
     Blog read time : Oct 8
                9 min read
    
    Author: Amador Criado, 
     Blog title is : How to enable versioning in Amazon S3 ,
     Blog link: https://dev.to/aws-builders/how-to-enable-versioning-in-amazon-s3-17m8, 
     Blog read time :  for Oct 8
                2 min read
    
    Author: Bartosz Zagrodzki, 
     Blog title is : React Context - jak efektywnie go używać? ,
     Blog link: https://dev.to/bartek532/react-context-jak-efektywnie-go-uzywac-41l, 
     Blog read time : Oct 8
                8 min read
    
    

    まとめ
    これは、Webサイトをscrapeする方法についての簡単なガイドですが、他のパッケージは、同じような機能を実行するために使用することができますPuppeteer、フェッチ、リクエストなど.

    リファレンス
    Cheerio Docs
    読書ありがとう