🌥 BE TIL Day 9 0324


⬇️ Main Note
https://docs.google.com/document/d/1IZ5yYEtX92E7k2ijoAZZB3W_nBG9MpGPX6OKk_POxLQ/edit

🐚 Scraping vs. Crawling


🧃 Scraping

  • Literally scraping the other site's data only once.
  • use Cheerio as a tool.
  • 🍷 Crawling

  • Constantly getting the data from other web site.
  • use Puppeteer as a tool.
  • How scraping/crawling works:


    inspect/developer tools command: command + option + iThere is <em> tag in elements.
    --> Bringing the data is scraping and doing whatever else with that data depends on the developer.

    XML

  • Before knowing about scraping, the form before JSON should be understood. Before JSON, XML form is used.
  • XML: Extensible markup language
    --> </> => hyper markup language
    --> example of XML: <Writer/> , <School/> , etc
  • Before JSON, <Name>JB</Name> format was used.
    --> Inefficient (there needs two divs that encompass the value)
  • But by using JSON, HTML is received so drawn in string fomula.
    --> Able to feth in postman.
  • GET https://naver.com : able to get the data of elements.
  • 🐚 Scraping


    Cheerio
    Cheerio helps to get HTML tags into string form. [tool]

  • When we send particular links to some other sites, for example like discord, there pops out a preview image and title on the link box.

  • When a site is created, there is meta tag and property added to og in the head tag. Here, Discord developers create these tags.
    --> Creating link-preview

  • og is created by Facebook, where Facebook first wanted to create the link-preview. og stands for open graph.

  • If I'm creating my own site and the site address is mysite.com , meta tag should be initially created in the head tag.<meta og: title /> , <meta og: image/>
  • Process

  • The user uploads a post that contains --> title:"Hi there, this is my title", contents: "The weather's nice today. I want you guys to visit this site: aaa.com"
    --> here, the goal is to show link-preveiw to users. (The title and image of the site.)
  • To achieve this goal, the title and contents should be sent to backend via API.
    --> Post '/boards' => Sent in a form of JSON
  • Here, backend developers pick out the link that starts with http from the contents. HERE, we need scraping. (axios.get)
    --> And that result is put into another variable.
  • Then find the meta og tag inside the developer tool - elements page.
    --> After picking the data that are needed, title, contents, and ogs should be sent to database.
  • Practice

    import axios from "axios"
    import cheerio  from "cheerio"
    
    async function createBoardAPI(mydata){      // mydata <== frontendData 데이터 가져오기
    
        const targetUrl = mydata.contents.split(" ").filter((el) => el.startsWith("http"))[0]
        //공백을 기준으로 split을 하면 한 단어씩 썰려서 배열로 출력됨 //=> 이때 http로 시작하는 애를 가져오면 되는거
        // 이렇게 하면 최종 결과로 "http로 시작하는거 하나만" 배열에 들어오게 됨 
        // 그 배열의 0번째를 뽑아와야 순수하게 주소만 뽑아올 수가 있는거임
    
        const aaa = await axios.get(targetUrl)
        const $ = cheerio.load(aaa.data)
        $("meta").each((_, el) => {    // 메타태그들만 쭉 뽑혀져 나오는거임 => .each : for문처럼 작용 (meta의 모든 태그에서 작동해줘) 
            // _ :몇번째 meta tag인지 // el=element => ex) 3번째면 3번째 meta tag의 내용을 가져오는것
            // 우리한테 필요한건 og: 가 포함되어있는 meta tag
            
            // $가 특정 태그를 컨트롤 하는 애
    
            if ($(el).attr('property')){       // $("meta").each((_, el) => { 인 상태로 하면 모든 meta tag를 돌아보기 때문에 비효율적임. 그래서 if 문 가동
                const key = $(el).attr('property').split(":")[1]    //속성이 property인, og: 을 가지고있는 속성을 찾는것 
                // ==> split(":") --> :을 기준으로 og와 url이 나눠짐 ['og', 'title'] 이런식으로 여기서 title은 1번째 인덱스에 있는거임
                // title --> key, "네이버" --> value
                const value = $(el).attr('content') // 네이버라는 단어가 나옴
                console.log(key, value)
            }
    
        })     
    }
    
    const frontendData = {      // frontend에서 게시물을 등록할때 아래 내용을 등록한다:
        title: "Hi there, this is my title 😚 ",
        contents: "The weather's nice today. I want you guys to visit this site: https://naver.com 입니다~"
    }
    createBoardAPI(frontendData)
    onclick isan attribute(プロパティ)Property is also an attribute<meta og: title/>When scraping happens constantly, that becomes crawling.

    🐚 Crawling


    When I want to do something after opening a browser, Puppeteer is used. [tool]
    // 여기어때 크롤링 위법 사례: https://biz.chosun.com/topics/law_firm/2021/09/29/OOBWHWT5ZBF7DESIRKNPYIODLA/
    // 무차별적으로 크롤링을 요청하다보면 접속자가 많아져서 메모리가 많이 필요하개 됨 => 이러면 더 많은 컴퓨터가 필요해지게 됨
    
    import puppeteer from 'puppeteer'
    
    async function startCrawling(){  //하나씩 다 기다려줘야함 (브라우저 열고 창 열고)
        const browser = await puppeteer.launch({headless: false}) // 브라우저 나타남
        const page = await browser.newPage()    // 새 페이지 열기
        await page.setViewport({width: 1280, height: 720}) // page 크기도 지정 가능함
        await page.goto("https://www.goodchoice.kr/product/search/2")   // chromium 브라우저로 이동하게 됨  // chromium을 기반으로 해서 만들어진 브라우저가 크롬임 (둘은 전혀 다른거)
        page.waitForTimeout(1000) // 접속하고 시간텀을 주고 접속하는거임 
    
    
        const star = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > div > span", (el) => el.textContent)
                                 //$eval은 한개에 대해서, $$eval은 여러개 선택할 때       // '>' => 자식으로 있는 태그                   //div의 자식이 span이다 
        // child()의 숫자만 다름 => 다른 호텔 성급: #poduct_list_area > li:nth-child(3) > a > div > div.name > div > span //=> 이러면 for문 돌려서 모든 데이터 가져오기 가능
        page.waitForTimeout(1000)
    
        const location = await (await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.name > p:nth-child(4)", (el)=> el.textContent)).trim()
        page.waitForTimeout(1000)
    
        const price = await page.$eval("#poduct_list_area > li:nth-child(2) > a > div > div.price > p > b", (el) => el.textContent)
        page.waitForTimeout(1000)
    
        console.log("⭐️ star:", star)
        console.log("📍 location:", location)
        console.log("💳 Price:", price)
    
        await browser.close()    // crawling 끝나면 browser 종료해주기
    }
    
    startCrawling()

    iframe

  • iframe is a separate page inside the browser. (サイト内の異なるパーティクル)
    --> ifram is a total different page.
    --> The outerShell and the inside is different.
  • Even if the devleoper brings the data by Copy selector, iframe selector doesn't work on the site selector.
  • EX) If I copied $30 product in the market by iframe selector, I'm trying to get the data inside the iframe of the market site.
    --> The accessing site is naver, but the data of iframe is getting pulled out.