Goで爬虫類goquery使用

6938 ワード

golang

仕事の内容の需要に応じて、2つのウェブサイトのデータ(どんなウェブサイトなのかは、ここでは明らかにしないで、ははは、ipを発見されることを恐れて)、これらのデータは定期的に更新されています.バックエンドのすべてのサービスはgoで書かれているので、pythonを使うつもりはありません.やはりgoでこのニーズを完成させたいと思っています.githubで探してみると、goqueryという爬虫類のバッグを使う人が多いことに気づきました.5000人以上のstar、そしてBSDオープンソースプロトコルで、ためらうことなく持ってきました.
まず、go gethttps://github.com/PuerkitoBio/goquery
次に、この2つのサイトのDOMレイアウトを分析すると、やはり登りやすいことがわかりました.
まずgoqueryの一般的な使い方を教えてください.
DocumentはHTMLドキュメントを表し、

type Document struct {
    *Selection  
    Url      *url.URL
    rootNode *html.Node
    }

DocumentはSelectionタイプを継承するため、DocumentはSelectionタイプのメソッドを直接使用することができます.Selectionはdomノードの集合に対応しています

type Selection struct {
    Nodes    []*html.Node
    document *Document
    prevSel  *Selection
}

urlによる初期化

  func NewDocument(url string) (*Document, error) {
   // Load the URL
   res, e := http.Get(url)  
   if e != nil {
      return nil, e
   }
   return NewDocumentFromResponse(res)
}

ページ解析の最も重要な最も核心的な方法であるSelectionタイプが提供する方法
1)類似関数の位置操作

Eq(index int)*Selection//インデックスに基づいてノードセット

を取得

First()*Selection//第1サブノードセット

を取得

Last()*Selection//最後のサブノードセット

を取得

Next()*Selection//次の兄弟ノードセット

を取得

Next All()*Selection//後のすべての兄弟ノードセット

を取得

Prev(*Selection//前の兄弟ノードセット

Get(index int) *html.Node//インデックスに基づいてノード

を取得

Index()int//選択対象の最初の要素の位置

を返します.

Slice(start,end int)*Selection//サブノードセット

を開始位置から取得
2)選択したノードをループする

Each(f func(int,*Selection)*Selection//

遍歴

EachWithBreak(f func(int,*Selection)bool)*Selection//

を中断可能

Map(f func(int,*Selection)string)(result[]string)/文字列配列

を返す
3)ノード属性値の検出または取得

Attr()、RemoveAttr()、SetAttr()/取得、削除、属性の値

を設定

AddClass(), HasClass(), RemoveClass(), ToggleClass()

Html()/このノードのhtml

を取得

Length()/このSelectionを返す要素の個数

Text()/このノードのテキスト値

を取得
4)ドキュメントツリー間を飛び回る(よく使われるノードの検索方法)

Children()/selectionの各ノードの下の子供ノード

に戻る

Contents()/現在のノードの下にあるすべてのノード

を取得

Find()/現在一致する要素を検索

Next()/次の要素

Prev()/前の要素

説明が終わり、使用を開始します.
最初のWebサイト:

doc, err := goquery.NewDocument(url)
    if err != nil {
        logger.Error("get document error:", err)
        return
    }
    doc.Find(".first").EachWithBreak(func(i int, s *goquery.Selection) bool {
        //d := s.Eq(0).Find("td")//  first tr         td   
        //fmt.Println(s.Children().Text())

        //      ，      ，    EachWithBreak
        s.Children().EachWithBreak(func(j int, selection *goquery.Selection) bool {
            //fmt.Println(selection.Text())

            //    
            str := selection.Text()
            currencyIndexList = append(currencyIndexList, util.FindNumbers(str))
            if j == 5 {
                return false
            }
            return true
        })

        if i == 0 {
            return false
        }
        return true

    })
    return

2つ目のサイト

doc, err := goquery.NewDocument(url + "?name=" + value)
        if err != nil {
            logger.Error("get document error:", err)
            return
        }
        var priceList []string
        doc.Find(".SPD_main").Children().EachWithBreak(func(i int, s *goquery.Selection) bool {
            s.Children().EachWithBreak(func(j int, selection *goquery.Selection) bool {

                tr := selection.Children().First().Next()

                tr.Children().Each(func(k int, trselection *goquery.Selection) {
                    str := trselection.Text()
                    priceList = append(priceList, str)
                })

                if j == 0 {
                    return false
                }
                return true
            })
            if i == 0 {
                return false
            }
            return true

        })

C++単純Vectorテンプレートクラス

MySQLでのBlobタイプデータの挿入と読み込み