Jsoupを使ってデータを抽出します。


JsoupはJavaのHTML解析器であり、非常に便利な抽出と操作のHTML文書方法を提供しています。DOM、CSS、Jqueryと同様の方法を組み合わせてノードの情報を特定し、得ることができます。
Jqueryと同じように強いselectとpipelineのAPIがあります。
58同城網から賃貸情報を抽出した例で、どのように利用するかを説明します。

package test

import org.jsoup.nodes.Document
import java.util.HashMap
import org.jsoup.Jsoup
/**
 * Author: fuliang
 * http://fuliang.iteye.com
 */
class HouseEntry(var title: String,var link: String,var price: Integer, var houseType: String, var date: String){
	override def toString(): String = {
		return String.format("title: %s\tlink:%s\tprice:%d\thouseType:%s\tdate:%s
", title,link,price,houseType,date); } } class HouseRentCrawler{ def crawl(url: String,keyword: String,lowRange: Int,highRange: Int): List[HouseEntry] = { var doc = fetch(url,keyword,lowRange,highRange); return extract(doc); } private def fetch(url:String,keyword: String,lowRange: Int,highRange: Int): Document = { var params = new HashMap[String,String](); params.put("final","1"); params.put("jump","2"); params.put("searchtype","3"); params.put("key",keyword); params.put("MinPrice",lowRange + "_" + highRange); return Jsoup.connect(url).data(params) .userAgent("Mozilla") .timeout(10000) .get(); } private def extract(doc: Document): List[HouseEntry] = { val elements = doc.select("#infolist > tr:not(.dev)"); var houseEntries = List[HouseEntry](); for(val i <- 0 until elements.size()){ val entry = elements.get(i); val fields = entry.select("td"); val title = fields.get(0).text(); val link = fields.get(0).select("a[class=t]").attr("href"); val price = fields.get(1).text().toInt; val houseType = fields.get(2).text(); val date = fields.get(3).text(); val houseEntry = new HouseEntry(title,link,price,houseType,date); houseEntries ::= houseEntry; } return houseEntries; } } object HouseRentCrawler{ def main(args: Array[String]) { val url = "http://bj.58.com/zufang"; val crawler = new HouseRentCrawler(); val houseEntries = crawler.crawl(url," ",2000,3500); for(val entry <- houseEntries){ println(entry); } } }
Selector overview
    * tagname:find elemens by(e.g.a)
    * ns(:find elemens)by(in a namespace)e.g.fb me findselements
    * #id:find elements by ID,e.g.嗳logo
    * .class:find elemens by class name,e.g.masted
    * [atribute:elemens with atribute、e.g.[href]
    * [^atre]:elemens with an atribute name prefix,e.g.[^data-]finds element with HTML 5 dataset atributes
    * [atr=value):elemens with atribute value、e.g.[width=500]
    * [atr^=value、[atrドル=value]:elemens with atributes that start with、end with、or contain the value、e.g.[href*=/path/]
    * [atr~=regex]:elemens with atribute values that match the reglar expression;e.g.img[src~=(i)\.(png jpe?g)
    * *: all elements,e.g.*
Selector cobinations
    * el胫id:elements with ID、e.g.div嗵logo
    * elements with class,e.g.div.mastthead
    * el[atr]:elemens with atribute、e.g.a[href]
    * Any commbination,e.g.a[href].highlight
    * ancestor child:child elemens that descend from ancestor、e.g.body p finds p elements anywhere under a block with class“body”
    * parent>child:child elemens that descend directly from parent、e.g.div.com>p finds p element;and body>*finds the direct children of the body
    * siblingA+siblingB:finds sibling B element immediated by sibling A,e.g.div.head+div
    * siblingA~siblingX:finds sibling X element preced by sibling A,e.g.h 1~p
    * el、el:group multile selectors、find unique elements that match any of the selectors;e.g.div.mastered、div.logo
Pseud selectors
    * :lt(n):find elemens whose sibling index(i.e.its position in the DOM tree relative to its parent)is less than n;e.g.td:lt(3)
    * :gt(n):find elemens whose sibling index is greater than;e.g.div p:gt(2)
    * :eq(n):find elemens whose sibling index is equal to n;e.g.form input:eq(1)
    * :has(seletor):find elemens that contain element s matching the selector;e.g.div:has(p)
    * :not(selector):find elements that do not match the selector;e.g.div:not(.logo)
    * :contains(text):find elemens that contain the given text.The search is case-innsensitive;e.g.p:contains(jsoup)
    * :containsOwn(text):find elemens that directly contain the given text
    * :matches(regex):find elements whose text matches the specified reglar expression;e.g.div:matches((i)login)
    * :matchesOwn(regex):find elemens whose own text matches the specified reglar expression
    * Note that the above indexed pseudo-selectors are 0-based、that is、the first element is、the second at 1、etc
より多くの情報は参照できます。http://jsoup.org/𞓜http://jsoup.org/」