Jsoupを使ってデータを抽出します。
JsoupはJavaのHTML解析器であり、非常に便利な抽出と操作のHTML文書方法を提供しています。DOM、CSS、Jqueryと同様の方法を組み合わせてノードの情報を特定し、得ることができます。
Jqueryと同じように強いselectとpipelineのAPIがあります。
58同城網から賃貸情報を抽出した例で、どのように利用するかを説明します。
* tagname:find elemens by(e.g.a)
* ns(:find elemens)by(in a namespace)e.g.fb me findselements
* #id:find elements by ID,e.g.嗳logo
* .class:find elemens by class name,e.g.masted
* [atribute:elemens with atribute、e.g.[href]
* [^atre]:elemens with an atribute name prefix,e.g.[^data-]finds element with HTML 5 dataset atributes
* [atr=value):elemens with atribute value、e.g.[width=500]
* [atr^=value、[atrドル=value]:elemens with atributes that start with、end with、or contain the value、e.g.[href*=/path/]
* [atr~=regex]:elemens with atribute values that match the reglar expression;e.g.img[src~=(i)\.(png jpe?g)
* *: all elements,e.g.*
Selector cobinations
* el胫id:elements with ID、e.g.div嗵logo
* elements with class,e.g.div.mastthead
* el[atr]:elemens with atribute、e.g.a[href]
* Any commbination,e.g.a[href].highlight
* ancestor child:child elemens that descend from ancestor、e.g.body p finds p elements anywhere under a block with class“body”
* parent>child:child elemens that descend directly from parent、e.g.div.com>p finds p element;and body>*finds the direct children of the body
* siblingA+siblingB:finds sibling B element immediated by sibling A,e.g.div.head+div
* siblingA~siblingX:finds sibling X element preced by sibling A,e.g.h 1~p
* el、el:group multile selectors、find unique elements that match any of the selectors;e.g.div.mastered、div.logo
Pseud selectors
* :lt(n):find elemens whose sibling index(i.e.its position in the DOM tree relative to its parent)is less than n;e.g.td:lt(3)
* :gt(n):find elemens whose sibling index is greater than;e.g.div p:gt(2)
* :eq(n):find elemens whose sibling index is equal to n;e.g.form input:eq(1)
* :has(seletor):find elemens that contain element s matching the selector;e.g.div:has(p)
* :not(selector):find elements that do not match the selector;e.g.div:not(.logo)
* :contains(text):find elemens that contain the given text.The search is case-innsensitive;e.g.p:contains(jsoup)
* :containsOwn(text):find elemens that directly contain the given text
* :matches(regex):find elements whose text matches the specified reglar expression;e.g.div:matches((i)login)
* :matchesOwn(regex):find elemens whose own text matches the specified reglar expression
* Note that the above indexed pseudo-selectors are 0-based、that is、the first element is、the second at 1、etc
より多くの情報は参照できます。http://jsoup.org/𞓜http://jsoup.org/」
Jqueryと同じように強いselectとpipelineのAPIがあります。
58同城網から賃貸情報を抽出した例で、どのように利用するかを説明します。
package test
import org.jsoup.nodes.Document
import java.util.HashMap
import org.jsoup.Jsoup
/**
* Author: fuliang
* http://fuliang.iteye.com
*/
class HouseEntry(var title: String,var link: String,var price: Integer, var houseType: String, var date: String){
override def toString(): String = {
return String.format("title: %s\tlink:%s\tprice:%d\thouseType:%s\tdate:%s
", title,link,price,houseType,date);
}
}
class HouseRentCrawler{
def crawl(url: String,keyword: String,lowRange: Int,highRange: Int): List[HouseEntry] = {
var doc = fetch(url,keyword,lowRange,highRange);
return extract(doc);
}
private def fetch(url:String,keyword: String,lowRange: Int,highRange: Int): Document = {
var params = new HashMap[String,String]();
params.put("final","1");
params.put("jump","2");
params.put("searchtype","3");
params.put("key",keyword);
params.put("MinPrice",lowRange + "_" + highRange);
return Jsoup.connect(url).data(params)
.userAgent("Mozilla")
.timeout(10000)
.get();
}
private def extract(doc: Document): List[HouseEntry] = {
val elements = doc.select("#infolist > tr:not(.dev)");
var houseEntries = List[HouseEntry]();
for(val i <- 0 until elements.size()){
val entry = elements.get(i);
val fields = entry.select("td");
val title = fields.get(0).text();
val link = fields.get(0).select("a[class=t]").attr("href");
val price = fields.get(1).text().toInt;
val houseType = fields.get(2).text();
val date = fields.get(3).text();
val houseEntry = new HouseEntry(title,link,price,houseType,date);
houseEntries ::= houseEntry;
}
return houseEntries;
}
}
object HouseRentCrawler{
def main(args: Array[String]) {
val url = "http://bj.58.com/zufang";
val crawler = new HouseRentCrawler();
val houseEntries = crawler.crawl(url," ",2000,3500);
for(val entry <- houseEntries){
println(entry);
}
}
}
Selector overview* tagname:find elemens by(e.g.a)
* ns(:find elemens)by(in a namespace)e.g.fb me finds
* #id:find elements by ID,e.g.嗳logo
* .class:find elemens by class name,e.g.masted
* [atribute:elemens with atribute、e.g.[href]
* [^atre]:elemens with an atribute name prefix,e.g.[^data-]finds element with HTML 5 dataset atributes
* [atr=value):elemens with atribute value、e.g.[width=500]
* [atr^=value、[atrドル=value]:elemens with atributes that start with、end with、or contain the value、e.g.[href*=/path/]
* [atr~=regex]:elemens with atribute values that match the reglar expression;e.g.img[src~=(i)\.(png jpe?g)
* *: all elements,e.g.*
Selector cobinations
* el胫id:elements with ID、e.g.div嗵logo
* elements with class,e.g.div.mastthead
* el[atr]:elemens with atribute、e.g.a[href]
* Any commbination,e.g.a[href].highlight
* ancestor child:child elemens that descend from ancestor、e.g.body p finds p elements anywhere under a block with class“body”
* parent>child:child elemens that descend directly from parent、e.g.div.com>p finds p element;and body>*finds the direct children of the body
* siblingA+siblingB:finds sibling B element immediated by sibling A,e.g.div.head+div
* siblingA~siblingX:finds sibling X element preced by sibling A,e.g.h 1~p
* el、el:group multile selectors、find unique elements that match any of the selectors;e.g.div.mastered、div.logo
Pseud selectors
* :lt(n):find elemens whose sibling index(i.e.its position in the DOM tree relative to its parent)is less than n;e.g.td:lt(3)
* :gt(n):find elemens whose sibling index is greater than;e.g.div p:gt(2)
* :eq(n):find elemens whose sibling index is equal to n;e.g.form input:eq(1)
* :has(seletor):find elemens that contain element s matching the selector;e.g.div:has(p)
* :not(selector):find elements that do not match the selector;e.g.div:not(.logo)
* :contains(text):find elemens that contain the given text.The search is case-innsensitive;e.g.p:contains(jsoup)
* :containsOwn(text):find elemens that directly contain the given text
* :matches(regex):find elements whose text matches the specified reglar expression;e.g.div:matches((i)login)
* :matchesOwn(regex):find elemens whose own text matches the specified reglar expression
* Note that the above indexed pseudo-selectors are 0-based、that is、the first element is、the second at 1、etc
より多くの情報は参照できます。http://jsoup.org/𞓜http://jsoup.org/」