Jsoupを使ってデータを抽出します。

5251 ワード

JsoupはJavaのHTML解析器であり、非常に便利な抽出と操作のHTML文書方法を提供しています。DOM、CSS、Jqueryと同様の方法を組み合わせてノードの情報を特定し、得ることができます。
Jqueryと同じように強いselectとpipelineのAPIがあります。
58同城網から賃貸情報を抽出した例で、どのように利用するかを説明します。


package test

import org.jsoup.nodes.Document
import java.util.HashMap
import org.jsoup.Jsoup
/**
 * Author: fuliang
 * http://fuliang.iteye.com
 */
class HouseEntry(var title: String,var link: String,var price: Integer, var houseType: String, var date: String){
	override def toString(): String = {
		return String.format("title: %s\tlink:%s\tprice:%d\thouseType:%s\tdate:%s
", title,link,price,houseType,date);
	}
}

class HouseRentCrawler{
	def crawl(url: String,keyword: String,lowRange: Int,highRange: Int): List[HouseEntry] = {
		var doc = fetch(url,keyword,lowRange,highRange);
		return extract(doc);
	}

	private def fetch(url:String,keyword: String,lowRange: Int,highRange: Int): Document = {
		var params = new HashMap[String,String]();
		params.put("final","1");
		params.put("jump","2");
		params.put("searchtype","3");
		params.put("key",keyword);
		params.put("MinPrice",lowRange + "_" + highRange);
		
	    return Jsoup.connect(url).data(params)
									.userAgent("Mozilla")
									.timeout(10000)
									.get();
	}
	
	private def extract(doc: Document):  List[HouseEntry] = {
		val elements = doc.select("#infolist > tr:not(.dev)");
		var houseEntries = List[HouseEntry]();
		for(val i <- 0 until elements.size()){
			val entry = elements.get(i);
			val fields = entry.select("td"); 
			val title = fields.get(0).text();
			val link = fields.get(0).select("a[class=t]").attr("href");
			val price = fields.get(1).text().toInt;
			val houseType = fields.get(2).text();
			val date = fields.get(3).text();
			val houseEntry = new HouseEntry(title,link,price,houseType,date);
			houseEntries ::= houseEntry;
		}
		return houseEntries;
	}
}

object HouseRentCrawler{
	def main(args: Array[String]) {
		val url = "http://bj.58.com/zufang";
		val crawler = new HouseRentCrawler();
		val houseEntries = crawler.crawl(url,"   ",2000,3500);
		for(val entry <- houseEntries){
			println(entry);
		}
	}
}

Selector overview
    * tagname:find elemens by(e.g.a)
    * ns(:find elemens)by(in a namespace)e.g.fb me findselements
    * #id:find elements by ID，e.g.嗳logo
    * .class:find elemens by class name，e.g.masted
    * [atribute：elemens with atribute、e.g.[href]
    * [^atre]：elemens with an atribute name prefix，e.g.[^data-]finds element with HTML 5 dataset atributes
    * [atr=value)：elemens with atribute value、e.g.[width=500]
    * [atr^=value、[atrドル=value]：elemens with atributes that start with、end with、or contain the value、e.g.[href*=/path/]
    * [atr~=regex]：elemens with atribute values that match the reglar expression；e.g.img[src~=(i)\.（png jpe？g）
    * *: all elements，e.g.＊
Selector cobinations
    * el胫id：elements with ID、e.g.div嗵logo
    * elements with class，e.g.div.mastthead
    * el[atr]：elemens with atribute、e.g.a[href]
    * Any commbination，e.g.a[href].highlight
    * ancestor child：child elemens that descend from ancestor、e.g.body p finds p elements anywhere under a block with class“body”
    * parent>child：child elemens that descend directly from parent、e.g.div.com>p finds p element；and body>*finds the direct children of the body
    * siblingA+siblingB:finds sibling B element immediated by sibling A，e.g.div.head+div
    * siblingA~siblingX:finds sibling X element preced by sibling A，e.g.h 1~p
    * el、el：group multile selectors、find unique elements that match any of the selectors；e.g.div.mastered、div.logo
Pseud selectors
    * :lt(n)：find elemens whose sibling index(i.e.its position in the DOM tree relative to its parent)is less than n；e.g.td:lt(3)
    * :gt(n)：find elemens whose sibling index is greater than；e.g.div p:gt(2)
    * :eq(n)：find elemens whose sibling index is equal to n；e.g.form input:eq(1)
    * :has（seletor）：find elemens that contain element s matching the selector；e.g.div：has(p)
    * :not（selector）：find elements that do not match the selector；e.g.div：not(.logo)
    * :contains(text)：find elemens that contain the given text.The search is case-innsensitive；e.g.p:contains(jsoup)
    * :containsOwn(text)：find elemens that directly contain the given text
    * :matches(regex)：find elements whose text matches the specified reglar expression；e.g.div：matches((i)login)
    * :matchesOwn(regex)：find elemens whose own text matches the specified reglar expression
    * Note that the above indexed pseudo-selectors are 0-based、that is、the first element is、the second at 1、etc
より多くの情報は参照できます。http://jsoup.org/𞓜http://jsoup.org/」

RabbiitMQ配置記録

python urllibは翻訳があります。