【転送】HTMLファイルから本文を抽出する簡単なソリューション


原文を転載してからhttp://blog.csdn.net/lanphaday/archive/2007/08/13/1741185.aspx
上記によると、ページノイズ低減のテストクラスを書きましたが、確かに有効です.しかし、異なるページに対しては、結果に偏りがあるかもしれません.特にページの文字が少ないです.例えば、写真は文字混合のテーマページなどです.
package com.test.net;

import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.PriorityQueue;

/**
 *
 * @author LiuZiHeng
 * @version
 * @date 2010-8-25
 */
public class GetMainContent {

	private PriorityQueue<IndexPersent> priorityQueue = new PriorityQueue<IndexPersent>(1000, new Comparator<IndexPersent>() {
		public int compare(IndexPersent o1, IndexPersent o2) {
			if(o1.persent > o2.persent) {
				return -1;
			}
			
			if(o1.persent < o2.persent) {
				return 1;
			}
			return 0;
		}
	});
	
	public void run() {
		try {
			URL url = new URL("http://view.news.qq.com/a/20100824/000039.htm");
			URLConnection connection = url.openConnection();
			connection.connect();
			InputStream in = connection.getInputStream();
			
			BufferedReader reader = new BufferedReader(new InputStreamReader(in, "GBK"));
			FileOutputStream writer = new FileOutputStream("txt/test1.html", true);
			String line = null;
			StringBuffer sb = new StringBuffer();
			List<String> contentlist = new ArrayList<String>();
			
			//      
			while((line = reader.readLine()) != null) {
				writer.write(line.getBytes("GBK"));
				writer.write("\r
".getBytes("GBK")); sb.append(line);// html contentlist.add(line);// html } reader.close(); writer.close(); System.out.println("============================================="); double allens = sb.toString().getBytes("GBK").length; for(int i = 0; i < contentlist.size(); i++) { String linestr = contentlist.get(i); int linelen = linestr.getBytes("GBK").length; double persent = (double)linelen / allens;// , IndexPersent indexPersent = new IndexPersent(); indexPersent.setIndex(i); indexPersent.setPersent(persent); this.priorityQueue.add(indexPersent); } // int maxsize = 0; while(!priorityQueue.isEmpty()) { IndexPersent indexPersent = priorityQueue.poll(); System.out.println(indexPersent.getIndex() + ":" + indexPersent.getPersent()); System.out.println(contentlist.get(indexPersent.getIndex())); maxsize++; if(maxsize >= 6) { break; } } } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } public static void main(String[] args) { new GetMainContent().run(); } private static class IndexPersent { int index; double persent; int getIndex() { return index; } void setIndex(int index) { this.index = index; } double getPersent() { return persent; } void setPersent(double persent) { this.persent = persent; } } }