Hadoop入門クラシック:WordCount


以下のプログラムはhadoop 1.2.1テストに成功しました.
この例では、まずソースコードを提示し、実行手順を詳細に説明し、最後にソースコードと実行プロセスを分析します.
一、ソースコード
package org.jediael.hadoopdemo.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

	public static class WordCountMap extends
			Mapper<LongWritable, Text, Text, IntWritable> {

		private final IntWritable one = new IntWritable(1);
		private Text word = new Text();

		public void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			StringTokenizer token = new StringTokenizer(line);
			while (token.hasMoreTokens()) {
				word.set(token.nextToken());
				context.write(word, one);
			}
		}
	}

	public static class WordCountReduce extends
			Reducer<Text, IntWritable, Text, IntWritable> {

		public void reduce(Text key, Iterable<IntWritable> values,
				Context context) throws IOException, InterruptedException {
			int sum = 0;
			for (IntWritable val : values) {
				sum += val.get();
			}
			context.write(key, new IntWritable(sum));
		}
	}

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = new Job(conf);
		job.setJarByClass(WordCount.class);
		job.setJobName("wordcount");

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		job.setMapperClass(WordCountMap.class);
		job.setReducerClass(WordCountReduce.class);

		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);

		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		job.waitForCompletion(true);
	}
}

二、実行プログラム
1、eclipseからwordcountにエクスポートする.JArは、hadoopサーバにアップロードされ、この例では、プログラムを/home/jediael/projectにアップロードします.
2、hadoop擬似分布モードをインストールし、Hadoop 1を参照することができる.2.1 hadoopの擬似公開環境で実行される擬似分散モードインストールガイド.
3、HDFSにディレクトリwcinputを作成し、入力ディレクトリとして使用し、分析するファイルをディレクトリの下にコピーする.
[root@jediael conf]# hadoop fs -mkdir wcinput
[root@jediael conf]# hadoop fs -copyFromLocal * wcinput 
[root@jediael conf]# hadoop fs -ls wcinput 
Found 26 items 
-rw-r--r-- 1 root supergroup 1524 2014-08-20 12:29 /user/root/wcinput/automaton-urlfilter.txt 
-rw-r--r-- 1 root supergroup 1311 2014-08-20 12:29 /user/root/wcinput/configuration.xsl 
-rw-r--r-- 1 root supergroup 131090 2014-08-20 12:29 /user/root/wcinput/domain-suffixes.xml 
-rw-r--r-- 1 root supergroup 4649 2014-08-20 12:29 /user/root/wcinput/domain-suffixes.xsd 
-rw-r--r-- 1 root supergroup 824 2014-08-20 12:29 /user/root/wcinput/domain-urlfilter.txt 
-rw-r--r-- 1 root supergroup 3368 2014-08-20 12:29 /user/root/wcinput/gora-accumulo-mapping.xml 
-rw-r--r-- 1 root supergroup 3279 2014-08-20 12:29 /user/root/wcinput/gora-cassandra-mapping.xml 
-rw-r--r-- 1 root supergroup 3447 2014-08-20 12:29 /user/root/wcinput/gora-hbase-mapping.xml 
-rw-r--r-- 1 root supergroup 2677 2014-08-20 12:29 /user/root/wcinput/gora-sql-mapping.xml 
-rw-r--r-- 1 root supergroup 2993 2014-08-20 12:29 /user/root/wcinput/gora.properties 
-rw-r--r-- 1 root supergroup 983 2014-08-20 12:29 /user/root/wcinput/hbase-site.xml 
-rw-r--r-- 1 root supergroup 3096 2014-08-20 12:29 /user/root/wcinput/httpclient-auth.xml 
-rw-r--r-- 1 root supergroup 3948 2014-08-20 12:29 /user/root/wcinput/log4j.properties 
-rw-r--r-- 1 root supergroup 511 2014-08-20 12:29 /user/root/wcinput/nutch-conf.xsl 
-rw-r--r-- 1 root supergroup 42610 2014-08-20 12:29 /user/root/wcinput/nutch-default.xml 
-rw-r--r-- 1 root supergroup 753 2014-08-20 12:29 /user/root/wcinput/nutch-site.xml 
-rw-r--r-- 1 root supergroup 347 2014-08-20 12:29 /user/root/wcinput/parse-plugins.dtd 
-rw-r--r-- 1 root supergroup 3016 2014-08-20 12:29 /user/root/wcinput/parse-plugins.xml 
-rw-r--r-- 1 root supergroup 857 2014-08-20 12:29 /user/root/wcinput/prefix-urlfilter.txt 
-rw-r--r-- 1 root supergroup 2484 2014-08-20 12:29 /user/root/wcinput/regex-normalize.xml 
-rw-r--r-- 1 root supergroup 1736 2014-08-20 12:29 /user/root/wcinput/regex-urlfilter.txt 
-rw-r--r-- 1 root supergroup 18969 2014-08-20 12:29 /user/root/wcinput/schema-solr4.xml 
-rw-r--r-- 1 root supergroup 6020 2014-08-20 12:29 /user/root/wcinput/schema.xml 
-rw-r--r-- 1 root supergroup 1766 2014-08-20 12:29 /user/root/wcinput/solrindex-mapping.xml 
-rw-r--r-- 1 root supergroup 1044 2014-08-20 12:29 /user/root/wcinput/subcollections.xml 
-rw-r--r-- 1 root supergroup 1411 2014-08-20 12:29 /user/root/wcinput/suffix-urlfilter.txt
4、プログラムの実行
[root@jediael project]# hadoop org.jediael.hadoopdemo.wordcount.WordCount wcinput wcoutput3 
14/08/20 12:50:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 
14/08/20 12:50:26 INFO input.FileInputFormat: Total input paths to process : 26 
14/08/20 12:50:26 INFO util.NativeCodeLoader: Loaded the native-hadoop library 
14/08/20 12:50:26 WARN snappy.LoadSnappy: Snappy native library not loaded 
14/08/20 12:50:26 INFO mapred.JobClient: Running job: job_201408191134_0005 
14/08/20 12:50:27 INFO mapred.JobClient: map 0% reduce 0% 
14/08/20 12:50:38 INFO mapred.JobClient: map 3% reduce 0% 
14/08/20 12:50:39 INFO mapred.JobClient: map 7% reduce 0% 
14/08/20 12:50:50 INFO mapred.JobClient: map 15% reduce 0% 
14/08/20 12:50:57 INFO mapred.JobClient: map 19% reduce 0% 
14/08/20 12:50:58 INFO mapred.JobClient: map 23% reduce 0% 
14/08/20 12:51:00 INFO mapred.JobClient: map 23% reduce 5% 
14/08/20 12:51:04 INFO mapred.JobClient: map 30% reduce 5% 
14/08/20 12:51:06 INFO mapred.JobClient: map 30% reduce 10% 
14/08/20 12:51:11 INFO mapred.JobClient: map 38% reduce 10% 
14/08/20 12:51:16 INFO mapred.JobClient: map 38% reduce 11% 
14/08/20 12:51:18 INFO mapred.JobClient: map 46% reduce 11% 
14/08/20 12:51:19 INFO mapred.JobClient: map 46% reduce 12% 
14/08/20 12:51:22 INFO mapred.JobClient: map 46% reduce 15% 
14/08/20 12:51:25 INFO mapred.JobClient: map 53% reduce 15% 
14/08/20 12:51:31 INFO mapred.JobClient: map 53% reduce 17% 
14/08/20 12:51:32 INFO mapred.JobClient: map 61% reduce 17% 
14/08/20 12:51:39 INFO mapred.JobClient: map 69% reduce 17% 
14/08/20 12:51:40 INFO mapred.JobClient: map 69% reduce 20% 
14/08/20 12:51:45 INFO mapred.JobClient: map 73% reduce 20% 
14/08/20 12:51:46 INFO mapred.JobClient: map 76% reduce 23% 
14/08/20 12:51:52 INFO mapred.JobClient: map 80% reduce 23% 
14/08/20 12:51:53 INFO mapred.JobClient: map 84% reduce 23% 
14/08/20 12:51:55 INFO mapred.JobClient: map 84% reduce 25% 
14/08/20 12:51:59 INFO mapred.JobClient: map 88% reduce 25% 
14/08/20 12:52:00 INFO mapred.JobClient: map 92% reduce 25% 
14/08/20 12:52:02 INFO mapred.JobClient: map 92% reduce 29% 
14/08/20 12:52:06 INFO mapred.JobClient: map 96% reduce 29% 
14/08/20 12:52:07 INFO mapred.JobClient: map 100% reduce 29% 
14/08/20 12:52:11 INFO mapred.JobClient: map 100% reduce 30% 
14/08/20 12:52:15 INFO mapred.JobClient: map 100% reduce 100% 
14/08/20 12:52:17 INFO mapred.JobClient: Job complete: job_201408191134_0005 
14/08/20 12:52:18 INFO mapred.JobClient: Counters: 29 
14/08/20 12:52:18 INFO mapred.JobClient: Job Counters 
14/08/20 12:52:18 INFO mapred.JobClient: Launched reduce tasks=1 
14/08/20 12:52:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=192038 
14/08/20 12:52:18 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 
14/08/20 12:52:18 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 
14/08/20 12:52:18 INFO mapred.JobClient: Launched map tasks=26 
14/08/20 12:52:18 INFO mapred.JobClient: Data-local map tasks=26 
14/08/20 12:52:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=95814 
14/08/20 12:52:18 INFO mapred.JobClient: File Output Format Counters 
14/08/20 12:52:18 INFO mapred.JobClient: Bytes Written=123950 
14/08/20 12:52:18 INFO mapred.JobClient: FileSystemCounters 
14/08/20 12:52:18 INFO mapred.JobClient: FILE_BYTES_READ=352500 
14/08/20 12:52:18 INFO mapred.JobClient: HDFS_BYTES_READ=247920 
14/08/20 12:52:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2177502 
14/08/20 12:52:18 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=123950 
14/08/20 12:52:18 INFO mapred.JobClient: File Input Format Counters 
14/08/20 12:52:18 INFO mapred.JobClient: Bytes Read=244713 
14/08/20 12:52:18 INFO mapred.JobClient: Map-Reduce Framework 
14/08/20 12:52:18 INFO mapred.JobClient: Map output materialized bytes=352650 
14/08/20 12:52:18 INFO mapred.JobClient: Map input records=7403 
14/08/20 12:52:18 INFO mapred.JobClient: Reduce shuffle bytes=352650 
14/08/20 12:52:18 INFO mapred.JobClient: Spilled Records=45210 
14/08/20 12:52:18 INFO mapred.JobClient: Map output bytes=307281 
14/08/20 12:52:18 INFO mapred.JobClient: Total committed heap usage (bytes)=3398606848 
14/08/20 12:52:18 INFO mapred.JobClient: CPU time spent (ms)=14400 
14/08/20 12:52:18 INFO mapred.JobClient: Combine input records=0 
14/08/20 12:52:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=3207 
14/08/20 12:52:18 INFO mapred.JobClient: Reduce input records=22605 
14/08/20 12:52:18 INFO mapred.JobClient: Reduce input groups=6749 
14/08/20 12:52:18 INFO mapred.JobClient: Combine output records=0 
14/08/20 12:52:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=4799041536 
14/08/20 12:52:18 INFO mapred.JobClient: Reduce output records=6749 
14/08/20 12:52:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=19545337856 
14/08/20 12:52:18 INFO mapred.JobClient: Map output records=22605
、結果の表示
root@jediael project]# hadoop fs -ls wcoutput3 
Found 3 items 
-rw-r--r-- 1 root supergroup 0 2014-08-20 12:52 /user/root/wcoutput3/_SUCCESS 
drwxr-xr-x - root supergroup 0 2014-08-20 12:50 /user/root/wcoutput3/_logs 
-rw-r--r-- 1 root supergroup 123950 2014-08-20 12:52 /user/root/wcoutput3/part-r-00000 
[root@jediael project]# hadoop fs -cat wcoutput3/part-r-00000
!!      2
!ci.*.*.us      1
!co.*.*.us      1
!town.*.*.us    1
"AS     22
"Accept"        1
"Accept-Language"       1
"License");     22
"NOW"   1
"WiFi"  1
"Z"     1
"all"   1
"content"       1
"delete 1
"delimiter"     1
………………
三、プログラム分析
1、WordCountMapクラスはorgを継承する.apache.hadoop.mapreduce.Mapper,4つの汎用タイプはそれぞれmap関数入力keyのタイプ,入力valueのタイプ,出力keyのタイプ,出力valueのタイプである.
2、WordCountReduce類はorgを継承する.apache.hadoop.mapreduce.Reducer,4つの汎用タイプの意味はmapクラスと同じである.
3、mapの出力タイプはreduceの入力タイプと同じであるが、一般的にmapの出力タイプはreduceの出力タイプと同じであるため、reduceの入力タイプは出力タイプと同じである.
4、hadoopは以下のコードに基づいて入力内容のフォーマットを決定する.
job.setInputFormatClass(TextInputFormat.class);
TextInputFormatは、FileInputFormatから継承されるhadoopのデフォルトの入力方法です.TextInputFormatでは、データセットを小さなデータセットInputSplitにカットし、各InputSplitはmapperで処理します.さらに、InputFormatはRecordReaderの実装を提供し、InputSplitをの形式に解析し、map関数に提供します.
key:このデータは、データスライス内のバイトオフセット量に対して、LongWritableです.
value:各行のデータの内容、タイプはTextです.
したがって、この例ではmap関数のkey/valueタイプはLongWritableとTextである.
5、Hadoopは以下のコードに基づいて出力内容のフォーマットを決定する.
job.setOutputFormatClass(TextOutputFormat.class);
TextOutputFormatはhadoopのデフォルトの出力フォーマットで、各レコードの行の形式をテキストファイルに格納します.
the 30
happy 23
……