hadoopコンパイルはLZO圧縮フォーマットをサポートする
hadoopコンパイルはLZO圧縮フォーマットをサポートする
1.lzo取付
1.1 lzo形式ファイル圧縮解凍はサーバのlzopツールを使用する必要があり、hadoopのnativeライブラリ(hadoop checknativeはないlzo、zip関連情報)はサポートしていない
# lzop
[hadoop@hadoop001 software]$ which lzop
/bin/lzop
#
[root@hadoop001 ~]# yum install -y svn ncurses-devel
[root@hadoop001 ~]# yum install -y gcc gcc-c++ make cmake
[root@hadoop001 ~]# yum install -y openssl openssl-devel svn ncurses-devel zlib-devel libtool
[root@hadoop001 ~]# yum install -y lzo lzo-devel lzop autoconf automake cmake
1.2 lzopツールを使用してテストデータを圧縮する
#
[hadoop@hadoop001 log_data]$ ll
total 441152
-rw-r--r--. 1 hadoop hadoop 437156257 Apr 16 10:48 page_views.dat
[hadoop@hadoop001 log_data]$ du -sh *
431M page_views.dat
#lzo :lzop -v file lzo :lzop -dv file
[hadoop@hadoop001 log_data]$ lzop -v page_views.dat
compressing page_views.dat into page_views.dat.lzo
#
[hadoop@hadoop001 log_data]$ du -sh *
417M page_views.dat
199M page_views.dat.lzo
2.hadoop-lzoのコンパイル
hadoop-lzoのソースコードはGitHub上でオープンソースで、ソースコードアドレス:https://github.com/twitter/hadoop-lzo
2.1 mvnソースコードのコンパイル
#
[hadoop@hadoop001 software]$ tar -xzvf hadoop-lzo-release-0.4.20.tar.gz -C ../app/source/
#mvn , pom hadoop , 2.6.0
[hadoop@hadoop001 source]$ cd hadoop-lzo-release-0.4.20/
[hadoop@hadoop001 hadoop-lzo-release-0.4.20]$ vim pom.xml
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.6.0</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
#
[hadoop@hadoop001 hadoop-lzo-release-0.4.20]$ mvn clean package -Dmaven.test.skip=true
[INFO] Building jar: /home/hadoop/app/source/hadoop-lzo-release-0.4.20/target/hadoop-lzo-0.4.20-javadoc.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:06 min
[INFO] Finished at: 2019-04-16T11:00:15-04:00
[INFO] Final Memory: 36M/516M
[INFO] ------------------------------------------------------------------------
# jar ,hadoop-lzo-0.4.20.jar jar
[hadoop@hadoop001 hadoop-lzo-release-0.4.20]$ cd target/
[hadoop@hadoop001 target]$ ll
total 424
drwxrwxr-x. 2 hadoop hadoop 4096 Apr 16 10:59 antrun
drwxrwxr-x. 4 hadoop hadoop 4096 Apr 16 11:00 apidocs
drwxrwxr-x. 5 hadoop hadoop 66 Apr 16 10:59 classes
drwxrwxr-x. 3 hadoop hadoop 25 Apr 16 10:59 generated-sources
-rw-rw-r--. 1 hadoop hadoop 188645 Apr 16 11:00 hadoop-lzo-0.4.20.jar
-rw-rw-r--. 1 hadoop hadoop 180128 Apr 16 11:00 hadoop-lzo-0.4.20-javadoc.jar
-rw-rw-r--. 1 hadoop hadoop 51984 Apr 16 11:00 hadoop-lzo-0.4.20-sources.jar
drwxrwxr-x. 2 hadoop hadoop 71 Apr 16 11:00 javadoc-bundle-options
drwxrwxr-x. 2 hadoop hadoop 28 Apr 16 11:00 maven-archiver
drwxrwxr-x. 3 hadoop hadoop 28 Apr 16 10:59 native
drwxrwxr-x. 3 hadoop hadoop 18 Apr 16 10:59 test-classes
3.hadoopの構成
3.1 hadoop-lzoをアップロードします。jar
# hadoop-lzo-0.4.20-SNAPSHOT.jar hadoop common , ,
[hadoop@hadoop001 target]$ cp hadoop-lzo-0.4.20.jar ~/app/hadoop/share/hadoop/common/
[hadoop@hadoop001 target]$ ll ~/app/hadoop/share/hadoop/common/hadoop-lzo*
-rw-rw-r--. 1 hadoop hadoop 188645 Apr 16 11:11 /home/hadoop/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar
3.2 coreを構成する.site.xml
# hadoop
[hadoop@hadoop001 hadoop-lzo-master]$ stop-all.sh
# core-site.xml
[hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/core-site.xml
# com.hadoop.compression.lzo.LzoCodec、com.hadoop.compression.lzo.LzopCodec
#io.compression.codec.lzo.class LzoCodec LzopCodec,
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
[hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml
#
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
#
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
core-site.xmlとmapred-site.xmlこの2つのファイルはクラスタマシンであれば、同期して変更し、クラスタを起動します.
4.LZOファイルテスト
1.hiveテストスライス
--
-- LZO , hadoop common hadoop-lzo jar, DeprecatedLzoTextInputFormat
create table page_views2_lzo(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) row format delimited fields terminated by '\t'
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
-- lzo
hive> load data local inpath '/home/hadoop/log_data/page_views.dat.lzo' overwrite into table page_views2_lzo;
Loading data to table test.page_views2_lzo
Table test.page_views2_lzo stats: [numFiles=1, numRows=0, totalSize=207749249, rawDataSize=0]
OK
Time taken: 1.009 seconds
#
[hadoop@hadoop001 hadoop]$ hadoop fs -du -s -h /user/hive/warehouse/test.db/page_views2_lzo/*
198.1 M 198.1 M /user/hive/warehouse/test.db/page_views2_lzo/page_views.dat.lzo
-- , Map 1
select count(1) from page_views2_lzo;
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.88 sec HDFS Read: 207756318 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 880 msec
オープン圧縮
-- , LzopCodec,lzoCode .lzo_deflate 。
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
-- LZO
create table page_views2_lzo_split
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
as select * from page_views2_lzo;
# , .lzo
[hadoop@hadoop001 hadoop]$ hadoop fs -du -s -h /user/hive/warehouse/test.db/page_views2_lzo_split/*
196.8 M 196.8 M /user/hive/warehouse/test.db/page_views2_lzo_split/000000_0.lzo
# LZO , jar
[hadoop@hadoop001 hadoop]$ hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/test.db/page_views2_lzo_split
# hdfs , lzo .index
[hadoop@hadoop001 hadoop]$ hadoop fs -du -s -h /user/hive/warehouse/test.db/page_views2_lzo_split/* 196.8 M 196.8 M /user/hive/warehouse/test.db/page_views2_lzo_split/000000_0.lzo
13.9 K 13.9 K /user/hive/warehouse/test.db/page_views2_lzo_split/000000_0.lzo.index
-- , Map 2
select count(1) from page_views2_lzo_split;
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 28.05 sec HDFS Read: 206448787 HDFS Write: 58 SUCCESS
Total MapReduce CPU Time Spent: 28 seconds 50 msec
OK
2298975
Time taken: 28.621 seconds, Fetched: 1 row(s)
したがって、インデックスを構築した後、lzoはデータスライスをサポートします.
ビッグデータでよく見られる圧縮フォーマットはbzip 2のみがデータスライスをサポートし、lzoはファイルがインデックスを構築した後にデータスライスをサポートします.