hadoopコンパイルはLZO圧縮フォーマットをサポートする


hadoopコンパイルはLZO圧縮フォーマットをサポートする


1.lzo取付


1.1 lzo形式ファイル圧縮解凍はサーバのlzopツールを使用する必要があり、hadoopのnativeライブラリ(hadoop checknativeはないlzo、zip関連情報)はサポートしていない
#     lzop  
[hadoop@hadoop001 software]$ which lzop
/bin/lzop
#           
[root@hadoop001 ~]# yum install -y svn ncurses-devel
[root@hadoop001 ~]# yum install -y gcc gcc-c++ make cmake
[root@hadoop001 ~]# yum install -y openssl openssl-devel svn ncurses-devel zlib-devel libtool
[root@hadoop001 ~]# yum install -y lzo lzo-devel lzop autoconf automake cmake 

1.2 lzopツールを使用してテストデータを圧縮する
#    
[hadoop@hadoop001 log_data]$ ll
total 441152
-rw-r--r--. 1 hadoop hadoop 437156257 Apr 16 10:48 page_views.dat
[hadoop@hadoop001 log_data]$ du -sh *
431M    page_views.dat
#lzo  :lzop -v file  lzo  :lzop -dv file
[hadoop@hadoop001 log_data]$ lzop -v page_views.dat 
compressing page_views.dat into page_views.dat.lzo
#      
[hadoop@hadoop001 log_data]$ du -sh *               
417M    page_views.dat
199M    page_views.dat.lzo

2.hadoop-lzoのコンパイル


hadoop-lzoのソースコードはGitHub上でオープンソースで、ソースコードアドレス:https://github.com/twitter/hadoop-lzo
2.1 mvnソースコードのコンパイル
#  
[hadoop@hadoop001 software]$ tar -xzvf hadoop-lzo-release-0.4.20.tar.gz -C ../app/source/

#mvn  ,  pom     hadoop      ,      2.6.0
[hadoop@hadoop001 source]$ cd hadoop-lzo-release-0.4.20/
[hadoop@hadoop001 hadoop-lzo-release-0.4.20]$ vim pom.xml
<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.current.version>2.6.0</hadoop.current.version>
    <hadoop.old.version>1.0.4</hadoop.old.version>
</properties>

#  
[hadoop@hadoop001 hadoop-lzo-release-0.4.20]$ mvn clean package -Dmaven.test.skip=true
[INFO] Building jar: /home/hadoop/app/source/hadoop-lzo-release-0.4.20/target/hadoop-lzo-0.4.20-javadoc.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:06 min
[INFO] Finished at: 2019-04-16T11:00:15-04:00
[INFO] Final Memory: 36M/516M
[INFO] ------------------------------------------------------------------------

#       jar ,hadoop-lzo-0.4.20.jar        jar 
[hadoop@hadoop001 hadoop-lzo-release-0.4.20]$ cd target/
[hadoop@hadoop001 target]$ ll
total 424
drwxrwxr-x. 2 hadoop hadoop   4096 Apr 16 10:59 antrun
drwxrwxr-x. 4 hadoop hadoop   4096 Apr 16 11:00 apidocs
drwxrwxr-x. 5 hadoop hadoop     66 Apr 16 10:59 classes
drwxrwxr-x. 3 hadoop hadoop     25 Apr 16 10:59 generated-sources
-rw-rw-r--. 1 hadoop hadoop 188645 Apr 16 11:00 hadoop-lzo-0.4.20.jar
-rw-rw-r--. 1 hadoop hadoop 180128 Apr 16 11:00 hadoop-lzo-0.4.20-javadoc.jar
-rw-rw-r--. 1 hadoop hadoop  51984 Apr 16 11:00 hadoop-lzo-0.4.20-sources.jar
drwxrwxr-x. 2 hadoop hadoop     71 Apr 16 11:00 javadoc-bundle-options
drwxrwxr-x. 2 hadoop hadoop     28 Apr 16 11:00 maven-archiver
drwxrwxr-x. 3 hadoop hadoop     28 Apr 16 10:59 native
drwxrwxr-x. 3 hadoop hadoop     18 Apr 16 10:59 test-classes

3.hadoopの構成


3.1 hadoop-lzoをアップロードします。jar

# hadoop-lzo-0.4.20-SNAPSHOT.jar    hadoop common  ,     ,        
[hadoop@hadoop001 target]$ cp hadoop-lzo-0.4.20.jar ~/app/hadoop/share/hadoop/common/
[hadoop@hadoop001 target]$ ll  ~/app/hadoop/share/hadoop/common/hadoop-lzo*
-rw-rw-r--. 1 hadoop hadoop 188645 Apr 16 11:11 /home/hadoop/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar

3.2 coreを構成する.site.xml

#   hadoop
[hadoop@hadoop001 hadoop-lzo-master]$ stop-all.sh 

#  core-site.xml         
[hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/core-site.xml 
#     com.hadoop.compression.lzo.LzoCodec、com.hadoop.compression.lzo.LzopCodec   
#io.compression.codec.lzo.class     LzoCodec LzopCodec,               
<property>
	<name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,
		org.apache.hadoop.io.compress.DefaultCodec,
		org.apache.hadoop.io.compress.BZip2Codec,
		org.apache.hadoop.io.compress.SnappyCodec,
		com.hadoop.compression.lzo.LzoCodec,
		com.hadoop.compression.lzo.LzopCodec
    </value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

[hadoop@hadoop001 ~]$ vim ~/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml
#       
<property>    
    <name>mapred.compress.map.output</name>    
    <value>true</value>    
</property>
<property>    
    <name>mapred.map.output.compression.codec</name>    
    <value>com.hadoop.compression.lzo.LzoCodec</value>    
</property>

#       
<property>
   <name>mapreduce.output.fileoutputformat.compress</name>
   <value>true</value>
</property>

<property>
   <name>mapreduce.output.fileoutputformat.compress.codec</name>
   <value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>	

core-site.xmlとmapred-site.xmlこの2つのファイルはクラスタマシンであれば、同期して変更し、クラスタを起動します.

4.LZOファイルテスト


1.hiveテストスライス
--     
--  LZO       , hadoop common    hadoop-lzo jar,    DeprecatedLzoTextInputFormat     
create table page_views2_lzo(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
) row format delimited fields terminated by '\t'
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"--  lzo       
hive> load data local inpath '/home/hadoop/log_data/page_views.dat.lzo' overwrite into table page_views2_lzo;
Loading data to table test.page_views2_lzo
Table test.page_views2_lzo stats: [numFiles=1, numRows=0, totalSize=207749249, rawDataSize=0]
OK
Time taken: 1.009 seconds
#    
[hadoop@hadoop001 hadoop]$ hadoop fs -du -s -h /user/hive/warehouse/test.db/page_views2_lzo/*
198.1 M  198.1 M  /user/hive/warehouse/test.db/page_views2_lzo/page_views.dat.lzo
--    ,    Map    1
select count(1) from page_views2_lzo;
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.88 sec   HDFS Read: 207756318 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 880 msec

オープン圧縮
--    ,               LzopCodec,lzoCode          .lzo_deflate        。
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;

--  LZO       
create table page_views2_lzo_split
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
as select *  from page_views2_lzo;
#    ,     .lzo
[hadoop@hadoop001 hadoop]$ hadoop fs -du -s -h /user/hive/warehouse/test.db/page_views2_lzo_split/*
196.8 M  196.8 M  /user/hive/warehouse/test.db/page_views2_lzo_split/000000_0.lzo

#  LZO    ,        jar      
[hadoop@hadoop001 hadoop]$ hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.20.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/test.db/page_views2_lzo_split
#  hdfs    ,   lzo        .index    
[hadoop@hadoop001 hadoop]$ hadoop fs -du -s -h /user/hive/warehouse/test.db/page_views2_lzo_split/*                                        196.8 M  196.8 M  /user/hive/warehouse/test.db/page_views2_lzo_split/000000_0.lzo
13.9 K  13.9 K  /user/hive/warehouse/test.db/page_views2_lzo_split/000000_0.lzo.index
--    ,    Map    2
select count(1) from page_views2_lzo_split;
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 28.05 sec   HDFS Read: 206448787 HDFS Write: 58 SUCCESS
Total MapReduce CPU Time Spent: 28 seconds 50 msec
OK
2298975
Time taken: 28.621 seconds, Fetched: 1 row(s)

したがって、インデックスを構築した後、lzoはデータスライスをサポートします.
ビッグデータでよく見られる圧縮フォーマットはbzip 2のみがデータスライスをサポートし、lzoはファイルがインデックスを構築した後にデータスライスをサポートします.