Hadoop(-3.2.1.)で久方ぶりにpig(-0.17.0)を使ってみる


■ こちらの環境
OS: Ubuntu 16 or 18
Hadoop: hadoop-3.2.1.tar.gz
JDK (Java): jdk-8u202-linux-x64.tar.gz

ネームノード
192.168.76.216: h-gpu05

データノード
192.168.76.210: h-gpu03
192.168.76.210: h-gpu04


$ wget https://archive.apache.org/dist/pig/pig-0.17.0/pig-0.17.0.tar.gz
$ tar zxvf pig-0.17.0.tar.gz

.bashrcに下記を追加する。


export PIG_HOME=/home/hadoop/pig-0.17.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$PIG_HOME/conf:$HADOOP_INSTALL/etc/hadoop

historyserverを起動する。。。


  506  mr-jobhistory-daemon.sh start historyserver

バージョンの確認


hadoop@h-gpu05:~$ pig --version                                                                                                                                                                           
Apache Pig version 0.17.0 (r1797386) 
compiled Jun 02 2017, 15:41:58

今回はこのようなファイルを生成する。


hadoop@h-gpu05:~/qiita/hadoop/pig$ g++ rand_gen_sin.cpp 
hadoop@h-gpu05:~/qiita/hadoop/pig$ ./a.out 100000
hadoop@h-gpu05:~/qiita/hadoop/pig$ head -n 3 random_data.txt 
2019/07/02 03:03:00.000,35293
2019/07/02 06:06:00.000,34155.7
2019/07/02 20:20:00.000,35647.6

データ(random_data.txt)をHDFS上に置く。。。


hadoop@h-gpu05:~/qiita/hadoop/pig$ hdfs dfs -mkdir pig_input                                                                                                                                              
hadoop@h-gpu05:~/qiita/hadoop/pig$ hdfs dfs -put random_data.txt pig_input/
2020-12-04 16:17:15,437 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
hadoop@h-gpu05:~/qiita/hadoop/pig$ hdfs dfs -ls pig_input/
Found 1 items
-rw-r--r--   3 hadoop supergroup    2602451 2020-12-04 16:17 pig_input/random_data.txt

pigで読み込んでみる。。。


grunt> A = LOAD 'input/random_data.txt' USING PigStorage(',') as (pdfdata:CHARARRAY);   
grunt> dump A; 

↓のようなスクリプトを実行する。。。


lines = LOAD 'pig_input/random_data.txt' AS (line:chararray);                                                                                                                                                 
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;                                                                                                                                           
grouped = GROUP words BY word;                                                                                                                                                                            
wordcount = FOREACH grouped GENERATE group, COUNT(words);                                                                                                                                                 
DUMP wordcount; 


hadoop@h-gpu05:~$ pig 4.pig

↓ 出力例


(2019/07/02,100000)
(00:00:00.000,4064)
(01:01:00.000,4239)
(02:02:00.000,4159)
(03:03:00.000,4169)
(04:04:00.000,4208)
(05:05:00.000,4269)
(06:06:00.000,4135)
(07:07:00.000,4197)
(08:08:00.000,4217)
(09:09:00.000,4292)
(10:10:00.000,4149)
(11:11:00.000,4094)
(12:12:00.000,4204)
(13:13:00.000,4122)
(14:14:00.000,4222)
(15:15:00.000,4127)
(16:16:00.000,4199)
(17:17:00.000,4177)
(18:18:00.000,4089)
(19:19:00.000,4163)
(20:20:00.000,4130)
(21:21:00.000,4141)
(22:22:00.000,4082)
(23:23:00.000,4152)
2020-12-04 16:43:44,300 [main] INFO  org.apache.pig.Main - Pig script completed in 23 seconds and 850 milliseconds (23850 ms)

↓のようなスクリプトでファイルに書き込んでみる。。。


lines = LOAD 'pig_input/random_data.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
store wordcount into 'pig_output/output';

実行してみる。。。


hadoop@h-gpu05:~$ pig 4.pig
hadoop@h-gpu05:~$ hdfs dfs -ls pig_output
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2020-12-04 16:51 pig_output/output
hadoop@h-gpu05:~$ hdfs dfs -get pig_output/output
2020-12-04 16:52:15,928 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
hadoop@h-gpu05:~$ head -n 10 output/                                                                                                                                                                      
part-r-00000  _SUCCESS      
hadoop@h-gpu05:~$ head -n 10 output/part-r-00000 
29491   1
29494   2
29498   1
29501   2
29507   1
29508   2
29510   1
29513   1
29514   1
29522   1
hadoop@h-gpu05:~$ tail -n 10 output/part-r-00000 
14:14:00.000    4222
15:15:00.000    4127
16:16:00.000    4199
17:17:00.000    4177
18:18:00.000    4089
19:19:00.000    4163
20:20:00.000    4130
21:21:00.000    4141
22:22:00.000    4082
23:23:00.000    4152