Apache Hadoop 3.2.1 Hive 3.1.2 動作確認メモ


自宅のvmware環境の動作確認メモ。DHCP環境でのローカル通信のシングル構成

環境

・Red Hat Enterprise Linux release 8.1 (Ootpa)
・PostgreSQL x86_64 10.6
・OpenJDK 1.8.0_242
・hadoop 3.2.1
・hbase 1.4.13
・hive 3.1.2

参照ドキュメントは
https://hadoop.apache.org/docs/r3.2.1/hadoop-project-dist/hadoop-common/SingleCluster.html

環境変数JAVA_HOMEの設定。
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el8_1.x86_64/jre

rootユーザで/etc/hosts編集

127.0.0.1 localhost

ホスト名の設定

#hostnamectl set-hostname localhost

各ノード間で公開鍵認証するための設定。
この設定はシングル構成でもローカル通信するので必要。

$ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$chmod 0600 ~/.ssh/authorized_keys

/opt/hadoop-3.2.1/etc/hadoop/core-site.xmlを編集。
※シングルクラスタのドキュメントに記載なかったがこの設定がないと起動に失敗したため。

<configuration>
   <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

ファイルシステムをフォーマットし、NameNodeとDataNodeの起動

$bin/hdfs namenode -format
$sbin/start-dfs.sh

http://localhost:9870/
で状況確認する。(Hadoopのサマリ、NamaNodeのジャーナルステータス、ストレージの状況を確認する。)

バンドルされているサンプルMapReduceアプリケーションの動作確認

$bin/hdfs dfs -mkdir /user
$bin/hdfs dfs -mkdir /user/hadoop
$bin/hdfs dfs -mkdir input
$bin/hdfs dfs -put etc/hadoop/*.xml input
$bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
$bin/hdfs dfs -get output output
$bin/hdfs dfs -cat output/*

YARNで動作するための設定
etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

YARNの起動停止は下記コマンド

$sbin/start-yarn.sh
$sbin/stop-yarn.sh

2回目以降の実行

bin/hdfs dfs -rm -r output
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
bin/hdfs dfs -cat output/*

http://localhost:8088/cluster
でアプリケーションのステータス確認。

grepのソースコード

package org.apache.hadoop.examples;

import java.util.Random;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.map.RegexMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.mapreduce.lib.reduce.LongSumReducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/* Extracts matching regexs from input files and counts them. */
public class Grep extends Configured implements Tool {
  private Grep() {}                               // singleton

  public int run(String[] args) throws Exception {
    if (args.length < 3) {
      System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
      ToolRunner.printGenericCommandUsage(System.out);
      return 2;
    }

    Path tempDir =
      new Path("grep-temp-"+
          Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));

    Configuration conf = getConf();
    conf.set(RegexMapper.PATTERN, args[2]);
    if (args.length == 4)
      conf.set(RegexMapper.GROUP, args[3]);

    Job grepJob = Job.getInstance(conf);

    try {

      grepJob.setJobName("grep-search");
      grepJob.setJarByClass(Grep.class);

      FileInputFormat.setInputPaths(grepJob, args[0]);

      grepJob.setMapperClass(RegexMapper.class);

      grepJob.setCombinerClass(LongSumReducer.class);
      grepJob.setReducerClass(LongSumReducer.class);

      FileOutputFormat.setOutputPath(grepJob, tempDir);
      grepJob.setOutputFormatClass(SequenceFileOutputFormat.class);
      grepJob.setOutputKeyClass(Text.class);
      grepJob.setOutputValueClass(LongWritable.class);

      grepJob.waitForCompletion(true);

      Job sortJob = Job.getInstance(conf);
      sortJob.setJobName("grep-sort");
      sortJob.setJarByClass(Grep.class);

      FileInputFormat.setInputPaths(sortJob, tempDir);
      sortJob.setInputFormatClass(SequenceFileInputFormat.class);

      sortJob.setMapperClass(InverseMapper.class);

      sortJob.setNumReduceTasks(1);                 // write a single file
      FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
      sortJob.setSortComparatorClass(          // sort by decreasing freq
        LongWritable.DecreasingComparator.class);

      sortJob.waitForCompletion(true);
    }
    finally {
      FileSystem.get(conf).delete(tempDir, true);
    }
    return 0;
  }

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new Grep(), args);
    System.exit(res);
  }

}

以下、Hiveの設定。
参考ドキュメント
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallationandConfiguration
https://qiita.com/Esfahan/items/a6f2107876e5a712a72c#%E7%92%B0%E5%A2%83

ダウンロードサイト
https://downloads.apache.org/hive/hive-3.1.2/

$tar zxvf apache-hive-3.1.2-bin.tar.gz
$cd apache-hive-3.1.2-bin/

環境変数の追加設定

export HIVE_HOME=/opt/apache-hive-3.1.2-bin
export PATH=$HIVE_HOME/bin:$PATH
$bin/hadoop fs -mkdir tmp
$bin/hadoop fs -mkdir user
$bin/hadoop fs -mkdir user/hive
$bin/hadoop fs -mkdir user/hive/warehouse
$bin/hdfs dfs -chmod g+w tmp
$bin/hdfs dfs -chmod g+w user/hive/warehouse

HBASEの導入がhiveの動作前提条件

wget http://ftp.jaist.ac.jp/pub/apache/hbase/1.4.13/hbase-1.4.13-bin.tar.gz
$export HBASE_HOME=/opt/hbase-1.4.13
$export PATH=$HBASE_HOME/bin:$PATH

追加の設定。そのままでは動作しないため。

$HBASE_HOME/bin/hbase version
HBase 1.4.13
Source code repository git://Sakthis-MacBook-Pro-2.local/Users/sakthi/dev/hbase revision=38bf65a22b7e9320f07aeb27677e4533b9a77ef4
Compiled by sakthi on Sun Feb 23 02:06:36 PST 2020
From source with checksum cfb98e5fbeeca2068278ea88175d751b
$HBASE_HOME/bin/start-hbase.sh
running master, logging to /opt/hbase-1.4.13/logs/hbase-hadoop-master-localhost.localdomain.out
OpenJDK 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hbase-1.4.13/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

上記のライブラリエラーの解消手順とインストール確認

$find ./ -name '*hive-jdbc-*-standalone.jar'
./jdbc/hive-jdbc-3.1.2-standalone.jar
$find ./ -name '*log4j-slf4j-impl-*.jar'
./lib/log4j-slf4j-impl-2.10.0.jar
$rm -f ./jdbc/hive-jdbc-3.1.2-standalone.jar
$rm -f ./lib/log4j-slf4j-impl-2.10.0.jar
$mv /opt/hbase-1.4.13/lib/slf4j-log4j12-1.7.25.jar /opt/hbase-1.4.13/lib/slf4j-log4j12-1.7.25.jar.org
$cd /opt/apache-hive-3.1.2-bin/lib
$mv guava-19.0.jar guava-19.0.jar.org
$cp /opt/hadoop-3.2.1/share/hadoop/common/lib/guava-27.0-jre.jar .
$bin/hive -H
Hive Session ID = e4d480a4-d426-4471-b235-08932783ec66
usage: hive
 -d,--define <key=value>          Variable substitution to apply to Hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable substitution to apply to Hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

メタストア用の設定。
PostgreSQLのインストールは
https://qiita.com/thashi/items/98bcab8e7d6c32e5632e
ドライバダウンロードサイトから最新版をダウンロードし、
postgresql-42.2.10.jar(記事公開時の最新)をHIVE_HOME/libにコピーする。

$HIVE_HOME/conf/hive-site.xmlの編集

<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:postgresql://localhost:5432/metastore</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.postgresql.Driver</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>postgres</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>postgres</value>
    </property>

    <property>
        <name>org.jpox.autoCreateSchema</name>
        <value>true</value>
    </property>

    <property>
        <name>datanucleus.autoCreateSchema</name>
        <value>false</value>
    </property>

</configuration>

以下、サンプル動作確認

$bin/schematool -dbType postgres -initSchema
$bin/hiveserver2
$bin/beeline -u jdbc:hive2://
0: jdbc:hive2://> !exit
$bin/hive
hive> create database testdb;
hive> use testdb;
hive> create table items(name string,price int) row format delimited fields terminated by ',';
hive> desc items;
OK
name                    string
price                   int
Time taken: 0.151 seconds, Fetched: 2 row(s)

sample.csvの中身

りんご,100
バナナ,340
みかん,200
いちじく,400
hive> load data local inpath "/home/hadoop/sample.csv" into table items;
Loading data to table testdb.items
hive> select * from items;
OK
りんご  100
バナナ  340
みかん  200
いちじく        400
hive> insert overwrite directory '/output/'
    > select * from items;
$../hadoop-3.2.1/bin/hdfs dfs -ls /output
Found 1 items
-rw-r--r--   3 hadoop supergroup         63 2020-03-01 16:20 /output/000000_0
$../hadoop-3.2.1/bin/hdfs dfs -cat /output/000000_0
りんご100
バナナ340
みかん200
いちじく400