CDH5でHiveを利用する(Embedded Mode)


はじめに

CDH5でHive(Embedded Mode)を利用する方法を記述します。

環境

  • CentOS 6.5
  • CDH 5
  • Hive 0.12.0-cdh5.1.3
  • jdk 1.7.0_55

構成

ホスト名 IPアドレス ResourceManager Namenode NodeManager Datanode JobHistoryServer
hadoop-master 192.168.122.101 - -
hadoop-master2 192.168.122.102 - - -
hadoop-slave 192.168.122.111 - - -
hadoop-slave2 192.168.122.112 - - -
hadoop-slave3 192.168.122.113 - - -
hadoop-client 192.168.122.201 - - - - -

※ Hadoopのクラスタの構築方法は、CDH5でhadoopのクラスタを構築するをご参照ください。

Hiveの設定

※ hadoop-clientにHiveをインストールします。

  • Hiveのインストール
$ sudo yum install hive
  • Hive用ディレクトリをHDFS上に作成します。
$ sudo -u hdfs hadoop fs -mkdir /user/hive
$ sudo -u hdfs hadoop fs -chown hive:hadoop /user/hive
$ sudo -u hdfs hadoop fs ls /user/
Found 3 items
drwxr-xr-x   - hdfs   hadoop          0 2014-09-20 08:09 /user/hdfs
drwxrwxrwt   - mapred hadoop          0 2014-09-20 05:39 /user/history
drwxr-xr-x   - hive   hadoop          0 2014-10-06 13:34 /user/hive
  • ローカルディレクトリのパーミッションの調整
$ sudo chown hive /var/lib/hive
$ ls -ld /var/lib/hive
drwxr-xr-x 3 hive root 4096 Oct  6 13:34 /var/lib/hive

データの準備

今回は郵便番号データを使用します。

$ cd /tmp
$ curl -O http://www.post.japanpost.jp/zipcode/dl/roman/ken_all_rome.zip
$ unzip ken_all_rome.zip
$ nkf -S -w ken_all_rome/KEN_ALL_ROME.CSV > ken_all_rome/KEN_ALL_ROME.UTF8.CSV
$ head ken_all_rome/KEN_ALL_ROME.UTF8.CSV
"0600000","北海道","札幌市 中央区","以下に掲載がない場合","HOKKAIDO","SAPPORO SHI CHUO KU","IKANIKEISAIGANAIBAAI"
"0640941","北海道","札幌市 中央区","旭ケ丘","HOKKAIDO","SAPPORO SHI CHUO KU","ASAHIGAOKA"
"0600041","北海道","札幌市 中央区","大通東","HOKKAIDO","SAPPORO SHI CHUO KU","ODORIHIGASHI"
"0600042","北海道","札幌市 中央区","大通西(1~19丁目)","HOKKAIDO","SAPPORO SHI CHUO KU","ODORINISHI(1-19-CHOME)"
"0640820","北海道","札幌市 中央区","大通西(20~28丁目)","HOKKAIDO","SAPPORO SHI CHUO KU","ODORINISHI(20-28-CHOME)"
"0600031","北海道","札幌市 中央区","北一条東","HOKKAIDO","SAPPORO SHI CHUO KU","KITA1-JOHIGASHI"
"0600001","北海道","札幌市 中央区","北一条西(1~19丁目)","HOKKAIDO","SAPPORO SHI CHUO KU","KITA1-JONISHI(1-19-CHOME)"
"0640821","北海道","札幌市 中央区","北一条西(20~28丁目)","HOKKAIDO","SAPPORO SHI CHUO KU","KITA1-JONISHI(20-28-CHOME)"
"0600032","北海道","札幌市 中央区","北二条東","HOKKAIDO","SAPPORO SHI CHUO KU","KITA2-JOHIGASHI"
"0600002","北海道","札幌市 中央区","北二条西(1~19丁目)","HOKKAIDO","SAPPORO SHI CHUO KU","KITA2-JONISHI(1-19-CHOME)"

※ 郵便番号データの文字コードは、「SHIFT_JIS」ですが、そのままでは扱いにくいので「UTF8」に変換して使用しています。

データの投入

  • データベース及びテーブルの作成
$ cd /tmp
$ sudo -u hive hive
hive> create database sample;
OK

hive> show databases;
OK
default
sample
Time taken: 5.021 seconds, Fetched: 2 row(s)

hive> use sample;
OK

hive> create table zip_all (
    > zip string,
    > pref string,
    > city string,
    > town string,
    > pref_r string,
    > city_r string,
    > town_r string
    > )
    > row format delimited
    > fields terminated by ','
    > lines terminated by '\n'
    > ;

hive> show tables;
OK
zip_all
Time taken: 0.022 seconds, Fetched: 1 row(s)

hive> load data local inpath '/tmp/ken_all_rome/KEN_ALL_ROME.UTF8.CSV' into table zip_all;
Copying data from file:/tmp/ken_all_rome/KEN_ALL_ROME.UTF8.CSV
Copying file: file:/tmp/ken_all_rome/KEN_ALL_ROME.UTF8.CSV
Loading data to table sample.zip_all
Table sample.zip_all stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 12527284, raw_data_size: 0]
OK
Time taken: 0.817 seconds

データの検索

hive> select count(*) from zip_all;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
14/10/06 15:19:14 WARN conf.Configuration: file:/tmp/hive/hive_2014-10-06_15-19-10_668_7801925482149728044-1/-local-10002/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/10/06 15:19:14 WARN conf.Configuration: file:/tmp/hive/hive_2014-10-06_15-19-10_668_7801925482149728044-1/-local-10002/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
14/10/06 15:19:14 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/10/06 15:19:14 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/10/06 15:19:14 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/10/06 15:19:14 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/10/06 15:19:14 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/10/06 15:19:14 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/10/06 15:19:14 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
Execution log at: /tmp/hive/hive_20141006151919_0ae7a324-9f85-4b3f-8036-61e18070c4bd.log
Job running in-process (local Hadoop)
2014-10-06 15:19:18,218 null map = 100%,  reduce = 0%
2014-10-06 15:19:19,226 null map = 100%,  reduce = 100%
Ended Job = job_local1780116823_0001
Execution completed successfully
MapredLocal task succeeded
OK
123699
Time taken: 9.292 seconds, Fetched: 1 row(s)

hive> select * from zip_all where pref_r = '"TOKYO TO"' and city_r = '"SHIBUYA KU"' and town_r = '"SHIBUYA"';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
14/10/06 15:22:59 WARN conf.Configuration: file:/tmp/hive/hive_2014-10-06_15-22-56_612_8533133809152280651-1/-local-10002/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/10/06 15:22:59 WARN conf.Configuration: file:/tmp/hive/hive_2014-10-06_15-22-56_612_8533133809152280651-1/-local-10002/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
14/10/06 15:22:59 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
14/10/06 15:22:59 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/10/06 15:22:59 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/10/06 15:22:59 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
14/10/06 15:22:59 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
14/10/06 15:22:59 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/10/06 15:22:59 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
Execution log at: /tmp/hive/hive_20141006152222_ad9fa2f4-76da-4d3b-b083-6d84a7c48ab8.log
Job running in-process (local Hadoop)
2014-10-06 15:23:03,607 null map = 0%,  reduce = 0%
2014-10-06 15:23:04,617 null map = 100%,  reduce = 0%
Ended Job = job_local2040699265_0001
Execution completed successfully
MapredLocal task succeeded
OK
"1500002"       "東京都"        "渋谷区"        "渋谷"  "TOKYO TO"      "SHIBUYA KU"    "SHIBUYA"
Time taken: 8.706 seconds, Fetched: 1 row(s)

参考