HiveQL ひっかけ問題10本ノック!!(1問 ~ 5問目)


はじめに

Hiveとは、HiveqlというSQLライクなドメイン固有言語を提供するクエリエンジンで、Hadoopというデータ処理基盤で動作するものです。この記事では、Hiveqlを初めて触る人向けに10本ノックを作ってみました。

10本ノックの内容は、私が業務で初めてHiveQLを触った際にハマった内容を元に作成したひっかけ問題です。
(私はHive初心者ですので、間違いがあればご指摘下さい)

環境構築

こちらの記事を参考に環境構築すると、Dockerでカンタンにbeelineが利用できるようになります。

参考記事: https://qiita.com/letusfly85/items/b4e5fc67c9566765b794

(記事作成者の方、ありがとうございます!!)

参考記事からdocker-compose.ymlを入手して以下のコマンドを打つとHiveQLの実行環境が得られます。

$ docker-compose -f docker-compose.yml up -d
$ docker exec -it docker-compose_hiveserver2_1 /bin/bash
$ beeline -u jdbc:hive2://localhost:10000 hive hive org.apache.hive.jdbc.HiveDriver

※ この記事の10本ノックは上記の環境でクエリを実行することを前提に作りました。上記の環境以外でクエリを実行すると、10本ノックの解答例が記載通りに動かない場合があります。

1. INSERT文

それでは、第1問目です。

問題

次のようなテーブルが存在したとします。

CREATE TEMPORARY TABLE cols(
    col3    INT,
    col4    INT,
    col2    INT,
    col1    INT
);

このとき、SELECT文の結果が以下のようになるように、INSERT文の???の部分を埋めてください。

SELECT文
0: jdbc:hive2://localhost:10000> SELECT * FROM cols;
+------------+------------+------------+------------+--+
| cols.col3  | cols.col4  | cols.col2  | cols.col1  |
+------------+------------+------------+------------+--+
| 3          | 4          | 2          | 1          |
+------------+------------+------------+------------+--+
INSERT文(???の部分を埋めて下さい)
INSERT INTO TABLE cols
SELECT
    ??? AS col1,
    ??? AS col2,
    ??? AS col3,
    ??? AS col4
FROM (SELECT 0) dummyTable;

誤答例

以下のように、順番通りに値を1, 2, 3, 4と指定してしまうと、正しい結果が得られません。

誤答例
INSERT INTO TABLE cols
SELECT
    1 AS col1,
    2 AS col2,
    3 AS col3,
    4 AS col4
FROM (SELECT 0) dummyTable;
誤ったINSERT文を実行した結果
0: jdbc:hive2://localhost:10000> SELECT * FROM cols;
+------------+------------+------------+------------+--+
| cols.col3  | cols.col4  | cols.col2  | cols.col1  |
+------------+------------+------------+------------+--+
| 1          | 2          | 3          | 4          |
+------------+------------+------------+------------+--+
1 row selected (0.074 seconds)

解答例

HiveQLでINSERT...SELECTする場合は、SELECTする順番に従って、INSERTが行われます。 ASでカラム名を指定しても、無視されます。ですので、値を3, 4, 2, 1の順番でSELECTする必要があります。

解答例
INSERT INTO TABLE cols
SELECT
    3 AS col1,
    4 AS col2,
    2 AS col3,
    1 AS col4
FROM (SELECT 0) dummyTable;

2. NULLの比較

問題

次のようなtable1table2があったとします。

CREATE TEMPORARY TABLE table1(
    col1    INT,
    col2    INT,
    col3    INT
);

INSERT INTO table1 VALUES (1, 2, 3), (11, 22, NULL);

CREATE TEMPORARY TABLE table2(
    col1    INT,
    col2    INT,
    col3    INT
);

INSERT INTO table2 VALUES (1, 2, -3), (11, 22, NULL);

SELECT * FROM table1;
SELECT * FROM table2;

-- SELECT文の実行結果
-- 0: jdbc:hive2://localhost:10000> SELECT * FROM table1;
-- +--------------+--------------+--------------+--+
-- | table1.col1  | table1.col2  | table1.col3  |
-- +--------------+--------------+--------------+--+
-- | 1            | 2            | 3            |
-- | 11           | 22           | NULL         |
-- +--------------+--------------+--------------+--+
-- 2 rows selected (0.129 seconds)
-- 0: jdbc:hive2://localhost:10000> SELECT * FROM table2;
-- +--------------+--------------+--------------+--+
-- | table2.col1  | table2.col2  | table2.col3  |
-- +--------------+--------------+--------------+--+
-- | 1            | 2            | -3           |
-- | 11           | 22           | NULL         |
-- +--------------+--------------+--------------+--+
-- 2 rows selected (0.041 seconds)

このとき、table1table2をJOINして、以下のような結果になるSELECT文を作成してください。ただし、JOINする際は、table1table2col1, col2, col3の値が一致しているか事を比較してください。

欲しい結果
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| table1.col1  | table1.col2  | table1.col3  | table2.col1  | table2.col2  | table2.col3  |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| 11           | 22           | NULL         | 11           | 22           | NULL         |
+--------------+--------------+--------------+--------------+--------------+--------------+--+

ヒント: 作成するSELECT文の完成イメージは以下のようになります。 ???の部分を穴埋めしてください。

SELECT文(???の部分を埋めてください)
SELECT
    *
FROM table1
     JOIN table2 ON
     ???
     ???
     ???

誤答例

NULLの値を含むcol3を比較する際、 =を比較に使用するとJOINに失敗します。

誤答例
SELECT
    *
FROM table1
     JOIN table2 ON
     table1.col1 = table2.col1 AND
     table1.col2 = table2.col2 AND
     table1.col3 = table2.col3
;
誤答例(実行結果)
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000>     *
0: jdbc:hive2://localhost:10000> FROM table1
0: jdbc:hive2://localhost:10000>      JOIN table2 ON
0: jdbc:hive2://localhost:10000>      table1.col1 = table2.col1 AND
0: jdbc:hive2://localhost:10000>      table1.col2 = table2.col2 AND
0: jdbc:hive2://localhost:10000>      table1.col3 = table2.col3
0: jdbc:hive2://localhost:10000> ;
INFO  : Execution completed successfully
INFO  : MapredLocal task succeeded
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : Job running in-process (local Hadoop)
INFO  : 2019-11-24 06:00:52,674 Stage-3 map = 100%,  reduce = 0%
INFO  : Ended Job = job_local671600389_0025
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| table1.col1  | table1.col2  | table1.col3  | table2.col1  | table2.col2  | table2.col3  |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
+--------------+--------------+--------------+--------------+--------------+--------------+--+

解答例

値がNULLの場合はISNULL関数またはIS NULLで比較してください。

解答例
SELECT
    *
FROM table1
     JOIN table2 ON
     table1.col1 = table2.col1 AND
     table1.col2 = table2.col2
WHERE (table1.col3 = table2.col3) OR (ISNULL(table1.col3) AND ISNULL(table2.col3))
;
-- WHERE句は以下のようにしてもOK
-- WHERE (table1.col3 = table2.col3) OR (table1.col3 IS NULL AND table2.col3 IS NULL)
解答例(実行結果)
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000>     *
0: jdbc:hive2://localhost:10000> FROM table1
0: jdbc:hive2://localhost:10000>      JOIN table2 ON
0: jdbc:hive2://localhost:10000>      table1.col1 = table2.col1 AND
0: jdbc:hive2://localhost:10000>      table1.col2 = table2.col2
0: jdbc:hive2://localhost:10000> WHERE (table1.col3 = table2.col3) OR (ISNULL(table1.col3) AND ISNULL(table2.col3))
0: jdbc:hive2://localhost:10000> ;
INFO  : Execution completed successfully
INFO  : MapredLocal task succeeded
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : Job running in-process (local Hadoop)
INFO  : 2019-11-24 06:03:46,862 Stage-3 map = 100%,  reduce = 0%
INFO  : Ended Job = job_local1646480365_0026
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| table1.col1  | table1.col2  | table1.col3  | table2.col1  | table2.col2  | table2.col3  |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
| 11           | 22           | NULL         | 11           | 22           | NULL         |
+--------------+--------------+--------------+--------------+--------------+--------------+--+
1 row selected (4.912 seconds)

3. UNION ALL

問題

次のtable1table2UNION ALLで統合するSELECT文を作成してください。

CREATE TEMPORARY TABLE table1 AS SELECT 1, 2;
CREATE TEMPORARY TABLE table2 AS SELECT 3, 4;

SELECT文は、実行結果が以下のようになるように作成してください。

欲しい結果
+-------------+-------------+--+
| table3._c0  | table3._c1  |
+-------------+-------------+--+
| 3           | 4           |
| 1           | 2           |
+-------------+-------------+--+

誤答例

UNION ALLはサブクエリの中でしか使用できませんので注意です。

誤答例
-- 誤答例1
(SELECT * FROM table1) UNION ALL (SELECT * FROM table2);

-- 誤答例2
SELECT
    *
FROM
    table1
    UNION ALL
    table2
;
誤答例(実行結果)
0: jdbc:hive2://localhost:10000> (SELECT * FROM table1) UNION ALL (SELECT * FROM table2);
Error: Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near '(' 'SELECT' '*' (state=42000,code=40000)
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000>     *
0: jdbc:hive2://localhost:10000> FROM
0: jdbc:hive2://localhost:10000>     table1
0: jdbc:hive2://localhost:10000>     UNION ALL
0: jdbc:hive2://localhost:10000>     table2
0: jdbc:hive2://localhost:10000> ;
Error: Error while compiling statement: FAILED: ParseException line 6:4 cannot recognize input near 'table2' '<EOF>' '<EOF>' in select clause (state=42000,code=40000)

解答例

解答例
SELECT
    *
FROM(
    SELECT * FROM table1
    UNION ALL
    SELECT * FROM table2
) table3;
解答例(実行結果)
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000>     *
0: jdbc:hive2://localhost:10000> FROM(
0: jdbc:hive2://localhost:10000>     SELECT * FROM table1
0: jdbc:hive2://localhost:10000>     UNION ALL
0: jdbc:hive2://localhost:10000>     SELECT * FROM table2
0: jdbc:hive2://localhost:10000> ) table3;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : Job running in-process (local Hadoop)
INFO  : 2019-11-24 06:13:21,578 Stage-1 map = 100%,  reduce = 0%
INFO  : Ended Job = job_local26487554_0031
+-------------+-------------+--+
| table3._c0  | table3._c1  |
+-------------+-------------+--+
| 3           | 4           |
| 1           | 2           |
+-------------+-------------+--+
2 rows selected (1.299 seconds)

4. リテラルのINSERT

問題

次のtblというテーブルに18446744073709001000という数値をINSERTしてください。

CREATE TEMPORARY TABLE tbl (col DECIMAL(38, 18));

ただし、INSERT文は以下の形式で作成してください。

SELECT文(???の部分を埋めてください)
INSERT INTO TABLE tbl SELECT ??? FROM (SELECT 0) dummyTable;

誤答例

18446744073709001000をそのままINSERTすると、1000の位の数が0になります。

誤答例
INSERT INTO TABLE tbl SELECT 18446744073709001000 FROM (SELECT 0) dummyTable;
誤答例(実行結果)
0: jdbc:hive2://localhost:10000> INSERT INTO TABLE tbl SELECT 18446744073709001000 FROM (SELECT 0) dummyTable;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : Job running in-process (local Hadoop)
INFO  : 2019-11-24 06:46:40,619 Stage-1 map = 100%,  reduce = 0%
INFO  : Ended Job = job_local1838873872_0037
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by condition resolver.
INFO  : Stage-5 is filtered out by condition resolver.
INFO  : Moving data to: file:/tmp/hive/anonymous/cf5efc27-4200-4102-bec7-74e48029f052/_tmp_space.db/0fc85315-e03a-453a-981f-f68aed834083/.hive-staging_hive_2019-11-24_06-46-39_451_9085844297384824125-10/-ext-10000 from file:/tmp/hive/anonymous/cf5efc27-4200-4102-bec7-74e48029f052/_tmp_space.db/0fc85315-e03a-453a-981f-f68aed834083/.hive-staging_hive_2019-11-24_06-46-39_451_9085844297384824125-10/-ext-10002
INFO  : Loading data to table default.tbl from file:/tmp/hive/anonymous/cf5efc27-4200-4102-bec7-74e48029f052/_tmp_space.db/0fc85315-e03a-453a-981f-f68aed834083/.hive-staging_hive_2019-11-24_06-46-39_451_9085844297384824125-10/-ext-10000
INFO  : Table default.tbl stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
No rows affected (1.223 seconds)
0: jdbc:hive2://localhost:10000>
0: jdbc:hive2://localhost:10000> select * from tbl;
+-----------------------+--+
|        tbl.col        |
+-----------------------+--+
| 18446744073709000000  |
+-----------------------+--+
1 row selected (0.136 seconds)

解答例

18446744073709001000を挿入する際は末尾にBDを着けて18446744073709001000BDとする必要があります。

解答例
INSERT INTO TABLE tbl SELECT 18446744073709001000BD FROM (SELECT 0) dummyTable;
解答例(実行結果)
0: jdbc:hive2://localhost:10000> INSERT INTO TABLE tbl SELECT 18446744073709001000BD FROM (SELECT 0) dummyTable;
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : Job running in-process (local Hadoop)
INFO  : 2019-11-24 06:51:31,349 Stage-1 map = 100%,  reduce = 0%
INFO  : Ended Job = job_local1036843654_0039
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by condition resolver.
INFO  : Stage-5 is filtered out by condition resolver.
INFO  : Moving data to: file:/tmp/hive/anonymous/cfb993aa-a328-481a-8e6c-be26cc623ec5/_tmp_space.db/fa20fbb0-8bfd-4b16-b137-142ed0a85316/.hive-staging_hive_2019-11-24_06-51-30_174_7370544646674941832-10/-ext-10000 from file:/tmp/hive/anonymous/cfb993aa-a328-481a-8e6c-be26cc623ec5/_tmp_space.db/fa20fbb0-8bfd-4b16-b137-142ed0a85316/.hive-staging_hive_2019-11-24_06-51-30_174_7370544646674941832-10/-ext-10002
INFO  : Loading data to table default.tbl from file:/tmp/hive/anonymous/cfb993aa-a328-481a-8e6c-be26cc623ec5/_tmp_space.db/fa20fbb0-8bfd-4b16-b137-142ed0a85316/.hive-staging_hive_2019-11-24_06-51-30_174_7370544646674941832-10/-ext-10000
INFO  : Table default.tbl stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
No rows affected (1.243 seconds)
0: jdbc:hive2://localhost:10000> select * from tbl;
+-----------------------+--+
|        tbl.col        |
+-----------------------+--+
| 18446744073709001000  |
+-----------------------+--+
1 row selected (0.167 seconds)

因みにINSERT INTO tbl VALUESの形式でINSERTしても、正しくINSERTされます。

INSERT INTO tbl VALUES(18446744073709001000);
0: jdbc:hive2://localhost:10000> INSERT INTO tbl VALUES(18446744073709001000);
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
WARN  : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
INFO  : Job running in-process (local Hadoop)
INFO  : 2019-11-24 06:48:56,695 Stage-1 map = 100%,  reduce = 0%
INFO  : Ended Job = job_local1313902307_0038
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by condition resolver.
INFO  : Stage-5 is filtered out by condition resolver.
INFO  : Moving data to: file:/tmp/hive/anonymous/5d456b22-4585-4e92-9ae5-f2f362241157/_tmp_space.db/371e970a-13cb-4c61-90a6-1c52ee49182c/.hive-staging_hive_2019-11-24_06-48-55_528_1421126263335840065-10/-ext-10000 from file:/tmp/hive/anonymous/5d456b22-4585-4e92-9ae5-f2f362241157/_tmp_space.db/371e970a-13cb-4c61-90a6-1c52ee49182c/.hive-staging_hive_2019-11-24_06-48-55_528_1421126263335840065-10/-ext-10002
INFO  : Loading data to table default.tbl from file:/tmp/hive/anonymous/5d456b22-4585-4e92-9ae5-f2f362241157/_tmp_space.db/371e970a-13cb-4c61-90a6-1c52ee49182c/.hive-staging_hive_2019-11-24_06-48-55_528_1421126263335840065-10/-ext-10000
INFO  : Table default.tbl stats: [numFiles=1, numRows=1, totalSize=21, rawDataSize=20]
No rows affected (1.211 seconds)
0: jdbc:hive2://localhost:10000> select * from tbl;
+-----------------------+--+
|        tbl.col        |
+-----------------------+--+
| 18446744073709001000  |
+-----------------------+--+
1 row selected (0.145 seconds)

5. SELECT文で配列展開

問題

次のテーブルのarrEXPLODE関数の引数として渡すと、配列の各要素がレコードとして展開されます。

tblテーブル
CREATE TEMPORARY TABLE tbl AS SELECT ARRAY('a', 'b', 'c') AS arr;
tbl.arrをEXPLODE関数に渡した結果
0: jdbc:hive2://localhost:10000> SELECT EXPLODE(arr) AS element FROM tbl;
+----------+--+
| element  |
+----------+--+
| a        |
| b        |
| c        |
+----------+--+
3 rows selected (0.044 seconds)

このSELECT EXPLODE(arr) AS element FROM tbl;というクエリを改修して、SELECT文の実行結果が以下のようになるようにしてください。

欲しい結果
+----------+-------+--+
| element  | hoge  |
+----------+-------+--+
| a        | hoge  |
| b        | hoge  |
| c        | hoge  |
+----------+-------+--+

誤答例

SELECT句の中で使えるUDTF(この場合EXPLODE関数)は一つのみです。ですので、以下のように、カンマ区切りでそのままEXPLODEhogeを並べるとエラーが発生します。

誤答例
SELECT EXPLODE(arr) AS element, 'hoge' AS hoge FROM tbl;
誤答例(実行結果)
0: jdbc:hive2://localhost:10000> SELECT EXPLODE(arr) AS element, 'hoge' AS hoge FROM tbl;
Error: Error while compiling statement: FAILED: SemanticException 1:32 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'hoge' (state=42000,code=40000)

解答例

LATERAL VIEWを使用すると、SELECT句の中に、配列を展開したものと他のカラムを記述することができます。

解答例
SELECT
    element,
    'hoge' AS hoge
FROM
    tbl
LATERAL VIEW
    EXPLODE(arr) arr AS element;
解答例(実行結果)
0: jdbc:hive2://localhost:10000> SELECT
0: jdbc:hive2://localhost:10000>     element,
0: jdbc:hive2://localhost:10000>     'hoge' AS hoge
0: jdbc:hive2://localhost:10000> FROM
0: jdbc:hive2://localhost:10000>     tbl
0: jdbc:hive2://localhost:10000> LATERAL VIEW
0: jdbc:hive2://localhost:10000>     EXPLODE(arr) arr AS element;
+----------+-------+--+
| element  | hoge  |
+----------+-------+--+
| a        | hoge  |
| b        | hoge  |
| c        | hoge  |
+----------+-------+--+

終わりに

閲覧ありがとうございました!6~10問目は近日公開予定です。