Apache Impalaを試す。2019-09-11

5831 ワード

bigdata Impala bigdata テキストリンク

概要

Ubuntu18でApache ImpalaをセットアップでImpalaが使えるようになったのでselectしたりしてみます。

参考: https://qiita.com/t-yotsu/items/747046ccb1ef363d1e3c
（参考というかこの記事の内容を自分なりに試したメモがこの記事です）

とりあえずselect

CSVファイルをselectしてみます。

Ubuntu18でApache Impalaをセットアップの通りImpalaを構築済みということで話を進めます。

Impalaから見えるところにCSVを置きます。

shell

% hdfs dfs -mkdir /test_data
% cat ./test.csv
a,1
b,2
c,3
d,4
e,5
% hdfs dfs -put ./test.csv /test_data
% hdfs dfs -ls /test_data
Found 1 items
-rw-r--r--   3 vagrant supergroup         20 2019-09-11 03:54 /test_data/test.csv

置いたら impala-shell でtableを作ってselectします。
（見やすいように結果は一部加工）

impala-shell

[localhost:21000] default> create external table test_tbl
(c0 string, c1 int)
row format delimited fields terminated by ','
location '/test_data';
+-------------------------+
| summary                 |
+-------------------------+
| Table has been created. |
+-------------------------+
Fetched 1 row(s) in 0.25s

[localhost:21000] default> show tables;
+------------------+
| name             |
+------------------+
| hive_warm_up_tbl |
| test_tbl         |
+------------------+
Fetched 2 row(s) in 0.01s

[localhost:21000] default> select * from test_tbl;
+----+----+
| c0 | c1 |
+----+----+
| a  | 1  |
| b  | 2  |
| c  | 3  |
| d  | 4  |
| e  | 5  |
+----+----+
Fetched 5 row(s) in 3.88s

[localhost:21000] default> select sum(c1) from test_tbl;
+---------+
| sum(c1) |
+---------+
| 15      |
+---------+
Fetched 1 row(s) in 0.18s

[localhost:21000] default> select * from test_tbl;
+----+----+
| c0 | c1 |
+----+----+
| a  | 1  |
| b  | 2  |
| c  | 3  |
| d  | 4  |
| e  | 5  |
+----+----+
Fetched 5 row(s) in 0.13s

[localhost:21000] default> drop table test_tbl;
+-------------------------+
| summary                 |
+-------------------------+
| Table has been dropped. |
+-------------------------+
Fetched 1 row(s) in 3.63s

impala-shellで実行してみた結果をそのままコピペしました。
少々見づらいですが、以下のcreate文でテーブルを作って

create external table test_tbl
(c0 string, c1 int)
row format delimited fields terminated by ','
location '/test_data';

以下のクエリを実行

select * from test_tbl;
select sum(c1) from test_tbl;

作ったテーブルをDropして終了

drop table test_tbl;

ということをしています

そこそこでかめのデータをselect

せっかくImpalaなのでそこそこ大きめのデータに、重めのクエリを投げてみます
（1000万件くらいのデータで同じことをしようとしたら、構築した環境ではメモリが足りなくて落ちたので70万件程度のデータで示します）
（見やすいように結果は一部加工）

[localhost:21000] default> select count(*) from test_tbl;
+----------+
| count(*) |
+----------+
| 747865   |
+----------+
Fetched 1 row(s) in 4.78s

このデータ量で試します。重めのクエリを投げてみます。
（全く意味のないアホなクエリですみません）

[localhost:21000] default> select count(*) from test_tbl a join test_tbl b on a.id = b.id;
+-------------+
| count(*)    |
+-------------+
| 27681809023 |
+-------------+
Fetched 1 row(s) in 388.01s

6分30秒くらいで返ってきた。

ちなみにMySQLで同じことしたら30分たっても結果が返ってこず。（indexなど調整を何もしてない状態の結果。きちんとindexをはったら結果は変わると思います）

まとめ

CSVを置いてcreate tableをしたらselectできました
- 比較的お手軽にselectできた
ビッグデータ？も扱えました
- メモリ8Gで起動したVirtualBox上でパラメータ調整など何も行なっていない状態での検証だったので、もう少し良い感じの環境でやれば、もう少し良い感じになりそう

Author And Source

この問題について(Apache Impalaを試す。2019-09-11), 我々は、より多くの情報をここで見つけました https://qiita.com/abetomo/items/43e58969961722be2ee8

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .