Amazon EMRで提供されているデータ生成ツールについてメモ

7562 ワード

Amazon Amazon テキストリンク

ここで紹介されているツールです。Impala利用の際に見つけました。ERMの各種機能をテストするのに便利そうなのでメモ。

いつもはFakerを利用させてもらっているのですが、ちょっとしたデータや、データ量を意識して作るときは便利そう。

注意

オフィシャルなページが無いのでライセンス等は不明です。

環境

Javaがあれば動く。

取得

PATHから推測するにImpalaのテスト用。もちろんImpalaには依存していません。

wget http://elasticmapreduce.s3.amazonaws.com/samples/impala/dbgen-1.0-jar-with-dependencies.jar

利用

java -cp dbgen-1.0-jar-with-dependencies.jar DBGen -p ./ -b 1 -c 1 -t 1

パラメータの意味

DBGen：生成（でしょうね）
-p：生成PATH。標準では/tmp/dbgenらしい。
-b：bookデータのサイズ（GB）。標準は1。1以下(0.1とか)は設定できないみたい。
-c：customerデータのサイズ（GB）。標準は1。1以下(0.1とか)は設定できないみたい。
-t：transactionsデータのサイズ（GB）標準は1。1以下(0.1とか)は設定できないみたい。

というわけで、1回の実行で3GBはディスクを消費します。さすがBigData。

スキーマ

ここから意味を推測できます。

books

Impalaのcreate table文。

create EXTERNAL TABLE books
( 
    id BIGINT,
    isbn STRING,
    category STRING,
    publish_date TIMESTAMP,
    publisher STRING,
    price FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/books/';

意味は下記のような感じでしょうか。

id ID
isbn ISBN
category 属性
publish_date 発売日
publisher 発行者（著者？）
価格

実データは下記のような通り。

0|4-34538-258-1|NATURE|1975-03-12|Harper Collins|30.99
1|9-50874-957-5|LITERARY-COLLECTIONS|2015-01-12|Woongjin ThinkBig|186.99
2|8-21886-784-2|DESIGN|2004-12-19|Perseus|102.99

idは0から始まってますね。

customers

Impalaのcreate table文。

#customers
create EXTERNAL TABLE customers
(
    id BIGINT,
    name STRING,
    date_of_birth TIMESTAMP,
    gender STRING,
    state STRING,
    email STRING,
    phone STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/customers/';

意味は下記のような感じでしょうか。

id ID
name 名前
date_of_birth 生年月日
gender 性別
state 国？
email E-Mail
phone 電話番号

実データは下記のような通り。

0|David WAGNER|1969-06-15|F|NJ|[email protected]|588-584-2254
1|Gabriella KENNEDY|1970-08-07|M|ID|[email protected]|605-697-5974
2|Mia WOOD|1960-07-13|F|SD|[email protected]|372-234-8732

transactions

Impalaのcreate table文。

#transactions
create EXTERNAL TABLE transactions
(
    id BIGINT,
    customer_id BIGINT,
    book_id BIGINT,
    quantity INT,
    transaction_date TIMESTAMP
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/transactions/';

意味は下記のような感じでしょうか。

id ID
costomer_id 顧客番号
book_id 書籍番号
quantity 数量
transacton_date 売上日

好きなだけJOINしてくれという感じです。
実データは下記のような通り。

0|3279645|2404172|16|2005-04-29 14:44:34
1|9370334|3816127|20|2012-06-02 15:29:38
2|4378631|5620786|16|2012-01-24 05:45:29

その他

Helpを表示させてみる。

ちゃんと実装されてます。

java -cp dbgen-1.0-jar-with-dependencies.jar DBGen -h


usage: java -cp <path-to-jar> DBGen <options>
 -b,--books-table-size <size>                         Size of books table
                                                      in GB, default 1
 -bp,--books-table-partitioned <partitioned>          Books table
                                                      partitioned on the
                                                      category column,
                                                      default false
 -c,--customers-table-size <size>                     Size of customers
                                                      table in GB, default
                                                      1
 -cp,--customers-table-partitioned <partitioned>      Customers table
                                                      partitioned on the
                                                      state column,
                                                      default false
 -p,--path <path>                                     path to the base
                                                      directory where the
                                                      table files are
                                                      saved, default
                                                      /tmp/dbgen
 -t,--transactions-table-size <size>                  Size of transactions
                                                      table in GB, default
                                                      1
 -tp,--transactions-table-partitioned <partitioned>   Transactions table
                                                      partitioned on the
                                                      transaction year and
                                                      month columns,
                                                      default false

Author And Source

この問題について(Amazon EMRで提供されているデータ生成ツールについてメモ), 我々は、より多くの情報をここで見つけました https://qiita.com/zaburo/items/db35cadaff95d43d1fe1

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

ReentrantReadWriteLock

apache2.2チェーン問題