TreasureData接続情報を~/.td/td.confに書くとどこに有効か


やりたいこと

TreasureDataには色々な方法で接続する(tdコマンド、embulkコマンド、Digdagのtd>、Digdagのembulk>)
同じTreasureData接続情報(エンドポイント・APIキー)を複数箇所に重複して書くのはできるだけ避けたい

結果

~/.td/td.confに書いてみたところ、以下の通り有効な箇所と無効な箇所があった

  1. 有効
    1. tdコマンド
    2. Digdagのtd>(ローカルモード)
  2. 無効
    1. Digdagのtd>(サーバモード)
    2. Embulkのembulk-input-tdプラグイン
    3. Embulkのembulk-output-tdプラグイン

Digdagサーバモードに効かないのは納得が行く

Embulkのtdプラグインに効かないのは残念
改善して頂けると嬉しい

やったこと

作成

エンドポイントにはhttps://をつけること(つけないとエラー)

$ id
uid=500(vagrant) gid=500(vagrant) groups=500(vagrant) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
$ pwd
/home/vagrant
$ td --version
0.15.0
$ ls .td
ls: cannot access .td: No such file or directory
$ td -e https://<エンドポイント> account
Enter your Treasure Data credentials. For Google SSO user, please see https://docs.treasuredata.com/articles/command-line#google-sso-users
Email: <ユーザ名>
Password (typing will be hidden): <パスワード>
Authenticated successfully.
Use 'td -e https://<エンドポイント> db:create <db_name>' to create a database.
$ ls .td
td.conf
$ cat .td/td.conf
[account]
  user = <ユーザ名>
  apikey = <APIキー>
  endpoint = https://<エンドポイント>

実はvagrant実行時にはこのtd.confを直接作っている
(td accountを実行しても作れるがパスワードをVagrantfileに書くのがAPIキーを書くより抵抗が強かったから)

  config.vm.provision "shell", privileged: false, inline: <<-EOT
    mkdir                                         ~/.td
    echo '[account]'                           >  ~/.td/td.conf
    echo 'user     = <ユーザ名>'              >> ~/.td/td.conf
    echo 'apikey   = <APIキー>'               >> ~/.td/td.conf
    echo 'endpoint = https://<エンドポイント>' >> ~/.td/td.conf
  EOT

tdコマンド

以下の通り有効

$ cat xxx.sql
SELECT COUNT(*) AS count FROM xxx
$ td query -d xxx -w -q xxx.sql
Job 9999999 is queued.
Use 'td job:show 9999999' to show the status.
queued...
~中略~
Status      : success
Result      :
+-------+
| count |
+-------+
| 42    |
+-------+
1 row in set

Digdagのtd>(ローカルモード)

以下の通り有効

$ cat xxx.dig
+task1:
   td>: xxx.sql
   database: xxx
   store_last_results: true
+task2:
   echo>: ${td.last_results.count}
$ digdag run xxx.dig
2016-10-26 10:56:52 +0900: Digdag v0.8.17
2016-10-26 10:56:55 +0900 [WARN] (main): Using a new session time 2016-10-26T00:00:00+00:00.
2016-10-26 10:56:55 +0900 [INFO] (main): Using session /tmp/test/.digdag/status/20161026T000000+0000.
2016-10-26 10:56:55 +0900 [INFO] (main): Starting a new session project id=1 workflow name=xxx session_time=2016-10-26T00:00:00+00:00
2016-10-26 10:56:58 +0900 [INFO] (0016@+xxx+task1): td>: xxx.sql
2016-10-26 10:56:59 +0900 [INFO] (0016@+xxx+task1): td-client version: 0.7.26
2016-10-26 10:56:59 +0900 [INFO] (0016@+xxx+task1): Logging initialized @6699ms
2016-10-26 10:57:00 +0900 [INFO] (0016@+xxx+task1): td>: xxx.sql
2016-10-26 10:57:01 +0900 [INFO] (0016@+xxx+task1): Started presto job id=9999999:
SELECT COUNT(*) AS count FROM xxx

2016-10-26 10:57:04 +0900 [INFO] (0016@+xxx+task1): td>: xxx.sql
2016-10-26 10:57:06 +0900 [INFO] (0016@+xxx+task2): echo>: 42
42
Success. Task state is saved at /tmp/test/.digdag/status/20161026T000000+0000 directory.
  * Use --session <daily | hourly | "yyyy-MM-dd[ HH:mm:ss]"> to not reuse the last session time.
  * Use --rerun, --start +NAME, or --goal +NAME argument to rerun skipped tasks.

Digdagのtd>(サーバモード)

以下の通り無効

$ cat ~/.config/digdag/config
client.http.endpoint = http://<DigdagサーバマシンのIPアドレス>:<ポート番号>/
$ digdag push proj1
2016-10-27 12:27:30 +0900: Digdag v0.8.17
Creating .digdag/tmp/archive-7184579153809927090.tar.gz...
  Archiving xxx.dig
  Archiving xxx.sql
Workflows:
  xxx

Uploaded:
  id: 10
  name: proj1
  revision: a5b35e9e-8ae5-4942-af2a-7b0a4ed12c3d
  archive type: db
  project created at: 2016-10-27T03:27:33Z
  revision updated at: 2016-10-27T03:27:33Z

Use `digdag workflows` to show all workflows.
$ digdag start proj1 xxx --session now
2016-10-27 12:28:16 +0900: Digdag v0.8.17
Started a session attempt:
  session id: 112
  attempt id: 111
  uuid: 4ed676b1-01bf-4dee-ba5d-d9b5e032588d
  project: proj1
  workflow: xxx
  session time: 2016-10-27 03:28:19 +0000
  retry attempt name:
  params: {}
  created at: 2016-10-27 12:28:19 +0900

* Use `digdag session 112` to show session status.
* Use `digdag task 111` and `digdag log 111` to show task status and logs.
$ digdag log 111
2016-10-27 12:29:32 +0900: Digdag v0.8.17
2016-10-27 12:28:22.488 +0900 [INFO] (0074@+xxx+task1) io.digdag.core.agent.OperatorManager: td>: xxx.sql
2016-10-27 12:28:23.146 +0900 [INFO] (0074@+xxx+task1) com.treasuredata.client.TDClient: td-client version: 0.7.26
2016-10-27 12:28:23.161 +0900 [ERROR] (0074@+xxx+task1) io.digdag.core.agent.OperatorManager: Configuration error at task +xxx+task1: The 'td.apikey' secret is missing (config)
2016-10-27 12:28:23.971 +0900 [INFO] (0074@+xxx^failure-alert) io.digdag.core.agent.OperatorManager: type: notify
$ ssh <DigdagサーバマシンのIPアドレス> ps -ef|grep digdag
vagrant@<DigdagサーバマシンのIPアドレス>'s password:
vagrant   1524     1  2 12:18 ?        00:00:16 java -XX:+AggressiveOpts -XX:+TieredCompilation -XX:TieredStopAtLevel=1 -Xverify:none -jar /usr/local/bin/digdag server -c /home/vagrant/.config/digdag/config -O /home/vagrant/digdag-server/task-log
$ ssh <DigdagサーバマシンのIPアドレス> cat ~/.td/td.conf
vagrant@<DigdagサーバマシンのIPアドレス>'s password:
[account]
  user = <ユーザ名>
  apikey = <APIキー>
  endpoint = https://<エンドポイント>
$ digdag version
2016-10-27 12:34:41 +0900: Digdag v0.8.17
Client version: 0.8.17
Server version: 0.8.17

クライアント側に~/.td/td.confがあってもそれはサーバでの実行時には見ない

Digdagサーバマシンでサーバプロセスユーザの~/.td/td.confがあってもそれはサーバでの実行時には見ない
サーバには複数ユーザから複数プロジェクトのpushがあり、それらはエンドポイント・APIキーを共通使用すべきでないから、ということか

Embulkのembulk-input-tdプラグイン

まず動く例
エンドポイントにはhttps://をつけないこと(つけるとエラー)

$ cat input.yml
in:
  type: td
  apikey: <APIキー>
  endpoint: <エンドポイント>
  database: xxx
  query: SELECT * FROM xxx
out:
  type: file
  path_prefix: xxx
  file_ext: csv
  formatter:
    type: csv
    header_line: true
$ embulk run input.yml
2016-10-27 12:43:01.916 +0900: Embulk v0.8.14
2016-10-27 12:43:06.817 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-td (0.1.0)
2016-10-27 12:43:06.925 +0900 [INFO] (0001:transaction): td-client version: 0.7.24
2016-10-27 12:43:06.938 +0900 [INFO] (0001:transaction): Reading configuration file: /home/vagrant/.td/td.conf
2016-10-27 12:43:07.006 +0900 [INFO] (0001:transaction): Logging initialized @13514ms
2016-10-27 12:43:07.648 +0900 [INFO] (0001:transaction): Submit a query for database 'xxx': SELECT * FROM xxx
2016-10-27 12:43:08.650 +0900 [INFO] (0001:transaction): Job 8065368 is queued.
2016-10-27 12:43:08.650 +0900 [INFO] (0001:transaction): Confirm that job 8065368 finished
2016-10-27 12:43:14.317 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=1
2016-10-27 12:43:14.460 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2016-10-27 12:43:14.668 +0900 [INFO] (0023:task-0000): Writing local file 'xxx000.00.csv'
2016-10-27 12:43:15.141 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2016-10-27 12:43:15.168 +0900 [INFO] (main): Committed.
2016-10-27 12:43:15.170 +0900 [INFO] (main): Next config diff: {"in":{},"out":{}}

Reading configuration file: /home/vagrant/.td/td.confとあるのでAPIキー・エンドポイントをtd.confから取得してくれそう
しかしendpointをコメントアウトして実行するとエラー
https://github.com/muga/embulk-input-td#configuration にあるようにデフォルトの api.treasuredata.com に行ってしまったのだろう

$ embulk run input.yml
2016-10-27 12:46:48.242 +0900: Embulk v0.8.14
2016-10-27 12:46:52.945 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-td (0.1.0)
2016-10-27 12:46:53.061 +0900 [INFO] (0001:transaction): td-client version: 0.7.24
2016-10-27 12:46:53.068 +0900 [INFO] (0001:transaction): Reading configuration file: /home/vagrant/.td/td.conf
2016-10-27 12:46:53.158 +0900 [INFO] (0001:transaction): Logging initialized @13498ms
2016-10-27 12:46:53.743 +0900 [INFO] (0001:transaction): Submit a query for database 'xxx': SELECT * FROM xxx
2016-10-27 12:46:55.023 +0900 [WARN] (0001:transaction): API request failed
java.util.concurrent.ExecutionException: org.eclipse.jetty.client.HttpResponseException: HTTP protocol violation: Authentication challenge without WWW-Authenticate header
        at org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118) ~[jetty-client-9.2.2.v20140723.jar:9.2.2.v20140723]
        at org.eclipse.jetty.client.util.FutureResponseListener.get(FutureResponseListener.java:101) ~[jetty-client-9.2.2.v20140723.jar:9.2.2.v20140723]
~略~

endpointを戻してapikeyをコメントアウトしてもエラー
https://github.com/muga/embulk-input-td#configuration にapikeyは必須とある

$ embulk run input.yml
2016-10-27 12:49:09.632 +0900: Embulk v0.8.14
2016-10-27 12:49:14.507 +0900 [INFO] (0001:transaction): Loaded plugin embulk-input-td (0.1.0)
org.embulk.exec.PartialExecutionException: org.embulk.config.ConfigException: com.fasterxml.jackson.databind.JsonMappingException: Field 'apikey' is required but not set
 at [Source: N/A; line: -1, column: -1]
        at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(org/embulk/exec/BulkLoader.java:363)

Embulkのembulk-output-tdプラグイン

embulk-input-tdプラグインと結果は同じ

$ cat output_for_guess.yml
in:
 type: file
 path_prefix: xxx
out:
  type: td
  apikey: <APIキー>
  endpoint: <エンドポイント>
  database: xxx
  table: xxx2
  mode: truncate
$ embulk guess output_for_guess.yml -o output.yml
2016-10-27 13:01:49.899 +0900: Embulk v0.8.14
2016-10-27 13:01:51.835 +0900 [INFO] (0001:guess): Listing local files at directory '.' filtering filename by prefix 'xxx'
2016-10-27 13:01:51.839 +0900 [INFO] (0001:guess): Loading files [xxx000.00.csv]
2016-10-27 13:01:52.038 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/gzip from a load path
2016-10-27 13:01:52.062 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/bzip2 from a load path
2016-10-27 13:01:52.102 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/json from a load path
2016-10-27 13:01:52.119 +0900 [INFO] (0001:guess): Loaded plugin embulk/guess/csv from a load path
in:
  type: file
  path_prefix: xxx
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: col1, type: string}
    - {name: col2, type: string}
    - {name: time, type: long}
out: {type: td, apikey: <APIキー>, endpoint: <エンドポイント>,
  database: xxx, table: xxx2, mode: truncate}
Created 'output.yml' file.
$ embulk run output.yml
2016-10-27 13:02:22.364 +0900: Embulk v0.8.14
2016-10-27 13:02:27.395 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-td (0.3.8)
2016-10-27 13:02:27.511 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'xxx'
2016-10-27 13:02:27.522 +0900 [INFO] (0001:transaction): Loading files [xxx000.00.csv]
2016-10-27 13:02:27.690 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=1
2016-10-27 13:02:27.803 +0900 [INFO] (0001:transaction): td-client version: 0.7.24
2016-10-27 13:02:27.815 +0900 [INFO] (0001:transaction): Reading configuration file: /home/vagrant/.td/td.conf
2016-10-27 13:02:27.887 +0900 [INFO] (0001:transaction): Logging initialized @13895ms
2016-10-27 13:02:29.793 +0900 [INFO] (0001:transaction): Using time:long column as the data partitioning key
2016-10-27 13:02:29.796 +0900 [INFO] (0001:transaction): Create bulk_import session embulk_20161027_040227_014000000
2016-10-27 13:02:30.176 +0900 [INFO] (0001:transaction): {done:  0 / 1, running: 0}
2016-10-27 13:02:30.540 +0900 [INFO] (0022:task-0000): {uploading: {rows: 20, size: 1,166 bytes (compressed)}}
2016-10-27 13:02:30.974 +0900 [INFO] (0001:transaction): {done:  1 / 1, running: 0}
2016-10-27 13:02:31.761 +0900 [INFO] (0001:transaction): Performing bulk import session 'embulk_20161027_040227_014000000'
2016-10-27 13:03:12.793 +0900 [INFO] (0001:transaction):     job id: 8065734
2016-10-27 13:03:13.262 +0900 [INFO] (0001:transaction): Committing bulk import session 'embulk_20161027_040227_014000000'
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction):     valid records: 20
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction):     error records: 0
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction):     valid parts: 1
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction):     error parts: 0
2016-10-27 13:03:13.263 +0900 [INFO] (0001:transaction):     new columns:
2016-10-27 13:03:13.265 +0900 [INFO] (0001:transaction):       - col1: string
2016-10-27 13:03:13.266 +0900 [INFO] (0001:transaction):       - col2: string
2016-10-27 13:03:20.469 +0900 [INFO] (0001:transaction): Deleting bulk import session 'embulk_20161027_040227_014000000'
2016-10-27 13:03:20.876 +0900 [INFO] (main): Committed.
2016-10-27 13:03:20.877 +0900 [INFO] (main): Next config diff: {"in":{"last_path":"xxx000.00.csv"},"out":{"last_session":"embulk_20161027_040227_014000000"}}
$ vi output.yml # endpoint定義削除
$ embulk run output.yml
2016-10-27 13:07:21.973 +0900: Embulk v0.8.14
2016-10-27 13:07:26.872 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-td (0.3.8)
2016-10-27 13:07:26.993 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'xxx'
2016-10-27 13:07:27.004 +0900 [INFO] (0001:transaction): Loading files [xxx000.00.csv]
2016-10-27 13:07:27.159 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=1
2016-10-27 13:07:27.271 +0900 [INFO] (0001:transaction): td-client version: 0.7.24
2016-10-27 13:07:27.277 +0900 [INFO] (0001:transaction): Reading configuration file: /home/vagrant/.td/td.conf
2016-10-27 13:07:27.345 +0900 [INFO] (0001:transaction): Logging initialized @13797ms
2016-10-27 13:07:29.389 +0900 [WARN] (0001:transaction): API request failed
java.util.concurrent.ExecutionException: org.eclipse.jetty.client.HttpResponseException: HTTP protocol violation: Authentication challenge without WWW-Authenticate header
        at org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118) ~[jetty-client-9.2.2.v20140723.jar:9.2.2.v20140723]
~略~
$ vi output.yml # apikey定義削除
$ embulk run output.yml
2016-10-27 13:08:38.456 +0900: Embulk v0.8.14
2016-10-27 13:08:43.483 +0900 [INFO] (0001:transaction): Loaded plugin embulk-output-td (0.3.8)
2016-10-27 13:08:43.607 +0900 [INFO] (0001:transaction): Listing local files at directory '.' filtering filename by prefix 'xxx'
2016-10-27 13:08:43.617 +0900 [INFO] (0001:transaction): Loading files [xxx000.00.csv]
2016-10-27 13:08:43.788 +0900 [INFO] (0001:transaction): Using local thread executor with max_threads=2 / tasks=1
org.embulk.exec.PartialExecutionException: org.embulk.config.ConfigException: com.fasterxml.jackson.databind.JsonMappingException: Field 'apikey' is required but not set
 at [Source: N/A; line: -1, column: -1]
        at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(org/embulk/exec/BulkLoader.java:363)