PyAthenaJDBCを使ってAthenaからPandasにデータを格納する

7303 ワード

データ分析 Athena Python pandas Python テキストリンク

はじめに

分析作業はSQLやらRやらPythonやら色々と使わねばなりませんが、
最近は何でもPythonで完結させたいんです。

AWS Athenaからデータを取得するようになりましたので、Pythonで接続してクエリを投げてPandasにデータを格納する方法をまとめてみました。

必要なもの

Python（実行時v3.7.2）
PyAthenaJDBC（実行時v2.0.3）
jvm.dll（Javaがない場合、OracleのJREダウンロードページより最新のJavaSEをダウンロードしてゲット）

実行OSはWindows10です。

セッティング

まず、PyAthenaJDBCライブラリをインストールします。

pip install PyAthenaJDBC

続いてPythonでの作業に移ります。

AWSの設定とクエリ実行

Python

#接続ライブラリ
from pyathenajdbc import connect
from pyathenajdbc.util import as_pandas

#AWS接続
conn = connect(
    access_key='hogehoge_access_key',
    secret_key='hogehoge_secret_key'
    s3_staging_dir='hogehoge_s3_staging_dir',
    region_name='us-hoge-hoge',
    jvm_path='hoge/hoge/jvm.dll') #jvm.dllのpath

#クエリ実行
try:
    with conn.cursor() as cursor:
        cursor.execute("""
        SELECT * FROM hogehoge /*クエリを記入*/
        """)
        dataframe = as_pandas(cursor)
finally:
    conn.close()

いかがでしょうか。
しかし、データ取得のたびに何度も接続と切断、クエリ実行を書くのは面倒です。
また、クエリに引数を渡す必要性も当然出てきます。
そのためもろもろまとめて関数化してみました。

関数化してまとめてみる

Python：データ取得関数

def paq_func(query, arguments):
    import contextlib
    from pyathenajdbc import connect
    from pyathenajdbc.util import as_pandas
    # AWS設定
    conn_setting = {'access_key': 'hogehoge_access_key',
                    'secret_key': 'hogehoge_secret_key',
                    's3_staging_dir': 'hogehoge_s3_staging_dir',
                    'region_name': 'us-hoge-hoge',
                    'jvm_path': "hoge/hoge/jvm.dll"}
    # 接続
    with contextlib.closing(connect(**conn_setting)) as conn:
        with conn.cursor() as cursor:
            cursor.execute("""{}""".format(query), arguments)
            df = as_pandas(cursor)
    return(df)

Python：関数を実行してデータ取得

#実行するクエリ
query = ("""
select
    *
from
    hogehoge
where
    date between %(arg1)s and %(arg2)s
limit 100
""")

#クエリに渡す引数
arguments = {'arg1': arg1, 'arg2': arg2}

#関数実行
test_df = paq_func(query, arguments)

もっと簡略化できそうな気もしますが、能力の限界です。

まとめ

AthenaからさくっとPandasにデータを格納することができるようになりました。
速度面でもっと素早くPandasに格納できる方法がわかりましたら教えてください。

Author And Source

この問題について(PyAthenaJDBCを使ってAthenaからPandasにデータを格納する), 我々は、より多くの情報をここで見つけました https://qiita.com/T_Shinomiya/items/dc3dadd5795bcc235a5d

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

gdb共通コマンド

文字列および文字列関数コマンドラインパラメータ