PySparkがHiveデータを格納する2つの方法

11469 ワード

背景:HiveのCREATE TABLE ASとPySparkの.write.saveAsTableの格納後に生じるデータ型は異なり,前者はText形式,後者はparquet形式である.

元のデータのタイプ

hiveContext.sql("SHOW CREATE TABLE testdb.tttest").show(n=1000, truncate=False)
+--------------------------------------------------------------+
|result                                                        |
+--------------------------------------------------------------+
|CREATE TABLE `testdb.tttest`(                                 |
|  `username` string,                                          |
|  `sex` string)                                               |
|COMMENT 'Imported by sqoop on 2017/04/17 10:11:26'            |
|ROW FORMAT SERDE                                              |
|  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'        |
|WITH SERDEPROPERTIES (                                        |
|  'field.delim'='\t',                                         |
|  'line.delim'='
', | | 'serialization.format'='\t') | |STORED AS INPUTFORMAT | | 'org.apache.hadoop.mapred.TextInputFormat' | |OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'| |LOCATION | | 'hdfs://nameservice1/user/hive/warehouse/testdb.db/tttest' | |TBLPROPERTIES ( | | 'COLUMN_STATS_ACCURATE'='true', | | 'numFiles'='1', | | 'numRows'='0', | | 'rawDataSize'='0', | | 'totalSize'='66', | | 'transient_lastDdlTime'='1492395090') | +--------------------------------------------------------------+

ソースデータはText形式

方式一:Hiveを利用したCREATE TABLE ASストレージ

hiveContext.sql("DROP TABLE IF EXISTS testdb.test_a")
hiveContext.sql("""CREATE TABLE IF NOT EXISTS testdb.test_a AS SELECT * FROM testdb.tttest""")
hiveContext.sql("SHOW CREATE TABLE testdb.test_a").show(n=1000, truncate=False)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|CREATE TABLE `testdb.test_a`(                               |
|  `username` string,                                        |
|  `sex` string)                                             |
|ROW FORMAT SERDE                                            |
|  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'      |
|STORED AS INPUTFORMAT                                       |
|  'org.apache.hadoop.mapred.TextInputFormat'                |
|OUTPUTFORMAT                                                |
|  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'  |
|LOCATION                                                    |
|  'hdfs://nameservice1/user/hive/warehouse/testdb.db/test_a'|
|TBLPROPERTIES (                                             |
|  'COLUMN_STATS_ACCURATE'='false',                          |
|  'numFiles'='2',                                           |
|  'numRows'='-1',                                           |
|  'rawDataSize'='-1',                                       |
|  'totalSize'='66',                                         |
|  'transient_lastDdlTime'='1500603886')                     |
+------------------------------------------------------------+

方式2:PySparkを利用する.write.saveAsTableストレージ

hiveContext.sql("DROP TABLE IF EXISTS testdb.test_b")
hiveContext.sql("""SELECT * FROM testdb.tttest""").write.saveAsTable("testdb.test_b")
hiveContext.sql("SHOW CREATE TABLE testdb.test_b").show(n=1000, truncate=False)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `testdb.test_b`(                                                                                                                                                                                                    |
|  `username` string COMMENT '',                                                                                                                                                                                                  |
|  `sex` string COMMENT '')                                                                                                                                                                                                       |
|ROW FORMAT SERDE                                                                                                                                                                                                                 |
|  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'                                                                                                                                                                  |
|WITH SERDEPROPERTIES (                                                                                                                                                                                                           |
|  'path'='hdfs://nameservice1/user/hive/warehouse/testdb.db/test_b')                                                                                                                                                             |
|STORED AS INPUTFORMAT                                                                                                                                                                                                            |
|  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'                                                                                                                                                                |
|OUTPUTFORMAT                                                                                                                                                                                                                     |
|  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'                                                                                                                                                               |
|LOCATION                                                                                                                                                                                                                         |
|  'hdfs://nameservice1/user/hive/warehouse/testdb.db/test_b'                                                                                                                                                                     |
|TBLPROPERTIES (                                                                                                                                                                                                                  |
|  'COLUMN_STATS_ACCURATE'='false',                                                                                                                                                                                               |
|  'EXTERNAL'='FALSE',                                                                                                                                                                                                            |
|  'numFiles'='2',                                                                                                                                                                                                                |
|  'numRows'='-1',                                                                                                                                                                                                                |
|  'rawDataSize'='-1',                                                                                                                                                                                                            |
|  'spark.sql.sources.provider'='org.apache.spark.sql.parquet',                                                                                                                                                                   |
|  'spark.sql.sources.schema.numParts'='1',                                                                                                                                                                                       |
|  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"username\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"sex\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', |
|  'totalSize'='1308',                                                                                                                                                                                                            |
|  'transient_lastDdlTime'='1500603889')                                                                                                                                                                                          |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Tipsの第1の方式では文字化けしが発生する場合があり、大型のテーブルに対しては第2の方式で記憶することが好ましく、文字化けしが発生しにくい.

新しい2つのテーブルを削除

hiveContext.sql("DROP TABLE testdb.test_a PURGE")
hiveContext.sql("DROP TABLE testdb.test_b PURGE")

終わります.