Apache SparkでDBからWHERE句付きでDataFrameを作成する

2927 ワード

Scala SQL Spark Scala テキストリンク

Apache Sparkで既存のDBMSから
DataFrameを作成するには以下のようなやり方があります。

確認した環境
Windows 10
MySQL 5.7.10 （付属のサンプルデータ worldデータベースのcityテーブルを使用しました）
Apache Spark 1.5.2

JDBCドライバの設定

set SPARK_CLASSPATH=lib\mysql-connector-java-5.1.38-bin.jar

SPARK_CLASSPATHにjdbcドライバのパスを設定します。
Windows環境のためSETコマンドを使用しています。

そしてspark-shellの起動
spark-shell

以下を貼り付けます。

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

val options = Map("driver" -> "com.mysql.jdbc.Driver",
"url" -> "jdbc:mysql://localhost:3306/world?user=xxx&password=xxx",
"dbtable" -> "(select * from city where countrycode = 'NLD') as city2")

var jdbcDF = sqlContext.read.format("jdbc").options(options).load()

jdbcDF.count()

dbtabeの項目は、通常テーブル名を指定しますが、上記のように別名をつけることで任意のSQLが書けます。

(select * from city where countrycode = 'NLD') as city2

結果は以下のようになりました。

res0: Long = 28

cityテーブルは実際には、4079レコードありますが、「countrycode = 'NLD'」のもののみ絞り込まれています。
これで不必要なデータをメモリに読み込む必要がなくなりました。

Author And Source

この問題について(Apache SparkでDBからWHERE句付きでDataFrameを作成する), 我々は、より多くの情報をここで見つけました https://qiita.com/sohatach/items/627a928efd4dee30b3b1

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

Remove Duplicates from Sorted List II (C++)

leetcodeノート:Range Sum Query-Mutable