1.sparksqlのcase classによるDataFramesの作成(反射)


import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object TestDataFrame1 {
  def main(args: Array[String]): Unit = {
    val conf=new SparkConf().setAppName("RDDToDataFrame").setMaster("local")
    val sc=new SparkContext(conf)
    val context=new SQLContext(sc)

    // RDD, RDD case class 
    val peopleRDD=sc.textFile("D:\\Users\\shashahu\\Desktop\\work\\spark-2.4.4\\examples\\src\\main\\resources\\people.txt").map(line => People(line.split(",")(0), line.split(",")(1).trim.toInt))
    // txt , txt 
    import context.implicits._
    // 
    val df=peopleRDD.toDF() // RDD DataFrames

    // DataFrame :
    df.createOrReplaceTempView("people") // 

    // SQL 
    context.sql("select * from people").show()
  }
  case class People(var name:String,var age:Int)
}
/** sparksql txt 
  *  :
  * 
  * 
  * 19/11/12 20:07:13 INFO FileInputFormat: Total input paths to process : 1
  * 19/11/12 20:07:13 INFO SparkContext: Starting job: show at TestDataFrame1.scala:21
  * 19/11/12 20:07:13 INFO DAGScheduler: Got job 0 (show at TestDataFrame1.scala:21) with 1 output partitions
  * 19/11/12 20:07:13 INFO DAGScheduler: Final stage: ResultStage 0 (show at TestDataFrame1.scala:21)
  * 19/11/12 20:07:13 INFO DAGScheduler: Parents of final stage: List()
  * 19/11/12 20:07:13 INFO DAGScheduler: Missing parents: List()
  * 19/11/12 20:07:13 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at show at TestDataFrame1.scala:21), which has no missing parents
  * 19/11/12 20:07:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 10.0 KB, free 4.1 GB)
  * 19/11/12 20:07:13 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.8 KB, free 4.1 GB)
  * 19/11/12 20:07:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on dst001825.cn1.global.ctrip.com:53625 (size: 4.8 KB, free: 4.1 GB)
  * 19/11/12 20:07:13 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
  * 19/11/12 20:07:13 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at show at TestDataFrame1.scala:21) (first 15 tasks are for partitions Vector(0))
  * 19/11/12 20:07:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
  * 19/11/12 20:07:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7947 bytes)
  * 19/11/12 20:07:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
  * 19/11/12 20:07:13 INFO HadoopRDD: Input split: file:/D:/Users/shashahu/Desktop/work/spark-2.4.4/examples/src/main/resources/people.txt:0+32
  * 19/11/12 20:07:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1264 bytes result sent to driver
  * 19/11/12 20:07:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 123 ms on localhost (executor driver) (1/1)
  * 19/11/12 20:07:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
  * 19/11/12 20:07:13 INFO DAGScheduler: ResultStage 0 (show at TestDataFrame1.scala:21) finished in 0.222 s
  * 19/11/12 20:07:13 INFO DAGScheduler: Job 0 finished: show at TestDataFrame1.scala:21, took 0.260352 s
  * +-------+---+
  * |   name|age|
  * +-------+---+
  * |Michael| 29|
  * |   Andy| 30|
  * | Justin| 19|
  * +-------+---+
  *
  * 19/11/12 20:07:13 INFO SparkContext: Invoking stop() from shutdown hook
  */

公式サイトのtxtを使用してテーブルに変換します.関連コードとプレゼンテーションの結果は上記の通りです.