rddのjoin使用

2947 ワード

コードは次のとおりです.
package rdd

import org.apache.spark.{SparkContext, SparkConf}

/**
  * Created by     on 2016/7/2.
  */
object rddJoin {

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("rddJoin").setMaster("local")
    val sc = new SparkContext(conf)

    val rdd1 = sc.parallelize(Array((1, 21), (2, 42), (3, 41)), 1)
    val rdd2 = sc.parallelize(Array((3, 4), (4, 41)), 1)
    val rdd3 = rdd1.join(rdd2)
    rdd3.foreach(println)
    rdd1.zipWithIndex.foreach(println)
  }

}

実行結果は次のとおりです.
16/07/02 22:42:13 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 19 ms
16/07/02 22:42:13 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/07/02 22:42:13 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
(3,(41,4))
16/07/02 22:42:13 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1165 bytes result sent to driver
16/07/02 22:42:13 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 113 ms on localhost (1/1)
16/07/02 22:42:13 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
16/07/02 22:42:13 INFO DAGScheduler: ResultStage 2 (foreach at rddJoin.scala:18) finished in 0.114 s
16/07/02 22:42:13 INFO DAGScheduler: Job 0 finished: foreach at rddJoin.scala:18, took 1.312562 s
16/07/02 22:42:13 INFO SparkContext: Starting job: foreach at rddJoin.scala:19
16/07/02 22:42:13 INFO DAGScheduler: Got job 1 (foreach at rddJoin.scala:19) with 1 output partitions
16/07/02 22:42:13 INFO DAGScheduler: Final stage: ResultStage 3 (foreach at rddJoin.scala:19)
16/07/02 22:42:13 INFO DAGScheduler: Parents of final stage: List()
16/07/02 22:42:13 INFO DAGScheduler: Missing parents: List()
16/07/02 22:42:13 INFO DAGScheduler: Submitting ResultStage 3 (ZippedWithIndexRDD[5] at zipWithIndex at rddJoin.scala:19), which has no missing parents
16/07/02 22:42:13 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 1608.0 B, free 12.3 KB)
16/07/02 22:42:13 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1075.0 B, free 13.3 KB)
16/07/02 22:42:13 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:46987 (size: 1075.0 B, free: 5.1 GB)
16/07/02 22:42:13 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
16/07/02 22:42:13 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (ZippedWithIndexRDD[5] at zipWithIndex at rddJoin.scala:19)
16/07/02 22:42:13 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
16/07/02 22:42:13 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, partition 0,PROCESS_LOCAL, 2336 bytes)
16/07/02 22:42:13 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
((1,21),0)
((2,42),1)
((3,41),2)

joinのspark rddでの使用は現在本でも詳しく書かれていますが、ここでは直接プログラムでお見せします