Spark pyspark rdd接続関数のjoin、leftOuterJoin、right OuterJoin、fullOuterJoinの説明

2473 ワード

Spark pyspark rdd接続関数のjoin、leftOuterJoin、right OuterJoin、fullOuterJoinの説明
unionは2つのrddの要素を結合するために使用され、joinは内部接続に使用され、その後の3つの関数(leftOuterJoin,rightOuterJoin,fullOuterJoin)はSQLのような左、右、全接続に使用される.
key-value形式のRDDについて.
例:
1)データ初期化
>>> pp=(('cat', 2), ('cat', 5), ('book', 4), ('cat', 12))
>>> pp
(('cat', 2), ('cat', 5), ('book', 4), ('cat', 12))
>>> qq=(("cat",2), ("cup", 5), ("mouse", 4),("cat", 12))
>>> qq
(('cat', 2), ('cup', 5), ('mouse', 4), ('cat', 12))
>>> pairRDD1 = sc.parallelize(pp)
>>> pairRDD2 = sc.parallelize(qq)
>>> pairRDD1.collect()
[('cat', 2), ('cat', 5), ('book', 4), ('cat', 12)]
>>> pairRDD2.collect()
[('cat', 2), ('cup', 5), ('mouse', 4), ('cat', 12)]

2)ジョイン内接続結果:
>>> pairRDD1.join(pairRDD2).collect()
[('cat', (2, 2)), ('cat', (2, 12)), ('cat', (5, 2)), ('cat', (5, 12)), ('cat', (12, 2)), ('cat', (12, 12))]

3)leftOuterJoin結果:
>>> pairRDD1.leftOuterJoin(pairRDD2).collect()
[('book', (4, None)), ('cat', (2, 2)), ('cat', (2, 12)), ('cat', (5, 2)), ('cat', (5, 12)), ('cat', (12, 2)), ('cat', (12, 12))]

4)rightOuterJoin結果:
>>> pairRDD1.rightOuterJoin(pairRDD2).collect()
[('cup', (None, 5)), ('mouse', (None, 4)), ('cat', (2, 2)), ('cat', (2, 12)), ('cat', (5, 2)), ('cat', (5, 12)), ('cat', (12, 2)), ('cat', (12, 12))]

5)fullOuterJoin結果:
>>> pairRDD1.fullOuterJoin(pairRDD2).collect()
[('book', (4, None)), ('cup', (None, 5)), ('mouse', (None, 4)), ('cat', (2, 2)), ('cat', (2, 12)), ('cat', (5, 2)), ('cat', (5, 12)), ('cat', (12, 2)), ('cat', (12, 12))]

6)union結果:
>>> pairRDD1.union(pairRDD2).collect()
[('cat', 2), ('cat', 5), ('book', 4), ('cat', 12), ('cat', 2), ('cup', 5), ('mouse', 4), ('cat', 12)]

参照先:http://blog.cheyo.net/175.html