spark shufferの紹介と操作

3780 ワード

スパーク shuffer 紹介する

詳細
一.序文
简単なcopyの下で、记录して、翻訳は问题があって指摘してください.

Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

操作
sparkがトリガーしたイベントには、
shuffle
,
shuffleはspark
パーティション間操作によってデータを新たに乱す方法.
通常は
executorsと
machines間でデータをコピーし、
shuffleは非常に高価な操作です.

Background
To understand what happens during the shuffle we can consider the example of the reduceByKey
operation. The reduceByKey
operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.

理解するには
shuffleの過程で何が起こったのか、参考にしてみましょう.reduceByKey
を行ないます.reduceByKey
操作により新しいRDDが生成され、keyでtuple(類似:map)にマージされ、keyでreduce関数を実行すると実行結果が得られます.直面する課題は、すべてのkeyが同じパーティションに分布しているかどうか、同じマシンにも分布しているかどうかです.しかし、結果を得るには合併しなければなりません.

In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey
reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the
shuffle
.

sparkでは、データは通常、パーティションにまたがらず、必要な場所で具体的な操作を実行します.計算中は、単一のタスクが単一のパーティションで動作するため、整理されたデータは1つになります.
reduceByKey
reduceタスク実行.sparkはこれらのすべての操作を実行する必要があります.すべてのパーティションからすべてのkeysを見つけ、これらのデータを集約して各keyに基づいて集計し、最終的な結果を得る必要があります.これが
shuffle.

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

mapPartitions to sort each partition using, for example, .sorted

repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning

sortBy to make a globally ordered RDD

にもかかわらず
shuffled以降の各新しいパーティションの要素は
すべて決定されますが、これらの要素自体には順序がありません.ソート後の
shuffleデータ、使用可能:
mapPartitions:各パーティションはソートを使用する、例えば.sorted
repartitionAndSortWithinPartitions:新しいパーティションからソートsortBy
to make a globally ordered RDD:グローバルソートRDDを作成する

Operations which can cause a shuffle include
repartition
operations like repartition
and coalesce
,
‘ByKey
operations (except for counting) like groupByKey
and reduceByKey
, and
join
operations like cogroup
and join
.

結果として
shuffle操作のパーティション操作はrepartition
and coalesce
「ByKey」の操作は次のとおりです.groupByKey
and reduceByKey
と、
join
次のように操作します.cogroup
and join
.

ソフトキーボードのイジェクトに関する質問

Pythonマシン学習【二】-意思決定ツリー