回転:sparkはsparkを合理的に設定する.default.parallelismパラメータによる実行効率の向上

2846 ワード

回転:sparkはsparkを合理的に設定する.default.parallelismパラメータによる実行効率の向上
sparkにはpartitionの概念(sliceと同じ概念でspark 1.2で公式サイトで説明されている)があり、一般的に各partitionはtaskに対応している.私のテスト中にsparkを設定しなかったら.default.parallelismパラメータ、sparkで計算されたpartitionは非常に大きく、私のcoresとは非常に合わない.私は2台の機械(8 cores*2+6 g*2)で、sparkが計算したpartitionは2.8万個、つまり2.9万個のtasksに達し、各taskの完成時間は数ミリ秒か零時数ミリ秒で、実行が非常に遅い.sparkを設定しようとしました.default.parallelism後、タスク数は10に減少し、minuteから20 secondに計算プロセスを実行します.
パラメータはspark_home/conf/spark-default.confプロファイル設定.
eg.


    
    
    
    
     
     
     spark.master                  spark:
     
     
     
     //
     
     
     
     master:7077 
     
     
     
     

     
     
     
     
spark.
     
     
     
     default.parallelism     10 
     
     
     
     

     
     
     
     
spark.driver.memory           2g 
     
     
     
     

     
     
     
     
spark.serializer              org.apache.spark.serializer.KryoSerializer 
     
     
     
     

     
     
     
     
spark.sql.shuffle.partitions  50

以下は公式サイトの説明です.
from:http://spark.apache.org/docs/latest/configuration.html
Property Name
Default
Meaning spark.default.parallelism
For distributed shuffle operations like reduceByKey and join , the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine

Mesos fine grained mode: 8

Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join , reduceByKey , and parallelize when not set by user.
from:http://spark.apache.org/docs/latest/tuning.html

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile , etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey , it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. 原文住所:http://www.cnblogs.com/wrencai/p/4231966.html

Flumeはデータを収集して直接Solrに入る

Posex信号量実現プロセス間の同期(生産者&消費者)