Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing


1) Goals and Overview
Our goal is to provide an abstraction that supports applications with working sets (i.e., applications that reuse an intermediate result in multiple parallel operations) while preserving the attractive properties of MapReduce and related models: automatic fault tolerance, locality-aware scheduling, and scalability. RDDs should be as easy to program against as data flow models, but capable of efficiently expressing computations with working sets. Out of our desired properties, the most difficult one to support efficiently is fault tolerance. In general, there are two options to make a distributed dataset fault-tolerant: checkpointing the data or logging the updates made to it. In our target environment (large-scale data analytics), checkpointing the data is expensive: it would require replicating big datasets across machines over the datacenter network, which typically has much lower bandwidth than the memory bandwidth within a machine, and it would also consume additional storage (replicating data in RAM would reduce the total amount that can be cached, while logging it to disk would slow down applications). Consequently, we choose to log updates. However, logging updates is also expensive if there are many of them. Consequently, RDDs only support coarse-grainedtransformations, where we can log a single operation to be applied to many records. We then remember the series of transformations used to build an RDD (i.e., its lineage) and use it to recover lost partitions. While supporting only coarse-grained transformations restricts the programming model, we have found RDDs suitable fora wide range of applications. In particular, RDDs are well-suited for data-parallel batch analytics applications, including data mining, machine learning, and graph algorithms, because these programs naturally perform the same operation on many records. RDDs would be less suitable for applications that asynchronously update shared state, such as a parallel web crawler. However, our goal is to provide an efficient programming model for a large array of analytics applications, and leave other applications to specialized systems.
2) RDD Abstraction
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be only created through deterministic operations on either (1) a dataset in stable storage or (2) other existing RDDs. We call these operationstransformations to differentiate them from other operations that programmers may apply on RDDs. Examplesof transformations include map, filter, groupBy and join. RDDs do not need to be materialized at all times. Instead, an RDD has enough information about how it was derived from other datasets (i.e., its lineage) to computeits partitions from data in stable storage.
3) Programming Model
In Spark, RDDs are represented by objects, and transformations are invoked using methods on these objects.After defining one or more RDDs, programmers canuse them in actions, which are operations that return avalue to the application or export data to a storage system. Examples of actions include count (which returnsthe number of elements in the RDD), collect (which returns the elements themselves), and save (which outputsthe RDD to a storage system). In Spark, RDDs are only computed the first time they are used in an action(i.e., they are lazily evaluated), allowing the runtime to pipeline several transformations when building an RDD.Programmers can also control two other aspects of RDDs: caching and partitioning. A user may ask for anRDD to be cached, in which case the runtime will store partitions of the RDD that it has computed to speed upfuture reuse. Cached RDDs are typically stored in memory, but they spill to disk if there is not enough memory.Lastly, RDDs optionally allow users to specify a partitioning order based on a key associated with each record.We currently support hash and range partitioning. For example, an application may request that two RDDs behash-partitioned in the same way (placing records with the same keys on the same machine) to speed up joinsbetween them. Consistent partition placement across iterations is one of the main optimizations in Pregel andHaLoop, so we let users express this optimization.
4) Example: Console Log Mining
We illustrate RDDs through an example use case. Suppose that a large website is experiencing errors and anoperator wants to search terabytes of logs in the Hadoop filesystem (HDFS) to find the cause. Using Spark, ourimplementation of RDDs, the operator can load just the error messages from the logs into RAM across a set ofnodes and query them interactively. She would first type the following Scala code at the Spark interpreter:
lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.cache()

Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text), while line 2 derives a filteredRDD from it. Line 3 asks for errors to be cached. Note that the argument to filter is Scala syntax for a closure.At this point, no work has been performed on the cluster. However, the user can now use the RDD in actions,e.g., to count the number of messages:errors.count()The user can also perform further transformations on the RDD and use their results, as in the following lines:
// Count errors mentioning MySQL:
errors.filter(_.contains("MySQL")).count()
// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS")).map(_.split('\t')(3)).collect()

After the first action involving errors runs, Spark will cache the partitions of errors in memory, greatly speeding up subsequent computations on it. Note that the base RDD, lines, is never cached. This is desirable becausethe error messages might only be a small fraction of the data (small enough to fit into memory).Finally, to illustrate how our model achieves fault tolerance, we show the lineage graph for the RDDs in ourthird query in Figure 1. In this query, we started with errors, the result of a filter on lines, and applied a further filter and map before running a collect. The Spark scheduler will pipeline the latter two transformations andsend a set of tasks to compute them to the nodes holding the cached partitions of errors. In addition, if a partitionof errors is lost, Spark rebuilds it by applying a filter on only the corresponding partition of lines.
RDDのpaperを紹介することは、sparkを理解するのに非常に役立ちます.http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf
どなたか兴味のある方は翻訳してください.