A collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. (集合、分布式,并行计算)
Created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. (数据可以来自scala集合或外部存储,可以进行transform和action)
Persist an RDD in memory, allowing it to be reused efficiently across parallel operations. (持久化中间RDD,加速计算)
RDDs automatically recover from node failures. (失败后基于lineage重建)
创建RDD
来自集合
1
2
3
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
val distData2 = sc.parallelize(data, 10) //指定partition数量
来自外部数据
Text File
1
2
val localFile = sc.textFile("file://path/data.txt")
val hdfsFile = sc.textFile("hdfs://path/data.txt")
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.
DISK_ONLY
Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
Same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP (experimental)
Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory. If you plan to use Tachyon as the off heap store, Spark is compatible with Tachyon out-of-the-box. Please refer to this page for the suggested version pairings.
RDD天生是支持容错的。首先,它自身是一个不变的(immutable)数据集,其次,它能够记住构建它的操作图(Graph of Operation),因此当执行任务的Worker失败时,完全可以通过操作图获得之前执行的操作,进行重新计算。由于无需采用replication方式支持容错,很好地降低了跨网络的数据传输成本。