Friday, March 28, 2014

Spark RDD

http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html

Resilient Distributed Datasets (RDD) for the impatient

There was not much resources around RDD, but this paper and presentation are the roots of RDD. Check this to know more about RDDs from a Spark perspective.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.
To summarize, for iterative processing MR model is less suited than the RDD model. Performance metrics around iterative and other processings are mentioned in detail in this paperaround RDD.


No comments: