Sunday, February 9, 2014

Spark RDD

http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html

Resilient Distributed Datasets (RDD) for the impatient

The Big Data revolution was started by the Google's Paper on MapReduce (MR). But, the MR model mainly suits batch oriented processing of the data and some of the other models are being shoe horned into it because of the prevalence of Hadoop and the attention/support it gets. With the introduction of YARN in Hadoop, other models besides MR could be first-class-citizens in the Hadoop space.

Lets take the case of MR as shown below, there is a lot of reading and writing happening to the disk after each MR transformation which makes is too slow and less suitable for iterative processing as in the case of Machine Learning.
Let's see how we can solve the iterative problem with Apache Spark. Spark is built using Scala around the concept of Resilient Distributed Datasets (RDD) and provides actions / transformations on top of RDD. It has one of the best documentation around open source projects. There was not much resources around RDD, but this paper and presentation are the roots of RDD. Check this to know more about RDDs from a Spark perspective.

No comments: