Tuesday, January 28, 2014

MapReduce and Spark

http://vision.cloudera.com/mapreduce-spark/

The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.
If you’re not interested in technical detail, skip the next three paragraphs.
Original MapReduce executed jobs in a simple but rigid structure: a processing or transform step (“map”), a synchronization step (“shuffle”), and a step to combine results from all the nodes in a cluster (“reduce”). If you wanted to do something complicated, you had to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely. Complex, multi-stage applications were distressingly slow.
An alternative approach is to let programmers construct complex, multi-step directed acyclic graphs (DAGs) of work that must be done, and to execute those DAGs all at once, not step by step. This eliminates the costly synchronization required by MapReduce and makes applications much easier to build. Prior research on DAG engines includes Dryad, a Microsoft Research project used internally at Microsoft for its Bing search engine and other hosted services.
Spark builds on those ideas but adds some important innovative features. For example, Spark supports in-memory data sharing across DAGs, so that different jobs can work with the same data at very high speed. It even allows cyclic data flows. As a result, Spark handles iterative graph algorithms (think social network analysis), machine learning and stream processing extremely well. Those have been cumbersome to build on MapReduce and even on other DAG engines. They are very popular applications in the Hadoop ecosystem, so simplicity and performance matter.
Spark began as a research project at UC Berkeley’s AMPLab in 2009 and was released as open source in 2010. With several years of real use, it’s had plenty of time to mature. Advanced features available in the latest release include stream processing, fast fault recovery, language-integrated APIs, optimized scheduling and data transfer and more.
One of the most interesting features of Spark — and the reason we believe it such a powerful addition to the engines available for the Enterprise Data Hub — is its smart use of memory. MapReduce has always worked primarily with data stored on disk. Spark, by contrast, can exploit the considerable amount of RAM that is spread across all the nodes in a cluster. It is smart about use of disk for overflow data and persistence. That gives Spark huge performance advantages for many workloads.

No comments: