Friday, March 28, 2014

Spark for Big Analytics

http://blog.revolutionanalytics.com/2013/12/apache-spark.html

Spark seeks to address the critical challenges for advanced analytics in Hadoop.  First, Spark is designed to support in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data.  This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce. 
Second, Spark offers an integrated framework for advanced analytics, including a machine learning library (MLLib); a graph engine (GraphX); a streaming analytics engine (Spark Streaming) and a fast interactive query tool (Shark).   This eliminates the need to support multiple point solutions, such as Giraph, GraphLab and Tez for graph engines; Storm and S3 for streaming; or Hive and Impala for interactive queries.  A single platform simplifies integration, and ensures that users can produce consistent results across different types of analysis.  
At Spark's core is an abstraction layer called Resilient Distributed Datasets, or RDDs.  RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs.  RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence.  They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.  
For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase).  Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat.

No comments: