Sunday, May 11, 2014

Spark vs Storm

https://www.linkedin.com/groups/Can-anyone-share-some-experience-4158686.S.235367680

Both frameworks are used to parallelize computations of massive amount of data.
However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).
Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.


  • Founder
    Hi Gilles,
    I used Storm for personnal project, and I was impressed by the performance, and easy-use, simple to deploy to AWS cluster, but I think in major cases, we do not need that disproportioned perfs.
  • Gilles Vandelle
    Gilles
    CSO (Chief Science Officer) at TEDEMIS
    Hi Zahir,
    I understand that I probably don't need such level of perfs (but who will complain?).
    I'm looking to use it for a business critical app. So the high availability and transaction security are big requirements that I would to achieve by using such framework.
  • Sabri Skhiri
    Sabri
    R&D Director at EURA NOVA
    Dear Gilles,
    Spark and Storm cannot be directly compared. Indeed, Spark is more a simplification of the distributed processing frameworks (as hadoop MR) and using an in-memory approach (the RDDs) and applying interesting concepts such as the operator placement for minimizing the I/O. In the other hand, Spark D-Stream is a Stream processing framework leveraging Spark RDDs for creating discrete sets of Events (mini-batches). As a result D-Stream can be compared to the Transactional Topologies of Storm or, perhaps more appropriate, Storm Trident.
    In term of performance, the only benchmark that exist have been recently published by D-Stream team and only took Storm without the transactional topologies which is unfair.
    But in term of maturity, Storm is much more ahead, while D-Stream is only a research project with an open source implementation.
    The Storm community is really active and responsive, you can expect an answer within a day. The Project on github is really well documented and you have a starter kit that shows you everything you need for starting.
     Quentin D.Eric D. and 8 others like this
  • Clark Updike
    Clark
  • Gilles Vandelle
    Gilles
    CSO (Chief Science Officer) at TEDEMIS
    Hi everyone and thank you for your good recommandation and answers.
    At TEDEMIS we have now successfully released our first platform using a combination of STORM and REDIS in production. The solution has been very stable so far and after 4 weeks of real time processing. I'm very happy with this choice. In fact REDIS bring to STORM a useful level of persistency. In REDIS data never stay more than 24h. Long term persistency is still today going to MySql but we plan to move to CASSANDRA in the coming months.
  • Mayur Rustagi
    Mayur
    Apache Spark Consultant, Sigmoid Analytics
    I have relied on spark for streaming usecases. While comparatively much younger than storm and management/development tools are much smaller, however it works for several usecases in a much easier manner. One key area where it shines is ensuring that each data is processed only once, something that is ensured through the trident version of Storm.

  • Venki
    Chief Architect at Mobile Advertising Startup
    I have been experimenting with providing near real-time analytics for high-performance ad-server. I am forced to move away from Storm+Trident for following reasons: very poorly documented API and examples. Nothing beyond word-counting! Absolutely no documentation in code. Very hard to debug. Does not naturally provide for SQL like semantics. The abstractions of MapState and layer over layers of Maps far too complex to reason. I am going to try Spark. That said, I still have to see evaluate the "process only once semantics" and performance of Spark
  • Mayur Rustagi
    Mayur
    Apache Spark Consultant, Sigmoid Analytics
    Spark is quite actively supported with active community. We provide extensive documentation around Spark @ docs.sigmoidanalytics.com
    Since Spark supports general scala functions you can support SQL semantics in a fault tolerant & streaming fashion. Let me know if you need help in implementing your pipeline @ mayur@sigmoidanalytics.com
  • Ngoc Dao
    Ngoc
    Experienced Ruby, Erlang, and Scala software engineer
    With Storm, you move data to code. With Spark, you move code to data.
    Explained here: http://stackoverflow.com/questions/16685214/compare-in-memory-cluster-computing-systems
     Boris M.Delphine L. and 1 other like this
  • Pranab Ghosh
    Pranab
    Big Data Consultant at Verizon
    Storm is designed for real time processing. I have successfully used Storm in projects. I have not done any benchmarking of Storm and Spark. Form what I understand, Spark provides near real time stream processing capability. Another option you may want to consider is Apache Samza, originally from Linkedin.
     Delphine L. likes this
  • Mayur Rustagi
    Mayur
    Apache Spark Consultant, Sigmoid Analytics
    Thats actually a good point Ngoc. However the description ignores Spark Streaming & talks about Spark Platform in general. Spark streaming leverages micro-batches to perform stream processing in a much more elegant fashion than Storm. It provides higher level semantics on top of storm like Windowing operations(number of visitors in last 2 hrs ) & transactional processing (ensuring 1 event is processed only once). In storm you have implement both yourself using custom code & Trident respectively.
     Anthony YeeDelphine L. like this
  • Pranab Ghosh
    Pranab
    Big Data Consultant at Verizon
    Mayur
    It's because of micro batching, spark gives you NEAR real time processing. Spark streaming processing is essentially a batch processing platform retrofitted for for near real time processing. Storm is built ground up for real time processing only.

    Windowing operations can easily be performed in storm also. One of the common use cases in storm is doing aggregation in real time stream.
  • Mayur Rustagi
    Mayur
    Apache Spark Consultant, Sigmoid Analytics
    In terms of throughput Spark gives a comparable performance to Storm, in most benchmarks they beat storm (but that may be biased :) ).
    It would be interesting to see the turn around time from the point a message enters the system to the point a metric is updated, between Spark streaming & Storm.. I have not seen many benchmarks targetting that.
  • Deepak Kumar
    Deepak
    NA
    Apache Spark is sure ahead of Storm.
    In addition to real time processing of streams in Spark , you can also do batch level processing as hadoop.
    The performance way , Spark is clearly ahead of storm.
    So i would say if I have the freedom to choose between Spark and Storm , i would select Spark.
  • Pranab Ghosh
    Pranab
    Big Data Consultant at Verizon
    Deepak

    Since Spark relies on micro batching to simulate real time processing, it's likely to be slower than real real time processing systems like storm. If you have bench mark data that shows otherwise, please share the links.
  • Sabri Skhiri
    Sabri
    R&D Director at EURA NOVA
    @Pranab at first sight you might think that mini-batches impact the performance, but actually it depends of the stream processing you apply. For instance the impressive performance difference you can find in http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data (slide 39) can be explained by the type of processing (In this case a grep and a TopK). Typically the aggregation jobs will be significantly improved by the aggregation of mini-batches of records leveraging the Spark RDDs instead of processing each one 1 by 1 (connection socket, marshaling, loading in memory, applying computation, etc.). Leveraging mini-batches within RDDs and adding all the processing optimizations as the reducer/combiner can lead to much better performances.

    Finally to be fair, let's be clear that Storm and Spark Streaming cannot be compared directly as the aggregation operators are offered out of the box by Spark and must be implemented manually in Storm. As a result, the benchmark mainly depends on your implementation on Storm. To be more accurate we should compare Trident and Spark. Notice that usually Trident involved an important drop of performance because of the reliability mechanism it offers.
     Pranab G.Eric D. and 3 others like this
  • Pranab Ghosh
    Pranab
    Big Data Consultant at Verizon
    @Sabri Thank you for sharing the slides. Drop in storm performance with the increase in record size as in slide 39 is very interesting.
  • Mayur Rustagi
    Mayur
    Apache Spark Consultant, Sigmoid Analytics
    Again
    1. The throughput is not same as turnaround time, would love to see some benchmarks thr. As Storm relies on passing data through its system, the performance is bottlenecked by network.
    2. This benchmark was done by Amplabs, hence storm may not be tuned for the best performance.
    That said even if storm is equal or similar performance, I do believe the benefit of reusing your hadoop stack for streaming & having a framework which can be leveraged for data warehousing, machine learning & analytics tips the scales heavily in favour of Spark.

No comments: