Both frameworks are used to parallelize computations of massive amount of data.
However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).
Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.
FounderHi Gilles,
I used Storm for personnal project, and I was impressed by the performance, and easy-use, simple to deploy to AWS cluster, but I think in major cases, we do not need that disproportioned perfs.- Like
- Flag as inappropriate
- April 26, 2013
-
CSO (Chief Science Officer) at TEDEMISHi Zahir,
I understand that I probably don't need such level of perfs (but who will complain?).
I'm looking to use it for a business critical app. So the high availability and transaction security are big requirements that I would to achieve by using such framework.- Like
- Flag as inappropriate
- April 29, 2013
-
R&D Director at EURA NOVADear Gilles,
Spark and Storm cannot be directly compared. Indeed, Spark is more a simplification of the distributed processing frameworks (as hadoop MR) and using an in-memory approach (the RDDs) and applying interesting concepts such as the operator placement for minimizing the I/O. In the other hand, Spark D-Stream is a Stream processing framework leveraging Spark RDDs for creating discrete sets of Events (mini-batches). As a result D-Stream can be compared to the Transactional Topologies of Storm or, perhaps more appropriate, Storm Trident.
In term of performance, the only benchmark that exist have been recently published by D-Stream team and only took Storm without the transactional topologies which is unfair.
But in term of maturity, Storm is much more ahead, while D-Stream is only a research project with an open source implementation.
The Storm community is really active and responsive, you can expect an answer within a day. The Project on github is really well documented and you have a starter kit that shows you everything you need for starting.- Like (10)
- Flag as inappropriate
- May 2, 2013
-
Software Engineer at JHUAPLCloudera is now backing Spark:
http://gigaom.com/2013/10/28/spark-is-a-really-big-deal-for-big-data-and-cloudera-gets-it/- Like
- Flag as inappropriate
- 6 months ago
-
CSO (Chief Science Officer) at TEDEMISHi everyone and thank you for your good recommandation and answers.
At TEDEMIS we have now successfully released our first platform using a combination of STORM and REDIS in production. The solution has been very stable so far and after 4 weeks of real time processing. I'm very happy with this choice. In fact REDIS bring to STORM a useful level of persistency. In REDIS data never stay more than 24h. Long term persistency is still today going to MySql but we plan to move to CASSANDRA in the coming months.- Like
- Flag as inappropriate
- 6 months ago
-
Apache Spark Consultant, Sigmoid AnalyticsI have relied on spark for streaming usecases. While comparatively much younger than storm and management/development tools are much smaller, however it works for several usecases in a much easier manner. One key area where it shines is ensuring that each data is processed only once, something that is ensured through the trident version of Storm.
- Like
- Flag as inappropriate
- 4 months ago
-
Chief Architect at Mobile Advertising StartupI have been experimenting with providing near real-time analytics for high-performance ad-server. I am forced to move away from Storm+Trident for following reasons: very poorly documented API and examples. Nothing beyond word-counting! Absolutely no documentation in code. Very hard to debug. Does not naturally provide for SQL like semantics. The abstractions of MapState and layer over layers of Maps far too complex to reason. I am going to try Spark. That said, I still have to see evaluate the "process only once semantics" and performance of Spark
- Like
- Flag as inappropriate
- 2 months ago
-
Apache Spark Consultant, Sigmoid AnalyticsSpark is quite actively supported with active community. We provide extensive documentation around Spark @ docs.sigmoidanalytics.com
Since Spark supports general scala functions you can support SQL semantics in a fault tolerant & streaming fashion. Let me know if you need help in implementing your pipeline @ mayur@sigmoidanalytics.com- Like
- Flag as inappropriate
- 2 months ago
-
Experienced Ruby, Erlang, and Scala software engineerWith Storm, you move data to code. With Spark, you move code to data.
Explained here: http://stackoverflow.com/questions/16685214/compare-in-memory-cluster-computing-systems- Like (3)
- Flag as inappropriate
- 2 months ago
It's because of micro batching, spark gives you NEAR real time processing. Spark streaming processing is essentially a batch processing platform retrofitted for for near real time processing. Storm is built ground up for real time processing only.
Windowing operations can easily be performed in storm also. One of the common use cases in storm is doing aggregation in real time stream.
It would be interesting to see the turn around time from the point a message enters the system to the point a metric is updated, between Spark streaming & Storm.. I have not seen many benchmarks targetting that.
In addition to real time processing of streams in Spark , you can also do batch level processing as hadoop.
The performance way , Spark is clearly ahead of storm.
So i would say if I have the freedom to choose between Spark and Storm , i would select Spark.
Since Spark relies on micro batching to simulate real time processing, it's likely to be slower than real real time processing systems like storm. If you have bench mark data that shows otherwise, please share the links.
Finally to be fair, let's be clear that Storm and Spark Streaming cannot be compared directly as the aggregation operators are offered out of the box by Spark and must be implemented manually in Storm. As a result, the benchmark mainly depends on your implementation on Storm. To be more accurate we should compare Trident and Spark. Notice that usually Trident involved an important drop of performance because of the reliability mechanism it offers.
1. The throughput is not same as turnaround time, would love to see some benchmarks thr. As Storm relies on passing data through its system, the performance is bottlenecked by network.
2. This benchmark was done by Amplabs, hence storm may not be tuned for the best performance.
That said even if storm is equal or similar performance, I do believe the benefit of reusing your hadoop stack for streaming & having a framework which can be leveraged for data warehousing, machine learning & analytics tips the scales heavily in favour of Spark.