My personal bookmarks: May 2014

http://stackoverflow.com/questions/3400734/package-objects
http://www.naildrivin5.com/scalatour/wiki_pages/PackageObjects

Normally you would put your package object in a separate file called package.scala in the package that it corresponds to. You can also use the nested package syntax but that is quite unusual.

The main use case for package objects is when you need definitions in various places inside your package as well as outside the package when you use the API defined by the package. Here is an example:

package foo

package object bar {

  // package wide constants:
  def BarVersionString = "1.0"

  // or type aliases
  type StringMap[+T] = Map[String,T]

  // can be used to emulate a package wide import
  // especially useful when wrapping a Java API
  type DateTime = org.joda.time.DateTime

  type JList[T] = java.util.List[T]

  // Define implicits needed to effectively use your API:
  implicit def a2b(a: A): B = // ...

}

One additional thing to note is that package objects are objects. Among other things, this means you can build them up from traits, using mix-in inheritance. Moritz's example could be written as

package object bar extends Versioning 
                          with JodaAliases 
                          with JavaAliases {

  // package wide constants:
  override val version = "1.0"

  // or type aliases
  type StringMap[+T] = Map[String,T]

  // Define implicits needed to effectively use your API:
  implicit def a2b(a: A): B = // ...

}

Here Versioning is an abstract trait, which says that the package object must have a "version" method, while JodaAliases and JavaAliases are concrete traits containing handy type aliases. All of these traits can be reused by many different package objects.

https://www.linkedin.com/groups/Can-anyone-share-some-experience-4158686.S.235367680

Both frameworks are used to parallelize computations of massive amount of data.

However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).

Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.

Founder

Hi Gilles,
I used Storm for personnal project, and I was impressed by the performance, and easy-use, simple to deploy to AWS cluster, but I think in major cases, we do not need that disproportioned perfs.
- Like
- Flag as inappropriate
- April 26, 2013
Gilles
Gilles Vandelle

CSO (Chief Science Officer) at TEDEMIS

Hi Zahir,
I understand that I probably don't need such level of perfs (but who will complain?).
I'm looking to use it for a business critical app. So the high availability and transaction security are big requirements that I would to achieve by using such framework.
- Like
- Flag as inappropriate
- April 29, 2013
Sabri
Sabri Skhiri

R&D Director at EURA NOVA

Dear Gilles,
Spark and Storm cannot be directly compared. Indeed, Spark is more a simplification of the distributed processing frameworks (as hadoop MR) and using an in-memory approach (the RDDs) and applying interesting concepts such as the operator placement for minimizing the I/O. In the other hand, Spark D-Stream is a Stream processing framework leveraging Spark RDDs for creating discrete sets of Events (mini-batches). As a result D-Stream can be compared to the Transactional Topologies of Storm or, perhaps more appropriate, Storm Trident.
In term of performance, the only benchmark that exist have been recently published by D-Stream team and only took Storm without the transactional topologies which is unfair.
But in term of maturity, Storm is much more ahead, while D-Stream is only a research project with an open source implementation.
The Storm community is really active and responsive, you can expect an answer within a day. The Project on github is really well documented and you have a starter kit that shows you everything you need for starting.
- Like (10)
- Flag as inappropriate
- May 2, 2013
Quentin D., Eric D. and 8 others like this
Clark
Clark Updike

Software Engineer at JHUAPL

Cloudera is now backing Spark:
http://gigaom.com/2013/10/28/spark-is-a-really-big-deal-for-big-data-and-cloudera-gets-it/
- Like
- Flag as inappropriate
- 6 months ago
Gilles
Gilles Vandelle

CSO (Chief Science Officer) at TEDEMIS

Hi everyone and thank you for your good recommandation and answers.
At TEDEMIS we have now successfully released our first platform using a combination of STORM and REDIS in production. The solution has been very stable so far and after 4 weeks of real time processing. I'm very happy with this choice. In fact REDIS bring to STORM a useful level of persistency. In REDIS data never stay more than 24h. Long term persistency is still today going to MySql but we plan to move to CASSANDRA in the coming months.
- Like
- Flag as inappropriate
- 6 months ago
Mayur
Mayur Rustagi

Apache Spark Consultant, Sigmoid Analytics

I have relied on spark for streaming usecases. While comparatively much younger than storm and management/development tools are much smaller, however it works for several usecases in a much easier manner. One key area where it shines is ensuring that each data is processed only once, something that is ensured through the trident version of Storm.
- Like
- Flag as inappropriate
- 4 months ago
Venki
Venki Tirumala

Chief Architect at Mobile Advertising Startup

I have been experimenting with providing near real-time analytics for high-performance ad-server. I am forced to move away from Storm+Trident for following reasons: very poorly documented API and examples. Nothing beyond word-counting! Absolutely no documentation in code. Very hard to debug. Does not naturally provide for SQL like semantics. The abstractions of MapState and layer over layers of Maps far too complex to reason. I am going to try Spark. That said, I still have to see evaluate the "process only once semantics" and performance of Spark
- Like
- Flag as inappropriate
- 2 months ago
Mayur
Mayur Rustagi

Apache Spark Consultant, Sigmoid Analytics

Spark is quite actively supported with active community. We provide extensive documentation around Spark @ docs.sigmoidanalytics.com
Since Spark supports general scala functions you can support SQL semantics in a fault tolerant & streaming fashion. Let me know if you need help in implementing your pipeline @ mayur@sigmoidanalytics.com
- Like
- Flag as inappropriate
- 2 months ago
Ngoc
Ngoc Dao

Experienced Ruby, Erlang, and Scala software engineer

With Storm, you move data to code. With Spark, you move code to data.
Explained here: http://stackoverflow.com/questions/16685214/compare-in-memory-cluster-computing-systems
- Like (3)
- Flag as inappropriate
- 2 months ago
Boris M., Delphine L. and 1 other like this

Pranab
Pranab Ghosh

Big Data Consultant at Verizon

Storm is designed for real time processing. I have successfully used Storm in projects. I have not done any benchmarking of Storm and Spark. Form what I understand, Spark provides near real time stream processing capability. Another option you may want to consider is Apache Samza, originally from Linkedin.
- Like (1)
- Flag as inappropriate
- 2 months ago
Delphine L. likes this
Mayur
Mayur Rustagi

Apache Spark Consultant, Sigmoid Analytics

Thats actually a good point Ngoc. However the description ignores Spark Streaming & talks about Spark Platform in general. Spark streaming leverages micro-batches to perform stream processing in a much more elegant fashion than Storm. It provides higher level semantics on top of storm like Windowing operations(number of visitors in last 2 hrs ) & transactional processing (ensuring 1 event is processed only once). In storm you have implement both yourself using custom code & Trident respectively.
- Like (2)
- Flag as inappropriate
- 2 months ago
Anthony Yee, Delphine L. like this
Pranab
Pranab Ghosh

Big Data Consultant at Verizon

Mayur
It's because of micro batching, spark gives you NEAR real time processing. Spark streaming processing is essentially a batch processing platform retrofitted for for near real time processing. Storm is built ground up for real time processing only.

Windowing operations can easily be performed in storm also. One of the common use cases in storm is doing aggregation in real time stream.
- Like
- Flag as inappropriate
- 2 months ago
Mayur
Mayur Rustagi

Apache Spark Consultant, Sigmoid Analytics

In terms of throughput Spark gives a comparable performance to Storm, in most benchmarks they beat storm (but that may be biased :) ).
It would be interesting to see the turn around time from the point a message enters the system to the point a metric is updated, between Spark streaming & Storm.. I have not seen many benchmarks targetting that.
- Like
- Flag as inappropriate
- 2 months ago
Deepak
Deepak Kumar

NA

Apache Spark is sure ahead of Storm.
In addition to real time processing of streams in Spark , you can also do batch level processing as hadoop.
The performance way , Spark is clearly ahead of storm.
So i would say if I have the freedom to choose between Spark and Storm , i would select Spark.
- Like
- Flag as inappropriate
- 2 months ago
Pranab
Pranab Ghosh

Big Data Consultant at Verizon

Deepak

Since Spark relies on micro batching to simulate real time processing, it's likely to be slower than real real time processing systems like storm. If you have bench mark data that shows otherwise, please share the links.
- Like
- Flag as inappropriate
- 2 months ago
Sabri
Sabri Skhiri

R&D Director at EURA NOVA

@Pranab at first sight you might think that mini-batches impact the performance, but actually it depends of the stream processing you apply. For instance the impressive performance difference you can find in http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data (slide 39) can be explained by the type of processing (In this case a grep and a TopK). Typically the aggregation jobs will be significantly improved by the aggregation of mini-batches of records leveraging the Spark RDDs instead of processing each one 1 by 1 (connection socket, marshaling, loading in memory, applying computation, etc.). Leveraging mini-batches within RDDs and adding all the processing optimizations as the reducer/combiner can lead to much better performances.

Finally to be fair, let's be clear that Storm and Spark Streaming cannot be compared directly as the aggregation operators are offered out of the box by Spark and must be implemented manually in Storm. As a result, the benchmark mainly depends on your implementation on Storm. To be more accurate we should compare Trident and Spark. Notice that usually Trident involved an important drop of performance because of the reliability mechanism it offers.
- Like (5)
- Flag as inappropriate
- 2 months ago
Pranab G., Eric D. and 3 others like this
Pranab
Pranab Ghosh

Big Data Consultant at Verizon

@Sabri Thank you for sharing the slides. Drop in storm performance with the increase in record size as in slide 39 is very interesting.
- Like
- Flag as inappropriate
- 2 months ago
Mayur

Mayur Rustagi

Apache Spark Consultant, Sigmoid Analytics

Again
1. The throughput is not same as turnaround time, would love to see some benchmarks thr. As Storm relies on passing data through its system, the performance is bottlenecked by network.
2. This benchmark was done by Amplabs, hence storm may not be tuned for the best performance.
That said even if storm is equal or similar performance, I do believe the benefit of reusing your hadoop stack for streaming & having a framework which can be leveraged for data warehousing, machine learning & analytics tips the scales heavily in favour of Spark.

My personal bookmarks

Saturday, May 24, 2014

How to use Python 2.x and 3.x on CentOS 6 ?

Sunday, May 18, 2014

How to schedule Cron jobs in Play 2 ?

Wednesday, May 14, 2014

How to parse PCAP files ?

Tuesday, May 13, 2014

Package objects

Sunday, May 11, 2014

Spark vs Storm

InfoQ

Blog Archive

Tags

Total Pageviews