Friday, February 28, 2014

How to remove duplicate entries from a list in Scala

# Method 1 
val dirty = List("a", "b", "a", "c")
> disty.distinct   ('distinct' does not work on the Array elements)
res0: List[java.lang.String] = List(a, b, c)

# Method 2
val arr = Array(
  Array("1234","rjan","nilsson"),
  Array("4321","eva-lisa","nyman"),
  Array("1234","eva","nilsson")
)

arr.groupBy( _(0)).map{ case (_, vs) => vs.head}.toArray

Wednesday, February 12, 2014

An introduction to SBT


http://www.adelbertc.com/posts/2014-01-21-basic-sbt.html

SBT Basics

Adelbert Chang - 21 January 2014
I’ve talked with a sufficient number of people who want to get into Scala but are lost as to how to get a simple SBT project started, so I figured I’d put up a quick guide.

Installation

On Mac OS I personally just use the Homebrew package manager, and run brew install sbt.
Alternatively, I know some people like to do manual installation, which I had done myself before I started using Homebrew.
To check to make sure your installation was successful, type in sbt --version at the prompt.

A note about SBT and Scala

You do NOT need Scala/scalac installed to run SBT - while SBT is written in Scala, it has already been shoved into a JAR and depends only on an installation of Java. I still recommend having Scala installed so you have access to the REPL, but the two can be considered as disjoint installations.

Directory structure

For simple projects, you should make a folder dedicated to your project. The directory structure should look something like this:
myproject/
    build.sbt
    project/
        build.properties
    src/
        main/
            scala/
                mypackage/

build.sbt

build.sbt will be where the majority of your SBT configuration will be. Below is an example of a simple build.sbt file:
name := "myproject"

scalaVersion := 2.10.3

scalacOptions ++= Seq(
  "-deprecation", 
  "-encoding", "UTF-8",
  "-feature", 
  "-unchecked"
)

resolvers ++= Seq(
  "Sonatype OSS Releases"  at "http://oss.sonatype.org/content/repositories/releases/",
  "Sonatype OSS Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"
)

libraryDependencies ++= Seq(
  "org.scalaz"  %% "scalaz-core"      % "7.0.5",
  "com.chuusai" %  "shapeless_2.10.2" % "2.0.0-M1"
)

Tuesday, February 11, 2014

Shapeless - a type class and dependent type based generic programming library for Scala


https://github.com/milessabin/shapeless/wiki/Feature-overview:-shapeless-2.0.0

ReactiveMongo - Asynchronous & Non-Blocking Scala Driver for MongoDB

https://github.com/ReactiveMongo/ReactiveMongo
Scale better, use less threads
With a classic synchronous database driver, each operation blocks the current thread until a response is received. This model is simple but has a major flaw - it can't scale that much.
Imagine that you have a web application with 10 concurrent accesses to the database. That means you eventually end up with 10 frozen threads at the same time, doing nothing but waiting for a response. A common solution is to rise the number of running threads to handle more requests. Such a waste of resources is not really a problem if your application is not heavily loaded, but what happens if you have 100 or even 1000 more requests to handle, performing each several db queries? The multiplication grows really fast...
The problem is getting more and more obvious while using the new generation of web frameworks. What's the point of using a nifty, powerful, fully asynchronous web framework if all your database accesses are blocking?
ReactiveMongo is designed to avoid any kind of blocking request. Every operation returns immediately, freeing the running thread and resuming execution when it is over. Accessing the database is not a bottleneck anymore.

Sunday, February 9, 2014

Spark RDD document

http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html#resilient-distributed-datasets-rdds

Resilient Distributed Datasets (RDDs)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are currently two types of RDDs: parallelized collections, which take an existing Scala collection and run functions on it in parallel, and Hadoop datasets, which run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop. Both types of RDDs can be operated on through the same methods.

Parallelized Collections

Parallelized collections are created by calling SparkContext’s parallelize method on an existing Scala collection (a Seq object). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is some interpreter output showing how to create a parallel collection from an array:
scala> val data = Array(1, 2, 3, 4, 5)
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val distData = sc.parallelize(data)
distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
Once created, the distributed dataset (distData here) can be operated on in parallel. For example, we might call distData.reduce(_ + _) to add up the elements of the array. We describe operations on distributed datasets later on.
One important parameter for parallel collections is the number of slices to cut the dataset into. Spark will run one task for each slice of the cluster. Typically you want 2-4 slices for each CPU in your cluster. Normally, Spark tries to set the number of slices automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)).

JavaFX + GraniteDS

http://java.dzone.com/articles/real-world-javafx-graniteds-0?mz=46483-html5

Good and bad of JavaFX

JavaFX has been very enjoyable to work with and is a great tool for developing desktop applications. Being built in the Java SDK runtime makes it really friendly for Java developers, and easy to use and deploy. It has a decent number of available widgets including nice charting controls and allows for a clean MVC architecture. Building complex views is quite easy using the FXML markup language and the ability to use CSS styling has been very convenient.
During the development of our application, we unfortunately encountered several issues, notably memory leaks (in particular when using TableView and selection models) that forced us to introduce ugly workarounds in the code making it much less clean and readable, and we wasted a lot of time to identify the bugs.
Some very useful widgets were also missing such as a text field with suggestions, or even a calendar widget and we had to find third-party components which are not necessarily very reliable. It seems many of these issues will be addressed in JavaFX 8.

Good and bad of GraniteDS

GraniteDS has been very helpful to develop and structure our application and has almost entirely eliminated the need for handling network communication in our code.
We have particularly enjoyed some of its features which have saved us a lot of time:
  • automatic pagination working with many widgets (ListView, TableView, ChoiceBox) with incremental loading of data and caching. Just great!
  • automatic generation of client side JavaFX bindable entities from the JPA model
  • transparent lazy loading of remote data
  • dirty checking of changes made in client side entities
  • Spring integration, both on the server and on the client
Except some early bugs in the beta versions which are now fixed, we have not faced too many issues with GraniteDS. It just needs some time to become familiar with its many features and the documentation is sometimes a bit short. This initial effort is really worth it and you can build your application much faster and become very productive once you are more familiar with the framework. 

Spark RDD

http://www.thecloudavenue.com/2014/01/resilient-distributed-datasets-rdd.html

Resilient Distributed Datasets (RDD) for the impatient

The Big Data revolution was started by the Google's Paper on MapReduce (MR). But, the MR model mainly suits batch oriented processing of the data and some of the other models are being shoe horned into it because of the prevalence of Hadoop and the attention/support it gets. With the introduction of YARN in Hadoop, other models besides MR could be first-class-citizens in the Hadoop space.

Lets take the case of MR as shown below, there is a lot of reading and writing happening to the disk after each MR transformation which makes is too slow and less suitable for iterative processing as in the case of Machine Learning.
Let's see how we can solve the iterative problem with Apache Spark. Spark is built using Scala around the concept of Resilient Distributed Datasets (RDD) and provides actions / transformations on top of RDD. It has one of the best documentation around open source projects. There was not much resources around RDD, but this paper and presentation are the roots of RDD. Check this to know more about RDDs from a Spark perspective.

Friday, February 7, 2014

Data Explorer( or Dex) is a tool for exploring data.

http://www.javainc.com/projects/dex/Dex.html

Dex is a JavaFX application which aims to provide a general purpose framework for data visualization.  The basics are easy to learn, however Dex offers advanced capabilities in the form of SQL and groovy transformations.

Dex integrates many excellent visualization frameworks into one consistent GUI package.  This is an early release, so you are likely to find functional gaps for your particular problem space.  In such cases, you are free to extend Dex to suit your needs.

Why JavaFX ?

http://programmers.stackexchange.com/questions/180414/using-mvc-in-a-java-app/180497#180497

Reasons I would recommend JavaFX are: