My personal bookmarks: January 2014

Thursday, January 30, 2014

EPC Architecture

EPC Network Architecture

There are several functional entities within the core Evolved Core Network. Within the user plane the core network is the gateway between the access network and the PDNs (e.g., the Internet) that support the interfaces, mobility needs and the differentiation of QoS flows. The gateway may be split in two separate nodes with an optional S5 interface. The two logical gateway entities are the Serving Gateway (S-GW) and the PDN Gateway (P-GW).

The Serving Gateway (S-GW) acts as a local mobility anchor, forwarding and receiving packets to and from the eNodeB currently serving the UE.
The PDN Gateway (P-GW) interfaces with the external PDNs, such as the Internet and IMS. It is also responsible for several IP functions, such as address allocation, policy enforcement, packet classification and routing, and it provides mobility anchoring for non-3GPP access networks.

The control plane functions are performed by the MME which is connected to the gateway via the S11 interface.

The PCRF makes policy decisions based on information obtained via the Rx interface. It confirms that the information received is consistent with policies defined in the PCRF. The PCRF will authorize QoS resources and will decide if new resources are required for existing connections. PCRF mechanism is used also in 3G network.

http://4g360.com/profiles/blogs/evolved-packet-core-architecture-and-network-element?utm_source=All+Contacts&utm_campaign=04dac46001-July_2013_WiMAX_Forum_Monthly_Newsletter7_30_2013&utm_medium=email&utm_term=0_9db850706d-04dac46001-&ct=t%28July_2013_WiMAX_Forum_Monthly_Newsletter7_30_2013%29&mc_cid=04dac46001&mc_eid=[UNIQID]

Tuesday, January 28, 2014

Spark Scala-2.10 support

https://github.com/mesos/spark/tree/scala-2.10

> git clone -b scala-2.10 https://github.com/davidchang168/spark.git

For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, also setSPARK_YARN=true:

# Apache Hadoop 2.0.5-alpha
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Cloudera CDH 4.2.0 with MapReduce v2
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly

MapReduce and Spark

http://vision.cloudera.com/mapreduce-spark/

The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.

If you’re not interested in technical detail, skip the next three paragraphs.

Original MapReduce executed jobs in a simple but rigid structure: a processing or transform step (“map”), a synchronization step (“shuffle”), and a step to combine results from all the nodes in a cluster (“reduce”). If you wanted to do something complicated, you had to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely. Complex, multi-stage applications were distressingly slow.

An alternative approach is to let programmers construct complex, multi-step directed acyclic graphs (DAGs) of work that must be done, and to execute those DAGs all at once, not step by step. This eliminates the costly synchronization required by MapReduce and makes applications much easier to build. Prior research on DAG engines includes Dryad, a Microsoft Research project used internally at Microsoft for its Bing search engine and other hosted services.

Spark builds on those ideas but adds some important innovative features. For example, Spark supports in-memory data sharing across DAGs, so that different jobs can work with the same data at very high speed. It even allows cyclic data flows. As a result, Spark handles iterative graph algorithms (think social network analysis), machine learning and stream processing extremely well. Those have been cumbersome to build on MapReduce and even on other DAG engines. They are very popular applications in the Hadoop ecosystem, so simplicity and performance matter.

Spark began as a research project at UC Berkeley’s AMPLab in 2009 and was released as open source in 2010. With several years of real use, it’s had plenty of time to mature. Advanced features available in the latest release include stream processing, fast fault recovery, language-integrated APIs, optimized scheduling and data transfer and more.

One of the most interesting features of Spark — and the reason we believe it such a powerful addition to the engines available for the Enterprise Data Hub — is its smart use of memory. MapReduce has always worked primarily with data stored on disk. Spark, by contrast, can exploit the considerable amount of RAM that is spread across all the nodes in a cluster. It is smart about use of disk for overflow data and persistence. That gives Spark huge performance advantages for many workloads.

Thursday, January 9, 2014

How to change input split size ?

http://stackoverflow.com/questions/19188315/behavior-of-the-parameter-mapred-min-split-size-in-hdfs

http://stackoverflow.com/questions/14380841/how-to-restrict-the-concurrent-running-map-tasks

The number of mappers fired are decided by the input block size. The input block size is the size of the chunks into which the data is divided and sent to different mappers while it is read from the HDFS. So in order to control the number of mappers we have to control the block size.
It can be controlled by setting the parameters, mapred.min.split.size and mapred.max.split.size, while configuring the job in MapReduce. The value is to be set in bytes. So if we have a 20 GB file, and we want to fire 40 mappers, then we need to set it to 20480 / 40 = 512 MB each. So for that the code would be,
conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");

 where conf is an object of the org.apache.hadoop.conf.Configuration class.

The split size is calculated by the formula:-

max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))

In your case it will be:-

split size=max(128,min(Long.MAX_VALUE(default),64))

So above inference:-

each map will process 2 hdfs blocks(assuming each block 64MB): True
There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False

but making the minimum split size greater than the block size increases the split size, but at the cost of locality.

Saturday, January 4, 2014

build.sbt vs Build.scala

http://www.scala-sbt.org/0.12.2/docs/Getting-Started/Full-Def.html
http://www.scala-sbt.org/release/docs/Getting-Started/Basic-Def.html#sbt-vs-scala-definition

To mix .sbt and .scala files in your build definition, you need to understand how they relate.

The following two files illustrate. First, if your project is in hello, create hello/project/Build.scala as follows:

import sbt._
import Keys._

object HelloBuild extends Build {

    val sampleKeyA = SettingKey[String]("sample-a", "demo key A")
    val sampleKeyB = SettingKey[String]("sample-b", "demo key B")
    val sampleKeyC = SettingKey[String]("sample-c", "demo key C")
    val sampleKeyD = SettingKey[String]("sample-d", "demo key D")

    override lazy val settings = super.settings ++
        Seq(sampleKeyA := "A: in Build.settings in Build.scala", resolvers := Seq())

    lazy val root = Project(id = "hello",
                            base = file("."),
                            settings = Project.defaultSettings ++ Seq(sampleKeyB := "B: in the root project settings in Build.scala"))
}

Now, create hello/build.sbt as follows:

sampleKeyC in ThisBuild := "C: in build.sbt scoped to ThisBuild"

sampleKeyD := "D: in build.sbt"

Start up the sbt interactive prompt. Type inspect sample-a and you should see (among other things):

[info] Setting: java.lang.String = A: in Build.settings in Build.scala
[info] Provided by:
[info]  {file:/home/hp/checkout/hello/}/*:sample-a

and then inspect sample-c and you should see:

[info] Setting: java.lang.String = C: in build.sbt scoped to ThisBuild
[info] Provided by:
[info]  {file:/home/hp/checkout/hello/}/*:sample-c

Note that the "Provided by" shows the same scope for the two values. That is, sampleKeyC in ThisBuild in a.sbt file is equivalent to placing a setting in the Build.settings list in a .scala file. sbt takes build-scoped settings from both places to create the build definition.

Now, inspect sample-b:

[info] Setting: java.lang.String = B: in the root project settings in Build.scala
[info] Provided by:
[info]  {file:/home/hp/checkout/hello/}hello/*:sample-b

Note that sample-b is scoped to the project ({file:/home/hp/checkout/hello/}hello) rather than the entire build ({file:/home/hp/checkout/hello/}).

As you've probably guessed, inspect sample-d matches sample-b:

[info] Setting: java.lang.String = D: in build.sbt
[info] Provided by:
[info]  {file:/home/hp/checkout/hello/}hello/*:sample-d

sbt appends the settings from .sbt files to the settings from Build.settings and Project.settings which means .sbt settings take precedence. Try changing Build.scala so it sets key sample-c or sample-d, which are also set in build.sbt. The setting in build.sbt should "win" over the one in Build.scala.

One other thing you may have noticed: sampleKeyC and sampleKeyD were available inside build.sbt. That's because sbt imports the contents of your Build object into your .sbt files. In this case import HelloBuild._was implicitly done for the build.sbt file.

In summary:

In .scala files, you can add settings to Build.settings for sbt to find, and they are automatically build-scoped.
In .scala files, you can add settings to Project.settings for sbt to find, and they are automatically project-scoped.
Any Build object you write in a .scala file will have its contents imported and available to .sbt files.
The settings in .sbt files are appended to the settings in .scala files.
The settings in .sbt files are project-scoped unless you explicitly specify another scope.

When to use `.scala` files¶

In .scala files, you are not limited to a series of settings expressions. You can write any Scala code includingval, object, and method definitions.

One recommended approach is to define settings in ``.sbt`` files, using ``.scala`` files when you need to factor out a ``val`` or ``object`` or method definition.

Because the .sbt format allows only single expressions, it doesn't give you a way to share code among expressions. When you need to share code, you need a .scala file so you can set common variables or define methods.

There's one build definition, which is a nested project inside your main project. .sbt and .scala files are compiled together to create that single definition.

.scala files are also required to define multiple projects in a single build. More on that is coming up in Multi-Project Builds.

(A disadvantage of using .sbt files in a multi-project build is that they'll be spread around in different directories; for that reason, some people prefer to put settings in their .scala files if they have sub-projects. This will be clearer after you see how multi-project builds work.)

Thursday, January 2, 2014

Datanode dead but pid file exists - cannot start datanode anymore

Don't know why I cannot start Cloudera CDH5 datanode with the command below:
> sudo service hadoop-hdfs-datanode start

> sudo service hadoop-hdfs-datanode status
Hadoop datanode is dead and pid file exists [FAILED]

But I performed the following commands to fix the problem:
> for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x stop ; done
> sudo rm -rf /var/lib/hadoop-hdfs/cache/*
> sudo -u hdfs hdfs namenode -format
> for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

and after these commands you need to re-run step 3,4, and 6 using instructions on ~~https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Quick-Start/cdh5qs_topic_3_3.html~~
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Quick-Start/cdh5qs_yarn_pseudo.html

Note: reference to http://www.manning-sandbox.com/thread.jspa?threadID=41812 Probably my problem was namenode was formatted and the namespace ID in namenode didn't get replicated to datanode.

My personal bookmarks

Friday, January 31, 2014

Google Java Coding Standards

Google Java Style

Last changed: December 19, 2013

Thursday, January 30, 2014

EPC Architecture

Tuesday, January 28, 2014

Spark Scala-2.10 support

MapReduce and Spark

Thursday, January 9, 2014

How to change input split size ?

Saturday, January 4, 2014

build.sbt vs Build.scala

When to use `.scala` files¶

Thursday, January 2, 2014

Datanode dead but pid file exists - cannot start datanode anymore

InfoQ

Blog Archive

Tags

Total Pageviews

Friday, January 31, 2014

Google Java Style

Last changed: December 19, 2013

Thursday, January 30, 2014

Tuesday, January 28, 2014

Thursday, January 9, 2014

Saturday, January 4, 2014

When to use .scala files¶

Thursday, January 2, 2014

InfoQ

Blog Archive

Tags

Total Pageviews

When to use `.scala` files¶