Thursday, January 9, 2014

How to change input split size ?

http://stackoverflow.com/questions/19188315/behavior-of-the-parameter-mapred-min-split-size-in-hdfs

http://stackoverflow.com/questions/14380841/how-to-restrict-the-concurrent-running-map-tasks

The number of mappers fired are decided by the input block size. The input block size is the size of the chunks into which the data is divided and sent to different mappers while it is read from the HDFS. So in order to control the number of mappers we have to control the block size.
It can be controlled by setting the parameters, mapred.min.split.size and mapred.max.split.size, while configuring the job in MapReduce. The value is to be set in bytes. So if we have a 20 GB file, and we want to fire 40 mappers, then we need to set it to 20480 / 40 = 512 MB each. So for that the code would be,
conf.set("mapred.min.split.size", "536870912"); 
conf.set("mapred.max.split.size", "536870912");
 where conf is an object of the org.apache.hadoop.conf.Configuration class.
The split size is calculated by the formula:-
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
In your case it will be:-
split size=max(128,min(Long.MAX_VALUE(default),64))
So above inference:-
  1. each map will process 2 hdfs blocks(assuming each block 64MB): True
  2. There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False
but making the minimum split size greater than the block size increases the split size, but at the cost of locality.

No comments: