http://stackoverflow.com/questions/14380841/how-to-restrict-the-concurrent-running-map-tasks
The number of mappers fired are decided by the input
block size. The input block size is the size of the chunks into which
the data is divided and sent to different mappers while it is read from
the HDFS. So in order to control the number of mappers we have to
control the block size.
It can be controlled by setting the parameters,
It can be controlled by setting the parameters,
mapred.min.split.size
and mapred.max.split.size
,
while configuring the job in MapReduce. The value is to be set in
bytes. So if we have a 20 GB file, and we want to fire 40 mappers, then
we need to set it to 20480 / 40 = 512 MB each. So for that the code
would be,conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");
where
conf
is an object of theorg.apache.hadoop.conf.Configuration
class.
The split size is calculated by the formula:-
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
In your case it will be:-split size=max(128,min(Long.MAX_VALUE(default),64))
So above inference:-- each map will process 2 hdfs blocks(assuming each block 64MB): True
- There will be a new division of my input file (previously included HDFS) to occupy blocks in HDFS 128M: False
No comments:
Post a Comment