Monday, November 11, 2013

Hadoop: How to control how many map tasks can be executed in parallel

http://developer.yahoo.com/hadoop/tutorial/module4.html

The InputFormat defines the list of tasks that make up the mapping phase; each task corresponds to a single input split. The tasks are then assigned to the nodes in the system based on where the input file chunks are physically resident. An individual node may have several dozen tasks assigned to it. The node will begin working on the tasks, attempting to perform as many in parallel as it can. The on-node parallelism is controlled by the mapred.tasktracker.map.tasks.maximum parameter

mapreduce-flow
Figure 4.5: Detailed Hadoop MapReduce data flow

No comments: