Sunday, November 10, 2013

Hadoop: custom RecordReader of TextInputFormat

http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

Problem : We want our mapper to receive 3 records ( 3 lines ) from the source file at a time instead on 1 line as provided by default by the TextInputFormat.
Approach :
  1. We will extend from  TextInputFormat class to create our own NLinesInputFormat .
  2. We will also create our own RecordReader class called NLinesRecordReader where we will implement the logic of feeding 3 lines/records at a time.
  3. We will make a change in our driver program to use our new NLinesInputFormat class.
  4. To prove that we are really getting 3 lines at a time, instead of actually counting words ( which we already know now how to do ) , we will emit out number of lines we get in the input at a time as a key and 1 as a value , which after going through reducer will give us frequency of  each unique number of lines to the mappers.

No comments: