My personal bookmarks: Hadoop: custom RecordReader of TextInputFormat

Sunday, November 10, 2013

Hadoop: custom RecordReader of TextInputFormat

http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

Problem : We want our mapper to receive 3 records ( 3 lines ) from the source file at a time instead on 1 line as provided by default by the TextInputFormat.

Approach :

We will extend from TextInputFormat class to create our own NLinesInputFormat .
We will also create our own RecordReader class called NLinesRecordReader where we will implement the logic of feeding 3 lines/records at a time.
We will make a change in our driver program to use our new NLinesInputFormat class.
To prove that we are really getting 3 lines at a time, instead of actually counting words ( which we already know now how to do ) , we will emit out number of lines we get in the input at a time as a key and 1 as a value , which after going through reducer will give us frequency of each unique number of lines to the mappers.

My personal bookmarks

Sunday, November 10, 2013

Hadoop: custom RecordReader of TextInputFormat

No comments:

InfoQ

Blog Archive

Tags

Total Pageviews