Skip to content

Instantly share code, notes, and snippets.

@milindjagre
Created April 11, 2016 10:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save milindjagre/017bdc76eff3e93c446927a1b40875e2 to your computer and use it in GitHub Desktop.
Save milindjagre/017bdc76eff3e93c446927a1b40875e2 to your computer and use it in GitHub Desktop.
This is custom Input Format Class which is used while reading Microsoft Word Document file using MapReduce API
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.milind.mr.worddoc;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
/**
* <p>
* An {@link org.apache.hadoop.mapreduce.InputFormat} for excel spread sheet
* files. Multiple sheets are supported
* <p/>
* Keys are the position in the file, and values are the row containing all
* columns for the particular row.
*/
/**
*
* @author milind
*/
public class WordInputFormat extends FileInputFormat<LongWritable, Text> {
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException {
return new WordRecordReader();
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment