Skip to content

Instantly share code, notes, and snippets.

@cotdp cotdp/gist:3062901
Created Jul 6, 2012

Embed
What would you like to do?
Mapper for processing ZipFile entries
/**
* This Mapper class checks the filename ends with the .txt extension, cleans
* the text and then applies the simple WordCount algorithm.
*/
public static class MyMapper
extends Mapper<Text, BytesWritable, Text, IntWritable>
{
private final static IntWritable one = new IntWritable( 1 );
private Text word = new Text();
public void map( Text key, BytesWritable value, Context context )
throws IOException, InterruptedException
{
// NOTE: the filename is the *full* path within the ZIP file
// e.g. "subdir1/subsubdir2/Ulysses-18.txt"
String filename = key.toString();
LOG.info( "map: " + filename );
// We only want to process .txt files
if ( filename.endsWith(".txt") == false )
return;
// Prepare the content
String content = new String( value.getBytes(), "UTF-8" );
content = content.replaceAll( "[^A-Za-z \n]", "" ).toLowerCase();
// Tokenize the content
StringTokenizer tokenizer = new StringTokenizer( content );
while ( tokenizer.hasMoreTokens() )
{
word.set( tokenizer.nextToken() );
context.write( word, one );
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.