Skip to content

Instantly share code, notes, and snippets.

View jgulum's full-sized avatar

Jem G. jgulum

  • nutrasynch.ai
  • NYC
View GitHub Profile
@jgulum
jgulum / EmployeeReducer.java
Created November 18, 2013 17:14
The Reducer phase in Java. In order to compile, package, and run the code an Apache Ant build system was used.
package example;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/*
* To define a reduce function for your MapReduce job, subclass
@jgulum
jgulum / EmployeeMapper.java
Created November 18, 2013 17:09
The Mapper example: A training example from data pulled from a typical relational table.
package example;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/*
@jgulum
jgulum / Driver.java
Created November 18, 2013 17:00
Driver for MapReduce Job in Java. The driver class configures and submits the job to the Hadoop cluster for execution.
package example;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
/*
@jgulum
jgulum / build.xml
Created November 18, 2013 16:58
In order to run a MapReduce job in Java.
<?xml version="1.0" encoding="UTF-8"?>
<project name="empdata_mrjob" default="run" basedir=".">
<target name="init">
<property name="src.dir" value="java/src" />
<property name="build.dir" value="java/build" />
<property name="dist.dir" value="java/dist" />
<property name="hdfs.output.dir" value="/user/training/empcounts_java" />
@jgulum
jgulum / running the mapReduce job on Linux terminal.
Created November 18, 2013 16:38
Running a Python MapReduce Job: The following shell script, which is executed in a Linux env, defines the path of the Java library (JAR) file that contains support for Hadoop streaming. It also defines the output directory, making sure that it does not already exist (for example, if a previous job had already created it) before streaming the job…
#!/bin/sh
# Path of Hadoop streaming JAR library
STREAMJAR=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-*.jar
# Directory in which we'll store job output
OUTPUT=/user/training/empcounts
# Make sure we don't have output from a previous run.
# The -r option removes the directory recursively, and
@jgulum
jgulum / Reducer
Created November 18, 2013 16:32
Running a Python MapReduce Job: The reducer phase of the job.
#!/usr/bin/env python
import sys
previous_state = ''
count_for_state = 0
for line in sys.stdin:
line = line.strip()
@jgulum
jgulum / Mapper
Created November 18, 2013 16:27
Running a Python MapReduce Job: The mapper phase of the job.
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
(id, fname, lname, addr, city, state, zip, job, email, active, salary) = line.split("\t")
if int(salary) >= 75000:
print "%s,1" % state