Airawat airawat

## 00-MapSideJoinDistCacheTextFile
This gist demonstrates how to do a map-side join, loading one small dataset from DistributedCache into a HashMap
in memory, and joining with a larger dataset.

Includes:
---------
1. Input data and script download
2. Dataset structure review
3. Expected results
4. Mapper code
5. Driver code

## 00-MapSideJoinDistCacheMapFile
This gist demonstrates how to do a map-side join, joining a MapFile from distributedcache
with a larger dataset in HDFS.

Includes:
---------
1. Input data and script download
2. Dataset structure review
3. Expected results
4. Mapper code
5. Driver code

## 00-MapSideJoinDistCacheThruGenericOptionsParser
This gist is part of a series of gists related to Map-side joins in Java map-reduce.
In the gist - https://gist.github.com/airawat/6597557, we added the reference data available
in HDFS to the distributed cache from the driver code.

This gist demonstrates adding a local file via command line to distributed cache.
Refer gist at https://gist.github.com/airawat/6597557 for-
1.  Data samples and structure
2.  Expected results
3.  Commands to load data to HDFS

## 00-SecondarySortJavaMapReduce
Secondary sort in Mapreduce

With mapreduce framework, the keys are sorted but the values associated with each key
are not.  In order for the values to be sorted, we need to write code to perform what is
referred to a secondary sort.  The sample code in this gist demonstrates such a sort.

The input to the program is a bunch of employee attributes.
The output required is department number (deptNo) in ascending order, and the employee last name,
first name and employee ID in descending order.

## 00-MultipleOutputs
********************************
Gist
********************************

Motivation
-----------
The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
on whether it is a map or a reduce output, and then the part number.  There are scenarios where we
may want to create separate files based on criteria-data keys and/or values.  Enter the "MultipleOutputs"
functionality.

## 00-NLineInputFormat
**********************
Gist
**********************

A common interview question for a Hadoop developer position is whether we can control the number of
mappers for a job.  We can - there are a few ways of controlling the number of mappers, as needed.
Using NLineInputFormat is one way.

About NLineInputFormat
----------------------

## 00-CombineFileInputFornat
*************************
Gist
*************************

One more gist related to controlling the number of mappers in a mapreduce task.

Background on Inputsplits
--------------------------
An inputsplit is a chunk of the input data allocated to a map task for processing.  FileInputFormat
generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the

## 00-MapSideJoinLargeDatasets
**********************
**Gist
**********************

This gist details how to inner join two large datasets on the map-side, leveraging the join capability
in mapreduce.  Such a join makes sense if both input datasets are too large to qualify for distribution
through distributedcache, and can be implemented if both input datasets can be joined by the join key
and both input datasets are sorted in the same order, by the join key.

There are two critical pieces to engaging the join behavior:

## 00-ReduceSideJoin
My blog has an introduction to reduce side join in Java map reduce-
http://hadooped.blogspot.com/2013/09/reduce-side-join-options-in-java-map.html


## 00-OozieSSHAction
This gist covers the Oozie SSH action.
It includes components of a sample Oozie workflow application- scripts/code,
sample data and commands;  Oozie actions covered: secure shell action, email
action.

My blog has documentation, and highlights of a very basic sample program.
http://hadooped.blogspot.com/2013/10/apache-oozie-part-13-oozie-ssh-action_30.html


This gist includes:
	This gist demonstrates how to do a map-side join, loading one small dataset from DistributedCache into a HashMap
	in memory, and joining with a larger dataset.

	Includes:
	---------
	1. Input data and script download
	2. Dataset structure review
	3. Expected results
	4. Mapper code
	5. Driver code
	This gist demonstrates how to do a map-side join, joining a MapFile from distributedcache
	with a larger dataset in HDFS.

	Includes:
	---------
	1. Input data and script download
	2. Dataset structure review
	3. Expected results
	4. Mapper code
	5. Driver code
	This gist is part of a series of gists related to Map-side joins in Java map-reduce.
	In the gist - https://gist.github.com/airawat/6597557, we added the reference data available
	in HDFS to the distributed cache from the driver code.

	This gist demonstrates adding a local file via command line to distributed cache.
	Refer gist at https://gist.github.com/airawat/6597557 for-
	1. Data samples and structure
	2. Expected results
	3. Commands to load data to HDFS
	Secondary sort in Mapreduce

	With mapreduce framework, the keys are sorted but the values associated with each key
	are not. In order for the values to be sorted, we need to write code to perform what is
	referred to a secondary sort. The sample code in this gist demonstrates such a sort.

	The input to the program is a bunch of employee attributes.
	The output required is department number (deptNo) in ascending order, and the employee last name,
	first name and employee ID in descending order.
	********************************
	Gist
	********************************

	Motivation
	-----------
	The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
	on whether it is a map or a reduce output, and then the part number. There are scenarios where we
	may want to create separate files based on criteria-data keys and/or values. Enter the "MultipleOutputs"
	functionality.
	**********************
	Gist
	**********************

	A common interview question for a Hadoop developer position is whether we can control the number of
	mappers for a job. We can - there are a few ways of controlling the number of mappers, as needed.
	Using NLineInputFormat is one way.

	About NLineInputFormat
	----------------------
	*************************
	Gist
	*************************

	One more gist related to controlling the number of mappers in a mapreduce task.

	Background on Inputsplits
	--------------------------
	An inputsplit is a chunk of the input data allocated to a map task for processing. FileInputFormat
	generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the
	**********************
	**Gist
	**********************

	This gist details how to inner join two large datasets on the map-side, leveraging the join capability
	in mapreduce. Such a join makes sense if both input datasets are too large to qualify for distribution
	through distributedcache, and can be implemented if both input datasets can be joined by the join key
	and both input datasets are sorted in the same order, by the join key.

	There are two critical pieces to engaging the join behavior:
	My blog has an introduction to reduce side join in Java map reduce-
	http://hadooped.blogspot.com/2013/09/reduce-side-join-options-in-java-map.html
	This gist covers the Oozie SSH action.
	It includes components of a sample Oozie workflow application- scripts/code,
	sample data and commands; Oozie actions covered: secure shell action, email
	action.

	My blog has documentation, and highlights of a very basic sample program.
	http://hadooped.blogspot.com/2013/10/apache-oozie-part-13-oozie-ssh-action_30.html


	This gist includes: