Airawat airawat

## 00-CombineFileInputFornat
*************************
Gist
*************************

One more gist related to controlling the number of mappers in a mapreduce task.

Background on Inputsplits
--------------------------
An inputsplit is a chunk of the input data allocated to a map task for processing.  FileInputFormat
generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the

## 00-OozieWorkflowSqoopAction
This gist includes components of a simple workflow application (oozie 3.3.0) that
pipes data in a Hive table to mysql;

The sample application includes:
--------------------------------
1.  Oozie actions: sqoop action
2.  Oozie workflow controls: start, end, and kill.
3.  Workflow components: job.properties and workflow.xml
4.  Sample data
5.  Prep tasks in Hive

## 00-OozieSSHAction
This gist covers the Oozie SSH action.
It includes components of a sample Oozie workflow application- scripts/code,
sample data and commands;  Oozie actions covered: secure shell action, email
action.

My blog has documentation, and highlights of a very basic sample program.
http://hadooped.blogspot.com/2013/10/apache-oozie-part-13-oozie-ssh-action_30.html


This gist includes:

## 00-OozieWorkflowJavaMapReduceAction
This gist includes components of a oozie workflow - scripts/code, sample data
and commands;  Oozie actions covered: java mapreduce action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts the path of the mapper input directory path and includes in the
key emitted.

Note: The reducer can be specified as a combiner as well.

Usecase
-------

## 00-CustomGenericUDFHive-NVL2
This gist covers a simple Hive genericUDF in Java, that mimics NVL2 functionality in Oracle.
NVL2 is used to handle nulls and conditionally substitute values.

Included:
1.  Input data
2.  Expected results
3.  UDF code in java
4.  Hive query to demo the UDF
5.  Output


## 00-OozieWorkflowWithPigAction
This gist includes oozie workflow components to run a pig latin script to parse
(Syslog generated) log files using regex;
Usecase:  Count the number of occurances of processes that got logged, by month,
day and process.

Pictorial overview of workflow:
-------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html

Includes:

## 00-OozieBundleApplication
Introduction
-------------
This gist includes sample data, application components, and components to execute a bundle application.

The sample bundle application is time triggered.  The start time is defined in the bundle job.properties
file.  The bundle application starts two coordinator applications- as defined in the bundle definition file -
bundleConfirguration.xml.

The first coordinator job is time triggered.  The start time is defined in the bundle job.properties file.
It runs a workflow, that includes a java main action.  The java program parses some log files and generates

## 00-OozieCoordinatorJobWithDatasetCreationAsTrigger
This gist includes components of a oozie, dataset availability initiated, coordinator job -
scripts/code, sample data and commands;  Oozie actions covered: hdfs action, email action,
sqoop action (mysql database);  Oozie controls covered: decision;

Usecase
-------
Pipe report data available in HDFS, to mysql database;

Pictorial overview of job:
--------------------------

## 00-SecondarySortJavaMapReduce
Secondary sort in Mapreduce

With mapreduce framework, the keys are sorted but the values associated with each key
are not.  In order for the values to be sorted, we need to write code to perform what is
referred to a secondary sort.  The sample code in this gist demonstrates such a sort.

The input to the program is a bunch of employee attributes.
The output required is department number (deptNo) in ascending order, and the employee last name,
first name and employee ID in descending order.

## 00-OozieWorkflowShellAction
This gist includes components of a oozie workflow - scripts/code, sample data
and commands;  Oozie actions covered: shell action, email action

Action 1: The shell action executes a shell script that does a line count for files in a
glob provided, and writes the line count to standard output
Action 2: The email action emails the output of action 1


Pictorial overview of job:
--------------------------
	*************************
	Gist
	*************************

	One more gist related to controlling the number of mappers in a mapreduce task.

	Background on Inputsplits
	--------------------------
	An inputsplit is a chunk of the input data allocated to a map task for processing. FileInputFormat
	generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the
	This gist includes components of a simple workflow application (oozie 3.3.0) that
	pipes data in a Hive table to mysql;

	The sample application includes:
	--------------------------------
	1. Oozie actions: sqoop action
	2. Oozie workflow controls: start, end, and kill.
	3. Workflow components: job.properties and workflow.xml
	4. Sample data
	5. Prep tasks in Hive
	This gist covers the Oozie SSH action.
	It includes components of a sample Oozie workflow application- scripts/code,
	sample data and commands; Oozie actions covered: secure shell action, email
	action.

	My blog has documentation, and highlights of a very basic sample program.
	http://hadooped.blogspot.com/2013/10/apache-oozie-part-13-oozie-ssh-action_30.html


	This gist includes:
	This gist includes components of a oozie workflow - scripts/code, sample data
	and commands; Oozie actions covered: java mapreduce action; Oozie controls
	covered: start, kill, end; The java program uses regex to parse the logs, and
	also extracts the path of the mapper input directory path and includes in the
	key emitted.

	Note: The reducer can be specified as a combiner as well.

	Usecase
	-------
	This gist covers a simple Hive genericUDF in Java, that mimics NVL2 functionality in Oracle.
	NVL2 is used to handle nulls and conditionally substitute values.

	Included:
	1. Input data
	2. Expected results
	3. UDF code in java
	4. Hive query to demo the UDF
	5. Output
	This gist includes oozie workflow components to run a pig latin script to parse
	(Syslog generated) log files using regex;
	Usecase: Count the number of occurances of processes that got logged, by month,
	day and process.

	Pictorial overview of workflow:
	-------------------------------
	http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html

	Includes:
	Introduction
	-------------
	This gist includes sample data, application components, and components to execute a bundle application.

	The sample bundle application is time triggered. The start time is defined in the bundle job.properties
	file. The bundle application starts two coordinator applications- as defined in the bundle definition file -
	bundleConfirguration.xml.

	The first coordinator job is time triggered. The start time is defined in the bundle job.properties file.
	It runs a workflow, that includes a java main action. The java program parses some log files and generates
	This gist includes components of a oozie, dataset availability initiated, coordinator job -
	scripts/code, sample data and commands; Oozie actions covered: hdfs action, email action,
	sqoop action (mysql database); Oozie controls covered: decision;

	Usecase
	-------
	Pipe report data available in HDFS, to mysql database;

	Pictorial overview of job:
	--------------------------
	Secondary sort in Mapreduce

	With mapreduce framework, the keys are sorted but the values associated with each key
	are not. In order for the values to be sorted, we need to write code to perform what is
	referred to a secondary sort. The sample code in this gist demonstrates such a sort.

	The input to the program is a bunch of employee attributes.
	The output required is department number (deptNo) in ascending order, and the employee last name,
	first name and employee ID in descending order.
	This gist includes components of a oozie workflow - scripts/code, sample data
	and commands; Oozie actions covered: shell action, email action

	Action 1: The shell action executes a shell script that does a line count for files in a
	glob provided, and writes the line count to standard output
	Action 2: The email action emails the output of action 1


	Pictorial overview of job:
	--------------------------