Airawat airawat

## 00-MapSideJoinLargeDatasets
**********************
**Gist
**********************

This gist details how to inner join two large datasets on the map-side, leveraging the join capability
in mapreduce.  Such a join makes sense if both input datasets are too large to qualify for distribution
through distributedcache, and can be implemented if both input datasets can be joined by the join key
and both input datasets are sorted in the same order, by the join key.

There are two critical pieces to engaging the join behavior:

## 00-OozieCoordinatorJobWithFileAsTrigger
This gist includes components of a oozie (trigger file initiated) coordinator job -
scripts/code, sample data and commands;  Oozie actions covered: hdfs action, email action,
java main action, hive action;  Oozie controls covered: decision, fork-join; The workflow
includes a sub-workflow that runs two hive actions concurrently.  The hive table is
partitioned;  Parsing uses hive-regex serde, and Java-regex.  Also, the java mapper, gets
the input directory path and includes part of it in the key.

Usecase
-------
Parse Syslog generated log files to generate reports;

## Oozie-SSHConfig-Azure
B5b. Configure Oozie SSH action
Sometimes, you may need to execute jobs on a specific node - instead of any cluster node.
For this you need oozie service user to be able to connect to the node of choice as your workflow user.
# The following documentation details configuring an application ID to execute a SSH action

# In the illustration-
# edge node=cdh-en01
# oozie server=cdh-mn01
# applicaiton ID=akhanolk

## 00-NLineInputFormat
**********************
Gist
**********************

A common interview question for a Hadoop developer position is whether we can control the number of
mappers for a job.  We can - there are a few ways of controlling the number of mappers, as needed.
Using NLineInputFormat is one way.

About NLineInputFormat
----------------------

## 00-OozieWorkflowStreamingMRAction-Python
This gist includes oozie workflow components (streaming map reduce action) to execute
python mapper and reducer scripts to parse Syslog generated log files using regex;
Usecase:  Count the number of occurances of processes that got logged, by month, and process.

Pictorial overview of workflow:
--------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html

Includes:
---------

## 00-OozieWorkflowHdfsAndEmailActions
This gist includes components of a simple workflow application that created a directory and moves files within
hdfs to this directory;
Emails are sent out to notify designated users of success/failure of workflow.  There is a prepare section,
to allow re-run of the  action..the prepare essentially negates the move done by a potential prior run
of the action.  Sample data is also included.

The sample application includes:
--------------------------------
1.  Oozie actions: hdfs action and email action
2.  Oozie workflow controls: start, end, and kill.

## 00-OozieWorkflowWithSubworkflow
This gist includes components of a oozie workflow application - scripts/code, sample data
and commands;  Oozie actions covered: sub-workflow, email java main action,
sqoop action (to mysql);  Oozie controls covered: decision;

Pictorial overview:
--------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-8-subworkflow.html

Usecase:
--------

## 00-CreatingSequenceFile
This gist demonstrates how to create a sequence file (compressed and uncompressed), from a text file.

Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Mapper code
5. Driver code to create the sequence file out of a text file in HDFS
6. Command to run Java program

## 00-MultipleOutputs
********************************
Gist
********************************

Motivation
-----------
The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
on whether it is a map or a reduce output, and then the part number.  There are scenarios where we
may want to create separate files based on criteria-data keys and/or values.  Enter the "MultipleOutputs"
functionality.

## 00-OozieConfigSSHAction
# The following documentation details configuring an application ID to execute a SSH action

# In the illustration-
# edge node=cdh-sn03
# oozie server=cdh-mn01
# applicaiton ID=akhanolk


# ==========================================
# 1.  On edge node, as application ID
	**********************
	**Gist
	**********************

	This gist details how to inner join two large datasets on the map-side, leveraging the join capability
	in mapreduce. Such a join makes sense if both input datasets are too large to qualify for distribution
	through distributedcache, and can be implemented if both input datasets can be joined by the join key
	and both input datasets are sorted in the same order, by the join key.

	There are two critical pieces to engaging the join behavior:
	This gist includes components of a oozie (trigger file initiated) coordinator job -
	scripts/code, sample data and commands; Oozie actions covered: hdfs action, email action,
	java main action, hive action; Oozie controls covered: decision, fork-join; The workflow
	includes a sub-workflow that runs two hive actions concurrently. The hive table is
	partitioned; Parsing uses hive-regex serde, and Java-regex. Also, the java mapper, gets
	the input directory path and includes part of it in the key.

	Usecase
	-------
	Parse Syslog generated log files to generate reports;
	B5b. Configure Oozie SSH action
	Sometimes, you may need to execute jobs on a specific node - instead of any cluster node.
	For this you need oozie service user to be able to connect to the node of choice as your workflow user.
	# The following documentation details configuring an application ID to execute a SSH action

	# In the illustration-
	# edge node=cdh-en01
	# oozie server=cdh-mn01
	# applicaiton ID=akhanolk
	**********************
	Gist
	**********************

	A common interview question for a Hadoop developer position is whether we can control the number of
	mappers for a job. We can - there are a few ways of controlling the number of mappers, as needed.
	Using NLineInputFormat is one way.

	About NLineInputFormat
	----------------------
	This gist includes oozie workflow components (streaming map reduce action) to execute
	python mapper and reducer scripts to parse Syslog generated log files using regex;
	Usecase: Count the number of occurances of processes that got logged, by month, and process.

	Pictorial overview of workflow:
	--------------------------------
	http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html

	Includes:
	---------
	This gist includes components of a simple workflow application that created a directory and moves files within
	hdfs to this directory;
	Emails are sent out to notify designated users of success/failure of workflow. There is a prepare section,
	to allow re-run of the action..the prepare essentially negates the move done by a potential prior run
	of the action. Sample data is also included.

	The sample application includes:
	--------------------------------
	1. Oozie actions: hdfs action and email action
	2. Oozie workflow controls: start, end, and kill.
	This gist includes components of a oozie workflow application - scripts/code, sample data
	and commands; Oozie actions covered: sub-workflow, email java main action,
	sqoop action (to mysql); Oozie controls covered: decision;

	Pictorial overview:
	--------------------
	http://hadooped.blogspot.com/2013/07/apache-oozie-part-8-subworkflow.html

	Usecase:
	--------
	This gist demonstrates how to create a sequence file (compressed and uncompressed), from a text file.

	Includes:
	---------
	1. Input data and script download
	2. Input data-review
	3. Data load commands
	4. Mapper code
	5. Driver code to create the sequence file out of a text file in HDFS
	6. Command to run Java program
	********************************
	Gist
	********************************

	Motivation
	-----------
	The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
	on whether it is a map or a reduce output, and then the part number. There are scenarios where we
	may want to create separate files based on criteria-data keys and/or values. Enter the "MultipleOutputs"
	functionality.
	# The following documentation details configuring an application ID to execute a SSH action

	# In the illustration-
	# edge node=cdh-sn03
	# oozie server=cdh-mn01
	# applicaiton ID=akhanolk


	# ==========================================
	# 1. On edge node, as application ID