Skip to content

Instantly share code, notes, and snippets.

@airawat
airawat / 00-MapSideJoinLargeDatasets
Last active December 23, 2017 07:11
MapsideJoinOfTwoLargeDatasets(Old API) - Joining (inner join) two large datasets on the map side
**********************
**Gist
**********************
This gist details how to inner join two large datasets on the map-side, leveraging the join capability
in mapreduce. Such a join makes sense if both input datasets are too large to qualify for distribution
through distributedcache, and can be implemented if both input datasets can be joined by the join key
and both input datasets are sorted in the same order, by the join key.
There are two critical pieces to engaging the join behavior:
@airawat
airawat / 00-OozieCoordinatorJobWithFileAsTrigger
Last active February 12, 2018 10:10
Oozie coordinator job example with trigger file as trigger
This gist includes components of a oozie (trigger file initiated) coordinator job -
scripts/code, sample data and commands; Oozie actions covered: hdfs action, email action,
java main action, hive action; Oozie controls covered: decision, fork-join; The workflow
includes a sub-workflow that runs two hive actions concurrently. The hive table is
partitioned; Parsing uses hive-regex serde, and Java-regex. Also, the java mapper, gets
the input directory path and includes part of it in the key.
Usecase
-------
Parse Syslog generated log files to generate reports;
B5b. Configure Oozie SSH action
Sometimes, you may need to execute jobs on a specific node - instead of any cluster node.
For this you need oozie service user to be able to connect to the node of choice as your workflow user.
# The following documentation details configuring an application ID to execute a SSH action
# In the illustration-
# edge node=cdh-en01
# oozie server=cdh-mn01
# applicaiton ID=akhanolk
@airawat
airawat / 00-NLineInputFormat
Last active August 22, 2018 15:18
NLineInputFormat - About NLineInputFormat, uses, and a sample program
**********************
Gist
**********************
A common interview question for a Hadoop developer position is whether we can control the number of
mappers for a job. We can - there are a few ways of controlling the number of mappers, as needed.
Using NLineInputFormat is one way.
About NLineInputFormat
----------------------
@airawat
airawat / 00-OozieWorkflowStreamingMRAction-Python
Last active November 21, 2018 06:24
Sample of an Oozie workflow with streaming action - parses Syslog generated log files using python -regex
This gist includes oozie workflow components (streaming map reduce action) to execute
python mapper and reducer scripts to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month, and process.
Pictorial overview of workflow:
--------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html
Includes:
---------
@airawat
airawat / 00-OozieWorkflowHdfsAndEmailActions
Last active November 21, 2018 14:33
Oozie workflow application with FS and email actions; Includes sample data, workflow components, commands.
This gist includes components of a simple workflow application that created a directory and moves files within
hdfs to this directory;
Emails are sent out to notify designated users of success/failure of workflow. There is a prepare section,
to allow re-run of the action..the prepare essentially negates the move done by a potential prior run
of the action. Sample data is also included.
The sample application includes:
--------------------------------
1. Oozie actions: hdfs action and email action
2. Oozie workflow controls: start, end, and kill.
@airawat
airawat / 00-OozieWorkflowWithSubworkflow
Last active January 3, 2019 18:08
Oozie workflow application with a subworkflow Includes - sample data, workflow components, hdfs and oozie commands, application output
This gist includes components of a oozie workflow application - scripts/code, sample data
and commands; Oozie actions covered: sub-workflow, email java main action,
sqoop action (to mysql); Oozie controls covered: decision;
Pictorial overview:
--------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-8-subworkflow.html
Usecase:
--------
@airawat
airawat / 00-CreatingSequenceFile
Last active March 19, 2019 18:35
Hadoop Sequence File - Sample program to create a sequence file (compressed and uncompressed) from a text file, and another to read the sequence file.
This gist demonstrates how to create a sequence file (compressed and uncompressed), from a text file.
Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Mapper code
5. Driver code to create the sequence file out of a text file in HDFS
6. Command to run Java program
@airawat
airawat / 00-MultipleOutputs
Last active July 17, 2019 10:29
MultipleOutputs sample program - A program that demonstrates how to generate an output file for each key
********************************
Gist
********************************
Motivation
-----------
The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
on whether it is a map or a reduce output, and then the part number. There are scenarios where we
may want to create separate files based on criteria-data keys and/or values. Enter the "MultipleOutputs"
functionality.
@airawat
airawat / 00-OozieConfigSSHAction
Last active January 8, 2020 02:41
Oozie configuration for SSH action
# The following documentation details configuring an application ID to execute a SSH action
# In the illustration-
# edge node=cdh-sn03
# oozie server=cdh-mn01
# applicaiton ID=akhanolk
# ==========================================
# 1. On edge node, as application ID