Skip to content

Instantly share code, notes, and snippets.

Anagha Khanolkar airawat

  • Microsoft
View GitHub Profile
@airawat
airawat / 00-OozieCoordinatorJobWithDatasetCreationAsTrigger
Last active Jul 1, 2020
Sample Oozie coordinator job that executes upon availability of a specified dataset. Includes scripts/code, sample data, commands.
View 00-OozieCoordinatorJobWithDatasetCreationAsTrigger
This gist includes components of a oozie, dataset availability initiated, coordinator job -
scripts/code, sample data and commands; Oozie actions covered: hdfs action, email action,
sqoop action (mysql database); Oozie controls covered: decision;
Usecase
-------
Pipe report data available in HDFS, to mysql database;
Pictorial overview of job:
--------------------------
@airawat
airawat / 00-CombineFileInputFornat
Last active May 30, 2020
CombineFileInputFormat - a solution to efficient map reduce processing of small files
View 00-CombineFileInputFornat
*************************
Gist
*************************
One more gist related to controlling the number of mappers in a mapreduce task.
Background on Inputsplits
--------------------------
An inputsplit is a chunk of the input data allocated to a map task for processing. FileInputFormat
generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the
@airawat
airawat / 00-OozieSSHAction
Last active Feb 18, 2020
Oozie SSH action Sample Oozie workflow that demonstrates the SSH action to move files from a specific node to HDFS
View 00-OozieSSHAction
This gist covers the Oozie SSH action.
It includes components of a sample Oozie workflow application- scripts/code,
sample data and commands; Oozie actions covered: secure shell action, email
action.
My blog has documentation, and highlights of a very basic sample program.
http://hadooped.blogspot.com/2013/10/apache-oozie-part-13-oozie-ssh-action_30.html
This gist includes:
@airawat
airawat / 00-OozieConfigSSHAction
Last active Jan 8, 2020
Oozie configuration for SSH action
View 00-OozieConfigSSHAction
# The following documentation details configuring an application ID to execute a SSH action
# In the illustration-
# edge node=cdh-sn03
# oozie server=cdh-mn01
# applicaiton ID=akhanolk
# ==========================================
# 1. On edge node, as application ID
@airawat
airawat / 00-MapSideJoinDistCacheTextFile
Last active Nov 28, 2019
Map-side join example - Java code for joining two datasets - one large (tsv format), and one with lookup data (text), made available through DistributedCache
View 00-MapSideJoinDistCacheTextFile
This gist demonstrates how to do a map-side join, loading one small dataset from DistributedCache into a HashMap
in memory, and joining with a larger dataset.
Includes:
---------
1. Input data and script download
2. Dataset structure review
3. Expected results
4. Mapper code
5. Driver code
@airawat
airawat / 00-SecondarySortJavaMapReduce
Last active Nov 20, 2019
Secondary sort in mapreduce - Includes code for a simple program that sorts employee information by department ascending and employee name desc.
View 00-SecondarySortJavaMapReduce
Secondary sort in Mapreduce
With mapreduce framework, the keys are sorted but the values associated with each key
are not. In order for the values to be sorted, we need to write code to perform what is
referred to a secondary sort. The sample code in this gist demonstrates such a sort.
The input to the program is a bunch of employee attributes.
The output required is department number (deptNo) in ascending order, and the employee last name,
first name and employee ID in descending order.
@airawat
airawat / 00-OozieWorkflowShellAction
Last active Jul 18, 2019
Oozie workflow with a shell action - with CaptureOutput Counts lines in a glob provided and writes the same to standard output. A subsequent email action emails the output of the shell action
View 00-OozieWorkflowShellAction
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: shell action, email action
Action 1: The shell action executes a shell script that does a line count for files in a
glob provided, and writes the line count to standard output
Action 2: The email action emails the output of action 1
Pictorial overview of job:
--------------------------
@airawat
airawat / 00-MultipleOutputs
Last active Jul 17, 2019
MultipleOutputs sample program - A program that demonstrates how to generate an output file for each key
View 00-MultipleOutputs
********************************
Gist
********************************
Motivation
-----------
The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
on whether it is a map or a reduce output, and then the part number. There are scenarios where we
may want to create separate files based on criteria-data keys and/or values. Enter the "MultipleOutputs"
functionality.
@airawat
airawat / 00-OozieWorkflowSqoopAction
Last active Mar 23, 2019
Oozie workflow application with sqoop action Pipes data from Hive table to mysql database table Oozie 3.3.0; Sqoop (1.4.2) with Mysql (5.1.69 )
View 00-OozieWorkflowSqoopAction
This gist includes components of a simple workflow application (oozie 3.3.0) that
pipes data in a Hive table to mysql;
The sample application includes:
--------------------------------
1. Oozie actions: sqoop action
2. Oozie workflow controls: start, end, and kill.
3. Workflow components: job.properties and workflow.xml
4. Sample data
5. Prep tasks in Hive
@airawat
airawat / 00-CreatingSequenceFile
Last active Mar 19, 2019
Hadoop Sequence File - Sample program to create a sequence file (compressed and uncompressed) from a text file, and another to read the sequence file.
View 00-CreatingSequenceFile
This gist demonstrates how to create a sequence file (compressed and uncompressed), from a text file.
Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Mapper code
5. Driver code to create the sequence file out of a text file in HDFS
6. Command to run Java program
You can’t perform that action at this time.