Skip to content

Instantly share code, notes, and snippets.

@airawat
airawat / 00-CombineFileInputFornat
Last active March 27, 2024 04:57
CombineFileInputFormat - a solution to efficient map reduce processing of small files
*************************
Gist
*************************
One more gist related to controlling the number of mappers in a mapreduce task.
Background on Inputsplits
--------------------------
An inputsplit is a chunk of the input data allocated to a map task for processing. FileInputFormat
generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the
@airawat
airawat / 00-OozieWorkflowSqoopAction
Last active January 20, 2024 07:08
Oozie workflow application with sqoop action Pipes data from Hive table to mysql database table Oozie 3.3.0; Sqoop (1.4.2) with Mysql (5.1.69 )
This gist includes components of a simple workflow application (oozie 3.3.0) that
pipes data in a Hive table to mysql;
The sample application includes:
--------------------------------
1. Oozie actions: sqoop action
2. Oozie workflow controls: start, end, and kill.
3. Workflow components: job.properties and workflow.xml
4. Sample data
5. Prep tasks in Hive
@airawat
airawat / 00-OozieSSHAction
Last active July 25, 2023 04:43
Oozie SSH action Sample Oozie workflow that demonstrates the SSH action to move files from a specific node to HDFS
This gist covers the Oozie SSH action.
It includes components of a sample Oozie workflow application- scripts/code,
sample data and commands; Oozie actions covered: secure shell action, email
action.
My blog has documentation, and highlights of a very basic sample program.
http://hadooped.blogspot.com/2013/10/apache-oozie-part-13-oozie-ssh-action_30.html
This gist includes:
@airawat
airawat / 00-OozieWorkflowJavaMapReduceAction
Last active February 23, 2023 20:19
Oozie workflow application with a Java Mapreduce action that parses syslog generated log files and generates a report Gist includes sample data, all workflow components, java mapreduce program code, commands - hdfs and Oozie
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: java mapreduce action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts the path of the mapper input directory path and includes in the
key emitted.
Note: The reducer can be specified as a combiner as well.
Usecase
-------
@airawat
airawat / 00-CustomGenericUDFHive-NVL2
Last active December 20, 2022 15:39
Custom genericUDF in Hive Demonstrates NVL2 functionality
This gist covers a simple Hive genericUDF in Java, that mimics NVL2 functionality in Oracle.
NVL2 is used to handle nulls and conditionally substitute values.
Included:
1. Input data
2. Expected results
3. UDF code in java
4. Hive query to demo the UDF
5. Output
@airawat
airawat / 00-OozieWorkflowWithPigAction
Last active January 25, 2022 21:41
Sample of an Oozie workflow with pig action - parses Syslog generated log files using regex.
This gist includes oozie workflow components to run a pig latin script to parse
(Syslog generated) log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Pictorial overview of workflow:
-------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html
Includes:
@airawat
airawat / 00-OozieBundleApplication
Last active June 14, 2021 13:57
Oozie bundle application sample. The sample bundle application is time triggered. The start time is defined in the bundle job.properties file. The bundle application starts two coordinator applications- as defined in the bundle definition file - bundleConfirguration.xml. The first coordinator job is time triggered. The start time is defined in t…
Introduction
-------------
This gist includes sample data, application components, and components to execute a bundle application.
The sample bundle application is time triggered. The start time is defined in the bundle job.properties
file. The bundle application starts two coordinator applications- as defined in the bundle definition file -
bundleConfirguration.xml.
The first coordinator job is time triggered. The start time is defined in the bundle job.properties file.
It runs a workflow, that includes a java main action. The java program parses some log files and generates
@airawat
airawat / 00-OozieCoordinatorJobWithDatasetCreationAsTrigger
Last active May 21, 2021 16:09
Sample Oozie coordinator job that executes upon availability of a specified dataset. Includes scripts/code, sample data, commands.
This gist includes components of a oozie, dataset availability initiated, coordinator job -
scripts/code, sample data and commands; Oozie actions covered: hdfs action, email action,
sqoop action (mysql database); Oozie controls covered: decision;
Usecase
-------
Pipe report data available in HDFS, to mysql database;
Pictorial overview of job:
--------------------------
@airawat
airawat / 00-SecondarySortJavaMapReduce
Last active April 29, 2021 01:35
Secondary sort in mapreduce - Includes code for a simple program that sorts employee information by department ascending and employee name desc.
Secondary sort in Mapreduce
With mapreduce framework, the keys are sorted but the values associated with each key
are not. In order for the values to be sorted, we need to write code to perform what is
referred to a secondary sort. The sample code in this gist demonstrates such a sort.
The input to the program is a bunch of employee attributes.
The output required is department number (deptNo) in ascending order, and the employee last name,
first name and employee ID in descending order.
@airawat
airawat / 00-OozieWorkflowShellAction
Last active March 18, 2021 08:34
Oozie workflow with a shell action - with CaptureOutput Counts lines in a glob provided and writes the same to standard output. A subsequent email action emails the output of the shell action
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: shell action, email action
Action 1: The shell action executes a shell script that does a line count for files in a
glob provided, and writes the line count to standard output
Action 2: The email action emails the output of action 1
Pictorial overview of job:
--------------------------