Skip to content

Instantly share code, notes, and snippets.

Anagha Khanolkar airawat

  • Microsoft
Block or report user

Report or block airawat

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@airawat
airawat / 00-OozieWorkflowShellAction
Last active Jul 18, 2019
Oozie workflow with a shell action - with CaptureOutput Counts lines in a glob provided and writes the same to standard output. A subsequent email action emails the output of the shell action
View 00-OozieWorkflowShellAction
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: shell action, email action
Action 1: The shell action executes a shell script that does a line count for files in a
glob provided, and writes the line count to standard output
Action 2: The email action emails the output of action 1
Pictorial overview of job:
--------------------------
@airawat
airawat / 00-MultipleOutputs
Last active Jul 17, 2019
MultipleOutputs sample program - A program that demonstrates how to generate an output file for each key
View 00-MultipleOutputs
********************************
Gist
********************************
Motivation
-----------
The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
on whether it is a map or a reduce output, and then the part number. There are scenarios where we
may want to create separate files based on criteria-data keys and/or values. Enter the "MultipleOutputs"
functionality.
@airawat
airawat / 00-OozieWorkflowSqoopAction
Last active Mar 23, 2019
Oozie workflow application with sqoop action Pipes data from Hive table to mysql database table Oozie 3.3.0; Sqoop (1.4.2) with Mysql (5.1.69 )
View 00-OozieWorkflowSqoopAction
This gist includes components of a simple workflow application (oozie 3.3.0) that
pipes data in a Hive table to mysql;
The sample application includes:
--------------------------------
1. Oozie actions: sqoop action
2. Oozie workflow controls: start, end, and kill.
3. Workflow components: job.properties and workflow.xml
4. Sample data
5. Prep tasks in Hive
@airawat
airawat / 00-CreatingSequenceFile
Last active Mar 19, 2019
Hadoop Sequence File - Sample program to create a sequence file (compressed and uncompressed) from a text file, and another to read the sequence file.
View 00-CreatingSequenceFile
This gist demonstrates how to create a sequence file (compressed and uncompressed), from a text file.
Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Mapper code
5. Driver code to create the sequence file out of a text file in HDFS
6. Command to run Java program
@airawat
airawat / 00-OozieWorkflowWithSubworkflow
Last active Jan 3, 2019
Oozie workflow application with a subworkflow Includes - sample data, workflow components, hdfs and oozie commands, application output
View 00-OozieWorkflowWithSubworkflow
This gist includes components of a oozie workflow application - scripts/code, sample data
and commands; Oozie actions covered: sub-workflow, email java main action,
sqoop action (to mysql); Oozie controls covered: decision;
Pictorial overview:
--------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-8-subworkflow.html
Usecase:
--------
@airawat
airawat / 00-SecondarySortJavaMapReduce
Last active Dec 8, 2018
Secondary sort in mapreduce - Includes code for a simple program that sorts employee information by department ascending and employee name desc.
View 00-SecondarySortJavaMapReduce
Secondary sort in Mapreduce
With mapreduce framework, the keys are sorted but the values associated with each key
are not. In order for the values to be sorted, we need to write code to perform what is
referred to a secondary sort. The sample code in this gist demonstrates such a sort.
The input to the program is a bunch of employee attributes.
The output required is department number (deptNo) in ascending order, and the employee last name,
first name and employee ID in descending order.
@airawat
airawat / 00-OozieWorkflowHdfsAndEmailActions
Last active Nov 21, 2018
Oozie workflow application with FS and email actions; Includes sample data, workflow components, commands.
View 00-OozieWorkflowHdfsAndEmailActions
This gist includes components of a simple workflow application that created a directory and moves files within
hdfs to this directory;
Emails are sent out to notify designated users of success/failure of workflow. There is a prepare section,
to allow re-run of the action..the prepare essentially negates the move done by a potential prior run
of the action. Sample data is also included.
The sample application includes:
--------------------------------
1. Oozie actions: hdfs action and email action
2. Oozie workflow controls: start, end, and kill.
@airawat
airawat / 00-OozieWorkflowStreamingMRAction-Python
Last active Nov 21, 2018
Sample of an Oozie workflow with streaming action - parses Syslog generated log files using python -regex
View 00-OozieWorkflowStreamingMRAction-Python
This gist includes oozie workflow components (streaming map reduce action) to execute
python mapper and reducer scripts to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month, and process.
Pictorial overview of workflow:
--------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html
Includes:
---------
@airawat
airawat / 00-LogParser-Hive-Regex
Last active Sep 13, 2018
Log parser in Hive using regex serde
View 00-LogParser-Hive-Regex
This gist includes hive ql scripts to create an external partitioned table for Syslog
generated log files using regex serde;
Usecase: Count the number of occurances of processes that got logged, by year, month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data download: 02-DataDownload
Data load commands: 03-DataLoadCommands
@airawat
airawat / 00-NLineInputFormat
Last active Aug 22, 2018
NLineInputFormat - About NLineInputFormat, uses, and a sample program
View 00-NLineInputFormat
**********************
Gist
**********************
A common interview question for a Hadoop developer position is whether we can control the number of
mappers for a job. We can - there are a few ways of controlling the number of mappers, as needed.
Using NLineInputFormat is one way.
About NLineInputFormat
----------------------
You can’t perform that action at this time.