Skip to content

Instantly share code, notes, and snippets.

@airawat
airawat / 00-MapSideJoinDistCacheThruGenericOptionsParser
Last active September 25, 2016 14:31
Map-side join example - Java code for joining two datasets - one large (tsv format), and one with reference data (txt file), made available through DistributedCache via command line (GenericOptionsParser)
This gist is part of a series of gists related to Map-side joins in Java map-reduce.
In the gist - https://gist.github.com/airawat/6597557, we added the reference data available
in HDFS to the distributed cache from the driver code.
This gist demonstrates adding a local file via command line to distributed cache.
Refer gist at https://gist.github.com/airawat/6597557 for-
1. Data samples and structure
2. Expected results
3. Commands to load data to HDFS
@airawat
airawat / 00-MapSideJoinDistCacheMapFile
Last active December 23, 2015 06:59
Map-side join example - Java code for joining two datasets - one large (tsv format), and one with lookup data (mapfile), made available through DistributedCache
This gist demonstrates how to do a map-side join, joining a MapFile from distributedcache
with a larger dataset in HDFS.
Includes:
---------
1. Input data and script download
2. Dataset structure review
3. Expected results
4. Mapper code
5. Driver code
@airawat
airawat / 00-MapSideJoinDistCacheTextFile
Last active February 21, 2021 19:05
Map-side join example - Java code for joining two datasets - one large (tsv format), and one with lookup data (text), made available through DistributedCache
This gist demonstrates how to do a map-side join, loading one small dataset from DistributedCache into a HashMap
in memory, and joining with a larger dataset.
Includes:
---------
1. Input data and script download
2. Dataset structure review
3. Expected results
4. Mapper code
5. Driver code
@airawat
airawat / 00-CreatingSequenceFile
Last active March 19, 2019 18:35
Hadoop Sequence File - Sample program to create a sequence file (compressed and uncompressed) from a text file, and another to read the sequence file.
This gist demonstrates how to create a sequence file (compressed and uncompressed), from a text file.
Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Mapper code
5. Driver code to create the sequence file out of a text file in HDFS
6. Command to run Java program
@airawat
airawat / 00-CreatingMapFile
Last active December 22, 2015 22:18
Creating a Map file in Hadoop. This gist covers reading a text file in HDFS, and creating a map file
This gist demonstrates how to create a map file, from a text file.
Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Java program to create the map file out of a text file in HDFS
5. Command to run Java program
6. Results of the program run to create map file
@airawat
airawat / 00-OozieWorkflowShellAction
Last active March 18, 2021 08:34
Oozie workflow with a shell action - with CaptureOutput Counts lines in a glob provided and writes the same to standard output. A subsequent email action emails the output of the shell action
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: shell action, email action
Action 1: The shell action executes a shell script that does a line count for files in a
glob provided, and writes the line count to standard output
Action 2: The email action emails the output of action 1
Pictorial overview of job:
--------------------------
@airawat
airawat / 00-OozieWorkflowCallWithJavaAPI
Last active September 4, 2016 17:59
Oozie workflow - invoked from Java using Oozie Java API
import java.util.Properties;
import org.apache.oozie.client.OozieClient;
import org.apache.oozie.client.WorkflowJob;
public class myOozieWorkflowJavaAPICall {
public static void main(String[] args) {
OozieClient wc = new OozieClient("http://cdh-dev01:11000/oozie");
@airawat
airawat / 00-OozieWorkflowWithSubworkflow
Last active January 3, 2019 18:08
Oozie workflow application with a subworkflow Includes - sample data, workflow components, hdfs and oozie commands, application output
This gist includes components of a oozie workflow application - scripts/code, sample data
and commands; Oozie actions covered: sub-workflow, email java main action,
sqoop action (to mysql); Oozie controls covered: decision;
Pictorial overview:
--------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-8-subworkflow.html
Usecase:
--------
@airawat
airawat / 00-OozieBundleApplication
Last active June 14, 2021 13:57
Oozie bundle application sample. The sample bundle application is time triggered. The start time is defined in the bundle job.properties file. The bundle application starts two coordinator applications- as defined in the bundle definition file - bundleConfirguration.xml. The first coordinator job is time triggered. The start time is defined in t…
Introduction
-------------
This gist includes sample data, application components, and components to execute a bundle application.
The sample bundle application is time triggered. The start time is defined in the bundle job.properties
file. The bundle application starts two coordinator applications- as defined in the bundle definition file -
bundleConfirguration.xml.
The first coordinator job is time triggered. The start time is defined in the bundle job.properties file.
It runs a workflow, that includes a java main action. The java program parses some log files and generates
@airawat
airawat / 00-OozieWorkflowJavaMainAction
Last active December 19, 2015 18:59
Oozie workflow application with a java main action The java program parses log files and generates a report. Sample data, code, workflow components, commands are provided.
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: java main action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts pat of the mapper input directory path and includes in the key
emitted.
Usecase
-------
Parse Syslog generated log files to generate reports;