This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
********************** | |
**Gist | |
********************** | |
This gist details how to inner join two large datasets on the map-side, leveraging the join capability | |
in mapreduce. Such a join makes sense if both input datasets are too large to qualify for distribution | |
through distributedcache, and can be implemented if both input datasets can be joined by the join key | |
and both input datasets are sorted in the same order, by the join key. | |
There are two critical pieces to engaging the join behavior: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This gist includes components of a oozie (trigger file initiated) coordinator job - | |
scripts/code, sample data and commands; Oozie actions covered: hdfs action, email action, | |
java main action, hive action; Oozie controls covered: decision, fork-join; The workflow | |
includes a sub-workflow that runs two hive actions concurrently. The hive table is | |
partitioned; Parsing uses hive-regex serde, and Java-regex. Also, the java mapper, gets | |
the input directory path and includes part of it in the key. | |
Usecase | |
------- | |
Parse Syslog generated log files to generate reports; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
B5b. Configure Oozie SSH action | |
Sometimes, you may need to execute jobs on a specific node - instead of any cluster node. | |
For this you need oozie service user to be able to connect to the node of choice as your workflow user. | |
# The following documentation details configuring an application ID to execute a SSH action | |
# In the illustration- | |
# edge node=cdh-en01 | |
# oozie server=cdh-mn01 | |
# applicaiton ID=akhanolk |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
********************** | |
Gist | |
********************** | |
A common interview question for a Hadoop developer position is whether we can control the number of | |
mappers for a job. We can - there are a few ways of controlling the number of mappers, as needed. | |
Using NLineInputFormat is one way. | |
About NLineInputFormat | |
---------------------- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This gist includes oozie workflow components (streaming map reduce action) to execute | |
python mapper and reducer scripts to parse Syslog generated log files using regex; | |
Usecase: Count the number of occurances of processes that got logged, by month, and process. | |
Pictorial overview of workflow: | |
-------------------------------- | |
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html | |
Includes: | |
--------- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This gist includes components of a simple workflow application that created a directory and moves files within | |
hdfs to this directory; | |
Emails are sent out to notify designated users of success/failure of workflow. There is a prepare section, | |
to allow re-run of the action..the prepare essentially negates the move done by a potential prior run | |
of the action. Sample data is also included. | |
The sample application includes: | |
-------------------------------- | |
1. Oozie actions: hdfs action and email action | |
2. Oozie workflow controls: start, end, and kill. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This gist includes components of a oozie workflow application - scripts/code, sample data | |
and commands; Oozie actions covered: sub-workflow, email java main action, | |
sqoop action (to mysql); Oozie controls covered: decision; | |
Pictorial overview: | |
-------------------- | |
http://hadooped.blogspot.com/2013/07/apache-oozie-part-8-subworkflow.html | |
Usecase: | |
-------- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This gist demonstrates how to create a sequence file (compressed and uncompressed), from a text file. | |
Includes: | |
--------- | |
1. Input data and script download | |
2. Input data-review | |
3. Data load commands | |
4. Mapper code | |
5. Driver code to create the sequence file out of a text file in HDFS | |
6. Command to run Java program |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
******************************** | |
Gist | |
******************************** | |
Motivation | |
----------- | |
The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending | |
on whether it is a map or a reduce output, and then the part number. There are scenarios where we | |
may want to create separate files based on criteria-data keys and/or values. Enter the "MultipleOutputs" | |
functionality. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# The following documentation details configuring an application ID to execute a SSH action | |
# In the illustration- | |
# edge node=cdh-sn03 | |
# oozie server=cdh-mn01 | |
# applicaiton ID=akhanolk | |
# ========================================== | |
# 1. On edge node, as application ID |