Skip to content

Instantly share code, notes, and snippets.

@airawat
Last active December 19, 2015 18:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save airawat/6003001 to your computer and use it in GitHub Desktop.
Save airawat/6003001 to your computer and use it in GitHub Desktop.
Oozie workflow application with a java main action The java program parses log files and generates a report. Sample data, code, workflow components, commands are provided.
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: java main action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts pat of the mapper input directory path and includes in the key
emitted.
Usecase
-------
Parse Syslog generated log files to generate reports;
Pictorial overview of job:
--------------------------
http://hadooped.blogspot.com/2013/06/apache-oozie-part-6-oozie-workflow-with.html
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Java MR - Mapper code: 04A-MapperJavaCode
Java MR - Reducer code: 04B-ReducerJavaCode
Java MR - Driver code: 04C-DriverJavaCode
Command to test Java MR program: 04D-CommandTestJavaMRProg
Oozie job properties file: 05-OozieJobProperties
Oozie workflow file: 06-OozieWorkflowXML
Oozie commands 07-OozieJobExecutionCommands
Output -Report1 08-OutputOfJavaProgram
Oozie web console - screenshots 09-OozieWebConsoleScreenshots
01a. Sample data
-----------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
01b. Structure
---------------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
02a. Data download
-------------------
gitHub:
https://github.com/airawat/OozieSamples
Email me at airawat.blog@gmail.com if you encounter any issues.
Directory structure applicable for this blog/gist:
--------------------------------------------------
oozieProject
data
airawat-syslog
<<Node-Name>>
<<Year>>
<<Month>>
messages
workflowJavaMainAction
workflow.xml
job.properties
lib
LogEventCount.jar
03-Hdfs load commands
----------------------
$ hadoop fs -mkdir oozieProject
$ hadoop fs -put oozieProject/* oozieProject/
// Source code for Mapper
//-----------------------------------------------------------
// LogEventCountMapper.java
//-----------------------------------------------------------
// Java program that parses logs using regex
// The program counts the number of processes logged by year.
// E.g. Key=2013-ntpd; Value=1;
package Airawat.Oozie.Samples;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class LogEventCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
String strLogEntryPattern = "(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)";
public static final int NUM_FIELDS = 6;
Text strEvent = new Text("");
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String strLogEntryLine = value.toString();
Pattern objPtrn = Pattern.compile(strLogEntryPattern);
Matcher objPatternMatcher = objPtrn.matcher(strLogEntryLine);
if (!objPatternMatcher.matches() || NUM_FIELDS != objPatternMatcher.groupCount()) {
System.err.println("Bad log entry (or problem with RE?):");
System.err.println(strLogEntryLine);
return;
}
/*
System.out.println("Month_Name: " + objPatternMatcher.group(1));
System.out.println("Day: " + objPatternMatcher.group(2));
System.out.println("Time: " + objPatternMatcher.group(3));
System.out.println("Node: " + objPatternMatcher.group(4));
System.out.println("Process: " + objPatternMatcher.group(5));
System.out.println("LogMessage: " + objPatternMatcher.group(6));
*/
//Oh what a pretty chunk of code ;)
strEvent.set(((FileSplit)context.getInputSplit()).getPath().toString().substring((((FileSplit)context.getInputSplit()).getPath().toString().length()-16), (((FileSplit)context.getInputSplit()).getPath().toString().length()-12)) + "-" + ((objPatternMatcher.group(5).toString().indexOf("[")) == -1 ? (objPatternMatcher.group(5).toString().substring(0,(objPatternMatcher.group(5).length()-1))) : (objPatternMatcher.group(5).toString().substring(0,(objPatternMatcher.group(5).toString().indexOf("["))))));
context.write(strEvent, new IntWritable(1));
}
}
// Source code for reducer
//--------------------------
// LogEventCountReducer.java
//--------------------------
package Airawat.Oozie.Samples;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class LogEventCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int intEventCount = 0;
for (IntWritable value : values) {
intEventCount += value.get();
}
context.write(key, new IntWritable(intEventCount));
}
}
// Source code for reducer
//--------------------------
// LogEventCountReducer.java
//--------------------------
package Airawat.Oozie.Samples;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
public class LogEventCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf(
"Usage: Airawat.Oozie.Samples.LogEventCount <input dir> <output dir>\n");
System.exit(-1);
}
//Instantiate a Job object for your job's configuration.
Job job = new Job();
//Job jar file
job.setJarByClass(LogEventCount.class);
//Job name
job.setJobName("Syslog Event Rollup");
//Paths
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Mapper and reducer classes
job.setMapperClass(LogEventCountMapper.class);
job.setReducerClass(LogEventCountReducer.class);
//Job's output key and value classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//Number of reduce tasks
job.setNumReduceTasks(3);
//Start the MapReduce job, wait for it to finish.
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
Commands to test the java program
---------------------------------
a) Command to run the program
$ hadoop jar oozieProject/workflowJavaMainAction/lib/LogEventCount.jar Airawat.Oozie.Samples.LogEventCount "oozieProject/data/*/*/*/*/*" "oozieProject/workflowJavaMainAction/myCLIOutput"
b) Command to view results
$ hadoop fs -cat oozieProject/workflowJavaMainAction/myCLIOutput/part* | sort
c) Results
2013-NetworkManager 7
22013-console-kit-daemon 7
2013-gnome-session 11
2013-init 166
2013-kernel 810
2013-login 2
2013-NetworkManager 7
2013-nm-dispatcher.action 4
2013-ntpd_initres 4133
2013-polkit-agent-helper-1 8
2013-pulseaudio 18
2013-spice-vdagent 15
2013-sshd 6
2013-sudo 8
2013-udevd 6
#*****************************
# job.properties
#*****************************
nameNode=hdfs://cdh-nn01.chuntikhadoop.com:8020
jobTracker=cdh-jt01:8021
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=${nameNode}/user/${user.name}/oozieProject
appPath=${oozieProjectRoot}/workflowJavaMainAction
oozie.wf.application.path=${appPath}
inputDir=${oozieProjectRoot}/data/*/*/*/*/*
outputDir=${appPath}/output
<!--*************************************************-->
<!--*******06-workflow.xml***************************-->
<!--*************************************************-->
<workflow-app name="WorkflowJavaMainAction" xmlns="uri:oozie:workflow:0.1">
<start to="javaMainAction"/>
<action name="javaMainAction">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputDir}"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>Airawat.Oozie.Samples.LogEventCount</main-class>
<arg>${inputDir}</arg>
<arg>${outputDir}</arg>
</java>
<ok to="end"/>
<error to="killJobAction"/>
</action>
<kill name="killJobAction">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
07. Oozie commands
-------------------
Note: Replace oozie server and port, with your cluster-specific.
1) Submit job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowJavaMainAction/job.properties -submit
job: 0000012-130712212133144-oozie-oozi-W
2) Run job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000014-130712212133144-oozie-oozi-W
3) Check the status:
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000014-130712212133144-oozie-oozi-W
4) Suspend workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000014-130712212133144-oozie-oozi-W
5) Resume workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000014-130712212133144-oozie-oozi-W
6) Re-run workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowJavaMainAction/job.properties -rerun 0000014-130712212133144-oozie-oozi-W
7) Should you need to kill the job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000014-130712212133144-oozie-oozi-W
8) View server logs:
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000014-130712212133144-oozie-oozi-W
Logs are available at:
/var/log/oozie on the Oozie server.
08-Workflow application output
-------------------------------
$ hadoop fs -ls -R oozieProject/workflowJavaMainAction/output/part* | awk '{print $8}'
oozieProject/workflowJavaMainAction/output/part-r-00000
$ hadoop fs -cat oozieProject/workflowJavaMainAction/output/part-r-00000
2013-NetworkManager 7
2013-console-kit-daemon 7
2013-gnome-session 11
2013-init 166
2013-kernel 810
2013-login 2
2013-nm-dispatcher.action 4
2013-ntpd_initres 4133
2013-polkit-agent-helper-1 8
2013-pulseaudio 18
2013-spice-vdagent 15
2013-sshd 6
2013-sudo 8
2013-udevd 6
09 - Oozie web console - screenshots
-------------------------------------
Available at-
http://hadooped.blogspot.com/2013/06/apache-oozie-part-6-oozie-workflow-with.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment