Skip to content

Instantly share code, notes, and snippets.

@airawat
Last active January 25, 2022 21:41
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save airawat/5922347 to your computer and use it in GitHub Desktop.
Save airawat/5922347 to your computer and use it in GitHub Desktop.
Sample of an Oozie workflow with pig action - parses Syslog generated log files using regex.
This gist includes oozie workflow components to run a pig latin script to parse
(Syslog generated) log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Pictorial overview of workflow:
-------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript
Pig script execution command: 05-PigLatinScriptExecution
Oozie job properties: 06-JobProperties
Oozie workflow: 07-OozieWorkflowXML
Oozie job exection command: 08-OozieCommand
Oozie job output 09-Output
Oozie web console screenshots 10-OozieWebScreenshots
1a. Sample data
----------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
1b. Structure
--------------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
2a. Data download
------------------
gitHub:
https://github.com/airawat/OozieSamples
Email me at airawat.blog@gmail.com if you encounter any issues.
2b. Directory structure applicable for this blog/gist:
-------------------------------------------------------
oozieProject
data
airawat-syslog
<<Node-Name>>
<<Year>>
<<Month>>
messages
workflowPigAction
workflow.xml
job.properties
reportScript.pig
3. Data load commands
----------------------
Save the zip file to your home directory and unzip.
$ hadoop fs -mkdir oozieProject
$ hadoop fs -put oozieProject/* oozieProject/
Validate load:
$ hadoop fs -ls -R oozieProject/workflowPigAction | awk '{print $8}'
Should match listing in 2b, above
/********************************************/
/* Pig Latin Script: reportScript.pig */
/********************************************/
rmf oozieProject/workflowPigAction/output
raw_log_DS =
-- load the logs into a sequence of one element tuples
LOAD 'oozieProject/data/*/*/*/*/*' AS line;
parsed_log_DS =
-- for each line/log parse the same into a
-- structure with named fields
FOREACH raw_log_DS
GENERATE
FLATTEN (
REGEX_EXTRACT_ALL(
line,
'(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)'
)
)
AS (
month_name: chararray,
day: chararray,
time: chararray,
host: chararray,
process: chararray,
log: chararray
);
report_draft_DS =
--Generate dataset containing just the data needed
FOREACH parsed_log_DS GENERATE month_name,process;
grouped_report_DS =
--Group the dataset
GROUP report_draft_DS BY (month_name,process);
aggregate_report_DS =
--Compute count
FOREACH grouped_report_DS {
GENERATE group.month_name,group.process,COUNT(report_draft_DS) AS frequency;
}
sorted_DS =
ORDER aggregate_report_DS by $0,$1;
STORE sorted_DS INTO 'oozieProject/workflowPigAction/output/SortedResults';
To test the pig script independently
-------------------------------------
To test the script outside of the oozie workflow, refer my post -
https://gist.github.com/airawat/5915708
#-------------------------------------------------
# This is the job properties file - job.properties
#-------------------------------------------------
# Replace nameNode and jobTracker with your cluster specific details; Ensure you follow the format (hdfs:// or lack of it etc)
nameNode=hdfs://cdh-nn01.chuntikhadoop.com:8020
jobTracker=cdh-jt01:8021
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=${nameNode}/user/akhanolk/oozieProject
appPath=${oozieProjectRoot}/workflowPigAction
oozie.wf.application.path=${appPath}
outputDir=${appPath}/output
<!-------------------------------------->
<!--Oozie workflow file: workflow.xml -->
<!-------------------------------------->
<workflow-app name="WorkflowWithPigAction" xmlns="uri:oozie:workflow:0.1">
<start to="pigAction"/>
<action name="pigAction">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${outputDir}"/>
</prepare>
<script>reportScript.pig</script>
</pig>
<ok to="end"/>
<error to="killJobAction"/>
</action>
<kill name="killJobAction">
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
</kill>
<end name="end" />
</workflow-app>
08. Oozie commands
-------------------
Note: Replace oozie server and port, with your cluster-specific.
1) Submit job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowPigAction/job.properties -submit
job: 0000012-130712212133144-oozie-oozi-W
2) Run job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000014-130712212133144-oozie-oozi-W
3) Check the status:
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000014-130712212133144-oozie-oozi-W
4) Suspend workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000014-130712212133144-oozie-oozi-W
5) Resume workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000014-130712212133144-oozie-oozi-W
6) Re-run workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowPigAction/job.properties -rerun 0000014-130712212133144-oozie-oozi-W
7) Should you need to kill the job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000014-130712212133144-oozie-oozi-W
8) View server logs:
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000014-130712212133144-oozie-oozi-W
Logs are available at:
/var/log/oozie on the Oozie server.
Program output
--------------
$ hadoop fs -cat oozieProject/workflowPigAction/output/SortedResults/part-r-00000
Apr pulseaudio[5705]: 1
Apr spice-vdagent[5657]: 1
Apr sudo: 5
May NetworkManager[1232]: 1
May NetworkManager[1243]: 1
May NetworkManager[1284]: 1
May NetworkManager[1292]: 1
.........
Oozie Web Console - Screenshots
--------------------------------
Available at:
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html
@airawat
Copy link
Author

airawat commented Jul 15, 2013

To do: Improve pig script to-
a) Strip off session IDs/process IDs (Omit the [1232] in NetworkManager[1232])
b) Factor in year in the grouping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment