Last active
January 25, 2022 21:41
-
-
Save airawat/5922347 to your computer and use it in GitHub Desktop.
Sample of an Oozie workflow with pig action - parses Syslog generated log files using regex.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This gist includes oozie workflow components to run a pig latin script to parse | |
(Syslog generated) log files using regex; | |
Usecase: Count the number of occurances of processes that got logged, by month, | |
day and process. | |
Pictorial overview of workflow: | |
------------------------------- | |
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html | |
Includes: | |
--------- | |
Sample data and structure: 01-SampleDataAndStructure | |
Data and script download: 02-DataAndScriptDownload | |
Data load commands: 03-HdfsLoadCommands | |
Pig script: 04-PigLatinScript | |
Pig script execution command: 05-PigLatinScriptExecution | |
Oozie job properties: 06-JobProperties | |
Oozie workflow: 07-OozieWorkflowXML | |
Oozie job exection command: 08-OozieCommand | |
Oozie job output 09-Output | |
Oozie web console screenshots 10-OozieWebScreenshots |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1a. Sample data | |
---------------- | |
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal | |
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1 | |
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray | |
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr | |
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max) | |
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns | |
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org | |
1b. Structure | |
-------------- | |
Month = May | |
Day = 3 | |
Time = 11:52:54 | |
Node = cdh-dn03 | |
Process = init: | |
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2a. Data download | |
------------------ | |
gitHub: | |
https://github.com/airawat/OozieSamples | |
Email me at airawat.blog@gmail.com if you encounter any issues. | |
2b. Directory structure applicable for this blog/gist: | |
------------------------------------------------------- | |
oozieProject | |
data | |
airawat-syslog | |
<<Node-Name>> | |
<<Year>> | |
<<Month>> | |
messages | |
workflowPigAction | |
workflow.xml | |
job.properties | |
reportScript.pig |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3. Data load commands | |
---------------------- | |
Save the zip file to your home directory and unzip. | |
$ hadoop fs -mkdir oozieProject | |
$ hadoop fs -put oozieProject/* oozieProject/ | |
Validate load: | |
$ hadoop fs -ls -R oozieProject/workflowPigAction | awk '{print $8}' | |
Should match listing in 2b, above |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/********************************************/ | |
/* Pig Latin Script: reportScript.pig */ | |
/********************************************/ | |
rmf oozieProject/workflowPigAction/output | |
raw_log_DS = | |
-- load the logs into a sequence of one element tuples | |
LOAD 'oozieProject/data/*/*/*/*/*' AS line; | |
parsed_log_DS = | |
-- for each line/log parse the same into a | |
-- structure with named fields | |
FOREACH raw_log_DS | |
GENERATE | |
FLATTEN ( | |
REGEX_EXTRACT_ALL( | |
line, | |
'(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)' | |
) | |
) | |
AS ( | |
month_name: chararray, | |
day: chararray, | |
time: chararray, | |
host: chararray, | |
process: chararray, | |
log: chararray | |
); | |
report_draft_DS = | |
--Generate dataset containing just the data needed | |
FOREACH parsed_log_DS GENERATE month_name,process; | |
grouped_report_DS = | |
--Group the dataset | |
GROUP report_draft_DS BY (month_name,process); | |
aggregate_report_DS = | |
--Compute count | |
FOREACH grouped_report_DS { | |
GENERATE group.month_name,group.process,COUNT(report_draft_DS) AS frequency; | |
} | |
sorted_DS = | |
ORDER aggregate_report_DS by $0,$1; | |
STORE sorted_DS INTO 'oozieProject/workflowPigAction/output/SortedResults'; | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
To test the pig script independently | |
------------------------------------- | |
To test the script outside of the oozie workflow, refer my post - | |
https://gist.github.com/airawat/5915708 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#------------------------------------------------- | |
# This is the job properties file - job.properties | |
#------------------------------------------------- | |
# Replace nameNode and jobTracker with your cluster specific details; Ensure you follow the format (hdfs:// or lack of it etc) | |
nameNode=hdfs://cdh-nn01.chuntikhadoop.com:8020 | |
jobTracker=cdh-jt01:8021 | |
queueName=default | |
oozie.libpath=${nameNode}/user/oozie/share/lib | |
oozie.use.system.libpath=true | |
oozie.wf.rerun.failnodes=true | |
oozieProjectRoot=${nameNode}/user/akhanolk/oozieProject | |
appPath=${oozieProjectRoot}/workflowPigAction | |
oozie.wf.application.path=${appPath} | |
outputDir=${appPath}/output |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!--------------------------------------> | |
<!--Oozie workflow file: workflow.xml --> | |
<!--------------------------------------> | |
<workflow-app name="WorkflowWithPigAction" xmlns="uri:oozie:workflow:0.1"> | |
<start to="pigAction"/> | |
<action name="pigAction"> | |
<pig> | |
<job-tracker>${jobTracker}</job-tracker> | |
<name-node>${nameNode}</name-node> | |
<prepare> | |
<delete path="${outputDir}"/> | |
</prepare> | |
<script>reportScript.pig</script> | |
</pig> | |
<ok to="end"/> | |
<error to="killJobAction"/> | |
</action> | |
<kill name="killJobAction"> | |
<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message> | |
</kill> | |
<end name="end" /> | |
</workflow-app> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
08. Oozie commands | |
------------------- | |
Note: Replace oozie server and port, with your cluster-specific. | |
1) Submit job: | |
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowPigAction/job.properties -submit | |
job: 0000012-130712212133144-oozie-oozi-W | |
2) Run job: | |
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000014-130712212133144-oozie-oozi-W | |
3) Check the status: | |
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000014-130712212133144-oozie-oozi-W | |
4) Suspend workflow: | |
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000014-130712212133144-oozie-oozi-W | |
5) Resume workflow: | |
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000014-130712212133144-oozie-oozi-W | |
6) Re-run workflow: | |
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowPigAction/job.properties -rerun 0000014-130712212133144-oozie-oozi-W | |
7) Should you need to kill the job: | |
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000014-130712212133144-oozie-oozi-W | |
8) View server logs: | |
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000014-130712212133144-oozie-oozi-W | |
Logs are available at: | |
/var/log/oozie on the Oozie server. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Program output | |
-------------- | |
$ hadoop fs -cat oozieProject/workflowPigAction/output/SortedResults/part-r-00000 | |
Apr pulseaudio[5705]: 1 | |
Apr spice-vdagent[5657]: 1 | |
Apr sudo: 5 | |
May NetworkManager[1232]: 1 | |
May NetworkManager[1243]: 1 | |
May NetworkManager[1284]: 1 | |
May NetworkManager[1292]: 1 | |
......... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oozie Web Console - Screenshots | |
-------------------------------- | |
Available at: | |
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To do: Improve pig script to-
a) Strip off session IDs/process IDs (Omit the [1232] in NetworkManager[1232])
b) Factor in year in the grouping