airawat/00-OozieWorkflowWithPigAction

## 00-OozieWorkflowWithPigAction
This gist includes oozie workflow components to run a pig latin script to parse
(Syslog generated) log files using regex;
Usecase:  Count the number of occurances of processes that got logged, by month,
day and process.

Pictorial overview of workflow:
-------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html

Includes:
---------
Sample data and structure:           01-SampleDataAndStructure
Data and script download:            02-DataAndScriptDownload
Data load commands:                  03-HdfsLoadCommands
Pig script:                          04-PigLatinScript
Pig script execution command:        05-PigLatinScriptExecution
Oozie job properties:                06-JobProperties
Oozie workflow:                      07-OozieWorkflowXML
Oozie job exection command:          08-OozieCommand
Oozie job output                     09-Output
Oozie web console screenshots        10-OozieWebScreenshots

## 01-SampleDataAndStructure
1a. Sample data
----------------
May  3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May  3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May  3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May  3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May  3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May  3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May  3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org

1b. Structure
--------------
Month   = May
Day     = 3
Time    = 11:52:54
Node    = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal

## 02-DataAndScriptDownload
2a. Data download
------------------
gitHub:
https://github.com/airawat/OozieSamples

Email me at airawat.blog@gmail.com if you encounter any issues.


2b. Directory structure applicable for this blog/gist:
-------------------------------------------------------
oozieProject

  data
      airawat-syslog
        <<Node-Name>>
            <<Year>>
                <<Month>>
                    messages

  workflowPigAction
        workflow.xml
        job.properties
        reportScript.pig

## 03-HdfsLoadCommands
3. Data load commands
----------------------
Save the zip file to your home directory and unzip.

$ hadoop fs -mkdir oozieProject
$ hadoop fs -put oozieProject/* oozieProject/

Validate load:

$ hadoop fs -ls -R oozieProject/workflowPigAction | awk '{print $8}'

Should match listing in 2b, above

## 04-PigLatinScript
/********************************************/
/* Pig Latin Script: reportScript.pig */
/********************************************/

rmf oozieProject/workflowPigAction/output

raw_log_DS =
  -- load the logs into a sequence of one element tuples
  LOAD 'oozieProject/data/*/*/*/*/*' AS line;

parsed_log_DS =
  -- for each line/log parse the same into a
  -- structure with named fields
  FOREACH raw_log_DS
         GENERATE
                FLATTEN (
                        REGEX_EXTRACT_ALL(
                                          line,
                                          '(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)'
                                         )
                        )
    AS (
      month_name: chararray,
      day: chararray,
      time: chararray,
      host: chararray,
      process: chararray,
      log: chararray
    );

report_draft_DS =
        --Generate dataset containing just the data needed
        FOREACH parsed_log_DS GENERATE month_name,process;

grouped_report_DS =
        --Group the dataset
        GROUP report_draft_DS BY (month_name,process);

aggregate_report_DS =
        --Compute count
        FOREACH grouped_report_DS {
                GENERATE group.month_name,group.process,COUNT(report_draft_DS) AS frequency;
        }

sorted_DS =
        ORDER aggregate_report_DS by $0,$1;

STORE sorted_DS INTO 'oozieProject/workflowPigAction/output/SortedResults';


## 05-PigLatinScriptExecution
To test the pig script independently
-------------------------------------

To test the script outside of the oozie workflow, refer my post -
https://gist.github.com/airawat/5915708

## 06-JobProperties
#-------------------------------------------------
# This is the job properties file - job.properties
#-------------------------------------------------

# Replace nameNode and jobTracker with your cluster specific details; Ensure you follow the format (hdfs:// or lack of it etc)

nameNode=hdfs://cdh-nn01.chuntikhadoop.com:8020
jobTracker=cdh-jt01:8021
queueName=default

oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true

oozieProjectRoot=${nameNode}/user/akhanolk/oozieProject
appPath=${oozieProjectRoot}/workflowPigAction
oozie.wf.application.path=${appPath}

outputDir=${appPath}/output

## 07-OozieWorkflowXML
<!-------------------------------------->
<!--Oozie workflow file: workflow.xml -->
<!-------------------------------------->
<workflow-app name="WorkflowWithPigAction" xmlns="uri:oozie:workflow:0.1">
    <start to="pigAction"/>
        <action name="pigAction">
                <pig>
                        <job-tracker>${jobTracker}</job-tracker>
                        <name-node>${nameNode}</name-node>
                        <prepare>
                                <delete path="${outputDir}"/>
                        </prepare>
                        <script>reportScript.pig</script>
                </pig>
                <ok to="end"/>
                <error to="killJobAction"/>
        </action>
        <kill name="killJobAction">
            <message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
        </kill>
    <end name="end" />
</workflow-app>

## 08-OozieCommand
08. Oozie commands
-------------------
Note: Replace oozie server and port, with your cluster-specific.

1) Submit job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowPigAction/job.properties -submit
job: 0000012-130712212133144-oozie-oozi-W

2) Run job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000014-130712212133144-oozie-oozi-W

3) Check the status:
$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000014-130712212133144-oozie-oozi-W

4) Suspend workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000014-130712212133144-oozie-oozi-W

5) Resume workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000014-130712212133144-oozie-oozi-W

6) Re-run workflow:
$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowPigAction/job.properties -rerun 0000014-130712212133144-oozie-oozi-W

7) Should you need to kill the job:
$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000014-130712212133144-oozie-oozi-W

8) View server logs:
$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000014-130712212133144-oozie-oozi-W

Logs are available at:
/var/log/oozie on the Oozie server.

## 09-Output
Program output
--------------

$ hadoop fs -cat oozieProject/workflowPigAction/output/SortedResults/part-r-00000

Apr	pulseaudio[5705]:	1
Apr	spice-vdagent[5657]:	1
Apr	sudo:	5
May	NetworkManager[1232]:	1
May	NetworkManager[1243]:	1
May	NetworkManager[1284]:	1
May	NetworkManager[1292]:	1
.........

## 10-OozieWebScreenshots
Oozie Web Console - Screenshots
--------------------------------
Available at:
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html
	This gist includes oozie workflow components to run a pig latin script to parse
	(Syslog generated) log files using regex;
	Usecase: Count the number of occurances of processes that got logged, by month,
	day and process.

	Pictorial overview of workflow:
	-------------------------------
	http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html

	Includes:
	---------
	Sample data and structure: 01-SampleDataAndStructure
	Data and script download: 02-DataAndScriptDownload
	Data load commands: 03-HdfsLoadCommands
	Pig script: 04-PigLatinScript
	Pig script execution command: 05-PigLatinScriptExecution
	Oozie job properties: 06-JobProperties
	Oozie workflow: 07-OozieWorkflowXML
	Oozie job exection command: 08-OozieCommand
	Oozie job output 09-Output
	Oozie web console screenshots 10-OozieWebScreenshots
	1a. Sample data
	----------------
	May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
	May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
	May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
	May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
	May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
	May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
	May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org

	1b. Structure
	--------------
	Month = May
	Day = 3
	Time = 11:52:54
	Node = cdh-dn03
	Process = init:
	Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
	2a. Data download
	------------------
	gitHub:
	https://github.com/airawat/OozieSamples

	Email me at airawat.blog@gmail.com if you encounter any issues.


	2b. Directory structure applicable for this blog/gist:
	-------------------------------------------------------
	oozieProject

	data
	airawat-syslog
	<<Node-Name>>
	<<Year>>
	<<Month>>
	messages

	workflowPigAction
	workflow.xml
	job.properties
	reportScript.pig
	3. Data load commands
	----------------------
	Save the zip file to your home directory and unzip.

	$ hadoop fs -mkdir oozieProject
	$ hadoop fs -put oozieProject/* oozieProject/

	Validate load:

	$ hadoop fs -ls -R oozieProject/workflowPigAction \| awk '{print $8}'

	Should match listing in 2b, above
	/********************************************/
	/* Pig Latin Script: reportScript.pig */
	/********************************************/

	rmf oozieProject/workflowPigAction/output

	raw_log_DS =
	-- load the logs into a sequence of one element tuples
	LOAD 'oozieProject/data/////*' AS line;

	parsed_log_DS =
	-- for each line/log parse the same into a
	-- structure with named fields
	FOREACH raw_log_DS
	GENERATE
	FLATTEN (
	REGEX_EXTRACT_ALL(
	line,
	'(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W\\w)\\s+(.?\\:)\\s+(.$)'
	)
	)
	AS (
	month_name: chararray,
	day: chararray,
	time: chararray,
	host: chararray,
	process: chararray,
	log: chararray
	);

	report_draft_DS =
	--Generate dataset containing just the data needed
	FOREACH parsed_log_DS GENERATE month_name,process;

	grouped_report_DS =
	--Group the dataset
	GROUP report_draft_DS BY (month_name,process);

	aggregate_report_DS =
	--Compute count
	FOREACH grouped_report_DS {
	GENERATE group.month_name,group.process,COUNT(report_draft_DS) AS frequency;
	}

	sorted_DS =
	ORDER aggregate_report_DS by $0,$1;

	STORE sorted_DS INTO 'oozieProject/workflowPigAction/output/SortedResults';
	To test the pig script independently
	-------------------------------------

	To test the script outside of the oozie workflow, refer my post -
	https://gist.github.com/airawat/5915708
	#-------------------------------------------------
	# This is the job properties file - job.properties
	#-------------------------------------------------

	# Replace nameNode and jobTracker with your cluster specific details; Ensure you follow the format (hdfs:// or lack of it etc)

	nameNode=hdfs://cdh-nn01.chuntikhadoop.com:8020
	jobTracker=cdh-jt01:8021
	queueName=default

	oozie.libpath=${nameNode}/user/oozie/share/lib
	oozie.use.system.libpath=true
	oozie.wf.rerun.failnodes=true

	oozieProjectRoot=${nameNode}/user/akhanolk/oozieProject
	appPath=${oozieProjectRoot}/workflowPigAction
	oozie.wf.application.path=${appPath}

	outputDir=${appPath}/output
	<!-------------------------------------->
	<!--Oozie workflow file: workflow.xml -->
	<!-------------------------------------->
	<workflow-app name="WorkflowWithPigAction" xmlns="uri:oozie:workflow:0.1">
	<start to="pigAction"/>
	<action name="pigAction">
	<pig>
	<job-tracker>${jobTracker}</job-tracker>
	<name-node>${nameNode}</name-node>
	<prepare>
	<delete path="${outputDir}"/>
	</prepare>
	<script>reportScript.pig</script>
	</pig>
	<ok to="end"/>
	<error to="killJobAction"/>
	</action>
	<kill name="killJobAction">
	<message>"Killed job due to error: ${wf:errorMessage(wf:lastErrorNode())}"</message>
	</kill>
	<end name="end" />
	</workflow-app>
	08. Oozie commands
	-------------------
	Note: Replace oozie server and port, with your cluster-specific.

	1) Submit job:
	$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowPigAction/job.properties -submit
	job: 0000012-130712212133144-oozie-oozi-W

	2) Run job:
	$ oozie job -oozie http://cdh-dev01:11000/oozie -start 0000014-130712212133144-oozie-oozi-W

	3) Check the status:
	$ oozie job -oozie http://cdh-dev01:11000/oozie -info 0000014-130712212133144-oozie-oozi-W

	4) Suspend workflow:
	$ oozie job -oozie http://cdh-dev01:11000/oozie -suspend 0000014-130712212133144-oozie-oozi-W

	5) Resume workflow:
	$ oozie job -oozie http://cdh-dev01:11000/oozie -resume 0000014-130712212133144-oozie-oozi-W

	6) Re-run workflow:
	$ oozie job -oozie http://cdh-dev01:11000/oozie -config oozieProject/workflowPigAction/job.properties -rerun 0000014-130712212133144-oozie-oozi-W

	7) Should you need to kill the job:
	$ oozie job -oozie http://cdh-dev01:11000/oozie -kill 0000014-130712212133144-oozie-oozi-W

	8) View server logs:
	$ oozie job -oozie http://cdh-dev01:11000/oozie -logs 0000014-130712212133144-oozie-oozi-W

	Logs are available at:
	/var/log/oozie on the Oozie server.
	Program output
	--------------

	$ hadoop fs -cat oozieProject/workflowPigAction/output/SortedResults/part-r-00000

	Apr pulseaudio[5705]: 1
	Apr spice-vdagent[5657]: 1
	Apr sudo: 5
	May NetworkManager[1232]: 1
	May NetworkManager[1243]: 1
	May NetworkManager[1284]: 1
	May NetworkManager[1292]: 1
	.........