airawat/00-LogParser-PigLatin-UsingRegex

## 00-LogParser-PigLatin-UsingRegex
This gist includes a pig latin script to parse Syslog generated log files using regex;
Usecase:  Count the number of occurances of processes that got logged, by month,
day and process.

Includes:
---------
Sample data and structure:           01-SampleDataAndStructure
Data and script download:            02-DataAndScriptDownload
Data load commands:                  03-HdfsLoadCommands
Pig script:                          04-PigLatinScript
Pig script execution command:        05-PigLatinScriptExecution
Output:                              06-Output

## 01-SampleDataAndStructure
Sample data
------------
May  3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May  3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May  3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May  3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May  3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May  3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May  3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org

Structure
----------
Month   = May
Day     = 3
Time    = 11:52:54
Node    = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal

## 02-DataAndScriptDownload
Data and Script download
------------------------
https://groups.google.com/forum/?hl=en#!topic/hadooped/Wix8ZznQGJU


Directory structure
-------------------
LogParserSamplePig
    Data
        airawat-syslog
            2013
                04
                    messages
            2013
                05
                    messages
    SysLog-Pig-Report.pig

## 03-HdfsLoadCommands
hdfs load commands
-------------------

$ hadoop fs -put LogParserSamplePig/

Validate load
-------------

$ hadoop fs -ls -R LogParserSamplePig | awk '{print $8}'

Expected directory structure
-----------------------------

LogParserSamplePig/Data
LogParserSamplePig/Data/airawat-syslog
LogParserSamplePig/Data/airawat-syslog/2013
LogParserSamplePig/Data/airawat-syslog/2013/04
LogParserSamplePig/Data/airawat-syslog/2013/04/messages
LogParserSamplePig/Data/airawat-syslog/2013/05
LogParserSamplePig/Data/airawat-syslog/2013/05/messages
LogParserSamplePig/SysLog-Pig-Report.pig

## 04-PigLatinScript
#Pig Latin script - SysLog-Pig-Report.pig

rmf LogParserSamplePig/output

raw_log_DS =
  -- load the logs into a sequence of one element tuples
  LOAD 'LogParserSamplePig/Data/airawat-syslog/*/*/*' AS line;

parsed_log_DS =
  -- for each line/log parse the same into a
  -- structure with named fields
  FOREACH raw_log_DS
    GENERATE
    		FLATTEN (
      			REGEX_EXTRACT_ALL(
        				  line,
        				  '(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)'
     					 )
			)
    AS (
      month_name: chararray,
      day: chararray,
      time: chararray,
      host: chararray,
      process: chararray,
      log: chararray
    );

report_draft_DS =
	--Generate dataset containing just the data needed
	FOREACH parsed_log_DS GENERATE month_name,process;

grouped_report_DS =
	--Group the dataset
	GROUP report_draft_DS BY (month_name,process);

aggregate_report_DS =
	--Compute count
	FOREACH grouped_report_DS {
		GENERATE group.month_name,group.process,COUNT(report_draft_DS) AS frequency;
	}

sorted_DS =
  ORDER aggregate_report_DS by $0,$1;

STORE sorted_DS INTO 'LogParserSamplePig/output/SortedResults';

## 05-PigLatinScriptExecution
Execute the pig script on the cluster
--------------------------------------

$ pig SysLog-Pig-Report.pig


View output
-----------

$ hadoop fs -cat LogParserSamplePig/output/SortedResults/part*


## 06-Output
Output
-------

Apr  sudo:  1
May	init:	23
May	kernel:	58
May	ntpd_initres[1705]:	792
May	sudo:	1
May	udevd[361]:	1
	This gist includes a pig latin script to parse Syslog generated log files using regex;
	Usecase: Count the number of occurances of processes that got logged, by month,
	day and process.

	Includes:
	---------
	Sample data and structure: 01-SampleDataAndStructure
	Data and script download: 02-DataAndScriptDownload
	Data load commands: 03-HdfsLoadCommands
	Pig script: 04-PigLatinScript
	Pig script execution command: 05-PigLatinScriptExecution
	Output: 06-Output
	Sample data
	------------
	May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
	May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
	May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
	May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
	May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
	May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
	May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org

	Structure
	----------
	Month = May
	Day = 3
	Time = 11:52:54
	Node = cdh-dn03
	Process = init:
	Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
	Data and Script download
	------------------------
	https://groups.google.com/forum/?hl=en#!topic/hadooped/Wix8ZznQGJU


	Directory structure
	-------------------
	LogParserSamplePig
	Data
	airawat-syslog
	2013
	04
	messages
	2013
	05
	messages
	SysLog-Pig-Report.pig
	hdfs load commands
	-------------------

	$ hadoop fs -put LogParserSamplePig/

	Validate load
	-------------

	$ hadoop fs -ls -R LogParserSamplePig \| awk '{print $8}'

	Expected directory structure
	-----------------------------

	LogParserSamplePig/Data
	LogParserSamplePig/Data/airawat-syslog
	LogParserSamplePig/Data/airawat-syslog/2013
	LogParserSamplePig/Data/airawat-syslog/2013/04
	LogParserSamplePig/Data/airawat-syslog/2013/04/messages
	LogParserSamplePig/Data/airawat-syslog/2013/05
	LogParserSamplePig/Data/airawat-syslog/2013/05/messages
	LogParserSamplePig/SysLog-Pig-Report.pig
	#Pig Latin script - SysLog-Pig-Report.pig

	rmf LogParserSamplePig/output

	raw_log_DS =
	-- load the logs into a sequence of one element tuples
	LOAD 'LogParserSamplePig/Data/airawat-syslog///*' AS line;

	parsed_log_DS =
	-- for each line/log parse the same into a
	-- structure with named fields
	FOREACH raw_log_DS
	GENERATE
	FLATTEN (
	REGEX_EXTRACT_ALL(
	line,
	'(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W\\w)\\s+(.?\\:)\\s+(.$)'
	)
	)
	AS (
	month_name: chararray,
	day: chararray,
	time: chararray,
	host: chararray,
	process: chararray,
	log: chararray
	);

	report_draft_DS =
	--Generate dataset containing just the data needed
	FOREACH parsed_log_DS GENERATE month_name,process;

	grouped_report_DS =
	--Group the dataset
	GROUP report_draft_DS BY (month_name,process);

	aggregate_report_DS =
	--Compute count
	FOREACH grouped_report_DS {
	GENERATE group.month_name,group.process,COUNT(report_draft_DS) AS frequency;
	}

	sorted_DS =
	ORDER aggregate_report_DS by $0,$1;

	STORE sorted_DS INTO 'LogParserSamplePig/output/SortedResults';
	Execute the pig script on the cluster
	--------------------------------------

	$ pig SysLog-Pig-Report.pig


	View output
	-----------

	$ hadoop fs -cat LogParserSamplePig/output/SortedResults/part*
	Output
	-------

	Apr sudo: 1
	May init: 23
	May kernel: 58
	May ntpd_initres[1705]: 792
	May sudo: 1
	May udevd[361]: 1