Skip to content

Instantly share code, notes, and snippets.

@airawat
Last active December 19, 2015 06:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save airawat/5915708 to your computer and use it in GitHub Desktop.
Save airawat/5915708 to your computer and use it in GitHub Desktop.
This gist includes a pig latin script to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript
Pig script execution command: 05-PigLatinScriptExecution
Output: 06-Output
Sample data
------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
Structure
----------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
Data and Script download
------------------------
https://groups.google.com/forum/?hl=en#!topic/hadooped/Wix8ZznQGJU
Directory structure
-------------------
LogParserSamplePig
Data
airawat-syslog
2013
04
messages
2013
05
messages
SysLog-Pig-Report.pig
hdfs load commands
-------------------
$ hadoop fs -put LogParserSamplePig/
Validate load
-------------
$ hadoop fs -ls -R LogParserSamplePig | awk '{print $8}'
Expected directory structure
-----------------------------
LogParserSamplePig/Data
LogParserSamplePig/Data/airawat-syslog
LogParserSamplePig/Data/airawat-syslog/2013
LogParserSamplePig/Data/airawat-syslog/2013/04
LogParserSamplePig/Data/airawat-syslog/2013/04/messages
LogParserSamplePig/Data/airawat-syslog/2013/05
LogParserSamplePig/Data/airawat-syslog/2013/05/messages
LogParserSamplePig/SysLog-Pig-Report.pig
#Pig Latin script - SysLog-Pig-Report.pig
rmf LogParserSamplePig/output
raw_log_DS =
-- load the logs into a sequence of one element tuples
LOAD 'LogParserSamplePig/Data/airawat-syslog/*/*/*' AS line;
parsed_log_DS =
-- for each line/log parse the same into a
-- structure with named fields
FOREACH raw_log_DS
GENERATE
FLATTEN (
REGEX_EXTRACT_ALL(
line,
'(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)'
)
)
AS (
month_name: chararray,
day: chararray,
time: chararray,
host: chararray,
process: chararray,
log: chararray
);
report_draft_DS =
--Generate dataset containing just the data needed
FOREACH parsed_log_DS GENERATE month_name,process;
grouped_report_DS =
--Group the dataset
GROUP report_draft_DS BY (month_name,process);
aggregate_report_DS =
--Compute count
FOREACH grouped_report_DS {
GENERATE group.month_name,group.process,COUNT(report_draft_DS) AS frequency;
}
sorted_DS =
ORDER aggregate_report_DS by $0,$1;
STORE sorted_DS INTO 'LogParserSamplePig/output/SortedResults';
Execute the pig script on the cluster
--------------------------------------
$ pig SysLog-Pig-Report.pig
View output
-----------
$ hadoop fs -cat LogParserSamplePig/output/SortedResults/part*
Output
-------
Apr sudo: 1
May init: 23
May kernel: 58
May ntpd_initres[1705]: 792
May sudo: 1
May udevd[361]: 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment