Skip to content

Instantly share code, notes, and snippets.

@airawat
Last active December 19, 2015 07:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save airawat/5920995 to your computer and use it in GitHub Desktop.
Save airawat/5920995 to your computer and use it in GitHub Desktop.
There might be situations were you may have to reuse java map reduce programs within a pig program. This blog includes a sample pig script, with associated jars and sample data. The input is Syslog generated log files, and the output is a count of occurrences of processes logged inception to date.
This gist includes a pig latin script to parse Syslog generated log files through a
java mapreduce program that uses regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Related gist that covers the java code - https://gist.github.com/airawat/5915374
Pig version: version 0.10.0
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript
Pig script execution command: 05-PigLatinScriptExecution
Output: 06-Output
Sample data
------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
Structure
----------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
Data download
-------------
https://groups.google.com/forum/?hl=en#!topic/hadooped/DMQVIwBUQOo
Directory structure
-------------------
LogParserSamplePigMR
Data
airawat-syslog
2013
04
messages
2013
05
messages
lib
LogEventCount.jar
SysLog-PigMR-Report.pig
Commands to load to HDFS [03-HdfsLoadCommands]
----------------------------------------------
$ hadoop fs -put LogParserSamplePigMR
$ hadoop fs -ls -R LogParserSamplePigMR | awk '{print $8}'
LogParserSamplePigMR/Data
LogParserSamplePigMR/Data/airawat-syslog
LogParserSamplePigMR/Data/airawat-syslog/2013
LogParserSamplePigMR/Data/airawat-syslog/2013/04
LogParserSamplePigMR/Data/airawat-syslog/2013/04/messages
LogParserSamplePigMR/Data/airawat-syslog/2013/05
LogParserSamplePigMR/Data/airawat-syslog/2013/05/messages
LogParserSamplePigMR/SysLog-PigMR-Report.pig
LogParserSamplePigMR/lib
LogParserSamplePigMR/lib/LogEventCount.jar
ParserSamplePigMR/reportDir/_logs/history/job_201306261042_0054_1372873417824_akhanolk_PigLatin%3ASysLog-PigMR-Report.pig
LogParserSamplePigMR/reportDir/part-m-00000
/*----------------------------------------*/
/*PigLatinScript - SysLog-PigMR-Report.pig*/
/*----------------------------------------*/
rmf LogParserSamplePigMR/outputDir
rmf LogParserSamplePigMR/inputDir
rmf LogParserSamplePigMR/reportDir
raw_log_DS =
LOAD 'LogParserSamplePigMR/Data/airawat-syslog/*/*/*' as line;
report_DS = MAPREDUCE 'lib/LogEventCount.jar' STORE raw_log_DS INTO 'LogParserSamplePigMR/inputDir' LOAD 'LogParserSamplePigMR/outputDir' AS (process:chararray, count: int) `Airawat.O
ozie.Samples.LogEventCount LogParserSamplePigMR/inputDir LogParserSamplePigMR/outputDir`;
store report_DS INTO 'LogParserSamplePigMR/reportDir';
Command to run the pig script
------------------------------
These should be run after the data, scripts and jars are loaded to HDFS - covered in section 03-HdfsLoadCommands
$ cd LogParserSamplePigMR
$ pig SysLog-PigMR-Report.pig
Command to view output
-----------------------
$ hadoop fs -cat LogParserSamplePigMR/reportDir/part*
Output
-------
init: 23
kernel: 58
ntpd_initres[1705]: 792
sudo: 2
udevd[361]: 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment