Skip to content

Instantly share code, notes, and snippets.

@yaravind
Forked from airawat/00-LogParser-Hive-Regex
Created May 18, 2018 02:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yaravind/49efb18fbc867ad0afc31eb43a3f05ba to your computer and use it in GitHub Desktop.
Save yaravind/49efb18fbc867ad0afc31eb43a3f05ba to your computer and use it in GitHub Desktop.
Log parser in Hive using regex serde
This gist includes hive ql scripts to create an external partitioned table for Syslog
generated log files using regex serde;
Usecase: Count the number of occurances of processes that got logged, by year, month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data download: 02-DataDownload
Data load commands: 03-DataLoadCommands
Hive commands: 04-HiveCommands
Sample output: 05-SampleOutput
Sample data
------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
Structure
----------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
Data download
-------------
https://groups.google.com/forum/?hl=en#!topic/hadooped/_tj8w_E-MGY
Directory structure
-------------------
LogParserSampleHive
logs
airawat-syslog
2013
04
messages
2013
05
messages
Data load commands
------------------
a) Load the data
$ hadoop fs -mkdir LogParserSampleHive
$ hadoop fs -mkdir LogParserSampleHive/logs
$ hadoop fs -put LogParserSampleHive/logs/* LogParserSampleHive/logs/
$ hadoop fs -ls -R LogParserSampleHive/ | awk {'print $8'}
LogParserSampleHive/logs
LogParserSampleHive/logs/airawat-syslog
LogParserSampleHive/logs/airawat-syslog/2013
LogParserSampleHive/logs/airawat-syslog/2013/04
LogParserSampleHive/logs/airawat-syslog/2013/04/messages
LogParserSampleHive/logs/airawat-syslog/2013/05
LogParserSampleHive/logs/airawat-syslog/2013/05/messages
Hive commands
--------------
a) Create external table:
hive> CREATE EXTERNAL TABLE LogParserSample(
month_name STRING,
day STRING,
time STRING,
host STRING,
event STRING,
log STRING)
PARTITIONED BY(year int, month int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)"
)
stored as textfile;
b) Create partitions and load data:
Note: Replace '/user/airawat' with '/user/<your userID>'
hive> Alter table LogParserSample Add IF NOT EXISTS partition(year=2013, month=04)
location '/user/airawat/LogParserSampleHive/logs/airawat-syslog/2013/04/';
hive> Alter table LogParserSample Add IF NOT EXISTS partition(year=2013, month=05)
location '/user/airawat/LogParserSampleHive/logs/airawat-syslog/2013/05/';
Hive query
-----------
hive> set hive.cli.print.header=true;
hive> add jar hadoop-lib/hive-contrib-0.10.0-cdh4.2.0.jar; --I need this as my environment is not properly configured
hive> select Year,Month,Day,Event,Count(*) Occurrence from LogParserSample group by year,month,day,event order by event desc,year,month,day;
Query output
------------
year month day event occurrence
2013 05 7 udevd[361]: 1
2013 04 23 sudo: 1
2013 05 3 sudo: 1
2013 05 3 ntpd_initres[1705]: 144
2013 05 4 ntpd_initres[1705]: 261
2013 05 5 ntpd_initres[1705]: 264
2013 05 6 ntpd_initres[1705]: 123
2013 05 3 kernel: 5
2013 05 6 kernel: 1
2013 05 7 kernel: 52
2013 05 3 init: 5
2013 05 7 init: 18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment