Skip to content

Instantly share code, notes, and snippets.

@airawat
airawat / 00-LogParser-PythonMR-UsingRegex
Last active December 19, 2015 06:59
Mapper and Reducer in python for log parsing using python regex
This gist includes a mapper and reducer in python that can parse log files using
regex; Usecase: Count the number of occurances of processes that got logged by month.
Includes:
---------
Sample data
Review of log data structure
Sample data and scripts for download
Mapper
Reducer
This gist includes a pig latin script to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript
@airawat
airawat / 00-LogParserPigLatinNativeMapReduce
Last active December 19, 2015 07:49
There might be situations were you may have to reuse java map reduce programs within a pig program. This blog includes a sample pig script, with associated jars and sample data. The input is Syslog generated log files, and the output is a count of occurrences of processes logged inception to date.
This gist includes a pig latin script to parse Syslog generated log files through a
java mapreduce program that uses regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Related gist that covers the java code - https://gist.github.com/airawat/5915374
Pig version: version 0.10.0
@airawat
airawat / 00-OozieWorkflowJavaMainAction
Last active December 19, 2015 18:59
Oozie workflow application with a java main action The java program parses log files and generates a report. Sample data, code, workflow components, commands are provided.
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: java main action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts pat of the mapper input directory path and includes in the key
emitted.
Usecase
-------
Parse Syslog generated log files to generate reports;
@airawat
airawat / 00-CreatingMapFile
Last active December 22, 2015 22:18
Creating a Map file in Hadoop. This gist covers reading a text file in HDFS, and creating a map file
This gist demonstrates how to create a map file, from a text file.
Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Java program to create the map file out of a text file in HDFS
5. Command to run Java program
6. Results of the program run to create map file
@airawat
airawat / 00-MapSideJoinDistCacheMapFile
Last active December 23, 2015 06:59
Map-side join example - Java code for joining two datasets - one large (tsv format), and one with lookup data (mapfile), made available through DistributedCache
This gist demonstrates how to do a map-side join, joining a MapFile from distributedcache
with a larger dataset in HDFS.
Includes:
---------
1. Input data and script download
2. Dataset structure review
3. Expected results
4. Mapper code
5. Driver code
@airawat
airawat / 00-CustomPigEvalUDF-NVL2
Last active December 27, 2015 03:39
Custom Pig UDF NVL2
This gist covers a simple Pig eval UDF in Java, that mimics NVL2 functionality in Oracle.
Included:
1. Input data
2. UDF code in java
3. Pig script to demo the UDF
4. Expected result
5. Command to execute script
6. Output
@airawat
airawat / 00-RegexFilterInAccumuloC#ProxyClient
Last active December 30, 2015 16:09
Using regex filter in Accumulo Proxy C# client
......
List<String> artifactList = new List<String> ();
var scanOpts = new ScanOptions();
String rowRegex = rowID + ".*";
IteratorSetting iterSttng = new IteratorSetting();
iterSttng.Priority = 15;
iterSttng.Name = "rowIDRegexFilter";
iterSttng.IteratorClass="org.apache.accumulo.core.iterators.user.RegExFilter";
@airawat
airawat / 00-LogParserCascading
Last active January 1, 2016 13:29
LogParserInCascading
About this gist:
================
This gist is a part of a series of log parsers in Java Mapreduce, Pig, Hive, Python...
This one covers a log parser in Cascading.
It reads syslogs in HDFS -
a) Parses them based on a regex pattern & writes parsed files to HDFS
b) Writes records that dont match pattern to HDFS
c) Writes a report to HDFS that contains the count of distinct processes logged.
Other gists/blogs:
@airawat
airawat / cascading.accumulo.examples
Last active January 2, 2016 08:09
cascading.accumuloSample programs
The sample programs, for Cascading(2.5.1) for Accumulo(1.5.0) are in github -
https://github.com/airawat/cascading.accumulo.examples
The source code for the extensions are at-
https://github.com/airawat/cascading.accumulo