Skip to content

Instantly share code, notes, and snippets.


Anagha Khanolkar airawat

  • Microsoft
View GitHub Profile
airawat / 00-LogParser-PythonMR-UsingRegex
Last active Dec 19, 2015
Mapper and Reducer in python for log parsing using python regex
View 00-LogParser-PythonMR-UsingRegex
This gist includes a mapper and reducer in python that can parse log files using
regex; Usecase: Count the number of occurances of processes that got logged by month.
Sample data
Review of log data structure
Sample data and scripts for download
View 00-LogParser-PigLatin-UsingRegex
This gist includes a pig latin script to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript
airawat / 00-LogParserPigLatinNativeMapReduce
Last active Dec 19, 2015
There might be situations were you may have to reuse java map reduce programs within a pig program. This blog includes a sample pig script, with associated jars and sample data. The input is Syslog generated log files, and the output is a count of occurrences of processes logged inception to date.
View 00-LogParserPigLatinNativeMapReduce
This gist includes a pig latin script to parse Syslog generated log files through a
java mapreduce program that uses regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Related gist that covers the java code -
Pig version: version 0.10.0
airawat / 00-OozieWorkflowJavaMainAction
Last active Dec 19, 2015
Oozie workflow application with a java main action The java program parses log files and generates a report. Sample data, code, workflow components, commands are provided.
View 00-OozieWorkflowJavaMainAction
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: java main action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts pat of the mapper input directory path and includes in the key
Parse Syslog generated log files to generate reports;
airawat / 00-CreatingMapFile
Last active Dec 22, 2015
Creating a Map file in Hadoop. This gist covers reading a text file in HDFS, and creating a map file
View 00-CreatingMapFile
This gist demonstrates how to create a map file, from a text file.
1. Input data and script download
2. Input data-review
3. Data load commands
4. Java program to create the map file out of a text file in HDFS
5. Command to run Java program
6. Results of the program run to create map file
airawat / 00-MapSideJoinDistCacheMapFile
Last active Dec 23, 2015
Map-side join example - Java code for joining two datasets - one large (tsv format), and one with lookup data (mapfile), made available through DistributedCache
View 00-MapSideJoinDistCacheMapFile
This gist demonstrates how to do a map-side join, joining a MapFile from distributedcache
with a larger dataset in HDFS.
1. Input data and script download
2. Dataset structure review
3. Expected results
4. Mapper code
5. Driver code
View 00-CustomPigEvalUDF-NVL2
This gist covers a simple Pig eval UDF in Java, that mimics NVL2 functionality in Oracle.
1. Input data
2. UDF code in java
3. Pig script to demo the UDF
4. Expected result
5. Command to execute script
6. Output
airawat / 00-RegexFilterInAccumuloC#ProxyClient
Last active Dec 30, 2015
Using regex filter in Accumulo Proxy C# client
View 00-RegexFilterInAccumuloC#ProxyClient
List<String> artifactList = new List<String> ();
var scanOpts = new ScanOptions();
String rowRegex = rowID + ".*";
IteratorSetting iterSttng = new IteratorSetting();
iterSttng.Priority = 15;
iterSttng.Name = "rowIDRegexFilter";
airawat / 00-LogParserCascading
Last active Jan 1, 2016
View 00-LogParserCascading
About this gist:
This gist is a part of a series of log parsers in Java Mapreduce, Pig, Hive, Python...
This one covers a log parser in Cascading.
It reads syslogs in HDFS -
a) Parses them based on a regex pattern & writes parsed files to HDFS
b) Writes records that dont match pattern to HDFS
c) Writes a report to HDFS that contains the count of distinct processes logged.
Other gists/blogs:
airawat / cascading.accumulo.examples
Last active Jan 2, 2016
cascading.accumuloSample programs
View cascading.accumulo.examples
The sample programs, for Cascading(2.5.1) for Accumulo(1.5.0) are in github -
The source code for the extensions are at-
You can’t perform that action at this time.