Skip to content

Instantly share code, notes, and snippets.

Avatar

Anagha Khanolkar airawat

  • Microsoft
View GitHub Profile
@airawat
airawat / 00-LogParser-PythonMR-UsingRegex
Last active Dec 19, 2015
Mapper and Reducer in python for log parsing using python regex
View 00-LogParser-PythonMR-UsingRegex
This gist includes a mapper and reducer in python that can parse log files using
regex; Usecase: Count the number of occurances of processes that got logged by month.
Includes:
---------
Sample data
Review of log data structure
Sample data and scripts for download
Mapper
Reducer
View 00-LogParser-PigLatin-UsingRegex
This gist includes a pig latin script to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript
@airawat
airawat / 00-LogParserPigLatinNativeMapReduce
Last active Dec 19, 2015
There might be situations were you may have to reuse java map reduce programs within a pig program. This blog includes a sample pig script, with associated jars and sample data. The input is Syslog generated log files, and the output is a count of occurrences of processes logged inception to date.
View 00-LogParserPigLatinNativeMapReduce
This gist includes a pig latin script to parse Syslog generated log files through a
java mapreduce program that uses regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Related gist that covers the java code - https://gist.github.com/airawat/5915374
Pig version: version 0.10.0
@airawat
airawat / 00-OozieWorkflowJavaMainAction
Last active Dec 19, 2015
Oozie workflow application with a java main action The java program parses log files and generates a report. Sample data, code, workflow components, commands are provided.
View 00-OozieWorkflowJavaMainAction
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: java main action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts pat of the mapper input directory path and includes in the key
emitted.
Usecase
-------
Parse Syslog generated log files to generate reports;
@airawat
airawat / 00-CreatingMapFile
Last active Dec 22, 2015
Creating a Map file in Hadoop. This gist covers reading a text file in HDFS, and creating a map file
View 00-CreatingMapFile
This gist demonstrates how to create a map file, from a text file.
Includes:
---------
1. Input data and script download
2. Input data-review
3. Data load commands
4. Java program to create the map file out of a text file in HDFS
5. Command to run Java program
6. Results of the program run to create map file
@airawat
airawat / 00-MapSideJoinDistCacheMapFile
Last active Dec 23, 2015
Map-side join example - Java code for joining two datasets - one large (tsv format), and one with lookup data (mapfile), made available through DistributedCache
View 00-MapSideJoinDistCacheMapFile
This gist demonstrates how to do a map-side join, joining a MapFile from distributedcache
with a larger dataset in HDFS.
Includes:
---------
1. Input data and script download
2. Dataset structure review
3. Expected results
4. Mapper code
5. Driver code
View 00-CustomPigEvalUDF-NVL2
This gist covers a simple Pig eval UDF in Java, that mimics NVL2 functionality in Oracle.
Included:
1. Input data
2. UDF code in java
3. Pig script to demo the UDF
4. Expected result
5. Command to execute script
6. Output
@airawat
airawat / 00-RegexFilterInAccumuloC#ProxyClient
Last active Dec 30, 2015
Using regex filter in Accumulo Proxy C# client
View 00-RegexFilterInAccumuloC#ProxyClient
......
List<String> artifactList = new List<String> ();
var scanOpts = new ScanOptions();
String rowRegex = rowID + ".*";
IteratorSetting iterSttng = new IteratorSetting();
iterSttng.Priority = 15;
iterSttng.Name = "rowIDRegexFilter";
iterSttng.IteratorClass="org.apache.accumulo.core.iterators.user.RegExFilter";
@airawat
airawat / 00-LogParserCascading
Last active Jan 1, 2016
LogParserInCascading
View 00-LogParserCascading
About this gist:
================
This gist is a part of a series of log parsers in Java Mapreduce, Pig, Hive, Python...
This one covers a log parser in Cascading.
It reads syslogs in HDFS -
a) Parses them based on a regex pattern & writes parsed files to HDFS
b) Writes records that dont match pattern to HDFS
c) Writes a report to HDFS that contains the count of distinct processes logged.
Other gists/blogs:
@airawat
airawat / cascading.accumulo.examples
Last active Jan 2, 2016
cascading.accumuloSample programs
View cascading.accumulo.examples
The sample programs, for Cascading(2.5.1) for Accumulo(1.5.0) are in github -
https://github.com/airawat/cascading.accumulo.examples
The source code for the extensions are at-
https://github.com/airawat/cascading.accumulo
You can’t perform that action at this time.