Skip to content

Instantly share code, notes, and snippets.

View rvprasad's full-sized avatar

Venkatesh-Prasad Ranganath rvprasad

View GitHub Profile
@rvprasad
rvprasad / recoverSchema.py
Last active December 20, 2015 04:58
While structured logs are easy to analyze, logs are most often unstructured (e.g. crond, SQL server). This code snippet demonstrates a simple language-based approach to recover the schema/structure of unstructured logs. The approach analyzes a set of logs (lines) and constructs schemas (regular expressions) that cover every log in the set. The a…
import re
def getVocabulary(wordFileName):
ret = set()
with open(wordFileName) as wordFile:
for w in wordFile:
ret.add(w.strip())
return ret
import string
@rvprasad
rvprasad / batchRead.r
Last active August 29, 2015 14:05
read.csv style functions in R read the entire data file in one sweep. Hence, they can be hard to read files that cannot fit into memory of the host machine. Here's an R function to read such large files in chunks as separate data frames. The only requirement is that there is one column in the read data such that all records/rows with identical v…
#' Read a file in chunks
#'
#' @param theConn providing the data, e.g., file('data/transactions.csv', 'r').
#' @param headers of the data being read.
#' @param leftOver rows that were not read but not returned by the previous invocation of this function.
#' @param col on which the data is grouped.
#' @return a list of two elements: data provided by the current invocation and leftOver to be used during the next invocation.
getDataFrameForNextId <- function(theFile, headers, leftOver, col) {
while (NROW(leftOver) == 0 || NROW(unique(leftOver[,col])) < 2) {
tmp1 <- read.csv(theFile, nrows=100000)
@rvprasad
rvprasad / conftest.py
Last active May 16, 2016 04:53
In test cases, this adaptation of Pytest hook implementation reports uncaught AssertionError exceptions as failures and all other uncaught exceptions as errors.
import pytest # added
from _pytest import runner, _code # added
def pytest_runtest_makereport(item, call):
when = call.when
duration = call.stop-call.start
keywords = dict([(x,1) for x in item.keywords])
excinfo = call.excinfo
sections = []
if not call.excinfo:
@rvprasad
rvprasad / calculateBufferedReadWriteTimes.groovy
Last active December 17, 2016 18:34
A script to calculate buffered byte-sized read and write times at various buffer sizes.
fileSize = 2 ** 16 * 1000
def getStatsWith(closure) {
buffSizes = (8..15)
iterations = (0..10)
buffSize2runtimes = buffSizes.collectEntries { [(2 ** it):[]] }
iterations.each {
buffSize2runtimes.each { buffSize, runtimes ->
runtimes << closure(buffSize) * 1000 / 1024 / 1024
@rvprasad
rvprasad / XDump.java
Last active January 3, 2017 23:20
Code to reproduce bug 9046671 in JDK 8.
/*
* Run this with ASM 5.1 (http://forge.ow2.org/project/showfiles.php?group_id=23) to generate X.class.
* Loading the generated X.class will cause the following error with JDK 9-ea, JDK 1.8.0_112, and Zulu 1.8.0_112.
*
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.VerifyError: Stack map does not match the one at exception handler 13
Exception Details:
Location:
X.<init>(LX;)V @13: athrow
Reason:
@rvprasad
rvprasad / collectTagsOccurringWithGivenTag.py
Last active May 24, 2018 01:49
Python script to extract tags that co-occur with a given tag on question-type posts in Posts.xml file from Stack Overflow data dump.
#
# Copyright (c) 2017, Venkatesh-Prasad Ranganath
#
# BSD 3-clause License
#
# Author: Venkatesh-Prasad Ranganath
#
import argparse
import datetime
import itertools
@rvprasad
rvprasad / collectTagsOccurringWithGivenTag.groovy
Last active April 21, 2017 05:57
Groovy script to extract tags that co-occur with a given tag on question-type posts in Posts.xml file from Stack Overflow data dump.
/*
* Copyright (c) 2017, Venkatesh-Prasad Ranganath
*
* BSD 3-clause License
*
* Author: Venkatesh-Prasad Ranganath
*/
import groovy.util.CliBuilder
import groovyx.gpars.actor.DynamicDispatchActor
@rvprasad
rvprasad / test_str_to_int.py
Last active October 12, 2017 18:13
A script to demonstrate how to use property-based testing to test str_to_int function. Not all tests are required.
# Python -- v3.6
# https://docs.pytest.org/en/latest/ -- v3.2.1
# http://hypothesis.readthedocs.io/en/latest/ -- v3.7
from hypothesis import assume, given
from hypothesis.strategies import text
import pytest
rep2int = {
'1':1,
@rvprasad
rvprasad / mapper-par.py
Last active November 3, 2017 23:57
Maps each alignment (in BAM) to reference gene sequence data (in GFF) and its description (in Description file).
# python2.7
#
# Before using the script, execute the following.
# PYTHONPATH=~/.pip easy_install --install-dir=~/.pip intervaltree
# PYTHONPATH=~/.pip easy_install --install-dir=~/.pip plac
#
# To run the script, use the following command
# PYTHONPATH=~/.pip python2.7 mapper-par.py <desc> <gff> <bam> <output>
# <# cores>
@rvprasad
rvprasad / test_pool_map.py
Last active October 21, 2017 01:17
Illustrates performance degradation when large data chunks are used with multiprocesing module in Python.
# Python -- v3.6
import begin
import multiprocessing
import time
def worker(varying_data, aux_data):
t = 0
for j in range(1, 10000):