Skip to content

Instantly share code, notes, and snippets.

View rvprasad's full-sized avatar

Venkatesh-Prasad Ranganath rvprasad

View GitHub Profile
rvprasad / batchRead.r
Last active August 29, 2015 14:05
read.csv style functions in R read the entire data file in one sweep. Hence, they can be hard to read files that cannot fit into memory of the host machine. Here's an R function to read such large files in chunks as separate data frames. The only requirement is that there is one column in the read data such that all records/rows with identical v…
#' Read a file in chunks
#' @param theConn providing the data, e.g., file('data/transactions.csv', 'r').
#' @param headers of the data being read.
#' @param leftOver rows that were not read but not returned by the previous invocation of this function.
#' @param col on which the data is grouped.
#' @return a list of two elements: data provided by the current invocation and leftOver to be used during the next invocation.
getDataFrameForNextId <- function(theFile, headers, leftOver, col) {
while (NROW(leftOver) == 0 || NROW(unique(leftOver[,col])) < 2) {
tmp1 <- read.csv(theFile, nrows=100000)
rvprasad /
Last active December 20, 2015 04:58
While structured logs are easy to analyze, logs are most often unstructured (e.g. crond, SQL server). This code snippet demonstrates a simple language-based approach to recover the schema/structure of unstructured logs. The approach analyzes a set of logs (lines) and constructs schemas (regular expressions) that cover every log in the set. The a…
import re
def getVocabulary(wordFileName):
ret = set()
with open(wordFileName) as wordFile:
for w in wordFile:
return ret
import string
rvprasad /
Last active May 16, 2016 04:53
In test cases, this adaptation of Pytest hook implementation reports uncaught AssertionError exceptions as failures and all other uncaught exceptions as errors.
import pytest # added
from _pytest import runner, _code # added
def pytest_runtest_makereport(item, call):
when = call.when
duration = call.stop-call.start
keywords = dict([(x,1) for x in item.keywords])
excinfo = call.excinfo
sections = []
if not call.excinfo:
rvprasad / calculateBufferedReadWriteTimes.groovy
Last active December 17, 2016 18:34
A script to calculate buffered byte-sized read and write times at various buffer sizes.
fileSize = 2 ** 16 * 1000
def getStatsWith(closure) {
buffSizes = (8..15)
iterations = (0..10)
buffSize2runtimes = buffSizes.collectEntries { [(2 ** it):[]] }
iterations.each {
buffSize2runtimes.each { buffSize, runtimes ->
runtimes << closure(buffSize) * 1000 / 1024 / 1024
rvprasad /
Last active January 3, 2017 23:20
Code to reproduce bug 9046671 in JDK 8.
* Run this with ASM 5.1 ( to generate X.class.
* Loading the generated X.class will cause the following error with JDK 9-ea, JDK 1.8.0_112, and Zulu 1.8.0_112.
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.VerifyError: Stack map does not match the one at exception handler 13
Exception Details:
X.<init>(LX;)V @13: athrow
rvprasad / collectTagsOccurringWithGivenTag.groovy
Last active April 21, 2017 05:57
Groovy script to extract tags that co-occur with a given tag on question-type posts in Posts.xml file from Stack Overflow data dump.
* Copyright (c) 2017, Venkatesh-Prasad Ranganath
* BSD 3-clause License
* Author: Venkatesh-Prasad Ranganath
import groovy.util.CliBuilder
rvprasad /
Last active October 12, 2017 18:13
A script to demonstrate how to use property-based testing to test str_to_int function. Not all tests are required.
# Python -- v3.6
# -- v3.2.1
# -- v3.7
from hypothesis import assume, given
from hypothesis.strategies import text
import pytest
rep2int = {
rvprasad /
Last active October 20, 2017 22:34
Illustrates how the performance of multiprocessing.Process changes inside and outside of multiprocessing.Pool in Python.
# Python -- v3.6
import begin
import multiprocessing
import time
def worker(varying_data, fixed_data):
t = 0
for j in range(1, 10000):
rvprasad /
Created October 20, 2017 22:32
Creates the graph from the data generated by
set terminal png
set output "test_process.png"
set logscale
set xlabel "Size of fixed data [Number of ints]"
set ylabel "Performance [seconds per iteration]"
set title "Performance of options vs Size of fixed data"
plot "test_process.csv" using 1:2 title "builtin pool" with linespoints, \
"test_process.csv" using 1:3 title "custom pool" with linespoints
rvprasad /
Created October 21, 2017 01:16
creates the graph from the data generated by
set terminal png
set output "test_pool_map.png"
set logscale
set xlabel "Size of aux data [Number of ints]"
set ylabel "Performance [seconds per iteration]"
set title "Performance of options vs Size of aux data"
plot "test_pool_map.csv" using 1:2 title "without initializer / default chunksize" with linespoints, \
"test_pool_map.csv" using 1:3 title "with initializer / default chunksize" with linespoints, \
"test_pool_map.csv" using 1:4 title "without initializer / 250 chunksize" with linespoints, \
"test_pool_map.csv" using 1:5 title "with initializer / 250 chunksize" with linespoints, \