Skip to content

Instantly share code, notes, and snippets.

Avatar

Venkatesh-Prasad Ranganath rvprasad

View GitHub Profile
@rvprasad
rvprasad / batchRead.r
Last active Aug 29, 2015
read.csv style functions in R read the entire data file in one sweep. Hence, they can be hard to read files that cannot fit into memory of the host machine. Here's an R function to read such large files in chunks as separate data frames. The only requirement is that there is one column in the read data such that all records/rows with identical v…
View batchRead.r
#' Read a file in chunks
#'
#' @param theConn providing the data, e.g., file('data/transactions.csv', 'r').
#' @param headers of the data being read.
#' @param leftOver rows that were not read but not returned by the previous invocation of this function.
#' @param col on which the data is grouped.
#' @return a list of two elements: data provided by the current invocation and leftOver to be used during the next invocation.
getDataFrameForNextId <- function(theFile, headers, leftOver, col) {
while (NROW(leftOver) == 0 || NROW(unique(leftOver[,col])) < 2) {
tmp1 <- read.csv(theFile, nrows=100000)
@rvprasad
rvprasad / recoverSchema.py
Last active Dec 20, 2015
While structured logs are easy to analyze, logs are most often unstructured (e.g. crond, SQL server). This code snippet demonstrates a simple language-based approach to recover the schema/structure of unstructured logs. The approach analyzes a set of logs (lines) and constructs schemas (regular expressions) that cover every log in the set. The a…
View recoverSchema.py
import re
def getVocabulary(wordFileName):
ret = set()
with open(wordFileName) as wordFile:
for w in wordFile:
ret.add(w.strip())
return ret
import string
@rvprasad
rvprasad / conftest.py
Last active May 16, 2016
In test cases, this adaptation of Pytest hook implementation reports uncaught AssertionError exceptions as failures and all other uncaught exceptions as errors.
View conftest.py
import pytest # added
from _pytest import runner, _code # added
def pytest_runtest_makereport(item, call):
when = call.when
duration = call.stop-call.start
keywords = dict([(x,1) for x in item.keywords])
excinfo = call.excinfo
sections = []
if not call.excinfo:
@rvprasad
rvprasad / calculateBufferedReadWriteTimes.groovy
Last active Dec 17, 2016
A script to calculate buffered byte-sized read and write times at various buffer sizes.
View calculateBufferedReadWriteTimes.groovy
fileSize = 2 ** 16 * 1000
def getStatsWith(closure) {
buffSizes = (8..15)
iterations = (0..10)
buffSize2runtimes = buffSizes.collectEntries { [(2 ** it):[]] }
iterations.each {
buffSize2runtimes.each { buffSize, runtimes ->
runtimes << closure(buffSize) * 1000 / 1024 / 1024
@rvprasad
rvprasad / XDump.java
Last active Jan 3, 2017
Code to reproduce bug 9046671 in JDK 8.
View XDump.java
/*
* Run this with ASM 5.1 (http://forge.ow2.org/project/showfiles.php?group_id=23) to generate X.class.
* Loading the generated X.class will cause the following error with JDK 9-ea, JDK 1.8.0_112, and Zulu 1.8.0_112.
*
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.VerifyError: Stack map does not match the one at exception handler 13
Exception Details:
Location:
X.<init>(LX;)V @13: athrow
Reason:
@rvprasad
rvprasad / collectTagsOccurringWithGivenTag.groovy
Last active Apr 21, 2017
Groovy script to extract tags that co-occur with a given tag on question-type posts in Posts.xml file from Stack Overflow data dump.
View collectTagsOccurringWithGivenTag.groovy
/*
* Copyright (c) 2017, Venkatesh-Prasad Ranganath
*
* BSD 3-clause License
*
* Author: Venkatesh-Prasad Ranganath
*/
import groovy.util.CliBuilder
import groovyx.gpars.actor.DynamicDispatchActor
@rvprasad
rvprasad / test_str_to_int.py
Last active Oct 12, 2017
A script to demonstrate how to use property-based testing to test str_to_int function. Not all tests are required.
View test_str_to_int.py
# Python -- v3.6
# https://docs.pytest.org/en/latest/ -- v3.2.1
# http://hypothesis.readthedocs.io/en/latest/ -- v3.7
from hypothesis import assume, given
from hypothesis.strategies import text
import pytest
rep2int = {
'1':1,
@rvprasad
rvprasad / test_process.py
Last active Oct 20, 2017
Illustrates how the performance of multiprocessing.Process changes inside and outside of multiprocessing.Pool in Python.
View test_process.py
# Python -- v3.6
import begin
import multiprocessing
import time
def worker(varying_data, fixed_data):
t = 0
for j in range(1, 10000):
@rvprasad
rvprasad / test_process_graph.gp
Created Oct 20, 2017
Creates the graph from the data generated by test_process.py.
View test_process_graph.gp
set terminal png
set output "test_process.png"
set logscale
set xlabel "Size of fixed data [Number of ints]"
set ylabel "Performance [seconds per iteration]"
set title "Performance of options vs Size of fixed data"
plot "test_process.csv" using 1:2 title "builtin pool" with linespoints, \
"test_process.csv" using 1:3 title "custom pool" with linespoints
@rvprasad
rvprasad / test_pool_map_graph.gp
Created Oct 21, 2017
creates the graph from the data generated by test_pool_map.py.
View test_pool_map_graph.gp
set terminal png
set output "test_pool_map.png"
set logscale
set xlabel "Size of aux data [Number of ints]"
set ylabel "Performance [seconds per iteration]"
set title "Performance of options vs Size of aux data"
plot "test_pool_map.csv" using 1:2 title "without initializer / default chunksize" with linespoints, \
"test_pool_map.csv" using 1:3 title "with initializer / default chunksize" with linespoints, \
"test_pool_map.csv" using 1:4 title "without initializer / 250 chunksize" with linespoints, \
"test_pool_map.csv" using 1:5 title "with initializer / 250 chunksize" with linespoints, \