Skip to content

Instantly share code, notes, and snippets.

View jeroenjanssens's full-sized avatar

Jeroen Janssens jeroenjanssens

View GitHub Profile
jeroenjanssens / topwords.R
Created April 25, 2014 02:17
Get top N words from STDIN using Bash, Python, and R. All three scripts produce the same output, but R scales very badly w.r.t. to input size. What am I doing wrong?
#!/usr/bin/env Rscript
num.words <- as.integer(commandArgs(trailingOnly = TRUE))
f <- file("stdin")
input.lines <- readLines(f)
full.text <- tolower(paste(input.lines, collapse = " "))
splits <- gregexpr("\\w+", full.text)
words.all <- (regmatches(full.text, splits)[[1]])
words.unique <-
words.sorted <- words.unique[order(-words.unique$Freq),]
jeroenjanssens /
Created June 6, 2014 00:29
Remove header without streaming entire file
#!/usr/bin/env python
# The trick is to overwrite the file with spaces till the first newline.
# Only works if the program that reads it ignores empty lines.
import sys
filename = sys.argv[1]
f = open(filename, "r+b")
n = 0
while != "\n":
n += 1
jeroenjanssens /
Last active August 29, 2015 14:07
Data Science at the Command Line Strata Tutorial
# make sure that you have the R package `ggmap` installed
curl -s > citibikes.json
< citibikes.json jq -r '.[] | [.lat/1000000,.lng/1000000,.bikes] | @csv' | header -a lat,lng,bikes > citibikes.csv
< citibikes.csv Rio -vge 'require(ggmap); qmap("NYC", zoom=14) + geom_point(data=df, aes(x=lng, y=lat, size=bikes))' > citibikes.png
jeroenjanssens /
Last active December 29, 2015 03:39
Detecting anomalous senators

This interactive visualization demonstrates the Stochastic Outlier Selection (SOS) applied to roll call voting data. It was first presented at the NYC Machine Learning meetup on November 21, 2013. SOS is an unsupervised outlier-selection algorithm by J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik (2012). It employs the concept of affinity to quantify the relationship between data points and subsequently computes an outlier probability for each data point. Intuitively, a data point is selected as an outlier when the other data points have insufficient affinity with it.

The data set contains 103 data points (senators) and 172 features (votes). The dissimilarity between the data points is the Euclidean distance. Each circle in the scatter plot represents a senator, of which the location is determined by applying the non-linear dimensionality reduction technique [t-SNE](http://homepage.tudelf

jeroenjanssens / gist:549087b0fd6551064e57
Last active January 5, 2016 14:40
RStudio Stack trace
## Run RStudio Desktop
$ rstudio-bin
## Attach debugger
$ sudo gdb -p $(pgrep rsession)
(gdb) cont
## In RStudio, try to connect to Aster
> library(TeradataAsterR)
jeroenjanssens /
Last active March 1, 2016 14:08
Stem-and-Leaf Plot

Back in the old days, when many data sets were still small, stem-and-leaf plots were a popular method of representing quantitative data. The example data shown in the text area comes from the cover of John Tukey's Exploratory Data Analysis. The stem-and-leaf plot updates as you change the data. Try adding fractions and negative values. Hover over the leaves to see the original values.

jeroenjanssens / keep_top_n.R
Created April 12, 2017 10:12
R function to keep rows belonging to top n groups of a certain column
keep_top_n <- function(df, col, n = 10) {
semi_join(df, head(count_(df, col, sort = TRUE), n))
# All car models
mpg %>% nrow()
# Just car models of top three manufacturers
html_more_nodes <- function(session, css, more_css) {
html_nodes(session, css),
html_more_nodes(follow_link(session, css = more_css),
css, more_css)
}, error = function(e) NULL)
directory <- "bunch/of/excel/files"
# Get an overview of all the Excel files and their sheets
sheets <-
data_frame(file = list.files(directory, full.names = TRUE),
jeroenjanssens /
Last active June 8, 2018 08:37
Some jq examples translated from