Skip to content

Instantly share code, notes, and snippets.

View jeroenjanssens's full-sized avatar

Jeroen Janssens jeroenjanssens

View GitHub Profile
@jeroenjanssens
jeroenjanssens / README.md
Last active March 1, 2016 14:08
Stem-and-Leaf Plot

Back in the old days, when many data sets were still small, stem-and-leaf plots were a popular method of representing quantitative data. The example data shown in the text area comes from the cover of John Tukey's Exploratory Data Analysis. The stem-and-leaf plot updates as you change the data. Try adding fractions and negative values. Hover over the leaves to see the original values.

@jeroenjanssens
jeroenjanssens / README.md
Last active December 29, 2015 03:39
Detecting anomalous senators

This interactive visualization demonstrates the Stochastic Outlier Selection (SOS) applied to roll call voting data. It was first presented at the NYC Machine Learning meetup on November 21, 2013. SOS is an unsupervised outlier-selection algorithm by J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik (2012). It employs the concept of affinity to quantify the relationship between data points and subsequently computes an outlier probability for each data point. Intuitively, a data point is selected as an outlier when the other data points have insufficient affinity with it.

The data set contains 103 data points (senators) and 172 features (votes). The dissimilarity between the data points is the Euclidean distance. Each circle in the scatter plot represents a senator, of which the location is determined by applying the non-linear dimensionality reduction technique [t-SNE](http://homepage.tudelf

@jeroenjanssens
jeroenjanssens / chat.sh
Last active February 15, 2022 21:44
Simple chat server in bash, demonstrating websocketd.
#!/bin/bash
# Hacked together by JeroenJanssens.com on 2013-12-10
# Requires: https://github.com/joewalnes/websocketd
# Run: websocketd --devconsole --port 8080 ./chat.sh
echo "Please enter your name:"; read USER
echo "[$(date)] ${USER} joined the chat" >> chat.log
echo "[$(date)] Welcome to the chat ${USER}!"
tail -n 0 -f chat.log --pid=$$ | grep --line-buffered -v "] ${USER}>" &
while read MSG; do echo "[$(date)] ${USER}> ${MSG}" >> chat.log; done
@jeroenjanssens
jeroenjanssens / topwords.R
Created April 25, 2014 02:17
Get top N words from STDIN using Bash, Python, and R. All three scripts produce the same output, but R scales very badly w.r.t. to input size. What am I doing wrong?
#!/usr/bin/env Rscript
num.words <- as.integer(commandArgs(trailingOnly = TRUE))
f <- file("stdin")
input.lines <- readLines(f)
close(f)
full.text <- tolower(paste(input.lines, collapse = " "))
splits <- gregexpr("\\w+", full.text)
words.all <- (regmatches(full.text, splits)[[1]])
words.unique <- as.data.frame(table(words.all))
words.sorted <- words.unique[order(-words.unique$Freq),]
@jeroenjanssens
jeroenjanssens / remove-header.py
Created June 6, 2014 00:29
Remove header without streaming entire file
#!/usr/bin/env python
# The trick is to overwrite the file with spaces till the first newline.
# Only works if the program that reads it ignores empty lines.
import sys
filename = sys.argv[1]
f = open(filename, "r+b")
n = 0
while f.read(1) != "\n":
n += 1
@jeroenjanssens
jeroenjanssens / case-citibikes.sh
Last active August 29, 2015 14:07
Data Science at the Command Line Strata Tutorial
# make sure that you have the R package `ggmap` installed
curl -s http://api.citybik.es/citi-bike-nyc.json > citibikes.json
< citibikes.json jq -r '.[] | [.lat/1000000,.lng/1000000,.bikes] | @csv' | header -a lat,lng,bikes > citibikes.csv
< citibikes.csv Rio -vge 'require(ggmap); qmap("NYC", zoom=14) + geom_point(data=df, aes(x=lng, y=lat, size=bikes))' > citibikes.png
@jeroenjanssens
jeroenjanssens / cache.R
Last active September 25, 2020 14:33
Cache the result of an expression in R
#' Cache the result of an expression.
#'
#' Use \code{options(cache.path = "...")} to change the cache directory (which
#' is the current working directory by default).
#'
#' @param expr expression to evaluate
#' @param key basename for cache file
#' @param ignore_cache evalute expression regardless of cache file?
#' @return result of expression or read from cache file
#'
@jeroenjanssens
jeroenjanssens / config.md
Last active June 8, 2018 08:39
Sensitive information in R scripts

If your R script uses senstive information such as a password, then it's best to keep this in a seperate file (and perhaps outside the project's repository). Moreover, if you're giving a live demo using RStudio, then you should avoid putting this senstive information in your global environment.

If you put it in a YAML file, say .my_project.yaml, which may look as follows:

---
api_service:
  username: foo
  password: bar123!
@jeroenjanssens
jeroenjanssens / gist:549087b0fd6551064e57
Last active January 5, 2016 14:40
RStudio Stack trace
## Run RStudio Desktop
$ rstudio-bin
## Attach debugger
$ sudo gdb -p $(pgrep rsession)
(gdb) cont
Continuing.
## In RStudio, try to connect to Aster
> library(TeradataAsterR)
@jeroenjanssens
jeroenjanssens / jq.md
Last active June 8, 2018 08:37
Some jq examples translated from https://github.com/jsonlines/guide