Skip to content

Instantly share code, notes, and snippets.

@debasishg
debasishg / gist:8172796
Last active May 10, 2024 13:37
A collection of links for streaming algorithms and data structures

General Background and Overview

  1. Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
  2. Models and Issues in Data Stream Systems
  3. Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
  4. Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
  5. [Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&rep=rep1&t
@multidis
multidis / split_strat_scale.r
Created December 23, 2013 08:17
Stratified sampling: training / test data split preserving class distribution (caret functions) and scaling (standardize) the data. Stratified folds for CV.
library(caret)
## select training indices preserving class distribution
in.train <- createDataPartition(yclass, p=0.8, list=FALSE)
summary(factor(yclass))
ytra <- yclass[in.train]; summary(factor(ytra))
ytst <- yclass[-in.train]; summary(factor(ytst))
## standardize features: training parameters of scaling for test-part
Xtra <- scale(X[in.train,])
@jasonrudolph
jasonrudolph / git-branches-by-commit-date.sh
Created February 12, 2012 20:40
List remote Git branches and the last commit date for each branch. Sort by most recent commit date.
# Credit http://stackoverflow.com/a/2514279
for branch in `git branch -r | grep -v HEAD`;do echo -e `git show --format="%ci %cr" $branch | head -n 1` \\t$branch; done | sort -r