An idiosyncratic guide to teaching yourself practical machine learning, without links:

  • Find a binary classification dataset; maybe you have one internally.
  • Implement a simple decision tree algorithm, like CART.
  • Write some code to validate your model; produce an ROC curve and understand the tradeoff it embodies.
  • Compare the ROC for your training set with the ROC for a holdout and understand what it means that they differ.
  • Experiment with some hyperparameters: how does the comparison above change as you adjust the depth of the tree or other stopping criteria?
  • Combine your decision tree algorithm with bagging to produce a random forest. How does its ROC compare?
  • Do the same hyperparameter tuning here. (How many trees?) Reflect on overfitting and on the bias/variance tradeoff.
View gist:b2df11671f1a8e5099d7
def takeBy(pipe: TypedPipe[(K,V)], max: Int)(fn: V => Double): TypedPipe[(K,V)] = {
implicit val qtreeSemi = QTreeSemigroup(4) //magic number, determines how much RAM the trees take
val qtrees ={case (k,v) => k -> QTree(fn(v))}.sumByKey
val maxV = qtrees.flatMap{case (k,q) =>
if(q.size > max) {
val targetQuantile = max.toDouble / q.size
val (lower, upper) = q.quantileBounds(targetQuantile)
Some(k -> upper) //this will give us at least max values; use lower to get at most max values
View gist:f158a2e4abe977dc15f3
case class SetSizeAggregator(hllBits: Int, maxSetSize: Int = 10)
extends MonoidAggregator[Array[Byte], Either[HLL, Set[Array[Byte]]], Long] {
def prepare(in: Array[Byte]) = Right(Set(in))
def present(sum: Either[HLL, Set[Array[Byte]]]) = {
sum match {
case Right(set) => set.size
case Left(hll) => hll.approximateSize.estimate
View gist:b43d3db8933556001285
import com.twitter.algebird._
case class Preparer[A, T](prepareFn: A => T) {
def map[U](fn: T => U) =
Preparer[A, U](fn.compose(prepareFn))
def flatMap[U](fn: T => TraversableOnce[U]) =
FlatPreparer[A, U](fn.compose(prepareFn))
def aggregate[B, C](aggregator: Aggregator[T, B, C]): Aggregator[A, B, C] =
View bayesab.js
function probabilityBBeatsA(aa, ba, ab, bb) {
var probability = 0.0;
for(var i = 0; i < ab; i++) {
var product = Math.log(1 + (aa + ba)/(i + bb));
var j = 1;
var start = i+1;
[bb, ba, aa, i].forEach(function(steps){
var stop = start + steps;
View gist:11376572
MapReduce = Hadoop
BigTable = HBase
Protocol Buffers = Thrift or Avro (serialization)
Stubby = Thrift or Avro (RPC)
ColumnIO = Parquet
Dremel = Impala
Chubby = Zookeeper
Omega = Mesos
Borg = Aurora

Keybase proof

I hereby claim:

  • I am avibryant on github.
  • I am avi ( on keybase.
  • I have a public key whose fingerprint is 9DFC 19D0 8558 8DE8 6BF3 5C19 3879 4907 3690 98AF

To claim this, I am signing this object:


Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.

This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn, using NDV(mycolumn) got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn) took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.

It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.

HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:

View index.html
<link rel="stylesheet" href="" />
<script src=""></script>
<script src=""></script>
<link rel="stylesheet" href="/resources/demos/style.css" />
$(function() {
$( "#datepicker" ).datepicker(
View gist:6575113
scala> val x = null.asInstanceOf[Int]
x: Int = 0
scala> Option(x)
res1: Option[Int] = Some(0)
scala> Option(null.asInstanceOf[Int])
res2: Option[Int] = None