Skip to content

Instantly share code, notes, and snippets.

View avibryant's full-sized avatar

Avi Bryant avibryant

  • Galiano Island, BC
View GitHub Profile
case class SetSizeAggregator(hllBits: Int, maxSetSize: Int = 10)
extends MonoidAggregator[Array[Byte], Either[HLL, Set[Array[Byte]]], Long] {
def prepare(in: Array[Byte]) = Right(Set(in))
def present(sum: Either[HLL, Set[Array[Byte]]]) = {
sum match {
case Right(set) => set.size
case Left(hll) => hll.approximateSize.estimate
}
}
import com.twitter.algebird._
case class Preparer[A, T](prepareFn: A => T) {
def map[U](fn: T => U) =
Preparer[A, U](fn.compose(prepareFn))
def flatMap[U](fn: T => TraversableOnce[U]) =
FlatPreparer[A, U](fn.compose(prepareFn))
def aggregate[B, C](aggregator: Aggregator[T, B, C]): Aggregator[A, B, C] =
@avibryant
avibryant / bayesab.js
Last active March 22, 2021 18:36
An implementation of http://www.evanmiller.org/bayesian-ab-testing.html with no depenedencies
function probabilityBBeatsA(aa, ba, ab, bb) {
var probability = 0.0;
for(var i = 0; i < ab; i++) {
var product = Math.log(1 + (aa + ba)/(i + bb));
var j = 1;
var start = i+1;
[bb, ba, aa, i].forEach(function(steps){
var stop = start + steps;
@avibryant
avibryant / gist:11376572
Last active March 13, 2016 01:51
Google <=> Open Source Rosetta Stone
GFS = HDFS
MapReduce = Hadoop
BigTable = HBase
Protocol Buffers = Thrift or Avro (serialization)
Stubby = Thrift or Avro (RPC)
ColumnIO = Parquet
Dremel = Impala
Chubby = Zookeeper
Omega = Mesos
Borg = Aurora
@avibryant
avibryant / keybase.md
Created March 17, 2014 20:55
keybase.md

Keybase proof

I hereby claim:

  • I am avibryant on github.
  • I am avi (https://keybase.io/avi) on keybase.
  • I have a public key whose fingerprint is 9DFC 19D0 8558 8DE8 6BF3 5C19 3879 4907 3690 98AF

To claim this, I am signing this object:

Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.

This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn, using NDV(mycolumn) got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn) took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.

It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.

HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:

select
<html>
<head>
<link rel="stylesheet" href="http://code.jquery.com/ui/1.10.3/themes/smoothness/jquery-ui.css" />
<script src="http://code.jquery.com/jquery-1.9.1.js"></script>
<script src="http://code.jquery.com/ui/1.10.3/jquery-ui.js"></script>
<link rel="stylesheet" href="/resources/demos/style.css" />
<script>
$(function() {
$( "#datepicker" ).datepicker(
scala> val x = null.asInstanceOf[Int]
x: Int = 0
scala> Option(x)
res1: Option[Int] = Some(0)
scala> Option(null.asInstanceOf[Int])
res2: Option[Int] = None
@avibryant
avibryant / monad.rb
Last active December 16, 2015 22:19
Totally impractical and absurd monad syntactic sugar for Ruby, roughly modeled on Scala's for-loop syntax.
require 'continuation'
module Monad
def -@
MonadContext.bind(self)
end
def wrap(val)
raise NotImplementedError
end
@avibryant
avibryant / approx-redis.md
Last active December 16, 2015 14:29
Approximate collection commands for Redis. Note: I haven't implemented these; I don't have immediate plans to implement these; I don't even know what the right way would be (fork redis? a bunch of Lua scripts? A proxy?); but it was an interesting thing to think about.

Approximate set

Unless noted otherwise, these do the same thing their equivalent set commands do. However, only a small fixed amount of memory is used for each set, no matter how many members you add.

  • ASADD key member [member ...]
  • ASCARD key
    • this will be correct within some small error
  • ASINTERCARD key [key ...]
    • this is equivalent to a SINTERSTORE followed by a SCARD, but we can't actually store it
  • ASISMEMBER key