I hereby claim:
- I am avibryant on github.
- I am avi (https://keybase.io/avi) on keybase.
- I have a public key whose fingerprint is 9DFC 19D0 8558 8DE8 6BF3 5C19 3879 4907 3690 98AF
To claim this, I am signing this object:
case class SetSizeAggregator(hllBits: Int, maxSetSize: Int = 10) | |
extends MonoidAggregator[Array[Byte], Either[HLL, Set[Array[Byte]]], Long] { | |
def prepare(in: Array[Byte]) = Right(Set(in)) | |
def present(sum: Either[HLL, Set[Array[Byte]]]) = { | |
sum match { | |
case Right(set) => set.size | |
case Left(hll) => hll.approximateSize.estimate | |
} | |
} |
import com.twitter.algebird._ | |
case class Preparer[A, T](prepareFn: A => T) { | |
def map[U](fn: T => U) = | |
Preparer[A, U](fn.compose(prepareFn)) | |
def flatMap[U](fn: T => TraversableOnce[U]) = | |
FlatPreparer[A, U](fn.compose(prepareFn)) | |
def aggregate[B, C](aggregator: Aggregator[T, B, C]): Aggregator[A, B, C] = |
function probabilityBBeatsA(aa, ba, ab, bb) { | |
var probability = 0.0; | |
for(var i = 0; i < ab; i++) { | |
var product = Math.log(1 + (aa + ba)/(i + bb)); | |
var j = 1; | |
var start = i+1; | |
[bb, ba, aa, i].forEach(function(steps){ | |
var stop = start + steps; |
GFS = HDFS | |
MapReduce = Hadoop | |
BigTable = HBase | |
Protocol Buffers = Thrift or Avro (serialization) | |
Stubby = Thrift or Avro (RPC) | |
ColumnIO = Parquet | |
Dremel = Impala | |
Chubby = Zookeeper | |
Omega = Mesos | |
Borg = Aurora |
I hereby claim:
To claim this, I am signing this object:
Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.
This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn
, using NDV(mycolumn)
got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn)
took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.
It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash
function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.
HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:
select
<html> | |
<head> | |
<link rel="stylesheet" href="http://code.jquery.com/ui/1.10.3/themes/smoothness/jquery-ui.css" /> | |
<script src="http://code.jquery.com/jquery-1.9.1.js"></script> | |
<script src="http://code.jquery.com/ui/1.10.3/jquery-ui.js"></script> | |
<link rel="stylesheet" href="/resources/demos/style.css" /> | |
<script> | |
$(function() { | |
$( "#datepicker" ).datepicker( |
scala> val x = null.asInstanceOf[Int] | |
x: Int = 0 | |
scala> Option(x) | |
res1: Option[Int] = Some(0) | |
scala> Option(null.asInstanceOf[Int]) | |
res2: Option[Int] = None |
require 'continuation' | |
module Monad | |
def -@ | |
MonadContext.bind(self) | |
end | |
def wrap(val) | |
raise NotImplementedError | |
end |
Unless noted otherwise, these do the same thing their equivalent set commands do. However, only a small fixed amount of memory is used for each set, no matter how many members you add.