Skip to content

Instantly share code, notes, and snippets.

View avibryant's full-sized avatar

Avi Bryant avibryant

  • Galiano Island, BC
View GitHub Profile
=begin
Adapatation of streaming DISCO algorithm to parallel streams/monoids.
Each instance of Disco maintains @counts which is just #(x)
as well as @h which stores (q,n) for each (x,y) pair.
n corresponds conceptually to the number of emitted values for that pair, q corresponds (as in the paper)
to the probability with which they were emitted.
A separate instance is created for each dimension, these are then merged in any order.
Initialize with a single dimension by setting counts(w) to 1 for each w
in the dimension, and h(q,n) to (1,1) for each (w1,w2).
package com.twitter.algebird
case class DyadicRange(maxValue : Long = Long.MaxValue) {
val levels = math.ceil(math.log(maxValue) / math.log(2)).toInt
def indicesForPoint(v : Long) = (1 to levels).map{level => (level, indexForPoint(v, level))}
def indicesForRange(start : Long, end : Long) : List[(Int,Long)] = indicesForRange(start, end, levels)
def indexForPoint(v : Long, level : Int) = v >> (level - 1)
def rangeForIndex(i : Long, level : Int) = (i << (level-1), ((i+1) << (level-1)) - 1)
//See http://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection
class RandomProjection(dimensions : Int, nBuckets : Int, bitsPerBucket : Int) {
val seed = 123456789
lazy val r = new scala.util.Random(seed)
lazy val buckets =
(1 to nBuckets).map{i =>
(1 to bitsPerBucket).map{j =>
(1 to dimensions).map{k =>
val f = r.nextDouble
@avibryant
avibryant / approx-redis.md
Last active December 16, 2015 14:29
Approximate collection commands for Redis. Note: I haven't implemented these; I don't have immediate plans to implement these; I don't even know what the right way would be (fork redis? a bunch of Lua scripts? A proxy?); but it was an interesting thing to think about.

Approximate set

Unless noted otherwise, these do the same thing their equivalent set commands do. However, only a small fixed amount of memory is used for each set, no matter how many members you add.

  • ASADD key member [member ...]
  • ASCARD key
    • this will be correct within some small error
  • ASINTERCARD key [key ...]
    • this is equivalent to a SINTERSTORE followed by a SCARD, but we can't actually store it
  • ASISMEMBER key
@avibryant
avibryant / monad.rb
Last active December 16, 2015 22:19
Totally impractical and absurd monad syntactic sugar for Ruby, roughly modeled on Scala's for-loop syntax.
require 'continuation'
module Monad
def -@
MonadContext.bind(self)
end
def wrap(val)
raise NotImplementedError
end
scala> val x = null.asInstanceOf[Int]
x: Int = 0
scala> Option(x)
res1: Option[Int] = Some(0)
scala> Option(null.asInstanceOf[Int])
res2: Option[Int] = None
<html>
<head>
<link rel="stylesheet" href="http://code.jquery.com/ui/1.10.3/themes/smoothness/jquery-ui.css" />
<script src="http://code.jquery.com/jquery-1.9.1.js"></script>
<script src="http://code.jquery.com/ui/1.10.3/jquery-ui.js"></script>
<link rel="stylesheet" href="/resources/demos/style.css" />
<script>
$(function() {
$( "#datepicker" ).datepicker(
@avibryant
avibryant / gist:11376572
Last active March 13, 2016 01:51
Google <=> Open Source Rosetta Stone
GFS = HDFS
MapReduce = Hadoop
BigTable = HBase
Protocol Buffers = Thrift or Avro (serialization)
Stubby = Thrift or Avro (RPC)
ColumnIO = Parquet
Dremel = Impala
Chubby = Zookeeper
Omega = Mesos
Borg = Aurora

An idiosyncratic guide to teaching yourself practical machine learning, without links:

  • Find a binary classification dataset; maybe you have one internally.
  • Implement a simple decision tree algorithm, like CART.
  • Write some code to validate your model; produce an ROC curve and understand the tradeoff it embodies.
  • Compare the ROC for your training set with the ROC for a holdout and understand what it means that they differ.
  • Experiment with some hyperparameters: how does the comparison above change as you adjust the depth of the tree or other stopping criteria?
  • Combine your decision tree algorithm with bagging to produce a random forest. How does its ROC compare?
  • Do the same hyperparameter tuning here. (How many trees?) Reflect on overfitting and on the bias/variance tradeoff.
@avibryant
avibryant / gist:3200554
Created July 29, 2012 17:47
The Regex Game
The Regex Game
Avi Bryant
This is a game for two programmers, although it's easy to imagine variations for more.
It can be played over email, twitter, or IM, but it's easy to imagine a custom web app for it, and I encourage someone to build one.
Each player starts by thinking of a regular expression. The players should decide beforehand on dialect and length restrictions (eg, has to be JavaScript-compatible and under 20 characters).
They don't reveal the Regex, but if playing over email etc, should send each other a difficult to brute force hash (eg bcrypt) of the Regex for later verification.
They do reveal two strings: one which the Regex will match, and one which it will not.