Avi Bryant avibryant

## disco.rb
=begin
Adapatation of streaming DISCO algorithm to parallel streams/monoids.
Each instance of Disco maintains @counts which is just #(x)
as well as @h which stores (q,n) for each (x,y) pair.
n corresponds conceptually to the number of emitted values for that pair, q corresponds (as in the paper)
to the probability with which they were emitted.

A separate instance is created for each dimension, these are then merged in any order.
Initialize with a single dimension by setting counts(w) to 1 for each w
in the dimension, and h(q,n) to (1,1) for each (w1,w2).

## DyadicRange.scala
package com.twitter.algebird

case class DyadicRange(maxValue : Long = Long.MaxValue) {
  val levels = math.ceil(math.log(maxValue) / math.log(2)).toInt

  def indicesForPoint(v : Long) = (1 to levels).map{level => (level, indexForPoint(v, level))}
  def indicesForRange(start : Long, end : Long)  : List[(Int,Long)] = indicesForRange(start, end, levels)

  def indexForPoint(v : Long, level : Int) = v >> (level - 1)
  def rangeForIndex(i : Long, level : Int) = (i << (level-1), ((i+1) << (level-1)) - 1)

## RandomProjection.scala
//See http://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection

class RandomProjection(dimensions : Int, nBuckets : Int, bitsPerBucket : Int) {
  val seed = 123456789
  lazy val r = new scala.util.Random(seed)
  lazy val buckets =
    (1 to nBuckets).map{i =>
      (1 to bitsPerBucket).map{j =>
        (1 to dimensions).map{k =>
          val f = r.nextDouble

## approx-redis.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                avibryant
                / approx-redis.md
            
            
              Last active
              December 16, 2015 14:29
            
              
                Approximate collection commands for Redis. Note: I haven't implemented these; I don't have immediate plans to implement these; I don't even know what the right way would be (fork redis? a bunch of Lua scripts? A proxy?); but it was an interesting thing to think about.

              
    Approximate set

Unless noted otherwise, these do the same thing their equivalent set commands do. However, only a small fixed amount of memory is used for each set, no matter how many members you add.

ASADD key member [member ...]
ASCARD key

this will be correct within some small error


ASINTERCARD key [key ...]

this is equivalent to a SINTERSTORE followed by a SCARD, but we can't actually store it


ASISMEMBER key


## monad.rb
require 'continuation'

module Monad
  def -@
    MonadContext.bind(self)
  end

  def wrap(val)
    raise NotImplementedError
  end

## gist:6575113
scala> val x = null.asInstanceOf[Int]
x: Int = 0

scala> Option(x)
res1: Option[Int] = Some(0)

scala> Option(null.asInstanceOf[Int])
res2: Option[Int] = None

## index.html
<html>
<head>
<link rel="stylesheet" href="http://code.jquery.com/ui/1.10.3/themes/smoothness/jquery-ui.css" />
<script src="http://code.jquery.com/jquery-1.9.1.js"></script>
<script src="http://code.jquery.com/ui/1.10.3/jquery-ui.js"></script>
<link rel="stylesheet" href="/resources/demos/style.css" />
<script>

$(function() {
  $( "#datepicker" ).datepicker(

## gist:11376572
GFS = HDFS
MapReduce = Hadoop
BigTable = HBase
Protocol Buffers = Thrift or Avro (serialization)
Stubby = Thrift or Avro (RPC)
ColumnIO = Parquet
Dremel = Impala
Chubby = Zookeeper
Omega = Mesos
Borg = Aurora

## ml.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                avibryant
                / ml.md
            
            
              Last active
              June 3, 2016 02:34
            
          
    An idiosyncratic guide to teaching yourself practical machine learning, without links:

Find a binary classification dataset; maybe you have one internally.
Implement a simple decision tree algorithm, like CART.
Write some code to validate your model; produce an ROC curve and understand the tradeoff it embodies.
Compare the ROC for your training set with the ROC for a holdout and understand what it means that they differ.
Experiment with some hyperparameters: how does the comparison above change as you adjust the depth of the tree or other stopping criteria?
Combine your decision tree algorithm with bagging to produce a random forest. How does its ROC compare?
Do the same hyperparameter tuning here. (How many trees?) Reflect on overfitting and on the bias/variance tradeoff.


## gist:3200554
The Regex Game
Avi Bryant

This is a game for two programmers, although it's easy to imagine variations for more.
It can be played over email, twitter, or IM, but it's easy to imagine a custom web app for it, and I encourage someone to build one.

Each player starts by thinking of a regular expression. The players should decide beforehand on dialect and length restrictions (eg, has to be JavaScript-compatible and under 20 characters).
They don't reveal the Regex, but if playing over email etc, should send each other a difficult to brute force hash (eg bcrypt) of the Regex for later verification.

They do reveal two strings: one which the Regex will match, and one which it will not.
	=begin
	Adapatation of streaming DISCO algorithm to parallel streams/monoids.
	Each instance of Disco maintains @counts which is just #(x)
	as well as @h which stores (q,n) for each (x,y) pair.
	n corresponds conceptually to the number of emitted values for that pair, q corresponds (as in the paper)
	to the probability with which they were emitted.

	A separate instance is created for each dimension, these are then merged in any order.
	Initialize with a single dimension by setting counts(w) to 1 for each w
	in the dimension, and h(q,n) to (1,1) for each (w1,w2).
	package com.twitter.algebird

	case class DyadicRange(maxValue : Long = Long.MaxValue) {
	val levels = math.ceil(math.log(maxValue) / math.log(2)).toInt

	def indicesForPoint(v : Long) = (1 to levels).map{level => (level, indexForPoint(v, level))}
	def indicesForRange(start : Long, end : Long) : List[(Int,Long)] = indicesForRange(start, end, levels)

	def indexForPoint(v : Long, level : Int) = v >> (level - 1)
	def rangeForIndex(i : Long, level : Int) = (i << (level-1), ((i+1) << (level-1)) - 1)
	//See http://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection

	class RandomProjection(dimensions : Int, nBuckets : Int, bitsPerBucket : Int) {
	val seed = 123456789
	lazy val r = new scala.util.Random(seed)
	lazy val buckets =
	(1 to nBuckets).map{i =>
	(1 to bitsPerBucket).map{j =>
	(1 to dimensions).map{k =>
	val f = r.nextDouble
	require 'continuation'

	module Monad
	def -@
	MonadContext.bind(self)
	end

	def wrap(val)
	raise NotImplementedError
	end
	scala> val x = null.asInstanceOf[Int]
	x: Int = 0

	scala> Option(x)
	res1: Option[Int] = Some(0)

	scala> Option(null.asInstanceOf[Int])
	res2: Option[Int] = None
	<html>
	<head>
	<link rel="stylesheet" href="http://code.jquery.com/ui/1.10.3/themes/smoothness/jquery-ui.css" />
	<script src="http://code.jquery.com/jquery-1.9.1.js"></script>
	<script src="http://code.jquery.com/ui/1.10.3/jquery-ui.js"></script>
	<link rel="stylesheet" href="/resources/demos/style.css" />
	<script>

	$(function() {
	$( "#datepicker" ).datepicker(
	GFS = HDFS
	MapReduce = Hadoop
	BigTable = HBase
	Protocol Buffers = Thrift or Avro (serialization)
	Stubby = Thrift or Avro (RPC)
	ColumnIO = Parquet
	Dremel = Impala
	Chubby = Zookeeper
	Omega = Mesos
	Borg = Aurora
	The Regex Game
	Avi Bryant

	This is a game for two programmers, although it's easy to imagine variations for more.
	It can be played over email, twitter, or IM, but it's easy to imagine a custom web app for it, and I encourage someone to build one.

	Each player starts by thinking of a regular expression. The players should decide beforehand on dialect and length restrictions (eg, has to be JavaScript-compatible and under 20 characters).
	They don't reveal the Regex, but if playing over email etc, should send each other a difficult to brute force hash (eg bcrypt) of the Regex for later verification.

	They do reveal two strings: one which the Regex will match, and one which it will not.