Skip to content

Instantly share code, notes, and snippets.

@avibryant
Created January 29, 2012 07:03
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save avibryant/9ab95b033adadf9d797f to your computer and use it in GitHub Desktop.
Save avibryant/9ab95b033adadf9d797f to your computer and use it in GitHub Desktop.
optimal algorithm for sampling from distributed streams, using redis
# This implements Tirthapura & Woodruff's optimal algorithm for sampling from distributed streams,
# using a redis sorted set for coordination.
#
# Run as many copies of the following command as you like, concurrently or not, feeding each one a different
# part of the stream of data to be sampled through stdin, where each line will be treated as an item in the stream.
#
# ruby sample.rb <sorted set key> <sample size>
#
# References:
# http://geomblog.blogspot.com/2012/01/shonan-meeting-part-3-optimal.html
# http://home.engineering.iastate.edu/~snt/pubs/disc2011-talk.pdf
# http://redis.io/topics/data-types#sorted-sets
require 'rubygems'
require 'redis'
zset = ARGV.shift
max = ARGV.shift.to_i
redis = Redis.new
counter = 0
t = 1.0
STDIN.each do |line|
r = rand
if r < t
counter += 1
results = redis.multi do
redis.zadd(zset, r, line)
redis.zremrangebyrank(zset, max, -1)
redis.zrange(zset, max-1, max-1, :with_scores => true)
end
t = (results[-1][-1] || 1).to_f
end
end
$stderr.puts "Communication cost: #{counter}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment