-
-
Save avibryant/9ab95b033adadf9d797f to your computer and use it in GitHub Desktop.
optimal algorithm for sampling from distributed streams, using redis
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This implements Tirthapura & Woodruff's optimal algorithm for sampling from distributed streams, | |
# using a redis sorted set for coordination. | |
# | |
# Run as many copies of the following command as you like, concurrently or not, feeding each one a different | |
# part of the stream of data to be sampled through stdin, where each line will be treated as an item in the stream. | |
# | |
# ruby sample.rb <sorted set key> <sample size> | |
# | |
# References: | |
# http://geomblog.blogspot.com/2012/01/shonan-meeting-part-3-optimal.html | |
# http://home.engineering.iastate.edu/~snt/pubs/disc2011-talk.pdf | |
# http://redis.io/topics/data-types#sorted-sets | |
require 'rubygems' | |
require 'redis' | |
zset = ARGV.shift | |
max = ARGV.shift.to_i | |
redis = Redis.new | |
counter = 0 | |
t = 1.0 | |
STDIN.each do |line| | |
r = rand | |
if r < t | |
counter += 1 | |
results = redis.multi do | |
redis.zadd(zset, r, line) | |
redis.zremrangebyrank(zset, max, -1) | |
redis.zrange(zset, max-1, max-1, :with_scores => true) | |
end | |
t = (results[-1][-1] || 1).to_f | |
end | |
end | |
$stderr.puts "Communication cost: #{counter}" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment