jmhodges (owner)

Revisions

gist: 17985 Download_button fork
public
Description:
a ruby implementation of BloomNames
Public Clone URL: git://gist.github.com/17985.git
Embed All Files: show embed
bloomnames.rb #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
require 'digest/sha1'
 
# Derived from a python implementation by Joe Gregorio:
# http://bitworking.org/news/380/bloom-filter-resources
 
# A Bloom Filter for tracking 3,000 'names', where 'names' are any
# strings of any length. Can be used to track less than 3,000 names, or more,
# but going over will increase the false positive rate. This is currently
# tuned for a false positive rate of 1%.
#
# By tuned, that means that the following Bloom Filter parameters
# are used:
#
# Number of hash functions:
# k = 7
# Number of bits in filter array:
# m = 30,000
# Number of elements added to filter:
# n = 3,000
 
class BloomNames
  FILTER_SIZE = 30000
 
  attr_reader :filter
 
  # Construct with a zero for an empty filter, or #
  # pass in a Fixnum for an alreadly build filter
  def initialize(filter=0)
    @filter = filter
  end
 
  # Add a key to the filter.
  def <<(name)
    hashes(name).each{ |pos| @filter |= 2 ** pos }
    self
  end
  alias :add :<<
 
  # Determine if a key is a member of the filter.
  def include?(name)
    hashes(name).all?{ |pos| !(@filter & (2 ** pos)).zero? }
  end
  alias :contains :include?
  
  # The filter shows up in the inspect, and it can be huge.
  # So, we replace it with a little handy info.
  def inspect
    "#<BloomNames bytes: #{filter.size}>"
  end
 
  private
  # To create seven hash functions we use the sha1 hash of the
  # string 'name' and chop that up into 20 bit values and then
  # mod down to the length of the Bloom filter, in this case
  # 30,000.
  def hashes(name)
    digits = Digest::SHA1.hexdigest(name)
    
    (0..6).map do |i|
      digits[i*5,5].to_i(16) % FILTER_SIZE
    end
  end
end