Skip to content

Instantly share code, notes, and snippets.

@rickhull
Created February 20, 2024 19:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rickhull/881ada3c8d2bcbecda3cbbc360410449 to your computer and use it in GitHub Desktop.
Save rickhull/881ada3c8d2bcbecda3cbbc360410449 to your computer and use it in GitHub Desktop.
require 'bitset'
require 'zlib'
class BloomFilter
# just defaults, overrideable
SEED = 0
FPR = 0.01
def self.num_bits(num_items, fpr: FPR)
(-1 * num_items * Math.log(fpr) / Math.log(2) ** 2).ceil
end
def self.num_hashes(num_bits, num_items)
(num_bits / num_items).ceil
end
# allows fixing of num_hashes to a specific size
def self.optimal(num_items, num_hashes: nil, fpr: FPR)
num_bits = self.num_bits(num_items, fpr: fpr)
nh = self.num_hashes(num_bits, num_items)
if num_hashes
[(num_bits * nh / num_hashes).ceil, num_hashes]
else
[num_bits, nh]
end
end
def self.generate(num_items, num_hashes: nil, fpr: FPR, seed: SEED)
num_bits, num_hashes =
self.optimal(num_items, num_hashes: num_hashes, fpr: fpr)
self.new(num_bits: num_bits, num_hashes: num_hashes, seed: seed)
end
attr_reader :bitmap
# The default values require 8 bytes of storage and can "recognize" up to
# 10 strings with a False Positive Rate under 5%
# The false positive rate goes up as more strings are added
def initialize(num_bits: 64, num_hashes: 6, seed: SEED)
@num_bits = num_bits
@num_hashes = num_hashes
@seed = seed
@bitmap = Bitset.new(@num_bits)
end
# use ruby's #hash "for free"; crc32 beyond that
def bits(str)
Array.new(@num_hashes - 1) { |i|
Zlib.crc32(str, i + @seed) % @num_bits
}.push(str.hash % @num_bits)
end
def see(str)
@bitmap.set *self.bits(str)
end
def seen?(str)
@bitmap.set? *self.bits(str)
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment