Skip to content

Instantly share code, notes, and snippets.

@brandt
Created February 17, 2015 17:19
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save brandt/8f9ab3ceae37562a2841 to your computer and use it in GitHub Desktop.
Save brandt/8f9ab3ceae37562a2841 to your computer and use it in GitHub Desktop.
Calculate the required bloom filter size and optimal number of hashes from the expected number of items in the collection and acceptable false-positive rate
# Optimal bloom filter size and number of hashes
# Tips:
# 1. One byte per item in the input set gives about a 2% false positive rate.
# 2. The optimal number of hash functions is ~0.7x the number of bits per item.
# 3. The number of hashes dominates performance.
# Expected number of items in the collection
# n = (m * ln(2))/k;
n = 300_000
# Acceptable false-positive rate (0.01 = 1%)
# p = e^(-(m/n) * (ln(2)^2));
fpr = 0.01
# Optimal size (number of elements in the bit array)
# m = -((n*ln(p))/(ln(2)^2));
m = (n * Math.log(fpr).abs) / (Math.log(2) ** 2)
# Optimal number of hash functions
# k = (m/n) * ln(2);
k = (m / n) * Math.log(2)
puts "Optimal bloom filter size: #{m.ceil} bits"
puts "Optimal number of hash functions: #{k.ceil}"
@jan53n
Copy link

jan53n commented Feb 14, 2022

i wish i understood this magic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment