Skip to content

Instantly share code, notes, and snippets.

@greed2411
Last active July 4, 2021 20:21
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save greed2411/1ea583d328f66aa16540b6966de047de to your computer and use it in GitHub Desktop.
Save greed2411/1ea583d328f66aa16540b6966de047de to your computer and use it in GitHub Desktop.
Python Bloom Filter based random-int-without-replacement-generator-given-a-range.
"""
This gist comes out of frustration that I couldn't have a
random-int-without-replacement-generator-given-a-range.
random.sample, and np.random.choice(replace=False) both fail on really large numbers.
Python crashes saying OOM & segfaults.
Problem was for smaller `n` (<5 million) they optimized for linear/sub-linear
times and linear storage (set, pool-tracking list).
Therefore compromising time, with better storage (like bit-vector/bloom/cuckoo/xor filters)
we can sample huge amounts of unique random integers given a range.
Useful links:
-------------
CPython random.sample implementation, which relies on set & pool-tracking list
to give out unique random numbers.
https://docs.python.org/3/library/random.html#random.sample
https://github.com/python/cpython/blob/main/Lib/random.py#L385
For this particular use case since we only want unique random
numbers, Bloom Filter are an excellent choice (because 0 false negatives on set membership).
The space they take up is the limited only to the size of the bit-vector
(function of capacity & error-rate).
https://llimllib.github.io/bloomfilter-tutorial/
https://en.wikipedia.org/wiki/Bloom_filter
https://github.com/joseph-fox/python-bloomfilter/blob/3ec8d1393df4c8fa3e16ab767c0fab431f584b06/pybloom_live/pybloom.py#L69
I was confused how to do implement it properly, whether an iterator vs iterable vs generator etc.
This article helped with examples.
https://hackaday.com/2018/09/19/learn-to-loop-the-python-way-iterators-and-generators-explained/
"""
import random
from typing import Generator
import pybloom_live
SAMPLE_SPACE = 2_176_782_336
MAX_REQUIRED_SAMPLES = 462_455_000
def rand_int_bf_sampler(a: int , b: int, k: int) -> Generator:
"""
returns a generator, which can be used to generate `k` random numbers
without replacement, in the range `a` to `b`.
internally, instead of a set/pool-tracking method for tracking uniquely
generated random numbers we are using a Bloom Filter. The trade-off is
sacrificing time, in turn for storage. At the same time, being lazy and giving
unique random numbers only when it is needed.
a, b: provides the range for python's `random.randint`
k: provides the number of random numbers (without replacement) to be returned by the generator
"""
bloom_filter = pybloom_live.BloomFilter(capacity=k, error_rate=0.001)
i = 0
while i < k:
x = random.randint(a, b)
if x not in bloom_filter:
yield x
bloom_filter.add(x)
i += 1
def main():
bloom_sampler = rand_int_bf_sampler(0, SAMPLE_SPACE, MAX_REQUIRED_SAMPLES)
for number in bloom_sampler:
print(number)
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment