Last active
July 4, 2021 20:21
-
-
Save greed2411/1ea583d328f66aa16540b6966de047de to your computer and use it in GitHub Desktop.
Python Bloom Filter based random-int-without-replacement-generator-given-a-range.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
This gist comes out of frustration that I couldn't have a | |
random-int-without-replacement-generator-given-a-range. | |
random.sample, and np.random.choice(replace=False) both fail on really large numbers. | |
Python crashes saying OOM & segfaults. | |
Problem was for smaller `n` (<5 million) they optimized for linear/sub-linear | |
times and linear storage (set, pool-tracking list). | |
Therefore compromising time, with better storage (like bit-vector/bloom/cuckoo/xor filters) | |
we can sample huge amounts of unique random integers given a range. | |
Useful links: | |
------------- | |
CPython random.sample implementation, which relies on set & pool-tracking list | |
to give out unique random numbers. | |
https://docs.python.org/3/library/random.html#random.sample | |
https://github.com/python/cpython/blob/main/Lib/random.py#L385 | |
For this particular use case since we only want unique random | |
numbers, Bloom Filter are an excellent choice (because 0 false negatives on set membership). | |
The space they take up is the limited only to the size of the bit-vector | |
(function of capacity & error-rate). | |
https://llimllib.github.io/bloomfilter-tutorial/ | |
https://en.wikipedia.org/wiki/Bloom_filter | |
https://github.com/joseph-fox/python-bloomfilter/blob/3ec8d1393df4c8fa3e16ab767c0fab431f584b06/pybloom_live/pybloom.py#L69 | |
I was confused how to do implement it properly, whether an iterator vs iterable vs generator etc. | |
This article helped with examples. | |
https://hackaday.com/2018/09/19/learn-to-loop-the-python-way-iterators-and-generators-explained/ | |
""" | |
import random | |
from typing import Generator | |
import pybloom_live | |
SAMPLE_SPACE = 2_176_782_336 | |
MAX_REQUIRED_SAMPLES = 462_455_000 | |
def rand_int_bf_sampler(a: int , b: int, k: int) -> Generator: | |
""" | |
returns a generator, which can be used to generate `k` random numbers | |
without replacement, in the range `a` to `b`. | |
internally, instead of a set/pool-tracking method for tracking uniquely | |
generated random numbers we are using a Bloom Filter. The trade-off is | |
sacrificing time, in turn for storage. At the same time, being lazy and giving | |
unique random numbers only when it is needed. | |
a, b: provides the range for python's `random.randint` | |
k: provides the number of random numbers (without replacement) to be returned by the generator | |
""" | |
bloom_filter = pybloom_live.BloomFilter(capacity=k, error_rate=0.001) | |
i = 0 | |
while i < k: | |
x = random.randint(a, b) | |
if x not in bloom_filter: | |
yield x | |
bloom_filter.add(x) | |
i += 1 | |
def main(): | |
bloom_sampler = rand_int_bf_sampler(0, SAMPLE_SPACE, MAX_REQUIRED_SAMPLES) | |
for number in bloom_sampler: | |
print(number) | |
if __name__ == "__main__": | |
main() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment