Skip to content

Instantly share code, notes, and snippets.

@m00nlight
Last active June 30, 2019 06:17
Show Gist options
  • Save m00nlight/bfe54d1b2db362755a3a to your computer and use it in GitHub Desktop.
Save m00nlight/bfe54d1b2db362755a3a to your computer and use it in GitHub Desktop.
Python reservoir sampling algorithm
from random import randrange
def reservoir_sampling(items, k):
"""
Reservoir sampling algorithm for large sample space or unknow end list
See <http://en.wikipedia.org/wiki/Reservoir_sampling> for detail>
Type: ([a] * Int) -> [a]
Prev constrain: k is positive and items at least of k items
Post constrain: the length of return array is k
"""
sample = items[0:k]
for i in range(k, len(items)):
j = randrange(1, i + 1)
if j <= k:
sample[j - 1] = items[i]
return sample
@tommyod
Copy link

tommyod commented Feb 3, 2019

  • This code is wrong, even reservoir_sampling([1, 2, 3], k=1) will fail.
  • If we know len(items) then this algorithm is not needed in the first place.

@m00nlight
Copy link
Author

* This code is wrong, even `reservoir_sampling([1, 2, 3], k=1)` will fail.

* If we know `len(items)` then this algorithm is not needed in the first place.

@tommyod. Thanks for pointing out the problem, fix the bug. The idea of reservoir_sampling is that we don't know the length of the items or the items are too large to hold in memory. Yes, if the len(items) is know and can be hold in memory, the algorithm is useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment