Knuth's Algorithm R for Reservoir Sampling.
We wish to sample k
elements out of a continuous stream of data, where we can process each data item only once. We cannot cache the streaming data, nor can we go back to look at previous data.
1. Define a reservoir R of size k elements.
2. Keep adding items to this reservoir from the streaming data source until the reservoir is full.
3. For the i th element, x_i, of the input data where i >= k,
a. Do a random sampling of integers from set (0,i). Set this randomly sampled integer to m.
b. if 0 <= m <= k-1,
set R[m] = x_i
else
Keep R unchanged.
4. Once all the input elements are received, the reservoir R is the desired sample set.
I have not completely read the union logic for reservoir sketch in the Java library, but the code comments here indicates that 'merging' and 'union' are seen as two different concepts? Is that so? Can you please explain the difference.
From general reading on the internet, my idea of merging reservoir sketches is as follows -
Assume we have two different streams of data of length p
and q
. We generated reservoirs for both these streams R and S respectively. Can we merge the reservoirs, R
and S
, of these two streams to generate a reservoir T
of k
elements of the combined data stream.
Algorithm to merge the two sketches -
1. For k iterations do the following,
a.) Flip a random coin and with probability = `p/(p+q)` choose a random element from reservoir R without replacement and add it to the final reservoir T.
b.) And with probability `q/(p+q)` choose a random element from reservoir S without replacement and add it to the final reservoir T.
2. After `k` iterations we will get the final reservoir, `T` for the combined stream.
The Datasketches website says that for a sketch to be acceptable, these conditions for the algorithm should be met:
- Streaming / One-Touch
- Small in Size
- Sublinear in Size Growth
- Mergeable sketch(A + B) ≈ sketch(A) U sketch(B),
- Mergeable With Different Size Parameters
- Data Insensitive
- Mathematically Proven Error and Size Properties
- Meaningful and Usable Error Bounds.
The above algorithm definitely meets criteria 1, 2, 3, 4, and 6. I assume it meets criteria 7 and 8 - I have not done the math myself, but I assume the algorithm has been sufficiently vetted.
I am unsure about what criteria 5, "Mergeable With Different Size Parameters", would entail for the Reservoir Sketch.
import random
def reservoir_sampling(k):
"""
k: int
size of the reservoir.
"""
reservoir = []
# Assume read_data() function simulates online streaming of data
# and returns the tuple (index, item)
for i,item in read_input():
if i < k:
reservoir.append(item)
else:
random_index = random.randint(0, i+1)
if random_index < k:
reservoir[random_index] = item
for item in reservoir:
print(str(item) + "\n")
- the Java library defines two types of reservoir sampling - one for long data types, and another for a generic <Item>. what is the difference between longs_sketches and items_sketches? why can't long also be under item?
- The code comments in the Java library for ReservoirLongsUnion.java indicates that 'merging' and 'union' are seen as two different concepts? Is that so? Can you please explain the difference.
- What does the criteria "Mergeable With Different Size Parameters" mean in the context of Reservoir Sampling sketch?
Java generics must be objects, not primitives. You could do a generic sketch with
Long
but notlong
, incurring some amount of object overhead with each time. But the real reason we have both is that I wrote it for longs first and then made it generic since that was a bit easier (serializing generic types took a bit more doing, and there are some assumptions needed for it to work).For our purposes, merging B into A leaves sketch A in a modified state. By contrast, a union is its own object and unioning sketches into it leaves the original sketches untouched.
Mergeable with different size parameters means the sketches need not have the same value of k. The algorithm you listed won't work properly for that case; the Java code is the only formal documentation we have, unfortunately.