Knuth's Algorithm R for Reservoir Sampling.
We wish to sample k
elements out of a continuous stream of data, where we can process each data item only once. We cannot cache the streaming data, nor can we go back to look at previous data.
1. Define a reservoir R of size k elements.
2. Keep adding items to this reservoir from the streaming data source until the reservoir is full.
3. For the i th element, x_i, of the input data where i >= k,
a. Do a random sampling of integers from set (0,i). Set this randomly sampled integer to m.
b. if 0 <= m <= k-1,
set R[m] = x_i
else
Keep R unchanged.
4. Once all the input elements are received, the reservoir R is the desired sample set.
I have not completely read the union logic for reservoir sketch in the Java library, but the code comments here indicates that 'merging' and 'union' are seen as two different concepts? Is that so? Can you please explain the difference.
From general reading on the internet, my idea of merging reservoir sketches is as follows -
Assume we have two different streams of data of length p
and q
. We generated reservoirs for both these streams R and S respectively. Can we merge the reservoirs, R
and S
, of these two streams to generate a reservoir T
of k
elements of the combined data stream.
Algorithm to merge the two sketches -
1. For k iterations do the following,
a.) Flip a random coin and with probability = `p/(p+q)` choose a random element from reservoir R without replacement and add it to the final reservoir T.
b.) And with probability `q/(p+q)` choose a random element from reservoir S without replacement and add it to the final reservoir T.
2. After `k` iterations we will get the final reservoir, `T` for the combined stream.
The Datasketches website says that for a sketch to be acceptable, these conditions for the algorithm should be met:
- Streaming / One-Touch
- Small in Size
- Sublinear in Size Growth
- Mergeable sketch(A + B) ≈ sketch(A) U sketch(B),
- Mergeable With Different Size Parameters
- Data Insensitive
- Mathematically Proven Error and Size Properties
- Meaningful and Usable Error Bounds.
The above algorithm definitely meets criteria 1, 2, 3, 4, and 6. I assume it meets criteria 7 and 8 - I have not done the math myself, but I assume the algorithm has been sufficiently vetted.
I am unsure about what criteria 5, "Mergeable With Different Size Parameters", would entail for the Reservoir Sketch.
import random
def reservoir_sampling(k):
"""
k: int
size of the reservoir.
"""
reservoir = []
# Assume read_data() function simulates online streaming of data
# and returns the tuple (index, item)
for i,item in read_input():
if i < k:
reservoir.append(item)
else:
random_index = random.randint(0, i+1)
if random_index < k:
reservoir[random_index] = item
for item in reservoir:
print(str(item) + "\n")
- the Java library defines two types of reservoir sampling - one for long data types, and another for a generic <Item>. what is the difference between longs_sketches and items_sketches? why can't long also be under item?
- The code comments in the Java library for ReservoirLongsUnion.java indicates that 'merging' and 'union' are seen as two different concepts? Is that so? Can you please explain the difference.
- What does the criteria "Mergeable With Different Size Parameters" mean in the context of Reservoir Sampling sketch?
A few more comments for clarity:
You will find that most of our sketches in the library have a separate Union class for the above reasons.
In most of the scientific literature, the term "mergability" is a general term that means that the fundamental data structure underlying the algorithm is capable of allowing two such data structures to be combined together, without any loss of accuracy. Our definition above is more precise because we care about the actual implementation in code, which the theoretical papers are almost never concerned about.