Skip to content

Instantly share code, notes, and snippets.

@anirudhacharya
Last active March 31, 2020 03:58
Show Gist options
  • Save anirudhacharya/48a98aea57645d78255aa256df050b9e to your computer and use it in GitHub Desktop.
Save anirudhacharya/48a98aea57645d78255aa256df050b9e to your computer and use it in GitHub Desktop.

Knuth's Algorithm R for Reservoir Sampling.

Problem Statement:

We wish to sample k elements out of a continuous stream of data, where we can process each data item only once. We cannot cache the streaming data, nor can we go back to look at previous data.

Solution -

1. Define a reservoir R of size k elements.
2. Keep adding items to this reservoir from the streaming data source until the reservoir is full.
3. For the i th element, x_i, of the input data where i >= k,
    a. Do a random sampling of integers from set (0,i). Set this randomly sampled integer to m.
    b. if 0 <= m <= k-1,
            set R[m] = x_i
       else
            Keep R unchanged.
4. Once all the input elements are received, the reservoir R is the desired sample set.

Merging sketches

I have not completely read the union logic for reservoir sketch in the Java library, but the code comments here indicates that 'merging' and 'union' are seen as two different concepts? Is that so? Can you please explain the difference.

From general reading on the internet, my idea of merging reservoir sketches is as follows -

Assume we have two different streams of data of length p and q. We generated reservoirs for both these streams R and S respectively. Can we merge the reservoirs, R and S, of these two streams to generate a reservoir T of k elements of the combined data stream.

Algorithm to merge the two sketches -

1. For k iterations do the following,
    a.) Flip a random coin and with probability = `p/(p+q)` choose a random element from reservoir R without replacement and add it to the final reservoir T.
    b.) And with probability `q/(p+q)` choose a random element from reservoir S without replacement and add it to the final reservoir T.
2. After `k` iterations we will get the final reservoir, `T` for the combined stream.

Sketch Criteria

The Datasketches website says that for a sketch to be acceptable, these conditions for the algorithm should be met:

  1. Streaming / One-Touch
  2. Small in Size
  3. Sublinear in Size Growth
  4. Mergeable sketch(A + B) ≈ sketch(A) U sketch(B),
  5. Mergeable With Different Size Parameters
  6. Data Insensitive
  7. Mathematically Proven Error and Size Properties
  8. Meaningful and Usable Error Bounds.

The above algorithm definitely meets criteria 1, 2, 3, 4, and 6. I assume it meets criteria 7 and 8 - I have not done the math myself, but I assume the algorithm has been sufficiently vetted.

I am unsure about what criteria 5, "Mergeable With Different Size Parameters", would entail for the Reservoir Sketch.

Python code

import random

def reservoir_sampling(k):
    """
    k: int
        size of the reservoir.
    """
    reservoir = []
    # Assume read_data() function simulates online streaming of data
    # and returns the tuple (index, item)
    for i,item in read_input():
        if i < k:
            reservoir.append(item)
        else:
            random_index = random.randint(0, i+1)
            if random_index < k:
                reservoir[random_index] = item

    for item in reservoir:
        print(str(item) + "\n")

Questions -

@jmalkin
Copy link

jmalkin commented Mar 31, 2020

for cpp can we just define 1 implementation with templates, since cpp templates can take long?

Yes. In fact, c++ needs to be header-only -- we don't produce a library. Not only can it be a template, it must be a template :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment