Skip to content

Instantly share code, notes, and snippets.

@anirudhacharya
Last active March 31, 2020 03:58
Show Gist options
  • Save anirudhacharya/48a98aea57645d78255aa256df050b9e to your computer and use it in GitHub Desktop.
Save anirudhacharya/48a98aea57645d78255aa256df050b9e to your computer and use it in GitHub Desktop.

Knuth's Algorithm R for Reservoir Sampling.

Problem Statement:

We wish to sample k elements out of a continuous stream of data, where we can process each data item only once. We cannot cache the streaming data, nor can we go back to look at previous data.

Solution -

1. Define a reservoir R of size k elements.
2. Keep adding items to this reservoir from the streaming data source until the reservoir is full.
3. For the i th element, x_i, of the input data where i >= k,
    a. Do a random sampling of integers from set (0,i). Set this randomly sampled integer to m.
    b. if 0 <= m <= k-1,
            set R[m] = x_i
       else
            Keep R unchanged.
4. Once all the input elements are received, the reservoir R is the desired sample set.

Merging sketches

I have not completely read the union logic for reservoir sketch in the Java library, but the code comments here indicates that 'merging' and 'union' are seen as two different concepts? Is that so? Can you please explain the difference.

From general reading on the internet, my idea of merging reservoir sketches is as follows -

Assume we have two different streams of data of length p and q. We generated reservoirs for both these streams R and S respectively. Can we merge the reservoirs, R and S, of these two streams to generate a reservoir T of k elements of the combined data stream.

Algorithm to merge the two sketches -

1. For k iterations do the following,
    a.) Flip a random coin and with probability = `p/(p+q)` choose a random element from reservoir R without replacement and add it to the final reservoir T.
    b.) And with probability `q/(p+q)` choose a random element from reservoir S without replacement and add it to the final reservoir T.
2. After `k` iterations we will get the final reservoir, `T` for the combined stream.

Sketch Criteria

The Datasketches website says that for a sketch to be acceptable, these conditions for the algorithm should be met:

  1. Streaming / One-Touch
  2. Small in Size
  3. Sublinear in Size Growth
  4. Mergeable sketch(A + B) ≈ sketch(A) U sketch(B),
  5. Mergeable With Different Size Parameters
  6. Data Insensitive
  7. Mathematically Proven Error and Size Properties
  8. Meaningful and Usable Error Bounds.

The above algorithm definitely meets criteria 1, 2, 3, 4, and 6. I assume it meets criteria 7 and 8 - I have not done the math myself, but I assume the algorithm has been sufficiently vetted.

I am unsure about what criteria 5, "Mergeable With Different Size Parameters", would entail for the Reservoir Sketch.

Python code

import random

def reservoir_sampling(k):
    """
    k: int
        size of the reservoir.
    """
    reservoir = []
    # Assume read_data() function simulates online streaming of data
    # and returns the tuple (index, item)
    for i,item in read_input():
        if i < k:
            reservoir.append(item)
        else:
            random_index = random.randint(0, i+1)
            if random_index < k:
                reservoir[random_index] = item

    for item in reservoir:
        print(str(item) + "\n")

Questions -

@jmalkin
Copy link

jmalkin commented Mar 30, 2020

  • Java generics must be objects, not primitives. You could do a generic sketch with Long but not long, incurring some amount of object overhead with each time. But the real reason we have both is that I wrote it for longs first and then made it generic since that was a bit easier (serializing generic types took a bit more doing, and there are some assumptions needed for it to work).

  • For our purposes, merging B into A leaves sketch A in a modified state. By contrast, a union is its own object and unioning sketches into it leaves the original sketches untouched.

  • Mergeable with different size parameters means the sketches need not have the same value of k. The algorithm you listed won't work properly for that case; the Java code is the only formal documentation we have, unfortunately.

@leerho
Copy link

leerho commented Mar 30, 2020

A few more comments for clarity:

  1. Because of the object overhead, sketches using generics are slower and take more space. Because the generic sketch has no knowledge of the user specified generic type, the user must also supply methods for serializing and deserializing their specific generic type.
  • We define merge as: A.merge(B) // result is in A. But both A and B are owned by the user, and as Jon mentioned, A gets modified.
  • We define union as result = union(A, B) or it can be sequential as in union = new Union(); union(A); result = union(B); But in either case neither A nor B is modified. And ...
  1. Another justification for Union over merge is performance. If A is tiny and B is large, A.merge(B) requires iterating through all the items in B. Meanwhile, the union operation defined above can change the order of the merge, because it has its own private intermediate data structure. But this is not the only advantage. If you are unioning millions of sketches into a single Union, there are often other performance optimizations that are possible with a dedicated Union operator.

You will find that most of our sketches in the library have a separate Union class for the above reasons.

In most of the scientific literature, the term "mergability" is a general term that means that the fundamental data structure underlying the algorithm is capable of allowing two such data structures to be combined together, without any loss of accuracy. Our definition above is more precise because we care about the actual implementation in code, which the theoretical papers are almost never concerned about.

@anirudhacharya
Copy link
Author

Thanks for all the replies.

  • Java generics must be objects, not primitives. You could do a generic sketch with Long but not long, incurring some amount of object overhead with each time. But the real reason we have both is that I wrote it for longs first and then made it generic since that was a bit easier (serializing generic types took a bit more doing, and there are some assumptions needed for it to work).

@jmalkin But for cpp can we just define 1 implementation with templates, since cpp templates can take long?

@jmalkin
Copy link

jmalkin commented Mar 31, 2020

for cpp can we just define 1 implementation with templates, since cpp templates can take long?

Yes. In fact, c++ needs to be header-only -- we don't produce a library. Not only can it be a template, it must be a template :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment