anirudhacharya/ReservoirSampling.md

## ReservoirSampling.md

      
    Raw
  

              ReservoirSampling.md
            
          
    Knuth's Algorithm R for Reservoir Sampling.
Problem Statement:

We wish to sample k elements out of a continuous stream of data, where we can process each data item only once. We cannot cache the streaming data, nor can we go back to look at previous data.
Solution -

1. Define a reservoir R of size k elements.
2. Keep adding items to this reservoir from the streaming data source until the reservoir is full.
3. For the i th element, x_i, of the input data where i >= k,
    a. Do a random sampling of integers from set (0,i). Set this randomly sampled integer to m.
    b. if 0 <= m <= k-1,
            set R[m] = x_i
       else
            Keep R unchanged.
4. Once all the input elements are received, the reservoir R is the desired sample set.

Merging sketches

I have not completely read the union logic for reservoir sketch in the Java library, but the code comments here indicates that 'merging' and 'union' are seen as two different concepts? Is that so? Can you please explain the difference.
From general reading on the internet, my idea of merging reservoir sketches is as follows -
Assume we have two different streams of data of length p and q. We generated reservoirs for both these streams R and S respectively. Can we merge the reservoirs, R and S, of these two streams to generate a reservoir T of k elements of the combined data stream.
Algorithm to merge the two sketches -
1. For k iterations do the following,
    a.) Flip a random coin and with probability = `p/(p+q)` choose a random element from reservoir R without replacement and add it to the final reservoir T.
    b.) And with probability `q/(p+q)` choose a random element from reservoir S without replacement and add it to the final reservoir T.
2. After `k` iterations we will get the final reservoir, `T` for the combined stream.

Sketch Criteria

The Datasketches website says that for a sketch to be acceptable, these conditions for the algorithm should be met:

Streaming / One-Touch
Small in Size
Sublinear in Size Growth
Mergeable
sketch(A + B) ≈ sketch(A) U sketch(B),
Mergeable With Different Size Parameters
Data Insensitive
Mathematically Proven Error and Size Properties
Meaningful and Usable Error Bounds.

The above algorithm definitely meets criteria 1, 2, 3, 4, and 6.
I assume it meets criteria 7 and 8 - I have not done the math myself, but I assume the algorithm has been sufficiently vetted.
I am unsure about what criteria 5, "Mergeable With Different Size Parameters", would entail for the Reservoir Sketch.
Python code

import random

def reservoir_sampling(k):
    """
    k: int
        size of the reservoir.
    """
    reservoir = []
    # Assume read_data() function simulates online streaming of data
    # and returns the tuple (index, item)
    for i,item in read_input():
        if i < k:
            reservoir.append(item)
        else:
            random_index = random.randint(0, i+1)
            if random_index < k:
                reservoir[random_index] = item

    for item in reservoir:
        print(str(item) + "\n")
Questions -


the Java library defines two types of reservoir sampling - one for long data types, and another for a generic <Item>. what is the difference between longs_sketches and items_sketches? why can't long also be under item?
The code comments in the Java library for ReservoirLongsUnion.java indicates that 'merging' and 'union' are seen as two different concepts? Is that so? Can you please explain the difference.
What does the criteria "Mergeable With Different Size Parameters" mean in the context of Reservoir Sampling sketch?