Skip to content

Instantly share code, notes, and snippets.

@dbyr
Last active July 15, 2020 08:39
Show Gist options
  • Save dbyr/7f114d85db462603bd810f241b32f559 to your computer and use it in GitHub Desktop.
Save dbyr/7f114d85db462603bd810f241b32f559 to your computer and use it in GitHub Desktop.
I came up with an algorithm that provably selects a value at random in a completely balanced and fair manner from an iterator (applicable to streams too) of indeterminate length. I tried searching (admittedly briefly) for such an algorithm but didn't find anything, so I thought I'd publish this gist in case it's never been done before (which I d…
use std::iter::Iterator;
use rand; // requires the rand crate
use rand::Rng;
use std::f64;
// I came up with this little algorithm because it seems
// (as far as I could google) there are no simple methods
// for selecting a single value randomly _and_ fairly
// from a variable-size stream/iterator in a single pass.
/// Method for collecting a single, _evenly_ random
/// value from an iterator/stream of unknown size in a
/// single pass. Only requires that the number of values
/// so far passed is known/tracked. It does so by selecting
/// the current item at random proportional to how many items
/// have been seen up to (and including) this item. For example,
/// the first three numbers have probabilities 1/1, 1/2, 1/3.
/// This obviously continues for indefinite number of items
/// in the collection.
///
/// The correctness of the algorithm can be easily proven
/// (assuming the random number generator sufficiently
/// generates numbers within an interval randomly):
///
/// Base case: For a single item, any number generated
/// will yield that item. For the second item, there is
/// 1/2 chance it will be selected, meaning the first item
/// also has 1/2 of being selected (P(~1/2) = 1/2). Therefore,
/// in the base case(s)
/// Recursive case: Assume all values up to this value have been
/// selected from fairly at random. Therefore, the probability
/// with which the currently selected point was selected is 1/(n-1).
/// Now select the current value with probability 1/n. This means
/// the previously selected value is selected with probability
/// 1/(n-1) * P(~current) = 1/(n-1) * (n-1)/n = (n-1)/n(n-1) = 1/n
/// = P(current). Therefore, since all previous values were selected
/// from evenly (with prob. 1/(n-1)), then all values remain selected
/// from evenly with the introduction of the nth value (that is, P(any)=1/n).
// The method described above written in Rust
fn random_value<'a, V, T>(vals: V) -> Option<&'a T>
where V: Iterator<Item=&'a T> {
let mut gen = rand::thread_rng();
let mut counted = 1f64;
let mut selected = None;
let mut select: f64;
let mut prob: f64;
for val in vals {
select = gen.gen_range(0f64, f64::MAX);
prob = f64::MAX / counted;
if prob >= select {
selected = Some(val);
}
counted += 1f64;
}
selected
}
fn main() {
let vs: Vec<u32> = (0..100).map(|x| x).collect();
let it = vs.iter();
println!("{}", random_value(it).unwrap());
}
#[cfg(test)]
mod test {
use super::random_value;
use std::collections::HashMap;
use std::fs::OpenOptions;
use std::io::Write;
const WITHIN_10000: i32 = 30;
// obviously this is going to fail some of the time,
// that's just how randomness is... but it's at least
// a good indicator if it works some/most of the time!
#[test]
fn freq_test_1000() {
let mut occurrences = HashMap::new();
let data: Vec<i32> = (0..100).map(|x| x).collect();
// get 100 values randomly
for _ in 0..10000 {
let res = random_value(data.iter()).unwrap(); // only None if list empty
if let Some(v) = occurrences.get_mut(&res) {
*v += 1;
} else {
occurrences.insert(res, 1i32);
}
}
// test each value to see they're within
// a reasonable number of occurences
for datum in data.iter() {
if let Some(v) = occurrences.get(datum) {
assert!(
(v - 100).abs() < WITHIN_10000,
format!("'{}' appeared {} times", datum, v)
);
} else {
assert!(false, format!("'{}' didn't appear at all", datum));
}
}
output_stats(&occurrences);
}
fn output_stats(stats: &HashMap<&i32, i32>) {
let f = OpenOptions::new()
.create_new(true)
.read(true)
.write(true)
.open("./data/distribution.csv");
match f {
Ok(mut file) => {
file.write("Number,Occurences\n".as_bytes());
for (k, v) in stats {
file.write(format!("{},{}\n", k, v).as_bytes());
}
file.flush();
},
Err(e) => println!("Could not open file: {}", e)
}
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment