Paul Colomiets tailhook

## top-k-threshold.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                ztellman
                / top-k-threshold.md
            
            
              Last active
              July 26, 2016 19:57
            
          
    This relates to work I'm doing on a system for distributed outlier detection, as described here.
In this system, the count-min sketch is used as a distributed filter.  Over some interval, the CMS is populated on all the edge nodes, sent to a server for merging, and the merged CMS is broadcast back out to the edges.  On the following interval, each new key is run through the "global" CMS, and if it's over some threshold, it is added to the set of potential outliers, which are precisely counted.
This is, as described so far, an adaptation of this 2003 paper from single-node to multi-node.  However, that paper assumes that the threshold is known ahead of time.
In most cases, though, what constitutes an "outlier" is a moving target. We'd like to adapt as the distribution changes. Previously, I was using a consistent uniform sample of the keys to identify the threshold, but we'd have to take a very

  
## anonymous
use std::sync::{Arc, Mutex};

trait Set {
    fn set(&mut self, t: Str);
}

struct Str<'a>(&'a str);
pub struct Port(Arc<Mutex<Set>>);


## anonymous
use std::sync::{Arc, Mutex};

trait Get<T> {
    fn get(&mut self) -> T;
}

trait Set {
    fn set(&mut self, t: Str);
}

## SparkOnYARN.md

      
              3 files
            
          
              2 forks
            
          
              2 comments
            
          
              12 stars
            
          
                oza
                / SparkOnYARN.md
            
            
              Last active
              October 9, 2022 08:53
            
              
                How to run Spark on YARN with dynamic resource allocation
              
          
    YARN


General resource management layer on HDFS
A part of Hadoop

Spark


In memory processing framework

Spark on YARN
	use std::sync::{Arc, Mutex};

	trait Set {
	fn set(&mut self, t: Str);
	}

	struct Str<'a>(&'a str);
	pub struct Port(Arc<Mutex<Set>>);
	use std::sync::{Arc, Mutex};

	trait Get<T> {
	fn get(&mut self) -> T;
	}

	trait Set {
	fn set(&mut self, t: Str);
	}