Skip to content

Instantly share code, notes, and snippets.

@johnynek
johnynek / gist:8961994
Last active Aug 29, 2015
Some Questions with Sketch Monoids
View gist:8961994

Unifying Sketch Monoids

As I discussed in Algebra for Analytics, many sketch monoids, such as Bloom filters, HyperLogLog, and Count-min sketch, can be described as a hashing (projection) of items into a sparse space, then using two different commutative monoids to read and write respectively. Finally, the read monoids always have the property that (a + b) <= a, b and the write monoids has the property that (a + b) >= a, b.

##Some questions:

  1. Note how similar CMS and Bloom filters are. The difference: bloom hashes k times onto the same space, CMS hashes k times onto a k orthogonal subspaces. Why the difference? Imagine a fixed space bloom that hashes onto k orthogonal spaces, or an overlapping CMS that hashes onto k * m length space. How do the error asymptotics change?
  2. CMS has many query modes (dot product, etc...) can those generalize to other sketchs (HLL, Bloom)?
  3. What other sketch or non-sketch algorithms can be expressed in this dual mo
@tlockney
tlockney / Vagrantfile
Last active Aug 29, 2015
This setup allows for quick hacking with an sbt console on an EC2 instance -- very useful for trying out the AWS APIs when you need to try things out. As an example, I wanted to make sure I understood how to get the various bits of meta-data that are visible only on EC2. Create the following files and run setup.sh to run everything.
View Vagrantfile
Vagrant.configure("2") do |config|
config.vm.box = "dummy"
config.vm.provider :aws do |aws, override|
aws.access_key_id = "..."
aws.secret_access_key = "..."
# you'll need to create the EC2 keypair used here -- I called it vagrant for easy tracking
aws.keypair_name = "vagrant"
# you'll want to use a group that has at least SSH open
@ceteri
ceteri / multitool.wordcount
Created May 14, 2012
Word Count app run in Multitool
View multitool.wordcount
12/05/14 14:42:57 INFO multitool.Main: key: source
12/05/14 14:42:58 INFO multitool.Main: key: expr
12/05/14 14:42:58 INFO multitool.Main: key: gen
12/05/14 14:42:58 INFO multitool.Main: key: group
12/05/14 14:42:58 INFO multitool.Main: key: count
12/05/14 14:42:58 INFO multitool.Main: key: group
12/05/14 14:42:58 INFO multitool.Main: key: sink
12/05/14 14:43:01 INFO util.HadoopUtil: resolving application jar from found main method on: multitool.Main
12/05/14 14:43:01 INFO planner.HadoopPlanner: using application jar: /Users/paco/src/concur/cascading.multitool/./build/multitool.jar
12/05/14 14:43:01 INFO property.AppProps: using app.id: 1C9D0188E5018B980067AAC12AE43BBA
@ceteri
ceteri / Main.java
Created Jun 11, 2012
Cascading for the Impatient, Part 1
View Main.java
public class
Main
{
public static void
main( String[] args )
{
String inPath = args[ 0 ];
String outPath = args[ 1 ];
Properties properties = new Properties();
@ceteri
ceteri / EMR log
Created Jun 18, 2012
Cascading Sample Recommender
View EMR log
bash-3.2$ elastic-mapreduce --create --name "Sample Recommender" \
> --jar s3n://temp.cascading.org/sample/recommender.jar \
> --arg s3n://temp.cascading.org/sample/en.stop \
> --arg s3n://temp.cascading.org/sample/tweets/ \
> --arg s3n://temp.cascading.org/sample/out/token \
> --arg s3n://temp.cascading.org/sample/out/similarity
Created job flow j-2HA2BVCBJGMVX
bash-3.2$ elastic-mapreduce --list
j-2HA2BVCBJGMVX STARTING Sample Recommender
PENDING Example Jar Step
@ceteri
ceteri / Main.java
Created Jun 29, 2012
Cascading for the Impatient, Part 2
View Main.java
public class
Main
{
public static void
main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
@ceteri
ceteri / Main.java
Created Jun 30, 2012
Cascading for the Impatient, Part 3
View Main.java
public class
Main
{
public static void
main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
@ceteri
ceteri / Main.java
Last active Oct 6, 2015
Cascading for the Impatient, Part 4
View Main.java
public class
Main
{
public static void
main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
String stopPath = args[ 2 ];
@ceteri
ceteri / Main.java
Last active Oct 6, 2015
Cascading for the Impatient, Part 5
View Main.java
public class
Main
{
public static void
main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
String stopPath = args[ 2 ];
String tfidfPath = args[ 3 ];
@ceteri
ceteri / Main.java
Created Jul 3, 2012
Cascading for the Impatient, Part 6
View Main.java
public class
Main
{
public static void
main( String[] args )
{
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
String stopPath = args[ 2 ];
String tfidfPath = args[ 3 ];