Skip to content

Instantly share code, notes, and snippets.

@debasishg
Last active March 15, 2024 15:05
Star You must be signed in to star a gist
Save debasishg/8172796 to your computer and use it in GitHub Desktop.
A collection of links for streaming algorithms and data structures

General Background and Overview

  1. Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
  2. Models and Issues in Data Stream Systems
  3. Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
  4. Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
  5. Methods for Finding Frequent Items in Data Streams by Graham Cormode & Marios Hadjieleftheriou
  6. The space complexity of approximating the frequency moments by Noga Alon, Yossi Matias, Mario Szegedy : one of the most influential papers introducing succinctness in computing frequency moments
  7. Cuckoo Filter: Practically Better Than Bloom by Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher
  8. A Simple Algorithm for Finding Frequent Elements in Streams and Bags by Karp, Shenker and Papadimitriou : used in Spark to find frequent items
  9. Hyperloglog and MinHash : Implementation of a form of hyperloglog and adding capabilities of MinHash algorithm on to it which would enable to perform set intersections."While it does require extra processing power to deal with collecting all the minima, it’s possible to get satisfactory performance out of the structure for a relatively low storage or memory footprint"

Q-digest

  1. Medians and Beyond: New Aggregation Techniques for Sensor Networks : The paper that introduced q-digest for range queries and quantile approximation
  2. Blog post on q-digest
  3. Blog post on approximate quantiles
  4. The Art of Approximating Distributions : Histograms and Quantiles at Scale - an alternative approach to q-digest
  5. t-digest : A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. Ted Dunning's variant of Q-digest that does some improvements

Implementations

  1. stream-lib : A collection of Stream summarization and cardinality estimation algorithms like CM Sketch, Hyperloglog, Bloom Filters
  2. Algebird from Twitter
  3. streamDM - Data Mining for Spark Streaming
  4. Sketching library from Yahoo

Count-Min Sketch

  1. An Improved Data Stream Summary: The Count-Min Sketch and its Applications - Cormode & Muthukrishnan : The paper that introduced count min sketch
  2. Collection of information on Count Min Sketch
  3. Count Min Sketch by Cormode : Introductory paper
  4. Streaming Algorithms and Sketches - Count Min Sketch on AK Tech Blog
  5. Muthukrishnan talking on Count Min Sketch at AK Tech conference
  6. Sketch Techniques for Approximate Query Processing by Cormode
  7. Sketching data structures - a good overview of Bloom Filters and Count Min Sketch
  8. Sketching can improve linear regression and the talk by David
  9. A Framework for Clustering Massive-Domain Data Streams by Charu Aggarwal
  10. Streaming Anomaly Detection Using Randomized Matrix Sketching by Huang & Kasiviswanathan
  11. Time adaptive sketches for summarizing data streams by Anshumali Shrivastava et. al from Microsoft Research

Surveys

  1. References for Data Stream Algorithms by Graham Cormode : an exhaustive set of references with explanations
  2. Data Streams - Algorithms and Applications by S. Muthukrishnan : This is an excellent monograph with surveys of all algorithms related to data streams. Also a free copy of the book is available from Muthu's web site at http://www.cs.rutgers.edu/~muthu/
  3. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches by Graham Cormode1, Minos Garofalakis, Peter J. Haas and Chris Jermaine . Describes basic principles and recent developments in approximate query processing. It focuses on four key synopses: random samples, histograms, wavelets, and sketches. It considers issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. It also discusses the trade-offs between the different synopsis types.
  4. Graph Stream Algorithms - A Survey by Andrew McGregor. A survey on algorithms for processing massive graphs in the data stream model.

Miscellaneous

  1. Distributed Streams Algorithms for Sliding Windows by Phillip B. Gibbons and Srikanta Tirthapura
  2. Frugal Streaming
  3. A Framework for Clustering Massive-Domain Data Streams by Charu C. Aggarwal
  4. A framework for clustering evolving data streams by Charu C. Aggarwal et. al.
  5. Unsupervised Feature Selection on Data Streams by Hao Huang

Presentations

  1. Spark Streaming Use Cases by Paco Nathan
  2. Tiny Batches in the wine, Shiny new bits in Spark Streaming by Paco Nathan
  3. Real time Data Analysis Patterns by Mikio Braun
  4. Streaming Big Data with Apache Spark, Kafka and Cassandra by Helena Edelson
  5. Streaming Data Analysis and Online Learning by John Myles White
  6. Algebra for Analytics by Oscar Boykin @posco
  7. Sketching for Data Streams - by Michael Kapralov at the Simons Institute - Part 1 Part 2 Part 3 Part 4
  8. Sketching High Dimensional Data - by Jelani Nelson at Simons Institute - Day 1 | Day 2 | Day 3 | Day 4

Courses

  1. Alex Smola course at Berkeley SML: Data Streams
  2. Piotr Indyk course at MIT Sketching, Streaming and Sub-linear Space Algorithms
  3. Andrew McGregor course at UMass on Advanced Algorithms
  4. Amit Chakrabarti course at Dartmouth on Data Stream Algorithms and the entire course notes in a single document
  5. Moses Charikar course CS369G: Algorithmic Techniques for Big Data at Stanford Spring Quarter 2016
  6. Piotr Indyk (MIT) and Nelson (Harvard) course on Sketching Algorithms for Big Data - Fall 2017
  7. Nelson's course on Algorithms for Big Data at Harvard in Fall 2015. This includes videos as well.

Incremental Learning with Decision Trees for Streamed Data

  1. Mining High-Speed Data Streams (Hoeffding Trees) by Pedro Domingos and Geoff Hulten
  2. Mining Time Changing Data Streams by G. Hulten, L. Spencer, and P. Domingos.
  3. Comprehensive study on techniques of Incremental learning with decision trees for streamed data by Prerana Gupta, Amit Thakkar, Amit Ganatra
  4. Use of Hoeffding trees in concept based data stream mining by Hoeglinger, S. and Pears, R.

Clustering Data Streams

  1. Clustering Data Streams: Theory and Practice by Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani and Liadan O’Callaghan
  2. Online clustering of data streams by J. Beringer, E. Hüllermeier
  3. Conquering the divide: Continuous clustering of distributed data streams (2007) by Graham Cormode
  4. Clustering on Streams by Suresh Venkat

Books

  1. Andrew McGregor is writing a book on sketching and data streaming algorithms, parts of the draft is available here
@andypetrella
Copy link

What about having a render-friendlier format like md?
This gist is a pearl, but horizontal scrolling is a bit sad ;-)

@debasishg
Copy link
Author

done ..

@stefan-pdx
Copy link

Thank you! Found via DataTau.

@blinsay
Copy link

blinsay commented Jun 13, 2014

AggregateKnowledge has a good set of serialization-compatible Postgres/JS/Java HLL implementations:

https://github.com/aggregateknowledge/postgresql-hll
https://github.com/aggregateknowledge/java-hll
https://github.com/aggregateknowledge/js-hll

@vhazrati
Copy link

Thanks Debashish this is valuable!

@tdunning
Copy link

Great summary.

@finlay-liu
Copy link

Good

@racranjan
Copy link

Thanks !

@ronaldsuwandi
Copy link

Thanks! This is great 👍

@jrjtLite
Copy link

great stuff!

@bistaumanga
Copy link

Thanks (y) for this awesome stuff

@keshavbashyal
Copy link

Thank you.. It is really valuable for the community..

@ajgappmark
Copy link

Thanks a great resource indeed

@pbarker
Copy link

pbarker commented Feb 20, 2016

fantastic

@Adewole-zz
Copy link

Interesting post.

@leerho
Copy link

leerho commented May 25, 2016

You might be interested in DataSketches. A new production quality library of unique counting, quantiles, and frequent items sketches.

@giannifiore
Copy link

Exactly what I was looking for! Thanks!

@OElesin
Copy link

OElesin commented Dec 9, 2016

Great Stuff, just found this!!

@visenger
Copy link

visenger commented May 3, 2017

Great! found via @ds_ldn

@ChamodDamitha
Copy link

Great

@tamyiuchau
Copy link

Link to Blog Post on q-digest seems broken. A quick google point me to https://papercruncher.wordpress.com/2011/07/31/q-digest/
Please update it. Thanks.

@guozheng
Copy link

really nice list, can you consider adding it to the Awesome project so that more people can benefit ;-)

https://github.com/sindresorhus/awesome

@ankushs92
Copy link

Outstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment