Instantly share code, notes, and snippets.

Embed
What would you like to do?
A collection of links for streaming algorithms and data structures
  1. General Background and Overview
  1. Hyperloglog and MinHash : Implementation of a form of hyperloglog and adding capabilities of MinHash algorithm on to it which would enable to perform set intersections."While it does require extra processing power to deal with collecting all the minima, it’s possible to get satisfactory performance out of the structure for a relatively low storage or memory footprint"

  2. Streaming/Sketching Conference from AK Tech : Contains links to videos and slides from the speakers like Muthukrishnan who spoke about Count Min Sketch

  3. Q-digest

  1. t-digest : A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. Ted Dunning's variant of Q-digest that does some improvements

  2. Implementations

  1. Count-Min Sketch
  1. Surveys
  • References for Data Stream Algorithms by Graham Cormode : an exhaustive set of references with explanations
  • Data Streams - Algorithms and Applications by S. Muthukrishnan : This is an excellent monograph with surveys of all algorithms related to data streams. Also a free copy of the book is available from Muthu's web site at http://www.cs.rutgers.edu/~muthu/
  • Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches by Graham Cormode1, Minos Garofalakis, Peter J. Haas and Chris Jermaine . Describes basic principles and recent developments in approximate query processing. It focuses on four key synopses: random samples, histograms, wavelets, and sketches. It considers issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. It also discusses the trade-offs between the different synopsis types.
  1. Distributed Streams Algorithms for Sliding Windows by Phillip B. Gibbons and Srikanta Tirthapura

  2. Frugal Streaming

  3. A Framework for Clustering Massive-Domain Data Streams by Charu C. Aggarwal

  4. A framework for clustering evolving data streams by Charu C. Aggarwal et. al.

  5. Unsupervised Feature Selection on Data Streams by Hao Huang

  6. Presentations

  1. Courses
  1. Incremental Learning with Decision Trees for Streamed Data
  1. Clustering Data Streams
  1. Books
  • Andrew McGregor is writing a book on sketching and data streaming algorithms, parts of the draft is available here
@andypetrella

This comment has been minimized.

Show comment
Hide comment
@andypetrella

andypetrella Dec 30, 2013

What about having a render-friendlier format like md?
This gist is a pearl, but horizontal scrolling is a bit sad ;-)

andypetrella commented Dec 30, 2013

What about having a render-friendlier format like md?
This gist is a pearl, but horizontal scrolling is a bit sad ;-)

@debasishg

This comment has been minimized.

Show comment
Hide comment
@debasishg
Owner

debasishg commented Dec 31, 2013

done ..

@slnovak

This comment has been minimized.

Show comment
Hide comment
@slnovak

slnovak Jun 12, 2014

Thank you! Found via DataTau.

slnovak commented Jun 12, 2014

Thank you! Found via DataTau.

@blinsay

This comment has been minimized.

Show comment
Hide comment
@blinsay

blinsay commented Jun 13, 2014

AggregateKnowledge has a good set of serialization-compatible Postgres/JS/Java HLL implementations:

https://github.com/aggregateknowledge/postgresql-hll
https://github.com/aggregateknowledge/java-hll
https://github.com/aggregateknowledge/js-hll

@vhazrati

This comment has been minimized.

Show comment
Hide comment
@vhazrati

vhazrati Jan 12, 2015

Thanks Debashish this is valuable!

vhazrati commented Jan 12, 2015

Thanks Debashish this is valuable!

@tdunning

This comment has been minimized.

Show comment
Hide comment
@tdunning

tdunning Mar 26, 2015

Great summary.

tdunning commented Mar 26, 2015

Great summary.

@finlay-liu

This comment has been minimized.

Show comment
Hide comment
@finlay-liu

finlay-liu commented Apr 4, 2015

Good

@racranjan

This comment has been minimized.

Show comment
Hide comment
@racranjan

racranjan commented May 5, 2015

Thanks !

@ronaldsuwandi

This comment has been minimized.

Show comment
Hide comment
@ronaldsuwandi

ronaldsuwandi Jun 15, 2015

Thanks! This is great 👍

ronaldsuwandi commented Jun 15, 2015

Thanks! This is great 👍

@jrjtLite

This comment has been minimized.

Show comment
Hide comment
@jrjtLite

jrjtLite Jun 25, 2015

great stuff!

jrjtLite commented Jun 25, 2015

great stuff!

@bistaumanga

This comment has been minimized.

Show comment
Hide comment
@bistaumanga

bistaumanga Oct 22, 2015

Thanks (y) for this awesome stuff

bistaumanga commented Oct 22, 2015

Thanks (y) for this awesome stuff

@SemanticBeeng

This comment has been minimized.

Show comment
Hide comment
@SemanticBeeng

SemanticBeeng Dec 9, 2015

If Apache Flink's "delta iterations", "off heap memory management" and "cost based optimization" qualify as "algorithms" then consider "Overview of Apache Flink: Next-Gen Big Data Analytics Framework": http://www.slideshare.net/sbaltagi/overview-of-apacheflinkbyslimbaltagi

SemanticBeeng commented Dec 9, 2015

If Apache Flink's "delta iterations", "off heap memory management" and "cost based optimization" qualify as "algorithms" then consider "Overview of Apache Flink: Next-Gen Big Data Analytics Framework": http://www.slideshare.net/sbaltagi/overview-of-apacheflinkbyslimbaltagi

@keshavbashyal

This comment has been minimized.

Show comment
Hide comment
@keshavbashyal

keshavbashyal Dec 19, 2015

Thank you.. It is really valuable for the community..

keshavbashyal commented Dec 19, 2015

Thank you.. It is really valuable for the community..

@ajgappmark

This comment has been minimized.

Show comment
Hide comment
@ajgappmark

ajgappmark Feb 1, 2016

Thanks a great resource indeed

ajgappmark commented Feb 1, 2016

Thanks a great resource indeed

@pbarker

This comment has been minimized.

Show comment
Hide comment
@pbarker

pbarker Feb 20, 2016

fantastic

pbarker commented Feb 20, 2016

fantastic

@Adewole

This comment has been minimized.

Show comment
Hide comment
@Adewole

Adewole Apr 17, 2016

Interesting post.

Adewole commented Apr 17, 2016

Interesting post.

@leerho

This comment has been minimized.

Show comment
Hide comment
@leerho

leerho May 25, 2016

You might be interested in DataSketches. A new production quality library of unique counting, quantiles, and frequent items sketches.

leerho commented May 25, 2016

You might be interested in DataSketches. A new production quality library of unique counting, quantiles, and frequent items sketches.

@fioreggianni

This comment has been minimized.

Show comment
Hide comment
@fioreggianni

fioreggianni Jul 11, 2016

Exactly what I was looking for! Thanks!

fioreggianni commented Jul 11, 2016

Exactly what I was looking for! Thanks!

@OElesin

This comment has been minimized.

Show comment
Hide comment
@OElesin

OElesin Dec 9, 2016

Great Stuff, just found this!!

OElesin commented Dec 9, 2016

Great Stuff, just found this!!

@visenger

This comment has been minimized.

Show comment
Hide comment
@visenger

visenger May 3, 2017

Great! found via @ds_ldn

visenger commented May 3, 2017

Great! found via @ds_ldn

@ChamodDamitha

This comment has been minimized.

Show comment
Hide comment
@ChamodDamitha

ChamodDamitha commented Jul 5, 2017

Great

@tamyiuchau

This comment has been minimized.

Show comment
Hide comment
@tamyiuchau

tamyiuchau Jul 12, 2017

Link to Blog Post on q-digest seems broken. A quick google point me to https://papercruncher.wordpress.com/2011/07/31/q-digest/
Please update it. Thanks.

tamyiuchau commented Jul 12, 2017

Link to Blog Post on q-digest seems broken. A quick google point me to https://papercruncher.wordpress.com/2011/07/31/q-digest/
Please update it. Thanks.

@guozheng

This comment has been minimized.

Show comment
Hide comment
@guozheng

guozheng Sep 27, 2017

really nice list, can you consider adding it to the Awesome project so that more people can benefit ;-)

https://github.com/sindresorhus/awesome

guozheng commented Sep 27, 2017

really nice list, can you consider adding it to the Awesome project so that more people can benefit ;-)

https://github.com/sindresorhus/awesome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment