Skip to content

Instantly share code, notes, and snippets.

@pbailis
Last active April 15, 2018 08:54
Show Gist options
  • Star 49 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
  • Save pbailis/5066860 to your computer and use it in GitHub Desktop.
Save pbailis/5066860 to your computer and use it in GitHub Desktop.
Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.

###Dataflow Engines:

Dryad--general-purpose distributed parallel dataflow engine
http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

Spark--in memory dataflow
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Streaming and Matviews

Spark Streaming--building streaming on top of a distributed data flow engine
http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

Nectar--reusing previously computed results in dataflows (HT @squarecog)
http://static.usenix.org/events/osdi10/tech/full_papers/Gunda.pdf

Differential dataflow: fresh take on incremental computation and dataflow
http://research.microsoft.com/pubs/176693/differentialdataflow.pdf

DBToaster: fast, modern materialized view maintenance
http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf

TelegraphCQ: good example of (old) stream processing systems--useful to contrast to, say, Storm
http://sites.google.com/site/sailesh/TCQcidr03.pdf

Borealis: research distributed stream processing system from the 2000s (HT @marcua)
http://www.cs.harvard.edu/~mdw/course/cs260r/papers/borealis-cidr05.pdf

###Full-stack "Database System" Category

Mostly OLAP

C-Store: columnar storage, now Vertica
http://people.csail.mit.edu/tdanford/6830papers/stonebraker-cstore.pdf

Column stores vs row stores
http://www.courses.fas.harvard.edu/~cs265/papers/abadi-2008.pdf

Google Dremel--columnar storage for fast queries on disk (c.f. Impala)
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf

Google PowerDrill--columnar storage and some optimizations for fast in-memory queries (HT @squarecog)
http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf

Mostly Non-OLAP

Google Spanner--strongly consistent global database
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf

Motivation behind VoltDB: lots of overhead in systems besides "useful work"; one of my favorite papers from recent years:
http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf

Google File System--de facto large scale distributed FS architecture
http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf

Languages/programming interfaces:

DryadLINQ--program collections, not dataflows directly
http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf

FlumeJava (similar to DryadLINQ, but from GOOG)
http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf

Google Tenzing--SQL on MR
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/37200.pdf

Shark: Building Hive on Spark
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

Brief Data Warehousing details (older stuff)

DynaMat: Useful for thinking about when to prematerialize views:
http://idke.ruc.edu.cn/seminars/phd/2008/01.08/DynaMat-%20A%20Dynamic%20View%20Management%20System%20for%20Data%20Warehouses.pdf

Jeff Ullman (among other awesome things, the co-author of the dragon compiler book) and friends give a neat and powerful greedy algorithm for efficient data cubing:
http://www.cs.aau.dk/~simas/dat5_08/papers/P205.pdf

Scheduling

Mesos--scheduling for DCs; some ideas adapted in YARN
http://www.mesosproject.org/papers/nsdi_mesos.pdf

Dominant Resource Fairness: multi-resource scheduling
http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf

###Slides and notes http://rxin.github.com/db-readings/
http://www.cs.berkeley.edu/~istoica/classes/cs294/11/
http://www.courses.fas.harvard.edu/~cs265/syllabus.html
http://www.courses.fas.harvard.edu/~cs265/notes/

@vishal0soni
Copy link

Hey,
Really nice work.
I just got another good repository about Big Data updates.
Mostly microsoft links, but covers good variety of information like downloads, events, and good articles.

@vkushwaha
Copy link

thanks for compiling and sharing.

@wudcwctw
Copy link

wudcwctw commented Dec 6, 2013

Thank you for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment