Skip to content

Instantly share code, notes, and snippets.

@pbailis
Last active April 15, 2018 08:54
Show Gist options
  • Star 49 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
  • Save pbailis/5066860 to your computer and use it in GitHub Desktop.
Save pbailis/5066860 to your computer and use it in GitHub Desktop.
Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.

###Dataflow Engines:

Dryad--general-purpose distributed parallel dataflow engine
http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

Spark--in memory dataflow
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Streaming and Matviews

Spark Streaming--building streaming on top of a distributed data flow engine
http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

Nectar--reusing previously computed results in dataflows (HT @squarecog)
http://static.usenix.org/events/osdi10/tech/full_papers/Gunda.pdf

Differential dataflow: fresh take on incremental computation and dataflow
http://research.microsoft.com/pubs/176693/differentialdataflow.pdf

DBToaster: fast, modern materialized view maintenance
http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf

TelegraphCQ: good example of (old) stream processing systems--useful to contrast to, say, Storm
http://sites.google.com/site/sailesh/TCQcidr03.pdf

Borealis: research distributed stream processing system from the 2000s (HT @marcua)
http://www.cs.harvard.edu/~mdw/course/cs260r/papers/borealis-cidr05.pdf

###Full-stack "Database System" Category

Mostly OLAP

C-Store: columnar storage, now Vertica
http://people.csail.mit.edu/tdanford/6830papers/stonebraker-cstore.pdf

Column stores vs row stores
http://www.courses.fas.harvard.edu/~cs265/papers/abadi-2008.pdf

Google Dremel--columnar storage for fast queries on disk (c.f. Impala)
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf

Google PowerDrill--columnar storage and some optimizations for fast in-memory queries (HT @squarecog)
http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf

Mostly Non-OLAP

Google Spanner--strongly consistent global database
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf

Motivation behind VoltDB: lots of overhead in systems besides "useful work"; one of my favorite papers from recent years:
http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf

Google File System--de facto large scale distributed FS architecture
http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf

Languages/programming interfaces:

DryadLINQ--program collections, not dataflows directly
http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf

FlumeJava (similar to DryadLINQ, but from GOOG)
http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf

Google Tenzing--SQL on MR
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/37200.pdf

Shark: Building Hive on Spark
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

Brief Data Warehousing details (older stuff)

DynaMat: Useful for thinking about when to prematerialize views:
http://idke.ruc.edu.cn/seminars/phd/2008/01.08/DynaMat-%20A%20Dynamic%20View%20Management%20System%20for%20Data%20Warehouses.pdf

Jeff Ullman (among other awesome things, the co-author of the dragon compiler book) and friends give a neat and powerful greedy algorithm for efficient data cubing:
http://www.cs.aau.dk/~simas/dat5_08/papers/P205.pdf

Scheduling

Mesos--scheduling for DCs; some ideas adapted in YARN
http://www.mesosproject.org/papers/nsdi_mesos.pdf

Dominant Resource Fairness: multi-resource scheduling
http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf

###Slides and notes http://rxin.github.com/db-readings/
http://www.cs.berkeley.edu/~istoica/classes/cs294/11/
http://www.courses.fas.harvard.edu/~cs265/syllabus.html
http://www.courses.fas.harvard.edu/~cs265/notes/

@xiejuncs
Copy link

xiejuncs commented Mar 2, 2013

Awesome, nice job. Thanks very much.

@raymondtay
Copy link

thank you for sharing that trove of treasures :) very much appreciated

@samklr
Copy link

samklr commented Mar 4, 2013

Nice. Tanks.
Just one Think though, you wrote "Google Dremel--columnar storage for fast queries on disk (c.f. Impala)", but Impala has nothing to do with dremel nor Dremel has to do with an MPP database. Impala is closer to what Teradara or Netezza do. The closest thing to Dremel is Apache Drill.

@vishal0soni
Copy link

Hey,
Really nice work.
I just got another good repository about Big Data updates.
Mostly microsoft links, but covers good variety of information like downloads, events, and good articles.

@vkushwaha
Copy link

thanks for compiling and sharing.

@wudcwctw
Copy link

wudcwctw commented Dec 6, 2013

Thank you for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment