Skip to content

Instantly share code, notes, and snippets.

Last active Apr 15, 2018
What would you like to do?
Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.

###Dataflow Engines:

Dryad--general-purpose distributed parallel dataflow engine

Spark--in memory dataflow

Streaming and Matviews

Spark Streaming--building streaming on top of a distributed data flow engine

Nectar--reusing previously computed results in dataflows (HT @squarecog)

Differential dataflow: fresh take on incremental computation and dataflow

DBToaster: fast, modern materialized view maintenance

TelegraphCQ: good example of (old) stream processing systems--useful to contrast to, say, Storm

Borealis: research distributed stream processing system from the 2000s (HT @marcua)

###Full-stack "Database System" Category

Mostly OLAP

C-Store: columnar storage, now Vertica

Column stores vs row stores

Google Dremel--columnar storage for fast queries on disk (c.f. Impala)

Google PowerDrill--columnar storage and some optimizations for fast in-memory queries (HT @squarecog)

Mostly Non-OLAP

Google Spanner--strongly consistent global database

Motivation behind VoltDB: lots of overhead in systems besides "useful work"; one of my favorite papers from recent years:

Google File System--de facto large scale distributed FS architecture

Languages/programming interfaces:

DryadLINQ--program collections, not dataflows directly

FlumeJava (similar to DryadLINQ, but from GOOG)

Google Tenzing--SQL on MR

Shark: Building Hive on Spark

Brief Data Warehousing details (older stuff)

DynaMat: Useful for thinking about when to prematerialize views:

Jeff Ullman (among other awesome things, the co-author of the dragon compiler book) and friends give a neat and powerful greedy algorithm for efficient data cubing:


Mesos--scheduling for DCs; some ideas adapted in YARN

Dominant Resource Fairness: multi-resource scheduling

###Slides and notes

Copy link

wudcwctw commented Dec 6, 2013

Thank you for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment