Skip to content

Instantly share code, notes, and snippets.

Last active April 15, 2018 08:54
  • Star 49 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.

###Dataflow Engines:

Dryad--general-purpose distributed parallel dataflow engine

Spark--in memory dataflow

Streaming and Matviews

Spark Streaming--building streaming on top of a distributed data flow engine

Nectar--reusing previously computed results in dataflows (HT @squarecog)

Differential dataflow: fresh take on incremental computation and dataflow

DBToaster: fast, modern materialized view maintenance

TelegraphCQ: good example of (old) stream processing systems--useful to contrast to, say, Storm

Borealis: research distributed stream processing system from the 2000s (HT @marcua)

###Full-stack "Database System" Category

Mostly OLAP

C-Store: columnar storage, now Vertica

Column stores vs row stores

Google Dremel--columnar storage for fast queries on disk (c.f. Impala)

Google PowerDrill--columnar storage and some optimizations for fast in-memory queries (HT @squarecog)

Mostly Non-OLAP

Google Spanner--strongly consistent global database

Motivation behind VoltDB: lots of overhead in systems besides "useful work"; one of my favorite papers from recent years:

Google File System--de facto large scale distributed FS architecture

Languages/programming interfaces:

DryadLINQ--program collections, not dataflows directly

FlumeJava (similar to DryadLINQ, but from GOOG)

Google Tenzing--SQL on MR

Shark: Building Hive on Spark

Brief Data Warehousing details (older stuff)

DynaMat: Useful for thinking about when to prematerialize views:

Jeff Ullman (among other awesome things, the co-author of the dragon compiler book) and friends give a neat and powerful greedy algorithm for efficient data cubing:


Mesos--scheduling for DCs; some ideas adapted in YARN

Dominant Resource Fairness: multi-resource scheduling

###Slides and notes

Copy link

xiejuncs commented Mar 2, 2013

Awesome, nice job. Thanks very much.

Copy link

thank you for sharing that trove of treasures :) very much appreciated

Copy link

samklr commented Mar 4, 2013

Nice. Tanks.
Just one Think though, you wrote "Google Dremel--columnar storage for fast queries on disk (c.f. Impala)", but Impala has nothing to do with dremel nor Dremel has to do with an MPP database. Impala is closer to what Teradara or Netezza do. The closest thing to Dremel is Apache Drill.

Copy link

Really nice work.
I just got another good repository about Big Data updates.
Mostly microsoft links, but covers good variety of information like downloads, events, and good articles.

Copy link

thanks for compiling and sharing.

Copy link

wudcwctw commented Dec 6, 2013

Thank you for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment