truemped/list.md

## list.md

      
    Raw
  

              list.md
            
          
    A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, and I figured I'd share the list. This is biased and rather incomplete but maybe of interest to someone. While many are obvious choices, I think there are a few underappreciated gems.
###Dataflow/Stream Processing Engines:
Dryad--general-purpose distributed parallel dataflow engine

http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
Google Dremel--columnar storage for fast queries (c.f. Impala)

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf
Spark--in memory dataflow

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Spark Streaming--building streaming on top of a distributed data flow engine

http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
DBToaster: fast, modern materialized view maintenance

http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf
Good example of (old) stream processing systems--useful to contrast to, say, Storm.

http://sites.google.com/site/sailesh/TCQcidr03.pdf
###General "Database System" Category
C-Store: columnar storage, now Vertica

http://people.csail.mit.edu/tdanford/6830papers/stonebraker-cstore.pdf
Column stores vs row stores

http://www.courses.fas.harvard.edu/~cs265/papers/abadi-2008.pdf
Google Spanner--strongly consistent global database

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf
Motivation behind VoltDB: lots of overhead in systems besides "useful work"; one of my fav papers from recent:

http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf
Google File System--de facto large scale distributed FS architecture

http://www.cs.rochester.edu/meetings/sosp2003/papers/p125-ghemawat.pdf
Languages/programming interfaces:

DryadLINQ--program collections, not dataflows directly

http://research.microsoft.com/en-us/projects/dryadlinq/dryadlinq.pdf
FlumeJava (DryadLINQ clone from GOOG)

http://faculty.neu.edu.cn/cc/zhangyf/cloud-bigdata/papers/big%20data%20programming/FlumeJava-pldi-2010.pdf
Google Tenzing--SQL on MR

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/37200.pdf
Shark: Building Hive on Spark
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf
Brief Data Warehousing details (older stuff)

Useful for thinking about when to prematerialize views:

http://idke.ruc.edu.cn/seminars/phd/2008/01.08/DynaMat-%20A%20Dynamic%20View%20Management%20System%20for%20Data%20Warehouses.pdf
Cute and powerful greedy algorithm for cubing:

http://www.cs.aau.dk/~simas/dat5_08/papers/P205.pdf
Scheduling

Mesos--scheduling for DCs; some ideas adapted in YARN

http://www.mesosproject.org/papers/nsdi_mesos.pdf
Dominant Resource Fairness: multi-resource scheduling

http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
###Slides and notes
http://rxin.github.com/db-readings/

http://www.cs.berkeley.edu/~istoica/classes/cs294/11/

http://www.courses.fas.harvard.edu/~cs265/syllabus.html

http://www.courses.fas.harvard.edu/~cs265/notes/