jvns/hdfs-performance.md

## hdfs-performance.md

      
    Raw
  

              hdfs-performance.md
            
          
    I want to spend a week (during Hacker School alumni reunion week) better understanding performance (probably of things in the Hadoop ecosystem) on a few different dataset sizes (8GB, 100GB, 1TB). I have $1000 of AWS credit that I can spend on this (yay!)
Some things I want:

get a much better grasp on the performance of in-memory operations (put 8GB of data into memory and be done) vs running a distributed map reduce.
Understand what goes into the performance (how much time is spent copying data? sending data over the network? CPU?)
Learn something about tradeoffs

I'd love suggestions for experiments to run and setups to use. At work I've been using HDFS / Impala / Scalding, so my current thought is to spend time looking in depth at running a map/reduce with Scalding vs an Impala query vs running a non-distributed job in memory, because I already know about those things. But I'm open to other ideas!
Some questions I need to answer:

Are there good large open datasets I could use? I'd like to use real data because it's more fun.
If you were going to try to make reproducible experiments, where would you start?
How can I set up an environment without spending an entire week on it?
Should I use Elastic Map Reduce? How can I make installing everything as easy as possible?
I have $1000 of AWS credit to spend. How much should I budget? What machines should I spend it on?