I want to spend a week (during Hacker School alumni reunion week) better understanding performance (probably of things in the Hadoop ecosystem) on a few different dataset sizes (8GB, 100GB, 1TB). I have $1000 of AWS credit that I can spend on this (yay!)
Some things I want:
- get a much better grasp on the performance of in-memory operations (put 8GB of data into memory and be done) vs running a distributed map reduce.
- Understand what goes into the performance (how much time is spent copying data? sending data over the network? CPU?)
- Learn something about tradeoffs
I'd love suggestions for experiments to run and setups to use. At work I've been using HDFS / Impala / Scalding, so my current thought is to spend time looking in depth at running a map/reduce with Scalding vs an Impala query vs running a non-distributed job in memory, because I already know about those things. But I'm open to other ideas!
Some questions I need to answer:
- Are there good large open datasets I could use? I'd like to use real data because it's more fun.
- If you were going to try to make reproducible experiments, where would you start?
- How can I set up an environment without spending an entire week on it?
- Should I use Elastic Map Reduce? How can I make installing everything as easy as possible?
- I have $1000 of AWS credit to spend. How much should I budget? What machines should I spend it on?
There are many to choose from! :) @vasia has already linked to some, so I posted more below:
http://www.infochimps.com/tags/bigdata
https://bitly.com/bundles/hmason/1
This is a good question and a big open area of research. IMO, something along the lines of a virtualbox image loaded with the datasets and source code, and configured with any necessary tool (Java, Eclipse, Hadoop, etc).
You should definitely investigate docker + vagrant + puppet. I don't know how to run those on AWS EC2, but certainly there are some people out there using at least docker. There will be necessary some scripts to boot the instances and automate the experiments so cloud-init, Boto and Fabric may be necessary too.