I want to spend a week (during Hacker School alumni reunion week) better understanding performance (probably of things in the Hadoop ecosystem) on a few different dataset sizes (8GB, 100GB, 1TB). I have $1000 of AWS credit that I can spend on this (yay!)
Some things I want:
- get a much better grasp on the performance of in-memory operations (put 8GB of data into memory and be done) vs running a distributed map reduce.
- Understand what goes into the performance (how much time is spent copying data? sending data over the network? CPU?)
- Learn something about tradeoffs
I'd love suggestions for experiments to run and setups to use. At work I've been using HDFS / Impala / Scalding, so my current thought is to spend time looking in depth at running a map/reduce with Scalding vs an Impala query vs running a non-distributed job in memory, because I already know about those things. But I'm open to other ideas!
Some questions I need to answer:
- Are there good large open datasets I could use? I'd like to use real data because it's more fun.
- If you were going to try to make reproducible experiments, where would you start?
- How can I set up an environment without spending an entire week on it?
- Should I use Elastic Map Reduce? How can I make installing everything as easy as possible?
- I have $1000 of AWS credit to spend. How much should I budget? What machines should I spend it on?
https://blog.cloudera.com/blog/2010/12/a-profile-of-hadoop-mapreduce-computing-efficiency-sra-paul-burkhardt/