Skip to content

Instantly share code, notes, and snippets.

@jvns
Created April 27, 2014 18:15
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jvns/11351980 to your computer and use it in GitHub Desktop.
Save jvns/11351980 to your computer and use it in GitHub Desktop.

I want to spend a week (during Hacker School alumni reunion week) better understanding performance (probably of things in the Hadoop ecosystem) on a few different dataset sizes (8GB, 100GB, 1TB). I have $1000 of AWS credit that I can spend on this (yay!)

Some things I want:

  • get a much better grasp on the performance of in-memory operations (put 8GB of data into memory and be done) vs running a distributed map reduce.
  • Understand what goes into the performance (how much time is spent copying data? sending data over the network? CPU?)
  • Learn something about tradeoffs

I'd love suggestions for experiments to run and setups to use. At work I've been using HDFS / Impala / Scalding, so my current thought is to spend time looking in depth at running a map/reduce with Scalding vs an Impala query vs running a non-distributed job in memory, because I already know about those things. But I'm open to other ideas!

Some questions I need to answer:

  • Are there good large open datasets I could use? I'd like to use real data because it's more fun.
  • If you were going to try to make reproducible experiments, where would you start?
  • How can I set up an environment without spending an entire week on it?
  • Should I use Elastic Map Reduce? How can I make installing everything as easy as possible?
  • I have $1000 of AWS credit to spend. How much should I budget? What machines should I spend it on?
@vasia
Copy link

vasia commented Apr 27, 2014

  • There exist plenty of free datasets, some of them already hosted on Amazon. Some others include SNAP (graph & network data), Million Song Dataset,
    Yelp academic dataset and Yahoo! datasets.
  • For reproducible experiments, I guess the least you need to do is share your setup settings in as much details as possible and of course your code :)
  • The fastest way to setup an environment is indeed EMR. You can have a cluster up and running in like 5'! However, I think it's fun to do the configuration yourself at least once (you will learn much more!)

Best of luck!

@eribeiro
Copy link

  • Are there good large open datasets I could use? I'd like to use real data because it's more fun.

There are many to choose from! :) @vasia has already linked to some, so I posted more below:

http://www.infochimps.com/tags/bigdata

https://bitly.com/bundles/hmason/1

  • If you were going to try to make reproducible experiments, where would you start?

This is a good question and a big open area of research. IMO, something along the lines of a virtualbox image loaded with the datasets and source code, and configured with any necessary tool (Java, Eclipse, Hadoop, etc).

  • How can I set up an environment without spending an entire week on it?

You should definitely investigate docker + vagrant + puppet. I don't know how to run those on AWS EC2, but certainly there are some people out there using at least docker. There will be necessary some scripts to boot the instances and automate the experiments so cloud-init, Boto and Fabric may be necessary too.

Copy link

ghost commented Apr 27, 2014

+1 on docker

@jvns
Copy link
Author

jvns commented Apr 28, 2014

from bra-ket on HN: (https://news.ycombinator.com/item?id=7656976)

try YCSB benchmark : https://github.com/brianfrankcooper/YCSB
dfsio: https://support.gopivotal.com/hc/en-us/articles/200864057-Ru...
note: EBS IO performance used to be abysmal, especially for cheaper instances, use storage which comes native with the EC2 instances vs network-based, see AWS FAQ: http://www.datadoghq.com/wp-content/uploads/2013/07/top_5_aw...
if you're interested in scalable in-memory computing (online vs batch) try
Storm: http://storm.incubator.apache.org/
Spark: http://spark.apache.org/
Phoenix: http://phoenix.incubator.apache.org/

@jvns
Copy link
Author

jvns commented Apr 28, 2014

one more from HN: https://news.ycombinator.com/item?id=7657349

  • How can I set up an environment without spending an entire week on it? - Cloud formation would be a good bet here
  • How can I make installing everything as easy as possible? - Use hosted chef and community cookbooks
    These suggestions might be pretty daunting if you havent used chef or autoscaling/cloudformation before. Alternatively you could hack this all and bake an ami and clone it.
    Look at using something with a higher granularity of metics than cloudwatch (resolution is 5 minutes), like graphite and collectd to collect stats easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment