Skip to content

Instantly share code, notes, and snippets.

@jspacker
Last active December 16, 2015 18:09
Show Gist options
  • Save jspacker/5475391 to your computer and use it in GitHub Desktop.
Save jspacker/5475391 to your computer and use it in GitHub Desktop.
Notes for ADI MapReduce + Pig talk

Problem: what to do when your data is too large to process on one machine?

Some cases where this can occur

  • Recommendations: Netflix
  • Ad-tech: Quantcast (audience measurement)
  • Sensor monitoring: EnergyHub (thermostats)
  • Biotech: Schrodinger (drug discovery)

Netflix architecture diagram

Talk outline

  1. Hadoop and MapReduce
  2. Apache Pig, by example
  3. Getting started with free datasets

Hadoop and MapReduce

Word count example

  1. Input split
  2. Map
  3. Shuffle/Sort
  4. Reduce

What Hadoop does beyond simple functional programming:

  1. Data locality -- reduce keeps records with the same key on the same machine, reduces I/O time dramatically
  2. Data transfer -- distributed filesystem (HDFS)
  3. Error handling -- node failures, network failures
  4. Job monitoring -- tools like Ambari, Ambrose, and CDH

Pig by example

What is Pig?

  1. MapReduce jobs are hard to write -> Pig writes them for you
  2. Dataflow programming language. Like SQL, but awesomer.
  3. No database needed
  4. Illustrate feature is super useful.

Example: Twitter Sentiment Analysis

Goal: what do people on Twitter like? dislike?

More formally: what words occur more frequently in tweets with pos/neg sentiment than in the corpus of all tweets?

Source code:

Algorithm steps:

  1. Load tweets
  2. Tokenize
  3. Sentiment analysis
  4. Word count for pos/neg tweets
  5. Word frequencies for pos/neg tweets
  6. Relative word frequencies for pos/neg tweets vs. corpus as a whole
  7. Word-sentiment associations

Performance: 16,000,000 tweets from -4/13-04/27, ~1 GB data, 40 minutes on a 5-node cluster costing ~$4

Simple algorithm means lots of noise, but here's some highlights from the results:

Negative sentiment:

  • #7: fractions
  • #10: texters
  • #12: soveryawkward
  • #13: abdominal [pain]
  • #15: quotidien
  • #27: hatter
  • #30: gramatically

Positive sentiment:

  • #2: freekicks
  • #3: wsaatl [Writing Sessions of America, Atlanta]
  • #4: apachecafe
  • #5: holidaze
  • #10: kixify [sneakers]
  • #13: unconditional [love]
  • #21: georgetakei [Sulu from the original Star Trek]

Super-quick advanced Pig example: Amazon product clustering

  • Goal: given info on frequently co-purchased Amazon products, group them into discrete clusters
  • Used an iterative algorithm called Markov clustering
  • Don't have time to go through the script, but it shows that you can do real machine learning sorts of things with Pig
  • Results

Getting started + free datasets

  • You can use Pig just on your local machine too!
  • Installing Pig with the Mortar Development Framework: gem install mortar
  • Using Mortar: local illustrate/run, cluster illustrate/run
  • Learning Pig: links to resources

Free datasets:

  1. Twitter Gardenhose (archive of 1% of tweets from the last two weeks). Sample.
  2. Common Crawl (archive of every webpage on the internet. use downloader script to get pages from just the domains you want). Repo with downloader script
  3. Google Books (word occurrences and ngrams in books Google has scanned). Sample.
  4. Amazon Product Data and Copurchasing-Graph (i.e. edge from Odyssey to Iliad because people frequently buy the Odyssey with the Iliad). I've scraped only books, but you can scrape whatever. Product data sample. Graph sample. Scraper code and clustering code.
  5. Millionsong (metadata on one million songs). Field listing.

Repo with example scripts using these datasets with Mortar

Or use s3cmd to get raw data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment