jspacker/adi_mapreduce_and_pig_talk.md

## adi_mapreduce_and_pig_talk.md

      
    Raw
  

              adi_mapreduce_and_pig_talk.md
            
          
    Problem: what to do when your data is too large to process on one machine?

Some cases where this can occur

Recommendations: Netflix
Ad-tech: Quantcast (audience measurement)
Sensor monitoring: EnergyHub (thermostats)
Biotech: Schrodinger (drug discovery)

Netflix architecture diagram
Talk outline


Hadoop and MapReduce
Apache Pig, by example
Getting started with free datasets

Hadoop and MapReduce

Word count example

Input split
Map
Shuffle/Sort
Reduce

What Hadoop does beyond simple functional programming:

Data locality -- reduce keeps records with the same key on the same machine, reduces I/O time dramatically
Data transfer -- distributed filesystem (HDFS)
Error handling -- node failures, network failures
Job monitoring -- tools like Ambari, Ambrose, and CDH

Pig by example

What is Pig?

MapReduce jobs are hard to write -> Pig writes them for you
Dataflow programming language. Like SQL, but awesomer.
No database needed
Illustrate feature is super useful.

Example: Twitter Sentiment Analysis
Goal: what do people on Twitter like? dislike?
More formally: what words occur more frequently in tweets with pos/neg sentiment than in the corpus of all tweets?
Source code:

Pig Script
Jython User-Defined-Functions (UDFs)
Pig Macros

Algorithm steps:

Load tweets
Tokenize
Sentiment analysis
Word count for pos/neg tweets
Word frequencies for pos/neg tweets
Relative word frequencies for pos/neg tweets vs. corpus as a whole
Word-sentiment associations

Performance: 16,000,000 tweets from -4/13-04/27, ~1 GB data, 40 minutes on a 5-node cluster costing ~$4
Simple algorithm means lots of noise, but here's some highlights from the results:
Negative sentiment:

#7:  fractions
#10: texters
#12: soveryawkward
#13: abdominal [pain]
#15: quotidien
#27: hatter
#30: gramatically

Positive sentiment:

#2: freekicks
#3: wsaatl [Writing Sessions of America, Atlanta]
#4: apachecafe
#5: holidaze
#10: kixify [sneakers]
#13: unconditional [love]
#21: georgetakei [Sulu from the original Star Trek]

Super-quick advanced Pig example: Amazon product clustering

Goal: given info on frequently co-purchased Amazon products, group them into discrete clusters
Used an iterative algorithm called Markov clustering
Don't have time to go through the script, but it shows that you can do real machine learning sorts of things with Pig
Results

Getting started + free datasets


You can use Pig just on your local machine too!
Installing Pig with the Mortar Development Framework: gem install mortar
Using Mortar: local illustrate/run, cluster illustrate/run
Learning Pig: links to resources

Free datasets:

Twitter Gardenhose (archive of 1% of tweets from the last two weeks). Sample.
Common Crawl (archive of every webpage on the internet. use downloader script to get pages from just the domains you want). Repo with downloader script
Google Books (word occurrences and ngrams in books Google has scanned). Sample.
Amazon Product Data and Copurchasing-Graph (i.e. edge from Odyssey to Iliad because people frequently buy the Odyssey with the Iliad). I've scraped only books, but you can scrape whatever. Product data sample. Graph sample. Scraper code and clustering code.
Millionsong (metadata on one million songs). Field listing.

Repo with example scripts using these datasets with Mortar
Or use s3cmd to get raw data