Some cases where this can occur
- Recommendations: Netflix
- Ad-tech: Quantcast (audience measurement)
- Sensor monitoring: EnergyHub (thermostats)
- Biotech: Schrodinger (drug discovery)
- Hadoop and MapReduce
- Apache Pig, by example
- Getting started with free datasets
- Input split
- Map
- Shuffle/Sort
- Reduce
What Hadoop does beyond simple functional programming:
- Data locality -- reduce keeps records with the same key on the same machine, reduces I/O time dramatically
- Data transfer -- distributed filesystem (HDFS)
- Error handling -- node failures, network failures
- Job monitoring -- tools like Ambari, Ambrose, and CDH
- MapReduce jobs are hard to write -> Pig writes them for you
- Dataflow programming language. Like SQL, but awesomer.
- No database needed
- Illustrate feature is super useful.
Example: Twitter Sentiment Analysis
Goal: what do people on Twitter like? dislike?
More formally: what words occur more frequently in tweets with pos/neg sentiment than in the corpus of all tweets?
Source code:
Algorithm steps:
- Load tweets
- Tokenize
- Sentiment analysis
- Word count for pos/neg tweets
- Word frequencies for pos/neg tweets
- Relative word frequencies for pos/neg tweets vs. corpus as a whole
- Word-sentiment associations
Performance: 16,000,000 tweets from -4/13-04/27, ~1 GB data, 40 minutes on a 5-node cluster costing ~$4
Simple algorithm means lots of noise, but here's some highlights from the results:
- #7: fractions
- #10: texters
- #12: soveryawkward
- #13: abdominal [pain]
- #15: quotidien
- #27: hatter
- #30: gramatically
- #2: freekicks
- #3: wsaatl [Writing Sessions of America, Atlanta]
- #4: apachecafe
- #5: holidaze
- #10: kixify [sneakers]
- #13: unconditional [love]
- #21: georgetakei [Sulu from the original Star Trek]
Super-quick advanced Pig example: Amazon product clustering
- Goal: given info on frequently co-purchased Amazon products, group them into discrete clusters
- Used an iterative algorithm called Markov clustering
- Don't have time to go through the script, but it shows that you can do real machine learning sorts of things with Pig
- Results
- You can use Pig just on your local machine too!
- Installing Pig with the Mortar Development Framework:
gem install mortar
- Using Mortar: local illustrate/run, cluster illustrate/run
- Learning Pig: links to resources
Free datasets:
- Twitter Gardenhose (archive of 1% of tweets from the last two weeks). Sample.
- Common Crawl (archive of every webpage on the internet. use downloader script to get pages from just the domains you want). Repo with downloader script
- Google Books (word occurrences and ngrams in books Google has scanned). Sample.
- Amazon Product Data and Copurchasing-Graph (i.e. edge from Odyssey to Iliad because people frequently buy the Odyssey with the Iliad). I've scraped only books, but you can scrape whatever. Product data sample. Graph sample. Scraper code and clustering code.
- Millionsong (metadata on one million songs). Field listing.
Repo with example scripts using these datasets with Mortar
Or use s3cmd to get raw data