Skip to content

Instantly share code, notes, and snippets.

View neilkod's full-sized avatar

neil kodner neilkod

View GitHub Profile
register piggybank.jar
DEFINE RegexExtract org.apache.pig.piggybank.evaluation.string.RegexExtract();
raw = LOAD '20100617.txt' USING PigStorage('\t') AS (id:chararray,timestamp:chararray,screenname:chararray,tweet:chararray);
fltr = FILTER raw BY tweet matches '.*\\bGOAL\\b.*';
describe fltr
extrctd = FOREACH fltr GENERATE FLATTEN(RegexExtract(tweet,'\\bGOAL\\b')) as (flat:chararray);
describe extrctd;
RegexExtract can be found at:
using the amazon piggybank.jar found at
http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=262&externalID=2730
The amazon piggybank.jar seems to have better examples and documentation.
Pig in local mode took about 1 minute, 20 seconds to produce the results using a single day's worth of tweets, which is 5.9 Million. I'm going to run pig in local mode for 2 weeks worth of data and see how it performs. Tomorrow Ill run the same against my 2-node cluster. I might commandeer my wife's laptop for a third node.
The pig program that worked:
register piggybank-0.3-amzn.jar
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
-- load a small sample of parsed tweets. a sample row(tweet id, timestamp, user, tweet) looks like:
-- 17387228106 Wed Jun 30 04:00:07 +0000 2010 Urbindex @CavsWITNESS lebron was in NYC for a photoshoot....
raw = LOAD 'small.txt' USING PigStorage('\t') AS (id:chararray,timestamp:chararray,screenname:chararray,tweet:chararray);
register piggybank-0.3-amzn.jar
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
raw = LOAD '20100617.txt' USING PigStorage('\t') AS (id:chararray,timestamp:chararray,screenname:chararray,tweet:chararray);
fltr = FILTER raw BY tweet matches '.*\\b[Ll][Ee][Bb][Rr][Oo][Nn]\\b.*';
extrctd = FOREACH fltr GENERATE timestamp,DATE_TIME(timestamp, 'EEE MMM dd HH:mm:ss Z yyyy', 'UTC') as datetime,FORMAT_DT('YYYYMMddHH',DATE_TIME(timestamp, 'EEE MMM dd HH:mm:ss Z yyyy', 'UTC')) as theHour;
grpd = GROUP extrctd BY theHour;
cntd = FOREACH grpd GENERATE $0 as theHour,COUNT(extrctd) as cnt;
-- load the amazon piggybank.jar
register piggybank-0.3-amzn.jar
-- function definitions.
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
-- load all of the parsed tweets as id, timestamp, screenname, tweet
> str(theData)
'data.frame': 586 obs. of 3 variables:
$ timestamp: POSIXct, format: "2010-06-17 04:00:00" "2010-06-17 05:00:00" "2010-06-17 06:00:00" "2010-06-17 07:00:00" ...
$ LJcount : int 24 29 9 10 11 3 7 8 17 15 ...
$ CBcount : int 5 1 2 2 1 1 NA 2 2 NA ...
> theData
timestamp LJcount CBcount
1 2010-06-17 04:00:00 24 5
2 2010-06-17 05:00:00 29 1
192:predictionApi nkodner$ ./predict10.sh
{data: {"input" : { "text" : [ "Drivers go slow when on a call People who use mobile phones while driving are spoiling it for the rest of us - by driving more carefully and slowing down traffic, according to US researchers.… " ]}}}}
{"data":{"output":{"output_label":"theregister"}}}
{data: {"input" : { "text" : [ "Friendly fire A wannabe hacker succeeded only in getting a forum for a group he wanted to join taken down after hacking celebrity MySpace profiles.… " ]}}}}
{"data":{"output":{"output_label":"theregister"}}}
{data: {"input" : { "text" : [ " Brangelina, who???? The hot couple of Cannes on Thursday was Madonna and Guy Ritchie. Mr. & Mrs. attended amfAR's annual Cinema Against AIDS benefit. Note to them: When putting on a happy face, try and look happy!!! [Photo via Getty Images.] " ]}}}}
{"data":{"output":{"output_label":"perezhilton"}}}
{data: {"input" : { "text" : [ "Born to tinker For as long as he can remember, Shane Kelly has taken a keen interest in tak
notes:
code is based on example at http://blog.notdot.net/2010/06/Trying-out-the-new-Prediction-API
my attempt at authentication seems ok, if i change my auth key, I get an unauthenticated response.
def predict2():
request_data = {
'data': {
@neilkod
neilkod / yesterday.sh
Created July 27, 2010 11:45
yesterday.sh
#!/bin/bash
# yesterday.sh - prints yesterday's date in yyyymmdd format
# using date -v, substract 1 day (%d) from the date and print
# in the format yyyymmdd
# example If executed on 27-Jul-2010, it will return 20100726
#
# example usage:
# ------- ------
# time cat gardenhose/sample.`./yesterday.sh`*.json|~/development/python/parseTweets/parse.py > ~/parsed/`./yesterday.sh`.txt
# replaces
UWD10>cat automerge.sql
set serverout on
--automerge.sql
--because writing merge statements is no fun!!
--now with utlfile!
set echo off
set termout off feedback off
set trimspool on
spool automerge
set serverout on size 999999