This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
register piggybank.jar | |
DEFINE RegexExtract org.apache.pig.piggybank.evaluation.string.RegexExtract(); | |
raw = LOAD '20100617.txt' USING PigStorage('\t') AS (id:chararray,timestamp:chararray,screenname:chararray,tweet:chararray); | |
fltr = FILTER raw BY tweet matches '.*\\bGOAL\\b.*'; | |
describe fltr | |
extrctd = FOREACH fltr GENERATE FLATTEN(RegexExtract(tweet,'\\bGOAL\\b')) as (flat:chararray); | |
describe extrctd; | |
RegexExtract can be found at: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
using the amazon piggybank.jar found at | |
http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=262&externalID=2730 | |
The amazon piggybank.jar seems to have better examples and documentation. | |
Pig in local mode took about 1 minute, 20 seconds to produce the results using a single day's worth of tweets, which is 5.9 Million. I'm going to run pig in local mode for 2 weeks worth of data and see how it performs. Tomorrow Ill run the same against my 2-node cluster. I might commandeer my wife's laptop for a third node. | |
The pig program that worked: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
register piggybank-0.3-amzn.jar | |
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME(); | |
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); | |
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT(); | |
-- load a small sample of parsed tweets. a sample row(tweet id, timestamp, user, tweet) looks like: | |
-- 17387228106 Wed Jun 30 04:00:07 +0000 2010 Urbindex @CavsWITNESS lebron was in NYC for a photoshoot.... | |
raw = LOAD 'small.txt' USING PigStorage('\t') AS (id:chararray,timestamp:chararray,screenname:chararray,tweet:chararray); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
register piggybank-0.3-amzn.jar | |
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME(); | |
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); | |
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT(); | |
raw = LOAD '20100617.txt' USING PigStorage('\t') AS (id:chararray,timestamp:chararray,screenname:chararray,tweet:chararray); | |
fltr = FILTER raw BY tweet matches '.*\\b[Ll][Ee][Bb][Rr][Oo][Nn]\\b.*'; | |
extrctd = FOREACH fltr GENERATE timestamp,DATE_TIME(timestamp, 'EEE MMM dd HH:mm:ss Z yyyy', 'UTC') as datetime,FORMAT_DT('YYYYMMddHH',DATE_TIME(timestamp, 'EEE MMM dd HH:mm:ss Z yyyy', 'UTC')) as theHour; | |
grpd = GROUP extrctd BY theHour; | |
cntd = FOREACH grpd GENERATE $0 as theHour,COUNT(extrctd) as cnt; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- load the amazon piggybank.jar | |
register piggybank-0.3-amzn.jar | |
-- function definitions. | |
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME(); | |
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT(); | |
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT(); | |
-- load all of the parsed tweets as id, timestamp, screenname, tweet |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> str(theData) | |
'data.frame': 586 obs. of 3 variables: | |
$ timestamp: POSIXct, format: "2010-06-17 04:00:00" "2010-06-17 05:00:00" "2010-06-17 06:00:00" "2010-06-17 07:00:00" ... | |
$ LJcount : int 24 29 9 10 11 3 7 8 17 15 ... | |
$ CBcount : int 5 1 2 2 1 1 NA 2 2 NA ... | |
> theData | |
timestamp LJcount CBcount | |
1 2010-06-17 04:00:00 24 5 | |
2 2010-06-17 05:00:00 29 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
192:predictionApi nkodner$ ./predict10.sh | |
{data: {"input" : { "text" : [ "Drivers go slow when on a call People who use mobile phones while driving are spoiling it for the rest of us - by driving more carefully and slowing down traffic, according to US researchers.… " ]}}}} | |
{"data":{"output":{"output_label":"theregister"}}} | |
{data: {"input" : { "text" : [ "Friendly fire A wannabe hacker succeeded only in getting a forum for a group he wanted to join taken down after hacking celebrity MySpace profiles.… " ]}}}} | |
{"data":{"output":{"output_label":"theregister"}}} | |
{data: {"input" : { "text" : [ " Brangelina, who???? The hot couple of Cannes on Thursday was Madonna and Guy Ritchie. Mr. & Mrs. attended amfAR's annual Cinema Against AIDS benefit. Note to them: When putting on a happy face, try and look happy!!! [Photo via Getty Images.] " ]}}}} | |
{"data":{"output":{"output_label":"perezhilton"}}} | |
{data: {"input" : { "text" : [ "Born to tinker For as long as he can remember, Shane Kelly has taken a keen interest in tak |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
notes: | |
code is based on example at http://blog.notdot.net/2010/06/Trying-out-the-new-Prediction-API | |
my attempt at authentication seems ok, if i change my auth key, I get an unauthenticated response. | |
def predict2(): | |
request_data = { | |
'data': { |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# yesterday.sh - prints yesterday's date in yyyymmdd format | |
# using date -v, substract 1 day (%d) from the date and print | |
# in the format yyyymmdd | |
# example If executed on 27-Jul-2010, it will return 20100726 | |
# | |
# example usage: | |
# ------- ------ | |
# time cat gardenhose/sample.`./yesterday.sh`*.json|~/development/python/parseTweets/parse.py > ~/parsed/`./yesterday.sh`.txt | |
# replaces |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
UWD10>cat automerge.sql | |
set serverout on | |
--automerge.sql | |
--because writing merge statements is no fun!! | |
--now with utlfile! | |
set echo off | |
set termout off feedback off | |
set trimspool on | |
spool automerge | |
set serverout on size 999999 |