Skip to content

Instantly share code, notes, and snippets.

Avatar

neil kodner neilkod

View GitHub Profile
@neilkod
neilkod / gist:3167664
Created Jul 24, 2012
upsert and preserve existing values?
View gist:3167664
for itm in itms:
data = itm
rec=coll.find({'text':data})
if rec.count() == 0:
print "new item: {0}".format(itm.strip())
coll.insert({'text':data,'count':0,'last_posted': OLD_DATE})
can the coll.find/coll.insert be rewritten to use upsert (coll.update) **and preserve any existing values of count(integer) and last_posted(date)**?
View elephant bird demo.pig
sample pig script, runs fine in local mode. the elephantbird magic is the JsonLoader() in the LOAD command and then
converting user to a java map so that i can extract screen_name. I haven't read the docs yet but there may be a better way to do this. I'm sure I can combine the two generate statements into one, this is just a first attempt.
REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
raw = LOAD '/Users/nkodner/clean_tweets/with_deletedaa' using com.twitter.elephantbird.pig.load.JsonLoader();
bah = limit raw 100;
cc = foreach bah generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
dd = foreach cc generat
View gist:2898057
nkodner@hadoop4 pig-0.10.0$ ant test
Buildfile: /Users/nkodner/Downloads/pig-0.10.0/build.xml
test:
ivy-download:
[get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
[get] To: /Users/nkodner/Downloads/pig-0.10.0/ivy/ivy-2.2.0.jar
[get] Not modified - so not downloaded
View gist:2897924
goal is (text,id,user.screen_name)
REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
raw = LOAD '/Users/nkodner/tweetsxxxxxx' using com.twitter.elephantbird.pig.load.JsonLoader();
lmtd = limit raw 100;
cc = foreach lmtd generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
dd = foreach cc generate text,id,user#'screen_name' as name:chararray;
@neilkod
neilkod / gist:2868503
Created Jun 4, 2012
bash one-liner to download the google books 1-gram data
View gist:2868503
nkodner@hadoop4 ~$ for i in {0..9}; do curl -O http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20090715-${i}.csv.zip; done
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 196M 100 196M 0 0 8514k 0 0:00:23 0:00:23 --:--:-- 16.0M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 196M 100 196M 0 0 16.6M 0 0:00:11 0:00:11 --:--:-- 14.1M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 196M 100 196M 0 0 15.9M 0 0:00:12 0:00:12 --:--:-- 12.3M
@neilkod
neilkod / benford.sh
Created Jun 3, 2012
benfords law on twitter data
View benford.sh
nkodner@hadoop4 strip_numbers$ cat numbers_from_12milliontweets.txt |awk '{print substr($1,0,1)}'|sort -n|uniq -c|sort -n
69606 7
70809 9
80228 6
80468 8
125992 0
131495 4
194264 5
369118 3
394841 2
@neilkod
neilkod / strip_tweet.py
Created Jun 3, 2012
strip entities (urls, hashtags, usernames) from a tweet
View strip_tweet.py
note: tweets are in json format, coming from STDIN.
for each entity in entities, grab the start and end position. because they can appear in any order, put the (start, end) on a list. after extracting all of the entities, reverse the list and trim the string(tweet text) appropriately.
I'll clean this up and put it in a proper repo. it's some yak-shaving i needed to do for my latest data project.
#!/bin/python
import json, sys
def strip_items(str, start_pos, end_pos):
@neilkod
neilkod / funcs.awk
Created May 29, 2012
simple histogram in awk
View funcs.awk
nkodner@hadoop4 tmp$ cat coins.txt
gold 1 1986 USA American Eagle
gold 1 1908 Austria-Hungary Franz Josef 100 Korona
silver 10 1981 USA ingot
gold 1 1984 Switzerland ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 0.1 1986 PRC Panda
silver 1 1986 USA Liberty dollar
gold 0.25 1986 USA Liberty 5-dollar piece
@neilkod
neilkod / export.sql
Created May 25, 2012
export all rows where modified dt field is > 01-jan-2010
View export.sql
exports data that was modified(inc. created) after a certain date, in this case 01-oct-2011
uses analytic function to decide whether or not to add the column delimiter. in my case, i'm using || as a delimiter since my data contains tabs and commas
it generates /tmp/TABLE_NAME.cmd.sql and then executes it while spooling TABLE_NAME.txt.
usage:
$ sqlplus user/pass@db @export <TABLE_NAME> <MOD_DT_FIELD_NAME>
set echo off feedb off head off pages 0 lines 500 trimspool on verify off termout off array 1000
View gist:2270775
data at https://raw.github.com/neilkod/2012_mb_corporate_run/master/data/results_2012.tsv
> raw_data = read.csv('/path/to/data',header=FALSE, sep='\t',stringsAsFactors=FALSE)
> names(raw_data) <- c('overall_position','gender_position','bib','name','time','seconds','minutes','gender','team')
> raw_data[raw_data$team=="Motorola Mobility",]
overall_position gender_position bib name time
37 37 33 2271 Roberto Munoz 20:52
43 43 39 2253 Nicolas Guyot 21:05
95 95 87 2264 Steve Lloyd 22:15
125 125 112 2231 Ronald Bochenek 22:35