Skip to content

Instantly share code, notes, and snippets.

View neilkod's full-sized avatar

neil kodner neilkod

View GitHub Profile
@neilkod
neilkod / gist:3167664
Created July 24, 2012 02:37
upsert and preserve existing values?
for itm in itms:
data = itm
rec=coll.find({'text':data})
if rec.count() == 0:
print "new item: {0}".format(itm.strip())
coll.insert({'text':data,'count':0,'last_posted': OLD_DATE})
can the coll.find/coll.insert be rewritten to use upsert (coll.update) **and preserve any existing values of count(integer) and last_posted(date)**?
@neilkod
neilkod / elephant bird demo.pig
Created June 8, 2012 22:30
elephant bird demo.pig
sample pig script, runs fine in local mode. the elephantbird magic is the JsonLoader() in the LOAD command and then
converting user to a java map so that i can extract screen_name. I haven't read the docs yet but there may be a better way to do this. I'm sure I can combine the two generate statements into one, this is just a first attempt.
REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
raw = LOAD '/Users/nkodner/clean_tweets/with_deletedaa' using com.twitter.elephantbird.pig.load.JsonLoader();
bah = limit raw 100;
cc = foreach bah generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
dd = foreach cc generat
@neilkod
neilkod / gist:2898057
Created June 8, 2012 20:49
ant test output
nkodner@hadoop4 pig-0.10.0$ ant test
Buildfile: /Users/nkodner/Downloads/pig-0.10.0/build.xml
test:
ivy-download:
[get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
[get] To: /Users/nkodner/Downloads/pig-0.10.0/ivy/ivy-2.2.0.jar
[get] Not modified - so not downloaded
goal is (text,id,user.screen_name)
REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
raw = LOAD '/Users/nkodner/tweetsxxxxxx' using com.twitter.elephantbird.pig.load.JsonLoader();
lmtd = limit raw 100;
cc = foreach lmtd generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
dd = foreach cc generate text,id,user#'screen_name' as name:chararray;
@neilkod
neilkod / gist:2868503
Created June 4, 2012 13:48
bash one-liner to download the google books 1-gram data
nkodner@hadoop4 ~$ for i in {0..9}; do curl -O http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20090715-${i}.csv.zip; done
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 196M 100 196M 0 0 8514k 0 0:00:23 0:00:23 --:--:-- 16.0M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 196M 100 196M 0 0 16.6M 0 0:00:11 0:00:11 --:--:-- 14.1M
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 196M 100 196M 0 0 15.9M 0 0:00:12 0:00:12 --:--:-- 12.3M
@neilkod
neilkod / benford.sh
Created June 3, 2012 20:45
benfords law on twitter data
nkodner@hadoop4 strip_numbers$ cat numbers_from_12milliontweets.txt |awk '{print substr($1,0,1)}'|sort -n|uniq -c|sort -n
69606 7
70809 9
80228 6
80468 8
125992 0
131495 4
194264 5
369118 3
394841 2
@neilkod
neilkod / strip_tweet.py
Created June 3, 2012 14:58
strip entities (urls, hashtags, usernames) from a tweet
note: tweets are in json format, coming from STDIN.
for each entity in entities, grab the start and end position. because they can appear in any order, put the (start, end) on a list. after extracting all of the entities, reverse the list and trim the string(tweet text) appropriately.
I'll clean this up and put it in a proper repo. it's some yak-shaving i needed to do for my latest data project.
#!/bin/python
import json, sys
def strip_items(str, start_pos, end_pos):
@neilkod
neilkod / funcs.awk
Created May 29, 2012 20:03
simple histogram in awk
nkodner@hadoop4 tmp$ cat coins.txt
gold 1 1986 USA American Eagle
gold 1 1908 Austria-Hungary Franz Josef 100 Korona
silver 10 1981 USA ingot
gold 1 1984 Switzerland ingot
gold 1 1979 RSA Krugerrand
gold 0.5 1981 RSA Krugerrand
gold 0.1 1986 PRC Panda
silver 1 1986 USA Liberty dollar
gold 0.25 1986 USA Liberty 5-dollar piece
@neilkod
neilkod / export.sql
Created May 25, 2012 16:58
export all rows where modified dt field is > 01-jan-2010
exports data that was modified(inc. created) after a certain date, in this case 01-oct-2011
uses analytic function to decide whether or not to add the column delimiter. in my case, i'm using || as a delimiter since my data contains tabs and commas
it generates /tmp/TABLE_NAME.cmd.sql and then executes it while spooling TABLE_NAME.txt.
usage:
$ sqlplus user/pass@db @export <TABLE_NAME> <MOD_DT_FIELD_NAME>
set echo off feedb off head off pages 0 lines 500 trimspool on verify off termout off array 1000
data at https://raw.github.com/neilkod/2012_mb_corporate_run/master/data/results_2012.tsv
> raw_data = read.csv('/path/to/data',header=FALSE, sep='\t',stringsAsFactors=FALSE)
> names(raw_data) <- c('overall_position','gender_position','bib','name','time','seconds','minutes','gender','team')
> raw_data[raw_data$team=="Motorola Mobility",]
overall_position gender_position bib name time
37 37 33 2271 Roberto Munoz 20:52
43 43 39 2253 Nicolas Guyot 21:05
95 95 87 2264 Steve Lloyd 22:15
125 125 112 2231 Ronald Bochenek 22:35