neil kodner neilkod

## gist:3167664
for itm in itms:
  data = itm
  rec=coll.find({'text':data})
  if rec.count() == 0:
    print "new item: {0}".format(itm.strip())
    coll.insert({'text':data,'count':0,'last_posted': OLD_DATE})


can the coll.find/coll.insert be rewritten to use upsert (coll.update) **and preserve any existing values of count(integer) and last_posted(date)**?

## elephant bird demo.pig
sample pig script, runs fine in local mode. the elephantbird magic is the JsonLoader() in the LOAD command and then
converting user to a java map so that i can extract screen_name. I haven't read the docs yet but there may be a better way to do this. I'm sure I can combine the two generate statements into one, this is just a first attempt.

REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
raw = LOAD '/Users/nkodner/clean_tweets/with_deletedaa' using com.twitter.elephantbird.pig.load.JsonLoader();
bah = limit raw 100;
cc = foreach bah generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
dd = foreach cc generat

## gist:2898057
nkodner@hadoop4 pig-0.10.0$ ant test
Buildfile: /Users/nkodner/Downloads/pig-0.10.0/build.xml

test:

ivy-download:
      [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
      [get] To: /Users/nkodner/Downloads/pig-0.10.0/ivy/ivy-2.2.0.jar
      [get] Not modified - so not downloaded

## gist:2897924
goal is (text,id,user.screen_name)

REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
raw = LOAD '/Users/nkodner/tweetsxxxxxx' using com.twitter.elephantbird.pig.load.JsonLoader();
lmtd = limit raw 100;
cc = foreach lmtd generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
dd = foreach cc generate text,id,user#'screen_name' as name:chararray;

## gist:2868503
nkodner@hadoop4 ~$ for i in {0..9}; do curl -O http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20090715-${i}.csv.zip; done
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  196M  100  196M    0     0  8514k      0  0:00:23  0:00:23 --:--:-- 16.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  196M  100  196M    0     0  16.6M      0  0:00:11  0:00:11 --:--:-- 14.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  196M  100  196M    0     0  15.9M      0  0:00:12  0:00:12 --:--:-- 12.3M

## benford.sh
nkodner@hadoop4 strip_numbers$ cat numbers_from_12milliontweets.txt |awk '{print substr($1,0,1)}'|sort -n|uniq -c|sort -n
69606 7
70809 9
80228 6
80468 8
125992 0
131495 4
194264 5
369118 3
394841 2

## strip_tweet.py
note: tweets are in json format, coming from STDIN.

for each entity in entities, grab the start and end position. because they can appear in any order, put the (start, end) on a list. after extracting all of the entities, reverse the list and trim the string(tweet text) appropriately.

I'll clean this up and put it in a proper repo. it's some yak-shaving i needed to do for my latest data project.


#!/bin/python
import json, sys
def strip_items(str, start_pos, end_pos):

## funcs.awk
nkodner@hadoop4 tmp$ cat coins.txt
gold     1    1986  USA                 American Eagle
gold     1    1908  Austria-Hungary     Franz Josef 100 Korona
silver  10    1981  USA                 ingot
gold     1    1984  Switzerland         ingot
gold     1    1979  RSA                 Krugerrand
gold     0.5  1981  RSA                 Krugerrand
gold     0.1  1986  PRC                 Panda
silver   1    1986  USA                 Liberty dollar
gold     0.25 1986  USA                 Liberty 5-dollar piece

## export.sql
exports data that was modified(inc. created) after a certain date, in this case 01-oct-2011

uses analytic function to decide whether or not to add the column delimiter. in my case, i'm using || as a delimiter since my data contains tabs and commas

it generates /tmp/TABLE_NAME.cmd.sql and then executes it while spooling TABLE_NAME.txt.

usage:
$ sqlplus user/pass@db @export <TABLE_NAME> <MOD_DT_FIELD_NAME>

set echo off feedb off head off pages 0 lines 500 trimspool on verify off termout off array 1000

## gist:2270775
data at https://raw.github.com/neilkod/2012_mb_corporate_run/master/data/results_2012.tsv

> raw_data = read.csv('/path/to/data',header=FALSE, sep='\t',stringsAsFactors=FALSE)
> names(raw_data) <- c('overall_position','gender_position','bib','name','time','seconds','minutes','gender','team')
> raw_data[raw_data$team=="Motorola Mobility",]
     overall_position gender_position  bib                    name  time
37                 37              33 2271           Roberto Munoz 20:52
43                 43              39 2253           Nicolas Guyot 21:05
95                 95              87 2264             Steve Lloyd 22:15
125               125             112 2231         Ronald Bochenek 22:35
	for itm in itms:
	data = itm
	rec=coll.find({'text':data})
	if rec.count() == 0:
	print "new item: {0}".format(itm.strip())
	coll.insert({'text':data,'count':0,'last_posted': OLD_DATE})


	can the coll.find/coll.insert be rewritten to use upsert (coll.update) and preserve any existing values of count(integer) and last_posted(date)?
	sample pig script, runs fine in local mode. the elephantbird magic is the JsonLoader() in the LOAD command and then
	converting user to a java map so that i can extract screen_name. I haven't read the docs yet but there may be a better way to do this. I'm sure I can combine the two generate statements into one, this is just a first attempt.

	REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
	REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
	REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
	raw = LOAD '/Users/nkodner/clean_tweets/with_deletedaa' using com.twitter.elephantbird.pig.load.JsonLoader();
	bah = limit raw 100;
	cc = foreach bah generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
	dd = foreach cc generat
	nkodner@hadoop4 pig-0.10.0$ ant test
	Buildfile: /Users/nkodner/Downloads/pig-0.10.0/build.xml

	test:

	ivy-download:
	[get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
	[get] To: /Users/nkodner/Downloads/pig-0.10.0/ivy/ivy-2.2.0.jar
	[get] Not modified - so not downloaded
	goal is (text,id,user.screen_name)

	REGISTER '/Users/nkodner/Downloads/cdh3/elephant-bird/build/elephant-bird-2.2.4-SNAPSHOT.jar';
	REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/contrib/piggybank/java/lib/json-simple-1.1.jar';
	REGISTER '/Users/nkodner/Downloads/cdh3/pig-0.8.1-cdh3u4/build/ivy/lib/Pig/guava-r06.jar';
	raw = LOAD '/Users/nkodner/tweetsxxxxxx' using com.twitter.elephantbird.pig.load.JsonLoader();
	lmtd = limit raw 100;
	cc = foreach lmtd generate (chararray)$0#'text' as text,(long)$0#'id' as id,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'user') as user;
	dd = foreach cc generate text,id,user#'screen_name' as name:chararray;
	nkodner@hadoop4 ~$ for i in {0..9}; do curl -O http://commondatastorage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20090715-${i}.csv.zip; done
	% Total % Received % Xferd Average Speed Time Time Time Current
	Dload Upload Total Spent Left Speed
	100 196M 100 196M 0 0 8514k 0 0:00:23 0:00:23 --:--:-- 16.0M
	% Total % Received % Xferd Average Speed Time Time Time Current
	Dload Upload Total Spent Left Speed
	100 196M 100 196M 0 0 16.6M 0 0:00:11 0:00:11 --:--:-- 14.1M
	% Total % Received % Xferd Average Speed Time Time Time Current
	Dload Upload Total Spent Left Speed
	100 196M 100 196M 0 0 15.9M 0 0:00:12 0:00:12 --:--:-- 12.3M
	nkodner@hadoop4 strip_numbers$ cat numbers_from_12milliontweets.txt \|awk '{print substr($1,0,1)}'\|sort -n\|uniq -c\|sort -n
	69606 7
	70809 9
	80228 6
	80468 8
	125992 0
	131495 4
	194264 5
	369118 3
	394841 2
	note: tweets are in json format, coming from STDIN.

	for each entity in entities, grab the start and end position. because they can appear in any order, put the (start, end) on a list. after extracting all of the entities, reverse the list and trim the string(tweet text) appropriately.

	I'll clean this up and put it in a proper repo. it's some yak-shaving i needed to do for my latest data project.


	#!/bin/python
	import json, sys
	def strip_items(str, start_pos, end_pos):
	nkodner@hadoop4 tmp$ cat coins.txt
	gold 1 1986 USA American Eagle
	gold 1 1908 Austria-Hungary Franz Josef 100 Korona
	silver 10 1981 USA ingot
	gold 1 1984 Switzerland ingot
	gold 1 1979 RSA Krugerrand
	gold 0.5 1981 RSA Krugerrand
	gold 0.1 1986 PRC Panda
	silver 1 1986 USA Liberty dollar
	gold 0.25 1986 USA Liberty 5-dollar piece
	exports data that was modified(inc. created) after a certain date, in this case 01-oct-2011

	uses analytic function to decide whether or not to add the column delimiter. in my case, i'm using \|\| as a delimiter since my data contains tabs and commas

	it generates /tmp/TABLE_NAME.cmd.sql and then executes it while spooling TABLE_NAME.txt.

	usage:
	$ sqlplus user/pass@db @export <TABLE_NAME> <MOD_DT_FIELD_NAME>

	set echo off feedb off head off pages 0 lines 500 trimspool on verify off termout off array 1000
	data at https://raw.github.com/neilkod/2012_mb_corporate_run/master/data/results_2012.tsv

	> raw_data = read.csv('/path/to/data',header=FALSE, sep='\t',stringsAsFactors=FALSE)
	> names(raw_data) <- c('overall_position','gender_position','bib','name','time','seconds','minutes','gender','team')
	> raw_data[raw_data$team=="Motorola Mobility",]
	overall_position gender_position bib name time
	37 37 33 2271 Roberto Munoz 20:52
	43 43 39 2253 Nicolas Guyot 21:05
	95 95 87 2264 Steve Lloyd 22:15
	125 125 112 2231 Ronald Bochenek 22:35