Skip to content

Instantly share code, notes, and snippets.

View brendano's full-sized avatar

Brendan O'Connor brendano

View GitHub Profile
@brendano
brendano / autolog.py
Created October 10, 2008 23:00
python decorators to log all method calls, show call graphs in realtime too
# Written by Brendan O'Connor, brenocon@gmail.com, www.anyall.org
# * Originally written Aug. 2005
# * Posted to gist.github.com/16173 on Oct. 2008
# Copyright (c) 2003-2006 Open Source Applications Foundation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
@brendano
brendano / merged.csv
Created October 11, 2008 09:37
political bias algorithm analysis, scraping and comparison to skewz.com - see anyall.org/blog?p=189
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 9 columns, instead of 5. in line 7.
name,score_skewz,score_svd,url,v1,v2,v3,v4,v5
The Politico,-0.133333333333333,-0.069840595513546,politico.com,-0.0579919888228,-0.0156533209161,-0.0118276408031,-0.000672353189093,0.00899951990495
Right Wing Nut House,0.666666666666667,0.016997861495122,rightwingnuthouse.com,-0.0114438419789,0.00923210186058,-0.000332659887795,-0.00357075698976,0.0194133595538
Chicago Tribune,0.0,0.011507686305562,chicagotribune.com,-0.00487815404818,0.0062502057793,0.00472616298604,-0.00370269426842,-0.00354255787188
City Journal,0.566666666666667,0.002719928640919,city-journal.org,-0.000318806368726,0.00147728337907,0.000218460777,-0.000500262448403,-0.00112420748062
Time,-0.1,-0.01921486123282,time.com,-0.0206799675285,-0.00430661260867,-0.00335205354211,-0.00167995286891,-0.0152016073966
National Enquirer,0.533333333333333,-0.008120760725041,nationalenquirer.com,-0.00279469690892,-0.0018201000833,-0.00761346294708,0.00713945342214,-0.00165965873961
AlterNet,-0.633333333333333,-0.029834727529704,alternet.org,-0.0066
@brendano
brendano / xlsx2tsv.py
Created November 7, 2008 02:53
xlsx2tsv: python command-line script to convert xlsx (Excel "OOXML") into tab-separated values
#!/usr/bin/env python
"""
xlsx2tsv filename.xlsx [sheet number or name]
Parse a .xlsx (Excel OOXML, which is not OpenOffice) into tab-separated values.
If it has multiple sheets, need to give a sheet number or name.
Outputs honest-to-goodness tsv, no quoting or embedded \\n\\r\\t.
One reason I wrote this is because Mac Excel 2008 export to csv or tsv messes
up encodings, converting everything to something that's not utf8 (macroman
@brendano
brendano / setdiff.py
Created November 7, 2008 20:37
commandline set operations on files
#!/usr/bin/env python
""" set operations on files as lists. symlink this as:
* setdiff [-c] <set1> <set2> - set difference
* setand [-c] <set1> <set2> - set intersection
* setor [-c] <set1> <set2> - set union
-c means: give count of the result
Output order is randomish
We don't newline chomp, so a bug if your file doesnt end with a newline
Dash - for stdin (e.g. cut/awk/sed/grep)
Though in zsh, =(bla bla) syntax is superior: can do 2 pipeline inputs
@brendano
brendano / md5sort.py
Created November 7, 2008 20:38
md5sort
#!/usr/bin/env python
""" sorts lines (or tab-sep records) by md5. (e.g. for train/test splits).
optionally prepends with the md5 id too.
brendan o'connor - anyall.org - gist.github.com/brendano """
import hashlib,sys,optparse
p = optparse.OptionParser()
p.add_option('-k', type='int', default=False)
p.add_option('-p', action='store_true')
opts,args=p.parse_args()
"""ajaxgoogle.py - Simple bindings to the AJAX Google Search API
(Just the JSON-over-HTTP bit of it, nothing to do with AJAX per se)
http://code.google.com/apis/ajaxsearch/documentation/reference.html#_intro_fonje
brendan o'connor - gist.github.com/28405 - anyall.org"""
try:
import json
except ImportError:
import simplejson as json
import urllib, urllib2
@brendano
brendano / gist:28439
Created November 24, 2008 10:33
pipe fiddling: (1) kill buffering (2) output redir kills stdout encoding, so force it
# Pipe-oriented I/O in Python. This is harder than it should be.
# (1) Kill stdout buffering. makes redirects and tee easier to use.
if "<fdopen>" not in str(sys.stdout): sys.stdout = os.fdopen(1,'w',0)
# (2) Encoding madness. Note codecs.open() isn't available to us since we're using pipes.
import codecs
sys.stdout = codecs.EncodedFile(sys.stdout,'utf-8','utf-8','ignore')
# or this too .. sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
# I'm interested in safely handling potentially garbled input data, so want to protect stdin.
# You'd think this would work:
@brendano
brendano / gist:39760
Created December 24, 2008 20:11
load the MNIST data set in R
# Load the MNIST digit recognition dataset into R
# http://yann.lecun.com/exdb/mnist/
# assume you have all 4 files and gunzip'd them
# creates train$n, train$x, train$y and test$n, test$x, test$y
# e.g. train$x is a 60000 x 784 matrix, each row is one digit (28x28)
# call: show_digit(train$x[5,]) to see a digit.
# brendan o'connor - gist.github.com/39760 - anyall.org
load_mnist <- function() {
load_image_file <- function(filename) {
CSV from PostgreSQL, at least as far as I can tell. i'm sure messes up embedded quotes and maybe embedded commas.
psql.csv() { psql -qAF , "$@" | egrep -v '^\([0-9]+ rows\)$' }
@brendano
brendano / tabsort
Created February 7, 2009 19:59
tabsort
#!/bin/bash
export TAB=$(echo -e "\t")
exec sort "-t$TAB" "$@"