Skip to content

Instantly share code, notes, and snippets.

View brendano's full-sized avatar

Brendan O'Connor brendano

View GitHub Profile
@brendano
brendano / md5sort.py
Created November 7, 2008 20:38
md5sort
#!/usr/bin/env python
""" sorts lines (or tab-sep records) by md5. (e.g. for train/test splits).
optionally prepends with the md5 id too.
brendan o'connor - anyall.org - gist.github.com/brendano """
import hashlib,sys,optparse
p = optparse.OptionParser()
p.add_option('-k', type='int', default=False)
p.add_option('-p', action='store_true')
opts,args=p.parse_args()
@brendano
brendano / xlsx2tsv.py
Created November 7, 2008 02:53
xlsx2tsv: python command-line script to convert xlsx (Excel "OOXML") into tab-separated values
#!/usr/bin/env python
"""
xlsx2tsv filename.xlsx [sheet number or name]
Parse a .xlsx (Excel OOXML, which is not OpenOffice) into tab-separated values.
If it has multiple sheets, need to give a sheet number or name.
Outputs honest-to-goodness tsv, no quoting or embedded \\n\\r\\t.
One reason I wrote this is because Mac Excel 2008 export to csv or tsv messes
up encodings, converting everything to something that's not utf8 (macroman
@brendano
brendano / autolog.py
Created October 10, 2008 23:00
python decorators to log all method calls, show call graphs in realtime too
# Written by Brendan O'Connor, brenocon@gmail.com, www.anyall.org
# * Originally written Aug. 2005
# * Posted to gist.github.com/16173 on Oct. 2008
# Copyright (c) 2003-2006 Open Source Applications Foundation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
@brendano
brendano / gist:39760
Created December 24, 2008 20:11
load the MNIST data set in R
# Load the MNIST digit recognition dataset into R
# http://yann.lecun.com/exdb/mnist/
# assume you have all 4 files and gunzip'd them
# creates train$n, train$x, train$y and test$n, test$x, test$y
# e.g. train$x is a 60000 x 784 matrix, each row is one digit (28x28)
# call: show_digit(train$x[5,]) to see a digit.
# brendan o'connor - gist.github.com/39760 - anyall.org
load_mnist <- function() {
load_image_file <- function(filename) {
# -*- encoding: utf-8 -*-
# actually that encoding line is NOT important codewise. only for doc purposes.
"""
Detect emoji or other emoji-like things in Python.
The regular expressions here can be used to either identify emoji or to remove it.
The comments are written from the perspective of removing it.
The regexes get some stuff besides emoji.
by Brendan O'Connor (http://brenocon.com) 2016-10-20
originally written as part of https://arxiv.org/abs/1608.08868
@brendano
brendano / analysis.txt
Created June 14, 2011 02:56
How much text versus metadata is in a tweet?
How much text versus metadata is in a tweet?
Brendan O'Connor (brenocon.com), 2011-06-13
http://twitter.com/brendan642/status/80473880111742976
What's it mean to compare the amount of text versus metadata?
Let's start with raw size of the data that comes over the wire from Twitter.
## Get tweets out of a sample stream archive.
## (e.g. curl http://stream.twitter.com/1/statuses/sample.json)
% cat tweets.2011-05-19 | grep -P '"text":' | head -100000 > 100k_tweets
@brendano
brendano / morpha.py
Last active April 16, 2021 19:18
Python wrapper for morpha (English lemmatizer)
"""
Wrapper around morpha from
http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html
Vaguely follows edu.stanford.nlp.Morphology except we implement with a pipe.
hacky. Would be nice to use cython/swig/ctypes to directly embed morpha.yy.c
as a python extension.
TODO compare linguistic quality to lemmatizer in python's "pattern" package
@brendano
brendano / log_logistic.py.md
Last active March 8, 2020 14:12
numerically stable implementation of the log-logistic function

Binary case

This is just the middle section of Bob Carpenter's note for evaluating log-loss via the binary logistic functoin https://lingpipe-blog.com/2012/02/16/howprevent-overflow-underflow-logistic-regression/

The logp function calculates the negative cross-entropy:

    dotproduct( [y, 1-y],  [logP(y=1), logP(y=0)] )

where the input s is the beta'x log-odds scalar value. The trick is to make this numerically stable for any choice of s and y.

@brendano
brendano / .Rhistory
Last active August 31, 2019 14:39
longrun_bettng_inequality
log(c(1.05,.6))
log(c(1.5,.6))
ifelse(runif(1000)>.5, 1.5, .6)
x=ifelse(runif(1000)>.5, 1.5, .6)
mean(x)
prod(x)
x=ifelse(runif(10)>.5, 1.5, .6)
x
y=replicate(100000,{x=ifelse(runif(10)>.5, 1.5, .6); prod(x)})
summary(y)
@brendano
brendano / make_views.py
Created January 2, 2011 22:53
Publish Zotero papers as HTML and symlinks
#!/usr/bin/env python
# From your Zotero database and file storage,
# creates a simple HTML table, and directory full of symlinks,
# for quick-and-dirty web or Dropbox viewing.
# Installation: place in your Zotero folder
# e.g. ~/Documents/zotero/
# And run it
# e.g. python ~/Documents/zotero/make_views.py