Skip to content

Instantly share code, notes, and snippets.

View brendano's full-sized avatar

Brendan O'Connor brendano

View GitHub Profile
325 !! !!! !!!! !!!!! !!!!!! !!!!!!! !!!!!!!! !!!!!!!!! !!!!!!!.. !!.. !!.... !!: !!?? !" !' !. !... !: !? ", ". ": #1 #2 #2010 #39 #4 #8217 #ui #ww $1 $10 $100 $1000 $188 $2 $20 $200 $25 $32 $379 $5 && '" '' ($149 ($169 (( ((( (((((((((((((((((((((((((((((((( (: (= (@ (^_^) (¬_¬ )( ))) ): *)) ** **awwyyy *] +22 ," ,... -& -- --- ----- ------> ----> ---->>> ---> --> -6 -> -_- -__- -___- .! ." .' ., .. ... ..." .... ..... ...... ....... ........ ............ ...: ..: .: .?!! 0-1 0.00 00 04:45 09 1,000 1-0 1/2 10 10.27 10.4 10/26 10/27- 10/27/2010 10/30/10 100% 100,000 10093 101 101.1 106 107.5 109a 10:45 10:55 10¢ 11 11/01/10 11:30 12 12.99 1200 1221 13 13% 13.94 14:14 15 15.7 15/30 16 161 17 17% 175 1793 17:27 18 1895 18:1 1980 1995 2+3 2.0 2.3 2.5 2/3 20 20% 2008 2010 2011 2014 2020 21 2221 23 23.0 257 26 27 28 29 29.676 30 30% 300 31 31.1 33.1 330 35 360 3:00 4,900 4-8 4.25 40 40% 401 45% 465.00 48 5'1 50 500 5000 516 52 53 55 56.3 57 5o 6-8 6.05 6.95 60-80% 63 640 6501 67 6:30 6:40 70 7046614311 75 76 7
@brendano
brendano / xlsx2tsv.py
Created November 7, 2008 02:53
xlsx2tsv: python command-line script to convert xlsx (Excel "OOXML") into tab-separated values
#!/usr/bin/env python
"""
xlsx2tsv filename.xlsx [sheet number or name]
Parse a .xlsx (Excel OOXML, which is not OpenOffice) into tab-separated values.
If it has multiple sheets, need to give a sheet number or name.
Outputs honest-to-goodness tsv, no quoting or embedded \\n\\r\\t.
One reason I wrote this is because Mac Excel 2008 export to csv or tsv messes
up encodings, converting everything to something that's not utf8 (macroman
% booktabs example for hierarchical columns
% for Table 2 of
% https://aclanthology.org/2021.findings-acl.371.pdf
% halterman, keith, sarwar, o'connor, 2021
% leaving out some stuff
% just the 'tabular' environ
% the minipage stuff is irrelevant it was just for a two column thing
@brendano
brendano / gist:39760
Created December 24, 2008 20:11
load the MNIST data set in R
# Load the MNIST digit recognition dataset into R
# http://yann.lecun.com/exdb/mnist/
# assume you have all 4 files and gunzip'd them
# creates train$n, train$x, train$y and test$n, test$x, test$y
# e.g. train$x is a 60000 x 784 matrix, each row is one digit (28x28)
# call: show_digit(train$x[5,]) to see a digit.
# brendan o'connor - gist.github.com/39760 - anyall.org
load_mnist <- function() {
load_image_file <- function(filename) {
@brendano
brendano / md5sort.py
Created November 7, 2008 20:38
md5sort
#!/usr/bin/env python
""" sorts lines (or tab-sep records) by md5. (e.g. for train/test splits).
optionally prepends with the md5 id too.
brendan o'connor - anyall.org - gist.github.com/brendano """
import hashlib,sys,optparse
p = optparse.OptionParser()
p.add_option('-k', type='int', default=False)
p.add_option('-p', action='store_true')
opts,args=p.parse_args()
@brendano
brendano / autolog.py
Created October 10, 2008 23:00
python decorators to log all method calls, show call graphs in realtime too
# Written by Brendan O'Connor, brenocon@gmail.com, www.anyall.org
# * Originally written Aug. 2005
# * Posted to gist.github.com/16173 on Oct. 2008
# Copyright (c) 2003-2006 Open Source Applications Foundation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# -*- encoding: utf-8 -*-
# actually that encoding line is NOT important codewise. only for doc purposes.
"""
Detect emoji or other emoji-like things in Python.
The regular expressions here can be used to either identify emoji or to remove it.
The comments are written from the perspective of removing it.
The regexes get some stuff besides emoji.
by Brendan O'Connor (http://brenocon.com) 2016-10-20
originally written as part of https://arxiv.org/abs/1608.08868
@brendano
brendano / analysis.txt
Created June 14, 2011 02:56
How much text versus metadata is in a tweet?
How much text versus metadata is in a tweet?
Brendan O'Connor (brenocon.com), 2011-06-13
http://twitter.com/brendan642/status/80473880111742976
What's it mean to compare the amount of text versus metadata?
Let's start with raw size of the data that comes over the wire from Twitter.
## Get tweets out of a sample stream archive.
## (e.g. curl http://stream.twitter.com/1/statuses/sample.json)
% cat tweets.2011-05-19 | grep -P '"text":' | head -100000 > 100k_tweets
@brendano
brendano / morpha.py
Last active April 16, 2021 19:18
Python wrapper for morpha (English lemmatizer)
"""
Wrapper around morpha from
http://www.informatics.sussex.ac.uk/research/groups/nlp/carroll/morph.html
Vaguely follows edu.stanford.nlp.Morphology except we implement with a pipe.
hacky. Would be nice to use cython/swig/ctypes to directly embed morpha.yy.c
as a python extension.
TODO compare linguistic quality to lemmatizer in python's "pattern" package