Skip to content

Instantly share code, notes, and snippets.

View brendano's full-sized avatar

Brendan O'Connor brendano

View GitHub Profile
@brendano
brendano / analysis.txt
Created June 14, 2011 02:56
How much text versus metadata is in a tweet?
How much text versus metadata is in a tweet?
Brendan O'Connor (brenocon.com), 2011-06-13
http://twitter.com/brendan642/status/80473880111742976
What's it mean to compare the amount of text versus metadata?
Let's start with raw size of the data that comes over the wire from Twitter.
## Get tweets out of a sample stream archive.
## (e.g. curl http://stream.twitter.com/1/statuses/sample.json)
% cat tweets.2011-05-19 | grep -P '"text":' | head -100000 > 100k_tweets
@brendano
brendano / sim.py
Created March 24, 2012 02:35
matrix-tree thm for CRF marginal dependencies via matrix inversion, from koo et al. 2007
In [9]: run -i sim
for each word: prob connect to root
[ 0.28026994 0.16394082 0.10616135 0.17767563 0.12675216 0.1452001 ]
for (head,child) entries: P(head <- child)
[[ 0. 0.12563458 0.27335659 0.17451717 0.24229165 0.24617475]
[ 0.25410789 0. 0.21883649 0.09361327 0.17785164 0.2048422 ]
[ 0.12280784 0.12047786 0. 0.13921346 0.12119944 0.11342211]
[ 0.11823039 0.27609723 0.15487263 0. 0.22541093 0.15197973]
[ 0.11249058 0.1968143 0.09766069 0.21988013 0. 0.13838112]
[ 0.11209336 0.11703521 0.14911225 0.19510033 0.10649417 0. ]]
@brendano
brendano / NOTES.md
Created June 12, 2012 20:03
Patches to compile ocropus on Mac OSX 10.6 -- see explanation at NOTES.md at bottom https://gist.github.com/2919800#file_notes.md

by Brendan O'Connor (http://brenocon.com)

I got all of ocropus to compile on Mac OSX 10.6, though I haven't tested it much yet. This is the current version inside the ocropus hg repository, so approximately version 0.5, with iulib perhaps 0.4ish.

See ocroinst.osx -- the first file in "everything_besides_iulib.diff" -- for line-by-line instructions; the script may even just run. We're assuming Homebrew and pip (see the comments).