Skip to content

Instantly share code, notes, and snippets.

View brendano's full-sized avatar

Brendan O'Connor brendano

View GitHub Profile
http://opinionator.blogs.nytimes.com/2012/08/08/hear-all-ye-people-hearken-o-earth/
http://news.ycombinator.com/item?id=4362277
http://news.ycombinator.com/item?id=4365086
2009-01 19
2009-02 20
2009-03 48
2009-04 100
2009-05 275
2009-06 292
2009-07 494
2009-08 259
2009-09 207
2009-10 65
{
"contributors": null,
"coordinates": null,
"created_at": "Tue Jul 03 00:35:49 +0000 2012",
"entities": {
"hashtags": [],
"urls": [],
"user_mentions": []
},
"favorited": false,
2009-06-01 4507 1 [470275]
2009-06-08 4507 1 [422879]
2009-06-22 4507 1 [257976]
2009-07-06 4507 1 [69444]
2009-07-13 4507 4 [16042,237457,457813,61273]
2009-07-20 4507 6 [422879,358078,82438,34891,47316,97749]
2009-07-27 4507 2 [423154,477645]
2009-08-10 4507 6 [356596,316917,99418,247707,230452,3538]
2009-08-17 4507 7 [82438,263332,23135,35494,94656,122471,272590]
2009-08-24 4507 7 [157430,338463,426119,157430,405565,309448,338463]
# handle the wikipedia dump format
module WikiDump
def self.yield_page_strings(stream)
buf = ""
stream.each do |line|
if line =~ /^\s* <page> \s*$/x
buf = ""
#!/usr/bin/env ruby
#
# Data structures and proccessing of documents to be indexed, e.g. wikipedia
# pages. ok, everything is wikipedia-specific. :-)
#
# This file can be executed for various sorts of testing (see bottom)
require File.dirname(__FILE__)+'/common'
#!/usr/bin/env python
"""
Convert STDIN to UTF-8
based on character encoding detection
"""
import sys, json, itertools
from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()
@brendano
brendano / NOTES.md
Created June 12, 2012 20:03
Patches to compile ocropus on Mac OSX 10.6 -- see explanation at NOTES.md at bottom https://gist.github.com/2919800#file_notes.md

by Brendan O'Connor (http://brenocon.com)

I got all of ocropus to compile on Mac OSX 10.6, though I haven't tested it much yet. This is the current version inside the ocropus hg repository, so approximately version 0.5, with iulib perhaps 0.4ish.

See ocroinst.osx -- the first file in "everything_besides_iulib.diff" -- for line-by-line instructions; the script may even just run. We're assuming Homebrew and pip (see the comments).

#!/usr/bin/env python
r"""
vertunion File1 File2 ....
Iterates through parallel files, each row
"DocID \t JSON1" "DocID \t JSON2" ....
and outputs
"DocID \t UnionOfJSONs"
Union of key-value pairs, that is.
The economy 's temperature will be taken from several vantage points this week , with readings on trade , output , housing and inflation . {"root_mar":[0.00008,0.00798,0.00182,0.01688,0.87625,0.01018,0.01229,0.00165,0.00052,0.00076,0.01282,0.00046,0.00662,0.00095,0.01759,0.00537,0.00439,0.00125,0.00062,0.00356,0.00063,0.01118,0.00293,0.00226,0.00097],"edge_mar":[[0.00001,0.00047,0.00121,0.00102,0.00128,0.00042,0.00087,0.00035,0.00027,0.00022,0.00134,0.00021,0.0014,0.00228,0.00408,0.00097,0.00177,0.00039,0.00188,0.00074,0.00187,0.00175,0.00159,0.00052,0.0029],[0.83539,0.00001,0.93406,0.0138,0.01333,0.00313,0.01055,0.0033,0.0029,0.00128,0.00686,0.00241,0.008,0.02723,0.02736,0.00517,0.0124,0.00233,0.0233,0.00504,0.02344,0.01223,0.01524,0.00345,0.01125],[0.00251,0.00049,0.00001,0.00273,0.00634,0.00138,0.00285,0.00112,0.00076,0.00072,0.004,0.00061,0.00469,0.00675,0.01124,0.00284,0.00427,0.00119,0.00533,0.00235,0.00504,0.00571,0.00479,0.00163,0.01046],[0.08547,0.75282,0.01562,0.00001,0.01579,0.00701,0.01237,0.0046,