Skip to content

Instantly share code, notes, and snippets.

@bmschmidt
bmschmidt / gist:923dce0330d72486ee8d
Last active December 18, 2015 19:29
Starting to make
grams=3
corpus="eng-all"
searchstring="attention"
simultaneousDownloads="16"
# THIS PROCESS IS EXTREMELY RESOURCE INTENSIVE--DO NOT RUN IT ON A LARK OR IF YOU DON'T UNDERSTAND WHAT IT DOES,
# BECAUSE WASTING ENERGY AND BANDWIDTH IS BAD FOR THE ENVIRONMENT.
# To ensure you don't run it trivially, I've included a obvious command in the code that quickly stops the download.
# If you can't find or fix this, you probably shouldn't be running the script!
@bmschmidt
bmschmidt / TesseractPDF.sh
Last active December 24, 2015 08:09
Make Tifs from pdfs using gs and then perform OCR on them using tesseract.
#!/usr/bin/sh
#Adapted by Ben Schmidt from Barry Hubbard's code at
#http://www.barryhubbard.com/articles/37-general/74-converting-a-pdf-to-text-in-linux
#to convert into a folder of text files, each one representing a page.
#This takes pdfs from the pdfs folder, writes tif files to the images folder, and writes text to the texts folder.
#each pdf gets a _folder_ in each of the other two.
mkdir -p texts
mkdir -p images
@bmschmidt
bmschmidt / BibLaTeX-Chicago.js
Last active December 25, 2015 00:59
A BibLaTeX zotero translator specially designed to work with the BibLaTeX-Chicago library, which has a few special rules about how to translate different sorts of documents into appropriate citations. Most of this is simply adapted from the existing BibLaTeX Chicago plugin; but it has a few other nice features the original translator lacks, such…
{
"translatorID":"ba905f1a-436b-4b6d-a816-ba0b4ac4c9ad",
"translatorType":2,
"label":"BibLaTeX-Chicago",
"creator":"Simon Kornblith, Richard Karnesky, Anders Johansson and Ben Schmidt",
"target":"bib",
"minVersion":"2.1.9",
"maxVersion":"null",
"priority":100,
"inRepository":false,

Make sure it works

Download one by hand. See if you get the Stanford NLTK running to extract places and dates. And see if it works!

Downloading files

@bmschmidt
bmschmidt / OLfield_descriptions
Created November 13, 2013 16:49
Example field_descriptions.json
[
{"field":"title","datatype":"etc","type":"text","unique":true},
{"field":"lc0","datatype":"categorical","type":"character","unique":true},
{"field":"lc1","datatype":"categorical","type":"character","unique":true},
{"field":"lc2","datatype":"etc","type":"integer","unique":true},
{"field":"publishers","datatype":"etc","type":"character","unique":false},
{"field":"subjects","datatype":"categorical","type":"character","unique":false},
{"field":"publish_country","datatype":"categorical","type":"character","unique":true},
{"field":"publish_places","datatype":"categorical","type":"character","unique":false},
@bmschmidt
bmschmidt / ExampleCatalog.json
Created November 13, 2013 16:53
First 12 Open Library records
{"publishers": ["Simon & Schuster Books for Young Readers"], "searchstring": "[No author], <em>Ernestine & Amanda, summer camp ready or not!</em> (undated) <a href=\"http://openlibrary.org/books/OL1002147M\">more info</a> <a href=\"http://archive.org/stream/ernestineamandas00belt\">read</a>", "lc2": "7", "lc0": "P", "title": "Ernestine & Amanda, summer camp ready or not!", "lccn": ["96041443"], "lc1": "PZ", "editionid": "/books/OL1002147M", "publish": 1997, "filename": "er/ne/ernestineamandas00belt", "languages": ["/languages/eng"], "lc_classifications": ["PZ7.B4197 Er 1997"], "publish_date": "1997", "publish_country": "nyu", "key": "/books/OL1002147M", "authors": ["/authors/OL24054A"], "ocaid": "ernestineamandas00belt", "oclc_numbers": ["35360730"], "works": ["/works/OL16070305W"], "publish_places": ["New York"]}
{"publishers": ["Copernicus"], "searchstring": "[No author], <em>The call of distant mammoths</em> (undated) <a href=\"http://openlibrary.org/books/OL1008703M\">more info</a> <a href=\"http://archiv
{"publisher": "T.B. Kalbfus", "paperid": "sn82016373", "searchstring": "<img src=\"http://chroniclingamerica.loc.gov/lccn/sn82016373/1891-10-04/ed-1/seq-7/thumbnail.jpg\"> <i>The Sunday herald and weekly national intelligencer</i> (Washington [D.C.]), Friday, October 04, 1891, p. 7. <a href=\"http://chroniclingamerica.loc.gov/lccn/sn82016373/1891-10-04/ed-1/seq-7\">Read page</a>", "lat": 38.8951118, "city": "Washington", "period": "1887/1896", "filename": "sn82016373/1891-10-04_7", "edition": "1", "state": "DC", "paper": "The Sunday herald and weekly national intelligencer", "location": "Washington [D.C.]", "lng": -77.0363658, "date": "1891-10-04", "subjects": [], "successors": [], "precedors": ["sn85042682"], "page": "7"}
{"publisher": "T.B. Kalbfus", "paperid": "sn82016373", "searchstring": "<img src=\"http://chroniclingamerica.loc.gov/lccn/sn82016373/1891-10-04/ed-1/seq-5/thumbnail.jpg\"> <i>The Sunday herald and weekly national intelligencer</i> (Washington [D.C.]), Friday, October 04, 1891, p. 5. <a href
rm(list=ls())
source("SQLFunctions.R")
plotSet = function(filename,dck=701,alpha=.01,lwd=8) {
library(grid)
paths = tbl(tblsrc,"paths") %.%
filter(DCK==dck) %.%
arrange(voyagenum,yearday) %.%
select(LON,LAT,voyagenum) %.%
collect()
@bmschmidt
bmschmidt / bookwormFormat.py
Last active August 29, 2015 14:04
Quickly format strings for searchstring in the mysql client
"""
This uses a fake sprintf style construction to handle easily resetting searchstrings without rebuilding the whole database.
Anything in a string like this:
%(blah blah)s
will be broken out in mysql as a literal.
By default, this assumes you only need the "catalog" field: things will get strange if you try to use any non-unique fields.
@bmschmidt
bmschmidt / dates.md
Created July 31, 2014 19:49
Date functions

MySQL has a canonical implementation:

SELECT FROM_DAYS(730669);

+-------------------+
| FROM_DAYS(730669) |
+-------------------+
| 2000-07-03        |