Skip to content

Instantly share code, notes, and snippets.

View lukehollis's full-sized avatar

Luke Hollis lukehollis

View GitHub Profile
@lukehollis
lukehollis / find_sentence_lengths.py
Created January 28, 2016 15:48
Find sentence lengths of texts in the CLTK corpora for Perseus Greek and Latin XML
"""
Inspired by "Quantifying origin and character of long-range correlations in narrative texts"
by Stanisław Drożdż, Paweł Oświȩcimkaa, Andrzej Kuliga, Jarosław Kwapieńa, Katarzyna Bazarnikb,
Iwona Grabska-Gradzińskac, Jan Rybickib, and Marek Stanuszekd, this is an attempt to implement
the CLTK tokenizers to sentence lengths of works from the Greek and Latin corpora from the
Perseus Digital Archive for analysis via the methods used by the above researchers.
"""
@lukehollis
lukehollis / perseus_to_mongo.py
Created January 28, 2016 15:32
Really simple CLTK data to Mongo for Perseus XML
import pdb
import os, json, re
from bs4 import BeautifulSoup
import html.parser
import pymongo
from db import mongo
class PerseusToMongo:
# a class to migrate Perseus XML file data to mongo db
@lukehollis
lukehollis / scansion_to_html.py
Created December 5, 2015 06:31
Get scansion info and turn it into html
import pdb
import re
import string
import sys
class ScansionToHTML:
def __init__(self, line, scansion):
self.scansion = scansion
@lukehollis
lukehollis / .gitignore
Created May 5, 2015 18:23
.gitingore for wamu.org repo
# Ignore configuration files that may contain sensitive information.
sites/*/settings*.php
# Ignore paths that contain user-generated content.
sites/*/files
sites/*/private
sites/*/~private
.svn
*.svn/
@lukehollis
lukehollis / improve_schinke_stemming_resources.py
Created May 4, 2015 23:43
Improve the Schinke Stemming Algorithm Resources
conj_list = ['ac', 'at', 'atque', 'aut', 'et', 'ne', 'nec', 'non', 'sed', 'si', 'uel',
'cum', 'quum', 'donec', 'dum', 'enim', 'enimuero', 'etiam', 'etsi', 'igitur',
'itaque', 'nam', 'necnon', 'neque', 'nisi', 'postquam', 'quamquam', 'quamuis',
'quando', 'que', 'quia', 'quin', 'quippe', 'quinetiam', 'quod', 'quodque',
'siue', 'ut', 'tam', 'necdum']
prep_list = ['ante', 'ad', 'circum', 'contra', 'inter', 'intra', 'post', 'in', 'en', 'praeter',
'per', 'propter', 'super', 'uersus', 'extra', 'trans', 'sub', 'ob', 'a', 'ab',
'de', 'cum', 'e', 'ex', 'sine', 'pro', 'prae', 'sub', 'super']
@lukehollis
lukehollis / import_geojson.py
Created February 3, 2015 04:15
GeoJSON Import
import pymongo
import json
import pdb
def mongo(db):
host = "localhost"
port = 27017
client = pymongo.MongoClient(host, port, max_pool_size=None)
return client[db]