Skip to content

Instantly share code, notes, and snippets.

@diyclassics
diyclassics / alliteration_python_sample.py
Created November 16, 2015 12:03 — forked from pbartleby/alliteration_python_sample.py
Workflow example for "Distant Reading Alliteration in Latin Poetry". Presented at Word, Space, Time: Digital Perspectives on the Classical World (Digital Classics Association conference) on 4.6.13 @ U. Buffalo.
# !/usr/local/bin/python
# -*- coding: utf-8 -*-
# alliteration_python_sample.py
"""
Workflow example for
Distant Reading Alliteration in Latin Poetry
Patrick J. Burns
Fordham University, Department of Classics
Word, Space, Time: Digital Perspectives on the Classical World
@diyclassics
diyclassics / zen-pythonis.txt
Created February 18, 2016 19:38
Latin translation of Peters's "The Zen of Python"
Zen Pythonis
a T. Peters imprimis Anglice scriptum
redditumque Latine a Patricio Ios. Burns:
– Formosum deformi praefertur.
– Directum obliquo praefertur.
– Simplex multiplici praefertur.
– Multiplex contorto praefertur.
– Planum implicato praefertur.
– Rarum denso praefertur.
@diyclassics
diyclassics / gist:5f4e7ff7963e255dd44278577ffcbf6e
Last active March 22, 2018 15:31
ll-plaintextcorpus-demo
from cltk.corpus.latin import latinlibrary
from cltk.tokenize.word import WordTokenizer
tokenizer = WordTokenizer('latin')
ll_raw = latinlibrary.raw()
print(ll_raw[:500])
ll_words = latinlibrary.words()
print(ll_words[:100])
@diyclassics
diyclassics / xml_parse_perseus_nlp.py
Created July 15, 2016 16:59
Script for extracting attributes from Perseus Latin Treebank XML files
# Script for extracting attributes (like word, lemma, form, etc.)
# from the XML files in the Perseus Latin Dependency Treebank 2.1
# available here:
# https://github.com/PerseusDL/treebank_data/tree/master/v2.1/Latin
#
# Returns tuples that can be used for testing the new version of
# the CLTK lemmatizer
#
# Use: from the command line, call script with filename:
# >>> python xml_parse_perseus_nlp.py phi0448.phi001.perseus-lat1.tb.xml
@diyclassics
diyclassics / gsoc-summary.txt
Last active August 23, 2016 13:43
GSoC 2016 Summary
Patrick J. Burns, PhD
Classical Language Toolkit
Google Summer of Code 2016 Final Report
Here is a summary of the work I completed for the 2016 Google Summer of Code project "CLTK Latin/Greek Backoff Lemmatizer" for the Classical Language Toolkit (cltk.org). The code can be found at https://github.com/diyclassics/cltk/tree/lemmatize/cltk/lemmatize.
- Wrote custom lemmatizers for Latin and Greek as subclasses of NLTK's tag module (http://www.nltk.org/api/nltk.tag.html), including:
- Default lemmatization, i.e. same lemma returned for every token
- Identity lemmatization, i.e. original token returned as lemma
- Model lemmatization, i.e. lemma returned based on dictionary lookup
- Context lemmatization, i.e. lemma returned based on proximal token/lemma tuples in training data
@diyclassics
diyclassics / omeka-xml-parse.py
Last active August 14, 2022 16:23
Parsing DublinCore XML data exported from Omeka
### Jupyter notebook for this code available at:
### https://github.com/isaw-ga-3024/isaw-ga-3024.github.io/blob/master/burns-patrick-diyclassics/notebooks/Omeka-XML-Parse.ipynb
omeka = """<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://dcaa.hosting.nyu.edu/cms/admin/items/show/2">
<dc:title>Changing the Center of Gravity</dc:title>
<dc:subject>digital humanities</dc:subject>
<dc:creator>Terras, Melissa</dc:creator>
<dc:creator>Crane, Gregory</dc:creator>
@diyclassics
diyclassics / gist:b24fbd1ad3bbb726387de443fab84956
Created December 16, 2016 17:15
Backoff lemmatizer edits
def _define_lemmatizer(self):
backoff0 = None
backoff1 = IdentityLemmatizer()
backoff2 = TrainLemmatizer(model=self.LATIN_OLD_MODEL, backoff=backoff1)
backoff3 = PPLemmatizer(regexps=self.latin_verb_patterns, pps=self.latin_pps, backoff=backoff2)
backoff4 = UnigramLemmatizer(self.train_sents, backoff=backoff3)
backoff5 = RegexpLemmatizer(self.latin_misc_patterns, backoff=backoff4)
backoff6 = TrainLemmatizer(model=self.LATIN_MODEL, backoff=backoff5)
#backoff7 = BigramPOSLemmatizer(self.pos_train_sents, include=['cum'], backoff=backoff6)
lemmatizer = backoff6
@diyclassics
diyclassics / CLTK GSoC Proposal Suggestions
Last active March 20, 2017 14:02
Some things to consider including in a GSoC proposal for CLTK
Several prospective CLTK Google Summer of Code applicants have written recently about what the proposal should include. While successful project proposals can take many different forms, here is an outline that helps address the questions likely to come up as the proposal are reviewed:
- Abstract: It is helpful to distill your proposal into 100-200 words that define the problem, identify your solution, name the datasets necessary to do the work, and report the expected outcome of this project. On this last point, note that since this is a proposal, we do not expect you to report results—but you should have a clear idea of where you expect to be by the end of the summer. We will also need to use abstracts and brief descriptions of your project on the GSoC page if your proposal is selected.
- Proposal: This will be the bulk of your submission. Here you want to expand upon the points mentioned in the abstract, including:
- Define the problem. Depending on your project, CLTK may be different than other open so
@diyclassics
diyclassics / gist:8caaa77b163ab55c6238d75e45f33281
Created August 30, 2017 14:04
SVG for Humanities Commons icon
<symbol id="icon-hcommons" viewBox="0 0 240 240">
<title>hcommons</title>
<g transform="translate(0.000000,240.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M1045 2394 c-85 -14 -235 -57 -312 -90 -361 -154 -608 -451 -705
-845 -31 -126 -33 -381 -4 -504 98 -419 381 -743 767 -882 157 -57 228 -68
414 -68 153 1 181 4 280 29 122 31 296 109 395 176 240 163 420 431 492 733
32 132 32 382 0 514 -38 160 -112 325 -200 448 -120 166 -306 314 -503 398
-151 65 -262 88 -439 92 -85 2 -168 1 -185 -1z m948 -585 l57 -12 0 -114 0
-114 -47 8 c-27 4 -91 8 -144 8 -86 0 -100 -3 -133 -25 -20 -14 -46 -45 -59
@diyclassics
diyclassics / get_pleiades_id.py
Last active October 31, 2017 07:59
Get coordinates by Pleiades ID
import json
import urllib.request
def get_pleiades_json(pleiades_id):
# pleiades_id: STR
pleiades_url = "https://raw.githubusercontent.com/ryanfb/pleiades-geojson/gh-pages/geojson/%s.geojson" % pleiades_id
try:
with urllib.request.urlopen(pleiades_url) as url:
pleiades_geojson = json.loads(url.read().decode())
return pleiades_geojson