This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# !/usr/local/bin/python | |
# -*- coding: utf-8 -*- | |
# alliteration_python_sample.py | |
""" | |
Workflow example for | |
Distant Reading Alliteration in Latin Poetry | |
Patrick J. Burns | |
Fordham University, Department of Classics | |
Word, Space, Time: Digital Perspectives on the Classical World |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Script for extracting attributes (like word, lemma, form, etc.) | |
# from the XML files in the Perseus Latin Dependency Treebank 2.1 | |
# available here: | |
# https://github.com/PerseusDL/treebank_data/tree/master/v2.1/Latin | |
# | |
# Returns tuples that can be used for testing the new version of | |
# the CLTK lemmatizer | |
# | |
# Use: from the command line, call script with filename: | |
# >>> python xml_parse_perseus_nlp.py phi0448.phi001.perseus-lat1.tb.xml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Patrick J. Burns, PhD | |
Classical Language Toolkit | |
Google Summer of Code 2016 Final Report | |
Here is a summary of the work I completed for the 2016 Google Summer of Code project "CLTK Latin/Greek Backoff Lemmatizer" for the Classical Language Toolkit (cltk.org). The code can be found at https://github.com/diyclassics/cltk/tree/lemmatize/cltk/lemmatize. | |
- Wrote custom lemmatizers for Latin and Greek as subclasses of NLTK's tag module (http://www.nltk.org/api/nltk.tag.html), including: | |
- Default lemmatization, i.e. same lemma returned for every token | |
- Identity lemmatization, i.e. original token returned as lemma | |
- Model lemmatization, i.e. lemma returned based on dictionary lookup | |
- Context lemmatization, i.e. lemma returned based on proximal token/lemma tuples in training data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def _define_lemmatizer(self): | |
backoff0 = None | |
backoff1 = IdentityLemmatizer() | |
backoff2 = TrainLemmatizer(model=self.LATIN_OLD_MODEL, backoff=backoff1) | |
backoff3 = PPLemmatizer(regexps=self.latin_verb_patterns, pps=self.latin_pps, backoff=backoff2) | |
backoff4 = UnigramLemmatizer(self.train_sents, backoff=backoff3) | |
backoff5 = RegexpLemmatizer(self.latin_misc_patterns, backoff=backoff4) | |
backoff6 = TrainLemmatizer(model=self.LATIN_MODEL, backoff=backoff5) | |
#backoff7 = BigramPOSLemmatizer(self.pos_train_sents, include=['cum'], backoff=backoff6) | |
lemmatizer = backoff6 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Several prospective CLTK Google Summer of Code applicants have written recently about what the proposal should include. While successful project proposals can take many different forms, here is an outline that helps address the questions likely to come up as the proposal are reviewed: | |
- Abstract: It is helpful to distill your proposal into 100-200 words that define the problem, identify your solution, name the datasets necessary to do the work, and report the expected outcome of this project. On this last point, note that since this is a proposal, we do not expect you to report results—but you should have a clear idea of where you expect to be by the end of the summer. We will also need to use abstracts and brief descriptions of your project on the GSoC page if your proposal is selected. | |
- Proposal: This will be the bulk of your submission. Here you want to expand upon the points mentioned in the abstract, including: | |
- Define the problem. Depending on your project, CLTK may be different than other open so |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<symbol id="icon-hcommons" viewBox="0 0 240 240"> | |
<title>hcommons</title> | |
<g transform="translate(0.000000,240.000000) scale(0.100000,-0.100000)" | |
fill="#000000" stroke="none"> | |
<path d="M1045 2394 c-85 -14 -235 -57 -312 -90 -361 -154 -608 -451 -705 | |
-845 -31 -126 -33 -381 -4 -504 98 -419 381 -743 767 -882 157 -57 228 -68 | |
414 -68 153 1 181 4 280 29 122 31 296 109 395 176 240 163 420 431 492 733 | |
32 132 32 382 0 514 -38 160 -112 325 -200 448 -120 166 -306 314 -503 398 | |
-151 65 -262 88 -439 92 -85 2 -168 1 -185 -1z m948 -585 l57 -12 0 -114 0 | |
-114 -47 8 c-27 4 -91 8 -144 8 -86 0 -100 -3 -133 -25 -20 -14 -46 -45 -59 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import json | |
import urllib.request | |
def get_pleiades_json(pleiades_id): | |
# pleiades_id: STR | |
pleiades_url = "https://raw.githubusercontent.com/ryanfb/pleiades-geojson/gh-pages/geojson/%s.geojson" % pleiades_id | |
try: | |
with urllib.request.urlopen(pleiades_url) as url: | |
pleiades_geojson = json.loads(url.read().decode()) | |
return pleiades_geojson |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
GREEK = '\u0300-\u03FF' | |
GREEK_EXT = '\u1F00-\u1FFF' | |
# Cicero Att 1.4 | |
# http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A1999.02.0008%3Abook%3D1%3Aletter%3D1%3Asection%3D4 | |
text = """ | |
sane sum perturbatus cum ipsius Satyri familiaritate tum Domiti, in quo uno maxime ambitio nostra nititur. demonstravi haec Caecilio simul et illud ostendi, si ipse unus cum illo uno contenderet, me ei satis facturum fuisse; nunc in causa universorum creditorum, hominum praesertim amplissimorum, qui sine eo quem Caecilius suo nomine perhiberet facile causam communem sustinerent, aequum esse eum et officio meo consulere et tempori. durius accipere hoc mihi visus est quam vellem et quam homines belli solent, et postea prorsus ab instituta nostra paucorum dierum consuetudine longe refugit. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from cltk.corpus.latin import latinlibrary | |
from cltk.tokenize.word import WordTokenizer | |
tokenizer = WordTokenizer('latin') | |
ll_raw = latinlibrary.raw() | |
print(ll_raw[:500]) | |
ll_words = latinlibrary.words() | |
print(ll_words[:100]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
function getShortDef(input) { | |
var array = []; | |
var url = "http://www.perseus.tufts.edu/hopper/morph?l=" + input; | |
var page = UrlFetchApp.fetch(url); | |
var doc = Xml.parse(page, true); | |
var bodyHtml = doc.html.body.toXmlString(); | |
doc = XmlService.parse(bodyHtml); | |
var root = doc.getRootElement(); | |
OlderNewer