-
-
Save bnagy/2e236c52c174435a459778b299323636 to your computer and use it in GitHub Desktop.
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"sulps = \"\"\"Gratum est, securus multum quod iam tibi de me\n", | |
" permittis, subito ne male inepta cadam.\n", | |
"Sit tibi cura togae potior pressumque quasillo\n", | |
" scortum quam Servi filia Sulpicia:\n", | |
"Solliciti sunt pro nobis, quibus illa dolori est,\n", | |
" ne cedam ignoto, maxima causa, toro.\"\"\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"'gratum est securus multum quod iam tibi de me\\n permittis subito ne male inepta cadam\\nsit tibi cura togae potior pressumque quasillo\\n scortum quam servi filia sulpicia\\nsolliciti sunt pro nobis quibus illa dolori est\\n ne cedam ignoto maxima causa toro'" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import string\n", | |
"tr = sulps.maketrans('','',string.punctuation)\n", | |
"sulps = sulps.translate(tr).lower()\n", | |
"sulps" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['gratum est securus multum quod iam tibi de me',\n", | |
" ' permittis subito ne male inepta cadam',\n", | |
" 'sit tibi cura togae potior pressumque quasillo',\n", | |
" ' scortum quam serui filia sulpicia',\n", | |
" 'solliciti sunt pro nobis quibus illa dolori est',\n", | |
" ' ne cedam ignoto maxima causa toro']" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"from cltk.stem.latin.j_v import JVReplacer\n", | |
"j = JVReplacer()\n", | |
"clean_lines = j.replace(sulps).splitlines()\n", | |
"clean_lines" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['gratum est securus multum quod iam tibi de me',\n", | |
" 'permittis subito ne male inepta cadam',\n", | |
" 'sit tibi cura togae potior pressumque quasillo',\n", | |
" 'scortum quam serui filia sulpicia',\n", | |
" 'solliciti sunt pro nobis quibus illa dolori est',\n", | |
" 'ne cedam ignoto maxima causa toro']" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import re\n", | |
"clean_lines = [re.sub('^ ','',l) for l in clean_lines] # remove leading spaces\n", | |
"clean_lines" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['permitto', 'subeo', 'neo1', 'malus', 'ineptus', 'cado']" | |
] | |
}, | |
"execution_count": 30, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# The Lemmatizer is able to correctly identify cadam as an inflected form of cado\n", | |
"\n", | |
"from cltk.stem.lemma import LemmaReplacer\n", | |
"lemmatizer = LemmaReplacer('latin')\n", | |
"l = lemmatizer.lemmatize(clean_lines[1])\n", | |
"l" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# But the POS taggers have no idea how to handle it\n", | |
"\n", | |
"from cltk.tag.pos import POSTag" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"tagger = POSTag('latin')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('permittis', 'N-S---FG-'),\n", | |
" ('subito', 'D--------'),\n", | |
" ('ne', 'D--------'),\n", | |
" ('male', 'D--------'),\n", | |
" ('inepta', 'T-SRPPFN-'),\n", | |
" ('cadam', 'A-S---FA-')]" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"tagger.tag_crf(clean_lines[1])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('permittis', 'V2SPIA---'),\n", | |
" ('subito', 'D--------'),\n", | |
" ('ne', 'D--------'),\n", | |
" ('male', 'D--------'),\n", | |
" ('inepta', 'Unk'),\n", | |
" ('cadam', 'Unk')]" | |
] | |
}, | |
"execution_count": 16, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"tagger.tag_tnt(clean_lines[1])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"[('permittis', 'V2SPIA---'),\n", | |
" ('subito', 'D--------'),\n", | |
" ('ne', 'D--------'),\n", | |
" ('male', 'D--------'),\n", | |
" ('inepta', None),\n", | |
" ('cadam', None)]" | |
] | |
}, | |
"execution_count": 17, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"tagger.tag_ngram_123_backoff(clean_lines[1])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"gratum/n-s---nb- est/v1si-a--- securus/a-s---fb- multum/d-------- quod/p-s---na- iam/d-------- tibi/a-s---fb- de/n-s---nb- me/d-------- permittis/p-s---ma- subito/d-------- ne/t-srppmn- male/d-------- inepta/v--pna--- cadam/v2spia---\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# The new LAPOS tagger at least tags it as a verb, but v2spia is weird\n", | |
"# since a final 'm' is always 1st person\n", | |
"\n", | |
"import subprocess\n", | |
"p = subprocess.run(\n", | |
" [\"lapos/lapos\", \"-t\", \"-m\", \"lapos_model\"], \n", | |
" input='gratum est securus multum quod iam tibi de me permittis subito ne male inepta cadam', \n", | |
" encoding='ascii',\n", | |
" stdout=subprocess.PIPE\n", | |
")\n", | |
"print(p.stdout)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Two more notes:
-
The two data sets both came from Perseus, which means that their labels will match one another (eg,
neo1
for the lemmatizer refers to the sameneo1
as in the POS tagger. (Though note their current tags, and perhaps their headwords/lemma keys too, have changed slightly since.) -
Patrick's backoff lemmatizer returns better results. Have you tried it, too? http://docs.cltk.org/en/latest/latin.html#lemmatization-backoff-method
Thanks a lot for the comments. Re the backoff lemmatizer, I didn't see any POS information, which is what I need at the moment. I'm making a note to look at the ContextPOSLemmatizer. I think that you could probably do a thing with the Perseus treebank data for form selection to Collatinus, just select the POS which is most common overall in the hand-analysed data? I would need more time / help to work out how to do that, though. The problem I'm seeing here, though, is that the morphological guesses are not strong enough in the first place - it's not that they're selecting the wrong form, it's that the right form isn't there in the first place...
LemmaReplacer
is as dumb as it gets (and one of the first things I wrote ~6 years ago). It reads a key-val list of<declined-form>: <lemma>
and does a string match-based replace. There is no ambiguity handling and I just took the most frequently occurring form. The code used to create this (out of the old Perseus files in the form of<lemma>: [<declined form 1>, ... <declined form n>]
) is here: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/transform_lemmata.py.About the POS taggers, these are made using the NLTK's built in statistical parsers. You can trace back the class's logic from here: https://github.com/cltk/cltk/blob/e91f44d66ea2009a388dc9a3a224b138d9e003d6/cltk/tag/pos.py#L47. And the models was created from the same CLTK repo I cited above, though in this module: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/pos_latin.py#L13.