Skip to content

Instantly share code, notes, and snippets.

@bnagy

bnagy/pos.ipynb Secret

Created June 28, 2018 01:59
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bnagy/2e236c52c174435a459778b299323636 to your computer and use it in GitHub Desktop.
Save bnagy/2e236c52c174435a459778b299323636 to your computer and use it in GitHub Desktop.
Thinking about using CLTK for verb form tagging
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"sulps = \"\"\"Gratum est, securus multum quod iam tibi de me\n",
" permittis, subito ne male inepta cadam.\n",
"Sit tibi cura togae potior pressumque quasillo\n",
" scortum quam Servi filia Sulpicia:\n",
"Solliciti sunt pro nobis, quibus illa dolori est,\n",
" ne cedam ignoto, maxima causa, toro.\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'gratum est securus multum quod iam tibi de me\\n permittis subito ne male inepta cadam\\nsit tibi cura togae potior pressumque quasillo\\n scortum quam servi filia sulpicia\\nsolliciti sunt pro nobis quibus illa dolori est\\n ne cedam ignoto maxima causa toro'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import string\n",
"tr = sulps.maketrans('','',string.punctuation)\n",
"sulps = sulps.translate(tr).lower()\n",
"sulps"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['gratum est securus multum quod iam tibi de me',\n",
" ' permittis subito ne male inepta cadam',\n",
" 'sit tibi cura togae potior pressumque quasillo',\n",
" ' scortum quam serui filia sulpicia',\n",
" 'solliciti sunt pro nobis quibus illa dolori est',\n",
" ' ne cedam ignoto maxima causa toro']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from cltk.stem.latin.j_v import JVReplacer\n",
"j = JVReplacer()\n",
"clean_lines = j.replace(sulps).splitlines()\n",
"clean_lines"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['gratum est securus multum quod iam tibi de me',\n",
" 'permittis subito ne male inepta cadam',\n",
" 'sit tibi cura togae potior pressumque quasillo',\n",
" 'scortum quam serui filia sulpicia',\n",
" 'solliciti sunt pro nobis quibus illa dolori est',\n",
" 'ne cedam ignoto maxima causa toro']"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"clean_lines = [re.sub('^ ','',l) for l in clean_lines] # remove leading spaces\n",
"clean_lines"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['permitto', 'subeo', 'neo1', 'malus', 'ineptus', 'cado']"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The Lemmatizer is able to correctly identify cadam as an inflected form of cado\n",
"\n",
"from cltk.stem.lemma import LemmaReplacer\n",
"lemmatizer = LemmaReplacer('latin')\n",
"l = lemmatizer.lemmatize(clean_lines[1])\n",
"l"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# But the POS taggers have no idea how to handle it\n",
"\n",
"from cltk.tag.pos import POSTag"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"tagger = POSTag('latin')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('permittis', 'N-S---FG-'),\n",
" ('subito', 'D--------'),\n",
" ('ne', 'D--------'),\n",
" ('male', 'D--------'),\n",
" ('inepta', 'T-SRPPFN-'),\n",
" ('cadam', 'A-S---FA-')]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tagger.tag_crf(clean_lines[1])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('permittis', 'V2SPIA---'),\n",
" ('subito', 'D--------'),\n",
" ('ne', 'D--------'),\n",
" ('male', 'D--------'),\n",
" ('inepta', 'Unk'),\n",
" ('cadam', 'Unk')]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tagger.tag_tnt(clean_lines[1])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('permittis', 'V2SPIA---'),\n",
" ('subito', 'D--------'),\n",
" ('ne', 'D--------'),\n",
" ('male', 'D--------'),\n",
" ('inepta', None),\n",
" ('cadam', None)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tagger.tag_ngram_123_backoff(clean_lines[1])"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"gratum/n-s---nb- est/v1si-a--- securus/a-s---fb- multum/d-------- quod/p-s---na- iam/d-------- tibi/a-s---fb- de/n-s---nb- me/d-------- permittis/p-s---ma- subito/d-------- ne/t-srppmn- male/d-------- inepta/v--pna--- cadam/v2spia---\n",
"\n"
]
}
],
"source": [
"# The new LAPOS tagger at least tags it as a verb, but v2spia is weird\n",
"# since a final 'm' is always 1st person\n",
"\n",
"import subprocess\n",
"p = subprocess.run(\n",
" [\"lapos/lapos\", \"-t\", \"-m\", \"lapos_model\"], \n",
" input='gratum est securus multum quod iam tibi de me permittis subito ne male inepta cadam', \n",
" encoding='ascii',\n",
" stdout=subprocess.PIPE\n",
")\n",
"print(p.stdout)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@kylepjohnson
Copy link

The Lemmatizer is able to correctly identify cadam as an inflected form of cado

But the POS taggers have no idea how to handle it

LemmaReplacer is as dumb as it gets (and one of the first things I wrote ~6 years ago). It reads a key-val list of <declined-form>: <lemma> and does a string match-based replace. There is no ambiguity handling and I just took the most frequently occurring form. The code used to create this (out of the old Perseus files in the form of <lemma>: [<declined form 1>, ... <declined form n>]) is here: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/transform_lemmata.py.

About the POS taggers, these are made using the NLTK's built in statistical parsers. You can trace back the class's logic from here: https://github.com/cltk/cltk/blob/e91f44d66ea2009a388dc9a3a224b138d9e003d6/cltk/tag/pos.py#L47. And the models was created from the same CLTK repo I cited above, though in this module: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/pos_latin.py#L13.

@kylepjohnson
Copy link

Two more notes:

  1. The two data sets both came from Perseus, which means that their labels will match one another (eg, neo1 for the lemmatizer refers to the same neo1 as in the POS tagger. (Though note their current tags, and perhaps their headwords/lemma keys too, have changed slightly since.)

  2. Patrick's backoff lemmatizer returns better results. Have you tried it, too? http://docs.cltk.org/en/latest/latin.html#lemmatization-backoff-method

@bnagy
Copy link
Author

bnagy commented Sep 26, 2018

Thanks a lot for the comments. Re the backoff lemmatizer, I didn't see any POS information, which is what I need at the moment. I'm making a note to look at the ContextPOSLemmatizer. I think that you could probably do a thing with the Perseus treebank data for form selection to Collatinus, just select the POS which is most common overall in the hand-analysed data? I would need more time / help to work out how to do that, though. The problem I'm seeing here, though, is that the morphological guesses are not strong enough in the first place - it's not that they're selecting the wrong form, it's that the right form isn't there in the first place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment