Skip to content

Instantly share code, notes, and snippets.

@bnagy

bnagy/pos.ipynb Secret

Created June 28, 2018 01:59
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bnagy/2e236c52c174435a459778b299323636 to your computer and use it in GitHub Desktop.
Save bnagy/2e236c52c174435a459778b299323636 to your computer and use it in GitHub Desktop.
Thinking about using CLTK for verb form tagging
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@kylepjohnson
Copy link

The Lemmatizer is able to correctly identify cadam as an inflected form of cado

But the POS taggers have no idea how to handle it

LemmaReplacer is as dumb as it gets (and one of the first things I wrote ~6 years ago). It reads a key-val list of <declined-form>: <lemma> and does a string match-based replace. There is no ambiguity handling and I just took the most frequently occurring form. The code used to create this (out of the old Perseus files in the form of <lemma>: [<declined form 1>, ... <declined form n>]) is here: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/transform_lemmata.py.

About the POS taggers, these are made using the NLTK's built in statistical parsers. You can trace back the class's logic from here: https://github.com/cltk/cltk/blob/e91f44d66ea2009a388dc9a3a224b138d9e003d6/cltk/tag/pos.py#L47. And the models was created from the same CLTK repo I cited above, though in this module: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/pos_latin.py#L13.

@kylepjohnson
Copy link

Two more notes:

  1. The two data sets both came from Perseus, which means that their labels will match one another (eg, neo1 for the lemmatizer refers to the same neo1 as in the POS tagger. (Though note their current tags, and perhaps their headwords/lemma keys too, have changed slightly since.)

  2. Patrick's backoff lemmatizer returns better results. Have you tried it, too? http://docs.cltk.org/en/latest/latin.html#lemmatization-backoff-method

@bnagy
Copy link
Author

bnagy commented Sep 26, 2018

Thanks a lot for the comments. Re the backoff lemmatizer, I didn't see any POS information, which is what I need at the moment. I'm making a note to look at the ContextPOSLemmatizer. I think that you could probably do a thing with the Perseus treebank data for form selection to Collatinus, just select the POS which is most common overall in the hand-analysed data? I would need more time / help to work out how to do that, though. The problem I'm seeing here, though, is that the morphological guesses are not strong enough in the first place - it's not that they're selecting the wrong form, it's that the right form isn't there in the first place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment