Skip to content

Instantly share code, notes, and snippets.

@bnagy bnagy/pos.ipynb Secret
Created Jun 28, 2018

Embed
What would you like to do?
Thinking about using CLTK for verb form tagging
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@kylepjohnson

This comment has been minimized.

Copy link

commented Sep 25, 2018

The Lemmatizer is able to correctly identify cadam as an inflected form of cado

But the POS taggers have no idea how to handle it

LemmaReplacer is as dumb as it gets (and one of the first things I wrote ~6 years ago). It reads a key-val list of <declined-form>: <lemma> and does a string match-based replace. There is no ambiguity handling and I just took the most frequently occurring form. The code used to create this (out of the old Perseus files in the form of <lemma>: [<declined form 1>, ... <declined form n>]) is here: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/transform_lemmata.py.

About the POS taggers, these are made using the NLTK's built in statistical parsers. You can trace back the class's logic from here: https://github.com/cltk/cltk/blob/e91f44d66ea2009a388dc9a3a224b138d9e003d6/cltk/tag/pos.py#L47. And the models was created from the same CLTK repo I cited above, though in this module: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/pos_latin.py#L13.

@kylepjohnson

This comment has been minimized.

Copy link

commented Sep 25, 2018

Two more notes:

  1. The two data sets both came from Perseus, which means that their labels will match one another (eg, neo1 for the lemmatizer refers to the same neo1 as in the POS tagger. (Though note their current tags, and perhaps their headwords/lemma keys too, have changed slightly since.)

  2. Patrick's backoff lemmatizer returns better results. Have you tried it, too? http://docs.cltk.org/en/latest/latin.html#lemmatization-backoff-method

@bnagy

This comment has been minimized.

Copy link
Owner Author

commented Sep 26, 2018

Thanks a lot for the comments. Re the backoff lemmatizer, I didn't see any POS information, which is what I need at the moment. I'm making a note to look at the ContextPOSLemmatizer. I think that you could probably do a thing with the Perseus treebank data for form selection to Collatinus, just select the POS which is most common overall in the hand-analysed data? I would need more time / help to work out how to do that, though. The problem I'm seeing here, though, is that the morphological guesses are not strong enough in the first place - it's not that they're selecting the wrong form, it's that the right form isn't there in the first place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.