-
-
Save bnagy/2e236c52c174435a459778b299323636 to your computer and use it in GitHub Desktop.
Two more notes:
-
The two data sets both came from Perseus, which means that their labels will match one another (eg,
neo1
for the lemmatizer refers to the sameneo1
as in the POS tagger. (Though note their current tags, and perhaps their headwords/lemma keys too, have changed slightly since.) -
Patrick's backoff lemmatizer returns better results. Have you tried it, too? http://docs.cltk.org/en/latest/latin.html#lemmatization-backoff-method
Thanks a lot for the comments. Re the backoff lemmatizer, I didn't see any POS information, which is what I need at the moment. I'm making a note to look at the ContextPOSLemmatizer. I think that you could probably do a thing with the Perseus treebank data for form selection to Collatinus, just select the POS which is most common overall in the hand-analysed data? I would need more time / help to work out how to do that, though. The problem I'm seeing here, though, is that the morphological guesses are not strong enough in the first place - it's not that they're selecting the wrong form, it's that the right form isn't there in the first place...
LemmaReplacer
is as dumb as it gets (and one of the first things I wrote ~6 years ago). It reads a key-val list of<declined-form>: <lemma>
and does a string match-based replace. There is no ambiguity handling and I just took the most frequently occurring form. The code used to create this (out of the old Perseus files in the form of<lemma>: [<declined form 1>, ... <declined form n>]
) is here: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/transform_lemmata.py.About the POS taggers, these are made using the NLTK's built in statistical parsers. You can trace back the class's logic from here: https://github.com/cltk/cltk/blob/e91f44d66ea2009a388dc9a3a224b138d9e003d6/cltk/tag/pos.py#L47. And the models was created from the same CLTK repo I cited above, though in this module: https://github.com/cltk/latin_pos_lemmata_cltk/blob/master/pos_latin.py#L13.