Skip to content

Instantly share code, notes, and snippets.

@nevenjovanovic
Last active December 16, 2018 19:46
Show Gist options
  • Save nevenjovanovic/907aae04aca664c335c9b20bc44a6f9c to your computer and use it in GitHub Desktop.
Save nevenjovanovic/907aae04aca664c335c9b20bc44a6f9c to your computer and use it in GitHub Desktop.
A list of Greek words not recognized by the Morpheus parser (online and in CLTK)

21 Greek words not recognized by the Morpheus parser in a 3,863 word forms prose set

Data: 60 brief passages of Ancient Greek prose, from Herodotus to Plotinus. 9,278 words total and 3,863 different word forms.

The texts (in plain text format) are published here: [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/pos_txt/], directories p1, p2, p3.

The tokenized and cleaned-up XML version (words in w, punctuation in pc, names of source files as @id; combined diacritics and letters replaced with precomposed characters where necessary) is in the same repository: [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/pos_txt/tokenizedp/grctxt.xml].

The words were sent to the online Morpheus parser at [http://morph.perseids.org/analysis/word?lang=grc&engine=morpheusgrc&word=], using the XQuery script [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/scripts/ParsePerseusGetHeadwordFromDB.xq].

The word forms unrecognized by Morpheus were submitted to the CLTK Greek lemmatization module following instructions at [http://docs.cltk.org/en/latest/greek.html#lemmatization] (Python on the command line, using the IPython IDE).

Both online Morpheus and its CLTK version do not recognize the same 21 word forms. Some of them are not recognized because they are all uppercase. The unrecognized forms of μέγας and καλός should appear quite often in texts. Other forms can be considered rare, although the 60 passages set was prepared for pedagogical purposes, and taken from works which are well positioned in the Greek canon.

The list is as follows.

Greek word forms not recognized by the Morpheus lemmatizer

ΑΕΤΟΣ
ΑΛΩΠΗΞ
ἀμύνης
ΑΝΘΡΩΠΟΣ
ἀσθμήνας
ἠλέχθη
ΚΑΙ
καλὰ
καταδήξας
ΚΛΕΠΤΗΣ
κρανίας
μεγάλου
μεγάλῳ
ΜΗΤΗΡ
ΠΑΙΣ
πάμα
ΠΙΘΗΚΟΣ
ΣΑΤΥΡΟΣ
στερνοκοπούσης
ΤΡΑΓΟΣ
φιλοκερδίᾳ

Biblissima / Eulexis

Of the unrecognized word forms above, the Eulexis parser by Biblissima does not recognize the following five:

ἀσθμήνας : Non trouvé
ἠλέχθη : Non trouvé
καταδήξας : Non trouvé
στερνοκοπούσης : Non trouvé
φιλοκερδίᾳ : Non trouvé

But there is no API for Eulexis...

GLEM

We have not checked yet how the GLEM parser would do on the unrecognized word forms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment