Data: 60 brief passages of Ancient Greek prose, from Herodotus to Plotinus. 9,278 words total and 3,863 different word forms.
The texts (in plain text format) are published here: [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/pos_txt/], directories p1, p2, p3.
The tokenized and cleaned-up XML version (words in w
, punctuation in pc
, names of source files as @id
; combined diacritics and letters replaced with precomposed characters where necessary) is in the same repository: [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/pos_txt/tokenizedp/grctxt.xml].
The words were sent to the online Morpheus parser at [http://morph.perseids.org/analysis/word?lang=grc&engine=morpheusgrc&word=], using the XQuery script [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/scripts/ParsePerseusGetHeadwordFromDB.xq].
The word forms unrecognized by Morpheus were submitted to the CLTK Greek lemmatization module following instructions at [http://docs.cltk.org/en/latest/greek.html#lemmatization] (Python on the command line, using the IPython IDE).
Both online Morpheus and its CLTK version do not recognize the same 21 word forms. Some of them are not recognized because they are all uppercase. The unrecognized forms of μέγας and καλός should appear quite often in texts. Other forms can be considered rare, although the 60 passages set was prepared for pedagogical purposes, and taken from works which are well positioned in the Greek canon.
The list is as follows.
ΑΕΤΟΣ
ΑΛΩΠΗΞ
ἀμύνης
ΑΝΘΡΩΠΟΣ
ἀσθμήνας
ἠλέχθη
ΚΑΙ
καλὰ
καταδήξας
ΚΛΕΠΤΗΣ
κρανίας
μεγάλου
μεγάλῳ
ΜΗΤΗΡ
ΠΑΙΣ
πάμα
ΠΙΘΗΚΟΣ
ΣΑΤΥΡΟΣ
στερνοκοπούσης
ΤΡΑΓΟΣ
φιλοκερδίᾳ
Of the unrecognized word forms above, the Eulexis parser by Biblissima does not recognize the following five:
ἀσθμήνας : Non trouvé
ἠλέχθη : Non trouvé
καταδήξας : Non trouvé
στερνοκοπούσης : Non trouvé
φιλοκερδίᾳ : Non trouvé
But there is no API for Eulexis...
We have not checked yet how the GLEM parser would do on the unrecognized word forms.