Classical Language Toolkit (CLTK) - Google Summer of Code (GSoC) 2017
- The goal of my GSoC project was to extend CLTK coverage to Old and Middle French. To this end I compiled/developed the following:
a corpus of Old French (OF) (with particular focus on Anglo-Norman) and Middle French (MF) texts available at https://github.com/cltk/french_text.
tokenizers for OF and MF - word and line.
stopword filterer - a list of stopwords derived from the corpus and a list of auxiliaries from a grammar of OF and MF.
named entity recognition - derived from the corpus. Named entities are tagged and identified as belonging to one of eight categories: locations “LOC”, nationalities/places of origin “NAT” (e.g. Grius), animals “ANI” (i.e. horses, e.g. Veillantif, cows, e.g. Blerain, dogs, e.g. Husdent), authors “AUT” (e.g. Marie, Chrestïen), nobility “CHI” (e.g. Rolland, Artus), characters from classical sources “CLAS” (e.g. Echo), feasts “F” (e.g. Pentecost), religious things “REL” (i.e. saints, e.g. St Alexis, and deities, e.g. Deus, and Old Testament people, e.g. Adam), swords “SW” (e.g. Hautecler), commoners “VIL” (e.g Pathelin).
stemmer - strips morphological endings from an input string.
normalizer - normalizes Anglo-Norman-specific spellings to those of "orthographe commune", the most "standard" orthography and the one used by OF and MF resources.
lemmatizer - provides lemmas for a list of input tokens. It first seeks a match between each token and a list of potential lemmas taken from Godefroy (1901)’s Lexique, the Tobler-Lommatszch, and the DECT. If a match is not found, the lemmatizer then seeks a match between the forms different lemmas have been known to take and the token (this at present only applies to lemmas from A-D and W-Z). If no match is returned at this stage, a set of rules is applied to the token. These rules are similar to those applied by the stemmer but aim to bring forms in line with lemmas rather than truncating them. Finally, if no match is found between the modified token and the list of lemmas, a result of ‘None’ is returned.
(n.b. All the modules above have associated tests at cltk/cltk/tests and documentation at cltk/docs.)
- Further work on OF/MF within the CLTK could:
improve the lemmatizer further, e.g. add coverage for forms of lemmas D-W, introduce context-based lemmatizing to improve lemmatization of ambiguous forms (e.g. "ot" can be a form of the verb "avoir" or a conjunction, "ot"). This is dependent on the use of an annotated corpus, which does not seem to exist at this time.
expand the modules to more dialects of OF, e.g. expanding the scope of the normalizer.
- Here are links to what was described above:
- corpora, data & dictionary files: https://github.com/cltk/french_text_wikisource, https://github.com/cltk/french_data_cltk & https://github.com/cltk/french_lexicon_cltk
- core OF/MF modules : https://github.com/cltk/cltk/pull/571
- Thanks & Acknowledgements:
Thank you first and foremost to my mentors, Patrick J. Burns and Marius Jøhndal, for their invaluable help, insights, and support throughout the project. Thank you also to Kyle P. Johnson and the rest of the CLTK team. Thank you also to the anonymous heroes who digitised OF and MF texts, dictionaries, and grammars, and made them available on the Internet. Thank you to GSoC for introducing me to the CLTK and to open-source software development more generally. I have learned more over the course of the project than I thought possible, both about natural language processing (it's messy stuff) and open-source coding practises more generally, from using git to ensuring everything is properly documented.