diyclassics/gsoc-summary.txt

## gsoc-summary.txt
Patrick J. Burns, PhD
Classical Language Toolkit
Google Summer of Code 2016 Final Report

Here is a summary of the work I completed for the 2016 Google Summer of Code project "CLTK Latin/Greek Backoff Lemmatizer" for the Classical Language Toolkit (cltk.org). The code can be found at https://github.com/diyclassics/cltk/tree/lemmatize/cltk/lemmatize.
- Wrote custom lemmatizers for Latin and Greek as subclasses of NLTK's tag module (http://www.nltk.org/api/nltk.tag.html), including:
  - Default lemmatization, i.e. same lemma returned for every token
  - Identity lemmatization, i.e. original token returned as lemma
  - Model lemmatization, i.e. lemma returned based on dictionary lookup
  - Context lemmatization, i.e. lemma returned based on proximal token/lemma tuples in training data
  - Context/POS lemmatization, i.e. same as above, but proximal tuples are inspected for POS information
  - Regex lemmatization, i.e. lemma returned through rules-based inspection of token endings
  - Principal parts lemmatization, i.e. same as above, but matched regexes are then subjected to dictionary lookup to determine lemma
- Organized the custom lemmatizers into a backoff chain, increasing accuracy (compared to dictionary lookup alone by as much as 28.9%). Final accuracy tests on test corpus showed average of 90.82%.
  - An example backoff chain is included in the backoff.py file under the class LazyLatinLemmatizer.
- Constructed models for language-specific lookup tasks, including:
  - Dictionaries of high-frequency, unambiguous lemmas
  - Regex patterns for high-accuracy lemma prediction
- Constructed models to be used as training data for context-based lemmatization
- Wrote tests for basic subclasses. Code for tests can be found here: https://github.com/diyclassics/cltk/blob/lemmatize/cltk/tests/test_lemmatize.py
- Tangential work for CLTK inspired by daily work on lemmatizer
  - Continued improvements to the CLTK Latin tokenizer. Lemmatization is performed on tokens, and it is clear that accuracy is affected by the quality of the tokens text pass as parameters to the lemmatizer.
  - Introduction of PlaintextCorpusReader-based corpus of Latin (using the Latin Library corpus) to encourage easier adoption of the CLTK. Initial blog posts on this feature are part of an ongoing series which will work through a Latin NLP task workflow and will soon treat lemmatization. These posts will document in detail features developed during this summer project.

Next steps:
- Test various combinations of backoff chains like the one used in LazyLatinLemmatizer to determine which returns data with the highest accuracy.
- The most significant increases in accuracy appear to come from the ContextLemmatizer, which is based on training data. Two comments here:
  - Training data for the GSoC summer project was derived from Ancient Greek Dependency Treebank (v. 2.1) https://github.com/PerseusDL/treebank_data/tree/master/v2.1. The Latin data consists of around 5,000 sentences. Experiments throughout the summer (and research by others) suggests that more training data will lead to improved results. This data will be "expensive" to produce, but I am sure it will lead to higher accuracy. There are other large, tagged sets available and testing will continue with those in upcoming months. The AGDT data also has some inconsistancies, e.g. various lemma tagging for punctuation. I would like to work with the Perseus team to bring this data increasing closer to being a 'gold standard' dataset for applications such as this.
  - The NLTK ContextTagger uses look-behind ngrams to create context. The nature of Latin/Greek as a "free" word-order language suggests that it may be worthwhile to think about and write code for generating different contexts. Skipgram context is one idea that I will pursue in upcoming months.
  - More model/pattern information will only improve accuracy, i.e. more 'endings' patterns for the RegexLemmatizer, a more complete principal parts list for the PPLematizer. The original dictionary model--currently included at the end of the LazyLatinLemmatizer--could also be revised/augmented.
- Continued testing of the lemmatizer with smaller, localized selections will help to isolate edge cases and exceptions. The RomanNumeralLemmatizer, e.g., was written to handle a type of token that as an edge case was lowering accuracy.
- The combination context/POS lemmatizer is very basic at the moment, but has enormous potential for increasing the accuracy of a notoriously difficult lemmatization problem, i.e. ambiguous forms. The current version (inc. the corresponding training data) is only set to resolve one ambiguous case, namely 'cum1' (prep.) versus 'cum2' (conj.). Two comments:
  - More testing is needed to determine the accuracy (as well as the precision and recall) of this lemmatizer in distinguishing between the two forms of 'cum1/2'. The current version only uses bigram POS data, but (see above) different contexts may yield better results as well.
  - More ambiguous cases should be introduced to the training data and tested like 'cum1/2'. The use of Morpheus numbers in the AGDT data should assist with this.

This was an incredible project to work on following several years of philological/literary critical graduate work and as I finished up my PhD in classics at Fordham University. I improved my skills and/or learned a great deal about, but not limited to, object-oriented programming, unit testing, version control, and working with important open-source development architecture such as TravisCI, ZenHub, Codecov, etc.

I want to thank the following people: my mentors Kyle P. Johnson and James Tauber who have set an excellent example of what the future of philology will look like: open source/access and community-developed, while rooted in the highest standards of both software development and traditional scholarship; the rest of the CLTK development community; my team at the Institute of the Study of the Ancient World for supporting this work during my first months there; Matthew McGowan, my dissertation advisor, for supporting both my traditional and digital work throughout my time at Fordham; the Tufts/Perseus/Leipzig DH/Classics team--the roots of this project come from working with them at various workshops in recent years and they first made the case to me about what could be accomplished through humanties computing; Neil Coffee and the DCA; the NLTK development team; Google for supporting an open-source, digital humanities coding project with Summer of Code; and of course, the #DigiClass world of Twitter for proving to me that there is an enthusiastic audience out there who want to 'break' classical texts, study them, and put them back together in various ways to learn more about them--better lemmatization is a desideratum and my motivation comes from wanting to help the community fill this need.--PJB
	Patrick J. Burns, PhD
	Classical Language Toolkit
	Google Summer of Code 2016 Final Report

	Here is a summary of the work I completed for the 2016 Google Summer of Code project "CLTK Latin/Greek Backoff Lemmatizer" for the Classical Language Toolkit (cltk.org). The code can be found at https://github.com/diyclassics/cltk/tree/lemmatize/cltk/lemmatize.
	- Wrote custom lemmatizers for Latin and Greek as subclasses of NLTK's tag module (http://www.nltk.org/api/nltk.tag.html), including:
	- Default lemmatization, i.e. same lemma returned for every token
	- Identity lemmatization, i.e. original token returned as lemma
	- Model lemmatization, i.e. lemma returned based on dictionary lookup
	- Context lemmatization, i.e. lemma returned based on proximal token/lemma tuples in training data
	- Context/POS lemmatization, i.e. same as above, but proximal tuples are inspected for POS information
	- Regex lemmatization, i.e. lemma returned through rules-based inspection of token endings
	- Principal parts lemmatization, i.e. same as above, but matched regexes are then subjected to dictionary lookup to determine lemma
	- Organized the custom lemmatizers into a backoff chain, increasing accuracy (compared to dictionary lookup alone by as much as 28.9%). Final accuracy tests on test corpus showed average of 90.82%.
	- An example backoff chain is included in the backoff.py file under the class LazyLatinLemmatizer.
	- Constructed models for language-specific lookup tasks, including:
	- Dictionaries of high-frequency, unambiguous lemmas
	- Regex patterns for high-accuracy lemma prediction
	- Constructed models to be used as training data for context-based lemmatization
	- Wrote tests for basic subclasses. Code for tests can be found here: https://github.com/diyclassics/cltk/blob/lemmatize/cltk/tests/test_lemmatize.py
	- Tangential work for CLTK inspired by daily work on lemmatizer
	- Continued improvements to the CLTK Latin tokenizer. Lemmatization is performed on tokens, and it is clear that accuracy is affected by the quality of the tokens text pass as parameters to the lemmatizer.
	- Introduction of PlaintextCorpusReader-based corpus of Latin (using the Latin Library corpus) to encourage easier adoption of the CLTK. Initial blog posts on this feature are part of an ongoing series which will work through a Latin NLP task workflow and will soon treat lemmatization. These posts will document in detail features developed during this summer project.

	Next steps:
	- Test various combinations of backoff chains like the one used in LazyLatinLemmatizer to determine which returns data with the highest accuracy.
	- The most significant increases in accuracy appear to come from the ContextLemmatizer, which is based on training data. Two comments here:
	- Training data for the GSoC summer project was derived from Ancient Greek Dependency Treebank (v. 2.1) https://github.com/PerseusDL/treebank_data/tree/master/v2.1. The Latin data consists of around 5,000 sentences. Experiments throughout the summer (and research by others) suggests that more training data will lead to improved results. This data will be "expensive" to produce, but I am sure it will lead to higher accuracy. There are other large, tagged sets available and testing will continue with those in upcoming months. The AGDT data also has some inconsistancies, e.g. various lemma tagging for punctuation. I would like to work with the Perseus team to bring this data increasing closer to being a 'gold standard' dataset for applications such as this.
	- The NLTK ContextTagger uses look-behind ngrams to create context. The nature of Latin/Greek as a "free" word-order language suggests that it may be worthwhile to think about and write code for generating different contexts. Skipgram context is one idea that I will pursue in upcoming months.
	- More model/pattern information will only improve accuracy, i.e. more 'endings' patterns for the RegexLemmatizer, a more complete principal parts list for the PPLematizer. The original dictionary model--currently included at the end of the LazyLatinLemmatizer--could also be revised/augmented.
	- Continued testing of the lemmatizer with smaller, localized selections will help to isolate edge cases and exceptions. The RomanNumeralLemmatizer, e.g., was written to handle a type of token that as an edge case was lowering accuracy.
	- The combination context/POS lemmatizer is very basic at the moment, but has enormous potential for increasing the accuracy of a notoriously difficult lemmatization problem, i.e. ambiguous forms. The current version (inc. the corresponding training data) is only set to resolve one ambiguous case, namely 'cum1' (prep.) versus 'cum2' (conj.). Two comments:
	- More testing is needed to determine the accuracy (as well as the precision and recall) of this lemmatizer in distinguishing between the two forms of 'cum1/2'. The current version only uses bigram POS data, but (see above) different contexts may yield better results as well.
	- More ambiguous cases should be introduced to the training data and tested like 'cum1/2'. The use of Morpheus numbers in the AGDT data should assist with this.

	This was an incredible project to work on following several years of philological/literary critical graduate work and as I finished up my PhD in classics at Fordham University. I improved my skills and/or learned a great deal about, but not limited to, object-oriented programming, unit testing, version control, and working with important open-source development architecture such as TravisCI, ZenHub, Codecov, etc.

	I want to thank the following people: my mentors Kyle P. Johnson and James Tauber who have set an excellent example of what the future of philology will look like: open source/access and community-developed, while rooted in the highest standards of both software development and traditional scholarship; the rest of the CLTK development community; my team at the Institute of the Study of the Ancient World for supporting this work during my first months there; Matthew McGowan, my dissertation advisor, for supporting both my traditional and digital work throughout my time at Fordham; the Tufts/Perseus/Leipzig DH/Classics team--the roots of this project come from working with them at various workshops in recent years and they first made the case to me about what could be accomplished through humanties computing; Neil Coffee and the DCA; the NLTK development team; Google for supporting an open-source, digital humanities coding project with Summer of Code; and of course, the #DigiClass world of Twitter for proving to me that there is an enthusiastic audience out there who want to 'break' classical texts, study them, and put them back together in various ways to learn more about them--better lemmatization is a desideratum and my motivation comes from wanting to help the community fill this need.--PJB