This is a final report of the work done as part of Greek Morphological and spelling dictionary GSOC19 Project (https://github.com/eellak/gsoc2019-greek-morpho).
A Morphological dictionary is a linguistic resource, that apart from surface forms, it also includes information about lemma, POS, gender, number etc The main goal of this GSOC was to create an open source one using information extracted from el.wiktionary.org
All of the work can be found at this repository (https://github.com/eellak/gsoc2019-greek-morpho) with code created from scratch.
- An SQL database (https://github.com/eellak/gsoc2019-greek-morpho/tree/master/data/morph-dict-v0.2.zip) containing the following data
- A morphological dictionary containing about
900.000
entries, with518.000
distinct surface forms with information described according to Universal Dependencies. - Definitions for most lemmas
- Etymologies for most lemmas
18500
Synonyms,12500
of which are for Greek5500
Antonyms,4300
of which are for Greek3310
Normalizations of words- Almost
150.000
Translations
- A spelling dictionary with
1.047.200
words, up from the828.807
of the previous dictionary used in open source programs. The dictionary also includes frequencies for all words. (https://github.com/eellak/gsoc2019-greek-morpho/tree/master/data) It will be integrated into spelling dictionaries of Firefox and Thunderbird.
Possible use cases of the morphological dictionary are:
- Creation of a dictionary based lemmatizer
- Creation of a dictionary based POS tagger
- Creation of a Thessaurus for various office editors
- Usage of the Translations list for Machine Translation software
Documentation about the installation and running of the scripts can be found here.
The morphological dictionary found in this repo contains information
found in el.wiktionary.org. Thus the best way to contribute, is to add
inflection tables in el.wiktionary.org
for lemmas that don't have one.
You can learn about the articles' structure here and the list of inflection templates here.
The spelling dictionary contains words extracted from various text sources that may contain possible spelling errors. While work has been done to minimize them, there are possibly some that remain. A future task is to double check the new words included.
@kagiannis Where can I find the definitions of the tags assigned to the
greek_pos
fields?As a reminder, here is some example values: