Skip to content

Instantly share code, notes, and snippets.

@kagiannis
Last active January 3, 2024 09:44
Show Gist options
  • Save kagiannis/449e301c331c7d91a5116c0d00703a20 to your computer and use it in GitHub Desktop.
Save kagiannis/449e301c331c7d91a5116c0d00703a20 to your computer and use it in GitHub Desktop.
Final evaluation for GSOC19 creation of Greek Morphological Dictionary

This is a final report of the work done as part of Greek Morphological and spelling dictionary GSOC19 Project (https://github.com/eellak/gsoc2019-greek-morpho).

Abstract

A Morphological dictionary is a linguistic resource, that apart from surface forms, it also includes information about lemma, POS, gender, number etc The main goal of this GSOC was to create an open source one using information extracted from el.wiktionary.org

Work and repository

All of the work can be found at this repository (https://github.com/eellak/gsoc2019-greek-morpho) with code created from scratch.

Deliverables

  1. A morphological dictionary containing about 900.000 entries, with 518.000 distinct surface forms with information described according to Universal Dependencies.
  2. Definitions for most lemmas
  3. Etymologies for most lemmas
  4. 18500 Synonyms, 12500 of which are for Greek
  5. 5500 Antonyms, 4300 of which are for Greek
  6. 3310 Normalizations of words
  7. Almost 150.000 Translations
  • A spelling dictionary with 1.047.200 words, up from the 828.807 of the previous dictionary used in open source programs. The dictionary also includes frequencies for all words. (https://github.com/eellak/gsoc2019-greek-morpho/tree/master/data) It will be integrated into spelling dictionaries of Firefox and Thunderbird.

Use cases

Possible use cases of the morphological dictionary are:

  • Creation of a dictionary based lemmatizer
  • Creation of a dictionary based POS tagger
  • Creation of a Thessaurus for various office editors
  • Usage of the Translations list for Machine Translation software

Documentation

Documentation about the installation and running of the scripts can be found here.

Future work

Database

The morphological dictionary found in this repo contains information found in el.wiktionary.org. Thus the best way to contribute, is to add inflection tables in el.wiktionary.org for lemmas that don't have one. You can learn about the articles' structure here and the list of inflection templates here.

Spelling Dictionaries

The spelling dictionary contains words extracted from various text sources that may contain possible spelling errors. While work has been done to minimize them, there are possibly some that remain. A future task is to double check the new words included.

Copy link

ghost commented Jan 3, 2024

@kagiannis Where can I find the definitions of the tags assigned to the greek_pos fields?

As a reminder, here is some example values:

PROST_ENEST_B_ENIKO
PROST_AOR_B_PL
PROST_AOR_B_ENIKO
PARATATIKOS_G_PL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment