Skip to content

Instantly share code, notes, and snippets.

@spapadiamantis
Last active August 26, 2019 09:53
Show Gist options
  • Save spapadiamantis/8ee78769975ffcb2a7a4d1a135a7a05f to your computer and use it in GitHub Desktop.
Save spapadiamantis/8ee78769975ffcb2a7a4d1a135a7a05f to your computer and use it in GitHub Desktop.
Final Report of 3GM GSOC-2019

This is a final report of the work which was done as part of 3gm GSOC-2019 Project (https://github.com/eellak/gsoc2019-3gm).

Abstract

This project aims to enhance the NLP capabilities of the 3gm project that was developed during GSOC-2018 on behalf of Greek FOSS. The main goals for GSoC-2019 are populating the database with more types of amendments, widening the range of feature extraction and training a new Doc2Vec model and a new NER annotator specifically for our corpus.

Work and Repository

Migrating Data

As part of the first week of GSoC-2019 a data mirgation project. In the scope of this project we had to mine the website of the Greek National Printing House and upload as many GGG issues to the respective Internet Archive Collection. Until now, 87,874 issues have been uploaded, in addition to the ~45.000 files that the collection contained initially and this number will continue to surge. The main goal of this whole endeavour is making the greek legislation archive more accessible.

We tried documenting our insights from this process. We would like to evolve this to an entry at the project wiki, titled " A simple guide to mining the Greek Government Gazette".

Named Entity Recognision model

After uploading a major part of the Greek Government Gazette issues, including all primary type issues, it was time to start building a dataset to train a new NER tagger based on the Greek spaCy model. To do that it is necessary to use an annotation tool. A tool that is fully compatible with spaCy is prodigy. We contacted them and they provided a research licence for the duration of the project.

To mine, prepare and annotate our data we followed this workflow and followed the guidlines for annotation described here.

As a result of this process we have created a dataset containing around 3000 sentences. A first version of this dataset can be found in the projects data folder. We have also deployed the prodigy annotator, in an effort to showcase our progress. In case you want to support this year's project. All annotations gathered will be used for model training after quality control.You can find it here.

After obtaining a large enough data-set to train our models we trained the small and medium sized Greek spaCy models using the prodigy recipes for training. The models showed significant improvement after training. A version of the small NER model that we trained can be found in the data directory of this repo. Our goal now is to optimize the model and properly evaluate it. As a first step to this process we will use the train-curve recipe of prodigy to see how to the model performs when trained with different portions of our data. Finally we will develop a python script to train the spacy model, document all its metrics and tune hyperparameters. The is process is documented in this report.

The final version of the NER model is located in the models folder alongside a model of word-embedding containing around 20000 word vectors. You can find examples of use in the related wiki page.

Broadening fact extaction

During this year's GSOC we focused a lot on enhancing the NLP capabilities of the project.

As part of this procedure it is vital to broaden fact extraction on the project. Using regular expressions we will work on the entities file aiming to make it possible for the app to identify useful information such as metrics, codes, ids, contact info e.t.c.

We have created a script to test regular expressions for fact extractions. Unfortunately there is very little consistency when it come to writing information between issues and this results to difficulties in entities extraction.

After optimizing the extraction queries we integrated them to the entities module that can be found in the 3gm directory. We now have to use the regular expressions to extract entities in the pparser module, the module that is responsible for extracting amendments, laws and ratifications.

You can find examples on how to generate text in the [project wiki]https://github.com/eellak/gsoc2018-3gm/wiki/Fact-Extraction).

Training a new Doc2vec model

We will train a new model for doc2vec using the gensim library following the proposed workflow in the project wiki. We will use the codifier to create a large corpus and subsequently train the gensim model on it. To make sure that the model is efficient we will have to create a corpus of several thousand issues and then finetune the model's hyperparameters.

For the time being we have created a corpus file containing 2878 laws and presidential decisions totaling around 223Mb. We have also trained a doc2vec model that can be found in the models directory. Our goal is to create a corpus as big as possible and this is the reason we will continue to expand it.

Creating a natural language model

Even though it was not included in the initial project proposal we also decided to create a natural language model that generates texts, aiming to make use of the word vectors we had produced earlier using prodigy. to achieve this we will deploy transfer learning techniques.

Our approach includes training a variation of a character level based LSTM model that we trained on a corpus of GGG texts.The idea is to use the embeddings produced, in an embedding layer and then stack this model on top of it. To train the model we are using Google Colab using TPU acceleration on a variation of this notebook provided by the TensorFlow Hub Authors.

You can find examples on how to generate text in the project wiki.

Deliverables

The deliverables for the GSOC-2019 include:

  1. An expanded version of the Internet Archive collection containing a total of 134,113 issues from several issue types.
  2. A new Named Entity Recognision model trained exlusively on Greek Government Gazzette texts.
  3. An expanded entities.py module with broadened fact extraction functionality.
  4. A new Doc2vec model containing around 3000 vectors
  5. A natural language model that produces legislation text.

Future Work

Future work for the project can be found, along with previous issues here. Issues created during GSOC-2019 are labeled with a relevant tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment