Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save anasedova/0cdcf8588f73873c5d7c4998cec90b37 to your computer and use it in GitHub Desktop.
Save anasedova/0cdcf8588f73873c5d7c4998cec90b37 to your computer and use it in GitHub Desktop.
Documentation for the final evaluation of GSOC 2021

Sedova Anastasiia

Google Summer of Code 2021

Organisation: DeepPavlov

Git: Relation Extraction

Project page: Relation Extraction

Code Contribution: Pull Request

What was the task?

My major task was to implement relation extraction module for both English and Russian languages. The main steps were:

  1. design and develop the neural network for relation extraction
  2. design the relation extraction pipeline and incorporate it to DeepPavlov framework
  3. choose datasets the relation extraction models will be trained on
  4. train reliable RE models for both languages and test them on different test examples.

What is done?

During the GSoC Project 2021 all the tasks outlisted above have been accomplished, namely:

  1. a new relation extraction model was designed based on ATLOP model (more details are here)
  2. the relation extraction pipeline was realised in DeepPavlov framework (and here is a pull request)
  3. for training the English RE model the DocRED corpus was used; for the Russian RE model the RuRED was used. Dataset were preprocessed and extended with additional negative samples needed to train a reliable RE model (more details are here)
  4. multiple RE models were trained with different parameters; the best one trained was uploaded online and can be now used for trying relation extraction component (here are working examples for English and Russian languages)

Pull Request is now opened and is waiting to be merged to DeepPavlov library.

Working example: English RE

Suppose we have a sentence Barack Obama is married to Michelle Obama, born Michelle Robinson. The task is to find a relation between Barack Obama and Michelle Obama/Michelle Robinson. In this case, we should first download the already pretrained English RE model:

python3.6 -m deeppavlov download re_docred

... and call it on our test sample:

from deeppavlov import configs, build_model
re = build_model(configs.relation_extraction.re_docred, download=False)

sentence_tokens = [["Barack", "Obama", "is", "married", "to", "Michelle", "Obama", ",", "born", "Michelle", "Robinson", "."]]
entity_pos = [[[(0, 2)], [(5, 7), (9, 11)]]]
entity_tags = [["PER", "PER"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
>> [['P26'], ['spouse']]

As a result, the English RE model is able to detect a relation "spouse" and returns it with the correposponding id in Wikidata ("P26").

If the English RE model is wished to be trained from scratch, it could be done with the following command:

python3.6 -m deeppavlov train re_docred

Working example: Russian RE

Suppose we have a sentence Илон Маск живет в Сиэттле. The task is to find a relation between Илон Маск and Сиеттл. In this case, we should first download the already pretrained Russian RE model:

python3.6 -m deeppavlov download re_rured

... and call it on our test sample:

from deeppavlov import configs, build_model
re = build_model(configs.relation_extraction.re_rured, download=False)

sentence_tokens = [["Илон", "Маск", "живет", "в", "Сиэттле", "."]]
entity_pos = [[[(0, 2)], [(4, 6)]]]
entity_tags = [["PERSON", "CITY"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
>> [['P551'], ['место жительства']]

As a result, the Russian RE model is able to detect a relation "место жительства" and returns it with the correposponding id in Wikidata ("P551").

If the Russian RE model is wished to be trained from scratch, it could be done with the following command:

python3.6 -m deeppavlov train re_rured

What is Relation Extraction?

Relation extraction is a subtask of information extraction, that deals with finding and classifying the semantic relations hold between entities in an unstructured text. The main practical applications of relation extraction are collecting the database or augmenting the existing ones. An extensive collection of relational triples (i.e. two entities and a relation hold between them) can be converted into a structured database of facts about the real world of a much better quality than manually created ones. In most of the conventional applications, text entities between which a relation holds correspond to named entities or underlying entities obtained with co-reference resolution.

Among a great variaty of different relation extraction approaches, we decided to realise:

  • a supervised relation extraction, as, to the best of our knowledge, it is the most common and well-performed approach today;
  • a document-level relation extraction, as it allows to get a greater number of reliable relations from the document consisted of multiple sentences, what makes it closer to the real life application.

What challenges does (document-level) Relation Extraction encounter with?

  • negative samples: an important part of preparing the data for relation extraction model training is generating negative training samples, i.e. the samples where no relation holds between entities. Tuning of the number of negative samples is usually done with experimentations and often turnes out to be a sophisticated task, which requires additional attention;
  • multi-entity problem: in document-level relation extraction one document may contain multiple entity pairs between which different relations are hold. Indeally, a stable relation extraction system should be able to detect and classify all of them at once. Moreover, the same entity may be mentioned in different ways across the text (e.g. "John Smith", "Mr. Smith", "John" etc);
  • multi-label problem: one entity pair can occur many times in the document associated with different relations, in contrast to one relation per entity pair for sentence-level RE. For example, the entities "John Smith" and "New York" may easily express relations "place of birth" and "place of death" at the same time if some unknown John Smith happened to be born and die in the same city which is New York.

The number of negative samples was tuned in a number of experimentations. The last two challenges we ovecame with a specific model architecture, which is described in the next section.

How does the RE model look like?

The developed RE model on the Adaptive Thresholding and Localized Context Pooling. Two core ideas of this model are adaptive threshold and localised context pooling.

  • Adaptive Threshold. The usual global threshold for converting the RE classifier output probability to relation label is replaced with a learnable one. A new threshold class that learns an entities-dependent threshold value is introduced and learnt as all other classes. During prediction the positive classes (= relations that are hold in the sample indeed) are claimed to be the classes with higher logits than the threshold class. All other classes turn out to be negative (= these relations are not hold in the sample between the given entities).
  • Localised Context Pooling. The embedding of each entity pair is enhanced with an additional local context embedding related to both entities. Such representation, which is attended to the relevant context of the entity pair in the document, is useful to decide the relation for this particular entity pair. The context information is derived by direct use of attention heads.

Apart from the a number of different adjustements we made to the original ATLOP model we also changed the input. Compared to the ATLOP model, where the input consists of tokenized samples and entity positions, we also added NER tags of all entities as an additional input. Thus, the input of our RE model is the following: 1) text document as a list of tokens, 2) list of entities positions (i.e. all start and end positions of both entities' mentions) 3) list of NER tags of both entities. For encoding of input text we used BERT base model (uncased).

The model result would be one or several relations found between the given entities. In case of English RE model, the output consists of Wikidata relation id and English relation name. For Russian RE model, it is a corresponding Russian relation name or English one, if Russian one is unavailable, and, if applicable, its id in Wikidata.

Full list of 97 English relations & relational statistics in train, valid and test data
Relation Relation id # samples (train) # samples (valid) # samples (test)
head of government P6 235 11 7
country P17 11189 288 275
place of birth P19 616 25 15
place of death P20 242 3 8
father P22 314 9 9
mother P25 86 0 3
spouse P26 365 8 12
country of citizenship P27 3321 95 74
continent P30 428 29 20
instance of P31 140 5 6
head of state P35 181 3 5
capital P36 106 0 6
official language P37 153 6 7
position held P39 29 0 2
child P40 417 11 13
author P50 389 15 9
member of sports team P54 533 10 2
director P57 316 9 11
screenwriter P58 184 5 2
educated at P69 375 22 10
composer P86 126 6 4
member of political party P102 480 15 9
employer P108 235 10 5
founded by P112 123 3 1
league P118 234 5 2
publisher P123 226 9 6
owned by P127 265 13 6
located in the administrative territorial entity P131 5164 117 161
genre P136 124 1 0
operator P137 113 18 6
religion P140 214 2 10
contains administrative territorial entity P150 2467 76 77
follows P155 240 11 6
followed by P156 229 10 4
headquarters location P159 335 9 6
cast member P161 815 19 13
producer P162 156 6 7
award received P166 228 6 4
creator P170 259 8 4
parent taxon P171 91 1 0
ethnic group P172 106 1 1
performer P175 1337 31 25
manufacturer P176 122 1 0
developer P178 306 2 5
series P179 194 7 6
sister city P190 6 0 0
legislative body P194 215 5 2
basin country P205 112 5 0
located in or next to body of water P206 267 6 4
military branch P241 144 2 5
record label P264 783 19 29
production company P272 107 6 5
location P276 231 3 12
subclass of P279 101 1 11
subsidiary P355 115 2 5
part of P361 753 15 22
original language of work P364 87 5 4
platform P400 358 7 8
mouth of the watercourse P403 125 7 1
original network P449 184 2 5
member of P463 512 13 2
chairperson P488 78 2 4
country of origin P495 721 24 10
has part P527 791 10 8
residence P551 39 2 0
date of birth P569 1318 37 31
date of death P570 998 34 26
inception P571 602 14 15
dissolved, abolished or demolished P576 113 2 3
publication date P577 1482 27 43
start time P580 139 1 2
end time P582 73 1 0
point in time P585 132 2 1
conflict P607 371 17 6
characters P674 223 6 8
lyrics by P676 44 0 0
located on terrain feature P706 184 8 5
participant P710 237 3 8
influenced by P737 17 1 1
location of formation P740 74 3 0
parent organization P749 127 2 3
notable work P800 198 7 1
separated from P807 4 0 0
narrative location P840 56 4 3
work location P937 118 3 5
applies to jurisdiction P1001 368 7 6
product or material produced P1056 45 0 0
unemployment rate P1198 3 0 0
territory claimed by P1336 42 1 0
participant of P1344 272 5 5
replaces P1365 27 0 1
replaced by P1366 44 1 1
capital of P1376 93 0 4
languages spoken, written or signed P1412 192 6 4
present in work P1441 401 6 8
sibling P3373 453 3 13
Full list of 30 Russian relation & relational statistics in train, valid and test data
Relation Relation id Russian relation # samples (train) # samples (valid) # samples (test)
MEMBER P710 участник 104 20 9
WORKS_AS P106 род занятий 962 126 121
WORKPLACE 932 93 119
OWNERSHIP P1830 владеет 784 107 99
SUBORDINATE_OF - - 30 4 3
TAKES_PLACE_IN P276 местонахождение 66 6 7
EVENT_TAKES_PART_IN P1344 участвовал в 143 22 12
SELLS_TO - - 323 37 44
ALTERNATIVE_NAME - - 137 19 12
HEADQUARTERED_IN P159 расположение штаб-квартиры 451 69 61
PRODUCES P1056 продукция 42 4 6
ABBREVIATION - - 92 5 11
DATE_DEFUNCT_IN P576 дата прекращения существования 4 - -
SUBEVENT_OF P361 часть от 10 1 1
DATE_FOUNDED_IN P571 дата основания/создания/возникновения 16 1 2
DATE_TAKES_PLACE_ON P585 момент времени 40 4 5
NUMBER_OF_EMPLOYEES_FIRED - - 12 3 1
ORIGINS_FROM P495 страна происхождения 43 6 5
ACQUINTANCE_OF - - 3 - -
PARENT_OF P40 дети 14 1 2
ORGANIZES P664 организатор 46 8 7
FOUNDED_BY P112 основатель 13 - 3
PLACE_RESIDES_IN P551 место жительства 8 1 2
BORN_IN P19 место рождения 1 - -
AGE_IS - - 1 1 1
RELATIVE - - 2 - -
NUMBER_OF_EMPLOYEES P1128 число сотрудников 4 - 2
SIBLING P3373 брат/сестра 1 - 1
DATE_OF_BIRTH P569 дата рождения 1 - -

What about the training data?

Training corpora

RE model for English language was trained on DocRED corpus. It was constructed from Wikipedia and Wikidata and is now the largest English human-annotated dataset for document-level RE from plain text.

Some details about DocRED corpus

Here is the statistics of train, valid and test sets we used for training:

Train Valid Test
130650 3406 3545
Train Positive Train Negative Valid Positive Valid Negative Test Positive Test Negative
44823 89214 1239 1229 1043 1036

As the original DocRED test dataset contains only unlabeled data, while we want to have labeled one in order to perform evaluation, we decided to:

  1. merge train and dev data (= labeled data)
  2. split them into new train, dev and test dataset.

The current implementation allows to split the data in two ways:

  • user can set the relative size of dev and test data (e.g. 1/7)
  • user can set the absolute size of dev and test data (e.g. 2000 samples)

In final model training I set the absolute size of dev and test data == 150 initial documents. It resulted in approximately 3500 samples.

As for the negative samples, the best result was obtained with the following proportions:

  • for train set: negative samples are twice as many as positive ones
  • for dev & test set: negative samples are the same amount as positive ones

RE model for Russian language was trained on RuRED corpus based on the Lenta.ru news corpus.

Some details about RuRED corpus

For Russian RE model training the original RuRED train, valid and test sets were used.

The negative samples were generated in the same proportion as for the English RE model:

  • for train set: negative samples are twice as many as positive ones
  • for dev & test set: negative samples are the same amount as positive ones
Train Valid Test
12855 1076 1072
Train Positive Train Negative Valid Positive Valid Negative Test Positive Test Negative
4285 8570 538 538 536 536

NER tags

List of 6 NER tags for English model
# Tag Description
1 PER People, including fictional
2 ORG Companies, universities, institutions, political or religious groups, etc.
3 LOC Geographically defined locations, including mountains, waters, etc. Politically defined locations, including countries, cities, states, streets, etc. Facilities, including buildings, museums, stadiums, hospitals, factories, airports, etc.
4 TIME Absolute or relative dates or periods.
5 NUM Percents, money, quantities
6 MISC Products, including vehicles, weapons, etc. Events, including elections, battles, sporting events, etc. Laws, cases, languages, etc
List of 29 NER tags for Russian model
# NER tag Description
1. WORK_OF_ART name of work of art
2. NORP affiliation
3. GROUP unnamed groups of people and companies
4. LAW law name
5. NATIONALITY names of nationalities
6. EVENT event name
7. DATE date value
8. CURRENCY names of currencies
9. GPE geo-political entity
10. QUANTITY quantity value
11. FAMILY families as a whole
12. ORDINAL ordinal value
13. RELIGION names of religions
14. CITY Names of cities, towns, and villages
15. MONEY money name
16. AGE people's and object's ages
17. LOCATION location name
18. PERCENT percent value
19. BOROUGH Names of sub-city entities
20. PERSON person name
21. REGION Names of sub-country entities
22. COUNTRY Names of countries
23. PROFESSION Professions and people of these professions.
24. ORGANIZATION organization name
25. FAC building name
26. CARDINAL cardinal value
27. PRODUCT product name
28. TIME time value
29. STREET street name
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment