Google Summer of Code 2021
Organisation: DeepPavlov
Git: Relation Extraction
Project page: Relation Extraction
Code Contribution: Pull Request
My major task was to implement relation extraction module for both English and Russian languages. The main steps were:
- design and develop the neural network for relation extraction
- design the relation extraction pipeline and incorporate it to DeepPavlov framework
- choose datasets the relation extraction models will be trained on
- train reliable RE models for both languages and test them on different test examples.
During the GSoC Project 2021 all the tasks outlisted above have been accomplished, namely:
- a new relation extraction model was designed based on ATLOP model (more details are here)
- the relation extraction pipeline was realised in DeepPavlov framework (and here is a pull request)
- for training the English RE model the DocRED corpus was used; for the Russian RE model the RuRED was used. Dataset were preprocessed and extended with additional negative samples needed to train a reliable RE model (more details are here)
- multiple RE models were trained with different parameters; the best one trained was uploaded online and can be now used for trying relation extraction component (here are working examples for English and Russian languages)
Pull Request is now opened and is waiting to be merged to DeepPavlov library.
Suppose we have a sentence Barack Obama is married to Michelle Obama, born Michelle Robinson. The task is to find a relation between Barack Obama and Michelle Obama/Michelle Robinson. In this case, we should first download the already pretrained English RE model:
python3.6 -m deeppavlov download re_docred
... and call it on our test sample:
from deeppavlov import configs, build_model
re = build_model(configs.relation_extraction.re_docred, download=False)
sentence_tokens = [["Barack", "Obama", "is", "married", "to", "Michelle", "Obama", ",", "born", "Michelle", "Robinson", "."]]
entity_pos = [[[(0, 2)], [(5, 7), (9, 11)]]]
entity_tags = [["PER", "PER"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
>> [['P26'], ['spouse']]
As a result, the English RE model is able to detect a relation "spouse" and returns it with the correposponding id in Wikidata ("P26").
If the English RE model is wished to be trained from scratch, it could be done with the following command:
python3.6 -m deeppavlov train re_docred
Suppose we have a sentence Илон Маск живет в Сиэттле. The task is to find a relation between Илон Маск and Сиеттл. In this case, we should first download the already pretrained Russian RE model:
python3.6 -m deeppavlov download re_rured
... and call it on our test sample:
from deeppavlov import configs, build_model
re = build_model(configs.relation_extraction.re_rured, download=False)
sentence_tokens = [["Илон", "Маск", "живет", "в", "Сиэттле", "."]]
entity_pos = [[[(0, 2)], [(4, 6)]]]
entity_tags = [["PERSON", "CITY"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
>> [['P551'], ['место жительства']]
As a result, the Russian RE model is able to detect a relation "место жительства" and returns it with the correposponding id in Wikidata ("P551").
If the Russian RE model is wished to be trained from scratch, it could be done with the following command:
python3.6 -m deeppavlov train re_rured
Relation extraction is a subtask of information extraction, that deals with finding and classifying the semantic relations hold between entities in an unstructured text. The main practical applications of relation extraction are collecting the database or augmenting the existing ones. An extensive collection of relational triples (i.e. two entities and a relation hold between them) can be converted into a structured database of facts about the real world of a much better quality than manually created ones. In most of the conventional applications, text entities between which a relation holds correspond to named entities or underlying entities obtained with co-reference resolution.
Among a great variaty of different relation extraction approaches, we decided to realise:
- a supervised relation extraction, as, to the best of our knowledge, it is the most common and well-performed approach today;
- a document-level relation extraction, as it allows to get a greater number of reliable relations from the document consisted of multiple sentences, what makes it closer to the real life application.
- negative samples: an important part of preparing the data for relation extraction model training is generating negative training samples, i.e. the samples where no relation holds between entities. Tuning of the number of negative samples is usually done with experimentations and often turnes out to be a sophisticated task, which requires additional attention;
- multi-entity problem: in document-level relation extraction one document may contain multiple entity pairs between which different relations are hold. Indeally, a stable relation extraction system should be able to detect and classify all of them at once. Moreover, the same entity may be mentioned in different ways across the text (e.g. "John Smith", "Mr. Smith", "John" etc);
- multi-label problem: one entity pair can occur many times in the document associated with different relations, in contrast to one relation per entity pair for sentence-level RE. For example, the entities "John Smith" and "New York" may easily express relations "place of birth" and "place of death" at the same time if some unknown John Smith happened to be born and die in the same city which is New York.
The number of negative samples was tuned in a number of experimentations. The last two challenges we ovecame with a specific model architecture, which is described in the next section.
The developed RE model on the Adaptive Thresholding and Localized Context Pooling. Two core ideas of this model are adaptive threshold and localised context pooling.
- Adaptive Threshold. The usual global threshold for converting the RE classifier output probability to relation label is replaced with a learnable one. A new threshold class that learns an entities-dependent threshold value is introduced and learnt as all other classes. During prediction the positive classes (= relations that are hold in the sample indeed) are claimed to be the classes with higher logits than the threshold class. All other classes turn out to be negative (= these relations are not hold in the sample between the given entities).
- Localised Context Pooling. The embedding of each entity pair is enhanced with an additional local context embedding related to both entities. Such representation, which is attended to the relevant context of the entity pair in the document, is useful to decide the relation for this particular entity pair. The context information is derived by direct use of attention heads.
Apart from the a number of different adjustements we made to the original ATLOP model we also changed the input. Compared to the ATLOP model, where the input consists of tokenized samples and entity positions, we also added NER tags of all entities as an additional input. Thus, the input of our RE model is the following: 1) text document as a list of tokens, 2) list of entities positions (i.e. all start and end positions of both entities' mentions) 3) list of NER tags of both entities. For encoding of input text we used BERT base model (uncased).
The model result would be one or several relations found between the given entities. In case of English RE model, the output consists of Wikidata relation id and English relation name. For Russian RE model, it is a corresponding Russian relation name or English one, if Russian one is unavailable, and, if applicable, its id in Wikidata.
Full list of 97 English relations & relational statistics in train, valid and test data
Relation | Relation id | # samples (train) | # samples (valid) | # samples (test) |
---|---|---|---|---|
head of government | P6 | 235 | 11 | 7 |
country | P17 | 11189 | 288 | 275 |
place of birth | P19 | 616 | 25 | 15 |
place of death | P20 | 242 | 3 | 8 |
father | P22 | 314 | 9 | 9 |
mother | P25 | 86 | 0 | 3 |
spouse | P26 | 365 | 8 | 12 |
country of citizenship | P27 | 3321 | 95 | 74 |
continent | P30 | 428 | 29 | 20 |
instance of | P31 | 140 | 5 | 6 |
head of state | P35 | 181 | 3 | 5 |
capital | P36 | 106 | 0 | 6 |
official language | P37 | 153 | 6 | 7 |
position held | P39 | 29 | 0 | 2 |
child | P40 | 417 | 11 | 13 |
author | P50 | 389 | 15 | 9 |
member of sports team | P54 | 533 | 10 | 2 |
director | P57 | 316 | 9 | 11 |
screenwriter | P58 | 184 | 5 | 2 |
educated at | P69 | 375 | 22 | 10 |
composer | P86 | 126 | 6 | 4 |
member of political party | P102 | 480 | 15 | 9 |
employer | P108 | 235 | 10 | 5 |
founded by | P112 | 123 | 3 | 1 |
league | P118 | 234 | 5 | 2 |
publisher | P123 | 226 | 9 | 6 |
owned by | P127 | 265 | 13 | 6 |
located in the administrative territorial entity | P131 | 5164 | 117 | 161 |
genre | P136 | 124 | 1 | 0 |
operator | P137 | 113 | 18 | 6 |
religion | P140 | 214 | 2 | 10 |
contains administrative territorial entity | P150 | 2467 | 76 | 77 |
follows | P155 | 240 | 11 | 6 |
followed by | P156 | 229 | 10 | 4 |
headquarters location | P159 | 335 | 9 | 6 |
cast member | P161 | 815 | 19 | 13 |
producer | P162 | 156 | 6 | 7 |
award received | P166 | 228 | 6 | 4 |
creator | P170 | 259 | 8 | 4 |
parent taxon | P171 | 91 | 1 | 0 |
ethnic group | P172 | 106 | 1 | 1 |
performer | P175 | 1337 | 31 | 25 |
manufacturer | P176 | 122 | 1 | 0 |
developer | P178 | 306 | 2 | 5 |
series | P179 | 194 | 7 | 6 |
sister city | P190 | 6 | 0 | 0 |
legislative body | P194 | 215 | 5 | 2 |
basin country | P205 | 112 | 5 | 0 |
located in or next to body of water | P206 | 267 | 6 | 4 |
military branch | P241 | 144 | 2 | 5 |
record label | P264 | 783 | 19 | 29 |
production company | P272 | 107 | 6 | 5 |
location | P276 | 231 | 3 | 12 |
subclass of | P279 | 101 | 1 | 11 |
subsidiary | P355 | 115 | 2 | 5 |
part of | P361 | 753 | 15 | 22 |
original language of work | P364 | 87 | 5 | 4 |
platform | P400 | 358 | 7 | 8 |
mouth of the watercourse | P403 | 125 | 7 | 1 |
original network | P449 | 184 | 2 | 5 |
member of | P463 | 512 | 13 | 2 |
chairperson | P488 | 78 | 2 | 4 |
country of origin | P495 | 721 | 24 | 10 |
has part | P527 | 791 | 10 | 8 |
residence | P551 | 39 | 2 | 0 |
date of birth | P569 | 1318 | 37 | 31 |
date of death | P570 | 998 | 34 | 26 |
inception | P571 | 602 | 14 | 15 |
dissolved, abolished or demolished | P576 | 113 | 2 | 3 |
publication date | P577 | 1482 | 27 | 43 |
start time | P580 | 139 | 1 | 2 |
end time | P582 | 73 | 1 | 0 |
point in time | P585 | 132 | 2 | 1 |
conflict | P607 | 371 | 17 | 6 |
characters | P674 | 223 | 6 | 8 |
lyrics by | P676 | 44 | 0 | 0 |
located on terrain feature | P706 | 184 | 8 | 5 |
participant | P710 | 237 | 3 | 8 |
influenced by | P737 | 17 | 1 | 1 |
location of formation | P740 | 74 | 3 | 0 |
parent organization | P749 | 127 | 2 | 3 |
notable work | P800 | 198 | 7 | 1 |
separated from | P807 | 4 | 0 | 0 |
narrative location | P840 | 56 | 4 | 3 |
work location | P937 | 118 | 3 | 5 |
applies to jurisdiction | P1001 | 368 | 7 | 6 |
product or material produced | P1056 | 45 | 0 | 0 |
unemployment rate | P1198 | 3 | 0 | 0 |
territory claimed by | P1336 | 42 | 1 | 0 |
participant of | P1344 | 272 | 5 | 5 |
replaces | P1365 | 27 | 0 | 1 |
replaced by | P1366 | 44 | 1 | 1 |
capital of | P1376 | 93 | 0 | 4 |
languages spoken, written or signed | P1412 | 192 | 6 | 4 |
present in work | P1441 | 401 | 6 | 8 |
sibling | P3373 | 453 | 3 | 13 |
Full list of 30 Russian relation & relational statistics in train, valid and test data
Relation | Relation id | Russian relation | # samples (train) | # samples (valid) | # samples (test) |
---|---|---|---|---|---|
MEMBER | P710 | участник | 104 | 20 | 9 |
WORKS_AS | P106 | род занятий | 962 | 126 | 121 |
WORKPLACE | 932 | 93 | 119 | ||
OWNERSHIP | P1830 | владеет | 784 | 107 | 99 |
SUBORDINATE_OF | - | - | 30 | 4 | 3 |
TAKES_PLACE_IN | P276 | местонахождение | 66 | 6 | 7 |
EVENT_TAKES_PART_IN | P1344 | участвовал в | 143 | 22 | 12 |
SELLS_TO | - | - | 323 | 37 | 44 |
ALTERNATIVE_NAME | - | - | 137 | 19 | 12 |
HEADQUARTERED_IN | P159 | расположение штаб-квартиры | 451 | 69 | 61 |
PRODUCES | P1056 | продукция | 42 | 4 | 6 |
ABBREVIATION | - | - | 92 | 5 | 11 |
DATE_DEFUNCT_IN | P576 | дата прекращения существования | 4 | - | - |
SUBEVENT_OF | P361 | часть от | 10 | 1 | 1 |
DATE_FOUNDED_IN | P571 | дата основания/создания/возникновения | 16 | 1 | 2 |
DATE_TAKES_PLACE_ON | P585 | момент времени | 40 | 4 | 5 |
NUMBER_OF_EMPLOYEES_FIRED | - | - | 12 | 3 | 1 |
ORIGINS_FROM | P495 | страна происхождения | 43 | 6 | 5 |
ACQUINTANCE_OF | - | - | 3 | - | - |
PARENT_OF | P40 | дети | 14 | 1 | 2 |
ORGANIZES | P664 | организатор | 46 | 8 | 7 |
FOUNDED_BY | P112 | основатель | 13 | - | 3 |
PLACE_RESIDES_IN | P551 | место жительства | 8 | 1 | 2 |
BORN_IN | P19 | место рождения | 1 | - | - |
AGE_IS | - | - | 1 | 1 | 1 |
RELATIVE | - | - | 2 | - | - |
NUMBER_OF_EMPLOYEES | P1128 | число сотрудников | 4 | - | 2 |
SIBLING | P3373 | брат/сестра | 1 | - | 1 |
DATE_OF_BIRTH | P569 | дата рождения | 1 | - | - |
RE model for English language was trained on DocRED corpus. It was constructed from Wikipedia and Wikidata and is now the largest English human-annotated dataset for document-level RE from plain text.
Some details about DocRED corpus
Here is the statistics of train, valid and test sets we used for training:
Train | Valid | Test |
---|---|---|
130650 | 3406 | 3545 |
Train Positive | Train Negative | Valid Positive | Valid Negative | Test Positive | Test Negative |
---|---|---|---|---|---|
44823 | 89214 | 1239 | 1229 | 1043 | 1036 |
As the original DocRED test dataset contains only unlabeled data, while we want to have labeled one in order to perform evaluation, we decided to:
- merge train and dev data (= labeled data)
- split them into new train, dev and test dataset.
The current implementation allows to split the data in two ways:
- user can set the relative size of dev and test data (e.g. 1/7)
- user can set the absolute size of dev and test data (e.g. 2000 samples)
In final model training I set the absolute size of dev and test data == 150 initial documents. It resulted in approximately 3500 samples.
As for the negative samples, the best result was obtained with the following proportions:
- for train set: negative samples are twice as many as positive ones
- for dev & test set: negative samples are the same amount as positive ones
RE model for Russian language was trained on RuRED corpus based on the Lenta.ru news corpus.
Some details about RuRED corpus
For Russian RE model training the original RuRED train, valid and test sets were used.
The negative samples were generated in the same proportion as for the English RE model:
- for train set: negative samples are twice as many as positive ones
- for dev & test set: negative samples are the same amount as positive ones
Train | Valid | Test |
---|---|---|
12855 | 1076 | 1072 |
Train Positive | Train Negative | Valid Positive | Valid Negative | Test Positive | Test Negative |
---|---|---|---|---|---|
4285 | 8570 | 538 | 538 | 536 | 536 |
List of 6 NER tags for English model
# | Tag | Description |
---|---|---|
1 | PER | People, including fictional |
2 | ORG | Companies, universities, institutions, political or religious groups, etc. |
3 | LOC | Geographically defined locations, including mountains, waters, etc. Politically defined locations, including countries, cities, states, streets, etc. Facilities, including buildings, museums, stadiums, hospitals, factories, airports, etc. |
4 | TIME | Absolute or relative dates or periods. |
5 | NUM | Percents, money, quantities |
6 | MISC | Products, including vehicles, weapons, etc. Events, including elections, battles, sporting events, etc. Laws, cases, languages, etc |
List of 29 NER tags for Russian model
# | NER tag | Description |
---|---|---|
1. | WORK_OF_ART | name of work of art |
2. | NORP | affiliation |
3. | GROUP | unnamed groups of people and companies |
4. | LAW | law name |
5. | NATIONALITY | names of nationalities |
6. | EVENT | event name |
7. | DATE | date value |
8. | CURRENCY | names of currencies |
9. | GPE | geo-political entity |
10. | QUANTITY | quantity value |
11. | FAMILY | families as a whole |
12. | ORDINAL | ordinal value |
13. | RELIGION | names of religions |
14. | CITY | Names of cities, towns, and villages |
15. | MONEY | money name |
16. | AGE | people's and object's ages |
17. | LOCATION | location name |
18. | PERCENT | percent value |
19. | BOROUGH | Names of sub-city entities |
20. | PERSON | person name |
21. | REGION | Names of sub-country entities |
22. | COUNTRY | Names of countries |
23. | PROFESSION | Professions and people of these professions. |
24. | ORGANIZATION | organization name |
25. | FAC | building name |
26. | CARDINAL | cardinal value |
27. | PRODUCT | product name |
28. | TIME | time value |
29. | STREET | street name |