anasedova/GSoC 2021: Relation Extraction for DeepPavlov.md

## GSoC 2021: Relation Extraction for DeepPavlov.md

      
    Raw
  

              GSoC 2021: Relation Extraction for DeepPavlov.md
            
          
    Sedova Anastasiia

Google Summer of Code 2021
Organisation: DeepPavlov
Git: Relation Extraction
Project page: Relation Extraction
Code Contribution: Pull Request
What was the task?

My major task was to implement relation extraction module for both English and Russian languages. The main steps were:

design and develop the neural network for relation extraction
design the relation extraction pipeline and incorporate it to DeepPavlov framework
choose datasets the relation extraction models will be trained on
train reliable RE models for both languages and test them on different test examples.

What is done?

During the GSoC Project 2021 all the tasks outlisted above have been accomplished, namely:

a new relation extraction model was designed based on ATLOP model (more details are here)
the relation extraction pipeline was realised in DeepPavlov framework (and here is a pull request)
for training the English RE model the DocRED corpus was used; for the Russian RE model the RuRED was used. Dataset were preprocessed and extended with additional negative samples needed to train a reliable RE model (more details are here)
multiple RE models were trained with different parameters; the best one trained was uploaded online and can be now used for trying relation extraction component (here are working examples for English and Russian languages)

Pull Request is now opened and is waiting to be merged to DeepPavlov library.
Working example: English RE

Suppose we have a sentence Barack Obama is married to Michelle Obama, born Michelle Robinson. The task is to find a relation between Barack Obama and Michelle Obama/Michelle Robinson. In this case, we should first download the already pretrained English RE model:
python3.6 -m deeppavlov download re_docred

... and call it on our test sample:
from deeppavlov import configs, build_model
re = build_model(configs.relation_extraction.re_docred, download=False)

sentence_tokens = [["Barack", "Obama", "is", "married", "to", "Michelle", "Obama", ",", "born", "Michelle", "Robinson", "."]]
entity_pos = [[[(0, 2)], [(5, 7), (9, 11)]]]
entity_tags = [["PER", "PER"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
>> [['P26'], ['spouse']]

As a result, the English RE model is able to detect a relation "spouse" and returns it with the correposponding id in Wikidata ("P26").
If the English RE model is wished to be trained from scratch, it could be done with the following command:
python3.6 -m deeppavlov train re_docred

Working example: Russian RE

Suppose we have a sentence Илон Маск живет в Сиэттле. The task is to find a relation between Илон Маск and Сиеттл. In this case, we should first download the already pretrained Russian RE model:
python3.6 -m deeppavlov download re_rured

... and call it on our test sample:
from deeppavlov import configs, build_model
re = build_model(configs.relation_extraction.re_rured, download=False)

sentence_tokens = [["Илон", "Маск", "живет", "в", "Сиэттле", "."]]
entity_pos = [[[(0, 2)], [(4, 6)]]]
entity_tags = [["PERSON", "CITY"]]
pred = re_model(sentence_tokens, entity_pos, entity_tags)
>> [['P551'], ['место жительства']]

As a result, the Russian RE model is able to detect a relation "место жительства" and returns it with the correposponding id in Wikidata ("P551").
If the Russian RE model is wished to be trained from scratch, it could be done with the following command:
python3.6 -m deeppavlov train re_rured

What is Relation Extraction?

Relation extraction is a subtask of information extraction, that deals with finding and classifying the semantic relations hold between entities in an unstructured text. The main practical applications of relation extraction are collecting the database or augmenting the existing ones. An extensive collection of relational triples (i.e. two entities and a relation hold between them) can be converted into a structured database of facts about the real world of a much better quality than manually created ones. In most of the conventional applications, text entities between which a relation holds correspond to named entities or underlying entities obtained with co-reference resolution.
Among a great variaty of different relation extraction approaches, we decided to realise:

a supervised relation extraction, as, to the best of our knowledge, it is the most common and well-performed approach today;
a document-level relation extraction, as it allows to get a greater number of reliable relations from the document consisted of multiple sentences, what makes it closer to the real life application.

What challenges does (document-level) Relation Extraction encounter with?


negative samples: an important part of preparing the data for relation extraction model training is generating negative training samples, i.e. the samples where no relation holds between entities. Tuning of the number of negative samples is usually done with experimentations and often turnes out to be a sophisticated task, which requires additional attention;
multi-entity problem: in document-level relation extraction one document may contain multiple entity pairs between which different relations are hold. Indeally, a stable relation extraction system should be able to detect and classify all of them at once. Moreover, the same entity may be mentioned in different ways across the text (e.g. "John Smith", "Mr. Smith", "John" etc);
multi-label problem: one entity pair can occur many times in the document associated with different relations, in contrast to one relation per entity pair for sentence-level RE. For example, the entities "John Smith" and "New York" may easily express relations "place of birth" and "place of death" at the same time if some unknown John Smith happened to be born and die in the same city which is New York.

The number of negative samples was tuned in a number of experimentations. The last two challenges we ovecame with a specific model architecture, which is described in the next section.
How does the RE model look like?

The developed RE model on the Adaptive Thresholding and Localized Context Pooling. Two core ideas of this model are adaptive threshold and localised context pooling.

Adaptive Threshold.
The usual global threshold for converting the RE classifier output probability to relation label is replaced with a learnable one. A new threshold class that learns an entities-dependent threshold value is introduced and learnt as all other classes. During prediction the positive classes (= relations that are hold in the sample indeed) are claimed to be the classes with higher logits than the threshold class. All other classes turn out to be negative (= these relations are not hold in the sample between the given entities).
Localised Context Pooling.
The embedding of each entity pair is enhanced with an additional local context embedding related to both entities. Such representation, which is attended to the relevant context of the entity pair in the document, is useful to decide the relation for this particular entity pair. The context information is derived by direct use of attention heads.

Apart from the a number of different adjustements we made to the original ATLOP model we also changed the input. Compared to the ATLOP model, where the input consists of tokenized samples and entity positions, we also added NER tags of all entities as an additional input. Thus, the input of our RE model is the following: 1) text document as a list of tokens, 2) list of entities positions (i.e. all start and end positions of both entities' mentions) 3) list of NER tags of both entities. For encoding of input text we used BERT base model (uncased).
The model result would be one or several relations found between the given entities. In case of English RE model, the output consists of Wikidata relation id and English relation name. For Russian RE model, it is a corresponding Russian relation name or English one, if Russian one is unavailable, and, if applicable, its id in Wikidata.

   Full list of 97 English relations & relational statistics in train, valid and test data 


Relation
Relation id
# samples (train)
# samples (valid)
# samples (test)


head of government
P6
235
11
7


country
P17
11189
288
275


place of birth
P19
616
25
15


place of death
P20
242
3
8


father
P22
314
9
9


mother
P25
86
0
3


spouse
P26
365
8
12


country of citizenship
P27
3321
95
74


continent
P30
428
29
20


instance of
P31
140
5
6


head of state
P35
181
3
5


capital
P36
106
0
6


official language
P37
153
6
7


position held
P39
29
0
2


child
P40
417
11
13


author
P50
389
15
9


member of sports team
P54
533
10
2


director
P57
316
9
11


screenwriter
P58
184
5
2


educated at
P69
375
22
10


composer
P86
126
6
4


member of political party
P102
480
15
9


employer
P108
235
10
5


founded by
P112
123
3
1


league
P118
234
5
2


publisher
P123
226
9
6


owned by
P127
265
13
6


located in the administrative territorial entity
P131
5164
117
161


genre
P136
124
1
0


operator
P137
113
18
6


religion
P140
214
2
10


contains administrative territorial entity
P150
2467
76
77


follows
P155
240
11
6


followed by
P156
229
10
4


headquarters location
P159
335
9
6


cast member
P161
815
19
13


producer
P162
156
6
7


award received
P166
228
6
4


creator
P170
259
8
4


parent taxon
P171
91
1
0


ethnic group
P172
106
1
1


performer
P175
1337
31
25


manufacturer
P176
122
1
0


developer
P178
306
2
5


series
P179
194
7
6


sister city
P190
6
0
0


legislative body
P194
215
5
2


basin country
P205
112
5
0


located in or next to body of water
P206
267
6
4


military branch
P241
144
2
5


record label
P264
783
19
29


production company
P272
107
6
5


location
P276
231
3
12


subclass of
P279
101
1
11


subsidiary
P355
115
2
5


part of
P361
753
15
22


original language of work
P364
87
5
4


platform
P400
358
7
8


mouth of the watercourse
P403
125
7
1


original network
P449
184
2
5


member of
P463
512
13
2


chairperson
P488
78
2
4


country of origin
P495
721
24
10


has part
P527
791
10
8


residence
P551
39
2
0


date of birth
P569
1318
37
31


date of death
P570
998
34
26


inception
P571
602
14
15


dissolved, abolished or demolished
P576
113
2
3


publication date
P577
1482
27
43


start time
P580
139
1
2


end time
P582
73
1
0


point in time
P585
132
2
1


conflict
P607
371
17
6


characters
P674
223
6
8


lyrics by
P676
44
0
0


located on terrain feature
P706
184
8
5


participant
P710
237
3
8


influenced by
P737
17
1
1


location of formation
P740
74
3
0


parent organization
P749
127
2
3


notable work
P800
198
7
1


separated from
P807
4
0
0


narrative location
P840
56
4
3


work location
P937
118
3
5


applies to jurisdiction
P1001
368
7
6


product or material produced
P1056
45
0
0


unemployment rate
P1198
3
0
0


territory claimed by
P1336
42
1
0


participant of
P1344
272
5
5


replaces
P1365
27
0
1


replaced by
P1366
44
1
1


capital of
P1376
93
0
4


languages spoken, written or signed
P1412
192
6
4


present in work
P1441
401
6
8


sibling
P3373
453
3
13


   Full list of 30 Russian relation & relational statistics in train, valid and test data 


Relation
Relation id
Russian relation
# samples (train)
# samples (valid)
# samples (test)


MEMBER
P710
участник
104
20
9


WORKS_AS
P106
род занятий
962
126
121


WORKPLACE


932
93
119


OWNERSHIP
P1830
владеет
784
107
99


SUBORDINATE_OF
-
-
30
4
3


TAKES_PLACE_IN
P276
местонахождение
66
6
7


EVENT_TAKES_PART_IN
P1344
участвовал в
143
22
12


SELLS_TO
-
-
323
37
44


ALTERNATIVE_NAME
-
-
137
19
12


HEADQUARTERED_IN
P159
расположение штаб-квартиры
451
69
61


PRODUCES
P1056
продукция
42
4
6


ABBREVIATION
-
-
92
5
11


DATE_DEFUNCT_IN
P576
дата прекращения существования
4
-
-


SUBEVENT_OF
P361
часть от
10
1
1


DATE_FOUNDED_IN
P571
дата основания/создания/возникновения
16
1
2


DATE_TAKES_PLACE_ON
P585
момент времени
40
4
5


NUMBER_OF_EMPLOYEES_FIRED
-
-
12
3
1


ORIGINS_FROM
P495
страна происхождения
43
6
5


ACQUINTANCE_OF
-
-
3
-
-


PARENT_OF
P40
дети
14
1
2


ORGANIZES
P664
организатор
46
8
7


FOUNDED_BY
P112
основатель
13
-
3


PLACE_RESIDES_IN
P551
место жительства
8
1
2


BORN_IN
P19
место рождения
1
-
-


AGE_IS
-
-
1
1
1


RELATIVE
-
-
2
-
-


NUMBER_OF_EMPLOYEES
P1128
число сотрудников
4
-
2


SIBLING
P3373
брат/сестра
1
-
1


DATE_OF_BIRTH
P569
дата рождения
1
-
-


What about the training data?

Training corpora

RE model for English language was trained on DocRED corpus. It was constructed from Wikipedia and Wikidata and is now the largest English human-annotated dataset for document-level RE from plain text.

  Some details about DocRED corpus 
Here is the statistics of train, valid and test sets we used for training:


Train
Valid
Test


130650
3406
3545


Train Positive
Train Negative
Valid Positive
Valid Negative
Test Positive
Test Negative


44823
89214
1239
1229
1043
1036


As the original DocRED test dataset contains only unlabeled data, while we want to have labeled one in order to perform evaluation, we decided to:

merge train and dev data (= labeled data)
split them into new train, dev and test dataset.

The current implementation allows to split the data in two ways:

user can set the relative size of dev and test data (e.g. 1/7)
user can set the absolute size of dev and test data (e.g. 2000 samples)

In final model training I set the absolute size of dev and test data == 150 initial documents. It resulted in approximately 3500 samples.
As for the negative samples, the best result was obtained with the following proportions:

for train set: negative samples are twice as many as positive ones
for dev & test set: negative samples are the same amount as positive ones


RE model for Russian language was trained on RuRED corpus based on the Lenta.ru news corpus.

  Some details about RuRED corpus 
For Russian RE model training the original RuRED train, valid and test sets were used.
The negative samples were generated in the same proportion as for the English RE model:

for train set: negative samples are twice as many as positive ones
for dev & test set: negative samples are the same amount as positive ones


Train
Valid
Test


12855
1076
1072


Train Positive
Train Negative
Valid Positive
Valid Negative
Test Positive
Test Negative


4285
8570
538
538
536
536


NER tags


   List of 6 NER tags for English model 


#
Tag
Description


1
PER
People, including fictional


2
ORG
Companies, universities, institutions, political or religious groups, etc.


3
LOC
Geographically defined locations, including mountains, waters, etc. Politically defined locations, including countries, cities, states, streets, etc. Facilities, including buildings, museums, stadiums, hospitals, factories, airports, etc.


4
TIME
Absolute or relative dates or periods.


5
NUM
Percents, money, quantities


6
MISC
Products, including vehicles, weapons, etc. Events, including elections, battles, sporting events, etc. Laws, cases, languages, etc


   List of 29 NER tags for Russian model 


#
NER tag
Description


1.
WORK_OF_ART
name of work of art


2.
NORP
affiliation


3.
GROUP
unnamed groups of people and companies


4.
LAW
law name


5.
NATIONALITY
names of nationalities


6.
EVENT
event name


7.
DATE
date value


8.
CURRENCY
names of currencies


9.
GPE
geo-political entity


10.
QUANTITY
quantity value


11.
FAMILY
families as a whole


12.
ORDINAL
ordinal value


13.
RELIGION
names of religions


14.
CITY
Names of cities, towns, and villages


15.
MONEY
money name


16.
AGE
people's and object's ages


17.
LOCATION
location name


18.
PERCENT
percent value


19.
BOROUGH
Names of sub-city entities


20.
PERSON
person name


21.
REGION
Names of sub-country entities


22.
COUNTRY
Names of countries


23.
PROFESSION
Professions and people of these professions.


24.
ORGANIZATION
organization name


25.
FAC
building name


26.
CARDINAL
cardinal value


27.
PRODUCT
product name


28.
TIME
time value


29.
STREET
street name
Relation	Relation id	# samples (train)	# samples (valid)	# samples (test)
head of government	P6	235	11	7
country	P17	11189	288	275
place of birth	P19	616	25	15
place of death	P20	242	3	8
father	P22	314	9	9
mother	P25	86	0	3
spouse	P26	365	8	12
country of citizenship	P27	3321	95	74
continent	P30	428	29	20
instance of	P31	140	5	6
head of state	P35	181	3	5
capital	P36	106	0	6
official language	P37	153	6	7
position held	P39	29	0	2
child	P40	417	11	13
author	P50	389	15	9
member of sports team	P54	533	10	2
director	P57	316	9	11
screenwriter	P58	184	5	2
educated at	P69	375	22	10
composer	P86	126	6	4
member of political party	P102	480	15	9
employer	P108	235	10	5
founded by	P112	123	3	1
league	P118	234	5	2
publisher	P123	226	9	6
owned by	P127	265	13	6
located in the administrative territorial entity	P131	5164	117	161
genre	P136	124	1	0
operator	P137	113	18	6
religion	P140	214	2	10
contains administrative territorial entity	P150	2467	76	77
follows	P155	240	11	6
followed by	P156	229	10	4
headquarters location	P159	335	9	6
cast member	P161	815	19	13
producer	P162	156	6	7
award received	P166	228	6	4
creator	P170	259	8	4
parent taxon	P171	91	1	0
ethnic group	P172	106	1	1
performer	P175	1337	31	25
manufacturer	P176	122	1	0
developer	P178	306	2	5
series	P179	194	7	6
sister city	P190	6	0	0
legislative body	P194	215	5	2
basin country	P205	112	5	0
located in or next to body of water	P206	267	6	4
military branch	P241	144	2	5
record label	P264	783	19	29
production company	P272	107	6	5
location	P276	231	3	12
subclass of	P279	101	1	11
subsidiary	P355	115	2	5
part of	P361	753	15	22
original language of work	P364	87	5	4
platform	P400	358	7	8
mouth of the watercourse	P403	125	7	1
original network	P449	184	2	5
member of	P463	512	13	2
chairperson	P488	78	2	4
country of origin	P495	721	24	10
has part	P527	791	10	8
residence	P551	39	2	0
date of birth	P569	1318	37	31
date of death	P570	998	34	26
inception	P571	602	14	15
dissolved, abolished or demolished	P576	113	2	3
publication date	P577	1482	27	43
start time	P580	139	1	2
end time	P582	73	1	0
point in time	P585	132	2	1
conflict	P607	371	17	6
characters	P674	223	6	8
lyrics by	P676	44	0	0
located on terrain feature	P706	184	8	5
participant	P710	237	3	8
influenced by	P737	17	1	1
location of formation	P740	74	3	0
parent organization	P749	127	2	3
notable work	P800	198	7	1
separated from	P807	4	0	0
narrative location	P840	56	4	3
work location	P937	118	3	5
applies to jurisdiction	P1001	368	7	6
product or material produced	P1056	45	0	0
unemployment rate	P1198	3	0	0
territory claimed by	P1336	42	1	0
participant of	P1344	272	5	5
replaces	P1365	27	0	1
replaced by	P1366	44	1	1
capital of	P1376	93	0	4
languages spoken, written or signed	P1412	192	6	4
present in work	P1441	401	6	8
sibling	P3373	453	3	13
Relation	Relation id	Russian relation	# samples (train)	# samples (valid)	# samples (test)
MEMBER	P710	участник	104	20	9
WORKS_AS	P106	род занятий	962	126	121
WORKPLACE			932	93	119
OWNERSHIP	P1830	владеет	784	107	99
SUBORDINATE_OF	-	-	30	4	3
TAKES_PLACE_IN	P276	местонахождение	66	6	7
EVENT_TAKES_PART_IN	P1344	участвовал в	143	22	12
SELLS_TO	-	-	323	37	44
ALTERNATIVE_NAME	-	-	137	19	12
HEADQUARTERED_IN	P159	расположение штаб-квартиры	451	69	61
PRODUCES	P1056	продукция	42	4	6
ABBREVIATION	-	-	92	5	11
DATE_DEFUNCT_IN	P576	дата прекращения существования	4	-	-
SUBEVENT_OF	P361	часть от	10	1	1
DATE_FOUNDED_IN	P571	дата основания/создания/возникновения	16	1	2
DATE_TAKES_PLACE_ON	P585	момент времени	40	4	5
NUMBER_OF_EMPLOYEES_FIRED	-	-	12	3	1
ORIGINS_FROM	P495	страна происхождения	43	6	5
ACQUINTANCE_OF	-	-	3	-	-
PARENT_OF	P40	дети	14	1	2
ORGANIZES	P664	организатор	46	8	7
FOUNDED_BY	P112	основатель	13	-	3
PLACE_RESIDES_IN	P551	место жительства	8	1	2
BORN_IN	P19	место рождения	1	-	-
AGE_IS	-	-	1	1	1
RELATIVE	-	-	2	-	-
NUMBER_OF_EMPLOYEES	P1128	число сотрудников	4	-	2
SIBLING	P3373	брат/сестра	1	-	1
DATE_OF_BIRTH	P569	дата рождения	1	-	-
#	Tag	Description
1	PER	People, including fictional
2	ORG	Companies, universities, institutions, political or religious groups, etc.
3	LOC	Geographically defined locations, including mountains, waters, etc. Politically defined locations, including countries, cities, states, streets, etc. Facilities, including buildings, museums, stadiums, hospitals, factories, airports, etc.
4	TIME	Absolute or relative dates or periods.
5	NUM	Percents, money, quantities
6	MISC	Products, including vehicles, weapons, etc. Events, including elections, battles, sporting events, etc. Laws, cases, languages, etc
#	NER tag	Description
1.	WORK_OF_ART	name of work of art
2.	NORP	affiliation
3.	GROUP	unnamed groups of people and companies
4.	LAW	law name
5.	NATIONALITY	names of nationalities
6.	EVENT	event name
7.	DATE	date value
8.	CURRENCY	names of currencies
9.	GPE	geo-political entity
10.	QUANTITY	quantity value
11.	FAMILY	families as a whole
12.	ORDINAL	ordinal value
13.	RELIGION	names of religions
14.	CITY	Names of cities, towns, and villages
15.	MONEY	money name
16.	AGE	people's and object's ages
17.	LOCATION	location name
18.	PERCENT	percent value
19.	BOROUGH	Names of sub-city entities
20.	PERSON	person name
21.	REGION	Names of sub-country entities
22.	COUNTRY	Names of countries
23.	PROFESSION	Professions and people of these professions.
24.	ORGANIZATION	organization name
25.	FAC	building name
26.	CARDINAL	cardinal value
27.	PRODUCT	product name
28.	TIME	time value
29.	STREET	street name