Student | Anshuman Singh |
---|---|
Github | @rimijoker |
Organisation | DeepPavlov |
Project | Refactor MultiTask Bert |
The DeepPavlov Library consists of a lot of state of the art NLP techniques and Multi-task BERT is one of them.
Multi-task learning shares information between related tasks, reducing the number of parameters required. State of the art results across natural language understanding tasks in the GLUE benchmark has been previously used transfer learning from a large task: unsupervised training with BERT, where a separate BERT model was fine-tuned for each task.
In multi-task BERT we share a single BERT model along with a small number of task-specific parameters and match the performance of separately fine-tuned models with fewer parameters on the GLUE benchmark.
In the current state of the DeepPavlov, multi-task BERT is implemented in Tensorflow which needs to be refactored such that DeepPavlov uses new frameworks such as PyTorch.
The refactored code also needs to incorporate techniques such as PAL-BERT, CA-MTL, MT-DNN within the DeepPavlov library by matching the results of the GLUE benchmark on the respective techniques.
DeepPavlov: Github Repository
Project Issue: Refacor Multitask Bert
- Bert-n-Pals: https://gluebenchmark.com/submission/WOLUDc3uwbTdFn5y4hpj1Nh9pWi1/-MciWas_9fvwXZpnPCzo
- CA-MTL: https://gluebenchmark.com/submission/4soAsU8UFyfUvRvJRWzSHfGMEB03/-MgV7AEqgUctoJxdwAHe
- Implemented the MultiTask Pal Bert Iterator
- Implemented the MultiTask Pal Bert Preprocessor
- Implemented the MultiTask Pal Bert Model
- Added gradient accmulation support for the model which is not in other model of DeepPavlov.
- 35% savings on VRAM when compared with two single Bert models while achieving better scores on the GLUE Benchmark.
- Documented the MultiTask Pal Bert Iterator, MultiTask Pal Bert Preprocessor, Pal Bert Model.
- Added a colab tutorial for the same. mt_pal_bert_mrpc_rte_tutorial
- Added a config for full GLUE training.
Multi-task PAL BERT in DeepPavlov is an implementation of BERT training algorithm published in the paper "BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning".
Multitask BERT and PALs paper: https://arxiv.org/pdf/1902.02671.pdf
The idea is to share the BERT body between several tasks. If a model pipe has several components using BERT and the amount of GPU memory is limited, this sharing can significantly save the memory. Each task has its own 'classifier' part attached to the output of the BERT encoder. If multi-task BERT has T heads, one training iteration consists of
- composing T mini-batches, one for the task to be trained on(as specified by the task ID) and rest are dummy/sample batches,
- n gradient step as provided in gradient_accumulation_steps (1 by default), for the tasks specified by task ID.
When one of the BERT heads is being trained, other heads' parameters do not change. On each training step, both BERT head and body parameters are modified.
You can also follow this tutorial in which we train a model on MRPC and RTE datasets on colab: mt_pal_bert_mrpc_rte_tutorial
On this page, multi-task PAL BERT usage is explained on a toy configuration file of a model that detects insults(for the demonstration, we will use the same data for both the tasks).
We start with the metadata
field of the configuration file. Multi-task
PAL BERT model is saved in {"MODELS_PATH": "{ROOT_PATH}/models"}
.
downloads
field of Multitask PAL BERT configuration file is a union of
downloads
fields of original configs without pre-trained models. The
metadata
field of our config is given below.
{
"metadata": {
"variables": {
"ROOT_PATH": "~/.deeppavlov",
"DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
"PRETRAINED_BERT": "{ROOT_PATH}/pretrained_bert",
"MODELS_PATH": "{ROOT_PATH}/models"
},
"download": [
{
"url": "http://files.deeppavlov.ai/datasets/insults_data.tar.gz",
"subdir": "{DOWNLOADS_PATH}"
},
{
"url": "https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin",
"subdir": "{PRETRAINED_BERT}"
}
]
}
}
Data reading and iteration are performed by multitask_reader
and
multitask_pal_bert_iterator
. These classes are composed of task
readers and iterators and generate batches that contain data from
heterogeneous datasets.
A multitask_reader
configuration has parameters class_name
,
data_path
, and tasks
. data_path
field may be any string because
data paths are passed for tasks individually in the tasks
parameter.
However, you can not drop a data_path
parameter because it is
obligatory for the dataset reader configuration. tasks
parameter is a
dictionary of task dataset readers configurations. In configurations of
task readers, the reader_class_name
parameter is used instead of the
class_name
. The dataset reader configuration is provided:
{
"dataset_reader": {
"class_name": "multitask_reader",
"data_path": "null",
"tasks": {
"insults": {
"reader_class_name": "basic_classification_reader",
"x": "Comment",
"y": "Class",
"data_path": "{DOWNLOADS_PATH}/insults_data"
},
"insults1": {
"reader_class_name": "basic_classification_reader",
"x": "Comment",
"y": "Class",
"data_path": "{DOWNLOADS_PATH}/insults_data"
}
}
},
}
A multitask_pal_bert_iterator
configuration has parameters
num_train_epochs
, steps_per_epoch
, class_name
and tasks
. tasks
is a dictionary of configurations of task iterators. In configurations
of task iterators, iterator_class_name
is used instead ofclass_name
.
Also provide gradient_accumulation_steps
if using gradient
accumulation. The dataset iterator configuration is as follows:
{
"dataset_iterator": {
"class_name": "multitask_pal_bert_iterator",
"num_train_epochs": 5,
"steps_per_epoch": 100,
"tasks": {
"insults": {
"iterator_class_name": "basic_classification_iterator",
"seed": 42
},
"insults1": {
"iterator_class_name": "basic_classification_iterator",
"seed": 42
}
}
}
Batches generated by multitask_iterator
are tuples of two elements:
inputs of the model and labels. Both inputs and labels are lists of
tuples. The inputs have following format:
[(first_task_inputs[0], second_task_inputs[0], ...), (first_task_inputs[1], second_task_inputs[1], ...), ...]
where first_task_inputs
, second_task_inputs
, and so on are x values
of batches from task dataset iterators. The labels have the similar
format. Also, inputs have task ID along with them, which will be
extracted later using the pal_bert_preprocessor
.
In this tutorial, there are 2 datasets. Considering the batch structure,
chainer
inputs are:
{
"in": ["x_insults1_with_id", "x_insults2_with_id"],
"in_y": ["y_insults1", "y_insults2"]
}
To extract the task id from the inputs, we need to use the component
pal_bert_preprocessor
, which has parameters class_name
, in
and
out
. The first variable out will always be the task id and make sure
the relative order for the task inputs is the same.
{
"class_name": "multitask_pal_bert_preprocessor",
"in": ["x_insults1_with_id", "x_insults2_with_id"],
"out": ["task_id", "x_insults", "x_insults2"]
},
Sometimes a task dataset iterator returns inputs or labels consisting of
more than one element. For example, in model
mt_bert_train_tutorial.json <kbqa/kbqa_mt_bert_train.json>
siamese_iterator
input element consists of 2 strings. If there is a
necessity to split such a variable, the InputSplitter
component can be
used.
Data preparation steps in the pipe of tutorial config are similar to data preparation steps in the original configs except for the names of the variables.
A multitask_pal_bert
component has task-specific parameters and
parameters that are common for all tasks. The first parameters are
provided inside the tasks
parameter. The tasks
is a dictionary that
keys are task names and values are task-specific parameters.
Inputs and labels of a multitask_pal_bert
component are distributed
between the tasks according to the in_distribution
and
in_y_distribution
parameters. First, in
and in_y
elements have to
be grouped by tasks, and the first parameter of in
should be the task
id extracted by the multitask_pal_bert_preprocessor
followed by the
input for each task specified, e.g. arguments for the first task, then
arguments for the second task and so on. Secondly, the order of tasks in
in
and in_y
has to be the same as the order of tasks in the
in_distribution
and in_y_distribution
parameters. If in
and in_y
parameters are dictionaries, you may make in_distribution
and
in_y_distribution
parameter dictionaries which keys are task names and
values are lists of elements of in
or in_y
. If using gradient
accumulation, you also need to provide the gradient_accumulation_steps
and steps_per_epoch
parameters.
{
"id": "multitask_pal_bert",
"class_name": "multitask_pal_bert",
"pretrained_bert": "{PRETRAINED_BERT}/pytorch_model.bin",
"optimizer_parameters": {"lr": 3e-5},
"learning_rate_drop_patience": 2,
"learning_rate_drop_div": 2.0,
"return_probas": true,
"save_path": "{MODELS_PATH}/model",
"load_path": "{MODELS_PATH}/model",
"tasks": {
"insults1": {
"n_classes": "#vocab_insults1.len"
},
"insults2": {
"n_classes": "#vocab_insults2.len"
}
},
"in_distribution": {
"insults1": 1,
"insults2": 1
},
"in": [
"task_id",
"bert_features_insults1",
"bert_features_insults2"
],
"in_y_distribution": {
"insults1": 1,
"insults2": 1
},
"in_y": [
"y_ids_insults1",
"y_ids_insults2"
],
"out": [
"y_insults1_pred_probas",
"y_insults2_pred_probas"
]
}
- Initially, it was planned to start with the CA-MTL model, but the authors code had some bugs and uncompleted smapling methods. So I started out with the replication of results of the Bert-n-Pals model.
- After the model was implemented the memory usage was quite high compared to the original implementation. This took the lot of time to be fixed.
- After that the inference time was higher so more time was spent there to fix that.
- Authors of CAMTL did push the remaining code and I replicated the results but due to lack of time, that model was not implemented.
- The implemented model does not perform well on the MNLI GLUE task, which is unexpected, more work can be done towards fixing that.
Feel free to reach out to me if you have any doubts about my projects or GSoC. You can find me on twitter @rimijoker