rimijoker/Anshuman_Singh_GSoC_2021.md

## Anshuman_Singh_GSoC_2021.md

      
    Raw
  

              Anshuman_Singh_GSoC_2021.md
            
          
Google Summer of Code 2021 Final Work Product -


Student
Anshuman Singh


Github
@rimijoker


Organisation
DeepPavlov


Project
Refactor MultiTask Bert


Abstract

The DeepPavlov Library consists of a lot of state of the art NLP techniques and Multi-task BERT is one of them.
Multi-task learning shares information between related tasks, reducing the number of parameters required. State of the art results across natural language understanding tasks in the GLUE benchmark has been previously used transfer learning from a large task: unsupervised training with BERT, where a separate BERT model was fine-tuned for each task.
In multi-task BERT we share a single BERT model along with a small number of task-specific parameters and match the performance of separately fine-tuned models with fewer parameters on the GLUE benchmark.
In the current state of the DeepPavlov, multi-task BERT is implemented in Tensorflow which needs to be refactored such that DeepPavlov uses new frameworks such as PyTorch.
The refactored code also needs to incorporate techniques such as PAL-BERT, CA-MTL, MT-DNN within the DeepPavlov library by matching the results of the GLUE benchmark on the respective techniques.
DeepPavlov: Github Repository

Project Issue: Refacor Multitask Bert

Replication of the results from their respective papers:


Bert-n-Pals: https://gluebenchmark.com/submission/WOLUDc3uwbTdFn5y4hpj1Nh9pWi1/-MciWas_9fvwXZpnPCzo
CA-MTL: https://gluebenchmark.com/submission/4soAsU8UFyfUvRvJRWzSHfGMEB03/-MgV7AEqgUctoJxdwAHe

Pull-Requests:

Add Multi-Task Pal Bert in DeepPavlov:


Implemented the MultiTask Pal Bert Iterator
Implemented the MultiTask Pal Bert Preprocessor
Implemented the MultiTask Pal Bert Model
Added gradient accmulation support for the model which is not in other model of DeepPavlov.
35% savings on VRAM when compared with two single Bert models while achieving better scores on the GLUE Benchmark.
Documented the MultiTask Pal Bert Iterator, MultiTask Pal Bert Preprocessor, Pal Bert Model.
Added a colab tutorial for the same. mt_pal_bert_mrpc_rte_tutorial
Added a config for full GLUE training.

Documentation

Multi-task PAL BERT in DeepPavlov

Multi-task PAL BERT in DeepPavlov is an implementation of BERT training
algorithm published in the paper "BERT and PALs: Projected Attention
Layers for Efficient Adaptation in Multi-Task Learning".
Multitask BERT and PALs paper: https://arxiv.org/pdf/1902.02671.pdf
The idea is to share the BERT body between several tasks. If a model
pipe has several components using BERT and the amount of GPU memory is
limited, this sharing can significantly save the memory. Each task has
its own 'classifier' part attached to the output of the BERT encoder. If
multi-task BERT has T heads, one training iteration consists of

composing T mini-batches, one for the task to be trained on(as
specified by the task ID) and rest are dummy/sample batches,
n gradient step as provided in gradient_accumulation_steps (1 by
default), for the tasks specified by task ID.

When one of the BERT heads is being trained, other heads' parameters do
not change. On each training step, both BERT head and body parameters
are modified.
You can also follow this tutorial in which we train a model on MRPC and
RTE datasets on colab:
mt_pal_bert_mrpc_rte_tutorial
On this page, multi-task PAL BERT usage is explained on a toy
configuration file of a model that detects insults(for the
demonstration, we will use the same data for both the tasks).
We start with the metadata field of the configuration file. Multi-task
PAL BERT model is saved in {"MODELS_PATH": "{ROOT_PATH}/models"}.
downloads field of Multitask PAL BERT configuration file is a union of
downloads fields of original configs without pre-trained models. The
metadata field of our config is given below.
{
  "metadata": {
    "variables": {
      "ROOT_PATH": "~/.deeppavlov",
      "DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
      "PRETRAINED_BERT": "{ROOT_PATH}/pretrained_bert",
      "MODELS_PATH": "{ROOT_PATH}/models"
    },
    "download": [
      {
        "url": "http://files.deeppavlov.ai/datasets/insults_data.tar.gz",
        "subdir": "{DOWNLOADS_PATH}"
      },
      {
        "url": "https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin",
        "subdir": "{PRETRAINED_BERT}"
      }
    ]
  }
}

Train config

Data reading and iteration are performed by multitask_reader and
multitask_pal_bert_iterator. These classes are composed of task
readers and iterators and generate batches that contain data from
heterogeneous datasets.
A multitask_reader configuration has parameters class_name,
data_path, and tasks. data_path field may be any string because
data paths are passed for tasks individually in the tasks parameter.
However, you can not drop a data_path parameter because it is
obligatory for the dataset reader configuration. tasks parameter is a
dictionary of task dataset readers configurations. In configurations of
task readers, the reader_class_name parameter is used instead of the
class_name. The dataset reader configuration is provided:
{
  "dataset_reader": {
    "class_name": "multitask_reader",
    "data_path": "null",
    "tasks": {
      "insults": {
        "reader_class_name": "basic_classification_reader",
        "x": "Comment",
        "y": "Class",
        "data_path": "{DOWNLOADS_PATH}/insults_data"
      },
      "insults1": {
        "reader_class_name": "basic_classification_reader",
        "x": "Comment",
        "y": "Class",
        "data_path": "{DOWNLOADS_PATH}/insults_data"
      }
    }
  },
}

A multitask_pal_bert_iterator configuration has parameters
num_train_epochs, steps_per_epoch, class_name and tasks. tasks
is a dictionary of configurations of task iterators. In configurations
of task iterators, iterator_class_name is used instead ofclass_name.
Also provide gradient_accumulation_steps if using gradient
accumulation. The dataset iterator configuration is as follows:
{
  "dataset_iterator": {
    "class_name": "multitask_pal_bert_iterator",
    "num_train_epochs": 5,
    "steps_per_epoch": 100,
    "tasks": {
      "insults": {
        "iterator_class_name": "basic_classification_iterator",
        "seed": 42
      },
      "insults1": {
        "iterator_class_name": "basic_classification_iterator",
        "seed": 42
      }
    }
}

Batches generated by multitask_iterator are tuples of two elements:
inputs of the model and labels. Both inputs and labels are lists of
tuples. The inputs have following format:
[(first_task_inputs[0], second_task_inputs[0], ...), (first_task_inputs[1], second_task_inputs[1], ...), ...]
where first_task_inputs, second_task_inputs, and so on are x values
of batches from task dataset iterators. The labels have the similar
format. Also, inputs have task ID along with them, which will be
extracted later using the pal_bert_preprocessor.
In this tutorial, there are 2 datasets. Considering the batch structure,
chainer inputs are:
{
  "in": ["x_insults1_with_id", "x_insults2_with_id"],
  "in_y": ["y_insults1", "y_insults2"]
}

To extract the task id from the inputs, we need to use the component
pal_bert_preprocessor, which has parameters class_name, in and
out. The first variable out will always be the task id and make sure
the relative order for the task inputs is the same.
{
      "class_name": "multitask_pal_bert_preprocessor",
      "in": ["x_insults1_with_id", "x_insults2_with_id"],
      "out": ["task_id", "x_insults", "x_insults2"]
},

Sometimes a task dataset iterator returns inputs or labels consisting of
more than one element. For example, in model
mt_bert_train_tutorial.json <kbqa/kbqa_mt_bert_train.json>
siamese_iterator input element consists of 2 strings. If there is a
necessity to split such a variable, the InputSplitter component can be
used.
Data preparation steps in the pipe of tutorial config are similar to
data preparation steps in the original configs except for the names of
the variables.
A multitask_pal_bert component has task-specific parameters and
parameters that are common for all tasks. The first parameters are
provided inside the tasks parameter. The tasks is a dictionary that
keys are task names and values are task-specific parameters.
Inputs and labels of a multitask_pal_bert component are distributed
between the tasks according to the in_distribution and
in_y_distribution parameters. First, in and in_y elements have to
be grouped by tasks, and the first parameter of in should be the task
id extracted by the multitask_pal_bert_preprocessor followed by the
input for each task specified, e.g. arguments for the first task, then
arguments for the second task and so on. Secondly, the order of tasks in
in and in_y has to be the same as the order of tasks in the
in_distribution and in_y_distribution parameters. If in and in_y
parameters are dictionaries, you may make in_distribution and
in_y_distribution parameter dictionaries which keys are task names and
values are lists of elements of in or in_y. If using gradient
accumulation, you also need to provide the gradient_accumulation_steps
and steps_per_epoch parameters.
{
    "id": "multitask_pal_bert",
    "class_name": "multitask_pal_bert",
    "pretrained_bert": "{PRETRAINED_BERT}/pytorch_model.bin",
    "optimizer_parameters": {"lr": 3e-5},
    "learning_rate_drop_patience": 2,
    "learning_rate_drop_div": 2.0,
    "return_probas": true,
    "save_path": "{MODELS_PATH}/model",
    "load_path": "{MODELS_PATH}/model",
    "tasks": {
        "insults1": {
            "n_classes": "#vocab_insults1.len"
        },
        "insults2": {
            "n_classes": "#vocab_insults2.len"
        }
    },
    "in_distribution": {
        "insults1": 1,
        "insults2": 1
    },
    "in": [
        "task_id",
        "bert_features_insults1",
        "bert_features_insults2"
    ],
    "in_y_distribution": {
        "insults1": 1,
        "insults2": 1
    },
    "in_y": [
        "y_ids_insults1",
        "y_ids_insults2"
    ],
    "out": [
        "y_insults1_pred_probas",
        "y_insults2_pred_probas"
    ]
}

Changes in plans with respect to Proposal


Initially, it was planned to start with the CA-MTL model, but the authors code had some bugs and uncompleted smapling methods. So I started out with the replication of results of the Bert-n-Pals model.
After the model was implemented the memory usage was quite high compared to the original implementation. This took the lot of time to be fixed.
After that the inference time was higher so more time was spent there to fix that.
Authors of CAMTL did push the remaining code and I replicated the results but due to lack of time, that model was not implemented.

Future Work


The implemented model does not perform well on the MNLI GLUE task, which is unexpected, more work can be done towards fixing that.

Other

Feel free to reach out to me if you have any doubts about my projects or GSoC. You can find me on twitter @rimijoker
Student	Anshuman Singh
Github	@rimijoker
Organisation	DeepPavlov
Project	Refactor MultiTask Bert