Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Google Summer of Code 2021(DeepPavlov) Final Work Product

Google Summer of Code 2021 Final Work Product -

Student Anshuman Singh
Github @rimijoker
Organisation DeepPavlov
Project Refactor MultiTask Bert

Abstract

The DeepPavlov Library consists of a lot of state of the art NLP techniques and Multi-task BERT is one of them.

Multi-task learning shares information between related tasks, reducing the number of parameters required. State of the art results across natural language understanding tasks in the GLUE benchmark has been previously used transfer learning from a large task: unsupervised training with BERT, where a separate BERT model was fine-tuned for each task.

In multi-task BERT we share a single BERT model along with a small number of task-specific parameters and match the performance of separately fine-tuned models with fewer parameters on the GLUE benchmark.

In the current state of the DeepPavlov, multi-task BERT is implemented in Tensorflow which needs to be refactored such that DeepPavlov uses new frameworks such as PyTorch.

The refactored code also needs to incorporate techniques such as PAL-BERT, CA-MTL, MT-DNN within the DeepPavlov library by matching the results of the GLUE benchmark on the respective techniques.

DeepPavlov: Github Repository

Project Issue: Refacor Multitask Bert

Replication of the results from their respective papers:

Pull-Requests:

Add Multi-Task Pal Bert in DeepPavlov:

  • Implemented the MultiTask Pal Bert Iterator
  • Implemented the MultiTask Pal Bert Preprocessor
  • Implemented the MultiTask Pal Bert Model
  • Added gradient accmulation support for the model which is not in other model of DeepPavlov.
  • 35% savings on VRAM when compared with two single Bert models while achieving better scores on the GLUE Benchmark.
  • Documented the MultiTask Pal Bert Iterator, MultiTask Pal Bert Preprocessor, Pal Bert Model.
  • Added a colab tutorial for the same. mt_pal_bert_mrpc_rte_tutorial
  • Added a config for full GLUE training.

Documentation

Multi-task PAL BERT in DeepPavlov

Multi-task PAL BERT in DeepPavlov is an implementation of BERT training algorithm published in the paper "BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning".

Multitask BERT and PALs paper: https://arxiv.org/pdf/1902.02671.pdf

The idea is to share the BERT body between several tasks. If a model pipe has several components using BERT and the amount of GPU memory is limited, this sharing can significantly save the memory. Each task has its own 'classifier' part attached to the output of the BERT encoder. If multi-task BERT has T heads, one training iteration consists of

  • composing T mini-batches, one for the task to be trained on(as specified by the task ID) and rest are dummy/sample batches,
  • n gradient step as provided in gradient_accumulation_steps (1 by default), for the tasks specified by task ID.

When one of the BERT heads is being trained, other heads' parameters do not change. On each training step, both BERT head and body parameters are modified.

You can also follow this tutorial in which we train a model on MRPC and RTE datasets on colab: mt_pal_bert_mrpc_rte_tutorial

On this page, multi-task PAL BERT usage is explained on a toy configuration file of a model that detects insults(for the demonstration, we will use the same data for both the tasks).

We start with the metadata field of the configuration file. Multi-task PAL BERT model is saved in {"MODELS_PATH": "{ROOT_PATH}/models"}. downloads field of Multitask PAL BERT configuration file is a union of downloads fields of original configs without pre-trained models. The metadata field of our config is given below.

{
  "metadata": {
    "variables": {
      "ROOT_PATH": "~/.deeppavlov",
      "DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
      "PRETRAINED_BERT": "{ROOT_PATH}/pretrained_bert",
      "MODELS_PATH": "{ROOT_PATH}/models"
    },
    "download": [
      {
        "url": "http://files.deeppavlov.ai/datasets/insults_data.tar.gz",
        "subdir": "{DOWNLOADS_PATH}"
      },
      {
        "url": "https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin",
        "subdir": "{PRETRAINED_BERT}"
      }
    ]
  }
}

Train config

Data reading and iteration are performed by multitask_reader and multitask_pal_bert_iterator. These classes are composed of task readers and iterators and generate batches that contain data from heterogeneous datasets.

A multitask_reader configuration has parameters class_name, data_path, and tasks. data_path field may be any string because data paths are passed for tasks individually in the tasks parameter. However, you can not drop a data_path parameter because it is obligatory for the dataset reader configuration. tasks parameter is a dictionary of task dataset readers configurations. In configurations of task readers, the reader_class_name parameter is used instead of the class_name. The dataset reader configuration is provided:

{
  "dataset_reader": {
    "class_name": "multitask_reader",
    "data_path": "null",
    "tasks": {
      "insults": {
        "reader_class_name": "basic_classification_reader",
        "x": "Comment",
        "y": "Class",
        "data_path": "{DOWNLOADS_PATH}/insults_data"
      },
      "insults1": {
        "reader_class_name": "basic_classification_reader",
        "x": "Comment",
        "y": "Class",
        "data_path": "{DOWNLOADS_PATH}/insults_data"
      }
    }
  },
}

A multitask_pal_bert_iterator configuration has parameters num_train_epochs, steps_per_epoch, class_name and tasks. tasks is a dictionary of configurations of task iterators. In configurations of task iterators, iterator_class_name is used instead ofclass_name. Also provide gradient_accumulation_steps if using gradient accumulation. The dataset iterator configuration is as follows:

{
  "dataset_iterator": {
    "class_name": "multitask_pal_bert_iterator",
    "num_train_epochs": 5,
    "steps_per_epoch": 100,
    "tasks": {
      "insults": {
        "iterator_class_name": "basic_classification_iterator",
        "seed": 42
      },
      "insults1": {
        "iterator_class_name": "basic_classification_iterator",
        "seed": 42
      }
    }
}

Batches generated by multitask_iterator are tuples of two elements: inputs of the model and labels. Both inputs and labels are lists of tuples. The inputs have following format: [(first_task_inputs[0], second_task_inputs[0], ...), (first_task_inputs[1], second_task_inputs[1], ...), ...] where first_task_inputs, second_task_inputs, and so on are x values of batches from task dataset iterators. The labels have the similar format. Also, inputs have task ID along with them, which will be extracted later using the pal_bert_preprocessor.

In this tutorial, there are 2 datasets. Considering the batch structure, chainer inputs are:

{
  "in": ["x_insults1_with_id", "x_insults2_with_id"],
  "in_y": ["y_insults1", "y_insults2"]
}

To extract the task id from the inputs, we need to use the component pal_bert_preprocessor, which has parameters class_name, in and out. The first variable out will always be the task id and make sure the relative order for the task inputs is the same.

{
      "class_name": "multitask_pal_bert_preprocessor",
      "in": ["x_insults1_with_id", "x_insults2_with_id"],
      "out": ["task_id", "x_insults", "x_insults2"]
},

Sometimes a task dataset iterator returns inputs or labels consisting of more than one element. For example, in model mt_bert_train_tutorial.json <kbqa/kbqa_mt_bert_train.json> siamese_iterator input element consists of 2 strings. If there is a necessity to split such a variable, the InputSplitter component can be used.

Data preparation steps in the pipe of tutorial config are similar to data preparation steps in the original configs except for the names of the variables.

A multitask_pal_bert component has task-specific parameters and parameters that are common for all tasks. The first parameters are provided inside the tasks parameter. The tasks is a dictionary that keys are task names and values are task-specific parameters.

Inputs and labels of a multitask_pal_bert component are distributed between the tasks according to the in_distribution and in_y_distribution parameters. First, in and in_y elements have to be grouped by tasks, and the first parameter of in should be the task id extracted by the multitask_pal_bert_preprocessor followed by the input for each task specified, e.g. arguments for the first task, then arguments for the second task and so on. Secondly, the order of tasks in in and in_y has to be the same as the order of tasks in the in_distribution and in_y_distribution parameters. If in and in_y parameters are dictionaries, you may make in_distribution and in_y_distribution parameter dictionaries which keys are task names and values are lists of elements of in or in_y. If using gradient accumulation, you also need to provide the gradient_accumulation_steps and steps_per_epoch parameters.

{
    "id": "multitask_pal_bert",
    "class_name": "multitask_pal_bert",
    "pretrained_bert": "{PRETRAINED_BERT}/pytorch_model.bin",
    "optimizer_parameters": {"lr": 3e-5},
    "learning_rate_drop_patience": 2,
    "learning_rate_drop_div": 2.0,
    "return_probas": true,
    "save_path": "{MODELS_PATH}/model",
    "load_path": "{MODELS_PATH}/model",
    "tasks": {
        "insults1": {
            "n_classes": "#vocab_insults1.len"
        },
        "insults2": {
            "n_classes": "#vocab_insults2.len"
        }
    },
    "in_distribution": {
        "insults1": 1,
        "insults2": 1
    },
    "in": [
        "task_id",
        "bert_features_insults1",
        "bert_features_insults2"
    ],
    "in_y_distribution": {
        "insults1": 1,
        "insults2": 1
    },
    "in_y": [
        "y_ids_insults1",
        "y_ids_insults2"
    ],
    "out": [
        "y_insults1_pred_probas",
        "y_insults2_pred_probas"
    ]
}

Changes in plans with respect to Proposal

  • Initially, it was planned to start with the CA-MTL model, but the authors code had some bugs and uncompleted smapling methods. So I started out with the replication of results of the Bert-n-Pals model.
  • After the model was implemented the memory usage was quite high compared to the original implementation. This took the lot of time to be fixed.
  • After that the inference time was higher so more time was spent there to fix that.
  • Authors of CAMTL did push the remaining code and I replicated the results but due to lack of time, that model was not implemented.

Future Work

  • The implemented model does not perform well on the MNLI GLUE task, which is unexpected, more work can be done towards fixing that.

Other

Feel free to reach out to me if you have any doubts about my projects or GSoC. You can find me on twitter @rimijoker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment