Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save shahbazsyed/9cecfe1b49bef370a061421333364d04 to your computer and use it in GitHub Desktop.
Save shahbazsyed/9cecfe1b49bef370a061421333364d04 to your computer and use it in GitHub Desktop.
Notes on multi-task learning for NLP

Notes on Multi-task learning for NLP

Multi-task learning (MTL) tackles the overfitting and data scarcity problems of deep learning methods by introducing useful information from related tasks to achieve simultaneous performance improvement on multiple related tasks.

MTL trains machine learning models from multiple related tasks simultaneously or enhances the model for a specific task using auxiliary tasks. Learning from multiple tasks makes it possible for learning models to capture generalized and complementary knowledge from the tasks at hand besides task-specific features. MTL architectures used in NLP tasks are categorized into four classes: the parallel, hierarchical, modular, and generative adversarial architecture.

The parallel architecture shares the bulk of the model among multiple tasks while each task has its own task-specific output layer. The hierarchical architecture models the hierarchical relationships between tasks. Such architecture can hierarchically combine features from different tasks, take the output of one task as the input of another task, or explicitly model the interaction between tasks. The modular architecture decomposes the whole model into shared components and task-specific components that learn task-invariant and task-specific features, respectively. The generative adversarial architecture borrows the idea of the generative adversarial network to improve capabilities of existing models.

Hard vs soft parameter sharing
Hard parameter sharing refers to sharing the same model parameters among multiple tasks, and it is the most widely used approach in multi-task learning models. Soft parameter sharing, on the other hand, constrains a distance metric between the intended parameters, such as the Euclidean distance and correlation matrix penalty, to force certain parameters of models for different tasks to be similar.

Model Architecture

Parallel Architectures

In such architectures, the model for each task is run in parallel, by sharing certain intermediate layers. In this case, there is no dependency other than layer sharing among tasks, hence no constraint on the order of training samples from each task.

Tree-like Architecture

Models of different tasks share a base feature extractor (trunk) followed by task-specific encoders and output layers (branches). This single trunk forces all tasks to share the same low-level feature representation, which may limit the expressive power of the model for each task.

Parallel Feature Fusion

Models actively combine features from different tasks, including shared and task-specific features, to form representations for each task. Models can use a globally shared encoder to produce shared representations that can be used as additional features for each task-specific model. Shared features can also be selectively chosen according to their importance for a task.

Supervision at Different Feature Levels

Models using the parallel architecture handle multiple tasks in parallel. These tasks may concern features at different abstraction levels. For NLP tasks, such levels can be character-level, token-level, sentence-level, paragraph-level, and document-level. It is natural to give supervision signals at different depths of an MTL model for tasks at different levels. In some settings where MTL is used to improve the performance of a primary task, the introduction of auxiliary tasks at different levels could be helpful.

Hierarchical Architectures

The hierarchical architecture considers hierarchical relationships among multiple tasks. The features and output of one task can be used by another task as an extra input or additional control signals.

Hierarchical Feature Fusion

Explicitly combines features at different depths and allow different processing for different features.

Hierarchical Pipeline

Treats the output of a task as an extra input of another task. The extra input can be used directly as input features or used indirectly as control signals to enhance the performance of other task. Such a pipeline can be further divided into feature and signal pipeline.

In a hierarchical feature pipeline, the output of one task is used as extra features for another task. The tasks are assumed to be directly related so that outputs instead of hidden feature representations are helpful to other tasks. For example, feeding the output of a question-review pair recognition model to the question answering model.

In a hierarchical signal pipeline, the outputs of tasks are used indirectly as external signals to help improve the performance of other tasks. For example, the predicted probability of the sentence extraction task can be used to weigh sentence embeddings for a document-level classification task.

Hierarchical Interactive Architecture

This explicitly models the interactions between tasks via a multi-turn prediction mechanism which allows a model to refine its predictions over multiple steps with the help of the previous outputs from other tasks in a way similar to recurrent neural networks.

Modular Architectures

These decompose an MTL model into shared modules and task-specific modules. The shared modules learn shared features from multiple tasks. Since the shared modules can learn from many tasks, they can be sufficiently trained and can generalize better, which is particularly meaningful for low-resource scenarios. On the other hand, task-specific modules learn features that are specific to a certain task.Compared with shared modules, task-specific modules are usually much smaller and thus less likely to suffer from overfitting caused by insufficient training data. The simplest form of modular architectures is a single shared module coupled with task-specific modules as in tree-like architectures.

When adapting large pre-trained models to down-stream tasks, a common practice is to fine-tune a separate model for each task. While this approach usually attains good performance, it poses heavy computational and storage costs.A more cost-efficient way is to add task-specific modules into a single shared pre-trained model. As an example,multi-task adapters adapt single-task models to multiple tasks by adding extra task-specific parameters (adapters).

GAN Architectures

By introducing a discriminator G that predicts which task a given training instance comes from, the shared feature extractor E is forced to produce more generalized task-invariant features and therefore improve the performance and robustness of the entire model.

Optimization for MTL Models

Loss Construction

The most common approach to train an MTL model is to linearly combine loss functions of different tasks into a single global loss function. In this way, the entire objective function of the MTL model can be optimized through conventional learning techniques such as stochastic gradient descent with back-propagation. Each loss function can also be weighted according to its importance, with the simplest being equal weight assignment. Adaptive loss functions can also be used in addition that adjust the importance of each task-specific loss function during training.

Data Sampling

MTL poses the challenge of dealing with training datasets of multiple tasks with potentially different sizes and data distributions. To handle data imbalance, various data sampling techniques have been proposed to properly construct training datasets. In practice, given 𝑀 tasks and their datasets {D1, . . .,D𝑀}, a sampling weight 𝑝𝑑 is assigned to task 𝑑 to control the probability of sampling a data batch from D𝑑 in each training step. The simplest strategy is proportional sampling, where the probability of sampling from a task is proportional to the size of its dataset. Advanced strategies include task-oriented sampling (randomly sample the same amount of instances from all tasks), and employing a sampling temperature for each task that can by dynamically adjusted during training.

Task Scheduling

Task scheduling determines the order of tasks in which an MTL model is trained. A naive way is to train all tasks together. Alternatively, we can train an MTL model on different tasks at different steps. Similar to the data sampling, we can assign a task sampling weight π‘Ÿπ‘‘ for task 𝑑, which is also called mixing ratio, to control the frequency of data batches from task 𝑑. The most common task scheduling technique is to shuffle between different tasks either randomly or according to a pre-defined schedule. In some cases, multiple tasks are learned sequentially. Such tasks usually form a clear dependency relationship or are of different difficulty levels. For auxiliary MTL, some researchers adopt a pre-train then fine-tune methodology, which trains auxiliary tasks first before fine-tuning on the down-stream primary tasks.

Application

Auxiliary MTL

Auxiliary MTL aims to improve the performance of certain primary tasks by introducing auxiliary tasks and is widely used in the NLP field for different types of primary task, including sequence tagging, classification, text generation, and representation learning. Auxiliary tasks are usually closely related to the primary tasks. For text generation tasks, MTL is brought in to improve the quality of the generated text.

Joint MTL

Different from auxiliary MTL, joint MTL models optimize its performance on several tasks simultaneously. Through joint MTL, one could take advantage of data-rich domains via implicit knowledge sharing. In addition, abundant unlabeled data could be utilized via unsupervised learning techniques. Moreover, joint MTL is suitable for multi-domain or multi-formalism NLP tasks. Multi-domain tasks share the same problem definition and label space among tasks, but have different data distributions.

Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment