layout: post title: “All the fuss around Multi Task Learning” subtitle: “Big name of a trivial concept” date: 2019-02-18 23:00:01 categories: [tech] comments: true deploy: false —

What is Multi-Task Learning? To understand Multi-Task Learning, let’s start with a Single-Task Learning example: for simplicity’s sake, imagine a plain feed-forward neural network used in pre-training for NLP (natural language processing). The task is to predict the next word in a sentence.

The input is the string “I like New” and the correct output is the string “York”.

The training process (gradient descent) can be visualized as a ball rolling down a hill: where the terrain is the loss function (otherwise known as cost/error function), and the position of the ball represents the current value of all parameters (weights & biases).

Now, what if you wanted the neural network to do multiple tasks? For example, predict the next word in a sentence AND conduct sentiment analysis (predict whether or not the attitude is positive, neutral, or negative. E.g. “You are awesome” classifies as positive).

Well, you can simply add another output!

The input is “I like New”, the next word prediction is “York”, and the sentiment prediction is positive.

The loss from both outputs is then summed together and averaged, and the final loss is used to train the network, since now you want to minimize the loss for both tasks.

This time, the training process can be visualized as summing 2 terrains (the 2 loss functions) together to get a new terrain (the final loss function), and then performing gradient descent.

This is the essence of Multi-Task Learning — training one neural network to perform multiple tasks so that the model can develop generalized representation of language rather than constraining itself to one particular task.

Multi-Task Learning is especially useful in natural language processing, as the goal of the pre-training process is to “understand” the language. Similarly, humans also perform multiple tasks when it comes to language understanding.

Two MTL methods for Deep Learning

So far, we have focused on theoretical motivations for MTL. To make the ideas of MTL more concrete, we will now look at the two most commonly used ways to perform multi-task learning in deep neural networks. In the context of Deep Learning, multi-task learning is typically done with either hard or soft parameter sharing of hidden layers.

Hard parameter sharing

Hard parameter sharing is the most commonly used approach to MTL in neural networks. It is generally applied by sharing the hidden layers between all tasks, while keeping several task-specific output layers.

Hard parameter sharing greatly reduces the risk of overfitting. In fact, the risk of overfitting the shared parameters is an order N – where N is the number of tasks – smaller than overfitting the task-specific parameters, i.e. the output layers. This makes sense intuitively: The more tasks we are learning simultaneously, the more our model has to find a representation that captures all of the tasks and the less is our chance of overfitting on our original task.

Soft parameter sharing

In soft parameter sharing on the other hand, each task has its own model with its own parameters. The distance between the parameters of the model is then regularized in order to encourage the parameters to be similar. For instance a model may use the ℓ2 norm for regularization, while some other may use the trace norm.

The constraints used for soft parameter sharing in deep neural networks have been greatly inspired by regularization techniques for MTL that have been developed for other models, which we will soon discuss.