What’s all the hype with Transformers? Part 2: Memory for RNNs

Contents

Introduction

In the previous post, we highlighted some of the issues we are faced with when attempting to process natural language with the help of AI. 

We found that a mechanism that is able to also “perceive” the position of words in a sentence is a crucial requirement for a model to be able to extract contextual metadata from a sequence in order to be able to use it for further processing. 

We discussed how Recurrent Neural Networks (RNNs) were a step in the right direction, but they had some serious limitations. This post highlights some methods software developers much smarter than me came up with to address these limitations. 

Namely, we will be talking about the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU). The problems with (Vanilla) RNNs: 

Recurrent Neural Networks

When training any Neural Network, we put the model through a series of trials and errors. Meaning that we feed the model an input to which we already know the correct output, and then we compare the output of the model to the correct output. We then compute the error between the output of the model and the correct output, and then we adjust the weights of the connections between the neurons in the model to minimize this error. This adjusting of the weights is done through a process called backpropagation, which is the process of updating the weights of the connections between the neurons in the model. A standard approach to this is the Gradient Descent Algorithm, which is a method for finding the minimum of a function since we want to minimize the error between the output of the model and the correct output. This is done by calculating the gradient of the error with respect to the weights of the connections between the neurons and then adjusting the weights in the direction of the negative gradient. 

  1. Vanishing Gradient Problem: When training an RNN, the model has to learn the weights of the connections between the neurons. This is done by backpropagating the error from the output layer to the input layer. However, when the model is very deep, the gradients of the error can become very small, which makes it hard for the model to learn the weights of the connections. This is known as the Vanishing Gradient Problem. The gradients become infinitesimally small, which means that the model learns very slowly. 
  1. Exploding Gradient Problem: On the other hand, the gradients can also become very large, which can lead to the model diverging and producing nonsensical predictions. This is known as the Exploding Gradient Problem. Here the gradients become very large, which means that the model learns very quickly, but it can also lead to the model diverging and producing nonsensical predictions. (Basically, the model is learning too fast and is making too big of a step in the wrong direction.) 

The following graphic illustrates the Vanishing Gradient Problem and Exploding Gradient Problem: 

During backpropagation (the process of updating the weights of the connections between the neurons a.k.a Training), the gradients of the error can become very small or very large, which makes it very hard for the model to learn the weights of the connections. This causes the model to take a very long time to train or to produce nonsensical predictions. 

Generally, the Exploding Gradient Problem is simple to solve: If the gradients become too large, we can simply clip them to a certain threshold. 

The Vanishing Gradient Problem, however, is much harder to solve. The problem is that the gradients become very small, which means that the model learns very slowly. To solve this problem, researchers came up with yet further innovations, such as the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU). 

Short-Term Memory

RNNs have a very short memory, meaning that they can only remember the last few tokens in a sequence. 

This is not ideal for long sequences, as the model will forget the beginning of the sequence by the time it reaches the end. This is known as the Context-Loss Problem. To illustrate this problem, consider the following sentence: 

“This blog post about […] was very interesting.” 

The “was” at the end of the sentence is dependent on the “This” at the beginning of the sentence. If the part in the middle is very long, the RNN will have forgotten the “This” by the time it reaches the “was”. This is caused by the fact that the output of the first neuron of the RNN has to be fed into the entire sequence of neurons, which means that the influence of the first neuron on the last neuron is very weak. So, the first parts of a long input sequence will, mathematically speaking, have a very small influence on the prediction of the last part of the sequence. This is not really what we want, because in natural language, the beginning of a long sentence or paragraph can have a very strong influence on the end of the sentence or paragraph. 

So, in summary, RNNs are a step up in the right direction, but they have some serious limitations that make them unsuitable for processing natural language. Especially when they need to deal with long sequences, such as entire paragraphs or documents, RNNs tend to suffer from Context-Loss. 

Towards a Solution: Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM)

As we have seen, traditional RNNs have a very hard time remembering the Context of a long sequence. To solve these issues, researchers came up with the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU). These units act as a sort of “memory cell” that can store information about the input sequence for a longer period of time and help decide which context information should be passed on to the next neuron. 

Gated Recurrent Units (GRUs)

The first mechanic is the introduction of the Gated Recurrent Unit (GRU). This specialized neuron has two gates: the Update Gate and the Reset Gate. These gates decide if the input should be passed on to the next neuron or not. This allows a model to learn which parts of an input sequence to focus on and which parts to ignore. (Remember this idea, as it will become relevant with the Transformer model.) 

The following illustration shows the architecture of a simplified GRU-Cell: 

This architecture allows the model to have a mechanism in place to not only carry information across multiple cells but also to decide which information to carry. 

There is still a problem, however: the model can only decide which information to carry forward, but it cannot decide how “important” that information is. Also, it cannot decide how to combine the information from various parts of the input sequence. So, for example, in a long sentence, both the choice of gender as well as the choice of tense can influence words at the end of the sentence. The GRU can only decide to carry forward the information of one of these choices but cannot carry their combined influence forward. 

Advantages of a GRU over a traditional RNN are: 

  • A model can decide which information to carry forward and which information to ignore. 
  • Furthermore, the model can learn to focus on the most important parts of the input sequence.

Long Short-Term Memory (LSTM)

A further innovation was the Long Short-Term Memory (LSTM). Here the idea was to build on the concept of the GRU and introduce a more complex mechanism for deciding which information to carry forward and how to combine the information from various parts of the input sequence. These mechanisms are the Forget Gate, the Input Gate, and the Output Gate

The following illustration shows the architecture of a simplified LSTM: 

This architecture allows the model to have a mechanism in place to not only carry information across multiple cells but also to decide which information to carry, how to combine the information from different parts of the input sequence, and how “important” that information is. This makes the LSTM a very powerful tool for processing natural language, as it can remember the context of a long sequence and combine the influence of distinct parts of the sequence more effectively than a vanilla RNN. 

The advantages of an LSTM over a GRU are: 

  • Good at remembering the context of a long sequence. 
  • Effective for complex and structured data. 
  • Handles the vanishing gradient problem better than a GRU. 

There are however still some drawbacks to LSTMs: 

  • They are computationally expensive. (This is not a problem for small models, but for very large models, this can be a problem.) 
  • They are hard to parallelize. (This is a problem for training on GPUs, as GPUs are very good at parallelizing computations.) 
  • They are more complex than GRUs. (This makes them harder to understand and to implement.) Slow to train. (This is because of the complexity of the model.) 

As you can see, RNNs are a step in the right direction, but they have some serious limitations. Their main drawback is that they are only able to process data sequentially, which means that they cannot process data in parallel. This makes them very slow to train and to use in practice. 

This slowness in training and application was their main limiting factor. Currently, the simplest method of making an AI “smarter” is to make it bigger. This, however, is hard to do with RNNs, as they are slow by nature. So, if you increase the size of an RNN, you also increase the time it takes to train and to use the model, making it hard to scale up the model. 

This is one of the reasons why we didn’t see any conversational models on the scale of GPT-3 before the Transformer model was introduced. It was simply impractical to train and use such a large model with an RNN. 

(There are other factors which contributed to the rise of the Transformer model, such as the availability of large datasets and the development of more powerful hardware, but the limitations of RNNs were certainly a factor.) 

In our next post, we will be taking a closer look at the Transformer model and how it solves the limitations of RNNs. Stay tuned!