Back to Index

Stanford CS25: V1 I Transformers United: DL Models that have revolutionized NLP, CV, RL


Chapters

0:0 Introduction
2:43 Overview of Transformers
6:3 Attention mechanisms
7:53 Self retention
11:38 Other necessary ingredients
13:32 Encoder Decoder Architecture
16:2 Advantages & Disadvantages
18:4 Applications of Transformers

Transcript

>> Hey everyone, welcome to the first and introductory lecture for CS25, Transformers United. So CS25 was a class that the three of us created and taught at Stanford in the fall of 2021. And the subject of the class is not as the picture might suggest, it's not about robots that can transform into cars.

It's about deep learning models and specifically a particular kind of deep learning models that have revolutionized multiple fields. Starting from natural language processing to things like computer vision and reinforcement learning, to name a few. We have an exciting set of videos lined up for you. We have some truly fantastic speakers come and give talks about how they were applying transformers in their own research.

And we hope you will enjoy and learn from these talks. So this video is purely an introductory lecture to talk a little bit about transformers. And before we get started, I'd like to introduce the instructors. So my name is Abwehr. I am a software engineer at a company called Applied Intuition.

Before this, I was a master's student in CS at Stanford. And I am one of the co-instructors for CS25. Chaitanya, Dev, if the two of you could introduce yourselves. >> So hi, everyone. I am a PhD student at Stanford. Before this, I was pursuing a master's here. I'm researching a lot in generative modeling, reinforcement learning, and robotics.

So nice to meet you all. >> Yeah, that was Dev, since he didn't say his name. Chaitanya, if you want to introduce yourself. >> Yeah, hi, everyone. My name is Chaitanya, and I'm currently working as an ML engineer at a startup called Moveworks. Before that, I was a master's student at Stanford specializing in NLP and was a member of the prize-winning Stanford's team for the Alexa Prize Challenge.

>> All right, awesome. So moving on to the rest of this talk. Essentially, what we hope you will learn watching these videos, and what we hope the people who took our class in the fall of 2021 learned, is three things. One is we hope you will have an understanding of how transformers work.

Secondly, we hope you will learn, and by the end of these talks, understand how transformers are being applied beyond just natural language processing. And thirdly, we hope that some of these talks will spark some new ideas within you, and hopefully lead to new directions of research, new kinds of innovation, and things of that sort.

And to begin, we're going to talk a little bit about transformers and introduce some of the context behind transformers as well. And for that, I'd like to hand it off to Dev. >> So hi, everyone. So welcome to our transformer seminar. So I will start first with an overview of the attention timeline and how it came to be.

The key idea about transformers was the self-attention mechanism that was developed in 2017, and it all started with this one paper called Attention is All You Need by Vava Swanyatal. Before 2017, we used to have this prehistoric era where we had older models like RNNs, LSTMs, and simpler attention mechanisms.

And eventually, the growth in transformers has exploded into other fields and has become prominent in all of machine learning. And I'll go and see and show how this has been used. So in the prehistoric era, there used to be RNNs. There were different models, like the sequence-to-sequence, LSTMs, GRUs.

They were good at encoding some sort of memory, but they did not work for encoding long sequences, and they were very bad at encoding context. So here is an example where if you have a sentence like, I grew up in France, dot, dot, dot, so I speak fluent dash.

Then you want to fill this with French based on the context, but a LSTM model might not know what it is and might just make a very big mistake here. Similarly, we can show some sort of correlation map here where if you have a pronoun like it, we want it to correlate to one of the past nouns that we have seen so far, like animal, but again, older models were really not good at this context encoding.

So where we are currently now is on the verge of takeoff. We're beginning to realize the potential of transformers in different fields. We have started to use them to solve long sequence problems and protein folding, such as the alpha fold model from DeepMind, which gets 95% accuracy on different challenges in offline RL.

We can use it for few-shot and zero-shot generalization for text and image generation. And we can also use this for content generation. So here's an example from OpenAI, where you can give a different text prompt and have an AI-generated fictional image for you. And so there's a talk on this that you can also watch on YouTube, which basically says that LSTMs are dead and long-lived transformers.

So what's the future? So we can enable a lot more applications for transformers. They can be applied to any form of sequence modeling. So we could use them for real understanding. We can use them for finance and a lot more. So basically imagine all sorts of genetic modeling problems.

Nevertheless, there are a lot of missing ingredients. So like the human brain, we need some sort of external memory unit, which is the hippocampus for us. And there are some early works here. So one nice work you might want to check out is called Neural Turing Machines. Similarly, the current attention mechanisms are very competitionally complex in terms of time, and they scale quadratically, which we'll discuss later.

And we want to make them more linear. And the third problem is that we want to align our current sort of language models with how the human brain works and human values. And this is also a big issue. OK, so now I will deep dive deeper into the attention mechanisms and show how they came out to be.

So initially, they used to be very simple mechanisms. Their attention was inspired by the process of importance fitting, or putting attention on different parts of an image, where like similar to a human, where you might focus more on like a foreground, if you have an image of a dog compared to like the rest of the background.

So in the case of soft attention, what you do is you learn the simple soft attention weighting for each pixel, which can be a weight between 0 to 1. The problem over here is that this is a very expensive computation. And then you can, as shown in the figure on the left, you can see we are calculating this attention map for the whole image.

What you can do instead is you can just calculate a 0 to 1 attention map, where we directly put a 1 on wherever the dog is and a 0 wherever it's a background. This is like less computationally expensive, but the problem is it's not differentiable and makes things harder to train.

Going forward, we also have different varieties of basic attention mechanisms that came, that were proposed before self-attention. So the first variety here is global attention models. So in global attention models for each hidden layer input, hidden layer output, you learn an attention weight, a of p. And this is element-wise multiplied with your current output to calculate your final output, yt.

Similarly, you have local attention models, where instead of calculating the global attention over the whole sequence length, you only calculate the attention over a small window. And then you weight by the attention of the window into the current output to get the final output you need. So moving on, I'll pass on to Chaitanya to discuss self-attention mechanisms and transforms.

>> Yeah, thank you, Div, for covering a brief overview of how the primitive versions of attention work. Now, just before we talk about self-attention, just a bit of a trivia that this term was first introduced by a paper from Lin et al, which provided a framework for a self-attentive mechanism for our sentence embeddings.

And now moving on to the main crux of the transformers paper, which was the self-attention block. So self-attention is the basis, is the main comp building block for what makes the transformers model work so well and to enable them and make them so powerful. So to think of it more easily, we can break down the self-attention as a search retrieval problem.

So the problem is that given a query Q, and we need to find a set of keys K, which are most similar to Q and return the corresponding key values called V. Now, these three vectors can be drawn from the same source. For example, we can have that Q, K, and V are all equal to a single vector X, where X can be output of a previous layer.

In transformers, these vectors are obtained by applying different linear transformations to X. So as to enable the model to capture more complex interactions between the different tokens at different places of the sentence. Now, how attention is computed is just a weighted summation of the similarities in the query and key vectors, which is weighted by the respective value for those keys.

And in the transformers paper, they use the scale dot product as a similarity function for the queries and keys. And another important aspect of the transformers was the introduction of multi-head self-attention. So what multi-head self-attention means is that the self-attention is for at every layer, the self-attention is performed multiple times, which enables the model to learn multiple representation subspaces.

So in a way, you can think of it that each head has a power to look at different things and to learn different semantics. For example, one head can be learning to try to predict what is the part of speech for those tokens. One head might be learning what is the syntactic structure of the sentence, and all those things that are there to understand what the upcoming sentence means.

Now, to better understand what the self-attention works and what are the different computations, there is a short video. So as you can see, there are three incoming tokens. So input 1, input 2, input 3. We apply linear transformations to get the key value vectors for each input, and then once a query queue comes, we calculate its similarity with the respective key vectors, and then multiply those scores with the value vector, and then add them all up to get the output.

The same computation is then performed on all the tokens, and we get the output of the self-attention layer. So as you can see here, the final output of the self-attention layer is in dark green that's at the top of the screen. So now again, for the final token, we perform everything same, queries multiplied by keys.

We get the similarity scores, and then those similarity scores weigh the value vectors, and then we finally perform the addition to get the self-attention output of the transformers. Apart from self-attention, there are some other necessary ingredients that makes the transformer so powerful. One important aspect is the presence of positional representations or the embedding layer.

So the way RNNs work very well was that since they process each of the information in a sequential ordering, so there was this notion of ordering, right, and which is also very important in understanding language because we all know that we read any piece of text from left to right in most of the languages, and also right to left in some languages.

So there is a notion of ordering, which is lost in kind of self-attention because every word is attending to every other word. That's why this paper introduced a separate embedding layer for introducing positional representations. The second important aspect is having nonlinearities. So if you think of all the computation that is happening in the self-attention layer, it's all linear because it's all matrix multiplication.

But as we all know, that deep learning models work well when they are able to learn more complex mappings between input and output, which can be attained by a simple MLP. And the third important component of the transformers is the masking. So masking is what allows to parallelize the operations.

Since every word can attend to every other word, in the decoder part of the transformers, which Advai is going to be talking about later, is the problem comes that you don't want the decoder to look into the future because that can result in data leakage. So that's why masking helps the decoder to avoid that future information and learn only what the model has processed so far.

So now on to the encoder-decoder architecture of the transformers. Advai? - Yeah, thanks, Saithanya, for talking about self-attention. So self-attention is sort of the key ingredient or one of the key ingredients that allows transformers to work so well. But at a very high level, the model that was proposed in the Vaswani et al.

paper of 2017 was like previous language models in the sense that it had an encoder-decoder architecture. What that means is, let's say you're working on a translation problem. You want to translate English to French. The way that would work is you would read in the entire input of your English sentence, you would encode that input, so that's the encoder part of the network.

And then you would generate token by token the corresponding French translation. And the decoder is the part of the network that is responsible for generating those tokens. So you can think of these encoder blocks and decoder blocks as essentially something like Lego. They have these sub-components that make them up.

And in particular, the encoder block has three main sub-components. The first is a self-attention layer that Saithanya talked about earlier. And as talked about earlier as well, you need a feed-forward layer after that because the self-attention layer only performs linear operations. And so you need something that can capture the non-linearities.

You also have a layer norm after this. And lastly, there are residual connections between different encoder blocks. The decoder is very similar to the encoder, but there's one difference, which is that it has this extra layer because the decoder doesn't just do multi-head attention on the output of the previous layers.

So for context, the encoder does multi-head attention for each self-attention layer in the encoder block. In each of the encoder blocks, it does multi-head attention looking at the previous layers of the encoder blocks. The decoder, however, does that in the sense that it also looks at the previous layers of the decoder, but it also looks at the output of the encoder.

And so it needs a multi-head attention layer over the encoder blocks. And lastly, there's masking as well. So if you are-- because every token can look at every other token, you want to make sure in the decoder that you're not looking into the future. So if you're in position 3, for instance, you shouldn't be able to look at position 4 and position 5.

So those are sort of all the components that led to the creation of the model in the Vaswani et al paper. And let's talk a little bit about the advantages and drawbacks of this model. So the two main advantages, which are huge advantages and which are why transformers have done such a good job of revolutionizing many, many fields within deep learning, are as follows.

So the first is there is this constant path length between any two positions in a sequence because every token in the sequence is looking at every other token. And this basically solves the problem that Dev talked about earlier with long sequences. You don't have this problem with long sequences where if you're trying to predict a token that depends on a word that was far, far behind in a sentence, you don't have the problem of losing that context.

Now, the distance between them is only one in terms of the path length. Also, because of the nature of the computation that's happening, transformer models lend themselves really well to parallelization. And because of the advances that we've had with GPUs, basically, if you take a transformer model with n parameters and you take a model that isn't a transformer, say like an MSTM, also with n parameters, training the transformer model is going to be much faster because of the parallelization that it leverages.

So those are the advantages. The disadvantages are basically self-attention takes quadratic time because every token looks at every other token. Order n squared, as you might know, does not scale. And there's actually been a lot of work in trying to tackle this. So we've linked to some here. Big Bird, Linformer, and Reformer are all approaches to try and make this linear or quasi-linear, essentially.

And yeah, we highly recommend going through Jay Allamer's blog, the Illustrated Transformer, which provides great visualizations and explains everything that we just talked about in great detail. Yeah, and I'd like to pass it on to Chaitanya for applications of transformers. So now moving on to like some of the recent work, some of the work that very shortly followed the Transformers paper.

So one of the models that came out was GPT, the GPT architecture, which was released by OpenAI. So OpenAI had the latest model that OpenAI has in the GPT series is the GPT-3. So it consists of only the decoder blocks from Transformers and is trained on a traditional language modeling task, which is predicting the current token, which is predicting the next token given the last T tokens that the model has seen.

And for any downstream tasks, now the model can just, you can just train a classification layer on the last hidden state, which can have any number of labels. And since the model is generative in nature, you can also use the pre-trained network as for generative kind of tasks, such as summarization and natural language, and natural language generation for that instance.

Another important aspect that GPT-3 gained popularity was its ability to be able to perform in-context learning, what the authors called in-context learning. So this is the ability wherein the model can perform, can learn under few short settings, what the task is to complete the task without performing any gradient updates.

For example, let's say the model is shown a bunch of addition examples. And then if you pass in a new input and leave the, and just leave it at equal to sign, the model tries to predict the next token, which very well comes out to be the sum of the numbers that is shown.

Another example can be also the spell correction task or the translation task. So this was the ability that made GPT-3 so much talked about in the NLP world. And right now also, many applications have been made using GPT-3, which includes one of them being the VS Code Copilot, which tries to generate a piece of code given a docstring kind of natural language text.

Another major model that came out that was based on the Transformers architecture was BERT. So BERT lends its name from, it's an acronym for Bidirectional Encoding, encoder representations of Transformers. It consists of only the encoder blocks of the Transformers, which is unlike GPT-3, which had only the decoder blocks.

Now, because of this change, there comes a problem because BERT has only the encoder blocks. So it sees the entire piece of text. It cannot be pre-trained on a naive language modeling task because of the problem of data leakage from the future. So what the authors came up with was a clever idea.

And they came up with a novel task called mass language modeling, which included to replace certain words with a placeholder. And then the model tries to predict those words given the entire context. Now, apart from this token-level task, the authors also added a second objective called the next sentence prediction, which was a sentence-level task.

Wherein, given two chunks of text, the model tried to predict whether the second sentence followed the other sentence or not, followed the first sentence or not. And now, after pre-training this model for any downstream task, the model can be further fine-tuned with an additional classification layer, just like it was in GPT-3.

So these are the two models that have been very popular and have made a lot of applications, made their way in a lot of applications. But the landscape has changed quite a lot since we have taken this class. There are models with different pre-training techniques, like Electra, D-BERTA. And there are also models that do well in other modalities, and which we are going to be talking about in other lecture series as well.

So yeah, that's all from this lecture. And thank you for tuning in. - Yeah, just want to end by saying thank you all for watching this. And we have a really exciting set of videos with truly amazing speakers. And we hope you are able to derive value from that.

- Okay, thanks a lot. - Thank you. - Thank you, everyone. - Bye. - Bye.