back to indexStanford CS25: V1 I Transformers United: DL Models that have revolutionized NLP, CV, RL
Chapters
0:0 Introduction
2:43 Overview of Transformers
6:3 Attention mechanisms
7:53 Self retention
11:38 Other necessary ingredients
13:32 Encoder Decoder Architecture
16:2 Advantages & Disadvantages
18:4 Applications of Transformers
00:00:08.640 |
introductory lecture for CS25, Transformers United. 00:00:13.000 |
So CS25 was a class that the three of us created and 00:00:19.760 |
And the subject of the class is not as the picture might suggest, 00:00:23.800 |
it's not about robots that can transform into cars. 00:00:27.480 |
It's about deep learning models and specifically a particular kind of 00:00:31.620 |
deep learning models that have revolutionized multiple fields. 00:00:35.340 |
Starting from natural language processing to things like computer vision and 00:00:41.480 |
We have an exciting set of videos lined up for you. 00:00:45.880 |
We have some truly fantastic speakers come and 00:00:48.400 |
give talks about how they were applying transformers in their own research. 00:00:53.280 |
And we hope you will enjoy and learn from these talks. 00:00:56.440 |
So this video is purely an introductory lecture to talk a little bit about 00:01:01.800 |
And before we get started, I'd like to introduce the instructors. 00:01:07.000 |
I am a software engineer at a company called Applied Intuition. 00:01:10.320 |
Before this, I was a master's student in CS at Stanford. 00:01:19.680 |
Chaitanya, Dev, if the two of you could introduce yourselves. 00:01:29.520 |
I'm researching a lot in generative modeling, reinforcement learning, and 00:01:35.280 |
>> Yeah, that was Dev, since he didn't say his name. 00:01:38.400 |
Chaitanya, if you want to introduce yourself. 00:01:43.520 |
I'm currently working as an ML engineer at a startup called Moveworks. 00:01:48.360 |
Before that, I was a master's student at Stanford specializing in NLP and 00:01:52.360 |
was a member of the prize-winning Stanford's team for the Alexa Prize Challenge. 00:02:04.520 |
Essentially, what we hope you will learn watching these videos, and 00:02:08.840 |
what we hope the people who took our class in the fall of 2021 learned, is three things. 00:02:15.320 |
One is we hope you will have an understanding of how transformers work. 00:02:19.720 |
Secondly, we hope you will learn, and by the end of these talks, 00:02:23.560 |
understand how transformers are being applied beyond just natural language 00:02:29.460 |
And thirdly, we hope that some of these talks will spark some new ideas within you, 00:02:34.360 |
and hopefully lead to new directions of research, 00:02:37.320 |
new kinds of innovation, and things of that sort. 00:02:40.280 |
And to begin, we're going to talk a little bit about transformers and 00:02:48.280 |
introduce some of the context behind transformers as well. 00:02:52.280 |
And for that, I'd like to hand it off to Dev. 00:03:04.960 |
So I will start first with an overview of the attention timeline and 00:03:09.960 |
The key idea about transformers was the self-attention mechanism that was 00:03:14.260 |
developed in 2017, and it all started with this one paper called 00:03:19.900 |
Before 2017, we used to have this prehistoric era where we had older models 00:03:24.420 |
like RNNs, LSTMs, and simpler attention mechanisms. 00:03:28.620 |
And eventually, the growth in transformers has exploded into other fields and 00:03:32.580 |
has become prominent in all of machine learning. 00:03:35.540 |
And I'll go and see and show how this has been used. 00:03:39.900 |
So in the prehistoric era, there used to be RNNs. 00:03:43.660 |
There were different models, like the sequence-to-sequence, LSTMs, GRUs. 00:03:47.700 |
They were good at encoding some sort of memory, but they did not work for 00:03:51.840 |
encoding long sequences, and they were very bad at encoding context. 00:03:55.220 |
So here is an example where if you have a sentence like, I grew up in France, 00:04:02.380 |
Then you want to fill this with French based on the context, but 00:04:05.340 |
a LSTM model might not know what it is and might just make a very big mistake here. 00:04:09.780 |
Similarly, we can show some sort of correlation map here where if you have 00:04:13.940 |
a pronoun like it, we want it to correlate to one of the past nouns that we have seen 00:04:18.420 |
so far, like animal, but again, older models were really not good at this context encoding. 00:04:26.100 |
So where we are currently now is on the verge of takeoff. 00:04:29.580 |
We're beginning to realize the potential of transformers in different fields. 00:04:32.900 |
We have started to use them to solve long sequence problems and 00:04:36.740 |
protein folding, such as the alpha fold model from DeepMind, 00:04:42.780 |
which gets 95% accuracy on different challenges in offline RL. 00:04:47.580 |
We can use it for few-shot and zero-shot generalization for text and image generation. 00:04:52.100 |
And we can also use this for content generation. 00:04:53.860 |
So here's an example from OpenAI, where you can give a different text prompt and 00:04:58.580 |
have an AI-generated fictional image for you. 00:05:01.260 |
And so there's a talk on this that you can also watch on YouTube, 00:05:06.500 |
which basically says that LSTMs are dead and long-lived transformers. 00:05:13.340 |
So we can enable a lot more applications for transformers. 00:05:17.660 |
They can be applied to any form of sequence modeling. 00:05:25.740 |
So basically imagine all sorts of genetic modeling problems. 00:05:29.180 |
Nevertheless, there are a lot of missing ingredients. 00:05:31.660 |
So like the human brain, we need some sort of external memory unit, 00:05:40.460 |
So one nice work you might want to check out is called Neural Turing Machines. 00:05:44.340 |
Similarly, the current attention mechanisms are very competitionally complex 00:05:49.100 |
in terms of time, and they scale quadratically, which we'll discuss later. 00:05:54.700 |
And the third problem is that we want to align our current sort of language models 00:05:58.500 |
with how the human brain works and human values. 00:06:03.260 |
OK, so now I will deep dive deeper into the attention mechanisms 00:06:12.260 |
So initially, they used to be very simple mechanisms. 00:06:17.780 |
Their attention was inspired by the process of importance fitting, 00:06:21.220 |
or putting attention on different parts of an image, 00:06:24.420 |
where like similar to a human, where you might focus more on like a foreground, 00:06:28.380 |
if you have an image of a dog compared to like the rest of the background. 00:06:31.060 |
So in the case of soft attention, what you do is you learn the simple 00:06:34.340 |
soft attention weighting for each pixel, which can be a weight between 0 to 1. 00:06:39.100 |
The problem over here is that this is a very expensive computation. 00:06:42.380 |
And then you can, as shown in the figure on the left, 00:06:46.540 |
you can see we are calculating this attention map for the whole image. 00:06:48.740 |
What you can do instead is you can just calculate a 0 to 1 attention map, 00:06:55.500 |
where we directly put a 1 on wherever the dog is and a 0 wherever it's a background. 00:07:03.260 |
but the problem is it's not differentiable and makes things harder to train. 00:07:06.140 |
Going forward, we also have different varieties of basic attention mechanisms 00:07:10.980 |
that came, that were proposed before self-attention. 00:07:14.140 |
So the first variety here is global attention models. 00:07:17.500 |
So in global attention models for each hidden layer input, hidden layer output, 00:07:26.300 |
And this is element-wise multiplied with your current output to calculate your 00:07:35.340 |
where instead of calculating the global attention over the whole sequence length, 00:07:39.980 |
you only calculate the attention over a small window. 00:07:43.500 |
And then you weight by the attention of the window into the current output 00:07:50.340 |
So moving on, I'll pass on to Chaitanya to discuss self-attention mechanisms and 00:07:58.580 |
>> Yeah, thank you, Div, for covering a brief overview of how the primitive 00:08:05.340 |
Now, just before we talk about self-attention, just a bit of a trivia that 00:08:10.060 |
this term was first introduced by a paper from Lin et al, 00:08:14.060 |
which provided a framework for a self-attentive mechanism for 00:08:22.540 |
And now moving on to the main crux of the transformers paper, 00:08:28.380 |
So self-attention is the basis, is the main comp building block for 00:08:33.300 |
what makes the transformers model work so well and to enable them and 00:08:42.140 |
we can break down the self-attention as a search retrieval problem. 00:08:46.980 |
So the problem is that given a query Q, and we need to find a set of keys K, 00:08:53.260 |
which are most similar to Q and return the corresponding key values called V. 00:08:58.220 |
Now, these three vectors can be drawn from the same source. 00:09:01.060 |
For example, we can have that Q, K, and V are all equal to a single vector X, 00:09:08.580 |
In transformers, these vectors are obtained by applying different linear 00:09:14.460 |
So as to enable the model to capture more complex interactions between 00:09:18.780 |
the different tokens at different places of the sentence. 00:09:22.860 |
Now, how attention is computed is just a weighted summation of the similarities 00:09:27.540 |
in the query and key vectors, which is weighted by the respective value for 00:09:33.260 |
And in the transformers paper, they use the scale dot product as a similarity 00:09:40.260 |
And another important aspect of the transformers was the introduction of 00:09:46.420 |
So what multi-head self-attention means is that the self-attention is for 00:09:50.700 |
at every layer, the self-attention is performed multiple times, 00:09:54.260 |
which enables the model to learn multiple representation subspaces. 00:09:58.660 |
So in a way, you can think of it that each head has a power to look at 00:10:05.820 |
different things and to learn different semantics. 00:10:08.580 |
For example, one head can be learning to try to predict what is the part of 00:10:15.220 |
One head might be learning what is the syntactic structure of the sentence, 00:10:19.660 |
and all those things that are there to understand what the upcoming sentence 00:10:28.740 |
Now, to better understand what the self-attention works and what are the 00:10:32.020 |
different computations, there is a short video. 00:10:35.140 |
So as you can see, there are three incoming tokens. 00:10:43.220 |
We apply linear transformations to get the key value vectors for each input, 00:10:49.100 |
and then once a query queue comes, we calculate its similarity with the 00:10:53.220 |
respective key vectors, and then multiply those scores with the value 00:10:58.900 |
vector, and then add them all up to get the output. 00:11:02.740 |
The same computation is then performed on all the tokens, and we get the output 00:11:10.380 |
So as you can see here, the final output of the self-attention layer is in dark 00:11:17.660 |
So now again, for the final token, we perform everything same, queries 00:11:22.820 |
We get the similarity scores, and then those similarity scores weigh the value 00:11:26.820 |
vectors, and then we finally perform the addition to get the self-attention 00:11:39.220 |
Apart from self-attention, there are some other necessary ingredients that makes 00:11:46.540 |
One important aspect is the presence of positional representations or the 00:11:51.740 |
So the way RNNs work very well was that since they process each of the 00:11:58.020 |
information in a sequential ordering, so there was this notion of ordering, 00:12:03.220 |
right, and which is also very important in understanding language because we all 00:12:06.980 |
know that we read any piece of text from left to right in most of the languages, 00:12:17.220 |
So there is a notion of ordering, which is lost in kind of self-attention 00:12:20.740 |
because every word is attending to every other word. 00:12:24.060 |
That's why this paper introduced a separate embedding layer for introducing 00:12:30.980 |
The second important aspect is having nonlinearities. 00:12:34.300 |
So if you think of all the computation that is happening in the self-attention 00:12:38.100 |
layer, it's all linear because it's all matrix multiplication. 00:12:41.220 |
But as we all know, that deep learning models work well when they are able to 00:12:47.700 |
learn more complex mappings between input and output, which can be attained 00:12:54.220 |
And the third important component of the transformers is the masking. 00:12:59.020 |
So masking is what allows to parallelize the operations. 00:13:03.020 |
Since every word can attend to every other word, in the decoder part of the 00:13:07.220 |
transformers, which Advai is going to be talking about later, is the problem 00:13:11.380 |
comes that you don't want the decoder to look into the future because that can 00:13:18.340 |
So that's why masking helps the decoder to avoid that future information and learn 00:13:29.580 |
So now on to the encoder-decoder architecture of the transformers. 00:13:36.580 |
- Yeah, thanks, Saithanya, for talking about self-attention. 00:13:39.620 |
So self-attention is sort of the key ingredient or one of the key ingredients 00:13:47.540 |
But at a very high level, the model that was proposed in the Vaswani et al. 00:13:51.940 |
paper of 2017 was like previous language models in the sense that it had an 00:13:59.540 |
What that means is, let's say you're working on a translation problem. 00:14:04.940 |
The way that would work is you would read in the entire input of your English 00:14:09.340 |
sentence, you would encode that input, so that's the encoder part of the network. 00:14:13.780 |
And then you would generate token by token the corresponding French translation. 00:14:18.420 |
And the decoder is the part of the network that is responsible for generating 00:14:24.420 |
So you can think of these encoder blocks and decoder blocks as essentially 00:14:30.980 |
They have these sub-components that make them up. 00:14:34.580 |
And in particular, the encoder block has three main sub-components. 00:14:38.540 |
The first is a self-attention layer that Saithanya talked about earlier. 00:14:43.180 |
And as talked about earlier as well, you need a feed-forward layer after that 00:14:48.500 |
because the self-attention layer only performs linear operations. 00:14:52.140 |
And so you need something that can capture the non-linearities. 00:14:58.060 |
And lastly, there are residual connections between different encoder blocks. 00:15:02.740 |
The decoder is very similar to the encoder, but there's one difference, 00:15:06.340 |
which is that it has this extra layer because the decoder doesn't just do 00:15:10.100 |
multi-head attention on the output of the previous layers. 00:15:13.900 |
So for context, the encoder does multi-head attention for each self-attention 00:15:21.300 |
In each of the encoder blocks, it does multi-head attention looking at the 00:15:29.420 |
The decoder, however, does that in the sense that it also looks at the previous 00:15:34.380 |
layers of the decoder, but it also looks at the output of the encoder. 00:15:38.420 |
And so it needs a multi-head attention layer over the encoder blocks. 00:15:46.700 |
So if you are-- because every token can look at every other token, 00:15:51.300 |
you want to make sure in the decoder that you're not looking into the future. 00:15:57.340 |
you shouldn't be able to look at position 4 and position 5. 00:16:00.020 |
So those are sort of all the components that led to the creation of the model 00:16:10.700 |
And let's talk a little bit about the advantages and drawbacks of this model. 00:16:16.540 |
So the two main advantages, which are huge advantages and which are why 00:16:20.660 |
transformers have done such a good job of revolutionizing many, 00:16:25.740 |
many fields within deep learning, are as follows. 00:16:29.660 |
So the first is there is this constant path length between any two positions 00:16:34.260 |
in a sequence because every token in the sequence is looking at every other token. 00:16:39.700 |
And this basically solves the problem that Dev talked about earlier 00:16:44.580 |
You don't have this problem with long sequences where if you're trying to 00:16:47.940 |
predict a token that depends on a word that was far, far behind in a sentence, 00:16:54.220 |
you don't have the problem of losing that context. 00:16:55.940 |
Now, the distance between them is only one in terms of the path length. 00:17:00.700 |
Also, because of the nature of the computation that's happening, 00:17:03.900 |
transformer models lend themselves really well to parallelization. 00:17:07.380 |
And because of the advances that we've had with GPUs, basically, 00:17:10.900 |
if you take a transformer model with n parameters and you take a model that 00:17:14.660 |
isn't a transformer, say like an MSTM, also with n parameters, 00:17:18.340 |
training the transformer model is going to be much faster because of the 00:17:26.180 |
The disadvantages are basically self-attention takes quadratic time 00:17:31.220 |
because every token looks at every other token. 00:17:33.620 |
Order n squared, as you might know, does not scale. 00:17:36.500 |
And there's actually been a lot of work in trying to tackle this. 00:17:41.500 |
Big Bird, Linformer, and Reformer are all approaches to try and 00:17:44.580 |
make this linear or quasi-linear, essentially. 00:17:47.580 |
And yeah, we highly recommend going through Jay Allamer's blog, 00:17:55.820 |
the Illustrated Transformer, which provides great visualizations and 00:17:59.940 |
explains everything that we just talked about in great detail. 00:18:02.260 |
Yeah, and I'd like to pass it on to Chaitanya for applications of transformers. 00:18:10.460 |
So now moving on to like some of the recent work, some of the work that 00:18:14.780 |
very shortly followed the Transformers paper. 00:18:18.060 |
So one of the models that came out was GPT, the GPT architecture, 00:18:24.940 |
So OpenAI had the latest model that OpenAI has in the GPT series is the GPT-3. 00:18:31.260 |
So it consists of only the decoder blocks from Transformers and is trained 00:18:35.500 |
on a traditional language modeling task, which is predicting the current token, 00:18:40.300 |
which is predicting the next token given the last T tokens that the model has seen. 00:18:45.900 |
And for any downstream tasks, now the model can just, 00:18:49.900 |
you can just train a classification layer on the last hidden state, 00:18:57.660 |
And since the model is generative in nature, you can also use the pre-trained 00:19:03.420 |
network as for generative kind of tasks, such as summarization and natural language, 00:19:09.100 |
and natural language generation for that instance. 00:19:12.380 |
Another important aspect that GPT-3 gained popularity was its ability to 00:19:18.540 |
be able to perform in-context learning, what the authors called in-context learning. 00:19:23.580 |
So this is the ability wherein the model can perform, can learn under few short settings, 00:19:28.780 |
what the task is to complete the task without performing any gradient updates. 00:19:33.820 |
For example, let's say the model is shown a bunch of addition examples. 00:19:38.620 |
And then if you pass in a new input and leave the, 00:19:42.220 |
and just leave it at equal to sign, the model tries to predict the next token, 00:19:49.020 |
which very well comes out to be the sum of the numbers that is shown. 00:19:55.660 |
Another example can be also the spell correction task or the translation task. 00:20:00.540 |
So this was the ability that made GPT-3 so much talked about in the NLP world. 00:20:08.060 |
And right now also, many applications have been made using GPT-3, which includes 00:20:13.820 |
one of them being the VS Code Copilot, which tries to generate a piece of code 00:20:21.820 |
given a docstring kind of natural language text. 00:20:26.140 |
Another major model that came out that was based on the Transformers architecture was BERT. 00:20:32.220 |
So BERT lends its name from, it's an acronym for Bidirectional Encoding, 00:20:38.940 |
It consists of only the encoder blocks of the Transformers, which is unlike GPT-3, 00:20:46.620 |
Now, because of this change, there comes a problem because BERT has only the encoder blocks. 00:20:57.260 |
It cannot be pre-trained on a naive language modeling task 00:21:00.220 |
because of the problem of data leakage from the future. 00:21:02.940 |
So what the authors came up with was a clever idea. 00:21:06.700 |
And they came up with a novel task called mass language modeling, 00:21:10.860 |
which included to replace certain words with a placeholder. 00:21:15.180 |
And then the model tries to predict those words given the entire context. 00:21:19.980 |
Now, apart from this token-level task, the authors also added a second objective 00:21:25.900 |
called the next sentence prediction, which was a sentence-level task. 00:21:28.780 |
Wherein, given two chunks of text, the model tried to predict 00:21:34.300 |
whether the second sentence followed the other sentence or not, 00:21:39.180 |
And now, after pre-training this model for any downstream task, 00:21:43.580 |
the model can be further fine-tuned with an additional classification layer, 00:21:49.580 |
So these are the two models that have been very popular 00:21:54.300 |
and have made a lot of applications, made their way in a lot of applications. 00:21:58.940 |
But the landscape has changed quite a lot since we have taken this class. 00:22:02.540 |
There are models with different pre-training techniques, like Electra, D-BERTA. 00:22:07.020 |
And there are also models that do well in other modalities, 00:22:12.140 |
and which we are going to be talking about in other lecture series as well. 00:22:20.140 |
- Yeah, just want to end by saying thank you all for watching this. 00:22:25.580 |
And we have a really exciting set of videos with truly amazing speakers. 00:22:30.940 |
And we hope you are able to derive value from that.