back to indexStanford CS25: V4 I Hyung Won Chung of OpenAI
Chapters
0:0 Introduction
2:5 Identifying and understanding the dominant driving force behind AI.
15:18 Overview of Transformer architectures: encoder-decoder, encoder-only and decoder-only
23:29 Differences between encoder-decoder and decoder-only, and rationale for encoder-decoder’s additional structures from the perspective of scaling.
00:00:12.760 |
He has worked on various aspects of large language models. 00:00:16.520 |
Things like pre-training, instruction fine-tuning, 00:00:31.520 |
The training framework used to train the POM language model. 00:01:06.600 |
I'm giving a lecture on transformers at Stanford. 00:01:15.480 |
Zoom will actually go shape the future of AI. 00:01:23.440 |
So, that could be a good topic to think about. 00:01:26.440 |
When we talk about something into the future, 00:01:29.840 |
the best place to get an advice is to look into the history. 00:01:37.240 |
transformer and try to learn many lessons from there. 00:01:53.280 |
hope to project into the future what might be coming. 00:02:01.000 |
and we'll look at some of the architectures of the transformers. 00:02:08.320 |
it's saying AI is so advancing so fast that it's so hard to keep up. 00:02:12.760 |
It doesn't matter if you have years of experience, 00:02:22.520 |
energy catching up with this latest developments, 00:02:28.800 |
and then not enough attention goes into all things because they 00:02:36.920 |
But I think it's important actually to look into that, 00:02:41.720 |
when things are moving so fast beyond our ability to catch up, 00:02:45.160 |
what we need to do is study the change itself, 00:02:52.200 |
the current thing and try to map how we got here, 00:02:55.320 |
and from which we can look into where we are heading towards. 00:02:59.200 |
So, what does it mean to study the change itself? 00:03:07.280 |
the dominant driving forces behind the change. 00:03:17.320 |
the dominant one because we're not trying to get really accurate, 00:03:19.840 |
you just want to have the sense of directionality. 00:03:22.320 |
Second, we need to understand the driving force really well, 00:03:25.400 |
and then after that we can predict the future trajectory 00:03:31.840 |
You heard it right, I'd mentioned about predicting the future. 00:03:38.040 |
But we do, I think it's actually not that impossible 00:03:53.800 |
and then make your prediction accuracy from one percent to 10 percent, 00:04:01.520 |
say one of them will be really, really correct, 00:04:04.080 |
meaning it will have an outside impact that outweighs everything, 00:04:13.720 |
that you really have to be right a few times. 00:04:16.920 |
So, if we think about why predicting the future is difficult, 00:04:26.280 |
where we can all do the prediction with perfect accuracy, 00:04:31.440 |
So here I'm going to do a very simple experiment 00:04:34.520 |
of dropping this pen and follow this same three-step process. 00:04:40.080 |
So we're going to identify the dominant driving force. 00:04:43.360 |
First of all, what are the driving forces acting on this pen? 00:04:48.520 |
We also have, say, air friction if I drop it, 00:04:53.520 |
and that will cause what's called a drag force acting upwards, 00:04:57.480 |
and actually, depending on how I drop this, the orientation, 00:05:01.760 |
the aerodynamic interaction will be so complicated 00:05:04.960 |
that we don't currently have any analytical way of modeling that. 00:05:08.880 |
We can do it with the CFD, the computational fluid dynamics, 00:05:15.040 |
This is heavy enough that gravity is probably the only dominant force. 00:05:20.360 |
Second, do we understand this dominant driving force, which is gravity? 00:05:24.040 |
And we do because we have this Newtonian mechanics 00:05:28.720 |
And then with that, we can predict the future trajectory of this pen. 00:05:32.640 |
And if you remember from this dynamics class, 00:05:42.440 |
and then 1/2 gt squared will give a precise trajectory of this pen 00:05:49.760 |
So if there is a single driving force that we really understand, 00:05:54.160 |
it's actually possible to predict what's going to happen. 00:05:57.640 |
So then why do we really fear about predicting the future 00:06:10.760 |
acting on the general prediction is so complicated, 00:06:17.360 |
that we cannot predict in the most general sense. 00:06:19.800 |
So here's my cartoon way of thinking about the prediction of future. 00:06:23.600 |
X-axis, we have a number of dominant driving forces. 00:06:28.440 |
So on the left-hand side, we have a dropping a pen. 00:06:36.720 |
And then as you add more stuff, it just becomes impossible. 00:06:44.320 |
And you might think, "OK, I see all the time things are coming in, 00:06:51.360 |
"and some people will come up with a new agent, 00:06:58.040 |
"It's just I'm not even able to catch up with the latest thing. 00:07:01.880 |
"How can I even hope to predict the future of the AI research?" 00:07:11.240 |
that is governing a lot, if not all, of the AI research. 00:07:15.560 |
And because of that, I would like to point out 00:07:18.720 |
that it's actually closer to the left than to the right 00:07:28.400 |
Oh, maybe before that, I would like to caveat that 00:07:34.600 |
I would like to not focus too much on the technical stuff, 00:07:37.760 |
which you can probably do better in your own time, 00:07:43.920 |
And for that, I want to share how my opinion is, 00:07:50.720 |
And by no means I'm saying this is correct or not. 00:08:03.400 |
and on the y-axis, we have the calculations flopped. 00:08:07.760 |
If you pay $100, and how much computing power do you get? 00:08:13.640 |
And then x-axis, we have a time of more than 100 years. 00:08:30.800 |
I should say, okay, I should not compete with this, 00:08:33.840 |
and better, I should try to leverage as much as possible. 00:08:37.920 |
And so what this means is you get 10x more compute 00:08:42.920 |
every five years if you spend the same amount of dollar. 00:08:46.760 |
And so in other words, you get the cost of compute 00:09:00.560 |
but that is, I think, really important to think about. 00:09:11.560 |
Let's think about the job of the AI researchers. 00:09:17.640 |
And one somewhat unfortunately common approach 00:09:32.800 |
into some kind of mathematical model, teach that. 00:09:35.760 |
And now the question is, do we understand how we think 00:09:48.240 |
And what happens if we go with this kind of approach 00:09:55.280 |
And so you can maybe get a paper or something, 00:10:18.560 |
And bitter lesson is I think the single most important 00:10:35.200 |
with weaker modeling assumptions or inductive biases 00:10:38.680 |
and add more data and compute, in other words, scale up. 00:10:41.200 |
And that has been the recipe of entire AI research, 00:10:54.240 |
And so it's much easier to get into AI nowadays 00:10:59.120 |
So this is, I think, really the key information. 00:11:04.120 |
We have this compute cost is going down exponentially, 00:11:12.600 |
and just try to leverage that as much as possible. 00:11:15.080 |
And that is the driving force that I wanted to identify. 00:11:19.840 |
And I'm not saying this is the only driving force, 00:11:38.440 |
more modeling assumptions, fancier math, whatever. 00:11:50.320 |
because of some kind of structure backfiring. 00:11:53.840 |
because we give a lot more freedom to the model, 00:11:57.760 |
But then as we add more compute, it starts working. 00:12:04.840 |
So does that mean we should just go with the least structure, 00:12:09.200 |
most freedom to the model possible way from the get-go? 00:12:16.600 |
This red one here is, it will pick up a lot later 00:12:25.880 |
We cannot indefinitely wait for the most general case. 00:12:31.320 |
where our compute situation is at this dotted line. 00:12:34.320 |
If we're here, we should choose this less structure one 00:12:50.640 |
And so the difference between these two method 00:12:53.120 |
is that additional inductive biases or structure 00:13:06.720 |
algorithmic development and architecture that we have, 00:13:09.800 |
there's like an optimal inductive bias or structure 00:13:12.680 |
that we can add to the problem to make the progress. 00:13:15.960 |
And that has been really how we have made so much progress. 00:13:24.400 |
when we have more compute, better algorithm or whatever. 00:13:28.160 |
And as a community, we do adding structure very well. 00:13:32.160 |
And 'cause there's an incentive structure with like papers, 00:13:38.000 |
but removing that doesn't really get you much. 00:13:42.640 |
And I think we should do a lot more of those. 00:13:45.080 |
So maybe another implication of this bitter lesson 00:14:11.600 |
it's more chaotic at the beginning, so it doesn't work. 00:14:18.360 |
we can put in more compute and then it can be better. 00:14:21.320 |
So it's really important to have this in mind. 00:14:27.440 |
this dominant driving force behind the AI research. 00:14:40.840 |
the next step is to understand this driving force better. 00:14:48.400 |
And for that, we need to go back to some history 00:14:51.960 |
of transformer 'cause this is a transformers class, 00:14:57.640 |
that were made by the researchers at the time 00:15:10.000 |
And we'll go through some of the practice of this. 00:15:18.420 |
So now we'll go into a little bit of the technical stuff. 00:15:22.100 |
Transformer architecture, there are some variants. 00:15:38.960 |
which you can think of as a current like GPT-3 00:15:43.480 |
This has a lot less structure than the encoder decoder. 00:15:46.360 |
So these are the three types we'll go into detail. 00:15:49.240 |
Second, the encoder only is actually not that useful 00:15:52.800 |
in the most general sense, it still has some place, 00:15:58.240 |
and then spend most of the time comparing one and three. 00:16:06.040 |
So first of all, let's think about what a transformer is. 00:16:08.960 |
Just at a very high level or first principles, 00:16:15.160 |
and sequence model has an input of a sequence. 00:16:18.720 |
So sequence of elements can be words or images or whatever. 00:16:25.500 |
In this particular example, I'll show you with the words, 00:16:32.320 |
'cause we have to represent this words in computers, 00:16:36.880 |
which requires just some kind of a encoding scheme. 00:16:40.720 |
So we just do it with a fixed number of integers 00:16:49.160 |
is to represent each sequence element as a vector, 00:16:52.120 |
dense vector, because we know how to multiply them well. 00:16:57.760 |
And finally, this sequence model will do the following. 00:17:12.000 |
we can say semantically they are more related 00:17:18.080 |
And the transformer is a particular type of sequence model 00:17:26.960 |
So let's get into the details of this encoder decoder, 00:17:33.500 |
So let's go into a little bit, a piece at a time. 00:17:40.200 |
of machine translation, which used to be very cool thing. 00:17:44.440 |
And so you have an English sentence that is good, 00:17:50.080 |
So first thing is to encode this into a dense vector. 00:17:59.200 |
And then we have to let them take the dot product. 00:18:01.520 |
So this lines represent which element can talk 00:18:10.280 |
we take what is called the bidirectional attention. 00:18:14.800 |
And then we have this MLP or feed forward layer, 00:18:20.540 |
You just do some multiplication just because we can do it. 00:18:25.720 |
And then that's one layer, and we repeat that n times. 00:18:31.880 |
And at the end, what you get is the sequence of vectors, 00:18:47.600 |
So here we put in as an input what the answer should be. 00:18:57.780 |
but that's the German translation of that is good. 00:19:00.200 |
And so we kind of go through the similar process. 00:19:14.320 |
So we cannot, when we train it, we should limit that. 00:19:23.780 |
So after this, you can get, after again, N layers, 00:19:35.220 |
this is a general encoder-decoder architecture. 00:19:43.380 |
Now I'll point out some important attention patterns. 00:19:55.740 |
That is done by this cross-attention mechanism 00:19:59.020 |
which is just that each vector's representation 00:20:04.580 |
should attend to some of them in the encoder. 00:20:10.060 |
which is interesting is that all the layers in the decoder 00:20:13.900 |
attend to the final layer output of the encoder. 00:20:16.980 |
I will come back to the implication of this design. 00:20:22.940 |
And now move on to the second type of architecture, 00:20:35.700 |
And then in this case, the final output is a single vector. 00:20:43.100 |
And that is, that represent the input sequence. 00:20:49.500 |
And then let's say we do some kind of a sentiment analysis. 00:20:59.540 |
And that's required for all these task-specific cases. 00:21:06.620 |
And what this means is that here at the time, 00:21:19.900 |
This was how the field really advanced at the time. 00:21:32.880 |
that was put into this particular architecture 00:21:35.140 |
is that we're gonna give up on the generation. 00:21:38.860 |
If we do that, it becomes a lot simpler problem. 00:21:43.460 |
we're talking about sequence to classification labels, 00:21:51.780 |
a lot of the papers are just research was like, 00:22:04.020 |
And, but if you look at from this perspective, 00:22:15.100 |
but in the long term, it's not really useful. 00:22:18.220 |
at this encoder only architecture going forward. 00:22:40.700 |
that some people think this decoder only architecture 00:22:42.940 |
is used for language modeling next to prediction. 00:22:45.340 |
So it cannot be used for supervised learning, 00:22:48.560 |
The trick is to have this input that is good, 00:22:54.260 |
then it just becomes simple to sequence in sequence out. 00:22:57.940 |
So what we do is the self attention mechanism here 00:23:01.180 |
is actually handling both the cross attention 00:23:05.740 |
and self attention sequence learning within each. 00:23:10.980 |
And then, as I mentioned, the output is a sequence. 00:23:15.140 |
And then the key design features are self attention, 00:23:21.700 |
sharing the parameters between input and target. 00:23:35.420 |
they look very different, at least on the schematics. 00:23:39.780 |
And I argue that they're actually quite similar. 00:23:45.880 |
we're gonna transform starting from this encoder decoder, 00:24:08.840 |
And then as we go through, we'll populate this table. 00:24:12.620 |
So let's first look at this additional cross-attention. 00:24:19.380 |
which has this additional red block, the cross-attention, 00:24:22.040 |
compared to the simpler one that doesn't have that. 00:24:24.220 |
So we wanna make the left closer to the right. 00:24:28.020 |
So that means we need to either get rid of it or something. 00:24:38.180 |
actually have the same number of parameters, same shape. 00:24:41.460 |
So that's the first step, share both of these. 00:24:43.620 |
And then it becomes mostly the same mechanism. 00:24:58.660 |
encoder decoder architecture uses a separate parameters. 00:25:05.900 |
So if you wanna make the left close to right, 00:25:14.740 |
Third difference is the target to input attention pattern. 00:25:17.860 |
So we need to connect the target to the input, 00:25:22.460 |
In the encoder decoder case, we had this cross-attention, 00:25:47.700 |
we are looking at the same layer representation 00:25:57.220 |
we have to bring back this attention to each layer. 00:26:00.700 |
So now layer one will be attending to layer one of this. 00:26:04.760 |
And finally, the last difference is the input attention. 00:26:09.700 |
I mentioned about this bidirectional attention, 00:26:26.540 |
these two architectures are almost identical. 00:26:29.820 |
A little bit of difference in the cross-attention, 00:26:36.480 |
these two architecture in the same task, same data, 00:26:38.780 |
I think you will get pretty much within the noise, 00:26:40.580 |
probably closer than if you train the same thing twice. 00:26:48.260 |
Now we'll look at what are the additional structures, 00:26:57.480 |
And then, so we can say that encoder-decoder, 00:27:02.400 |
has these additional structures in the devices built in. 00:27:11.140 |
what encoder-decoder tries at it as a structure 00:27:18.580 |
it'll be useful to use a separate parameters. 00:27:30.420 |
Back when the transform was introduced in 2017, 00:27:46.720 |
So in that task, we have this input and target 00:28:06.860 |
Modern language models is just about learning knowledge. 00:28:19.180 |
So does it make sense to have a separate parameter 00:28:31.800 |
and if we represent them in a separate parameters, 00:28:50.420 |
and with Jason, we did this instruction fine-tuning work, 00:28:54.100 |
and what this is, is you take the pre-trained model, 00:29:05.340 |
but here, let's think about the performance gain 00:29:15.180 |
which is T5-based, which is encoder-decoder architecture. 00:29:28.780 |
And then at the end, we just spent like three days on T5, 00:29:31.540 |
but the performance gain was a lot higher on this. 00:29:43.340 |
So my hypothesis is that it's about the length. 00:29:48.340 |
So academic datasets we use, we use like 1,832 tasks, 00:29:52.700 |
and here, they have this very distinctive characteristic 00:29:58.620 |
long in order to make the task more difficult, 00:30:02.740 |
because if we do, there's no way to grade it. 00:30:07.900 |
So what happens is you have a long text of input 00:30:12.380 |
And so this is kind of the length distribution 00:30:25.620 |
and a very different type of sequence going into the target. 00:30:31.060 |
has an assumption that they will be very different. 00:30:33.340 |
That structure really shines because of this. 00:30:38.260 |
but that was, I think, why this really architecture 00:30:41.780 |
was just suitable for fine-tuning with the academic datasets. 00:31:03.300 |
doesn't mean that we are not interested in them. 00:31:05.500 |
Actually, if anything, we are more interested in that. 00:31:15.380 |
And moreover, we think about this chat application, 00:31:26.540 |
And then my question is, does that make sense 00:31:34.700 |
So that was the first inductive bias we just mentioned. 00:31:40.540 |
And then the second structure is that target element 00:31:47.980 |
Let's look at this additional structure, what that means. 00:32:04.580 |
Meaning that, for example, in computer vision, 00:32:06.940 |
lower layer, bottom layers encode something like edges, 00:32:10.340 |
top layers, higher levels, combining the features, 00:32:16.020 |
a hierarchical representation learning method. 00:32:21.780 |
if decoder layer one attends to encoder final layer, 00:32:26.620 |
which probably has a very different level of information, 00:32:29.380 |
is that some kind of an information bottleneck, 00:32:31.660 |
which actually motivated the original attention mechanism? 00:32:35.820 |
And in practice, I would say, in my experience, 00:32:40.540 |
And that's because my experience was limited to, 00:32:49.020 |
But what if we have 10X or 1,000X more layers? 00:33:01.500 |
Final structure we're gonna talk about is the, 00:33:05.380 |
when we do this, there's like a bidirectional thing 00:33:21.020 |
2018, when we were solving that question answering squat, 00:33:32.700 |
I think maybe boosting up the squat score by like 20. 00:33:37.580 |
But at scale, I don't think this matters that much. 00:33:43.380 |
So we did, in flan two, we tried both bidirectional 00:33:50.700 |
So, but I wanna point out this bidirectionality, 00:33:59.500 |
So at every turn, the new input has to be encoded again. 00:34:03.580 |
And for union directional attention is much, much better. 00:34:07.460 |
So let's think about this more modern conversation 00:34:14.580 |
And so here, if we think about the bidirectional case, 00:34:19.980 |
we need to encode this input with the bidirectional thing, 00:34:34.020 |
So we need to do everything from scratch again. 00:34:41.900 |
because now when we are trying to generate why, 00:34:46.100 |
because we cannot attend to the future tokens, 00:34:51.500 |
If you see the difference, this part can be cached, 00:35:03.660 |
So I would say bidirectional attention did well in 2018, 00:35:09.660 |
and now because of this engineering challenge, 00:35:13.460 |
So to conclude, we have looked into this driving force, 00:35:17.940 |
dominant driving force governing this AI research, 00:35:20.660 |
and that was this exponentially cheaper compute 00:35:30.500 |
added to the encoded decoder compared to the decoder only, 00:35:38.460 |
And I wanted to just conclude with this remark. 00:35:44.900 |
which are all, one can say this is just historical artifacts 00:35:48.860 |
and doesn't matter, but if you do many of these, 00:35:53.940 |
you can hopefully think about those in a more unified manner 00:35:57.520 |
and then see, okay, what assumptions in my problem 00:36:01.580 |
that I need to revisit, and are they relevant? 00:36:06.260 |
Is can we do it with a more general thing and scale up? 00:36:13.620 |
and together we can really shape the future of AI