Stanford CS25: V4 I Hyung Won Chung of OpenAI

Now, we'll have Hyung-Won give a talk. So, he's currently a research scientist on the OpenAI Chat GPT team. He has worked on various aspects of large language models. Things like pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, and so forth. Some of his notable works include the scaling Flan papers such as Flan T5, as well as Flan POM, and T5X.

The training framework used to train the POM language model. Before OpenAI, he was at Google Brain, and he received his PhD from MIT. So, give a hand for Hyung-Won. >> All right. My name is Hyung-Won, and really happy to be here today. This week, I was thinking about, by the way, is my mic working fine?

Yeah. So, this week, I thought about, I'm giving a lecture on transformers at Stanford. What should I talk about? I thought, okay, some of you in this room and in Zoom will actually go shape the future of AI. So, maybe I should talk about that. It's a really important goal and ambitious, and we really have to get it right.

So, that could be a good topic to think about. When we talk about something into the future, the best place to get an advice is to look into the history. In particular, look at the early history of transformer and try to learn many lessons from there. The goal would be to develop a unified perspective in which we can look into many seemingly disjoint events.

From that, we can probably hope to project into the future what might be coming. So, that will be the goal of this lecture, and we'll look at some of the architectures of the transformers. So, let's get started. Everyone, I see, it's saying AI is so advancing so fast that it's so hard to keep up.

It doesn't matter if you have years of experience, there's so many things are coming out every week that it's just hard to keep up. I do see many people spend a lot of time and energy catching up with this latest developments, the cutting edge, and the newest thing, and then not enough attention goes into all things because they become deprecated and no longer relevant.

But I think it's important actually to look into that, because we really need to, when things are moving so fast beyond our ability to catch up, what we need to do is study the change itself, and that means we can look back at the previous things and then look at the current thing and try to map how we got here, and from which we can look into where we are heading towards.

So, what does it mean to study the change itself? First, we need to identify the dominant driving forces behind the change. So, here dominant is an important word, because typically a change has many, many driving forces and we only care about the dominant one because we're not trying to get really accurate, you just want to have the sense of directionality.

Second, we need to understand the driving force really well, and then after that we can predict the future trajectory by rolling out the driving force and so on. You heard it right, I'd mentioned about predicting the future. This is a computer science class, not like an astrology or something.

But we do, I think it's actually not that impossible to predict some future trajectory of a very narrow scientific domain, and that endeavor is really useful to do, because let's say you do all these and then make your prediction accuracy from one percent to 10 percent, and then you'll make 100 predictions, 10 of them will be correct, say one of them will be really, really correct, meaning it will have an outside impact that outweighs everything, and I think that is how many I've seen, a very general thing in life, that you really have to be right a few times.

So, if we think about why predicting the future is difficult, or maybe even think about the extreme case where we can all do the prediction with perfect accuracy, almost perfect accuracy. So here I'm going to do a very simple experiment of dropping this pen and follow this same three-step process.

So we're going to identify the dominant driving force. First of all, what are the driving forces acting on this pen? Gravity downwards, and is that all? We also have, say, air friction if I drop it, and that will cause what's called a drag force acting upwards, and actually, depending on how I drop this, the orientation, the aerodynamic interaction will be so complicated that we don't currently have any analytical way of modeling that.

We can do it with the CFD, the computational fluid dynamics, but it will be non-trivial. So we can neglect that. This is heavy enough that gravity is probably the only dominant force. So we simplify the problem. Second, do we understand this dominant driving force, which is gravity? And we do because we have this Newtonian mechanics which provides a reasonably good model.

And then with that, we can predict the future trajectory of this pen. And if you remember from this dynamics class, if we have this initial velocity is zero, I'm not going to put any velocity, and then let's say position is zero here, and then 1/2 gt squared will give a precise trajectory of this pen as I drop this.

So if there is a single driving force that we really understand, it's actually possible to predict what's going to happen. So then why do we really fear about predicting the future in the most general sense? And I argue that among many reasons, the sheer number of dominant driving forces acting on the general prediction is so complicated, and their interaction creates a complexity that we cannot predict in the most general sense.

So here's my cartoon way of thinking about the prediction of future. X-axis, we have a number of dominant driving forces. Y-axis, we have a prediction difficulty. So on the left-hand side, we have a dropping a pen. It's a very simple case. The difficulty is very small. You just need to learn physics.

And then as you add more stuff, it just becomes impossible. So how does this fit into the AI research? And you might think, "OK, I see all the time things are coming in, "and we are bombarded by new things, "and some people will come up with a new agent, "new modality, new MML use score, whatever.

"We just see so many things. "It's just I'm not even able to catch up with the latest thing. "How can I even hope to predict the future of the AI research?" But I argue that it's actually simpler because there is a dominant driving force that is governing a lot, if not all, of the AI research.

And because of that, I would like to point out that it's actually closer to the left than to the right than we actually may perceive. So what is that driving force? Oh, maybe before that, I would like to caveat that when I do this kind of talk, I would like to not focus too much on the technical stuff, which you can probably do better in your own time, but rather I want to share how I think.

And for that, I want to share how my opinion is, and so it will be very strongly opinionated. And by no means I'm saying this is correct or not. I just wanted to share my perspective. So coming back to this driving force for AI, what is that dominant driving force?

And here's a plot from Rich Sutton, and on the y-axis, we have the calculations flopped. If you pay $100, and how much computing power do you get? And it's in log scale. And then x-axis, we have a time of more than 100 years. So this is actually more than exponential, and I don't know any trend that is as strong and as long-lasting as this one.

So whenever I see this kind of thing, I should say, okay, I should not compete with this, and better, I should try to leverage as much as possible. And so what this means is you get 10x more compute every five years if you spend the same amount of dollar.

And so in other words, you get the cost of compute is going down exponentially. And this associated scaling is really dominating the AI research, and that is somewhat hard to take, but that is, I think, really important to think about. So coming back to this AI research, how is this exponentially cheaper compute drive the AI research?

Let's think about the job of the AI researchers. It is to teach machines how to think in a very general sense. And one somewhat unfortunately common approach is we think about how we teach machine how we think we think. So meaning we model how we think, and then try to incorporate that into some kind of mathematical model, teach that.

And now the question is, do we understand how we think at the very low level? I don't think we do. I have no idea what's going on. So it's fundamentally flawed in the sense that we're trying to model something that we have no idea about. And what happens if we go with this kind of approach is that it poses a structure that serves as a shortcut in the short term.

And so you can maybe get a paper or something, but then it becomes a bottleneck because we don't know how this will limit further scaling up. More fundamentally, what this is doing is we are limiting the degree of freedom we are giving to the machines, and that will backfire at some point.

And this has been going on for decades. And bitter lesson is I think the single most important piece of writing in AI, and it says, this is my wording, by the way, past 70 years of entire AI research can be summarized into developing progressively more general method with weaker modeling assumptions or inductive biases and add more data and compute, in other words, scale up.

And that has been the recipe of entire AI research, not fancy things. And if you think about this, the models of 2000 is a lot more difficult than what we use now. And so it's much easier to get into AI nowadays from technical perspective. So this is, I think, really the key information.

We have this compute cost is going down exponentially, and it's getting cheaper faster than we're becoming a better researcher. So don't compete with that and just try to leverage that as much as possible. And that is the driving force that I wanted to identify. And I'm not saying this is the only driving force, but this is the dominant driving force.

So we can probably neglect the other ones. So here's a graphical version of that. X-axis, we have a compute, Y-axis, we have a performance of some kind. Let's think about some general intelligence. And let's look at two different methods. One with more structure, more modeling assumptions, fancier math, whatever.

And then the other one is a less structure. What you see is typically, you start with a better performance when you have a low compute regime. And then, but it plateaus because of some kind of structure backfiring. And then with the less structure, because we give a lot more freedom to the model, it doesn't work in the beginning.

But then as we add more compute, it starts working. And then it gets better. We call this more scalable methods. So does that mean we should just go with the least structure, most freedom to the model possible way from the get-go? And the answer is obviously no. Let's think about even less structure case.

This red one here is, it will pick up a lot later and requires a lot more compute. So it really depends on where we are. We cannot indefinitely wait for the most general case. And so let's think about the case where our compute situation is at this dotted line.

If we're here, we should choose this less structure one as opposed to this even less structure one, because the other one doesn't really work and the other one works. But crucially, we need to remember that we are adding some structure because we don't have compute. So we need to remove that later.

And so the difference between these two method is that additional inductive biases or structure we impose, someone impose, that typically don't get removed. So adding this, what that means is that at the given level of compute data, algorithmic development and architecture that we have, there's like an optimal inductive bias or structure that we can add to the problem to make the progress.

And that has been really how we have made so much progress. But these are like shortcuts that hinder further scaling later on. So we have to remove them later on when we have more compute, better algorithm or whatever. And as a community, we do adding structure very well. And 'cause there's an incentive structure with like papers, you add a nice one, then you get a paper, but removing that doesn't really get you much.

So that we don't really do that. And I think we should do a lot more of those. So maybe another implication of this bitter lesson is that because of this, what is better in the long-term almost necessarily looks worse now. And this is quite unique to AI research because the AI research of current paradigm is learning-based method, meaning that we are giving models freedom, the machines choose how they learn.

So because we need to give more freedom, it's more chaotic at the beginning, so it doesn't work. But then when it started working, we can put in more compute and then it can be better. So it's really important to have this in mind. So to summarize, we have identified this dominant driving force behind the AI research.

And that is exponentially cheaper compute and associated scaling up. Now that we have identified, if you remember back from my initial slides, the next step is to understand this driving force better. And so that's where we're gonna spend most of the time doing that. And for that, we need to go back to some history of transformer 'cause this is a transformers class, analyze key structures and decisions that were made by the researchers at the time and why they did that, whether that was an optimal structure that could have been added at the time and why they might be irrelevant now and should we remove that.

And we'll go through some of the practice of this. And hopefully this will give you some flavor of what like scaling research looks like. So now we'll go into a little bit of the technical stuff. Transformer architecture, there are some variants. I'll talk about three of them. First is the encoder decoder, which is the original transformer, which has a little bit more structure.

Second one is the encoder only, which is popularized by Bert. And then third one is decoder only, which you can think of as a current like GPT-3 or other language models. This has a lot less structure than the encoder decoder. So these are the three types we'll go into detail.

Second, the encoder only is actually not that useful in the most general sense, it still has some place, but we will so just briefly go over that and then spend most of the time comparing one and three. So one has more structure, what's the implication of that and so on.

So first of all, let's think about what a transformer is. Just at a very high level or first principles, what is a transformer is a sequence model and sequence model has an input of a sequence. So sequence of elements can be words or images or whatever. It's a very general concept.

In this particular example, I'll show you with the words, sentence is a sequence of words. And then the first step is to tokenize it 'cause we have to represent this words in computers, which requires just some kind of a encoding scheme. So we just do it with a fixed number of integers that we have now sequence of integers.

And then the dominant paradigm nowadays is to represent each sequence element as a vector, dense vector, because we know how to multiply them well. And then so we have a sequence of vectors. And finally, this sequence model will do the following. We just want to model the interaction between sequence elements.

And we do that by let them take the dot product of each other. And if the dot product is high, we can say semantically they are more related than the dot products that is low. And that's kind of the sequence model. And the transformer is a particular type of sequence model that uses what's called the tension to model this interaction.

So let's get into the details of this encoder decoder, which was the original transformer. It's quite many, many pieces. So let's go into a little bit, a piece at a time. So starting with the encoder. So here I'm going to show you an example of machine translation, which used to be very cool thing.

And so you have an English sentence that is good, and then we're gonna translate into German. So first thing is to encode this into a dense vector. So here I'm representing it with this vector of size three or something. And then we have to let them take the dot product.

So this lines represent which element can talk to which element, other elements. And here, because it's an input, we take what is called the bidirectional attention. So any token can talk to any other token. And then we have this MLP or feed forward layer, which is per token. It doesn't have any interaction.

You just do some multiplication just because we can do it. And then that's one layer, and we repeat that n times. And that's just the transformer encoder. And at the end, what you get is the sequence of vectors, each representing the sequence element, in this case, a word. So that's the output of this encoder.

Now let's look at the decoder, which is similarly shaped stack of layers. So here we put in as an input what the answer should be. So here, VOS is the beginning of sequence, and then das ist gut. I don't know how to pronounce it, but that's the German translation of that is good.

And so we kind of go through the similar process. Here we have a causal self-attention, meaning that the tokens of time step T can only attend to T and before, because when we start generating it, we don't have the future tokens. So we cannot, when we train it, we should limit that.

And that way, this is done by like masking, but it's just different from the encoder. So after this, you can get, after again, N layers, you get this sequence output, and you have this, the output is sequence. So sequence to sequence mapping, this is a general encoder-decoder architecture. And when you get this end of sequence, you stop generating it.

So this is the overall picture. Now I'll point out some important attention patterns. So we are translating into German what is input to the encoder. So there has to be some connection between the decoder and the encoder. That is done by this cross-attention mechanism shown in this red, which is just that each vector's representation on each sequence in the output decoder should attend to some of them in the encoder.

And that is done. In particular, the design feature, which is interesting is that all the layers in the decoder attend to the final layer output of the encoder. I will come back to the implication of this design. So yep, that's that. And now move on to the second type of architecture, which is encoder-only.

We'll spend a little bit of time here. So again, we have the same input, and we go through a similar structure. And then in this case, the final output is a single vector. Regardless of the length of the sequence, we just get a single vector. And that is, that represent the input sequence.

That's the dense vector representation. And then let's say we do some kind of a sentiment analysis. We run through a task-specific linear layer to map it to classification labels, positive or negative probabilities here. And that's required for all these task-specific cases. And this is kind of popularized by BERT.

And what this means is that here at the time, 2018, when BERT came out, we had the benchmark called GLUE, which was a language understanding test. You have a sequence in, classification labels out for most cases. This was how the field really advanced at the time. So when we care about such tasks, then there's an incentive to think about simplifying the problem, adding the structure to the problem so that we can make a progress.

So this, the additional structure that was put into this particular architecture is that we're gonna give up on the generation. If we do that, it becomes a lot simpler problem. Instead of sequence to sequence, we're talking about sequence to classification labels, and that's just so much easier. And so at some point, 2018, 2019, a lot of the papers are just research was like, we sometimes call it BERT engineers.

It's a little bit change of something, get like 0.5% better on GLUE, and you get a paper and things like that. It was like very chaotic era. And, but if you look at from this perspective, we are putting the sequence structure of not generating the sequence that puts a lot of performance win, but in the long term, it's not really useful.

So we're not gonna look at this encoder only architecture going forward. Third architecture, decoder only. This one is my favorite personally, and it looks kind of daunting, but because of this attention pattern, but it actually is very simple. So here we only have a single stack, and it can actually generate stuff.

And so there's misconception that some people think this decoder only architecture is used for language modeling next to prediction. So it cannot be used for supervised learning, but here we can actually do it. The trick is to have this input that is good, concatenated with the target. And if you do that, then it just becomes simple to sequence in sequence out.

So what we do is the self attention mechanism here is actually handling both the cross attention between target and the input, and self attention sequence learning within each. So that's the causal attention. And then, as I mentioned, the output is a sequence. And then the key design features are self attention, and so serving both roles.

And we are, in some sense, sharing the parameters between input and target. So same set of parameters are applied to both input and the target sequences. So this is the decoder only. Now we will go into the comparison. So I think there are many, they look very different, at least on the schematics.

So how different are they actually? And I argue that they're actually quite similar. And so to illustrate that, we're gonna transform starting from this encoder decoder, which has more structures built in, and then into the decoder only architecture, and see what are some of the differences. And then interpret those differences, those additional structures, are they relevant nowadays, now that we have more compute, better algorithm, and so on.

So let's have this table. Four differences, we'll see each of them. And then as we go through, we'll populate this table. So let's first look at this additional cross-attention. What that means is that this, on the left, is an encoder decoder, which has this additional red block, the cross-attention, compared to the simpler one that doesn't have that.

So we wanna make the left closer to the right. So that means we need to either get rid of it or something. And attention mechanism has kind of the four projection matrices. And so self-attention and cross-attention actually have the same number of parameters, same shape. So we can just share them.

So that's the first step, share both of these. And then it becomes mostly the same mechanism. And then, so that's the first difference, separate cross-attention, or self-attention serving both roles. Second difference is the parameter sharing. So what that means is that between the input and the target, encoder decoder architecture uses a separate parameters.

And decoder only has a single stack, so it uses the shared parameter. So if you wanna make the left close to right, we wanna share the encoder parameters. So let's do that, just color this. So now they share the parameters. Third difference is the target to input attention pattern.

So we need to connect the target to the input, and how is that done? In the encoder decoder case, we had this cross-attention, and then in the decoder only, it's the self-attention doing everything. The difference is that we have this, every layer of the decoder attending to the final layer output of the encoder.

Whereas if you think about this decoder, it's actually per layer, within layer. When we are decoding the, say, word DOS, we are looking at the same layer representation of the encoder, and that's within layer, and I think this is the design feature. So if you wanna make this close to that, we have to bring back this attention to each layer.

So now layer one will be attending to layer one of this. And finally, the last difference is the input attention. I mentioned about this bidirectional attention, and because we have this decoder only, typically with the unidirectional attention, we need to make them matching. So that's the, we can just get rid of it.

I just got rid of some of the arrows. So then at this point, these two architectures are almost identical. A little bit of difference in the cross-attention, but same number of parameters, and if you have, in deep learning, if you just train this, these two architecture in the same task, same data, I think you will get pretty much within the noise, probably closer than if you train the same thing twice.

So I would say they are identical. And so these are the main differences. Now we'll look at what are the additional structures, what they mean, what they mean, speed means. So yeah, that's the populated table now. And then, so we can say that encoder-decoder, compared to the decoder-only architecture, has these additional structures in the devices built in.

So let's go into each of them. The first one is the, what encoder-decoder tries at it as a structure is that input and the target sequences are sufficiently different that we, it'll be useful to use a separate parameters. That's the assumption. And so why is that useful? When can that assumption be useful?

And one example is machine translation. Back when the transform was introduced in 2017, translation was a really popular task. And it was difficult, considered difficult. And because it's just sequence to sequence, and you can actually have a blue score, which is heuristic-based method that can give you a single number, and then people can optimize that.

So in that task, we have this input and target in completely different languages. So if the goal is to learn translation only, then it kind of makes sense to have, okay, this parameter in the encoder will take care of the English, and this parameter in the decoder will take care of the German.

That seems natural. And what about now? Modern language models is just about learning knowledge. And it's not just about translation, or not even about language. Language just comes up as a byproduct of doing this next token prediction, and translation as well. So does it make sense to have a separate parameter for this kind of situation now?

Like we have some knowledge in German, some knowledge in English, and if anything, you wanna combine them, and if we represent them in a separate parameters, I don't think that's natural. So I would say with this much more general, larger models that can do a lot of things, this assumption seems very unnatural to me.

Second example is a little bit more modern. Two years ago, when I was at Google, and with Jason, we did this instruction fine-tuning work, and what this is, is you take the pre-trained model, and then just fine-tune on academic dataset, and so that it can understand the natural language instruction.

So the detail doesn't matter, but here, let's think about the performance gain by doing this fine-tuning on two different architectures we tried. So first five is the Flan T5, which is T5-based, which is encoder-decoder architecture. Last one, the latter five, decoder-only architecture based on POM. So we spent 99% of the time on POM, optimizing a lot of these.

And then at the end, we just spent like three days on T5, but the performance gain was a lot higher on this. And I was really confused about this, and in a very good way. And after the paper was published, I wanted to dig a little bit deeper into why this might be the case.

So my hypothesis is that it's about the length. So academic datasets we use, we use like 1,832 tasks, and here, they have this very distinctive characteristic where we have a long input, long in order to make the task more difficult, but then we cannot make the target long, because if we do, there's no way to grade it.

So there's fundamental challenge of that. So what happens is you have a long text of input and then short text of the target. And so this is kind of the length distribution of what went into the Flan fine-tuning. So then you see this, you have a very different sequence going into the encoder as an input, and a very different type of sequence going into the target.

So now this encoder-decoder architecture has an assumption that they will be very different. That structure really shines because of this. It was a kind of an accident, but that was, I think, why this really architecture was just suitable for fine-tuning with the academic datasets. What about now? Do we care about this kind of assumption?

And if you think about the general use cases of language models nowadays, if anything, the more interesting cases involve longer generation, longer target. Just because we cannot grade them doesn't mean that we are not interested in them. Actually, if anything, we are more interested in that. So now we have this longer target situation.

So this separate sequence length parameter doesn't seem to make much sense. And moreover, we think about this chat application, like ChatGPT, we do multi-turn conversation. And then, so what is a target of this turn becomes the input of the next turn? And then my question is, does that make sense to even think about different parameters if next turn it's gonna be the same thing?

So that was the first inductive bias we just mentioned. And then the second structure is that target element can only attend to the fully encoded ones, the final output of the encoder. Let's look at this additional structure, what that means. So as I mentioned, we have this very top layer attending to it.

And so in deep neural nets, typically we see that the bottom layers and the top layers encode information at a very different level. Meaning that, for example, in computer vision, lower layer, bottom layers encode something like edges, top layers, higher levels, combining the features, something like cat face. And so we call this deep learning a hierarchical representation learning method.

And so now the question is, if decoder layer one attends to encoder final layer, which probably has a very different level of information, is that some kind of an information bottleneck, which actually motivated the original attention mechanism? And in practice, I would say, in my experience, doesn't really make any difference.

And that's because my experience was limited to, say, 24 layers of encoder of T5. So layer one attended to 24, probably fine. But what if we have 10X or 1,000X more layers? Would that be problematic? I'm not really comfortable with that. So I think this is also unnecessary design that maybe we need to revisit.

Final structure we're gonna talk about is the, when we do this, there's like a bidirectional thing in the encoder-decoder. Let's think about that. So yeah, bidirectional input attention, is that really necessary? So when we had this BERT, B in BERT stands for bidirectional. 2018, when we were solving that question answering squat, actually it was very difficult task.

So if you have any additional trick, it can make a huge difference. Bidirectionality was a really useful, I think maybe boosting up the squat score by like 20. So it was really huge thing. But at scale, I don't think this matters that much. This is my highly anecdotal experience.

So we did, in flan two, we tried both bidirectional and unidirectional fine tuning, didn't really make much difference. So, but I wanna point out this bidirectionality, actually bring in an engineering challenge for modern multi-turn chat application. So at every turn, the new input has to be encoded again. And for union directional attention is much, much better.

So here's what I mean by that. So let's think about this more modern conversation between user and system. How are you bad and why? And so here, if we think about the bidirectional case, we will, and when we generate bad, we need to encode this input with the bidirectional thing, which is fine.

And then after the bad is generated, when we're trying to generate why, we'll need to encode how again, because how can attend to bad. So we need to do everything from scratch again. In contrast, if we do unidirectional one, we can do much, much better, because now when we are trying to generate why, we don't have to redo how, because we cannot attend to the future tokens, so we don't have to do anything.

If you see the difference, this part can be cached, and then this part is the only thing that has to be encoded again. So this kind of makes a big difference when we think about multiple turns going in. So I would say bidirectional attention did well in 2018, which is mostly solved by scale, and now because of this engineering challenge, we don't really need that.

So to conclude, we have looked into this driving force, dominant driving force governing this AI research, and that was this exponentially cheaper compute and associated scaling effort. And so to understand this driving force, we analyze some of the additional structures added to the encoded decoder compared to the decoder only, and then thought about what that means from the perspective of scaling.

And I wanted to just conclude with this remark. So we have looked at these kind of analysis, which are all, one can say this is just historical artifacts and doesn't matter, but if you do many of these, now you look at this current events, you can hopefully think about those in a more unified manner and then see, okay, what assumptions in my problem that I need to revisit, and are they relevant?

And if not, why? And you have an answer to it. Is can we do it with a more general thing and scale up? And so I hope you can go back and really think about these problems, and together we can really shape the future of AI in a really nice way.

So that's it, thanks. (audience applauding) (chewing) (upbeat music) you

Stanford CS25: V4 I Hyung Won Chung of OpenAI

Chapters

Transcript