Stanford CS25: V4 I Jason Wei & Hyung Won Chung of OpenAI

So again, I'm very happy to have Jason here. So he's an AI researcher based in San Francisco, currently working at OpenAI. He was previously a research scientist at Google Brain, where he popularized key ideas in LLMs, such as chain of thought prompting, instruction tuning, as well as emergent phenomena.

He's also a good friend of mine, and he's been here before to give some talks. So we're very happy to have you back, Jason, and take it away. Great. Yeah. Thanks for the intro. So a bit about the structure. So I'll talk for around 30 minutes, and then I'll take a few questions.

And then Hyungwon will talk for 30 minutes, and then we'll both take questions at the end. Great. So I want to talk about a few very basic things. And I think the fundamental question that I hope to get at is, why do language models work so well? And one thing that I'd encourage everyone to do that I found to be extremely helpful in trying to answer this question is to use a tool, which is manually inspect data.

And I'll give a short anecdote. I've been doing this for a long time. So in 2019, I was trying to build one of the first lung cancer classifiers. So there'd be an image, and you have to say, OK, what type of lung cancer is this? And my first thing was, OK, if I want to train a neural network to do this, I should be able to at least do the task.

So I went to my advisor, and I said, oh, I want to learn to do this task first. And he said, Jason, you need a medical degree and three years of pathology experience to even do this task. And I found that a bit discouraging, but I went and did it anyways.

So I basically looked at the specific type of lung cancer that I was working on. And I'd read all the papers on how to classify different types. And I went to pathologists, and I said, OK, try to classify these. What did I do wrong? And then, what do you think of that?

And in the end, I learned how to do this task of classifying lung cancer. And the result of this was I gained intuitions about the task that led to many papers. OK, so first, I will do a quick review of language models. So the way language models are trained are with the next word prediction task.

So let's say you have a sentence-- Dartmouth students like to. And the goal of next word prediction is you have some words that come before, and then you want to predict the next word. And what the language model does is it outputs a probability for every single word in the vocabulary.

So vocabulary would be A, aardvark, drink, study, and then all the way to zucchini. And then the language model is going to put a probability over every single word here. So the probability of A being the next word is something really small. Aardvark is something really small. And then maybe drink is like, say, 0.6.

Study is like 0.3. Zucchini is, again, really small. And then the way that you train the language model is you say, I want-- let's say drink is the correct word here. I want this number here, 0.6, to be as close as possible to 1. So your loss is basically, how close is the probability of the actual next word?

And you want this loss to be as low as possible. OK, so the first intuition that I would encourage everyone to use is next word prediction is massively multi-task learning. And what I mean by this is the following. I'll give a few examples. So when you train a language model on a large enough database, a large enough data set, on this task of next word prediction, you have a lot of sentences that you can learn from.

So for example, there might be some sentence, in my free time, I like to. And the language model has to learn that code should be higher probability than the word banana. So learn some grammar. It'll learn lexical semantics. So somewhere in your data set, there might be a sentence, I went to the store to buy papaya, dragon fruit, and durian.

And the language model should know that the probability of durian should be higher than squirrel. The language model will learn world knowledge. So there will be some sentence on the internet that says, you know, the capital of Azerbaijan is-- and then the language model should learn that it should be Baku instead of London.

You can learn traditional NLP tasks, like sentiment analysis. So there'll be some sentence, you know, I was engaged on the edge of my seat the whole time. The movie was. And the language model looks like, OK, these are the prior next words. The next word should probably be good and not bad.

And then finally, another example is translation. So here, you might see some sentence, the word for pretty in Spanish is. And then the language model should weigh Bonita more than Ola. Spatial reasoning. So you might even have some sentence like, Iroh went to the kitchen to make some tea.

Standing next to Iroh, Zuko pondered his destiny. Zuko left the. And then kitchen should be higher probability than store. And then finally, even some math questions. So you might have, like, some arithmetic exam answer key somewhere on the internet. And then the language model looks at this and says, OK, the next word should probably be 15 and not 11.

And you can have, like, basically millions of tasks like this when you have a huge data set. And you can think of this as basically extreme multitask learning. And these are sort of like very clean examples of tasks. But I'll give an example of how arbitrary some of these tasks can be.

So here's a sentence from Wikipedia. Biden married Amelia. And then now, pretend you're the language model. And you could say, like, OK, what's the next word here? And the next word here is Hunter. So it's Biden's first wife. And so, like, OK, what's the language model learning from predicting this word?

I guess, like, world knowledge. And then what's the next word after this? Turns out the next word is a comma. So here, the model is learning, like, basically comma prediction. And then what's the next word after that? I think it's kind of hard to know, but the answer is A.

And I guess this is, like, maybe grammar, but, like, somewhat arbitrary. And then what's the next word after that? Turns out it's student. And this, I would say, I don't know what task this is. This is, like, you know, it could have been woman. It could have been something else.

So this is, like, a pretty arbitrary task. And the point that I'm trying to make here is that the next word prediction task is really challenging. So, like, if you do this over the entire database, you're going to learn a lot of tasks. OK. The next intuition I want to talk about is scaling, which is, by the way, let's say scaling compute.

And by the way, compute is equal to how much data you have times the size of language model. Reliably improves loss. And this idea was basically pioneered by Kaplan et al in 2020. I would encourage you guys to read the paper. And what this basically says is you can have a plot here, and we'll see many plots like this, where the x-axis is compute and the y-axis is loss.

And what this intuition says is you can train one language model, you'll have that loss. And obviously, you want loss to be lower. So you can train the next one, you'll have that loss. If you train the one after that, you'll have that loss. Then if you train the one after that, you'll have that loss.

And you can basically predict the loss of a language model based on how much compute you're going to use to train it. And the reason why this is called a law is that in this paper, they showed that the x-axis here is actually seven orders of magnitude. So basically, it would be surprising if the trend broke if you continued.

And the important thing about this is that the line does not go like that. Because if it went like that, then it would saturate. And then putting more compute or training a larger language model wouldn't actually lead to lower loss. So I think a question that we don't have a good answer to as a field, but I'll give you a hand-wavy answer, is why does scaling up the size of your language model improve the loss?

And I'll give two basically hand-wavy answers. So here's a small lm, and here's large lm. So one thing that's important is how good is your language model at memorizing facts? And imagine you're a small language model, and you see a bunch of facts on the internet. You have to be pretty choosy in which facts you memorize.

Because if you don't have that many parameters, you're like, oh, I can only memorize a million facts. Is this one of the facts I want to memorize to get the lowest loss? And so you have to be very selective. Whereas if you're a large language model, you can memorize a lot of tail knowledge.

And so basically, every fact you see, you don't have to say, oh, is this something I want to memorize? You can just memorize it. And then the other hand-wavy answer I'll give is small language models tend to learn first-order heuristics. So if you're a small language model, you're already struggling to get the grammar correct.

You're not going to do your best to try to get the math problem exactly correct. Whereas if you're a large language model, you have a lot of parameters in your forward pass. And you can try to do really complicated things to get the next token correct and to get the loss as low as possible.

So the third intuition I'll talk about is while overall loss improves smoothly, individual tasks can improve suddenly. And here's what I mean by this. So you can write your overall loss. By this, I mean if you take some corpus of data and you compute the overall loss on every word in that data set.

The overall loss, because we know that next word prediction is massively multitask learning, you can decompose this overall loss into the loss of every single individual task. So you have, I don't know, some small number times the loss of, say, grammar plus some small number times the loss of sentiment analysis plus some small number times the loss of world knowledge plus-- and then all the way to, let's say, you have something times the loss of math.

And so you can basically write your overall loss as the weighted sum of the individual tasks in the data set. And now the question is, let's say I improve my loss from 4 to 3. Do each of these individual tasks improve at the same rate? Well, I'd say probably not.

So if you have a good enough language model, it's already doing basically perfectly on grammar and sentiment analysis. So this might not improve. This might be saturated. And maybe the loss on math is not saturated. It's not that good at math yet. So this could improve suddenly. And I'll redraw the diagram to show this.

So again, compute here and loss here. And your overall loss is like this. What I'm saying is there might be some part of that overall loss that scales like this. So this is, say, grammar. So for example, you could say, if GPT 3.5 is there and GPT 4 is there, you haven't actually improved the grammar that much.

On the other hand, you might have something like this for, say, doing math or harder tasks, where the difference between GPT 3.5 and GPT 4 will be much larger. And it turns out you can look at a big set of tasks, which I did, and you can look at what's the shape of these scaling curves.

So I looked at 202 tasks. There's this corpus called Big Bench, which has 200 tasks. I looked at all of them. And here is the distribution. So you have-- this was 29% of tasks that were smooth. So if I draw the scaling plot, compute is on the x-axis. And then here we have accuracy instead of loss.

So higher is better. Then you have something like this. I believe this was 22% will be flat. So if you have your scaling curve, it'll just all be 0. The task was too hard. 2% will be something called inverse scaling. I'll talk about this in a sec. But what that means is the accuracy actually gets worse as you increase the size of the language model.

And then I think this was 13% will be not correlated. So you have something like that. I don't know. And then finally, a pretty big portion, 33%, will be emergent abilities. And what I mean by that is if you plot your compute and accuracy, for a certain point, up to a certain point, your accuracy will be 0.

And then the accuracy suddenly starts to improve. And so you can define an emergent ability basically as, for small models, the performance is 0. So it's not present in this model. And then for large models, you have much better than random performance. And the interesting thing about this is, let's say you had only trained the small language models up to that point.

You would have predicted that it would have been impossible for the language model to ever perform the task. But actually, when you train the larger model, the language model does learn to perform the task. So in a sense, it's pretty unpredictable. I'll talk about one final thing, which is something called inverse scaling slash u-shaped scaling.

OK, so I'll give a tricky prompt to illustrate this. So the tricky prompt is this. Repeat after me, all that glisters is not glib. All that glisters is not. And this is the prompt I give to the language model. And the goal is to predict this next word. And obviously, the correct answer is glib, because you asked to repeat after me.

And what you see is, let's say you have a extra small language model, a small language model, and a large language model. The performance for the extra small language model will be, say, here's 100%. The small language model is actually worse at this task. So it's something like something here.

And then the large language model, again, learns to do this task. So how do we basically explain a behavior like this for a prompt like this? And the answer is, you can decompose this prompt into three subtasks that are basically being done. So the first subtask is, can you repeat some text?

Right? And if you draw the plot here, again, extra small, small, large, and then here is 100. This is a super easy task. And so all the language models have perfect performance on that. That's one hidden task. The other task is, can you fix a quote? So the quote is supposed to be, all that glisters is not gold.

And so you can then plot, again, extra small, small, large. What's the ability to fix that quote? Well, the small language model doesn't know the quote. So it's going to get 0. And then the extra small doesn't know the quote, so it's going to get 0. The small will be able to do it.

And the large can obviously do it. So that's what the scaling curve looks like. And then finally, you have the quote-- you have the task, follow an instruction. And obviously, this is the instruction here. And you could say, OK, what's the performance of these models on this task? And the small model can't do it, or the extra small model can't do it.

Small model also can't do it. But the large model can do it. So you get a curve like this. And then why does this explain this behavior here? Well, the extra small model, it can repeat. It can't fix the quote. And it can't follow the instruction. So it actually gets it correct.

It says glib. It says glib. The small model can repeat. It can fix the quote. But it doesn't follow the instruction. So it decides to fix the quote. And then the large model, it can do all three. It can follow the instruction. So it just repeats. And so that's how, if you look at the individual subtasks, you can explain the behavior of some of these weird scaling properties.

So I will conclude with one general takeaway, which is applicable if you do research. And the takeaway is to just plot scaling curves. , And I'll just give a really simple example. So let's say I do something for my research project. I fine-tune a model on some number of examples.

And I get-- this is my thing. I get some performance there. And then here's the baseline. Of not doing whatever my research project is. And there's the performance. The reason you want to plot a scaling curve for this is, let's say you take half the data. And you find out that the performance is actually here.

So your curve looks like this. What this would tell you is you didn't have to collect all the data to do your thing. And if you collect more, you probably won't see an improvement in performance. Another scenario would be if you plotted that point and it was there, then your curve will look like this.

And potentially, if you kept doing more of whatever your research project is, you'd see an improvement in performance. And then finally, maybe your point is there. So your curve looks like this. And in this case, you would expect to see an even larger jump in performance after you continue doing your thing.

So yeah, I'll end my talk here. And I'm happy to take a few questions beforehand while I talk. Yeah. Go ahead. Data that has an operational source, do you break the data based on the source? Repeat the question. Oh. Oh, yeah. Yeah, thanks. Good question. So the question is, during pre-training, how do you differentiate between good data and bad data?

The question is-- or the answer is, you don't really. But you should by only training on good data. So maybe you should look at your data source and filter out some data if it's not from a reliable data source. Do you want to give us maybe the intuition behind the intuition for one or two of the examples, like emergent, or why tail knowledge starts to develop?

What's behind that? What do you mean, intuition behind intuition? Intuitively, these concepts that you're seeing in the graphs. And from your experience and expertise, what in the model itself is really causing that emergent behavior? What do you mean, in the model itself? Is it more depth, more nodes, more-- Oh, I see.

--attention, in an intuitive sense, not in a-- Oh, yeah, OK, yeah. So the question is, what in the model makes the language model better at memorizing tail knowledge or at doing math problems? Yeah, I think it's definitely related to the size of the language model. So yeah, if you have more layers, you could encode probably a more complex function within that.

And then I guess if you have more breadth, you could probably encode more facts about the world. And then if you want to repeat a fact or retrieve something, it'd probably be easier. We'll get one more in person, and then I'll move this more slowly. Hi. So when you were studying the 200-ish problems in the big bench, you noticed that 22% were flat.

But there's a possibility that if you were to increase the compute even further, those might have turned out to be emergent. So my question to you is that when you were looking at the 33% that turned out to be emergent, did you notice anything about the loss in the flat portion that suggested that it would eventually become emergent?

Oh, yeah. I didn't notice anything. Oh, sorry. Let me repeat the question. The question is, when I looked at all the emergent tasks, was there anything that I noticed before the emergence point in the loss that would have hinted that it would become emergent later? To me, it's kind of tough.

We have a few plots of this. You can look at the loss, and it kind of gets better. And then suddenly, it spikes, and there's no way to predict it. But also, you don't have perfect data, because you might not have all the intermediate points for a given model size.

Yeah. Great question. Yeah, we can move to We have a few online questions. Oh, OK. Yeah. We just have a few questions from people who are joining on Zoom. The first one is, what do you think are the biggest bottlenecks for current large language models? Is it the quality of data, the amount of compute, or something else?

Yeah, great question. I guess if you go back to the scaling loss paradigm, what it says is that if you increase the size of the data and the size of the model, then you'd expect to get a lot better performance. And I think we'll probably try to keep increasing those things.

Gotcha. And then the last one, what are your thoughts on the paper, if you've read it? Are emergent abilities of large language models a mirage? Oh, yeah. I always get this question. I guess I would encourage you to read the paper and decide for yourself. But I guess what the paper says is if you change the metric a bit, it looks different.

But I would say, at the end of the day, I think the language model abilities are real. And if you think-- I guess I don't think that's a mirage. So, yeah. Yeah. All right, so thanks, Jason, for the very insightful talk. And now we'll have Hyung Won give a talk.

So he's currently a research scientist on the OpenAI Chat GPT team. He has worked on various aspects of large language models, things like pre-training, instruction fine-tuning, reinforcement learning with human feedback, reasoning, and so forth. And some of his notable works include the scaling FLAN papers, such as FLAN T5, as well as FLAN POM, and T5X, the training framework used to train the POM language model.

And before OpenAI, he was at Google Brain, and he received his PhD from MIT. So give a hand for Hyung Won. All right, my name is Hyung Won, and really happy to be here today. And this week, I was thinking about-- by the way, is my mic working fine?

Yeah, yeah. So this week, I thought about, OK, I'm giving a lecture on transformers at Stanford. What should I talk about? And I thought, OK, some of you in this room and in Zoom will actually go shape the future of AI. So maybe I should talk about that. It's a really important goal and ambitious, and we really have to get it right.

So that could be a good topic to think about. And when we talk about something into the future, the best place to get an advice is to look into the history. And in particular, look at the early history of transformer and try to learn many lessons from there. And the goal will be to develop a unified perspective in which we can look into many seemingly disjoint events.

And from that, we can probably hope to project into the future what might be coming. And so that will be the goal of this lecture. And we'll look at some of the architectures of the transformers. So let's get started. Everyone I see is saying AI is so advancing so fast that it's so hard to keep up.

And it doesn't matter if you have years of experience. There's so many things that are coming out every week that it's just hard to keep up. And I do see many people spend a lot of time and energy catching up with the latest developments, the cutting edge and the newest thing.

And then not enough attention goes into all things because they become deprecated and no longer relevant. But I think it's important, actually, to look into that. Because we really need to-- when things are moving so fast beyond our ability to catch up, what we need to do is study the change itself.

And that means we can look back at the previous things and then look at the current thing and try to map how we got here and from which we can look into where we are heading towards. So what does it mean to study the change itself? First, we need to identify the dominant driving forces behind the change.

So here, dominant is an important word because typically, a change has many, many driving forces. And we only care about the dominant one because we're not trying to get really accurate. We just want to have the sense of directionality. Second, we need to understand the driving force really well.

And then after that, we can predict the future trajectory by rolling out the driving force and so on. And you heard it right. I mentioned about predicting the future. This is a computer science class, not like an astrology or something. But we do-- I think it's actually not that impossible to predict some future trajectory of a very narrow scientific domain.

And that endeavor is really useful to do because let's say you do all these and then make your prediction accuracy from 1% to 10%. And then you'll make, say, 100 predictions. 10 of them will be correct. Say, one of them will be really, really correct, meaning it will have an outside impact that outweighs everything.

And I think that is kind of how many-- I've seen a very general thing in life that you really have to be right a few times. So if we think about why predicting the future is difficult, or maybe even think about the extreme case where we can all do the prediction with perfect accuracy, almost perfect accuracy.

So here, I'm going to do a very simple experiment of dropping this pen and follow this same three-step process. So we're going to identify the dominant driving force. First of all, what are the driving forces acting on this pen, gravity downwards? And is that all? We also have, say, air friction if I drop it.

And that will cause what's called a drag force acting upwards. And actually, depending on how I drop this, the orientation, the aerodynamic interaction will be so complicated that we don't currently have any analytical way of modeling that. We can do it with the CFD, the computational fluid dynamics, but it will be nontrivial.

So we can neglect that. This is heavy enough that gravity is probably the only dominant force. We simplify the problem. Second, do we understand this dominant driving force, which is gravity? And we do because we have this Newtonian mechanics, which provides a reasonably good model. And then with that, we can predict the future trajectory of this pen.

And if you remember from this dynamics class, if we have this initial velocity is 0, I'm not going to put any velocity. And then let's say position is 0 here. And then 1/2 gt square will give a precise trajectory of this pen as I drop this. So if there is a single driving force that we really understand, it's actually possible to predict what's going to happen.

So then why do we really fear about predicting the future in the most general sense? And I argue that among many reasons, the number of driving force, the sheer number of dominant driving forces acting on the general prediction is so complicated. And their interaction creates a complexity that we cannot predict the most general sense.

So here's my cartoon way of thinking about the prediction of future. X-axis, we have a number of dominant driving forces. Y-axis, we have a prediction difficulty. So on the left-hand side, we have a dropping a pen. It's a very simple case. It's a difficulty. It's very small. You just need to learn physics.

And then as you add more stuff, it just becomes impossible. So how does this fit into the AI research? And you might think, OK, I see all the time things are coming in. We are bombarded by new things. And some people will come up with a new agent, new modality, new MML use score, whatever.

We just see so many things. I'm not even able to catch up with the latest thing. How can I even hope to predict the future of the AI research? But I argue that it's actually simpler, because there is a dominant driving force that is governing a lot, if not all, of the AI research.

And because of that, I would like to point out that it's actually closer to the left than to the right than we actually may perceive. So what is that driving force? Oh, maybe before that, I would like to caveat that when I do this kind of talk, I would like to not focus too much on the technical stuff, which you can probably do better in your own time.

But rather, I want to share how I think. And for that, I want to share how my opinion is. And so it will be very strongly opinionated. And by no means, I'm saying this is correct or not. Just wanted to share my perspective. So coming back to this driving force for AI, what is that dominant driving force?

And here is a plot from Rich Sutton. And on the y-axis, we have the calculations flopped. If you pay $100, and how much computing power do you get? And it's in log scale. And then x-axis, we have a time of more than 100 years. So this is actually more than exponential.

And I don't know any trend that is as strong and as long-lasting as this one. So whenever I see this kind of thing, I should say, OK, I should not compete with this. And better, I should try to leverage as much as possible. And so what this means is you get 10x more compute every five years if you spend the same amount of dollar.

And so in other words, you get the cost of compute is going down exponentially. And this and associated scaling is really dominating the AI research. And that is somewhat hard to take. But that is, I think, really important to think about. So coming back to this AI research, how is this exponentially cheaper compute drive the AI research?

Let's think about the job of the AI researchers. It is to teach machines how to think in a very general sense. And one somewhat unfortunately common approach is we think about how we teach machine how we think we think. So meaning we model how we think and then try to incorporate that into some kind of mathematical model and teach that.

And now the question is, do we understand how we think at the very low level? I don't think we do. I have no idea what's going on. So it's fundamentally flawed in the sense that we try to model something that we have no idea about. And what happens if we go with this kind of approach is that it poses a structure that serves as a shortcut in the short term.

And so you can maybe get a paper or something. But then it becomes a bottleneck, because we don't know how this will limit further scaling up. More fundamentally, what this is doing is we are limiting the degree of freedom we are giving to the machines. And that will backfire at some point.

And this has been going on for decades. And bitter lesson is, I think, the single most important piece of writing in AI. And it says-- this is my wording, by the way-- past 70 years of entire AI research can be summarized into developing progressively more general method with weaker modeling assumptions or inductive biases, and add more data and compute-- in other words, scale up.

And that has been the recipe of entire AI research, not fancy things. And if you think about this, the models of 2000 is a lot more difficult than what we use now. And so it's much easier to get into AI nowadays from a technical perspective. So this is, I think, really the key information.

We have this compute cost that's going down exponentially. And it's getting cheaper faster than we're becoming a better researcher. So don't compete with that. And just try to leverage that as much as possible. And that is the driving force that I wanted to identify. And I'm not saying this is the only driving force.

But this is the dominant driving force. So we can probably neglect the other ones. So here's a graphical version of that. X-axis, we have a compute. Y-axis, we have a performance of some kind. Let's think about some general intelligence. And let's look at two different methods, one with more structure, more modeling assumptions, fancier math, whatever.

And then the other one is a less structure. What you see is typically you start with a better performance when you have a low compute regime. But it plateaus because of some kind of structure backfiring. And then with the less structure, because we give a lot more freedom to the model, it doesn't work in the beginning.

But then as we add more compute, it starts working. And then it gets better. We call this more scalable methods. So does that mean we should just go with the least structure, most freedom to the model possible way from the get-go? And the answer is obviously no. Let's think about even less structure case.

This red one here, it will pick up a lot later and requires a lot more compute. So it really depends on where we are. We cannot indefinitely wait for the most general case. And so let's think about the case where our compute situation is at this dotted line. If we're here, we should choose this less structure one as opposed to this even less structure one, because the other one doesn't really work.

And the other one works. But crucially, we need to remember that we are adding some structure because we don't have compute. So we need to remove that later. And so the difference between these two method is that additional inductive biases or structure we impose, someone impose, that typically don't get removed.

So adding this, what that means is that at the given level of compute, data, algorithmic development, and architecture that we have, there is like an optimal inductive bias or structure that we can add to the problem to make the progress. And that has been really how we have made so much progress.

But these are like shortcuts that hinder further scaling later on. So we have to remove them later on when we have more compute, better algorithm, or whatever. And as a community, we do adding structure very well. And 'cause there's an incentive structure with like papers, you add a nice one, then you get a paper.

But removing that doesn't really get you much. So that, we don't really do that. And I think we should do a lot more of those. So maybe another implication of this bitter lesson is that because of this, what is better in the long term almost necessarily looks worse now.

And this is quite unique to AI research because the AI research of current paradigm is learning-based method, meaning that we are giving models freedom, the machines choose how they learn. So because we need to give more freedom, it's more chaotic at the beginning, so it doesn't work. But then when it started working, we can put in more compute and then it can be better.

So it's really important to have this in mind. So to summarize, we have identified this dominant driving force behind the AI research. And that is exponentially cheaper compute and associated scaling up. Now that we have identified, if you remember back from my initial slides, the next step is to understand this driving force better.

And so that's where we're gonna spend most of the time doing that. And for that, we need to go back to some history of transformer, 'cause this is a transformers class, analyze key structures and decisions that were made by the researchers at the time and why they did that, whether that was an optimal structure that could have been added at the time and why they might be irrelevant now.

And should we remove that? And we'll go through some of the practice of this. And hopefully this will give you some flavor of what scaling research looks like. So now we'll go into a little bit of the technical stuff. Transformer architecture, there are some variants. I'll talk about three of them.

First is the encoder decoder, which is the original transformer, which has a little bit more structure. Second one is the encoder only, which is popularized by BERT. And then third one is decoder only, which you can think of as a current like GPT-3 or other language models. This has a lot less structure than the encoder decoder.

So these are the three types we'll go into detail. Second, the encoder only is actually not that useful in the most general sense. It still has some place, but we will just briefly go over that and then spend most of the time comparing one and three. So one has more structure, what's the implication of that and so on.

So first of all, let's think about what a transformer is. Just at a very high level or first principles, what is a transformer? It's a sequence model. And sequence model has an input of a sequence. So sequence of elements can be words or images or whatever. It's a very general concept.

In this particular example, I'll show you with the words. Sentence is a sequence of words. And then the first step is to tokenize it 'cause we have to represent this word in computers, which requires just some kind of a encoding scheme. So we just do it with a fixed number of integers that we have now sequence of integers.

And then the dominant paradigm nowadays is to represent each sequence element as a vector, dense vector, because we know how to multiply them well. And then so we have a sequence of vectors. And finally, this sequence model will do the following. We just want to model the interaction between sequence elements.

And we do that by let them take the dot product of each other. And if the dot product is high, we can say semantically they are more related than the dot products that is low. And that's kind of the sequence model. And the transformer is a particular type of sequence model that uses what's called the tension to model this interaction.

So let's get into the details of this encoder decoder, which was the original transformer. It's quite many, many pieces. So let's go into a little bit, a piece at a time. So starting with the encoder. So here, I'm going to show you an example of machine translation, which used to be very cool thing.

And so you have an English sentence that is good, and then we're gonna translate into German. So first thing is to encode this into a dense vector. So here I'm representing it with this vector of size three or something. And then we have to let them take the dot product.

So this lines represent which element can talk to which element, other elements. And here, because it's an input, we take what is called the bidirectional attention. So any token can talk to any other token. And then we have this MLP or feed forward layer, which is per token. It doesn't have any interaction.

You just do some multiplication just because we can do it. And then that's one layer, and we repeat that n times. And that's just the transformer encoder. And at the end, what you get is the sequence of vectors, each representing the sequence element, in this case, a word. So that's the output of this encoder.

Now let's look at the decoder, which is similarly shaped stack of layers. So here we put in as an input what the answer should be. So here, VOS is the beginning of sequence, and then das ist gut, I don't know how to pronounce it, but that's the German translation of that is good.

And so we kind of go through the similar process. Here we have a causal self-attention, meaning that the tokens of time step T can only attend to T and before, because when we start generating it, we don't have the future tokens. So we cannot, when we train it, we should limit that.

And that way, this is done by like masking, but it's just different from the encoder. So after this, you can get after, again, N layers, you get this sequence output, and you have this, the output is sequence. So sequence-to-sequence mapping, this is a general encoder-decoder architecture. And when you get this end of sequence, you stop generating it.

So this is the overall picture. Now I'll point out some important attention patterns. So we are translating into German what is input to the encoder. So there has to be some connection between the decoder and the encoder. That is done by this cross-attention mechanism shown in this red, which is just that each vector's representation on each sequence in the output decoder should attend to some of them in the encoder.

And that is done. In particular, the design feature, which is interesting is that all the layers in the decoder attend to the final layer output of the encoder. I will come back to the implication of this design. So, yep, that's that. And now, move on to the second type of architecture, which is encoder-only.

We'll spend a little bit of time here. So again, we have the same input, and we go through a similar structure. And then, in this case, the final output is a single vector. Regardless of the length of the sequence, we just get a single vector. And that is, that represent the input sequence.

That's a dense vector representation. And then, let's say we do some kind of a sentiment analysis. We run through a task-specific linear layer to map it to classification labels, positive or negative probabilities here. And that's required for all these task-specific cases. And this is kind of popularized by BERT.

And what this means is that here, at the time, 2018, when BERT came out, we had the benchmark called GLUE, which was a language understanding task. You have a sequence in, classification labels out for most cases. This was how the field really advanced at the time. So when we care about such tasks, then there's an incentive to think about simplifying the problem, adding the structure to the problem so that we can make a progress.

So this, the additional structure that was put into this particular architecture is that we're gonna give up on the generation. If we do that, it becomes a lot simpler problem. Instead of sequence to sequence, we're talking about sequence to classification labels, and that's just so much easier. And so at some point, 2018, 2019, a lot of the papers, or just research, was like, we sometimes call it BERT engineers.

It's a little bit change of something, get like 0.5% better on GLUE, and you get a paper and things like that. It was like very chaotic era. And, but if we look at from this perspective, we are putting the sequence structure of not generating the sequence, that puts a lot of performance win, but in the long term, it's not really useful.

So we're not gonna look at this encoder only architecture going forward. Third architecture, decoder only. This one is my favorite, personally, and it looks kind of daunting, but because of this attention pattern, but it actually is very simple. So here, we only have a single stack, and it can actually generate stuff.

And so there's misconception that some people think this decoder only architecture is used for language modeling next to prediction, so it cannot be used for supervised learning. But here, we can actually do it. The trick is to have this input that is good, concatenate it with the target. And if you do that, then it just becomes simple, just sequence in, sequence out.

So what we do is the self-attention mechanism here is actually handling both the cross-attention between target and the input, and self-attention sequence learning within each. So that's the causal attention. And then, as I mentioned, the output is a sequence. And then the key design features are self-attention is serving both roles, and we are, in some sense, sharing the parameters between input and target.

So same set of parameters are applied to both input and the target sequences. So this is the decoder only. Now, we will go into the comparison. So I think there are many, they look very different, at least on the schematics. So how different are they, actually? And I argue that they're actually quite similar.

And so to illustrate that, we're gonna transform, starting from this encoder-decoder, which has more structures built in, and then into the decoder-only architecture, and see what are some of the differences, and then interpret those differences, those additional structures, are they relevant nowadays? Now that we have more compute, better algorithm, and so on.

So let's have this table. Four differences, we'll see each of them. And then, as we go through, we'll populate this table. So let's first look at this additional cross-attention. What that means is that this, on the left, is an encoder-decoder, which has this additional red block, the cross-attention, compared to the simpler one that doesn't have that.

So we wanna make the left closer to the right. So that means we need to either get rid of it, or something. And attention mechanism has kind of the four projection matrices. And so self-attention and cross-attention actually have the same number of parameters, same shape. So we can just share them.

So that's the first step, share both of these. And then it becomes mostly the same mechanism. And then, so that's the first difference, separate cross-attention, or self-attention serving both roles. Second difference is the parameter sharing. So what that means is that, between the input and the target, encoder-decoder architecture uses the separate parameters.

And decoder only has a single stack, so it uses the shared parameter. So if we wanna make the left close to right, we wanna share the encoder parameters. So let's do that, just color this. So now they share the parameters. Third difference is the target-to-input attention pattern. So we need to connect the target to the input, and how is that done?

In the encoder-decoder case, we had this cross-attention, and then in the decoder only, it's the self-attention doing everything. What the difference is that we have this, every layer of the decoder attending to the final layer output of the encoder. Whereas if you think about this decoder, it's actually per layer, within layer.

When we are decoding the, say, word DOS, we are looking at the same layer representation of the encoder, and that's within layer, and I think this is the design feature. So if you wanna make this close to that, we have to bring back this attention to each layer. So now layer one will be attending to layer one of this.

And finally, the last difference is the input attention. I mentioned about this bidirectional attention, and because we have this decoder only, typically with the unidirectional attention, we need to make them matching. So that's the, we can just get rid of it. I just got rid of some of the arrows.

So then at this point, these two architectures are almost identical. There's a little bit of difference in the cross attention, but same number of parameters, and if you have, in deep learning, if you just train this, these two architecture in the same task, same data, I think you will get pretty much within the noise, probably closer than if you train the same thing twice.

So I would say they are identical. And so these are the main differences. Now we'll look at what are the additional structures, what they mean, speed means. So yeah, that's the populated table now. And then, so we can say that encoder-decoder, compared to the decoder-only architecture, has these additional structures in the devices built in.

So let's go into each of them. The first one is the, what encoder-decoder tries at it as a structure is that input and the target sequences are sufficiently different that we, it'll be useful to use a separate parameters. That's the assumption. And so, why is that useful? When can that assumption be useful?

And one example is machine translation. Back when the transform was introduced in 2017, translation was a really popular task. And it was difficult, considered difficult. And because it's just sequence to sequence, and you can actually have a blue score, which is heuristic-based method that can give you a single number, and then people can optimize that.

So in that task, we have this input and target in completely different languages. So if the goal is to learn translation only, then it kind of makes sense to have, okay, this parameter in the encoder will take care of the English, and this parameter in the decoder will take care of the German.

That seems natural. And what about now? Modern language models is about learning knowledge. And it's not just about translation, or not even about language. Language just comes up as a byproduct of doing this next-token prediction, and translation as well. So does it make sense to have a separate parameter for this kind of situation now?

Like, we have some knowledge in German, some knowledge in English, and if anything, you wanna combine them. And if we represent them in separate parameters, I don't think that's natural. So I would say, with this much more general, larger models that can do a lot of things, this assumption seems very unnatural to me.

Second example is a little bit more modern. Two years ago, when I was at Google, and with Jason, we did this instruction fine-tuning work. And what this is, is you take the pre-trained model, and then just fine-tune on academic data set, and so that it can understand the natural language instruction.

So the detail doesn't matter, but here, let's think about the performance gain by doing this fine-tuning on two different architectures we tried. So first five is the Flon T5, which is T5-based, which is encoder-decoder architecture. Last one, the latter five, decoder-only architecture, based on POM. So we spent 99% of the time on POM, optimizing a lot of these.

And then at the end, we just spent three days on T5. But the performance gain was a lot higher on this. And I was really confused about this, and in a very good way. And after the paper was published, I wanted to dig a little bit deeper into why this might be the case.

So my hypothesis is that it's about the length. So academic data sets we use, we use like 1,832 tasks, and here, they have this very distinctive characteristic where we have a long input, long in order to make the task more difficult, but then we cannot make the target long, because if we do, there's no way to grade it.

So there's fundamental challenge of that. So what happens is you have a long text of input, and then short text of the target. And so this is kind of the length distribution of what went into the fine tuning. So then you see this, you have a very different sequence going into the encoder as an input, and a very different type of sequence going into the target.

So now, this encoder-decoder architecture has an assumption that they will be very different. That structure really shines because of this. It was a kind of an accident, but that was, I think, why this really architecture was just suitable for fine tuning with the academic data sets. What about now?

Do we care about this kind of assumption? And if you think about the general use cases of language models nowadays, if anything, the more interesting cases involve longer generation, longer target. Just because we cannot grade them doesn't mean that we are not interested in them. Actually, if anything, we are more interested in that.

So now, we have this longer target situation. So this separate sequence length parameter doesn't seem to make much sense. And moreover, we think about this chat application, like ChatGPT, we do multi-turn conversation. And then, so what is a target of this turn becomes the input of the next turn.

And then my question is, does that make sense to even think about different parameters if next turn it's gonna be the same thing? So that was the first inductive bias we just mentioned. And then the second structure is that target element can only attend to the fully encoded ones, the final output of the encoder.

Let's look at this additional structure, what that means. So as I mentioned, we have this very top layer attending to it. And so in deep neural nets, typically we see that the bottom layers and the top layers encode information at a very different level. Meaning that, for example, in computer vision, lower layer, bottom layers encode something like edges, top layers, higher levels, combining the features, something like cat face.

And so we call this deep learning a hierarchical representation learning method. And so now the question is, if decoder layer one attends to encoder final layer, which probably has a very different level of information, is that some kind of an information bottleneck, which actually motivated the original attention mechanism.

And in practice, I would say, in my experience, doesn't really make any difference. And that's because my experience was limited to, say, 24 layers of encoder of T5. So layer one attended to 24, probably fine. But what if we have 10x or 1000x more layers? Would that be problematic?

I'm not really comfortable with that. So I think this is also unnecessary design that maybe we need to revisit. Final structure we're gonna talk about is the, when we do this, there's like a bidirectional thing in the encoder-decoder. Let's think about that. So yeah, bidirectional input attention, is that really necessary?

So when we had this BERT, B in BERT stands for bidirectional. 2018, when we were solving that question answering squad, actually was very difficult task. So if you have any additional trick, it can make a huge difference. Bidirectionality was a really useful, like I think maybe boosting up the squad score by like 20.

So it was really huge thing. But at scale, I don't think this matters that much. This is my highly anecdotal experience. So we did, in flan two, we tried both bidirectional and unidirectional fine tuning, didn't really make much difference. So, but I wanna point out this bidirectionality, actually bring in an engineering challenge for modern multi-turn chat application.

So at every turn, the new input has to be encoded again, and for unidirectional attention is much, much better. So here's what I mean by that. So let's think about this more modern conversation between user and assistant, how are you bad and why? And so here, if we think about the bidirectional case, we will, and when we generate bad, we need to encode this input with the bidirectional thing, which is fine.

And then after the bad is generated, when we're trying to generate why, we'll need to encode how again, because how can attend to bad, so we need to do everything from scratch again. In contrast, if we do unidirectional one, we can do much, much better, because now when we are trying to generate why, we don't have to redo how, because we cannot attend to the future tokens, so we don't have to do anything.

So if you see the difference, this part can be cached, and then this part is the only thing that has to be encoded again. So this kind of makes a big difference when we think about multiple turns going in. So I would say bidirectional attention did well in 2018, which is mostly solved by scale, and now because of this engineering challenge, we don't really need that.

So to conclude, we have looked into this driving force, dominant driving force governing this AI research, and that was this exponentially cheaper compute and associated scaling effort. And so to understand this driving force, we analyzed some of the additional structures added to the encoder-decoder compared to decoder-only, and then thought about what that means from the perspective of scaling.

And I wanted to just conclude with this remark. So we have looked at these kind of analysis, which are all, one can say this is just historical artifacts and doesn't matter, but if you do many of these, now you look at the current events. You can hopefully think about those in a more unified manner and then see, okay, what assumptions in my problem that I need to revisit, and are they relevant?

And if not, why? And you have an answer to it. Can we do it with a more general thing and scale up? And so I hope you can go back and really think about these problems, and together we can really shape the future of AI in a really nice way.

So that's it, thanks. (audience applauds) (water splashes) - Hi, thank you for the talk. So about the mix of expert structure, if what you're saying is correct, then how long do you think the mix of experts is gonna stay for the new large-length models? - So one thing I have to apologize is the architecture is kind of a thing that I'm not really comfortable sharing a lot, that's why I'm limiting a little bit to the future.

So yeah, I'll probably just skip that, but I would say that seems quite general. Yeah. - So some of the changes that you described between encoder-decoder versus decoder-only, like the parameter sharing and the bidirectional attention, can they not be interpreted as less structure, or sorry, more structure or less freedom for the model to learn?

- Yeah, I think one can do that, but I think, somewhat subjective, but I think it's a simpler structure that the model kind of, if, but we're just saying like input and target are just sequences and we just, if we have enough capacity, we can just handle both. And there are other cases where, like, yeah, so yeah, I can totally see, oh, actually, maybe I should have repeated the question.

The question is, can we think about this parameter sharing other structures in the encoder-decoder as actually less structure? But I think it's a little bit more complicated model, and that such complications of, like, it has more assumption, right? The input and target are different. I think that's a stronger assumption than, okay, it's a sequence.

We deal with the sequence in a unified way. So that would be just my take. - Do you have any thoughts on the recent state-space models like Mamba and how that fits into the paradigm of less structure, more structure, without really, AI? - Yeah, yeah. Okay. It's hard to, like, think about it on the spot, but I think there are, like, to me, I talk about this architectures, but I don't, like, architecture is, like, kind of a, it doesn't change things too much.

And maybe it's a, I think multi-modalities might bring in another challenges, like when this transformer structure might become a bottleneck when we think about that. But, yeah. So I think it's, transformers have done a good job. So maybe we should think about it, especially with the multi-modalities. Yeah. - So, like, for cross-attention and casual attention, it's, like, imploding permutation of invariance in a way for multi-attention instead of causal.

And for computer vision, there's, like, a lot of learning structure for invariances for self-sufficient learning. What do you think about those in terms of complexity that you just talked about? - So the question is, this causal attention versus the, like, the bidirectional attention. They were probably fine in the text domain, but in the computer vision case, that being able to attend to the future part of it is really important.

Yeah, is that the question? - Also, like, one of the, like, causal attention removed the, like, invariance for permutation. So what do you think about, like, invariant, like, for observation, like, there's a lot of invariances, right, for augmentation. So what do you think about, like, those as a way to structure?

- So I think the, like, I don't really like this invariances and all these. These are, like, how humans think we perceive the vision. Like, CNN, for example, is, like, translation invariance, which was very important, we thought. But I don't think it's that important. Actually, if anything, now it's hurting the model learning more general thing.

And so the machines might be learning the vision in a completely different way from how humans do, and I don't think that's problematic. So those invariances, I think, is, like, could be a good guiding principle, but I don't, like, I'm not too worried about just not having such structures.

Yeah, I would just try out, like, just based on some metric, if not having that invariant structure is actually better and more scalable, and I think that's probably fine, and actually even better. If we do it without the structure, it's actually better. - So I actually have two questions.

One, so clearly you've been thinking about how inductive biases and structure limit our potential, essentially. So I'm just curious, what are some big inductive biases currently that you think are, like, big blocks that we can release, or let go of? That would be one question, if you could let me go.

- The current structure that we should get rid of. - Just current inductive biases you think, 'cause clearly you've been thinking about this, right? So when you look at the state of research, you must be thinking, oh, man, this is, like, a pretty big inductive bias. It'd be really cool if we could let this go.

So I'm just trying to see what you're-- - Yeah, so when I think about this as an architecture, I think the architectures are not the current bottleneck, in my view. And so, partly because I did a lot of the architecture research, and at the end, we published this paper called, it's saying, okay, we tried, like, 60 different transformer modifications, pretty much the same thing.

And none of that really makes sense. It wouldn't make a huge difference. Caveat, with, like, now, maybe the conclusion can be different. So I have a very huge bias against, like, not doing the architecture research. And one message could be that, actually, the architecture is not the bottleneck in further scaling.

And I think what's the bottleneck now is this learning objective, especially on the supervised learning paradigm, or even, like, self-supervised pre-training. What we're doing with this maximum likelihood estimation is, okay, given this, this is the only correct target, and everything else is incorrect, because the probability measure is finite.

So is that really a comfortable thing to do, teaching Signal to give the model? And I think, if we think about the old days, we could formalize the correct behavior for a given input very well. And maybe one answer being the single correct answer is fine. But now, if you're thinking about very general, especially the chat applications, okay, for, like, write a poem, and then you say this is the only correct answer, I think that the implication that could be really severe.

So I think that's really something that I'm not comfortable with. And partly why I'm, like, interested in RLHF is one instantiation of not using this maximum likelihood, instead using a reward model as a learned objective function, which is a lot less structure. Now we can scale it further. RLHF itself is not really that scalable, I would say.

But it just showed that we can use this supervised deep learning to train a model that serves as a objective function, and that really works in a cool way. I think that's a great paradigm. - Thank you, great answer. Not that you're being judged or anything. But a second question I would say is, so then, in the beginning of the talk, you talk about the big driving force being the exponentially cheap compute, right?

But some of the stuff I've been reading says Moore's Law is ending, and we're going towards performance-oriented architecture. So can we rely then on, 'cause in the past 50 years, we had transistors doubling or whatever at Moore's Law. But yeah, that's ending. So when you talk about the compute, these demands that we've been looking at, and that structure, our history, we're also uncertain, necessarily, about how that's gonna project into the future.

So what are some of your thoughts on-- - Yeah, I think that Moore's Law is really a red herring in this case, because it's like number of transistors, but that doesn't really matter. I think what matters is the compute availability. And GPU, for example, is a very different kind of architecture, and that enabled the continuation of this trend.

And I think right now, 2023, 2024, we're kind of taking the shortcuts with low-precision thing, which still, I think, is cool. But I think there are many other GPU-level things. But also, if we are kind of sure about the architecture, this is my thoughts. We can hard-code into the chips, and that can provide a lot of benefits.

And I think training, for training, I don't think that's really done. But GPU, if you think about it, is too general. Maybe that is something that we can revisit. And so I'm not losing hope, and I don't see any trend of doing that. But maybe other things will come as a bottleneck, like maybe energy or something.

Yeah, so physics probably is something that we need to study again. (audience laughs) - If you don't mind me continuing, then the problem is like, we're talking about exponential driving forces, right? You can tell me that you wanna hard-code chips, but that's not the same as telling me that there's gonna be exponential growth that we can drive, right, into like paradise or wherever the hell we're going.

- Yeah, here's my very boring answer. I think we just need to do a little bit better, and at some point, the machines will be better than us in thinking about chip design. (audience laughs) So I think it's half-joking, but if we look back at, say, this video two years from now, I think it'll be less, more of a joke, serious thing.

Let's just get there first. - All right, so thanks to Hongwan for an amazing talk. (audience applauds) you

Stanford CS25: V4 I Jason Wei & Hyung Won Chung of OpenAI

Transcript