back to index

Stanford CS25: V4 I Jason Wei & Hyung Won Chung of OpenAI


Whisper Transcript | Transcript Only Page

00:00:00.000 | So again, I'm very happy to have Jason here.
00:00:07.200 | So he's an AI researcher based in San Francisco,
00:00:09.980 | currently working at OpenAI.
00:00:11.560 | He was previously a research scientist at Google Brain,
00:00:14.380 | where he popularized key ideas in LLMs,
00:00:17.260 | such as chain of thought prompting,
00:00:18.960 | instruction tuning, as well as emergent phenomena.
00:00:21.900 | He's also a good friend of mine,
00:00:23.380 | and he's been here before to give some talks.
00:00:25.320 | So we're very happy to have you back, Jason, and take it away.
00:00:28.980 | Great.
00:00:30.480 | [APPLAUSE]
00:00:33.940 | Yeah.
00:00:34.440 | Thanks for the intro.
00:00:37.020 | So a bit about the structure.
00:00:38.660 | So I'll talk for around 30 minutes,
00:00:40.240 | and then I'll take a few questions.
00:00:41.980 | And then Hyungwon will talk for 30 minutes,
00:00:44.060 | and then we'll both take questions at the end.
00:00:46.020 | Great.
00:00:49.440 | So I want to talk about a few very basic things.
00:00:55.300 | And I think the fundamental question that I hope to get at
00:00:59.780 | is, why do language models work so well?
00:01:09.560 | And one thing that I'd encourage everyone
00:01:14.780 | to do that I found to be extremely helpful in trying
00:01:17.020 | to answer this question is to use a tool, which
00:01:21.780 | is manually inspect data.
00:01:27.260 | And I'll give a short anecdote.
00:01:31.580 | I've been doing this for a long time.
00:01:33.900 | So in 2019, I was trying to build one of the first lung
00:01:38.380 | cancer classifiers.
00:01:39.260 | So there'd be an image, and you have to say, OK,
00:01:41.220 | what type of lung cancer is this?
00:01:43.780 | And my first thing was, OK, if I want
00:01:47.140 | to train a neural network to do this,
00:01:49.520 | I should be able to at least do the task.
00:01:51.220 | So I went to my advisor, and I said,
00:01:53.340 | oh, I want to learn to do this task first.
00:01:55.300 | And he said, Jason, you need a medical degree
00:01:58.300 | and three years of pathology experience
00:02:00.300 | to even do this task.
00:02:01.940 | And I found that a bit discouraging,
00:02:03.740 | but I went and did it anyways.
00:02:05.680 | So I basically looked at the specific type of lung cancer
00:02:09.100 | that I was working on.
00:02:10.660 | And I'd read all the papers on how
00:02:12.340 | to classify different types.
00:02:13.940 | And I went to pathologists, and I said, OK,
00:02:16.060 | try to classify these.
00:02:17.060 | What did I do wrong?
00:02:17.860 | And then, what do you think of that?
00:02:20.140 | And in the end, I learned how to do this task
00:02:22.740 | of classifying lung cancer.
00:02:25.340 | And the result of this was I gained intuitions
00:02:28.340 | about the task that led to many papers.
00:02:32.620 | OK, so first, I will do a quick review of language models.
00:02:38.820 | So the way language models are trained
00:02:41.180 | are with the next word prediction task.
00:02:44.760 | So let's say you have a sentence--
00:02:51.280 | Dartmouth students like to.
00:02:55.640 | And the goal of next word prediction
00:02:57.520 | is you have some words that come before,
00:02:59.440 | and then you want to predict the next word.
00:03:02.080 | And what the language model does is
00:03:05.880 | it outputs a probability for every single word
00:03:08.800 | in the vocabulary.
00:03:10.000 | So vocabulary would be A, aardvark, drink, study,
00:03:21.200 | and then all the way to zucchini.
00:03:24.920 | And then the language model is going
00:03:26.680 | to put a probability over every single word here.
00:03:29.360 | So the probability of A being the next word
00:03:32.160 | is something really small.
00:03:34.680 | Aardvark is something really small.
00:03:37.600 | And then maybe drink is like, say, 0.6.
00:03:40.680 | Study is like 0.3.
00:03:43.800 | Zucchini is, again, really small.
00:03:45.520 | And then the way that you train the language model is you say,
00:03:53.840 | I want--
00:03:54.480 | let's say drink is the correct word here.
00:03:59.040 | I want this number here, 0.6, to be as close as possible to 1.
00:04:03.920 | So your loss is basically, how close
00:04:07.800 | is the probability of the actual next word?
00:04:11.680 | And you want this loss to be as low as possible.
00:04:14.040 | OK, so the first intuition that I would encourage everyone
00:04:27.800 | to use is next word prediction is massively
00:04:45.640 | multi-task learning.
00:04:53.000 | And what I mean by this is the following.
00:04:55.080 | I'll give a few examples.
00:04:56.160 | So when you train a language model on a large enough
00:05:04.880 | database, a large enough data set,
00:05:07.960 | on this task of next word prediction,
00:05:10.560 | you have a lot of sentences that you can learn from.
00:05:12.800 | So for example, there might be some sentence,
00:05:15.800 | in my free time, I like to.
00:05:17.680 | And the language model has to learn
00:05:19.440 | that code should be higher probability than the word
00:05:21.600 | banana.
00:05:22.560 | So learn some grammar.
00:05:25.520 | It'll learn lexical semantics.
00:05:27.160 | So somewhere in your data set, there might be a sentence,
00:05:31.560 | I went to the store to buy papaya, dragon fruit,
00:05:34.080 | and durian.
00:05:35.440 | And the language model should know
00:05:36.900 | that the probability of durian should be higher than squirrel.
00:05:41.240 | The language model will learn world knowledge.
00:05:43.420 | So there will be some sentence on the internet that says,
00:05:45.760 | you know, the capital of Azerbaijan is--
00:05:48.160 | and then the language model should learn that it should be
00:05:50.620 | Baku instead of London.
00:05:53.760 | You can learn traditional NLP tasks, like sentiment analysis.
00:05:56.640 | So there'll be some sentence, you know,
00:05:58.440 | I was engaged on the edge of my seat the whole time.
00:06:00.720 | The movie was.
00:06:01.960 | And the language model looks like, OK,
00:06:03.560 | these are the prior next words.
00:06:05.480 | The next word should probably be good and not bad.
00:06:10.200 | And then finally, another example is translation.
00:06:14.240 | So here, you might see some sentence,
00:06:16.660 | the word for pretty in Spanish is.
00:06:19.240 | And then the language model should
00:06:21.120 | weigh Bonita more than Ola.
00:06:24.960 | Spatial reasoning.
00:06:26.120 | So you might even have some sentence like,
00:06:28.200 | Iroh went to the kitchen to make some tea.
00:06:30.360 | Standing next to Iroh, Zuko pondered his destiny.
00:06:33.560 | Zuko left the.
00:06:34.880 | And then kitchen should be higher probability than store.
00:06:39.000 | And then finally, even some math questions.
00:06:41.160 | So you might have, like, some arithmetic exam answer
00:06:44.720 | key somewhere on the internet.
00:06:46.480 | And then the language model looks at this and says,
00:06:48.600 | OK, the next word should probably be 15 and not 11.
00:06:53.360 | And you can have, like, basically millions
00:06:55.120 | of tasks like this when you have a huge data set.
00:06:58.720 | And you can think of this as basically
00:07:00.800 | extreme multitask learning.
00:07:03.840 | And these are sort of like very clean examples of tasks.
00:07:08.160 | But I'll give an example of how arbitrary
00:07:16.360 | some of these tasks can be.
00:07:17.600 | So here's a sentence from Wikipedia.
00:07:26.800 | Biden married Amelia.
00:07:34.920 | And then now, pretend you're the language model.
00:07:37.880 | And you could say, like, OK, what's the next word here?
00:07:41.560 | And the next word here is Hunter.
00:07:44.160 | So it's Biden's first wife.
00:07:46.360 | And so, like, OK, what's the language model learning
00:07:48.640 | from predicting this word?
00:07:50.160 | I guess, like, world knowledge.
00:07:52.200 | And then what's the next word after this?
00:07:58.720 | Turns out the next word is a comma.
00:08:00.760 | So here, the model is learning, like, basically
00:08:03.360 | comma prediction.
00:08:06.800 | And then what's the next word after that?
00:08:09.840 | I think it's kind of hard to know, but the answer is A.
00:08:13.320 | And I guess this is, like, maybe grammar,
00:08:15.720 | but, like, somewhat arbitrary.
00:08:17.480 | And then what's the next word after that?
00:08:23.480 | Turns out it's student.
00:08:24.680 | And this, I would say, I don't know what task this is.
00:08:31.040 | This is, like, you know, it could have been woman.
00:08:35.160 | It could have been something else.
00:08:37.160 | So this is, like, a pretty arbitrary task.
00:08:40.040 | And the point that I'm trying to make here
00:08:41.960 | is that the next word prediction task is really challenging.
00:08:45.520 | So, like, if you do this over the entire database,
00:08:47.680 | you're going to learn a lot of tasks.
00:08:51.740 | The next intuition I want to talk about
00:09:01.520 | is scaling, which is, by the way, let's say scaling compute.
00:09:15.960 | And by the way, compute is equal to how much data you have
00:09:19.720 | times the size of language model.
00:09:21.400 | Reliably improves loss.
00:09:29.320 | [WRITING ON BOARD]
00:09:32.240 | And this idea was basically pioneered
00:09:40.640 | by Kaplan et al in 2020.
00:09:47.200 | I would encourage you guys to read the paper.
00:09:49.600 | And what this basically says is you can have a plot here,
00:09:54.200 | and we'll see many plots like this,
00:09:56.320 | where the x-axis is compute and the y-axis is loss.
00:10:05.920 | And what this intuition says is you can train one language
00:10:10.120 | model, you'll have that loss.
00:10:11.760 | And obviously, you want loss to be lower.
00:10:13.960 | So you can train the next one, you'll have that loss.
00:10:16.480 | If you train the one after that, you'll have that loss.
00:10:19.120 | Then if you train the one after that, you'll have that loss.
00:10:22.680 | And you can basically predict the loss of a language model
00:10:26.300 | based on how much compute you're going to use to train it.
00:10:30.040 | And the reason why this is called a law
00:10:32.560 | is that in this paper, they showed that the x-axis here
00:10:36.640 | is actually seven orders of magnitude.
00:10:39.320 | So basically, it would be surprising if the trend broke
00:10:42.600 | if you continued.
00:10:44.800 | And the important thing about this
00:10:47.200 | is that the line does not go like that.
00:10:50.200 | Because if it went like that, then it would saturate.
00:10:55.120 | And then putting more compute or training a larger language
00:10:57.580 | model wouldn't actually lead to lower loss.
00:11:00.440 | So I think a question that we don't have a good answer
00:11:18.660 | to as a field, but I'll give you a hand-wavy answer,
00:11:22.980 | is why does scaling up the size of your language model
00:11:26.900 | improve the loss?
00:11:28.980 | And I'll give two basically hand-wavy answers.
00:11:31.300 | So here's a small lm, and here's large lm.
00:11:39.340 | So one thing that's important is how good is your language model
00:11:43.100 | at memorizing facts?
00:11:44.580 | And imagine you're a small language model,
00:11:50.780 | and you see a bunch of facts on the internet.
00:11:53.660 | You have to be pretty choosy in which facts you memorize.
00:11:57.220 | Because if you don't have that many parameters,
00:11:59.140 | you're like, oh, I can only memorize a million facts.
00:12:01.740 | Is this one of the facts I want to memorize
00:12:03.580 | to get the lowest loss?
00:12:05.020 | And so you have to be very selective.
00:12:07.700 | Whereas if you're a large language model,
00:12:12.980 | you can memorize a lot of tail knowledge.
00:12:15.580 | And so basically, every fact you see,
00:12:17.180 | you don't have to say, oh, is this
00:12:18.420 | something I want to memorize?
00:12:19.700 | You can just memorize it.
00:12:22.020 | And then the other hand-wavy answer I'll give
00:12:25.460 | is small language models tend to learn first-order heuristics.
00:12:31.980 | So if you're a small language model,
00:12:35.260 | you're already struggling to get the grammar correct.
00:12:37.980 | You're not going to do your best to try to get the math
00:12:40.300 | problem exactly correct.
00:12:42.420 | Whereas if you're a large language model,
00:12:46.860 | you have a lot of parameters in your forward pass.
00:12:50.100 | And you can try to do really complicated things
00:12:52.300 | to get the next token correct and to get
00:12:54.820 | the loss as low as possible.
00:12:56.060 | So the third intuition I'll talk about
00:13:11.060 | is while overall loss improves smoothly,
00:13:26.660 | individual tasks can improve suddenly.
00:13:42.060 | And here's what I mean by this.
00:13:44.020 | So you can write your overall loss.
00:13:47.540 | By this, I mean if you take some corpus of data
00:13:53.580 | and you compute the overall loss on every word in that data set.
00:13:58.140 | The overall loss, because we know that next word prediction
00:14:01.900 | is massively multitask learning, you
00:14:04.060 | can decompose this overall loss into the loss
00:14:06.980 | of every single individual task.
00:14:08.700 | So you have, I don't know, some small number times the loss
00:14:14.660 | of, say, grammar plus some small number times
00:14:21.300 | the loss of sentiment analysis plus some small number
00:14:30.380 | times the loss of world knowledge plus--
00:14:37.180 | and then all the way to, let's say,
00:14:38.820 | you have something times the loss of math.
00:14:47.500 | And so you can basically write your overall loss
00:14:49.940 | as the weighted sum of the individual tasks
00:14:52.540 | in the data set.
00:14:54.700 | And now the question is, let's say
00:14:56.900 | I improve my loss from 4 to 3.
00:15:01.740 | Do each of these individual tasks
00:15:03.460 | improve at the same rate?
00:15:06.580 | Well, I'd say probably not.
00:15:07.740 | So if you have a good enough language model,
00:15:11.580 | it's already doing basically perfectly
00:15:13.180 | on grammar and sentiment analysis.
00:15:15.380 | So this might not improve.
00:15:19.540 | This might be saturated.
00:15:21.100 | And maybe the loss on math is not saturated.
00:15:29.260 | It's not that good at math yet.
00:15:30.840 | So this could improve suddenly.
00:15:32.140 | And I'll redraw the diagram to show this.
00:15:47.140 | So again, compute here and loss here.
00:15:55.580 | And your overall loss is like this.
00:15:57.860 | What I'm saying is there might be
00:16:05.940 | some part of that overall loss that scales like this.
00:16:09.420 | So this is, say, grammar.
00:16:13.220 | So for example, you could say, if GPT 3.5 is there
00:16:18.100 | and GPT 4 is there, you haven't actually
00:16:20.540 | improved the grammar that much.
00:16:23.060 | On the other hand, you might have something
00:16:25.780 | like this for, say, doing math or harder tasks,
00:16:30.380 | where the difference between GPT 3.5 and GPT 4
00:16:34.300 | will be much larger.
00:16:37.420 | And it turns out you can look at a big set of tasks, which
00:16:48.920 | I did, and you can look at what's
00:16:51.100 | the shape of these scaling curves.
00:16:54.340 | So I looked at 202 tasks.
00:16:58.900 | There's this corpus called Big Bench, which has 200 tasks.
00:17:02.060 | I looked at all of them.
00:17:03.860 | And here is the distribution.
00:17:06.300 | So you have-- this was 29% of tasks that were smooth.
00:17:17.380 | So if I draw the scaling plot, compute is on the x-axis.
00:17:23.380 | And then here we have accuracy instead of loss.
00:17:25.620 | So higher is better.
00:17:27.700 | Then you have something like this.
00:17:31.140 | I believe this was 22% will be flat.
00:17:36.860 | So if you have your scaling curve, it'll just all be 0.
00:17:39.740 | The task was too hard.
00:17:42.620 | 2% will be something called inverse scaling.
00:17:46.300 | I'll talk about this in a sec.
00:17:49.140 | But what that means is the accuracy actually
00:17:52.580 | gets worse as you increase the size of the language model.
00:17:57.180 | And then I think this was 13% will be not correlated.
00:18:03.340 | So you have something like that.
00:18:09.300 | I don't know.
00:18:10.540 | And then finally, a pretty big portion, 33%,
00:18:16.380 | will be emergent abilities.
00:18:21.380 | And what I mean by that is if you plot your compute
00:18:28.140 | and accuracy, for a certain point, up to a certain point,
00:18:35.100 | your accuracy will be 0.
00:18:37.460 | And then the accuracy suddenly starts to improve.
00:18:42.020 | And so you can define an emergent ability basically
00:18:45.820 | as, for small models, the performance is 0.
00:18:52.700 | So it's not present in this model.
00:18:54.700 | And then for large models, you have much better
00:19:03.060 | than random performance.
00:19:04.940 | And the interesting thing about this
00:19:06.660 | is, let's say you had only trained the small language
00:19:09.520 | models up to that point.
00:19:10.880 | You would have predicted that it would have been impossible
00:19:13.460 | for the language model to ever perform the task.
00:19:16.580 | But actually, when you train the larger model,
00:19:18.540 | the language model does learn to perform the task.
00:19:20.900 | So in a sense, it's pretty unpredictable.
00:19:23.980 | I'll talk about one final thing, which
00:19:39.140 | is something called inverse scaling slash u-shaped scaling.
00:19:45.660 | OK, so I'll give a tricky prompt to illustrate this.
00:19:52.580 | So the tricky prompt is this.
00:19:55.620 | Repeat after me, all that glisters is not glib.
00:20:12.860 | All that glisters is not.
00:20:23.020 | And this is the prompt I give to the language model.
00:20:26.220 | And the goal is to predict this next word.
00:20:29.100 | And obviously, the correct answer
00:20:31.580 | is glib, because you asked to repeat after me.
00:20:37.220 | And what you see is, let's say you
00:20:40.460 | have a extra small language model, a small language model,
00:20:45.580 | and a large language model.
00:20:48.100 | The performance for the extra small language model
00:20:51.060 | will be, say, here's 100%.
00:20:55.580 | The small language model is actually worse at this task.
00:20:59.340 | So it's something like something here.
00:21:01.260 | And then the large language model, again,
00:21:03.980 | learns to do this task.
00:21:06.540 | So how do we basically explain a behavior
00:21:09.140 | like this for a prompt like this?
00:21:10.940 | [WRITING]
00:21:14.380 | And the answer is, you can decompose this prompt
00:21:26.220 | into three subtasks that are basically being done.
00:21:30.620 | So the first subtask is, can you repeat some text?
00:21:33.900 | Right?
00:21:36.340 | And if you draw the plot here, again,
00:21:39.340 | extra small, small, large, and then here is 100.
00:21:44.540 | This is a super easy task.
00:21:46.420 | And so all the language models have
00:21:51.460 | perfect performance on that.
00:21:53.580 | That's one hidden task.
00:21:55.180 | The other task is, can you fix a quote?
00:22:02.020 | So the quote is supposed to be, all that glisters is not gold.
00:22:05.860 | And so you can then plot, again, extra small, small, large.
00:22:12.380 | What's the ability to fix that quote?
00:22:14.740 | Well, the small language model doesn't know the quote.
00:22:17.300 | So it's going to get 0.
00:22:19.860 | And then the extra small doesn't know the quote,
00:22:22.020 | so it's going to get 0.
00:22:23.540 | The small will be able to do it.
00:22:26.020 | And the large can obviously do it.
00:22:28.300 | So that's what the scaling curve looks like.
00:22:31.780 | And then finally, you have the quote--
00:22:37.060 | you have the task, follow an instruction.
00:22:40.420 | And obviously, this is the instruction here.
00:22:47.060 | And you could say, OK, what's the performance
00:22:49.860 | of these models on this task?
00:22:52.220 | And the small model can't do it, or the extra small model
00:22:59.580 | can't do it.
00:23:00.620 | Small model also can't do it.
00:23:02.460 | But the large model can do it.
00:23:04.740 | So you get a curve like this.
00:23:07.540 | And then why does this explain this behavior here?
00:23:11.260 | Well, the extra small model, it can repeat.
00:23:16.660 | It can't fix the quote.
00:23:18.180 | And it can't follow the instruction.
00:23:19.940 | So it actually gets it correct.
00:23:21.180 | It says glib.
00:23:23.500 | It says glib.
00:23:26.380 | The small model can repeat.
00:23:30.700 | It can fix the quote.
00:23:32.020 | But it doesn't follow the instruction.
00:23:33.740 | So it decides to fix the quote.
00:23:35.060 | And then the large model, it can do all three.
00:23:42.420 | It can follow the instruction.
00:23:43.980 | So it just repeats.
00:23:44.740 | And so that's how, if you look at the individual subtasks,
00:23:52.700 | you can explain the behavior of some of these weird scaling
00:23:55.620 | properties.
00:23:56.580 | So I will conclude with one general takeaway, which
00:24:11.260 | is applicable if you do research.
00:24:15.460 | And the takeaway is to just plot scaling curves.
00:24:26.060 | And I'll just give a really simple example.
00:24:28.380 | So let's say I do something for my research project.
00:24:32.340 | I fine-tune a model on some number of examples.
00:24:37.100 | And I get-- this is my thing.
00:24:42.540 | I get some performance there.
00:24:45.540 | And then here's the baseline.
00:24:48.460 | Of not doing whatever my research project is.
00:24:53.700 | And there's the performance.
00:24:56.660 | The reason you want to plot a scaling curve for this
00:24:59.180 | is, let's say you take half the data.
00:25:02.740 | And you find out that the performance is actually here.
00:25:06.660 | So your curve looks like this.
00:25:09.980 | What this would tell you is you didn't
00:25:11.940 | have to collect all the data to do your thing.
00:25:14.180 | And if you collect more, you probably
00:25:16.420 | won't see an improvement in performance.
00:25:19.900 | Another scenario would be if you plotted that point
00:25:23.860 | and it was there, then your curve will look like this.
00:25:28.100 | And potentially, if you kept doing
00:25:30.940 | more of whatever your research project is,
00:25:33.620 | you'd see an improvement in performance.
00:25:37.020 | And then finally, maybe your point is there.
00:25:40.740 | So your curve looks like this.
00:25:43.380 | And in this case, you would expect
00:25:45.340 | to see an even larger jump in performance
00:25:48.340 | after you continue doing your thing.
00:25:51.460 | So yeah, I'll end my talk here.
00:25:54.660 | And I'm happy to take a few questions beforehand
00:25:57.020 | while I talk.
00:25:57.520 | Yeah.
00:26:01.900 | Go ahead.
00:26:02.900 | [INAUDIBLE]
00:26:10.380 | Data that has an operational source,
00:26:12.300 | do you break the data [INAUDIBLE]
00:26:13.860 | based on the source?
00:26:15.620 | Repeat the question.
00:26:17.080 | Oh, yeah.
00:26:17.580 | Yeah, thanks.
00:26:18.380 | Good question.
00:26:19.620 | So the question is, during pre-training,
00:26:23.900 | how do you differentiate between good data and bad data?
00:26:27.820 | The question is-- or the answer is, you don't really.
00:26:31.380 | But you should by only training on good data.
00:26:33.540 | So maybe you should look at your data source
00:26:37.620 | and filter out some data if it's not from a reliable data source.
00:26:40.860 | Do you want to give us maybe the intuition
00:26:46.180 | behind the intuition for one or two of the examples,
00:26:48.780 | like emergent, or why tail knowledge starts to develop?
00:26:54.460 | What's behind that?
00:26:55.980 | What do you mean, intuition behind intuition?
00:26:58.520 | Intuitively, these concepts that you're seeing in the graphs.
00:27:01.460 | And from your experience and expertise,
00:27:04.060 | what in the model itself is really
00:27:07.060 | causing that emergent behavior?
00:27:08.700 | What do you mean, in the model itself?
00:27:14.660 | Is it more depth, more nodes, more--
00:27:19.260 | Oh, I see.
00:27:19.780 | --attention, in an intuitive sense, not in a--
00:27:24.060 | Oh, yeah, OK, yeah.
00:27:24.860 | So the question is, what in the model
00:27:26.620 | makes the language model better at memorizing tail knowledge
00:27:30.300 | or at doing math problems?
00:27:35.860 | Yeah, I think it's definitely related to the size
00:27:37.940 | of the language model.
00:27:38.940 | So yeah, if you have more layers,
00:27:41.620 | you could encode probably a more complex function within that.
00:27:46.220 | And then I guess if you have more breadth,
00:27:49.780 | you could probably encode more facts about the world.
00:27:54.140 | And then if you want to repeat a fact or retrieve something,
00:27:57.340 | it'd probably be easier.
00:27:59.340 | We'll get one more in person, and then I'll
00:28:01.340 | move this more slowly.
00:28:05.100 | So when you were studying the 200-ish problems
00:28:07.980 | in the big bench, you noticed that 22% were flat.
00:28:11.580 | But there's a possibility that if you were to increase
00:28:13.820 | the compute even further, those might
00:28:15.500 | have turned out to be emergent.
00:28:16.780 | So my question to you is that when
00:28:18.300 | you were looking at the 33% that turned out to be emergent,
00:28:21.300 | did you notice anything about the loss in the flat portion
00:28:24.060 | that suggested that it would eventually become emergent?
00:28:27.460 | Oh, yeah.
00:28:29.140 | I didn't notice anything.
00:28:31.180 | Oh, sorry.
00:28:31.700 | Let me repeat the question.
00:28:32.980 | The question is, when I looked at all the emergent tasks,
00:28:36.980 | was there anything that I noticed before the emergence
00:28:40.140 | point in the loss that would have hinted that it
00:28:43.420 | would become emergent later?
00:28:45.420 | To me, it's kind of tough.
00:28:46.460 | We have a few plots of this.
00:28:48.380 | You can look at the loss, and it kind of gets better.
00:28:51.480 | And then suddenly, it spikes, and there's
00:28:53.180 | no way to predict it.
00:28:54.100 | But also, you don't have perfect data,
00:28:56.340 | because you might not have all the intermediate points
00:28:58.660 | for a given model size.
00:29:00.740 | Yeah.
00:29:01.240 | Great question.
00:29:01.900 | Yeah, we can move to [INAUDIBLE]
00:29:08.260 | We have a few online questions.
00:29:10.140 | Oh, OK.
00:29:10.660 | Yeah.
00:29:11.160 | We just have a few questions from people
00:29:12.980 | who are joining on Zoom.
00:29:15.020 | The first one is, what do you think
00:29:16.580 | are the biggest bottlenecks for current large language models?
00:29:19.380 | Is it the quality of data, the amount of compute,
00:29:21.860 | or something else?
00:29:24.940 | Yeah, great question.
00:29:25.900 | I guess if you go back to the scaling loss paradigm, what
00:29:32.860 | it says is that if you increase the size of the data
00:29:36.400 | and the size of the model, then you'd
00:29:38.500 | expect to get a lot better performance.
00:29:40.540 | And I think we'll probably try to keep increasing those things.
00:29:43.780 | Gotcha.
00:29:44.380 | And then the last one, what are your thoughts on the paper,
00:29:48.020 | if you've read it?
00:29:49.220 | Are emergent abilities of large language models a mirage?
00:29:53.260 | Oh, yeah.
00:29:53.740 | I always get this question.
00:29:55.640 | I guess I would encourage you to read the paper
00:29:58.640 | and decide for yourself.
00:29:59.860 | But I guess what the paper says is if you change the metric a
00:30:05.660 | bit, it looks different.
00:30:07.380 | But I would say, at the end of the day,
00:30:10.700 | I think the language model abilities are real.
00:30:12.660 | And if you think--
00:30:14.300 | I guess I don't think that's a mirage.
00:30:16.340 | So, yeah.
00:30:19.580 | Yeah.
00:30:25.340 | All right, so thanks, Jason, for the very insightful talk.
00:30:28.580 | And now we'll have Hyung Won give a talk.
00:30:31.500 | So he's currently a research scientist
00:30:34.420 | on the OpenAI Chat GPT team.
00:30:36.700 | He has worked on various aspects of large language models,
00:30:40.780 | things like pre-training, instruction fine-tuning,
00:30:43.380 | reinforcement learning with human feedback, reasoning,
00:30:47.060 | and so forth.
00:30:48.060 | And some of his notable works include the scaling FLAN
00:30:50.820 | papers, such as FLAN T5, as well as FLAN POM, and T5X,
00:30:55.220 | the training framework used to train the POM language model.
00:30:58.660 | And before OpenAI, he was at Google Brain,
00:31:01.220 | and he received his PhD from MIT.
00:31:04.700 | So give a hand for Hyung Won.
00:31:06.220 | All right, my name is Hyung Won, and really happy
00:31:21.180 | to be here today.
00:31:22.300 | And this week, I was thinking about--
00:31:24.940 | by the way, is my mic working fine?
00:31:27.660 | Yeah, yeah.
00:31:28.580 | So this week, I thought about, OK,
00:31:30.500 | I'm giving a lecture on transformers at Stanford.
00:31:34.020 | What should I talk about?
00:31:35.380 | And I thought, OK, some of you in this room and in Zoom
00:31:39.860 | will actually go shape the future of AI.
00:31:42.120 | So maybe I should talk about that.
00:31:43.540 | It's a really important goal and ambitious,
00:31:45.460 | and we really have to get it right.
00:31:47.220 | So that could be a good topic to think about.
00:31:50.140 | And when we talk about something into the future,
00:31:53.620 | the best place to get an advice is to look into the history.
00:31:57.860 | And in particular, look at the early history of transformer
00:32:01.960 | and try to learn many lessons from there.
00:32:04.660 | And the goal will be to develop a unified perspective in which
00:32:10.380 | we can look into many seemingly disjoint events.
00:32:13.860 | And from that, we can probably hope
00:32:17.340 | to project into the future what might be coming.
00:32:20.260 | And so that will be the goal of this lecture.
00:32:24.420 | And we'll look at some of the architectures
00:32:26.260 | of the transformers.
00:32:27.460 | So let's get started.
00:32:30.020 | Everyone I see is saying AI is so advancing so fast that
00:32:34.860 | it's so hard to keep up.
00:32:36.420 | And it doesn't matter if you have years of experience.
00:32:39.340 | There's so many things that are coming out every week
00:32:41.780 | that it's just hard to keep up.
00:32:43.420 | And I do see many people spend a lot of time and energy
00:32:46.940 | catching up with the latest developments, the cutting
00:32:50.340 | edge and the newest thing.
00:32:52.580 | And then not enough attention goes into all things
00:32:54.980 | because they become deprecated and no longer relevant.
00:33:00.740 | But I think it's important, actually, to look into that.
00:33:03.220 | Because we really need to--
00:33:05.460 | when things are moving so fast beyond our ability
00:33:08.060 | to catch up, what we need to do is study the change itself.
00:33:11.460 | And that means we can look back at the previous things
00:33:14.660 | and then look at the current thing
00:33:16.660 | and try to map how we got here and from which we can look
00:33:20.020 | into where we are heading towards.
00:33:23.060 | So what does it mean to study the change itself?
00:33:28.140 | First, we need to identify the dominant driving
00:33:31.740 | forces behind the change.
00:33:33.540 | So here, dominant is an important word
00:33:35.900 | because typically, a change has many, many driving forces.
00:33:39.700 | And we only care about the dominant one
00:33:41.420 | because we're not trying to get really accurate.
00:33:43.460 | We just want to have the sense of directionality.
00:33:46.220 | Second, we need to understand the driving force really well.
00:33:49.220 | And then after that, we can predict the future trajectory
00:33:52.100 | by rolling out the driving force and so on.
00:33:55.620 | And you heard it right.
00:33:56.620 | I mentioned about predicting the future.
00:33:58.620 | This is a computer science class, not
00:34:00.140 | like an astrology or something.
00:34:01.780 | But we do--
00:34:02.820 | I think it's actually not that impossible to predict
00:34:06.660 | some future trajectory of a very narrow scientific domain.
00:34:10.380 | And that endeavor is really useful to do
00:34:14.380 | because let's say you do all these
00:34:17.540 | and then make your prediction accuracy from 1% to 10%.
00:34:21.620 | And then you'll make, say, 100 predictions.
00:34:23.860 | 10 of them will be correct.
00:34:25.500 | Say, one of them will be really, really correct,
00:34:27.860 | meaning it will have an outside impact that
00:34:30.100 | outweighs everything.
00:34:31.580 | And I think that is kind of how many--
00:34:35.060 | I've seen a very general thing in life
00:34:37.620 | that you really have to be right a few times.
00:34:40.340 | So if we think about why predicting the future
00:34:47.260 | is difficult, or maybe even think about the extreme case
00:34:50.180 | where we can all do the prediction
00:34:53.100 | with perfect accuracy, almost perfect accuracy.
00:34:55.460 | So here, I'm going to do a very simple experiment
00:34:58.500 | of dropping this pen and follow this same three-step process.
00:35:04.100 | So we're going to identify the dominant driving force.
00:35:07.300 | First of all, what are the driving forces acting
00:35:09.340 | on this pen, gravity downwards?
00:35:11.500 | And is that all?
00:35:12.300 | We also have, say, air friction if I drop it.
00:35:17.380 | And that will cause what's called a drag force acting
00:35:20.380 | upwards.
00:35:21.380 | And actually, depending on how I drop this, the orientation,
00:35:25.620 | the aerodynamic interaction will be so complicated
00:35:28.860 | that we don't currently have any analytical way of modeling
00:35:32.100 | that.
00:35:32.700 | We can do it with the CFD, the computational fluid dynamics,
00:35:35.580 | but it will be nontrivial.
00:35:36.820 | So we can neglect that.
00:35:38.780 | This is heavy enough that gravity is probably
00:35:40.700 | the only dominant force.
00:35:41.740 | We simplify the problem.
00:35:44.100 | Second, do we understand this dominant driving force, which
00:35:46.900 | is gravity?
00:35:47.780 | And we do because we have this Newtonian mechanics, which
00:35:50.660 | provides a reasonably good model.
00:35:52.500 | And then with that, we can predict the future trajectory
00:35:55.060 | of this pen.
00:35:56.420 | And if you remember from this dynamics class,
00:35:59.740 | if we have this initial velocity is 0,
00:36:02.420 | I'm not going to put any velocity.
00:36:04.140 | And then let's say position is 0 here.
00:36:06.180 | And then 1/2 gt square will give a precise trajectory
00:36:11.380 | of this pen as I drop this.
00:36:13.500 | So if there is a single driving force that we really
00:36:17.140 | understand, it's actually possible to predict
00:36:19.980 | what's going to happen.
00:36:21.500 | So then why do we really fear about predicting the future
00:36:26.380 | in the most general sense?
00:36:27.900 | And I argue that among many reasons,
00:36:30.660 | the number of driving force, the sheer number
00:36:33.140 | of dominant driving forces acting on the general prediction
00:36:37.100 | is so complicated.
00:36:38.540 | And their interaction creates a complexity
00:36:41.140 | that we cannot predict the most general sense.
00:36:43.580 | So here's my cartoon way of thinking
00:36:45.220 | about the prediction of future.
00:36:47.420 | X-axis, we have a number of dominant driving forces.
00:36:50.020 | Y-axis, we have a prediction difficulty.
00:36:52.180 | So on the left-hand side, we have a dropping a pen.
00:36:54.660 | It's a very simple case.
00:36:56.260 | It's a difficulty.
00:36:57.020 | It's very small.
00:36:57.780 | You just need to learn physics.
00:37:00.540 | And then as you add more stuff, it just becomes impossible.
00:37:05.300 | So how does this fit into the AI research?
00:37:08.180 | And you might think, OK, I see all the time things
00:37:12.340 | are coming in.
00:37:13.060 | We are bombarded by new things.
00:37:15.220 | And some people will come up with a new agent, new modality,
00:37:18.460 | new MML use score, whatever.
00:37:20.100 | We just see so many things.
00:37:22.900 | I'm not even able to catch up with the latest thing.
00:37:25.700 | How can I even hope to predict the future of the AI research?
00:37:29.300 | But I argue that it's actually simpler,
00:37:31.780 | because there is a dominant driving force that
00:37:35.180 | is governing a lot, if not all, of the AI research.
00:37:39.420 | And because of that, I would like
00:37:41.460 | to point out that it's actually closer to the left
00:37:45.180 | than to the right than we actually may perceive.
00:37:49.140 | So what is that driving force?
00:37:52.220 | Oh, maybe before that, I would like
00:37:53.940 | to caveat that when I do this kind of talk,
00:37:58.420 | I would like to not focus too much
00:38:00.540 | on the technical stuff, which you can probably
00:38:02.860 | do better in your own time.
00:38:05.020 | But rather, I want to share how I think.
00:38:07.780 | And for that, I want to share how my opinion is.
00:38:11.860 | And so it will be very strongly opinionated.
00:38:14.580 | And by no means, I'm saying this is correct or not.
00:38:17.780 | Just wanted to share my perspective.
00:38:19.820 | So coming back to this driving force for AI,
00:38:22.180 | what is that dominant driving force?
00:38:24.540 | And here is a plot from Rich Sutton.
00:38:27.260 | And on the y-axis, we have the calculations flopped.
00:38:31.620 | If you pay $100, and how much computing power do you get?
00:38:36.020 | And it's in log scale.
00:38:37.500 | And then x-axis, we have a time of more than 100 years.
00:38:42.260 | So this is actually more than exponential.
00:38:45.100 | And I don't know any trend that is as strong and as
00:38:49.740 | long-lasting as this one.
00:38:51.300 | So whenever I see this kind of thing, I should say, OK,
00:38:55.860 | I should not compete with this.
00:38:57.540 | And better, I should try to leverage as much as possible.
00:39:01.700 | And so what this means is you get 10x more compute
00:39:07.180 | every five years if you spend the same amount of dollar.
00:39:10.460 | And so in other words, you get the cost of compute
00:39:14.660 | is going down exponentially.
00:39:16.260 | And this and associated scaling is really
00:39:19.820 | dominating the AI research.
00:39:22.220 | And that is somewhat hard to take.
00:39:24.180 | But that is, I think, really important to think about.
00:39:27.620 | So coming back to this AI research,
00:39:29.580 | how is this exponentially cheaper
00:39:32.380 | compute drive the AI research?
00:39:35.180 | Let's think about the job of the AI researchers.
00:39:37.580 | It is to teach machines how to think in a very general sense.
00:39:41.260 | And one somewhat unfortunately common approach
00:39:45.220 | is we think about how we teach machine how we think we think.
00:39:50.900 | So meaning we model how we think and then
00:39:55.420 | try to incorporate that into some kind of mathematical model
00:39:58.100 | and teach that.
00:39:59.380 | And now the question is, do we understand
00:40:01.580 | how we think at the very low level?
00:40:03.860 | I don't think we do.
00:40:05.220 | I have no idea what's going on.
00:40:06.940 | So it's fundamentally flawed in the sense
00:40:08.860 | that we try to model something that we have no idea about.
00:40:11.860 | And what happens if we go with this kind of approach
00:40:14.460 | is that it poses a structure that
00:40:16.620 | serves as a shortcut in the short term.
00:40:19.020 | And so you can maybe get a paper or something.
00:40:21.420 | But then it becomes a bottleneck,
00:40:23.780 | because we don't know how this will limit further scaling up.
00:40:29.020 | More fundamentally, what this is doing
00:40:30.900 | is we are limiting the degree of freedom
00:40:33.740 | we are giving to the machines.
00:40:35.620 | And that will backfire at some point.
00:40:37.600 | And this has been going on for decades.
00:40:42.180 | And bitter lesson is, I think, the single most important piece
00:40:46.940 | of writing in AI.
00:40:48.660 | And it says-- this is my wording, by the way--
00:40:51.660 | past 70 years of entire AI research
00:40:54.300 | can be summarized into developing progressively more
00:40:57.980 | general method with weaker modeling assumptions
00:41:01.020 | or inductive biases, and add more data and compute--
00:41:03.660 | in other words, scale up.
00:41:04.860 | And that has been the recipe of entire AI research,
00:41:08.820 | not fancy things.
00:41:10.380 | And if you think about this, the models of 2000
00:41:14.460 | is a lot more difficult than what we use now.
00:41:17.900 | And so it's much easier to get into AI nowadays
00:41:21.100 | from a technical perspective.
00:41:22.900 | So this is, I think, really the key information.
00:41:27.820 | We have this compute cost that's going down exponentially.
00:41:31.000 | And it's getting cheaper faster than we're
00:41:33.340 | becoming a better researcher.
00:41:34.780 | So don't compete with that.
00:41:36.260 | And just try to leverage that as much as possible.
00:41:38.740 | And that is the driving force that I wanted to identify.
00:41:43.500 | And I'm not saying this is the only driving force.
00:41:46.260 | But this is the dominant driving force.
00:41:47.920 | So we can probably neglect the other ones.
00:41:50.100 | So here's a graphical version of that.
00:41:52.020 | X-axis, we have a compute.
00:41:53.380 | Y-axis, we have a performance of some kind.
00:41:55.460 | Let's think about some general intelligence.
00:41:57.780 | And let's look at two different methods, one
00:42:00.800 | with more structure, more modeling assumptions, fancier
00:42:03.580 | math, whatever.
00:42:04.660 | And then the other one is a less structure.
00:42:06.420 | What you see is typically you start with a better performance
00:42:10.740 | when you have a low compute regime.
00:42:13.140 | But it plateaus because of some kind of structure backfiring.
00:42:16.120 | And then with the less structure,
00:42:17.460 | because we give a lot more freedom to the model,
00:42:19.780 | it doesn't work in the beginning.
00:42:21.380 | But then as we add more compute, it starts working.
00:42:23.940 | And then it gets better.
00:42:25.580 | We call this more scalable methods.
00:42:28.460 | So does that mean we should just go
00:42:30.820 | with the least structure, most freedom
00:42:33.540 | to the model possible way from the get-go?
00:42:36.420 | And the answer is obviously no.
00:42:38.180 | Let's think about even less structure case.
00:42:40.220 | This red one here, it will pick up a lot later
00:42:44.620 | and requires a lot more compute.
00:42:46.820 | So it really depends on where we are.
00:42:49.500 | We cannot indefinitely wait for the most general case.
00:42:53.140 | And so let's think about the case
00:42:54.940 | where our compute situation is at this dotted line.
00:42:57.940 | If we're here, we should choose this less structure
00:43:01.060 | one as opposed to this even less structure one,
00:43:03.940 | because the other one doesn't really work.
00:43:05.660 | And the other one works.
00:43:07.180 | But crucially, we need to remember
00:43:08.820 | that we are adding some structure
00:43:10.900 | because we don't have compute.
00:43:12.060 | So we need to remove that later.
00:43:14.260 | And so the difference between these two method
00:43:16.740 | is that additional inductive biases
00:43:18.940 | or structure we impose, someone impose,
00:43:21.660 | that typically don't get removed.
00:43:24.260 | So adding this, what that means is that
00:43:27.460 | at the given level of compute, data,
00:43:30.340 | algorithmic development, and architecture that we have,
00:43:33.420 | there is like an optimal inductive bias
00:43:35.860 | or structure that we can add to the problem
00:43:37.900 | to make the progress.
00:43:39.580 | And that has been really how we have made so much progress.
00:43:43.100 | But these are like shortcuts
00:43:44.340 | that hinder further scaling later on.
00:43:46.380 | So we have to remove them later on
00:43:48.020 | when we have more compute, better algorithm, or whatever.
00:43:51.780 | And as a community, we do adding structure very well.
00:43:55.740 | And 'cause there's an incentive structure
00:43:58.140 | with like papers, you add a nice one, then you get a paper.
00:44:01.620 | But removing that doesn't really get you much.
00:44:04.340 | So that, we don't really do that.
00:44:06.260 | And I think we should do a lot more of those.
00:44:08.700 | So maybe another implication of this bitter lesson
00:44:11.820 | is that because of this, what is better in the long term
00:44:15.980 | almost necessarily looks worse now.
00:44:19.340 | And this is quite unique to AI research
00:44:21.860 | because the AI research of current paradigm
00:44:25.540 | is learning-based method,
00:44:27.140 | meaning that we are giving models freedom,
00:44:30.620 | the machines choose how they learn.
00:44:32.860 | So because we need to give more freedom,
00:44:35.220 | it's more chaotic at the beginning, so it doesn't work.
00:44:40.140 | But then when it started working,
00:44:41.980 | we can put in more compute and then it can be better.
00:44:44.940 | So it's really important to have this in mind.
00:44:48.020 | So to summarize, we have identified
00:44:51.060 | this dominant driving force behind the AI research.
00:44:54.180 | And that is exponentially cheaper compute
00:44:56.740 | and associated scaling up.
00:44:58.900 | Now that we have identified,
00:45:01.300 | if you remember back from my initial slides,
00:45:04.420 | the next step is to understand this driving force better.
00:45:07.820 | And so that's where we're gonna spend
00:45:10.380 | most of the time doing that.
00:45:11.980 | And for that, we need to go back to some history
00:45:15.540 | of transformer, 'cause this is a transformers class,
00:45:18.060 | analyze key structures and decisions
00:45:21.180 | that were made by the researchers at the time
00:45:23.820 | and why they did that,
00:45:25.220 | whether that was an optimal structure
00:45:27.100 | that could have been added at the time
00:45:29.060 | and why they might be irrelevant now.
00:45:32.020 | And should we remove that?
00:45:33.580 | And we'll go through some of the practice of this.
00:45:35.700 | And hopefully this will give you some flavor
00:45:38.140 | of what scaling research looks like.
00:45:42.020 | So now we'll go into a little bit of the technical stuff.
00:45:45.700 | Transformer architecture, there are some variants.
00:45:48.620 | I'll talk about three of them.
00:45:50.740 | First is the encoder decoder,
00:45:52.180 | which is the original transformer,
00:45:53.940 | which has a little bit more structure.
00:45:55.460 | Second one is the encoder only,
00:45:57.020 | which is popularized by BERT.
00:45:59.940 | And then third one is decoder only,
00:46:02.540 | which you can think of as a current like GPT-3
00:46:05.740 | or other language models.
00:46:07.060 | This has a lot less structure than the encoder decoder.
00:46:09.940 | So these are the three types we'll go into detail.
00:46:12.820 | Second, the encoder only is actually not that useful
00:46:16.380 | in the most general sense.
00:46:17.620 | It still has some place,
00:46:19.060 | but we will just briefly go over that
00:46:21.820 | and then spend most of the time comparing one and three.
00:46:25.060 | So one has more structure,
00:46:26.860 | what's the implication of that and so on.
00:46:29.700 | So first of all, let's think about what a transformer is.
00:46:32.620 | Just at a very high level or first principles,
00:46:36.780 | what is a transformer?
00:46:37.620 | It's a sequence model.
00:46:38.820 | And sequence model has an input of a sequence.
00:46:42.380 | So sequence of elements can be words or images or whatever.
00:46:47.380 | It's a very general concept.
00:46:49.180 | In this particular example, I'll show you with the words.
00:46:51.820 | Sentence is a sequence of words.
00:46:53.700 | And then the first step is to tokenize it
00:46:55.980 | 'cause we have to represent this word in computers,
00:47:00.540 | which requires just some kind of a encoding scheme.
00:47:04.380 | So we just do it with a fixed number of integers
00:47:07.580 | that we have now sequence of integers.
00:47:09.940 | And then the dominant paradigm nowadays
00:47:12.820 | is to represent each sequence element as a vector,
00:47:15.780 | dense vector, because we know how to multiply them well.
00:47:18.740 | And then so we have a sequence of vectors.
00:47:21.420 | And finally, this sequence model will do the following.
00:47:25.620 | We just want to model the interaction
00:47:28.380 | between sequence elements.
00:47:30.060 | And we do that by let them take the dot product
00:47:33.300 | of each other.
00:47:34.180 | And if the dot product is high,
00:47:35.660 | we can say semantically they are more related
00:47:37.620 | than the dot products that is low.
00:47:39.820 | And that's kind of the sequence model.
00:47:41.740 | And the transformer is a particular type of sequence model
00:47:45.140 | that uses what's called the tension
00:47:47.460 | to model this interaction.
00:47:50.620 | So let's get into the details of this encoder decoder,
00:47:54.260 | which was the original transformer.
00:47:55.780 | It's quite many, many pieces.
00:47:57.180 | So let's go into a little bit, a piece at a time.
00:48:00.340 | So starting with the encoder.
00:48:01.940 | So here, I'm going to show you an example
00:48:03.860 | of machine translation, which used to be very cool thing.
00:48:08.100 | And so you have an English sentence that is good,
00:48:11.500 | and then we're gonna translate into German.
00:48:13.740 | So first thing is to encode this into a dense vector.
00:48:17.660 | So here I'm representing it with this vector
00:48:21.340 | of size three or something.
00:48:22.860 | And then we have to let them take the dot product.
00:48:25.180 | So this lines represent which element can talk
00:48:29.820 | to which element, other elements.
00:48:32.220 | And here, because it's an input,
00:48:33.940 | we take what is called the bidirectional attention.
00:48:36.300 | So any token can talk to any other token.
00:48:38.460 | And then we have this MLP or feed forward layer,
00:48:41.580 | which is per token.
00:48:42.780 | It doesn't have any interaction.
00:48:44.220 | You just do some multiplication just because we can do it.
00:48:49.380 | And then that's one layer, and we repeat that n times.
00:48:53.220 | And that's just the transformer encoder.
00:48:55.540 | And at the end, what you get is the sequence of vectors,
00:48:59.860 | each representing the sequence element,
00:49:02.660 | in this case, a word.
00:49:04.380 | So that's the output of this encoder.
00:49:06.380 | Now let's look at the decoder,
00:49:07.900 | which is similarly shaped stack of layers.
00:49:11.260 | So here we put in as an input what the answer should be.
00:49:16.900 | So here, VOS is the beginning of sequence,
00:49:19.500 | and then das ist gut, I don't know how to pronounce it,
00:49:21.420 | but that's the German translation of that is good.
00:49:23.860 | And so we kind of go through the similar process.
00:49:26.740 | Here we have a causal self-attention,
00:49:28.820 | meaning that the tokens of time step T
00:49:31.780 | can only attend to T and before,
00:49:34.220 | because when we start generating it,
00:49:36.100 | we don't have the future tokens.
00:49:37.980 | So we cannot, when we train it, we should limit that.
00:49:41.060 | And that way, this is done by like masking,
00:49:44.100 | but it's just different from the encoder.
00:49:47.420 | So after this, you can get after, again, N layers,
00:49:52.300 | you get this sequence output,
00:49:54.900 | and you have this, the output is sequence.
00:49:57.540 | So sequence-to-sequence mapping,
00:49:58.880 | this is a general encoder-decoder architecture.
00:50:01.620 | And when you get this end of sequence,
00:50:03.620 | you stop generating it.
00:50:04.820 | So this is the overall picture.
00:50:07.020 | Now I'll point out some important attention patterns.
00:50:10.060 | So we are translating into German
00:50:14.100 | what is input to the encoder.
00:50:15.780 | So there has to be some connection
00:50:17.140 | between the decoder and the encoder.
00:50:19.380 | That is done by this cross-attention mechanism
00:50:21.520 | shown in this red,
00:50:22.660 | which is just that each vector's representation
00:50:25.820 | on each sequence in the output decoder
00:50:28.220 | should attend to some of them in the encoder.
00:50:30.420 | And that is done.
00:50:31.340 | In particular, the design feature,
00:50:33.700 | which is interesting is that all the layers in the decoder
00:50:37.520 | attend to the final layer output of the encoder.
00:50:40.600 | I will come back to the implication of this design.
00:50:44.100 | So, yep, that's that.
00:50:46.580 | And now, move on to the second type of architecture,
00:50:49.740 | which is encoder-only.
00:50:50.680 | We'll spend a little bit of time here.
00:50:52.020 | So again, we have the same input,
00:50:56.460 | and we go through a similar structure.
00:50:59.340 | And then, in this case,
00:51:01.060 | the final output is a single vector.
00:51:03.100 | Regardless of the length of the sequence,
00:51:05.380 | we just get a single vector.
00:51:06.740 | And that is, that represent the input sequence.
00:51:10.820 | That's a dense vector representation.
00:51:13.100 | And then, let's say we do some kind of a sentiment analysis.
00:51:16.180 | We run through a task-specific linear layer
00:51:18.560 | to map it to classification labels,
00:51:20.700 | positive or negative probabilities here.
00:51:23.140 | And that's required for all these task-specific cases.
00:51:28.140 | And this is kind of popularized by BERT.
00:51:30.260 | And what this means is that here,
00:51:33.500 | at the time, 2018, when BERT came out,
00:51:36.300 | we had the benchmark called GLUE,
00:51:38.540 | which was a language understanding task.
00:51:40.260 | You have a sequence in,
00:51:41.460 | classification labels out for most cases.
00:51:43.560 | This was how the field really advanced at the time.
00:51:46.700 | So when we care about such tasks,
00:51:49.060 | then there's an incentive to think about
00:51:51.140 | simplifying the problem,
00:51:52.260 | adding the structure to the problem
00:51:53.640 | so that we can make a progress.
00:51:54.820 | So this, the additional structure
00:51:56.540 | that was put into this particular architecture
00:51:58.820 | is that we're gonna give up on the generation.
00:52:02.540 | If we do that, it becomes a lot simpler problem.
00:52:05.540 | Instead of sequence to sequence,
00:52:07.140 | we're talking about sequence to classification labels,
00:52:09.740 | and that's just so much easier.
00:52:11.500 | And so at some point, 2018, 2019,
00:52:15.460 | a lot of the papers, or just research,
00:52:17.540 | was like, we sometimes call it BERT engineers.
00:52:19.980 | It's a little bit change of something,
00:52:21.480 | get like 0.5% better on GLUE,
00:52:24.400 | and you get a paper and things like that.
00:52:25.860 | It was like very chaotic era.
00:52:27.660 | And, but if we look at from this perspective,
00:52:31.720 | we are putting the sequence structure
00:52:34.280 | of not generating the sequence,
00:52:35.740 | that puts a lot of performance win,
00:52:38.780 | but in the long term, it's not really useful.
00:52:41.060 | So we're not gonna look at this encoder
00:52:42.980 | only architecture going forward.
00:52:45.020 | Third architecture, decoder only.
00:52:46.940 | This one is my favorite, personally,
00:52:49.580 | and it looks kind of daunting,
00:52:52.180 | but because of this attention pattern,
00:52:54.280 | but it actually is very simple.
00:52:56.340 | So here, we only have a single stack,
00:52:59.380 | and it can actually generate stuff.
00:53:01.620 | And so there's misconception that some people think
00:53:05.260 | this decoder only architecture is used
00:53:07.000 | for language modeling next to prediction,
00:53:09.000 | so it cannot be used for supervised learning.
00:53:10.900 | But here, we can actually do it.
00:53:12.220 | The trick is to have this input that is good,
00:53:15.300 | concatenate it with the target.
00:53:17.020 | And if you do that, then it just becomes simple,
00:53:20.060 | just sequence in, sequence out.
00:53:21.620 | So what we do is the self-attention mechanism here
00:53:24.840 | is actually handling both the cross-attention
00:53:27.980 | between target and the input,
00:53:29.380 | and self-attention sequence learning within each.
00:53:32.960 | So that's the causal attention.
00:53:34.660 | And then, as I mentioned, the output is a sequence.
00:53:38.780 | And then the key design features are self-attention
00:53:41.500 | is serving both roles, and we are, in some sense,
00:53:45.380 | sharing the parameters between input and target.
00:53:47.580 | So same set of parameters are applied
00:53:49.420 | to both input and the target sequences.
00:53:51.860 | So this is the decoder only.
00:53:53.540 | Now, we will go into the comparison.
00:53:56.140 | So I think there are many, they look very different,
00:54:00.140 | at least on the schematics.
00:54:01.460 | So how different are they, actually?
00:54:03.440 | And I argue that they're actually quite similar.
00:54:07.820 | And so to illustrate that, we're gonna transform,
00:54:10.620 | starting from this encoder-decoder,
00:54:12.240 | which has more structures built in,
00:54:14.100 | and then into the decoder-only architecture,
00:54:17.220 | and see what are some of the differences,
00:54:19.740 | and then interpret those differences,
00:54:21.640 | those additional structures, are they relevant nowadays?
00:54:24.340 | Now that we have more compute, better algorithm, and so on.
00:54:27.240 | So let's have this table.
00:54:30.060 | Four differences, we'll see each of them.
00:54:32.500 | And then, as we go through, we'll populate this table.
00:54:36.300 | So let's first look at this additional cross-attention.
00:54:39.600 | What that means is that this, on the left,
00:54:42.060 | is an encoder-decoder, which has this additional red block,
00:54:44.760 | the cross-attention, compared to the simpler one
00:54:46.880 | that doesn't have that.
00:54:47.900 | So we wanna make the left closer to the right.
00:54:51.700 | So that means we need to either get rid of it, or something.
00:54:55.220 | And attention mechanism
00:54:56.740 | has kind of the four projection matrices.
00:54:59.580 | And so self-attention and cross-attention
00:55:01.820 | actually have the same number of parameters, same shape.
00:55:03.960 | So we can just share them.
00:55:05.100 | So that's the first step, share both of these.
00:55:07.260 | And then it becomes mostly the same mechanism.
00:55:09.980 | And then, so that's the first difference,
00:55:12.260 | separate cross-attention,
00:55:13.400 | or self-attention serving both roles.
00:55:15.700 | Second difference is the parameter sharing.
00:55:17.920 | So what that means is that,
00:55:20.180 | between the input and the target,
00:55:22.300 | encoder-decoder architecture uses the separate parameters.
00:55:25.120 | And decoder only has a single stack,
00:55:27.800 | so it uses the shared parameter.
00:55:29.560 | So if we wanna make the left close to right,
00:55:32.020 | we wanna share the encoder parameters.
00:55:34.500 | So let's do that, just color this.
00:55:36.180 | So now they share the parameters.
00:55:38.400 | Third difference is the target-to-input attention pattern.
00:55:41.500 | So we need to connect the target to the input,
00:55:44.060 | and how is that done?
00:55:46.140 | In the encoder-decoder case, we had this cross-attention,
00:55:49.180 | and then in the decoder only,
00:55:51.100 | it's the self-attention doing everything.
00:55:54.460 | What the difference is that we have this,
00:55:57.320 | every layer of the decoder
00:56:00.580 | attending to the final layer output of the encoder.
00:56:03.580 | Whereas if you think about this decoder,
00:56:05.480 | it's actually per layer, within layer.
00:56:07.660 | When we are decoding the, say, word DOS,
00:56:11.420 | we are looking at the same layer representation
00:56:14.320 | of the encoder, and that's within layer,
00:56:17.260 | and I think this is the design feature.
00:56:19.060 | So if you wanna make this close to that,
00:56:20.940 | we have to bring back this attention to each layer.
00:56:24.420 | So now layer one will be attending to layer one of this.
00:56:28.460 | And finally, the last difference is the input attention.
00:56:33.380 | I mentioned about this bidirectional attention,
00:56:35.560 | and because we have this decoder only,
00:56:38.280 | typically with the unidirectional attention,
00:56:41.260 | we need to make them matching.
00:56:42.840 | So that's the, we can just get rid of it.
00:56:45.220 | I just got rid of some of the arrows.
00:56:47.860 | So then at this point,
00:56:50.260 | these two architectures are almost identical.
00:56:53.440 | There's a little bit of difference in the cross attention,
00:56:55.020 | but same number of parameters,
00:56:56.980 | and if you have, in deep learning,
00:56:58.900 | if you just train this,
00:57:00.180 | these two architecture in the same task, same data,
00:57:02.500 | I think you will get pretty much within the noise,
00:57:04.300 | probably closer than if you train the same thing twice.
00:57:07.020 | So I would say they are identical.
00:57:09.900 | And so these are the main differences.
00:57:11.940 | Now we'll look at what are the additional structures,
00:57:15.500 | what they mean, speed means.
00:57:18.180 | So yeah, that's the populated table now.
00:57:21.140 | And then, so we can say that encoder-decoder,
00:57:24.060 | compared to the decoder-only architecture,
00:57:26.060 | has these additional structures in the devices built in.
00:57:31.060 | So let's go into each of them.
00:57:33.000 | The first one is the,
00:57:34.820 | what encoder-decoder tries at it as a structure
00:57:37.980 | is that input and the target sequences
00:57:39.920 | are sufficiently different that we,
00:57:42.260 | it'll be useful to use a separate parameters.
00:57:44.800 | That's the assumption.
00:57:46.140 | And so, why is that useful?
00:57:48.420 | When can that assumption be useful?
00:57:51.180 | And one example is machine translation.
00:57:54.100 | Back when the transform was introduced in 2017,
00:57:56.700 | translation was a really popular task.
00:57:58.900 | And it was difficult, considered difficult.
00:58:00.900 | And because it's just sequence to sequence,
00:58:03.660 | and you can actually have a blue score,
00:58:05.300 | which is heuristic-based method
00:58:06.940 | that can give you a single number,
00:58:08.580 | and then people can optimize that.
00:58:10.380 | So in that task, we have this input and target
00:58:14.720 | in completely different languages.
00:58:16.300 | So if the goal is to learn translation only,
00:58:19.140 | then it kind of makes sense to have,
00:58:20.820 | okay, this parameter in the encoder
00:58:22.420 | will take care of the English,
00:58:24.180 | and this parameter in the decoder
00:58:25.620 | will take care of the German.
00:58:26.900 | That seems natural.
00:58:28.420 | And what about now?
00:58:30.540 | Modern language models is about learning knowledge.
00:58:33.740 | And it's not just about translation,
00:58:35.540 | or not even about language.
00:58:36.740 | Language just comes up as a byproduct
00:58:38.740 | of doing this next-token prediction,
00:58:41.140 | and translation as well.
00:58:42.860 | So does it make sense to have a separate parameter
00:58:45.880 | for this kind of situation now?
00:58:48.520 | Like, we have some knowledge in German,
00:58:51.760 | some knowledge in English,
00:58:53.360 | and if anything, you wanna combine them.
00:58:55.440 | And if we represent them in separate parameters,
00:58:58.080 | I don't think that's natural.
00:58:59.560 | So I would say, with this much more general,
00:59:02.600 | larger models that can do a lot of things,
00:59:06.040 | this assumption seems very unnatural to me.
00:59:09.160 | Second example is a little bit more modern.
00:59:11.840 | Two years ago, when I was at Google,
00:59:14.040 | and with Jason, we did this instruction fine-tuning work.
00:59:17.720 | And what this is, is you take the pre-trained model,
00:59:21.080 | and then just fine-tune on academic data set,
00:59:23.640 | and so that it can understand
00:59:25.840 | the natural language instruction.
00:59:27.240 | So the detail doesn't matter,
00:59:28.960 | but here, let's think about the performance gain
00:59:32.620 | by doing this fine-tuning
00:59:33.920 | on two different architectures we tried.
00:59:36.320 | So first five is the Flon T5,
00:59:38.800 | which is T5-based, which is encoder-decoder architecture.
00:59:41.920 | Last one, the latter five,
00:59:44.280 | decoder-only architecture, based on POM.
00:59:46.420 | So we spent 99% of the time on POM,
00:59:50.680 | optimizing a lot of these.
00:59:52.440 | And then at the end, we just spent three days on T5.
00:59:55.180 | But the performance gain was a lot higher on this.
00:59:58.320 | And I was really confused about this,
01:00:00.280 | and in a very good way.
01:00:01.780 | And after the paper was published,
01:00:03.400 | I wanted to dig a little bit deeper
01:00:05.520 | into why this might be the case.
01:00:07.000 | So my hypothesis is that it's about the length.
01:00:12.000 | So academic data sets we use, we use like 1,832 tasks,
01:00:16.360 | and here, they have this very distinctive characteristic
01:00:20.360 | where we have a long input,
01:00:22.280 | long in order to make the task more difficult,
01:00:24.720 | but then we cannot make the target long,
01:00:26.400 | because if we do, there's no way to grade it.
01:00:29.640 | So there's fundamental challenge of that.
01:00:31.560 | So what happens is you have a long text of input,
01:00:34.280 | and then short text of the target.
01:00:36.060 | And so this is kind of the length distribution
01:00:39.120 | of what went into the fine tuning.
01:00:42.000 | So then you see this,
01:00:44.800 | you have a very different sequence
01:00:47.200 | going into the encoder as an input,
01:00:49.280 | and a very different type of sequence going into the target.
01:00:51.840 | So now, this encoder-decoder architecture
01:00:54.720 | has an assumption that they will be very different.
01:00:57.000 | That structure really shines because of this.
01:01:00.360 | It was a kind of an accident,
01:01:01.920 | but that was, I think,
01:01:03.680 | why this really architecture was just suitable
01:01:07.740 | for fine tuning with the academic data sets.
01:01:10.520 | What about now?
01:01:11.520 | Do we care about this kind of assumption?
01:01:14.320 | And if you think about the general use cases
01:01:16.760 | of language models nowadays,
01:01:18.800 | if anything, the more interesting cases
01:01:20.980 | involve longer generation, longer target.
01:01:24.500 | Just because we cannot grade them
01:01:26.960 | doesn't mean that we are not interested in them.
01:01:29.160 | Actually, if anything, we are more interested in that.
01:01:31.180 | So now, we have this longer target situation.
01:01:34.180 | So this separate sequence length parameter
01:01:37.160 | doesn't seem to make much sense.
01:01:39.040 | And moreover, we think about this chat application,
01:01:42.440 | like ChatGPT, we do multi-turn conversation.
01:01:45.300 | And then, so what is a target of this turn
01:01:48.140 | becomes the input of the next turn.
01:01:50.180 | And then my question is,
01:01:51.680 | does that make sense to even think about
01:01:53.880 | different parameters if next turn
01:01:57.300 | it's gonna be the same thing?
01:01:59.980 | So that was the first inductive bias we just mentioned.
01:02:04.180 | And then the second structure is that
01:02:05.740 | target element can only attend to the fully encoded ones,
01:02:09.820 | the final output of the encoder.
01:02:11.620 | Let's look at this additional structure, what that means.
01:02:14.220 | So as I mentioned, we have this
01:02:16.720 | very top layer attending to it.
01:02:19.700 | And so in deep neural nets,
01:02:22.420 | typically we see that the bottom layers
01:02:24.380 | and the top layers encode information
01:02:26.540 | at a very different level.
01:02:28.260 | Meaning that, for example, in computer vision,
01:02:30.620 | lower layer, bottom layers encode something like edges,
01:02:34.020 | top layers, higher levels,
01:02:35.620 | combining the features, something like cat face.
01:02:38.280 | And so we call this deep learning
01:02:39.700 | a hierarchical representation learning method.
01:02:43.380 | And so now the question is,
01:02:45.460 | if decoder layer one attends to encoder final layer,
01:02:50.300 | which probably has a very different level of information,
01:02:53.060 | is that some kind of an information bottleneck,
01:02:55.340 | which actually motivated the original attention mechanism.
01:02:59.540 | And in practice, I would say, in my experience,
01:03:02.740 | doesn't really make any difference.
01:03:04.240 | And that's because my experience was limited to,
01:03:06.820 | say, 24 layers of encoder of T5.
01:03:10.140 | So layer one attended to 24, probably fine.
01:03:12.740 | But what if we have 10x or 1000x more layers?
01:03:15.700 | Would that be problematic?
01:03:16.980 | I'm not really comfortable with that.
01:03:19.580 | So I think this is also unnecessary design
01:03:23.460 | that maybe we need to revisit.
01:03:25.200 | Final structure we're gonna talk about is the,
01:03:29.060 | when we do this, there's like a bidirectional thing
01:03:31.780 | in the encoder-decoder.
01:03:32.740 | Let's think about that.
01:03:34.240 | So yeah, bidirectional input attention,
01:03:37.060 | is that really necessary?
01:03:38.340 | So when we had this BERT,
01:03:42.140 | B in BERT stands for bidirectional.
01:03:44.700 | 2018, when we were solving that question answering squad,
01:03:48.060 | actually was very difficult task.
01:03:49.740 | So if you have any additional trick,
01:03:51.960 | it can make a huge difference.
01:03:53.420 | Bidirectionality was a really useful,
01:03:56.140 | like I think maybe boosting up the squad score by like 20.
01:03:59.700 | So it was really huge thing.
01:04:01.260 | But at scale, I don't think this matters that much.
01:04:04.260 | This is my highly anecdotal experience.
01:04:07.060 | So we did, in flan two, we tried both bidirectional
01:04:10.820 | and unidirectional fine tuning,
01:04:12.840 | didn't really make much difference.
01:04:14.380 | So, but I wanna point out this bidirectionality,
01:04:18.220 | actually bring in an engineering challenge
01:04:20.140 | for modern multi-turn chat application.
01:04:23.180 | So at every turn, the new input has to be encoded again,
01:04:27.240 | and for unidirectional attention is much, much better.
01:04:29.660 | So here's what I mean by that.
01:04:31.140 | So let's think about this more modern conversation
01:04:34.360 | between user and assistant, how are you bad and why?
01:04:38.220 | And so here, if we think about the bidirectional case,
01:04:40.980 | we will, and when we generate bad,
01:04:43.640 | we need to encode this input
01:04:45.700 | with the bidirectional thing, which is fine.
01:04:47.800 | And then after the bad is generated,
01:04:51.100 | when we're trying to generate why,
01:04:52.940 | we'll need to encode how again,
01:04:55.720 | because how can attend to bad,
01:04:57.700 | so we need to do everything from scratch again.
01:05:00.240 | In contrast, if we do unidirectional one,
01:05:03.360 | we can do much, much better,
01:05:05.560 | because now when we are trying to generate why,
01:05:08.420 | we don't have to redo how,
01:05:09.780 | because we cannot attend to the future tokens,
01:05:13.560 | so we don't have to do anything.
01:05:15.140 | So if you see the difference, this part can be cached,
01:05:17.740 | and then this part is the only thing
01:05:19.460 | that has to be encoded again.
01:05:22.180 | So this kind of makes a big difference
01:05:23.780 | when we think about multiple turns going in.
01:05:27.300 | So I would say bidirectional attention did well in 2018,
01:05:31.340 | which is mostly solved by scale,
01:05:33.340 | and now because of this engineering challenge,
01:05:35.540 | we don't really need that.
01:05:37.140 | So to conclude, we have looked into this driving force,
01:05:41.580 | dominant driving force governing this AI research,
01:05:44.340 | and that was this exponentially cheaper compute
01:05:47.060 | and associated scaling effort.
01:05:49.500 | And so to understand this driving force,
01:05:51.800 | we analyzed some of the additional structures
01:05:54.180 | added to the encoder-decoder compared to decoder-only,
01:05:57.220 | and then thought about what that means
01:06:00.260 | from the perspective of scaling.
01:06:02.120 | And I wanted to just conclude with this remark.
01:06:06.120 | So we have looked at these kind of analysis,
01:06:08.580 | which are all, one can say this is just historical artifacts
01:06:12.500 | and doesn't matter, but if you do many of these,
01:06:15.500 | now you look at the current events.
01:06:17.600 | You can hopefully think about those in a more unified manner
01:06:21.180 | and then see, okay, what assumptions in my problem
01:06:25.240 | that I need to revisit, and are they relevant?
01:06:27.680 | And if not, why?
01:06:28.800 | And you have an answer to it.
01:06:30.260 | Can we do it with a more general thing and scale up?
01:06:33.040 | And so I hope you can go back
01:06:35.280 | and really think about these problems,
01:06:37.280 | and together we can really shape the future of AI
01:06:40.280 | in a really nice way.
01:06:41.840 | So that's it, thanks.
01:06:43.720 | (audience applauds)
01:06:46.720 | (water splashes)
01:06:49.480 | - Hi, thank you for the talk.
01:07:04.220 | So about the mix of expert structure,
01:07:08.600 | if what you're saying is correct,
01:07:11.560 | then how long do you think the mix of experts
01:07:15.640 | is gonna stay for the new large-length models?
01:07:20.640 | - So one thing I have to apologize is the architecture
01:07:24.120 | is kind of a thing that I'm not really comfortable
01:07:28.040 | sharing a lot, that's why I'm limiting
01:07:30.500 | a little bit to the future.
01:07:32.160 | So yeah, I'll probably just skip that,
01:07:37.160 | but I would say that seems quite general.
01:07:40.780 | Yeah.
01:07:41.620 | - So some of the changes that you described
01:07:48.740 | between encoder-decoder versus decoder-only,
01:07:51.100 | like the parameter sharing and the bidirectional attention,
01:07:54.860 | can they not be interpreted as less structure,
01:07:58.260 | or sorry, more structure or less freedom
01:08:00.820 | for the model to learn?
01:08:01.980 | - Yeah, I think one can do that,
01:08:04.820 | but I think, somewhat subjective,
01:08:08.060 | but I think it's a simpler structure
01:08:10.660 | that the model kind of, if,
01:08:14.300 | but we're just saying like input and target
01:08:21.220 | are just sequences and we just,
01:08:23.640 | if we have enough capacity, we can just handle both.
01:08:26.620 | And there are other cases where, like,
01:08:30.860 | yeah, so yeah, I can totally see,
01:08:34.860 | oh, actually, maybe I should have repeated the question.
01:08:36.540 | The question is, can we think about this parameter sharing
01:08:39.780 | other structures in the encoder-decoder
01:08:42.140 | as actually less structure?
01:08:43.700 | But I think it's a little bit more complicated model,
01:08:45.540 | and that such complications of,
01:08:47.620 | like, it has more assumption, right?
01:08:49.060 | The input and target are different.
01:08:51.420 | I think that's a stronger assumption than,
01:08:53.020 | okay, it's a sequence.
01:08:54.580 | We deal with the sequence in a unified way.
01:08:57.060 | So that would be just my take.
01:09:01.820 | - Do you have any thoughts on the recent state-space models
01:09:06.820 | like Mamba and how that fits into the paradigm
01:09:13.060 | of less structure, more structure,
01:09:14.860 | without really, AI?
01:09:17.380 | - Yeah, yeah.
01:09:18.580 | Okay.
01:09:22.460 | It's hard to, like, think about it on the spot,
01:09:25.820 | but I think there are, like,
01:09:27.860 | to me, I talk about this architectures,
01:09:29.580 | but I don't, like, architecture is, like, kind of a,
01:09:32.460 | it doesn't change things too much.
01:09:36.980 | And maybe it's a, I think multi-modalities
01:09:39.660 | might bring in another challenges,
01:09:41.180 | like when this transformer structure
01:09:43.620 | might become a bottleneck when we think about that.
01:09:47.420 | But, yeah.
01:09:50.540 | So I think it's, transformers have done a good job.
01:09:54.020 | So maybe we should think about it,
01:09:56.060 | especially with the multi-modalities.
01:09:57.900 | Yeah.
01:09:58.740 | - So, like, for cross-attention and casual attention,
01:10:05.020 | it's, like, imploding permutation of invariance
01:10:08.540 | in a way for multi-attention instead of causal.
01:10:13.340 | And for computer vision,
01:10:15.100 | there's, like, a lot of learning structure
01:10:17.220 | for invariances for self-sufficient learning.
01:10:20.060 | What do you think about those
01:10:21.140 | in terms of complexity that you just talked about?
01:10:23.980 | - So the question is, this causal attention
01:10:26.300 | versus the, like, the bidirectional attention.
01:10:29.420 | They were probably fine in the text domain,
01:10:32.620 | but in the computer vision case,
01:10:34.180 | that being able to attend to the future part of it
01:10:38.380 | is really important.
01:10:39.700 | Yeah, is that the question?
01:10:41.660 | - Also, like, one of the, like, causal attention
01:10:46.340 | removed the, like, invariance for permutation.
01:10:49.540 | So what do you think about, like, invariant,
01:10:52.220 | like, for observation, like,
01:10:53.540 | there's a lot of invariances, right, for augmentation.
01:10:56.700 | So what do you think about, like,
01:10:58.340 | those as a way to structure?
01:11:00.020 | - So I think the, like,
01:11:03.180 | I don't really like this invariances and all these.
01:11:05.860 | These are, like, how humans think we perceive the vision.
01:11:09.620 | Like, CNN, for example, is, like, translation invariance,
01:11:12.220 | which was very important, we thought.
01:11:14.660 | But I don't think it's that important.
01:11:16.540 | Actually, if anything, now it's hurting
01:11:18.260 | the model learning more general thing.
01:11:20.340 | And so the machines might be learning the vision
01:11:23.260 | in a completely different way from how humans do,
01:11:26.300 | and I don't think that's problematic.
01:11:28.020 | So those invariances, I think, is, like,
01:11:31.340 | could be a good guiding principle,
01:11:32.980 | but I don't, like, I'm not too worried
01:11:35.140 | about just not having such structures.
01:11:38.420 | Yeah, I would just try out, like,
01:11:42.020 | just based on some metric,
01:11:44.460 | if not having that invariant structure
01:11:46.740 | is actually better and more scalable,
01:11:48.500 | and I think that's probably fine,
01:11:50.620 | and actually even better.
01:11:51.620 | If we do it without the structure, it's actually better.
01:11:54.020 | - So I actually have two questions.
01:12:00.100 | One, so clearly you've been thinking about
01:12:02.540 | how inductive biases and structure
01:12:04.660 | limit our potential, essentially.
01:12:07.100 | So I'm just curious, what are some big inductive biases
01:12:11.940 | currently that you think are, like,
01:12:13.700 | big blocks that we can release, or let go of?
01:12:18.060 | That would be one question, if you could let me go.
01:12:21.140 | - The current structure that we should get rid of.
01:12:24.180 | - Just current inductive biases you think,
01:12:26.340 | 'cause clearly you've been thinking about this, right?
01:12:28.260 | So when you look at the state of research,
01:12:29.740 | you must be thinking, oh, man,
01:12:31.900 | this is, like, a pretty big inductive bias.
01:12:33.700 | It'd be really cool if we could let this go.
01:12:35.740 | So I'm just trying to see what you're--
01:12:38.500 | - Yeah, so when I think about this as an architecture,
01:12:42.180 | I think the architectures
01:12:45.140 | are not the current bottleneck, in my view.
01:12:47.420 | And so, partly because I did
01:12:49.340 | a lot of the architecture research,
01:12:50.740 | and at the end, we published this paper called,
01:12:53.580 | it's saying, okay, we tried, like,
01:12:55.940 | 60 different transformer modifications,
01:12:58.100 | pretty much the same thing.
01:12:59.420 | And none of that really makes sense.
01:13:01.340 | It wouldn't make a huge difference.
01:13:03.300 | Caveat, with, like, now,
01:13:04.580 | maybe the conclusion can be different.
01:13:06.060 | So I have a very huge bias against, like,
01:13:07.820 | not doing the architecture research.
01:13:09.340 | And one message could be that, actually,
01:13:11.340 | the architecture is not the bottleneck in further scaling.
01:13:13.860 | And I think what's the bottleneck now
01:13:15.620 | is this learning objective,
01:13:17.060 | especially on the supervised learning paradigm,
01:13:20.020 | or even, like, self-supervised pre-training.
01:13:22.260 | What we're doing with this maximum likelihood estimation is,
01:13:25.220 | okay, given this, this is the only correct target,
01:13:28.380 | and everything else is incorrect,
01:13:29.540 | because the probability measure is finite.
01:13:31.140 | So is that really a comfortable thing to do,
01:13:34.900 | teaching Signal to give the model?
01:13:36.620 | And I think, if we think about the old days,
01:13:39.220 | we could formalize the correct behavior
01:13:43.420 | for a given input very well.
01:13:45.100 | And maybe one answer being
01:13:46.820 | the single correct answer is fine.
01:13:48.620 | But now, if you're thinking about very general,
01:13:50.820 | especially the chat applications,
01:13:52.420 | okay, for, like, write a poem,
01:13:54.420 | and then you say this is the only correct answer,
01:13:57.540 | I think that the implication that could be really severe.
01:14:01.020 | So I think that's really something
01:14:03.140 | that I'm not comfortable with.
01:14:04.660 | And partly why I'm, like, interested in RLHF
01:14:07.260 | is one instantiation of not using this maximum likelihood,
01:14:10.860 | instead using a reward model
01:14:13.140 | as a learned objective function,
01:14:14.700 | which is a lot less structure.
01:14:16.420 | Now we can scale it further.
01:14:17.860 | RLHF itself is not really that scalable, I would say.
01:14:20.860 | But it just showed that we can use
01:14:23.860 | this supervised deep learning to train a model
01:14:26.260 | that serves as a objective function,
01:14:28.180 | and that really works in a cool way.
01:14:31.420 | I think that's a great paradigm.
01:14:33.100 | - Thank you, great answer.
01:14:36.660 | Not that you're being judged or anything.
01:14:38.180 | But a second question I would say is,
01:14:40.820 | so then, in the beginning of the talk,
01:14:42.980 | you talk about the big driving force
01:14:46.860 | being the exponentially cheap compute, right?
01:14:49.500 | But some of the stuff I've been reading
01:14:51.460 | says Moore's Law is ending,
01:14:53.460 | and we're going towards performance-oriented architecture.
01:14:56.780 | So can we rely then on, 'cause in the past 50 years,
01:15:01.480 | we had transistors doubling or whatever at Moore's Law.
01:15:04.860 | But yeah, that's ending.
01:15:05.860 | So when you talk about the compute,
01:15:07.860 | these demands that we've been looking at,
01:15:09.420 | and that structure, our history,
01:15:12.020 | we're also uncertain, necessarily,
01:15:13.500 | about how that's gonna project into the future.
01:15:15.060 | So what are some of your thoughts on--
01:15:16.700 | - Yeah, I think that Moore's Law
01:15:17.980 | is really a red herring in this case,
01:15:19.780 | because it's like number of transistors,
01:15:21.820 | but that doesn't really matter.
01:15:22.780 | I think what matters is the compute availability.
01:15:25.220 | And GPU, for example,
01:15:27.180 | is a very different kind of architecture,
01:15:28.660 | and that enabled the continuation of this trend.
01:15:30.980 | And I think right now, 2023, 2024,
01:15:34.180 | we're kind of taking the shortcuts
01:15:35.460 | with low-precision thing, which still, I think, is cool.
01:15:39.820 | But I think there are many other GPU-level things.
01:15:43.400 | But also, if we are kind of sure about the architecture,
01:15:46.700 | this is my thoughts.
01:15:47.740 | We can hard-code into the chips,
01:15:49.940 | and that can provide a lot of benefits.
01:15:52.520 | And I think training, for training,
01:15:54.660 | I don't think that's really done.
01:15:56.120 | But GPU, if you think about it, is too general.
01:15:58.340 | Maybe that is something that we can revisit.
01:16:01.120 | And so I'm not losing hope,
01:16:04.020 | and I don't see any trend of doing that.
01:16:06.300 | But maybe other things will come as a bottleneck,
01:16:09.220 | like maybe energy or something.
01:16:12.260 | Yeah, so physics probably is something
01:16:15.460 | that we need to study again.
01:16:16.720 | (audience laughs)
01:16:17.560 | - If you don't mind me continuing,
01:16:19.720 | then the problem is like,
01:16:21.320 | we're talking about exponential driving forces, right?
01:16:23.680 | You can tell me that you wanna hard-code chips,
01:16:25.760 | but that's not the same as telling me
01:16:27.280 | that there's gonna be exponential growth
01:16:28.600 | that we can drive, right,
01:16:29.800 | into like paradise or wherever the hell we're going.
01:16:32.840 | - Yeah, here's my very boring answer.
01:16:36.280 | I think we just need to do a little bit better,
01:16:39.200 | and at some point, the machines will be better than us
01:16:41.640 | in thinking about chip design.
01:16:43.200 | (audience laughs)
01:16:45.480 | So I think it's half-joking,
01:16:47.080 | but if we look back at, say, this video two years from now,
01:16:50.440 | I think it'll be less, more of a joke, serious thing.
01:16:54.440 | Let's just get there first.
01:16:55.960 | - All right, so thanks to Hongwan for an amazing talk.
01:17:00.120 | (audience applauds)
01:17:05.180 | [BLANK_AUDIO]