Stanford CS25: V2 I Emergent Abilities and Scaling in LLMs

00:00:00.000 | [AUDIO OUT]

00:00:06.780 | OK, great.

00:00:07.340 | So the first sort of paper that I'm going to be talking about

00:00:10.380 | is called "Emergent Abilities of Large Language Models."

00:00:13.540 | And this paper was especially cool,

00:00:15.780 | I think, because we got people from Google and DeepMind,

00:00:19.100 | and also Stanford.

00:00:20.580 | You might recognize Percy, or Tatsu, or Rishi.

00:00:24.360 | I mean, we got people to sort of agree

00:00:25.900 | on what's a nice framework of looking

00:00:28.020 | at why we want to scale in emergent abilities.

00:00:33.660 | So one of the things that we've seen throughout language models

00:00:38.960 | is that you sort of get these predictable gains

00:00:41.140 | as a result of scaling.

00:00:42.620 | So here's the canonical Kaplan et al.

00:00:45.340 | paper, where you can see that if you scale up

00:00:49.380 | the size of the language model, measured either in compute,

00:00:52.620 | in data set size, or in parameters,

00:00:55.980 | you see that the loss on the test set

00:00:59.780 | actually goes down predictably.

00:01:01.100 | [INAUDIBLE]

00:01:01.600 | Yeah.

00:01:02.100 | I don't know if you're screen sharing, so people in Zoom,

00:01:03.820 | I don't think, can see the slides.

00:01:05.200 | OK.

00:01:05.700 | Yeah, yeah, yeah.

00:01:06.460 | Sorry, let me fix that.

00:01:07.340 | [INAUDIBLE]

00:01:07.820 | OK.

00:01:09.820 | OK.

00:01:11.900 | I guess I'll say this for the third time.

00:01:14.980 | As we've seen in language models,

00:01:17.660 | if you scale up the size of the language model,

00:01:20.100 | measured either in compute, in data set size,

00:01:23.260 | or in number of parameters, you can

00:01:24.940 | see that there's sort of this predictable improvement

00:01:28.200 | in the test loss.

00:01:29.060 | Now, what I'm going to be talking about in terms

00:01:34.420 | of emergence is something that's actually

00:01:36.500 | unpredictable if you only look at smaller language models.

00:01:40.780 | So one way that emergence has been described in the broader

00:01:44.580 | science literature is it's basically

00:01:46.540 | seen as a qualitative change that arises

00:01:49.540 | from quantitative changes.

00:01:51.540 | It sort of started with this article in Science

00:01:54.260 | by a Nobel Prize winning physicist

00:01:56.580 | called "More is Different."

00:01:59.340 | And I really like this post from Jacob Steindart

00:02:03.260 | that sort of describes emergence.

00:02:04.780 | And he gives a couple of good examples here.

00:02:08.700 | For example, he says, "With a bit of uranium,

00:02:11.700 | nothing special happens.

00:02:13.460 | With a large amount of uranium packed densely enough,

00:02:15.980 | you get a nuclear reaction.

00:02:18.540 | And then also with DNA, for example,

00:02:20.340 | given only small molecules such as calcium,

00:02:22.820 | you can't meaningfully encode useful information.

00:02:25.180 | But given larger models such as DNA, you can encode a genome."

00:02:28.020 | So for this particular work, we use this definition

00:02:35.420 | of emergent abilities of large language models in particular.

00:02:38.900 | So we say that ability is emergent

00:02:42.100 | if it is not present in smaller models,

00:02:44.540 | but it is present in larger models.

00:02:47.940 | And the sort of natural question here

00:02:49.780 | is how do you measure the size or the scale of the language

00:02:53.180 | model?

00:02:54.220 | And there's sort of traditionally three axes

00:02:56.660 | of scale.

00:02:57.180 | So the training flops are the amount of compute

00:02:59.740 | that you use to train the language model,

00:03:02.300 | the number of model parameters, or the size of the language

00:03:04.900 | model, and also the size of the training data set

00:03:07.340 | that the model is trained on.

00:03:09.940 | And a lot of the plots here will use either training flops

00:03:13.340 | or the number of model parameters.

00:03:15.620 | The reason is that the training data set size is usually

00:03:17.980 | fixed for different size models.

00:03:20.780 | And because training flops is just the data set size

00:03:23.260 | times model parameters, you can get a similar plot

00:03:27.300 | from either training flops or number of model parameters

00:03:29.980 | for most language models.

00:03:31.420 | Great.

00:03:35.180 | And so the first type of emergence--

00:03:38.140 | yeah, sorry, go ahead.

00:03:40.940 | Yeah.

00:03:41.700 | For me, it seems like it would be relatively easier

00:03:44.660 | to measure the size versus what qualifies

00:03:47.580 | as an ability.

00:03:48.560 | How do you define what counts as an actual ability versus not?

00:03:54.140 | Yeah, sure.

00:03:54.640 | [INAUDIBLE]

00:03:57.140 | So yeah, for example, I'll just give an example here,

00:04:01.460 | which is actually next slide.

00:04:03.500 | So basically, we have this way of interacting

00:04:06.220 | with language models called few-shot prompting.

00:04:08.900 | And the way it works is the language model

00:04:11.380 | is a really good next-word predictor.

00:04:14.540 | And when you give the model an example,

00:04:17.980 | and then you ask it for an unseen movie review,

00:04:22.780 | for example, and then you say, what's the output?

00:04:25.340 | And then here, the language model

00:04:26.740 | can say positive because it understands

00:04:28.460 | to use the context from the review to give the next token.

00:04:34.140 | And the definition that we use for having ability or not

00:04:38.900 | is that basically, a few-shot prompted task, for example,

00:04:42.900 | sentiment analysis, is emergent if it

00:04:45.420 | has random accuracy for small models

00:04:47.980 | but above random accuracy for large models.

00:04:50.700 | Does that make sense?

00:04:51.580 | So basically, if the model isn't doing any better than random,

00:04:54.660 | then when you say it doesn't have the ability

00:04:56.200 | to do this particular task.

00:04:57.500 | And I'll give a few examples here.

00:05:05.740 | So here's the canonical way that we

00:05:08.300 | look at plots for emergence.

00:05:09.700 | So basically, what we have, each of these different plots

00:05:14.620 | is a different task.

00:05:15.900 | And I'll go over some examples soon.

00:05:18.100 | But the way you read the plot is the x-axis

00:05:20.700 | is the number of training flops or the model scale.

00:05:24.620 | And then the y-axis is the accuracy

00:05:26.820 | or how good the model is doing the task.

00:05:30.340 | And then we have different language models

00:05:33.220 | from OpenAI, from Google, and from DeepMind.

00:05:36.340 | And then each of the points is a different language model.

00:05:39.340 | It's not a language model over the course of training.

00:05:42.260 | Each point is a different language model.

00:05:45.060 | And what you see is that for the very small language models,

00:05:49.420 | you basically get performance that's close to random

00:05:52.740 | or not being any better than random.

00:05:55.980 | And then once you pass a certain threshold,

00:05:59.540 | you can see that the performance suddenly

00:06:01.740 | gets substantially above random.

00:06:05.580 | And this is what we call emergence.

00:06:07.040 | So basically, if you were to extrapolate the lines

00:06:11.300 | from the small language models, you

00:06:13.140 | might predict that it would never do better than random

00:06:16.040 | because it's just a flat line.

00:06:17.760 | But the interesting phenomenon is

00:06:19.180 | that when you scale up past a certain threshold,

00:06:21.620 | you actually do see this emergent phenomenon where

00:06:23.740 | the model does a lot better than random.

00:06:25.420 | So let me go over some concrete examples.

00:06:32.420 | So here's one task.

00:06:34.380 | It's basically a benchmark called multitask NLU or MMLU.

00:06:40.180 | And basically, what it is, it's a bunch of test questions

00:06:45.100 | ranging from high school all the way to professional level

00:06:47.580 | exams.

00:06:49.820 | And how it works is the language model is given--

00:06:53.660 | for example, here is a high school math example.

00:06:57.260 | And the language model is given a few examples.

00:06:59.840 | And then for an unseen question, it has to give the answer.

00:07:04.100 | And then you can see in the plot on the right--

00:07:06.620 | so for the model scale, if you go up

00:07:08.860 | to sort of 10 to the power of 22 training flops,

00:07:12.300 | you don't actually get any better than random accuracy

00:07:14.580 | on this task.

00:07:15.700 | But if you scale up to 10 to the 24 training flops,

00:07:19.220 | then you see that all the three models there

00:07:22.100 | do much better than random accuracy.

00:07:24.220 | Yeah, go ahead.

00:07:28.260 | The scale of the data used to train this,

00:07:30.700 | is it roughly similar?

00:07:31.980 | Or because these are different models trained

00:07:33.980 | by different boards?

00:07:36.140 | Yeah, the scale is, I think, within an order of magnitude

00:07:39.580 | for these models here, yeah.

00:07:42.460 | But every single dot on each individual tracks

00:07:45.740 | is the same data?

00:07:47.700 | Yes, the data is fixed except for Chinchilla.

00:07:50.820 | Chinchilla uses more data for larger models.

00:07:53.020 | But I believe for all the other models

00:07:55.740 | here, the amount of data is the same.

00:08:01.220 | Yeah, here's just another example

00:08:02.660 | to show it more concretely.

00:08:05.900 | So this is a task from the Big Bench Benchmark.

00:08:10.620 | Just as an aside, the Big Bench Benchmark

00:08:12.300 | is like 200 benchmarks.

00:08:15.340 | And basically, it's like a crowdsource set

00:08:17.300 | of benchmarks I'd recommend looking at if you're

00:08:19.260 | doing that kind of work.

00:08:21.300 | And basically, the task is the language model

00:08:23.900 | has to take an English sentence and then

00:08:26.960 | give the international phonetic alphabet transliteration,

00:08:30.580 | the IPA transliteration, which is basically

00:08:33.020 | how to pronounce it.

00:08:35.420 | And for this task, the evaluation metric

00:08:38.860 | is actually blue, or like an n-gram overlap metric.

00:08:43.180 | And you get a similar phenomenon,

00:08:46.020 | where as you increase the size of the language model,

00:08:49.940 | it's flat for a while.

00:08:51.820 | And then suddenly, the improvement is above random.

00:08:54.420 | Great.

00:08:59.420 | So I'll talk about another interesting result

00:09:03.140 | here that's why it's emergent.

00:09:04.540 | So this was a technical report that we put out

00:09:07.420 | a couple of months ago.

00:09:09.180 | And basically, there's this really interesting prize in--

00:09:13.100 | or it's like a one-time prize in language models,

00:09:17.740 | where Anthropics, which is like a startup,

00:09:21.420 | basically had this prize where if people could come up

00:09:23.700 | with a task where the performance on the task

00:09:27.460 | would actually decrease as you increase

00:09:31.020 | the size of the language model, then you

00:09:33.580 | would get a bunch of money.

00:09:35.460 | So basically, there are a lot of submissions to this.

00:09:37.620 | And here's one example of a task where

00:09:40.100 | they found that the performance would actually

00:09:42.140 | decrease if you increase the size of the language model.

00:09:45.260 | So the task is--

00:09:46.420 | I'll just read it here.

00:09:47.340 | It's like, repeat my sentences back to me.

00:09:49.380 | And then the input is like, all that glisters is not glib.

00:09:53.060 | And then the output is--

00:09:55.620 | the model has to accurately say glib.

00:10:01.340 | And so what happened is for the small language model,

00:10:06.100 | it doesn't know the phrase, all that glisters is not gold.

00:10:09.420 | So it just copies the input and actually

00:10:11.260 | does get-- it's like 100% on that.

00:10:14.860 | But then for the medium-sized language model,

00:10:17.820 | what you would see is that the performance actually

00:10:20.100 | decreased because the medium-sized language

00:10:22.220 | model knows the phrase, all that glisters is not gold.

00:10:24.860 | And then it says gold, which actually is not

00:10:26.700 | what the task asks it to do.

00:10:28.980 | Yeah, go ahead.

00:10:29.660 | Someone on Zoom asked, can you give a physical estimate

00:10:32.060 | of 10 to the 24 flops, possibly in terms of training

00:10:35.100 | time or number of GPUs?

00:10:38.500 | Yeah, so I think 10 to the 24 flops is around--

00:10:45.900 | so at Google, we use TPUs.

00:10:48.260 | And one pod of TPUs, I believe, is equal to like 4,000 A100s.

00:10:55.780 | And 10 to the 24 flops is like two pods for around six weeks

00:11:00.580 | or something like that.

00:11:02.420 | So it's a lot of compute to do the pre-training.

00:11:04.740 | I don't know.

00:11:08.180 | Do you guys remember in chemistry class

00:11:10.020 | when you'd have moles?

00:11:12.740 | And it would be like 10 to the 23.

00:11:14.380 | And then your teacher would be like, oh,

00:11:16.020 | I don't even think about how big this number is.

00:11:19.020 | That's the number of floating point operations

00:11:21.780 | that goes into the pre-training of some of these models.

00:11:26.460 | OK, great, anyways.

00:11:28.220 | So yeah, so basically, the medium-sized language model

00:11:32.300 | would actually do worse.

00:11:33.420 | Oh, yeah, did you have another question?

00:11:35.620 | Yeah, oh, wait, did this one win the prize or not?

00:11:40.740 | This one is one of the winners.

00:11:43.660 | I think it's a third-place winner or something.

00:11:46.040 | Yeah?

00:11:46.540 | [INAUDIBLE]

00:11:47.040 | What do you mean, flip a negative sign?

00:11:59.980 | Well, because all of this depends on which evaluation

00:12:04.020 | scheme you use to measure if you do your task right.

00:12:07.860 | So it's like the measurement is very sparse.

00:12:11.340 | You only get credit if you do it perfectly or something.

00:12:14.180 | Yeah, yeah.

00:12:14.700 | [INAUDIBLE]

00:12:15.180 | Yeah, so for this thing, they accounted for all--

00:12:37.180 | you can't just say the task should

00:12:38.740 | be a do badly on something.

00:12:40.580 | It has to be a meaningful sort of task.

00:12:43.340 | And then, I guess, your point about how credit assignment

00:12:46.760 | or the valuation metric works is actually a really good one.

00:12:50.740 | Yeah, so I guess it still kind of counts if--

00:12:54.500 | I guess the argument is that the performance might not

00:13:00.420 | look emergent if you assign partial credit.

00:13:03.260 | But we have a bunch of--

00:13:05.180 | I can show an example later.

00:13:06.380 | But even if you use partial credit metrics,

00:13:09.060 | you'll often still see the same type of emergence.

00:13:11.900 | So it's not purely a phenomenon of not

00:13:14.340 | assigning partial credit based on the valuation metric.

00:13:16.620 | And then-- great.

00:13:24.540 | So what we sort of argued in this paper

00:13:27.160 | is that, yeah, there might be some tasks where

00:13:29.420 | the performance starts to decrease if you use a medium

00:13:31.740 | sized language model.

00:13:33.780 | But if you keep scaling all the way to the largest model

00:13:38.260 | that we have at Google that's known publicly, Palm,

00:13:43.420 | you'll see that the language model can actually

00:13:46.260 | go back and do the task correctly.

00:13:47.740 | Because the large language model also

00:13:51.260 | knows the phrase, all that glisters is not gold.

00:13:54.100 | But it also understands, repeat my sentences back to me.

00:13:58.420 | So it's able to get 100% on this task.

00:14:01.060 | So this is a different type of emergence also.

00:14:04.820 | And another class of emergence that we talk about in the paper

00:14:09.420 | is an emergent prompting technique.

00:14:12.020 | So basically, other than future prompting,

00:14:15.020 | there's other ways of interacting

00:14:16.780 | with the language models that can be considered emergent.

00:14:21.780 | Yeah.

00:14:22.300 | Can I interrupt?

00:14:23.020 | Sorry, somebody else had a question on the previous one.

00:14:25.300 | Yeah.

00:14:25.820 | [INAUDIBLE]

00:14:27.300 | The question is, did all models undergo instruction-binding

00:14:30.100 | tuning?

00:14:31.580 | None of these models underwent instruction-binding tuning

00:14:34.900 | for this plot.

00:14:35.900 | Yeah.

00:14:36.400 | Great.

00:14:41.280 | Yeah.

00:14:41.780 | So one way of interacting with language models

00:14:45.780 | is by basically finding the model using

00:14:49.780 | a technique called RLHF.

00:14:51.620 | And basically, the way it works is you have this data.

00:14:54.180 | And humans rate preferences based on the data.

00:14:57.620 | And humans rate preferences for what type of outputs

00:15:01.260 | they prefer.

00:15:02.620 | And then the model is trained on RL to optimize

00:15:05.500 | for human preferences.

00:15:08.780 | And what this plot is showing here

00:15:10.700 | is that if you do this RLHF on the model,

00:15:14.180 | the model performance on a different zero-shot task

00:15:17.700 | actually gets worse for small models.

00:15:19.580 | You can see the blue line is above the orange line.

00:15:22.660 | Blue line is the baseline.

00:15:23.780 | The orange line is RLHF.

00:15:26.900 | And then if you do it for large models,

00:15:28.780 | though, then you can see that the performance actually

00:15:32.060 | has a positive delta from doing RLHF.

00:15:34.220 | And so this is an interesting thing

00:15:38.860 | where a certain technique might only

00:15:41.260 | help if you try on a large enough language model.

00:15:43.420 | So if you only try it on the small language models,

00:15:46.180 | it'd be tough to draw the conclusion that it wouldn't

00:15:49.660 | help performance.

00:15:51.560 | And then later, I'll talk about chain-of-thought prompting

00:15:53.980 | as another emergent prompting technique.

00:15:58.660 | So here's the hand-wavy diagram that I

00:16:01.060 | used to think about emergence as a framework.

00:16:05.260 | So on the x-axis here, there's a scale of the language model.

00:16:09.860 | And on the y-axis is a imaginary scale of a range of things

00:16:16.380 | that a language model can do.

00:16:18.500 | And then basically, you can pick some random point,

00:16:20.700 | like, say, 100 blind parameters in the language model.

00:16:23.620 | And there will be certain abilities.

00:16:26.420 | So first, you can see, as you increase

00:16:28.140 | the size of the language model, the number

00:16:30.220 | of tasks or things that the language model can do increases.

00:16:33.740 | And then you can see there are some tasks where--

00:16:38.180 | models above 100 blind parameters, for example,

00:16:40.500 | can do them.

00:16:41.420 | But models below 100 blind parameters can't do them.

00:16:43.620 | And we call these emergent abilities.

00:16:46.360 | Sorry, quick question about this.

00:16:47.740 | Yeah.

00:16:48.240 | What are the colors?

00:16:50.500 | Oh, it's just highlighting the dark blue

00:16:53.820 | is tasks that a smaller language model wouldn't be able to do.

00:16:58.820 | Does that make sense?

00:16:59.660 | Yeah.

00:17:00.160 | And then to the right of the dotted line,

00:17:01.860 | the white region on top?

00:17:04.820 | Oh, that just means tasks that we

00:17:07.420 | haven't been able to solve yet with language models.

00:17:09.780 | Yeah.

00:17:10.900 | Cool.

00:17:11.940 | And I'm curious to know, do you think

00:17:13.440 | that it's not that those tasks in the white region

00:17:15.620 | are unsolvable, like the 100 billion scale?

00:17:19.300 | Or do you think that better models, specific training

00:17:23.660 | data, would allow us to, at the 100 billion scale,

00:17:26.180 | get into that white region?

00:17:27.980 | Yeah, I definitely think that it's not a fixed--

00:17:34.460 | I'll give an example shortly.

00:17:35.700 | But it's not a rule that you have

00:17:40.300 | to have 100 blind parameters to do a certain task.

00:17:42.900 | It's just that that happens to be the threshold that we've

00:17:45.420 | observed in models so far.

00:17:47.020 | And I do think with better training

00:17:48.860 | data, and architecture, and algorithms,

00:17:51.420 | we can probably beat that.

00:17:52.500 | Cool.

00:17:57.580 | Yeah, so as Rylan just mentioned,

00:18:00.100 | one example of getting emergence can be with better data.

00:18:04.220 | So it's not all about scale.

00:18:05.380 | I'll give some nuance here.

00:18:06.740 | So for this task, it's just one of the tasks

00:18:10.540 | in the Big Bench benchmark.

00:18:12.340 | You can see that for Lambda, which is a Google model,

00:18:15.060 | and GPT-3, you actually don't get emergence

00:18:18.300 | from scaling to 137 or 175 billion parameters.

00:18:25.780 | But when you come in with a different language model,

00:18:28.020 | Palm, which is trained on better data than Lambda and GPT-3,

00:18:33.100 | you actually can get this emergent ability,

00:18:35.420 | even with the smaller language model, so here

00:18:37.300 | at 62 billion parameters.

00:18:40.380 | Do you see Perpetuary's better model as better data,

00:18:43.180 | or also better architectural loss choices,

00:18:47.820 | or [INAUDIBLE] data for it?

00:18:49.940 | Yeah, so the challenging thing is-- that's a great question.

00:18:54.940 | There's a lot of differences between Palm and Lambda,

00:18:57.700 | for example.

00:18:58.780 | And we can't really ablate them in any controlled way

00:19:01.500 | because of the cost of pre-training.

00:19:04.020 | But our running hypothesis is that Palm

00:19:07.100 | is trained on better data.

00:19:08.180 | And that accounts for a lot of the difference

00:19:10.280 | between Palm and Lambda.

00:19:12.140 | I've seen the smaller scales where

00:19:13.620 | it is possible to ablate stuff, or some really architectural

00:19:17.180 | ways.

00:19:17.680 | Does anyone look into it?

00:19:20.100 | Yeah, that's a good question.

00:19:21.300 | So I guess even here, you can look at, for example,

00:19:23.420 | the Palm 8 billion model, like that point there.

00:19:28.700 | You can ablate it, and it's a little bit higher,

00:19:30.900 | but it's not really emergent yet at that point.

00:19:33.420 | So it's hard to tell, for example, this particular task,

00:19:37.180 | what the effect is.

00:19:38.940 | Yeah.

00:19:39.440 | There's a question on Zoom.

00:19:41.240 | Are there two different versions of Palm?

00:19:43.140 | If not, why are there two lines for it?

00:19:45.680 | Oh, so I think the two lines here-- one is maybe three shot,

00:19:51.060 | and then one is zero shot, something like that.

00:19:55.140 | So it just refers to the way that we're

00:19:58.240 | using the language model, either with or without exemplars.

00:20:00.740 | Great.

00:20:07.160 | I'll talk about a small ablation here that shows this.

00:20:11.080 | So this is an ablation on a toy task, where basically,

00:20:17.560 | the language model has to know that in English,

00:20:20.040 | you have to use plural verbs with plural subjects

00:20:22.560 | and singular verbs with singular subjects.

00:20:26.880 | And what we're doing here is basically,

00:20:30.640 | we train these small BERT models from scratch,

00:20:33.920 | and then we fixed the frequency of certain verbs

00:20:39.080 | in the training data set, which basically says, OK,

00:20:41.920 | what's the effect of seeing a certain verb in the data

00:20:45.320 | more often?

00:20:47.200 | In this plot, the x-axis is the frequency of the verb,

00:20:50.840 | and the y-axis is the error rate.

00:20:53.200 | And what you basically see is that if you have more

00:20:56.680 | in-domain data, so if the model sees the verb more times,

00:21:00.000 | it does the task a lot better.

00:21:02.160 | So this is an example of having high-quality data or data

00:21:05.440 | that's more in-domain for the task that you're evaluating on

00:21:08.880 | can make a big difference, even if you're

00:21:10.680 | fixing the compute, the size of the model,

00:21:13.440 | and the rest of the data.

00:21:15.740 | Yeah?

00:21:16.240 | Question on Zoom.

00:21:17.080 | Someone asks, could there be a way

00:21:18.720 | to distill convergent abilities down to smaller models

00:21:21.720 | from larger teacher models?

00:21:24.560 | Yeah, I think so.

00:21:25.560 | So larger teacher models, you can use them, for example,

00:21:32.880 | to generate data.

00:21:34.680 | And then if you fine-tune the smaller model on data,

00:21:38.760 | it's pretty likely that you'll be able to get the ability

00:21:42.080 | to emerge in the smaller model.

00:21:43.940 | I'll talk about an example of this, too.

00:21:46.440 | Oh, actually, that's the next slide.

00:21:49.760 | Desired behaviors can be induced in smaller models

00:21:52.260 | once you know what behavior you want.

00:21:55.280 | So for example, here's the figure

00:22:00.360 | from the InstructGPT paper.

00:22:02.840 | And basically, the desired behavior here

00:22:05.360 | is instruction following.

00:22:07.920 | And you can see that there's multiple models.

00:22:10.800 | So on the left, you have these small models

00:22:13.080 | that are trained with RLHF.

00:22:15.200 | And they actually have better performance

00:22:17.560 | than larger models trained on weaker techniques.

00:22:21.240 | So basically, the point is, if you

00:22:22.760 | know that you want a certain behavior that you

00:22:26.160 | saw previously in an emergent way in a larger model,

00:22:30.840 | you can find a way to fine-tune on that behavior specifically

00:22:34.280 | and induce that behavior in a smaller model.

00:22:38.120 | I guess that one of the limitations

00:22:39.520 | is that unless you know all the behaviors that you want,

00:22:43.160 | you can't really get this natural emergent behavior.

00:22:50.880 | Yeah, another discussion point here

00:22:52.800 | is that there's this question of what's

00:22:55.800 | the right x-axis for emergence.

00:22:57.800 | So right now, we mostly talk about model parameters

00:23:01.000 | and training flops.

00:23:02.160 | But I guess if you ask deep mind people how they look at it,

00:23:06.160 | you'll get this argument that model parameters and training

00:23:09.640 | flops are really just a proxy for how good the model is.

00:23:13.840 | And how good the model is can really

00:23:15.440 | be measured by perplexity, or how well it's

00:23:18.280 | doing on some dev sets, such as Wikitext 103.

00:23:22.960 | So basically, you can also measure emergence

00:23:27.200 | in terms of perplexity.

00:23:29.240 | So here is Wikitext perplexity.

00:23:33.640 | And then you can see on a downstream task

00:23:35.640 | that, as the perplexity gets better,

00:23:38.120 | there's sort of this threshold where

00:23:39.580 | you're able to do a lot better on the downstream task.

00:23:44.600 | And there's sort of a strong correlation,

00:23:47.400 | at least right now, between perplexity and training

00:23:49.600 | compute.

00:23:50.120 | So you can see these two lines are pretty similar.

00:23:53.200 | And basically, I think in the future,

00:23:57.880 | if we have much better models that are a lot smaller,

00:24:00.640 | turn on much better data and better algorithms,

00:24:02.840 | then maybe Wikitext perplexity can

00:24:04.960 | show a different type of plot than using other metrics.

00:24:09.240 | Yeah, go ahead.

00:24:09.960 | [INAUDIBLE]

00:24:12.080 | So Wikitext is basically a--

00:24:17.080 | I think it's like a subset of Wikipedia.

00:24:19.640 | And then perplexity is like a measure

00:24:21.160 | of how well you can predict the next word in a data set.

00:24:25.840 | So basically, if you're really good at modeling

00:24:27.840 | the next word on this particular evaluation set,

00:24:32.080 | that's sort of a measure of how well you understand language.

00:24:37.160 | That make sense?

00:24:38.360 | [INAUDIBLE]

00:24:41.240 | Oh, this is like a held-out test set.

00:24:43.160 | And then a final thing that I think

00:24:52.280 | is pretty exciting about emergence

00:24:54.960 | is that there's sort of not just technical emergence

00:24:58.080 | that we've talked about, but there's

00:24:59.760 | sort of sociological changes in how the AI community views

00:25:04.080 | scaling and how to use language models.

00:25:07.760 | So here's some examples of where scaling up

00:25:12.280 | the size of the language model enables you to,

00:25:15.200 | in this sort of few-shot scenario,

00:25:17.880 | beat a task-specific fine-tuned language model that's

00:25:21.120 | usually fine-tuned on, say, thousands of examples.

00:25:25.240 | So basically, the green line is the prior state of the art

00:25:28.360 | achieved by fine-tuning.

00:25:30.960 | And then the blue dots basically show,

00:25:34.120 | if you take a pre-trained language model

00:25:36.120 | and you do few-shot prompting, which means the language model

00:25:38.720 | isn't intentionally trained to do the task,

00:25:40.760 | you can often get state-of-the-art results

00:25:42.520 | just by continuing to scale up the size of the language model.

00:25:45.960 | And obviously, there's limitations here.

00:25:47.920 | You don't want to just keep scaling up

00:25:49.720 | in order to get state-of-the-art.

00:25:51.720 | But I think it's a pretty big change in people's minds

00:25:54.600 | that you could actually get some of the best results

00:25:56.880 | just by scaling up the size of the language model

00:25:58.880 | and doing prompting.

00:25:59.760 | Question from Zoom.

00:26:03.400 | Someone asks, is that not a contradiction

00:26:06.120 | graph from two to three slides ago?

00:26:08.600 | What is that?

00:26:10.720 | Which one?

00:26:11.520 | This one?

00:26:12.600 | I'm not sure.

00:26:14.160 | Should we, in general, assume-- oh, he said yes.

00:26:18.200 | OK.

00:26:19.400 | He said, should we, in general, assume

00:26:21.200 | that scale trumps fine-tuning?

00:26:24.640 | Yeah, so that's a great question.

00:26:26.920 | So this plot is saying that you fine-tune and you can do--

00:26:32.720 | OK, yeah.

00:26:33.400 | So it depends on your particular task.

00:26:39.480 | But what this plot is saying is that fine-tuned smaller models

00:26:50.600 | can do well on some tasks if you target it well.

00:26:53.040 | But for tasks that are more complicated,

00:26:56.040 | often you can do better just by scaling.

00:26:58.440 | So there's sort of tasks that fall

00:27:01.240 | into both of these categories.

00:27:03.120 | And I wouldn't say that it's contradictory.

00:27:06.400 | I guess some tasks you would do a lot better just

00:27:12.840 | by scaling up the size of the model.

00:27:14.440 | And then other tasks, if it's a very narrow domain

00:27:17.600 | or the large language model might not

00:27:20.080 | be trained on that kind of data, then you

00:27:21.960 | would do better by fine-tuning.

00:27:23.240 | OK, great.

00:27:29.040 | So here's sort of a little summary slide.

00:27:31.280 | So basically, emergent abilities can only

00:27:33.000 | be observed in large models.

00:27:34.640 | And if you try to predict their emergence just

00:27:37.720 | by looking at the plots for small models,

00:27:39.440 | then you wouldn't be able to do it.

00:27:42.760 | And I sort of had a little reflection

00:27:45.160 | on how to look at this.

00:27:46.920 | So emergence is really this framing

00:27:48.480 | of how to view new abilities that are not intentionally

00:27:52.920 | built in to the pre-training.

00:27:55.120 | And I think the subtext for this is super important, which

00:27:57.560 | is you can see it as an implicit argument for why we should keep

00:28:01.120 | scaling up language models, because you

00:28:03.760 | get these abilities that are really hard to find otherwise.

00:28:07.280 | And the context around this is pretty important,

00:28:09.240 | because it's really expensive to continue scaling up

00:28:13.360 | these models.

00:28:14.040 | And even one year ago, a lot of people

00:28:17.800 | didn't believe that you could do better on certain tasks

00:28:20.320 | just by scaling up the size of the language model.

00:28:23.920 | They sort of-- if you work in industry at all,

00:28:25.880 | there's this interesting tension between emergence and also

00:28:29.800 | many production tasks.

00:28:30.920 | So emergence is sort of this task general phenomena

00:28:34.760 | where you scale up the model, and it's really expensive.

00:28:37.720 | But the single model can do a lot of tasks.

00:28:40.080 | This is sort of in the direction of AGI.

00:28:44.120 | And then for many production tasks,

00:28:46.280 | you have sort of the opposite, where

00:28:48.960 | you know what task it is, for example,

00:28:50.840 | translating to Spanish.

00:28:52.760 | And then you have these constraints on compute,

00:28:54.720 | because when you build Google Translate, for example,

00:28:57.280 | you don't want people to have to wait a couple of seconds

00:28:59.680 | just to get the translation.

00:29:02.200 | And then you also happen to have a lot of in-domain data.

00:29:05.320 | So you have, for example, a million pairs

00:29:08.720 | of English-Spanish sentences to train on.

00:29:12.200 | And this is sort of the opposite setting, where you don't really

00:29:15.560 | care about the model's emergence.

00:29:18.720 | You can just train a very small model on the data

00:29:21.000 | and do all of the tasks without having to use a lot of compute.

00:29:25.880 | And the final point is that I think

00:29:28.960 | a really promising research direction, if anyone

00:29:31.080 | is interested in doing research, is

00:29:33.000 | to work on predicting future emergent abilities.

00:29:36.840 | And I haven't seen a lot of work on it recently,

00:29:38.840 | just because I think maybe it's too hard, for example.

00:29:42.200 | You can only predict emergence for a specific task.

00:29:46.400 | Or one way of predicting emergence

00:29:48.440 | might not be super general.

00:29:49.720 | And so I haven't seen much work on that.

00:29:52.160 | But I think this is a pretty promising direction to work on.

00:29:55.240 | And maybe Anthropic is working on it.

00:29:56.800 | I don't know.

00:29:57.400 | OK, great.

00:30:00.560 | Any questions on that before I move on

00:30:02.240 | to chain of thought prompting?

00:30:04.680 | Yeah, go ahead.

00:30:07.320 | Do we have any theoretical basis to predict

00:30:10.280 | which parameters are best scaled to get related properties?

00:30:13.800 | Because obviously, there are many different options

00:30:16.200 | for where the actual parameters [INAUDIBLE]

00:30:19.640 | like GPT, for example, you could add more to the inventory.

00:30:22.760 | You could add more in [INAUDIBLE] or whatever.

00:30:26.240 | Is that mostly something we just test?

00:30:27.880 | And then we find out which ones scale better

00:30:30.320 | and give us better results?

00:30:32.320 | Yeah, I would say that we don't have very principled methods

00:30:37.240 | for how to scale up these architectures.

00:30:41.920 | I'm not an expert in this.

00:30:43.000 | But some of it has to deal with how many parameters you

00:30:46.120 | can fit onto a particular TPU.

00:30:50.360 | But in general, I think you scale up

00:30:52.440 | the number of intentions heads and embeddings

00:30:55.760 | somewhat proportionally.

00:30:57.680 | But yeah, I think this is an open research question.

00:30:59.800 | And because you can't really do ablations

00:31:03.440 | over these pre-training, you can't really

00:31:06.600 | do ablations over pre-training.

00:31:07.880 | It's hard to have any principled way of doing it,

00:31:12.440 | other than some engineers who are in charge of doing it,

00:31:15.480 | saying, OK, I think this is the right thing to do.

00:31:17.560 | And it kind of works, and you go with it.

00:31:20.280 | Yeah?

00:31:21.280 | Do you have any indication of the asymptotic behavior

00:31:23.760 | of this get thing?

00:31:24.880 | If you expect that, eventually, it

00:31:26.840 | would reach either some plateau of finite or non-zero loss,

00:31:30.960 | or it would just go all the way down to zero?

00:31:35.000 | Yeah, that's a great question.

00:31:36.340 | You mean on perplexity or on a particular task,

00:31:45.120 | or just in general on an export prediction?

00:31:47.000 | Well, it seems like these results are pretty general,

00:31:49.240 | pretty task-independent, right?

00:31:50.800 | It's emergent scaling.

00:31:53.120 | But if you take the limit of infinite parameters,

00:31:55.680 | then even analytically, is there any sense

00:31:58.200 | of how that converges?

00:32:00.720 | Yeah, I have no clue.

00:32:02.120 | I think, for most of these tasks,

00:32:05.280 | there's a limit to accuracy, like 100%, for example.

00:32:08.680 | So there's some sort of asymptote there.

00:32:10.440 | But I guess the deeper question that you might be asking

00:32:13.800 | is, can a language model perfectly

00:32:16.120 | know how to predict the next word for any given input?

00:32:22.040 | And maybe.

00:32:23.960 | I mean, I guess there's some limit to--

00:32:28.120 | if I say a sentence, there are two possible next words

00:32:31.240 | or something.

00:32:31.840 | And you might not be able to guess that perfectly.

00:32:35.400 | So I think there's some limit, but I think we're

00:32:37.440 | far from reaching that limit.

00:32:38.640 | And there's still a lot of unsolved tasks

00:32:40.440 | that sort of indicate that there's a lot of headroom.

00:32:45.120 | Yeah.

00:32:45.640 | If researchers are interested in studying emergence,

00:32:48.520 | what family of differently-sized models

00:32:51.520 | is publicly available or best for studying this?

00:32:55.960 | Yeah, good question.

00:32:58.520 | So I think the OpenAI API has a lot of language models.

00:33:03.620 | And we actually use that a lot.

00:33:04.920 | Even at Google, it's used to study emergence.

00:33:07.880 | And that's sort of one way of doing it.

00:33:10.220 | And actually, a lot of these models are currently free.

00:33:13.760 | They're rate-limited, but they're free.

00:33:15.680 | So we also use that.

00:33:18.520 | I think there's also smaller language models.

00:33:22.840 | Like, for example, there's a UL2 model

00:33:24.680 | that's 20 billion parameters.

00:33:26.880 | But I guess you're right.

00:33:28.080 | There is sort of this challenge where the small language

00:33:30.520 | models, you won't see a lot of these emergent behaviors.

00:33:33.200 | So you kind of have to either train--

00:33:37.120 | yeah, so you kind of have to either use OpenAI API for now

00:33:41.500 | or wait until people train larger models.

00:33:44.180 | I guess there's also the Bloom and, like, you guys probably

00:33:47.700 | know better than me, like, OPT models

00:33:49.260 | that are publicly available, but I haven't seen

00:33:51.500 | a lot of experiments on them.

00:33:53.420 | Yeah, yeah.

00:33:56.100 | So my question is, are there emergent abilities

00:34:00.260 | that are accessible in lower parameter regimes?

00:34:03.620 | I can think of, like, more of a speech technique or [INAUDIBLE]

00:34:09.520 | I would expect maybe there might be, like, some better--

00:34:11.900 | maybe not, like, chain of thought,

00:34:13.360 | but are there some that are, like--

00:34:14.680 | Yeah, definitely.

00:34:15.360 | I think in the paper, we had, like,

00:34:17.480 | the list of a couple dozen abilities

00:34:19.040 | that would be emergent at, like, 8 billion parameters

00:34:21.400 | or, like, 60 billion parameters, something like that, yeah.

00:34:24.160 | Yeah.

00:34:25.320 | Yeah.

00:34:25.820 | We have two questions from Zoom.

00:34:27.200 | The first question is, do you see strategy tactics

00:34:30.080 | between the larger tech firms differing systematically

00:34:33.560 | in studying these models, or is basically everyone

00:34:36.320 | taking the same approach?

00:34:37.400 | I wouldn't say that everyone is taking the same approach.

00:34:48.800 | I think, as one example, Anthropic

00:34:52.360 | takes, like, a very safety-centric approach.

00:34:55.800 | And they're super interested in, like, emergent abilities

00:34:59.080 | because there could be emergent abilities that are undesirable

00:35:02.920 | and they want to predict those types of things.

00:35:06.560 | I also don't know what happens at other companies

00:35:09.840 | other than at Google, so I can't really speak too much to that.

00:35:13.600 | Yeah.

00:35:14.120 | The second question, what are some examples of tasks

00:35:17.120 | or abilities that have not yet emerged, even in models

00:35:20.040 | like Lambda, ChatGPT, et cetera?

00:35:23.080 | Oh, yeah, I have--

00:35:24.120 | maybe I'll just show this real quick.

00:35:25.660 | Uh-- there's, like, a nice list somewhere.

00:35:37.780 | So-- yeah, so basically, what we did

00:35:47.500 | is there's, like, 200 tasks in BigBench.

00:35:51.100 | And then we basically classified them

00:35:52.660 | into, like, smoothly increasing, emergent with GPT-3 or Lambda,

00:35:58.380 | emergent with POM, and then flat, which is, like,

00:36:00.740 | no model better than random.

00:36:02.420 | So I think if you look at any of these tasks

00:36:04.420 | here, they should still not have emerged yet.

00:36:08.820 | And if you can get them to emerge, that'd be interesting.

00:36:12.420 | [INAUDIBLE]

00:36:14.980 | Sorry?

00:36:16.060 | I think ChatGPT should be 20 questions.

00:36:18.700 | Oh, OK, yeah, this is not a super--

00:36:20.340 | I think this is, like, a couple of months old.

00:36:22.260 | Sorry, [INAUDIBLE]

00:36:23.420 | Yeah, yeah.

00:36:25.140 | Oh, 20 questions?

00:36:26.300 | OK, yeah.

00:36:26.900 | [INAUDIBLE]

00:36:29.500 | Yeah, I think-- like, the cool thing

00:36:32.220 | is, like, you can see over time, right?

00:36:33.820 | Like, originally, like, maybe only these were emergent.

00:36:37.100 | And then when POM came out, you'd

00:36:38.580 | see a couple dozen more abilities became emergent.

00:36:40.700 | And then I suspect in a year or two, most of these

00:36:45.620 | will become emergent.

00:36:46.620 | And we'll need harder benchmarks.

00:36:48.780 | Yeah?

00:36:49.500 | There's another question on here.

00:36:50.860 | Why doesn't Google take as much of a safety-centered approach,

00:36:54.100 | like you said, in Prometheus?

00:36:55.860 | Are there reasons to believe harmful capabilities wouldn't

00:36:58.660 | be emergent?

00:36:59.700 | Yeah, I don't want to answer the question on behalf of Google.

00:37:05.580 | I just can only talk about my own opinions.

00:37:10.540 | But I think the reality is that Google--

00:37:12.780 | even if you look at, like, the amount of research

00:37:14.820 | that Google does, it might not be in the large language

00:37:18.060 | models very specifically.

00:37:19.260 | But the amount of safety research that we do, I think,

00:37:22.540 | is more than anthropic if you actually

00:37:24.540 | look at the number of papers published.

00:37:26.540 | Don't quote me on this.

00:37:27.500 | But I think that's correct.

00:37:28.860 | Great.

00:37:34.920 | Yeah, I'll talk about chain-of-thought prompting.

00:37:42.380 | So basically, chain-of-thought prompting

00:37:44.020 | is this way of doing reasoning, multi-step reasoning,

00:37:47.700 | with large language models.

00:37:49.980 | And yeah, I wanted to say that it's

00:37:53.980 | super exciting to see a lot of people at Google working

00:37:57.020 | on this, and also to see Sundar, our CEO,

00:38:00.580 | present this at our last year's Google I/O press event.

00:38:04.220 | And basically, the motivation for this

00:38:10.380 | is that we want language models to do more complicated tasks.

00:38:15.220 | For example, we know language models

00:38:17.820 | can do easy tasks like sentiment analysis or translation.

00:38:20.980 | But what about more complicated tasks

00:38:23.180 | that might even take a human a minute or more to do?

00:38:28.340 | And the goal here is to basically guide them

00:38:30.300 | with metadata.

00:38:32.100 | So for example, instead of just giving an input-output pair,

00:38:35.260 | we want to give them the entire reasoning process

00:38:38.380 | and have them mimic that.

00:38:41.460 | And basically, you can see here, in a standard prompt,

00:38:45.100 | you have the question and then the answer.

00:38:47.100 | And then you have a question, and the model

00:38:48.940 | gives a new answer.

00:38:49.980 | Unfortunately, it's wrong.

00:38:52.940 | And then with chain-of-thought prompting,

00:38:56.380 | you give the model a question.

00:38:57.860 | And then, kind of like how your teacher would

00:39:00.140 | ask you to show your work, you give the chain-of-thought,

00:39:04.220 | is what we call it, or basically a reasoning path.

00:39:06.820 | And then you give the final answer.

00:39:08.460 | And then when the model sees this unseen question,

00:39:10.940 | now it's able to give the reasoning path

00:39:12.620 | and then give the correct final answer.

00:39:15.900 | And the way that we add these prompts into the prompt

00:39:18.660 | is basically we just manually write a couple

00:39:21.060 | and then add it into the prompt.

00:39:23.940 | So let me just show how that works.

00:39:26.940 | So this is the OpenAI API.

00:39:32.780 | And basically, here's the non-chain-of-thought way

00:39:38.500 | of doing it.

00:39:39.020 | So basically, you would have question, answer, question,

00:39:43.420 | answer, question, answer, and then

00:39:45.180 | new question about cafeteria has 23 apples.

00:39:47.900 | They used 20 to make lunch and bought six more

00:39:49.780 | healthy apples to have.

00:39:52.100 | And the model gets it wrong.

00:39:56.420 | And the only difference with chain-of-thought

00:39:58.900 | is that you give these intermediate reasoning

00:40:02.220 | paths before giving the final answer.

00:40:05.260 | So here's a path.

00:40:06.780 | There's a reasoning chain.

00:40:08.700 | There's another reasoning chain.

00:40:10.660 | And then now, the model for this unseen question

00:40:15.860 | gives the entire reasoning process.

00:40:18.940 | And then this actually enables the model to get it correct.

00:40:23.060 | I'll give another quick example, this one.

00:40:28.460 | So here, the task is just take the last letters

00:40:31.300 | of the words in Bill Gates, so like L from Bill and S

00:40:34.380 | from Gates, and then concatenate them.

00:40:36.420 | And the answer should be LS.

00:40:39.780 | And then here, the model gets it wrong.

00:40:42.500 | The answer should be NK, so it's SK.

00:40:46.860 | And then if you do chain-of-thought,

00:40:50.100 | obviously, this becomes very easy for the model.

00:40:53.580 | So it says the last letter of Bill is L.

00:40:55.940 | The last letter of Gates is S. The answer is LS.

00:41:00.860 | And then here, it's able to do the last letter of Elon is M.

00:41:03.940 | And the last letter of Musk is K. And the answer is NK.

00:41:06.660 | So is this clear?

00:41:11.900 | Any questions about what's going on here?

00:41:15.380 | OK, great.

00:41:16.020 | So basically, we can have these similar plots

00:41:21.100 | where the x-axis is the model scale.

00:41:23.300 | The y-axis is the performance.

00:41:26.580 | So on the left, we have this math word question

00:41:28.900 | benchmark called GSMAK.

00:41:30.380 | It's basically like questions that you'd

00:41:32.420 | see in an elementary school math test.

00:41:35.300 | And you can see the blue dot is standard,

00:41:38.180 | and the purple star is chain-of-thought.

00:41:41.020 | And basically, you see that the chain-of-thought,

00:41:43.020 | if you use a large enough model, does a lot better

00:41:46.340 | than standard prompting.

00:41:49.020 | It actually beats the fine-tuned state-of-the-art at the time.

00:41:53.620 | A similar example is on this benchmark called Strategy QA.

00:41:58.140 | And what Strategy QA is, it's basically

00:42:00.260 | like this world knowledge plus common sense reasoning

00:42:03.380 | benchmark.

00:42:03.900 | So the question would be, can you

00:42:05.740 | hide a basketball in a sand cat's ear?

00:42:08.420 | And then the model would say, a basketball is about this size.

00:42:12.180 | A sand cat's ear is that.

00:42:13.500 | So it would not fit.

00:42:14.820 | And on this benchmark, you can also

00:42:16.420 | see that we can beat the fine-tuned state-of-the-art

00:42:19.340 | from before just by using chain-of-thought

00:42:22.380 | with a large enough [INAUDIBLE] model.

00:42:28.100 | So one way we use this is that we

00:42:30.660 | evaluate a chain-of-thought on a certain subset of Big Bench

00:42:35.060 | tasks.

00:42:35.560 | So we created a subset called Big Bench Hard.

00:42:38.780 | And basically, it's like 23 challenging tasks

00:42:41.780 | from Big Bench, where no model had done better

00:42:45.140 | than the average human rater.

00:42:48.660 | So the way that you prompt the model

00:42:50.700 | is that you'd have a task description, question, options,

00:42:54.460 | chain-of-thought, and then the test time question.

00:42:59.060 | And so I'll give a couple examples of tasks here.

00:43:02.940 | So one example is navigate.

00:43:06.660 | Basically, what the language model has to do in this task

00:43:09.620 | is it has to basically follow these.

00:43:11.620 | So the question is, if you follow these instructions,

00:43:14.560 | do you return to the starting point?

00:43:16.260 | Turn left, turn right, take five steps, take four steps,

00:43:18.540 | turn around, take nine steps.

00:43:21.100 | And then the model, following the few shot exemplars,

00:43:25.540 | is able to basically track state after all of the actions.

00:43:30.420 | And then at the end, it says, OK,

00:43:32.300 | are we at the final answer?

00:43:34.140 | Are we at the original location?

00:43:36.740 | If it is 0, 0, the answer is yes.

00:43:38.380 | Just to give an example of another task,

00:43:44.780 | here's a task that's very easy for humans,

00:43:47.180 | basically word sorting.

00:43:48.260 | So there's a list of words, burly, baila.

00:43:51.740 | I'm not going to read them.

00:43:52.860 | And basically, the model has to sort them alphabetical order.

00:43:57.780 | And here, the model can follow the few shot exemplars.

00:44:00.020 | So you have this pretty complicated chain-of-thought

00:44:05.260 | where the model has to sort each of the subparts.

00:44:08.900 | And then finally, it gets to the final answer, which is correct.

00:44:15.580 | So here's this result summary on the subset of BigBench.

00:44:20.180 | So you can see, OK, we have two metrics.

00:44:24.380 | One is just the average performance

00:44:26.180 | on all these tasks.

00:44:27.580 | And the second is the percent of tasks

00:44:31.140 | that are above the average human reader.

00:44:34.820 | So average human reader is 67.

00:44:37.620 | Max human reader is 94.

00:44:40.220 | And then prior results, the model was doing way worse.

00:44:43.020 | It was like 50.

00:44:44.540 | And this is by construction of the subset.

00:44:50.060 | And then we used Code Da Vinci 02,

00:44:52.260 | which is one of the OpenAI models.

00:44:54.460 | And actually, you can use this one for free with OpenAPI.

00:44:58.220 | And basically, if you do answer-only prompting

00:45:01.360 | without chain-of-thought, then you

00:45:03.580 | are beating the average human reader on 5 of 27.

00:45:06.900 | But if you use chain-of-thought prompting,

00:45:09.500 | then the performance increases by this pretty decent amount.

00:45:12.500 | And you're able to pass the average human

00:45:15.700 | on the majority of tasks.

00:45:18.180 | And then below is just this visualization

00:45:19.980 | of the tasks that are doing worse than humans in red

00:45:23.180 | and then better than humans in blue.

00:45:25.460 | Yeah.

00:45:26.500 | Two questions.

00:45:27.620 | Isn't this similar to RLHF in spirit, at least?

00:45:32.460 | Is what similar?

00:45:34.540 | I think chain-of-thought prompting.

00:45:36.040 | I'm not sure what the statement is.

00:45:37.540 | I think chain-of-thought.

00:45:41.220 | Yeah, I think it's--

00:45:44.340 | I wouldn't call it similar.

00:45:45.700 | So chain-of-thought is basically you

00:45:47.900 | take a pre-trained language model

00:45:49.340 | and you use a prompting technique that includes

00:45:51.780 | intermediate reusing path.

00:45:54.340 | The way that RLHF works is that you have this additional data

00:45:58.180 | that you want to fine-tune the model on.

00:46:00.100 | And you have a preference model that

00:46:02.420 | sort of predicts how well does a certain output--

00:46:09.420 | how likely is that to be preferred by humans?

00:46:12.300 | And then RLHF, what that does is it fine-tunes the language

00:46:17.620 | model to do well on the preference model's prediction.

00:46:20.660 | So basically, it's sort of aligning the model

00:46:24.100 | with what humans would prefer.

00:46:26.260 | Is there a second question?

00:46:27.420 | Yeah, sorry.

00:46:27.920 | Just a few.

00:46:28.980 | OK.

00:46:29.700 | Andres asks, can chain-of-thought

00:46:31.300 | be included in fine-tuning rather than

00:46:33.500 | having to be in the prompt?

00:46:36.580 | Yes.

00:46:37.420 | The short answer is yes.

00:46:39.900 | The sort of complicated thing about that

00:46:41.540 | is that you have to have chain-of-thought intermediate

00:46:44.980 | steps.

00:46:45.580 | And those are pretty--

00:46:48.460 | it can be costly to gather that data and to do the fine-tuning.

00:46:54.680 | One last question.

00:46:55.480 | Sorry for everybody getting in.

00:46:56.800 | Another student asks, do you think

00:46:58.220 | that chain-of-thought and prompt engineering in general

00:47:00.940 | is just an artifact that won't be necessary for larger scale

00:47:04.220 | models that are better able to understand the prompts

00:47:06.660 | [INAUDIBLE]

00:47:08.260 | Yeah, so that's a great question.

00:47:11.340 | Basically, the question is how ephemeral is prompt engineering

00:47:15.060 | going to be.

00:47:16.460 | I think we'll find out.

00:47:18.340 | But some initial intuitions are that for easy tasks that

00:47:22.740 | are easy to describe and maybe they're multiple choice,

00:47:26.620 | larger models will probably be more robust to prompt

00:47:29.180 | engineering.

00:47:29.700 | And there's sort of less you can do with that.

00:47:31.740 | But I think as language models get more powerful,

00:47:35.540 | they'll sort of be more normal to use them

00:47:37.300 | on a lot more challenging tasks.

00:47:40.140 | And in those tasks, you'll have to specify exactly what you

00:47:42.980 | want the model to do, et cetera.

00:47:44.420 | So I think there will still be some room for prompt

00:47:46.460 | engineering there, at least in the near future.

00:47:49.540 | Yeah, go ahead.

00:47:50.140 | Do you know how well this chain-of-thought prompting is

00:47:52.660 | generalizing to, for example, you showed these two tasks,

00:47:55.180 | right, a simple math and a map, and then the other one

00:47:58.180 | basically sorting the words alphabetically, right?

00:48:01.580 | Yeah.

00:48:02.140 | So I mean, I see that's the case with the math.

00:48:04.940 | Math has to give this chain-of-thought prompting,

00:48:08.660 | and it does that super well.

00:48:09.980 | But would that model also perform better

00:48:11.780 | on sorting the alphabet?

00:48:13.460 | Or do you have to give the chain-of-thought

00:48:15.420 | for sorting words alphabetically?

00:48:17.700 | Yeah, that's a great question.

00:48:19.380 | So for some tasks where you've seen similar data

00:48:24.380 | in pre-training, the model can do really well,

00:48:26.780 | even if the chain-of-thought is from another task.

00:48:29.620 | So for example, like math word problems,

00:48:31.300 | you actually don't really need a math chain-of-thought,

00:48:33.580 | because the model already knows how to do that.

00:48:35.660 | But for tasks like this, you probably

00:48:38.420 | haven't seen any data that's like the chain-of-thought here.

00:48:41.300 | So without task-specific exemplars,

00:48:43.120 | you probably wouldn't do super well

00:48:44.540 | on tasks like this without manually writing them

00:48:48.300 | for other examples.

00:48:51.620 | Yeah.

00:48:52.620 | I'm wondering, as the researcher behind this,

00:48:55.660 | what mental model would lead you to even try this?

00:48:59.500 | Do you perceive the model as, if I was a person,

00:49:01.740 | how would I do this better?

00:49:02.860 | Or is it trying to give it more compute in order

00:49:06.940 | to make the orgasmic answer?

00:49:09.780 | Yeah, great question.

00:49:12.140 | I think my motivation was just thinking about it,

00:49:14.380 | just as you said, what's going on in a human's mind

00:49:18.300 | while they try to solve this math question?

00:49:21.540 | And well, if you notice, at least some humans

00:49:25.540 | will think actually in natural language.

00:49:28.420 | So if you pay attention a lot to what's going on in your mind,

00:49:33.200 | you actually notice that sometimes you

00:49:34.820 | think in language.

00:49:35.860 | And so well, the language model can think in language, too.

00:49:38.340 | So that was the motivation behind asking

00:49:41.300 | the language model to do that.

00:49:43.220 | And I think one thing that went well

00:49:47.660 | is that the development of this technique

00:49:50.380 | actually coincided with the development of Palm.

00:49:53.660 | And so yeah, basically having the model Palm

00:49:58.820 | allowed us to do a lot more challenging tasks

00:50:03.900 | using chain of thought.

00:50:06.380 | Yeah?

00:50:06.880 | So when talking about the data quality,

00:50:09.580 | we're saying that it matters that the absolute number

00:50:12.220 | of examples of this chain of thought process

00:50:14.380 | or whatever in the data set or in the fine tuning,

00:50:19.220 | is that the main significant thing?

00:50:21.260 | Or is it this relative number of frequency of those examples

00:50:24.900 | are just negative examples, which are not

00:50:27.060 | good examples of how to reason?

00:50:29.020 | Do those matter as much as the absolute number

00:50:31.980 | of good examples?

00:50:35.420 | Yeah, good question.

00:50:38.300 | So I guess the challenging thing is we can't really

00:50:40.500 | measure how many similar examples are in the training

00:50:42.900 | set.

00:50:44.980 | It's hard to do that well.

00:50:46.900 | And I don't think anyone has done that before.

00:50:50.380 | So it's more of this open question

00:50:51.900 | of why a chain of thought even works.

00:50:54.260 | Because you actually don't see similar data

00:50:57.060 | like that in the training set.

00:51:00.420 | Yeah, I think it's open question of why it works.

00:51:03.260 | [INAUDIBLE]

00:51:03.760 | I mean, you said, OK, think about how new things--

00:51:14.280 | sometimes we think in language, and then

00:51:16.000 | model should do that, too.

00:51:17.160 | But how do you actually think in--

00:51:19.120 | what is the intuition for the model?

00:51:21.120 | I mean, is there a shift in what specific task?

00:51:24.800 | Some weights get more focus from the model?

00:51:28.960 | How do you think about that?

00:51:30.560 | Yeah, I don't really think about it

00:51:32.060 | in terms of what's going on in the weights.

00:51:34.080 | I guess the way that I think about it is that it'd

00:51:37.680 | be unfair for me to give you a math question

00:51:39.980 | and ask you to give me the answer within half

00:51:43.000 | a second, which is basically what you're doing with the model

00:51:45.720 | and when you don't do a chain of thought, right?

00:51:47.760 | You're basically asking this challenging question,

00:51:49.800 | and the model doesn't have enough compute

00:51:52.080 | to solve it in one pass to give you the answer immediately.

00:51:56.840 | I think the second thing that I think about

00:52:00.840 | is that the model has learned a compositional set of skills

00:52:05.920 | during pre-training, so maybe it hasn't really

00:52:08.800 | learned this particular navigate task during pre-training.

00:52:12.860 | But it's learned other things, right?

00:52:14.400 | It's learned like, OK, if you take five steps

00:52:17.200 | and you're facing this, maybe you

00:52:18.840 | should add five here or something like that, right?

00:52:21.200 | And it's learned how to do pattern matching.

00:52:23.040 | So maybe in the future exemplars,

00:52:25.720 | it can match what the reasoning path is

00:52:28.760 | with what the question was.

00:52:30.280 | And so there's sort of these little skills

00:52:32.020 | that the model might know.

00:52:33.720 | And then maybe if you can combine them together

00:52:35.680 | in some clever way, then you can get the model to solve

00:52:37.960 | more challenging problems.

00:52:39.000 | OK.

00:52:42.420 | Ryan, how much time do we have?

00:52:47.320 | [INAUDIBLE]

00:52:50.840 | Oh, OK, 50, OK.

00:52:53.800 | OK.

00:52:59.300 | OK, great.

00:53:01.300 | That's a good example of how we judge these tasks, anyway.

00:53:05.300 | A bunch of different answers.

00:53:06.500 | All of them are right, but we judge them.

00:53:09.980 | Yeah.

00:53:11.300 | OK, great.

00:53:11.800 | Yeah, feel free to keep asking questions if you have any.

00:53:16.040 | So yeah, here's another example of emergence.

00:53:20.220 | So basically, you can see there's

00:53:22.100 | three models here, InstructGBT, Codex, and Palm.

00:53:25.220 | Chain of thought in blue and non-chain of thought

00:53:27.460 | is in gray.

00:53:30.380 | And then you can see, you actually

00:53:32.180 | have to have sufficient model scale

00:53:33.780 | to get chain of thought to work well.

00:53:36.620 | And I guess the intuition here is

00:53:39.700 | that if you have a really small model,

00:53:42.620 | the model will keep repeating itself

00:53:45.200 | or not say anything coherent or never get a final answer, which

00:53:48.040 | is why using chain of thought for the small models

00:53:50.120 | doesn't really work well.

00:53:51.660 | And then for the large models, obviously,

00:53:53.420 | for multi-step problems, the model

00:53:57.140 | is going to be able to solve the task at a lot higher

00:54:00.340 | accuracy with chain of thought.

00:54:01.700 | And another cool thing about chain of thought

00:54:07.500 | is there are some tasks where you wouldn't get

00:54:12.060 | emergent behavior at all.

00:54:14.620 | So emergence hasn't been unlocked yet.

00:54:18.660 | But you can see that if you use chain of thought,

00:54:22.420 | you can unlock this emergent performance in smaller models.

00:54:26.220 | So one example here is multi-step arithmetic,

00:54:29.980 | where I don't know if you'll ever--

00:54:33.040 | maybe I don't want to say never, but it's

00:54:35.580 | hard to imagine a model getting this.

00:54:37.980 | Here's a question, and then the next token is correct.

00:54:40.380 | That's pretty hard to solve in one step.

00:54:42.900 | But with chain of thought, you can get 50% accuracy on this

00:54:45.900 | just by having the model output these intermediate reasoning

00:54:51.900 | steps.

00:54:54.260 | So I have a question about this.

00:54:56.220 | This is something that needs to have intuition

00:54:59.500 | about what's going on.

00:55:01.740 | Abstractly, I know that a transformer can definitely

00:55:04.940 | do addition, like an arithmetic in one step.

00:55:08.180 | But it can take in the numbers and do the carries.

00:55:11.460 | Definitely, yeah, yeah.

00:55:12.660 | But then there's this question of what happens empirically.

00:55:18.660 | And I understand that it isn't necessarily a lot space

00:55:22.100 | to cover per tick.

00:55:25.300 | My question is, how do we tell the difference?

00:55:31.140 | Are there ways to tell the difference between things

00:55:34.180 | that have an emerge because there's just no space?

00:55:37.780 | Or there's so many tasks that it couldn't allow any space

00:55:44.660 | to specifically do that one, versus the task is so hard

00:55:49.300 | that it just can't even if you use all the capacity to try

00:55:56.420 | and do it?

00:55:57.100 | Yeah, that's a good question.

00:55:58.300 | I think there seems to be some subset of tasks

00:56:02.860 | where it just doesn't fit well with the way

00:56:06.460 | that we train language models.

00:56:07.900 | So for example, in language models, we use tokens, right?

00:56:11.460 | And so if you give it the token four,

00:56:15.900 | it actually doesn't take the number four.

00:56:18.220 | It takes this embedding that's 1,000 dimensions or something.

00:56:22.620 | Or if you give it a word and ask it to reverse the letters,

00:56:26.540 | this is a super easy task, but the way we train the model

00:56:29.140 | doesn't actually look at the letters and stuff.

00:56:31.700 | So I think there's a certain subset of tasks

00:56:33.540 | where it doesn't really just fit well with the way

00:56:37.300 | that we train transformers.

00:56:38.660 | And I think if you really care about these tasks,

00:56:43.940 | you can just solve them using code or something like that.

00:56:48.140 | But yeah, I don't think this is really

00:56:51.020 | an inherent-- something that would never emerge

00:56:54.300 | because it's too hard.

00:56:56.180 | Yeah.

00:56:57.700 | Yeah.

00:56:58.460 | We have a question on Zoom.

00:56:59.500 | Also, by the way, sorry, I forgot to mention.

00:57:01.420 | Somebody asked, can you repeat the questions?

00:57:03.100 | Because they can't always hear you.

00:57:04.100 | Oh, OK.

00:57:04.900 | Yeah.

00:57:05.420 | That's my bad.

00:57:06.020 | That's my bad.

00:57:06.660 | So the question someone asked is,

00:57:08.020 | do you think chain of thought would

00:57:09.480 | be a viable interpretability technique for very advanced AI

00:57:12.580 | systems?

00:57:13.700 | And they mentioned that there is some research

00:57:15.660 | by a professor called Externalized Reasoning

00:57:17.780 | Oversight by Cameron Landon.

00:57:21.580 | Will it be a viable interpretability technique

00:57:23.460 | for advanced AI?

00:57:25.860 | Yeah.

00:57:26.340 | Am I supposed to repeat this?

00:57:27.580 | Yeah, yeah, yeah.

00:57:28.000 | Sorry.

00:57:28.500 | Please.

00:57:29.100 | So the question is, can chain of thought

00:57:32.340 | be a viable interpretability technique for AI?

00:57:36.380 | I think there's no guarantee that the chain of thought

00:57:40.020 | is how the model actually arrives at the final answer.

00:57:43.460 | But often, you can use it to debug,

00:57:45.780 | why isn't the model getting this question correct?

00:57:47.820 | Or what can we do better in the chain of thought

00:57:51.300 | to help the model get this correct?

00:57:53.700 | I haven't read the anthropic paper that was mentioned.

00:57:56.260 | So I actually don't know the answer to that.

00:57:59.340 | OK.

00:57:59.840 | Another interesting result that we had here

00:58:07.340 | was that you can actually do multilingual chain of thought

00:58:11.140 | prompting.

00:58:12.460 | And so basically, what we had is we

00:58:14.340 | translated this benchmark of math word problems

00:58:18.100 | to 10 languages.

00:58:19.860 | And then we prompt the model to do it in, say, Bengali.

00:58:23.940 | And then the model has to basically do

00:58:25.700 | the math problem in Bengali and give the final answer.

00:58:29.500 | And I think the cool thing about this

00:58:31.040 | is that this input is highly improbable, right?

00:58:33.780 | So Bengali is 0.01% of the pre-training data.

00:58:37.220 | And math word problems are probably

00:58:40.060 | an even smaller subset of that.

00:58:43.060 | And basically, the interesting thing is,

00:58:44.740 | the model can actually do these types of questions pretty well,

00:58:48.900 | to probably surprising degrees.

00:58:50.980 | If you ask people before I showed them this result,

00:58:53.100 | how well can the model do these math questions in Swahili?

00:58:56.260 | Probably like 10%.

00:58:57.500 | But actually, even very underrepresented languages

00:59:01.820 | like Swahili or Bengali or Telugu and Thai,

00:59:06.940 | the model can do surprisingly well despite the fact

00:59:09.380 | that they only occupy a very small subset

00:59:12.420 | of the pre-trained data.

00:59:15.800 | Yeah.

00:59:16.300 | Actually, speaking to this, and most of my experience

00:59:19.180 | with this is with Chatterjee PP, but if you

00:59:21.020 | ask it things in different languages,

00:59:22.860 | despite not being explicitly trained in these languages,

00:59:25.900 | it seems to have derived reasoning independent

00:59:28.260 | of language, to a certain extent.

00:59:29.980 | It can do the reasoning.

00:59:31.300 | Actually, it's kind of funny.

00:59:32.540 | Sometimes, it always looks like it

00:59:34.100 | does the reasoning in English, and it translates back

00:59:36.300 | to the other language, because the answers it gives you

00:59:39.340 | is sort of like if you reasoned in English

00:59:41.420 | and then translated to the other thing.

00:59:43.060 | So do you think that learning the structure of a language

00:59:46.420 | and learning reasoning abilities are

00:59:48.460 | somewhat separate in large language models,

00:59:50.660 | or that it inherently will learn a chain of thought reasoning

00:59:53.820 | within that language, within the structure of the language,

00:59:56.340 | like the way thought works in that language?

00:59:59.220 | Does that make sense?

01:00:00.060 | Yeah, that's a great question.

01:00:01.260 | I'm not sure how to measure that,

01:00:02.640 | but I've definitely thought about it.

01:00:04.300 | I think the language--

01:00:05.740 | I mean, based on these results, you probably

01:00:08.580 | didn't have any math questions in Swahili

01:00:10.980 | for the model to learn from.

01:00:12.960 | And I think, definitely, there's something language agnostic

01:00:15.500 | going on, where the model learns reasoning sort of independently

01:00:18.940 | of the language, and then it can express it

01:00:20.780 | in different languages if it needs to.

01:00:23.140 | But I don't think we know the answer to that yet.

01:00:26.540 | Yeah, so basically, one question that comes up frequently

01:00:36.580 | is, why does scaling up improve chain of thought?

01:00:39.620 | And one way of looking at this is,

01:00:41.380 | we can take a smaller model, like POM62B,

01:00:43.660 | and see what types of errors are fixed from scaling up

01:00:46.460 | to 540 billion parameters.

01:00:49.020 | And you can see that, for these three categories

01:00:51.160 | that we came up with, some of all of them get fixed.

01:00:54.220 | So scaling seems to have this universal effect

01:00:57.100 | on improving different types of errors from solid models.

01:01:03.740 | And then here's the same hand-wavy diagram

01:01:06.420 | expressed in different ways.

01:01:07.620 | So basically, you have some tasks

01:01:09.900 | that are doable with standard prompting, so in blue.

01:01:13.060 | And then the goal of chain of thought prompting

01:01:15.700 | is to sort of increase the set of tasks that we can do.

01:01:19.580 | So for example, now, the ones shown in pink

01:01:22.820 | include math word problems, symbolic reasoning,

01:01:25.100 | and challenging commonsense reasoning, yeah.

01:01:28.540 | One more question.

01:01:29.380 | Have you done any calculations to figure out

01:01:33.020 | how much-- is any of this contribution just

01:01:37.100 | because of the fact that you do more computations when

01:01:40.620 | you put in longer prompts?

01:01:42.340 | Like, you put multiple tasks into the model.

01:01:45.740 | You create multiple embeddings to sort of adjust the things

01:01:49.620 | the model is looking at, in a way.

01:01:51.180 | Yeah.

01:01:51.660 | How much of that have you tried non-chain of thought prompts

01:01:54.660 | with same token lengths?

01:01:56.900 | Yeah, yeah.

01:01:57.420 | We tried with XXXXX or something.

01:02:00.940 | And it doesn't really-- it doesn't work.

01:02:02.580 | I see.

01:02:02.980 | So I think it's not just about the compute.

01:02:05.100 | I think it's about the language to guide in the model as part

01:02:09.100 | of the reasoning, yeah.

01:02:10.540 | I see.

01:02:11.040 | And have you tried describing the problem in more detail

01:02:13.500 | than non-chain of thought?

01:02:14.660 | I know this is a super [INAUDIBLE] question.

01:02:16.500 | I'm just very curious about-- this

01:02:17.940 | sounds like a very interesting property.

01:02:20.340 | And I'm very curious exactly how it fits in.

01:02:24.220 | Yeah, you mean like describing the question in three

01:02:26.460 | different ways and seeing if that--

01:02:27.620 | Yeah, just describing the question in more detail

01:02:29.100 | instead of explicitly doing the step-by-step things

01:02:31.300 | and seeing how that--

01:02:32.300 | Yeah.

01:02:33.100 | I haven't tried that, but I would be surprised if that

01:02:35.660 | worked.

01:02:36.180 | I see.

01:02:36.680 | Yeah.

01:02:37.180 | That is a question to me.

01:02:38.620 | Did you try having it output the answer

01:02:41.900 | and then explain its reasoning into that?

01:02:44.220 | Yeah, that doesn't work as well.

01:02:45.560 | OK.

01:02:46.060 | Yeah, but it depends on the task also.

01:02:48.180 | So like--

01:02:48.980 | Yeah, yeah.

01:02:49.620 | Yeah.

01:02:50.120 | So like, there is something out of the [INAUDIBLE]

01:02:55.820 | That seems to be the case, yeah.

01:02:59.020 | Yep.

01:02:59.500 | Does it really have to be like reasoning?

01:03:01.260 | Can we do like just any amount of extra calculation

01:03:04.500 | to sort of conjugate the worst answer?

01:03:07.540 | Like, in a way, you know, in a way like in an ablation,

01:03:09.980 | chain of thought is like a very structured thing.

01:03:13.180 | It's like, what if the same structure is preserved,

01:03:15.300 | but like we do some more random things?

01:03:19.660 | Yeah, you could try it.

01:03:21.700 | I would be surprised if it works.

01:03:23.060 | I think like outputting tokens is pretty important for the

01:03:28.100 | model.

01:03:28.600 | Yeah.

01:03:29.100 | Yeah.

01:03:29.600 | OK.

01:03:32.800 | So we're doing on time.

01:03:36.620 | OK, great.

01:03:37.140 | So the last part, I think, is a pretty cool trick

01:03:40.540 | with chain of thought.

01:03:41.460 | So basically, what people usually do

01:03:45.200 | is they'll just generate one chain of thought,

01:03:47.940 | and then they'll take the final answer.

01:03:49.940 | But there's this nice trick called self-consistency

01:03:52.140 | where you can use like temperature sampling

01:03:54.460 | with the model to generate like a bunch of different reasoning

01:03:57.420 | pods and final answers.

01:03:59.220 | And then if you just take a majority vote

01:04:00.920 | over the final answers, this ends up

01:04:02.940 | like improving performance by like a pretty big margin.

01:04:05.340 | So for example, here, you can see

01:04:08.380 | on GSM 8K, which is like the math for problem data set,

01:04:12.180 | the improvement goes from like, you know,

01:04:14.700 | the performance is like 56.

01:04:15.980 | And then if you do self-consistency,

01:04:18.140 | then it becomes 74, which is like a pretty big improvement.

01:04:21.720 | Quick clarification question.

01:04:23.220 | Yeah.

01:04:23.720 | Here, how many are we averaging over for self-consistency?

01:04:27.180 | Oh, I think 40.

01:04:27.940 | So it increases the cost of the inference time compute.

01:04:31.980 | But yeah, it improves performance by a lot.

01:04:35.200 | You might be about to answer this,

01:04:36.620 | but I'm curious to know how many samples or how many chains

01:04:39.220 | does one need to draw to get a significant--

01:04:41.180 | like, what is the trade-off between number of chains

01:04:43.380 | averaged over versus performance gain?

01:04:46.320 | I think it depends on the--

01:04:49.060 | sorry, the question is, how many chains do you

01:04:51.020 | need to get a performance gain?

01:04:53.220 | I think the answer really depends on the data set.

01:04:58.140 | But usually, you can get something good with like 16,

01:05:00.380 | I think.

01:05:01.600 | Yeah.

01:05:02.980 | Oh, sorry, we have a question.

01:05:04.220 | How does the temperature change the way the model works?

01:05:08.260 | Oh, OK, the question is, how does the temperature

01:05:10.260 | change the way the model works?

01:05:12.020 | Basically, when you use temperature decoding,

01:05:16.060 | the language model can like stochastically

01:05:19.060 | pick one of the outputs instead of always picking

01:05:21.500 | the highest probability next word.

01:05:23.820 | So basically, you get these more stochastic outputs

01:05:27.500 | that are still based on what the language model has learned,

01:05:31.980 | but it's just a little bit more random.

01:05:33.660 | And then finally, yeah, self-consistency also

01:05:41.060 | seems to be a merge-ability.

01:05:42.580 | I guess part of it is because chain of thought is emergent

01:05:44.820 | because you wouldn't get any better

01:05:47.380 | than the random performance without doing chain of thought.

01:05:53.380 | But yeah, you kind of see this big delta

01:05:56.420 | from self-consistency for larger models.

01:05:58.540 | Great, so I'm going to run out of time.

01:06:07.340 | Let me just go to--

01:06:09.220 | I'll just talk about this a little bit.

01:06:10.840 | So I think in addition to just purely scaling up

01:06:15.460 | the language model, which is only available to people

01:06:17.820 | in the industry, I think there's a couple interesting directions

01:06:21.500 | to work on.

01:06:24.420 | One is better prompting and characterization

01:06:26.540 | of language model abilities.

01:06:28.140 | I think right now, we're just at the edge

01:06:30.020 | of knowing what the best way to prompt language models is.

01:06:36.700 | There's also pretty good applied work.

01:06:38.380 | So you can use language models, I've

01:06:40.420 | heard, to train therapists, to help with creative writing,

01:06:43.180 | to help with science.

01:06:44.420 | I think ChatGPT has really shown what language models

01:06:48.000 | can do in this regard.

01:06:50.660 | I think benchmarks are also something that's pretty lacking

01:06:54.060 | because I think we solve benchmarks pretty quickly.

01:06:58.060 | For example, Palm beat the average human

01:06:59.860 | on Big Bench within a year or something of Big Bench

01:07:02.660 | coming out.

01:07:03.940 | So I think we need more benchmarks,

01:07:05.700 | and I think that's going to be an important contribution.

01:07:09.460 | And then the final one is, how can we

01:07:13.180 | have compute-efficient methods to make language models better

01:07:17.100 | so that it's less expensive to use them

01:07:21.260 | and more people get to use them?

01:07:24.500 | Great.

01:07:25.000 | So I'll end here.

01:07:29.220 | And feel free to email me if you have any feedback.

01:07:32.220 | And if you're interested in Google,

01:07:34.300 | feel free to email me as well.

01:07:36.740 | Thanks.

01:07:37.220 | [APPLAUSE]

01:07:40.560 | Thank you.

01:07:42.120 | [APPLAUSE]

01:07:45.480 | you