back to index

Stanford CS25: V2 I Emergent Abilities and Scaling in LLMs


Whisper Transcript | Transcript Only Page

00:00:00.000 | [AUDIO OUT]
00:00:06.780 | OK, great.
00:00:07.340 | So the first sort of paper that I'm going to be talking about
00:00:10.380 | is called "Emergent Abilities of Large Language Models."
00:00:13.540 | And this paper was especially cool,
00:00:15.780 | I think, because we got people from Google and DeepMind,
00:00:19.100 | and also Stanford.
00:00:20.580 | You might recognize Percy, or Tatsu, or Rishi.
00:00:24.360 | I mean, we got people to sort of agree
00:00:25.900 | on what's a nice framework of looking
00:00:28.020 | at why we want to scale in emergent abilities.
00:00:33.660 | So one of the things that we've seen throughout language models
00:00:38.960 | is that you sort of get these predictable gains
00:00:41.140 | as a result of scaling.
00:00:42.620 | So here's the canonical Kaplan et al.
00:00:45.340 | paper, where you can see that if you scale up
00:00:49.380 | the size of the language model, measured either in compute,
00:00:52.620 | in data set size, or in parameters,
00:00:55.980 | you see that the loss on the test set
00:00:59.780 | actually goes down predictably.
00:01:01.100 | [INAUDIBLE]
00:01:01.600 | Yeah.
00:01:02.100 | I don't know if you're screen sharing, so people in Zoom,
00:01:03.820 | I don't think, can see the slides.
00:01:05.700 | Yeah, yeah, yeah.
00:01:06.460 | Sorry, let me fix that.
00:01:07.340 | [INAUDIBLE]
00:01:11.900 | I guess I'll say this for the third time.
00:01:14.980 | As we've seen in language models,
00:01:17.660 | if you scale up the size of the language model,
00:01:20.100 | measured either in compute, in data set size,
00:01:23.260 | or in number of parameters, you can
00:01:24.940 | see that there's sort of this predictable improvement
00:01:28.200 | in the test loss.
00:01:29.060 | Now, what I'm going to be talking about in terms
00:01:34.420 | of emergence is something that's actually
00:01:36.500 | unpredictable if you only look at smaller language models.
00:01:40.780 | So one way that emergence has been described in the broader
00:01:44.580 | science literature is it's basically
00:01:46.540 | seen as a qualitative change that arises
00:01:49.540 | from quantitative changes.
00:01:51.540 | It sort of started with this article in Science
00:01:54.260 | by a Nobel Prize winning physicist
00:01:56.580 | called "More is Different."
00:01:59.340 | And I really like this post from Jacob Steindart
00:02:03.260 | that sort of describes emergence.
00:02:04.780 | And he gives a couple of good examples here.
00:02:08.700 | For example, he says, "With a bit of uranium,
00:02:11.700 | nothing special happens.
00:02:13.460 | With a large amount of uranium packed densely enough,
00:02:15.980 | you get a nuclear reaction.
00:02:18.540 | And then also with DNA, for example,
00:02:20.340 | given only small molecules such as calcium,
00:02:22.820 | you can't meaningfully encode useful information.
00:02:25.180 | But given larger models such as DNA, you can encode a genome."
00:02:28.020 | So for this particular work, we use this definition
00:02:35.420 | of emergent abilities of large language models in particular.
00:02:38.900 | So we say that ability is emergent
00:02:42.100 | if it is not present in smaller models,
00:02:44.540 | but it is present in larger models.
00:02:47.940 | And the sort of natural question here
00:02:49.780 | is how do you measure the size or the scale of the language
00:02:53.180 | model?
00:02:54.220 | And there's sort of traditionally three axes
00:02:56.660 | of scale.
00:02:57.180 | So the training flops are the amount of compute
00:02:59.740 | that you use to train the language model,
00:03:02.300 | the number of model parameters, or the size of the language
00:03:04.900 | model, and also the size of the training data set
00:03:07.340 | that the model is trained on.
00:03:09.940 | And a lot of the plots here will use either training flops
00:03:13.340 | or the number of model parameters.
00:03:15.620 | The reason is that the training data set size is usually
00:03:17.980 | fixed for different size models.
00:03:20.780 | And because training flops is just the data set size
00:03:23.260 | times model parameters, you can get a similar plot
00:03:27.300 | from either training flops or number of model parameters
00:03:29.980 | for most language models.
00:03:31.420 | Great.
00:03:35.180 | And so the first type of emergence--
00:03:38.140 | yeah, sorry, go ahead.
00:03:40.940 | Yeah.
00:03:41.700 | For me, it seems like it would be relatively easier
00:03:44.660 | to measure the size versus what qualifies
00:03:47.580 | as an ability.
00:03:48.560 | How do you define what counts as an actual ability versus not?
00:03:54.140 | Yeah, sure.
00:03:54.640 | [INAUDIBLE]
00:03:57.140 | So yeah, for example, I'll just give an example here,
00:04:01.460 | which is actually next slide.
00:04:03.500 | So basically, we have this way of interacting
00:04:06.220 | with language models called few-shot prompting.
00:04:08.900 | And the way it works is the language model
00:04:11.380 | is a really good next-word predictor.
00:04:14.540 | And when you give the model an example,
00:04:17.980 | and then you ask it for an unseen movie review,
00:04:22.780 | for example, and then you say, what's the output?
00:04:25.340 | And then here, the language model
00:04:26.740 | can say positive because it understands
00:04:28.460 | to use the context from the review to give the next token.
00:04:34.140 | And the definition that we use for having ability or not
00:04:38.900 | is that basically, a few-shot prompted task, for example,
00:04:42.900 | sentiment analysis, is emergent if it
00:04:45.420 | has random accuracy for small models
00:04:47.980 | but above random accuracy for large models.
00:04:50.700 | Does that make sense?
00:04:51.580 | So basically, if the model isn't doing any better than random,
00:04:54.660 | then when you say it doesn't have the ability
00:04:56.200 | to do this particular task.
00:04:57.500 | And I'll give a few examples here.
00:05:05.740 | So here's the canonical way that we
00:05:08.300 | look at plots for emergence.
00:05:09.700 | So basically, what we have, each of these different plots
00:05:14.620 | is a different task.
00:05:15.900 | And I'll go over some examples soon.
00:05:18.100 | But the way you read the plot is the x-axis
00:05:20.700 | is the number of training flops or the model scale.
00:05:24.620 | And then the y-axis is the accuracy
00:05:26.820 | or how good the model is doing the task.
00:05:30.340 | And then we have different language models
00:05:33.220 | from OpenAI, from Google, and from DeepMind.
00:05:36.340 | And then each of the points is a different language model.
00:05:39.340 | It's not a language model over the course of training.
00:05:42.260 | Each point is a different language model.
00:05:45.060 | And what you see is that for the very small language models,
00:05:49.420 | you basically get performance that's close to random
00:05:52.740 | or not being any better than random.
00:05:55.980 | And then once you pass a certain threshold,
00:05:59.540 | you can see that the performance suddenly
00:06:01.740 | gets substantially above random.
00:06:05.580 | And this is what we call emergence.
00:06:07.040 | So basically, if you were to extrapolate the lines
00:06:11.300 | from the small language models, you
00:06:13.140 | might predict that it would never do better than random
00:06:16.040 | because it's just a flat line.
00:06:17.760 | But the interesting phenomenon is
00:06:19.180 | that when you scale up past a certain threshold,
00:06:21.620 | you actually do see this emergent phenomenon where
00:06:23.740 | the model does a lot better than random.
00:06:25.420 | So let me go over some concrete examples.
00:06:32.420 | So here's one task.
00:06:34.380 | It's basically a benchmark called multitask NLU or MMLU.
00:06:40.180 | And basically, what it is, it's a bunch of test questions
00:06:45.100 | ranging from high school all the way to professional level
00:06:47.580 | exams.
00:06:49.820 | And how it works is the language model is given--
00:06:53.660 | for example, here is a high school math example.
00:06:57.260 | And the language model is given a few examples.
00:06:59.840 | And then for an unseen question, it has to give the answer.
00:07:04.100 | And then you can see in the plot on the right--
00:07:06.620 | so for the model scale, if you go up
00:07:08.860 | to sort of 10 to the power of 22 training flops,
00:07:12.300 | you don't actually get any better than random accuracy
00:07:14.580 | on this task.
00:07:15.700 | But if you scale up to 10 to the 24 training flops,
00:07:19.220 | then you see that all the three models there
00:07:22.100 | do much better than random accuracy.
00:07:24.220 | Yeah, go ahead.
00:07:28.260 | The scale of the data used to train this,
00:07:30.700 | is it roughly similar?
00:07:31.980 | Or because these are different models trained
00:07:33.980 | by different boards?
00:07:36.140 | Yeah, the scale is, I think, within an order of magnitude
00:07:39.580 | for these models here, yeah.
00:07:42.460 | But every single dot on each individual tracks
00:07:45.740 | is the same data?
00:07:47.700 | Yes, the data is fixed except for Chinchilla.
00:07:50.820 | Chinchilla uses more data for larger models.
00:07:53.020 | But I believe for all the other models
00:07:55.740 | here, the amount of data is the same.
00:08:01.220 | Yeah, here's just another example
00:08:02.660 | to show it more concretely.
00:08:05.900 | So this is a task from the Big Bench Benchmark.
00:08:10.620 | Just as an aside, the Big Bench Benchmark
00:08:12.300 | is like 200 benchmarks.
00:08:15.340 | And basically, it's like a crowdsource set
00:08:17.300 | of benchmarks I'd recommend looking at if you're
00:08:19.260 | doing that kind of work.
00:08:21.300 | And basically, the task is the language model
00:08:23.900 | has to take an English sentence and then
00:08:26.960 | give the international phonetic alphabet transliteration,
00:08:30.580 | the IPA transliteration, which is basically
00:08:33.020 | how to pronounce it.
00:08:35.420 | And for this task, the evaluation metric
00:08:38.860 | is actually blue, or like an n-gram overlap metric.
00:08:43.180 | And you get a similar phenomenon,
00:08:46.020 | where as you increase the size of the language model,
00:08:49.940 | it's flat for a while.
00:08:51.820 | And then suddenly, the improvement is above random.
00:08:54.420 | Great.
00:08:59.420 | So I'll talk about another interesting result
00:09:03.140 | here that's why it's emergent.
00:09:04.540 | So this was a technical report that we put out
00:09:07.420 | a couple of months ago.
00:09:09.180 | And basically, there's this really interesting prize in--
00:09:13.100 | or it's like a one-time prize in language models,
00:09:17.740 | where Anthropics, which is like a startup,
00:09:21.420 | basically had this prize where if people could come up
00:09:23.700 | with a task where the performance on the task
00:09:27.460 | would actually decrease as you increase
00:09:31.020 | the size of the language model, then you
00:09:33.580 | would get a bunch of money.
00:09:35.460 | So basically, there are a lot of submissions to this.
00:09:37.620 | And here's one example of a task where
00:09:40.100 | they found that the performance would actually
00:09:42.140 | decrease if you increase the size of the language model.
00:09:45.260 | So the task is--
00:09:46.420 | I'll just read it here.
00:09:47.340 | It's like, repeat my sentences back to me.
00:09:49.380 | And then the input is like, all that glisters is not glib.
00:09:53.060 | And then the output is--
00:09:55.620 | the model has to accurately say glib.
00:10:01.340 | And so what happened is for the small language model,
00:10:06.100 | it doesn't know the phrase, all that glisters is not gold.
00:10:09.420 | So it just copies the input and actually
00:10:11.260 | does get-- it's like 100% on that.
00:10:14.860 | But then for the medium-sized language model,
00:10:17.820 | what you would see is that the performance actually
00:10:20.100 | decreased because the medium-sized language
00:10:22.220 | model knows the phrase, all that glisters is not gold.
00:10:24.860 | And then it says gold, which actually is not
00:10:26.700 | what the task asks it to do.
00:10:28.980 | Yeah, go ahead.
00:10:29.660 | Someone on Zoom asked, can you give a physical estimate
00:10:32.060 | of 10 to the 24 flops, possibly in terms of training
00:10:35.100 | time or number of GPUs?
00:10:38.500 | Yeah, so I think 10 to the 24 flops is around--
00:10:45.900 | so at Google, we use TPUs.
00:10:48.260 | And one pod of TPUs, I believe, is equal to like 4,000 A100s.
00:10:55.780 | And 10 to the 24 flops is like two pods for around six weeks
00:11:00.580 | or something like that.
00:11:02.420 | So it's a lot of compute to do the pre-training.
00:11:04.740 | I don't know.
00:11:08.180 | Do you guys remember in chemistry class
00:11:10.020 | when you'd have moles?
00:11:12.740 | And it would be like 10 to the 23.
00:11:14.380 | And then your teacher would be like, oh,
00:11:16.020 | I don't even think about how big this number is.
00:11:19.020 | That's the number of floating point operations
00:11:21.780 | that goes into the pre-training of some of these models.
00:11:26.460 | OK, great, anyways.
00:11:28.220 | So yeah, so basically, the medium-sized language model
00:11:32.300 | would actually do worse.
00:11:33.420 | Oh, yeah, did you have another question?
00:11:35.620 | Yeah, oh, wait, did this one win the prize or not?
00:11:40.740 | This one is one of the winners.
00:11:43.660 | I think it's a third-place winner or something.
00:11:46.040 | Yeah?
00:11:46.540 | [INAUDIBLE]
00:11:47.040 | What do you mean, flip a negative sign?
00:11:59.980 | Well, because all of this depends on which evaluation
00:12:04.020 | scheme you use to measure if you do your task right.
00:12:07.860 | So it's like the measurement is very sparse.
00:12:11.340 | You only get credit if you do it perfectly or something.
00:12:14.180 | Yeah, yeah.
00:12:14.700 | [INAUDIBLE]
00:12:15.180 | Yeah, so for this thing, they accounted for all--
00:12:37.180 | you can't just say the task should
00:12:38.740 | be a do badly on something.
00:12:40.580 | It has to be a meaningful sort of task.
00:12:43.340 | And then, I guess, your point about how credit assignment
00:12:46.760 | or the valuation metric works is actually a really good one.
00:12:50.740 | Yeah, so I guess it still kind of counts if--
00:12:54.500 | I guess the argument is that the performance might not
00:13:00.420 | look emergent if you assign partial credit.
00:13:03.260 | But we have a bunch of--
00:13:05.180 | I can show an example later.
00:13:06.380 | But even if you use partial credit metrics,
00:13:09.060 | you'll often still see the same type of emergence.
00:13:11.900 | So it's not purely a phenomenon of not
00:13:14.340 | assigning partial credit based on the valuation metric.
00:13:16.620 | And then-- great.
00:13:24.540 | So what we sort of argued in this paper
00:13:27.160 | is that, yeah, there might be some tasks where
00:13:29.420 | the performance starts to decrease if you use a medium
00:13:31.740 | sized language model.
00:13:33.780 | But if you keep scaling all the way to the largest model
00:13:38.260 | that we have at Google that's known publicly, Palm,
00:13:43.420 | you'll see that the language model can actually
00:13:46.260 | go back and do the task correctly.
00:13:47.740 | Because the large language model also
00:13:51.260 | knows the phrase, all that glisters is not gold.
00:13:54.100 | But it also understands, repeat my sentences back to me.
00:13:58.420 | So it's able to get 100% on this task.
00:14:01.060 | So this is a different type of emergence also.
00:14:04.820 | And another class of emergence that we talk about in the paper
00:14:09.420 | is an emergent prompting technique.
00:14:12.020 | So basically, other than future prompting,
00:14:15.020 | there's other ways of interacting
00:14:16.780 | with the language models that can be considered emergent.
00:14:21.780 | Yeah.
00:14:22.300 | Can I interrupt?
00:14:23.020 | Sorry, somebody else had a question on the previous one.
00:14:25.300 | Yeah.
00:14:25.820 | [INAUDIBLE]
00:14:27.300 | The question is, did all models undergo instruction-binding
00:14:30.100 | tuning?
00:14:31.580 | None of these models underwent instruction-binding tuning
00:14:34.900 | for this plot.
00:14:35.900 | Yeah.
00:14:36.400 | Great.
00:14:41.280 | Yeah.
00:14:41.780 | So one way of interacting with language models
00:14:45.780 | is by basically finding the model using
00:14:49.780 | a technique called RLHF.
00:14:51.620 | And basically, the way it works is you have this data.
00:14:54.180 | And humans rate preferences based on the data.
00:14:57.620 | And humans rate preferences for what type of outputs
00:15:01.260 | they prefer.
00:15:02.620 | And then the model is trained on RL to optimize
00:15:05.500 | for human preferences.
00:15:08.780 | And what this plot is showing here
00:15:10.700 | is that if you do this RLHF on the model,
00:15:14.180 | the model performance on a different zero-shot task
00:15:17.700 | actually gets worse for small models.
00:15:19.580 | You can see the blue line is above the orange line.
00:15:22.660 | Blue line is the baseline.
00:15:23.780 | The orange line is RLHF.
00:15:26.900 | And then if you do it for large models,
00:15:28.780 | though, then you can see that the performance actually
00:15:32.060 | has a positive delta from doing RLHF.
00:15:34.220 | And so this is an interesting thing
00:15:38.860 | where a certain technique might only
00:15:41.260 | help if you try on a large enough language model.
00:15:43.420 | So if you only try it on the small language models,
00:15:46.180 | it'd be tough to draw the conclusion that it wouldn't
00:15:49.660 | help performance.
00:15:51.560 | And then later, I'll talk about chain-of-thought prompting
00:15:53.980 | as another emergent prompting technique.
00:15:58.660 | So here's the hand-wavy diagram that I
00:16:01.060 | used to think about emergence as a framework.
00:16:05.260 | So on the x-axis here, there's a scale of the language model.
00:16:09.860 | And on the y-axis is a imaginary scale of a range of things
00:16:16.380 | that a language model can do.
00:16:18.500 | And then basically, you can pick some random point,
00:16:20.700 | like, say, 100 blind parameters in the language model.
00:16:23.620 | And there will be certain abilities.
00:16:26.420 | So first, you can see, as you increase
00:16:28.140 | the size of the language model, the number
00:16:30.220 | of tasks or things that the language model can do increases.
00:16:33.740 | And then you can see there are some tasks where--
00:16:38.180 | models above 100 blind parameters, for example,
00:16:40.500 | can do them.
00:16:41.420 | But models below 100 blind parameters can't do them.
00:16:43.620 | And we call these emergent abilities.
00:16:46.360 | Sorry, quick question about this.
00:16:47.740 | Yeah.
00:16:48.240 | What are the colors?
00:16:50.500 | Oh, it's just highlighting the dark blue
00:16:53.820 | is tasks that a smaller language model wouldn't be able to do.
00:16:58.820 | Does that make sense?
00:16:59.660 | Yeah.
00:17:00.160 | And then to the right of the dotted line,
00:17:01.860 | the white region on top?
00:17:04.820 | Oh, that just means tasks that we
00:17:07.420 | haven't been able to solve yet with language models.
00:17:09.780 | Yeah.
00:17:10.900 | Cool.
00:17:11.940 | And I'm curious to know, do you think
00:17:13.440 | that it's not that those tasks in the white region
00:17:15.620 | are unsolvable, like the 100 billion scale?
00:17:19.300 | Or do you think that better models, specific training
00:17:23.660 | data, would allow us to, at the 100 billion scale,
00:17:26.180 | get into that white region?
00:17:27.980 | Yeah, I definitely think that it's not a fixed--
00:17:34.460 | I'll give an example shortly.
00:17:35.700 | But it's not a rule that you have
00:17:40.300 | to have 100 blind parameters to do a certain task.
00:17:42.900 | It's just that that happens to be the threshold that we've
00:17:45.420 | observed in models so far.
00:17:47.020 | And I do think with better training
00:17:48.860 | data, and architecture, and algorithms,
00:17:51.420 | we can probably beat that.
00:17:52.500 | Cool.
00:17:57.580 | Yeah, so as Rylan just mentioned,
00:18:00.100 | one example of getting emergence can be with better data.
00:18:04.220 | So it's not all about scale.
00:18:05.380 | I'll give some nuance here.
00:18:06.740 | So for this task, it's just one of the tasks
00:18:10.540 | in the Big Bench benchmark.
00:18:12.340 | You can see that for Lambda, which is a Google model,
00:18:15.060 | and GPT-3, you actually don't get emergence
00:18:18.300 | from scaling to 137 or 175 billion parameters.
00:18:25.780 | But when you come in with a different language model,
00:18:28.020 | Palm, which is trained on better data than Lambda and GPT-3,
00:18:33.100 | you actually can get this emergent ability,
00:18:35.420 | even with the smaller language model, so here
00:18:37.300 | at 62 billion parameters.
00:18:40.380 | Do you see Perpetuary's better model as better data,
00:18:43.180 | or also better architectural loss choices,
00:18:47.820 | or [INAUDIBLE] data for it?
00:18:49.940 | Yeah, so the challenging thing is-- that's a great question.
00:18:54.940 | There's a lot of differences between Palm and Lambda,
00:18:57.700 | for example.
00:18:58.780 | And we can't really ablate them in any controlled way
00:19:01.500 | because of the cost of pre-training.
00:19:04.020 | But our running hypothesis is that Palm
00:19:07.100 | is trained on better data.
00:19:08.180 | And that accounts for a lot of the difference
00:19:10.280 | between Palm and Lambda.
00:19:12.140 | I've seen the smaller scales where
00:19:13.620 | it is possible to ablate stuff, or some really architectural
00:19:17.180 | ways.
00:19:17.680 | Does anyone look into it?
00:19:20.100 | Yeah, that's a good question.
00:19:21.300 | So I guess even here, you can look at, for example,
00:19:23.420 | the Palm 8 billion model, like that point there.
00:19:28.700 | You can ablate it, and it's a little bit higher,
00:19:30.900 | but it's not really emergent yet at that point.
00:19:33.420 | So it's hard to tell, for example, this particular task,
00:19:37.180 | what the effect is.
00:19:38.940 | Yeah.
00:19:39.440 | There's a question on Zoom.
00:19:41.240 | Are there two different versions of Palm?
00:19:43.140 | If not, why are there two lines for it?
00:19:45.680 | Oh, so I think the two lines here-- one is maybe three shot,
00:19:51.060 | and then one is zero shot, something like that.
00:19:55.140 | So it just refers to the way that we're
00:19:58.240 | using the language model, either with or without exemplars.
00:20:00.740 | Great.
00:20:07.160 | I'll talk about a small ablation here that shows this.
00:20:11.080 | So this is an ablation on a toy task, where basically,
00:20:17.560 | the language model has to know that in English,
00:20:20.040 | you have to use plural verbs with plural subjects
00:20:22.560 | and singular verbs with singular subjects.
00:20:26.880 | And what we're doing here is basically,
00:20:30.640 | we train these small BERT models from scratch,
00:20:33.920 | and then we fixed the frequency of certain verbs
00:20:39.080 | in the training data set, which basically says, OK,
00:20:41.920 | what's the effect of seeing a certain verb in the data
00:20:45.320 | more often?
00:20:47.200 | In this plot, the x-axis is the frequency of the verb,
00:20:50.840 | and the y-axis is the error rate.
00:20:53.200 | And what you basically see is that if you have more
00:20:56.680 | in-domain data, so if the model sees the verb more times,
00:21:00.000 | it does the task a lot better.
00:21:02.160 | So this is an example of having high-quality data or data
00:21:05.440 | that's more in-domain for the task that you're evaluating on
00:21:08.880 | can make a big difference, even if you're
00:21:10.680 | fixing the compute, the size of the model,
00:21:13.440 | and the rest of the data.
00:21:15.740 | Yeah?
00:21:16.240 | Question on Zoom.
00:21:17.080 | Someone asks, could there be a way
00:21:18.720 | to distill convergent abilities down to smaller models
00:21:21.720 | from larger teacher models?
00:21:24.560 | Yeah, I think so.
00:21:25.560 | So larger teacher models, you can use them, for example,
00:21:32.880 | to generate data.
00:21:34.680 | And then if you fine-tune the smaller model on data,
00:21:38.760 | it's pretty likely that you'll be able to get the ability
00:21:42.080 | to emerge in the smaller model.
00:21:43.940 | I'll talk about an example of this, too.
00:21:46.440 | Oh, actually, that's the next slide.
00:21:49.760 | Desired behaviors can be induced in smaller models
00:21:52.260 | once you know what behavior you want.
00:21:55.280 | So for example, here's the figure
00:22:00.360 | from the InstructGPT paper.
00:22:02.840 | And basically, the desired behavior here
00:22:05.360 | is instruction following.
00:22:07.920 | And you can see that there's multiple models.
00:22:10.800 | So on the left, you have these small models
00:22:13.080 | that are trained with RLHF.
00:22:15.200 | And they actually have better performance
00:22:17.560 | than larger models trained on weaker techniques.
00:22:21.240 | So basically, the point is, if you
00:22:22.760 | know that you want a certain behavior that you
00:22:26.160 | saw previously in an emergent way in a larger model,
00:22:30.840 | you can find a way to fine-tune on that behavior specifically
00:22:34.280 | and induce that behavior in a smaller model.
00:22:38.120 | I guess that one of the limitations
00:22:39.520 | is that unless you know all the behaviors that you want,
00:22:43.160 | you can't really get this natural emergent behavior.
00:22:50.880 | Yeah, another discussion point here
00:22:52.800 | is that there's this question of what's
00:22:55.800 | the right x-axis for emergence.
00:22:57.800 | So right now, we mostly talk about model parameters
00:23:01.000 | and training flops.
00:23:02.160 | But I guess if you ask deep mind people how they look at it,
00:23:06.160 | you'll get this argument that model parameters and training
00:23:09.640 | flops are really just a proxy for how good the model is.
00:23:13.840 | And how good the model is can really
00:23:15.440 | be measured by perplexity, or how well it's
00:23:18.280 | doing on some dev sets, such as Wikitext 103.
00:23:22.960 | So basically, you can also measure emergence
00:23:27.200 | in terms of perplexity.
00:23:29.240 | So here is Wikitext perplexity.
00:23:33.640 | And then you can see on a downstream task
00:23:35.640 | that, as the perplexity gets better,
00:23:38.120 | there's sort of this threshold where
00:23:39.580 | you're able to do a lot better on the downstream task.
00:23:44.600 | And there's sort of a strong correlation,
00:23:47.400 | at least right now, between perplexity and training
00:23:49.600 | compute.
00:23:50.120 | So you can see these two lines are pretty similar.
00:23:53.200 | And basically, I think in the future,
00:23:57.880 | if we have much better models that are a lot smaller,
00:24:00.640 | turn on much better data and better algorithms,
00:24:02.840 | then maybe Wikitext perplexity can
00:24:04.960 | show a different type of plot than using other metrics.
00:24:09.240 | Yeah, go ahead.
00:24:09.960 | [INAUDIBLE]
00:24:12.080 | So Wikitext is basically a--
00:24:17.080 | I think it's like a subset of Wikipedia.
00:24:19.640 | And then perplexity is like a measure
00:24:21.160 | of how well you can predict the next word in a data set.
00:24:25.840 | So basically, if you're really good at modeling
00:24:27.840 | the next word on this particular evaluation set,
00:24:32.080 | that's sort of a measure of how well you understand language.
00:24:37.160 | That make sense?
00:24:38.360 | [INAUDIBLE]
00:24:41.240 | Oh, this is like a held-out test set.
00:24:43.160 | And then a final thing that I think
00:24:52.280 | is pretty exciting about emergence
00:24:54.960 | is that there's sort of not just technical emergence
00:24:58.080 | that we've talked about, but there's
00:24:59.760 | sort of sociological changes in how the AI community views
00:25:04.080 | scaling and how to use language models.
00:25:07.760 | So here's some examples of where scaling up
00:25:12.280 | the size of the language model enables you to,
00:25:15.200 | in this sort of few-shot scenario,
00:25:17.880 | beat a task-specific fine-tuned language model that's
00:25:21.120 | usually fine-tuned on, say, thousands of examples.
00:25:25.240 | So basically, the green line is the prior state of the art
00:25:28.360 | achieved by fine-tuning.
00:25:30.960 | And then the blue dots basically show,
00:25:34.120 | if you take a pre-trained language model
00:25:36.120 | and you do few-shot prompting, which means the language model
00:25:38.720 | isn't intentionally trained to do the task,
00:25:40.760 | you can often get state-of-the-art results
00:25:42.520 | just by continuing to scale up the size of the language model.
00:25:45.960 | And obviously, there's limitations here.
00:25:47.920 | You don't want to just keep scaling up
00:25:49.720 | in order to get state-of-the-art.
00:25:51.720 | But I think it's a pretty big change in people's minds
00:25:54.600 | that you could actually get some of the best results
00:25:56.880 | just by scaling up the size of the language model
00:25:58.880 | and doing prompting.
00:25:59.760 | Question from Zoom.
00:26:03.400 | Someone asks, is that not a contradiction
00:26:06.120 | graph from two to three slides ago?
00:26:08.600 | What is that?
00:26:10.720 | Which one?
00:26:11.520 | This one?
00:26:12.600 | I'm not sure.
00:26:14.160 | Should we, in general, assume-- oh, he said yes.
00:26:19.400 | He said, should we, in general, assume
00:26:21.200 | that scale trumps fine-tuning?
00:26:24.640 | Yeah, so that's a great question.
00:26:26.920 | So this plot is saying that you fine-tune and you can do--
00:26:32.720 | OK, yeah.
00:26:33.400 | So it depends on your particular task.
00:26:39.480 | But what this plot is saying is that fine-tuned smaller models
00:26:50.600 | can do well on some tasks if you target it well.
00:26:53.040 | But for tasks that are more complicated,
00:26:56.040 | often you can do better just by scaling.
00:26:58.440 | So there's sort of tasks that fall
00:27:01.240 | into both of these categories.
00:27:03.120 | And I wouldn't say that it's contradictory.
00:27:06.400 | I guess some tasks you would do a lot better just
00:27:12.840 | by scaling up the size of the model.
00:27:14.440 | And then other tasks, if it's a very narrow domain
00:27:17.600 | or the large language model might not
00:27:20.080 | be trained on that kind of data, then you
00:27:21.960 | would do better by fine-tuning.
00:27:23.240 | OK, great.
00:27:29.040 | So here's sort of a little summary slide.
00:27:31.280 | So basically, emergent abilities can only
00:27:33.000 | be observed in large models.
00:27:34.640 | And if you try to predict their emergence just
00:27:37.720 | by looking at the plots for small models,
00:27:39.440 | then you wouldn't be able to do it.
00:27:42.760 | And I sort of had a little reflection
00:27:45.160 | on how to look at this.
00:27:46.920 | So emergence is really this framing
00:27:48.480 | of how to view new abilities that are not intentionally
00:27:52.920 | built in to the pre-training.
00:27:55.120 | And I think the subtext for this is super important, which
00:27:57.560 | is you can see it as an implicit argument for why we should keep
00:28:01.120 | scaling up language models, because you
00:28:03.760 | get these abilities that are really hard to find otherwise.
00:28:07.280 | And the context around this is pretty important,
00:28:09.240 | because it's really expensive to continue scaling up
00:28:13.360 | these models.
00:28:14.040 | And even one year ago, a lot of people
00:28:17.800 | didn't believe that you could do better on certain tasks
00:28:20.320 | just by scaling up the size of the language model.
00:28:23.920 | They sort of-- if you work in industry at all,
00:28:25.880 | there's this interesting tension between emergence and also
00:28:29.800 | many production tasks.
00:28:30.920 | So emergence is sort of this task general phenomena
00:28:34.760 | where you scale up the model, and it's really expensive.
00:28:37.720 | But the single model can do a lot of tasks.
00:28:40.080 | This is sort of in the direction of AGI.
00:28:44.120 | And then for many production tasks,
00:28:46.280 | you have sort of the opposite, where
00:28:48.960 | you know what task it is, for example,
00:28:50.840 | translating to Spanish.
00:28:52.760 | And then you have these constraints on compute,
00:28:54.720 | because when you build Google Translate, for example,
00:28:57.280 | you don't want people to have to wait a couple of seconds
00:28:59.680 | just to get the translation.
00:29:02.200 | And then you also happen to have a lot of in-domain data.
00:29:05.320 | So you have, for example, a million pairs
00:29:08.720 | of English-Spanish sentences to train on.
00:29:12.200 | And this is sort of the opposite setting, where you don't really
00:29:15.560 | care about the model's emergence.
00:29:18.720 | You can just train a very small model on the data
00:29:21.000 | and do all of the tasks without having to use a lot of compute.
00:29:25.880 | And the final point is that I think
00:29:28.960 | a really promising research direction, if anyone
00:29:31.080 | is interested in doing research, is
00:29:33.000 | to work on predicting future emergent abilities.
00:29:36.840 | And I haven't seen a lot of work on it recently,
00:29:38.840 | just because I think maybe it's too hard, for example.
00:29:42.200 | You can only predict emergence for a specific task.
00:29:46.400 | Or one way of predicting emergence
00:29:48.440 | might not be super general.
00:29:49.720 | And so I haven't seen much work on that.
00:29:52.160 | But I think this is a pretty promising direction to work on.
00:29:55.240 | And maybe Anthropic is working on it.
00:29:56.800 | I don't know.
00:29:57.400 | OK, great.
00:30:00.560 | Any questions on that before I move on
00:30:02.240 | to chain of thought prompting?
00:30:04.680 | Yeah, go ahead.
00:30:07.320 | Do we have any theoretical basis to predict
00:30:10.280 | which parameters are best scaled to get related properties?
00:30:13.800 | Because obviously, there are many different options
00:30:16.200 | for where the actual parameters [INAUDIBLE]
00:30:19.640 | like GPT, for example, you could add more to the inventory.
00:30:22.760 | You could add more in [INAUDIBLE] or whatever.
00:30:26.240 | Is that mostly something we just test?
00:30:27.880 | And then we find out which ones scale better
00:30:30.320 | and give us better results?
00:30:32.320 | Yeah, I would say that we don't have very principled methods
00:30:37.240 | for how to scale up these architectures.
00:30:41.920 | I'm not an expert in this.
00:30:43.000 | But some of it has to deal with how many parameters you
00:30:46.120 | can fit onto a particular TPU.
00:30:50.360 | But in general, I think you scale up
00:30:52.440 | the number of intentions heads and embeddings
00:30:55.760 | somewhat proportionally.
00:30:57.680 | But yeah, I think this is an open research question.
00:30:59.800 | And because you can't really do ablations
00:31:03.440 | over these pre-training, you can't really
00:31:06.600 | do ablations over pre-training.
00:31:07.880 | It's hard to have any principled way of doing it,
00:31:12.440 | other than some engineers who are in charge of doing it,
00:31:15.480 | saying, OK, I think this is the right thing to do.
00:31:17.560 | And it kind of works, and you go with it.
00:31:20.280 | Yeah?
00:31:21.280 | Do you have any indication of the asymptotic behavior
00:31:23.760 | of this get thing?
00:31:24.880 | If you expect that, eventually, it
00:31:26.840 | would reach either some plateau of finite or non-zero loss,
00:31:30.960 | or it would just go all the way down to zero?
00:31:35.000 | Yeah, that's a great question.
00:31:36.340 | You mean on perplexity or on a particular task,
00:31:45.120 | or just in general on an export prediction?
00:31:47.000 | Well, it seems like these results are pretty general,
00:31:49.240 | pretty task-independent, right?
00:31:50.800 | It's emergent scaling.
00:31:53.120 | But if you take the limit of infinite parameters,
00:31:55.680 | then even analytically, is there any sense
00:31:58.200 | of how that converges?
00:32:00.720 | Yeah, I have no clue.
00:32:02.120 | I think, for most of these tasks,
00:32:05.280 | there's a limit to accuracy, like 100%, for example.
00:32:08.680 | So there's some sort of asymptote there.
00:32:10.440 | But I guess the deeper question that you might be asking
00:32:13.800 | is, can a language model perfectly
00:32:16.120 | know how to predict the next word for any given input?
00:32:22.040 | And maybe.
00:32:23.960 | I mean, I guess there's some limit to--
00:32:28.120 | if I say a sentence, there are two possible next words
00:32:31.240 | or something.
00:32:31.840 | And you might not be able to guess that perfectly.
00:32:35.400 | So I think there's some limit, but I think we're
00:32:37.440 | far from reaching that limit.
00:32:38.640 | And there's still a lot of unsolved tasks
00:32:40.440 | that sort of indicate that there's a lot of headroom.
00:32:45.120 | Yeah.
00:32:45.640 | If researchers are interested in studying emergence,
00:32:48.520 | what family of differently-sized models
00:32:51.520 | is publicly available or best for studying this?
00:32:55.960 | Yeah, good question.
00:32:58.520 | So I think the OpenAI API has a lot of language models.
00:33:03.620 | And we actually use that a lot.
00:33:04.920 | Even at Google, it's used to study emergence.
00:33:07.880 | And that's sort of one way of doing it.
00:33:10.220 | And actually, a lot of these models are currently free.
00:33:13.760 | They're rate-limited, but they're free.
00:33:15.680 | So we also use that.
00:33:18.520 | I think there's also smaller language models.
00:33:22.840 | Like, for example, there's a UL2 model
00:33:24.680 | that's 20 billion parameters.
00:33:26.880 | But I guess you're right.
00:33:28.080 | There is sort of this challenge where the small language
00:33:30.520 | models, you won't see a lot of these emergent behaviors.
00:33:33.200 | So you kind of have to either train--
00:33:37.120 | yeah, so you kind of have to either use OpenAI API for now
00:33:41.500 | or wait until people train larger models.
00:33:44.180 | I guess there's also the Bloom and, like, you guys probably
00:33:47.700 | know better than me, like, OPT models
00:33:49.260 | that are publicly available, but I haven't seen
00:33:51.500 | a lot of experiments on them.
00:33:53.420 | Yeah, yeah.
00:33:56.100 | So my question is, are there emergent abilities
00:34:00.260 | that are accessible in lower parameter regimes?
00:34:03.620 | I can think of, like, more of a speech technique or [INAUDIBLE]
00:34:09.520 | I would expect maybe there might be, like, some better--
00:34:11.900 | maybe not, like, chain of thought,
00:34:13.360 | but are there some that are, like--
00:34:14.680 | Yeah, definitely.
00:34:15.360 | I think in the paper, we had, like,
00:34:17.480 | the list of a couple dozen abilities
00:34:19.040 | that would be emergent at, like, 8 billion parameters
00:34:21.400 | or, like, 60 billion parameters, something like that, yeah.
00:34:24.160 | Yeah.
00:34:25.320 | Yeah.
00:34:25.820 | We have two questions from Zoom.
00:34:27.200 | The first question is, do you see strategy tactics
00:34:30.080 | between the larger tech firms differing systematically
00:34:33.560 | in studying these models, or is basically everyone
00:34:36.320 | taking the same approach?
00:34:37.400 | I wouldn't say that everyone is taking the same approach.
00:34:48.800 | I think, as one example, Anthropic
00:34:52.360 | takes, like, a very safety-centric approach.
00:34:55.800 | And they're super interested in, like, emergent abilities
00:34:59.080 | because there could be emergent abilities that are undesirable
00:35:02.920 | and they want to predict those types of things.
00:35:06.560 | I also don't know what happens at other companies
00:35:09.840 | other than at Google, so I can't really speak too much to that.
00:35:13.600 | Yeah.
00:35:14.120 | The second question, what are some examples of tasks
00:35:17.120 | or abilities that have not yet emerged, even in models
00:35:20.040 | like Lambda, ChatGPT, et cetera?
00:35:23.080 | Oh, yeah, I have--
00:35:24.120 | maybe I'll just show this real quick.
00:35:25.660 | Uh-- there's, like, a nice list somewhere.
00:35:37.780 | So-- yeah, so basically, what we did
00:35:47.500 | is there's, like, 200 tasks in BigBench.
00:35:51.100 | And then we basically classified them
00:35:52.660 | into, like, smoothly increasing, emergent with GPT-3 or Lambda,
00:35:58.380 | emergent with POM, and then flat, which is, like,
00:36:00.740 | no model better than random.
00:36:02.420 | So I think if you look at any of these tasks
00:36:04.420 | here, they should still not have emerged yet.
00:36:08.820 | And if you can get them to emerge, that'd be interesting.
00:36:12.420 | [INAUDIBLE]
00:36:14.980 | Sorry?
00:36:16.060 | I think ChatGPT should be 20 questions.
00:36:18.700 | Oh, OK, yeah, this is not a super--
00:36:20.340 | I think this is, like, a couple of months old.
00:36:22.260 | Sorry, [INAUDIBLE]
00:36:23.420 | Yeah, yeah.
00:36:25.140 | Oh, 20 questions?
00:36:26.300 | OK, yeah.
00:36:26.900 | [INAUDIBLE]
00:36:29.500 | Yeah, I think-- like, the cool thing
00:36:32.220 | is, like, you can see over time, right?
00:36:33.820 | Like, originally, like, maybe only these were emergent.
00:36:37.100 | And then when POM came out, you'd
00:36:38.580 | see a couple dozen more abilities became emergent.
00:36:40.700 | And then I suspect in a year or two, most of these
00:36:45.620 | will become emergent.
00:36:46.620 | And we'll need harder benchmarks.
00:36:48.780 | Yeah?
00:36:49.500 | There's another question on here.
00:36:50.860 | Why doesn't Google take as much of a safety-centered approach,
00:36:54.100 | like you said, in Prometheus?
00:36:55.860 | Are there reasons to believe harmful capabilities wouldn't
00:36:58.660 | be emergent?
00:36:59.700 | Yeah, I don't want to answer the question on behalf of Google.
00:37:05.580 | I just can only talk about my own opinions.
00:37:10.540 | But I think the reality is that Google--
00:37:12.780 | even if you look at, like, the amount of research
00:37:14.820 | that Google does, it might not be in the large language
00:37:18.060 | models very specifically.
00:37:19.260 | But the amount of safety research that we do, I think,
00:37:22.540 | is more than anthropic if you actually
00:37:24.540 | look at the number of papers published.
00:37:26.540 | Don't quote me on this.
00:37:27.500 | But I think that's correct.
00:37:28.860 | Great.
00:37:34.920 | Yeah, I'll talk about chain-of-thought prompting.
00:37:42.380 | So basically, chain-of-thought prompting
00:37:44.020 | is this way of doing reasoning, multi-step reasoning,
00:37:47.700 | with large language models.
00:37:49.980 | And yeah, I wanted to say that it's
00:37:53.980 | super exciting to see a lot of people at Google working
00:37:57.020 | on this, and also to see Sundar, our CEO,
00:38:00.580 | present this at our last year's Google I/O press event.
00:38:04.220 | And basically, the motivation for this
00:38:10.380 | is that we want language models to do more complicated tasks.
00:38:15.220 | For example, we know language models
00:38:17.820 | can do easy tasks like sentiment analysis or translation.
00:38:20.980 | But what about more complicated tasks
00:38:23.180 | that might even take a human a minute or more to do?
00:38:28.340 | And the goal here is to basically guide them
00:38:30.300 | with metadata.
00:38:32.100 | So for example, instead of just giving an input-output pair,
00:38:35.260 | we want to give them the entire reasoning process
00:38:38.380 | and have them mimic that.
00:38:41.460 | And basically, you can see here, in a standard prompt,
00:38:45.100 | you have the question and then the answer.
00:38:47.100 | And then you have a question, and the model
00:38:48.940 | gives a new answer.
00:38:49.980 | Unfortunately, it's wrong.
00:38:52.940 | And then with chain-of-thought prompting,
00:38:56.380 | you give the model a question.
00:38:57.860 | And then, kind of like how your teacher would
00:39:00.140 | ask you to show your work, you give the chain-of-thought,
00:39:04.220 | is what we call it, or basically a reasoning path.
00:39:06.820 | And then you give the final answer.
00:39:08.460 | And then when the model sees this unseen question,
00:39:10.940 | now it's able to give the reasoning path
00:39:12.620 | and then give the correct final answer.
00:39:15.900 | And the way that we add these prompts into the prompt
00:39:18.660 | is basically we just manually write a couple
00:39:21.060 | and then add it into the prompt.
00:39:23.940 | So let me just show how that works.
00:39:26.940 | So this is the OpenAI API.
00:39:32.780 | And basically, here's the non-chain-of-thought way
00:39:38.500 | of doing it.
00:39:39.020 | So basically, you would have question, answer, question,
00:39:43.420 | answer, question, answer, and then
00:39:45.180 | new question about cafeteria has 23 apples.
00:39:47.900 | They used 20 to make lunch and bought six more
00:39:49.780 | healthy apples to have.
00:39:52.100 | And the model gets it wrong.
00:39:56.420 | And the only difference with chain-of-thought
00:39:58.900 | is that you give these intermediate reasoning
00:40:02.220 | paths before giving the final answer.
00:40:05.260 | So here's a path.
00:40:06.780 | There's a reasoning chain.
00:40:08.700 | There's another reasoning chain.
00:40:10.660 | And then now, the model for this unseen question
00:40:15.860 | gives the entire reasoning process.
00:40:18.940 | And then this actually enables the model to get it correct.
00:40:23.060 | I'll give another quick example, this one.
00:40:28.460 | So here, the task is just take the last letters
00:40:31.300 | of the words in Bill Gates, so like L from Bill and S
00:40:34.380 | from Gates, and then concatenate them.
00:40:36.420 | And the answer should be LS.
00:40:39.780 | And then here, the model gets it wrong.
00:40:42.500 | The answer should be NK, so it's SK.
00:40:46.860 | And then if you do chain-of-thought,
00:40:50.100 | obviously, this becomes very easy for the model.
00:40:53.580 | So it says the last letter of Bill is L.
00:40:55.940 | The last letter of Gates is S. The answer is LS.
00:41:00.860 | And then here, it's able to do the last letter of Elon is M.
00:41:03.940 | And the last letter of Musk is K. And the answer is NK.
00:41:06.660 | So is this clear?
00:41:11.900 | Any questions about what's going on here?
00:41:15.380 | OK, great.
00:41:16.020 | So basically, we can have these similar plots
00:41:21.100 | where the x-axis is the model scale.
00:41:23.300 | The y-axis is the performance.
00:41:26.580 | So on the left, we have this math word question
00:41:28.900 | benchmark called GSMAK.
00:41:30.380 | It's basically like questions that you'd
00:41:32.420 | see in an elementary school math test.
00:41:35.300 | And you can see the blue dot is standard,
00:41:38.180 | and the purple star is chain-of-thought.
00:41:41.020 | And basically, you see that the chain-of-thought,
00:41:43.020 | if you use a large enough model, does a lot better
00:41:46.340 | than standard prompting.
00:41:49.020 | It actually beats the fine-tuned state-of-the-art at the time.
00:41:53.620 | A similar example is on this benchmark called Strategy QA.
00:41:58.140 | And what Strategy QA is, it's basically
00:42:00.260 | like this world knowledge plus common sense reasoning
00:42:03.380 | benchmark.
00:42:03.900 | So the question would be, can you
00:42:05.740 | hide a basketball in a sand cat's ear?
00:42:08.420 | And then the model would say, a basketball is about this size.
00:42:12.180 | A sand cat's ear is that.
00:42:13.500 | So it would not fit.
00:42:14.820 | And on this benchmark, you can also
00:42:16.420 | see that we can beat the fine-tuned state-of-the-art
00:42:19.340 | from before just by using chain-of-thought
00:42:22.380 | with a large enough [INAUDIBLE] model.
00:42:28.100 | So one way we use this is that we
00:42:30.660 | evaluate a chain-of-thought on a certain subset of Big Bench
00:42:35.060 | tasks.
00:42:35.560 | So we created a subset called Big Bench Hard.
00:42:38.780 | And basically, it's like 23 challenging tasks
00:42:41.780 | from Big Bench, where no model had done better
00:42:45.140 | than the average human rater.
00:42:48.660 | So the way that you prompt the model
00:42:50.700 | is that you'd have a task description, question, options,
00:42:54.460 | chain-of-thought, and then the test time question.
00:42:59.060 | And so I'll give a couple examples of tasks here.
00:43:02.940 | So one example is navigate.
00:43:06.660 | Basically, what the language model has to do in this task
00:43:09.620 | is it has to basically follow these.
00:43:11.620 | So the question is, if you follow these instructions,
00:43:14.560 | do you return to the starting point?
00:43:16.260 | Turn left, turn right, take five steps, take four steps,
00:43:18.540 | turn around, take nine steps.
00:43:21.100 | And then the model, following the few shot exemplars,
00:43:25.540 | is able to basically track state after all of the actions.
00:43:30.420 | And then at the end, it says, OK,
00:43:32.300 | are we at the final answer?
00:43:34.140 | Are we at the original location?
00:43:36.740 | If it is 0, 0, the answer is yes.
00:43:38.380 | Just to give an example of another task,
00:43:44.780 | here's a task that's very easy for humans,
00:43:47.180 | basically word sorting.
00:43:48.260 | So there's a list of words, burly, baila.
00:43:51.740 | I'm not going to read them.
00:43:52.860 | And basically, the model has to sort them alphabetical order.
00:43:57.780 | And here, the model can follow the few shot exemplars.
00:44:00.020 | So you have this pretty complicated chain-of-thought
00:44:05.260 | where the model has to sort each of the subparts.
00:44:08.900 | And then finally, it gets to the final answer, which is correct.
00:44:15.580 | So here's this result summary on the subset of BigBench.
00:44:20.180 | So you can see, OK, we have two metrics.
00:44:24.380 | One is just the average performance
00:44:26.180 | on all these tasks.
00:44:27.580 | And the second is the percent of tasks
00:44:31.140 | that are above the average human reader.
00:44:34.820 | So average human reader is 67.
00:44:37.620 | Max human reader is 94.
00:44:40.220 | And then prior results, the model was doing way worse.
00:44:43.020 | It was like 50.
00:44:44.540 | And this is by construction of the subset.
00:44:50.060 | And then we used Code Da Vinci 02,
00:44:52.260 | which is one of the OpenAI models.
00:44:54.460 | And actually, you can use this one for free with OpenAPI.
00:44:58.220 | And basically, if you do answer-only prompting
00:45:01.360 | without chain-of-thought, then you
00:45:03.580 | are beating the average human reader on 5 of 27.
00:45:06.900 | But if you use chain-of-thought prompting,
00:45:09.500 | then the performance increases by this pretty decent amount.
00:45:12.500 | And you're able to pass the average human
00:45:15.700 | on the majority of tasks.
00:45:18.180 | And then below is just this visualization
00:45:19.980 | of the tasks that are doing worse than humans in red
00:45:23.180 | and then better than humans in blue.
00:45:25.460 | Yeah.
00:45:26.500 | Two questions.
00:45:27.620 | Isn't this similar to RLHF in spirit, at least?
00:45:32.460 | Is what similar?
00:45:34.540 | I think chain-of-thought prompting.
00:45:36.040 | I'm not sure what the statement is.
00:45:37.540 | I think chain-of-thought.
00:45:41.220 | Yeah, I think it's--
00:45:44.340 | I wouldn't call it similar.
00:45:45.700 | So chain-of-thought is basically you
00:45:47.900 | take a pre-trained language model
00:45:49.340 | and you use a prompting technique that includes
00:45:51.780 | intermediate reusing path.
00:45:54.340 | The way that RLHF works is that you have this additional data
00:45:58.180 | that you want to fine-tune the model on.
00:46:00.100 | And you have a preference model that
00:46:02.420 | sort of predicts how well does a certain output--
00:46:09.420 | how likely is that to be preferred by humans?
00:46:12.300 | And then RLHF, what that does is it fine-tunes the language
00:46:17.620 | model to do well on the preference model's prediction.
00:46:20.660 | So basically, it's sort of aligning the model
00:46:24.100 | with what humans would prefer.
00:46:26.260 | Is there a second question?
00:46:27.420 | Yeah, sorry.
00:46:27.920 | Just a few.
00:46:29.700 | Andres asks, can chain-of-thought
00:46:31.300 | be included in fine-tuning rather than
00:46:33.500 | having to be in the prompt?
00:46:37.420 | The short answer is yes.
00:46:39.900 | The sort of complicated thing about that
00:46:41.540 | is that you have to have chain-of-thought intermediate
00:46:44.980 | steps.
00:46:45.580 | And those are pretty--
00:46:48.460 | it can be costly to gather that data and to do the fine-tuning.
00:46:54.680 | One last question.
00:46:55.480 | Sorry for everybody getting in.
00:46:56.800 | Another student asks, do you think
00:46:58.220 | that chain-of-thought and prompt engineering in general
00:47:00.940 | is just an artifact that won't be necessary for larger scale
00:47:04.220 | models that are better able to understand the prompts
00:47:06.660 | [INAUDIBLE]
00:47:08.260 | Yeah, so that's a great question.
00:47:11.340 | Basically, the question is how ephemeral is prompt engineering
00:47:15.060 | going to be.
00:47:16.460 | I think we'll find out.
00:47:18.340 | But some initial intuitions are that for easy tasks that
00:47:22.740 | are easy to describe and maybe they're multiple choice,
00:47:26.620 | larger models will probably be more robust to prompt
00:47:29.180 | engineering.
00:47:29.700 | And there's sort of less you can do with that.
00:47:31.740 | But I think as language models get more powerful,
00:47:35.540 | they'll sort of be more normal to use them
00:47:37.300 | on a lot more challenging tasks.
00:47:40.140 | And in those tasks, you'll have to specify exactly what you
00:47:42.980 | want the model to do, et cetera.
00:47:44.420 | So I think there will still be some room for prompt
00:47:46.460 | engineering there, at least in the near future.
00:47:49.540 | Yeah, go ahead.
00:47:50.140 | Do you know how well this chain-of-thought prompting is
00:47:52.660 | generalizing to, for example, you showed these two tasks,
00:47:55.180 | right, a simple math and a map, and then the other one
00:47:58.180 | basically sorting the words alphabetically, right?
00:48:01.580 | Yeah.
00:48:02.140 | So I mean, I see that's the case with the math.
00:48:04.940 | Math has to give this chain-of-thought prompting,
00:48:08.660 | and it does that super well.
00:48:09.980 | But would that model also perform better
00:48:11.780 | on sorting the alphabet?
00:48:13.460 | Or do you have to give the chain-of-thought
00:48:15.420 | for sorting words alphabetically?
00:48:17.700 | Yeah, that's a great question.
00:48:19.380 | So for some tasks where you've seen similar data
00:48:24.380 | in pre-training, the model can do really well,
00:48:26.780 | even if the chain-of-thought is from another task.
00:48:29.620 | So for example, like math word problems,
00:48:31.300 | you actually don't really need a math chain-of-thought,
00:48:33.580 | because the model already knows how to do that.
00:48:35.660 | But for tasks like this, you probably
00:48:38.420 | haven't seen any data that's like the chain-of-thought here.
00:48:41.300 | So without task-specific exemplars,
00:48:43.120 | you probably wouldn't do super well
00:48:44.540 | on tasks like this without manually writing them
00:48:48.300 | for other examples.
00:48:51.620 | Yeah.
00:48:52.620 | I'm wondering, as the researcher behind this,
00:48:55.660 | what mental model would lead you to even try this?
00:48:59.500 | Do you perceive the model as, if I was a person,
00:49:01.740 | how would I do this better?
00:49:02.860 | Or is it trying to give it more compute in order
00:49:06.940 | to make the orgasmic answer?
00:49:09.780 | Yeah, great question.
00:49:12.140 | I think my motivation was just thinking about it,
00:49:14.380 | just as you said, what's going on in a human's mind
00:49:18.300 | while they try to solve this math question?
00:49:21.540 | And well, if you notice, at least some humans
00:49:25.540 | will think actually in natural language.
00:49:28.420 | So if you pay attention a lot to what's going on in your mind,
00:49:33.200 | you actually notice that sometimes you
00:49:34.820 | think in language.
00:49:35.860 | And so well, the language model can think in language, too.
00:49:38.340 | So that was the motivation behind asking
00:49:41.300 | the language model to do that.
00:49:43.220 | And I think one thing that went well
00:49:47.660 | is that the development of this technique
00:49:50.380 | actually coincided with the development of Palm.
00:49:53.660 | And so yeah, basically having the model Palm
00:49:58.820 | allowed us to do a lot more challenging tasks
00:50:03.900 | using chain of thought.
00:50:06.380 | Yeah?
00:50:06.880 | So when talking about the data quality,
00:50:09.580 | we're saying that it matters that the absolute number
00:50:12.220 | of examples of this chain of thought process
00:50:14.380 | or whatever in the data set or in the fine tuning,
00:50:19.220 | is that the main significant thing?
00:50:21.260 | Or is it this relative number of frequency of those examples
00:50:24.900 | are just negative examples, which are not
00:50:27.060 | good examples of how to reason?
00:50:29.020 | Do those matter as much as the absolute number
00:50:31.980 | of good examples?
00:50:35.420 | Yeah, good question.
00:50:38.300 | So I guess the challenging thing is we can't really
00:50:40.500 | measure how many similar examples are in the training
00:50:44.980 | It's hard to do that well.
00:50:46.900 | And I don't think anyone has done that before.
00:50:50.380 | So it's more of this open question
00:50:51.900 | of why a chain of thought even works.
00:50:54.260 | Because you actually don't see similar data
00:50:57.060 | like that in the training set.
00:51:00.420 | Yeah, I think it's open question of why it works.
00:51:03.260 | [INAUDIBLE]
00:51:03.760 | I mean, you said, OK, think about how new things--
00:51:14.280 | sometimes we think in language, and then
00:51:16.000 | model should do that, too.
00:51:17.160 | But how do you actually think in--
00:51:19.120 | what is the intuition for the model?
00:51:21.120 | I mean, is there a shift in what specific task?
00:51:24.800 | Some weights get more focus from the model?
00:51:28.960 | How do you think about that?
00:51:30.560 | Yeah, I don't really think about it
00:51:32.060 | in terms of what's going on in the weights.
00:51:34.080 | I guess the way that I think about it is that it'd
00:51:37.680 | be unfair for me to give you a math question
00:51:39.980 | and ask you to give me the answer within half
00:51:43.000 | a second, which is basically what you're doing with the model
00:51:45.720 | and when you don't do a chain of thought, right?
00:51:47.760 | You're basically asking this challenging question,
00:51:49.800 | and the model doesn't have enough compute
00:51:52.080 | to solve it in one pass to give you the answer immediately.
00:51:56.840 | I think the second thing that I think about
00:52:00.840 | is that the model has learned a compositional set of skills
00:52:05.920 | during pre-training, so maybe it hasn't really
00:52:08.800 | learned this particular navigate task during pre-training.
00:52:12.860 | But it's learned other things, right?
00:52:14.400 | It's learned like, OK, if you take five steps
00:52:17.200 | and you're facing this, maybe you
00:52:18.840 | should add five here or something like that, right?
00:52:21.200 | And it's learned how to do pattern matching.
00:52:23.040 | So maybe in the future exemplars,
00:52:25.720 | it can match what the reasoning path is
00:52:28.760 | with what the question was.
00:52:30.280 | And so there's sort of these little skills
00:52:32.020 | that the model might know.
00:52:33.720 | And then maybe if you can combine them together
00:52:35.680 | in some clever way, then you can get the model to solve
00:52:37.960 | more challenging problems.
00:52:42.420 | Ryan, how much time do we have?
00:52:47.320 | [INAUDIBLE]
00:52:50.840 | Oh, OK, 50, OK.
00:52:59.300 | OK, great.
00:53:01.300 | That's a good example of how we judge these tasks, anyway.
00:53:05.300 | A bunch of different answers.
00:53:06.500 | All of them are right, but we judge them.
00:53:09.980 | Yeah.
00:53:11.300 | OK, great.
00:53:11.800 | Yeah, feel free to keep asking questions if you have any.
00:53:16.040 | So yeah, here's another example of emergence.
00:53:20.220 | So basically, you can see there's
00:53:22.100 | three models here, InstructGBT, Codex, and Palm.
00:53:25.220 | Chain of thought in blue and non-chain of thought
00:53:27.460 | is in gray.
00:53:30.380 | And then you can see, you actually
00:53:32.180 | have to have sufficient model scale
00:53:33.780 | to get chain of thought to work well.
00:53:36.620 | And I guess the intuition here is
00:53:39.700 | that if you have a really small model,
00:53:42.620 | the model will keep repeating itself
00:53:45.200 | or not say anything coherent or never get a final answer, which
00:53:48.040 | is why using chain of thought for the small models
00:53:50.120 | doesn't really work well.
00:53:51.660 | And then for the large models, obviously,
00:53:53.420 | for multi-step problems, the model
00:53:57.140 | is going to be able to solve the task at a lot higher
00:54:00.340 | accuracy with chain of thought.
00:54:01.700 | And another cool thing about chain of thought
00:54:07.500 | is there are some tasks where you wouldn't get
00:54:12.060 | emergent behavior at all.
00:54:14.620 | So emergence hasn't been unlocked yet.
00:54:18.660 | But you can see that if you use chain of thought,
00:54:22.420 | you can unlock this emergent performance in smaller models.
00:54:26.220 | So one example here is multi-step arithmetic,
00:54:29.980 | where I don't know if you'll ever--
00:54:33.040 | maybe I don't want to say never, but it's
00:54:35.580 | hard to imagine a model getting this.
00:54:37.980 | Here's a question, and then the next token is correct.
00:54:40.380 | That's pretty hard to solve in one step.
00:54:42.900 | But with chain of thought, you can get 50% accuracy on this
00:54:45.900 | just by having the model output these intermediate reasoning
00:54:51.900 | steps.
00:54:54.260 | So I have a question about this.
00:54:56.220 | This is something that needs to have intuition
00:54:59.500 | about what's going on.
00:55:01.740 | Abstractly, I know that a transformer can definitely
00:55:04.940 | do addition, like an arithmetic in one step.
00:55:08.180 | But it can take in the numbers and do the carries.
00:55:11.460 | Definitely, yeah, yeah.
00:55:12.660 | But then there's this question of what happens empirically.
00:55:18.660 | And I understand that it isn't necessarily a lot space
00:55:22.100 | to cover per tick.
00:55:25.300 | My question is, how do we tell the difference?
00:55:31.140 | Are there ways to tell the difference between things
00:55:34.180 | that have an emerge because there's just no space?
00:55:37.780 | Or there's so many tasks that it couldn't allow any space
00:55:44.660 | to specifically do that one, versus the task is so hard
00:55:49.300 | that it just can't even if you use all the capacity to try
00:55:56.420 | and do it?
00:55:57.100 | Yeah, that's a good question.
00:55:58.300 | I think there seems to be some subset of tasks
00:56:02.860 | where it just doesn't fit well with the way
00:56:06.460 | that we train language models.
00:56:07.900 | So for example, in language models, we use tokens, right?
00:56:11.460 | And so if you give it the token four,
00:56:15.900 | it actually doesn't take the number four.
00:56:18.220 | It takes this embedding that's 1,000 dimensions or something.
00:56:22.620 | Or if you give it a word and ask it to reverse the letters,
00:56:26.540 | this is a super easy task, but the way we train the model
00:56:29.140 | doesn't actually look at the letters and stuff.
00:56:31.700 | So I think there's a certain subset of tasks
00:56:33.540 | where it doesn't really just fit well with the way
00:56:37.300 | that we train transformers.
00:56:38.660 | And I think if you really care about these tasks,
00:56:43.940 | you can just solve them using code or something like that.
00:56:48.140 | But yeah, I don't think this is really
00:56:51.020 | an inherent-- something that would never emerge
00:56:54.300 | because it's too hard.
00:56:56.180 | Yeah.
00:56:57.700 | Yeah.
00:56:58.460 | We have a question on Zoom.
00:56:59.500 | Also, by the way, sorry, I forgot to mention.
00:57:01.420 | Somebody asked, can you repeat the questions?
00:57:03.100 | Because they can't always hear you.
00:57:04.100 | Oh, OK.
00:57:04.900 | Yeah.
00:57:05.420 | That's my bad.
00:57:06.020 | That's my bad.
00:57:06.660 | So the question someone asked is,
00:57:08.020 | do you think chain of thought would
00:57:09.480 | be a viable interpretability technique for very advanced AI
00:57:12.580 | systems?
00:57:13.700 | And they mentioned that there is some research
00:57:15.660 | by a professor called Externalized Reasoning
00:57:17.780 | Oversight by Cameron Landon.
00:57:21.580 | Will it be a viable interpretability technique
00:57:23.460 | for advanced AI?
00:57:25.860 | Yeah.
00:57:26.340 | Am I supposed to repeat this?
00:57:27.580 | Yeah, yeah, yeah.
00:57:28.000 | Sorry.
00:57:28.500 | Please.
00:57:29.100 | So the question is, can chain of thought
00:57:32.340 | be a viable interpretability technique for AI?
00:57:36.380 | I think there's no guarantee that the chain of thought
00:57:40.020 | is how the model actually arrives at the final answer.
00:57:43.460 | But often, you can use it to debug,
00:57:45.780 | why isn't the model getting this question correct?
00:57:47.820 | Or what can we do better in the chain of thought
00:57:51.300 | to help the model get this correct?
00:57:53.700 | I haven't read the anthropic paper that was mentioned.
00:57:56.260 | So I actually don't know the answer to that.
00:57:59.840 | Another interesting result that we had here
00:58:07.340 | was that you can actually do multilingual chain of thought
00:58:11.140 | prompting.
00:58:12.460 | And so basically, what we had is we
00:58:14.340 | translated this benchmark of math word problems
00:58:18.100 | to 10 languages.
00:58:19.860 | And then we prompt the model to do it in, say, Bengali.
00:58:23.940 | And then the model has to basically do
00:58:25.700 | the math problem in Bengali and give the final answer.
00:58:29.500 | And I think the cool thing about this
00:58:31.040 | is that this input is highly improbable, right?
00:58:33.780 | So Bengali is 0.01% of the pre-training data.
00:58:37.220 | And math word problems are probably
00:58:40.060 | an even smaller subset of that.
00:58:43.060 | And basically, the interesting thing is,
00:58:44.740 | the model can actually do these types of questions pretty well,
00:58:48.900 | to probably surprising degrees.
00:58:50.980 | If you ask people before I showed them this result,
00:58:53.100 | how well can the model do these math questions in Swahili?
00:58:56.260 | Probably like 10%.
00:58:57.500 | But actually, even very underrepresented languages
00:59:01.820 | like Swahili or Bengali or Telugu and Thai,
00:59:06.940 | the model can do surprisingly well despite the fact
00:59:09.380 | that they only occupy a very small subset
00:59:12.420 | of the pre-trained data.
00:59:15.800 | Yeah.
00:59:16.300 | Actually, speaking to this, and most of my experience
00:59:19.180 | with this is with Chatterjee PP, but if you
00:59:21.020 | ask it things in different languages,
00:59:22.860 | despite not being explicitly trained in these languages,
00:59:25.900 | it seems to have derived reasoning independent
00:59:28.260 | of language, to a certain extent.
00:59:29.980 | It can do the reasoning.
00:59:31.300 | Actually, it's kind of funny.
00:59:32.540 | Sometimes, it always looks like it
00:59:34.100 | does the reasoning in English, and it translates back
00:59:36.300 | to the other language, because the answers it gives you
00:59:39.340 | is sort of like if you reasoned in English
00:59:41.420 | and then translated to the other thing.
00:59:43.060 | So do you think that learning the structure of a language
00:59:46.420 | and learning reasoning abilities are
00:59:48.460 | somewhat separate in large language models,
00:59:50.660 | or that it inherently will learn a chain of thought reasoning
00:59:53.820 | within that language, within the structure of the language,
00:59:56.340 | like the way thought works in that language?
00:59:59.220 | Does that make sense?
01:00:00.060 | Yeah, that's a great question.
01:00:01.260 | I'm not sure how to measure that,
01:00:02.640 | but I've definitely thought about it.
01:00:04.300 | I think the language--
01:00:05.740 | I mean, based on these results, you probably
01:00:08.580 | didn't have any math questions in Swahili
01:00:10.980 | for the model to learn from.
01:00:12.960 | And I think, definitely, there's something language agnostic
01:00:15.500 | going on, where the model learns reasoning sort of independently
01:00:18.940 | of the language, and then it can express it
01:00:20.780 | in different languages if it needs to.
01:00:23.140 | But I don't think we know the answer to that yet.
01:00:26.540 | Yeah, so basically, one question that comes up frequently
01:00:36.580 | is, why does scaling up improve chain of thought?
01:00:39.620 | And one way of looking at this is,
01:00:41.380 | we can take a smaller model, like POM62B,
01:00:43.660 | and see what types of errors are fixed from scaling up
01:00:46.460 | to 540 billion parameters.
01:00:49.020 | And you can see that, for these three categories
01:00:51.160 | that we came up with, some of all of them get fixed.
01:00:54.220 | So scaling seems to have this universal effect
01:00:57.100 | on improving different types of errors from solid models.
01:01:03.740 | And then here's the same hand-wavy diagram
01:01:06.420 | expressed in different ways.
01:01:07.620 | So basically, you have some tasks
01:01:09.900 | that are doable with standard prompting, so in blue.
01:01:13.060 | And then the goal of chain of thought prompting
01:01:15.700 | is to sort of increase the set of tasks that we can do.
01:01:19.580 | So for example, now, the ones shown in pink
01:01:22.820 | include math word problems, symbolic reasoning,
01:01:25.100 | and challenging commonsense reasoning, yeah.
01:01:28.540 | One more question.
01:01:29.380 | Have you done any calculations to figure out
01:01:33.020 | how much-- is any of this contribution just
01:01:37.100 | because of the fact that you do more computations when
01:01:40.620 | you put in longer prompts?
01:01:42.340 | Like, you put multiple tasks into the model.
01:01:45.740 | You create multiple embeddings to sort of adjust the things
01:01:49.620 | the model is looking at, in a way.
01:01:51.180 | Yeah.
01:01:51.660 | How much of that have you tried non-chain of thought prompts
01:01:54.660 | with same token lengths?
01:01:56.900 | Yeah, yeah.
01:01:57.420 | We tried with XXXXX or something.
01:02:00.940 | And it doesn't really-- it doesn't work.
01:02:02.580 | I see.
01:02:02.980 | So I think it's not just about the compute.
01:02:05.100 | I think it's about the language to guide in the model as part
01:02:09.100 | of the reasoning, yeah.
01:02:10.540 | I see.
01:02:11.040 | And have you tried describing the problem in more detail
01:02:13.500 | than non-chain of thought?
01:02:14.660 | I know this is a super [INAUDIBLE] question.
01:02:16.500 | I'm just very curious about-- this
01:02:17.940 | sounds like a very interesting property.
01:02:20.340 | And I'm very curious exactly how it fits in.
01:02:24.220 | Yeah, you mean like describing the question in three
01:02:26.460 | different ways and seeing if that--
01:02:27.620 | Yeah, just describing the question in more detail
01:02:29.100 | instead of explicitly doing the step-by-step things
01:02:31.300 | and seeing how that--
01:02:32.300 | Yeah.
01:02:33.100 | I haven't tried that, but I would be surprised if that
01:02:35.660 | worked.
01:02:36.180 | I see.
01:02:36.680 | Yeah.
01:02:37.180 | That is a question to me.
01:02:38.620 | Did you try having it output the answer
01:02:41.900 | and then explain its reasoning into that?
01:02:44.220 | Yeah, that doesn't work as well.
01:02:46.060 | Yeah, but it depends on the task also.
01:02:48.180 | So like--
01:02:48.980 | Yeah, yeah.
01:02:49.620 | Yeah.
01:02:50.120 | So like, there is something out of the [INAUDIBLE]
01:02:55.820 | That seems to be the case, yeah.
01:02:59.500 | Does it really have to be like reasoning?
01:03:01.260 | Can we do like just any amount of extra calculation
01:03:04.500 | to sort of conjugate the worst answer?
01:03:07.540 | Like, in a way, you know, in a way like in an ablation,
01:03:09.980 | chain of thought is like a very structured thing.
01:03:13.180 | It's like, what if the same structure is preserved,
01:03:15.300 | but like we do some more random things?
01:03:19.660 | Yeah, you could try it.
01:03:21.700 | I would be surprised if it works.
01:03:23.060 | I think like outputting tokens is pretty important for the
01:03:28.100 | model.
01:03:28.600 | Yeah.
01:03:29.100 | Yeah.
01:03:32.800 | So we're doing on time.
01:03:36.620 | OK, great.
01:03:37.140 | So the last part, I think, is a pretty cool trick
01:03:40.540 | with chain of thought.
01:03:41.460 | So basically, what people usually do
01:03:45.200 | is they'll just generate one chain of thought,
01:03:47.940 | and then they'll take the final answer.
01:03:49.940 | But there's this nice trick called self-consistency
01:03:52.140 | where you can use like temperature sampling
01:03:54.460 | with the model to generate like a bunch of different reasoning
01:03:57.420 | pods and final answers.
01:03:59.220 | And then if you just take a majority vote
01:04:00.920 | over the final answers, this ends up
01:04:02.940 | like improving performance by like a pretty big margin.
01:04:05.340 | So for example, here, you can see
01:04:08.380 | on GSM 8K, which is like the math for problem data set,
01:04:12.180 | the improvement goes from like, you know,
01:04:14.700 | the performance is like 56.
01:04:15.980 | And then if you do self-consistency,
01:04:18.140 | then it becomes 74, which is like a pretty big improvement.
01:04:21.720 | Quick clarification question.
01:04:23.220 | Yeah.
01:04:23.720 | Here, how many are we averaging over for self-consistency?
01:04:27.180 | Oh, I think 40.
01:04:27.940 | So it increases the cost of the inference time compute.
01:04:31.980 | But yeah, it improves performance by a lot.
01:04:35.200 | You might be about to answer this,
01:04:36.620 | but I'm curious to know how many samples or how many chains
01:04:39.220 | does one need to draw to get a significant--
01:04:41.180 | like, what is the trade-off between number of chains
01:04:43.380 | averaged over versus performance gain?
01:04:46.320 | I think it depends on the--
01:04:49.060 | sorry, the question is, how many chains do you
01:04:51.020 | need to get a performance gain?
01:04:53.220 | I think the answer really depends on the data set.
01:04:58.140 | But usually, you can get something good with like 16,
01:05:00.380 | I think.
01:05:01.600 | Yeah.
01:05:02.980 | Oh, sorry, we have a question.
01:05:04.220 | How does the temperature change the way the model works?
01:05:08.260 | Oh, OK, the question is, how does the temperature
01:05:10.260 | change the way the model works?
01:05:12.020 | Basically, when you use temperature decoding,
01:05:16.060 | the language model can like stochastically
01:05:19.060 | pick one of the outputs instead of always picking
01:05:21.500 | the highest probability next word.
01:05:23.820 | So basically, you get these more stochastic outputs
01:05:27.500 | that are still based on what the language model has learned,
01:05:31.980 | but it's just a little bit more random.
01:05:33.660 | And then finally, yeah, self-consistency also
01:05:41.060 | seems to be a merge-ability.
01:05:42.580 | I guess part of it is because chain of thought is emergent
01:05:44.820 | because you wouldn't get any better
01:05:47.380 | than the random performance without doing chain of thought.
01:05:53.380 | But yeah, you kind of see this big delta
01:05:56.420 | from self-consistency for larger models.
01:05:58.540 | Great, so I'm going to run out of time.
01:06:07.340 | Let me just go to--
01:06:09.220 | I'll just talk about this a little bit.
01:06:10.840 | So I think in addition to just purely scaling up
01:06:15.460 | the language model, which is only available to people
01:06:17.820 | in the industry, I think there's a couple interesting directions
01:06:21.500 | to work on.
01:06:24.420 | One is better prompting and characterization
01:06:26.540 | of language model abilities.
01:06:28.140 | I think right now, we're just at the edge
01:06:30.020 | of knowing what the best way to prompt language models is.
01:06:36.700 | There's also pretty good applied work.
01:06:38.380 | So you can use language models, I've
01:06:40.420 | heard, to train therapists, to help with creative writing,
01:06:43.180 | to help with science.
01:06:44.420 | I think ChatGPT has really shown what language models
01:06:48.000 | can do in this regard.
01:06:50.660 | I think benchmarks are also something that's pretty lacking
01:06:54.060 | because I think we solve benchmarks pretty quickly.
01:06:58.060 | For example, Palm beat the average human
01:06:59.860 | on Big Bench within a year or something of Big Bench
01:07:02.660 | coming out.
01:07:03.940 | So I think we need more benchmarks,
01:07:05.700 | and I think that's going to be an important contribution.
01:07:09.460 | And then the final one is, how can we
01:07:13.180 | have compute-efficient methods to make language models better
01:07:17.100 | so that it's less expensive to use them
01:07:21.260 | and more people get to use them?
01:07:24.500 | Great.
01:07:25.000 | So I'll end here.
01:07:29.220 | And feel free to email me if you have any feedback.
01:07:32.220 | And if you're interested in Google,
01:07:34.300 | feel free to email me as well.
01:07:36.740 | Thanks.
01:07:37.220 | [APPLAUSE]
01:07:40.560 | Thank you.
01:07:42.120 | [APPLAUSE]