back to indexStanford CS25: V2 I Emergent Abilities and Scaling in LLMs
00:00:07.340 |
So the first sort of paper that I'm going to be talking about 00:00:10.380 |
is called "Emergent Abilities of Large Language Models." 00:00:15.780 |
I think, because we got people from Google and DeepMind, 00:00:20.580 |
You might recognize Percy, or Tatsu, or Rishi. 00:00:28.020 |
at why we want to scale in emergent abilities. 00:00:33.660 |
So one of the things that we've seen throughout language models 00:00:38.960 |
is that you sort of get these predictable gains 00:00:45.340 |
paper, where you can see that if you scale up 00:00:49.380 |
the size of the language model, measured either in compute, 00:01:02.100 |
I don't know if you're screen sharing, so people in Zoom, 00:01:17.660 |
if you scale up the size of the language model, 00:01:20.100 |
measured either in compute, in data set size, 00:01:24.940 |
see that there's sort of this predictable improvement 00:01:29.060 |
Now, what I'm going to be talking about in terms 00:01:36.500 |
unpredictable if you only look at smaller language models. 00:01:40.780 |
So one way that emergence has been described in the broader 00:01:51.540 |
It sort of started with this article in Science 00:01:59.340 |
And I really like this post from Jacob Steindart 00:02:08.700 |
For example, he says, "With a bit of uranium, 00:02:13.460 |
With a large amount of uranium packed densely enough, 00:02:22.820 |
you can't meaningfully encode useful information. 00:02:25.180 |
But given larger models such as DNA, you can encode a genome." 00:02:28.020 |
So for this particular work, we use this definition 00:02:35.420 |
of emergent abilities of large language models in particular. 00:02:49.780 |
is how do you measure the size or the scale of the language 00:02:57.180 |
So the training flops are the amount of compute 00:03:02.300 |
the number of model parameters, or the size of the language 00:03:04.900 |
model, and also the size of the training data set 00:03:09.940 |
And a lot of the plots here will use either training flops 00:03:15.620 |
The reason is that the training data set size is usually 00:03:20.780 |
And because training flops is just the data set size 00:03:23.260 |
times model parameters, you can get a similar plot 00:03:27.300 |
from either training flops or number of model parameters 00:03:41.700 |
For me, it seems like it would be relatively easier 00:03:48.560 |
How do you define what counts as an actual ability versus not? 00:03:57.140 |
So yeah, for example, I'll just give an example here, 00:04:03.500 |
So basically, we have this way of interacting 00:04:06.220 |
with language models called few-shot prompting. 00:04:17.980 |
and then you ask it for an unseen movie review, 00:04:22.780 |
for example, and then you say, what's the output? 00:04:28.460 |
to use the context from the review to give the next token. 00:04:34.140 |
And the definition that we use for having ability or not 00:04:38.900 |
is that basically, a few-shot prompted task, for example, 00:04:51.580 |
So basically, if the model isn't doing any better than random, 00:04:54.660 |
then when you say it doesn't have the ability 00:05:09.700 |
So basically, what we have, each of these different plots 00:05:20.700 |
is the number of training flops or the model scale. 00:05:36.340 |
And then each of the points is a different language model. 00:05:39.340 |
It's not a language model over the course of training. 00:05:45.060 |
And what you see is that for the very small language models, 00:05:49.420 |
you basically get performance that's close to random 00:06:07.040 |
So basically, if you were to extrapolate the lines 00:06:13.140 |
might predict that it would never do better than random 00:06:19.180 |
that when you scale up past a certain threshold, 00:06:21.620 |
you actually do see this emergent phenomenon where 00:06:34.380 |
It's basically a benchmark called multitask NLU or MMLU. 00:06:40.180 |
And basically, what it is, it's a bunch of test questions 00:06:45.100 |
ranging from high school all the way to professional level 00:06:49.820 |
And how it works is the language model is given-- 00:06:53.660 |
for example, here is a high school math example. 00:06:57.260 |
And the language model is given a few examples. 00:06:59.840 |
And then for an unseen question, it has to give the answer. 00:07:04.100 |
And then you can see in the plot on the right-- 00:07:08.860 |
to sort of 10 to the power of 22 training flops, 00:07:12.300 |
you don't actually get any better than random accuracy 00:07:15.700 |
But if you scale up to 10 to the 24 training flops, 00:07:31.980 |
Or because these are different models trained 00:07:36.140 |
Yeah, the scale is, I think, within an order of magnitude 00:07:42.460 |
But every single dot on each individual tracks 00:07:47.700 |
Yes, the data is fixed except for Chinchilla. 00:08:05.900 |
So this is a task from the Big Bench Benchmark. 00:08:17.300 |
of benchmarks I'd recommend looking at if you're 00:08:21.300 |
And basically, the task is the language model 00:08:26.960 |
give the international phonetic alphabet transliteration, 00:08:38.860 |
is actually blue, or like an n-gram overlap metric. 00:08:46.020 |
where as you increase the size of the language model, 00:08:51.820 |
And then suddenly, the improvement is above random. 00:08:59.420 |
So I'll talk about another interesting result 00:09:04.540 |
So this was a technical report that we put out 00:09:09.180 |
And basically, there's this really interesting prize in-- 00:09:13.100 |
or it's like a one-time prize in language models, 00:09:21.420 |
basically had this prize where if people could come up 00:09:23.700 |
with a task where the performance on the task 00:09:35.460 |
So basically, there are a lot of submissions to this. 00:09:40.100 |
they found that the performance would actually 00:09:42.140 |
decrease if you increase the size of the language model. 00:09:49.380 |
And then the input is like, all that glisters is not glib. 00:10:01.340 |
And so what happened is for the small language model, 00:10:06.100 |
it doesn't know the phrase, all that glisters is not gold. 00:10:14.860 |
But then for the medium-sized language model, 00:10:17.820 |
what you would see is that the performance actually 00:10:22.220 |
model knows the phrase, all that glisters is not gold. 00:10:29.660 |
Someone on Zoom asked, can you give a physical estimate 00:10:32.060 |
of 10 to the 24 flops, possibly in terms of training 00:10:38.500 |
Yeah, so I think 10 to the 24 flops is around-- 00:10:48.260 |
And one pod of TPUs, I believe, is equal to like 4,000 A100s. 00:10:55.780 |
And 10 to the 24 flops is like two pods for around six weeks 00:11:02.420 |
So it's a lot of compute to do the pre-training. 00:11:16.020 |
I don't even think about how big this number is. 00:11:19.020 |
That's the number of floating point operations 00:11:21.780 |
that goes into the pre-training of some of these models. 00:11:28.220 |
So yeah, so basically, the medium-sized language model 00:11:35.620 |
Yeah, oh, wait, did this one win the prize or not? 00:11:43.660 |
I think it's a third-place winner or something. 00:11:59.980 |
Well, because all of this depends on which evaluation 00:12:04.020 |
scheme you use to measure if you do your task right. 00:12:11.340 |
You only get credit if you do it perfectly or something. 00:12:15.180 |
Yeah, so for this thing, they accounted for all-- 00:12:43.340 |
And then, I guess, your point about how credit assignment 00:12:46.760 |
or the valuation metric works is actually a really good one. 00:12:50.740 |
Yeah, so I guess it still kind of counts if-- 00:12:54.500 |
I guess the argument is that the performance might not 00:13:09.060 |
you'll often still see the same type of emergence. 00:13:14.340 |
assigning partial credit based on the valuation metric. 00:13:27.160 |
is that, yeah, there might be some tasks where 00:13:29.420 |
the performance starts to decrease if you use a medium 00:13:33.780 |
But if you keep scaling all the way to the largest model 00:13:38.260 |
that we have at Google that's known publicly, Palm, 00:13:43.420 |
you'll see that the language model can actually 00:13:51.260 |
knows the phrase, all that glisters is not gold. 00:13:54.100 |
But it also understands, repeat my sentences back to me. 00:14:01.060 |
So this is a different type of emergence also. 00:14:04.820 |
And another class of emergence that we talk about in the paper 00:14:16.780 |
with the language models that can be considered emergent. 00:14:23.020 |
Sorry, somebody else had a question on the previous one. 00:14:27.300 |
The question is, did all models undergo instruction-binding 00:14:31.580 |
None of these models underwent instruction-binding tuning 00:14:41.780 |
So one way of interacting with language models 00:14:51.620 |
And basically, the way it works is you have this data. 00:14:54.180 |
And humans rate preferences based on the data. 00:14:57.620 |
And humans rate preferences for what type of outputs 00:15:02.620 |
And then the model is trained on RL to optimize 00:15:14.180 |
the model performance on a different zero-shot task 00:15:19.580 |
You can see the blue line is above the orange line. 00:15:28.780 |
though, then you can see that the performance actually 00:15:41.260 |
help if you try on a large enough language model. 00:15:43.420 |
So if you only try it on the small language models, 00:15:46.180 |
it'd be tough to draw the conclusion that it wouldn't 00:15:51.560 |
And then later, I'll talk about chain-of-thought prompting 00:16:01.060 |
used to think about emergence as a framework. 00:16:05.260 |
So on the x-axis here, there's a scale of the language model. 00:16:09.860 |
And on the y-axis is a imaginary scale of a range of things 00:16:18.500 |
And then basically, you can pick some random point, 00:16:20.700 |
like, say, 100 blind parameters in the language model. 00:16:30.220 |
of tasks or things that the language model can do increases. 00:16:33.740 |
And then you can see there are some tasks where-- 00:16:38.180 |
models above 100 blind parameters, for example, 00:16:41.420 |
But models below 100 blind parameters can't do them. 00:16:53.820 |
is tasks that a smaller language model wouldn't be able to do. 00:17:07.420 |
haven't been able to solve yet with language models. 00:17:13.440 |
that it's not that those tasks in the white region 00:17:19.300 |
Or do you think that better models, specific training 00:17:23.660 |
data, would allow us to, at the 100 billion scale, 00:17:27.980 |
Yeah, I definitely think that it's not a fixed-- 00:17:40.300 |
to have 100 blind parameters to do a certain task. 00:17:42.900 |
It's just that that happens to be the threshold that we've 00:18:00.100 |
one example of getting emergence can be with better data. 00:18:12.340 |
You can see that for Lambda, which is a Google model, 00:18:18.300 |
from scaling to 137 or 175 billion parameters. 00:18:25.780 |
But when you come in with a different language model, 00:18:28.020 |
Palm, which is trained on better data than Lambda and GPT-3, 00:18:35.420 |
even with the smaller language model, so here 00:18:40.380 |
Do you see Perpetuary's better model as better data, 00:18:49.940 |
Yeah, so the challenging thing is-- that's a great question. 00:18:54.940 |
There's a lot of differences between Palm and Lambda, 00:18:58.780 |
And we can't really ablate them in any controlled way 00:19:08.180 |
And that accounts for a lot of the difference 00:19:13.620 |
it is possible to ablate stuff, or some really architectural 00:19:21.300 |
So I guess even here, you can look at, for example, 00:19:23.420 |
the Palm 8 billion model, like that point there. 00:19:28.700 |
You can ablate it, and it's a little bit higher, 00:19:30.900 |
but it's not really emergent yet at that point. 00:19:33.420 |
So it's hard to tell, for example, this particular task, 00:19:45.680 |
Oh, so I think the two lines here-- one is maybe three shot, 00:19:51.060 |
and then one is zero shot, something like that. 00:19:58.240 |
using the language model, either with or without exemplars. 00:20:07.160 |
I'll talk about a small ablation here that shows this. 00:20:11.080 |
So this is an ablation on a toy task, where basically, 00:20:17.560 |
the language model has to know that in English, 00:20:20.040 |
you have to use plural verbs with plural subjects 00:20:30.640 |
we train these small BERT models from scratch, 00:20:33.920 |
and then we fixed the frequency of certain verbs 00:20:39.080 |
in the training data set, which basically says, OK, 00:20:41.920 |
what's the effect of seeing a certain verb in the data 00:20:47.200 |
In this plot, the x-axis is the frequency of the verb, 00:20:53.200 |
And what you basically see is that if you have more 00:20:56.680 |
in-domain data, so if the model sees the verb more times, 00:21:02.160 |
So this is an example of having high-quality data or data 00:21:05.440 |
that's more in-domain for the task that you're evaluating on 00:21:18.720 |
to distill convergent abilities down to smaller models 00:21:25.560 |
So larger teacher models, you can use them, for example, 00:21:34.680 |
And then if you fine-tune the smaller model on data, 00:21:38.760 |
it's pretty likely that you'll be able to get the ability 00:21:49.760 |
Desired behaviors can be induced in smaller models 00:22:07.920 |
And you can see that there's multiple models. 00:22:17.560 |
than larger models trained on weaker techniques. 00:22:22.760 |
know that you want a certain behavior that you 00:22:26.160 |
saw previously in an emergent way in a larger model, 00:22:30.840 |
you can find a way to fine-tune on that behavior specifically 00:22:39.520 |
is that unless you know all the behaviors that you want, 00:22:43.160 |
you can't really get this natural emergent behavior. 00:22:57.800 |
So right now, we mostly talk about model parameters 00:23:02.160 |
But I guess if you ask deep mind people how they look at it, 00:23:06.160 |
you'll get this argument that model parameters and training 00:23:09.640 |
flops are really just a proxy for how good the model is. 00:23:18.280 |
doing on some dev sets, such as Wikitext 103. 00:23:39.580 |
you're able to do a lot better on the downstream task. 00:23:47.400 |
at least right now, between perplexity and training 00:23:50.120 |
So you can see these two lines are pretty similar. 00:23:57.880 |
if we have much better models that are a lot smaller, 00:24:00.640 |
turn on much better data and better algorithms, 00:24:04.960 |
show a different type of plot than using other metrics. 00:24:21.160 |
of how well you can predict the next word in a data set. 00:24:25.840 |
So basically, if you're really good at modeling 00:24:27.840 |
the next word on this particular evaluation set, 00:24:32.080 |
that's sort of a measure of how well you understand language. 00:24:54.960 |
is that there's sort of not just technical emergence 00:24:59.760 |
sort of sociological changes in how the AI community views 00:25:12.280 |
the size of the language model enables you to, 00:25:17.880 |
beat a task-specific fine-tuned language model that's 00:25:21.120 |
usually fine-tuned on, say, thousands of examples. 00:25:25.240 |
So basically, the green line is the prior state of the art 00:25:36.120 |
and you do few-shot prompting, which means the language model 00:25:42.520 |
just by continuing to scale up the size of the language model. 00:25:51.720 |
But I think it's a pretty big change in people's minds 00:25:54.600 |
that you could actually get some of the best results 00:25:56.880 |
just by scaling up the size of the language model 00:26:14.160 |
Should we, in general, assume-- oh, he said yes. 00:26:26.920 |
So this plot is saying that you fine-tune and you can do-- 00:26:39.480 |
But what this plot is saying is that fine-tuned smaller models 00:26:50.600 |
can do well on some tasks if you target it well. 00:27:06.400 |
I guess some tasks you would do a lot better just 00:27:14.440 |
And then other tasks, if it's a very narrow domain 00:27:34.640 |
And if you try to predict their emergence just 00:27:48.480 |
of how to view new abilities that are not intentionally 00:27:55.120 |
And I think the subtext for this is super important, which 00:27:57.560 |
is you can see it as an implicit argument for why we should keep 00:28:03.760 |
get these abilities that are really hard to find otherwise. 00:28:07.280 |
And the context around this is pretty important, 00:28:09.240 |
because it's really expensive to continue scaling up 00:28:17.800 |
didn't believe that you could do better on certain tasks 00:28:20.320 |
just by scaling up the size of the language model. 00:28:23.920 |
They sort of-- if you work in industry at all, 00:28:25.880 |
there's this interesting tension between emergence and also 00:28:30.920 |
So emergence is sort of this task general phenomena 00:28:34.760 |
where you scale up the model, and it's really expensive. 00:28:52.760 |
And then you have these constraints on compute, 00:28:54.720 |
because when you build Google Translate, for example, 00:28:57.280 |
you don't want people to have to wait a couple of seconds 00:29:02.200 |
And then you also happen to have a lot of in-domain data. 00:29:12.200 |
And this is sort of the opposite setting, where you don't really 00:29:18.720 |
You can just train a very small model on the data 00:29:21.000 |
and do all of the tasks without having to use a lot of compute. 00:29:28.960 |
a really promising research direction, if anyone 00:29:33.000 |
to work on predicting future emergent abilities. 00:29:36.840 |
And I haven't seen a lot of work on it recently, 00:29:38.840 |
just because I think maybe it's too hard, for example. 00:29:42.200 |
You can only predict emergence for a specific task. 00:29:52.160 |
But I think this is a pretty promising direction to work on. 00:30:10.280 |
which parameters are best scaled to get related properties? 00:30:13.800 |
Because obviously, there are many different options 00:30:19.640 |
like GPT, for example, you could add more to the inventory. 00:30:22.760 |
You could add more in [INAUDIBLE] or whatever. 00:30:32.320 |
Yeah, I would say that we don't have very principled methods 00:30:43.000 |
But some of it has to deal with how many parameters you 00:30:52.440 |
the number of intentions heads and embeddings 00:30:57.680 |
But yeah, I think this is an open research question. 00:31:07.880 |
It's hard to have any principled way of doing it, 00:31:12.440 |
other than some engineers who are in charge of doing it, 00:31:15.480 |
saying, OK, I think this is the right thing to do. 00:31:21.280 |
Do you have any indication of the asymptotic behavior 00:31:26.840 |
would reach either some plateau of finite or non-zero loss, 00:31:30.960 |
or it would just go all the way down to zero? 00:31:36.340 |
You mean on perplexity or on a particular task, 00:31:47.000 |
Well, it seems like these results are pretty general, 00:31:53.120 |
But if you take the limit of infinite parameters, 00:32:05.280 |
there's a limit to accuracy, like 100%, for example. 00:32:10.440 |
But I guess the deeper question that you might be asking 00:32:16.120 |
know how to predict the next word for any given input? 00:32:28.120 |
if I say a sentence, there are two possible next words 00:32:31.840 |
And you might not be able to guess that perfectly. 00:32:35.400 |
So I think there's some limit, but I think we're 00:32:40.440 |
that sort of indicate that there's a lot of headroom. 00:32:45.640 |
If researchers are interested in studying emergence, 00:32:51.520 |
is publicly available or best for studying this? 00:32:58.520 |
So I think the OpenAI API has a lot of language models. 00:33:04.920 |
Even at Google, it's used to study emergence. 00:33:10.220 |
And actually, a lot of these models are currently free. 00:33:18.520 |
I think there's also smaller language models. 00:33:28.080 |
There is sort of this challenge where the small language 00:33:30.520 |
models, you won't see a lot of these emergent behaviors. 00:33:37.120 |
yeah, so you kind of have to either use OpenAI API for now 00:33:44.180 |
I guess there's also the Bloom and, like, you guys probably 00:33:49.260 |
that are publicly available, but I haven't seen 00:33:56.100 |
So my question is, are there emergent abilities 00:34:00.260 |
that are accessible in lower parameter regimes? 00:34:03.620 |
I can think of, like, more of a speech technique or [INAUDIBLE] 00:34:09.520 |
I would expect maybe there might be, like, some better-- 00:34:19.040 |
that would be emergent at, like, 8 billion parameters 00:34:21.400 |
or, like, 60 billion parameters, something like that, yeah. 00:34:27.200 |
The first question is, do you see strategy tactics 00:34:30.080 |
between the larger tech firms differing systematically 00:34:33.560 |
in studying these models, or is basically everyone 00:34:37.400 |
I wouldn't say that everyone is taking the same approach. 00:34:55.800 |
And they're super interested in, like, emergent abilities 00:34:59.080 |
because there could be emergent abilities that are undesirable 00:35:02.920 |
and they want to predict those types of things. 00:35:06.560 |
I also don't know what happens at other companies 00:35:09.840 |
other than at Google, so I can't really speak too much to that. 00:35:14.120 |
The second question, what are some examples of tasks 00:35:17.120 |
or abilities that have not yet emerged, even in models 00:35:52.660 |
into, like, smoothly increasing, emergent with GPT-3 or Lambda, 00:35:58.380 |
emergent with POM, and then flat, which is, like, 00:36:04.420 |
here, they should still not have emerged yet. 00:36:08.820 |
And if you can get them to emerge, that'd be interesting. 00:36:20.340 |
I think this is, like, a couple of months old. 00:36:33.820 |
Like, originally, like, maybe only these were emergent. 00:36:38.580 |
see a couple dozen more abilities became emergent. 00:36:40.700 |
And then I suspect in a year or two, most of these 00:36:50.860 |
Why doesn't Google take as much of a safety-centered approach, 00:36:55.860 |
Are there reasons to believe harmful capabilities wouldn't 00:36:59.700 |
Yeah, I don't want to answer the question on behalf of Google. 00:37:12.780 |
even if you look at, like, the amount of research 00:37:14.820 |
that Google does, it might not be in the large language 00:37:19.260 |
But the amount of safety research that we do, I think, 00:37:34.920 |
Yeah, I'll talk about chain-of-thought prompting. 00:37:44.020 |
is this way of doing reasoning, multi-step reasoning, 00:37:53.980 |
super exciting to see a lot of people at Google working 00:38:00.580 |
present this at our last year's Google I/O press event. 00:38:10.380 |
is that we want language models to do more complicated tasks. 00:38:17.820 |
can do easy tasks like sentiment analysis or translation. 00:38:23.180 |
that might even take a human a minute or more to do? 00:38:32.100 |
So for example, instead of just giving an input-output pair, 00:38:35.260 |
we want to give them the entire reasoning process 00:38:41.460 |
And basically, you can see here, in a standard prompt, 00:38:57.860 |
And then, kind of like how your teacher would 00:39:00.140 |
ask you to show your work, you give the chain-of-thought, 00:39:04.220 |
is what we call it, or basically a reasoning path. 00:39:08.460 |
And then when the model sees this unseen question, 00:39:15.900 |
And the way that we add these prompts into the prompt 00:39:32.780 |
And basically, here's the non-chain-of-thought way 00:39:39.020 |
So basically, you would have question, answer, question, 00:39:47.900 |
They used 20 to make lunch and bought six more 00:39:56.420 |
And the only difference with chain-of-thought 00:39:58.900 |
is that you give these intermediate reasoning 00:40:10.660 |
And then now, the model for this unseen question 00:40:18.940 |
And then this actually enables the model to get it correct. 00:40:28.460 |
So here, the task is just take the last letters 00:40:31.300 |
of the words in Bill Gates, so like L from Bill and S 00:40:50.100 |
obviously, this becomes very easy for the model. 00:40:55.940 |
The last letter of Gates is S. The answer is LS. 00:41:00.860 |
And then here, it's able to do the last letter of Elon is M. 00:41:03.940 |
And the last letter of Musk is K. And the answer is NK. 00:41:16.020 |
So basically, we can have these similar plots 00:41:26.580 |
So on the left, we have this math word question 00:41:41.020 |
And basically, you see that the chain-of-thought, 00:41:43.020 |
if you use a large enough model, does a lot better 00:41:49.020 |
It actually beats the fine-tuned state-of-the-art at the time. 00:41:53.620 |
A similar example is on this benchmark called Strategy QA. 00:42:00.260 |
like this world knowledge plus common sense reasoning 00:42:08.420 |
And then the model would say, a basketball is about this size. 00:42:16.420 |
see that we can beat the fine-tuned state-of-the-art 00:42:30.660 |
evaluate a chain-of-thought on a certain subset of Big Bench 00:42:35.560 |
So we created a subset called Big Bench Hard. 00:42:38.780 |
And basically, it's like 23 challenging tasks 00:42:41.780 |
from Big Bench, where no model had done better 00:42:50.700 |
is that you'd have a task description, question, options, 00:42:54.460 |
chain-of-thought, and then the test time question. 00:42:59.060 |
And so I'll give a couple examples of tasks here. 00:43:06.660 |
Basically, what the language model has to do in this task 00:43:11.620 |
So the question is, if you follow these instructions, 00:43:16.260 |
Turn left, turn right, take five steps, take four steps, 00:43:21.100 |
And then the model, following the few shot exemplars, 00:43:25.540 |
is able to basically track state after all of the actions. 00:43:52.860 |
And basically, the model has to sort them alphabetical order. 00:43:57.780 |
And here, the model can follow the few shot exemplars. 00:44:00.020 |
So you have this pretty complicated chain-of-thought 00:44:05.260 |
where the model has to sort each of the subparts. 00:44:08.900 |
And then finally, it gets to the final answer, which is correct. 00:44:15.580 |
So here's this result summary on the subset of BigBench. 00:44:40.220 |
And then prior results, the model was doing way worse. 00:44:54.460 |
And actually, you can use this one for free with OpenAPI. 00:44:58.220 |
And basically, if you do answer-only prompting 00:45:03.580 |
are beating the average human reader on 5 of 27. 00:45:09.500 |
then the performance increases by this pretty decent amount. 00:45:19.980 |
of the tasks that are doing worse than humans in red 00:45:27.620 |
Isn't this similar to RLHF in spirit, at least? 00:45:49.340 |
and you use a prompting technique that includes 00:45:54.340 |
The way that RLHF works is that you have this additional data 00:46:02.420 |
sort of predicts how well does a certain output-- 00:46:09.420 |
how likely is that to be preferred by humans? 00:46:12.300 |
And then RLHF, what that does is it fine-tunes the language 00:46:17.620 |
model to do well on the preference model's prediction. 00:46:20.660 |
So basically, it's sort of aligning the model 00:46:41.540 |
is that you have to have chain-of-thought intermediate 00:46:48.460 |
it can be costly to gather that data and to do the fine-tuning. 00:46:58.220 |
that chain-of-thought and prompt engineering in general 00:47:00.940 |
is just an artifact that won't be necessary for larger scale 00:47:04.220 |
models that are better able to understand the prompts 00:47:11.340 |
Basically, the question is how ephemeral is prompt engineering 00:47:18.340 |
But some initial intuitions are that for easy tasks that 00:47:22.740 |
are easy to describe and maybe they're multiple choice, 00:47:26.620 |
larger models will probably be more robust to prompt 00:47:29.700 |
And there's sort of less you can do with that. 00:47:31.740 |
But I think as language models get more powerful, 00:47:40.140 |
And in those tasks, you'll have to specify exactly what you 00:47:44.420 |
So I think there will still be some room for prompt 00:47:46.460 |
engineering there, at least in the near future. 00:47:50.140 |
Do you know how well this chain-of-thought prompting is 00:47:52.660 |
generalizing to, for example, you showed these two tasks, 00:47:55.180 |
right, a simple math and a map, and then the other one 00:47:58.180 |
basically sorting the words alphabetically, right? 00:48:02.140 |
So I mean, I see that's the case with the math. 00:48:04.940 |
Math has to give this chain-of-thought prompting, 00:48:19.380 |
So for some tasks where you've seen similar data 00:48:24.380 |
in pre-training, the model can do really well, 00:48:26.780 |
even if the chain-of-thought is from another task. 00:48:31.300 |
you actually don't really need a math chain-of-thought, 00:48:33.580 |
because the model already knows how to do that. 00:48:38.420 |
haven't seen any data that's like the chain-of-thought here. 00:48:44.540 |
on tasks like this without manually writing them 00:48:52.620 |
I'm wondering, as the researcher behind this, 00:48:55.660 |
what mental model would lead you to even try this? 00:48:59.500 |
Do you perceive the model as, if I was a person, 00:49:02.860 |
Or is it trying to give it more compute in order 00:49:12.140 |
I think my motivation was just thinking about it, 00:49:14.380 |
just as you said, what's going on in a human's mind 00:49:21.540 |
And well, if you notice, at least some humans 00:49:28.420 |
So if you pay attention a lot to what's going on in your mind, 00:49:35.860 |
And so well, the language model can think in language, too. 00:49:50.380 |
actually coincided with the development of Palm. 00:49:58.820 |
allowed us to do a lot more challenging tasks 00:50:09.580 |
we're saying that it matters that the absolute number 00:50:14.380 |
or whatever in the data set or in the fine tuning, 00:50:21.260 |
Or is it this relative number of frequency of those examples 00:50:29.020 |
Do those matter as much as the absolute number 00:50:38.300 |
So I guess the challenging thing is we can't really 00:50:40.500 |
measure how many similar examples are in the training 00:50:46.900 |
And I don't think anyone has done that before. 00:51:00.420 |
Yeah, I think it's open question of why it works. 00:51:03.760 |
I mean, you said, OK, think about how new things-- 00:51:21.120 |
I mean, is there a shift in what specific task? 00:51:34.080 |
I guess the way that I think about it is that it'd 00:51:39.980 |
and ask you to give me the answer within half 00:51:43.000 |
a second, which is basically what you're doing with the model 00:51:45.720 |
and when you don't do a chain of thought, right? 00:51:47.760 |
You're basically asking this challenging question, 00:51:52.080 |
to solve it in one pass to give you the answer immediately. 00:52:00.840 |
is that the model has learned a compositional set of skills 00:52:05.920 |
during pre-training, so maybe it hasn't really 00:52:08.800 |
learned this particular navigate task during pre-training. 00:52:14.400 |
It's learned like, OK, if you take five steps 00:52:18.840 |
should add five here or something like that, right? 00:52:33.720 |
And then maybe if you can combine them together 00:52:35.680 |
in some clever way, then you can get the model to solve 00:53:01.300 |
That's a good example of how we judge these tasks, anyway. 00:53:11.800 |
Yeah, feel free to keep asking questions if you have any. 00:53:16.040 |
So yeah, here's another example of emergence. 00:53:22.100 |
three models here, InstructGBT, Codex, and Palm. 00:53:25.220 |
Chain of thought in blue and non-chain of thought 00:53:45.200 |
or not say anything coherent or never get a final answer, which 00:53:48.040 |
is why using chain of thought for the small models 00:53:57.140 |
is going to be able to solve the task at a lot higher 00:54:01.700 |
And another cool thing about chain of thought 00:54:07.500 |
is there are some tasks where you wouldn't get 00:54:18.660 |
But you can see that if you use chain of thought, 00:54:22.420 |
you can unlock this emergent performance in smaller models. 00:54:26.220 |
So one example here is multi-step arithmetic, 00:54:37.980 |
Here's a question, and then the next token is correct. 00:54:42.900 |
But with chain of thought, you can get 50% accuracy on this 00:54:45.900 |
just by having the model output these intermediate reasoning 00:54:56.220 |
This is something that needs to have intuition 00:55:01.740 |
Abstractly, I know that a transformer can definitely 00:55:08.180 |
But it can take in the numbers and do the carries. 00:55:12.660 |
But then there's this question of what happens empirically. 00:55:18.660 |
And I understand that it isn't necessarily a lot space 00:55:25.300 |
My question is, how do we tell the difference? 00:55:31.140 |
Are there ways to tell the difference between things 00:55:34.180 |
that have an emerge because there's just no space? 00:55:37.780 |
Or there's so many tasks that it couldn't allow any space 00:55:44.660 |
to specifically do that one, versus the task is so hard 00:55:49.300 |
that it just can't even if you use all the capacity to try 00:55:58.300 |
I think there seems to be some subset of tasks 00:56:07.900 |
So for example, in language models, we use tokens, right? 00:56:18.220 |
It takes this embedding that's 1,000 dimensions or something. 00:56:22.620 |
Or if you give it a word and ask it to reverse the letters, 00:56:26.540 |
this is a super easy task, but the way we train the model 00:56:29.140 |
doesn't actually look at the letters and stuff. 00:56:33.540 |
where it doesn't really just fit well with the way 00:56:38.660 |
And I think if you really care about these tasks, 00:56:43.940 |
you can just solve them using code or something like that. 00:56:51.020 |
an inherent-- something that would never emerge 00:56:59.500 |
Also, by the way, sorry, I forgot to mention. 00:57:01.420 |
Somebody asked, can you repeat the questions? 00:57:09.480 |
be a viable interpretability technique for very advanced AI 00:57:13.700 |
And they mentioned that there is some research 00:57:21.580 |
Will it be a viable interpretability technique 00:57:32.340 |
be a viable interpretability technique for AI? 00:57:36.380 |
I think there's no guarantee that the chain of thought 00:57:40.020 |
is how the model actually arrives at the final answer. 00:57:45.780 |
why isn't the model getting this question correct? 00:57:47.820 |
Or what can we do better in the chain of thought 00:57:53.700 |
I haven't read the anthropic paper that was mentioned. 00:58:07.340 |
was that you can actually do multilingual chain of thought 00:58:14.340 |
translated this benchmark of math word problems 00:58:19.860 |
And then we prompt the model to do it in, say, Bengali. 00:58:25.700 |
the math problem in Bengali and give the final answer. 00:58:31.040 |
is that this input is highly improbable, right? 00:58:33.780 |
So Bengali is 0.01% of the pre-training data. 00:58:44.740 |
the model can actually do these types of questions pretty well, 00:58:50.980 |
If you ask people before I showed them this result, 00:58:53.100 |
how well can the model do these math questions in Swahili? 00:58:57.500 |
But actually, even very underrepresented languages 00:59:06.940 |
the model can do surprisingly well despite the fact 00:59:16.300 |
Actually, speaking to this, and most of my experience 00:59:22.860 |
despite not being explicitly trained in these languages, 00:59:25.900 |
it seems to have derived reasoning independent 00:59:34.100 |
does the reasoning in English, and it translates back 00:59:36.300 |
to the other language, because the answers it gives you 00:59:43.060 |
So do you think that learning the structure of a language 00:59:50.660 |
or that it inherently will learn a chain of thought reasoning 00:59:53.820 |
within that language, within the structure of the language, 01:00:12.960 |
And I think, definitely, there's something language agnostic 01:00:15.500 |
going on, where the model learns reasoning sort of independently 01:00:23.140 |
But I don't think we know the answer to that yet. 01:00:26.540 |
Yeah, so basically, one question that comes up frequently 01:00:36.580 |
is, why does scaling up improve chain of thought? 01:00:43.660 |
and see what types of errors are fixed from scaling up 01:00:49.020 |
And you can see that, for these three categories 01:00:51.160 |
that we came up with, some of all of them get fixed. 01:00:54.220 |
So scaling seems to have this universal effect 01:00:57.100 |
on improving different types of errors from solid models. 01:01:09.900 |
that are doable with standard prompting, so in blue. 01:01:13.060 |
And then the goal of chain of thought prompting 01:01:15.700 |
is to sort of increase the set of tasks that we can do. 01:01:22.820 |
include math word problems, symbolic reasoning, 01:01:37.100 |
because of the fact that you do more computations when 01:01:45.740 |
You create multiple embeddings to sort of adjust the things 01:01:51.660 |
How much of that have you tried non-chain of thought prompts 01:02:05.100 |
I think it's about the language to guide in the model as part 01:02:11.040 |
And have you tried describing the problem in more detail 01:02:24.220 |
Yeah, you mean like describing the question in three 01:02:27.620 |
Yeah, just describing the question in more detail 01:02:29.100 |
instead of explicitly doing the step-by-step things 01:02:33.100 |
I haven't tried that, but I would be surprised if that 01:02:50.120 |
So like, there is something out of the [INAUDIBLE] 01:03:01.260 |
Can we do like just any amount of extra calculation 01:03:07.540 |
Like, in a way, you know, in a way like in an ablation, 01:03:09.980 |
chain of thought is like a very structured thing. 01:03:13.180 |
It's like, what if the same structure is preserved, 01:03:23.060 |
I think like outputting tokens is pretty important for the 01:03:37.140 |
So the last part, I think, is a pretty cool trick 01:03:45.200 |
is they'll just generate one chain of thought, 01:03:49.940 |
But there's this nice trick called self-consistency 01:03:54.460 |
with the model to generate like a bunch of different reasoning 01:04:02.940 |
like improving performance by like a pretty big margin. 01:04:08.380 |
on GSM 8K, which is like the math for problem data set, 01:04:18.140 |
then it becomes 74, which is like a pretty big improvement. 01:04:23.720 |
Here, how many are we averaging over for self-consistency? 01:04:27.940 |
So it increases the cost of the inference time compute. 01:04:36.620 |
but I'm curious to know how many samples or how many chains 01:04:41.180 |
like, what is the trade-off between number of chains 01:04:49.060 |
sorry, the question is, how many chains do you 01:04:53.220 |
I think the answer really depends on the data set. 01:04:58.140 |
But usually, you can get something good with like 16, 01:05:04.220 |
How does the temperature change the way the model works? 01:05:08.260 |
Oh, OK, the question is, how does the temperature 01:05:12.020 |
Basically, when you use temperature decoding, 01:05:19.060 |
pick one of the outputs instead of always picking 01:05:23.820 |
So basically, you get these more stochastic outputs 01:05:27.500 |
that are still based on what the language model has learned, 01:05:33.660 |
And then finally, yeah, self-consistency also 01:05:42.580 |
I guess part of it is because chain of thought is emergent 01:05:47.380 |
than the random performance without doing chain of thought. 01:06:10.840 |
So I think in addition to just purely scaling up 01:06:15.460 |
the language model, which is only available to people 01:06:17.820 |
in the industry, I think there's a couple interesting directions 01:06:30.020 |
of knowing what the best way to prompt language models is. 01:06:40.420 |
heard, to train therapists, to help with creative writing, 01:06:44.420 |
I think ChatGPT has really shown what language models 01:06:50.660 |
I think benchmarks are also something that's pretty lacking 01:06:54.060 |
because I think we solve benchmarks pretty quickly. 01:06:59.860 |
on Big Bench within a year or something of Big Bench 01:07:05.700 |
and I think that's going to be an important contribution. 01:07:13.180 |
have compute-efficient methods to make language models better 01:07:29.220 |
And feel free to email me if you have any feedback.