back to indexThe State of Reasoning — from Nathan Lambert, Interconnects/AI2 [LS Live @ NeurIPS 2024]

00:00:18.380 |
scaling, open models, synthetic data agents, et cetera. 00:00:22.860 |
And he asked me to fill in a quick slot on reasoning. 00:00:25.980 |
A couple notes, this was before O3 was announced by OpenAI. 00:00:29.380 |
So I think you can take everything that I said 00:00:39.820 |
So I re-edited the slides to match up with the audio. 00:00:47.620 |
and it should do a good job getting the conversation started 00:00:50.260 |
around reasoning on interconnects in the new year. 00:00:57.780 |
I wouldn't say my main research area is reasoning. 00:01:00.780 |
I would say that I came from a reinforcement learning 00:01:05.620 |
and reasoning is now getting subverted into that 00:01:10.980 |
And a lot of this is probably transitioning these talks 00:01:13.620 |
into more provocative forms to prime everyone 00:01:16.820 |
for the debate that is why most people are here. 00:01:31.660 |
because there's a lot of debates on reasoning 00:01:33.300 |
and I wanted to revisit a very basic definition. 00:01:38.040 |
which is the action of thinking about something 00:01:55.800 |
and now reasoning kind of seems like the same thing, 00:02:00.640 |
because it's like reasoning is a very general skill 00:02:13.460 |
I think I don't need to share a ton of examples 00:02:24.660 |
and I think there are some very credible arguments 00:02:49.440 |
the stochastic parents thing is true for many reasons 00:02:53.340 |
and we should embrace this and we should continue 00:02:58.580 |
is that we're seeing new types of language model reasoning 00:03:06.060 |
for expecting a really narrow type of behaviors. 00:03:13.900 |
which I thought was a very good education for me on this 00:03:17.620 |
and this is just a direct pull from the transcript. 00:03:22.680 |
if you do chain of thought on a language model, 00:03:29.040 |
If I were to ask you all a math problem right now, 00:03:33.920 |
and you're doing some sort of intermediate storage 00:03:41.080 |
They are kind of per token computation devices 00:03:45.720 |
where each token is outputted after doing this forward pass 00:03:49.600 |
and within that there's no explicit structure 00:03:58.760 |
for the language models is extremely reasonable 00:04:01.440 |
and it's showing that they're doing something 00:04:14.840 |
is that language models have randomness built into them 00:04:17.600 |
and a lot of what people see as failures in reasoning 00:04:24.520 |
and making very specific mistakes along the way 00:04:29.440 |
This is really not something that we see in human reasoning. 00:04:34.640 |
they will normally catch it on the next step, 00:04:37.120 |
but we need to handle language models differently. 00:04:44.040 |
is because it's a new type of language models 00:04:46.200 |
that are going to maximize on this view of reasoning, 00:04:54.080 |
can actually do a lot to achieve better outcomes 00:05:04.640 |
to make progress on some sort of intelligence-defined task. 00:05:16.920 |
What is O1 has been a large debate since its release. 00:05:22.960 |
I'm not gonna spend a lot of this talk on it, 00:05:28.800 |
is that you should take OpenAI at their face value, 00:05:34.400 |
on the verifiable outcomes is what I've added, 00:05:37.400 |
especially in context of the RL API that they've released, 00:05:42.920 |
But most of the reasons to believe in more complicated things 00:05:53.720 |
and things that we would have expected advanced reasoning 00:05:57.240 |
and not based on evidence that they have given us 00:06:02.960 |
or how actually inference is done when serving the model. 00:06:10.000 |
or I would probably call them relatives of O1 00:06:17.560 |
for what we can do with chain of thought in models. 00:06:20.800 |
The two I've highlighted are from DeepSeq and Quen, 00:06:23.080 |
and a lot of people in this room have probably seen them. 00:06:26.480 |
And I think that these models are really substantially 00:06:29.600 |
narrower than these full O1 models from OpenAI. 00:06:37.000 |
If you use, like I was using the DeepSeq model 00:06:42.000 |
but they've tried to keep the model so narrow 00:06:43.720 |
that even in that, if you ask a code question, 00:06:52.840 |
in the future models of this is going to be able to, 00:06:55.480 |
it being able to handle more tasks and more domains. 00:06:58.560 |
So semi-analysis wrote a post that I haven't read in full, 00:07:03.680 |
but even if you look at the paywalled headings, 00:07:12.360 |
from the table of contents that you can see without paying. 00:07:15.800 |
I'm due to pay at some point, but I have not. 00:07:18.600 |
And incredible amounts of forward passes during training. 00:07:28.080 |
there's two types of ways that you see data many times, 00:07:35.600 |
One is that when you're doing RL on a prompt, 00:07:37.800 |
you can sample many completions to then grade them 00:07:40.800 |
or use them in different ways to update your policy. 00:07:45.200 |
I could look at eight completions and choose the best one 00:08:00.320 |
you can go over the same prompts many more times 00:08:10.560 |
which is very different than other types of training 00:08:13.960 |
we've seen in the past at pre and post-training. 00:08:19.720 |
is that post-training flops exceed pre-training. 00:08:24.760 |
that they're using a ton of compute for this large-scale RL. 00:08:34.400 |
and this is something that these early relative models 00:08:39.280 |
because no one has this infrastructure like OpenAI does. 00:08:45.640 |
Okay, this takes us to reinforcement fine-tuning. 00:08:53.480 |
I would say that this is a hard pivot in the talk 00:08:56.240 |
where O1 is essentially pre-training scale RL, 00:09:01.240 |
and we don't know what all the details of the data are 00:09:03.840 |
to OpenAI then showing us this new beta API program 00:09:08.480 |
that they're making, which is just a sprinkle of this. 00:09:11.400 |
So what can you do with a tiny bit of their infrastructure? 00:09:23.240 |
it was a long tweet that gave a lot of details, 00:09:26.920 |
and I was like, this API is using the same infrastructure 00:09:31.640 |
I was like, that alone is like a lot of detail. 00:09:34.040 |
It was like on Twitter, it was a random thing. 00:09:35.920 |
And then there's a really long details on other stuff of it. 00:09:38.680 |
But it is just a new paradigm for fine-tuning, 00:09:52.680 |
rather than features in your style of text mattering. 00:09:55.800 |
So again, kind of like I was hinting at with O1, 00:10:07.720 |
which is just very different than previous training regimes. 00:10:15.880 |
the grader gives a bonus when the answer is right, 00:10:24.080 |
I'll highlight a research project that we did 00:10:28.200 |
that was pretty much doing a very similar thing, 00:10:31.680 |
to target very specific evaluations on open models, 00:10:37.840 |
when the answer is right, and that's all you do. 00:10:40.040 |
And the kind of key innovation in the simplicity 00:10:42.440 |
is that modern language models are a strong enough base 00:10:59.120 |
like in chat GPT, was just that they're gonna destroy 00:11:06.040 |
And it really seems like you can just do this 00:11:08.160 |
out of the box if OpenAI is going to allow an API, 00:11:13.920 |
that then just gets worse on random other things. 00:11:25.560 |
Seriously, you could start with a grade school math problem 00:11:28.520 |
and just say the correct answer is the correct number, 00:11:36.600 |
which is different than having a prompt in completion 00:11:46.600 |
I suspect quickly we'll see things like HuggingBase 00:12:07.560 |
And you can verify that with Python really easily. 00:12:10.080 |
This is something that we did in our project. 00:12:11.720 |
And it's like, the model gets better at this. 00:12:15.880 |
and the RL algorithm learns to change the model 00:12:18.400 |
just a tiny bit and actually reach these answers. 00:12:22.840 |
A confusing thing for people was these grader models. 00:12:26.640 |
I think the place to come from these is evaluation. 00:12:36.160 |
where an example that I used in the blog post 00:12:42.840 |
For math, they use both SymPy, a Python process, 00:12:48.280 |
and LLM as a judge to extract their answers for math. 00:12:54.080 |
is essentially amping this up to a whole nother level 00:12:57.080 |
where it's kind of a nested structure of configs 00:13:00.400 |
for doing reward shaping on these verifiable outputs. 00:13:06.000 |
It's like, you know you have to handle these five formats 00:13:10.520 |
for how you could represent different numbers and tokens. 00:13:18.680 |
that you're gonna need more than just a yes/no loss function 00:13:27.200 |
that are doing like judge models and Prometheus 00:13:30.640 |
and other things that are designed specifically 00:13:34.040 |
And I see that continuing to just become part 00:13:55.000 |
And then they have this like very generic RL plot. 00:13:58.120 |
The most standard RL plot is a X-axis of time or trials 00:14:04.000 |
Here, reward is like an accuracy or a success rate 00:14:12.600 |
And this is very similar to what we did in our project. 00:14:15.800 |
I think this is kind of just another way you can put this 00:14:20.680 |
If you've seen RL where you have this agent interacting 00:14:23.080 |
with the environment, this you will squint at it 00:14:26.520 |
If you haven't, you'll probably be in for more 00:14:31.840 |
'cause RL is really formulated as trial and error learning. 00:14:37.400 |
to have people use our code, which does this for math 00:14:48.920 |
doesn't really make sense, which is a good way to think 00:14:51.640 |
about why you might need to do some reward shaping 00:14:54.000 |
for how you would grade outputs from a various model. 00:14:57.120 |
And to kind of compare the plot that OpenAI had, 00:15:01.240 |
which is like performance improving over time, 00:15:03.840 |
these are some experiments we ran on various evaluations. 00:15:07.320 |
So the left column is some language model evaluation 00:15:12.280 |
And the right is all the various internal RL statistics 00:15:16.560 |
where like GSMA-K, math, and IFML are all being trained 00:15:29.600 |
And then it generalizes to various math evaluations 00:15:43.280 |
And I think a lot of people are obviously going 00:15:49.200 |
I'm excited to talk about it with people this week. 00:15:52.400 |
And I think reasoning is worth continuing as something. 00:15:56.920 |
You can read the post that I was referencing here.