The State of Reasoning — from Nathan Lambert, Interconnects/AI2 [LS Live @ NeurIPS 2024]

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey everyone, happy new year.

00:00:09.100 | This is a quick talk that I gave at NeurIPS

00:00:12.300 | the Latent Space unofficial industry event.

00:00:14.980 | So Swake's tried to have people to talk

00:00:16.940 | about the major topics of the year,

00:00:18.380 | scaling, open models, synthetic data agents, et cetera.

00:00:22.860 | And he asked me to fill in a quick slot on reasoning.

00:00:25.980 | A couple notes, this was before O3 was announced by OpenAI.

00:00:29.380 | So I think you can take everything that I said

00:00:31.660 | and run with it with even more enthusiasm

00:00:34.420 | and expect even more progress in 2025.

00:00:38.140 | And second, there was some recording issues.

00:00:39.820 | So I re-edited the slides to match up with the audio.

00:00:43.540 | So you might see that they're slightly off,

00:00:45.720 | but it's mostly reading like a blog post

00:00:47.620 | and it should do a good job getting the conversation started

00:00:50.260 | around reasoning on interconnects in the new year.

00:00:53.180 | Happy new year, and I hope you like this.

00:00:55.140 | Thanks.

00:00:57.780 | I wouldn't say my main research area is reasoning.

00:01:00.780 | I would say that I came from a reinforcement learning

00:01:04.100 | background into language models

00:01:05.620 | and reasoning is now getting subverted into that

00:01:08.300 | as a method rather than an area.

00:01:10.980 | And a lot of this is probably transitioning these talks

00:01:13.620 | into more provocative forms to prime everyone

00:01:16.820 | for the debate that is why most people are here.

00:01:19.440 | And this is called the state of reasoning.

00:01:22.760 | This is by no means a comprehensive survey.

00:01:26.380 | To continue, I wanted to make sure

00:01:28.260 | that I was not off base to think about this

00:01:31.660 | because there's a lot of debates on reasoning

00:01:33.300 | and I wanted to revisit a very basic definition.

00:01:36.540 | And this is a dictionary definition,

00:01:38.040 | which is the action of thinking about something

00:01:39.860 | in a logical, sensible way,

00:01:41.680 | which is actually sufficiently vague

00:01:43.300 | that I would agree with it.

00:01:44.820 | I think as we'll see in a lot of this talk

00:01:47.540 | is that I think people are going crazy

00:01:50.860 | about whether or not language models reason.

00:01:54.060 | We've seen this with AGI before

00:01:55.800 | and now reasoning kind of seems like the same thing,

00:01:58.760 | which to me is pretty ridiculous

00:02:00.640 | because it's like reasoning is a very general skill

00:02:04.660 | and I will provide more reasoning or support

00:02:07.920 | for the argument that these language models

00:02:10.040 | are doing some sort of reasoning

00:02:12.060 | when you give them problems.

00:02:13.460 | I think I don't need to share a ton of examples

00:02:16.060 | for what's just like ill-formed arguments

00:02:20.900 | for what language models are not doing,

00:02:23.240 | but it's tough that this is the case

00:02:24.660 | and I think there are some very credible arguments

00:02:27.540 | that reasoning is a poor direction to pursue

00:02:30.460 | for language models because language models

00:02:32.280 | are not going to be as good at it as humans.

00:02:34.260 | But to say that they can't do reasoning,

00:02:35.660 | I don't see a lot of proof for

00:02:37.780 | and I'll go through a few examples.

00:02:39.880 | And the question is like,

00:02:40.720 | why should language model reasoning

00:02:42.780 | be constrained to look like what humans do?

00:02:45.280 | I think language models are very different

00:02:47.180 | and they are stochastic,

00:02:49.440 | the stochastic parents thing is true for many reasons

00:02:53.340 | and we should embrace this and we should continue

00:02:56.400 | and I think a big trend of the year

00:02:58.580 | is that we're seeing new types of language model reasoning

00:03:02.020 | that look less human and that can be good

00:03:04.540 | for kind of separating the discourse

00:03:06.060 | for expecting a really narrow type of behaviors.

00:03:08.540 | I did an interview with Ross Taylor

00:03:12.260 | who was a reasoning lead at Meta,

00:03:13.900 | which I thought was a very good education for me on this

00:03:17.620 | and this is just a direct pull from the transcript.

00:03:20.680 | But essentially it's saying is like,

00:03:22.680 | if you do chain of thought on a language model,

00:03:25.480 | what it is doing is it's essentially

00:03:27.480 | outputting its intermediate steps.

00:03:29.040 | If I were to ask you all a math problem right now,

00:03:32.160 | you can do most of them in your head

00:03:33.920 | and you're doing some sort of intermediate storage

00:03:37.920 | of variables and language models

00:03:39.800 | have no ability to do this.

00:03:41.080 | They are kind of per token computation devices

00:03:45.720 | where each token is outputted after doing this forward pass

00:03:49.600 | and within that there's no explicit structure

00:03:52.240 | to hold these intermediate states.

00:03:54.280 | So I think embracing chain of thought

00:03:56.560 | and these kind of intermediate values

00:03:58.760 | for the language models is extremely reasonable

00:04:01.440 | and it's showing that they're doing something

00:04:04.560 | that actually gets to valuable outputs.

00:04:06.980 | So this is like one of the many ways

00:04:12.720 | that we can kind of lead towards O1

00:04:14.840 | is that language models have randomness built into them

00:04:17.600 | and a lot of what people see as failures in reasoning

00:04:21.400 | are kind of these language models

00:04:22.680 | following very static chains

00:04:24.520 | and making very specific mistakes along the way

00:04:27.040 | with really no ability to correct for that.

00:04:29.440 | This is really not something that we see in human reasoning.

00:04:33.260 | So if a human makes a mistake,

00:04:34.640 | they will normally catch it on the next step,

00:04:37.120 | but we need to handle language models differently.

00:04:40.000 | And why O1 is exciting

00:04:44.040 | is because it's a new type of language models

00:04:46.200 | that are going to maximize on this view of reasoning,

00:04:49.640 | which is that chain of thought

00:04:51.480 | in kind of a forward stream of tokens

00:04:54.080 | can actually do a lot to achieve better outcomes

00:04:57.160 | when you're doing a reasoning-like action,

00:05:01.720 | which is just repeatedly outputting tokens

00:05:04.640 | to make progress on some sort of intelligence-defined task.

00:05:09.640 | So it's just making forward progress

00:05:11.420 | by spending more compute

00:05:12.560 | and the token stream is the equivalent

00:05:15.480 | of some intermediate state.

00:05:16.920 | What is O1 has been a large debate since its release.

00:05:22.960 | I'm not gonna spend a lot of this talk on it,

00:05:25.800 | but the more I've spent on it

00:05:28.800 | is that you should take OpenAI at their face value,

00:05:31.520 | which they are doing very large-scale RL

00:05:34.400 | on the verifiable outcomes is what I've added,

00:05:37.400 | especially in context of the RL API that they've released,

00:05:41.160 | which I'll talk about more.

00:05:42.920 | But most of the reasons to believe in more complicated things

00:05:46.280 | like process rewards models, self-play,

00:05:48.960 | Monte Carlo tree search,

00:05:50.480 | are mostly based on previous literature

00:05:53.720 | and things that we would have expected advanced reasoning

00:05:56.000 | to look like for language models

00:05:57.240 | and not based on evidence that they have given us

00:05:59.960 | or the behavior,

00:06:00.800 | whether you're looking at evaluations

00:06:02.960 | or how actually inference is done when serving the model.

00:06:05.920 | This takes us to replications,

00:06:10.000 | or I would probably call them relatives of O1

00:06:12.960 | coming from the community.

00:06:14.480 | These are wonderful to see.

00:06:15.880 | We are exploring the boundaries

00:06:17.560 | for what we can do with chain of thought in models.

00:06:20.800 | The two I've highlighted are from DeepSeq and Quen,

00:06:23.080 | and a lot of people in this room have probably seen them.

00:06:26.480 | And I think that these models are really substantially

00:06:29.600 | narrower than these full O1 models from OpenAI.

00:06:32.720 | So OpenAI is, if you use O1,

00:06:35.440 | you can do it for a lot more tasks.

00:06:37.000 | If you use, like I was using the DeepSeq model

00:06:40.000 | and it's supposed to be for math or code,

00:06:42.000 | but they've tried to keep the model so narrow

00:06:43.720 | that even in that, if you ask a code question,

00:06:47.000 | sometimes it'll be like,

00:06:47.840 | I'm only supposed to work on math or code.

00:06:50.080 | And a lot of the success of O1

00:06:52.840 | in the future models of this is going to be able to,

00:06:55.480 | it being able to handle more tasks and more domains.

00:06:58.560 | So semi-analysis wrote a post that I haven't read in full,

00:07:03.680 | but even if you look at the paywalled headings,

00:07:05.840 | you can kind of make some intelligent claims

00:07:09.160 | about what O1 is or is not.

00:07:11.120 | I think these are two of the things

00:07:12.360 | from the table of contents that you can see without paying.

00:07:15.800 | I'm due to pay at some point, but I have not.

00:07:18.600 | And incredible amounts of forward passes during training.

00:07:22.080 | I think you'll see this as I discuss RL,

00:07:24.880 | fine tuning more in a little bit,

00:07:26.160 | but when you're doing RL,

00:07:28.080 | there's two types of ways that you see data many times,

00:07:30.840 | and that'll relate in many,

00:07:32.160 | or result in many forward passes.

00:07:35.600 | One is that when you're doing RL on a prompt,

00:07:37.800 | you can sample many completions to then grade them

00:07:40.800 | or use them in different ways to update your policy.

00:07:43.360 | So if I ask one math problem,

00:07:45.200 | I could look at eight completions and choose the best one

00:07:47.760 | or do some contrastive thing

00:07:49.120 | between the best and the worst one.

00:07:51.000 | And that kind of gradation

00:07:52.200 | can help the RL policy actually learn.

00:07:55.360 | And the second time,

00:07:56.200 | because the loss function is more flexible

00:07:58.600 | than something like instruction tuning,

00:08:00.320 | you can go over the same prompts many more times

00:08:02.920 | than you would in instruction tuning

00:08:05.000 | or kind of pre-training.

00:08:06.320 | So this kind of means they're doing

00:08:08.400 | just a lot of this sampling from the model,

00:08:10.560 | which is very different than other types of training

00:08:13.960 | we've seen in the past at pre and post-training.

00:08:16.960 | And then one of this one is great,

00:08:18.200 | thanks for Don for showing everyone this,

00:08:19.720 | is that post-training flops exceed pre-training.

00:08:22.560 | I think this pretty much clearly says

00:08:24.760 | that they're using a ton of compute for this large-scale RL.

00:08:28.440 | And at that point,

00:08:29.880 | it would probably mean something different,

00:08:32.280 | where this is like pre-training RL,

00:08:34.400 | and this is something that these early relative models

00:08:37.560 | are not going to be doing,

00:08:39.280 | because no one has this infrastructure like OpenAI does.

00:08:43.280 | It'll take a while to do that,

00:08:44.440 | but people will make it.

00:08:45.640 | Okay, this takes us to reinforcement fine-tuning.

00:08:53.480 | I would say that this is a hard pivot in the talk

00:08:56.240 | where O1 is essentially pre-training scale RL,

00:08:59.520 | extremely big RL,

00:09:01.240 | and we don't know what all the details of the data are

00:09:03.840 | to OpenAI then showing us this new beta API program

00:09:08.480 | that they're making, which is just a sprinkle of this.

00:09:11.400 | So what can you do with a tiny bit of their infrastructure?

00:09:13.720 | I think one of the fine-tuning leads

00:09:16.520 | responded to a tweet from SWIX,

00:09:19.640 | and they were like, the tweet literally,

00:09:21.520 | there was like one of the tweets,

00:09:23.240 | it was a long tweet that gave a lot of details,

00:09:24.800 | but even the first tweet I hadn't seen,

00:09:26.000 | I had like eight likes,

00:09:26.920 | and I was like, this API is using the same infrastructure

00:09:30.360 | that we use to train O1.

00:09:31.640 | I was like, that alone is like a lot of detail.

00:09:34.040 | It was like on Twitter, it was a random thing.

00:09:35.920 | And then there's a really long details on other stuff of it.

00:09:38.680 | But it is just a new paradigm for fine-tuning,

00:09:41.880 | and I have seen some of this work,

00:09:45.080 | and I'm pretty optimistic that it'll work

00:09:47.320 | for kind of really specific capabilities

00:09:51.440 | where answers matter,

00:09:52.680 | rather than features in your style of text mattering.

00:09:55.800 | So again, kind of like I was hinting at with O1,

00:10:00.840 | this reinforcement fine-tuning

00:10:02.080 | does many passes over the data,

00:10:03.600 | which is why they can say

00:10:04.720 | you only need dozens of labeled samples

00:10:06.640 | to actually learn from it,

00:10:07.720 | which is just very different than previous training regimes.

00:10:12.160 | So what happens is that the model gets a,

00:10:15.880 | the grader gives a bonus when the answer is right,

00:10:19.000 | and the model learns to reinforce behaviors

00:10:21.080 | that get right answers.

00:10:22.800 | And I'll move, later in the talk,

00:10:24.080 | I'll highlight a research project that we did

00:10:28.200 | that was pretty much doing a very similar thing,

00:10:31.680 | to target very specific evaluations on open models,

00:10:34.520 | and you do RL, and you give a reward bonus

00:10:37.840 | when the answer is right, and that's all you do.

00:10:40.040 | And the kind of key innovation in the simplicity

00:10:42.440 | is that modern language models are a strong enough base

00:10:45.760 | where just a really gentle RL fine-tuning

00:10:49.320 | can add these specific capabilities

00:10:51.120 | without degrading the model.

00:10:52.480 | I think a lot of fear for adding RL

00:10:55.120 | to these training regimes,

00:10:56.640 | especially on general instruct models,

00:10:59.120 | like in chat GPT, was just that they're gonna destroy

00:11:02.480 | the rest of the performance,

00:11:03.720 | the base of chattiness that you care about.

00:11:06.040 | And it really seems like you can just do this

00:11:08.160 | out of the box if OpenAI is going to allow an API,

00:11:11.600 | they aren't gonna let people train a model

00:11:13.920 | that then just gets worse on random other things.

00:11:16.920 | So what the data format looks like,

00:11:22.680 | the example I gave is way more complicated

00:11:24.600 | than I think you should.

00:11:25.560 | Seriously, you could start with a grade school math problem

00:11:28.520 | and just say the correct answer is the correct number,

00:11:30.920 | the genes are confusing,

00:11:33.200 | but essentially you have two components,

00:11:34.880 | a prompt and an answer,

00:11:36.600 | which is different than having a prompt in completion

00:11:39.000 | that you would train on,

00:11:39.880 | or if you're doing preference tuning,

00:11:41.240 | you would do a prompt in a chosen completion

00:11:43.680 | and a rejected completion.

00:11:45.120 | So it's a new type of data format.

00:11:46.600 | I suspect quickly we'll see things like HuggingBase

00:11:49.640 | having more of these.

00:11:51.160 | I will highlight, we have some of ours

00:11:53.200 | for our specific project that we did.

00:11:56.240 | We have examples for math.

00:11:57.440 | This on the screen is an example

00:11:59.320 | for precise instruction following,

00:12:01.320 | which is the idea that if you have a prompt,

00:12:03.320 | you can say something like,

00:12:05.160 | have every sentence start with the letter A.

00:12:07.560 | And you can verify that with Python really easily.

00:12:10.080 | This is something that we did in our project.

00:12:11.720 | And it's like, the model gets better at this.

00:12:13.600 | It's just like, you have constrained data

00:12:15.880 | and the RL algorithm learns to change the model

00:12:18.400 | just a tiny bit and actually reach these answers.

00:12:22.840 | A confusing thing for people was these grader models.

00:12:26.640 | I think the place to come from these is evaluation.

00:12:30.680 | There's been a lot of work in evaluation

00:12:32.640 | to make answer extraction stable,

00:12:34.960 | especially with math,

00:12:36.160 | where an example that I used in the blog post

00:12:39.520 | I wrote today on this is like LLAMA 3.1

00:12:41.600 | details their vowels.

00:12:42.840 | For math, they use both SymPy, a Python process,

00:12:46.360 | or Python package for extraction,

00:12:48.280 | and LLM as a judge to extract their answers for math.

00:12:51.520 | And what the graders are doing

00:12:54.080 | is essentially amping this up to a whole nother level

00:12:57.080 | where it's kind of a nested structure of configs

00:13:00.400 | for doing reward shaping on these verifiable outputs.

00:13:03.720 | For math, it can be really easy.

00:13:06.000 | It's like, you know you have to handle these five formats

00:13:09.200 | that I came up with in a minute

00:13:10.520 | for how you could represent different numbers and tokens.

00:13:13.240 | But as you get to more complicated things

00:13:15.160 | and more complicated behaviors,

00:13:16.800 | it seems like OpenAI is insinuating

00:13:18.680 | that you're gonna need more than just a yes/no loss function

00:13:22.200 | for your domains.

00:13:23.520 | And that seems fine.

00:13:25.000 | Well, we already have a bunch of open models

00:13:27.200 | that are doing like judge models and Prometheus

00:13:30.640 | and other things that are designed specifically

00:13:32.880 | for LLM as a judge.

00:13:34.040 | And I see that continuing to just become part

00:13:36.320 | of this kind of open RL infrastructure.

00:13:38.840 | OpenAI had a bunch of screenshots.

00:13:42.680 | I'm not gonna end on a commentary on these,

00:13:44.520 | but it looks pretty standard.

00:13:46.240 | They're gonna track how performance changes

00:13:48.160 | over time and stuff like this.

00:13:50.880 | You'll be able to look at all the outputs.

00:13:52.240 | This is just them making pretty things.

00:13:55.000 | And then they have this like very generic RL plot.

00:13:58.120 | The most standard RL plot is a X-axis of time or trials

00:14:02.280 | and a Y-axis of reward.

00:14:04.000 | Here, reward is like an accuracy or a success rate

00:14:07.080 | on a certain validation set.

00:14:09.120 | And X is actually supposed to be like

00:14:10.840 | how much training was done.

00:14:12.600 | And this is very similar to what we did in our project.

00:14:15.800 | I think this is kind of just another way you can put this

00:14:18.680 | with an RL feedback diagram.

00:14:20.680 | If you've seen RL where you have this agent interacting

00:14:23.080 | with the environment, this you will squint at it

00:14:25.480 | and it'll be familiar.

00:14:26.520 | If you haven't, you'll probably be in for more

00:14:28.920 | of these things if RL keeps becoming popular

00:14:31.840 | 'cause RL is really formulated as trial and error learning.

00:14:35.200 | But if you're interested, we're happy to try

00:14:37.400 | to have people use our code, which does this for math

00:14:40.400 | and some instruction tuning already.

00:14:42.840 | And we want to try more complicated graders

00:14:45.600 | for things like code.

00:14:46.600 | So for code quality, a binary outcome

00:14:48.920 | doesn't really make sense, which is a good way to think

00:14:51.640 | about why you might need to do some reward shaping

00:14:54.000 | for how you would grade outputs from a various model.

00:14:57.120 | And to kind of compare the plot that OpenAI had,

00:15:01.240 | which is like performance improving over time,

00:15:03.840 | these are some experiments we ran on various evaluations.

00:15:07.320 | So the left column is some language model evaluation

00:15:10.400 | that we would use in an academic paper.

00:15:12.280 | And the right is all the various internal RL statistics

00:15:16.560 | where like GSMA-K, math, and IFML are all being trained

00:15:20.320 | on training sets.

00:15:21.400 | So we have the answer, we have the prompts,

00:15:22.920 | which are math questions.

00:15:24.360 | We have the answers, which are numbers.

00:15:25.960 | And we're really doing this RL on seeing

00:15:28.280 | if this answer is right.

00:15:29.600 | And then it generalizes to various math evaluations

00:15:32.640 | that we care about.

00:15:33.880 | So I kind of see this as like we got a tip

00:15:37.360 | from a industry lab member to do this.

00:15:41.120 | A few months early, so we got a head start.

00:15:43.280 | And I think a lot of people are obviously going

00:15:45.280 | to be trying to replicate this now.

00:15:47.600 | So it's fun that we have a starting point.

00:15:49.200 | I'm excited to talk about it with people this week.

00:15:52.400 | And I think reasoning is worth continuing as something.

00:15:56.920 | You can read the post that I was referencing here.

00:15:59.400 | And I'm happy to take any related

00:16:02.520 | or hard question on reasoning

00:16:04.080 | 'cause I kind of opened the floor for that.

00:16:06.880 | So thank you.

00:16:07.880 | Thank you.

00:16:09.720 | (upbeat music)

00:16:12.320 | (upbeat music)

00:16:14.920 | (upbeat music)

00:16:17.520 | (hip hop music)

00:16:20.180 | (upbeat music)