back to index

The State of Reasoning — from Nathan Lambert, Interconnects/AI2 [LS Live @ NeurIPS 2024]


Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey everyone, happy new year.
00:00:09.100 | This is a quick talk that I gave at NeurIPS
00:00:12.300 | the Latent Space unofficial industry event.
00:00:14.980 | So Swake's tried to have people to talk
00:00:16.940 | about the major topics of the year,
00:00:18.380 | scaling, open models, synthetic data agents, et cetera.
00:00:22.860 | And he asked me to fill in a quick slot on reasoning.
00:00:25.980 | A couple notes, this was before O3 was announced by OpenAI.
00:00:29.380 | So I think you can take everything that I said
00:00:31.660 | and run with it with even more enthusiasm
00:00:34.420 | and expect even more progress in 2025.
00:00:38.140 | And second, there was some recording issues.
00:00:39.820 | So I re-edited the slides to match up with the audio.
00:00:43.540 | So you might see that they're slightly off,
00:00:45.720 | but it's mostly reading like a blog post
00:00:47.620 | and it should do a good job getting the conversation started
00:00:50.260 | around reasoning on interconnects in the new year.
00:00:53.180 | Happy new year, and I hope you like this.
00:00:55.140 | Thanks.
00:00:57.780 | I wouldn't say my main research area is reasoning.
00:01:00.780 | I would say that I came from a reinforcement learning
00:01:04.100 | background into language models
00:01:05.620 | and reasoning is now getting subverted into that
00:01:08.300 | as a method rather than an area.
00:01:10.980 | And a lot of this is probably transitioning these talks
00:01:13.620 | into more provocative forms to prime everyone
00:01:16.820 | for the debate that is why most people are here.
00:01:19.440 | And this is called the state of reasoning.
00:01:22.760 | This is by no means a comprehensive survey.
00:01:26.380 | To continue, I wanted to make sure
00:01:28.260 | that I was not off base to think about this
00:01:31.660 | because there's a lot of debates on reasoning
00:01:33.300 | and I wanted to revisit a very basic definition.
00:01:36.540 | And this is a dictionary definition,
00:01:38.040 | which is the action of thinking about something
00:01:39.860 | in a logical, sensible way,
00:01:41.680 | which is actually sufficiently vague
00:01:43.300 | that I would agree with it.
00:01:44.820 | I think as we'll see in a lot of this talk
00:01:47.540 | is that I think people are going crazy
00:01:50.860 | about whether or not language models reason.
00:01:54.060 | We've seen this with AGI before
00:01:55.800 | and now reasoning kind of seems like the same thing,
00:01:58.760 | which to me is pretty ridiculous
00:02:00.640 | because it's like reasoning is a very general skill
00:02:04.660 | and I will provide more reasoning or support
00:02:07.920 | for the argument that these language models
00:02:10.040 | are doing some sort of reasoning
00:02:12.060 | when you give them problems.
00:02:13.460 | I think I don't need to share a ton of examples
00:02:16.060 | for what's just like ill-formed arguments
00:02:20.900 | for what language models are not doing,
00:02:23.240 | but it's tough that this is the case
00:02:24.660 | and I think there are some very credible arguments
00:02:27.540 | that reasoning is a poor direction to pursue
00:02:30.460 | for language models because language models
00:02:32.280 | are not going to be as good at it as humans.
00:02:34.260 | But to say that they can't do reasoning,
00:02:35.660 | I don't see a lot of proof for
00:02:37.780 | and I'll go through a few examples.
00:02:39.880 | And the question is like,
00:02:40.720 | why should language model reasoning
00:02:42.780 | be constrained to look like what humans do?
00:02:45.280 | I think language models are very different
00:02:47.180 | and they are stochastic,
00:02:49.440 | the stochastic parents thing is true for many reasons
00:02:53.340 | and we should embrace this and we should continue
00:02:56.400 | and I think a big trend of the year
00:02:58.580 | is that we're seeing new types of language model reasoning
00:03:02.020 | that look less human and that can be good
00:03:04.540 | for kind of separating the discourse
00:03:06.060 | for expecting a really narrow type of behaviors.
00:03:08.540 | I did an interview with Ross Taylor
00:03:12.260 | who was a reasoning lead at Meta,
00:03:13.900 | which I thought was a very good education for me on this
00:03:17.620 | and this is just a direct pull from the transcript.
00:03:20.680 | But essentially it's saying is like,
00:03:22.680 | if you do chain of thought on a language model,
00:03:25.480 | what it is doing is it's essentially
00:03:27.480 | outputting its intermediate steps.
00:03:29.040 | If I were to ask you all a math problem right now,
00:03:32.160 | you can do most of them in your head
00:03:33.920 | and you're doing some sort of intermediate storage
00:03:37.920 | of variables and language models
00:03:39.800 | have no ability to do this.
00:03:41.080 | They are kind of per token computation devices
00:03:45.720 | where each token is outputted after doing this forward pass
00:03:49.600 | and within that there's no explicit structure
00:03:52.240 | to hold these intermediate states.
00:03:54.280 | So I think embracing chain of thought
00:03:56.560 | and these kind of intermediate values
00:03:58.760 | for the language models is extremely reasonable
00:04:01.440 | and it's showing that they're doing something
00:04:04.560 | that actually gets to valuable outputs.
00:04:06.980 | So this is like one of the many ways
00:04:12.720 | that we can kind of lead towards O1
00:04:14.840 | is that language models have randomness built into them
00:04:17.600 | and a lot of what people see as failures in reasoning
00:04:21.400 | are kind of these language models
00:04:22.680 | following very static chains
00:04:24.520 | and making very specific mistakes along the way
00:04:27.040 | with really no ability to correct for that.
00:04:29.440 | This is really not something that we see in human reasoning.
00:04:33.260 | So if a human makes a mistake,
00:04:34.640 | they will normally catch it on the next step,
00:04:37.120 | but we need to handle language models differently.
00:04:40.000 | And why O1 is exciting
00:04:44.040 | is because it's a new type of language models
00:04:46.200 | that are going to maximize on this view of reasoning,
00:04:49.640 | which is that chain of thought
00:04:51.480 | in kind of a forward stream of tokens
00:04:54.080 | can actually do a lot to achieve better outcomes
00:04:57.160 | when you're doing a reasoning-like action,
00:05:01.720 | which is just repeatedly outputting tokens
00:05:04.640 | to make progress on some sort of intelligence-defined task.
00:05:09.640 | So it's just making forward progress
00:05:11.420 | by spending more compute
00:05:12.560 | and the token stream is the equivalent
00:05:15.480 | of some intermediate state.
00:05:16.920 | What is O1 has been a large debate since its release.
00:05:22.960 | I'm not gonna spend a lot of this talk on it,
00:05:25.800 | but the more I've spent on it
00:05:28.800 | is that you should take OpenAI at their face value,
00:05:31.520 | which they are doing very large-scale RL
00:05:34.400 | on the verifiable outcomes is what I've added,
00:05:37.400 | especially in context of the RL API that they've released,
00:05:41.160 | which I'll talk about more.
00:05:42.920 | But most of the reasons to believe in more complicated things
00:05:46.280 | like process rewards models, self-play,
00:05:48.960 | Monte Carlo tree search,
00:05:50.480 | are mostly based on previous literature
00:05:53.720 | and things that we would have expected advanced reasoning
00:05:56.000 | to look like for language models
00:05:57.240 | and not based on evidence that they have given us
00:05:59.960 | or the behavior,
00:06:00.800 | whether you're looking at evaluations
00:06:02.960 | or how actually inference is done when serving the model.
00:06:05.920 | This takes us to replications,
00:06:10.000 | or I would probably call them relatives of O1
00:06:12.960 | coming from the community.
00:06:14.480 | These are wonderful to see.
00:06:15.880 | We are exploring the boundaries
00:06:17.560 | for what we can do with chain of thought in models.
00:06:20.800 | The two I've highlighted are from DeepSeq and Quen,
00:06:23.080 | and a lot of people in this room have probably seen them.
00:06:26.480 | And I think that these models are really substantially
00:06:29.600 | narrower than these full O1 models from OpenAI.
00:06:32.720 | So OpenAI is, if you use O1,
00:06:35.440 | you can do it for a lot more tasks.
00:06:37.000 | If you use, like I was using the DeepSeq model
00:06:40.000 | and it's supposed to be for math or code,
00:06:42.000 | but they've tried to keep the model so narrow
00:06:43.720 | that even in that, if you ask a code question,
00:06:47.000 | sometimes it'll be like,
00:06:47.840 | I'm only supposed to work on math or code.
00:06:50.080 | And a lot of the success of O1
00:06:52.840 | in the future models of this is going to be able to,
00:06:55.480 | it being able to handle more tasks and more domains.
00:06:58.560 | So semi-analysis wrote a post that I haven't read in full,
00:07:03.680 | but even if you look at the paywalled headings,
00:07:05.840 | you can kind of make some intelligent claims
00:07:09.160 | about what O1 is or is not.
00:07:11.120 | I think these are two of the things
00:07:12.360 | from the table of contents that you can see without paying.
00:07:15.800 | I'm due to pay at some point, but I have not.
00:07:18.600 | And incredible amounts of forward passes during training.
00:07:22.080 | I think you'll see this as I discuss RL,
00:07:24.880 | fine tuning more in a little bit,
00:07:26.160 | but when you're doing RL,
00:07:28.080 | there's two types of ways that you see data many times,
00:07:30.840 | and that'll relate in many,
00:07:32.160 | or result in many forward passes.
00:07:35.600 | One is that when you're doing RL on a prompt,
00:07:37.800 | you can sample many completions to then grade them
00:07:40.800 | or use them in different ways to update your policy.
00:07:43.360 | So if I ask one math problem,
00:07:45.200 | I could look at eight completions and choose the best one
00:07:47.760 | or do some contrastive thing
00:07:49.120 | between the best and the worst one.
00:07:51.000 | And that kind of gradation
00:07:52.200 | can help the RL policy actually learn.
00:07:55.360 | And the second time,
00:07:56.200 | because the loss function is more flexible
00:07:58.600 | than something like instruction tuning,
00:08:00.320 | you can go over the same prompts many more times
00:08:02.920 | than you would in instruction tuning
00:08:05.000 | or kind of pre-training.
00:08:06.320 | So this kind of means they're doing
00:08:08.400 | just a lot of this sampling from the model,
00:08:10.560 | which is very different than other types of training
00:08:13.960 | we've seen in the past at pre and post-training.
00:08:16.960 | And then one of this one is great,
00:08:18.200 | thanks for Don for showing everyone this,
00:08:19.720 | is that post-training flops exceed pre-training.
00:08:22.560 | I think this pretty much clearly says
00:08:24.760 | that they're using a ton of compute for this large-scale RL.
00:08:28.440 | And at that point,
00:08:29.880 | it would probably mean something different,
00:08:32.280 | where this is like pre-training RL,
00:08:34.400 | and this is something that these early relative models
00:08:37.560 | are not going to be doing,
00:08:39.280 | because no one has this infrastructure like OpenAI does.
00:08:43.280 | It'll take a while to do that,
00:08:44.440 | but people will make it.
00:08:45.640 | Okay, this takes us to reinforcement fine-tuning.
00:08:53.480 | I would say that this is a hard pivot in the talk
00:08:56.240 | where O1 is essentially pre-training scale RL,
00:08:59.520 | extremely big RL,
00:09:01.240 | and we don't know what all the details of the data are
00:09:03.840 | to OpenAI then showing us this new beta API program
00:09:08.480 | that they're making, which is just a sprinkle of this.
00:09:11.400 | So what can you do with a tiny bit of their infrastructure?
00:09:13.720 | I think one of the fine-tuning leads
00:09:16.520 | responded to a tweet from SWIX,
00:09:19.640 | and they were like, the tweet literally,
00:09:21.520 | there was like one of the tweets,
00:09:23.240 | it was a long tweet that gave a lot of details,
00:09:24.800 | but even the first tweet I hadn't seen,
00:09:26.000 | I had like eight likes,
00:09:26.920 | and I was like, this API is using the same infrastructure
00:09:30.360 | that we use to train O1.
00:09:31.640 | I was like, that alone is like a lot of detail.
00:09:34.040 | It was like on Twitter, it was a random thing.
00:09:35.920 | And then there's a really long details on other stuff of it.
00:09:38.680 | But it is just a new paradigm for fine-tuning,
00:09:41.880 | and I have seen some of this work,
00:09:45.080 | and I'm pretty optimistic that it'll work
00:09:47.320 | for kind of really specific capabilities
00:09:51.440 | where answers matter,
00:09:52.680 | rather than features in your style of text mattering.
00:09:55.800 | So again, kind of like I was hinting at with O1,
00:10:00.840 | this reinforcement fine-tuning
00:10:02.080 | does many passes over the data,
00:10:03.600 | which is why they can say
00:10:04.720 | you only need dozens of labeled samples
00:10:06.640 | to actually learn from it,
00:10:07.720 | which is just very different than previous training regimes.
00:10:12.160 | So what happens is that the model gets a,
00:10:15.880 | the grader gives a bonus when the answer is right,
00:10:19.000 | and the model learns to reinforce behaviors
00:10:21.080 | that get right answers.
00:10:22.800 | And I'll move, later in the talk,
00:10:24.080 | I'll highlight a research project that we did
00:10:28.200 | that was pretty much doing a very similar thing,
00:10:31.680 | to target very specific evaluations on open models,
00:10:34.520 | and you do RL, and you give a reward bonus
00:10:37.840 | when the answer is right, and that's all you do.
00:10:40.040 | And the kind of key innovation in the simplicity
00:10:42.440 | is that modern language models are a strong enough base
00:10:45.760 | where just a really gentle RL fine-tuning
00:10:49.320 | can add these specific capabilities
00:10:51.120 | without degrading the model.
00:10:52.480 | I think a lot of fear for adding RL
00:10:55.120 | to these training regimes,
00:10:56.640 | especially on general instruct models,
00:10:59.120 | like in chat GPT, was just that they're gonna destroy
00:11:02.480 | the rest of the performance,
00:11:03.720 | the base of chattiness that you care about.
00:11:06.040 | And it really seems like you can just do this
00:11:08.160 | out of the box if OpenAI is going to allow an API,
00:11:11.600 | they aren't gonna let people train a model
00:11:13.920 | that then just gets worse on random other things.
00:11:16.920 | So what the data format looks like,
00:11:22.680 | the example I gave is way more complicated
00:11:24.600 | than I think you should.
00:11:25.560 | Seriously, you could start with a grade school math problem
00:11:28.520 | and just say the correct answer is the correct number,
00:11:30.920 | the genes are confusing,
00:11:33.200 | but essentially you have two components,
00:11:34.880 | a prompt and an answer,
00:11:36.600 | which is different than having a prompt in completion
00:11:39.000 | that you would train on,
00:11:39.880 | or if you're doing preference tuning,
00:11:41.240 | you would do a prompt in a chosen completion
00:11:43.680 | and a rejected completion.
00:11:45.120 | So it's a new type of data format.
00:11:46.600 | I suspect quickly we'll see things like HuggingBase
00:11:49.640 | having more of these.
00:11:51.160 | I will highlight, we have some of ours
00:11:53.200 | for our specific project that we did.
00:11:56.240 | We have examples for math.
00:11:57.440 | This on the screen is an example
00:11:59.320 | for precise instruction following,
00:12:01.320 | which is the idea that if you have a prompt,
00:12:03.320 | you can say something like,
00:12:05.160 | have every sentence start with the letter A.
00:12:07.560 | And you can verify that with Python really easily.
00:12:10.080 | This is something that we did in our project.
00:12:11.720 | And it's like, the model gets better at this.
00:12:13.600 | It's just like, you have constrained data
00:12:15.880 | and the RL algorithm learns to change the model
00:12:18.400 | just a tiny bit and actually reach these answers.
00:12:22.840 | A confusing thing for people was these grader models.
00:12:26.640 | I think the place to come from these is evaluation.
00:12:30.680 | There's been a lot of work in evaluation
00:12:32.640 | to make answer extraction stable,
00:12:34.960 | especially with math,
00:12:36.160 | where an example that I used in the blog post
00:12:39.520 | I wrote today on this is like LLAMA 3.1
00:12:41.600 | details their vowels.
00:12:42.840 | For math, they use both SymPy, a Python process,
00:12:46.360 | or Python package for extraction,
00:12:48.280 | and LLM as a judge to extract their answers for math.
00:12:51.520 | And what the graders are doing
00:12:54.080 | is essentially amping this up to a whole nother level
00:12:57.080 | where it's kind of a nested structure of configs
00:13:00.400 | for doing reward shaping on these verifiable outputs.
00:13:03.720 | For math, it can be really easy.
00:13:06.000 | It's like, you know you have to handle these five formats
00:13:09.200 | that I came up with in a minute
00:13:10.520 | for how you could represent different numbers and tokens.
00:13:13.240 | But as you get to more complicated things
00:13:15.160 | and more complicated behaviors,
00:13:16.800 | it seems like OpenAI is insinuating
00:13:18.680 | that you're gonna need more than just a yes/no loss function
00:13:22.200 | for your domains.
00:13:23.520 | And that seems fine.
00:13:25.000 | Well, we already have a bunch of open models
00:13:27.200 | that are doing like judge models and Prometheus
00:13:30.640 | and other things that are designed specifically
00:13:32.880 | for LLM as a judge.
00:13:34.040 | And I see that continuing to just become part
00:13:36.320 | of this kind of open RL infrastructure.
00:13:38.840 | OpenAI had a bunch of screenshots.
00:13:42.680 | I'm not gonna end on a commentary on these,
00:13:44.520 | but it looks pretty standard.
00:13:46.240 | They're gonna track how performance changes
00:13:48.160 | over time and stuff like this.
00:13:50.880 | You'll be able to look at all the outputs.
00:13:52.240 | This is just them making pretty things.
00:13:55.000 | And then they have this like very generic RL plot.
00:13:58.120 | The most standard RL plot is a X-axis of time or trials
00:14:02.280 | and a Y-axis of reward.
00:14:04.000 | Here, reward is like an accuracy or a success rate
00:14:07.080 | on a certain validation set.
00:14:09.120 | And X is actually supposed to be like
00:14:10.840 | how much training was done.
00:14:12.600 | And this is very similar to what we did in our project.
00:14:15.800 | I think this is kind of just another way you can put this
00:14:18.680 | with an RL feedback diagram.
00:14:20.680 | If you've seen RL where you have this agent interacting
00:14:23.080 | with the environment, this you will squint at it
00:14:25.480 | and it'll be familiar.
00:14:26.520 | If you haven't, you'll probably be in for more
00:14:28.920 | of these things if RL keeps becoming popular
00:14:31.840 | 'cause RL is really formulated as trial and error learning.
00:14:35.200 | But if you're interested, we're happy to try
00:14:37.400 | to have people use our code, which does this for math
00:14:40.400 | and some instruction tuning already.
00:14:42.840 | And we want to try more complicated graders
00:14:45.600 | for things like code.
00:14:46.600 | So for code quality, a binary outcome
00:14:48.920 | doesn't really make sense, which is a good way to think
00:14:51.640 | about why you might need to do some reward shaping
00:14:54.000 | for how you would grade outputs from a various model.
00:14:57.120 | And to kind of compare the plot that OpenAI had,
00:15:01.240 | which is like performance improving over time,
00:15:03.840 | these are some experiments we ran on various evaluations.
00:15:07.320 | So the left column is some language model evaluation
00:15:10.400 | that we would use in an academic paper.
00:15:12.280 | And the right is all the various internal RL statistics
00:15:16.560 | where like GSMA-K, math, and IFML are all being trained
00:15:20.320 | on training sets.
00:15:21.400 | So we have the answer, we have the prompts,
00:15:22.920 | which are math questions.
00:15:24.360 | We have the answers, which are numbers.
00:15:25.960 | And we're really doing this RL on seeing
00:15:28.280 | if this answer is right.
00:15:29.600 | And then it generalizes to various math evaluations
00:15:32.640 | that we care about.
00:15:33.880 | So I kind of see this as like we got a tip
00:15:37.360 | from a industry lab member to do this.
00:15:41.120 | A few months early, so we got a head start.
00:15:43.280 | And I think a lot of people are obviously going
00:15:45.280 | to be trying to replicate this now.
00:15:47.600 | So it's fun that we have a starting point.
00:15:49.200 | I'm excited to talk about it with people this week.
00:15:52.400 | And I think reasoning is worth continuing as something.
00:15:56.920 | You can read the post that I was referencing here.
00:15:59.400 | And I'm happy to take any related
00:16:02.520 | or hard question on reasoning
00:16:04.080 | 'cause I kind of opened the floor for that.
00:16:06.880 | So thank you.
00:16:07.880 | Thank you.
00:16:09.720 | (upbeat music)
00:16:12.320 | (upbeat music)
00:16:14.920 | (upbeat music)
00:16:17.520 | (hip hop music)
00:16:20.180 | (upbeat music)