back to indexA Taxonomy for Next-gen Reasoning — Nathan Lambert, Allen Institute (AI2) & Interconnects.ai

Chapters
0:0 The Current State of Reasoning in AI Models
1:6 Unlocking New Language Model Applications
3:48 The Need for Advanced Planning in AI
4:29 A Proposed Taxonomy for Next-Generation Reasoning
6:16 Reinforcement Learning with Verifiable Rewards
8:23 Current Challenges and Future Directions
12:7 The Effort Required to Build New Capabilities
16:20 A Research Plan for Training Reasoning Models
17:36 The Shift in Compute Allocation from Pre-training to Post-training
00:00:00.000 |
I really came to this thinking about trying to reflect on six months into this 00:00:19.140 |
like reinforcement learning with verifiable rewards post-01 post-deep seek 00:00:23.240 |
and I think that a lot of this stuff is somewhat boring because everybody has a 00:00:28.080 |
reasoning model we all know the basics of you can scale RL at training time and 00:00:32.880 |
the numbers will go up and that's deeply correlated with being able to then do 00:00:36.540 |
this inference time scaling but really an AI right now everybody there's a lot of 00:00:41.700 |
people are up to speed but the crucial question is like where are things going 00:00:44.640 |
to go and how do you skate where the plug is going so a lot of this talk it's 00:00:49.440 |
really me trying to process is like where is this going besides getting high 00:00:53.460 |
benchmark scores with using 10,000 tokens per answer and like what do we need to 00:00:57.900 |
do to actually train these models and what are the things that open AI etc are 00:01:02.580 |
probably already doing but it's increasingly hard to get that signal out 00:01:06.720 |
of them so if we look at this like reasoning is really also unlocking really 00:01:12.480 |
new language model applications I think I this is the same search query which is 00:01:17.580 |
like I as an RL researcher I need to find this all the time I forget that it's 00:01:21.540 |
called coast runners and the Google like over optimization 20 times to find it but I 00:01:26.460 |
tried asking O3 and it like literally gave me the download link directly so I 00:01:30.120 |
didn't have to do anything and that's a very unusual use case to just pop out of 00:01:34.860 |
this reasoning training where math and code was a real thing to start with and O3 is 00:01:40.440 |
great it's the model that I use the most for finding information and this just 00:01:44.220 |
really is the signal that I have that a lot of new interesting things are coming 00:01:48.740 |
down the pipe I would say it's starting to unlock a lot of new language model 00:01:53.720 |
applications that I use some of these so this is a screenshot of deep research it's 00:01:58.220 |
great you can use it in really creative ways like prompt it to look at your 00:02:02.660 |
website and find typos or look at only the material on your website and things like 00:02:06.740 |
this it's actually more steerable than you than you may expect cloud code which I 00:02:11.720 |
describe is just the the vibes are very good it's fun I'm not a serious software 00:02:16.380 |
engineer so I don't use it on hard things but I use it for fun things because I 00:02:19.520 |
can I can put the company API key in and just kind of mess around like helping me 00:02:23.540 |
build my the website for this book that I wrote online and then there's the really 00:02:28.380 |
serious things which are like codex and these fully autonomous agents that are 00:02:31.820 |
sharing to come if you play with it it's obvious that the form factor is going to be 00:02:36.260 |
able to work I'm sure there are people that are getting a lot of value out of 00:02:39.080 |
it right now I think for ML tasks it's like there's no GPUs in it right now and 00:02:44.780 |
if you are dealing with open models it's like they just added internet so like it 00:02:48.560 |
wasn't going to be able to go back and forth and look at like hugging face 00:02:51.320 |
configs or something and all these headaches they don't want to deal with but in the 00:02:55.220 |
six months like all these things are going to be stuff you should be using on a 00:02:58.700 |
day-to-day basis and it's all downstream of this kind of step change in 00:03:01.640 |
performance from reasoning models and then this is kind of like another plot 00:03:08.000 |
that's been talked about and when I look at this it's like through 2024 if we 00:03:12.800 |
look at like GPT-40 and things a lot and really were saturating then and then 00:03:17.720 |
there's these new sonnet models in 01 which really helped push out the frontier 00:03:21.920 |
and time horizon so this is the y-axis is how long a task can roughly be 00:03:27.140 |
completed by the models in time which is kind of a weird way to measure it because 00:03:31.160 |
things will get faster but it's going to keep going and this reasoning model is 00:03:36.380 |
the technique that was kind of unlocked in order to figure out how to push the 00:03:39.540 |
limits and when you look at things like this it's not that just we're like on a 00:03:42.860 |
path determined from AI and more gains are going to come it's really like we have 00:03:47.240 |
to think about what the models need to be able to do in order to keep pushing out 00:03:51.140 |
these frontiers so there's a lot of human effort that goes into continuing the 00:03:55.040 |
trends of AI progress so it's like gains aren't free and I'm thinking that a lot of 00:04:00.380 |
planning and kind of thinking about training in a bit of a different way 00:04:04.040 |
beyond just reasoning skills is going to be what helps push this and enable these 00:04:09.020 |
language modeling applications and products that are kind of in their early 00:04:12.980 |
stages to really shine so this is a core question that I'm thinking about is like 00:04:18.280 |
what do I have to do to come up with a research plan to train reasoning models that 00:04:22.100 |
can work autonomously autonomously and really have meaningful ideas for what planning would be 00:04:26.900 |
so I kind of came up with the taxonomy that has a few different what I call traits within it the 00:04:34.100 |
first one is skills which we've pretty much already done skills are like getting really 00:04:37.760 |
good at math and code inference time scaling was useful to getting there but they kind of 00:04:42.740 |
could become more researchy over time I think for products calibration is going to be crucial which is 00:04:47.280 |
like these models overthink like crazy so they need to be able to kind of have some calibration to how 00:04:52.800 |
many output tokens are used relative to the difficulty of the problem and this will kind of become more 00:04:58.320 |
important when we're spending more on each task that we're planning and then the last two are subsets of 00:05:04.080 |
planning that I'm thinking about and happy to take feedback on this taxonomy but like strategy 00:05:09.600 |
which is just going in the right direction and knowing different things that you can try because 00:05:13.520 |
it's really hard for these language models to really change course where they can backtrack a little bit 00:05:18.240 |
but restarting their plan is hard and then as tasks become very hard we need to do abstraction which 00:05:24.880 |
is like the model has to choose on its own how to break down a problem into different things that it can 00:05:30.960 |
do on its own I think right now humans would often do this but if we want language models to do very 00:05:36.320 |
hard things they have to make a plan that has subtasks that are actually tractable or calls in a bigger 00:05:41.760 |
model to do that for it but these are things that are the models aren't going to do natively natively 00:05:46.480 |
they're trying to like doing math problem solving like that doesn't have clear abstraction on like this 00:05:52.720 |
task I can do and with this additional tool and all these things so this is this is a new thing that 00:05:57.920 |
we're going to have to add so to kind of summarize it's like we have skills we have research or 00:06:02.400 |
calibration and I'll highlight some of it but like planning is a new frontier where people are talking 00:06:07.600 |
about it and we really need to think about like how we will actually put this into the models 00:06:12.000 |
so to just put this up on the slide what we call reinforce learning with verifiable rewards looks very 00:06:20.000 |
simple I think a lot of rl and language models especially before you get into this multi-turn 00:06:25.600 |
setting has been you take prompts the agent creates a completion to the prompt and then you score the 00:06:31.120 |
completions and with those scored completions you can update the weights to the model it's been single 00:06:36.000 |
turn it's been very simple I'll have to update this diagram for multi-turn and tools and it makes it a 00:06:41.360 |
little bit more complex but the core of it is just a language model generates completions and gets 00:06:46.160 |
feedback on it and it's good to just take time to look at these skills these are a collection of evals 00:06:54.960 |
and we can look at like where gpt40 was and these were the hardest evals that have existed and look 00:06:59.840 |
were called like the frontier of ai and if we look at the oh one improvements and the like oh three 00:07:05.200 |
improvements in quick succession these are really incredible eval gains that are mostly just from 00:07:11.520 |
adding this new type of training in and the core of this argument is that we need to do something 00:07:16.480 |
similar if we want planning to work so I would say that a lot of the planning tasks look mostly like 00:07:22.080 |
humanity's last exam and amy just after adding this reasoning skill and we need to figure out what other 00:07:28.720 |
types of things these models are going to be able to do so it's like this list of reasoning abilities 00:07:37.200 |
that these kind of like low-level skills is going to continue to go up I think the most recent one 00:07:42.400 |
if you look at recent deep seek models or recent quen models is really this tool use being added in 00:07:47.600 |
and that's going to build more models like o3 so using o3 just feels very different because it is this kind 00:07:54.720 |
of combination of tool use with reasoning and it's obviously good at math and code but I think these 00:08:00.640 |
kind of low-level skills that we expect from reasoning training are we're going to keep getting more of 00:08:04.640 |
them as we figure out what is useful I think an abstraction for the kind of agenticness on top of 00:08:10.800 |
tool use is going to be very nice but it's hard to measure and people mostly say that clod is the best at 00:08:16.240 |
that but it's not yet super established on how we measure it or communicate it across different models 00:08:21.280 |
and then this is where we get into the fun interesting things I think it's hard for us 00:08:27.840 |
because calibration is passed to the user which is we have all sorts of things like model selectors if 00:08:33.520 |
you're a chat to bt user clod has reasoning on off with this extended thinking and gemini has something 00:08:39.680 |
similar and there's these reasoning effort selectors in the api and this is really rough on a user side 00:08:45.280 |
of things and making it so the model knows this will just really make it so it's easier to find the 00:08:50.800 |
right model for the job and just kind of your kind of over spent tokens for no reason will go down a lot 00:08:57.440 |
it's kind of obvious to want it and then it'll just it becomes a bigger problem the longer we don't have 00:09:02.480 |
this some examples from when overthinking was kind of identified as a problem it's like 00:09:08.240 |
the left half of this is you can ask a language model like what is two plus three and you can see 00:09:13.280 |
these reasoning models use hundreds to a thousand tokens or something that could realistically be like 00:09:18.560 |
one token as an output and then on the right is a kind of comparison of sequence links from a standard 00:09:24.480 |
like non-rl trained instruction model versus the qwq thinking model and you really can gain this like 00:09:31.440 |
10 to 100x in token spend when you shift to a reasoning model and if you do that in a way that 00:09:37.040 |
is wasteful it's just going to really load your infrastructure and cost and as a user i don't want 00:09:41.520 |
to wait minutes for an easy question and i don't want to have to switch models or providers to deal with that 00:09:46.720 |
so i think one of the things that once we start have this calibration is i'm is this kind of strategy 00:09:55.040 |
idea and on the right i went to the um i think it's epoch ai website i took a question one of their 00:10:00.960 |
example questions from frontier math and i was like does this new deep seek r105 28 model like does 00:10:07.200 |
it do any semblance of planning when it starts and you ask it a math problem it just like okay the first 00:10:12.400 |
thing i'm going to do is i mean i need to construct a polynomial it's like it just goes right in and it 00:10:17.040 |
doesn't do anything like trying to sketch the problem before it thinks and this is going to probably 00:10:22.560 |
output 10 to 40 000 tokens and if it's going to need to do another 10x there it's just like if that's all in 00:10:28.960 |
the wrong direction that's multiple dollars of spend and a lot of latency that's just totally useless 00:10:34.000 |
and most of these applications are set up to expect a latency between one and 30 minutes so it's like 00:10:38.960 |
there there is just a timeout they are fighting so either going in the wrong direction or just thinking 00:10:43.520 |
way too hard about a sub problem it is going to make it so the user leaves so right now these models i 00:10:50.320 |
said they do very little planning on their own but as we look at these applications they're very likely 00:10:55.120 |
prompted to plan which is like the beginning of deep research and cloud code and we kind of have to 00:10:59.760 |
make it so that is model native rather than something that we do manually and then once we look at this 00:11:07.600 |
plan there's all these implementation details across something like deep research or codex which is like 00:11:13.120 |
how do i manage a memory so we have cloud code compresses its memory when it fills up its context window we 00:11:18.880 |
don't know if that's the optimal way for every application we want to avoid repeating the same 00:11:23.520 |
mistakes we talked to greg was talking about the playing pokemon earlier which is a great example of 00:11:30.080 |
that we want to have tractable parts we want to offload thinking if we have a really challenging part 00:11:35.840 |
so i'll talk about parallel compute a little bit later it's a way to kind of boost through harder 00:11:40.880 |
things and really we want language models to call multiple other models in parallel so right now people 00:11:48.480 |
are spinning up tmux and launching cloud code in 10 windows to do this themselves but there's no reason 00:11:54.080 |
a language model can't be able to do that it just needs to know the right way to approach it 00:12:00.480 |
and as i started with this idea of kind of we need effort for or like we need to make effort to add new 00:12:08.320 |
capabilities into language models when you when i think about this kind of story of qstar that became 00:12:15.120 |
strawberry that became oh one the reason that it was in the news for so long and was such a big deal 00:12:20.640 |
is like it was a major effort for open ai spending like 12 to 18 months building these initial reasoning 00:12:27.040 |
traces that they could then train an initial model on that has some of these behaviors so it took a 00:12:33.440 |
lot of human data to get things like backtracking and verification to be reliable in their models 00:12:38.480 |
and we need to go through a similar arc with planning but with planning the kind of outputs that we're 00:12:44.880 |
going to train on are are much more intuitive than something like reasoning i think if i were to ask 00:12:49.440 |
you to sit down and write a 10 000 token reasoning trace with backtracking it's like you can't really do 00:12:54.800 |
this but a lot of expert people can write a five to ten step plan that is very good or check the work 00:13:01.200 |
of gemini or open ai when asked to write an initial plan so i'm a lot more optimistic on being able to 00:13:08.000 |
hill climb on this and then it goes through the same path where once you have initial data you can do some 00:13:13.360 |
sft and then the hard question is if the rl and even bigger tasks can reinforce these planning styles 00:13:22.400 |
on the right i added kind of a hypothetical which is like we already have thinking tokens before answer 00:13:28.320 |
tokens and there's no reason we can't apply more structure to our models to just really make them 00:13:35.760 |
so to give a bit more depth on this idea of skill versus planning if we go back to this example i would say 00:13:47.360 |
that o3 is extremely skilled at search so being able to find a piece of nice and niche information that 00:13:55.760 |
researchers in a field know of but can't quite remember the exact search words that is an incredible 00:14:01.200 |
skill but when you try to put this into something like deep research there's this lack of planning is 00:14:06.800 |
making it so that sometimes you get a masterpiece and sometimes you get a dud and if as these models get 00:14:12.640 |
better at planning it'll just be more thorough and reliable and getting the kind of coverage that you 00:14:18.240 |
want so it's like if it's crazy that we have models that can do this search but if you ask it to recommend 00:14:24.880 |
some sort of electronics purchase or something it's really hard to trust because they can't just know 00:14:30.880 |
how to pull in the right information and how hard it should try to do all that coverage 00:14:38.960 |
so to kind of summarize these are the four things i presented i think you can obviously add more to 00:14:44.560 |
these you could call a mix of strategy and abstraction there's like con you could call what i was describing 00:14:49.680 |
as like context management in many ways but really just want to have things like this so that you can 00:14:56.400 |
break down the training problem and think about data acquisition or new algorithmic methods for kind of 00:15:01.760 |
each of these tasks and i mentioned parallel compute because i think this is an interesting one because 00:15:08.400 |
if you use o1 pro it's still been one of the best models and the most robust models for quite some 00:15:14.480 |
time and i've been very excited for o3 pro but it doesn't solve problems in the same way as like 00:15:20.400 |
traditional inference time scaling where inference time scaling just made a bunch of things that didn't work 00:15:25.360 |
go from zero to one where this parallel compute is really like it makes things more robust 00:15:30.720 |
it just makes them nicer and it seems like this kind of rl training is something that can encourage 00:15:35.040 |
exploration and then if you apply more compute in parallel it feels something kind of exploiting 00:15:39.520 |
and getting a really well-crafted answer so there's a time when you want that but it doesn't solve every 00:15:44.400 |
problem and to kind of transition into the end of this talk it's like there's been a lot of talks today 00:15:51.120 |
saying the things that you can do with rl and there's obviously a lot of talk on the ground of 00:15:56.080 |
what is called continual learning and if we're just continually using very long horizon rl tasks to 00:16:01.920 |
update a model and diminish the need of pre-training and there are a lot of data points that were closer 00:16:07.600 |
to that in many ways i think continual learning has a big algorithmic bottleneck where but just like 00:16:13.600 |
scaling up rl further is very tractable and something that is happening so if people are to ask me what i'm 00:16:20.800 |
working on at ai2 and what i'm thinking about this is my like rough summary of what i think a research 00:16:28.160 |
plan looks like to train a reasoning model without without all the in between the line details so step 00:16:33.200 |
one is you just get a lot of questions that have verified answers across a wide variety of domains 00:16:38.720 |
most of these will be math and code because that's without that what is out there and then two if you look at 00:16:44.240 |
all these recipe papers they're having a step where they filter the questions based on the the difficulty 00:16:51.120 |
with respect to your base model so if a question is solved zero out of 100 times by your base model 00:16:56.960 |
or 100 out of 100 you don't want questions that look like that because you're both not only wasting 00:17:01.840 |
compute but you're messing up the gradients in your rl updates to make them a bit noisier and once you do 00:17:07.840 |
that you just want to make a stable rl run that'll go through all these questions and have the numbers 00:17:12.640 |
keep going up and that's the core of it is really stable infrastructure and data and then you can tap 00:17:18.320 |
into all these research papers that tell you to do methods like overlong filtering or different clipping 00:17:24.240 |
or resetting the reference model and that'll give you a few percentage points on the top where really 00:17:29.120 |
it's just data and stable infrastructure and this kind of leads to the provocation which is like what 00:17:36.640 |
if we rename post training as training and if openai 01 was like one percent of compute is post training 00:17:44.480 |
relative to pre-training they've already said that 03 has increased it by 10x so if the numbers started at 00:17:51.920 |
one percent you're very quickly getting to what you may see as like parity and compute in terms of gpu hours 00:17:58.960 |
between pre-training and post training which if you were to take anybody back a year ago before 01 would 00:18:05.040 |
seem pretty unfathomable and one of the fun data points for this is that the deep seek v3 paper and 00:18:13.280 |
you kind of watch deep seeks transition into becoming more serious about post training like the original 00:18:19.840 |
deep seek read through paper they use 0.18 percent of compute on post training in gpu hours and they said 00:18:26.320 |
their pre-training takes about two months and there was a deleted tweet from one of their rl researchers 00:18:30.880 |
that said the r1 training took a few weeks so if you make a few very strong probably not completely 00:18:37.920 |
accurate assumptions that rl was on the whole cluster that would already be 10 to 20 percent of their 00:18:43.440 |
compute i think like specific things for deep seek are like oh their pre-training efficiency is probably way 00:18:49.200 |
better than their rl code and things like this but scaling rl is a very real thing if you if you look at this if you look at 00:18:56.160 |
frontier labs and you look at the types of tasks that people want to solve with these long-term plans 00:19:01.760 |
so it's good to kind of embrace what you think these models will be able to do and kind of break 00:19:07.920 |
down tasks on their own and solve some of them so thanks for having me and let me know what you think