back to indexTraining Agentic Reasoners — Will Brown, Prime Intellect

Chapters
0:0 Introduction to the idea that reasoning and agents are similar.
1:5 The growing effectiveness of Reinforcement Learning (RL) in AI.
3:4 The complexities and challenges of implementing RL.
4:41 The connection between popular AI products (agents) and RL fine-tuning.
7:18 The core process of Reinforcement Learning.
10:21 The importance of tools and real-world tasks for agents.
12:13 The problem of "reward hacking" and how to design better evaluations.
14:51 Future directions for agentic systems and a practical toolkit for implementation.
00:00:00.040 |
Hi everyone, I'm Will Brown. I'm at Prime Intellect. Today I want to talk about training 00:00:18.720 |
agentic reasoners. Just kind of as a very high-level overview, I think a lot of people here are really 00:00:24.880 |
excited about reasoning and a lot of people here are really excited about agents, but I feel like a 00:00:28.480 |
lot of the conversations between these two topics are kind of different where people are like, "Oh, 00:00:32.560 |
reasoning is this one thing and agents are this other thing." And the considerations of reasoning are 00:00:37.760 |
very different from the considerations of building agents. And I think the high-level thesis of this 00:00:41.200 |
talk is like, "No, they're kind of the same thing." And you'll see why as we get into it. 00:00:45.600 |
First, just to start, RL kind of works now. I think for a long time people were like, "Oh, 00:00:50.800 |
is RL going to work? Is it not going to work? How hard is it going to be?" And DeepSeek, I think, 00:00:55.440 |
took a lot of people by surprise for many reasons, like the costs or whatever, and how good it is 00:00:59.840 |
compared to the open models, to the big labs, as well as just it being fully open. But I think it was 00:01:05.840 |
also just that it was RL applied at scale working with surprisingly few tweaks needed, where you just 00:01:13.120 |
have a good setup, you have a good signal, you have a model that is good enough to do some learning, 00:01:18.800 |
and you see this curve where doing more RL results in the model getting better. 00:01:23.760 |
And it's also kind of how everyone else is doing it. This is what the big labs are really banking on 00:01:30.400 |
to drive the next iterations of progress. The 03 release is the one that OpenAI is really excited 00:01:35.520 |
about, not GPT 4.5. They stopped serving the big pre-trained model via API, but they have continued to 00:01:41.760 |
really double down on the scaling direction of doing more and more reinforcement learning and spending more 00:01:46.800 |
compute on reinforcement learning once you have the right setup to enable progress. And 03 to me is 00:01:53.040 |
like a very naturally agentic model. The ChatGPT version has all of these tools. The kind of selling 00:01:59.200 |
point of it is not just that it's smarter, it's that it's really good at using lots of tools in agentic task 00:02:05.280 |
settings to solve harder problems that involve interacting with complex systems. And that is kind of 00:02:11.120 |
really the selling point of all of this is that like the more complex your system, the more things can 00:02:16.480 |
go wrong, the more that like a generic LLM API is going to be brittle and go off the rails after a certain 00:02:22.400 |
number of steps. And RL is kind of the way around it. It's the trick you can do to take the system that 00:02:28.240 |
kind of works. Maybe it works on small scales, but as you go harder, it starts going off the rails and 00:02:33.360 |
training the model to be better at that thing. And so this is a recipe that is still kind of like a research 00:02:39.680 |
topic that people are not fully sure like the best way to do it, especially outside of the big labs. 00:02:44.800 |
But it clearly is moving in a direction where it's becoming more and more reliable, more and more 00:02:50.400 |
accessible. And the sort of thing that I think would be silly to disregard as a potential like key piece of the future of 00:02:58.720 |
agentic software and agentic applications. But it's also complicated. So on the left here, 00:03:04.400 |
this is the like architecture diagram of Veral, which is kind of the most popular software people use in 00:03:09.920 |
the research world for writing papers to do RL. So if you want to like take a model and go do RL, 00:03:16.240 |
Veral kind of expects that you understand all of this. On the left, we have, right, we have GRPO as presented 00:03:22.720 |
in the original DeepSeq math paper back from early 2024. And like, there's a lot of pieces here. There's a lot of 00:03:27.760 |
like complicated steps going on that I think a lot of people who are used to thinking about APIs, 00:03:33.680 |
used to thinking about building agents, kind of like are hoping they don't have to worry about it. 00:03:39.120 |
And are hoping that like, you can just set it aside and like, something else will work and we'll just 00:03:46.240 |
use the APIs, it'll all be great. And I think the reality is like somewhere in the middle where like, 00:03:51.520 |
I think it doesn't need to be this complicated. But I think you also kind of do have to be aware of it if 00:03:55.680 |
your goal is really like building the most performing agents, not necessarily just like 00:03:59.200 |
today, you need to know about it. But as a piece of the toolkit to potentially make really powerful 00:04:04.400 |
agentic software, I think the people who are willing to do this and take the best open models and really 00:04:11.440 |
RL them for their tasks and configure how to do that well, are going to have a huge advantage. And that's 00:04:15.600 |
the kind of thing that also allows you to like build a moat beyond just like being a wrapper API and 00:04:20.480 |
towards something where it's like, oh, I actually have my own model now, but not everyone can be 00:04:25.520 |
a big lab. And so we kind of need to meet in the middle somewhere of like, okay, how do we make this 00:04:29.120 |
a thing that starts to become feasible for startups for individual researchers to actually do? And like, 00:04:36.960 |
at what scale does this become like feasible? And so agents are like, the type of product that everyone's 00:04:42.240 |
excited about, we all like, love cloud code and Devin and Manus and O3 and deep research. And like, 00:04:48.720 |
these are the sorts of products that are really capturing people's attention. 00:04:52.320 |
They're products that in their current iteration happen to work kind of because the models that 00:04:57.120 |
are being used have like been RL to basically do these kinds of things. Like Claude is a very good 00:05:03.040 |
coding agent, probably because it has been RL done a lot of code. And so it's like, not very surprising 00:05:08.000 |
that if you plug Claude into essentially a while loop with some tools, it's like quite good at doing 00:05:13.200 |
these things because it's basically most likely been trained in almost that exact setting. Same for things 00:05:19.440 |
like O3, like it can do GeoGuessr and whatever, because whether it's literally GeoGuessr or something 00:05:24.240 |
close to it, they have talked about training it to do this image cropping trick. Like that's a technique 00:05:30.160 |
that it didn't just know how to do out of the box. They said, hey, let's give it these tools to do that 00:05:34.560 |
and use reinforcement learning to train it to do that. And so that is kind of the recipe that we have seen 00:05:39.120 |
coming from the big labs as if you want a powerful agent that can do a certain type of task, 00:05:43.920 |
you can use reinforcement learning to train it to do that task better. 00:05:48.640 |
And so these are kind of the same thing, actually, like building an agent, the pieces of 00:05:52.720 |
making an agent in terms of the harness, the environment, the tools and the iteration is 00:05:59.440 |
essentially the same conceptual framing as canonical reinforcement learning in the sense of policies, 00:06:05.920 |
actions, states, rewards, transition probabilities. And I think the more that we start to view 00:06:10.880 |
agents as this umbrella, which is not just about static chaining of API calls, but as this interaction 00:06:18.160 |
loop with evaluations, that framing really is the way to think about RL, which is you build a system 00:06:24.800 |
where a thing is interacting with an environment, and you have some way of evaluating how good it's 00:06:29.760 |
doing. And RL is simply an algorithm to improve based on the scores of these evaluations. 00:06:36.560 |
And if you're building agents, and you're tuning your prompts, and you're fiddling with your harnesses, 00:06:40.880 |
this is kind of like doing RL by hand. What you're doing is you're saying like, 00:06:44.960 |
okay, currently, my evals are saying this, let's make sure the evals are like capturing what I want. 00:06:50.240 |
Let's look at the data. Let's see if the data matches what my evals are saying. And then, oh, 00:06:55.760 |
let's try a new prompt. Let's try giving it a new tool. Let's try switching out the model. 00:07:00.800 |
This is the process which is also being targeted by reinforcement learning in the general sense 00:07:07.600 |
beyond individual algorithms. About these algorithms, there's a few of them that are 00:07:12.960 |
very important. All of them have different implementation details. But in general, 00:07:17.120 |
the idea is you have a bunch of tasks, like versions of your problem, which are essentially prompts. 00:07:21.520 |
You have rollouts, which are just completions, potentially involving many steps of interactions, 00:07:25.680 |
but like one sequence of stuff happening. And then you have evaluation, potentially interleaved 00:07:31.200 |
throughout or at the end of the sequence. And what you're estimating is the advantage. The 00:07:35.840 |
advantage here is the idea that sometimes your model would be better than others. Like these 00:07:40.800 |
LLMs are all non-deterministic. You have temperature above zero. You have different things happen in 00:07:47.600 |
different rolls of the dice. And this forking process of saying like, okay, this time it did better than 00:07:54.240 |
that time. Why was it different? RL is really about saying like, okay, this is the actual thing that 00:08:02.160 |
changed, that resulted in the reward being better, the eval being better. This is the token at which 00:08:08.240 |
I went down the good path versus the bad path. And whether you're doing PPO or GRPO, 00:08:14.480 |
like this is the mechanism by which you get the signal of like, you have something that sometimes 00:08:20.800 |
went better, sometimes went worse. Now you can kind of very surgically have the model, learn to do more 00:08:27.200 |
of the good stuff without changing too much overall. I think this is also kind of maybe a reason why DPO, 00:08:32.400 |
I think people were hoping DPO would like really work well. In my view, DPO does not necessarily have 00:08:37.280 |
this like fine grained advantage estimate. Like it's not really clear just from like a full good completion and 00:08:42.320 |
a full bad completion where you're really getting the signal about these complex branching processes. 00:08:46.880 |
PPO has this, but it's also very expensive. GRPO I think has taken a lot of people kind of 00:08:52.160 |
by storm in terms of like being a very nice like middle ground where it's more computationally efficient. 00:08:58.400 |
It's like simple to implement. But also it does have this kind of forking process that comes just from sampling. 00:09:06.560 |
There's also just too many papers. So like, I think a lot of people just see a new paper every day and 00:09:11.520 |
are like, do I have to read this one? And I feel that too. Like, I think it's difficult to know up front, 00:09:17.760 |
like which of these are going to be important, which of them are just going to be like noise, especially 00:09:22.720 |
because lots of them have very sensationalist titles, like, oh, Quen doesn't work. 00:09:27.040 |
Or like, or Quen, everyone, everything only works with Quen is like kind of true. But like, there's also 00:09:32.320 |
more to the story than that. And I think there's like different implementation details of like, oh, 00:09:36.240 |
if you change the loss function like this in this experiment, then it works. And I think for most 00:09:41.520 |
people, it is best to just like kind of set this aside and to not get too caught up in the individual 00:09:48.400 |
details of individual experiments, individual papers, and kind of think more holistically about what is the 00:09:54.240 |
process of reinforcement learning doing? What implementation details am I willing to kind of 00:09:59.040 |
leave to other people to figure out and eventually come to me with like software that like has the 00:10:02.880 |
knob set correctly? And which pieces are actually important for solving the problems I care about? 00:10:08.240 |
And so for a lot of people, I think the things that are going to be really interesting 00:10:12.640 |
are things that are relating to actual software to actual problems that they want to sell in the world. 00:10:18.560 |
And agents, I think are kind of the instantiation of that where this makes sense. And the thing that makes 00:10:23.680 |
an agent an agent is tools, the ability to interact with an environment with a system. 00:10:28.720 |
A lot of people here are like very excited about MCP at the conference. Like MCP is just tools. MCP is 00:10:33.600 |
about giving your LM the ability to like interact with stuff, to go solve problems that involve changing 00:10:41.440 |
files, making requests, editing code, running code. And so I think these are the papers that I get excited 00:10:47.600 |
about because they feel like, like there's parts of the puzzle that are not fully solved yet of like, 00:10:51.360 |
what's the right way to do all of this? Like there's still some open questions, 00:10:54.400 |
but I think those are getting kind of refined. We're starting to see more and more, but like a lot of 00:11:01.280 |
the code, the tools we have out in the wild, they're like, go do this. Like if you want to like go play 00:11:05.520 |
around with RL, most code bases are like very set up for like either code and math tasks or things that are 00:11:12.320 |
quite similar to that. That's kind of my fault. I had a snippet go viral that was like, here's how you do 00:11:19.360 |
RL on like GSM AK, which is like a kind of easy math data set. And then I think I've seen a lot of people like 00:11:25.600 |
stick with this as like, Oh, we're going to RL on math. And I like, this is also just like math is easy to evaluate. 00:11:30.640 |
And I think people are the evals, writing evals are hard. Like there's a whole track going on in parallel to this about like how to build a good eval. 00:11:38.480 |
And so I think a lot of researchers gravitate towards things that like look like the benchmarks 00:11:43.120 |
that are also really easy to eval because there's like a very clear signal of like, okay, 00:11:47.040 |
this thing is like, right. This thing is wrong. Good. Okay. We're doing RL. But like real world tasks are 00:11:53.200 |
messier than that. We are not going to like get great software systems just by like hill climbing on 00:12:00.960 |
whatever question answer benchmark is popular today. What we're going to do is we're going to have to do is 00:12:06.000 |
start thinking about like the actual systems at hand and the challenges that emerge when we're 00:12:11.040 |
trying to design these rewards. And so like reward hacking is like a real thing. Um, I think this is 00:12:15.440 |
one of the lessons that like RL works, but also it's not like always going to work. There are things that 00:12:20.960 |
can go wrong. And to me, reward hacking is really a message about the difficulty of building good evals. 00:12:26.640 |
Like, uh, what you really want with an eval is for it to be easier for your model to do the task 00:12:32.960 |
than to hack the eval. You want to build a reward signal that actually captures what you care about 00:12:38.320 |
where, uh, gaming it is like more difficult than not gaming it. If you can, if the model can learn to do 00:12:45.840 |
the task directly just by doing what you want it to do in the spirit of the task, then like that is 00:12:53.680 |
what will happen. It will flow in the path of least resistance. This is like models just want to learn, 00:12:57.680 |
but they want to learn to do better on reward signals. And so your reward signals have to point 00:13:02.000 |
in the direction of the thing you actually care about. Um, otherwise like models will find cheats. 00:13:07.360 |
Um, and I think thinking about these things in combination kind of points a little bit towards 00:13:13.280 |
a direction that I think is going to be very promising. And there's some very early signs 00:13:16.480 |
that like this actually can work, um, which is like when R1 came out, I was kind of like speculating, 00:13:22.720 |
like what's next? What are the things that are going to unlock this sort of technique being used more 00:13:28.560 |
generally? Um, and you people talk a lot about like generator, verifier gaps, like what are the 00:13:34.080 |
differences between like solving a problem versus checking if you have a solution? And a lot of 00:13:37.760 |
problems like are much easier to check than solve, but this isn't like a binary thing. 00:13:42.240 |
This is a spectrum of how difficult is it to verify a thing. But, um, there's some kind of signs that 00:13:48.640 |
you kind of can do evaluations on more ambiguous tasks by just breaking them down into smaller pieces 00:13:56.160 |
and by using LLMs as subroutines in your evaluations, like LLMs that judge on steroids, 00:14:03.120 |
or maybe you want to actually like train a specialized LLM who is really good at doing 00:14:06.800 |
these fine grained evaluations. I like using the term rubric as a conceptual general umbrella around 00:14:12.160 |
reward models, reward functions, LLMs judge setups, like the criteria on which you are evaluating a 00:14:17.760 |
thing. There's a cool paper from deep seek that I was found very exciting when it came out a couple 00:14:21.760 |
months ago about like how to train reward models that like generate these rubrics on the fly. 00:14:26.240 |
There was a paper very recently that does this for creative writing and kind of found that like, 00:14:30.080 |
yes, you actually can train reward models that will come up with nuanced fine grained 00:14:35.840 |
evaluation criteria for a task on the fly, given the actual problem. And this gives you something that 00:14:41.520 |
results in a very like fine grained score that allows you to actually do LL and like keep getting better. 00:14:46.880 |
And I think like this is an area that I'm really excited about to keep watching. 00:14:52.400 |
But also like multi-turn. Multi-turn is probably where we're headed. We want to do agentic search, 00:14:57.520 |
we want to do tool calls, software, games, long horizon planning, computer use, memory. Scaling on tool 00:15:03.040 |
calls lets you solve harder problems. And so how do we actually like do this? What's the way to go about 00:15:08.960 |
building multi-turn agentic systems to do and that we can use RL with. And I think the conceptual pieces 00:15:16.560 |
here are environments are basically harnesses, rewards are basically evals, tasks are just prompts, 00:15:22.000 |
and your policy in the RL sense hopefully should just be as simple as like an LLM API. 00:15:27.280 |
I think the programming interface that makes sense for a lot of people is to have an API 00:15:31.840 |
that you're writing code as if it's just a normal agent in a loop. But then this is a thing that you 00:15:36.960 |
can use to go do RL. And so that's what I've been building over the past couple of months. I maintain 00:15:42.000 |
a repo called verifiers. It's finally on pip out in the world, you can just install it, but it's been a 00:15:49.520 |
long time coming. And what it really is, is a toolkit of these pieces to make it so that building an agent 00:15:57.280 |
that you can actually train with RL feels just like building an agent. So the interaction 00:16:01.760 |
protocol here is like quite simple. Like this is the entire rollout function on the left of like 00:16:06.960 |
what happens in the code when you're running an agent to do RL, which is that you kind of set up some 00:16:11.920 |
initial state stuff, have a while loop for is it done yet? If it's not done, do a turn. And the thing 00:16:17.520 |
you're passing here is a client object that's just an open AI compatible API. And I think this is the kind 00:16:24.000 |
of interface that you really want if you want people to be able to go from their agent applications to 00:16:29.760 |
something that's trainable, something that they can use with RL. It's been a lot of fun thinking 00:16:34.080 |
about like, what are the abstractions? What are the pieces here? And so like, there's things like 00:16:37.840 |
parsers and rubrics that I think are like nice building blocks that you sometimes want to use. 00:16:41.440 |
You can also like not use them if you don't want to, but like I've tried to make it fun and user-friendly. 00:16:45.440 |
The other day, I was like, let's train a Wordle agent. I think this was like a fun little toy 00:16:50.240 |
problem where it's like, it's not that hard of like a game for us as humans, but like, 00:16:55.680 |
it's actually like kind of tricky to get your code to be this sort of thing where you have this like 00:17:00.400 |
multi-turn interaction protocol that you actually can do learning with. But now it's like much easier, 00:17:05.680 |
like the code to do these things is like quite simple. And the reward functions can kind of be 00:17:10.880 |
relatively simple for this sort of setup where it's like, okay, you want to reward it for like 00:17:14.000 |
solving the thing eventually, but also like give it more rewards for doing it in less turns. And like, 00:17:19.760 |
this is a 7b model like works reasonably well, but one of the reasons it works, um, 00:17:26.720 |
Sft warmup as a way of kind of lowering the barrier of entry. Like this, the code as it is, 00:17:31.520 |
is very much set up so that like your environments for RL are also just like synthetic data loops or evals 00:17:36.240 |
where you can plug in clod or deep seek or open AI and like test. So you don't have to like do RL to debug. 00:17:42.160 |
You can like debug with an API in terms of seeing, is this a good eval? 00:17:46.080 |
Is this a good reward? Once you're kind of comfortable with it, you can like use whatever 00:17:50.080 |
API you like that you are allowed to use and make synthetic data, do some SFT on it. 00:17:55.200 |
And now you can start doing RL and this like helps a lot with small models. 00:17:58.080 |
I think there's a lot of efficiency challenges that are like, I've been kind of hard at work 00:18:03.120 |
trying to solve in terms of like having all of your computation be utilized effectively, 00:18:06.960 |
having everything be like fully async. So you don't have to worry about like batching 00:18:10.400 |
and that your trainer and your inference can kind of go at the same time. 00:18:14.080 |
You can be like a little bit off policy. A lot of engineering that I'm hoping like, 00:18:18.320 |
if you want to worry about that, great, dig into it, fork the repo, mess with things. 00:18:23.120 |
If you don't want to, you shouldn't have to. And like the idea here is that this should become 00:18:29.760 |
something that more people are trying out, more people are having fun with, with exploring 00:18:34.800 |
and getting a feel for it. Because if it's going to be a thing we have to worry about, if this is the future of 00:18:41.200 |
building better agent models for your applications, like now's a good time to start. And so this stuff 00:18:48.080 |
is set up so you can like on a couple GPUs like do a lot of interesting research. Like the barrier of entry 00:18:53.920 |
is like much lower now than it used to be. I have a lot of fun doing this on like a couple GPUs. 00:18:59.920 |
We sell GPUs by the way. Thanks everybody. I don't think we have time for questions. But yeah.