Hi everyone, I'm Will Brown. I'm at Prime Intellect. Today I want to talk about training agentic reasoners. Just kind of as a very high-level overview, I think a lot of people here are really excited about reasoning and a lot of people here are really excited about agents, but I feel like a lot of the conversations between these two topics are kind of different where people are like, "Oh, reasoning is this one thing and agents are this other thing." And the considerations of reasoning are very different from the considerations of building agents.
And I think the high-level thesis of this talk is like, "No, they're kind of the same thing." And you'll see why as we get into it. First, just to start, RL kind of works now. I think for a long time people were like, "Oh, is RL going to work?
Is it not going to work? How hard is it going to be?" And DeepSeek, I think, took a lot of people by surprise for many reasons, like the costs or whatever, and how good it is compared to the open models, to the big labs, as well as just it being fully open.
But I think it was also just that it was RL applied at scale working with surprisingly few tweaks needed, where you just have a good setup, you have a good signal, you have a model that is good enough to do some learning, and you see this curve where doing more RL results in the model getting better.
And it's also kind of how everyone else is doing it. This is what the big labs are really banking on to drive the next iterations of progress. The 03 release is the one that OpenAI is really excited about, not GPT 4.5. They stopped serving the big pre-trained model via API, but they have continued to really double down on the scaling direction of doing more and more reinforcement learning and spending more compute on reinforcement learning once you have the right setup to enable progress.
And 03 to me is like a very naturally agentic model. The ChatGPT version has all of these tools. The kind of selling point of it is not just that it's smarter, it's that it's really good at using lots of tools in agentic task settings to solve harder problems that involve interacting with complex systems.
And that is kind of really the selling point of all of this is that like the more complex your system, the more things can go wrong, the more that like a generic LLM API is going to be brittle and go off the rails after a certain number of steps.
And RL is kind of the way around it. It's the trick you can do to take the system that kind of works. Maybe it works on small scales, but as you go harder, it starts going off the rails and training the model to be better at that thing. And so this is a recipe that is still kind of like a research topic that people are not fully sure like the best way to do it, especially outside of the big labs.
But it clearly is moving in a direction where it's becoming more and more reliable, more and more accessible. And the sort of thing that I think would be silly to disregard as a potential like key piece of the future of agentic software and agentic applications. But it's also complicated.
So on the left here, this is the like architecture diagram of Veral, which is kind of the most popular software people use in the research world for writing papers to do RL. So if you want to like take a model and go do RL, Veral kind of expects that you understand all of this.
On the left, we have, right, we have GRPO as presented in the original DeepSeq math paper back from early 2024. And like, there's a lot of pieces here. There's a lot of like complicated steps going on that I think a lot of people who are used to thinking about APIs, used to thinking about building agents, kind of like are hoping they don't have to worry about it.
And are hoping that like, you can just set it aside and like, something else will work and we'll just use the APIs, it'll all be great. And I think the reality is like somewhere in the middle where like, I think it doesn't need to be this complicated. But I think you also kind of do have to be aware of it if your goal is really like building the most performing agents, not necessarily just like today, you need to know about it.
But as a piece of the toolkit to potentially make really powerful agentic software, I think the people who are willing to do this and take the best open models and really RL them for their tasks and configure how to do that well, are going to have a huge advantage.
And that's the kind of thing that also allows you to like build a moat beyond just like being a wrapper API and towards something where it's like, oh, I actually have my own model now, but not everyone can be a big lab. And so we kind of need to meet in the middle somewhere of like, okay, how do we make this a thing that starts to become feasible for startups for individual researchers to actually do?
And like, at what scale does this become like feasible? And so agents are like, the type of product that everyone's excited about, we all like, love cloud code and Devin and Manus and O3 and deep research. And like, these are the sorts of products that are really capturing people's attention.
They're products that in their current iteration happen to work kind of because the models that are being used have like been RL to basically do these kinds of things. Like Claude is a very good coding agent, probably because it has been RL done a lot of code. And so it's like, not very surprising that if you plug Claude into essentially a while loop with some tools, it's like quite good at doing these things because it's basically most likely been trained in almost that exact setting.
Same for things like O3, like it can do GeoGuessr and whatever, because whether it's literally GeoGuessr or something close to it, they have talked about training it to do this image cropping trick. Like that's a technique that it didn't just know how to do out of the box. They said, hey, let's give it these tools to do that and use reinforcement learning to train it to do that.
And so that is kind of the recipe that we have seen coming from the big labs as if you want a powerful agent that can do a certain type of task, you can use reinforcement learning to train it to do that task better. And so these are kind of the same thing, actually, like building an agent, the pieces of making an agent in terms of the harness, the environment, the tools and the iteration is essentially the same conceptual framing as canonical reinforcement learning in the sense of policies, actions, states, rewards, transition probabilities.
And I think the more that we start to view agents as this umbrella, which is not just about static chaining of API calls, but as this interaction loop with evaluations, that framing really is the way to think about RL, which is you build a system where a thing is interacting with an environment, and you have some way of evaluating how good it's doing.
And RL is simply an algorithm to improve based on the scores of these evaluations. And if you're building agents, and you're tuning your prompts, and you're fiddling with your harnesses, this is kind of like doing RL by hand. What you're doing is you're saying like, okay, currently, my evals are saying this, let's make sure the evals are like capturing what I want.
Let's look at the data. Let's see if the data matches what my evals are saying. And then, oh, let's try a new prompt. Let's try giving it a new tool. Let's try switching out the model. This is the process which is also being targeted by reinforcement learning in the general sense beyond individual algorithms.
About these algorithms, there's a few of them that are very important. All of them have different implementation details. But in general, the idea is you have a bunch of tasks, like versions of your problem, which are essentially prompts. You have rollouts, which are just completions, potentially involving many steps of interactions, but like one sequence of stuff happening.
And then you have evaluation, potentially interleaved throughout or at the end of the sequence. And what you're estimating is the advantage. The advantage here is the idea that sometimes your model would be better than others. Like these LLMs are all non-deterministic. You have temperature above zero. You have different things happen in different rolls of the dice.
And this forking process of saying like, okay, this time it did better than that time. Why was it different? RL is really about saying like, okay, this is the actual thing that changed, that resulted in the reward being better, the eval being better. This is the token at which I went down the good path versus the bad path.
And whether you're doing PPO or GRPO, like this is the mechanism by which you get the signal of like, you have something that sometimes went better, sometimes went worse. Now you can kind of very surgically have the model, learn to do more of the good stuff without changing too much overall.
I think this is also kind of maybe a reason why DPO, I think people were hoping DPO would like really work well. In my view, DPO does not necessarily have this like fine grained advantage estimate. Like it's not really clear just from like a full good completion and a full bad completion where you're really getting the signal about these complex branching processes.
PPO has this, but it's also very expensive. GRPO I think has taken a lot of people kind of by storm in terms of like being a very nice like middle ground where it's more computationally efficient. It's like simple to implement. But also it does have this kind of forking process that comes just from sampling.
There's also just too many papers. So like, I think a lot of people just see a new paper every day and are like, do I have to read this one? And I feel that too. Like, I think it's difficult to know up front, like which of these are going to be important, which of them are just going to be like noise, especially because lots of them have very sensationalist titles, like, oh, Quen doesn't work.
Or like, or Quen, everyone, everything only works with Quen is like kind of true. But like, there's also more to the story than that. And I think there's like different implementation details of like, oh, if you change the loss function like this in this experiment, then it works. And I think for most people, it is best to just like kind of set this aside and to not get too caught up in the individual details of individual experiments, individual papers, and kind of think more holistically about what is the process of reinforcement learning doing?
What implementation details am I willing to kind of leave to other people to figure out and eventually come to me with like software that like has the knob set correctly? And which pieces are actually important for solving the problems I care about? And so for a lot of people, I think the things that are going to be really interesting are things that are relating to actual software to actual problems that they want to sell in the world.
And agents, I think are kind of the instantiation of that where this makes sense. And the thing that makes an agent an agent is tools, the ability to interact with an environment with a system. A lot of people here are like very excited about MCP at the conference. Like MCP is just tools.
MCP is about giving your LM the ability to like interact with stuff, to go solve problems that involve changing files, making requests, editing code, running code. And so I think these are the papers that I get excited about because they feel like, like there's parts of the puzzle that are not fully solved yet of like, what's the right way to do all of this?
Like there's still some open questions, but I think those are getting kind of refined. We're starting to see more and more, but like a lot of the code, the tools we have out in the wild, they're like, go do this. Like if you want to like go play around with RL, most code bases are like very set up for like either code and math tasks or things that are quite similar to that.
That's kind of my fault. I had a snippet go viral that was like, here's how you do RL on like GSM AK, which is like a kind of easy math data set. And then I think I've seen a lot of people like stick with this as like, Oh, we're going to RL on math.
And I like, this is also just like math is easy to evaluate. And I think people are the evals, writing evals are hard. Like there's a whole track going on in parallel to this about like how to build a good eval. And so I think a lot of researchers gravitate towards things that like look like the benchmarks that are also really easy to eval because there's like a very clear signal of like, okay, this thing is like, right.
This thing is wrong. Good. Okay. We're doing RL. But like real world tasks are messier than that. We are not going to like get great software systems just by like hill climbing on whatever question answer benchmark is popular today. What we're going to do is we're going to have to do is start thinking about like the actual systems at hand and the challenges that emerge when we're trying to design these rewards.
And so like reward hacking is like a real thing. Um, I think this is one of the lessons that like RL works, but also it's not like always going to work. There are things that can go wrong. And to me, reward hacking is really a message about the difficulty of building good evals.
Like, uh, what you really want with an eval is for it to be easier for your model to do the task than to hack the eval. You want to build a reward signal that actually captures what you care about where, uh, gaming it is like more difficult than not gaming it.
If you can, if the model can learn to do the task directly just by doing what you want it to do in the spirit of the task, then like that is what will happen. It will flow in the path of least resistance. This is like models just want to learn, but they want to learn to do better on reward signals.
And so your reward signals have to point in the direction of the thing you actually care about. Um, otherwise like models will find cheats. Um, and I think thinking about these things in combination kind of points a little bit towards a direction that I think is going to be very promising.
And there's some very early signs that like this actually can work, um, which is like when R1 came out, I was kind of like speculating, like what's next? What are the things that are going to unlock this sort of technique being used more generally? Um, and you people talk a lot about like generator, verifier gaps, like what are the differences between like solving a problem versus checking if you have a solution?
And a lot of problems like are much easier to check than solve, but this isn't like a binary thing. This is a spectrum of how difficult is it to verify a thing. But, um, there's some kind of signs that you kind of can do evaluations on more ambiguous tasks by just breaking them down into smaller pieces and by using LLMs as subroutines in your evaluations, like LLMs that judge on steroids, or maybe you want to actually like train a specialized LLM who is really good at doing these fine grained evaluations.
I like using the term rubric as a conceptual general umbrella around reward models, reward functions, LLMs judge setups, like the criteria on which you are evaluating a thing. There's a cool paper from deep seek that I was found very exciting when it came out a couple months ago about like how to train reward models that like generate these rubrics on the fly.
There was a paper very recently that does this for creative writing and kind of found that like, yes, you actually can train reward models that will come up with nuanced fine grained evaluation criteria for a task on the fly, given the actual problem. And this gives you something that results in a very like fine grained score that allows you to actually do LL and like keep getting better.
And I think like this is an area that I'm really excited about to keep watching. But also like multi-turn. Multi-turn is probably where we're headed. We want to do agentic search, we want to do tool calls, software, games, long horizon planning, computer use, memory. Scaling on tool calls lets you solve harder problems.
And so how do we actually like do this? What's the way to go about building multi-turn agentic systems to do and that we can use RL with. And I think the conceptual pieces here are environments are basically harnesses, rewards are basically evals, tasks are just prompts, and your policy in the RL sense hopefully should just be as simple as like an LLM API.
I think the programming interface that makes sense for a lot of people is to have an API that you're writing code as if it's just a normal agent in a loop. But then this is a thing that you can use to go do RL. And so that's what I've been building over the past couple of months.
I maintain a repo called verifiers. It's finally on pip out in the world, you can just install it, but it's been a long time coming. And what it really is, is a toolkit of these pieces to make it so that building an agent that you can actually train with RL feels just like building an agent.
So the interaction protocol here is like quite simple. Like this is the entire rollout function on the left of like what happens in the code when you're running an agent to do RL, which is that you kind of set up some initial state stuff, have a while loop for is it done yet?
If it's not done, do a turn. And the thing you're passing here is a client object that's just an open AI compatible API. And I think this is the kind of interface that you really want if you want people to be able to go from their agent applications to something that's trainable, something that they can use with RL.
It's been a lot of fun thinking about like, what are the abstractions? What are the pieces here? And so like, there's things like parsers and rubrics that I think are like nice building blocks that you sometimes want to use. You can also like not use them if you don't want to, but like I've tried to make it fun and user-friendly.
The other day, I was like, let's train a Wordle agent. I think this was like a fun little toy problem where it's like, it's not that hard of like a game for us as humans, but like, it's actually like kind of tricky to get your code to be this sort of thing where you have this like multi-turn interaction protocol that you actually can do learning with.
But now it's like much easier, like the code to do these things is like quite simple. And the reward functions can kind of be relatively simple for this sort of setup where it's like, okay, you want to reward it for like solving the thing eventually, but also like give it more rewards for doing it in less turns.
And like, this is a 7b model like works reasonably well, but one of the reasons it works, um, which I'll talk about in a sec is, uh, Sft warmup as a way of kind of lowering the barrier of entry. Like this, the code as it is, is very much set up so that like your environments for RL are also just like synthetic data loops or evals where you can plug in clod or deep seek or open AI and like test.
So you don't have to like do RL to debug. You can like debug with an API in terms of seeing, is this a good eval? Is this a good reward? Once you're kind of comfortable with it, you can like use whatever API you like that you are allowed to use and make synthetic data, do some SFT on it.
And now you can start doing RL and this like helps a lot with small models. I think there's a lot of efficiency challenges that are like, I've been kind of hard at work trying to solve in terms of like having all of your computation be utilized effectively, having everything be like fully async.
So you don't have to worry about like batching and that your trainer and your inference can kind of go at the same time. You can be like a little bit off policy. A lot of engineering that I'm hoping like, if you want to worry about that, great, dig into it, fork the repo, mess with things.
If you don't want to, you shouldn't have to. And like the idea here is that this should become something that more people are trying out, more people are having fun with, with exploring and getting a feel for it. Because if it's going to be a thing we have to worry about, if this is the future of building better agent models for your applications, like now's a good time to start.
And so this stuff is set up so you can like on a couple GPUs like do a lot of interesting research. Like the barrier of entry is like much lower now than it used to be. I have a lot of fun doing this on like a couple GPUs. We sell GPUs by the way.
Thanks everybody. I don't think we have time for questions. But yeah.