Training Agentic Reasoners — Will Brown, Prime Intellect

00:00:00.040 | Hi everyone, I'm Will Brown. I'm at Prime Intellect. Today I want to talk about training

00:00:18.720 | agentic reasoners. Just kind of as a very high-level overview, I think a lot of people here are really

00:00:24.880 | excited about reasoning and a lot of people here are really excited about agents, but I feel like a

00:00:28.480 | lot of the conversations between these two topics are kind of different where people are like, "Oh,

00:00:32.560 | reasoning is this one thing and agents are this other thing." And the considerations of reasoning are

00:00:37.760 | very different from the considerations of building agents. And I think the high-level thesis of this

00:00:41.200 | talk is like, "No, they're kind of the same thing." And you'll see why as we get into it.

00:00:45.600 | First, just to start, RL kind of works now. I think for a long time people were like, "Oh,

00:00:50.800 | is RL going to work? Is it not going to work? How hard is it going to be?" And DeepSeek, I think,

00:00:55.440 | took a lot of people by surprise for many reasons, like the costs or whatever, and how good it is

00:00:59.840 | compared to the open models, to the big labs, as well as just it being fully open. But I think it was

00:01:05.840 | also just that it was RL applied at scale working with surprisingly few tweaks needed, where you just

00:01:13.120 | have a good setup, you have a good signal, you have a model that is good enough to do some learning,

00:01:18.800 | and you see this curve where doing more RL results in the model getting better.

00:01:23.760 | And it's also kind of how everyone else is doing it. This is what the big labs are really banking on

00:01:30.400 | to drive the next iterations of progress. The 03 release is the one that OpenAI is really excited

00:01:35.520 | about, not GPT 4.5. They stopped serving the big pre-trained model via API, but they have continued to

00:01:41.760 | really double down on the scaling direction of doing more and more reinforcement learning and spending more

00:01:46.800 | compute on reinforcement learning once you have the right setup to enable progress. And 03 to me is

00:01:53.040 | like a very naturally agentic model. The ChatGPT version has all of these tools. The kind of selling

00:01:59.200 | point of it is not just that it's smarter, it's that it's really good at using lots of tools in agentic task

00:02:05.280 | settings to solve harder problems that involve interacting with complex systems. And that is kind of

00:02:11.120 | really the selling point of all of this is that like the more complex your system, the more things can

00:02:16.480 | go wrong, the more that like a generic LLM API is going to be brittle and go off the rails after a certain

00:02:22.400 | number of steps. And RL is kind of the way around it. It's the trick you can do to take the system that

00:02:28.240 | kind of works. Maybe it works on small scales, but as you go harder, it starts going off the rails and

00:02:33.360 | training the model to be better at that thing. And so this is a recipe that is still kind of like a research

00:02:39.680 | topic that people are not fully sure like the best way to do it, especially outside of the big labs.

00:02:44.800 | But it clearly is moving in a direction where it's becoming more and more reliable, more and more

00:02:50.400 | accessible. And the sort of thing that I think would be silly to disregard as a potential like key piece of the future of

00:02:58.720 | agentic software and agentic applications. But it's also complicated. So on the left here,

00:03:04.400 | this is the like architecture diagram of Veral, which is kind of the most popular software people use in

00:03:09.920 | the research world for writing papers to do RL. So if you want to like take a model and go do RL,

00:03:16.240 | Veral kind of expects that you understand all of this. On the left, we have, right, we have GRPO as presented

00:03:22.720 | in the original DeepSeq math paper back from early 2024. And like, there's a lot of pieces here. There's a lot of

00:03:27.760 | like complicated steps going on that I think a lot of people who are used to thinking about APIs,

00:03:33.680 | used to thinking about building agents, kind of like are hoping they don't have to worry about it.

00:03:39.120 | And are hoping that like, you can just set it aside and like, something else will work and we'll just

00:03:46.240 | use the APIs, it'll all be great. And I think the reality is like somewhere in the middle where like,

00:03:51.520 | I think it doesn't need to be this complicated. But I think you also kind of do have to be aware of it if

00:03:55.680 | your goal is really like building the most performing agents, not necessarily just like

00:03:59.200 | today, you need to know about it. But as a piece of the toolkit to potentially make really powerful

00:04:04.400 | agentic software, I think the people who are willing to do this and take the best open models and really

00:04:11.440 | RL them for their tasks and configure how to do that well, are going to have a huge advantage. And that's

00:04:15.600 | the kind of thing that also allows you to like build a moat beyond just like being a wrapper API and

00:04:20.480 | towards something where it's like, oh, I actually have my own model now, but not everyone can be

00:04:25.520 | a big lab. And so we kind of need to meet in the middle somewhere of like, okay, how do we make this

00:04:29.120 | a thing that starts to become feasible for startups for individual researchers to actually do? And like,

00:04:36.960 | at what scale does this become like feasible? And so agents are like, the type of product that everyone's

00:04:42.240 | excited about, we all like, love cloud code and Devin and Manus and O3 and deep research. And like,

00:04:48.720 | these are the sorts of products that are really capturing people's attention.

00:04:52.320 | They're products that in their current iteration happen to work kind of because the models that

00:04:57.120 | are being used have like been RL to basically do these kinds of things. Like Claude is a very good

00:05:03.040 | coding agent, probably because it has been RL done a lot of code. And so it's like, not very surprising

00:05:08.000 | that if you plug Claude into essentially a while loop with some tools, it's like quite good at doing

00:05:13.200 | these things because it's basically most likely been trained in almost that exact setting. Same for things

00:05:19.440 | like O3, like it can do GeoGuessr and whatever, because whether it's literally GeoGuessr or something

00:05:24.240 | close to it, they have talked about training it to do this image cropping trick. Like that's a technique

00:05:30.160 | that it didn't just know how to do out of the box. They said, hey, let's give it these tools to do that

00:05:34.560 | and use reinforcement learning to train it to do that. And so that is kind of the recipe that we have seen

00:05:39.120 | coming from the big labs as if you want a powerful agent that can do a certain type of task,

00:05:43.920 | you can use reinforcement learning to train it to do that task better.

00:05:48.640 | And so these are kind of the same thing, actually, like building an agent, the pieces of

00:05:52.720 | making an agent in terms of the harness, the environment, the tools and the iteration is

00:05:59.440 | essentially the same conceptual framing as canonical reinforcement learning in the sense of policies,

00:06:05.920 | actions, states, rewards, transition probabilities. And I think the more that we start to view

00:06:10.880 | agents as this umbrella, which is not just about static chaining of API calls, but as this interaction

00:06:18.160 | loop with evaluations, that framing really is the way to think about RL, which is you build a system

00:06:24.800 | where a thing is interacting with an environment, and you have some way of evaluating how good it's

00:06:29.760 | doing. And RL is simply an algorithm to improve based on the scores of these evaluations.

00:06:36.560 | And if you're building agents, and you're tuning your prompts, and you're fiddling with your harnesses,

00:06:40.880 | this is kind of like doing RL by hand. What you're doing is you're saying like,

00:06:44.960 | okay, currently, my evals are saying this, let's make sure the evals are like capturing what I want.

00:06:50.240 | Let's look at the data. Let's see if the data matches what my evals are saying. And then, oh,

00:06:55.760 | let's try a new prompt. Let's try giving it a new tool. Let's try switching out the model.

00:07:00.800 | This is the process which is also being targeted by reinforcement learning in the general sense

00:07:07.600 | beyond individual algorithms. About these algorithms, there's a few of them that are

00:07:12.960 | very important. All of them have different implementation details. But in general,

00:07:17.120 | the idea is you have a bunch of tasks, like versions of your problem, which are essentially prompts.

00:07:21.520 | You have rollouts, which are just completions, potentially involving many steps of interactions,

00:07:25.680 | but like one sequence of stuff happening. And then you have evaluation, potentially interleaved

00:07:31.200 | throughout or at the end of the sequence. And what you're estimating is the advantage. The

00:07:35.840 | advantage here is the idea that sometimes your model would be better than others. Like these

00:07:40.800 | LLMs are all non-deterministic. You have temperature above zero. You have different things happen in

00:07:47.600 | different rolls of the dice. And this forking process of saying like, okay, this time it did better than

00:07:54.240 | that time. Why was it different? RL is really about saying like, okay, this is the actual thing that

00:08:02.160 | changed, that resulted in the reward being better, the eval being better. This is the token at which

00:08:08.240 | I went down the good path versus the bad path. And whether you're doing PPO or GRPO,

00:08:14.480 | like this is the mechanism by which you get the signal of like, you have something that sometimes

00:08:20.800 | went better, sometimes went worse. Now you can kind of very surgically have the model, learn to do more

00:08:27.200 | of the good stuff without changing too much overall. I think this is also kind of maybe a reason why DPO,

00:08:32.400 | I think people were hoping DPO would like really work well. In my view, DPO does not necessarily have

00:08:37.280 | this like fine grained advantage estimate. Like it's not really clear just from like a full good completion and

00:08:42.320 | a full bad completion where you're really getting the signal about these complex branching processes.

00:08:46.880 | PPO has this, but it's also very expensive. GRPO I think has taken a lot of people kind of

00:08:52.160 | by storm in terms of like being a very nice like middle ground where it's more computationally efficient.

00:08:58.400 | It's like simple to implement. But also it does have this kind of forking process that comes just from sampling.

00:09:06.560 | There's also just too many papers. So like, I think a lot of people just see a new paper every day and

00:09:11.520 | are like, do I have to read this one? And I feel that too. Like, I think it's difficult to know up front,

00:09:17.760 | like which of these are going to be important, which of them are just going to be like noise, especially

00:09:22.720 | because lots of them have very sensationalist titles, like, oh, Quen doesn't work.

00:09:27.040 | Or like, or Quen, everyone, everything only works with Quen is like kind of true. But like, there's also

00:09:32.320 | more to the story than that. And I think there's like different implementation details of like, oh,

00:09:36.240 | if you change the loss function like this in this experiment, then it works. And I think for most

00:09:41.520 | people, it is best to just like kind of set this aside and to not get too caught up in the individual

00:09:48.400 | details of individual experiments, individual papers, and kind of think more holistically about what is the

00:09:54.240 | process of reinforcement learning doing? What implementation details am I willing to kind of

00:09:59.040 | leave to other people to figure out and eventually come to me with like software that like has the

00:10:02.880 | knob set correctly? And which pieces are actually important for solving the problems I care about?

00:10:08.240 | And so for a lot of people, I think the things that are going to be really interesting

00:10:12.640 | are things that are relating to actual software to actual problems that they want to sell in the world.

00:10:18.560 | And agents, I think are kind of the instantiation of that where this makes sense. And the thing that makes

00:10:23.680 | an agent an agent is tools, the ability to interact with an environment with a system.

00:10:28.720 | A lot of people here are like very excited about MCP at the conference. Like MCP is just tools. MCP is

00:10:33.600 | about giving your LM the ability to like interact with stuff, to go solve problems that involve changing

00:10:41.440 | files, making requests, editing code, running code. And so I think these are the papers that I get excited

00:10:47.600 | about because they feel like, like there's parts of the puzzle that are not fully solved yet of like,

00:10:51.360 | what's the right way to do all of this? Like there's still some open questions,

00:10:54.400 | but I think those are getting kind of refined. We're starting to see more and more, but like a lot of

00:11:01.280 | the code, the tools we have out in the wild, they're like, go do this. Like if you want to like go play

00:11:05.520 | around with RL, most code bases are like very set up for like either code and math tasks or things that are

00:11:12.320 | quite similar to that. That's kind of my fault. I had a snippet go viral that was like, here's how you do

00:11:19.360 | RL on like GSM AK, which is like a kind of easy math data set. And then I think I've seen a lot of people like

00:11:25.600 | stick with this as like, Oh, we're going to RL on math. And I like, this is also just like math is easy to evaluate.

00:11:30.640 | And I think people are the evals, writing evals are hard. Like there's a whole track going on in parallel to this about like how to build a good eval.

00:11:38.480 | And so I think a lot of researchers gravitate towards things that like look like the benchmarks

00:11:43.120 | that are also really easy to eval because there's like a very clear signal of like, okay,

00:11:47.040 | this thing is like, right. This thing is wrong. Good. Okay. We're doing RL. But like real world tasks are

00:11:53.200 | messier than that. We are not going to like get great software systems just by like hill climbing on

00:12:00.960 | whatever question answer benchmark is popular today. What we're going to do is we're going to have to do is

00:12:06.000 | start thinking about like the actual systems at hand and the challenges that emerge when we're

00:12:11.040 | trying to design these rewards. And so like reward hacking is like a real thing. Um, I think this is

00:12:15.440 | one of the lessons that like RL works, but also it's not like always going to work. There are things that

00:12:20.960 | can go wrong. And to me, reward hacking is really a message about the difficulty of building good evals.

00:12:26.640 | Like, uh, what you really want with an eval is for it to be easier for your model to do the task

00:12:32.960 | than to hack the eval. You want to build a reward signal that actually captures what you care about

00:12:38.320 | where, uh, gaming it is like more difficult than not gaming it. If you can, if the model can learn to do

00:12:45.840 | the task directly just by doing what you want it to do in the spirit of the task, then like that is

00:12:53.680 | what will happen. It will flow in the path of least resistance. This is like models just want to learn,

00:12:57.680 | but they want to learn to do better on reward signals. And so your reward signals have to point

00:13:02.000 | in the direction of the thing you actually care about. Um, otherwise like models will find cheats.

00:13:07.360 | Um, and I think thinking about these things in combination kind of points a little bit towards

00:13:13.280 | a direction that I think is going to be very promising. And there's some very early signs

00:13:16.480 | that like this actually can work, um, which is like when R1 came out, I was kind of like speculating,

00:13:22.720 | like what's next? What are the things that are going to unlock this sort of technique being used more

00:13:28.560 | generally? Um, and you people talk a lot about like generator, verifier gaps, like what are the

00:13:34.080 | differences between like solving a problem versus checking if you have a solution? And a lot of

00:13:37.760 | problems like are much easier to check than solve, but this isn't like a binary thing.

00:13:42.240 | This is a spectrum of how difficult is it to verify a thing. But, um, there's some kind of signs that

00:13:48.640 | you kind of can do evaluations on more ambiguous tasks by just breaking them down into smaller pieces

00:13:56.160 | and by using LLMs as subroutines in your evaluations, like LLMs that judge on steroids,

00:14:03.120 | or maybe you want to actually like train a specialized LLM who is really good at doing

00:14:06.800 | these fine grained evaluations. I like using the term rubric as a conceptual general umbrella around

00:14:12.160 | reward models, reward functions, LLMs judge setups, like the criteria on which you are evaluating a

00:14:17.760 | thing. There's a cool paper from deep seek that I was found very exciting when it came out a couple

00:14:21.760 | months ago about like how to train reward models that like generate these rubrics on the fly.

00:14:26.240 | There was a paper very recently that does this for creative writing and kind of found that like,

00:14:30.080 | yes, you actually can train reward models that will come up with nuanced fine grained

00:14:35.840 | evaluation criteria for a task on the fly, given the actual problem. And this gives you something that

00:14:41.520 | results in a very like fine grained score that allows you to actually do LL and like keep getting better.

00:14:46.880 | And I think like this is an area that I'm really excited about to keep watching.

00:14:52.400 | But also like multi-turn. Multi-turn is probably where we're headed. We want to do agentic search,

00:14:57.520 | we want to do tool calls, software, games, long horizon planning, computer use, memory. Scaling on tool

00:15:03.040 | calls lets you solve harder problems. And so how do we actually like do this? What's the way to go about

00:15:08.960 | building multi-turn agentic systems to do and that we can use RL with. And I think the conceptual pieces

00:15:16.560 | here are environments are basically harnesses, rewards are basically evals, tasks are just prompts,

00:15:22.000 | and your policy in the RL sense hopefully should just be as simple as like an LLM API.

00:15:27.280 | I think the programming interface that makes sense for a lot of people is to have an API

00:15:31.840 | that you're writing code as if it's just a normal agent in a loop. But then this is a thing that you

00:15:36.960 | can use to go do RL. And so that's what I've been building over the past couple of months. I maintain

00:15:42.000 | a repo called verifiers. It's finally on pip out in the world, you can just install it, but it's been a

00:15:49.520 | long time coming. And what it really is, is a toolkit of these pieces to make it so that building an agent

00:15:57.280 | that you can actually train with RL feels just like building an agent. So the interaction

00:16:01.760 | protocol here is like quite simple. Like this is the entire rollout function on the left of like

00:16:06.960 | what happens in the code when you're running an agent to do RL, which is that you kind of set up some

00:16:11.920 | initial state stuff, have a while loop for is it done yet? If it's not done, do a turn. And the thing

00:16:17.520 | you're passing here is a client object that's just an open AI compatible API. And I think this is the kind

00:16:24.000 | of interface that you really want if you want people to be able to go from their agent applications to

00:16:29.760 | something that's trainable, something that they can use with RL. It's been a lot of fun thinking

00:16:34.080 | about like, what are the abstractions? What are the pieces here? And so like, there's things like

00:16:37.840 | parsers and rubrics that I think are like nice building blocks that you sometimes want to use.

00:16:41.440 | You can also like not use them if you don't want to, but like I've tried to make it fun and user-friendly.

00:16:45.440 | The other day, I was like, let's train a Wordle agent. I think this was like a fun little toy

00:16:50.240 | problem where it's like, it's not that hard of like a game for us as humans, but like,

00:16:55.680 | it's actually like kind of tricky to get your code to be this sort of thing where you have this like

00:17:00.400 | multi-turn interaction protocol that you actually can do learning with. But now it's like much easier,

00:17:05.680 | like the code to do these things is like quite simple. And the reward functions can kind of be

00:17:10.880 | relatively simple for this sort of setup where it's like, okay, you want to reward it for like

00:17:14.000 | solving the thing eventually, but also like give it more rewards for doing it in less turns. And like,

00:17:19.760 | this is a 7b model like works reasonably well, but one of the reasons it works, um,

00:17:24.640 | which I'll talk about in a sec is, uh,

00:17:26.720 | Sft warmup as a way of kind of lowering the barrier of entry. Like this, the code as it is,

00:17:31.520 | is very much set up so that like your environments for RL are also just like synthetic data loops or evals

00:17:36.240 | where you can plug in clod or deep seek or open AI and like test. So you don't have to like do RL to debug.

00:17:42.160 | You can like debug with an API in terms of seeing, is this a good eval?

00:17:46.080 | Is this a good reward? Once you're kind of comfortable with it, you can like use whatever

00:17:50.080 | API you like that you are allowed to use and make synthetic data, do some SFT on it.

00:17:55.200 | And now you can start doing RL and this like helps a lot with small models.

00:17:58.080 | I think there's a lot of efficiency challenges that are like, I've been kind of hard at work

00:18:03.120 | trying to solve in terms of like having all of your computation be utilized effectively,

00:18:06.960 | having everything be like fully async. So you don't have to worry about like batching

00:18:10.400 | and that your trainer and your inference can kind of go at the same time.

00:18:14.080 | You can be like a little bit off policy. A lot of engineering that I'm hoping like,

00:18:18.320 | if you want to worry about that, great, dig into it, fork the repo, mess with things.

00:18:23.120 | If you don't want to, you shouldn't have to. And like the idea here is that this should become

00:18:29.760 | something that more people are trying out, more people are having fun with, with exploring

00:18:34.800 | and getting a feel for it. Because if it's going to be a thing we have to worry about, if this is the future of

00:18:41.200 | building better agent models for your applications, like now's a good time to start. And so this stuff

00:18:48.080 | is set up so you can like on a couple GPUs like do a lot of interesting research. Like the barrier of entry

00:18:53.920 | is like much lower now than it used to be. I have a lot of fun doing this on like a couple GPUs.

00:18:59.920 | We sell GPUs by the way. Thanks everybody. I don't think we have time for questions. But yeah.

Training Agentic Reasoners — Will Brown, Prime Intellect

Chapters