back to indexHow to Train Your Agent: Building Reliable Agents with RL — Kyle Corbitt, OpenPipe

Chapters
0:0 Introduction to building reliable agents with RL.
0:49 Case Study: ART-E, an AI email assistant.
2:19 The importance of starting with prompted models before moving to RL.
3:17 Performance improvements of RL over prompted models.
5:18 Cost and latency benefits of the RL approach.
8:2 The two hardest problems in modern RL: realistic environments and reward functions.
13:13 Optimizing agent behavior with "extra rewards."
15:25 The problem of "reward hacking" and how to address it.
18:37 The solution to reward hacking
00:00:00.000 |
Hey everyone, glad you're all here. This is the Reasoning and Reinforcement Learning Track 00:00:19.920 |
on the afternoon of the last day of the AI Engineer World's Fair. Glad you're all here, 00:00:24.480 |
glad you're sharing it with us. Today what I'm going to talk about is a very specific case study 00:00:29.880 |
that we did. This case study I'm going to talk about lessons learned very concretely, 00:00:33.960 |
what did and didn't work, how we were able to build an agent that worked well with 00:00:36.900 |
Reinforcement Learning, all of this, everything that I'm talking about in this presentation. This 00:00:41.160 |
is an open source code base that we built. We wanted to share these learnings and I'll share 00:00:46.680 |
that link with you at the end as well for those of you who want to replicate what we did. So what is 00:00:52.080 |
the project we're going to be talking about? It's a project called ARTE. It is a natural language 00:00:56.880 |
assistant that helps you answer questions from your email inbox. So I'll give you an example of 00:01:02.640 |
what we're talking about here. Let's say you want to ask, you know, in this case our example question 00:01:08.080 |
is when is Sherry's move to Portland targeted for? So you would ask this question to the assistant, 00:01:12.000 |
it then goes and it searches your inbox. It's got several tools, so it has like a search tool, 00:01:15.520 |
it has a read email tool, and then it can actually answer the final question. You can kind of see if you 00:01:19.920 |
if you look here what's going on behind the scenes. This is important so you get a sense of kind of how this 00:01:24.240 |
agent works and as we're talking through how we built it, how we made it work, hopefully that 00:01:28.160 |
that helps make the conversation very grounded in a specific task. So anyway, you see the agent, 00:01:32.960 |
it's it's you know searching for certain keywords, it gets those messages back, it's in reading one of 00:01:37.920 |
them and answering the question. That's that's what it does. Okay, so question, you know, once we've 00:01:45.040 |
decided this is kind of the task we're trying to solve, why would you reuse reinforcement learning for this 00:01:50.000 |
specifically? And the answer is like to start with you shouldn't. In fact, to start off with we did 00:01:55.920 |
not. So the first version of this agent, once we decided we wanted to build this, we didn't use any 00:02:00.960 |
reinforcement learning at all, we purely built this on prompted models. And this is the first lesson 00:02:05.280 |
from this talk that I want to share is I would generally always recommend starting with getting 00:02:10.480 |
the best performance you can with a prompted model before going to any training, including reinforcement 00:02:14.960 |
learning. There's a few different reasons to do that three in specifically. The first one is just 00:02:20.480 |
like working out the bugs in your environment, right? You know, maybe your tools aren't implemented 00:02:24.400 |
properly, maybe they don't have access to the data you think they do. We find this happens a lot, 00:02:29.440 |
and it's a lot less frustrating to debug that, you know, separately from debugging your your training 00:02:34.400 |
loop. So you want to make sure that like you can get at least some kind of performance before you start 00:02:38.080 |
training. And then second of all, you may find as you're trying to improve the performance on 00:02:43.120 |
using these prompted models that you can get it working really well. And that's great. So that means 00:02:47.760 |
you don't need to train anything. And that saves you a lot of time. There's a third reason as well 00:02:52.640 |
that I'll share, which is basically once you've gone to that effort, you've done your best to get 00:02:57.840 |
the best quality prompted baselines you possibly can. Then if you find that those baselines are not able 00:03:04.240 |
to get you where you need to go, and you're able to surpass them with reinforcement learning, it feels 00:03:07.840 |
great. You get to gloat and be like, yes, I was able to beat the the frontier models on my task. 00:03:11.440 |
I highly recommend it. It feels good. You can post on X about it. There's nice graphs and stuff. 00:03:17.760 |
So this is what it looks like when everything goes right. So this is an example of a training run 00:03:24.320 |
for this RTE model that I'm going to be talking about. You can see that there's these lines for each 00:03:29.440 |
of the prompted model baselines that we've got. So we've got 03, 04 mini, and then Gemini and 4.1. 00:03:35.360 |
And you can see those ones, you know, they have certain level performance. And then you can see 00:03:40.080 |
this this sort of moving line that's going on. This is the model that we trained. And you can see it 00:03:45.520 |
actually starts out significantly worse than these other models from from the start. That's because we 00:03:49.520 |
started from a Quen 2.5, the 14 billion parameter one. It's a relatively small model, relatively weak 00:03:55.360 |
model. And so it was doing much worse than these initially. But you can see as training progresses, 00:03:59.840 |
you know, initially at the beginning, it's sort of maybe it's learning the right way to do tool calls. 00:04:05.360 |
There's a very sharp bump as it figures out the basic stuff, and then a more gradual climb until 00:04:09.600 |
eventually it's able to significantly outperform any of the prompted models on this task. And this is sort 00:04:15.040 |
of what you're, you know, in the ideal case, when everything works, this is what you're looking for. 00:04:18.560 |
This is what you're hoping to achieve. This is another view actually of that same data we were just 00:04:24.640 |
looking at. I like, I wanted to highlight it in this way, because it's important to realize. So on the 00:04:30.320 |
last graph, it looked like the lines sort of asymptote out pretty close together. That's because they're 00:04:34.400 |
getting near 100%. But the last, you can see, for example, with our best prompted model here, 03, 00:04:40.560 |
it's 90% accuracy. And with our RL model, we're able to get up to 96%. And so one way to think about 00:04:47.520 |
that is like 60% of the errors that 03 was making are actually solved with our model, which is quite a 00:04:55.120 |
large, you know, we find that that's actually can be very, very important for the user experience of 00:04:59.840 |
someone using one of these. If you're getting, you know, just half as many errors, that can make the 00:05:04.320 |
product much stronger. So this is where we got to an accuracy. There's a couple other metrics 00:05:10.480 |
that we find are often very, very important. And you know, the tradeoff between these does is very 00:05:16.800 |
task dependent, but they matter in many cases. Cost, obviously, is a big one. So for this email, 00:05:23.680 |
agentic harness that we had, we benchmarked the cost on 03, 04 mini, and our model. So if you wanted to 00:05:29.760 |
do like 1000 searches using 03, that's going to cost $55, which is a lot, I think for most use cases, 00:05:36.720 |
that probably would be cost prohibitive, just from a unit economics point of view. On 00:05:40.400 |
04 mini, we're down to $8, but that's still quite expensive. And then we drop another order of 00:05:44.560 |
magnitude by moving to this smaller Quen 2.5 14b. Again, this is just driven by it being a much 00:05:49.840 |
smaller model. So it's much cheaper to run. But we're still able to get very good performance 00:05:53.840 |
because we've specialized it on our task. Beyond cost and the accuracy, the third metric that often 00:06:00.480 |
comes up is latency, particularly if you're doing, I mean, certainly anything with voice. But if there's 00:06:05.520 |
any real-time human interaction with the task, latency is going to matter a lot. And we were 00:06:11.040 |
able to find on this task, we were able to get significantly better latency. There's a number 00:06:15.120 |
of different ways, which I'll go into in more detail later, that we were able to achieve this. 00:06:18.320 |
One was just, again, moving to a smaller model helps. There's just less loading from memory, 00:06:23.440 |
less matrix multiplies. It's just you're able to get tokens out faster. We were also able to train this 00:06:28.080 |
model to have fewer turns going back and forth with the database, with the actual email, the list of emails. We were able to 00:06:35.440 |
train it to be more efficient with its queries. And I'll go into that in a moment. And so that leads 00:06:40.160 |
to lower latency. There's actually a third thing, which we didn't apply here, but can help a lot with 00:06:44.160 |
these smaller things, which is called speculative decoding. That's something you can do on large or 00:06:47.760 |
small models. It generally works better on smaller task-specific models because you get higher acceptance 00:06:52.800 |
rates on your speculator. But basically, there's lots of reasons why smaller models work better. 00:06:56.720 |
Okay, so then the next question, for those of you who haven't done this yet, is like, okay, 00:07:02.640 |
what is the effort required to do this to actually achieve these results? 00:07:07.360 |
If you'd asked me this question a year ago, I would say, "Hey, you should really only be doing this 00:07:11.360 |
if you're this big company and willing to put months of work into a project." 00:07:17.360 |
In this case, so this training run, it cost us about $80 in GPU time. It did take about a week of 00:07:24.080 |
engineering time to build this. And caveat that was with an engineer who is familiar with this domain and 00:07:28.560 |
had quite a lot of experience with machine learning and RL. But I actually expect, as we figure out the 00:07:34.640 |
right patterns here, collectively as an industry, this will keep dropping. And I expect that the sort of 00:07:40.080 |
payback period to get a return on investment from these specialized bottles is actually going to continue 00:07:44.800 |
falling as well. And part of the reason I wanted to give this talk is to sort of distribute 00:07:51.200 |
the knowledge we learned and hopefully move faster towards that world where this is just sort of like 00:07:55.840 |
a thing everyone knows how to do and it's very easy and very fast. So that's what we'll be talking 00:08:00.560 |
about for the rest of the time is some more of the lessons we learned. Okay, so when you are using RL 00:08:08.880 |
to train an agent or really using RL for anything else, I find that consistently with different problems 00:08:14.320 |
we look at, there are sort of two hard problems that come up every single time, all right? And the two 00:08:19.200 |
hard problems are, first of all, figuring out a realistic environment, right? So if you're training an 00:08:23.840 |
agent, you need to be training it with realistic data, with realistic inputs and outputs, tools available, 00:08:29.520 |
everything like that to how it's going to be used in production. Because if you don't, then it's going 00:08:34.160 |
to be optimizing for the wrong thing and you won't get the results you want when you deploy it. 00:08:37.280 |
And then the second thing, which sometimes is hard, sometimes isn't, this one is a little bit 00:08:42.080 |
task dependent, is getting the right reward function. So reward function, that just means you have to be 00:08:46.880 |
able to know when your agent's gone through and say in this case, give it an answer to my email, 00:08:50.960 |
you have to have some way of knowing did it do a good job or a bad job, all right? That's the reward 00:08:54.880 |
function, it decides, it's how you decide if it's good or it's bad. Some, depending on the domain, 00:09:00.560 |
sometimes that's really easy. We have, I don't know if Nathan's here, he's going to be talking next, 00:09:04.000 |
but you know, he and his team put together this thing called RLVR, which in some verifiable domains, 00:09:08.400 |
it's actually very easy to do a reward. Oftentimes, not all domains are like that. Oftentimes, it is kind 00:09:14.720 |
of hard. And so it's somewhat task dependent. I'm going to go through how we solve these problems, 00:09:19.360 |
specifically with RE. Okay, first one, realistic environment. So for our RE task, what is the 00:09:25.280 |
environment we need? What's the environment this agent's going to be operating in? Well, it needs 00:09:28.960 |
these tools available, it needs to be able to go and query an email inbox, it needs to be able to like 00:09:32.320 |
get emails back, and that look realistic. These emails, you know, the inbox should be large, 00:09:37.760 |
because that's what most email inboxes are like. The emails in it should be diverse, and they have to 00:09:42.080 |
look kind of like real emails. So this could be kind of hard, because you can't just go ask like 00:09:47.280 |
a 1000 people to, you know, give you their personal emails to train on. Luckily, in this case, we were 00:09:53.120 |
able to solve this with the help of a company that has contributed a lot to just the open data ecosystem. 00:09:58.080 |
Generally, it's like a quite an iconic company, perhaps I would call it a historic company. I'm, 00:10:03.520 |
of course, talking about Enron. I'm hearing some laughter. So anyway, Enron was a there were a 00:10:11.520 |
financialized energy company in the 90s and 2000s, committed massive fraud, ended up getting shut down by the 00:10:16.720 |
Department of Justice. As part of this, you know, process, the court cases they were going 00:10:22.560 |
through, a dump of like 500,000 of their emails was released to the public as part of the discovery 00:10:26.880 |
process. So that's, that's, that's great for things like this. And that's what we used as our environment 00:10:31.840 |
for the email inboxes. All right, so now we've got realistic email inboxes with tens of thousands of 00:10:37.440 |
emails that are real emails back and forth. Now we have to design our reward function. So as our agent is going, 00:10:42.800 |
and as our agent is, you know, we're asking it questions, and then it's giving us answers, 00:10:47.760 |
we have to know is the answer correct or not, so we can reward it when it gets the answer right, 00:10:51.440 |
and it can learn to do that better. There's different ways, and this part is very task dependent. 00:10:56.960 |
The way that we went about it in this case, was we basically turned it into a more of a verifiable 00:11:04.320 |
problem. And the way we did that was, we actually took our email inbox, we sort of inverted the problem, 00:11:08.640 |
we, we grabbed batches of 20 emails at a time, from the inbox, and gave them to Gemini 2.5 Pro, 00:11:15.680 |
and said, hey, given this set of emails, give us a few questions that a user might realistically ask, 00:11:20.880 |
that the answers are found in this email, right? And so Gemini generated the questions, 00:11:25.280 |
it generated the answers, and then of course, the source emails that came from. 00:11:28.480 |
And there were some extra steps on top of that, a lot of the questions it came up with looked a little 00:11:32.720 |
bit unrealistic, we had a separate filtering step, where we're like, okay, let's find the subset of these 00:11:36.560 |
that actually look like questions that, you know, I would maybe ask. And we ended up with a list of 00:11:41.120 |
a few thousand questions, along with their verified answers. And so at this point, it becomes much more 00:11:47.520 |
of a sort of verified thing, the reward function becomes much easier, because we know what the correct 00:11:52.000 |
answer should be. And so the way we can tell if our agent did a good job, is we give our agent the 00:11:56.720 |
question, we let it go and search the email inbox, and try and find the right emails and everything, 00:12:00.000 |
and eventually comes back with an answer. And then we can just use an LLM as judge, a very simple one, 00:12:04.560 |
and say like, hey, you know, here's the question, here's the golden answer that we believe is right, 00:12:09.200 |
here's the answer we got from our model, is it right or not. We did have to do a little bit of 00:12:14.320 |
iteration there, making sure that the judge was well calibrated on what counts as correct or not. 00:12:20.000 |
But by and large, this worked pretty well, and was able to make this more of a verified task. 00:12:26.080 |
So that's how we solved the reward function problem was by having that, you know, turning 00:12:30.080 |
this into something where we had more of a golden data set. Okay, so once you've solved that problem, 00:12:35.200 |
those problems, once you have your environment, once you have your reward function defined, 00:12:40.160 |
then basically, you just kind of have to run a loop over and over and over again, 00:12:43.840 |
where you have your agent go through and it tries to solve the problem, and then you figure out if it's 00:12:49.120 |
good or it's bad, and then you just, you know, reward if it's good, and punish if it's bad, and 00:12:55.120 |
that's it. And you do this over and over and over again, and then hopefully, if you've got everything 00:13:00.800 |
set up right, it learns what good looks like, it learns what bad looks like, and it starts doing it 00:13:06.480 |
right. And then again, this is this is the curve we saw earlier, where you can see it, it starts 00:13:11.520 |
getting better over time. Okay, a few other like interesting learnings from this project. One thing 00:13:18.800 |
is, we found that there's actually, you can throw a lot of stuff into your reward function, beyond just 00:13:24.960 |
the primary thing you're trying to solve for. And so we actually ended up, there were like sort of eight 00:13:29.600 |
different little things that we gave extra credit for. I'm going to share two of them here. So the first one 00:13:34.960 |
here is, is we're trying to have it optimized for the number of turns, how many times back and forth, 00:13:40.480 |
how many times it had to query the email inbox, before it came up with the right answer, right. 00:13:44.720 |
So because the most important thing, of course, is getting the answer right. But between two answers 00:13:48.880 |
that both get it right, we would rather it took fewer turns back and forth, because that's fewer tokens, 00:13:53.200 |
that's lower latency, lower costs, it's just like a more efficient agent. So you can see here on this 00:13:59.280 |
first graph that early on, while it was getting its feet wet and figuring out what worked, it ended up 00:14:04.480 |
spiking up to over six turns on average. So it would go back and forth a bunch of times 00:14:08.480 |
with the email inbox and try and find the right thing. But then once it was able to like, figure 00:14:13.120 |
out how to use the tools efficiently, figure out like, you know, the right way to construct keywords 00:14:17.280 |
and find the right email, it was able to get very efficient and actually fast, better than any of our 00:14:21.680 |
prompted models on this metric of using fewer turns. And again, this was just because we gave it 00:14:26.000 |
a little bit of extra, it was it was a very small amount relative to the reward for getting it right, 00:14:30.720 |
but a little bit of extra credit on using for fewer turns, and it was able to use that to optimize 00:14:36.880 |
against that. Another extra reward function we gave it is to try and discourage it from hallucinating 00:14:42.960 |
answers. So obviously, the best thing is to get the right answer. If you can't find the right answer, 00:14:48.400 |
it's much better to say, hey, I don't know than to make up an answer in a situation like this. So we basically 00:14:54.160 |
penalized it if if the reward model said, hey, you got the answer wrong, and but it had tried to give an 00:15:00.320 |
answer, give an answer, that was like a much lower reward than if it just said, hey, I don't know, I can't 00:15:05.120 |
solve this problem. And as you can see, that worked quite well, compared to any of the prompted models, 00:15:08.880 |
including O3, we ended up with a significantly lower hallucination rate, because that was part of our reward 00:15:14.400 |
function. Again, these are these are things that are just sort of like extra credit. But we found that like 00:15:19.040 |
you can throw in a bunch of these, and it cannot jointly optimize all of them at the same time, 00:15:23.200 |
which is super powerful. Okay, I want to talk a little bit about reward hacking. It's something 00:15:29.280 |
that comes up a lot when you're trying to do this. And it's kind of a fun thing to talk about. 00:15:32.320 |
This is an iconic video some of you might have seen. This was released by OpenAI almost a decade ago at 00:15:37.920 |
this point of they were they were trying to, they had this environment where you were trying to get this 00:15:43.280 |
boat to complete a race. And instead of learning to complete compute, complete the race, it learned that, 00:15:47.920 |
oh, if I just go in this like little circle, that's not even part of the race track, I can like just get a 00:15:51.360 |
bunch of points. And so I just started doing that over and over and over again, instead of like 00:15:55.520 |
actually following. This is something that comes up a lot if you're doing reinforcement learning. And 00:16:00.400 |
it's basically just the difference between the difference between what you actually want the 00:16:06.720 |
model to do and what you can measure, like what you're actually rewarding it for. And and if you 00:16:11.440 |
almost always if you let one of these run long enough, it will figure out some way to exploit your 00:16:15.920 |
measure. And it will figure out some way to to get a really high reward without actually solving the 00:16:21.280 |
problem. And you need to just watch for that. So I'm going to give a couple examples here. This is a this 00:16:25.840 |
is a graph from another project, actually, not this one. So an engineer on our team was was working on this 00:16:32.080 |
game called NYT Connections. Some of you might know, you get 16 words, and you have to put them in like 00:16:36.720 |
four groups of four. It's quite a challenging game, especially for these language models, because it 00:16:41.040 |
requires a lot of world knowledge and like, you know, lateral thinking anyway. So they were trying to 00:16:46.080 |
train this model to do it. And it wasn't figuring out wasn't figured out what it wasn't figuring out. And 00:16:50.320 |
then boom, you can see here around step 40, it just like takes off. And it's like, okay, we figured out 00:16:54.000 |
how to how to solve this. And this engineer, I'm gonna I'm gonna call out where's where's on our team? 00:16:58.960 |
He's here at the conference. Yeah, he's great. You should talk to him after. But he was like, hey, we solved it, 00:17:03.280 |
like we got NYT Connections and like, and it's like, okay, the graph looks good. Let's look at what 00:17:07.280 |
it's actually doing. What it was actually doing is it figured out there was a bug in how we wrote 00:17:12.320 |
the verification. And if it just put every single word in every single category, it was able to get 00:17:16.720 |
a perfect score. Because we weren't verifying that they were in fact, only four words in each category. 00:17:22.400 |
So this is another example. This is a fun one. So I was I was training a model to produce really good 00:17:28.800 |
titles for hacker news, titles that would get a thing upvoted. So I had this reward model I'd trained 00:17:33.840 |
on like existing hacker news articles and how many upvotes they got. And I was I was trying to train 00:17:38.400 |
this model to produce new titles. And it was working really well for a while, you can see and sort of 00:17:43.120 |
subjectively as well, I looked at a bunch of these these titles generated and for these first like 00:17:47.280 |
1000 steps or so, it was actually learning things that I was like, okay, as someone who spends way too 00:17:52.080 |
much time on hacker news, yeah, that that does look like a good title, you're doing a good job. And then you can see 00:17:56.240 |
around step 1200 here, it just like jumps a bunch, right? It's like, okay, it clearly figured something 00:18:01.920 |
out. I don't know what it figured out. But we should look at that. And so what it turns out what the 00:18:08.720 |
model had figured out was that it could just completely ignore the content of the post and generate the 00:18:14.240 |
same title for every single one of them. And that would like maximize its score. So it generated this 00:18:18.640 |
title Google lays off 80% of workforce, literally every single article this was this was what it labeled it 00:18:24.000 |
as and when the remote was like, yes, that is going to get up on hacker news for sure, 00:18:29.680 |
So anyway, the way the way we solve this, what we found is that it's really important to watch 00:18:36.640 |
out for this. Solving it typically involves modifying in some way your reward function to penalize things 00:18:43.600 |
like that. So in the second example I talked about, it was actually quite an easy fix once we identified it, 00:18:48.480 |
which was just add an extra LMS judge that looked at the title, looked at the content and said, hey, 00:18:52.960 |
is there anything in the title that's not supported by the content? And we added that on and it actually 00:18:57.120 |
worked great. The important thing here is you want to be looking at your your rollouts, not just blindly 00:19:02.000 |
trusting the reward function, figuring out what's actually happening. Anyway, so that's it. I'm almost out of 00:19:08.880 |
time. So I'm going to stop a couple of QR codes for you. Everything in this presentation and there's 00:19:14.480 |
a much longer write up I have of this whole project. It includes the code, it includes the artifacts, 00:19:19.040 |
data sets along the way. You can you can check that out there. One more thing is we have a discord that's 00:19:25.920 |
open. We have an open source project for training reinforcement learning models. We have a discord you 00:19:30.880 |
can go to if you're interested in this kind of thing. We were all in there. We answer questions. 00:19:35.440 |
There's lots of people from the community trying to do these things. So if if you're interested in 00:19:39.440 |
building things with this, feel free to join it. And yeah, happy happy to chat there. And yes,