Back to Index

How to Train Your Agent: Building Reliable Agents with RL — Kyle Corbitt, OpenPipe


Chapters

0:0 Introduction to building reliable agents with RL.
0:49 Case Study: ART-E, an AI email assistant.
2:19 The importance of starting with prompted models before moving to RL.
3:17 Performance improvements of RL over prompted models.
5:18 Cost and latency benefits of the RL approach.
8:2 The two hardest problems in modern RL: realistic environments and reward functions.
13:13 Optimizing agent behavior with "extra rewards."
15:25 The problem of "reward hacking" and how to address it.
18:37 The solution to reward hacking

Transcript

Hey everyone, glad you're all here. This is the Reasoning and Reinforcement Learning Track on the afternoon of the last day of the AI Engineer World's Fair. Glad you're all here, glad you're sharing it with us. Today what I'm going to talk about is a very specific case study that we did.

This case study I'm going to talk about lessons learned very concretely, what did and didn't work, how we were able to build an agent that worked well with Reinforcement Learning, all of this, everything that I'm talking about in this presentation. This is an open source code base that we built.

We wanted to share these learnings and I'll share that link with you at the end as well for those of you who want to replicate what we did. So what is the project we're going to be talking about? It's a project called ARTE. It is a natural language assistant that helps you answer questions from your email inbox.

So I'll give you an example of what we're talking about here. Let's say you want to ask, you know, in this case our example question is when is Sherry's move to Portland targeted for? So you would ask this question to the assistant, it then goes and it searches your inbox.

It's got several tools, so it has like a search tool, it has a read email tool, and then it can actually answer the final question. You can kind of see if you if you look here what's going on behind the scenes. This is important so you get a sense of kind of how this agent works and as we're talking through how we built it, how we made it work, hopefully that that helps make the conversation very grounded in a specific task.

So anyway, you see the agent, it's it's you know searching for certain keywords, it gets those messages back, it's in reading one of them and answering the question. That's that's what it does. Okay, so question, you know, once we've decided this is kind of the task we're trying to solve, why would you reuse reinforcement learning for this specifically?

And the answer is like to start with you shouldn't. In fact, to start off with we did not. So the first version of this agent, once we decided we wanted to build this, we didn't use any reinforcement learning at all, we purely built this on prompted models. And this is the first lesson from this talk that I want to share is I would generally always recommend starting with getting the best performance you can with a prompted model before going to any training, including reinforcement learning.

There's a few different reasons to do that three in specifically. The first one is just like working out the bugs in your environment, right? You know, maybe your tools aren't implemented properly, maybe they don't have access to the data you think they do. We find this happens a lot, and it's a lot less frustrating to debug that, you know, separately from debugging your your training loop.

So you want to make sure that like you can get at least some kind of performance before you start training. And then second of all, you may find as you're trying to improve the performance on using these prompted models that you can get it working really well. And that's great.

So that means you don't need to train anything. And that saves you a lot of time. There's a third reason as well that I'll share, which is basically once you've gone to that effort, you've done your best to get the best quality prompted baselines you possibly can. Then if you find that those baselines are not able to get you where you need to go, and you're able to surpass them with reinforcement learning, it feels great.

You get to gloat and be like, yes, I was able to beat the the frontier models on my task. I highly recommend it. It feels good. You can post on X about it. There's nice graphs and stuff. So this is what it looks like when everything goes right. So this is an example of a training run for this RTE model that I'm going to be talking about.

You can see that there's these lines for each of the prompted model baselines that we've got. So we've got 03, 04 mini, and then Gemini and 4.1. And you can see those ones, you know, they have certain level performance. And then you can see this this sort of moving line that's going on.

This is the model that we trained. And you can see it actually starts out significantly worse than these other models from from the start. That's because we started from a Quen 2.5, the 14 billion parameter one. It's a relatively small model, relatively weak model. And so it was doing much worse than these initially.

But you can see as training progresses, you know, initially at the beginning, it's sort of maybe it's learning the right way to do tool calls. There's a very sharp bump as it figures out the basic stuff, and then a more gradual climb until eventually it's able to significantly outperform any of the prompted models on this task.

And this is sort of what you're, you know, in the ideal case, when everything works, this is what you're looking for. This is what you're hoping to achieve. This is another view actually of that same data we were just looking at. I like, I wanted to highlight it in this way, because it's important to realize.

So on the last graph, it looked like the lines sort of asymptote out pretty close together. That's because they're getting near 100%. But the last, you can see, for example, with our best prompted model here, 03, it's 90% accuracy. And with our RL model, we're able to get up to 96%.

And so one way to think about that is like 60% of the errors that 03 was making are actually solved with our model, which is quite a large, you know, we find that that's actually can be very, very important for the user experience of someone using one of these.

If you're getting, you know, just half as many errors, that can make the product much stronger. So this is where we got to an accuracy. There's a couple other metrics that we find are often very, very important. And you know, the tradeoff between these does is very task dependent, but they matter in many cases.

Cost, obviously, is a big one. So for this email, agentic harness that we had, we benchmarked the cost on 03, 04 mini, and our model. So if you wanted to do like 1000 searches using 03, that's going to cost $55, which is a lot, I think for most use cases, that probably would be cost prohibitive, just from a unit economics point of view.

On 04 mini, we're down to $8, but that's still quite expensive. And then we drop another order of magnitude by moving to this smaller Quen 2.5 14b. Again, this is just driven by it being a much smaller model. So it's much cheaper to run. But we're still able to get very good performance because we've specialized it on our task.

Beyond cost and the accuracy, the third metric that often comes up is latency, particularly if you're doing, I mean, certainly anything with voice. But if there's any real-time human interaction with the task, latency is going to matter a lot. And we were able to find on this task, we were able to get significantly better latency.

There's a number of different ways, which I'll go into in more detail later, that we were able to achieve this. One was just, again, moving to a smaller model helps. There's just less loading from memory, less matrix multiplies. It's just you're able to get tokens out faster. We were also able to train this model to have fewer turns going back and forth with the database, with the actual email, the list of emails.

We were able to train it to be more efficient with its queries. And I'll go into that in a moment. And so that leads to lower latency. There's actually a third thing, which we didn't apply here, but can help a lot with these smaller things, which is called speculative decoding.

That's something you can do on large or small models. It generally works better on smaller task-specific models because you get higher acceptance rates on your speculator. But basically, there's lots of reasons why smaller models work better. Okay, so then the next question, for those of you who haven't done this yet, is like, okay, what is the effort required to do this to actually achieve these results?

If you'd asked me this question a year ago, I would say, "Hey, you should really only be doing this if you're this big company and willing to put months of work into a project." I think that's changing. I honestly do. In this case, so this training run, it cost us about $80 in GPU time.

It did take about a week of engineering time to build this. And caveat that was with an engineer who is familiar with this domain and had quite a lot of experience with machine learning and RL. But I actually expect, as we figure out the right patterns here, collectively as an industry, this will keep dropping.

And I expect that the sort of payback period to get a return on investment from these specialized bottles is actually going to continue falling as well. And part of the reason I wanted to give this talk is to sort of distribute the knowledge we learned and hopefully move faster towards that world where this is just sort of like a thing everyone knows how to do and it's very easy and very fast.

So that's what we'll be talking about for the rest of the time is some more of the lessons we learned. Okay, so when you are using RL to train an agent or really using RL for anything else, I find that consistently with different problems we look at, there are sort of two hard problems that come up every single time, all right?

And the two hard problems are, first of all, figuring out a realistic environment, right? So if you're training an agent, you need to be training it with realistic data, with realistic inputs and outputs, tools available, everything like that to how it's going to be used in production. Because if you don't, then it's going to be optimizing for the wrong thing and you won't get the results you want when you deploy it.

And then the second thing, which sometimes is hard, sometimes isn't, this one is a little bit task dependent, is getting the right reward function. So reward function, that just means you have to be able to know when your agent's gone through and say in this case, give it an answer to my email, you have to have some way of knowing did it do a good job or a bad job, all right?

That's the reward function, it decides, it's how you decide if it's good or it's bad. Some, depending on the domain, sometimes that's really easy. We have, I don't know if Nathan's here, he's going to be talking next, but you know, he and his team put together this thing called RLVR, which in some verifiable domains, it's actually very easy to do a reward.

Oftentimes, not all domains are like that. Oftentimes, it is kind of hard. And so it's somewhat task dependent. I'm going to go through how we solve these problems, specifically with RE. Okay, first one, realistic environment. So for our RE task, what is the environment we need? What's the environment this agent's going to be operating in?

Well, it needs these tools available, it needs to be able to go and query an email inbox, it needs to be able to like get emails back, and that look realistic. These emails, you know, the inbox should be large, because that's what most email inboxes are like. The emails in it should be diverse, and they have to look kind of like real emails.

So this could be kind of hard, because you can't just go ask like a 1000 people to, you know, give you their personal emails to train on. Luckily, in this case, we were able to solve this with the help of a company that has contributed a lot to just the open data ecosystem.

Generally, it's like a quite an iconic company, perhaps I would call it a historic company. I'm, of course, talking about Enron. I'm hearing some laughter. So anyway, Enron was a there were a financialized energy company in the 90s and 2000s, committed massive fraud, ended up getting shut down by the Department of Justice.

As part of this, you know, process, the court cases they were going through, a dump of like 500,000 of their emails was released to the public as part of the discovery process. So that's, that's, that's great for things like this. And that's what we used as our environment for the email inboxes.

All right, so now we've got realistic email inboxes with tens of thousands of emails that are real emails back and forth. Now we have to design our reward function. So as our agent is going, and as our agent is, you know, we're asking it questions, and then it's giving us answers, we have to know is the answer correct or not, so we can reward it when it gets the answer right, and it can learn to do that better.

There's different ways, and this part is very task dependent. The way that we went about it in this case, was we basically turned it into a more of a verifiable problem. And the way we did that was, we actually took our email inbox, we sort of inverted the problem, we, we grabbed batches of 20 emails at a time, from the inbox, and gave them to Gemini 2.5 Pro, and said, hey, given this set of emails, give us a few questions that a user might realistically ask, that the answers are found in this email, right?

And so Gemini generated the questions, it generated the answers, and then of course, the source emails that came from. And there were some extra steps on top of that, a lot of the questions it came up with looked a little bit unrealistic, we had a separate filtering step, where we're like, okay, let's find the subset of these that actually look like questions that, you know, I would maybe ask.

And we ended up with a list of a few thousand questions, along with their verified answers. And so at this point, it becomes much more of a sort of verified thing, the reward function becomes much easier, because we know what the correct answer should be. And so the way we can tell if our agent did a good job, is we give our agent the question, we let it go and search the email inbox, and try and find the right emails and everything, and eventually comes back with an answer.

And then we can just use an LLM as judge, a very simple one, and say like, hey, you know, here's the question, here's the golden answer that we believe is right, here's the answer we got from our model, is it right or not. We did have to do a little bit of iteration there, making sure that the judge was well calibrated on what counts as correct or not.

But by and large, this worked pretty well, and was able to make this more of a verified task. So that's how we solved the reward function problem was by having that, you know, turning this into something where we had more of a golden data set. Okay, so once you've solved that problem, those problems, once you have your environment, once you have your reward function defined, then basically, you just kind of have to run a loop over and over and over again, where you have your agent go through and it tries to solve the problem, and then you figure out if it's good or it's bad, and then you just, you know, reward if it's good, and punish if it's bad, and that's it.

And you do this over and over and over again, and then hopefully, if you've got everything set up right, it learns what good looks like, it learns what bad looks like, and it starts doing it right. And then again, this is this is the curve we saw earlier, where you can see it, it starts getting better over time.

Okay, a few other like interesting learnings from this project. One thing is, we found that there's actually, you can throw a lot of stuff into your reward function, beyond just the primary thing you're trying to solve for. And so we actually ended up, there were like sort of eight different little things that we gave extra credit for.

I'm going to share two of them here. So the first one here is, is we're trying to have it optimized for the number of turns, how many times back and forth, how many times it had to query the email inbox, before it came up with the right answer, right.

So because the most important thing, of course, is getting the answer right. But between two answers that both get it right, we would rather it took fewer turns back and forth, because that's fewer tokens, that's lower latency, lower costs, it's just like a more efficient agent. So you can see here on this first graph that early on, while it was getting its feet wet and figuring out what worked, it ended up spiking up to over six turns on average.

So it would go back and forth a bunch of times with the email inbox and try and find the right thing. But then once it was able to like, figure out how to use the tools efficiently, figure out like, you know, the right way to construct keywords and find the right email, it was able to get very efficient and actually fast, better than any of our prompted models on this metric of using fewer turns.

And again, this was just because we gave it a little bit of extra, it was it was a very small amount relative to the reward for getting it right, but a little bit of extra credit on using for fewer turns, and it was able to use that to optimize against that.

Another extra reward function we gave it is to try and discourage it from hallucinating answers. So obviously, the best thing is to get the right answer. If you can't find the right answer, it's much better to say, hey, I don't know than to make up an answer in a situation like this.

So we basically penalized it if if the reward model said, hey, you got the answer wrong, and but it had tried to give an answer, give an answer, that was like a much lower reward than if it just said, hey, I don't know, I can't solve this problem. And as you can see, that worked quite well, compared to any of the prompted models, including O3, we ended up with a significantly lower hallucination rate, because that was part of our reward function.

Again, these are these are things that are just sort of like extra credit. But we found that like you can throw in a bunch of these, and it cannot jointly optimize all of them at the same time, which is super powerful. Okay, I want to talk a little bit about reward hacking.

It's something that comes up a lot when you're trying to do this. And it's kind of a fun thing to talk about. This is an iconic video some of you might have seen. This was released by OpenAI almost a decade ago at this point of they were they were trying to, they had this environment where you were trying to get this boat to complete a race.

And instead of learning to complete compute, complete the race, it learned that, oh, if I just go in this like little circle, that's not even part of the race track, I can like just get a bunch of points. And so I just started doing that over and over and over again, instead of like actually following.

This is something that comes up a lot if you're doing reinforcement learning. And it's basically just the difference between the difference between what you actually want the model to do and what you can measure, like what you're actually rewarding it for. And and if you almost always if you let one of these run long enough, it will figure out some way to exploit your measure.

And it will figure out some way to to get a really high reward without actually solving the problem. And you need to just watch for that. So I'm going to give a couple examples here. This is a this is a graph from another project, actually, not this one. So an engineer on our team was was working on this game called NYT Connections.

Some of you might know, you get 16 words, and you have to put them in like four groups of four. It's quite a challenging game, especially for these language models, because it requires a lot of world knowledge and like, you know, lateral thinking anyway. So they were trying to train this model to do it.

And it wasn't figuring out wasn't figured out what it wasn't figuring out. And then boom, you can see here around step 40, it just like takes off. And it's like, okay, we figured out how to how to solve this. And this engineer, I'm gonna I'm gonna call out where's where's on our team?

He's here at the conference. Yeah, he's great. You should talk to him after. But he was like, hey, we solved it, like we got NYT Connections and like, and it's like, okay, the graph looks good. Let's look at what it's actually doing. What it was actually doing is it figured out there was a bug in how we wrote the verification.

And if it just put every single word in every single category, it was able to get a perfect score. Because we weren't verifying that they were in fact, only four words in each category. So this is another example. This is a fun one. So I was I was training a model to produce really good titles for hacker news, titles that would get a thing upvoted.

So I had this reward model I'd trained on like existing hacker news articles and how many upvotes they got. And I was I was trying to train this model to produce new titles. And it was working really well for a while, you can see and sort of subjectively as well, I looked at a bunch of these these titles generated and for these first like 1000 steps or so, it was actually learning things that I was like, okay, as someone who spends way too much time on hacker news, yeah, that that does look like a good title, you're doing a good job.

And then you can see around step 1200 here, it just like jumps a bunch, right? It's like, okay, it clearly figured something out. I don't know what it figured out. But we should look at that. And so what it turns out what the model had figured out was that it could just completely ignore the content of the post and generate the same title for every single one of them.

And that would like maximize its score. So it generated this title Google lays off 80% of workforce, literally every single article this was this was what it labeled it as and when the remote was like, yes, that is going to get up on hacker news for sure, which which it probably would to be fair.

So anyway, the way the way we solve this, what we found is that it's really important to watch out for this. Solving it typically involves modifying in some way your reward function to penalize things like that. So in the second example I talked about, it was actually quite an easy fix once we identified it, which was just add an extra LMS judge that looked at the title, looked at the content and said, hey, is there anything in the title that's not supported by the content?

And we added that on and it actually worked great. The important thing here is you want to be looking at your your rollouts, not just blindly trusting the reward function, figuring out what's actually happening. Anyway, so that's it. I'm almost out of time. So I'm going to stop a couple of QR codes for you.

Everything in this presentation and there's a much longer write up I have of this whole project. It includes the code, it includes the artifacts, data sets along the way. You can you can check that out there. One more thing is we have a discord that's open. We have an open source project for training reinforcement learning models.

We have a discord you can go to if you're interested in this kind of thing. We were all in there. We answer questions. There's lots of people from the community trying to do these things. So if if you're interested in building things with this, feel free to join it.

And yeah, happy happy to chat there. And yes, thank you everyone. Appreciate your time.