back to index

How to Train Your Agent: Building Reliable Agents with RL — Kyle Corbitt, OpenPipe


Chapters

0:0 Introduction to building reliable agents with RL.
0:49 Case Study: ART-E, an AI email assistant.
2:19 The importance of starting with prompted models before moving to RL.
3:17 Performance improvements of RL over prompted models.
5:18 Cost and latency benefits of the RL approach.
8:2 The two hardest problems in modern RL: realistic environments and reward functions.
13:13 Optimizing agent behavior with "extra rewards."
15:25 The problem of "reward hacking" and how to address it.
18:37 The solution to reward hacking

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hey everyone, glad you're all here. This is the Reasoning and Reinforcement Learning Track
00:00:19.920 | on the afternoon of the last day of the AI Engineer World's Fair. Glad you're all here,
00:00:24.480 | glad you're sharing it with us. Today what I'm going to talk about is a very specific case study
00:00:29.880 | that we did. This case study I'm going to talk about lessons learned very concretely,
00:00:33.960 | what did and didn't work, how we were able to build an agent that worked well with
00:00:36.900 | Reinforcement Learning, all of this, everything that I'm talking about in this presentation. This
00:00:41.160 | is an open source code base that we built. We wanted to share these learnings and I'll share
00:00:46.680 | that link with you at the end as well for those of you who want to replicate what we did. So what is
00:00:52.080 | the project we're going to be talking about? It's a project called ARTE. It is a natural language
00:00:56.880 | assistant that helps you answer questions from your email inbox. So I'll give you an example of
00:01:02.640 | what we're talking about here. Let's say you want to ask, you know, in this case our example question
00:01:08.080 | is when is Sherry's move to Portland targeted for? So you would ask this question to the assistant,
00:01:12.000 | it then goes and it searches your inbox. It's got several tools, so it has like a search tool,
00:01:15.520 | it has a read email tool, and then it can actually answer the final question. You can kind of see if you
00:01:19.920 | if you look here what's going on behind the scenes. This is important so you get a sense of kind of how this
00:01:24.240 | agent works and as we're talking through how we built it, how we made it work, hopefully that
00:01:28.160 | that helps make the conversation very grounded in a specific task. So anyway, you see the agent,
00:01:32.960 | it's it's you know searching for certain keywords, it gets those messages back, it's in reading one of
00:01:37.920 | them and answering the question. That's that's what it does. Okay, so question, you know, once we've
00:01:45.040 | decided this is kind of the task we're trying to solve, why would you reuse reinforcement learning for this
00:01:50.000 | specifically? And the answer is like to start with you shouldn't. In fact, to start off with we did
00:01:55.920 | not. So the first version of this agent, once we decided we wanted to build this, we didn't use any
00:02:00.960 | reinforcement learning at all, we purely built this on prompted models. And this is the first lesson
00:02:05.280 | from this talk that I want to share is I would generally always recommend starting with getting
00:02:10.480 | the best performance you can with a prompted model before going to any training, including reinforcement
00:02:14.960 | learning. There's a few different reasons to do that three in specifically. The first one is just
00:02:20.480 | like working out the bugs in your environment, right? You know, maybe your tools aren't implemented
00:02:24.400 | properly, maybe they don't have access to the data you think they do. We find this happens a lot,
00:02:29.440 | and it's a lot less frustrating to debug that, you know, separately from debugging your your training
00:02:34.400 | loop. So you want to make sure that like you can get at least some kind of performance before you start
00:02:38.080 | training. And then second of all, you may find as you're trying to improve the performance on
00:02:43.120 | using these prompted models that you can get it working really well. And that's great. So that means
00:02:47.760 | you don't need to train anything. And that saves you a lot of time. There's a third reason as well
00:02:52.640 | that I'll share, which is basically once you've gone to that effort, you've done your best to get
00:02:57.840 | the best quality prompted baselines you possibly can. Then if you find that those baselines are not able
00:03:04.240 | to get you where you need to go, and you're able to surpass them with reinforcement learning, it feels
00:03:07.840 | great. You get to gloat and be like, yes, I was able to beat the the frontier models on my task.
00:03:11.440 | I highly recommend it. It feels good. You can post on X about it. There's nice graphs and stuff.
00:03:17.760 | So this is what it looks like when everything goes right. So this is an example of a training run
00:03:24.320 | for this RTE model that I'm going to be talking about. You can see that there's these lines for each
00:03:29.440 | of the prompted model baselines that we've got. So we've got 03, 04 mini, and then Gemini and 4.1.
00:03:35.360 | And you can see those ones, you know, they have certain level performance. And then you can see
00:03:40.080 | this this sort of moving line that's going on. This is the model that we trained. And you can see it
00:03:45.520 | actually starts out significantly worse than these other models from from the start. That's because we
00:03:49.520 | started from a Quen 2.5, the 14 billion parameter one. It's a relatively small model, relatively weak
00:03:55.360 | model. And so it was doing much worse than these initially. But you can see as training progresses,
00:03:59.840 | you know, initially at the beginning, it's sort of maybe it's learning the right way to do tool calls.
00:04:05.360 | There's a very sharp bump as it figures out the basic stuff, and then a more gradual climb until
00:04:09.600 | eventually it's able to significantly outperform any of the prompted models on this task. And this is sort
00:04:15.040 | of what you're, you know, in the ideal case, when everything works, this is what you're looking for.
00:04:18.560 | This is what you're hoping to achieve. This is another view actually of that same data we were just
00:04:24.640 | looking at. I like, I wanted to highlight it in this way, because it's important to realize. So on the
00:04:30.320 | last graph, it looked like the lines sort of asymptote out pretty close together. That's because they're
00:04:34.400 | getting near 100%. But the last, you can see, for example, with our best prompted model here, 03,
00:04:40.560 | it's 90% accuracy. And with our RL model, we're able to get up to 96%. And so one way to think about
00:04:47.520 | that is like 60% of the errors that 03 was making are actually solved with our model, which is quite a
00:04:55.120 | large, you know, we find that that's actually can be very, very important for the user experience of
00:04:59.840 | someone using one of these. If you're getting, you know, just half as many errors, that can make the
00:05:04.320 | product much stronger. So this is where we got to an accuracy. There's a couple other metrics
00:05:10.480 | that we find are often very, very important. And you know, the tradeoff between these does is very
00:05:16.800 | task dependent, but they matter in many cases. Cost, obviously, is a big one. So for this email,
00:05:23.680 | agentic harness that we had, we benchmarked the cost on 03, 04 mini, and our model. So if you wanted to
00:05:29.760 | do like 1000 searches using 03, that's going to cost $55, which is a lot, I think for most use cases,
00:05:36.720 | that probably would be cost prohibitive, just from a unit economics point of view. On
00:05:40.400 | 04 mini, we're down to $8, but that's still quite expensive. And then we drop another order of
00:05:44.560 | magnitude by moving to this smaller Quen 2.5 14b. Again, this is just driven by it being a much
00:05:49.840 | smaller model. So it's much cheaper to run. But we're still able to get very good performance
00:05:53.840 | because we've specialized it on our task. Beyond cost and the accuracy, the third metric that often
00:06:00.480 | comes up is latency, particularly if you're doing, I mean, certainly anything with voice. But if there's
00:06:05.520 | any real-time human interaction with the task, latency is going to matter a lot. And we were
00:06:11.040 | able to find on this task, we were able to get significantly better latency. There's a number
00:06:15.120 | of different ways, which I'll go into in more detail later, that we were able to achieve this.
00:06:18.320 | One was just, again, moving to a smaller model helps. There's just less loading from memory,
00:06:23.440 | less matrix multiplies. It's just you're able to get tokens out faster. We were also able to train this
00:06:28.080 | model to have fewer turns going back and forth with the database, with the actual email, the list of emails. We were able to
00:06:35.440 | train it to be more efficient with its queries. And I'll go into that in a moment. And so that leads
00:06:40.160 | to lower latency. There's actually a third thing, which we didn't apply here, but can help a lot with
00:06:44.160 | these smaller things, which is called speculative decoding. That's something you can do on large or
00:06:47.760 | small models. It generally works better on smaller task-specific models because you get higher acceptance
00:06:52.800 | rates on your speculator. But basically, there's lots of reasons why smaller models work better.
00:06:56.720 | Okay, so then the next question, for those of you who haven't done this yet, is like, okay,
00:07:02.640 | what is the effort required to do this to actually achieve these results?
00:07:07.360 | If you'd asked me this question a year ago, I would say, "Hey, you should really only be doing this
00:07:11.360 | if you're this big company and willing to put months of work into a project."
00:07:15.360 | I think that's changing. I honestly do.
00:07:17.360 | In this case, so this training run, it cost us about $80 in GPU time. It did take about a week of
00:07:24.080 | engineering time to build this. And caveat that was with an engineer who is familiar with this domain and
00:07:28.560 | had quite a lot of experience with machine learning and RL. But I actually expect, as we figure out the
00:07:34.640 | right patterns here, collectively as an industry, this will keep dropping. And I expect that the sort of
00:07:40.080 | payback period to get a return on investment from these specialized bottles is actually going to continue
00:07:44.800 | falling as well. And part of the reason I wanted to give this talk is to sort of distribute
00:07:51.200 | the knowledge we learned and hopefully move faster towards that world where this is just sort of like
00:07:55.840 | a thing everyone knows how to do and it's very easy and very fast. So that's what we'll be talking
00:08:00.560 | about for the rest of the time is some more of the lessons we learned. Okay, so when you are using RL
00:08:08.880 | to train an agent or really using RL for anything else, I find that consistently with different problems
00:08:14.320 | we look at, there are sort of two hard problems that come up every single time, all right? And the two
00:08:19.200 | hard problems are, first of all, figuring out a realistic environment, right? So if you're training an
00:08:23.840 | agent, you need to be training it with realistic data, with realistic inputs and outputs, tools available,
00:08:29.520 | everything like that to how it's going to be used in production. Because if you don't, then it's going
00:08:34.160 | to be optimizing for the wrong thing and you won't get the results you want when you deploy it.
00:08:37.280 | And then the second thing, which sometimes is hard, sometimes isn't, this one is a little bit
00:08:42.080 | task dependent, is getting the right reward function. So reward function, that just means you have to be
00:08:46.880 | able to know when your agent's gone through and say in this case, give it an answer to my email,
00:08:50.960 | you have to have some way of knowing did it do a good job or a bad job, all right? That's the reward
00:08:54.880 | function, it decides, it's how you decide if it's good or it's bad. Some, depending on the domain,
00:09:00.560 | sometimes that's really easy. We have, I don't know if Nathan's here, he's going to be talking next,
00:09:04.000 | but you know, he and his team put together this thing called RLVR, which in some verifiable domains,
00:09:08.400 | it's actually very easy to do a reward. Oftentimes, not all domains are like that. Oftentimes, it is kind
00:09:14.720 | of hard. And so it's somewhat task dependent. I'm going to go through how we solve these problems,
00:09:19.360 | specifically with RE. Okay, first one, realistic environment. So for our RE task, what is the
00:09:25.280 | environment we need? What's the environment this agent's going to be operating in? Well, it needs
00:09:28.960 | these tools available, it needs to be able to go and query an email inbox, it needs to be able to like
00:09:32.320 | get emails back, and that look realistic. These emails, you know, the inbox should be large,
00:09:37.760 | because that's what most email inboxes are like. The emails in it should be diverse, and they have to
00:09:42.080 | look kind of like real emails. So this could be kind of hard, because you can't just go ask like
00:09:47.280 | a 1000 people to, you know, give you their personal emails to train on. Luckily, in this case, we were
00:09:53.120 | able to solve this with the help of a company that has contributed a lot to just the open data ecosystem.
00:09:58.080 | Generally, it's like a quite an iconic company, perhaps I would call it a historic company. I'm,
00:10:03.520 | of course, talking about Enron. I'm hearing some laughter. So anyway, Enron was a there were a
00:10:11.520 | financialized energy company in the 90s and 2000s, committed massive fraud, ended up getting shut down by the
00:10:16.720 | Department of Justice. As part of this, you know, process, the court cases they were going
00:10:22.560 | through, a dump of like 500,000 of their emails was released to the public as part of the discovery
00:10:26.880 | process. So that's, that's, that's great for things like this. And that's what we used as our environment
00:10:31.840 | for the email inboxes. All right, so now we've got realistic email inboxes with tens of thousands of
00:10:37.440 | emails that are real emails back and forth. Now we have to design our reward function. So as our agent is going,
00:10:42.800 | and as our agent is, you know, we're asking it questions, and then it's giving us answers,
00:10:47.760 | we have to know is the answer correct or not, so we can reward it when it gets the answer right,
00:10:51.440 | and it can learn to do that better. There's different ways, and this part is very task dependent.
00:10:56.960 | The way that we went about it in this case, was we basically turned it into a more of a verifiable
00:11:04.320 | problem. And the way we did that was, we actually took our email inbox, we sort of inverted the problem,
00:11:08.640 | we, we grabbed batches of 20 emails at a time, from the inbox, and gave them to Gemini 2.5 Pro,
00:11:15.680 | and said, hey, given this set of emails, give us a few questions that a user might realistically ask,
00:11:20.880 | that the answers are found in this email, right? And so Gemini generated the questions,
00:11:25.280 | it generated the answers, and then of course, the source emails that came from.
00:11:28.480 | And there were some extra steps on top of that, a lot of the questions it came up with looked a little
00:11:32.720 | bit unrealistic, we had a separate filtering step, where we're like, okay, let's find the subset of these
00:11:36.560 | that actually look like questions that, you know, I would maybe ask. And we ended up with a list of
00:11:41.120 | a few thousand questions, along with their verified answers. And so at this point, it becomes much more
00:11:47.520 | of a sort of verified thing, the reward function becomes much easier, because we know what the correct
00:11:52.000 | answer should be. And so the way we can tell if our agent did a good job, is we give our agent the
00:11:56.720 | question, we let it go and search the email inbox, and try and find the right emails and everything,
00:12:00.000 | and eventually comes back with an answer. And then we can just use an LLM as judge, a very simple one,
00:12:04.560 | and say like, hey, you know, here's the question, here's the golden answer that we believe is right,
00:12:09.200 | here's the answer we got from our model, is it right or not. We did have to do a little bit of
00:12:14.320 | iteration there, making sure that the judge was well calibrated on what counts as correct or not.
00:12:20.000 | But by and large, this worked pretty well, and was able to make this more of a verified task.
00:12:26.080 | So that's how we solved the reward function problem was by having that, you know, turning
00:12:30.080 | this into something where we had more of a golden data set. Okay, so once you've solved that problem,
00:12:35.200 | those problems, once you have your environment, once you have your reward function defined,
00:12:40.160 | then basically, you just kind of have to run a loop over and over and over again,
00:12:43.840 | where you have your agent go through and it tries to solve the problem, and then you figure out if it's
00:12:49.120 | good or it's bad, and then you just, you know, reward if it's good, and punish if it's bad, and
00:12:55.120 | that's it. And you do this over and over and over again, and then hopefully, if you've got everything
00:13:00.800 | set up right, it learns what good looks like, it learns what bad looks like, and it starts doing it
00:13:06.480 | right. And then again, this is this is the curve we saw earlier, where you can see it, it starts
00:13:11.520 | getting better over time. Okay, a few other like interesting learnings from this project. One thing
00:13:18.800 | is, we found that there's actually, you can throw a lot of stuff into your reward function, beyond just
00:13:24.960 | the primary thing you're trying to solve for. And so we actually ended up, there were like sort of eight
00:13:29.600 | different little things that we gave extra credit for. I'm going to share two of them here. So the first one
00:13:34.960 | here is, is we're trying to have it optimized for the number of turns, how many times back and forth,
00:13:40.480 | how many times it had to query the email inbox, before it came up with the right answer, right.
00:13:44.720 | So because the most important thing, of course, is getting the answer right. But between two answers
00:13:48.880 | that both get it right, we would rather it took fewer turns back and forth, because that's fewer tokens,
00:13:53.200 | that's lower latency, lower costs, it's just like a more efficient agent. So you can see here on this
00:13:59.280 | first graph that early on, while it was getting its feet wet and figuring out what worked, it ended up
00:14:04.480 | spiking up to over six turns on average. So it would go back and forth a bunch of times
00:14:08.480 | with the email inbox and try and find the right thing. But then once it was able to like, figure
00:14:13.120 | out how to use the tools efficiently, figure out like, you know, the right way to construct keywords
00:14:17.280 | and find the right email, it was able to get very efficient and actually fast, better than any of our
00:14:21.680 | prompted models on this metric of using fewer turns. And again, this was just because we gave it
00:14:26.000 | a little bit of extra, it was it was a very small amount relative to the reward for getting it right,
00:14:30.720 | but a little bit of extra credit on using for fewer turns, and it was able to use that to optimize
00:14:36.880 | against that. Another extra reward function we gave it is to try and discourage it from hallucinating
00:14:42.960 | answers. So obviously, the best thing is to get the right answer. If you can't find the right answer,
00:14:48.400 | it's much better to say, hey, I don't know than to make up an answer in a situation like this. So we basically
00:14:54.160 | penalized it if if the reward model said, hey, you got the answer wrong, and but it had tried to give an
00:15:00.320 | answer, give an answer, that was like a much lower reward than if it just said, hey, I don't know, I can't
00:15:05.120 | solve this problem. And as you can see, that worked quite well, compared to any of the prompted models,
00:15:08.880 | including O3, we ended up with a significantly lower hallucination rate, because that was part of our reward
00:15:14.400 | function. Again, these are these are things that are just sort of like extra credit. But we found that like
00:15:19.040 | you can throw in a bunch of these, and it cannot jointly optimize all of them at the same time,
00:15:23.200 | which is super powerful. Okay, I want to talk a little bit about reward hacking. It's something
00:15:29.280 | that comes up a lot when you're trying to do this. And it's kind of a fun thing to talk about.
00:15:32.320 | This is an iconic video some of you might have seen. This was released by OpenAI almost a decade ago at
00:15:37.920 | this point of they were they were trying to, they had this environment where you were trying to get this
00:15:43.280 | boat to complete a race. And instead of learning to complete compute, complete the race, it learned that,
00:15:47.920 | oh, if I just go in this like little circle, that's not even part of the race track, I can like just get a
00:15:51.360 | bunch of points. And so I just started doing that over and over and over again, instead of like
00:15:55.520 | actually following. This is something that comes up a lot if you're doing reinforcement learning. And
00:16:00.400 | it's basically just the difference between the difference between what you actually want the
00:16:06.720 | model to do and what you can measure, like what you're actually rewarding it for. And and if you
00:16:11.440 | almost always if you let one of these run long enough, it will figure out some way to exploit your
00:16:15.920 | measure. And it will figure out some way to to get a really high reward without actually solving the
00:16:21.280 | problem. And you need to just watch for that. So I'm going to give a couple examples here. This is a this
00:16:25.840 | is a graph from another project, actually, not this one. So an engineer on our team was was working on this
00:16:32.080 | game called NYT Connections. Some of you might know, you get 16 words, and you have to put them in like
00:16:36.720 | four groups of four. It's quite a challenging game, especially for these language models, because it
00:16:41.040 | requires a lot of world knowledge and like, you know, lateral thinking anyway. So they were trying to
00:16:46.080 | train this model to do it. And it wasn't figuring out wasn't figured out what it wasn't figuring out. And
00:16:50.320 | then boom, you can see here around step 40, it just like takes off. And it's like, okay, we figured out
00:16:54.000 | how to how to solve this. And this engineer, I'm gonna I'm gonna call out where's where's on our team?
00:16:58.960 | He's here at the conference. Yeah, he's great. You should talk to him after. But he was like, hey, we solved it,
00:17:03.280 | like we got NYT Connections and like, and it's like, okay, the graph looks good. Let's look at what
00:17:07.280 | it's actually doing. What it was actually doing is it figured out there was a bug in how we wrote
00:17:12.320 | the verification. And if it just put every single word in every single category, it was able to get
00:17:16.720 | a perfect score. Because we weren't verifying that they were in fact, only four words in each category.
00:17:22.400 | So this is another example. This is a fun one. So I was I was training a model to produce really good
00:17:28.800 | titles for hacker news, titles that would get a thing upvoted. So I had this reward model I'd trained
00:17:33.840 | on like existing hacker news articles and how many upvotes they got. And I was I was trying to train
00:17:38.400 | this model to produce new titles. And it was working really well for a while, you can see and sort of
00:17:43.120 | subjectively as well, I looked at a bunch of these these titles generated and for these first like
00:17:47.280 | 1000 steps or so, it was actually learning things that I was like, okay, as someone who spends way too
00:17:52.080 | much time on hacker news, yeah, that that does look like a good title, you're doing a good job. And then you can see
00:17:56.240 | around step 1200 here, it just like jumps a bunch, right? It's like, okay, it clearly figured something
00:18:01.920 | out. I don't know what it figured out. But we should look at that. And so what it turns out what the
00:18:08.720 | model had figured out was that it could just completely ignore the content of the post and generate the
00:18:14.240 | same title for every single one of them. And that would like maximize its score. So it generated this
00:18:18.640 | title Google lays off 80% of workforce, literally every single article this was this was what it labeled it
00:18:24.000 | as and when the remote was like, yes, that is going to get up on hacker news for sure,
00:18:27.520 | which which it probably would to be fair.
00:18:29.680 | So anyway, the way the way we solve this, what we found is that it's really important to watch
00:18:36.640 | out for this. Solving it typically involves modifying in some way your reward function to penalize things
00:18:43.600 | like that. So in the second example I talked about, it was actually quite an easy fix once we identified it,
00:18:48.480 | which was just add an extra LMS judge that looked at the title, looked at the content and said, hey,
00:18:52.960 | is there anything in the title that's not supported by the content? And we added that on and it actually
00:18:57.120 | worked great. The important thing here is you want to be looking at your your rollouts, not just blindly
00:19:02.000 | trusting the reward function, figuring out what's actually happening. Anyway, so that's it. I'm almost out of
00:19:08.880 | time. So I'm going to stop a couple of QR codes for you. Everything in this presentation and there's
00:19:14.480 | a much longer write up I have of this whole project. It includes the code, it includes the artifacts,
00:19:19.040 | data sets along the way. You can you can check that out there. One more thing is we have a discord that's
00:19:25.920 | open. We have an open source project for training reinforcement learning models. We have a discord you
00:19:30.880 | can go to if you're interested in this kind of thing. We were all in there. We answer questions.
00:19:35.440 | There's lots of people from the community trying to do these things. So if if you're interested in
00:19:39.440 | building things with this, feel free to join it. And yeah, happy happy to chat there. And yes,
00:19:44.080 | thank you everyone. Appreciate your time.