Turning Fails into Features: Zapier’s Hard-Won Eval Lessons — Rafal Willinski, Vitor Balocco, Zapier

ZAPIER AGENT ZAPIER: Yeah, a brief introduction to ZAPIER Agents. I believe many of you know what ZAPIER is, this automation software, a lot of boxes, arrows, essentially about automating your business processes. Agents is just, well, more agentic alternative to ZAPIER. You describe what you want, we propose a bunch of tools, a trigger, you enable that, and hopefully we automate your whole business processes.

And a key lesson that we have after those two years is that building good AI agents is hard, and building good platform to enable non-technical people to build AI agents is even harder. That's because AI is non-deterministic, but on top of that, your users are even more non-deterministic. They are going to use your products in a way that you cannot imagine up front.

So, if you think that building agents is not that hard, you probably have this kind of picture in mind. You probably stumbled upon this library called Blankchain. You pulled some examples, tutorial, you tweaked the prompt, pulled a bunch of tools, you chatted with the solution, and they thought, well, it's actually kind of working, all right?

So, let's deploy it, and let's collect some profit. It turns out the reality has a surprising amount of detail, and we believe that building probabilistic software is a little bit different than building traditional software. The initial prototype is only a start, and after you ship something to your users, your responsibility switches to building the data flywheel.

So, once you start, once your user starts using your product, you need to collect the feedback, you're starting to understand the usage patterns, the failures, so they can, then you can build build more evals, build an understanding of what's failing, what are the use cases. As you're building more evals and burn features, probably your product is getting better, so you're getting more users, and there are more failures, and you have to build more features, and on and on and on.

So, yeah, it forms this data flywheel. But, starting with the first step, okay, yeah, so starting from the beginning, how do you start collecting actionable feedback? Backing up for just a second, the first step is to make sure you're instrumenting your code, right, which you probably already are doing.

Whether you're using Braintrust or something else, they all offer, like, an easy way to get started, like, just tracing your completion calls. And this is a good start, but, actually, you also want to make sure that you're recording much more than that in your traces. You want to record the two calls, the errors from those two calls, the pre- and post-processing steps.

That way, it will be much easier to debug what went wrong with the run. And you also want to strive to make the run repeatable for eval purposes. So, for instance, if you log data in the same shape as it appears in the runtime, it makes it much easier to convert it to an eval run later, because you can just prepopulate the inputs and expected outputs directly from your trace for free.

And this is especially useful, as well, for two calls, because if your two call produces any side effects, you probably want to mock those in your evals, so you get all that for free if you're recording them in your trace. Okay, great. So, you've instrumented your code, and you started getting all this raw data from your runs.

Now it's time to figure out what runs to actually pay attention to. Explicit user feedback is really high signal, so that's a good place to start. Unfortunately, not many people actually click those classic thumbs up, thumbs up and thumbs down buttons. So, you've got to work a bit harder for that feedback.

And in our experience, this works best when you ask for the feedback in the right context. So, you can be a little bit more aggressive about asking for the feedback, but you're in the right context, you're not bothering the user before that. So, for us, one example of this is once an agent finished running, even if it was just a test run, we show a feedback call to action at the bottom, right?

Did this run do what you expected? Give us the feedback now. And this small change actually gave us like a really nice bump in feedback submissions, surprisingly. So, thumbs up and thumbs down are a good benchmark, a good baseline, but try to find these critical moments in your user's journey where they'll be most likely to provide you that feedback, either because they're happy and satisfied or because they're angry and they want to tell you about it.

Even if you work really hard for the feedback, explicit feedback is still really rare. And explicit feedback that's detailed and actionable is even harder, because people are just not that interested in providing feedback generally. So, you also want to mine user interaction for implicit feedback, and the good news is there's actually a lot of low hanging fruit possibilities here.

Here's an example from our app. Users can test an agent before they turn it on to see if everything's going okay. So, if they do turn it on, that's actually really strong positive implicit feedback, right? Copying a model's response is also good implicit feedback. Even OpenAI is doing this for ChatGPT.

And you can also look for implicit signals in the conversation. Here the user is clearly letting us know that they're not happy with the results. Here they're telling the agent to stop slacking around, which is clearly implicit negative feedback, I think. Sometimes the user sends a follow-up message that is mostly rehashing what they asked the previous time to see if the LLM interprets that phrasing better.

That's also good implicit negative feedback. And there's also a surprising amount of cursing. Recently, we had a lot of successes while using an LLM to detect and group frustrations, and we have this weekly report that we post in our Slack. But it took us a lot of tinkering to make sure that the LLM understood what frustration means in the context of our products.

So, I encourage you to try it out. But expect a lot of tinkering. You should also not forget to look at more traditional user metrics, right? There's a lot of stuff in there for you to mine implicit signals, too. So, find what metrics your business cares about and figure out how to track them.

Then you can distill some signal from that data. You can look for customers, for example, that churned in the last seven days and go look at their last interactions with your product before they left. And you're likely to find some signal there. Okay. So, I have raw data. What now?

I'll let the industry experts speak. Why isn't it starting? Oh. Yeah, or the beatings will continue until everyone looks at their data. Okay, but how actually are you going to do that? So, we believe that the first step is to either buy or build LLM Ops software. We do both.

You're definitely going to need that to understand your agent runs, because one agent run is probably multiple LLM calls, multiple database interactions, tool calls, REST calls, whatever. Each one of them can be source of failure, and it's really important to piece together this whole story, understand this, you know, what caused this cascading failure.

Yeah, I said we are doing both, because I believe by coding your own internal tooling is really, really easy right now with Cursor and Cloud Cod, and it's going to pay you massive dividends in the future for two reasons. First of all, it gives you an ability to understand your data in your own specific domain context.

And the second of all, it also -- you should be also able to create a functionality to turn every single interacting case or every failure into an eval with the minimal amount of fraction. So, whenever you see something interesting, there should be like a one click to turn it into an eval.

It should become your instinct. Once you understand what's going on on a singular run basis, you can start understanding things at scale. So, now we can do feedback aggregations, clustering, you can bucket your failure modes, you can bucket your interactions, and then you're going to start to see what kind of tools are failing the most, what kind of interactions are the most problematic.

That's going to create. That's going to create for you like an automatic roadmap. So, you'll know where to apply your time and effort to improve your product the most. Doing anything else is going to be a sub-optimal strategy. Something that we are also experimenting with is using reasoning models to explain the failures.

Turns out that if you give them the trace output input instructions and anything you can find, they are pretty good at finding the root cause of a failure. Even if they are not going to do that, they are probably going to explain you the whole run or just direct your attention into something that's really interesting and might help you find the root cause of the problem.

Cool. So, now you have a good short list of failure modes you want to work on first. It's time to start building out your evals. And we realized over time that there are different types of evals, and the types of evals that we want to build can be placed into this hierarchy that resembles the testing pyramid, for those of you that know that.

So, with unit tests like evals at the base, end-to-end evals or trajectory evals, how we like to call them in the middle, and the ultimate way of evaluating using A/B testing with stage rollouts at the top. So, let's talk a bit about those. Starting with unit test evals, we are just trying to predict the n+1 state from the current state, so these work great when you want to do simple assertions, right?

For instance, you could check whether the next state is a specific tool call, or if the tool call parameters are correct, or if the answer contains a specific keyword, or if the agent determined that it was done, all that good stuff. So, if you're starting out, we recommend focusing on unit test evals first, because these are the easiest to add.

It helps you build that muscle of looking at your data, spotting problems, creating evals that reproduce them, and then just focusing on fixing them, right? Beware, though, of turning every positive feedback into an eval. We found that unit test evals are best for hill-climbing specific failure modes that you spot in your data.

So, now unit test evals are not perfect, and we realized that ourselves. We realized we had over-index on unit test evals when the new models were coming out that were objectively stronger models, but they were still performing worse in our internal benchmarks, which was weird. And because the majority of our evals were so fine-grained, this made it really hard to see the forest for the trees when benchmarking new models.

There was always a lot of noise when we tried comparing runs. Like when you're looking at a single trace, it's easy to kind of go through the trace and understand what's happening, but when you need to kind of look at it from -- I don't know how to play it again, sorry -- when you want to look at it through an aggregation of many traces, then it starts getting difficult to understand what's happening.

Why are so many of these passing and some of these are regressing? Yeah. So, we realized that maybe machine can help us. It turns out in that previous video when I was investigating one experiment inside Braintrust, there is a lot of looking at the screen trying to figure out what went wrong, and we were like, hey, maybe we can just give this old data to, once again, a reasoning LLM and compare the models for us.

It turns out that with Braintrust MCP and reasoning model, you can just ask it to, hey, look at this run, look at this run, and tell me what's actually different about the new model that we are going to deploy. In this case, it was Gemini Pro versus Cloud, and what the reasoning model found was actually really, really good.

It found that Cloud is like a decisive executor, whereas Gemini is really yapping a lot. It's asking follow-up questions. You need some positive affirmations, and it's sometimes even hallucinating about JSON structures. So, yeah, it helped us a lot. It also surfaces a problem with unit test evals a lot, which is different models have different ways of trying to achieve the same goal.

And unit test evals. And unit test evals are penalizing different paths. They are hard-coded to only follow one path. And, yeah, our unit test evals were overfitting to our existing models. They are actually data collecting using that model. So, what we started experimenting with is trajectory evals. Instead of grading just one iteration of an agent, we let the agent run to the end state.

And we are not grading just the end state, but we are also grading all the tool calls that were made along the way and all the artifacts that have been generated along the way. And this can be also paired with LLM as a judge. Vitor is going to speak about it later.

Yeah, but they are not free. I think they have really high return on investment, but they are much harder to set up, especially if you are evaluating runs that have tools that cause side effects, right? When you are running an eval, you definitely don't want to send an email on behalf of the customer once again, right?

So, we had a fundamental question whether we should mock environment or not. And we decided that we are not going to mock the environment because otherwise you are going to get data that is just not reflecting the reality. So, what we started doing is just mirroring users' environment and crafting a synthetic copy of that.

Also, they are much slower, right? So, they can sometimes take up to an hour. So, it's not pretty great. And we are also learning a bit more into LLM as a judge. This is when you use an LLM to grade or compare results from your evals. And it's tempting to lean into them for everything, but you need to make sure that the judge is judging things correctly, which can be surprisingly hard.

And you also have to be careful not to introduce subtle biases, right? Because even small things that you might overlook might end up influencing it. Lately, we have also been experimenting with this concept of rubrics-based scoring. We use an LLM to judge the run. But each row in our dataset has a different set of rubrics that were handcrafted by a human and described in natural language.

So, what specifically about this run should the LLM be paying attention to for the score? So, one example of this: Did the agent react to an unexpected error from the calendar API and then try it again? So, to sum it up, here's our current mental model of the types of evals that we build for separate agents.

We use LLM as a judge or rubrics-based evals to build a high-level overview of your system's capabilities. And these are great for benchmarking new models. We use trajectory evals to capture multi-term criteria. And we use unit tests like evals to debug specific failures. He'll climb them. But beware of overfitting with these.

Yeah. And a couple of closing thoughts. Don't obsess over metrics. Remember that when a good metrics become a target, it ceases to be a good target. So, when you're close to achieving 100% score on your eval dataset, it's not meaning that you're doing a good job. Actually meaning that your dataset is just not interesting, right?

Because we don't have AGI yet. So, it's probably not true that your model is that good. Something that we're experimenting with lately is dividing dataset into two pools. Into the regressions dataset to make sure that we are making any changes. We are not breaking existing use cases for the customers.

And also the aspirational dataset of things that are extremely hard. For instance, like nailing you've got 200 tool calls in a row. And lastly, let's take a step back. What's the point of creating evals in the first place? Your goal isn't to maximize some imaginary number in a lab-like setting.

Your end goal is user satisfaction. So, the ultimate judge are your users. You shouldn't be optimizing for the biggest scores for the evals and completely disregard the vibes. So, that's why you think the ultimate verification method is an A/B test. Just take a small proportion of your traffic, let's say 5%, and route it to the new model, route it to the new prompt, monitor the feedback, check your metrics, like activation, user retention, and so on.

Based on that, probably you can make the most educated guess instead of being in the lab and optimizing this imaginary number. That's all. Thank you. So, that's all. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. I'll see you next time.

Turning Fails into Features: Zapier’s Hard-Won Eval Lessons — Rafal Willinski, Vitor Balocco, Zapier

Transcript