How Zapier Builds AI Products and Features with the Help of Braintrust: Ankur Goyal & Olmo Maldonado

My name is Olmo Maldonado, I am from Zapier, I promise I am not an open source LLM, my name is Olmo, I've been at Zapier for over seven years, I apologize for all the bugs that I've introduced that may have affected you, and if you happen to have any bugs that you want to report, I'd be happy to take that in and start working on them.

You name it, I've been part of the team, so feel free to come by and talk to me about it. I'm a lucky husband and father, and I've been playing some golf lately and I don't know why I'm doing that. Hey everyone, I'm Ankur, I actually went through Hamel's journey I think, I built eval tooling at my last startup, Impira, and then when I led the AI team at Figma.

And that's what led to Braintrust, actually Braintrust started kind of as a collaboration with our friends at Zapier, who are our first users, and it's been a lot of fun since. I'm also a husband, not yet a father, but a proud older brother, and also a reluctant golf player.

I hope to play against Ulmo and beat him someday. So, today it's going to be a lot of story telling. I'm not here to prophesize what you should be doing. I'm just going to share what we've done that has worked for us. And I'm hoping to actually learn a lot from you all as well of what is working for you and something that we should try ourselves.

I know I've already learned a lot from the conference, so I'm hoping that you all learned something from this talk. We'll go over what we're doing at Zapier, the tech that is going on at Braintrust, and a couple of examples of how what they've done has really helped us to make a good product.

So, if you're not familiar, we're actually just at a quick poll, how many folks here use Zapier? All right, a good number of folks, appreciate that. Thank you for your support. If you're not aware, there's actually over 7,000 apps of your favorite ones online. We make it very low code to no code to integrate with all of them.

On the right, you can kind of see how the workflow works, and we'll do it in a reliable way to make sure it's mission critical and everything. Now, we have a lot of AI integrations. At this point, I'm happy to say, per day, we're doing over 10 million tasks.

So if you haven't tried this out, please give it a try, as well as use all the integrations with AI. Here are the apps that we, the products that we have built with AI. I'll only talk about the first two with the AI Zap Builder and Zapier Copilot, but I would strongly encourage for you guys to explore all the other new products that we have available.

Central, in particular, shout out to my colleagues that are here. It's a bot framework, so you can make your own bot connect to over all of your apps that you have online. So if you want to learn more, please go to Zapier.com/AI. Really quickly about Braintrust, we'll keep the propaganda brief.

Braintrust is the end-to-end developer platform that some of the world's best AI teams use, including Notion, Airtable, Instacart, Zapier, Vercel, and many others. Basically, if you break that down, there are three things that we're really focused on today. One is helping you do evals incredibly well. Olmo is going to talk about how they do evals, which I think is probably the best way to actually learn about that.

We also help you with observability, and I think it's really important that you build your stuff in a way that evals and observability, there's actually kind of a continuum across them. And so we are really kind of focused on that problem. And then the last thing is, we help you build really, really great prompts, and there's a bunch of tools around that.

Yeah, so this is what it looks like to work at Zapier. We want to get the prototype as early as possible to the user. We take an iterative approach. We will get some things wrong, and we hope to learn from them and just keep improving the product as fast as possible.

We make adjustments through our evals, and evals are the way that we make decisions. It didn't used to be that case. So if you haven't played with it, this is the AI Zap Builder. You give us a prompt, as you see there. We'll do our best to make a Zap for you, and when you click try it, there's your Zap.

It will do many other things as well, like field mapping and so forth. So what we learned from this experience is how do we go about knowing how well it works. How well is this product delivering Zapier to the customer? Over 7,000 products, the long tail of integrations is fast, so how well are we doing it?

One of the things that we did that has worked for us that I would encourage for you all to do as well is involve the product managers. Involve your product side of things to be part of the conversation. This is an engineering problem as well as a product problem.

And you can see here in this screenshot our P0 priority. The things that we wanted to make sure that our AI Zap Builder was able to produce. We wanted to make sure the triggers, actions were working as you'd expect. We don't want the wrong step in the wrong place, and we wanted to make sure the top 25 apps were supported and that they were done in an eloquent way, correct way.

And yeah, we have even internal Zapier apps like paths and filters. We want them to work as well. So all of this had to be in our eval suite in some way or form. This is our framework that we built in-house with the help of BrainTrust. We have synthetic data from our corporate accounts that we use for seeding the evals, and we use that to get going with all of that coverage that we saw before.

What we do is we load that data from BrainTrust actually that is hosting us, and we take that and run that on a CI basis as well as a manual basis. And we have our own little runner that essentially kind of does a load test against all of our AI providers every single time that we run this.

So it's been really incredible to take all of this data, run it, report on it, and start acting on the things that we've seen. We also have these custom graders that the previous speaker had mentioned. They're both logic-based as well as LLM-based. And in general, what we're trying to do is make sure that that criteria that you saw before is being tested upon and that we are actually acting on what we wanted to see.

So here's an example of all the different runs that we've had. As you can see, it's pretty often that we run it. We want to make sure that if any regressions happen that we act on them quickly. I'll actually go over one of those cases in a bit. And this is really easy for us to act on it.

We can see this is a screenshot of Maggie's project for the Zap AI builder. And, you know, as mentioned earlier, we have observability thanks to BrainTrust. We can see within it what happened, what were the inputs, what were the outputs, as well as compare the pink and green, hopefully you all can see it, is actually comparing against previous runs as well and trying to find what went down, what went up, and so forth.

And, yeah, this is just showing that even further with all of our different graders, the scores, if you will, of, like, the different things that we're looking for. We want to make sure that the ones that we care most about are being highlighted and that we do something about it.

So after all this work of creating the eval suite and running them continuously, I can say that before this we just had seven unit tests. So seven unit tests that were run manually by devs and now we have over 800 of them and they're all run part of this merge request as well as on a continuous basis.

And we get alerted on if any regress. So a lot better coverage there. This has led us to improve nearly 300 percent of our accuracy. I will say that is not saying that we're at 100 percent. We still have a lot of work there. But it is fortunate that we were able to improve with this process that we created.

Now, we're very thankful for our customers. These are just a few shout outs of how they have received that product. This is using an older UI, but it's essentially the same product. Now, one thing about that product is a single shot approach can only take us so far. So this is the next iteration.

As you all might imagine, it's a chat interface. We want to allow the user to interact with the editor as they're happening. So a progressive iterative approach. You can see in the demo here the gif that we not only did the same prompt that we did before, but we're also testing steps.

We're also configuring fields all as quickly as possible to the user. This is actually not sped up. So we're really happy so far with the performance that we're getting out of this thing. The problem with it, though, now that it's kind of like an agent framework with multiple tools that it calls, we couldn't see what was the critical path.

What is the things that we need to improve now to make the accuracy even better, to make the experience better? And this is where, again, brain trust came in. They have tracing capabilities. This allowed us to break down the request from very granular observability to a very fine look into the problem.

And just as you would expect, you can actually see the inputs and outputs of a chat completion. The tokens, the time to response, you name it, it's available. And we can quickly iterate on that one as well with a playground that they have. I'm not showing that, but I just wanted to showcase that it's really easy as developers to go into it and really understand what is going on with the performance of the co-pilot.

So one of the things that ended up happening with the co-pilot is early on, because we wanted to get to market first, we wanted to just do GPT 3.5 Turbo. And, you know, we started testing different models. And this view that you're seeing here was manually made for us, so that we can get a better sense of, you know, the performance of different models across different tools.

And as you change those tools, what is the performance characteristic overall? And we settled on GPT 4 Turbo for our message router that we call. And unfortunately, that came at a cost of performance. That was a lot slower than before. And -- but at least we were able to provide some guarantees to our customers on accuracy.

That led us to want to make that better. And we -- our evals started to show a regression. We were wondering what happened, and the reason for that was we switched to 4.0. Now, you might have already heard from the conference that 4.0 is -- you need to be careful with it.

And, yeah, we actually stumbled on that one. But I can say that with a few changes that we had to make and a couple of parameters that we added to the OpenAI calls, we're kind of back to the same performance that we were before. And that's what I want to talk to you about right now.

So, as you know, we can see all the examples or all the different runs that we've had with experiments. We noticed beforehand that we were 80% or better in most of our scores. After the change to the GPT 4.0, all of that regressed below 80%. So, we were really worried that, you know, we wanted the performance benefits and the cost benefits, but we didn't want to lower our scores.

So, what should we do? This is showcasing some of the work that my colleague, Maggie Cody, had. So, shout out to her for her hard work in this. You can kind of see that all of our scores went down. And we can clearly see that there's a pattern to this.

Like, what is going on with it? Like, drilling in further, we noticed that, you know, the 22 regressions that happened in here were all related to the OpenAI deciding to, like, forget our system prompt in some way. Or, like, to give an answer back that we didn't want to see or do.

It just, in some ways, we had prompts that were too fine-tuned to GP 3.5 Turbo. So, we had to kind of regress our engineering, if you will, our prompt engineering. And that ended up allowing us to actually go back to the numbers that we were at. So, this is the example of that prompt, how we were a lot more elaborate with how we were asking the GPT 3.5 to, like, respect our wishes.

And afterwards, we actually just relaxed it a lot more. And again, trial and error really quickly iterating on that loop that I mentioned earlier. And that is what led us to make these discoveries. Now, the other thing that I mentioned, just to showcase that, is, you know, we made a change to the tool choices.

We deprecated the functions usage over to the tool choice. That tool choice auto is actually an asterisk there, because we are also going to experiment when making that required. So, that's coming up next. So yeah, overall, after those changes, we can see immediately that most of our scores went up.

We're really happy that, for us, it's really easy to compare back and forth with previous runs. That's what's going on here on the top. And, since then, we've been able to adopt GPT 4.0. We still have more work to do there, like I mentioned, but it's an iterative approach.

Before this adoption, we were around 14 seconds. And now, we're at three seconds for a stream-based co-pilot. And, of course, we had a lot more reduction, as we did before, or with this. So, in conclusion, I just wanted to share some of our stories of how we've worked with BrainTrust from the very beginning to make a great product.

Really, we couldn't be happier working with them. And, I don't know, there's not much more to say to that. High five. Yeah. So, yeah, thank you. Thank you.

How Zapier Builds AI Products and Features with the Help of Braintrust: Ankur Goyal & Olmo Maldonado

Chapters

Transcript