back to indexHow Zapier Builds AI Products and Features with the Help of Braintrust: Ankur Goyal & Olmo Maldonado

Chapters
0:0 Introduction
0:23 Meet the Team
1:24 What is Zapier
7:58 Results
14:18 Conclusion
00:00:00.120 |
My name is Olmo Maldonado, I am from Zapier, I promise I am not an open source LLM, my 00:00:20.800 |
name is Olmo, I've been at Zapier for over seven years, I apologize for all the bugs 00:00:27.720 |
that I've introduced that may have affected you, and if you happen to have any bugs that 00:00:31.740 |
you want to report, I'd be happy to take that in and start working on them. 00:00:36.140 |
You name it, I've been part of the team, so feel free to come by and talk to me about 00:00:42.240 |
I'm a lucky husband and father, and I've been playing some golf lately and I don't 00:00:48.520 |
Hey everyone, I'm Ankur, I actually went through Hamel's journey I think, I built eval 00:00:55.440 |
tooling at my last startup, Impira, and then when I led the AI team at Figma. 00:01:00.460 |
And that's what led to Braintrust, actually Braintrust started kind of as a collaboration 00:01:05.620 |
with our friends at Zapier, who are our first users, and it's been a lot of fun since. 00:01:11.180 |
I'm also a husband, not yet a father, but a proud older brother, and also a reluctant 00:01:19.200 |
I hope to play against Ulmo and beat him someday. 00:01:24.220 |
So, today it's going to be a lot of story telling. 00:01:28.240 |
I'm not here to prophesize what you should be doing. 00:01:31.240 |
I'm just going to share what we've done that has worked for us. 00:01:35.080 |
And I'm hoping to actually learn a lot from you all as well of what is working for you and 00:01:40.240 |
I know I've already learned a lot from the conference, so I'm hoping that you all learned something 00:01:46.620 |
We'll go over what we're doing at Zapier, the tech that is going on at Braintrust, and 00:01:51.880 |
a couple of examples of how what they've done has really helped us to make a good product. 00:01:57.880 |
So, if you're not familiar, we're actually just at a quick poll, how many folks here use 00:02:03.520 |
All right, a good number of folks, appreciate that. 00:02:09.140 |
If you're not aware, there's actually over 7,000 apps of your favorite ones online. 00:02:15.060 |
We make it very low code to no code to integrate with all of them. 00:02:19.660 |
On the right, you can kind of see how the workflow works, and we'll do it in a reliable way to 00:02:25.000 |
make sure it's mission critical and everything. 00:02:30.780 |
At this point, I'm happy to say, per day, we're doing over 10 million tasks. 00:02:35.840 |
So if you haven't tried this out, please give it a try, as well as use all the integrations 00:02:43.420 |
Here are the apps that we, the products that we have built with AI. 00:02:47.740 |
I'll only talk about the first two with the AI Zap Builder and Zapier Copilot, but I would 00:02:52.580 |
strongly encourage for you guys to explore all the other new products that we have available. 00:02:56.560 |
Central, in particular, shout out to my colleagues that are here. 00:03:02.120 |
It's a bot framework, so you can make your own bot connect to over all of your apps that 00:03:07.680 |
So if you want to learn more, please go to Zapier.com/AI. 00:03:13.240 |
Really quickly about Braintrust, we'll keep the propaganda brief. 00:03:18.800 |
Braintrust is the end-to-end developer platform that some of the world's best AI teams use, 00:03:24.360 |
including Notion, Airtable, Instacart, Zapier, Vercel, and many others. 00:03:30.360 |
Basically, if you break that down, there are three things that we're really focused on today. 00:03:38.480 |
Olmo is going to talk about how they do evals, which I think is probably the best way to actually 00:03:44.480 |
We also help you with observability, and I think it's really important that you build your stuff 00:03:50.480 |
in a way that evals and observability, there's actually kind of a continuum across them. 00:03:56.480 |
And so we are really kind of focused on that problem. 00:03:59.680 |
And then the last thing is, we help you build really, really great prompts, and there's a 00:04:06.240 |
Yeah, so this is what it looks like to work at Zapier. 00:04:11.540 |
We want to get the prototype as early as possible to the user. 00:04:16.940 |
We will get some things wrong, and we hope to learn from them and just keep improving the 00:04:24.080 |
We make adjustments through our evals, and evals are the way that we make decisions. 00:04:32.280 |
So if you haven't played with it, this is the AI Zap Builder. 00:04:38.040 |
We'll do our best to make a Zap for you, and when you click try it, there's your Zap. 00:04:43.200 |
It will do many other things as well, like field mapping and so forth. 00:04:48.320 |
So what we learned from this experience is how do we go about knowing how well it works. 00:04:53.680 |
How well is this product delivering Zapier to the customer? 00:04:58.800 |
Over 7,000 products, the long tail of integrations is fast, so how well are we doing it? 00:05:06.080 |
One of the things that we did that has worked for us that I would encourage for you all to 00:05:11.800 |
Involve your product side of things to be part of the conversation. 00:05:15.920 |
This is an engineering problem as well as a product problem. 00:05:19.120 |
And you can see here in this screenshot our P0 priority. 00:05:22.800 |
The things that we wanted to make sure that our AI Zap Builder was able to produce. 00:05:27.680 |
We wanted to make sure the triggers, actions were working as you'd expect. 00:05:30.960 |
We don't want the wrong step in the wrong place, and we wanted to make sure the top 25 apps were 00:05:36.960 |
supported and that they were done in an eloquent way, correct way. 00:05:41.840 |
And yeah, we have even internal Zapier apps like paths and filters. 00:05:50.000 |
So all of this had to be in our eval suite in some way or form. 00:05:53.360 |
This is our framework that we built in-house with the help of BrainTrust. 00:06:00.480 |
We have synthetic data from our corporate accounts that we use for seeding the evals, 00:06:07.120 |
and we use that to get going with all of that coverage that we saw before. 00:06:12.640 |
What we do is we load that data from BrainTrust actually that is hosting us, 00:06:16.880 |
and we take that and run that on a CI basis as well as a manual basis. 00:06:21.840 |
And we have our own little runner that essentially kind of does a load test against 00:06:27.200 |
all of our AI providers every single time that we run this. 00:06:29.920 |
So it's been really incredible to take all of this data, run it, report on it, 00:06:36.240 |
and start acting on the things that we've seen. 00:06:40.400 |
We also have these custom graders that the previous speaker had mentioned. 00:06:44.400 |
They're both logic-based as well as LLM-based. 00:06:47.680 |
And in general, what we're trying to do is make sure that that criteria that you saw before 00:06:51.840 |
is being tested upon and that we are actually acting on what we wanted to see. 00:06:55.680 |
So here's an example of all the different runs that we've had. 00:07:01.840 |
As you can see, it's pretty often that we run it. 00:07:06.160 |
We want to make sure that if any regressions happen that we act on them quickly. 00:07:10.400 |
I'll actually go over one of those cases in a bit. 00:07:17.840 |
We can see this is a screenshot of Maggie's project for the Zap AI builder. 00:07:24.160 |
And, you know, as mentioned earlier, we have observability thanks to BrainTrust. 00:07:29.520 |
We can see within it what happened, what were the inputs, what were the outputs, 00:07:34.160 |
as well as compare the pink and green, hopefully you all can see it, is actually comparing against 00:07:39.520 |
previous runs as well and trying to find what went down, what went up, and so forth. 00:07:44.480 |
And, yeah, this is just showing that even further with all of our different graders, 00:07:49.760 |
the scores, if you will, of, like, the different things that we're looking for. 00:07:53.120 |
We want to make sure that the ones that we care most about are being highlighted and that we do something about it. 00:07:57.680 |
So after all this work of creating the eval suite and running them continuously, 00:08:04.800 |
I can say that before this we just had seven unit tests. 00:08:08.960 |
So seven unit tests that were run manually by devs and now we have over 800 of them 00:08:14.320 |
and they're all run part of this merge request as well as on a continuous basis. 00:08:24.240 |
This has led us to improve nearly 300 percent of our accuracy. 00:08:28.480 |
I will say that is not saying that we're at 100 percent. 00:08:34.640 |
But it is fortunate that we were able to improve with this process that we created. 00:08:42.480 |
These are just a few shout outs of how they have received that product. 00:08:46.080 |
This is using an older UI, but it's essentially the same product. 00:08:50.000 |
Now, one thing about that product is a single shot approach can only take us so far. 00:08:58.160 |
As you all might imagine, it's a chat interface. 00:09:01.040 |
We want to allow the user to interact with the editor as they're happening. 00:09:08.720 |
You can see in the demo here the gif that we not only did the same prompt that we did before, 00:09:17.040 |
We're also configuring fields all as quickly as possible to the user. 00:09:23.520 |
So we're really happy so far with the performance that we're getting out of this thing. 00:09:27.360 |
The problem with it, though, now that it's kind of like an agent framework with multiple tools that it calls, 00:09:37.200 |
What is the things that we need to improve now to make the accuracy even better, to make the experience better? 00:09:42.160 |
And this is where, again, brain trust came in. 00:09:47.040 |
This allowed us to break down the request from very granular observability to a very fine look into the problem. 00:09:56.960 |
And just as you would expect, you can actually see the inputs and outputs of a chat completion. 00:10:01.600 |
The tokens, the time to response, you name it, it's available. 00:10:06.560 |
And we can quickly iterate on that one as well with a playground that they have. 00:10:11.360 |
I'm not showing that, but I just wanted to showcase that it's really easy as developers to go into it 00:10:16.560 |
and really understand what is going on with the performance of the co-pilot. 00:10:22.000 |
So one of the things that ended up happening with the co-pilot is early on, because we wanted to get to market first, 00:10:34.480 |
And, you know, we started testing different models. 00:10:39.040 |
And this view that you're seeing here was manually made for us, 00:10:42.800 |
so that we can get a better sense of, you know, the performance of different models across different tools. 00:10:48.640 |
And as you change those tools, what is the performance characteristic overall? 00:10:53.040 |
And we settled on GPT 4 Turbo for our message router that we call. 00:10:57.840 |
And unfortunately, that came at a cost of performance. 00:11:03.920 |
And -- but at least we were able to provide some guarantees to our customers on accuracy. 00:11:12.000 |
And we -- our evals started to show a regression. 00:11:16.240 |
We were wondering what happened, and the reason for that was we switched to 4.0. 00:11:20.480 |
Now, you might have already heard from the conference that 4.0 is -- you need to be careful with it. 00:11:28.560 |
But I can say that with a few changes that we had to make and a couple of parameters that we added to the OpenAI calls, 00:11:38.560 |
we're kind of back to the same performance that we were before. 00:11:40.800 |
And that's what I want to talk to you about right now. 00:11:42.800 |
So, as you know, we can see all the examples or all the different runs that we've had with experiments. 00:11:50.880 |
We noticed beforehand that we were 80% or better in most of our scores. 00:11:54.960 |
After the change to the GPT 4.0, all of that regressed below 80%. 00:12:01.280 |
So, we were really worried that, you know, we wanted the performance benefits and the cost benefits, 00:12:09.680 |
This is showcasing some of the work that my colleague, Maggie Cody, had. 00:12:14.400 |
So, shout out to her for her hard work in this. 00:12:17.280 |
You can kind of see that all of our scores went down. 00:12:20.080 |
And we can clearly see that there's a pattern to this. 00:12:24.240 |
Like, drilling in further, we noticed that, you know, the 22 regressions that happened in here 00:12:29.920 |
were all related to the OpenAI deciding to, like, forget our system prompt in some way. 00:12:37.520 |
Or, like, to give an answer back that we didn't want to see or do. 00:12:41.760 |
It just, in some ways, we had prompts that were too fine-tuned to GP 3.5 Turbo. 00:12:48.320 |
So, we had to kind of regress our engineering, if you will, our prompt engineering. 00:12:53.200 |
And that ended up allowing us to actually go back to the numbers that we were at. 00:12:57.520 |
So, this is the example of that prompt, how we were a lot more elaborate with how we were 00:13:05.280 |
asking the GPT 3.5 to, like, respect our wishes. 00:13:09.520 |
And afterwards, we actually just relaxed it a lot more. 00:13:12.160 |
And again, trial and error really quickly iterating on that loop that I mentioned earlier. 00:13:18.160 |
And that is what led us to make these discoveries. 00:13:21.440 |
Now, the other thing that I mentioned, just to showcase that, is, you know, 00:13:27.760 |
We deprecated the functions usage over to the tool choice. 00:13:30.800 |
That tool choice auto is actually an asterisk there, because we are also going to experiment 00:13:39.680 |
So yeah, overall, after those changes, we can see immediately that most of our scores went up. 00:13:46.000 |
We're really happy that, for us, it's really easy to compare back and forth with previous runs. 00:13:54.080 |
And, since then, we've been able to adopt GPT 4.0. 00:13:59.920 |
We still have more work to do there, like I mentioned, but it's an iterative approach. 00:14:04.240 |
Before this adoption, we were around 14 seconds. 00:14:08.800 |
And now, we're at three seconds for a stream-based co-pilot. 00:14:14.240 |
And, of course, we had a lot more reduction, as we did before, or with this. 00:14:18.640 |
So, in conclusion, I just wanted to share some of our stories of how we've worked with BrainTrust 00:14:25.440 |
from the very beginning to make a great product. 00:14:29.040 |
Really, we couldn't be happier working with them. 00:14:32.080 |
And, I don't know, there's not much more to say to that.