How Zapier Builds AI Products and Features with the Help of Braintrust: Ankur Goyal & Olmo Maldonado

00:00:00.120 | My name is Olmo Maldonado, I am from Zapier, I promise I am not an open source LLM, my

00:00:20.800 | name is Olmo, I've been at Zapier for over seven years, I apologize for all the bugs

00:00:27.720 | that I've introduced that may have affected you, and if you happen to have any bugs that

00:00:31.740 | you want to report, I'd be happy to take that in and start working on them.

00:00:36.140 | You name it, I've been part of the team, so feel free to come by and talk to me about

00:00:41.240 | it.

00:00:42.240 | I'm a lucky husband and father, and I've been playing some golf lately and I don't

00:00:46.900 | know why I'm doing that.

00:00:48.520 | Hey everyone, I'm Ankur, I actually went through Hamel's journey I think, I built eval

00:00:55.440 | tooling at my last startup, Impira, and then when I led the AI team at Figma.

00:01:00.460 | And that's what led to Braintrust, actually Braintrust started kind of as a collaboration

00:01:05.620 | with our friends at Zapier, who are our first users, and it's been a lot of fun since.

00:01:11.180 | I'm also a husband, not yet a father, but a proud older brother, and also a reluctant

00:01:18.300 | golf player.

00:01:19.200 | I hope to play against Ulmo and beat him someday.

00:01:24.220 | So, today it's going to be a lot of story telling.

00:01:28.240 | I'm not here to prophesize what you should be doing.

00:01:31.240 | I'm just going to share what we've done that has worked for us.

00:01:35.080 | And I'm hoping to actually learn a lot from you all as well of what is working for you and

00:01:39.200 | something that we should try ourselves.

00:01:40.240 | I know I've already learned a lot from the conference, so I'm hoping that you all learned something

00:01:44.600 | from this talk.

00:01:46.620 | We'll go over what we're doing at Zapier, the tech that is going on at Braintrust, and

00:01:51.880 | a couple of examples of how what they've done has really helped us to make a good product.

00:01:57.880 | So, if you're not familiar, we're actually just at a quick poll, how many folks here use

00:02:02.520 | Zapier?

00:02:03.520 | All right, a good number of folks, appreciate that.

00:02:07.680 | Thank you for your support.

00:02:09.140 | If you're not aware, there's actually over 7,000 apps of your favorite ones online.

00:02:15.060 | We make it very low code to no code to integrate with all of them.

00:02:19.660 | On the right, you can kind of see how the workflow works, and we'll do it in a reliable way to

00:02:25.000 | make sure it's mission critical and everything.

00:02:27.220 | Now, we have a lot of AI integrations.

00:02:30.780 | At this point, I'm happy to say, per day, we're doing over 10 million tasks.

00:02:35.840 | So if you haven't tried this out, please give it a try, as well as use all the integrations

00:02:40.880 | with AI.

00:02:43.420 | Here are the apps that we, the products that we have built with AI.

00:02:47.740 | I'll only talk about the first two with the AI Zap Builder and Zapier Copilot, but I would

00:02:52.580 | strongly encourage for you guys to explore all the other new products that we have available.

00:02:56.560 | Central, in particular, shout out to my colleagues that are here.

00:03:02.120 | It's a bot framework, so you can make your own bot connect to over all of your apps that

00:03:06.680 | you have online.

00:03:07.680 | So if you want to learn more, please go to Zapier.com/AI.

00:03:13.240 | Really quickly about Braintrust, we'll keep the propaganda brief.

00:03:18.800 | Braintrust is the end-to-end developer platform that some of the world's best AI teams use,

00:03:24.360 | including Notion, Airtable, Instacart, Zapier, Vercel, and many others.

00:03:30.360 | Basically, if you break that down, there are three things that we're really focused on today.

00:03:35.920 | One is helping you do evals incredibly well.

00:03:38.480 | Olmo is going to talk about how they do evals, which I think is probably the best way to actually

00:03:43.480 | learn about that.

00:03:44.480 | We also help you with observability, and I think it's really important that you build your stuff

00:03:50.480 | in a way that evals and observability, there's actually kind of a continuum across them.

00:03:56.480 | And so we are really kind of focused on that problem.

00:03:59.680 | And then the last thing is, we help you build really, really great prompts, and there's a

00:04:04.080 | bunch of tools around that.

00:04:06.240 | Yeah, so this is what it looks like to work at Zapier.

00:04:11.540 | We want to get the prototype as early as possible to the user.

00:04:14.880 | We take an iterative approach.

00:04:16.940 | We will get some things wrong, and we hope to learn from them and just keep improving the

00:04:21.300 | product as fast as possible.

00:04:24.080 | We make adjustments through our evals, and evals are the way that we make decisions.

00:04:29.200 | It didn't used to be that case.

00:04:32.280 | So if you haven't played with it, this is the AI Zap Builder.

00:04:35.680 | You give us a prompt, as you see there.

00:04:38.040 | We'll do our best to make a Zap for you, and when you click try it, there's your Zap.

00:04:43.200 | It will do many other things as well, like field mapping and so forth.

00:04:48.320 | So what we learned from this experience is how do we go about knowing how well it works.

00:04:53.680 | How well is this product delivering Zapier to the customer?

00:04:58.800 | Over 7,000 products, the long tail of integrations is fast, so how well are we doing it?

00:05:06.080 | One of the things that we did that has worked for us that I would encourage for you all to

00:05:08.960 | do as well is involve the product managers.

00:05:11.800 | Involve your product side of things to be part of the conversation.

00:05:15.920 | This is an engineering problem as well as a product problem.

00:05:19.120 | And you can see here in this screenshot our P0 priority.

00:05:22.800 | The things that we wanted to make sure that our AI Zap Builder was able to produce.

00:05:27.680 | We wanted to make sure the triggers, actions were working as you'd expect.

00:05:30.960 | We don't want the wrong step in the wrong place, and we wanted to make sure the top 25 apps were

00:05:36.960 | supported and that they were done in an eloquent way, correct way.

00:05:41.840 | And yeah, we have even internal Zapier apps like paths and filters.

00:05:48.800 | We want them to work as well.

00:05:50.000 | So all of this had to be in our eval suite in some way or form.

00:05:53.360 | This is our framework that we built in-house with the help of BrainTrust.

00:06:00.480 | We have synthetic data from our corporate accounts that we use for seeding the evals,

00:06:07.120 | and we use that to get going with all of that coverage that we saw before.

00:06:12.640 | What we do is we load that data from BrainTrust actually that is hosting us,

00:06:16.880 | and we take that and run that on a CI basis as well as a manual basis.

00:06:21.840 | And we have our own little runner that essentially kind of does a load test against

00:06:27.200 | all of our AI providers every single time that we run this.

00:06:29.920 | So it's been really incredible to take all of this data, run it, report on it,

00:06:36.240 | and start acting on the things that we've seen.

00:06:40.400 | We also have these custom graders that the previous speaker had mentioned.

00:06:44.400 | They're both logic-based as well as LLM-based.

00:06:47.680 | And in general, what we're trying to do is make sure that that criteria that you saw before

00:06:51.840 | is being tested upon and that we are actually acting on what we wanted to see.

00:06:55.680 | So here's an example of all the different runs that we've had.

00:07:01.840 | As you can see, it's pretty often that we run it.

00:07:06.160 | We want to make sure that if any regressions happen that we act on them quickly.

00:07:10.400 | I'll actually go over one of those cases in a bit.

00:07:13.760 | And this is really easy for us to act on it.

00:07:17.840 | We can see this is a screenshot of Maggie's project for the Zap AI builder.

00:07:24.160 | And, you know, as mentioned earlier, we have observability thanks to BrainTrust.

00:07:29.520 | We can see within it what happened, what were the inputs, what were the outputs,

00:07:34.160 | as well as compare the pink and green, hopefully you all can see it, is actually comparing against

00:07:39.520 | previous runs as well and trying to find what went down, what went up, and so forth.

00:07:44.480 | And, yeah, this is just showing that even further with all of our different graders,

00:07:49.760 | the scores, if you will, of, like, the different things that we're looking for.

00:07:53.120 | We want to make sure that the ones that we care most about are being highlighted and that we do something about it.

00:07:57.680 | So after all this work of creating the eval suite and running them continuously,

00:08:04.800 | I can say that before this we just had seven unit tests.

00:08:08.960 | So seven unit tests that were run manually by devs and now we have over 800 of them

00:08:14.320 | and they're all run part of this merge request as well as on a continuous basis.

00:08:19.520 | And we get alerted on if any regress.

00:08:22.000 | So a lot better coverage there.

00:08:24.240 | This has led us to improve nearly 300 percent of our accuracy.

00:08:28.480 | I will say that is not saying that we're at 100 percent.

00:08:32.320 | We still have a lot of work there.

00:08:34.640 | But it is fortunate that we were able to improve with this process that we created.

00:08:39.040 | Now, we're very thankful for our customers.

00:08:42.480 | These are just a few shout outs of how they have received that product.

00:08:46.080 | This is using an older UI, but it's essentially the same product.

00:08:50.000 | Now, one thing about that product is a single shot approach can only take us so far.

00:08:56.560 | So this is the next iteration.

00:08:58.160 | As you all might imagine, it's a chat interface.

00:09:01.040 | We want to allow the user to interact with the editor as they're happening.

00:09:06.320 | So a progressive iterative approach.

00:09:08.720 | You can see in the demo here the gif that we not only did the same prompt that we did before,

00:09:15.600 | but we're also testing steps.

00:09:17.040 | We're also configuring fields all as quickly as possible to the user.

00:09:21.920 | This is actually not sped up.

00:09:23.520 | So we're really happy so far with the performance that we're getting out of this thing.

00:09:27.360 | The problem with it, though, now that it's kind of like an agent framework with multiple tools that it calls,

00:09:34.400 | we couldn't see what was the critical path.

00:09:37.200 | What is the things that we need to improve now to make the accuracy even better, to make the experience better?

00:09:42.160 | And this is where, again, brain trust came in.

00:09:44.320 | They have tracing capabilities.

00:09:47.040 | This allowed us to break down the request from very granular observability to a very fine look into the problem.

00:09:56.960 | And just as you would expect, you can actually see the inputs and outputs of a chat completion.

00:10:01.600 | The tokens, the time to response, you name it, it's available.

00:10:06.560 | And we can quickly iterate on that one as well with a playground that they have.

00:10:11.360 | I'm not showing that, but I just wanted to showcase that it's really easy as developers to go into it

00:10:16.560 | and really understand what is going on with the performance of the co-pilot.

00:10:22.000 | So one of the things that ended up happening with the co-pilot is early on, because we wanted to get to market first,

00:10:29.760 | we wanted to just do GPT 3.5 Turbo.

00:10:34.480 | And, you know, we started testing different models.

00:10:39.040 | And this view that you're seeing here was manually made for us,

00:10:42.800 | so that we can get a better sense of, you know, the performance of different models across different tools.

00:10:48.640 | And as you change those tools, what is the performance characteristic overall?

00:10:53.040 | And we settled on GPT 4 Turbo for our message router that we call.

00:10:57.840 | And unfortunately, that came at a cost of performance.

00:11:01.120 | That was a lot slower than before.

00:11:03.920 | And -- but at least we were able to provide some guarantees to our customers on accuracy.

00:11:08.480 | That led us to want to make that better.

00:11:12.000 | And we -- our evals started to show a regression.

00:11:16.240 | We were wondering what happened, and the reason for that was we switched to 4.0.

00:11:20.480 | Now, you might have already heard from the conference that 4.0 is -- you need to be careful with it.

00:11:26.240 | And, yeah, we actually stumbled on that one.

00:11:28.560 | But I can say that with a few changes that we had to make and a couple of parameters that we added to the OpenAI calls,

00:11:38.560 | we're kind of back to the same performance that we were before.

00:11:40.800 | And that's what I want to talk to you about right now.

00:11:42.800 | So, as you know, we can see all the examples or all the different runs that we've had with experiments.

00:11:50.880 | We noticed beforehand that we were 80% or better in most of our scores.

00:11:54.960 | After the change to the GPT 4.0, all of that regressed below 80%.

00:12:01.280 | So, we were really worried that, you know, we wanted the performance benefits and the cost benefits,

00:12:06.320 | but we didn't want to lower our scores.

00:12:08.720 | So, what should we do?

00:12:09.680 | This is showcasing some of the work that my colleague, Maggie Cody, had.

00:12:14.400 | So, shout out to her for her hard work in this.

00:12:17.280 | You can kind of see that all of our scores went down.

00:12:20.080 | And we can clearly see that there's a pattern to this.

00:12:22.400 | Like, what is going on with it?

00:12:24.240 | Like, drilling in further, we noticed that, you know, the 22 regressions that happened in here

00:12:29.920 | were all related to the OpenAI deciding to, like, forget our system prompt in some way.

00:12:37.520 | Or, like, to give an answer back that we didn't want to see or do.

00:12:41.760 | It just, in some ways, we had prompts that were too fine-tuned to GP 3.5 Turbo.

00:12:48.320 | So, we had to kind of regress our engineering, if you will, our prompt engineering.

00:12:53.200 | And that ended up allowing us to actually go back to the numbers that we were at.

00:12:57.520 | So, this is the example of that prompt, how we were a lot more elaborate with how we were

00:13:05.280 | asking the GPT 3.5 to, like, respect our wishes.

00:13:09.520 | And afterwards, we actually just relaxed it a lot more.

00:13:12.160 | And again, trial and error really quickly iterating on that loop that I mentioned earlier.

00:13:18.160 | And that is what led us to make these discoveries.

00:13:21.440 | Now, the other thing that I mentioned, just to showcase that, is, you know,

00:13:25.680 | we made a change to the tool choices.

00:13:27.760 | We deprecated the functions usage over to the tool choice.

00:13:30.800 | That tool choice auto is actually an asterisk there, because we are also going to experiment

00:13:36.640 | when making that required.

00:13:37.680 | So, that's coming up next.

00:13:39.680 | So yeah, overall, after those changes, we can see immediately that most of our scores went up.

00:13:46.000 | We're really happy that, for us, it's really easy to compare back and forth with previous runs.

00:13:52.560 | That's what's going on here on the top.

00:13:54.080 | And, since then, we've been able to adopt GPT 4.0.

00:13:59.920 | We still have more work to do there, like I mentioned, but it's an iterative approach.

00:14:04.240 | Before this adoption, we were around 14 seconds.

00:14:08.800 | And now, we're at three seconds for a stream-based co-pilot.

00:14:14.240 | And, of course, we had a lot more reduction, as we did before, or with this.

00:14:18.640 | So, in conclusion, I just wanted to share some of our stories of how we've worked with BrainTrust

00:14:25.440 | from the very beginning to make a great product.

00:14:29.040 | Really, we couldn't be happier working with them.

00:14:32.080 | And, I don't know, there's not much more to say to that.

00:14:35.840 | High five.

00:14:36.320 | Yeah.

00:14:36.880 | So, yeah, thank you.

00:14:40.480 | Thank you.

How Zapier Builds AI Products and Features with the Help of Braintrust: Ankur Goyal & Olmo Maldonado

Chapters