back to index

Evals Are Not Unit Tests — Ido Pesok, Vercel v0


Chapters

0:0 Introduction to Vercel's V0 and its growth
1:0 The problem with AI unreliability
2:44 The "Fruit Letter Counter" app example of AI failure
3:33 Introducing "evals" and the basketball court analogy
5:9 Defining the "court": understanding the domain of user queries
7:53 Data collection for evals
9:13 Structuring evals: constants in data, variables in task
10:45 Scoring evals
12:35 Integrating evals into CI/CD
13:40 The benefits of using evals

Whisper Transcript | Transcript Only Page

00:00:15.000 | My name is Ito. I'm an engineer at Vercel working on vZero.
00:00:20.000 | If you don't know, vZero is a full-stack Vibe coding platform.
00:00:25.000 | It's the easiest and fastest way to prototype, build on the web,
00:00:29.000 | and express new ideas.
00:00:31.000 | Here are some examples of cool things people have built and shared on Twitter.
00:00:35.000 | And to catch you up, we recently just launched GitHub Sync,
00:00:39.000 | so you can now push generated code to GitHub directly from vZero.
00:00:42.000 | You can also automatically pull changes from GitHub into your chat,
00:00:47.000 | and furthermore, switch branches and open PRs to collaborate with your team.
00:00:51.000 | I'm very excited to announce we recently crossed 100 million messages sent,
00:00:56.000 | and we're really excited to keep growing from here.
00:01:00.000 | So, my goal of this talk is for it to be an introduction to evals,
00:01:04.000 | and specifically at the application layer.
00:01:06.000 | You may be used to evals at the model layer,
00:01:09.000 | which is what the research labs will cite and model releases.
00:01:12.000 | But this will be a focus on what do evals mean for your users, your apps, and your data.
00:01:18.000 | The model is now in the wild, out of the lab, and it needs to work for your use case.
00:01:22.000 | And to do this, I have a story.
00:01:24.000 | It's a story about this app called Fruit Letter Counter.
00:01:28.000 | And if the name didn't already give it away, all it is is an app that counts the letters in fruit.
00:01:36.000 | So, the vision is we'll make a logo with ChatGPT.
00:01:39.000 | There might be a market fit already, because everyone on X is dying to know the number of letters in fruit.
00:01:45.000 | If you didn't get it, it's a joke on the how many R's on the strawberry prompt.
00:01:49.000 | We'll have V0 make all the UI and back end, and then we can ship.
00:01:55.000 | So, we had V0 write the code.
00:01:57.000 | It used AISDK to do the stream text call.
00:02:01.000 | And what do you know?
00:02:02.000 | It worked first try.
00:02:03.000 | GPT 4.1 said three.
00:02:05.000 | And not only did it say three once, I even tested it twice, and it worked both times in a row.
00:02:11.000 | So, from there, we're good to ship, right?
00:02:14.000 | Let's launch on Twitter.
00:02:15.000 | Want to know how many letters are on a fruit?
00:02:17.000 | Just launched fruitlettercounter.io.
00:02:20.000 | The .com and .ai were taken.
00:02:23.000 | And, yeah, everything was going great.
00:02:26.000 | We launched and deployed on Vercel.
00:02:28.000 | We had fluid compute on until I suddenly get this tweet.
00:02:32.000 | John said, I asked how many R's in strawberry, and it said two.
00:02:37.000 | So, of course, I just tested it twice.
00:02:39.000 | How is this even possible?
00:02:41.000 | But I think you get where I'm going with this, which is that by nature, LLMs can be very different.
00:02:46.000 | LLMs can be very unreliable.
00:02:48.000 | And this principle scales from a small letter counting app all the way to the biggest AI apps in the world.
00:02:55.000 | The reason why it's so important to recognize this is because no one is going to use something that doesn't work.
00:03:00.000 | It's literally unusable.
00:03:01.000 | And this is a significant challenge when you're building AI apps.
00:03:05.000 | So I have a funny meme here.
00:03:08.000 | But basically, AI apps have this unique property.
00:03:10.000 | They're very, like, demo savvy.
00:03:12.000 | You'll demo it.
00:03:13.000 | It looks super good.
00:03:14.000 | You'll show it to your coworkers.
00:03:15.000 | And then you ship the prod.
00:03:17.000 | And then suddenly, hallucinations come and get you.
00:03:20.000 | So we always have this in the back of our head when we're building.
00:03:24.000 | Back to where we were, let's actually not give up, right?
00:03:28.000 | We actually want to solve this for our users.
00:03:30.000 | We want to make a really good fruit letter counting app.
00:03:33.000 | So you might say, how do we make reliable software that uses LLMs?
00:03:37.000 | Our initial prompt was a simple question, right?
00:03:40.000 | But maybe we can try prompt engineering.
00:03:42.000 | Maybe we can add some chain of thought, something else to make it more reliable.
00:03:46.000 | So we spend all night working on this new prompt.
00:03:49.000 | You're an exuberant, fruit-loving AI on an epic quest, dot, dot, dot.
00:03:54.000 | And this time, we actually tested it ten times in a row on ChatGPT.
00:03:58.000 | And it worked every single time.
00:04:00.000 | Ten times in a row.
00:04:01.000 | It's amazing.
00:04:02.000 | So we ship.
00:04:04.000 | And everything was going great until John tweeted at me again.
00:04:09.000 | And he said, I asked how many Rs are in strawberry, banana, pineapple, mango, kiwi, dragon fruit, apple, raspberry.
00:04:16.000 | And it said five.
00:04:19.000 | So we failed John again.
00:04:21.000 | Although this example is pretty simple, but this is actually what will happen when you start deploying to production.
00:04:26.000 | You'll get users that come up with queries you could have never imagined.
00:04:30.000 | And you actually have to start thinking about how do we solve it.
00:04:33.000 | And the interesting thing, if you think about it, is 95% of our app works 100% of the time.
00:04:39.000 | We can have unit tests for every single function, end-to-end tests for the off, the login, the sign-out.
00:04:45.000 | It will all work.
00:04:46.000 | But it's that most crucial 5% that can fail on us.
00:04:49.000 | So let's improve it.
00:04:51.000 | Now, to visualize this, I have a diagram for you.
00:04:54.000 | Hopefully, you can see the code.
00:04:56.000 | Maybe I need to make my screen brighter.
00:04:58.000 | Can you see the code?
00:04:59.000 | I don't know.
00:05:00.000 | Okay.
00:05:01.000 | Okay.
00:05:02.000 | Well, we'll come back to this.
00:05:06.000 | But basically, we're going to start building evals.
00:05:09.000 | And to visualize this, I have a basketball court.
00:05:12.000 | So today's day one of the NBA finals.
00:05:14.000 | I don't know if you care.
00:05:15.000 | You don't need to know much about basketball.
00:05:17.000 | But just know that someone is trying to throw a ball in the basket.
00:05:21.000 | And here, the basket is the glowing golden circle.
00:05:25.000 | So blue will represent a shot make.
00:05:28.000 | And red will represent a shot miss.
00:05:30.000 | And one property to consider is that the farther away your shot is from the basket, the harder it is.
00:05:37.000 | Another property is that the court has boundaries.
00:05:39.000 | So this blue dot, although the shot goes in, it's out of the court.
00:05:44.000 | So it doesn't really count in the game.
00:05:46.000 | Let's start plotting our data.
00:05:48.000 | So here, we have a question, how many Rs in strawberry?
00:05:52.000 | This, after our new prompt, will probably work.
00:05:54.000 | So we'll label it blue.
00:05:56.000 | And we'll put it close to the basket because it's pretty easy.
00:05:59.000 | However, how many Rs are in that big array?
00:06:01.000 | We'll label it red.
00:06:02.000 | And we'll put it farther away from the basket.
00:06:06.000 | Hopefully, you can see that.
00:06:07.000 | Maybe we can make it a little bit brighter.
00:06:09.000 | But this is the data part of our eval.
00:06:12.000 | Basically, you're trying to collect what prompts your users are asking.
00:06:16.000 | And you want to just store this over time and keep building it
00:06:19.000 | and store where these points are on your court.
00:06:22.000 | Two more prompts I want to bring up is like, what if someone says,
00:06:25.000 | how many Rs are in strawberry, pineapple, dragon fruit, mango,
00:06:28.000 | after we replace all the vowels with Rs?
00:06:32.000 | Right?
00:06:33.000 | Insane prompt.
00:06:34.000 | But it's still technically in our domain.
00:06:36.000 | So we'll label it as red all the way down there.
00:06:40.000 | But a funny one is, how many syllables are in caret?
00:06:43.000 | So this, we'll call it out of bounds.
00:06:46.000 | Right?
00:06:47.000 | None of our users are actually going to ask.
00:06:48.000 | It's not part of our app.
00:06:49.000 | So no one is going to care.
00:06:51.000 | I hope you can see the code.
00:06:55.000 | But basically, when you're making eval,
00:06:58.000 | here's how you can think about it.
00:06:59.000 | Your data is the point on the court.
00:07:01.000 | Your shot, or in this case in Braintrust they call it a task,
00:07:05.000 | is the way you shoot the ball towards the basket.
00:07:07.000 | And your score is basically a check of did it go in the basket
00:07:10.000 | or did it not go in the basket.
00:07:12.000 | To make good evals, you must understand your court.
00:07:15.000 | This is the most important step.
00:07:18.000 | And you have to be careful of falling into some traps.
00:07:22.000 | First is the out of bounds traps.
00:07:24.000 | Don't spend time making evals for your data your users don't care about.
00:07:28.000 | You have enough problems, I promise you,
00:07:30.000 | queries that your users do care about.
00:07:33.000 | So be careful not to try and be productive.
00:07:36.000 | And you're making a lot of evals,
00:07:38.000 | but they're not really applicable to your app.
00:07:40.000 | And another visualization is don't have a concentrated set of points.
00:07:44.000 | When you really understand your court,
00:07:46.000 | you're going to understand where the boundaries are,
00:07:48.000 | and you want to make sure you test across the entire court.
00:07:51.000 | A lot of people have been talking about this today,
00:07:55.000 | but to collect as much data as possible,
00:07:57.000 | here are some things you can do.
00:07:59.000 | First is collect thumbs up, thumbs down data.
00:08:01.000 | This can be noisy, but it also can be really, really good signal
00:08:04.000 | as to where your app is struggling.
00:08:06.000 | Another thing is if you have observability, which is highly recommended,
00:08:10.000 | you can just read through random samples in your logs.
00:08:14.000 | Although users might not be giving you signal,
00:08:16.000 | but if you take like 100 random samples and go through it like once a week,
00:08:20.000 | you'll get a really good understanding of what your users are
00:08:23.000 | and how your users are using the product.
00:08:26.000 | If you have community forums, these are also great.
00:08:28.000 | People will often report issues they're having with the LLM,
00:08:31.000 | and also X and Twitter are also great, but can be noisy.
00:08:35.000 | And there really is no shortcut here.
00:08:37.000 | You really have to do the work and understand what your court looks like.
00:08:41.000 | So here is actually what, if you are doing a good job of understanding your core
00:08:46.000 | and a good job of building your data set, this is what it should look like.
00:08:49.000 | You should know the boundaries, you should be testing in your boundaries,
00:08:52.000 | and you should understand where your system has blue versus where it has red.
00:08:57.000 | So here it's really easy to tell, okay, maybe next week we need to prioritize
00:09:02.000 | the team to work on that bottom right corner.
00:09:05.000 | This is something where a lot of users are struggling,
00:09:07.000 | and we can really do a good job on flipping the tiles from red to blue.
00:09:12.000 | Another thing you can do, and I really hope you can see,
00:09:18.000 | but you want to put constants in data, variables in the task.
00:09:23.000 | So just like in math or programming, you want to factor constants
00:09:27.000 | so it improves clarity, reuse, and generalizations.
00:09:30.000 | Let's say you want to test your system prompt, right?
00:09:34.000 | Keep the constant data that your users are going to ask.
00:09:38.000 | So for example, how many R's in strawberry, that goes in the data.
00:09:40.000 | That's a constant.
00:09:41.000 | It's never going to change throughout your app.
00:09:43.000 | But what you're going to test is in that task,
00:09:45.000 | you're going to try different system prompts.
00:09:47.000 | You might try different pre-processing, different RAG,
00:09:49.000 | and that's what you want to put in your task section.
00:09:52.000 | This way your app actually scales, and you never have to,
00:09:55.000 | let's say when you change your system prompt, redo all your data.
00:09:57.000 | And this is a really nice feature of brain trust.
00:10:00.000 | And if you don't know, AI SDK actually offers a thing called middleware,
00:10:06.000 | and it's a really good abstraction to put basically all your logic
00:10:10.000 | of pre-processing.
00:10:11.000 | So RAG, system prompt you can put in here, et cetera.
00:10:14.000 | And you can now share this between your actual API route
00:10:17.000 | that's doing the completion and your evals.
00:10:19.000 | So if you think about the basketball court as if we're going like basketball practice,
00:10:24.000 | and we're trying to practice our system across different models,
00:10:28.000 | you want your practice to be as similar as possible to the real game.
00:10:32.000 | That's what makes a good practice.
00:10:33.000 | So you want to share pretty much the exact same code between the evals
00:10:37.000 | and what you're actually running.
00:10:39.000 | Now, I want to talk a little bit about scores,
00:10:42.000 | which is the last step of the eval.
00:10:44.000 | The unfortunate thing is it does vary greatly depending on your data.
00:10:48.000 | depending on your domain.
00:10:50.000 | So in this case, it's like super simple.
00:10:52.000 | You're just checking if the output contains the correct number of letters.
00:10:57.000 | But maybe if you're doing writing or tasks like writing,
00:11:00.000 | that's very, very difficult.
00:11:02.000 | From principles, you want to actually lean towards deterministic scoring
00:11:06.000 | and pass/fail.
00:11:07.000 | This is because when you're doing debugging,
00:11:09.000 | you're going to get a ton of input and logs,
00:11:12.000 | and you want to make it as easy as possible
00:11:14.000 | for you to actually figure out what's going wrong.
00:11:16.000 | So if you're building, if you're over-engineering your score,
00:11:19.000 | it might be very difficult to share with your team
00:11:21.000 | and distribute across different teams your evals,
00:11:25.000 | because no one will understand how these things are getting scored.
00:11:27.000 | Keep your scores as simple as possible.
00:11:30.000 | And a good question to ask yourself is when you're looking at the data,
00:11:34.000 | what am I looking for to see if this failed?
00:11:36.000 | So with V0, we're looking for if the code didn't work.
00:11:40.000 | But maybe for writing, you're looking for certain linguistics.
00:11:44.000 | Ask yourself that question and write the code that looks for you.
00:11:47.000 | There are some cases where it's so hard to write the code
00:11:50.000 | that you may need to do human review, and that's okay.
00:11:54.000 | At the end of the day, you want to build your core
00:11:56.000 | and you want to collect signal.
00:11:57.000 | Even if you must do human review to get the correct signal,
00:12:01.000 | don't worry.
00:12:02.000 | If you do the correct practice, it will pay off in the long run
00:12:05.000 | and you'll get better results for your users.
00:12:08.000 | One trick you can do for scoring is don't be scared
00:12:12.000 | to add a little bit of extra prompt to the original prompt.
00:12:17.000 | So for example, here we can say output your final answer
00:12:20.000 | in these answer tags.
00:12:22.000 | What this will do is basically make it very easy
00:12:24.000 | for you to do string matching and et cetera,
00:12:28.000 | whereas in production, you don't really want this.
00:12:30.000 | But yeah, you can do some little tweaks to your prompt
00:12:33.000 | so that scoring is easier.
00:12:36.000 | Another thing we really highly recommend is add evals to your CI.
00:12:40.000 | So Braintrust is really nice because you can get these eval reports.
00:12:43.000 | So it will run your task across all your data,
00:12:47.000 | and then it will give you this report at the end
00:12:50.000 | for the improvements and regressions.
00:12:52.000 | Assume my colleague made a PR that changes a bit of the prompt.
00:12:55.000 | We want to know, like, how did it do across the court, right?
00:12:58.000 | Visualize, like, did it change more tiles from red to blue?
00:13:01.000 | Maybe now our prompt fixed one part,
00:13:03.000 | but it broke the other part of our app.
00:13:05.000 | So this is a really useful report to have when you're doing PRs.
00:13:09.000 | So yeah, going back, this is the summary of the talk.
00:13:13.000 | You want to make your evals a core of your data.
00:13:17.000 | And this, you can treat it like practice.
00:13:19.000 | Your model is basically going to practice.
00:13:21.000 | Maybe you want to switch players, right?
00:13:23.000 | When you switch models, you can see how a different player
00:13:25.000 | is going to perform in your practice.
00:13:27.000 | But this gives you such a good understanding
00:13:29.000 | of how your system is doing when you change things,
00:13:32.000 | like maybe your rag or your system prompt.
00:13:34.000 | And you can now go to your colleague and say, hey,
00:13:37.000 | this actually did help our app, right?
00:13:39.000 | Because improvement without measurement is limited and imprecise.
00:13:44.000 | And evals give you the clarity you need
00:13:46.000 | to systematically improve your app.
00:13:50.000 | When you do that, you're going to get better reliability and quality,
00:13:54.000 | higher conversion and retention,
00:13:56.000 | and you also get to just spend less time on support ops, right?
00:13:59.000 | Because your evals, your practice environment
00:14:01.000 | will take care of that for you.
00:14:03.000 | And if you're wondering about how I built all these court diagrams,
00:14:06.000 | I actually just used V0, and it made me some app
00:14:08.000 | that I just added these shots made and missed in the basket.
00:14:13.000 | So yeah, thank you very much.
00:14:15.000 | I hope you learned a little bit about evals.
00:14:17.000 | Thank you.
00:14:19.000 | So we do have some time for some questions.
00:14:21.000 | There are two mics, one over here, one over there.
00:14:24.000 | I can take two or three of those, please,
00:14:27.000 | if anybody's interested in asking.
00:14:29.000 | We have one over there.
00:14:31.000 | Mic five, please.
00:14:34.000 | Or you can repeat the question as well, if you don't mind.
00:14:38.000 | Do you run the same eval again?
00:14:42.000 | Yeah.
00:14:43.000 | Yeah, you can think of it.
00:14:44.000 | It's really like practice.
00:14:45.000 | Like maybe you're a basketball player,
00:14:47.000 | like, you know, in general score like 90%,
00:14:50.000 | but they might miss more shots here or there.
00:14:52.000 | If you run it, like we do it like we run every day at least.
00:14:56.000 | And then we get a good sense of like,
00:14:58.000 | where are we actually like failing?
00:15:00.000 | Did we have some regression?
00:15:01.000 | So yeah, running it like daily or at least in some schedule
00:15:04.000 | will give you a good idea.
00:15:06.000 | I was thinking what if you ran like, you know,
00:15:08.000 | the same question through it five times, right?
00:15:10.000 | Yeah.
00:15:11.000 | Like what's the percentage?
00:15:12.000 | It's making it four out of five or, you know, five out of five.
00:15:14.000 | Oh, I see.
00:15:15.000 | So it's definitely like as you go further away,
00:15:18.000 | like the harder questions get like.
00:15:20.000 | We'll see you next time.