back to indexEvals Are Not Unit Tests — Ido Pesok, Vercel v0

Chapters
0:0 Introduction to Vercel's V0 and its growth
1:0 The problem with AI unreliability
2:44 The "Fruit Letter Counter" app example of AI failure
3:33 Introducing "evals" and the basketball court analogy
5:9 Defining the "court": understanding the domain of user queries
7:53 Data collection for evals
9:13 Structuring evals: constants in data, variables in task
10:45 Scoring evals
12:35 Integrating evals into CI/CD
13:40 The benefits of using evals
00:00:15.000 |
My name is Ito. I'm an engineer at Vercel working on vZero. 00:00:20.000 |
If you don't know, vZero is a full-stack Vibe coding platform. 00:00:25.000 |
It's the easiest and fastest way to prototype, build on the web, 00:00:31.000 |
Here are some examples of cool things people have built and shared on Twitter. 00:00:35.000 |
And to catch you up, we recently just launched GitHub Sync, 00:00:39.000 |
so you can now push generated code to GitHub directly from vZero. 00:00:42.000 |
You can also automatically pull changes from GitHub into your chat, 00:00:47.000 |
and furthermore, switch branches and open PRs to collaborate with your team. 00:00:51.000 |
I'm very excited to announce we recently crossed 100 million messages sent, 00:00:56.000 |
and we're really excited to keep growing from here. 00:01:00.000 |
So, my goal of this talk is for it to be an introduction to evals, 00:01:09.000 |
which is what the research labs will cite and model releases. 00:01:12.000 |
But this will be a focus on what do evals mean for your users, your apps, and your data. 00:01:18.000 |
The model is now in the wild, out of the lab, and it needs to work for your use case. 00:01:24.000 |
It's a story about this app called Fruit Letter Counter. 00:01:28.000 |
And if the name didn't already give it away, all it is is an app that counts the letters in fruit. 00:01:36.000 |
So, the vision is we'll make a logo with ChatGPT. 00:01:39.000 |
There might be a market fit already, because everyone on X is dying to know the number of letters in fruit. 00:01:45.000 |
If you didn't get it, it's a joke on the how many R's on the strawberry prompt. 00:01:49.000 |
We'll have V0 make all the UI and back end, and then we can ship. 00:02:05.000 |
And not only did it say three once, I even tested it twice, and it worked both times in a row. 00:02:15.000 |
Want to know how many letters are on a fruit? 00:02:28.000 |
We had fluid compute on until I suddenly get this tweet. 00:02:32.000 |
John said, I asked how many R's in strawberry, and it said two. 00:02:41.000 |
But I think you get where I'm going with this, which is that by nature, LLMs can be very different. 00:02:48.000 |
And this principle scales from a small letter counting app all the way to the biggest AI apps in the world. 00:02:55.000 |
The reason why it's so important to recognize this is because no one is going to use something that doesn't work. 00:03:01.000 |
And this is a significant challenge when you're building AI apps. 00:03:08.000 |
But basically, AI apps have this unique property. 00:03:17.000 |
And then suddenly, hallucinations come and get you. 00:03:20.000 |
So we always have this in the back of our head when we're building. 00:03:24.000 |
Back to where we were, let's actually not give up, right? 00:03:28.000 |
We actually want to solve this for our users. 00:03:30.000 |
We want to make a really good fruit letter counting app. 00:03:33.000 |
So you might say, how do we make reliable software that uses LLMs? 00:03:37.000 |
Our initial prompt was a simple question, right? 00:03:42.000 |
Maybe we can add some chain of thought, something else to make it more reliable. 00:03:46.000 |
So we spend all night working on this new prompt. 00:03:49.000 |
You're an exuberant, fruit-loving AI on an epic quest, dot, dot, dot. 00:03:54.000 |
And this time, we actually tested it ten times in a row on ChatGPT. 00:04:04.000 |
And everything was going great until John tweeted at me again. 00:04:09.000 |
And he said, I asked how many Rs are in strawberry, banana, pineapple, mango, kiwi, dragon fruit, apple, raspberry. 00:04:21.000 |
Although this example is pretty simple, but this is actually what will happen when you start deploying to production. 00:04:26.000 |
You'll get users that come up with queries you could have never imagined. 00:04:30.000 |
And you actually have to start thinking about how do we solve it. 00:04:33.000 |
And the interesting thing, if you think about it, is 95% of our app works 100% of the time. 00:04:39.000 |
We can have unit tests for every single function, end-to-end tests for the off, the login, the sign-out. 00:04:46.000 |
But it's that most crucial 5% that can fail on us. 00:04:51.000 |
Now, to visualize this, I have a diagram for you. 00:05:06.000 |
But basically, we're going to start building evals. 00:05:09.000 |
And to visualize this, I have a basketball court. 00:05:15.000 |
You don't need to know much about basketball. 00:05:17.000 |
But just know that someone is trying to throw a ball in the basket. 00:05:21.000 |
And here, the basket is the glowing golden circle. 00:05:30.000 |
And one property to consider is that the farther away your shot is from the basket, the harder it is. 00:05:37.000 |
Another property is that the court has boundaries. 00:05:39.000 |
So this blue dot, although the shot goes in, it's out of the court. 00:05:48.000 |
So here, we have a question, how many Rs in strawberry? 00:05:52.000 |
This, after our new prompt, will probably work. 00:05:56.000 |
And we'll put it close to the basket because it's pretty easy. 00:06:02.000 |
And we'll put it farther away from the basket. 00:06:12.000 |
Basically, you're trying to collect what prompts your users are asking. 00:06:16.000 |
And you want to just store this over time and keep building it 00:06:19.000 |
and store where these points are on your court. 00:06:22.000 |
Two more prompts I want to bring up is like, what if someone says, 00:06:25.000 |
how many Rs are in strawberry, pineapple, dragon fruit, mango, 00:06:36.000 |
So we'll label it as red all the way down there. 00:06:40.000 |
But a funny one is, how many syllables are in caret? 00:07:01.000 |
Your shot, or in this case in Braintrust they call it a task, 00:07:05.000 |
is the way you shoot the ball towards the basket. 00:07:07.000 |
And your score is basically a check of did it go in the basket 00:07:12.000 |
To make good evals, you must understand your court. 00:07:18.000 |
And you have to be careful of falling into some traps. 00:07:24.000 |
Don't spend time making evals for your data your users don't care about. 00:07:38.000 |
but they're not really applicable to your app. 00:07:40.000 |
And another visualization is don't have a concentrated set of points. 00:07:46.000 |
you're going to understand where the boundaries are, 00:07:48.000 |
and you want to make sure you test across the entire court. 00:07:51.000 |
A lot of people have been talking about this today, 00:07:59.000 |
First is collect thumbs up, thumbs down data. 00:08:01.000 |
This can be noisy, but it also can be really, really good signal 00:08:06.000 |
Another thing is if you have observability, which is highly recommended, 00:08:10.000 |
you can just read through random samples in your logs. 00:08:14.000 |
Although users might not be giving you signal, 00:08:16.000 |
but if you take like 100 random samples and go through it like once a week, 00:08:20.000 |
you'll get a really good understanding of what your users are 00:08:26.000 |
If you have community forums, these are also great. 00:08:28.000 |
People will often report issues they're having with the LLM, 00:08:31.000 |
and also X and Twitter are also great, but can be noisy. 00:08:37.000 |
You really have to do the work and understand what your court looks like. 00:08:41.000 |
So here is actually what, if you are doing a good job of understanding your core 00:08:46.000 |
and a good job of building your data set, this is what it should look like. 00:08:49.000 |
You should know the boundaries, you should be testing in your boundaries, 00:08:52.000 |
and you should understand where your system has blue versus where it has red. 00:08:57.000 |
So here it's really easy to tell, okay, maybe next week we need to prioritize 00:09:02.000 |
the team to work on that bottom right corner. 00:09:05.000 |
This is something where a lot of users are struggling, 00:09:07.000 |
and we can really do a good job on flipping the tiles from red to blue. 00:09:12.000 |
Another thing you can do, and I really hope you can see, 00:09:18.000 |
but you want to put constants in data, variables in the task. 00:09:23.000 |
So just like in math or programming, you want to factor constants 00:09:27.000 |
so it improves clarity, reuse, and generalizations. 00:09:30.000 |
Let's say you want to test your system prompt, right? 00:09:34.000 |
Keep the constant data that your users are going to ask. 00:09:38.000 |
So for example, how many R's in strawberry, that goes in the data. 00:09:41.000 |
It's never going to change throughout your app. 00:09:43.000 |
But what you're going to test is in that task, 00:09:45.000 |
you're going to try different system prompts. 00:09:47.000 |
You might try different pre-processing, different RAG, 00:09:49.000 |
and that's what you want to put in your task section. 00:09:52.000 |
This way your app actually scales, and you never have to, 00:09:55.000 |
let's say when you change your system prompt, redo all your data. 00:09:57.000 |
And this is a really nice feature of brain trust. 00:10:00.000 |
And if you don't know, AI SDK actually offers a thing called middleware, 00:10:06.000 |
and it's a really good abstraction to put basically all your logic 00:10:11.000 |
So RAG, system prompt you can put in here, et cetera. 00:10:14.000 |
And you can now share this between your actual API route 00:10:19.000 |
So if you think about the basketball court as if we're going like basketball practice, 00:10:24.000 |
and we're trying to practice our system across different models, 00:10:28.000 |
you want your practice to be as similar as possible to the real game. 00:10:33.000 |
So you want to share pretty much the exact same code between the evals 00:10:39.000 |
Now, I want to talk a little bit about scores, 00:10:44.000 |
The unfortunate thing is it does vary greatly depending on your data. 00:10:52.000 |
You're just checking if the output contains the correct number of letters. 00:10:57.000 |
But maybe if you're doing writing or tasks like writing, 00:11:02.000 |
From principles, you want to actually lean towards deterministic scoring 00:11:14.000 |
for you to actually figure out what's going wrong. 00:11:16.000 |
So if you're building, if you're over-engineering your score, 00:11:19.000 |
it might be very difficult to share with your team 00:11:21.000 |
and distribute across different teams your evals, 00:11:25.000 |
because no one will understand how these things are getting scored. 00:11:30.000 |
And a good question to ask yourself is when you're looking at the data, 00:11:36.000 |
So with V0, we're looking for if the code didn't work. 00:11:40.000 |
But maybe for writing, you're looking for certain linguistics. 00:11:44.000 |
Ask yourself that question and write the code that looks for you. 00:11:47.000 |
There are some cases where it's so hard to write the code 00:11:50.000 |
that you may need to do human review, and that's okay. 00:11:54.000 |
At the end of the day, you want to build your core 00:11:57.000 |
Even if you must do human review to get the correct signal, 00:12:02.000 |
If you do the correct practice, it will pay off in the long run 00:12:05.000 |
and you'll get better results for your users. 00:12:08.000 |
One trick you can do for scoring is don't be scared 00:12:12.000 |
to add a little bit of extra prompt to the original prompt. 00:12:17.000 |
So for example, here we can say output your final answer 00:12:22.000 |
What this will do is basically make it very easy 00:12:28.000 |
whereas in production, you don't really want this. 00:12:30.000 |
But yeah, you can do some little tweaks to your prompt 00:12:36.000 |
Another thing we really highly recommend is add evals to your CI. 00:12:40.000 |
So Braintrust is really nice because you can get these eval reports. 00:12:43.000 |
So it will run your task across all your data, 00:12:47.000 |
and then it will give you this report at the end 00:12:52.000 |
Assume my colleague made a PR that changes a bit of the prompt. 00:12:55.000 |
We want to know, like, how did it do across the court, right? 00:12:58.000 |
Visualize, like, did it change more tiles from red to blue? 00:13:05.000 |
So this is a really useful report to have when you're doing PRs. 00:13:09.000 |
So yeah, going back, this is the summary of the talk. 00:13:13.000 |
You want to make your evals a core of your data. 00:13:23.000 |
When you switch models, you can see how a different player 00:13:29.000 |
of how your system is doing when you change things, 00:13:34.000 |
And you can now go to your colleague and say, hey, 00:13:39.000 |
Because improvement without measurement is limited and imprecise. 00:13:50.000 |
When you do that, you're going to get better reliability and quality, 00:13:56.000 |
and you also get to just spend less time on support ops, right? 00:13:59.000 |
Because your evals, your practice environment 00:14:03.000 |
And if you're wondering about how I built all these court diagrams, 00:14:06.000 |
I actually just used V0, and it made me some app 00:14:08.000 |
that I just added these shots made and missed in the basket. 00:14:21.000 |
There are two mics, one over here, one over there. 00:14:34.000 |
Or you can repeat the question as well, if you don't mind. 00:14:50.000 |
but they might miss more shots here or there. 00:14:52.000 |
If you run it, like we do it like we run every day at least. 00:15:01.000 |
So yeah, running it like daily or at least in some schedule 00:15:06.000 |
I was thinking what if you ran like, you know, 00:15:08.000 |
the same question through it five times, right? 00:15:12.000 |
It's making it four out of five or, you know, five out of five. 00:15:15.000 |
So it's definitely like as you go further away,