Back to Index

Evals Are Not Unit Tests — Ido Pesok, Vercel v0


Chapters

0:0 Introduction to Vercel's V0 and its growth
1:0 The problem with AI unreliability
2:44 The "Fruit Letter Counter" app example of AI failure
3:33 Introducing "evals" and the basketball court analogy
5:9 Defining the "court": understanding the domain of user queries
7:53 Data collection for evals
9:13 Structuring evals: constants in data, variables in task
10:45 Scoring evals
12:35 Integrating evals into CI/CD
13:40 The benefits of using evals

Transcript

: My name is Ito. I'm an engineer at Vercel working on vZero. If you don't know, vZero is a full-stack Vibe coding platform. It's the easiest and fastest way to prototype, build on the web, and express new ideas. Here are some examples of cool things people have built and shared on Twitter.

And to catch you up, we recently just launched GitHub Sync, so you can now push generated code to GitHub directly from vZero. You can also automatically pull changes from GitHub into your chat, and furthermore, switch branches and open PRs to collaborate with your team. I'm very excited to announce we recently crossed 100 million messages sent, and we're really excited to keep growing from here.

So, my goal of this talk is for it to be an introduction to evals, and specifically at the application layer. You may be used to evals at the model layer, which is what the research labs will cite and model releases. But this will be a focus on what do evals mean for your users, your apps, and your data.

The model is now in the wild, out of the lab, and it needs to work for your use case. And to do this, I have a story. It's a story about this app called Fruit Letter Counter. And if the name didn't already give it away, all it is is an app that counts the letters in fruit.

So, the vision is we'll make a logo with ChatGPT. There might be a market fit already, because everyone on X is dying to know the number of letters in fruit. If you didn't get it, it's a joke on the how many R's on the strawberry prompt. We'll have V0 make all the UI and back end, and then we can ship.

So, we had V0 write the code. It used AISDK to do the stream text call. And what do you know? It worked first try. GPT 4.1 said three. And not only did it say three once, I even tested it twice, and it worked both times in a row. So, from there, we're good to ship, right?

Let's launch on Twitter. Want to know how many letters are on a fruit? Just launched fruitlettercounter.io. The .com and .ai were taken. And, yeah, everything was going great. We launched and deployed on Vercel. We had fluid compute on until I suddenly get this tweet. John said, I asked how many R's in strawberry, and it said two.

So, of course, I just tested it twice. How is this even possible? But I think you get where I'm going with this, which is that by nature, LLMs can be very different. LLMs can be very unreliable. And this principle scales from a small letter counting app all the way to the biggest AI apps in the world.

The reason why it's so important to recognize this is because no one is going to use something that doesn't work. It's literally unusable. And this is a significant challenge when you're building AI apps. So I have a funny meme here. But basically, AI apps have this unique property. They're very, like, demo savvy.

You'll demo it. It looks super good. You'll show it to your coworkers. And then you ship the prod. And then suddenly, hallucinations come and get you. So we always have this in the back of our head when we're building. Back to where we were, let's actually not give up, right?

We actually want to solve this for our users. We want to make a really good fruit letter counting app. So you might say, how do we make reliable software that uses LLMs? Our initial prompt was a simple question, right? But maybe we can try prompt engineering. Maybe we can add some chain of thought, something else to make it more reliable.

So we spend all night working on this new prompt. You're an exuberant, fruit-loving AI on an epic quest, dot, dot, dot. And this time, we actually tested it ten times in a row on ChatGPT. And it worked every single time. Ten times in a row. It's amazing. So we ship.

And everything was going great until John tweeted at me again. And he said, I asked how many Rs are in strawberry, banana, pineapple, mango, kiwi, dragon fruit, apple, raspberry. And it said five. So we failed John again. Although this example is pretty simple, but this is actually what will happen when you start deploying to production.

You'll get users that come up with queries you could have never imagined. And you actually have to start thinking about how do we solve it. And the interesting thing, if you think about it, is 95% of our app works 100% of the time. We can have unit tests for every single function, end-to-end tests for the off, the login, the sign-out.

It will all work. But it's that most crucial 5% that can fail on us. So let's improve it. Now, to visualize this, I have a diagram for you. Hopefully, you can see the code. Maybe I need to make my screen brighter. Can you see the code? I don't know.

Okay. Okay. Well, we'll come back to this. But basically, we're going to start building evals. And to visualize this, I have a basketball court. So today's day one of the NBA finals. I don't know if you care. You don't need to know much about basketball. But just know that someone is trying to throw a ball in the basket.

And here, the basket is the glowing golden circle. So blue will represent a shot make. And red will represent a shot miss. And one property to consider is that the farther away your shot is from the basket, the harder it is. Another property is that the court has boundaries.

So this blue dot, although the shot goes in, it's out of the court. So it doesn't really count in the game. Let's start plotting our data. So here, we have a question, how many Rs in strawberry? This, after our new prompt, will probably work. So we'll label it blue.

And we'll put it close to the basket because it's pretty easy. However, how many Rs are in that big array? We'll label it red. And we'll put it farther away from the basket. Hopefully, you can see that. Maybe we can make it a little bit brighter. But this is the data part of our eval.

Basically, you're trying to collect what prompts your users are asking. And you want to just store this over time and keep building it and store where these points are on your court. Two more prompts I want to bring up is like, what if someone says, how many Rs are in strawberry, pineapple, dragon fruit, mango, after we replace all the vowels with Rs?

Right? Insane prompt. But it's still technically in our domain. So we'll label it as red all the way down there. But a funny one is, how many syllables are in caret? So this, we'll call it out of bounds. Right? None of our users are actually going to ask. It's not part of our app.

So no one is going to care. I hope you can see the code. But basically, when you're making eval, here's how you can think about it. Your data is the point on the court. Your shot, or in this case in Braintrust they call it a task, is the way you shoot the ball towards the basket.

And your score is basically a check of did it go in the basket or did it not go in the basket. To make good evals, you must understand your court. This is the most important step. And you have to be careful of falling into some traps. First is the out of bounds traps.

Don't spend time making evals for your data your users don't care about. You have enough problems, I promise you, queries that your users do care about. So be careful not to try and be productive. And you're making a lot of evals, but they're not really applicable to your app.

And another visualization is don't have a concentrated set of points. When you really understand your court, you're going to understand where the boundaries are, and you want to make sure you test across the entire court. A lot of people have been talking about this today, but to collect as much data as possible, here are some things you can do.

First is collect thumbs up, thumbs down data. This can be noisy, but it also can be really, really good signal as to where your app is struggling. Another thing is if you have observability, which is highly recommended, you can just read through random samples in your logs. Although users might not be giving you signal, but if you take like 100 random samples and go through it like once a week, you'll get a really good understanding of what your users are and how your users are using the product.

If you have community forums, these are also great. People will often report issues they're having with the LLM, and also X and Twitter are also great, but can be noisy. And there really is no shortcut here. You really have to do the work and understand what your court looks like.

So here is actually what, if you are doing a good job of understanding your core and a good job of building your data set, this is what it should look like. You should know the boundaries, you should be testing in your boundaries, and you should understand where your system has blue versus where it has red.

So here it's really easy to tell, okay, maybe next week we need to prioritize the team to work on that bottom right corner. This is something where a lot of users are struggling, and we can really do a good job on flipping the tiles from red to blue. Another thing you can do, and I really hope you can see, but you want to put constants in data, variables in the task.

So just like in math or programming, you want to factor constants so it improves clarity, reuse, and generalizations. Let's say you want to test your system prompt, right? Keep the constant data that your users are going to ask. So for example, how many R's in strawberry, that goes in the data.

That's a constant. It's never going to change throughout your app. But what you're going to test is in that task, you're going to try different system prompts. You might try different pre-processing, different RAG, and that's what you want to put in your task section. This way your app actually scales, and you never have to, let's say when you change your system prompt, redo all your data.

And this is a really nice feature of brain trust. And if you don't know, AI SDK actually offers a thing called middleware, and it's a really good abstraction to put basically all your logic of pre-processing. So RAG, system prompt you can put in here, et cetera. And you can now share this between your actual API route that's doing the completion and your evals.

So if you think about the basketball court as if we're going like basketball practice, and we're trying to practice our system across different models, you want your practice to be as similar as possible to the real game. That's what makes a good practice. So you want to share pretty much the exact same code between the evals and what you're actually running.

Now, I want to talk a little bit about scores, which is the last step of the eval. The unfortunate thing is it does vary greatly depending on your data. depending on your domain. So in this case, it's like super simple. You're just checking if the output contains the correct number of letters.

But maybe if you're doing writing or tasks like writing, that's very, very difficult. From principles, you want to actually lean towards deterministic scoring and pass/fail. This is because when you're doing debugging, you're going to get a ton of input and logs, and you want to make it as easy as possible for you to actually figure out what's going wrong.

So if you're building, if you're over-engineering your score, it might be very difficult to share with your team and distribute across different teams your evals, because no one will understand how these things are getting scored. Keep your scores as simple as possible. And a good question to ask yourself is when you're looking at the data, what am I looking for to see if this failed?

So with V0, we're looking for if the code didn't work. But maybe for writing, you're looking for certain linguistics. Ask yourself that question and write the code that looks for you. There are some cases where it's so hard to write the code that you may need to do human review, and that's okay.

At the end of the day, you want to build your core and you want to collect signal. Even if you must do human review to get the correct signal, don't worry. If you do the correct practice, it will pay off in the long run and you'll get better results for your users.

One trick you can do for scoring is don't be scared to add a little bit of extra prompt to the original prompt. So for example, here we can say output your final answer in these answer tags. What this will do is basically make it very easy for you to do string matching and et cetera, whereas in production, you don't really want this.

But yeah, you can do some little tweaks to your prompt so that scoring is easier. Another thing we really highly recommend is add evals to your CI. So Braintrust is really nice because you can get these eval reports. So it will run your task across all your data, and then it will give you this report at the end for the improvements and regressions.

Assume my colleague made a PR that changes a bit of the prompt. We want to know, like, how did it do across the court, right? Visualize, like, did it change more tiles from red to blue? Maybe now our prompt fixed one part, but it broke the other part of our app.

So this is a really useful report to have when you're doing PRs. So yeah, going back, this is the summary of the talk. You want to make your evals a core of your data. And this, you can treat it like practice. Your model is basically going to practice. Maybe you want to switch players, right?

When you switch models, you can see how a different player is going to perform in your practice. But this gives you such a good understanding of how your system is doing when you change things, like maybe your rag or your system prompt. And you can now go to your colleague and say, hey, this actually did help our app, right?

Because improvement without measurement is limited and imprecise. And evals give you the clarity you need to systematically improve your app. When you do that, you're going to get better reliability and quality, higher conversion and retention, and you also get to just spend less time on support ops, right? Because your evals, your practice environment will take care of that for you.

And if you're wondering about how I built all these court diagrams, I actually just used V0, and it made me some app that I just added these shots made and missed in the basket. So yeah, thank you very much. I hope you learned a little bit about evals. Thank you.

So we do have some time for some questions. There are two mics, one over here, one over there. I can take two or three of those, please, if anybody's interested in asking. We have one over there. Mic five, please. Or you can repeat the question as well, if you don't mind.

Do you run the same eval again? Yeah. Yeah, you can think of it. It's really like practice. Like maybe you're a basketball player, like, you know, in general score like 90%, but they might miss more shots here or there. If you run it, like we do it like we run every day at least.

And then we get a good sense of like, where are we actually like failing? Did we have some regression? So yeah, running it like daily or at least in some schedule will give you a good idea. I was thinking what if you ran like, you know, the same question through it five times, right?

Yeah. Like what's the percentage? It's making it four out of five or, you know, five out of five. Oh, I see. So it's definitely like as you go further away, like the harder questions get like. We'll see you next time.