Evals Are Not Unit Tests — Ido Pesok, Vercel v0

00:00:00.000 | :

00:00:15.000 | My name is Ito. I'm an engineer at Vercel working on vZero.

00:00:20.000 | If you don't know, vZero is a full-stack Vibe coding platform.

00:00:25.000 | It's the easiest and fastest way to prototype, build on the web,

00:00:29.000 | and express new ideas.

00:00:31.000 | Here are some examples of cool things people have built and shared on Twitter.

00:00:35.000 | And to catch you up, we recently just launched GitHub Sync,

00:00:39.000 | so you can now push generated code to GitHub directly from vZero.

00:00:42.000 | You can also automatically pull changes from GitHub into your chat,

00:00:47.000 | and furthermore, switch branches and open PRs to collaborate with your team.

00:00:51.000 | I'm very excited to announce we recently crossed 100 million messages sent,

00:00:56.000 | and we're really excited to keep growing from here.

00:01:00.000 | So, my goal of this talk is for it to be an introduction to evals,

00:01:04.000 | and specifically at the application layer.

00:01:06.000 | You may be used to evals at the model layer,

00:01:09.000 | which is what the research labs will cite and model releases.

00:01:12.000 | But this will be a focus on what do evals mean for your users, your apps, and your data.

00:01:18.000 | The model is now in the wild, out of the lab, and it needs to work for your use case.

00:01:22.000 | And to do this, I have a story.

00:01:24.000 | It's a story about this app called Fruit Letter Counter.

00:01:28.000 | And if the name didn't already give it away, all it is is an app that counts the letters in fruit.

00:01:36.000 | So, the vision is we'll make a logo with ChatGPT.

00:01:39.000 | There might be a market fit already, because everyone on X is dying to know the number of letters in fruit.

00:01:45.000 | If you didn't get it, it's a joke on the how many R's on the strawberry prompt.

00:01:49.000 | We'll have V0 make all the UI and back end, and then we can ship.

00:01:55.000 | So, we had V0 write the code.

00:01:57.000 | It used AISDK to do the stream text call.

00:02:01.000 | And what do you know?

00:02:02.000 | It worked first try.

00:02:03.000 | GPT 4.1 said three.

00:02:05.000 | And not only did it say three once, I even tested it twice, and it worked both times in a row.

00:02:11.000 | So, from there, we're good to ship, right?

00:02:14.000 | Let's launch on Twitter.

00:02:15.000 | Want to know how many letters are on a fruit?

00:02:17.000 | Just launched fruitlettercounter.io.

00:02:20.000 | The .com and .ai were taken.

00:02:23.000 | And, yeah, everything was going great.

00:02:26.000 | We launched and deployed on Vercel.

00:02:28.000 | We had fluid compute on until I suddenly get this tweet.

00:02:32.000 | John said, I asked how many R's in strawberry, and it said two.

00:02:37.000 | So, of course, I just tested it twice.

00:02:39.000 | How is this even possible?

00:02:41.000 | But I think you get where I'm going with this, which is that by nature, LLMs can be very different.

00:02:46.000 | LLMs can be very unreliable.

00:02:48.000 | And this principle scales from a small letter counting app all the way to the biggest AI apps in the world.

00:02:55.000 | The reason why it's so important to recognize this is because no one is going to use something that doesn't work.

00:03:00.000 | It's literally unusable.

00:03:01.000 | And this is a significant challenge when you're building AI apps.

00:03:05.000 | So I have a funny meme here.

00:03:08.000 | But basically, AI apps have this unique property.

00:03:10.000 | They're very, like, demo savvy.

00:03:12.000 | You'll demo it.

00:03:13.000 | It looks super good.

00:03:14.000 | You'll show it to your coworkers.

00:03:15.000 | And then you ship the prod.

00:03:17.000 | And then suddenly, hallucinations come and get you.

00:03:20.000 | So we always have this in the back of our head when we're building.

00:03:24.000 | Back to where we were, let's actually not give up, right?

00:03:28.000 | We actually want to solve this for our users.

00:03:30.000 | We want to make a really good fruit letter counting app.

00:03:33.000 | So you might say, how do we make reliable software that uses LLMs?

00:03:37.000 | Our initial prompt was a simple question, right?

00:03:40.000 | But maybe we can try prompt engineering.

00:03:42.000 | Maybe we can add some chain of thought, something else to make it more reliable.

00:03:46.000 | So we spend all night working on this new prompt.

00:03:49.000 | You're an exuberant, fruit-loving AI on an epic quest, dot, dot, dot.

00:03:54.000 | And this time, we actually tested it ten times in a row on ChatGPT.

00:03:58.000 | And it worked every single time.

00:04:00.000 | Ten times in a row.

00:04:01.000 | It's amazing.

00:04:02.000 | So we ship.

00:04:04.000 | And everything was going great until John tweeted at me again.

00:04:09.000 | And he said, I asked how many Rs are in strawberry, banana, pineapple, mango, kiwi, dragon fruit, apple, raspberry.

00:04:16.000 | And it said five.

00:04:19.000 | So we failed John again.

00:04:21.000 | Although this example is pretty simple, but this is actually what will happen when you start deploying to production.

00:04:26.000 | You'll get users that come up with queries you could have never imagined.

00:04:30.000 | And you actually have to start thinking about how do we solve it.

00:04:33.000 | And the interesting thing, if you think about it, is 95% of our app works 100% of the time.

00:04:39.000 | We can have unit tests for every single function, end-to-end tests for the off, the login, the sign-out.

00:04:45.000 | It will all work.

00:04:46.000 | But it's that most crucial 5% that can fail on us.

00:04:49.000 | So let's improve it.

00:04:51.000 | Now, to visualize this, I have a diagram for you.

00:04:54.000 | Hopefully, you can see the code.

00:04:56.000 | Maybe I need to make my screen brighter.

00:04:58.000 | Can you see the code?

00:04:59.000 | I don't know.

00:05:00.000 | Okay.

00:05:01.000 | Okay.

00:05:02.000 | Well, we'll come back to this.

00:05:06.000 | But basically, we're going to start building evals.

00:05:09.000 | And to visualize this, I have a basketball court.

00:05:12.000 | So today's day one of the NBA finals.

00:05:14.000 | I don't know if you care.

00:05:15.000 | You don't need to know much about basketball.

00:05:17.000 | But just know that someone is trying to throw a ball in the basket.

00:05:21.000 | And here, the basket is the glowing golden circle.

00:05:25.000 | So blue will represent a shot make.

00:05:28.000 | And red will represent a shot miss.

00:05:30.000 | And one property to consider is that the farther away your shot is from the basket, the harder it is.

00:05:37.000 | Another property is that the court has boundaries.

00:05:39.000 | So this blue dot, although the shot goes in, it's out of the court.

00:05:44.000 | So it doesn't really count in the game.

00:05:46.000 | Let's start plotting our data.

00:05:48.000 | So here, we have a question, how many Rs in strawberry?

00:05:52.000 | This, after our new prompt, will probably work.

00:05:54.000 | So we'll label it blue.

00:05:56.000 | And we'll put it close to the basket because it's pretty easy.

00:05:59.000 | However, how many Rs are in that big array?

00:06:01.000 | We'll label it red.

00:06:02.000 | And we'll put it farther away from the basket.

00:06:06.000 | Hopefully, you can see that.

00:06:07.000 | Maybe we can make it a little bit brighter.

00:06:09.000 | But this is the data part of our eval.

00:06:12.000 | Basically, you're trying to collect what prompts your users are asking.

00:06:16.000 | And you want to just store this over time and keep building it

00:06:19.000 | and store where these points are on your court.

00:06:22.000 | Two more prompts I want to bring up is like, what if someone says,

00:06:25.000 | how many Rs are in strawberry, pineapple, dragon fruit, mango,

00:06:28.000 | after we replace all the vowels with Rs?

00:06:32.000 | Right?

00:06:33.000 | Insane prompt.

00:06:34.000 | But it's still technically in our domain.

00:06:36.000 | So we'll label it as red all the way down there.

00:06:40.000 | But a funny one is, how many syllables are in caret?

00:06:43.000 | So this, we'll call it out of bounds.

00:06:46.000 | Right?

00:06:47.000 | None of our users are actually going to ask.

00:06:48.000 | It's not part of our app.

00:06:49.000 | So no one is going to care.

00:06:51.000 | I hope you can see the code.

00:06:55.000 | But basically, when you're making eval,

00:06:58.000 | here's how you can think about it.

00:06:59.000 | Your data is the point on the court.

00:07:01.000 | Your shot, or in this case in Braintrust they call it a task,

00:07:05.000 | is the way you shoot the ball towards the basket.

00:07:07.000 | And your score is basically a check of did it go in the basket

00:07:10.000 | or did it not go in the basket.

00:07:12.000 | To make good evals, you must understand your court.

00:07:15.000 | This is the most important step.

00:07:18.000 | And you have to be careful of falling into some traps.

00:07:22.000 | First is the out of bounds traps.

00:07:24.000 | Don't spend time making evals for your data your users don't care about.

00:07:28.000 | You have enough problems, I promise you,

00:07:30.000 | queries that your users do care about.

00:07:33.000 | So be careful not to try and be productive.

00:07:36.000 | And you're making a lot of evals,

00:07:38.000 | but they're not really applicable to your app.

00:07:40.000 | And another visualization is don't have a concentrated set of points.

00:07:44.000 | When you really understand your court,

00:07:46.000 | you're going to understand where the boundaries are,

00:07:48.000 | and you want to make sure you test across the entire court.

00:07:51.000 | A lot of people have been talking about this today,

00:07:55.000 | but to collect as much data as possible,

00:07:57.000 | here are some things you can do.

00:07:59.000 | First is collect thumbs up, thumbs down data.

00:08:01.000 | This can be noisy, but it also can be really, really good signal

00:08:04.000 | as to where your app is struggling.

00:08:06.000 | Another thing is if you have observability, which is highly recommended,

00:08:10.000 | you can just read through random samples in your logs.

00:08:14.000 | Although users might not be giving you signal,

00:08:16.000 | but if you take like 100 random samples and go through it like once a week,

00:08:20.000 | you'll get a really good understanding of what your users are

00:08:23.000 | and how your users are using the product.

00:08:26.000 | If you have community forums, these are also great.

00:08:28.000 | People will often report issues they're having with the LLM,

00:08:31.000 | and also X and Twitter are also great, but can be noisy.

00:08:35.000 | And there really is no shortcut here.

00:08:37.000 | You really have to do the work and understand what your court looks like.

00:08:41.000 | So here is actually what, if you are doing a good job of understanding your core

00:08:46.000 | and a good job of building your data set, this is what it should look like.

00:08:49.000 | You should know the boundaries, you should be testing in your boundaries,

00:08:52.000 | and you should understand where your system has blue versus where it has red.

00:08:57.000 | So here it's really easy to tell, okay, maybe next week we need to prioritize

00:09:02.000 | the team to work on that bottom right corner.

00:09:05.000 | This is something where a lot of users are struggling,

00:09:07.000 | and we can really do a good job on flipping the tiles from red to blue.

00:09:12.000 | Another thing you can do, and I really hope you can see,

00:09:18.000 | but you want to put constants in data, variables in the task.

00:09:23.000 | So just like in math or programming, you want to factor constants

00:09:27.000 | so it improves clarity, reuse, and generalizations.

00:09:30.000 | Let's say you want to test your system prompt, right?

00:09:34.000 | Keep the constant data that your users are going to ask.

00:09:38.000 | So for example, how many R's in strawberry, that goes in the data.

00:09:40.000 | That's a constant.

00:09:41.000 | It's never going to change throughout your app.

00:09:43.000 | But what you're going to test is in that task,

00:09:45.000 | you're going to try different system prompts.

00:09:47.000 | You might try different pre-processing, different RAG,

00:09:49.000 | and that's what you want to put in your task section.

00:09:52.000 | This way your app actually scales, and you never have to,

00:09:55.000 | let's say when you change your system prompt, redo all your data.

00:09:57.000 | And this is a really nice feature of brain trust.

00:10:00.000 | And if you don't know, AI SDK actually offers a thing called middleware,

00:10:06.000 | and it's a really good abstraction to put basically all your logic

00:10:10.000 | of pre-processing.

00:10:11.000 | So RAG, system prompt you can put in here, et cetera.

00:10:14.000 | And you can now share this between your actual API route

00:10:17.000 | that's doing the completion and your evals.

00:10:19.000 | So if you think about the basketball court as if we're going like basketball practice,

00:10:24.000 | and we're trying to practice our system across different models,

00:10:28.000 | you want your practice to be as similar as possible to the real game.

00:10:32.000 | That's what makes a good practice.

00:10:33.000 | So you want to share pretty much the exact same code between the evals

00:10:37.000 | and what you're actually running.

00:10:39.000 | Now, I want to talk a little bit about scores,

00:10:42.000 | which is the last step of the eval.

00:10:44.000 | The unfortunate thing is it does vary greatly depending on your data.

00:10:48.000 | depending on your domain.

00:10:50.000 | So in this case, it's like super simple.

00:10:52.000 | You're just checking if the output contains the correct number of letters.

00:10:57.000 | But maybe if you're doing writing or tasks like writing,

00:11:00.000 | that's very, very difficult.

00:11:02.000 | From principles, you want to actually lean towards deterministic scoring

00:11:06.000 | and pass/fail.

00:11:07.000 | This is because when you're doing debugging,

00:11:09.000 | you're going to get a ton of input and logs,

00:11:12.000 | and you want to make it as easy as possible

00:11:14.000 | for you to actually figure out what's going wrong.

00:11:16.000 | So if you're building, if you're over-engineering your score,

00:11:19.000 | it might be very difficult to share with your team

00:11:21.000 | and distribute across different teams your evals,

00:11:25.000 | because no one will understand how these things are getting scored.

00:11:27.000 | Keep your scores as simple as possible.

00:11:30.000 | And a good question to ask yourself is when you're looking at the data,

00:11:34.000 | what am I looking for to see if this failed?

00:11:36.000 | So with V0, we're looking for if the code didn't work.

00:11:40.000 | But maybe for writing, you're looking for certain linguistics.

00:11:44.000 | Ask yourself that question and write the code that looks for you.

00:11:47.000 | There are some cases where it's so hard to write the code

00:11:50.000 | that you may need to do human review, and that's okay.

00:11:54.000 | At the end of the day, you want to build your core

00:11:56.000 | and you want to collect signal.

00:11:57.000 | Even if you must do human review to get the correct signal,

00:12:01.000 | don't worry.

00:12:02.000 | If you do the correct practice, it will pay off in the long run

00:12:05.000 | and you'll get better results for your users.

00:12:08.000 | One trick you can do for scoring is don't be scared

00:12:12.000 | to add a little bit of extra prompt to the original prompt.

00:12:17.000 | So for example, here we can say output your final answer

00:12:20.000 | in these answer tags.

00:12:22.000 | What this will do is basically make it very easy

00:12:24.000 | for you to do string matching and et cetera,

00:12:28.000 | whereas in production, you don't really want this.

00:12:30.000 | But yeah, you can do some little tweaks to your prompt

00:12:33.000 | so that scoring is easier.

00:12:36.000 | Another thing we really highly recommend is add evals to your CI.

00:12:40.000 | So Braintrust is really nice because you can get these eval reports.

00:12:43.000 | So it will run your task across all your data,

00:12:47.000 | and then it will give you this report at the end

00:12:50.000 | for the improvements and regressions.

00:12:52.000 | Assume my colleague made a PR that changes a bit of the prompt.

00:12:55.000 | We want to know, like, how did it do across the court, right?

00:12:58.000 | Visualize, like, did it change more tiles from red to blue?

00:13:01.000 | Maybe now our prompt fixed one part,

00:13:03.000 | but it broke the other part of our app.

00:13:05.000 | So this is a really useful report to have when you're doing PRs.

00:13:09.000 | So yeah, going back, this is the summary of the talk.

00:13:13.000 | You want to make your evals a core of your data.

00:13:17.000 | And this, you can treat it like practice.

00:13:19.000 | Your model is basically going to practice.

00:13:21.000 | Maybe you want to switch players, right?

00:13:23.000 | When you switch models, you can see how a different player

00:13:25.000 | is going to perform in your practice.

00:13:27.000 | But this gives you such a good understanding

00:13:29.000 | of how your system is doing when you change things,

00:13:32.000 | like maybe your rag or your system prompt.

00:13:34.000 | And you can now go to your colleague and say, hey,

00:13:37.000 | this actually did help our app, right?

00:13:39.000 | Because improvement without measurement is limited and imprecise.

00:13:44.000 | And evals give you the clarity you need

00:13:46.000 | to systematically improve your app.

00:13:50.000 | When you do that, you're going to get better reliability and quality,

00:13:54.000 | higher conversion and retention,

00:13:56.000 | and you also get to just spend less time on support ops, right?

00:13:59.000 | Because your evals, your practice environment

00:14:01.000 | will take care of that for you.

00:14:03.000 | And if you're wondering about how I built all these court diagrams,

00:14:06.000 | I actually just used V0, and it made me some app

00:14:08.000 | that I just added these shots made and missed in the basket.

00:14:13.000 | So yeah, thank you very much.

00:14:15.000 | I hope you learned a little bit about evals.

00:14:17.000 | Thank you.

00:14:19.000 | So we do have some time for some questions.

00:14:21.000 | There are two mics, one over here, one over there.

00:14:24.000 | I can take two or three of those, please,

00:14:27.000 | if anybody's interested in asking.

00:14:29.000 | We have one over there.

00:14:31.000 | Mic five, please.

00:14:34.000 | Or you can repeat the question as well, if you don't mind.

00:14:38.000 | Do you run the same eval again?

00:14:42.000 | Yeah.

00:14:43.000 | Yeah, you can think of it.

00:14:44.000 | It's really like practice.

00:14:45.000 | Like maybe you're a basketball player,

00:14:47.000 | like, you know, in general score like 90%,

00:14:50.000 | but they might miss more shots here or there.

00:14:52.000 | If you run it, like we do it like we run every day at least.

00:14:56.000 | And then we get a good sense of like,

00:14:58.000 | where are we actually like failing?

00:15:00.000 | Did we have some regression?

00:15:01.000 | So yeah, running it like daily or at least in some schedule

00:15:04.000 | will give you a good idea.

00:15:06.000 | I was thinking what if you ran like, you know,

00:15:08.000 | the same question through it five times, right?

00:15:10.000 | Yeah.

00:15:11.000 | Like what's the percentage?

00:15:12.000 | It's making it four out of five or, you know, five out of five.

00:15:14.000 | Oh, I see.

00:15:15.000 | So it's definitely like as you go further away,

00:15:18.000 | like the harder questions get like.

00:15:20.000 | We'll see you next time.

Evals Are Not Unit Tests — Ido Pesok, Vercel v0

Chapters