AI Engineer World's Fair 2025

Thank you. Thank you. Thank you. One, two, one, two, three, look at, look at, look at your data, look at your data. Yeah. Great to meet you, Rafal. I think so. Big fan of DSP UI. Yeah, I think it's come across here. Yeah, yeah, yeah. I tried to do it the same in TypeScript and failed.

Yeah, great to meet you, Rafal. I think so. Big fan of DSP UI. Yeah, I think it's come across here. Yeah, yeah, yeah. I tried to do it the same in TypeScript and failed. So now that I have a good portion of the I have something funny here. I wonder if you're all okay with it, or we don't have to do it, but I asked the AI to give us the intros.

Mm-hmm. And so one of the funny things I asked it to do is tell me the pronunciation for the name. Hmm. So I'd be curious which AI model does better. So it'll also be a way that I can kind of like interact with the audience. So like, all right, pick me, pick the AI model that you want to try to make the model, the thing.

So funny. Is that okay with everybody? Yeah? Yeah. Yeah. Sounds good. All right, cool. All right, cool. All right, cool. I have something funny here. I wonder if you're all okay with it, or we don't have to do it, but I asked the AI to give us the intros.

Mm-hmm. And so one of the funny things I asked it to do is tell me the pronunciation for the name. Hmm. So I'd be curious which AI model does better. Yeah, that's good. All right, cool. I don't know if I'll have a chance to connect it, but I can always just use it and like say it out loud.

So that's cool. For sure. All right, cool. And no hearing tease. It did a good job. Yeah. Yeah. And it probably hallucinated, or I might have been the entry problem. Want to do another check? Feel free to like call that out. Okay, that thing was wrong about this. Hello, hello.

Like for example, let's try Omar real quick. Where's your mic? Right here. It's invisible. Invisible, bro. You're at my computer. You're done, man. Yeah. Cool. Nice. There you go. I don't know if it used the web or what other things to like. I think we submitted bios. Yeah, I included whatever you provided.

Yeah. That's cheating. I know. I'll give you enough time. But some of them don't work. I want to turn your belt back off until it's your presentation. Okay. We have limited use of radio frequencies. Okay. Sure. Yours is fine. Mine is good. Yeah. Should I twist it or? Okay.

Yeah. You've done a lot of these, I bet. Huh? Have you done a lot of these? Have you done a lot of these? I've done a lot of these. Yeah. I've done a lot of them. Yeah. It's been a lot of them. I've done a lot of them. Yeah.

I've done a lot of them. Yeah. I've done a lot of them. Yeah. I've done a lot of them. Yeah. I've done a lot of them. Yeah. I've done a lot of them. I've done a lot of them. Yeah. I've done a lot of them. Yeah. I've done a lot of them.

Yeah. I've done a lot of them. Yeah. I've done a lot of them. Yeah. I've done a lot of them. I've done a lot of them. Yeah. I didn't know you worked at Database. Is there anything? Yeah. I've done that for a year. Yeah. I might have just missed it.

Yeah. Cool. Nice. Check. Check. Check. Check. Check, check, check, check. You want me to head back or do I say you want to? There's some cans there, I think that's the water. It's all good, just be yourself, ignore the crowd. Look at the last row, you know? It makes it look like you're looking at everyone.

I predict a lot. Maybe we contradict each other, don't you? That would be even more interesting. Well, he's going to contradict us for sure. Ours are going to be pretty complimentary, I think. Because ours are like, okay, we did a lot of unit tests, evals, and then we regretted it.

Yours come after. People are not unit tests. Duh. So they complement each other. Yeah, we just do a bunch of different types of evals now. We do more like trajectory, full conversation evals and stuff like that. Oh, yeah. There are 18 people watching. 18 people. Boo. I'm famous. Yeah, I can hear you.

Test, test, test. Hello, hello. Test, test. Yeah, I can hear you. Test, test. Hello. Test, test, test. Hello. Hello, hello. Can you talk to me? How was the flight? Good, good flight. Yeah, very long. Good. Can you talk to me? Hello, bro. One, two, three. You have to say that.

No, no, no. Oh, I actually live in Spain now, so it's like a very brutal, very brutal flight. Like 12 hours. Damn. Yep. Yep. Brutal. Yeah. It's rough. It's rough. Yeah, that's a good move. Good move. Yeah, yeah. The shortest way from the United States and Europe and New York.

Yeah, indeed. Yeah. No. I think it's pretty much more or less the same. Maybe slightly shorter. Maybe they're better flights. Yeah. Yeah. Yeah. Yeah. Yeah. Hmm. Hmm. Yeah, 'cause there's an Arctic curve from the left and all that. Totally. Yeah. Yeah. Yeah. Yeah, 'cause there's an Arctic curve from the left and all that.

Totally. Yeah. Yeah. they told you the dream but it was a lie yeah oh yeah I would be melting now if I wasn't in present room I'll be flying back to Paul I'm like sorry guys I'm sick I'm changing jobs I'm gonna be an electrician yeah yeah yeah okay so if you if we have trouble yeah that's true if we have trouble with the AV you take over because you you're gonna be better hours like ours might be right there on the limit I feel yeah yeah I don't think we will have time for the Q&A nobody got time for Q&A and we go on to the next talk yeah exactly yeah yeah it's really difficult yeah yeah I'm mostly bring like the average for everyone high was it's still like not super good at any specific vertical uh so we struggle with that a lot trying to work yeah yeah yeah yeah it's it's like it's for the normies yeah yeah yeah yeah it's it's it's like it's for the normies yeah yeah yeah yeah yeah it's it's like it's for the normies yeah so I get people hooked into it you know and then yeah yeah even though it was obviously the best model because we were already like using it in cursor and everything and - Yeah, even though it was obviously the best model, because we were already like using it in Cursor and everything.

- How are you doing? - It works at Vercel. - Yeah. - Yeah, I'm excited for your talk, and BrainTrust is emceeing this whole track? - Yeah, they just got the whole track, yeah. - Cool, makes sense. I mean, not if you're a competitor, but... - Yeah, true, true of that.

- Y'all ready? - Yeah. - No. - Have you given talks before? - Not like this fancy, no, no. I've done a lot of online talks. - When you did the course, you did... - Yeah, yeah, of course. - Is that enough for a lot of the content? - It's much easier, I just don't look at any face, I just look at my presenter notes and I just zone out.

I'm gonna try to do the same, I think. - Did you repurpose the content from... - A lot of it was repurposed, and then there's some new stuff. - Oh, the good stuff, right? - Yeah, there's some good stuff repurposed, and then some new stuff. - Cool. - So, like, it's mixed half and half.

- You'll get longer and longer prompting guides for the latest models that are supposed to be, you know, closer and closer to AGI. And if you're less lucky, you have to figure that out on your own, right? If you're even less lucky, the prompting guides from the provider are not even that good, so you have to actually kind of figure out what that thing is.

And every day, maybe at an even faster pace, someone is releasing an archive paper or a tweet or something that introduces a new learning algorithm, maybe some reinforcement learning, bills and whistles, maybe some prompting tricks, maybe a prompt optimization, you know, technique, something or the other that promises to make your system learn better and sort of fit your goals better.

Someone else is introducing some search or scaling or inference strategies or agent frameworks or agent architectures that are promising to finally unlock levels of reliability or quality better than what you had before. And I think if you're actually doing a reasonable job now, most likely, like, you know, I've got to stay on top of at least some of this stuff so that, like, I don't fall behind.

And in many cases, like, you know, model APIs actually change the model under the hood, even though, you know, you're using the same name. So it's actually, you're forced to scramble. And actually, I would say maybe the question isn't whether you will scramble every week. Maybe a different question is, will you even get to scramble for long if you think about the rate of progress of these LLMs?

Like, are they going to eat your lunch? Right? So these are, I think, questions that are on a lot of people's minds. And this is what the talk is going to be addressing. So the talk mentions the bitter lesson, which sounds like this, you know, really ancient, old kind of AI lore, but it's just, you know, six years old, where the current year's Turing Award winner, Rich Sutton, who's a pioneer of reinforcement learning, wrote this short essay on his website, basically, that says 70 years of AI has taught him and taught, you know, other people in the AI community from his perspective that when AI researchers leverage domain knowledge to solve problems, like, I don't know, chess or something, we build complicated methods that essentially don't scale, and we get stuck, and we get beat by methods that leverage scale a lot better.

What seems to work better, according to Sutton, is general methods to scale, and he identifies search, which is not like retrieval, more of like, you know, exploring large spaces, and learning, so getting the system to kind of understand its environment, maybe, for example, work best. And search here is what we'd call in LLM land maybe inference time scaling or something.

So I don't speak for Sutton, and I'm not, you know, suggesting that I have the right understanding of what he's saying or that I necessarily agree or disagree, but I think this is just fundamental and important kind of concept in this space. So I think it raises interesting questions for us as people who build, you know, engineer AI systems.

Because if leveraging domain knowledge is bad, what exactly is AI engineering supposed to be about? I mean, engineering is understanding your domain and working in it with a lot of human ingenuity in repeatable ways, let's say, or with principles. So, like, are we just doomed? Like, are we just wasting our time?

Why are we at an AI engineering, you know, fair? And I'll tell you how to resolve this. I've not really seen a lot of people discuss that. Sutton is talking about, and a lot of people, you know, throw the bitter lesson around, so clearly somebody has to think about this, right?

Sutton is talking about maximizing intelligence. All of us probably care about that to some degree, but which is like something like the ability to figure things out in a new environment really fast, let's say. All of us kind of care about this to some degree. I'm also an AI researcher.

But when we're building AI systems, I think it's important to remember that the reason we build software is not that we lack AGI. We build software, you know, and the reason for this and the way you kind of understand this is we already have general intelligence everywhere. We have eight billions of them.

They're unreliable because that's what intelligence is. And they've not solved the problems that we want to solve with software. That's why we're building software. So we program software, not because we lack AGI, but because we want reliable, robust, controllable, scalable systems. And we want these things to be things that we can reason about, understand at scale.

And actually, if you think about engineering and reliable systems, if you think about checks and balances, in any case where you try to systematize stuff, it's about subtracting agency and subtracting intelligence in exactly the right places. carefully and not restricting the intelligence otherwise. So this is a very different axis from the kinds of lessons that you would draw on from the bitter lesson.

Now, that does not mean the bitter lesson is irrelevant. Let me tell you the precise way in which it's relevant. So the first takeaway here is that scaling search and learning works best for intelligence. This is the right thing to do if you're an AI researcher interested in building, you know, agents that learn really well, really fast in new environments.

Don't hard code stuff at all, or unless you really have to. But in building AI systems, it's helpful to think about, well, sure, search and learning, but searching for what? Right? Like, what is your AI system even supposed to be doing? What is the fundamental problem that you're solving?

It's not intelligence. It's something else. And what are you learning for? Right? Like, what is the system learning in order to do well? And that is what you need to be engineering, not the specifics of search and not the specifics of learning, as I'll talk about in the rest of this talk.

So he's saying, Sutton is saying, complicated methods get in the way of scaling, especially if you do it early, like before you know what you're doing, essentially. Did we hear that before? I feel like I heard that back in the 1970s, although I wasn't around. This is, you know, the notion of structured programming with, with Knuth saying his popular, you know, phrase in a paper, premature optimization is the root of all evil.

I think this is the bitter lesson for software and thereby also for AI software. So it's human ingenuity and human knowledge of the domain. It's not that it's harmful. It's that when you do it prematurely in ways that constrain your system, in ways that reflect poor understanding, they're bad.

But you can't get away in an engineering field with not engineering your system. Like, you're just quitting or something. Right? So here's a little piece of code. If you follow me on X on Twitter, you might recognize it. But otherwise, I think it looks pretty opaque to me in like three seconds.

And I can't really look at this and tell exactly what it's doing. And I also honestly don't really care. So lo and behold, this is computing a square root in a certain floating point representation on an old machine. And I think the thing that jumps at me immediately is this is not the most future-proof program possible.

If you change the machine architecture, different floating point representations, better CPUs, first of all, it'll be wrong because, you know, like, it's just hard coding some values here. And the second of all, it'll probably be slower than a normal, you know, square root that maybe is a single instruction or maybe the compiler has a really smart way of doing it or, you know, a lot of other things that could be optimized for you.

Right? So someone who wrote this, maybe they had a good reason, maybe they didn't. But certainly, if you're writing this kind of thing often, you're probably messing up as an engineer. So premature optimization is maybe the square root of all evil or something. But what counts as premature? Like, I mean, that's kind of the name of the game, right?

Like, we could just say that, but it doesn't mean anything. So I don't think any strategy will work in tech. Nobody can anticipate what will happen in three years, five years, ten years. But I think you still have to have a conceptual model that you're working off of. And I happen to have built two things that are, you know, on the order of several years old, that have fundamentally stayed the same over the years, from the days of BERT, text Da Vinci 2, up to 04 mini, and they're bigger now than they ever were.

And they're sort of like these stable fundamental kind of abstractions or AI systems around, around LLMs. So what gives? What happens in order to get something like Colbert or something like DSPY in this ecosystem and sort of endure a few years, which is like, you know, centuries in AI land?

I'll try to reflect on this. And, you know, again, none of this is guaranteed to be something that lasts forever. So here's my hypothesis. Premature optimization is what is happening, if and only if you're hard coding stuff at a lower level of abstraction that you can, then you can justify.

If you want a square root, please just say, give me a square root. Don't start doing random bit shifts and bit stuff like, you know, bit manipulation that happens to appease your particular machine today. But actually take a step back. Do you even want a square root or are you computing something even more general?

And is there a way you could express that thing that is more general, right? And only, you know, stoop down or go down to the level of abstraction that's lower if you've demonstrated that a higher level of abstraction is not good enough. So I think the bigger picture here is applied machine learning and definitely prompt engineering has a huge issue here.

Tight coupling is known-- tighter coupling than necessary is known to be bad in software, but it's not really something we talk about when we're building machine learning systems. In fact, the name of the game in machine learning is usually like, hey, this latest thing came out, let's rewrite everything so that we're working around that specific thing.

And I tweeted about this a year ago, 13 months ago, last May in 2024, saying the bitter lesson is just an artifact of lacking high level, good high level ML abstractions. Deep learning, scaling deep learning helps predictably, but after every paradigm shift, the best systems always include modular specializations because we're trying to build software, we need those.

And every time they basically look the same and they should have been reusable, but they're not because we're writing code bad. We're writing bad code. So here's a nice example just to demonstrate this. It's not special at all. Here's a 2006 paper. The title could have really been a paper now, right?

A Modular Approach for Multilingual Question Answering. And here's a system architecture. It looks like your favorite multi-agent framework today, right? It has an execution manager. It has some question analyzers and retrieval strategists from a bunch of corpora. And it's like a figure, you know, if you color it, you would think it's a paper maybe from last year or something.

Now, here's the problem. It's a pretty figure. The system architecturally is actually not that wrong. I'm not saying it's the perfect architecture, but in a normal software environment, you could actually just upgrade the machine, right? Put it on a new hardware, put it on a new operating system, and it would just work.

And it would actually work reasonably well because the architecture is not that bad. But we know that that's not the case for these ML sort of architectures because they're not expressed in the right way. So I think fundamentally, I can express this most passionately against prompts. A prompt is a horrible abstraction for programming, and this needs to be fixed ASAP.

I say for programming because it's actually not a horrible one for management. If you want to manage an employee or an agent, a prompt is a reasonably kind of like it's a Slack channel. You have a remote employee. If you want to be a pet trainer, you know, working with tensors and, you know, objectives is a great way to iterate.

That's how we build the models. But I want us to be able to also engineer AI systems. And I think for engineering and programming, a prompt is a horrible abstraction. Here is why. It's a stringly typed canvas, just a big blurb, no structure whatsoever, even if structure actually exists in a latent way.

That couples and entangles the fundamental task definition you want to say, which is really important stuff. This is what you're engineering with some random overfitted half-baked decisions about, hey, this LLM responded to this language, you know, when I talk to it this way. Or I put this example to demonstrate my point, and it kind of clicked for this model, so I'll just keep it in.

And there's no way to really tell the difference. What was the fundamental thing you're solving? And, like, you know, what was the random trick you applied? It's like a square root thing, except you don't call it a square root and we just have to stare at it and be like, wait, why are we shifting to the left by five bits or something?

You're also using the inference time strategy, which is, like, changing every few weeks where people are proposing stuff all the time, and you're baking it, literally entangling it into your system. So if it's an agent, your prompt is telling it it's an agent. Your system has no big, like, deal knowing about the fact it's an agent or a reasoning system or whatever.

What are you actually trying to solve, right? It's like if you're writing a square root function, and then you're like, hey, here is the layout of the structs in memory or something. You're also talking about formatting and parsing things, you know, write in XML, produce JSON, whatever, like, again, that's really none of your business most of the time.

So you want to write a human-readable spec, but you're saying things like, do not ignore this, generate XML, answer in JSON, you are Professor Einstein, a wise expert in the field, I'll tip you $1,000, right? Like, that is just not engineering, guys. So what should we do? Trusty old separation of concerns, I think, is the answer.

Your job, as an engineer, is to invest in your actual system design, and, you know, starting with the spec. The spec, unfortunately, or fortunately, cannot be reduced to one thing, and this is the time I'll talk about evals. I know everyone hears about evals, so this is the one line about evals that makes us talk about evals.

A lot of the time, you want to invest in natural language descriptions, because that is the power of this new framework. Natural language definitions are not prompts. They are highly localized pieces of ambiguous stuff that could not have been said in any other way, right? I can't tell the system certain things, except in English, so I'll say it in English.

But a lot of the time, I'm actually iterating to appease a certain model and to make it perform well relative to some criteria I have, not telling it the criteria, just tinkering with things. There, evals is the way to do this, because evals say, here is what I actually care about.

Change the model, the evals are still what I care about. It's a fundamental thing. Now, evals are not for everything. If you try to use evals to define the core behavior of your system, you will not learn. Induction, learning from data, is a lot harder than following instructions, right?

So you need to have both. Code is another thing that you need. You know, a lot of people are like, oh, it's just like a, you know, just ask it to do the thing. Well, who's going to define the tools? Who's going to define the structure? How do you handle information flow?

Like, you know, things that are private should not flow in the wrong places, right? You need to control these things. How do you apply function composition? LLMs are horrible at composition because neural networks kind of essentially don't learn things that reliably. Function composition in software is always perfectly reliable, basically, right, by construction.

So a lot of the things are often best delegated to code, right? But it's hard and it's really important that you can actually juggle and combine these things. And you need a canvas that can allow you to combine these things well. When you do this, a good canvas, the definition here of a good canvas, or the criteria for a good canvas, is that it should allow you to express those three in a way that's highly streamlined and in a way that is decoupled and not entangled with models that are changing.

I should just be able to hot swap models. Inference strategies that are changing. Hey, I want to switch from a chain of thought to an agent. I want to switch from an agent to a Monte Carlo Tree Search, whatever the latest thing that has come out is, right? I should be able to just do that.

And new learning algorithms. This is really important. We talked about learning, but learning is, you know, always happening at the level of your entire system if you're engineering it, or at least you've got to be thinking about it that way, where you're saying, I want the whole thing to work as a whole for my problem, not for some general default, right?

So that's what the evals here are going to be doing. And you want a way of expressing this that allows you to do reinforcement learning, but also allows you to do prompt optimization, but also allows you to do any of these things at the level of abstraction that you're actually working with.

So the second takeaway is that you should invest in defining things specific to your AI system and decouple from the lower level swappable pieces because they'll expire faster than ever. So I'll just conclude by telling you we've built and been building for three years this DSPi framework, which is the only framework that actually decouples your job from-- which is writing the lower level AI software from our job, which is giving you powerful, evolving toolkits for learning and for search, which is scaling, and for swapping LLMs through adapters.

So there's only one concept you have to learn. It is a new concept, which we call signatures, a new first class concept. If you learn it, you've learned DSPi. I'll have to unfortunately skip this because of the time for the other speakers, but let me give you a summary.

I can't predict the future. I'm not telling you if you do this, you know, the code you write tomorrow will be there forever, but I'm telling you the least you can do--this is not like kind of the top level, it's just like the baseline, I would say--is avoid the hand engineering at lower levels than today allows you to do, right?

That's the big lesson from the bitter lesson and from premature optimization being the root of all label. Among your safest bets--they could turn out to be wrong, I don't know--is models are not anytime soon gonna read specs off of your mind. I don't know if, like, we'll figure that out.

And they're not going to magically collect all the structure and tools specific to your application. So that's clearly stuff you should invest in, right, when you're building a system. Invest in the signatures, which, again, you can learn about on the DSPi site, DSPi.ai. Invest in essential control flow and tools, and invest in evals for things that you would otherwise be iterating on by hand.

And write the wave of swappable models, write the wave of the modules we build, just swap them in and out, and write the wave of optimizers, which can do things like reinforcement learning or prompt optimization for any application that it is that you've built. All right. Thank you, everyone.

All right. All right. We won't take up time on the computer. I think I learned that lesson. So, for now, I'll go ahead and pick something. Lama 3. Let's see how it did. Is it true? So, here's the introduction, even though I asked it not to give me a preface.

It's my pleasure to introduce our next speakers, Rafal Wilinski, also known as Rafal Winiski, and Vitor Baloko, also known as Vitor Baloko, as the AI tech lead and staff AI engineer at Zapier. Recently, they'll share their insights on how to turn agent failures into opportunities for growth and improvement.

And then it said, "Note, I've also used the phonetic pronunciation guides to help with the speakers' names, which are often tricky to pronounce for native speakers." So, that's Lama 3 right there. All right. I'll let you all take it from here. Cool. Thank you, Omo. How's everybody doing? I don't think I have a mic yet, so give me a second.

Now it's the time to take a selfie. All right. Cool. Hey, everybody. How's it going? We're Vitor and Rafal, and we're AI engineers at Zapier, and we're building Zapier agents. And today, we would like to share a little bit with you some of the hard-won lessons about building agents that we learned from the past two years, building Zapier agents.

Yeah. And a brief introduction to Zapier agents. I believe many of you know what Zapier is. There's automation software, a lot of boxes, arrows, essentially about automating your business processes. Agents is just, well, more agentic alternative to Zapier. You describe what you want. We propose a bunch of tools, a trigger, and we enable that.

And hopefully, we automate your whole business processes. And a key lesson that we have after those two years is that building good AI agents is hard. And building good platform to enable non-technical people to build AI agents is even harder. That's because AI is non-deterministic, but on top of that, your users are even more non-deterministic.

They are going to use your products in a way that you cannot imagine up front. So, if you think that building agents is not that hard, you probably have this kind of picture in mind. You probably stumbled upon this library called Blankchain. You pulled some examples, tutorial. You picked the prompt, pulled a bunch of tools.

You chatted with the solution, and they thought, well, it's actually kind of working, all right? So, let's deploy it, and let's collect some profit. Turns out the reality has a surprising amount of detail, and we believe that building probabilistic software is a little bit different than building traditional software.

The initial prototype is only a start, and after you ship something to your users, your responsibility switches to building the data flywheel. So, once your user starts using your product, you need to collect the feedback. You're starting to understand the usage patterns, the failures, so then you can build more evals, build an understanding of what's failing, what are the use cases.

As you're building more evals and burn features, probably your product is getting better, so you're getting more users, and there are more failures, and you have to build more features, and on and on and on. So, yeah, it forms this data flywheel. But starting with the first step. Okay, yeah.

So, starting from the beginning, how do you start collecting actionable feedback? Backing up for just a second, the first step is to make sure you're instrumenting your code, right? Which you probably already are doing. Whether you're using Braintrust or something else, they all offer an easy way to get started, like just tracing your completion calls.

And this is a good start, but actually, you also want to make sure that you're recording much more than that in your traces. You want to record the two calls, the errors from those two calls, the pre- and post-processing steps. That way, it will be much easier to debug what went wrong with the run.

And you also want to strive to make the run repeatable for eval purposes. So, for instance, if you log data in the same shape as it appears in the runtime, it makes it much easier to convert it to an eval run later, because you can just pre-populate the inputs and expected outputs directly from your trace for free.

And this is especially useful as well for two calls, because if your two call produces any side effects, you probably want to mock those in your evals. So, you get all that for free if you're recording them in your trace. Okay, great. So, you've instrumented your code, and you started getting all this raw data from your runs.

Now it's time to figure out what runs to actually pay attention to. Explicit user feedback is really high signal, so that's a good place to start. Unfortunately, not many people actually click those classic thumbs up and thumbs down buttons. So, you've got to work a bit harder for that feedback.

And in our experience, this works best when you ask for the feedback in the right context. So, you can be a little bit more aggressive about asking for the feedback, but you're in the right context. You're not bothering the user before that. So, for us, one example of this is once an agent finished running, even if it was just a test run, we show a feedback call to action at the bottom, right?

Did this run do what you expected? Give us the feedback now. And this small change actually gave us a really nice bump in feedback submissions, surprisingly. So, thumbs up and thumbs down are a good benchmark, a good baseline. But try to find these critical moments in your user's journey where they'll be most likely to provide you that feedback.

Either because they're happy and satisfied or because they're angry and they want to tell you about it. Even if you work really hard for the feedback, explicit feedback is still really rare. And explicit feedback that's detailed and actionable is even harder. Because people are just not that interested in providing feedback, generally.

So, you also want to mine user interaction for implicit feedback. And the good news is there's actually a lot of low-hanging fruit possibilities here. Here's an example from our app. Users can test an agent before they turn it on to see if everything is going okay. So, if they do turn it on, that's actually really strong positive implicit feedback, right?

Copying a model's response is also good implicit feedback. Even OpenAI is doing this for ChatGPT. And you can also look for implicit signals in the conversation. Here the user is clearly letting us know that they're not happy with the results. Here they're telling the agent to stop slacking around, which is clearly implicit negative feedback, I think.

Sometimes the user sends a follow-up message that is mostly rehashing what I asked the previous time to see if LLM interprets that phrasing better. That's also good implicit negative feedback. And there's also a surprising amount of cursing. Recently, we had a lot of successes while using an LLM to detect and group frustrations.

And we have this weekly report that we post in our Slack. But it took us a lot of tinkering to make sure that LLM understood what frustration means in the context of our product. So, I encourage you to try it out. But expect a lot of tinkering. You should also not forget to look at more traditional user metrics, right?

There's a lot of stuff in there for you to mine implicit signal too. So, find what metrics your business cares about and figure out how to track them. Then you can distill some signal from that data. You can look for customers, for example, that churned in the last seven days and go look at their last interactions with your product before they left.

And you're likely to find some signal there. Okay. So, I have raw data. But now, I'll let the industry experts speak. Why isn't this thing? Oh. Yeah. Or the beatings will continue until everyone looks at their data. Okay. But how actually are you going to do that? So, we believe that the first step is to do that.

So, we believe that the first step is to do that. And you're going to do that. So, we're going to do that. And you're going to do that. And you're going to do that. And you're going to do that. And you're going to do that. And you're going to do that.

And you're going to do that. And you're going to do that. And you're going to do that. And you're going to do that. And you're going to do that. And you're going to do that. And you're going to do that. And you're going to do that. And you're going to do that.

And you're going to do that. And you're going to do that. And you're going to do that. Yeah. Or the beatings will continue until everyone looks at their data. Okay. But how actually are you going to do that? So, we believe that the first step is to either buy or build LLM ops software.

We do both. You're definitely going to need that to understand your agent runs. Because one agent run is probably multiple LLM calls, multiple database interactions, tool calls, REST calls, whatever. Each one of them can be source of failure. And it's really important to piece together this whole story. Understand this.

You know, what caused this cascading failure. Yeah. I said we are doing both. Because I believe by coding your own internal tooling, it's really, really easy right now with Cursor and CloudCod. And it's going to pay you massive dividends in the future. For two reasons. First of all, it gives you an ability to understand your data in your own specific domain context.

And the second of all, you should be also able to create a functionality to turn every single interacting case or every failure into an eval with the minimal amount of fraction. So, whenever you see something interesting, there should be like a one click to turn it into an eval.

It should become your instinct. Once you understand what's going on on a singular run basis, you can start understanding things at scale. So, now we can do feedback aggregations, clustering. You can bucket your failure modes. You can bucket your interactions. And then you are going to start to see what kind of tools are failing the most.

What kind of interactions are the most problematic. That's going to almost create for you like an automatic roadmap. So, you'll know where to apply your time and effort to improve your product the most. Doing anything else is going to be a suboptimal strategy. Something that we are also experimenting with is using reasoning models to explain the failures.

It turns out that if you give them the trace output input instructions and anything you can find, they are pretty good at finding the root cause of a failure. Even if they are not going to do that, they are probably going to explain you the whole run or just direct your attention into something that's really interesting and might help you find the root cause of the problem.

Cool. So, now you have a good short list of failure modes you want to work on first. It's time to start building out your evals. And we realize over time that there are different types of evals. And the types of evals that we want to build can be placed into this hierarchy that resembles the testing pyramid for those of you that know that.

So, with unit tests like evals at the base, end-to-end evals or trajectory evals, how we like to call them, in the middle. And the ultimate way of evaluating using A/B testing with stage rollouts at the top. So, let's talk a bit about those. Starting with unit test evals, we are just trying to predict the n+1 state from the current state.

So, these work great when you want to do simple assertions, right? For instance, you could check whether the next state is a specific tool call or if the tool call parameters are correct or if the answer contains a specific keyword or if the agent determined that it was done.

All that good stuff. So, if you're starting out, we recommend focusing on unit test evals first because these are the easiest to add. It helps you build that muscle of looking at your data, spotting problems, creating evals that reproduce them, and then just focusing on fixing them, right? Beware, though, of turning every positive feedback into an eval.

We found that unit test evals are best for hill-climbing specific failure modes that you spot in your data. So, now unit test evals are not perfect, and we realized that ourselves. We realized we had over-index on unit test evals when the new models were coming out that were objectively stronger models, but they were still performing worse in our internal benchmarks, which was weird.

And because the majority of our evals were so fine-grained, this made it really hard to see the forest for the trees when benchmarking new models. There was always a lot of noise when we tried comparing runs. Like, when you're looking at a single trace, it's easy to kind of go through the trace and understand what's happening, but when you need to kind of look at it from -- I don't know how to put it again, sorry -- when you want to look at it through an aggregation of many traces, then it starts getting difficult to understand what's happening.

Why are so many of these passing and some of these are regressing? Yeah, so we realized that maybe machine can help us. It turns out in that previous video when I was investigating one experiment inside Braintrust, there is a lot of looking at that screen trying to figure out what went wrong, and we were like, hey, maybe we can, like, just give this old data to, once again, a reasoning LLM and compare the models for us.

It turns out that with Braintrust MCP and reasoning model, you can just ask it to, hey, look at this run, look at this run, and tell me what's actually different about the new model that we are going to deploy. In this case, it was Gemini Pro versus Claude, and what the reasoning model found was actually really, really good.

It found that Claude is like a decisive executor, whereas Gemini is really yapping a lot. It's asking follow-up questions. It needs some positive affirmations, and it's sometimes even hallucinating about JSON structures. So, yeah, it helped us a lot. It also surfaces a problem with unit test evals a lot, which is different models have different ways of trying to achieve the same goal.

And unit test evals are penalizing different paths. They are, like, hard-coded to only follow one path. And, yeah, our unit test evals were overfitting to our existing models or actually data collecting using that model. So, what we started experimenting with is trajectory evals. Yeah, instead of grading just one iteration of an agent, we let the agent run to the end state.

And we are not grading just the end state, but we are also grading all the tool calls that were made along the way and all the artifacts that have been generated along the way. And this can be also paired with LLM as a judge. Vitor is going to speak about it later.

Yeah, but they are not free. I think they have really high return on investment, but they are much harder to set up, especially if you are evaluating runs that have tools that cause side effects, right? When you are running an eval, you definitely don't want to send an email on behalf of the customer once again, right?

So, we had a fundamental question whether we should mock environment or not. And we decided that we are not going to mock the environment because otherwise you are going to get data that is just not reflecting the reality. So, what we started doing is just mirroring users' environment and crafting a synthetic copy of that.

Also, they are much slower, right? So, they can sometimes take like up to an hour. So, it is not pretty great. And we are also learning a bit more into LLM as a judge. This is when you are using LLM to grade or compare results from your evals. And it is tempting to lean into them for everything, but you need to make sure that the judge is judging things correctly, which can be surprisingly hard.

And you also have to be careful not to introduce subtle biases, right? Because even small things that you might overlook might end up influencing it. Lately, we have also been experimenting with this concept of rubrics-based scoring. We use an LLM to judge the run. But each row in our dataset has a different set of rubrics that were handcrafted by a human and described in natural language what specifically about this run should the LLM be paying attention to for this score.

So, one example of this: Did the agent react to an unexpected error from the calendar API and then try it again? So, to sum it up, here's our current mental model of the types of evals that we build for separate agents. We use LLM as a judge or rubrics-based evals to build a high-level overview of your system's capabilities.

And these are great for benchmarking new models. We use trajectory evals to capture multi-term criteria. And we use unit tests like evals to debug specific failures. He'll find them. But beware of overfitting with these. Yeah. And a couple of closing thoughts: Don't obsess over metrics. Remember that when a good metrics become a target, it ceases to be a good target.

So, when you're close to achieving 100% score on your eval dataset, it's not meaning that you're doing a good job. It's actually meaning that your dataset is just not interesting, right? Because we don't have AGI yet. So, it's probably not true that your model is that good. Something that we're experimenting with lately is dividing dataset into two pools.

Into the regressions dataset to make sure that we are making any changes. We are not breaking existing use cases for the customers. And also the aspirational dataset of things that are extremely hard. For instance, like nailing 200 tool calls. And lastly, let's take a step back. What's the point of creating evals in the first place?

Your goal isn't to maximize some imaginary number in a lab-like setting. Your end goal is user satisfaction. So, the ultimate judge are your users. You shouldn't be optimizing for the biggest scores for the evals and completely disregard the vibes. So, that's why you think the ultimate verification method is an A/B test.

Just take a small portion of your traffic, let's say 5%, and route it to the new model. Route it to the new prompt. Monitor the feedback. Check your metrics like activation, user retention, and so on. Based on that, probably you can make the most educated guess instead of being in the lab and optimizing this imaginary number.

So, that's all. Thank you. We have time for a couple of questions, like one or two. There's a mic here on the left and another mic right there for anybody that wants to ask a question for Vitor and for Rafael. Anybody interested? You can also raise your hand and...

You can also say the question and if you don't mind, Vitor, repeat it. The question I had for you was, you mentioned at the very end you were talking about splitting these things up into three different data sets. Do you combine that with like the rubrics thing to sort of create like a map of like how things end up looking?

Or like how does that actually end up driving product decisions and like everything in general? Yeah, great question. We're still experimenting with the rubrics-based evals. And I think they're really useful when you want to eval really long trajectories or especially useful when you want to compare a new model that just came out.

But I think ultimately the combination of all the different evals is what tracks our product roadmap, right? Especially when we look at things in aggregate and we see we group things in buckets and then we know, okay, this is the most important thing that we have to focus now.

And then we can kind of go and like slice our data sets to find which part of that data set like it's most representative for us to kind of iterate over. And of course, afterwards we run the complete complete data set as well to catch any regressions. So that's kind of how we think about it.

Cool. Do we have another question over there? Yeah, one more question here. I'm Dhruv from PyLabs. I was really curious what your process for like evaling and aligning your eval looked like. Was that a pretty time-consuming thing? It's something we've struggled with and met a lot of people who have struggled with, so.

Cool. Anything else? Yeah, sure. So first of all, I think we start building evals by looking at the customer feedback, especially the negative customer feedback. We used to have labeling parties where we like group together and you're looking at the trace and you're trying to understand what went wrong and what's expected behavior.

We also have Slack channel and we have something like an on-call rotation. There's always like the one engineer that just looks at it and does its best to find a problem. Yeah, but there is no like really hard process for that. I think it's best effort, I would say.

Yeah. Yeah, and we kind of stumbled through different types of evals over time. We started like big long trajectory evals. Oh yeah. Then we doubled down on unit test evals and kind of neglected the rest and now we're feeling a bit of the pain from that. So in the end we kind of spread out over different types of evals and we're still kind of building that intuition about what the ratio of those should be.

Yeah. All right. Thank you, folks. You're welcome. All right. Good luck. All right. Thank you. In case folks came in at the last second, what I'm doing is I actually have eval by presenting and I have some multiple models here. Thank you. So let's try plot four. It was good.

Really. So up next we have Ito Pesak, an AI engineer at Vercel, who's been deep in the trenches building the AI systems behind V0, Vercel's AI-powered web development tool. He's here to share hard one wisdom about why evaluating AI systems requires throwing out everything you thought you knew about testing because when your system is non-deterministic, your evals need to be anything but conventional.

All right. Take it away. All right. Thank you. Hello, everyone. Hello. Good to see you. Let's see. Hopefully my slides go up. Okay. Perfect. Awesome. Okay. Thank you so much for coming. My name is Ito. I'm an engineer at Vercel working on V0. If you don't know, V0 is a full stack Vibe coding platform.

It's the easiest and fastest way to prototype, build on the web, and express new ideas. Here are some examples of cool things people have built and shared on Twitter. And to catch you up, we recently just launched GitHub Sync. So you can now push generated code to GitHub directly from V0.

You can also automatically pull changes from GitHub into your chat. And furthermore, switch branches and open PRs to collaborate with your team. I'm very excited to announce we recently crossed 100 million messages sent. And we're really excited to keep growing from here. So my goal of this talk is for it to be an introduction to evals and specifically at the application layer.

You may be used to evals at the model layer, which is what the research labs will cite in model releases. But this will be a focus on what do evals mean for your users, your apps, and your data. The model is now in the wild, out of the lab, and it needs to work for your use case.

And to do this, I have a story. It's a story about this app called Fruit Letter Counter. And if the name didn't already give it away, all it is is an app that counts the letters in fruit. So the vision is we'll make a logo with ChatGPT. There might be a product market fit already because everyone on X is dying to know the number of letters in fruit.

If you didn't get it, it's a joke on the how many R's on the strawberry prompt. We'll have V0 make all the UI and backend, and then we can ship. So we had V0 write the code. It used AISDK to do the stream text call. And what do you know?

It worked first try. GPT 4.1 said three. And not only did it say three once, I even tested it twice, and it worked both times in a row. So from there, we're good to ship. All right, let's launch on Twitter. Want to know how many letters are on a fruit?

Just launched fruitlettercounter.io. The .com and .ai were taken. And yeah, everything was going great. We launched and deployed on Vercel. We had fluid compute on until we suddenly get this tweet. John said, I asked how many R's in strawberry, and it said two. So, of course, I just tested it twice.

How is this even possible? But I think you get where I'm going with this, which is that by nature, LLMs can be very unreliable. And this principle scales from a small letter counting app all the way to the biggest AI apps in the world. The reason why it's so important to recognize this is because no one is going to use something that doesn't work.

It's literally unusable. And this is a significant challenge when you're building AI apps. So I have a funny meme here, but basically AI apps have this unique property. They're very demo savvy. You'll demo it. It looks super good. You'll show it to your coworkers, and then you ship the prod, and then suddenly hallucinations come and get you.

So we always have this in the back of our head when we're building. Back to where we were, let's actually not give up. We actually want to solve this for our users. We want to make a really good fruit letter counting app. So you might say, how do we make reliable software that uses LLMs?

Our initial prompt was a simple question, right? But maybe we can try prompt engineering. Maybe we can add some chain of thought, something else to make it more reliable. So we spent all night working on this new prompt. You're an exuberant, fruit-loving AI on an epic quest, dot, dot, dot.

And this time, we actually tested it ten times in a row on ChatGPT, and it worked every single time. Ten times in a row. It's amazing. So we shipped. And everything was going great until John tweeted on me again. And he said, I asked how many Rs are in strawberry, banana, pineapple, mango, kiwi, dragon fruit, apple, raspberry.

And it said five. So we failed John again. Although this example is pretty simple, but this is actually what will happen when you start deploying to production. You'll get users that come up with queries you could have never imagined. And you actually have to start thinking about how do we solve it?

And the interesting thing, if you think about it, is 95% of our app works 100% of the time. We can have unit tests for every single function, end-to-end tests for the off, the login, the sign-out. It will all work. But it's that most crucial 5% that can fail on those.

So let's improve it. Now, to visualize this, I have a diagram for you. Hopefully you can see the code. Maybe I need to make my screen brighter. Can you see the code? I don't know. Okay. Okay. Well, we'll come back to this. But basically, we're going to start building evals.

And to visualize this, I have a basketball court. So today's day one of the NBA finals. I don't know if you care. You don't need to know much about basketball. But just know that someone is trying to throw a ball in the basket. And here, the basket is the glowing golden circle.

So blue will represent a shot make. And red will represent a shot miss. And one property to consider is that the farther away your shot is from the basket, the harder it is. Another property is that the court has boundaries. So this blue dot, although the shot goes in, it's out of the court.

So it doesn't really count in the game. Let's start plotting our data. So here we have a question, how many Rs in strawberry? This, after our new prompt, will probably work. So we'll label it blue. And we'll put it close to the basket because it's pretty easy. However, how many Rs are in that big array?

We'll label it red. And we'll put it farther away from the basket. Hopefully, you can see that. Maybe we can make it a little bit brighter. But this is the data part of our eval. Basically, you're trying to collect what prompts your users are asking. And you want to just store this over time and keep building it.

and store it where these points are on your court. Two more prompts I want to bring up is like, what if someone says, how many Rs are in strawberry, pineapple, dragon fruit, mango, after we replace all the vowels with Rs, right? Insane prompt, but still technically in our domain.

So we'll label it as red all the way down there. But a funny one is like, how many syllables are in caret? So this, we'll call it out of bounds, right? None of our users are actually going to ask. It's not part of our app. So no one is going to care.

I hope you can see the code. But basically, when you're making eval, here's how you can think about it. Your data is the point on the court. Your shot, or in this case in Braintrust they call it a task, is the way you shoot the ball towards the basket.

And your score is basically a check of did it go in the basket or did it not go in the basket. To make good evals, you must understand your court. This is the most important step. And you have to be careful of falling into some traps. First is the out of bounds traps.

Don't spend time making evals for your data your users don't care about. You have enough problems, I promise you, queries that your users do care about. So be careful not to try and be productive. And you're making a lot of evals, but they're not really applicable to your app.

And another visualization is don't have a concentrated set of points. When you really understand your court, you're going to understand where the boundaries are, and you want to make sure you test across the entire court. A lot of people have been talking about this today, but to collect as much data as possible, here are some things you can do.

First is collect thumbs up, thumbs down data. This can be noisy, but it also can be really, really good signal as to where your app is struggling. Another thing is if you have observability, which is highly recommended, you can just read through random samples in your logs. Although users might not be giving you signal, but if you take like a hundred random samples and go through it like once a week, you'll get a really good understanding of what your users are and how your users are using the product.

If you have community forums, these are also great. People will often report issues they're having with the LLM, and also X and Twitter are also great but can be noisy. And there really is no shortcut here. You really have to do the work and understand what your court looks like.

So here is actually what, if you are doing a good job of understanding your court and a good job of building your data set, this is what it should look like. You should know the boundaries, you should be testing in your boundaries, and you should understand where your system has blue versus where it has red.

So here it's really easy to tell. Okay, maybe next week we need to prioritize the team to work on that bottom right corner. This is something where a lot of users are struggling, and we can really do a good job on flipping the tiles from red to blue. Another thing you can do, and I really hope you can see, but you want to put constants in data, variables in the task.

So just like in math or programming, you want to factor constants so it improves clarity, reuse, and generalizations. Let's say you want to test your system prompt, right? Keep the constant data that your users are going to ask. So for example, how many R's in strawberry? That goes in the data.

That's a constant. It's never going to change throughout your app. But what you're going to test is in that task, you're going to try different system prompts. You might try different pre-processing, different RAG, and that's what you want to put in your task section. This way, your app actually scales, and you never have to, let's say when you change your system prompt, redo all your data.

And this is a really nice feature of brain trust. And if you don't know, AISDK actually offers a thing called middleware. And it's a really good abstraction to put basically all your logic of pre-processing. So RAG, system prompt you can put in here, et cetera. And you can now share this between your actual API route that's doing the completion and your evals.

So if you think about the court, the basketball court, as if we're doing, we're going like basketball practice, and we're trying to practice our system across different models. You want your practice to be as similar as possible to the real game. That's what makes a good practice. So you want to share pretty much the exact same code between the evals and what you're actually running.

Now, I want to talk a little bit about scores, which is the last step of the eval. The unfortunate thing is it does vary greatly depending on your domain. So in this case, it's like super simple. You're just checking if the output contains the correct number of letters. But maybe if you're doing writing or tasks like writing, that's very, very difficult.

From principles, you want to actually lean towards deterministic scoring and pass/fail. This is because when you're doing debugging, you're going to get a ton of input and logs, and you want to make it as easy as possible for you to actually figure out what's going wrong. So if you're over-engineering your score, it might be very difficult to share with your team and distribute across different teams your evals because no one will understand how these things are getting scored.

Keep your scores as simple as possible. And a good question to ask yourself is when you're looking at the data, what am I looking for to see if this failed? So with V0, we're looking for if the code didn't work. But maybe for writing, you're looking for certain linguistics.

Ask yourself that question and write the code that looks for you. There are some cases where it's so hard to write the code that you may need to do human review, and that's okay. At the end of the day, you want to build your core and you want to collect signal.

Even if you must do human review to get the correct signal, don't worry. If you do the correct practice, it will pay off in the long run, and you'll get better results for users. One trick you can do for scoring is don't be scared to add a little bit of extra prompt to the original prompt.

So for example, here we can say output your final answer in these answer tags. What this will do is basically make it very easy for you to do string matching and et cetera, whereas in production you don't really want this. But yeah, you can do some little tweaks to your prompt so that scoring is easier.

Another thing we really highly recommend is add evals to your CI. So Braintrust is really nice because you can get these eval reports. So it will run your task across all your data, and then it will give you this report at the end for the improvements and regressions. Assume my colleague made a PR that changes a bit of the prompt.

We want to know, like, how did it do across the court, right? Visualize, like, did it change more tiles from red to blue? Maybe now our prompt fixed one part, but it broke the other part of our app. So this is a really useful report to have when you're doing PRs.

So yeah, going back, this is the summary of the talk. You want to make your evals a core of your data. And this, you can treat it like practice. Your model is basically going to practice. Maybe you want to switch players, right? When you switch models, you can see how a different player is going to perform in your practice.

But this gives you such a good understanding of how your system is doing when you change things, like maybe your rag or your system prompt. And you can now go to your colleague and say, hey, this actually did help our app, right? Because improvement without measurement is limited and imprecise.

And evals give you the clarity you need to systematically improve your app. When you do that, you're going to get better reliability and quality, higher conversion and retention. And you also get to do, just spend less time on support ops, right? Because your evals, your practice environment will take care of that for you.

And if you're wondering about how I built all these court diagrams, I actually just used V0, and it made me some app that I just added these shots made and missed in the basket. So yeah, thank you very much. I hope you learned a little bit about evals. Thank you.

So we do have some time for some questions. There are two mics, one over here, one over there. I can take two or three of those, please, if anybody's interested in asking. We have one over there. Mic five, please. Or you can repeat the question as well, if you don't mind.

. Yeah. Yeah, you can think of it. It's really like practice. Like, maybe you're a basketball player, like, you know, in general score like 90%, but they might miss more shots here or there. If you run it, like, we do it like we run every day, at least. And then we get a good sense of, like, where are we actually, like, failing?

Did we have some regression? So yeah, running it, like, daily or at least in some schedule will give you a good idea. I was thinking, we're saying a question through it five times, right? Yeah. Like, what's the percentage? It's making it four out of five or, you know, five out of five, right?

Oh, I see. So it's definitely, like, as you go further away, like, the harder questions get, like, lower pass rates. But overall, you can measure that, too. Like, how is it running, like, four out of five, like, a week or something? And you just want to improve that. Evals are just a way to measure so you can actually do the improvement.

Yeah. Yeah. Hello. I have a question about V0 evaluation. Yeah. Like, real evaluation. Yeah. As I understand, you have a lot of trajectories. Yeah. User trajectories, internal one. So it's not about analyzing one trajectory, but hundreds or even thousands. So the question is, how do you analyze? Which method do you use to analyze thousands of trajectories and getting valuable insights from it at scale?

Yeah. Yeah. So V0 is in a special domain because we're doing code. So we actually, like, run the code. When you run in V0, we can collect errors. So we do a lot of measurements based on those errors that we get. However, if you have a task like writing or something else that's, like, more difficult to measure, you might be in a worse spot where you need more human evals.

But what we do is, yeah, we track a lot of, like, errors in the apps. And that usually is a good signal for the final trajectory of, like, the app that user is building. Does that make sense? I'm also interested about how do you analyze which tools was called and how properly it was called.

Like, it's undeterministic and it's interesting how to understand it. Yeah. So we also found, like, the Zapier presentation earlier. You want to just look, like, the best thing is look at the final output. It can, you can get a little carried away with looking at every tool call. But at V0, because it's, like, did the app work?

A lot of times that's after all the steps in between. And that's what we really look at. It's, like, did the thing work at the end or not? And then if it didn't work, then we go in and we look at, like, all the steps and kind of try to find the one that broke.

But we start with looking at the final trajectory, like, after all the steps happened. Yeah. Okay. Yes, thank you. Awesome. Yeah. Yeah. How do you think about evals in terms of, like, each user query in, like, a full conversation? Yeah. I know you were just talking about kind of, like, thinking through, like, the whole thing.

Yeah, yeah. But, like, what if the user doesn't know where they're going at the end? Yeah, so for V0, like, the conversations can get really long. I would equate that to, like, just a more difficult, like, that's, like, actually a different data point. I would equate that to, like, a very difficult shot on the court.

And we test those separately. So, like, maybe you can have multiple evals within one chat, like, depending on where they are in the conversation. And you want to just, you guys, at that point you test, like, one message at a time. But, yeah. Does that answer your question? Yeah, somewhat.

Thank you. Yeah. Did you have one? Really quickly? Yeah. I was going to ask how you deal with if you have an error that's, like, three or five turns into a conversation and you need to mock up to that state and figure out whether it's able to resolve it from there.

Yeah, that's a good question. So, on v0, like, we have the state at every point. So, we were lucky. So, we can kind of really easily traverse that state. But what we do is, again, like I was saying earlier, we basically, on our evals, you test, like, one message ahead.

And if they got into a broken state already, then it's not really worth testing anymore because usually, like, the LLMs, like, really suck at coming back from it. Oh, okay. So, we're mostly focused on, like, finding, yeah, where was that point where it did break? And you want to try to reduce those number of points.

Maybe in the future, as they get better, you can start, and we're working on it, like, more, a better loop of, like, error correction. And this is something we're constantly working on. Sure. But usually, it's really helpful to find, like, yeah, that one point where it ended. Okay. So, go over there.

Makes sense. Thank you. Thank you. Thank you. Thank you. Yeah, the remainder of time. Yeah, appreciate that. So, we do have one remaining speaker, but they're not in right now. Thank you. So, feel free to stick around. Hopefully, he shows up. Thank you so much. Otherwise, there's also lunch available, right?

Thank you so much. Appreciate it. All right. So, we have a guest speaker, actually. If you don't mind introducing yourself, because I don't think even the LLM... Yeah, go for it. Hey, my name is Randall. I just launched a framework today for evals that I'm trying to figure out if it's good.

It's called Bolt Foundry. It's our company name. And we are trying to help people who do JavaScript. Most people who do JavaScript don't have a great way to run evals or create synthetic data. We have a different approach that is based on journalism, actually. Building prompts through a structured way that's more similar to the way a newspaper is written than the way most people do, where they scare the crap out of a LLM.

Let's see if I can plug in. This is totally Gorilla, but no one else is here, so why not, right? All right. Let's do this. It's live on GitHub right now. If you want to go, github.com/boltfoundry, or just boltfoundry.com has our thing. All right. Cool. So let me spin up my environment.

And the cool thing is that to start, you can just run npx bff-eval, and then we have a demo that you can run, dash dash demo. So what this does is it will create -- it all works with open router, so you can test against like any system that you want.

Like you can -- oh, this is a repli bug. Hold on. Let me fix this really quick. Shell. All right. One more time. Bff npx bff-eval, dash dash demo. So here we've made a simple JSON validator. Rather than write the JSON validator to actually parse the JSON, this actually is a grader.

So a grader is like someone who -- it's a LLM that you can use to figure out if your thing is good. We have a set of samples. Let me just grab them. Samples.jsonl. Let's do -- this one, sure. That's the wrong one, but it's okay. So essentially the messages come in as a user message, an assistant response, and then optionally a score.

So you can actually set your ground truth and say this is a sample of something that's really good. This is a sample of something that's really shitty. And then you can describe it. The description is mostly for your team so that when you have a -- like they were talking about if you're pulling something from production and you have a sample that you want your team to know about, you can describe like why this is relevant.

The eval runs -- in this case we have ten runs. Each have a rubric score from positive three to negative three. And the other person is here now. So thank you so much for letting me show you something cool. Bullfinder.com. Oh, I have a minute left. All right. So the other cool thing is -- I'll just show you the end.

You can actually see where your greater diverged from the ground truth. So you can say like, oh, this is weird. Why is this one acting this way? Anyway. We're going to have more stuff. It's going to be cool. Thanks, man. Thank you for giving me a second. Yeah, no problem.

All right. Next up, Diego. I had your introduction ready but hopefully you don't mind if you introduce yourself. Go ahead and set up. Connect the USB there and then the HDMI there. There's no audio in your computer. I'll just project? Yeah. I'm happy to project. That's fine. Well, no.

It should make you up there. Okay. Yeah. I'm going to give the -- this mic. Sorry. I thought you were up in the right room. Yeah, yeah, yeah. They just gave me -- for context, everyone just gave me like some instructions that actually was supposed to be there. And then I was like, wait, there's someone else is starting.

Yeah. Interesting. I'm going to go up to go there afterwards. Oh, wow. Yeah. Yeah. Mirror. And then you can use this one. There's one here that I think you can use. Hello. Hello. Okay. Cool. Cool. Cool. And then you can use this one. There's one here that I think you can use.

Hello, hello. Okay. Cool, cool, cool. Okay. Yeah, yeah, yeah. That makes sense. Are we starting like, very, very soon or should I wait a few minutes? Yeah. Hello, hello. That's 121212. Okay, good. Okay, perfect. Cool. Should be relatively short too, so we have time. We can wait. Okay. Okay, good.

Okay, perfect. Cool. Should be relatively short too, so we have time. We can wait. Okay. Okay, good. Okay, perfect. Cool. short too so we have time we can wait a few minutes if that's better. Okay, cool, I'm going to start. Okay, so hello everyone, my name is Diego Rodriguez, I'm the co-founder of CREA, a startup in the AI space like many others, in particular generative media, multimedia, multimodal, and all the buzzwords, but I come here mainly to tell a story about how we think about evaluations when we have to take into account human perception and human opinion and aesthetics into the mix, right?

So I'm going to start with a very simple story, it's like, I put an AI-generated image of a hand, obviously it looks horrible, and then I ask O3, what do you think of this image? Then he thought for 17 seconds, obviously tool calling, does Python, analysis, OpenCV goes crazy, and then after he charges me a few cents, he's like, oh, just a couple of months, it's mostly natural, but like, and it's like, okay, we have like what many people claim is basically AGI and it is completely unable of answering a very simple question, and like, does it, that's a surprising thing if you think about it, because we as humans, when people see that image, it's like we just react so naturally, right?

Again, that is like, what is that? Like, that's not natural. And I feel like that's precisely what AI models are being trained on, A, on human data, right? Second, human preference data, and third, like, in a way limited by the data that we humans based on our preconceived notions and perception and all of that.

So that's what this talk is about, about what can we do better, and honestly, to ask ourselves some questions. questions that I think are not being asked enough in the field. Cool. So, tiny, tiny bit of history. There's, we all know about Claude, or Claude Shannon, that is, the father of information theory, and according to many, his master's thesis is one of the most important master's thesis in the world, where he laid foundations for digital circuits, and then eventually communication, and all, and to a degree, to a degree, we can say that even LLM's nowadays, right, if we fast forward.

And I want to, I want to focus on the, all the, like, the fact that we call his work foundational and information theory, well, when he published it, it was actually called, like, mathematical foundation for communication theory. And he was always focused on communication. There's this image. So, it appears, appears on that work, appears on that work, and it's all about, okay, this is the source, this is the channel, this is the nation, there can be some noise there.

There can be some noise there. And, as a, well, as a founder of a company that is focusing on, on media, to me, it's interesting to realize, like, these parallels between classic information theory and communication. Let me see, let me see, let me see, let's see, well, if you, I didn't put the image, but, if you have any context around variational autoencoders, or neural networks, or whatever, you, you can screen and be like, oh, is that a neural network?

Right? And, in the context of information and communication, I want to talk about how compression is going to be related to how we think about evaluation, right? And, I'm going to talk, for example, on JPEG. JPEG exploits, like, human nature in the sense that we are very sensitive to brightness, but not so much to color, and this is an illusion that also talks about that, where A and B is actually the same color, but we are basically unable to perceive it until we do this, and then suddenly it's like, oh, really?

And it's kind of like, what's going on, right? And so, JPEG just does the same thing, where, okay, we have RGB color space to represent images with computers, we notice that there's a diagonal that represents the brightness of the images, we can change into a different color space that separates color versus brightness, and then we can down sample the channels around color, because we are actually not even that sensitive to it, so we can remove that.

Or parts of it. Or parts of it. And then, once we do that, this is an image where we can see the brightness and color component separated. Once we down sample, we can try to recreate the image. And this is an example of, like, basically, original image and then the image with the down sample color looks the same to us, and the image is, like, 50% less information, right?

And other stuff, there's Huffman coding and more stuff, but, like, the point is the same, right? And then the thing is, if you exploit the same for audio, like, what can we hear? What can we not hear? Well, you do the same and we have MP3. And then if you do the exact same thing across time, well, congrats, now you have MP4.

It's like, it's all this principle of, like, let's exploit how we humans perceive the world, right? How we humans perceive the world, right? But this made me think about myself because I studied artificial systems engineering, which is engineering around all of these, how microphones work, how speakers work. And it was just interesting to me that I was coding.

So, like, I started deleting information, I started deleting information, and then I re-render the image, and I see the same. It's like, like, philosophers always tell you about, like, oh, we are limited by our senses, but, like, this is the first time that I was like, I'm seeing it, right?

Like, I am not seeing the difference. But then, if a lot of our data is the internet, right? Like, we are stripping data from the internet, a bunch of those images are also compressed, right? Like, are we taking into account that perhaps our AIs are limited to? Because we're kind of like, like, like, we have some sort of contagion going on with our flaws into the AI.

Um, and then it gets more tricky, because for instance, uh, this is a, just a screenshot I took from a paper, I think it's called Clean FID, and FID scores, for all of you who don't have context, is one of the standard metrics used for, uh, how well, for instance, diffusion models are, how well, for instance, diffusion models are reproducing an image, but then you start adding JPEG artifacts, and the score is like, oh, no, no, no, this is horrible, horrible image.

And it's like, perceptually, the four images are basically the same, yet the FID score is like, no, no, no, this is really bad. So then it's like, why are we using FID scores or metrics along those lines to decide how this generative AI model is good or bad, right?

Um, so, the thing is, sometimes I feel like we are focused on measuring just things that are easy to measure, right? Like, from adherence with clip, hmm, how many objects are there? Is this blue? Is this red? Et cetera. But what about here? Oh, it's like, oh, no, really bad, really bad generator, because that's not how clock looks, and the sky, that makes no sense, and it's like, okay, how, how, how, not only are we limiting our AIs by our human perceptions, uh, on top of that, we forget about the relativity of metrics, right?

Like, uh, no, actually, this is art. And this is great, and there's sometimes meaning behind the work that is not, like, it is conveying the image, but only if you're human, you get it, right? Like, oh, this is what the author is trying to tell me. But I feel like the metrics don't show that.

And kind of, like, commercially and professionally, my job is kind of like, okay, how can we make a company that allows creatives, artists, of all sorts, uh, we can start with image, we can start with video, but to better express themselves, but how are we supposed to do that if this is kind of like the state of the art?

Kind of like the state of the art, right? Um, then, a friend of mine, uh, Cheng Lu, actually works at, uh, at MidJourney, it was, uh, he has, um, like, he has great talks that you should all check, but he told me once, a little bit over a year ago, a quote that I just can't stop thinking about, which goes something like, hey man, if you think about it, like, like, predicting the car back when everything was horses, it's not that hard.

I was like, what? Like, yeah, it's not that hard to, to like, oh, cars are the future and whatever. It's like, we have, we have a thing that goes like this, we have horses that make energy, so you swap the thing for the engine. That's essentially a car, right?

It's like, come on, how hard is that? It's like, you know what's hard to predict? Traffic. Right? And, and then I just kept thinking about it. I was like, oh man, like, as engineers, as researchers, as founders, what are the traffics that we're missing now? Because I feel like everyone's focused on, like, yeah, but you can, you can, I don't know, transform from Jason to Jamal, and like, who cares?

That dude, who cares? Like, or, or yes, it's important, right? Like, but what kind of big picture are we all missing? Right? Then he talks about, well, you know, the myth of the Tower of Babel, where, in a nutshell, it's like, God, like, we want to go and meet God, and then he's like, no, I don't want that.

So, instead, I'm just going to confuse all of you, and then you're not going to be able to coordinate. And then you're all, each one is going to speak a different language, and then it's just basically going to be impossible to keep the thing going, which, like, reminds me of, like, standard infrastructure meetings, well, with backend engineers, it's like, no, we should use Kubernetes, no, we should use, and it's like, it's just all fighting, and whatever, and nothing gets built, nothing gets built, and I'm like, dude, God is winning, God damn it.

But then, this makes me think about, like, we are now, we just entered the age where you can have models, essentially, they solve translation, right? Or they solve it to a very high degree. So, what happens, now that we can all speak our own languages, yet, at the same time, communicate with each other?

I'm already doing it, for instance, I do sometimes customer support manually for Kriya, and I literally speak Japanese with some of my users, and I don't speak Japanese. I learn a little bit, but I don't speak it, and, like, I'm now able to provide an excellent founder-led, whatever that means, customer support level to a country that otherwise I would be unable to do, right?

And so, I invite us all to think about what that really means, because this, for instance, means that we can now understand better, or transmit our own opinion better to others. And, on the previous point that I was talking about with the art, that's kind of like an opinion, right?

Like, evolves are not just about, are there four cats here? It's about, this cat is blue, and it's like, yeah, but is it blue, or is it teal? What kind of blue? And I don't like this blue, and all of that. So, like, in a nutshell, it's like, how do we evolve our evals, right?

Like, from my opinion, like, from my opinion, this is bad. Then, I want metrics that take into account my opinion, too. And then, it's like, okay, consider myself, I may be a visual learner. What that means is, like, maybe your evals should take into account how we humans perceive images, right?

So, and also the nature of the data, such as, oh, it's all trained on JPEG on the internet, so take into account the artifacts, take into account, like, all of these while training your data. Okay, I guess, mandatory slide before the thank you. Bunch of users, bunch of money.

We did all of that. We're eight people. Now, we're 12. This is an email that I set up today for high-priority applications for anyone who wants to work on research around aesthetics research, hyper-personalization, scaling generative AI models in real-time, for multimedia, image, video, audio, 3D, across the globe. We have customers like those.

And that's it. Thank you. Oh, okay. Q&A? Okay, perfect. Any questions? Uh... There's many points there. Can you, like, reframe the question? Like... Yeah. Yeah. Yeah. Yeah. So, so the question, like, in a nutshell, is, like, are there, uh, perceptually aware metrics, right? Like, like, okay, you, I showed an example of FID score.

It changes a lot with JPEG artifacts. Are those where it's almost like the opposite, barely changes, uh, and the metric is still good? Like, there are some, and many of these are used also in traditional, uh, encoding, uh, techniques. Um... Um, but in a way I'm here to invite us all to start thinking about those, like, like, to, we can actually train, like, we can train, uh, I mean, it's, it's called a classifier, right?

And it's like, like, uh, uh, or, or, or, or, or, or, or, or, or, or, or, or a continuous classifier. We can train so that it understands what we mean, and it's like, hey, I showed you these five images. These five images are actually all good. And then they can have all sorts of artifacts, not just JPEG artifacts.

And this is exactly where machine learning excels, right? When it's all about opinions, and it's like, let me just know and you will know. You know, you know what, you, you will know when you see it. That's precisely the type of questions that AI is amazing at. All right.

Thank you, everyone. We are. Thank you. He'll be sticking around for questions on the side if you have some. Thank you. And we are doing a working lunch. Thank you. Thank you. And we are doing a working lunch. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Check. Check. Check. Check. Check. Oh, it doesn't do it in the in the man in the main but check it out go to the individual channel. No, I go in the individual channel. I hit it again. There it is. It's an analyzer. So something's feeding back. You'll see it.

You'll see. And what you can move. You can move these matches around. You can move these matches around. You can move these matches around. You can move these matches around. You can move these matches around. You can move these matches around. You can move these matches around. You can move these matches around.

You can move these matches around. You can move these things around. You can move these. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that.

I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that.

I don't want you to do that. You don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that. I don't want you to do that.

I don't want you to do that. I don't want you to do that. It's a mutant in the open.

AI Engineer World's Fair 2025 - Evals

Transcript