Human seeded Evals — Samuel Colvin, Pydantic

I'll assume given the time we have that you kind of get who I am and what Pydantic is to some extent. So I will move on. I'm using the talk I gave at PyCon, so it was building AI applications the Pydantic way, which is, I guess, somewhat akin. As I say, I'm not going to be able to get to the eval stuff today, but I can talk about these two.

So everything is changing really fast as we all get told repeatedly in ever more hysterical terms. Actually, some things are not changing. We still want to build reliable, scalable applications, and that is still hard. Arguably, it's actually harder with Gen AI than it was before, whether that is using Gen AI to build it or using Gen AI within your application.

So what we're trying to talk about here is some techniques that you can use to build applications quickly, but also somewhat more safely than you might do otherwise. I'm a strong believer that type safety is one of the really important parts of that. Not just for in production avoiding bugs, but if you, no one starts off building an AI application knowing what it's going to look like.

So you're going to have to end up refactoring your application multiple times. If you build your application in a type safe way, if you use frameworks that allow it to be type safe, you can refactor it with confidence much more quickly. If you're using a coding agent like cursor, it can use type safety or running type checking to basically mark its own homework and work out what it's doing right in a way that you can't do if you use a framework like LangChain or LangGraph, who either through decision or inability decided not to build something that's type safe.

I'll talk a bit about MCP if I have a moment. And I won't talk about how eval's put in, because I don't have time. Nothing I'm going to say here on what an agent is is controversial. This is reasonably well accepted now by most people as a definition of an agent.

This image here is from Barry Zhang's talk at AI engineer in New York in February. This is his definition, or the anthropic definition of what an agent is now being copied by us, by OpenAI, by Google's ADK. I think generally the accepted definition of an agent. This, although very neat, doesn't really make any sense to me.

This, however, does make sense. So what they say is that an agent is effectively something that has an environment, there are some tools which may have access to the environment, there is some system prompt that describes to it what it's supposed to do, and then you have a while loop where you call the LLM, get back some actions to run in the tool, run the tools, that updates the state, and then you call the LLM again.

There is, however, even in his whatever it is, six line pseudocode, a bug, which is there is no exit from that loop. And sure enough, that points towards a real problem, which is that it is not clear when you should exit that loop. And so there are a number of different things you can do.

You can say when the LLM returns plain text rather than calling the tool, that is the end. Or you can have certain tools, which are kind of what we call final result tools, which basically trigger the end of the run. Or if you have models like OpenAI or Google, which have structured output types, you can use that to end your run.

But it's not necessarily trivial to work out when the end is. So enough pseudocode. Let me run a real minimal example of Pydantic AI. So this is a very simple Pydantic-based model with three fields. And then we're going to use Pydantic AI to extract structured data that fits that person schema from unstructured data, this sentence here.

Now, here, obviously, to fit this into onscreen, this is a very, very simple example. But this could be a PDF tens of megabytes. Well, probably not tens of megabytes necessarily in context, but like definitely enormous documents. And this schema is very simple, but this could be an incredibly complex nested schema.

Models are still able to do it. And sure enough, if we go and run this example and the gods of the internet are with us, sure enough, we get the Pydantic model printed out. So some of you will notice that this example is simple enough that we don't actually need an agent or this loop.

We're doing one shot, we make one call to the LLM, returns the structured data, we call under the hood, we call a final result tool, Pydantic AI performs validation, and we get back the data. But we don't have to change that example very much to start seeing the value of the agentic loop.

So here, I'm being a little bit unfair to the model. I've added a field validator to my person model. It says the date of birth needs to be before 1900. And obviously, the actual definition here is abstract. It doesn't define which century we're talking about. You would obviously, the model will, for the most part, assume '87 is 1987.

We'll then get a validation error when you do the validation. And that's where the agentic bit kicks in. Because we will take those validation errors and return them to the model basically as a definite and say, please try again, as I'll show you in a moment. And the model is then able to use the information from the validation error to try again.

Obviously, if you were trying to do this case in production, you would add a doc string to the DOB field saying it must be in the 19th century. But there are definitely cases where models, even the smartest models, don't pass validation. And being able to use this trick of returning validation errors to the model is a very effective way of fixing a lot of the simplest use cases.

So if we run this, you see we had two calls to Gemini here. And if I come and open-- the other thing you'll see in this example is we instrumented this code with Logfire, our observability platform, so we can actually go in and see exactly what happened. So you'll see our agent run.

We had two calls to the model, in this case Gemini Flash. And if we go and look at the exchange, you can see what's happened here. So I'll just try and make it big enough that you can see it. We first of all had the user prompt to the description.

It called the final result tool, as you might expect. The date of birth being 1987. We then responded. The tool response was validation error, incorrect, please try. And then we add on the end, please fix the error and try again. And sure enough, it was then able to return correctly, call the final result tool with the right date of birth and succeed.

Cool. I've got five minutes. I feel like I'm in one of those. See how fast I can go. I'm on the wrong window, am I? I am. Here we are. I think the other thing that's worth saying here, even if I don't have that much time, is if you take a look at this example, I talked about type safety.

The way that we're doing this under the hood. Agent, because of the output type, is generic. In this case, person. And so we can-- when we access result.output, both in typing terms, it's an instance of person. And a runtime will guaranteed from the Pydantic validation that it will really be an instance of person.

So if I access here.name, all will be well. If I access first name, we suddenly get a validation. We get a runtime. We get the nice error from typing, saying this is an incorrect field. So that's the kind of very beginning of the value of static typing of our typing support.

We go a lot further. You will have seen, or some of you might have noticed, there's a second generic on agent, which is the depths type. And so if you register tools with this agent, they-- you-- we can have type safe dependencies to tools, which I will show you in a moment.

So the other thing you will notice is missing from this example is any tools. So let's look at an example with tools. If I open this example here, we have-- this is an example of memory, long-term memory in particular, where we're using a tool to record memories and then another tool to be able to retrieve memories.

So you'll see we have these two tools here, record memory and retrieve memory. Tools are set up by registering them with the agent.tool decorator. But this is where the typing, as I say, gets more complex. Now you will see that we've set depths type when we've defined the agent.

And so our agent is now generic in that depths type. The return type is string because that's the default. And so we-- when we call a tool decorator, we have to set the first argument to be this run context to parameterize with our depths type. And so when we access context.depths, that is an instance of our-- of our depths data class that you see there.

And if we access one of its attributes, we get the actual type. And if we change this to be int, let's say, suddenly we get an error saying we've used the wrong-- the wrong type. So we get this guarantee that the type here matches the type here, matches the attributes you can access here.

And then when we come to run the agent, we need our depths to be an instance of that depths type. So again, if we gave it the wrong type, we would get a typing error saying you're using the wrong type. And as far as I know, we're the only agent framework that works this hard to be type safe.

And it is quite a lot of work on our side. I'll be honest. There's a little bit of work on your side as well. And it's not necessarily as trivial to set up. But it makes it incredibly easy to go and refactor your code. And yeah, we run this here.

And we give it the-- I'm pretty sure I don't have Postgres running. Do I have Docker running? I don't know if I have time to make that work. I will-- that's Docker running. I'll just try and run this very quickly. Docker run. Hopefully that is enough. If I now come and run this example, what you will see is it successfully failed.

Great. I will try one more time and see if I get lucky. I don't know quite what was going on there. Ah, and I have no idea. Well, we can look in Logfire and see what happened to make it fail. I promise you I hadn't set that up to fail the first time to demonstrate the value of observability, but maybe it can help here.

So if you look this first time, we-- our first agent run, you'll see that we used the tool call record memory. The user's name is Samuel. And then it returned finished. And then the second time, you can see that the-- when it did retrieve memory, where it called the-- that tool, the parameter or the argument it gave was your name, which was not-- is not contained within the query the previous time.

We're just doing a very simple I like here. So your name is not a substring of user's name is Samuel. And so that's why it failed that time. So this has turned into a very useful example of where Logfire can help. And if we look at the-- that second time, you'll see user's name is Samuel.

And then when it-- when it ran the agent, it just asked for name. Name is obviously a substring of-- of the user's name is Samuel. And so it was able-- it got the response, user's name is Samuel, and therefore succeeded. The other thing we get here is, like, obviously, we get this tracing information.

So we can see how long each of those calls took. And we also get pricing on both aggregate across the whole of the the trace and individuals' bands. I am told that I am running out of time. So thank you very much.

Human seeded Evals — Samuel Colvin, Pydantic

Transcript