Human seeded Evals — Samuel Colvin, Pydantic

00:00:00.000 | I'll assume given the time we have that you kind of get who I am and what Pydantic is to some extent.

00:00:19.460 | So I will move on. I'm using the talk I gave at PyCon, so it was building AI applications the Pydantic way,

00:00:27.880 | which is, I guess, somewhat akin. As I say, I'm not going to be able to get to the eval stuff today,

00:00:33.480 | but I can talk about these two. So everything is changing really fast as we all get told repeatedly

00:00:39.480 | in ever more hysterical terms. Actually, some things are not changing. We still want to build

00:00:44.840 | reliable, scalable applications, and that is still hard. Arguably, it's actually harder with Gen AI

00:00:49.720 | than it was before, whether that is using Gen AI to build it or using Gen AI within your application.

00:00:56.120 | So what we're trying to talk about here is some techniques that you can use to build applications

00:01:01.720 | quickly, but also somewhat more safely than you might do otherwise.

00:01:10.040 | I'm a strong believer that type safety is one of the really important parts of that. Not just for in

00:01:14.280 | production avoiding bugs, but if you, no one starts off building an AI application knowing what it's

00:01:19.080 | going to look like. So you're going to have to end up refactoring your application multiple times.

00:01:22.840 | If you build your application in a type safe way, if you use frameworks that allow it to be type safe,

00:01:27.480 | you can refactor it with confidence much more quickly. If you're using a coding agent like cursor,

00:01:32.040 | it can use type safety or running type checking to basically mark its own homework and work out what

00:01:37.480 | it's doing right in a way that you can't do if you use a framework like LangChain or LangGraph,

00:01:41.800 | who either through decision or inability decided not to build something that's type safe.

00:01:46.040 | I'll talk a bit about MCP if I have a moment. And I won't talk about how eval's put in,

00:01:52.760 | because I don't have time. Nothing I'm going to say here on what an agent is is controversial. This is

00:01:59.560 | reasonably well accepted now by most people as a definition of an agent. This

00:02:05.800 | image here is from Barry Zhang's talk at AI engineer in New York in February. This is his definition,

00:02:16.120 | or the anthropic definition of what an agent is now being copied by us, by OpenAI, by Google's

00:02:22.680 | ADK. I think generally the accepted definition of an agent. This, although very neat, doesn't really

00:02:27.720 | make any sense to me. This, however, does make sense. So what they say is that an agent is effectively

00:02:33.720 | something that has an environment, there are some tools which may have access to the environment,

00:02:38.840 | there is some system prompt that describes to it what it's supposed to do, and then you have a while

00:02:43.560 | loop where you call the LLM, get back some actions to run in the tool, run the tools, that updates the state,

00:02:50.680 | and then you call the LLM again. There is, however, even in his whatever it is, six line pseudocode,

00:02:57.560 | a bug, which is there is no exit from that loop. And sure enough, that points towards a real problem,

00:03:03.160 | which is that it is not clear when you should exit that loop. And so there are a number of different

00:03:08.520 | things you can do. You can say when the LLM returns plain text rather than calling the tool,

00:03:13.960 | that is the end. Or you can have certain tools, which are kind of what we call final result tools,

00:03:19.320 | which basically trigger the end of the run. Or if you have models like OpenAI or Google, which have

00:03:25.080 | structured output types, you can use that to end your run. But it's not necessarily trivial to work

00:03:30.040 | out when the end is. So enough pseudocode. Let me run a real minimal example of Pydantic AI. So this is a very

00:03:37.800 | simple Pydantic-based model with three fields. And then we're going to use Pydantic AI to extract

00:03:44.680 | structured data that fits that person schema from unstructured data, this sentence here.

00:03:50.600 | Now, here, obviously, to fit this into onscreen, this is a very, very simple example. But this could be

00:03:57.960 | a PDF tens of megabytes. Well, probably not tens of megabytes necessarily in context, but like

00:04:04.120 | definitely enormous documents. And this schema is very simple, but this could be an incredibly complex

00:04:09.800 | nested schema. Models are still able to do it. And sure enough, if we go and run this example and the

00:04:14.280 | gods of the internet are with us, sure enough, we get the Pydantic model printed out.

00:04:20.200 | So some of you will notice that this example is simple enough that we don't actually need an agent

00:04:25.400 | or this loop. We're doing one shot, we make one call to the LLM, returns the structured data, we call

00:04:31.400 | under the hood, we call a final result tool, Pydantic AI performs validation, and we get back the data.

00:04:36.360 | But we don't have to change that example very much to start seeing the value of the agentic loop. So here,

00:04:42.600 | I'm being a little bit unfair to the model. I've added a field validator to my person model.

00:04:49.800 | It says the date of birth needs to be before 1900. And obviously, the actual definition here

00:04:55.880 | is abstract. It doesn't define which century we're talking about. You would obviously, the model will,

00:05:06.280 | for the most part, assume '87 is 1987. We'll then get a validation error when you do the validation. And

00:05:12.200 | that's where the agentic bit kicks in. Because we will take those validation errors and return them to the

00:05:17.320 | model basically as a definite and say, please try again, as I'll show you in a moment. And the model

00:05:21.320 | is then able to use the information from the validation error to try again. Obviously,

00:05:25.400 | if you were trying to do this case in production, you would add a doc string to the DOB field saying it

00:05:31.320 | must be in the 19th century. But there are definitely cases where models, even the smartest models, don't

00:05:35.960 | pass validation. And being able to use this trick of returning validation errors to the model is a very

00:05:43.320 | effective way of fixing a lot of the simplest use cases. So if we run this, you see we had two calls to

00:05:50.280 | Gemini here. And if I come and open-- the other thing you'll see in this example is we instrumented

00:05:57.080 | this code with Logfire, our observability platform, so we can actually go in and see exactly what happened.

00:06:04.200 | So you'll see our agent run. We had two calls to the model, in this case Gemini Flash. And if we go

00:06:11.400 | and look at the exchange, you can see what's happened here. So I'll just try and make it big enough that you

00:06:18.760 | can see it. We first of all had the user prompt to the description. It called the final result tool, as you

00:06:23.480 | might expect. The date of birth being 1987. We then responded. The tool response was validation error,

00:06:31.720 | incorrect, please try. And then we add on the end, please fix the error and try again. And sure enough,

00:06:36.520 | it was then able to return correctly, call the final result tool with the right date of birth and succeed.

00:06:44.120 | Cool. I've got five minutes. I feel like I'm in one of those. See how fast I can go. I'm on the wrong

00:06:51.160 | window, am I? I am. Here we are. I think the other thing that's worth saying here, even if I don't have

00:07:00.040 | that much time, is if you take a look at this example, I talked about type safety. The way that we're

00:07:06.920 | doing this under the hood. Agent, because of the output type, is generic. In this case, person. And so we

00:07:13.080 | can-- when we access result.output, both in typing terms, it's an instance of person. And a runtime will

00:07:21.080 | guaranteed from the Pydantic validation that it will really be an instance of person. So if I access here.name,

00:07:26.680 | all will be well. If I access first name, we suddenly get a validation. We get a runtime. We get the nice error

00:07:35.480 | from typing, saying this is an incorrect field. So that's the kind of very beginning of the value of static typing

00:07:41.880 | of our typing support. We go a lot further. You will have seen, or some of you might have noticed,

00:07:47.320 | there's a second generic on agent, which is the depths type. And so if you register tools with this

00:07:53.400 | agent, they-- you-- we can have type safe dependencies to tools, which I will show you in a moment.

00:07:58.360 | So the other thing you will notice is missing from this example is any tools. So

00:08:03.880 | let's look at an example with tools. If I open this example here, we have-- this is an example of memory,

00:08:12.760 | long-term memory in particular, where we're using a tool to record memories and then another tool to be

00:08:18.120 | able to retrieve memories. So you'll see we have these two tools here, record memory and retrieve memory.

00:08:23.320 | Tools are set up by registering them with the agent.tool decorator. But this is where the typing,

00:08:30.920 | as I say, gets more complex. Now you will see that we've set depths type when we've defined the agent.

00:08:36.120 | And so our agent is now generic in that depths type. The return type is string because that's the default.

00:08:41.560 | And so we-- when we call a tool decorator, we have to set the first argument to be this run context

00:08:47.240 | to parameterize with our depths type. And so when we access context.depths, that is an instance of our--

00:08:53.720 | of our depths data class that you see there. And if we access one of its attributes, we get the actual type.

00:08:58.520 | And if we change this to be int, let's say, suddenly we get an error saying we've used the wrong-- the wrong

00:09:05.720 | type. So we get this guarantee that the type here matches the type here, matches the attributes you can

00:09:11.400 | access here. And then when we come to run the agent, we need our depths to be an instance of that

00:09:15.720 | depths type. So again, if we gave it the wrong type, we would get a typing error saying you're using the

00:09:21.480 | wrong type. And as far as I know, we're the only agent framework that works this hard to be type safe.

00:09:27.560 | And it is quite a lot of work on our side. I'll be honest. There's a little bit of work on your side

00:09:31.400 | as well. And it's not necessarily as trivial to set up. But it makes it incredibly easy to go and refactor

00:09:36.520 | your code. And yeah, we run this here. And we give it the-- I'm pretty sure I don't have Postgres running.

00:09:43.400 | Do I have Docker running? I don't know if I have time to make that work.

00:09:48.920 | I will-- that's Docker running. I'll just try and run this very quickly.

00:09:55.800 | Docker run. Hopefully that is enough. If I now come and run this example,

00:10:02.280 | what you will see is it successfully failed. Great.

00:10:12.360 | I will try one more time and see if I get lucky. I don't know quite what was going on there.

00:10:16.200 | Ah, and I have no idea. Well, we can look in Logfire and see what happened

00:10:22.440 | to make it fail. I promise you I hadn't set that up to fail the first time to demonstrate the value of

00:10:27.400 | observability, but maybe it can help here. So if you look this first time, we--

00:10:35.640 | our first agent run, you'll see that we used the tool call record memory. The user's name is Samuel.

00:10:45.480 | And then it returned finished. And then the second time,

00:10:49.240 | you can see that the-- when it did retrieve memory, where it called the-- that tool,

00:10:57.240 | the parameter or the argument it gave was your name, which was not-- is not contained within the

00:11:04.520 | query the previous time. We're just doing a very simple I like here. So your name is not a substring

00:11:10.840 | of user's name is Samuel. And so that's why it failed that time. So this has turned into a very useful

00:11:17.400 | example of where Logfire can help. And if we look at the-- that second time, you'll see user's name is

00:11:25.080 | Samuel. And then when it-- when it ran the agent, it just asked for name. Name is obviously a substring of--

00:11:31.560 | of the user's name is Samuel. And so it was able-- it got the response, user's name is Samuel, and therefore

00:11:37.000 | succeeded. The other thing we get here is, like, obviously, we get this tracing information. So we can

00:11:41.640 | see how long each of those calls took. And we also get pricing on both aggregate across the whole of the

00:11:48.520 | the trace and individuals' bands. I am told that I am running out of time. So thank you very much.