back to indexHuman seeded Evals — Samuel Colvin, Pydantic

00:00:00.000 |
I'll assume given the time we have that you kind of get who I am and what Pydantic is to some extent. 00:00:19.460 |
So I will move on. I'm using the talk I gave at PyCon, so it was building AI applications the Pydantic way, 00:00:27.880 |
which is, I guess, somewhat akin. As I say, I'm not going to be able to get to the eval stuff today, 00:00:33.480 |
but I can talk about these two. So everything is changing really fast as we all get told repeatedly 00:00:39.480 |
in ever more hysterical terms. Actually, some things are not changing. We still want to build 00:00:44.840 |
reliable, scalable applications, and that is still hard. Arguably, it's actually harder with Gen AI 00:00:49.720 |
than it was before, whether that is using Gen AI to build it or using Gen AI within your application. 00:00:56.120 |
So what we're trying to talk about here is some techniques that you can use to build applications 00:01:01.720 |
quickly, but also somewhat more safely than you might do otherwise. 00:01:10.040 |
I'm a strong believer that type safety is one of the really important parts of that. Not just for in 00:01:14.280 |
production avoiding bugs, but if you, no one starts off building an AI application knowing what it's 00:01:19.080 |
going to look like. So you're going to have to end up refactoring your application multiple times. 00:01:22.840 |
If you build your application in a type safe way, if you use frameworks that allow it to be type safe, 00:01:27.480 |
you can refactor it with confidence much more quickly. If you're using a coding agent like cursor, 00:01:32.040 |
it can use type safety or running type checking to basically mark its own homework and work out what 00:01:37.480 |
it's doing right in a way that you can't do if you use a framework like LangChain or LangGraph, 00:01:41.800 |
who either through decision or inability decided not to build something that's type safe. 00:01:46.040 |
I'll talk a bit about MCP if I have a moment. And I won't talk about how eval's put in, 00:01:52.760 |
because I don't have time. Nothing I'm going to say here on what an agent is is controversial. This is 00:01:59.560 |
reasonably well accepted now by most people as a definition of an agent. This 00:02:05.800 |
image here is from Barry Zhang's talk at AI engineer in New York in February. This is his definition, 00:02:16.120 |
or the anthropic definition of what an agent is now being copied by us, by OpenAI, by Google's 00:02:22.680 |
ADK. I think generally the accepted definition of an agent. This, although very neat, doesn't really 00:02:27.720 |
make any sense to me. This, however, does make sense. So what they say is that an agent is effectively 00:02:33.720 |
something that has an environment, there are some tools which may have access to the environment, 00:02:38.840 |
there is some system prompt that describes to it what it's supposed to do, and then you have a while 00:02:43.560 |
loop where you call the LLM, get back some actions to run in the tool, run the tools, that updates the state, 00:02:50.680 |
and then you call the LLM again. There is, however, even in his whatever it is, six line pseudocode, 00:02:57.560 |
a bug, which is there is no exit from that loop. And sure enough, that points towards a real problem, 00:03:03.160 |
which is that it is not clear when you should exit that loop. And so there are a number of different 00:03:08.520 |
things you can do. You can say when the LLM returns plain text rather than calling the tool, 00:03:13.960 |
that is the end. Or you can have certain tools, which are kind of what we call final result tools, 00:03:19.320 |
which basically trigger the end of the run. Or if you have models like OpenAI or Google, which have 00:03:25.080 |
structured output types, you can use that to end your run. But it's not necessarily trivial to work 00:03:30.040 |
out when the end is. So enough pseudocode. Let me run a real minimal example of Pydantic AI. So this is a very 00:03:37.800 |
simple Pydantic-based model with three fields. And then we're going to use Pydantic AI to extract 00:03:44.680 |
structured data that fits that person schema from unstructured data, this sentence here. 00:03:50.600 |
Now, here, obviously, to fit this into onscreen, this is a very, very simple example. But this could be 00:03:57.960 |
a PDF tens of megabytes. Well, probably not tens of megabytes necessarily in context, but like 00:04:04.120 |
definitely enormous documents. And this schema is very simple, but this could be an incredibly complex 00:04:09.800 |
nested schema. Models are still able to do it. And sure enough, if we go and run this example and the 00:04:14.280 |
gods of the internet are with us, sure enough, we get the Pydantic model printed out. 00:04:20.200 |
So some of you will notice that this example is simple enough that we don't actually need an agent 00:04:25.400 |
or this loop. We're doing one shot, we make one call to the LLM, returns the structured data, we call 00:04:31.400 |
under the hood, we call a final result tool, Pydantic AI performs validation, and we get back the data. 00:04:36.360 |
But we don't have to change that example very much to start seeing the value of the agentic loop. So here, 00:04:42.600 |
I'm being a little bit unfair to the model. I've added a field validator to my person model. 00:04:49.800 |
It says the date of birth needs to be before 1900. And obviously, the actual definition here 00:04:55.880 |
is abstract. It doesn't define which century we're talking about. You would obviously, the model will, 00:05:06.280 |
for the most part, assume '87 is 1987. We'll then get a validation error when you do the validation. And 00:05:12.200 |
that's where the agentic bit kicks in. Because we will take those validation errors and return them to the 00:05:17.320 |
model basically as a definite and say, please try again, as I'll show you in a moment. And the model 00:05:21.320 |
is then able to use the information from the validation error to try again. Obviously, 00:05:25.400 |
if you were trying to do this case in production, you would add a doc string to the DOB field saying it 00:05:31.320 |
must be in the 19th century. But there are definitely cases where models, even the smartest models, don't 00:05:35.960 |
pass validation. And being able to use this trick of returning validation errors to the model is a very 00:05:43.320 |
effective way of fixing a lot of the simplest use cases. So if we run this, you see we had two calls to 00:05:50.280 |
Gemini here. And if I come and open-- the other thing you'll see in this example is we instrumented 00:05:57.080 |
this code with Logfire, our observability platform, so we can actually go in and see exactly what happened. 00:06:04.200 |
So you'll see our agent run. We had two calls to the model, in this case Gemini Flash. And if we go 00:06:11.400 |
and look at the exchange, you can see what's happened here. So I'll just try and make it big enough that you 00:06:18.760 |
can see it. We first of all had the user prompt to the description. It called the final result tool, as you 00:06:23.480 |
might expect. The date of birth being 1987. We then responded. The tool response was validation error, 00:06:31.720 |
incorrect, please try. And then we add on the end, please fix the error and try again. And sure enough, 00:06:36.520 |
it was then able to return correctly, call the final result tool with the right date of birth and succeed. 00:06:44.120 |
Cool. I've got five minutes. I feel like I'm in one of those. See how fast I can go. I'm on the wrong 00:06:51.160 |
window, am I? I am. Here we are. I think the other thing that's worth saying here, even if I don't have 00:07:00.040 |
that much time, is if you take a look at this example, I talked about type safety. The way that we're 00:07:06.920 |
doing this under the hood. Agent, because of the output type, is generic. In this case, person. And so we 00:07:13.080 |
can-- when we access result.output, both in typing terms, it's an instance of person. And a runtime will 00:07:21.080 |
guaranteed from the Pydantic validation that it will really be an instance of person. So if I access here.name, 00:07:26.680 |
all will be well. If I access first name, we suddenly get a validation. We get a runtime. We get the nice error 00:07:35.480 |
from typing, saying this is an incorrect field. So that's the kind of very beginning of the value of static typing 00:07:41.880 |
of our typing support. We go a lot further. You will have seen, or some of you might have noticed, 00:07:47.320 |
there's a second generic on agent, which is the depths type. And so if you register tools with this 00:07:53.400 |
agent, they-- you-- we can have type safe dependencies to tools, which I will show you in a moment. 00:07:58.360 |
So the other thing you will notice is missing from this example is any tools. So 00:08:03.880 |
let's look at an example with tools. If I open this example here, we have-- this is an example of memory, 00:08:12.760 |
long-term memory in particular, where we're using a tool to record memories and then another tool to be 00:08:18.120 |
able to retrieve memories. So you'll see we have these two tools here, record memory and retrieve memory. 00:08:23.320 |
Tools are set up by registering them with the agent.tool decorator. But this is where the typing, 00:08:30.920 |
as I say, gets more complex. Now you will see that we've set depths type when we've defined the agent. 00:08:36.120 |
And so our agent is now generic in that depths type. The return type is string because that's the default. 00:08:41.560 |
And so we-- when we call a tool decorator, we have to set the first argument to be this run context 00:08:47.240 |
to parameterize with our depths type. And so when we access context.depths, that is an instance of our-- 00:08:53.720 |
of our depths data class that you see there. And if we access one of its attributes, we get the actual type. 00:08:58.520 |
And if we change this to be int, let's say, suddenly we get an error saying we've used the wrong-- the wrong 00:09:05.720 |
type. So we get this guarantee that the type here matches the type here, matches the attributes you can 00:09:11.400 |
access here. And then when we come to run the agent, we need our depths to be an instance of that 00:09:15.720 |
depths type. So again, if we gave it the wrong type, we would get a typing error saying you're using the 00:09:21.480 |
wrong type. And as far as I know, we're the only agent framework that works this hard to be type safe. 00:09:27.560 |
And it is quite a lot of work on our side. I'll be honest. There's a little bit of work on your side 00:09:31.400 |
as well. And it's not necessarily as trivial to set up. But it makes it incredibly easy to go and refactor 00:09:36.520 |
your code. And yeah, we run this here. And we give it the-- I'm pretty sure I don't have Postgres running. 00:09:43.400 |
Do I have Docker running? I don't know if I have time to make that work. 00:09:48.920 |
I will-- that's Docker running. I'll just try and run this very quickly. 00:09:55.800 |
Docker run. Hopefully that is enough. If I now come and run this example, 00:10:02.280 |
what you will see is it successfully failed. Great. 00:10:12.360 |
I will try one more time and see if I get lucky. I don't know quite what was going on there. 00:10:16.200 |
Ah, and I have no idea. Well, we can look in Logfire and see what happened 00:10:22.440 |
to make it fail. I promise you I hadn't set that up to fail the first time to demonstrate the value of 00:10:27.400 |
observability, but maybe it can help here. So if you look this first time, we-- 00:10:35.640 |
our first agent run, you'll see that we used the tool call record memory. The user's name is Samuel. 00:10:45.480 |
And then it returned finished. And then the second time, 00:10:49.240 |
you can see that the-- when it did retrieve memory, where it called the-- that tool, 00:10:57.240 |
the parameter or the argument it gave was your name, which was not-- is not contained within the 00:11:04.520 |
query the previous time. We're just doing a very simple I like here. So your name is not a substring 00:11:10.840 |
of user's name is Samuel. And so that's why it failed that time. So this has turned into a very useful 00:11:17.400 |
example of where Logfire can help. And if we look at the-- that second time, you'll see user's name is 00:11:25.080 |
Samuel. And then when it-- when it ran the agent, it just asked for name. Name is obviously a substring of-- 00:11:31.560 |
of the user's name is Samuel. And so it was able-- it got the response, user's name is Samuel, and therefore 00:11:37.000 |
succeeded. The other thing we get here is, like, obviously, we get this tracing information. So we can 00:11:41.640 |
see how long each of those calls took. And we also get pricing on both aggregate across the whole of the 00:11:48.520 |
the trace and individuals' bands. I am told that I am running out of time. So thank you very much.