Pydantic is STILL all you need: Jason Liu

Eliana Khanna: So, the context from last year's talk was "Pydantic All You Need." It was a very popular talk, you know, it kind of like kicked off my Twitter career. And today, I'm coming back a year later to basically say the same thing again: "Pydantic is still all you need." And really, my goal is to share with you sort of what I've learned for the past year.

And the problem has always been the fact that if I had hired an intern to write an API for me and that API returns a string that I have to JSON loads into a dictionary and then just pray that the data was still there to begin with, I would be pretty pissed off.

I would probably just fire them, replace them with Devon, and just prompt it to use FastAPI and Pydantic. Because I'm really tired of writing code like this, right, and this is the kind of code that we wrote when we had to work with things like chat GPT-3 and stuff like that.

But there's a lot of good tools that we have in the Python ecosystem. And the ecosystem in all of these languages, whether it's Ecto and Elixir, Active Record, or anything like that, that can make our lives much, much easier. And so the problem is that by not having schemas and structured responses, we tend to lose compatibility, composability, and reliability when we build tools and write code that interact with external systems.

But it seems that we're very happy with using LLMs for the exact same reason. And so last year, we mostly talked about how Pydantic and function calling was a great alternative for how we can use structured output to do a lot of additional benefits, right? We are able to have nested objects and nested models for modular structures.

And then we can also use validators to improve the reliability of the systems that we build. And I'll talk about some of these examples. And so it's been about a year and a half. I think the big question is, what's new in Pydantic? What's new in the library? And the answer is basically nothing.

I'm basically coming back to say that I was right, and it feels really, really good. It's still PIP install instructor, right? And since then, we've released 1.0. We've launched in five languages, Python, TypeScript, Ruby, Go, Elixir. We just built out a version in Rust. And again, it's mostly because this is just the exact 600 lines of code that you do not want to write.

And at least in the Python library, we've seen 40% growth month over month. And we've only had about 2% of the coverage of the OpenAI download. So there's still tons of room to grow in terms of how we can make these APIs a lot more ergonomic. And so you saw 1.0, you might be going, Jason, what did we break in the API?

I renamed a method, and now we support things like OLAMA and LAMACVP, along with a bunch of other APIs. So we support things like Anthropic, Cohered, Gemini, Grok, everything that you need. And as long as language models support more function calling capabilities, this API will pretty much stay standard.

And if you haven't seen the talk last year, the general API looks like this. You define a Pydantic object. You can then, you know, patch the OpenAI client or any client that you want. And all you have to do is you've got to pass in response model equals user.

Right? This is basically it. This is very similar to how fast API works. And you know, it took a little bit of hackiness, but now we can also leverage some of the new Python tooling to also infer the return type. And so here, because response model is a user, the object is inferred as a user object.

You get nice red squiggly lines if you've messed up your code. The same thing happens when you want to create an iterable. Here you see that I have a single response model as a user, but I want to extract two objects. And as long as you set stream equals true, you're going to get each object as they return.

And this is kind of a way of using streaming to improve the latency while having a little bit more structured output. We also have partials, right? The difference here is that instead of just returning a partially correct or validated JSON stack code, we can validate the entire object. And this means that if you have things like generative UI that use a structure, you can render that while streaming without having to write this very evil JSON stack code to figure out how to render this in real time.

And so, yeah, nothing's really changed. You have one noun, which is the client, and you have three verbs. You can create, create with iterable, and create with partial based on whether or not you want to use streaming. And everything else you think about is going to be around the response model, the validator features that you have to build, and the messages array that you pass in.

So if OpenAI supports some new, weird API call, as long as it fits with the messages, there's not going to be any breaking code. And that's why I think Pydantic is still all you need. And so the rest of this talk is basically going to be about two, really three things.

I'm going to cover some examples of generation, in particular, around RAG and extraction. Then I'm just going to cover what we learned this year, and it's really not that much, right? Validation errors are very important, and usually they can fix any errors that we have. Not all language models can really support retry logic right now.

I think that's something we're going to work towards. And ultimately, whether you use vision, or text, or RAG, or agents, they all benefit from structured outputs, right? Because the real idea here is we're going to be programming with data structures, which is something everyone knows how to do, rather than trying to, like, beg and pray to the LLM gods.

And really, again, the theme of this talk is the fact that nothing really has changed. The language did not change. All we learned to do is relearn how to program. And so the first concept that I think many people might not have seen in Pydantic is the validators, right?

Here, you can define a validator on any kind of attribute, and add additional logic that tells you what correct looks like. And so you see, in my prompt, I don't really ask the language model to uppercase all the names, but I can actually write Python code to verify that something is correct and throw an error message.

And if I want to, I can turn retrying on, and that error message is caught by the language model, and then you use to correct the outputs. And so in this example, it is the error message that is part of the prompt, but conditionally added to the language model.

And as you can see, after one API call, JSON is now all caps. Pretty nice. We can also do model-level validation. This is a very simple example that you might see, something like RAMP, where you're processing receipts. You might want to use a vision language model to extract the receipt data.

There's a total cost, and the products is a list of products. And the validator does something a little bit more interesting. It says, "Make sure that the price and the quantity add up to the total cost." Right? Again, this basically doesn't really happen for 99% of the cases, but when it does happen, you see a red bar in Datadog, and that's really what I care about.

And if I want to ask re-asking, I want to make sure that, again, everything is done correctly. So let's jump into generation, right? Why should I use structured outputs? Well, it turns out if you don't use structured outputs, the structure you get is just response as a content string.

Right? You still get an object back out, but you're just hoping that you don't have to call JSON loads yourself and, you know, eat whatever cost you have in terms of parsing. And so a really simple example of a RAG application is not only having a content, but having a list of follow-up questions, right?

The follow-up questions can be informed by the existing context, but now you're going to let the user show, like, "Hey, there's other questions that you can answer based on the context that I've put in the prompt." A really funny example that I've actually done in production quite a bit is just making sure that the links we return are valid.

And so here, I have a very simple validator. I just have a regular expression, parse all the URLs, and I use post to figure out if the URL returns a 200. And now I can make sure very easily that no URLs are, you know, hallucinated. And in my instructions, I just say, "Well, if it's not real, just throw it out next time." Right?

Don't try too hard. The same thing happens with retrieval augmented generation. We all kind of know at this point that embeddings won't really solve all the problems you have in search, right? For example, if I ask a question like, "What is the latest news from Z?" Like, latest news isn't something that embeddings can capture, right?

The source of that, maybe that is relevant if you use BM25, but really there might be separate indices that we want to query. And we can do something very simple in the structured output world that makes this very reasonable, right? To find a search object, I say it has a query, a start date, an end date that is optional.

Maybe there's a limit in case I want to see the top five results. And then a source that allows the language model to choose which backend I want to hit. And then, you know, how you actually search the endpoint is kind of an implementation detail that we don't care about.

And now you just define a very simple function, you know, create search. It takes in a string, returns the object. And even the API call itself now is an implementation detail, right? As long as I get the search query out and it's correct, I can do a lot more.

And in particular, like even the validations themselves, you know, I can figure out whether the date ranges are zero days, one day, and figure out even distributions based on the structured output. Then if I ask the question, like, what is the difference between X and Y, I can just turn on iterable mode.

Now, if I ask this question, I'm going to have a search query for Y, a search query for X, and again, my RAG application can figure out that I can do two parallel search queries, collect them together, and continue on. And so this means that you can build a fairly sophisticated RAG application in two functions and two models.

First you have the model for how you respond with the data, and then how you process a search query, right? As you can see here. And then you define two functions that return those objects. And then this is basically your advanced RAG application, right? You make a search query, you return multiple searches.

You search each one, and then you pass the context into the answer question function. This is very, very straightforward code, but what this means is you get to render something very structured, and then whether or not this endpoint is used by OpenAPI, is parsed by a React model, again, these are all just implementation details.

The LLM is very hidden behind the type system that we can now guarantee to be correct. And the last one I think is really interesting is this data extraction. If you want to do something like labeling, it's really easy to just say, okay, class label is literal of either spam or not spam, you've built a classifier.

If you want the accuracy to improve about 15%, you can add chain of thought, right? And again, it's the structure that tells you how the language model works, but you still have good validation on whether or not you're going to get, you know, spam or some, like, babble on, like, you know, here's the JSON that you care about.

You can do the same thing for things like extracting, like, structured information out of transcripts. Like, a very common example is people want to process transcripts. Now it's very structured, right? I have a classification in the meeting type, I've given myself a title, a list of action items, and a summary.

Here, the owner is a string, but you can imagine having a validator that makes sure that the owners are the, at least one of the participants of the email based on some Google Calendar integration. Again, these are all implementation details. It's all up to you. And then lastly, you can do some really magical stuff.

In this example, the type I've given is called table. It has a caption string and a very weird markdown data frame type hint. And here, what you can see is that I'm really just trying to extract images or tables out of an image. But this is a bit wild.

Like, don't worry if you don't understand it. Basically, what we're using is we're using the new pip, uh, pep, basically to figure out how we can use annotations to create new type hints. And so this type hint is pretty advanced. It says that it's an instance of data frame, which means your IDE will now autocomplete all the data frame methods as you continue to program.

Uh, the before evaluator says, I know M markdown is going to come out, but I want to parse it to a data frame. The serializer says, I know it's a data frame, but when I serialize it, I want it to be marked down. And then lastly, you can add additional JSON schema information, which becomes the prompt that you would use to send to a language model.

But the idea here is, you know, it's really just a type system that we've defined that can be used by a language model. And then you can get pretty interesting outputs out of this, right? And because of the data frame, you can instantly call 2CSV or something like that without worrying about other implementation details.

And so what we've seen is that we can now just generate things like date ranges, relationships, we can generate knowledge graphs as we've shown last year, and generally just think about DAGs and workflows and tables. And again, all we really care about is just coming up with a creative response model, having a good set of validators, and as models get smarter, we're only going to have to do less and less, right?

This is fairly bulletproof. And so for the last five minutes, I really just want to share what I've learned in the past year, right? The first thing is that often one retry for models like OpenAI and Anthropic are basically enough, and really all you care about is having good, well-written, informative error messages, which has been hard for all time, but now you're more incentivized to build this out because this not only makes the code more readable to you, but to the language model.

Then lastly, for the new models from like 3.5 and 4.0, they're so much faster now that we can actually eat the cost of latency for performance. So again, as these models get smarter and faster, you're still fairly bulletproof. One thing I've noticed in a lot of consulting that I've done is that we see 4% to 5% failure modes and very complex validations, but just by fine-tuning language models on function calling, we can get them down to zero for even simple models like Mistral or GPT-3.5.

And lastly, structured output is here to stay, mostly because even in domains like Vision or RAG or Agents, really what I care about is defining the type system that I want to program with on top of how I want to use language models. Prompting is an implementation detail, the response model is an implementation detail.

And whether or not we use something like constraint sampling that's available in Lama CBP or Lama or Outlines, again, the benefits I get as a programmer is sort of on a different level of abstraction. And then even with things like RAG and Agents, right now we think of RAG as much more like question-answering systems, but in larger enterprise situations, I see a lot of report generation as a step to make better decision-making, right?

In Agents, a lot of it now becomes generating workflows and DAGs to then go send to an execution engine to do the computation ourselves rather than having some kind of React loop and hope that that these things terminate. And so really there's no new abstractions, right? Everything that we've done today is just reducing language models back to very classical programming.

What I care about is that my IDE understands the types, and we just get red squiggly lines when things are unhappy. And what we've done is we've turned generative AI just to becoming generating data structures. You can now own the objects you define, you own the functions that you implement, you own the control flow, and most importantly you own the prompt because we just give you this messages array, and you can do anything that you want.

And what this means to me, and I think what this means to everyone else here, is that we are actually turning software 3.0 and making it backwards compatible with existing software, right? We're allowing ourselves to demystify the language models and go back to a much more classical structure of how we program.

And that's why I still think Pydantic is basically all we need. Thank you. Thanks. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.

Pydantic is STILL all you need: Jason Liu

Chapters

Transcript