Building AGI with OpenAI's Structured Outputs API

(upbeat music) - Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO on Residence and Decibel Partners, and I'm joined by my co-host Swiggs, founder of Small.ai. - Hey, and today we're excited to be in the in-person studio with Michelle, welcome. - Thanks, thanks for having me, very excited to be here.

- This has been a long time coming. I've been following your work on the API platform for a little bit, and I'm finally glad that we could make this happen after you shipped the structured outputs. How does that feel? - Yeah, it feels great. We've been working on it for quite a while, so very excited to have it out there and have people using it.

- We'll tell the story soon, but I want to give people a little intro to your backgrounds. So you've interned and/or worked at Google, Stripe, Coinbase, Clubhouse, and obviously OpenAI. What was that journey like? You know, the one that has the most appeal to me is Clubhouse because that was a very, very hot company for a while.

Basically, you seem to join companies when they're about to scale up really a lot, and obviously OpenAI has been the latest. But yeah, just what are your learnings and your history going into all these notable companies? - Yeah, totally. For a bit of my background, I'm Canadian. I went to the University of Waterloo, and there you do like six internships as part of your degree.

So I started, actually, my first job was really rough. I worked at a bank, and I learned Visual Basic, and I like animated bond yield curves, and it was, you know, not-- - Me too. - Oh, really? - Yeah, I was a derivatives trader. - Interest rate swaps, that kind of stuff, yeah.

- Yeah, so I liked, you know, having a job, but I didn't love that job. And then my next internship was Google, and I learned so much there. It was tremendous. But I had a bunch of friends that were into startups more, and, you know, Waterloo has like a big startup culture, and one of my friends interned at Stripe, and he said it was super cool.

So that was kind of my, I also was a little bit into crypto at the time, then I got into it on Hacker News, and so Coinbase was on my radar. And so that was like my first real startup opportunity was Coinbase. I think I've never learned more in my life than in the four-month period when I was interning at Coinbase.

They actually put me on call. I worked on like the ACH rails there, and it was absolutely crazy. You know, crypto was a very formative experience. Yeah. - This is 2018 to 2020, kind of like the first big wave. - That was my full-time. But I was there as an intern in 2016.

Yeah, and so that was the period where I really like learned to become an engineer, learned how to use Git, got on call right away, you know, managed production databases and stuff. So that was super cool. After that, I went to Stripe and kind of got a different flavor of payments on the other side.

Learned a lot, was really inspired by the Coulsons. And then my next internship after that, I actually started a company at Waterloo. So there's this thing you can do, it's an entrepreneurship co-op, and I did it with my roommate. The company's called Readwise, which still exists, but-- - Yeah, yeah, yeah.

- Everyone uses Readwise. - Yeah, awesome. - You co-founded Readwise? - Yeah. - I'm a premium user. - It's not even on your LinkedIn? - Yeah, I mean, I only worked on it for about a year, and so Tristan and Dan are the real founders, and I just had an interlude there.

But yeah, really loved working on something very startup-focused, user-focused, and hacking with friends, it was super fun. Eventually, I decided to go back to Coinbase and really get a lot better as an engineer. I didn't feel like I was, you know, didn't feel equipped to be a CTO of anything at that point, and so just learned so much at Coinbase.

And that was a really fun curve. But yeah, after that, I went to Clubhouse, which was a really interesting time. So I wouldn't say that I went there before it blew up. I would say I went there as it blew up, so not quite the startling track record that it might seem.

But it was a super exciting place. I joined as the second or third backend engineer, and we were down every day, basically. One time, Oprah came on, and absolutely everything melted down, and so we would have a stand-up every morning, and we'd be like, "How do we make everything stay up?" Which is super exciting.

Also, one of the first things I worked on there was making our notifications go out more quickly, because when you join a Clubhouse room, you need everyone to come in right away so that it's exciting, and the person speaking thinks a lot of my audience is here. But when I first joined, I think it would take 10 minutes for all the notifications to go out, which is insane.

By the time you want to start talking to the time your audience is there, it's like you can totally kill the room. So that's one of the first things I worked on, is making that a lot faster and keeping everything up. - I mean, so already we have an audience of engineers.

Those two things are useful. It's keeping things up and notifications out. Notifications, like is it a Kafka topic? - It was a Postgres shop, and you had all of the followers in Postgres, and you needed to iterate over the followers and figure out, is this a good notification to send?

And so all of this logic, it wasn't well-batched and parallelized, and our job queuing infrastructure wasn't right. And so there was a lot of fixing all of these things. Eventually, there were a lot of database migrations, because Postgres just wasn't scaling well for us. - Interesting, and then keeping things up, that was more of a, I don't know, reliability issue, SRE type?

- A lot of it, yeah, it goes down to database stuff. Everywhere I've worked-- - It's on databases. (laughing) - Indexing. - Actually, at Coinbase, at Clubhouse, and at OpenAI, Postgres has been a perennial challenge. It's like, the stuff you learn at one job carries over to all the others, because you're always debugging a long-running Postgres query at 3 a.m.

for some reason. So those skills have really carried me forward, for sure. - Why do you think that not as much of this is productized? Obviously, Postgres is an open-source project. It's not aimed at gigascale, but you would think somebody would come around and say, "Hey, we're like the--" - Yeah, I think that's what Planetscale is doing.

It's not on Postgres, I think. It's on MySQL, but I think that's the vision. It's like, they have zero downtime migrations, and that's a big pain point. I don't know why no one is doing this on Postgres, but I think it would be pretty cool. - Their connection pullers, like PG Bouncer, is good enough, I don't know.

- Yeah, well, even, I mean, I've run PG Bouncer everywhere, and there's still a lot of problems. Your scale, it's something that not many people see. - Yeah, I mean, at some point, every successful company gets to the scale where Postgres is not cutting it, and then you migrate to some sort of NoSQL database.

And that process I've seen happen a bunch of times now. - MongoDB, Redis, something like that. - Yeah, I mean, we're on Azure now, and so we use Cosmos DB. - Cosmos DB, hey! - At Clubhouse, I really love DynamoDB. That's probably my favorite database, which is like a very nerdy sentence, but that's the one I'm using if I need to scale something as far as it goes.

- Yeah, DynamoDB, I, when I learned, I worked at AWS briefly, and it's kind of like the memory register for the web. Like, you know, if you treat it just as physical memory, you will use it well. If you treat it as a real database, you might run into problems.

- Right, you have to totally change your mindset when you're going from Postgres to Dynamo. But I think it's a good mindset shift, and kind of makes you design things in a more scalable way. - Yeah, I'll recommend the DynamoDB book for people who need to use DynamoDB. But we're not here to talk about AWS, we're here to talk about OpenAI.

You joined OpenAI pre-Chad GPT. I also had the opportunity to join and I didn't. What was your insight? - Yeah, I think a lot of people who joined OpenAI joined because of a product that really gets them excited. And for most people, it's Chad GPT. But for me, I was a daily user of Copilot, GitHub Copilot.

And I was like so blown away at the quality of this thing. I actually remember the first time seeing it on Hacker News and being like, wow, this is absolutely crazy. Like, this is gonna change everything. And I started using it every day. It just really, even now when like I don't have service and I'm coding without Copilot, it's just like 10x difference.

So I was really excited about that product. I thought now is maybe the time for AI. And I'd done some AI in college and thought some of those skills would transfer. And I got introduced to the team. I liked everyone I talked to. So I thought that'd be cool.

Why didn't you join? - It was like, I was like, is Dolly it? (laughing) - We were there. We were at the Dolly like launch thing. And I think you were talking with Lenny and Lenny was at OpenAI at the time. And you were like-- - We don't have to go into too much detail.

- This is one of my biggest regrets of my life. - No, no, no. - But I was like, okay, I mean, I can create images. I don't know if like this is the thing to dedicate, but obviously you had a bigger vision than I did. - Dolly was really cool too.

I remember like first showing my family, I was like, I'm going to this company and here's like one of the things they do. And it like really helped bridge the gap. Whereas like, I still haven't figured out how to explain to my parents what crypto is. My mom for a while thought I worked at Bitcoin.

So it's like, it's pretty different to be able to tell your family what you actually do and they can see it. - Yeah, and they can use it too, personally. So you were there, were you immediately on API platform? You were there for the chat GPT moment. - Yeah, I mean, API platform is like a very grandiose term for what it was.

There was like just a handful of us working on the API. - Yeah, it was like a closed beta, right? Not even everyone had access to the GPT-3 model. - A very different access model then, a lot more like tiered rollouts. But yeah, I would say the applied team was maybe like 30 or 40 people and yeah, probably closer to 30.

And there was maybe like five-ish total working on the API at most. So yeah, we've grown a lot since then. - It's like 60, 70 now, right? - No, applied is much bigger than that. Applied now is bigger than the company when I joined. - Okay, all right. - Yeah, we've grown a lot.

I mean, there's so much to build. So we need all the help we can get. - I'm a little out of date, yeah. - Any chat GPT release, kind of like all ants on deck stories. I had lunch with Evan Morikawa a few months ago. It sounded like it was a fun time to get, build the APIs and have all these people trying to use the web thing.

Like, how are you prioritizing internally? And like, what was the helping scaling when you're scaling non-GPU workloads versus like Postgres bouncers and things like that? - Yeah, actually surprisingly, there were a lot of Postgres issues when chat GPT came out because the accounts for like chat GPT were tied to the accounts in the API.

And so you're basically creating a developer account to log into chat GPT at the time. 'Cause it's just what we had. It was low-key research preview. And so I remember there was just so much work scaling like our authorization system and that would be down a lot. Yeah, also GPU, you know, I never had worked in a place where you couldn't just scale the thing up.

It's like everywhere I've worked in Qt is like free and you just like auto-scale a thing and you like never think about it again. But here we're having like tough decisions every day. We're like discussing like, you know, should they go here or here? And we have to be principled about it.

So that's a real mindset shift. - So you just really structured outputs, congrats. You also wrote the blog post for it, which was really well-written. And I loved all the examples that you put out. Like you really give the full story. Yeah, tell us about the whole story from beginning to end.

- Yeah, I guess the story we should rewind quite a bit to Dev Day last year. Dev Day last year, exactly. We shipped JSON mode, which is our first foray into this area of product. So for folks who don't know, JSON mode is this functionality you can enable in our chat completions and other APIs, where if you opt in, we'll kind of constrain the output of the model to match the JSON language.

And so you basically will always get something in a curly brace. And this is good. This is nice for a lot of people. You can describe your schema, what you want in prompt, and then we'll constrain it to JSON. But it's not getting you exactly where you want, because you don't want the model to kind of make up the keys or match different values than what you want.

Like if you want an enum or a number and you get a string instead, it's pretty frustrating. So we've been ideating on this for a while, and people have been asking for basically this every time I talk to customers for maybe the last year. And so it was really clear that there's a developer need, and we started working on kind of making it happen.

And this is a real collab between engineering and research, I would say. And so it's not enough to just kind of constrain the model. I think of that as the engineering side, whereas basically you mask the available tokens that are produced every time to only fit the schema. And so you can do this engineering thing, and you can force the model to do what you want, but you might not get good outputs.

And sometimes with JSON mode, developers have seen that our models output like white space for a really long time, where they don't-- - Because it's a legal character. - Right, it's legal for JSON, but it's not really what they want. And so that's what happens when you do kind of a very engineering-biased approach.

But the modeling approach is to also train the model to do more of what you want. And so we did these together. We trained a model which is significantly better than our past models at following formats, and we did the end work to serve like this constrained decoding concept at scale.

So I think marrying these two is why this feature is pretty cool. - You just mentioned starts and ends with a curly brace, and maybe people's minds go to prefills in the Cloud API. How should people think about JSON mode structure output prefills? Because some of them are like, roughly starts with a curly brace and asks you for JSON, you should do it.

And then Instructor is like, "Hey, here's a rough data scheme that you should use." And how do you think about them? - So I think we kind of designed structured outputs to be the easiest to use. So you just, like the way you use it in our SDK, I think is my favorite thing.

So you just create like a pedantic object or a Zod object, and you pass it in and you get back an object. And so you don't have to deal with any of the serialization. - With the parse helper. - Yeah, you don't have to deal with any of the serialization on the way in or out.

So I kind of think of this as the feature for the developer who is like, I need this to plug into my system. I need the function call to be exact. I don't want to deal with any parsing. So that's where structured outputs is tailored. Whereas if you want the model to be more creative and use it to come up with a JSON schema that you don't even know you want, then that's kind of where JSON mode fits in.

But I expect most developers are probably going to want to upgrade to structured outputs. - The thing you just said, you just use interchangeable terms for the same thing, which is function calling and structured outputs. We've had disagreements or discussion before on the podcast about are they the same thing?

Semantically, they're slightly different. - They are, yes. - Because I think function calling API came out first than JSON mode. And we used to abuse function calling for JSON mode. Do you think we should treat them as synonymous? - No. - Okay, yeah. Please clarify. (both laughing) And by the way, there's also tool calling.

- Yeah, the history here is we started with function calling and function calling came from the idea of let's give the model access to tools and let's see what it does. And we basically had these internal prototypes of what a code interpreter is now. And we were like, this is super cool.

Let's make it an API. But we're not ready to host code interpreter for everybody. So we're just going to expose the raw capability and see what people do with it. But even now, I think there's a really big difference between function calling and structured outputs. So you should use function calling when you actually have functions that you want the model to call.

And so if you have a database that you want the model to be able to query from, or if you want the model to send an email or generate arguments for an actual action. And that's the way the model has been fine-tuned on, is to treat function calling for actually calling these tools and getting their outputs.

The new response format is a way of just getting the model to respond to the user, but in a structured way. And so this is very different. Responding to a user versus I'm going to go send an email. A lot of people were hacking function calling to get the response format they needed.

And so this is why we shipped this new response format. So you can get exactly what you want and you get more of the model's verbosity. It's responding in the way it would speak to a user. And so less just programmatic tool calling, if that makes sense. - Are you building something into the SDK to actually close the loop with the function calling?

Because right now it returns the function, then you got to run it, then you got to fake another message to then continue the conversation. - They have that in beta, the runs. - Yes, we have this in beta in the Node SDK. So you can basically- - Oh, not Python.

- It's coming to Python as well. - That's why I didn't know. - Yeah, I'm a Node guy. So I'm like, it's already existed. - It's coming everywhere. But basically what you do is you write a function and then you add a decorator to it. And then you can, basically there's this run tools method and it does the whole loop for you, which is pretty cool.

- When I saw that in the Node SDK, I wasn't sure if that's, because it basically runs it in the same machine. - Yeah. - And maybe you don't want that to happen. - Yeah, I think of it as like, if you're prototyping and building something really quickly and just playing around, it's so cool to just create a function and give it this decorator.

But you have the flexibility to do it however you like. - Like you don't want it in a critical path of a web request. - I mean, some people definitely will. (both laughing) It's just kind of the easiest way to get started. But let's say you want to like execute this function on a job queue async, then it wouldn't make sense to use that.

- Prior art, Instructure, outlines, JSON former, what did you study? What did you credit or learn from these things? - Yeah, there's a lot of different approaches to this. There's more fill in the blank style sampling where you basically pre-form kind of the keys and then get the model to sample just the value.

There's kind of a lot of approaches here. We didn't kind of use any of them wholesale, but we really loved what we saw from the community and like the developer experiences we saw. So that's where we took a lot of inspiration. - There was a question also just about constrained grammar.

This is something that I first saw in Llama CPP, which seems to be the most, let's just say academically permissive. - It's kind of the lowest level. - Yeah. For those who don't know, maybe I don't know if you want to explain it, but they use Backus-Norform, which you only learn in like college when you're working on programming languages and compilers.

I don't know if you like use that under the hood or you explore that. - Yeah, we didn't use any kind of other stuff. We kind of built our solution from scratch to meet our specific needs. But I think there's a lot of cool stuff out there where you can supply your own grammar.

Right now, we only allow JSON schema and the dialect of that. But I think in the future, it could be a really cool extension to let you supply a grammar more broadly. And maybe it's more token efficient than JSON. So a lot of opportunity there. - You mentioned before also training the model to be better at function calling.

What's that discussion like internally for like resources? It's like, hey, we need to get better JSON mode. And it's like, well, can't you figure it out on the API platform without touching the model? Like is there a really tight collaboration between the two teams? - Yeah, so I actually work on the API models team.

I guess we didn't quite get into what I do at API. (all laughing) - What do you say it is you do here? - Yeah, so yeah, I'm the tech lead for the API, but also I work on the API models team. And this team is really working on making the best models for the API.

And a lot of common deployment patterns are research makes a model and then you kind of ship it in the API. But I think there's a lot you miss when you do that. You miss a lot of developer feedback and things that are not kind of immediately obvious. What we do is we get a lot of feedback from developers and we go and make the models better in certain ways.

So our team does model training as well. We work very closely with our post-training team. And so for structured outputs, it was a collab between a bunch of teams, including safety systems to make a really great model that does structured outputs. - Mentioning safety systems, you have a refusal field.

- Yes. - You want to talk about that? - Yeah, it's pretty interesting. So you can imagine basically if you constrain the model to follow a schema, you can imagine there being like a schema supplied that it would add some risk or be harmful for the model to kind of follow that schema.

And we wanted to preserve our model's abilities to refuse when something doesn't match our policies or is harmful in some way. And so we needed to give the model an ability to refuse even when there is this schema. But also, you know, if you are a developer and you have this schema and you get back something that doesn't match it, you're like, "Oh, the feature's broken." So we wanted a really clear way for developers to program against this.

So if you get something back in the content, you know it's valid, it's JSON parsable. But if you get something back in the refusal field, it makes for a much better UI for you to kind of display this to your user in a different way. And it makes it easier to program against.

So really there was a few goals, but it was mainly to allow the model to continue to refuse, but also with a really good developer experience. - Yeah, why not offer it as like an error code? Because we have to display error codes anyway. - Yeah, we've falafeled for a long time about API design, as we are wont to do.

And there are a few reasons against an error code. Like you could imagine this being a 4xx error code or something, but you know, the developer's paying for the tokens. And that's kind of atypical for like a 4xx error code. - We pay with errors anyway, right? Or no?

- So 4xx is- - Is not, that's a U error. - Right, and it doesn't make sense as a 5xx either, 'cause it's not our fault. It's the way the API, the model is designed. I think the HTTP spec is a little bit limiting for AI in a lot of ways.

Like there are things that are in between your fault and my fault. There's kind of like the model's fault and there's no, you know, error code for that. So we really have to kind of invent a lot of the paradigm here. - We get 6xx. - Yeah, that's one option.

There's actually some like esoteric error codes we've considered adopting. - 328, my favorite. - Yeah, there's the teapot one. - Hey! (laughs) - We're still figuring that out. But I think there are some things, like for example, sometimes our model will produce tokens that are invalid based on kind of our language.

And when that happens, it's an error. But, you know, it doesn't, 500 is fine, which is what we return, but it's not as expressive as it could be. So yeah, just areas where, you know, web 2.0 doesn't quite fit with AI yet. - If you have to put in a spec, I was gonna-- - To just change.

Yeah, yeah, yeah. What would be your number one proposal to like rehaul? - The HTTP committee to re-invent the world. - Yeah, that's a good one. I mean, I think we just need an error of like a range of model error. And we can have many different kinds of model errors.

Like a refusal is a model error. - 601, auto refusal. - Yeah, again, like, so we've mentioned before that chat completions uses this chat ML format. So when the model doesn't follow chat ML, that's an error. And we're working on reducing those errors, but that's like, I don't know, 602, I guess.

- A lot of people actually no longer know what chat ML is. - Yeah, fair enough. - Because that was briefly introduced by OpenAI and then like kind of deprecated. Everyone who implements this under the hood knows it, but maybe the API users don't know it. - Basically, the API started with just one endpoint, the completions endpoint.

And the completions endpoint, you just put text in and you get text out. And you can prompt in certain ways. Then we released chat GPT, and we decided to put that in the API as well. And that became the chat completions API. And that API doesn't just take like a string input and produce an output.

It actually takes in messages and produces messages. And so you can get a distinction between like an assistant message and a user message, and that allows all kinds of behavior. And so the format under the hood for that is called chat ML. Sometimes, you know, because the model is so out of distribution based on what you're doing, maybe the temperature is super high, then it can't follow chat ML.

- Yeah, I didn't know that there could be errors generated there. Maybe I'm not asking challenging enough questions. - It's pretty rare, and we're working on driving it down. But actually, this is a side effect of structured outputs now, which is that we have removed a class of errors.

We didn't really mention this in the blog, just 'cause we ran out of space. But-- - That's what we're here to do. - Yeah, the model used to occasionally pick a recipient that was invalid, and this would cause an error. But now we are able to constrain to chat ML in a more valid way.

And this reduces a class of errors as well. - Recipient meaning, so there's this, like a few number of defined roles, like user, assistant, system. - Like recipient as in like picking the right tool. - Oh. - Oh. - So the model before was able to hallucinate a tool, but now it can't when you're using structured outputs.

- Do you collaborate with other model developers to try and figure out this type of errors? Like how do you display them? Because a lot of people try to work with different models. - Yeah. - Yeah, is there any? - Yeah, not a ton. We're kind of just focused on making the best API for developers.

- A lot of research and engineering, I guess, comes together with evals. You published some evals there. I think Gorilla is one of them. What is your assessment of like the state of evals for function calling and structured output right now? - Yeah, we've actually collaborated with BFCL a little bit, which is, I think, the same thing as Gorilla.

- Function calling leaderboard. - Kudos to the team. Those evals are great, and we use them internally. Yeah, we've also sent some feedback on some things that are misgraded. And so we're collaborating to make those better. In general, I feel evals are kind of the hardest part of AI.

Like when I talk to developers, it's so hard to get started. It's really hard to make a robust pipeline. And you don't want evals that are like 80% successful because, you know, things are gonna improve dramatically. And it's really hard to craft the right eval. You kind of want to hit everything on the difficulty curve.

I find that a lot of these evals are mostly saturated, like for BFCL. All the models are near the top already, and kind of the errors are more, I would say, like just differences in default behaviors. I think most of the models on leaderboard can kind of get 100% with different prompting, but it's more kind of you're just pulling apart different defaults at this point.

So yeah, I would say in general, we're missing evals. You know, we work on this a lot internally, but it's hard. - Did you, other than BFCL, would you call out any others just for people exploring the space? - SweetBench is actually like a very interesting eval, if people don't know.

You basically give the model a GitHub issue and like a repo and just see how well it does at the issue, which I think is super cool. It's kind of like an integration test, I would say, for models. - It's a little unfair, right? - What do you mean?

- A little unfair, 'cause like usually as a human, you have more opportunity to like ask questions about what it's supposed to do. And you're giving the model like way too little information. - It's a hard job. - To do the job. - But yeah, SweetBench targets like, how well can you follow the diff format and how well can you like search across files and how well can you write code?

So I'm really excited about evals like that because the pass rate is low, so there's a lot of room to improve. And it's just targeting a really cool capability. - I've seen other evals for function calling where I think might be BFCL as well, where they evaluate different kinds of function calling.

And I think the top one that people care about, for some reason, I don't know personally that this is so important to me, but it's parallel function calling. I think you confirmed that you don't support that yet. Why is that hard? Just more context about it. - So yeah, we put out parallel function calling in Dev Day last year as well.

And it's kind of the evolution of function calling. So function calling V1, you just get one function back. Function calling V2, you can get multiple back at the same time and save latency. We have this in our API, all our models support it, or all of our newer models support it, but we don't support it with structured outputs right now.

And there's actually a very interesting trade-off here. So when you basically call our API for structured outputs with a new schema, we have to build this artifact for fast sampling later on. But when you do parallel function calling, the kind of schema we follow is not just directly one of the function schemas.

It's like this combined schema based on a lot of them. If we were to kind of do the same thing and build an index every time you pass in a list of functions, if you ever change the list, you would kind of incur more latency. And we thought it would be really unintuitive for developers and hard to reason about.

So we decided to kind of wait until we can support a no-added-latency solution and not just kind of make it really confusing for developers. - Mentioning latency, that is something that people discovered, is that there is an increased cost and latency for the first token. - For the first request, yeah.

- First request. Is that an issue? Is that going to go down over time? Is there just an overhead to parsing JSON that is just insurmountable? - It's definitely not insurmountable. And I think it will definitely go down over time. We just kind of take the approach of ship early and often.

And if there's nothing in there you don't want to fix, then you probably shipped too late. So I think we will get that latency down over time. But yeah, I think for most developers, it's not a big concern. 'Cause you're testing out your integration, you're sending some requests while you're developing it, and then it's fast and broad.

So it kind of works for most people. The alternative design space that we explored is like pre-registering your schema, so like a totally different endpoint, and then passing in like a schema ID. But we thought, you know, that was a lot of overhead and like another endpoint to maintain and just kind of more complexity for the developer.

And we think this latency is going to come down over time. So it made sense to keep it kind of in chat completions. - I mean, hypothetically, if one were to ship caching at a future point, it would basically be the superset of that. - Maybe. I think the caching space is a little underexplored.

Like we've seen kind of two versions of it. But I think, yeah, there's ways that maybe put less onus on the developer. But, you know, we haven't committed to anything yet, but we're definitely exploring opportunities for making things cheaper over time. - Is AGI and Agents just going to be a bunch of structure upload and function calling one next to each other?

Like, how do you see, you know, there's like the model does everything. Where do you draw the line? Because you don't call these things like an agent API, but like if I were a startup trying to raise a C round, I would just do function calling and say, this is an agent API.

So how do you think about the difference and like how people build on top of it for like agentic systems? - Yeah, love that question. One of the reasons we wanted to build structured outputs is to make agentic applications actually work. So right now it's really hard. Like if something is 95% reliable, but you're chaining together a bunch of calls, if you magnify that error rate, it makes your like application not work.

So that's a really exciting thing here from going from like 95% to 100%. I'm very biased working in the API and working on function calling and structured outputs, but I think those are the building blocks that we'll be using kind of to distribute this technology very far. It's the way you connect like natural language and converting user intent into working with your application.

And so I think like kind of, there's no way to build without it, honestly. Like you need your function calls to work. Like, yeah, we wanted to make that a lot easier. - Yeah, and do you think the assistance kind of like API thing will be a bigger part as people build agents?

I think maybe most people just use messages and completion. - So I would say the assistance API was kind of a bet in a few areas. One bet is hosted tools. So we have the file search tool and code interpreter. Another bet was kind of statefulness. It's our first stateful API.

It'll store threads and you can fetch them later. I would say the hosted tools aspect has been really successful. Like people love our file search tool and it's like saves a lot of time to not build your own rag pipeline. I think we're still iterating on the shape for the stateful thing to make it as useful as possible.

Right now, there's kind of a few endpoints you need to call before you can get a run going. And we want to work to make that, you know, much more intuitive and easier over time. - One thing I'm just kind of curious about, did you notice any trade-offs when you add more structured output, it gets worse at some other thing that was like kind of, you didn't think was related at all?

- Yeah, it's a good question. Yeah, I mean, models are very spiky and RL is hard to predict. And so every model kind of improves on some things and maybe is flat or neutral on other things. - Yeah, like it's like very rare to just add a capability and have no trade-offs in everything else.

- So yeah, I don't have something off the top of my head, but I would say, yeah, every model is a special kind of its own thing. This is why we put them in API dated so developers can choose for themselves which one works best for them. In general, we strive to continue improving on all evals, but it's stochastic.

- Yeah, able to apply the structured output system on backdated models like 4.0 May, as well as Mini, as well as August. - Actually the new response format is only available on two models. It's 4.0 Mini and the new 4.0. So the old 4.0 doesn't have the new response format.

However, for function calling, we were able to enable it for all models that support function calling. And that's because those models were already trained to follow these schemas. We basically just didn't wanna add the new response format to models that would do poorly at it because they would just kind of do infinite white space, which is the most likely token if you have no idea what's going on.

- I just wanted to call out a little bit more in the stuff you've done in the blog posts. So in blog posts, just use cases, right? I just want people to be like, yeah, we're spelling it out for you. Use these for extracting structured data from unstructured data.

By the way, it does vision too, right? So that's cool. Dynamic UI generation. Actually, let's talk about dynamic UI. I think gen UI, I think, is something that people are very interested in. As your first example, what did you find about it? - Yeah, I just thought it was a super cool capability we have now.

So the schemas, we support recursive schemas, and this allows you to do really cool stuff. Like, every UI is a nested tree that has children. So I thought that was super cool. You can use one schema and generate tons of UIs. As a backend engineer who's always struggled with JavaScript and frontend, for me, that's super cool.

We've now built a system where I can get any frontend that I want. So yeah, that's super cool. The extracting structured data, the reality of a lot of AI applications is you're plugging them into your enterprise business and you have something that works, but you want to make it a little bit better.

And so the reliability gains you get here is you'll never get a classification using the wrong enum. It's just exactly your types. So really excited about that. - Like maybe hallucinate the actual values, right? So let's clearly state what the guarantees are. The guarantees is that this fits the schema, but the schema itself may be too broad because the JSON schema type system doesn't say like, I only want to range from one to 11.

You might give me zero. You might give me 12. - So yeah, JSON schema. So this is actually a good thing to talk about. So JSON schema is extremely vast and we weren't able to support every corner of it. So we kind of support our own dialect and it's described in the docs.

And there are a few trade-offs we had to make there. So by default, if you don't pass in additional properties in a schema, by default, that's true. And so that means you can get other keys, which you didn't spell out, which is kind of the opposite of what developers want.

You basically want to supply the keys and values and you want to get those keys and values. And so then we had to decision to make. It's like, do we redefine what additional properties means as the default? And that felt really bad. It's like, there's a schema that's predated us.

Like, it wouldn't be good. It'd be better to play nice with the community. And so we require that you pass it in as false. One of our design principles is to be very explicit and so developers know what to expect. And so this is one where we decided, it's a little harder to discover, but we think you should pass this thing in so that we can have a very clear definition of what you mean and what we mean.

There's a similar one here with required. By default, every key in JSON schema is optional, but that's not what developers want, right? You'd be very surprised if you passed in a bunch of keys and you didn't get some of them back. And so that's the trade-off we made, is to make everything required and have the developers spell that out.

- Is there a require false? Can people turn it off or they're just getting all-- - So developers can, basically what we recommend for that is to make your actual key a union type. And so-- - Nullable. - Yeah, make it union of int and null and that gets you the same behavior.

- Any other of the examples you want to dive into, math, chain of thought? - Yeah, you can now specify like a chain of thought field before a final answer. This is just like a more structured way of extracting the final answer. One example we have, I think we put up a demo app of this math tutoring example, or it's coming out soon.

- Did I miss it? Oh, okay, well. - Basically, it's this math tutoring thing and you put in an equation and you can go step by step and answer it. This is something you can do now with Structured Office. In the past, a developer would have to specify their format and then write a parser and parse out the model's output to be pretty hard.

But now you just specify steps and it's an array of steps and every step you can render and then the user can try it and you can see if it matches and go on that way. So I think it just opens up a lot of opportunities. Like for any kind of UI where you want to treat different parts of the model's responses differently, Structured Outputs is great for that.

- I remembered my question from earlier. I'm basically just using this to ask you all the questions as a user, as a daily user of the stuff that you put out. So one is a tip that people don't know and I confronted you on Twitter, which is you respect descriptions of JSON schemas, right?

And you can basically use that as a prompt for the field. - Totally. - I assume that's blessed and people should do that. - Intentional, yeah. - One thing that I started to do, which I don't, it could be a hallucination of me, is I changed the property name to prompt the model to what I wanted to do.

So for example, instead of saying topics as a property name, I would say like, "Brainstorm a list of topics up to five," something like that as a property name. I could stick that in the description as well, but is that too much? (laughs) - Yeah, I would say, I mean, we're so early in AI that people are figuring out the best way to do things.

And I love when I learn from a developer like a way they found to make something work. In general, I think there's like three or four places to put instructions. You can put instructions in the system message and I would say that's helpful for like when to call a function.

So it's like, let's say you're building a customer support thing and you want the model to verify the user's phone number or something. You can tell the model in the system message, like here's when you should call this function. Then when you're within a function, I would say the descriptions there should be more about how to call a function.

So really common is someone will have like a date as a string, but you don't tell the model, like, do you want year, year, month, month, day, day? Or do you want that backwards? And that's what a really good spot is for those kinds of descriptions. It's like, how do you call this thing?

And then sometimes there's like really stuff like what you're doing. It's like, name the key by what you want. So sometimes people put like, do not use. And you know, if they don't want, you know, this parameter to be used except only in some circumstances. And really, I think that's the fun nature of this.

It's like, you're figuring out the best way to get something out of the model. - Okay, so you don't have an official recommendation is what I'm hearing. - Well, the official recommendation is, you know, how to call a model, system instructions. - Exactly, exactly. - Or when to call a function, yeah.

- Do you benchmark these type of things? So like, say with date, it's like description, it's like return it in like ISO 8. Or if you call the key date in ISO A6001, I feel like the benchmarks don't go that deep, but then all the AI engineering kind of community, like all the work that people do, it's like, oh, actually this performs better, but then there's no way to verify, you know?

Like even the, I'm gonna tip you $100,000 or whatever, like some people say it works, some people say it doesn't. Do you pay attention to this stuff as you build this? Or are you just like, the model is just gonna get better, so why waste my time running evals on these small things?

- Yeah, I would say to that, I would say we basically pick our battles. I mean, there's so much surface area of LLMs that we could dig into, and we're just mostly focused on kind of raising the capabilities for everyone. I think for customers, and we work with a lot of customers, really developing their own evals is super high leverage, 'cause then you can upgrade really quickly when we have a new model, you can experiment with these things with confidence.

So yeah, we're hoping to make making evals easier. I think that's really generally very helpful for developers. - For people, I would just kind of wrap up the discussion for structured outputs, I immediately implemented, we use structured outputs for AI News, I use Instructor, and I ripped it out, and I think I saved 20 lines of code, but more importantly, it was like, we cut it by 55% of API costs based on what I measured, because we saved on the retries.

- Nice, yeah, love to hear that. - Yeah, which people I think don't understand, when you can't just simply add Instructor or add outlines, you can do that, but it's actually gonna cost you a lot of retries to get the model that you want, but you're kind of just kind of building that internally into the model.

- Yeah, I think this is the kind of feature that works really well when it's integrated with the LLM provider. Yeah, actually, I had folks, even my husband's company, he works at a small startup, they thought we were just retrying, and so I had to make them re-blog those.

We are not retrying, you know, we're doing it in one shot, and this is how you save on latency and cost. - Awesome, any other behind-the-scenes stuff, just generally on structured outputs? We're gonna move on to the other models. - Yeah, I think that's it. - Oh, look, that's excellent products, and I think everyone will be using it, and we have the full story now that people can try out.

So Roadmap would be parallel function calling, anything else that you've called out as coming soon? - Not quite soon, but we're thinking about, does it make sense to expose custom grammars beyond JSON schema? - What would you want to hear from developers to give you information, whether it's custom grammars or anything else about structured outputs?

What would you want to know more of? - Just always interested in feature requests, what's not working, but I'd be really curious, what specific grammars folks want. I know some folks want to match programming languages like Python. There's some challenges with the expressivity of our implementation, and so, yeah, just kind of the class of grammars folks want.

- I have a very simple one, which is a lot of people try to use GPT as judge, right? Which means they end up doing a rating system, and then there's like 10 different kinds of rating systems, there's a Likert scale, there's whatever. If there was an officially blessed way to do a rating system with structured outputs, everyone would use it.

- Yeah, yeah, that makes sense. I mean, we often recommend using log probs with classification tasks. So rather than like sampling, let's say you have four options, like red, yellow, blue, green, rather than sampling two tokens for yellow, you can just do like A, B, C, D, and get the log probs of those.

The inherent randomness of each sampling isn't taken into account, and you can just actually look at what is the most likely token. - I think this is more of like a calibration question. Like if I asked you to rate things from one to 10, a non-calibrated model might always pick seven, just like a human would.

- Right. - So like actually have a nice gradation from one to 10 would be the rough idea. And then even for structured outputs, I can't just say have a field of rating from one to 10 because I have to then validate it, and it might give me 11.

- Yeah, absolutely. - So what about model selection? Now you have a lot of models. When you first started, you had one model endpoint. I guess you had like the DaVinci, but like most people were using one model endpoint. Today, you have like a lot of competitive models, and I think we're nearing the end of the 3.5 run RIP.

How do you advise people to like experiment, select, both in terms of like tasks and like costs, like what's your playbook? - In general, I think folks should start with 4.0 mini. That's our cheapest model, and it's a great workhorse. Works for a lot of great use cases. If you're not finding the performance you need, like maybe it's not smart enough, then I would suggest going to 4.0.

And if 4.0 works well for you, that's great. Finally, there's some like really advanced frontier use cases, and maybe 4.0 is not quite cutting it, and there I would recommend our fine tuning API. Even just like 100 examples is enough to get started there, and you can really get the performance you're looking for.

- We're recording this ahead of it, but like you're announcing other some fine tuning stuff that people should pay attention to. - Yeah, actually tomorrow we're dropping our GA for GPT 4.0 fine tuning. So 4.0 mini has been available for a few weeks now, and 4.0 is now gonna be generally available.

And we also have a free training offering for a bit. I think until September 23rd, you get one million free training tokens a day. - This is already announced, right? Am I talking about a different thing? - So that was for 4.0 mini, and now it's also for 4.0.

So we're really excited to see what people do with it. And it's actually a lot easier to get started than a lot of people expect. I think they might need tens of thousands of examples, but even 100 really high quality ones, or 1,000 is enough to get going. - Oh, well, we might get a separate podcast just specifically on that, but we haven't confirmed that yet.

It basically seems like every time, I think people's concerns about fine tuning is that they're kind of locked into a model. And I think you're paving the path for migration of models. As long as they keep their original data set, they can at least migrate nicely. - Yeah, I'm not sure what we've said publicly there yet, but we definitely wanna make it easier for folks to migrate.

- It's the number one concern. I'm just, it's obvious. (laughs) - Absolutely. I also wanna point people to, you have official model selection docs, where it's in the guide, we'll put it in the show notes, where it says to optimize for accuracy first, so prompt engineering, RAG, evals, fine tuning.

This was done at Dev Day last year, so I'm just repeating things. And then optimize for cost and latency second, and there's a few sets of steps for optimizing latency, so people can read up on that stuff. - Yeah, totally. - We had one episode with Nigolas Carlini from DeepMind, and we actually talked about how some people don't actually get to the boundaries of the model performance.

You know, they just kind of try one model, and it's like, "Oh, LLMs cannot do this," and they stop. How should people get over the hurdle? It's like, how do you know if you hit the model performance, or like you hit skill issues? You know, it's like, "Your prompt is not good," or like, "Try another model," and whatnot.

Is there an easy way to do that? - That's tough. Some people are really good at prompting, and they just kind of get it right away, and for others, it's more of a challenge. I think there's a lot we can do to make it easier to prompt our models, but for now, I think it requires a lot of creativity and not giving up right away, yeah.

And a lot of people have experience now with ChatGPT. You know, before, ChatGPT, the easiest way to play with our models was in the playground, but now kind of everyone's played with it, with a model of some sort, and they have some sort of intuition. It's like, you know, if I tell you my grandma is sick, then maybe I'll get the right output, and we're hoping to kind of remove the need for that, but playing around with ChatGPT is a really good way to get a feel for, you know, how to use the API as well.

- Will prompt engineering be here forever, or is it a dying art as the models get better? - I mean, it's like the perennial question of software engineering as well. It's like, as the models get better at coding, you know, if we hit a hundred on SWE Bench, what does that mean?

I think there will always be alpha in people who are able to, like, clearly explain what they're trying to build. Most of engineering is like figuring out the requirements and stating what you're trying to do, and I believe this will be the case with AI as well. You're going to have to very clearly explain what you need, and some people are better than others at it, and people will always be building.

It's just the tools are going to get far better. - Last two weeks, you released two models. There's GPT 4.0 2034-0806, and then there's also ChatGPT 4.0 latest. I think people were a little bit confused by that, and then you issued a clarification that it was, one's chat-tuned and the other is more function calling-tuned.

Can you elaborate, just? - Yeah, totally. So part of the impetus here was to kind of very transparent with what's on ChatGPT and in the API. So basically, we're often training models, and there are different use cases. So you don't really need function calling for user-defined functions in ChatGPT.

And so this gives us kind of the freedom to build the best model for each use case. So in ChatGPT latest, we're releasing kind of this rolling model. The weights aren't pinned. As we release new models-- - This is literally what we use. - Yeah, so it's in what's in ChatGPT, so it's very good for like chat-style use cases.

But for the API broadly, you know, we really tune our models to be good at things that developers want, like function calling and structured outputs, and when a developer builds their application, they want to know that kind of the weights are stable under them. And so we have this offering where it's like, if you're tuning to a specific model and you know your function works, you know it will never change the weights out from under you.

And so those are the models we commit to supporting for a long time, and we think those are the best for developers. But we want to give it up, you know, we want to leave the choice to developers. Like, do you want the ChatGPT model or do you want the API model?

And you have the freedom to choose what's best for you. - I think it's for people, they do want to pin model versions, so I don't know when they would use ChatGPT, like the rolling one, unless they're really just kind of cloning ChatGPT. Which is like, why would they?

- I mean, I think there's a lot of interesting stuff that developers can do when unbounded, and so we don't want to limit them artificially. So it's kind of survival of the fittest, like whichever model is better, you know, that's the one that people should use. - Yeah, I talked about it to my friends, there's like, this isn't that new thing.

And basically, OpenAI has never actually shared with you the actual ChatGPT model, and now they do. - Well, it's not necessarily true. Actually, a lot of the models we have shipped have been the same, but you know, sometimes they diverge and it's not a limitation we want to stick around.

- Anything else we should know about the new model? I don't think there was no evals announced or anything, but people say it's better. I mean, obviously, LMSYS is like way better above on everything, right? It's like number one in the world on-- - Yeah, we published some release notes.

They're not as in-depth as we want to be yet, because it's still kind of a science and we're learning what actually changes with each model and how can we better understand the capabilities. But we are trying to do more release notes in the future and keep folks updated. But yeah, it's kind of an art and a science right now.

- You need the best evals team in the world to help you figure this out. - Yeah, evals are hard. We're hiring if you want to come work on evals. - Hold that thought on hiring. We'll come back to the end on what you're looking for, 'cause obviously people want to join you and they want to know what qualities you're looking for.

- So we just talked about API versus ChargedGBT. What's, I guess, the vision for the interface? You know, the mission of OpenAI is to build AGI that is accessible. Like, where is it going to come from? - Totally, yeah. So I believe that the API is kind of our broadest vehicle for distributing AGI.

You know, we're building some first-party products, but they'll never reach every niche in the world and kind of every corner and community. And so really love working with developers and seeing the incredible things they come up with. I often find that developers kind of see the future before anyone else, and we love working with them to make it happen.

And so really the API is a bet on going really broad. And we'll go very deep as well in our first-party products, but I think just that our impact is absolutely magnified by every developer that we uplift. - They can do the last mile where you cannot. Like, ChargedGBT is one type of product, but there's many other kinds.

In fact, you know, I observed, I think in February, basically, ChargedGBT's user growth stopped when the API was launched, because everyone was kind of able to take that and build other things. That has not become true anymore because ChargedGBT growth has continued to grow. But then, you're not confirming any of this.

This is me quoting similar web numbers, which have very high variance. - Well, the API predates ChargedGBT. The API was actually opened as first product, and the first idea for commercialization, that predates me as well. - Wide release. Like, GA, everyone can sign up and use it immediately. Yeah, that's what I'm talking about.

But yeah, I mean, I do believe that. And, you know, that means you also have to expose all of open-end models, right? Like, all the multi-modal models. We'll ask you questions on that, but I think that API mission is important. It's interesting that the hottest new programming language is supposed to be English, but it's actually just software engineering, right?

It's just, you know, we're talking about HTTP error codes. - Right. (laughs) - Yeah, I think, you know, engineering is still the way you access these models. And I think there are companies working on tools to make engineering more accessible for everyone, but there's still so much alpha in just writing code and deploying.

- Yeah, one might even call it AI engineering. - Exactly. - I don't know. Yeah, so like, there's lots of war stories from building this platform. We started at the start of your career, and then we jumped straight to structured outputs. There's a whole thing, like two years, that we skipped in between.

What have become your principles? What are your favorite stories that you like to tell? - We had so much fun working on the Assistance API and leading up to Dev Day. You know, things are always pretty chaotic when you have an externally, like a date-- - Forcing function. - That is hard, and there's like a stage, and there's like 1,000 people coming.

- You can always launch a wait list, I mean. (laughs) - We're trying hard not to, because, you know, we love it when people can access the thing on day one. And so, yeah, the Assistance API, we had like this really small team, and just working as hard as we could to make this come to life.

But even, actually, the morning of, I don't know if you'll remember this, but Sam did this keynote, and Ramon came up, and they gave free credits to everybody. So that was live, fully live, as were all of the demos that day. But actually, maybe like two hours before that, we had a little outage, and everyone was like scrambling to make this thing work again.

So, yeah, things are early and scrappy here, and, you know, we were really glad. We were a bit on the edge of our seat watching it live. - What's the plan B in that situation? If you can share. - Play a video. This is classic DevRel, right? I don't know.

- I mean, I actually don't know what the plan B was. - No plan B. No failure. - But we just, you know, we fixed it. We got everything running again, and the demo went well. - Just higher-cracked Waterloo tracks. - Exactly. (laughs) - Skill issues, as usual. - Sometimes you just gotta make it happen.

- I imagine it's actually very motivating, but I did hear that after Dev Day, like the whole company got like a few weeks off, just to relax a little bit. - Yeah, we sometimes get, like we just had the week of July 4th off, and yeah. It's hard to take vacation, because people are working on such exciting things, and it's like, you get a lot of FOMO on vacation, so it helps when the whole company's on vacation.

- Mentioning Ace Assistance API, you actually announced a roadmap there, and things have developed. I think people may not be up to date. What's the offering today versus, you know, one year ago? - Yeah, so we've made a bunch of key improvements. I would say the biggest one is in the file search product.

Before, we only supported, I think, like 20 files per assistant, and the way we used those files was like less effective. Basically, the model would decide based on the file name, whether to search a file, and there's not a ton of information in there. So our new offering, which we shipped a few months ago, I think now allows 10K files per assistant, which is like dramatically more.

And also, it's a kind of different operation. So you can search semantically over all files at once, rather than just kind of the model choosing one up front. So a lot of customers have seen really good performance. We also have exposed more like chunking and re-ranking options. I think the re-ranking one is coming, I think, next week or very soon.

So this kind of gives developers more control and more flexibility there. So we're trying to make it the easiest way to kind of do RAG at scale. - Yeah, I think that visibility into the RAG system was the number one thing missing from Dev Day, and then people got their first impressions, and then they never looked at it again.

So that's important. The re-ranker is a core feature of, let's say, some other Foundation Model Labs. Is OpenAI going to offer a re-ranking service, a re-ranker model? - So we do re-ranking as part of it. I think we're soon going to ship more controls for that. - Okay, got it.

And if I'm an existing LANG chain, MAMA index, whatever, how do you compare? Do you make different choices? Where does that exist in the spectrum of choices? - I think we are just coming at it trying to be the easiest option. And so ideally, you don't have to know what a re-ranker is, and you don't have to have a chunking strategy, and the thing just kind of works out of the box.

So I would say that's where we're going, and then giving controls to the power users to make the changes they need. - Awesome. I'm going to ask about a couple other things, just updates on stuff also announced at Dev Day, and we talked about this before. Determinism, something that people really want.

Dev Day will announce the Seed Program as well as System Fingerprint. And objectively, I've heard issues. - Yeah. - I don't know what's going on. - The Seed parameter is not fully deterministic, and it's kind of a best effort thing. - Yeah. - So you'll notice there's more determinism in the first few tokens.

That's kind of the current implementation. We've heard a lot of feedback. We're thinking about ways to make it better, but it's challenging. It's kind of trading off against reliability and uptime. - Other maybe underrated API-only thing, Logit Bias, that's another thing that kind of seems very useful, and then maybe most people are like, it's a lot of work, I don't want to use it.

Do you have any examples of use cases or products that are made a lot better through using it? - So yeah, classification is the big one. So Logit Bias, your valid classification outputs, and you're more likely to get something that matches. We've seen people Logit Bias punctuation tokens, maybe trying to get more succinct writing.

Yeah, it's generally very much a power user feature, and so not a ton of folks use it. - I actually wanted to use it to reduce the incidence of the word delve. - Yeah. - Have people done that? - Probably, I don't know, is delve one token? You're probably, you got to do a lot of permutations.

- It's used so much. - Maybe it is, depends on the tokenizer. - Are there non-public tokenizers? I guess you cannot answer or you would omit it. Are the 100K and 200K vocabs, like the ones that you use across all models, or? - Yeah, I think we have docs that publish more information.

I don't have it off the top, but I think we publish which tokenizers for which model. - Okay, so those are the only two. - Rate, the tiering rate limiting system, I don't think there was an official blog post kind of announcing this, but it was kind of mentioned that you started tying fine tuning to tiering and feature rollouts.

Just from your point of view, how do you manage that? And what should people know about the tiering system and rate limiting? - Yeah, I think basically the main changes here were to be more transparent and easier to use. So before developers didn't know what tier they're in, and now you can see that in the dashboard.

I think it's also, I think we publish how you move from tier to tier. And so this just helps us do kind of gated rollouts for the fine tuning launch. I think everyone tier two and up has full access. - That makes sense. I would just advise people to just get to tier five as quickly as possible.

(both laughing) - Sure. - Like a gold star customer, you know? Like, I don't know, it seems to make sense. - Do we want to maybe wrap with future things and kind of like how you think about designing and everything? So you just mentioned you want to be the easiest way to basically do everything.

What's the relationship with other people building in the developer ecosystem? Like I think maybe in the early days, it's like, okay, we only have these APIs and then everybody helps us, but now you're kind of building a whole platform. How do you make decisions? - Yeah, I think kind of the 80/20 principle applies here.

We'll build things that kind of capture, you know, 80% of the value and maybe leave the long tail to other developers. So we really prioritize by like, how much feedback are we getting? How much easier will this make something, like an integration for a developer? So yeah, we want to do more in this space and not just be an LLM as a service, but kind of AI development platform as a service.

- Ooh, okay. That ties into a thing that I put in the notes that we prepped. There are other companies trying to be AI development platform. So you will compete with them or they just want to know what you won't build so that they can build it? (laughs) - Yeah, it's a tough question.

I think we haven't, you know, determined what exactly we will and won't build, but you can think of something, if it makes it a lot easier for developers to integrate, you know, it's probably on our radar and we'll, you know, stack rank by impact. - Yeah, so there's like cost tracking and model fallbacks.

Model fallbacks is an interesting one because people do it. I don't think it adds a ton of value, but like if you don't build it, I have to build it because if one API is down or something, I need to fall back to another one. - Yeah, I mean, the way we're targeting that user need is just by investing a lot in reliability.

And so we- - Oh yeah. - We have- - Just don't fail. - I mean, we have improved our uptime like pretty dramatically over the last year and it's been, you know, the result of a lot of hard work from folks. So you'll see that on our status page and our continued commitment going forward.

- Is the important thing about owning the platform that it gives you the flexibility to put all the kind of messy stuff behind the scenes? Or yeah, how do you draw the line between what you want to include? - Yeah, I just think of it as like, how can we onboard the next generation of AI engineers, as you put it, right?

Like what's the easiest way to get them building really cool apps? And I think it's by building stuff to kind of hide this complexity or just make it really easy to integrate. So I think of it a lot as like, what is the value add we can provide beyond just the models that makes the models really useful?

- Okay, we'll touch on four more features of the API platform that we prepped. Batch, Vision, Whisper, and then Team Enterprise stuff. So you wanted to talk about Batch. - Yeah. - So the rough idea is you give a, the contract between you and me is that I give you the Batch job.

You have 24 hours to run it. It's kind of like spot inst for the API. What should people know about it? - So it's half off, which is a great savings. It also works with like 4.0 mini. So the savings on top of 4.0 mini is pretty crazy. Like the stuff you can do- - Like 7.5 cents or something per million.

- Yeah, I should really have that number top of mind, but it's like staggeringly cheap. And so I think this opens up a lot more use cases. Like let's say you have a user activation flow and you want to send them an email like maybe every day or like at certain points in their user journey.

So now you can do this with the Batch API and something that was maybe a lot more expensive and not feasible is now very easy to do. So right now we have this 24 hour turnaround time for half off and curious, would love to hear from your community, like what kind of turnaround time do they want?

- I would be an ideal user of Batch and I cannot use Batch because it's 24 hours. I need two to four. - Two to four hours, okay. Yeah, that's good to know. But yeah, just a lot of folks haven't heard about it. It's also really great for like evals, running them offline.

You don't, generally don't need them to come back within, you know, two hours. - I think you could do a range, right? Two to four for me, like I need to produce a daily thing and then 24 for like the average use case. And then maybe like a week, a month, who cares?

Like for people who just have a lot to do. - Yeah, absolutely. So yeah, that's Batch API. I think folks should use it more. It's pretty cool. - Is there a future in which like six months is like free? You know, like is there like small, is there like super small like shards of like GPU runtime that like over a long enough timeline you can just run all these things for free?

- Yeah, it's certainly possible. I think we're getting to the point where a lot of these are like almost free. - That's true. - Why would they work on something that's like completely free? I don't know. Okay, so Vision. Vision got G8. Last year, people were so wild by the GPT-4 demo and that was primarily Vision.

What was it like building the Vision API? - Yeah, the Vision API is super cool. We have a great team working there. I think the cool thing about Vision is that it works across our APIs. So there's, you can use it in the Assistance API, you can use the Batch API in track completions.

It works with structured outputs. I think it just helps a lot of folks with kind of data extraction where the spatial relationships between the data is too complicated and you can't get that over text. But yeah, there's a lot of really cool use cases. - I think the tricky thing for me is understanding how frequent to turn Vision from like single images into like effectively just always watching.

And right now, I think people just like send a frame every second. Will that model ever change? Will there just be like, I stream you a video and then? - Yeah, I think it's very possible that we'll have an API where you stream video in and maybe, you know, to start, we'll do the frame sampling for you.

- 'Cause the frame sampling is the default, right? - Right. - But I feel like it's hacky. - Yeah, I think it's hard for developers to do. And so, you know, we should definitely work on making that easier. - Is there in the Batch API, do you have like a time guarantees, like order guarantees?

Like if I send you a Batch request of like a video analysis, I need every frame to be done in order? - For Batch, you send like a list of requests and each of them stand alone. So you'll get all of them finished, but they don't kind of chain off each other.

- Well, if you're doing a video, you know, if you're doing like analyzing a video. - I wasn't linking video to Batch, but that's interesting. - Yeah, well, a video it's like, you know, if you have a very long video, you can just do a Batch of all the images and let it process.

- Oh, that's a good idea. It's like Batch, but serially. - Sequential true. - Yeah, yeah, yeah, exactly. - You know. - But the whole point of Batch is you're just using kind of spare time to run it. Let's talk about my favorite model, Whisper. Oliver, I built this thing called SmallPodcaster, which is an open source tool for podcasters.

And why does Whisper API not have diarization when everybody is transcribing people talking? That's my main question. - Yeah, it's a good question. And you've come to the right person. I actually worked on the Whisper API and shipped that. That was one of my first APIs I shipped. Long story short is that like Whisper V3, which we open sourced, has I think the diarization feature, but there's some like performance trade-offs.

And so Whisper V2 is better at some things than Whisper V3. And so it didn't seem that worthwhile to ship Whisper V3 compared to like the other things in our priorities. I think we still will at some point. But yeah, it's just, you know, there's always so many things we could work on.

It's tough to do everything. - We have a Python notebook that does the diarization for the pod, but I would just like, you can translate like 50 languages, but you cannot tell me who's speaking. That was like the funniest thing. - There's like an XKCD thing about this, about hard problems in AI.

I forget the one. - Yeah, yeah, yeah, exactly. - Tell me if this was taken in a park. And like, that's easy. And it's like, tell me if there's a bird in this picture. And it's like, give me 10 people on a research team. It's like, you never know which things are challenging and diarization is, I think, you know, more challenging than expected.

- Yeah, yeah. It still breaks a lot with like overlaps, obviously. Sometimes similar voices it struggles with. Like I need to like double read the thing. - Totally. - But yeah, great model. I mean, it would take us so long to do transcriptions. And I don't know why, like small podcasts has better transcription than like mostly every commercial tool.

- It beats the script. - And I'm like, I'm just using the model. I'm literally not doing anything. You know, it's just a notebook. So yeah, it just speaks to like, sometimes just using the simple OpenAI model is better than like figuring out your own pipeline thing. - I think the top feature request there just would be, I mean, again, you know, using you as a feature request, Dom, is like being able to bias the vocab.

I think there is like in raw Whisper, you can do that. - You can pass a prompt in the API as well. - But you pass in the prompts, okay. - Yeah. - There's no more deterministic way to do it. - So this is really helpful when you have like acronyms that aren't very familiar to the model.

And so you can put them in the prompt and you'll basically get the transcription using those correctly. - We have the AI engineer solution, which is just a dictionary. - Nice. - We're like all the way misspelled it in the past and then G-sub and like replace the thing.

- If it works, it works. Like that's engineering. - It's like, you know, llama with like one L or like all these different things or like length chain. It like transcribes length chain and like- - Capitalization. - A bunch of like three or four different ways. - Yeah. - You guys should try the prompt feature.

- I love these like kind of pro tip. Okay, fun question. I know we don't know yet, but I've been enjoying the advanced voice mode. It really streams back and forth and it handles interruptions. How would your audio endpoint change when that comes out? - We're exploring, you know, new shape of the API to see how it would work in this kind of speech to speech paradigm.

I don't think we're ready to share quite yet, but we're definitely working on it. I think just the regular request response probably isn't going to be the right solution. - For those who are listening along, I think it's pretty public that OpenAI uses LiveKit for the chat GPT app, which like seems to be the socket based approach that people should be at least up to speed on.

Like I think a lot of developers only do request response and like that doesn't work for streaming. - Yeah. When we do put out this API, I think we'll make it really easy for developers to figure out how to use it. Yeah. It's hard to do audio. - It'll be a paradigm change.

Okay. And then I think the last one on our list was team enterprise stuff. Audit logs, service accounts, API keys. What should people know using the enterprise offering? - Yeah, we recently shipped our admin and audit log APIs. And so a lot of enterprise users have been asking for this for a while.

The ability to kind of manage API keys programmatically, manage your projects, get the audit log. So we've shipped this and for folks that need it, it's out there and happy for your feedback. - Yeah, awesome. I don't use them. So I don't know. I imagine it's just like build your own internal gateway for your internal developers to manage your deployment of OpenAI.

- Yeah. I mean, if you work at like a company that needs to keep track of all the API keys, it was pretty hard in the past to do this in the dashboard. We've also improved our SSO offering. So that's much easier to use now. - The most important feature of an enterprise company.

- Yeah, people love SSO. - All right, let's go outside of OpenAI. What about just you personally? So you mentioned Waterloo. Maybe let's just do, why is everybody at Waterloo cracked? And why are people so good? And like, why have people not replicated it? Or any other commentary on your experience?

- The first is the co-op program. It's obviously really good. You know, I did six internships, learned so much in those. I think another reason is that Waterloo is like, you know, it's very cold in the winter. It's pretty miserable. There's like not that much to do apart from study and like hack on projects.

And there's this big like hacker mentality. You know, there's a Hack the North is a very popular hackathon. And there's a lot of like startup incubators. It's kind of just has this like startup and hacker ethos. Then that combined with the six internships means that you get people who like graduate with two years of experience and they're very entrepreneurial.

And you know, they're down to grind. - I do notice a correlation between climate and the crackiness of engineers. So, you know, it's no coincidence that Seattle is the birthplace of Microsoft and Amazon. I think I had this compilation of Denmark where people like, so it's the birthplace of C++, PHP, Turbo Pascal, Standard ML, BNF, the thing that we just talked about, MD5 Crypt, Ruby on Rails, Google Maps, and V8 for Chrome.

And it's 'cause according to Bjorn Sjostrup, the creator of C++, there's nothing else to do. - Well, you have Linus Thorvalds in Finland. - Yeah, I mean, you hear a lot about this, like in relation to SF. People say, you know, New York is way more fun. There's nothing to do in SF.

And maybe it's a little by design that Altec is here. - The climate is too good. - Yeah. - If we also have fun things to do. - Nature is so nice, you can touch grass. Why are we not touching grass? - You know, restaurants close at like 8 p.m.

Like that's what people are referring to. There's not a lot of like late night dining culture. Yeah, so you have time to wake up early and get to work. - You are a book recommender or book enjoyer. What underrated books do you recommend most to others? - Yeah, I think a book I read somewhat recently that was very formative was "The Making of the Prince of Persia." It's a striped press book.

That book just made me want to work hard. Like nothing I've ever read. It's just like this journal of what it takes to like build, you know, incredible things. So I'd recommend that. - Yeah, it's funny how video games are, for a lot of people, at least for me, kind of like the, some of the moments in technology.

Like when I played "The Science of Time" on PS2 was like my first PlayStation 2 game. And I was like, man, this thing is so crazy compared to any PlayStation 1 game. And it's like, wow. My expectations for like the technology, I think like open AI is a lot of similar things, like the advanced voice.

It's like, you see that thing and then you're like, okay, what I can expect from everybody else is kind of raised now, you know? - Totally. Another book I like to plug is called "Misbehaving" by Richard Thaler. He's a behavioral economist and talks a lot about how people act irrationally in terms of decision-making.

And I actually think about that book like once a week, probably, at least when I'm making a decision and I realize that, you know, I'm falling into a fallacy or, you know, it could be a better decision. - Yeah, you did a minor in psych. - I did, yeah.

I don't know if I learned that much there, but it was interesting. - Is there like an example of like a cognitive bias or misbehavior that you just love telling people about? - Yeah, people, so let's say you won tickets to like a Taylor Swift concert and I don't know how much they're going for, but it's probably like $10,000.

- Oh, okay. - Or whatever, sure. And like a lot of people are like, oh, I have to keep these. Like I won them, it's $10,000. But really it's the same decision you're making if you have $10,000, like would you buy these tickets? And so people don't really think about it rationally.

I'm like, would they rather have $10,000 of the tickets? For people who won it, a lot of the time it's going to be the $10,000, but their bias is because they won it. The world organized itself this way and you should keep it for some reason. - Yeah, oh, okay.

I'm pretty familiar with this stuff. There's also a loss version of this where it's like, if I take it away from you, you respond more strongly than if I give it to you. - Yes. If people are like really upset, if they like don't get a promotion, but if they do get a promotion, they're like, okay, phew.

It's like not even, you know, excitement. It's more like, we react a lot worse to losing something. - Which is why, like when you join like a new platform, they often give you points and then they'll take it away if you like don't do some action in like the first few days.

- Yeah, totally. Yeah, the book references people who like operate very rationally as econs, as like a separate group to humans. And I often think like, you know, what would an econ do here in this moment and try to act that way. - Okay, let's do this. Our LLMs, econs.

(all laughing) - I mean, they are maximizing probability distributions. - Minimizing loss. - Yeah, so I think way more than all of us, they are econs. - Whoa, okay. So they're more rational than us? - I think their optimization functions are more clear than ours. - Yeah, just to wrap, you mentioned you need help on a lot of things.

- Yeah. - Any specific roles, call outs, and also people's backgrounds. Like, is there anything that they need to have done before? Like what people fit well at OpenAI? - Yeah, we've hired people, all kinds of backgrounds, people who have PhD and an ML, or folks who've just done engineering like me.

And we're really hiring for a lot of teams. We're hiring across the Applied Org, which is where I sit for engineering, and for a lot of researchers. And there's a really cool model behavior role that we just dropped. So yeah, across the board, we'd recommend checking out our careers page, and you don't need a ton of experience in AI specifically to join.

- I think one thing that I'm trying to get at is like, what kind of person does well at OpenAI? I think objectively you have done well. And I've seen other people not do as well, and basically be managed out. I know it's an intense environment. - I mean, the people I enjoy working with the most are kind of low ego, do what it takes, ready to roll up their sleeves, do what needs to be done, and unpretentious about it.

Yeah, I also think folks that are very user-focused do well on kind of API and chat GBT. Like, the YC ethos of build something people want is very true at OpenAI as well. So I would say low ego, user-focused, driven. - Cool, yeah, this was great. Thank you so much for coming on.

- Yeah, thanks for having me. (upbeat music) (upbeat music) (upbeat music) (upbeat music)

Building AGI with OpenAI's Structured Outputs API

Chapters

Transcript