back to indexBuilding AGI with OpenAI's Structured Outputs API
Chapters
0:0 Introductions
6:37 Joining OpenAI pre-ChatGPT
8:21 ChatGPT release and scaling challenges
9:58 Structured Outputs and JSON mode
11:52 Structured Outputs vs JSON mode vs Prefills
17:8 OpenAI API / research teams structure
18:12 Refusal field and why the HTTP spec is limiting
21:23 ChatML & Function Calling
27:42 Building agents with structured outputs
30:52 Use cases for structured outputs
38:36 Roadmap for structured outputs
42:6 Fine-tuning and model selection strategies
48:13 OpenAI's mission and the role of the API
49:32 War stories from the trenches
51:29 Assistants API updates
55:48 Relationship with the developer ecosystem
58:8 Batch API and its use cases
60:12 Vision API
62:7 Whisper API
64:30 Advanced voice mode and how that changes DX
65:27 Enterprise features and offerings
66:9 Personal insights on Waterloo and reading recommendations
70:53 Hiring and qualities that succeed at OpenAI
00:00:10.320 |
and I'm joined by my co-host Swiggs, founder of Small.ai. 00:00:15.320 |
in the in-person studio with Michelle, welcome. 00:00:18.000 |
- Thanks, thanks for having me, very excited to be here. 00:00:25.140 |
and I'm finally glad that we could make this happen 00:00:41.360 |
So you've interned and/or worked at Google, Stripe, 00:00:49.240 |
You know, the one that has the most appeal to me 00:01:03.480 |
and your history going into all these notable companies? 00:01:13.840 |
So I started, actually, my first job was really rough. 00:01:16.600 |
I worked at a bank, and I learned Visual Basic, 00:01:26.200 |
- Interest rate swaps, that kind of stuff, yeah. 00:01:36.160 |
But I had a bunch of friends that were into startups more, 00:01:38.400 |
and, you know, Waterloo has like a big startup culture, 00:01:46.360 |
I also was a little bit into crypto at the time, 00:01:52.280 |
And so that was like my first real startup opportunity 00:02:05.440 |
You know, crypto was a very formative experience. 00:02:17.960 |
where I really like learned to become an engineer, 00:02:19.960 |
learned how to use Git, got on call right away, 00:02:22.480 |
you know, managed production databases and stuff. 00:02:26.640 |
and kind of got a different flavor of payments 00:02:29.320 |
Learned a lot, was really inspired by the Coulsons. 00:02:40.800 |
The company's called Readwise, which still exists, but-- 00:02:50.840 |
- Yeah, I mean, I only worked on it for about a year, 00:02:52.640 |
and so Tristan and Dan are the real founders, 00:03:10.680 |
didn't feel equipped to be a CTO of anything at that point, 00:03:21.920 |
So I wouldn't say that I went there before it blew up. 00:03:26.600 |
so not quite the startling track record that it might seem. 00:03:31.520 |
I joined as the second or third backend engineer, 00:03:40.120 |
and so we would have a stand-up every morning, 00:03:41.640 |
and we'd be like, "How do we make everything stay up?" 00:03:45.840 |
Also, one of the first things I worked on there 00:03:47.600 |
was making our notifications go out more quickly, 00:03:54.600 |
and the person speaking thinks a lot of my audience is here. 00:03:57.680 |
But when I first joined, I think it would take 10 minutes 00:03:59.960 |
for all the notifications to go out, which is insane. 00:04:07.120 |
So that's one of the first things I worked on, 00:04:08.440 |
is making that a lot faster and keeping everything up. 00:04:11.680 |
- I mean, so already we have an audience of engineers. 00:04:15.160 |
It's keeping things up and notifications out. 00:04:20.800 |
and you had all of the followers in Postgres, 00:04:25.760 |
and figure out, is this a good notification to send? 00:04:31.640 |
and our job queuing infrastructure wasn't right. 00:04:34.000 |
And so there was a lot of fixing all of these things. 00:04:36.400 |
Eventually, there were a lot of database migrations, 00:04:38.200 |
because Postgres just wasn't scaling well for us. 00:04:43.160 |
that was more of a, I don't know, reliability issue, 00:04:47.400 |
- A lot of it, yeah, it goes down to database stuff. 00:04:55.120 |
- Actually, at Coinbase, at Clubhouse, and at OpenAI, 00:05:05.600 |
a long-running Postgres query at 3 a.m. for some reason. 00:05:09.080 |
So those skills have really carried me forward, for sure. 00:05:11.480 |
- Why do you think that not as much of this is productized? 00:05:14.920 |
Obviously, Postgres is an open-source project. 00:05:18.560 |
but you would think somebody would come around 00:05:22.040 |
- Yeah, I think that's what Planetscale is doing. 00:05:25.480 |
It's on MySQL, but I think that's the vision. 00:05:27.920 |
It's like, they have zero downtime migrations, 00:05:33.120 |
I don't know why no one is doing this on Postgres, 00:05:45.800 |
Your scale, it's something that not many people see. 00:05:54.000 |
and then you migrate to some sort of NoSQL database. 00:05:56.720 |
And that process I've seen happen a bunch of times now. 00:06:14.800 |
if I need to scale something as far as it goes. 00:06:20.120 |
and it's kind of like the memory register for the web. 00:06:23.280 |
Like, you know, if you treat it just as physical memory, 00:06:30.840 |
- Right, you have to totally change your mindset 00:06:35.440 |
and kind of makes you design things in a more scalable way. 00:06:46.360 |
I also had the opportunity to join and I didn't. 00:06:50.600 |
- Yeah, I think a lot of people who joined OpenAI 00:06:52.800 |
joined because of a product that really gets them excited. 00:06:56.720 |
But for me, I was a daily user of Copilot, GitHub Copilot. 00:07:00.840 |
And I was like so blown away at the quality of this thing. 00:07:03.240 |
I actually remember the first time seeing it on Hacker News 00:07:05.200 |
and being like, wow, this is absolutely crazy. 00:07:10.560 |
It just really, even now when like I don't have service 00:07:22.760 |
and thought some of those skills would transfer. 00:07:43.760 |
- This is one of my biggest regrets of my life. 00:07:47.440 |
- But I was like, okay, I mean, I can create images. 00:07:50.680 |
I don't know if like this is the thing to dedicate, 00:07:52.680 |
but obviously you had a bigger vision than I did. 00:08:08.160 |
My mom for a while thought I worked at Bitcoin. 00:08:12.160 |
to be able to tell your family what you actually do 00:08:17.280 |
So you were there, were you immediately on API platform? 00:08:21.640 |
- Yeah, I mean, API platform is like a very grandiose term 00:08:25.720 |
There was like just a handful of us working on the API. 00:08:29.160 |
Not even everyone had access to the GPT-3 model. 00:08:51.320 |
Applied now is bigger than the company when I joined. 00:08:58.080 |
- Any chat GPT release, kind of like all ants on deck stories. 00:09:02.480 |
I had lunch with Evan Morikawa a few months ago. 00:09:16.800 |
versus like Postgres bouncers and things like that? 00:09:20.740 |
there were a lot of Postgres issues when chat GPT came out 00:09:29.760 |
And so you're basically creating a developer account 00:09:35.360 |
And so I remember there was just so much work scaling 00:09:45.960 |
It's like everywhere I've worked in Qt is like free 00:09:50.880 |
But here we're having like tough decisions every day. 00:09:58.480 |
- So you just really structured outputs, congrats. 00:10:02.640 |
And I loved all the examples that you put out. 00:10:06.000 |
Yeah, tell us about the whole story from beginning to end. 00:10:09.080 |
- Yeah, I guess the story we should rewind quite a bit 00:10:15.600 |
which is our first foray into this area of product. 00:10:19.680 |
JSON mode is this functionality you can enable 00:10:25.720 |
we'll kind of constrain the output of the model 00:10:30.240 |
And so you basically will always get something 00:10:40.920 |
But it's not getting you exactly where you want, 00:10:45.800 |
or match different values than what you want. 00:10:53.200 |
and people have been asking for basically this 00:10:55.600 |
every time I talk to customers for maybe the last year. 00:10:58.120 |
And so it was really clear that there's a developer need, 00:11:00.200 |
and we started working on kind of making it happen. 00:11:04.640 |
between engineering and research, I would say. 00:11:06.520 |
And so it's not enough to just kind of constrain the model. 00:11:11.600 |
whereas basically you mask the available tokens 00:11:14.960 |
that are produced every time to only fit the schema. 00:11:19.080 |
and you can force the model to do what you want, 00:11:33.600 |
when you do kind of a very engineering-biased approach. 00:11:36.120 |
But the modeling approach is to also train the model 00:11:41.000 |
We trained a model which is significantly better 00:11:46.400 |
like this constrained decoding concept at scale. 00:11:52.000 |
- You just mentioned starts and ends with a curly brace, 00:11:54.640 |
and maybe people's minds go to prefills in the Cloud API. 00:12:08.720 |
"Hey, here's a rough data scheme that you should use." 00:12:13.080 |
- So I think we kind of designed structured outputs 00:12:16.480 |
So you just, like the way you use it in our SDK, 00:12:20.720 |
So you just create like a pedantic object or a Zod object, 00:12:23.440 |
and you pass it in and you get back an object. 00:12:25.520 |
And so you don't have to deal with any of the serialization. 00:12:29.000 |
- Yeah, you don't have to deal with any of the serialization 00:12:41.120 |
So that's where structured outputs is tailored. 00:12:43.600 |
Whereas if you want the model to be more creative 00:12:52.840 |
are probably going to want to upgrade to structured outputs. 00:12:56.320 |
you just use interchangeable terms for the same thing, 00:12:59.480 |
which is function calling and structured outputs. 00:13:02.680 |
We've had disagreements or discussion before on the podcast 00:13:10.360 |
- Because I think function calling API came out first 00:13:14.240 |
And we used to abuse function calling for JSON mode. 00:13:18.480 |
Do you think we should treat them as synonymous? 00:13:26.000 |
- Yeah, the history here is we started with function calling 00:13:34.400 |
And we basically had these internal prototypes 00:13:40.880 |
But we're not ready to host code interpreter for everybody. 00:13:43.440 |
So we're just going to expose the raw capability 00:13:47.040 |
But even now, I think there's a really big difference 00:13:49.200 |
between function calling and structured outputs. 00:13:57.240 |
that you want the model to be able to query from, 00:14:04.920 |
And that's the way the model has been fine-tuned on, 00:14:09.400 |
for actually calling these tools and getting their outputs. 00:14:13.800 |
is a way of just getting the model to respond to the user, 00:14:19.400 |
Responding to a user versus I'm going to go send an email. 00:14:24.240 |
A lot of people were hacking function calling 00:14:28.440 |
And so this is why we shipped this new response format. 00:14:35.280 |
It's responding in the way it would speak to a user. 00:14:43.920 |
to actually close the loop with the function calling? 00:15:09.720 |
But basically what you do is you write a function 00:15:13.720 |
And then you can, basically there's this run tools method 00:15:16.480 |
and it does the whole loop for you, which is pretty cool. 00:15:22.760 |
because it basically runs it in the same machine. 00:15:29.600 |
if you're prototyping and building something really quickly 00:15:35.880 |
But you have the flexibility to do it however you like. 00:15:44.000 |
It's just kind of the easiest way to get started. 00:15:46.080 |
But let's say you want to like execute this function 00:15:51.840 |
- Prior art, Instructure, outlines, JSON former, 00:15:56.760 |
What did you credit or learn from these things? 00:15:59.640 |
- Yeah, there's a lot of different approaches to this. 00:16:02.080 |
There's more fill in the blank style sampling 00:16:04.880 |
where you basically pre-form kind of the keys 00:16:08.880 |
and then get the model to sample just the value. 00:16:15.280 |
but we really loved what we saw from the community 00:16:19.680 |
So that's where we took a lot of inspiration. 00:16:21.800 |
- There was a question also just about constrained grammar. 00:16:24.880 |
This is something that I first saw in Llama CPP, 00:16:35.080 |
maybe I don't know if you want to explain it, 00:16:39.200 |
when you're working on programming languages and compilers. 00:16:41.440 |
I don't know if you like use that under the hood 00:16:44.800 |
- Yeah, we didn't use any kind of other stuff. 00:16:52.160 |
But I think there's a lot of cool stuff out there 00:17:04.200 |
And maybe it's more token efficient than JSON. 00:17:08.680 |
- You mentioned before also training the model 00:17:12.720 |
What's that discussion like internally for like resources? 00:17:15.400 |
It's like, hey, we need to get better JSON mode. 00:17:25.440 |
- Yeah, so I actually work on the API models team. 00:17:27.520 |
I guess we didn't quite get into what I do at API. 00:17:33.960 |
- Yeah, so yeah, I'm the tech lead for the API, 00:17:47.560 |
But I think there's a lot you miss when you do that. 00:17:52.200 |
and things that are not kind of immediately obvious. 00:17:54.840 |
What we do is we get a lot of feedback from developers 00:17:57.440 |
and we go and make the models better in certain ways. 00:18:01.720 |
We work very closely with our post-training team. 00:18:07.320 |
including safety systems to make a really great model 00:18:12.680 |
- Mentioning safety systems, you have a refusal field. 00:18:19.200 |
So you can imagine basically if you constrain the model 00:18:23.880 |
you can imagine there being like a schema supplied 00:18:26.800 |
that it would add some risk or be harmful for the model 00:18:31.800 |
And we wanted to preserve our model's abilities to refuse 00:18:39.320 |
And so we needed to give the model an ability to refuse 00:18:47.400 |
and you get back something that doesn't match it, 00:18:56.880 |
But if you get something back in the refusal field, 00:19:06.520 |
but it was mainly to allow the model to continue to refuse, 00:19:09.160 |
but also with a really good developer experience. 00:19:11.240 |
- Yeah, why not offer it as like an error code? 00:19:14.320 |
Because we have to display error codes anyway. 00:19:16.880 |
- Yeah, we've falafeled for a long time about API design, 00:19:21.920 |
And there are a few reasons against an error code. 00:19:24.560 |
Like you could imagine this being a 4xx error code 00:19:29.620 |
And that's kind of atypical for like a 4xx error code. 00:19:40.240 |
- Right, and it doesn't make sense as a 5xx either, 00:19:46.880 |
I think the HTTP spec is a little bit limiting 00:19:56.040 |
and there's no, you know, error code for that. 00:20:04.680 |
There's actually some like esoteric error codes 00:20:16.320 |
like for example, sometimes our model will produce tokens 00:20:19.520 |
that are invalid based on kind of our language. 00:20:40.400 |
What would be your number one proposal to like rehaul? 00:20:51.400 |
And we can have many different kinds of model errors. 00:20:58.040 |
- Yeah, again, like, so we've mentioned before 00:21:00.560 |
that chat completions uses this chat ML format. 00:21:03.240 |
So when the model doesn't follow chat ML, that's an error. 00:21:10.600 |
- A lot of people actually no longer know what chat ML is. 00:21:15.360 |
briefly introduced by OpenAI and then like kind of deprecated. 00:21:18.280 |
Everyone who implements this under the hood knows it, 00:21:23.360 |
- Basically, the API started with just one endpoint, 00:21:35.000 |
and we decided to put that in the API as well. 00:21:39.520 |
And that API doesn't just take like a string input 00:21:42.840 |
It actually takes in messages and produces messages. 00:21:46.800 |
between like an assistant message and a user message, 00:21:50.720 |
And so the format under the hood for that is called chat ML. 00:21:56.280 |
is so out of distribution based on what you're doing, 00:22:02.120 |
- Yeah, I didn't know that there could be errors 00:22:05.360 |
Maybe I'm not asking challenging enough questions. 00:22:07.920 |
- It's pretty rare, and we're working on driving it down. 00:22:14.040 |
which is that we have removed a class of errors. 00:22:21.080 |
- Yeah, the model used to occasionally pick a recipient 00:22:24.040 |
that was invalid, and this would cause an error. 00:22:39.480 |
- Like recipient as in like picking the right tool. 00:22:42.760 |
- So the model before was able to hallucinate a tool, 00:22:46.520 |
but now it can't when you're using structured outputs. 00:22:49.680 |
- Do you collaborate with other model developers 00:22:55.600 |
Because a lot of people try to work with different models. 00:23:04.640 |
- A lot of research and engineering, I guess, 00:23:12.720 |
What is your assessment of like the state of evals 00:23:15.200 |
for function calling and structured output right now? 00:23:17.720 |
- Yeah, we've actually collaborated with BFCL a little bit, 00:23:21.360 |
which is, I think, the same thing as Gorilla. 00:23:25.880 |
Those evals are great, and we use them internally. 00:23:31.160 |
And so we're collaborating to make those better. 00:23:34.320 |
In general, I feel evals are kind of the hardest part of AI. 00:23:37.960 |
Like when I talk to developers, it's so hard to get started. 00:23:43.160 |
And you don't want evals that are like 80% successful 00:23:46.440 |
because, you know, things are gonna improve dramatically. 00:23:49.000 |
And it's really hard to craft the right eval. 00:23:51.160 |
You kind of want to hit everything on the difficulty curve. 00:23:53.720 |
I find that a lot of these evals are mostly saturated, 00:24:00.560 |
and kind of the errors are more, I would say, 00:24:07.360 |
can kind of get 100% with different prompting, 00:24:10.280 |
but it's more kind of you're just pulling apart 00:24:14.400 |
So yeah, I would say in general, we're missing evals. 00:24:16.200 |
You know, we work on this a lot internally, but it's hard. 00:24:19.160 |
- Did you, other than BFCL, would you call out any others 00:24:23.600 |
- SweetBench is actually like a very interesting eval, 00:24:26.960 |
You basically give the model a GitHub issue and like a repo 00:24:39.120 |
- A little unfair, 'cause like usually as a human, 00:24:41.400 |
you have more opportunity to like ask questions 00:24:44.880 |
And you're giving the model like way too little information. 00:24:52.400 |
and how well can you like search across files 00:25:01.200 |
And it's just targeting a really cool capability. 00:25:07.120 |
where they evaluate different kinds of function calling. 00:25:10.120 |
And I think the top one that people care about, 00:25:13.280 |
I don't know personally that this is so important to me, 00:25:17.520 |
I think you confirmed that you don't support that yet. 00:25:23.320 |
- So yeah, we put out parallel function calling 00:25:26.960 |
And it's kind of the evolution of function calling. 00:25:29.080 |
So function calling V1, you just get one function back. 00:25:31.840 |
Function calling V2, you can get multiple back 00:25:35.000 |
We have this in our API, all our models support it, 00:25:38.640 |
but we don't support it with structured outputs right now. 00:25:41.640 |
And there's actually a very interesting trade-off here. 00:25:44.360 |
So when you basically call our API for structured outputs 00:25:49.520 |
we have to build this artifact for fast sampling later on. 00:25:56.080 |
is not just directly one of the function schemas. 00:25:58.360 |
It's like this combined schema based on a lot of them. 00:26:08.360 |
And we thought it would be really unintuitive 00:26:13.960 |
until we can support a no-added-latency solution 00:26:16.600 |
and not just kind of make it really confusing for developers. 00:26:21.800 |
is that there is an increased cost and latency 00:26:34.640 |
And I think it will definitely go down over time. 00:26:37.040 |
We just kind of take the approach of ship early and often. 00:26:41.440 |
And if there's nothing in there you don't want to fix, 00:26:47.600 |
So I think we will get that latency down over time. 00:26:54.000 |
you're sending some requests while you're developing it, 00:26:59.560 |
The alternative design space that we explored 00:27:08.320 |
But we thought, you know, that was a lot of overhead 00:27:12.400 |
and just kind of more complexity for the developer. 00:27:15.120 |
And we think this latency is going to come down over time. 00:27:17.520 |
So it made sense to keep it kind of in chat completions. 00:27:21.720 |
if one were to ship caching at a future point, 00:27:27.680 |
I think the caching space is a little underexplored. 00:27:33.640 |
there's ways that maybe put less onus on the developer. 00:27:36.560 |
But, you know, we haven't committed to anything yet, 00:27:51.800 |
Because you don't call these things like an agent API, 00:27:54.600 |
but like if I were a startup trying to raise a C round, 00:28:05.480 |
One of the reasons we wanted to build structured outputs 00:28:07.520 |
is to make agentic applications actually work. 00:28:13.600 |
but you're chaining together a bunch of calls, 00:28:24.760 |
and working on function calling and structured outputs, 00:28:31.400 |
It's the way you connect like natural language 00:28:38.920 |
there's no way to build without it, honestly. 00:28:42.560 |
Like, yeah, we wanted to make that a lot easier. 00:28:51.280 |
I think maybe most people just use messages and completion. 00:28:59.880 |
So we have the file search tool and code interpreter. 00:29:06.280 |
It'll store threads and you can fetch them later. 00:29:20.680 |
for the stateful thing to make it as useful as possible. 00:29:23.520 |
Right now, there's kind of a few endpoints you need to call 00:29:36.760 |
it gets worse at some other thing that was like kind of, 00:29:47.440 |
And so every model kind of improves on some things 00:29:50.640 |
and maybe is flat or neutral on other things. 00:29:52.880 |
- Yeah, like it's like very rare to just add a capability 00:29:58.320 |
- So yeah, I don't have something off the top of my head, 00:30:01.080 |
every model is a special kind of its own thing. 00:30:09.120 |
In general, we strive to continue improving on all evals, 00:30:14.040 |
- Yeah, able to apply the structured output system 00:30:30.160 |
So the old 4.0 doesn't have the new response format. 00:30:38.000 |
And that's because those models were already trained 00:30:41.880 |
We basically just didn't wanna add the new response format 00:30:46.400 |
because they would just kind of do infinite white space, 00:30:52.240 |
- I just wanted to call out a little bit more 00:31:12.000 |
is something that people are very interested in. 00:31:14.120 |
As your first example, what did you find about it? 00:31:16.920 |
- Yeah, I just thought it was a super cool capability 00:31:19.880 |
So the schemas, we support recursive schemas, 00:31:24.280 |
Like, every UI is a nested tree that has children. 00:31:29.160 |
You can use one schema and generate tons of UIs. 00:31:47.440 |
is you're plugging them into your enterprise business 00:32:03.600 |
- Like maybe hallucinate the actual values, right? 00:32:06.240 |
So let's clearly state what the guarantees are. 00:32:13.160 |
because the JSON schema type system doesn't say like, 00:32:22.680 |
So this is actually a good thing to talk about. 00:32:26.560 |
and we weren't able to support every corner of it. 00:32:33.280 |
And there are a few trade-offs we had to make there. 00:32:36.080 |
if you don't pass in additional properties in a schema, 00:32:44.720 |
which is kind of the opposite of what developers want. 00:32:47.000 |
You basically want to supply the keys and values 00:32:51.920 |
It's like, do we redefine what additional properties means 00:32:56.960 |
It's like, there's a schema that's predated us. 00:33:00.240 |
It'd be better to play nice with the community. 00:33:01.920 |
And so we require that you pass it in as false. 00:33:04.560 |
One of our design principles is to be very explicit 00:33:20.560 |
By default, every key in JSON schema is optional, 00:33:25.520 |
You'd be very surprised if you passed in a bunch of keys 00:33:35.560 |
Can people turn it off or they're just getting all-- 00:33:38.160 |
- So developers can, basically what we recommend for that 00:33:48.920 |
- Any other of the examples you want to dive into, 00:33:52.120 |
- Yeah, you can now specify like a chain of thought field 00:33:59.680 |
One example we have, I think we put up a demo app 00:34:02.800 |
of this math tutoring example, or it's coming out soon. 00:34:12.760 |
This is something you can do now with Structured Office. 00:34:14.480 |
In the past, a developer would have to specify their format 00:34:17.560 |
and then write a parser and parse out the model's output 00:34:21.640 |
But now you just specify steps and it's an array of steps 00:34:24.480 |
and every step you can render and then the user can try it 00:34:27.160 |
and you can see if it matches and go on that way. 00:34:29.720 |
So I think it just opens up a lot of opportunities. 00:34:32.120 |
Like for any kind of UI where you want to treat 00:34:34.440 |
different parts of the model's responses differently, 00:34:40.280 |
I'm basically just using this to ask you all the questions 00:34:43.400 |
as a user, as a daily user of the stuff that you put out. 00:34:48.960 |
which is you respect descriptions of JSON schemas, right? 00:34:52.240 |
And you can basically use that as a prompt for the field. 00:34:56.420 |
and people should do that. - Intentional, yeah. 00:35:00.320 |
which I don't, it could be a hallucination of me, 00:35:02.160 |
is I changed the property name to prompt the model 00:35:08.200 |
So for example, instead of saying topics as a property name, 00:35:12.600 |
I would say like, "Brainstorm a list of topics up to five," 00:35:19.280 |
I could stick that in the description as well, 00:35:23.240 |
- Yeah, I would say, I mean, we're so early in AI 00:35:26.760 |
that people are figuring out the best way to do things. 00:35:30.880 |
like a way they found to make something work. 00:35:37.200 |
You can put instructions in the system message 00:35:44.840 |
a customer support thing and you want the model 00:35:47.320 |
to verify the user's phone number or something. 00:35:49.280 |
You can tell the model in the system message, 00:35:51.000 |
like here's when you should call this function. 00:35:57.400 |
So really common is someone will have like a date 00:36:01.880 |
like, do you want year, year, month, month, day, day? 00:36:19.920 |
this parameter to be used except only in some circumstances. 00:36:22.920 |
And really, I think that's the fun nature of this. 00:36:27.760 |
- Okay, so you don't have an official recommendation 00:36:30.720 |
- Well, the official recommendation is, you know, 00:36:38.720 |
So like, say with date, it's like description, 00:36:47.920 |
I feel like the benchmarks don't go that deep, 00:36:50.080 |
but then all the AI engineering kind of community, 00:36:59.200 |
Like even the, I'm gonna tip you $100,000 or whatever, 00:37:03.120 |
like some people say it works, some people say it doesn't. 00:37:05.600 |
Do you pay attention to this stuff as you build this? 00:37:08.280 |
Or are you just like, the model is just gonna get better, 00:37:10.800 |
so why waste my time running evals on these small things? 00:37:20.920 |
that we could dig into, and we're just mostly focused 00:37:23.280 |
on kind of raising the capabilities for everyone. 00:37:25.880 |
I think for customers, and we work with a lot of customers, 00:37:28.720 |
really developing their own evals is super high leverage, 00:37:34.160 |
you can experiment with these things with confidence. 00:37:36.560 |
So yeah, we're hoping to make making evals easier. 00:37:39.160 |
I think that's really generally very helpful for developers. 00:37:42.560 |
- For people, I would just kind of wrap up the discussion 00:37:44.840 |
for structured outputs, I immediately implemented, 00:37:56.360 |
we cut it by 55% of API costs based on what I measured, 00:38:03.880 |
- Yeah, which people I think don't understand, 00:38:05.480 |
when you can't just simply add Instructor or add outlines, 00:38:10.160 |
you can do that, but it's actually gonna cost you 00:38:12.280 |
a lot of retries to get the model that you want, 00:38:23.320 |
Yeah, actually, I had folks, even my husband's company, 00:38:31.800 |
We are not retrying, you know, we're doing it in one shot, 00:38:34.080 |
and this is how you save on latency and cost. 00:38:36.120 |
- Awesome, any other behind-the-scenes stuff, 00:38:45.960 |
and we have the full story now that people can try out. 00:38:49.280 |
So Roadmap would be parallel function calling, 00:38:51.560 |
anything else that you've called out as coming soon? 00:38:59.960 |
- What would you want to hear from developers 00:39:01.880 |
to give you information, whether it's custom grammars 00:39:06.960 |
- Just always interested in feature requests, 00:39:09.440 |
what's not working, but I'd be really curious, 00:39:13.200 |
I know some folks want to match programming languages 00:39:17.000 |
There's some challenges with the expressivity 00:39:22.760 |
just kind of the class of grammars folks want. 00:39:26.680 |
which is a lot of people try to use GPT as judge, right? 00:39:30.680 |
Which means they end up doing a rating system, 00:39:32.720 |
and then there's like 10 different kinds of rating systems, 00:39:38.400 |
to do a rating system with structured outputs, 00:40:07.200 |
- I think this is more of like a calibration question. 00:40:09.120 |
Like if I asked you to rate things from one to 10, 00:40:11.640 |
a non-calibrated model might always pick seven, 00:40:16.800 |
- So like actually have a nice gradation from one to 10 00:40:23.400 |
I can't just say have a field of rating from one to 10 00:40:34.360 |
When you first started, you had one model endpoint. 00:40:39.280 |
but like most people were using one model endpoint. 00:40:42.760 |
Today, you have like a lot of competitive models, 00:40:45.040 |
and I think we're nearing the end of the 3.5 run RIP. 00:40:51.840 |
select, both in terms of like tasks and like costs, 00:40:56.360 |
- In general, I think folks should start with 4.0 mini. 00:41:05.840 |
If you're not finding the performance you need, 00:41:15.280 |
frontier use cases, and maybe 4.0 is not quite cutting it, 00:41:18.880 |
and there I would recommend our fine tuning API. 00:41:21.120 |
Even just like 100 examples is enough to get started there, 00:41:24.400 |
and you can really get the performance you're looking for. 00:41:27.840 |
but like you're announcing other some fine tuning stuff 00:41:32.280 |
- Yeah, actually tomorrow we're dropping our GA 00:41:37.040 |
So 4.0 mini has been available for a few weeks now, 00:41:42.240 |
And we also have a free training offering for a bit. 00:41:47.120 |
you get one million free training tokens a day. 00:41:52.560 |
- So that was for 4.0 mini, and now it's also for 4.0. 00:41:55.120 |
So we're really excited to see what people do with it. 00:41:57.000 |
And it's actually a lot easier to get started 00:42:00.000 |
I think they might need tens of thousands of examples, 00:42:19.200 |
And I think you're paving the path for migration of models. 00:42:22.480 |
As long as they keep their original data set, 00:42:26.200 |
- Yeah, I'm not sure what we've said publicly there yet, 00:42:28.800 |
but we definitely wanna make it easier for folks to migrate. 00:42:39.480 |
where it's in the guide, we'll put it in the show notes, 00:42:42.560 |
where it says to optimize for accuracy first, 00:42:44.960 |
so prompt engineering, RAG, evals, fine tuning. 00:42:49.480 |
And then optimize for cost and latency second, 00:42:51.880 |
and there's a few sets of steps for optimizing latency, 00:42:58.280 |
- We had one episode with Nigolas Carlini from DeepMind, 00:43:08.720 |
and it's like, "Oh, LLMs cannot do this," and they stop. 00:43:13.160 |
It's like, how do you know if you hit the model performance, 00:43:17.360 |
You know, it's like, "Your prompt is not good," 00:43:28.280 |
I think there's a lot we can do to make it easier 00:43:35.320 |
And a lot of people have experience now with ChatGPT. 00:43:38.040 |
You know, before, ChatGPT, the easiest way to play 00:43:47.160 |
It's like, you know, if I tell you my grandma is sick, 00:43:51.600 |
and we're hoping to kind of remove the need for that, 00:43:53.800 |
but playing around with ChatGPT is a really good way 00:43:56.600 |
to get a feel for, you know, how to use the API as well. 00:44:01.320 |
or is it a dying art as the models get better? 00:44:07.320 |
It's like, as the models get better at coding, you know, 00:44:09.440 |
if we hit a hundred on SWE Bench, what does that mean? 00:44:17.280 |
Most of engineering is like figuring out the requirements 00:44:22.080 |
and I believe this will be the case with AI as well. 00:44:24.480 |
You're going to have to very clearly explain what you need, 00:44:26.600 |
and some people are better than others at it, 00:44:30.880 |
It's just the tools are going to get far better. 00:44:39.680 |
I think people were a little bit confused by that, 00:44:41.080 |
and then you issued a clarification that it was, 00:44:49.120 |
So part of the impetus here was to kind of very transparent 00:45:18.120 |
so it's very good for like chat-style use cases. 00:45:23.400 |
we really tune our models to be good at things 00:45:27.320 |
and structured outputs, and when a developer builds 00:45:31.400 |
that kind of the weights are stable under them. 00:45:33.600 |
And so we have this offering where it's like, 00:45:39.160 |
you know it will never change the weights out from under you. 00:45:44.960 |
and we think those are the best for developers. 00:45:53.040 |
And you have the freedom to choose what's best for you. 00:46:08.600 |
- I mean, I think there's a lot of interesting stuff 00:46:14.120 |
and so we don't want to limit them artificially. 00:46:25.920 |
And basically, OpenAI has never actually shared with you 00:46:33.400 |
Actually, a lot of the models we have shipped 00:46:37.760 |
sometimes they diverge and it's not a limitation 00:46:41.520 |
- Anything else we should know about the new model? 00:46:43.160 |
I don't think there was no evals announced or anything, 00:46:55.800 |
They're not as in-depth as we want to be yet, 00:46:59.560 |
and we're learning what actually changes with each model 00:47:02.160 |
and how can we better understand the capabilities. 00:47:04.720 |
But we are trying to do more release notes in the future 00:47:09.400 |
But yeah, it's kind of an art and a science right now. 00:47:17.880 |
We're hiring if you want to come work on evals. 00:47:21.320 |
We'll come back to the end on what you're looking for, 00:47:25.240 |
and they want to know what qualities you're looking for. 00:47:27.960 |
- So we just talked about API versus ChargedGBT. 00:47:31.240 |
What's, I guess, the vision for the interface? 00:47:34.200 |
You know, the mission of OpenAI is to build AGI 00:47:41.080 |
So I believe that the API is kind of our broadest vehicle 00:47:46.680 |
You know, we're building some first-party products, 00:47:48.760 |
but they'll never reach every niche in the world 00:47:54.680 |
and seeing the incredible things they come up with. 00:47:56.840 |
I often find that developers kind of see the future 00:47:59.880 |
and we love working with them to make it happen. 00:48:02.320 |
And so really the API is a bet on going really broad. 00:48:05.680 |
And we'll go very deep as well in our first-party products, 00:48:08.240 |
but I think just that our impact is absolutely magnified 00:48:13.280 |
- They can do the last mile where you cannot. 00:48:19.400 |
In fact, you know, I observed, I think in February, 00:48:26.760 |
because everyone was kind of able to take that 00:48:32.760 |
because ChargedGBT growth has continued to grow. 00:48:42.440 |
The API was actually opened as first product, 00:48:49.520 |
Like, GA, everyone can sign up and use it immediately. 00:48:55.520 |
And, you know, that means you also have to expose 00:49:07.440 |
It's interesting that the hottest new programming language 00:49:12.000 |
but it's actually just software engineering, right? 00:49:14.640 |
It's just, you know, we're talking about HTTP error codes. 00:49:20.040 |
engineering is still the way you access these models. 00:49:22.080 |
And I think there are companies working on tools 00:49:25.640 |
to make engineering more accessible for everyone, 00:49:32.400 |
- Yeah, one might even call it AI engineering. 00:49:40.840 |
and then we jumped straight to structured outputs. 00:49:47.200 |
What are your favorite stories that you like to tell? 00:49:50.120 |
- We had so much fun working on the Assistance API 00:50:32.040 |
But actually, maybe like two hours before that, 00:50:43.640 |
We were a bit on the edge of our seat watching it live. 00:50:52.000 |
- I mean, I actually don't know what the plan B was. 00:51:11.200 |
like the whole company got like a few weeks off, 00:51:18.680 |
like we just had the week of July 4th off, and yeah. 00:51:22.640 |
because people are working on such exciting things, 00:51:24.320 |
and it's like, you get a lot of FOMO on vacation, 00:51:26.640 |
so it helps when the whole company's on vacation. 00:51:36.440 |
What's the offering today versus, you know, one year ago? 00:51:39.480 |
- Yeah, so we've made a bunch of key improvements. 00:51:41.960 |
I would say the biggest one is in the file search product. 00:51:47.960 |
and the way we used those files was like less effective. 00:51:51.120 |
Basically, the model would decide based on the file name, 00:51:55.040 |
and there's not a ton of information in there. 00:51:57.360 |
So our new offering, which we shipped a few months ago, 00:52:04.640 |
And also, it's a kind of different operation. 00:52:07.320 |
So you can search semantically over all files at once, 00:52:09.600 |
rather than just kind of the model choosing one up front. 00:52:12.080 |
So a lot of customers have seen really good performance. 00:52:21.960 |
So this kind of gives developers more control 00:52:29.560 |
- Yeah, I think that visibility into the RAG system 00:52:33.360 |
was the number one thing missing from Dev Day, 00:52:39.840 |
The re-ranker is a core feature of, let's say, 00:52:44.920 |
Is OpenAI going to offer a re-ranking service, 00:52:51.560 |
I think we're soon going to ship more controls for that. 00:52:55.200 |
And if I'm an existing LANG chain, MAMA index, whatever, 00:53:01.360 |
Where does that exist in the spectrum of choices? 00:53:08.600 |
And so ideally, you don't have to know what a re-ranker is, 00:53:12.080 |
and you don't have to have a chunking strategy, 00:53:14.160 |
and the thing just kind of works out of the box. 00:53:23.320 |
I'm going to ask about a couple other things, 00:53:24.320 |
just updates on stuff also announced at Dev Day, 00:53:28.120 |
Determinism, something that people really want. 00:53:38.000 |
- The Seed parameter is not fully deterministic, 00:53:51.960 |
It's kind of trading off against reliability and uptime. 00:54:07.760 |
or products that are made a lot better through using it? 00:54:13.480 |
So Logit Bias, your valid classification outputs, 00:54:17.000 |
and you're more likely to get something that matches. 00:54:19.680 |
We've seen people Logit Bias punctuation tokens, 00:54:26.000 |
Yeah, it's generally very much a power user feature, 00:54:36.640 |
- Probably, I don't know, is delve one token? 00:54:38.920 |
You're probably, you got to do a lot of permutations. 00:54:46.160 |
I guess you cannot answer or you would omit it. 00:54:50.720 |
like the ones that you use across all models, or? 00:54:53.240 |
- Yeah, I think we have docs that publish more information. 00:54:58.120 |
but I think we publish which tokenizers for which model. 00:55:04.320 |
I don't think there was an official blog post 00:55:07.040 |
but it was kind of mentioned that you started tying 00:55:13.960 |
Just from your point of view, how do you manage that? 00:55:20.120 |
- Yeah, I think basically the main changes here 00:55:22.280 |
were to be more transparent and easier to use. 00:55:24.280 |
So before developers didn't know what tier they're in, 00:55:33.400 |
And so this just helps us do kind of gated rollouts 00:55:37.360 |
I think everyone tier two and up has full access. 00:55:41.040 |
I would just advise people to just get to tier five 00:55:48.520 |
- Do we want to maybe wrap with future things 00:55:51.200 |
and kind of like how you think about designing and everything? 00:55:53.800 |
So you just mentioned you want to be the easiest way 00:55:58.240 |
What's the relationship with other people building 00:56:02.400 |
Like I think maybe in the early days, it's like, okay, 00:56:05.280 |
we only have these APIs and then everybody helps us, 00:56:07.400 |
but now you're kind of building a whole platform. 00:56:10.960 |
- Yeah, I think kind of the 80/20 principle applies here. 00:56:13.960 |
We'll build things that kind of capture, you know, 00:56:16.240 |
80% of the value and maybe leave the long tail 00:56:31.600 |
but kind of AI development platform as a service. 00:56:35.160 |
That ties into a thing that I put in the notes 00:56:45.080 |
or they just want to know what you won't build 00:56:53.320 |
determined what exactly we will and won't build, 00:56:57.040 |
if it makes it a lot easier for developers to integrate, 00:57:03.440 |
- Yeah, so there's like cost tracking and model fallbacks. 00:57:11.680 |
but like if you don't build it, I have to build it 00:57:18.360 |
- Yeah, I mean, the way we're targeting that user need 00:57:37.080 |
- Is the important thing about owning the platform 00:57:41.200 |
to put all the kind of messy stuff behind the scenes? 00:57:49.640 |
how can we onboard the next generation of AI engineers, 00:58:05.720 |
beyond just the models that makes the models really useful? 00:58:12.320 |
Batch, Vision, Whisper, and then Team Enterprise stuff. 00:58:31.040 |
- So it's half off, which is a great savings. 00:58:36.200 |
So the savings on top of 4.0 mini is pretty crazy. 00:58:42.560 |
- Yeah, I should really have that number top of mind, 00:58:46.360 |
And so I think this opens up a lot more use cases. 00:58:48.640 |
Like let's say you have a user activation flow 00:58:51.800 |
and you want to send them an email like maybe every day 00:58:54.240 |
or like at certain points in their user journey. 00:58:58.400 |
and something that was maybe a lot more expensive 00:59:02.680 |
So right now we have this 24 hour turnaround time 00:59:08.440 |
like what kind of turnaround time do they want? 00:59:12.440 |
and I cannot use Batch because it's 24 hours. 00:59:18.400 |
But yeah, just a lot of folks haven't heard about it. 00:59:20.200 |
It's also really great for like evals, running them offline. 00:59:22.720 |
You don't, generally don't need them to come back 00:59:27.440 |
Two to four for me, like I need to produce a daily thing 00:59:32.280 |
And then maybe like a week, a month, who cares? 00:59:41.240 |
- Is there a future in which like six months is like free? 00:59:47.200 |
is there like super small like shards of like GPU runtime 01:00:06.560 |
Last year, people were so wild by the GPT-4 demo 01:00:19.800 |
So there's, you can use it in the Assistance API, 01:00:22.520 |
you can use the Batch API in track completions. 01:00:29.440 |
where the spatial relationships between the data 01:00:32.760 |
is too complicated and you can't get that over text. 01:00:35.360 |
But yeah, there's a lot of really cool use cases. 01:00:37.080 |
- I think the tricky thing for me is understanding 01:00:40.320 |
how frequent to turn Vision from like single images 01:00:51.360 |
Will there just be like, I stream you a video and then? 01:00:56.560 |
that we'll have an API where you stream video in 01:01:03.240 |
- 'Cause the frame sampling is the default, right? 01:01:07.240 |
- Yeah, I think it's hard for developers to do. 01:01:10.120 |
we should definitely work on making that easier. 01:01:12.920 |
do you have like a time guarantees, like order guarantees? 01:01:17.120 |
Like if I send you a Batch request of like a video analysis, 01:01:22.240 |
- For Batch, you send like a list of requests 01:01:34.000 |
- I wasn't linking video to Batch, but that's interesting. 01:01:49.080 |
is you're just using kind of spare time to run it. 01:01:54.520 |
Oliver, I built this thing called SmallPodcaster, 01:02:00.200 |
And why does Whisper API not have diarization 01:02:03.520 |
when everybody is transcribing people talking? 01:02:09.600 |
I actually worked on the Whisper API and shipped that. 01:02:20.960 |
but there's some like performance trade-offs. 01:02:23.240 |
And so Whisper V2 is better at some things than Whisper V3. 01:02:26.240 |
And so it didn't seem that worthwhile to ship Whisper V3 01:02:29.880 |
compared to like the other things in our priorities. 01:02:35.160 |
there's always so many things we could work on. 01:02:53.560 |
I forget the one. - Yeah, yeah, yeah, exactly. 01:02:58.320 |
And it's like, tell me if there's a bird in this picture. 01:02:59.560 |
And it's like, give me 10 people on a research team. 01:03:01.800 |
It's like, you never know which things are challenging 01:03:09.640 |
It still breaks a lot with like overlaps, obviously. 01:03:18.920 |
I mean, it would take us so long to do transcriptions. 01:03:37.280 |
is better than like figuring out your own pipeline thing. 01:03:40.640 |
- I think the top feature request there just would be, 01:03:49.760 |
I think there is like in raw Whisper, you can do that. 01:03:56.200 |
- There's no more deterministic way to do it. 01:03:57.280 |
- So this is really helpful when you have like acronyms 01:04:10.480 |
- We're like all the way misspelled it in the past 01:04:19.720 |
or like all these different things or like length chain. 01:04:25.480 |
- A bunch of like three or four different ways. 01:04:33.720 |
but I've been enjoying the advanced voice mode. 01:04:39.800 |
How would your audio endpoint change when that comes out? 01:04:43.640 |
- We're exploring, you know, new shape of the API 01:04:50.240 |
I don't think we're ready to share quite yet, 01:04:55.360 |
probably isn't going to be the right solution. 01:04:59.040 |
I think it's pretty public that OpenAI uses LiveKit 01:05:03.000 |
which like seems to be the socket based approach 01:05:05.800 |
that people should be at least up to speed on. 01:05:08.560 |
Like I think a lot of developers only do request response 01:05:14.600 |
I think we'll make it really easy for developers 01:05:18.520 |
It's hard to do audio. - It'll be a paradigm change. 01:05:24.920 |
What should people know using the enterprise offering? 01:05:27.480 |
- Yeah, we recently shipped our admin and audit log APIs. 01:05:34.520 |
The ability to kind of manage API keys programmatically, 01:05:39.200 |
So we've shipped this and for folks that need it, 01:05:45.880 |
I imagine it's just like build your own internal gateway 01:05:55.480 |
that needs to keep track of all the API keys, 01:05:57.720 |
it was pretty hard in the past to do this in the dashboard. 01:06:04.680 |
- The most important feature of an enterprise company. 01:06:15.240 |
Maybe let's just do, why is everybody at Waterloo cracked? 01:06:30.920 |
I think another reason is that Waterloo is like, 01:06:37.400 |
There's like not that much to do apart from study 01:06:47.520 |
And there's a lot of like startup incubators. 01:06:49.280 |
It's kind of just has this like startup and hacker ethos. 01:07:05.280 |
So, you know, it's no coincidence that Seattle 01:07:14.360 |
so it's the birthplace of C++, PHP, Turbo Pascal, 01:07:18.040 |
Standard ML, BNF, the thing that we just talked about, 01:07:21.160 |
MD5 Crypt, Ruby on Rails, Google Maps, and V8 for Chrome. 01:07:27.600 |
the creator of C++, there's nothing else to do. 01:07:35.920 |
People say, you know, New York is way more fun. 01:07:51.680 |
There's not a lot of like late night dining culture. 01:07:55.240 |
Yeah, so you have time to wake up early and get to work. 01:07:58.440 |
- You are a book recommender or book enjoyer. 01:08:01.560 |
What underrated books do you recommend most to others? 01:08:03.920 |
- Yeah, I think a book I read somewhat recently 01:08:23.320 |
kind of like the, some of the moments in technology. 01:08:26.880 |
Like when I played "The Science of Time" on PS2 01:08:37.760 |
I think like open AI is a lot of similar things, 01:08:41.600 |
It's like, you see that thing and then you're like, okay, 01:08:53.240 |
and talks a lot about how people act irrationally 01:08:57.480 |
And I actually think about that book like once a week, 01:08:59.240 |
probably, at least when I'm making a decision 01:09:01.360 |
and I realize that, you know, I'm falling into a fallacy 01:09:11.200 |
- Is there like an example of like a cognitive bias 01:09:14.760 |
or misbehavior that you just love telling people about? 01:09:28.080 |
And like a lot of people are like, oh, I have to keep these. 01:09:31.640 |
But really it's the same decision you're making 01:09:33.400 |
if you have $10,000, like would you buy these tickets? 01:09:36.040 |
And so people don't really think about it rationally. 01:09:37.720 |
I'm like, would they rather have $10,000 of the tickets? 01:09:55.800 |
you respond more strongly than if I give it to you. 01:10:01.960 |
but if they do get a promotion, they're like, okay, phew. 01:10:06.440 |
It's more like, we react a lot worse to losing something. 01:10:10.880 |
- Which is why, like when you join like a new platform, 01:10:13.520 |
they often give you points and then they'll take it away 01:10:15.760 |
if you like don't do some action in like the first few days. 01:10:38.840 |
- I mean, they are maximizing probability distributions. 01:10:43.280 |
- Yeah, so I think way more than all of us, they are econs. 01:10:59.680 |
Like, is there anything that they need to have done before? 01:11:04.400 |
- Yeah, we've hired people, all kinds of backgrounds, 01:11:09.760 |
or folks who've just done engineering like me. 01:11:19.320 |
And there's a really cool model behavior role 01:11:24.760 |
we'd recommend checking out our careers page, 01:11:30.760 |
- I think one thing that I'm trying to get at 01:11:32.600 |
is like, what kind of person does well at OpenAI? 01:11:43.840 |
- I mean, the people I enjoy working with the most 01:11:50.880 |
do what needs to be done, and unpretentious about it. 01:11:54.120 |
Yeah, I also think folks that are very user-focused 01:11:59.440 |
Like, the YC ethos of build something people want 01:12:04.760 |
So I would say low ego, user-focused, driven.