Building AGI with OpenAI's Structured Outputs API

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey, everyone.

00:00:05.240 | Welcome to the Latent Space Podcast.

00:00:06.960 | This is Alessio, partner and CTO

00:00:08.800 | on Residence and Decibel Partners,

00:00:10.320 | and I'm joined by my co-host Swiggs, founder of Small.ai.

00:00:13.440 | - Hey, and today we're excited to be

00:00:15.320 | in the in-person studio with Michelle, welcome.

00:00:18.000 | - Thanks, thanks for having me, very excited to be here.

00:00:20.040 | - This has been a long time coming.

00:00:21.680 | I've been following your work

00:00:23.200 | on the API platform for a little bit,

00:00:25.140 | and I'm finally glad that we could make this happen

00:00:28.200 | after you shipped the structured outputs.

00:00:30.240 | How does that feel?

00:00:31.200 | - Yeah, it feels great.

00:00:32.520 | We've been working on it for quite a while,

00:00:34.000 | so very excited to have it out there

00:00:36.000 | and have people using it.

00:00:37.240 | - We'll tell the story soon,

00:00:38.720 | but I want to give people a little intro

00:00:40.520 | to your backgrounds.

00:00:41.360 | So you've interned and/or worked at Google, Stripe,

00:00:44.880 | Coinbase, Clubhouse, and obviously OpenAI.

00:00:47.760 | What was that journey like?

00:00:49.240 | You know, the one that has the most appeal to me

00:00:51.840 | is Clubhouse because that was a very,

00:00:53.720 | very hot company for a while.

00:00:55.440 | Basically, you seem to join companies

00:00:57.080 | when they're about to scale up really a lot,

00:00:59.120 | and obviously OpenAI has been the latest.

00:01:01.640 | But yeah, just what are your learnings

00:01:03.480 | and your history going into all these notable companies?

00:01:06.480 | - Yeah, totally.

00:01:07.320 | For a bit of my background, I'm Canadian.

00:01:09.200 | I went to the University of Waterloo,

00:01:10.920 | and there you do like six internships

00:01:12.600 | as part of your degree.

00:01:13.840 | So I started, actually, my first job was really rough.

00:01:16.600 | I worked at a bank, and I learned Visual Basic,

00:01:19.360 | and I like animated bond yield curves,

00:01:21.760 | and it was, you know, not--

00:01:23.200 | - Me too.

00:01:24.040 | - Oh, really?

00:01:24.860 | - Yeah, I was a derivatives trader.

00:01:26.200 | - Interest rate swaps, that kind of stuff, yeah.

00:01:28.200 | - Yeah, so I liked, you know, having a job,

00:01:30.000 | but I didn't love that job.

00:01:31.080 | And then my next internship was Google,

00:01:33.680 | and I learned so much there.

00:01:35.120 | It was tremendous.

00:01:36.160 | But I had a bunch of friends that were into startups more,

00:01:38.400 | and, you know, Waterloo has like a big startup culture,

00:01:40.720 | and one of my friends interned at Stripe,

00:01:44.240 | and he said it was super cool.

00:01:45.160 | So that was kind of my,

00:01:46.360 | I also was a little bit into crypto at the time,

00:01:48.880 | then I got into it on Hacker News,

00:01:50.320 | and so Coinbase was on my radar.

00:01:52.280 | And so that was like my first real startup opportunity

00:01:54.720 | was Coinbase.

00:01:55.720 | I think I've never learned more in my life

00:01:57.880 | than in the four-month period

00:01:59.080 | when I was interning at Coinbase.

00:02:00.640 | They actually put me on call.

00:02:02.080 | I worked on like the ACH rails there,

00:02:04.160 | and it was absolutely crazy.

00:02:05.440 | You know, crypto was a very formative experience.

00:02:07.960 | Yeah.

00:02:08.800 | - This is 2018 to 2020,

00:02:10.480 | kind of like the first big wave.

00:02:11.720 | - That was my full-time.

00:02:13.360 | But I was there as an intern in 2016.

00:02:16.480 | Yeah, and so that was the period

00:02:17.960 | where I really like learned to become an engineer,

00:02:19.960 | learned how to use Git, got on call right away,

00:02:22.480 | you know, managed production databases and stuff.

00:02:24.520 | So that was super cool.

00:02:25.640 | After that, I went to Stripe

00:02:26.640 | and kind of got a different flavor of payments

00:02:28.320 | on the other side.

00:02:29.320 | Learned a lot, was really inspired by the Coulsons.

00:02:32.760 | And then my next internship after that,

00:02:34.600 | I actually started a company at Waterloo.

00:02:36.640 | So there's this thing you can do,

00:02:37.560 | it's an entrepreneurship co-op,

00:02:39.120 | and I did it with my roommate.

00:02:40.800 | The company's called Readwise, which still exists, but--

00:02:43.000 | - Yeah, yeah, yeah.

00:02:43.840 | - Everyone uses Readwise.

00:02:44.680 | - Yeah, awesome.

00:02:45.520 | - You co-founded Readwise?

00:02:46.600 | - Yeah.

00:02:47.440 | - I'm a premium user.

00:02:49.000 | - It's not even on your LinkedIn?

00:02:50.840 | - Yeah, I mean, I only worked on it for about a year,

00:02:52.640 | and so Tristan and Dan are the real founders,

00:02:54.800 | and I just had an interlude there.

00:02:56.520 | But yeah, really loved working on something

00:02:59.960 | very startup-focused, user-focused,

00:03:01.840 | and hacking with friends, it was super fun.

00:03:04.720 | Eventually, I decided to go back to Coinbase

00:03:06.120 | and really get a lot better as an engineer.

00:03:09.040 | I didn't feel like I was, you know,

00:03:10.680 | didn't feel equipped to be a CTO of anything at that point,

00:03:13.440 | and so just learned so much at Coinbase.

00:03:15.760 | And that was a really fun curve.

00:03:17.560 | But yeah, after that, I went to Clubhouse,

00:03:18.960 | which was a really interesting time.

00:03:21.920 | So I wouldn't say that I went there before it blew up.

00:03:24.920 | I would say I went there as it blew up,

00:03:26.600 | so not quite the startling track record that it might seem.

00:03:30.120 | But it was a super exciting place.

00:03:31.520 | I joined as the second or third backend engineer,

00:03:34.000 | and we were down every day, basically.

00:03:36.920 | One time, Oprah came on,

00:03:38.040 | and absolutely everything melted down,

00:03:40.120 | and so we would have a stand-up every morning,

00:03:41.640 | and we'd be like, "How do we make everything stay up?"

00:03:44.680 | Which is super exciting.

00:03:45.840 | Also, one of the first things I worked on there

00:03:47.600 | was making our notifications go out more quickly,

00:03:50.200 | because when you join a Clubhouse room,

00:03:51.800 | you need everyone to come in right away

00:03:53.640 | so that it's exciting,

00:03:54.600 | and the person speaking thinks a lot of my audience is here.

00:03:57.680 | But when I first joined, I think it would take 10 minutes

00:03:59.960 | for all the notifications to go out, which is insane.

00:04:02.920 | By the time you want to start talking

00:04:04.080 | to the time your audience is there,

00:04:05.400 | it's like you can totally kill the room.

00:04:07.120 | So that's one of the first things I worked on,

00:04:08.440 | is making that a lot faster and keeping everything up.

00:04:11.680 | - I mean, so already we have an audience of engineers.

00:04:14.040 | Those two things are useful.

00:04:15.160 | It's keeping things up and notifications out.

00:04:17.120 | Notifications, like is it a Kafka topic?

00:04:19.600 | - It was a Postgres shop,

00:04:20.800 | and you had all of the followers in Postgres,

00:04:23.920 | and you needed to iterate over the followers

00:04:25.760 | and figure out, is this a good notification to send?

00:04:28.080 | And so all of this logic,

00:04:29.240 | it wasn't well-batched and parallelized,

00:04:31.640 | and our job queuing infrastructure wasn't right.

00:04:34.000 | And so there was a lot of fixing all of these things.

00:04:36.400 | Eventually, there were a lot of database migrations,

00:04:38.200 | because Postgres just wasn't scaling well for us.

00:04:40.800 | - Interesting, and then keeping things up,

00:04:43.160 | that was more of a, I don't know, reliability issue,

00:04:46.560 | SRE type?

00:04:47.400 | - A lot of it, yeah, it goes down to database stuff.

00:04:50.920 | Everywhere I've worked--

00:04:51.760 | - It's on databases.

00:04:52.600 | (laughing)

00:04:54.280 | - Indexing.

00:04:55.120 | - Actually, at Coinbase, at Clubhouse, and at OpenAI,

00:04:57.760 | Postgres has been a perennial challenge.

00:04:59.760 | It's like, the stuff you learn at one job

00:05:02.400 | carries over to all the others,

00:05:03.520 | because you're always debugging

00:05:05.600 | a long-running Postgres query at 3 a.m. for some reason.

00:05:09.080 | So those skills have really carried me forward, for sure.

00:05:11.480 | - Why do you think that not as much of this is productized?

00:05:14.920 | Obviously, Postgres is an open-source project.

00:05:17.040 | It's not aimed at gigascale,

00:05:18.560 | but you would think somebody would come around

00:05:20.680 | and say, "Hey, we're like the--"

00:05:22.040 | - Yeah, I think that's what Planetscale is doing.

00:05:24.560 | It's not on Postgres, I think.

00:05:25.480 | It's on MySQL, but I think that's the vision.

00:05:27.920 | It's like, they have zero downtime migrations,

00:05:31.320 | and that's a big pain point.

00:05:33.120 | I don't know why no one is doing this on Postgres,

00:05:35.400 | but I think it would be pretty cool.

00:05:36.920 | - Their connection pullers, like PG Bouncer,

00:05:39.040 | is good enough, I don't know.

00:05:40.480 | - Yeah, well, even, I mean,

00:05:41.960 | I've run PG Bouncer everywhere,

00:05:43.440 | and there's still a lot of problems.

00:05:45.800 | Your scale, it's something that not many people see.

00:05:48.760 | - Yeah, I mean, at some point,

00:05:49.720 | every successful company gets to the scale

00:05:52.440 | where Postgres is not cutting it,

00:05:54.000 | and then you migrate to some sort of NoSQL database.

00:05:56.720 | And that process I've seen happen a bunch of times now.

00:05:59.200 | - MongoDB, Redis, something like that.

00:06:01.120 | - Yeah, I mean, we're on Azure now,

00:06:04.000 | and so we use Cosmos DB.

00:06:06.040 | - Cosmos DB, hey!

00:06:07.680 | - At Clubhouse, I really love DynamoDB.

00:06:10.080 | That's probably my favorite database,

00:06:12.160 | which is like a very nerdy sentence,

00:06:13.360 | but that's the one I'm using

00:06:14.800 | if I need to scale something as far as it goes.

00:06:16.560 | - Yeah, DynamoDB, I, when I learned,

00:06:18.800 | I worked at AWS briefly,

00:06:20.120 | and it's kind of like the memory register for the web.

00:06:23.280 | Like, you know, if you treat it just as physical memory,

00:06:26.800 | you will use it well.

00:06:27.920 | If you treat it as a real database,

00:06:29.760 | you might run into problems.

00:06:30.840 | - Right, you have to totally change your mindset

00:06:32.600 | when you're going from Postgres to Dynamo.

00:06:34.200 | But I think it's a good mindset shift,

00:06:35.440 | and kind of makes you design things in a more scalable way.

00:06:37.760 | - Yeah, I'll recommend the DynamoDB book

00:06:39.440 | for people who need to use DynamoDB.

00:06:41.640 | But we're not here to talk about AWS,

00:06:43.240 | we're here to talk about OpenAI.

00:06:44.680 | You joined OpenAI pre-Chad GPT.

00:06:46.360 | I also had the opportunity to join and I didn't.

00:06:48.840 | What was your insight?

00:06:50.600 | - Yeah, I think a lot of people who joined OpenAI

00:06:52.800 | joined because of a product that really gets them excited.

00:06:55.280 | And for most people, it's Chad GPT.

00:06:56.720 | But for me, I was a daily user of Copilot, GitHub Copilot.

00:07:00.840 | And I was like so blown away at the quality of this thing.

00:07:03.240 | I actually remember the first time seeing it on Hacker News

00:07:05.200 | and being like, wow, this is absolutely crazy.

00:07:07.040 | Like, this is gonna change everything.

00:07:08.920 | And I started using it every day.

00:07:10.560 | It just really, even now when like I don't have service

00:07:14.400 | and I'm coding without Copilot,

00:07:16.120 | it's just like 10x difference.

00:07:18.040 | So I was really excited about that product.

00:07:19.320 | I thought now is maybe the time for AI.

00:07:21.320 | And I'd done some AI in college

00:07:22.760 | and thought some of those skills would transfer.

00:07:25.600 | And I got introduced to the team.

00:07:26.800 | I liked everyone I talked to.

00:07:28.280 | So I thought that'd be cool.

00:07:29.720 | Why didn't you join?

00:07:30.760 | - It was like, I was like, is Dolly it?

00:07:32.880 | (laughing)

00:07:33.960 | - We were there.

00:07:35.320 | We were at the Dolly like launch thing.

00:07:37.320 | And I think you were talking with Lenny

00:07:39.000 | and Lenny was at OpenAI at the time.

00:07:41.160 | And you were like--

00:07:42.000 | - We don't have to go into too much detail.

00:07:43.760 | - This is one of my biggest regrets of my life.

00:07:45.960 | - No, no, no.

00:07:47.440 | - But I was like, okay, I mean, I can create images.

00:07:50.680 | I don't know if like this is the thing to dedicate,

00:07:52.680 | but obviously you had a bigger vision than I did.

00:07:55.360 | - Dolly was really cool too.

00:07:56.480 | I remember like first showing my family,

00:07:58.480 | I was like, I'm going to this company

00:08:00.000 | and here's like one of the things they do.

00:08:01.960 | And it like really helped bridge the gap.

00:08:03.320 | Whereas like, I still haven't figured out

00:08:05.600 | how to explain to my parents what crypto is.

00:08:08.160 | My mom for a while thought I worked at Bitcoin.

00:08:10.120 | So it's like, it's pretty different

00:08:12.160 | to be able to tell your family what you actually do

00:08:14.240 | and they can see it.

00:08:15.080 | - Yeah, and they can use it too, personally.

00:08:17.280 | So you were there, were you immediately on API platform?

00:08:20.080 | You were there for the chat GPT moment.

00:08:21.640 | - Yeah, I mean, API platform is like a very grandiose term

00:08:24.880 | for what it was.

00:08:25.720 | There was like just a handful of us working on the API.

00:08:27.920 | - Yeah, it was like a closed beta, right?

00:08:29.160 | Not even everyone had access to the GPT-3 model.

00:08:32.000 | - A very different access model then,

00:08:34.040 | a lot more like tiered rollouts.

00:08:36.520 | But yeah, I would say the applied team

00:08:38.480 | was maybe like 30 or 40 people

00:08:40.760 | and yeah, probably closer to 30.

00:08:42.520 | And there was maybe like five-ish total

00:08:44.520 | working on the API at most.

00:08:45.880 | So yeah, we've grown a lot since then.

00:08:47.480 | - It's like 60, 70 now, right?

00:08:49.300 | - No, applied is much bigger than that.

00:08:51.320 | Applied now is bigger than the company when I joined.

00:08:53.400 | - Okay, all right.

00:08:54.240 | - Yeah, we've grown a lot.

00:08:55.060 | I mean, there's so much to build.

00:08:55.900 | So we need all the help we can get.

00:08:56.740 | - I'm a little out of date, yeah.

00:08:58.080 | - Any chat GPT release, kind of like all ants on deck stories.

00:09:02.480 | I had lunch with Evan Morikawa a few months ago.

00:09:05.240 | It sounded like it was a fun time to get,

00:09:07.280 | build the APIs and have all these people

00:09:09.000 | trying to use the web thing.

00:09:10.080 | Like, how are you prioritizing internally?

00:09:12.160 | And like, what was the helping scaling

00:09:14.640 | when you're scaling non-GPU workloads

00:09:16.800 | versus like Postgres bouncers and things like that?

00:09:19.820 | - Yeah, actually surprisingly,

00:09:20.740 | there were a lot of Postgres issues when chat GPT came out

00:09:24.320 | because the accounts for like chat GPT

00:09:27.600 | were tied to the accounts in the API.

00:09:29.760 | And so you're basically creating a developer account

00:09:31.440 | to log into chat GPT at the time.

00:09:33.000 | 'Cause it's just what we had.

00:09:33.820 | It was low-key research preview.

00:09:35.360 | And so I remember there was just so much work scaling

00:09:37.480 | like our authorization system

00:09:38.840 | and that would be down a lot.

00:09:40.400 | Yeah, also GPU, you know,

00:09:42.040 | I never had worked in a place

00:09:43.880 | where you couldn't just scale the thing up.

00:09:45.960 | It's like everywhere I've worked in Qt is like free

00:09:47.800 | and you just like auto-scale a thing

00:09:49.240 | and you like never think about it again.

00:09:50.880 | But here we're having like tough decisions every day.

00:09:53.000 | We're like discussing like, you know,

00:09:54.320 | should they go here or here?

00:09:55.720 | And we have to be principled about it.

00:09:57.080 | So that's a real mindset shift.

00:09:58.480 | - So you just really structured outputs, congrats.

00:10:00.560 | You also wrote the blog post for it,

00:10:01.680 | which was really well-written.

00:10:02.640 | And I loved all the examples that you put out.

00:10:04.200 | Like you really give the full story.

00:10:06.000 | Yeah, tell us about the whole story from beginning to end.

00:10:09.080 | - Yeah, I guess the story we should rewind quite a bit

00:10:11.720 | to Dev Day last year.

00:10:13.000 | Dev Day last year, exactly.

00:10:14.320 | We shipped JSON mode,

00:10:15.600 | which is our first foray into this area of product.

00:10:18.560 | So for folks who don't know,

00:10:19.680 | JSON mode is this functionality you can enable

00:10:22.000 | in our chat completions and other APIs,

00:10:24.160 | where if you opt in,

00:10:25.720 | we'll kind of constrain the output of the model

00:10:28.120 | to match the JSON language.

00:10:30.240 | And so you basically will always get something

00:10:32.800 | in a curly brace.

00:10:33.880 | And this is good.

00:10:34.720 | This is nice for a lot of people.

00:10:35.720 | You can describe your schema,

00:10:37.280 | what you want in prompt,

00:10:38.800 | and then we'll constrain it to JSON.

00:10:40.920 | But it's not getting you exactly where you want,

00:10:43.080 | because you don't want the model

00:10:44.160 | to kind of make up the keys

00:10:45.800 | or match different values than what you want.

00:10:47.520 | Like if you want an enum or a number

00:10:49.080 | and you get a string instead,

00:10:50.000 | it's pretty frustrating.

00:10:51.600 | So we've been ideating on this for a while,

00:10:53.200 | and people have been asking for basically this

00:10:55.600 | every time I talk to customers for maybe the last year.

00:10:58.120 | And so it was really clear that there's a developer need,

00:11:00.200 | and we started working on kind of making it happen.

00:11:02.720 | And this is a real collab

00:11:04.640 | between engineering and research, I would say.

00:11:06.520 | And so it's not enough to just kind of constrain the model.

00:11:09.960 | I think of that as the engineering side,

00:11:11.600 | whereas basically you mask the available tokens

00:11:14.960 | that are produced every time to only fit the schema.

00:11:17.640 | And so you can do this engineering thing,

00:11:19.080 | and you can force the model to do what you want,

00:11:20.760 | but you might not get good outputs.

00:11:22.200 | And sometimes with JSON mode,

00:11:23.360 | developers have seen that our models output

00:11:25.240 | like white space for a really long time,

00:11:27.280 | where they don't--

00:11:28.120 | - Because it's a legal character.

00:11:29.320 | - Right, it's legal for JSON,

00:11:31.240 | but it's not really what they want.

00:11:32.760 | And so that's what happens

00:11:33.600 | when you do kind of a very engineering-biased approach.

00:11:36.120 | But the modeling approach is to also train the model

00:11:38.360 | to do more of what you want.

00:11:39.640 | And so we did these together.

00:11:41.000 | We trained a model which is significantly better

00:11:42.720 | than our past models at following formats,

00:11:44.960 | and we did the end work to serve

00:11:46.400 | like this constrained decoding concept at scale.

00:11:48.960 | So I think marrying these two

00:11:50.080 | is why this feature is pretty cool.

00:11:52.000 | - You just mentioned starts and ends with a curly brace,

00:11:54.640 | and maybe people's minds go to prefills in the Cloud API.

00:11:59.000 | How should people think about

00:12:00.240 | JSON mode structure output prefills?

00:12:02.640 | Because some of them are like,

00:12:03.840 | roughly starts with a curly brace

00:12:05.880 | and asks you for JSON, you should do it.

00:12:07.720 | And then Instructor is like,

00:12:08.720 | "Hey, here's a rough data scheme that you should use."

00:12:11.120 | And how do you think about them?

00:12:13.080 | - So I think we kind of designed structured outputs

00:12:15.160 | to be the easiest to use.

00:12:16.480 | So you just, like the way you use it in our SDK,

00:12:19.400 | I think is my favorite thing.

00:12:20.720 | So you just create like a pedantic object or a Zod object,

00:12:23.440 | and you pass it in and you get back an object.

00:12:25.520 | And so you don't have to deal with any of the serialization.

00:12:27.840 | - With the parse helper.

00:12:29.000 | - Yeah, you don't have to deal with any of the serialization

00:12:31.080 | on the way in or out.

00:12:32.320 | So I kind of think of this as the feature

00:12:34.080 | for the developer who is like,

00:12:35.680 | I need this to plug into my system.

00:12:37.520 | I need the function call to be exact.

00:12:39.440 | I don't want to deal with any parsing.

00:12:41.120 | So that's where structured outputs is tailored.

00:12:43.600 | Whereas if you want the model to be more creative

00:12:46.160 | and use it to come up with a JSON schema

00:12:48.320 | that you don't even know you want,

00:12:49.440 | then that's kind of where JSON mode fits in.

00:12:51.480 | But I expect most developers

00:12:52.840 | are probably going to want to upgrade to structured outputs.

00:12:55.480 | - The thing you just said,

00:12:56.320 | you just use interchangeable terms for the same thing,

00:12:59.480 | which is function calling and structured outputs.

00:13:02.680 | We've had disagreements or discussion before on the podcast

00:13:06.200 | about are they the same thing?

00:13:07.840 | Semantically, they're slightly different.

00:13:09.520 | - They are, yes.

00:13:10.360 | - Because I think function calling API came out first

00:13:12.760 | than JSON mode.

00:13:14.240 | And we used to abuse function calling for JSON mode.

00:13:18.480 | Do you think we should treat them as synonymous?

00:13:20.920 | - No. - Okay, yeah.

00:13:21.760 | Please clarify.

00:13:22.600 | (both laughing)

00:13:24.000 | And by the way, there's also tool calling.

00:13:26.000 | - Yeah, the history here is we started with function calling

00:13:29.120 | and function calling came from the idea of

00:13:31.760 | let's give the model access to tools

00:13:33.480 | and let's see what it does.

00:13:34.400 | And we basically had these internal prototypes

00:13:36.640 | of what a code interpreter is now.

00:13:38.520 | And we were like, this is super cool.

00:13:39.720 | Let's make it an API.

00:13:40.880 | But we're not ready to host code interpreter for everybody.

00:13:43.440 | So we're just going to expose the raw capability

00:13:45.560 | and see what people do with it.

00:13:47.040 | But even now, I think there's a really big difference

00:13:49.200 | between function calling and structured outputs.

00:13:51.040 | So you should use function calling

00:13:52.600 | when you actually have functions

00:13:53.920 | that you want the model to call.

00:13:55.600 | And so if you have a database

00:13:57.240 | that you want the model to be able to query from,

00:13:59.280 | or if you want the model to send an email

00:14:01.800 | or generate arguments for an actual action.

00:14:04.920 | And that's the way the model has been fine-tuned on,

00:14:07.720 | is to treat function calling

00:14:09.400 | for actually calling these tools and getting their outputs.

00:14:11.880 | The new response format

00:14:13.800 | is a way of just getting the model to respond to the user,

00:14:17.040 | but in a structured way.

00:14:18.400 | And so this is very different.

00:14:19.400 | Responding to a user versus I'm going to go send an email.

00:14:24.240 | A lot of people were hacking function calling

00:14:26.120 | to get the response format they needed.

00:14:28.440 | And so this is why we shipped this new response format.

00:14:31.040 | So you can get exactly what you want

00:14:32.640 | and you get more of the model's verbosity.

00:14:35.280 | It's responding in the way it would speak to a user.

00:14:38.560 | And so less just programmatic tool calling,

00:14:41.200 | if that makes sense.

00:14:42.080 | - Are you building something into the SDK

00:14:43.920 | to actually close the loop with the function calling?

00:14:46.880 | Because right now it returns the function,

00:14:48.360 | then you got to run it,

00:14:49.200 | then you got to fake another message

00:14:51.360 | to then continue the conversation.

00:14:53.640 | - They have that in beta, the runs.

00:14:55.160 | - Yes, we have this in beta in the Node SDK.

00:14:58.480 | So you can basically-

00:14:59.560 | - Oh, not Python.

00:15:00.840 | - It's coming to Python as well.

00:15:02.440 | - That's why I didn't know.

00:15:03.880 | - Yeah, I'm a Node guy.

00:15:04.840 | So I'm like, it's already existed.

00:15:08.760 | - It's coming everywhere.

00:15:09.720 | But basically what you do is you write a function

00:15:11.600 | and then you add a decorator to it.

00:15:13.720 | And then you can, basically there's this run tools method

00:15:16.480 | and it does the whole loop for you, which is pretty cool.

00:15:19.160 | - When I saw that in the Node SDK,

00:15:21.240 | I wasn't sure if that's,

00:15:22.760 | because it basically runs it in the same machine.

00:15:24.960 | - Yeah.

00:15:25.800 | - And maybe you don't want that to happen.

00:15:28.200 | - Yeah, I think of it as like,

00:15:29.600 | if you're prototyping and building something really quickly

00:15:32.240 | and just playing around,

00:15:33.080 | it's so cool to just create a function

00:15:34.840 | and give it this decorator.

00:15:35.880 | But you have the flexibility to do it however you like.

00:15:38.480 | - Like you don't want it in a critical path

00:15:39.840 | of a web request.

00:15:40.920 | - I mean, some people definitely will.

00:15:42.520 | (both laughing)

00:15:44.000 | It's just kind of the easiest way to get started.

00:15:46.080 | But let's say you want to like execute this function

00:15:48.320 | on a job queue async,

00:15:49.760 | then it wouldn't make sense to use that.

00:15:51.840 | - Prior art, Instructure, outlines, JSON former,

00:15:55.920 | what did you study?

00:15:56.760 | What did you credit or learn from these things?

00:15:59.640 | - Yeah, there's a lot of different approaches to this.

00:16:02.080 | There's more fill in the blank style sampling

00:16:04.880 | where you basically pre-form kind of the keys

00:16:08.880 | and then get the model to sample just the value.

00:16:11.480 | There's kind of a lot of approaches here.

00:16:13.200 | We didn't kind of use any of them wholesale,

00:16:15.280 | but we really loved what we saw from the community

00:16:17.480 | and like the developer experiences we saw.

00:16:19.680 | So that's where we took a lot of inspiration.

00:16:21.800 | - There was a question also just about constrained grammar.

00:16:24.880 | This is something that I first saw in Llama CPP,

00:16:27.920 | which seems to be the most,

00:16:29.520 | let's just say academically permissive.

00:16:32.080 | - It's kind of the lowest level.

00:16:33.320 | - Yeah.

00:16:34.240 | For those who don't know,

00:16:35.080 | maybe I don't know if you want to explain it,

00:16:36.040 | but they use Backus-Norform,

00:16:37.800 | which you only learn in like college

00:16:39.200 | when you're working on programming languages and compilers.

00:16:41.440 | I don't know if you like use that under the hood

00:16:43.560 | or you explore that.

00:16:44.800 | - Yeah, we didn't use any kind of other stuff.

00:16:48.160 | We kind of built our solution from scratch

00:16:50.600 | to meet our specific needs.

00:16:52.160 | But I think there's a lot of cool stuff out there

00:16:54.440 | where you can supply your own grammar.

00:16:56.360 | Right now, we only allow JSON schema

00:16:58.120 | and the dialect of that.

00:16:59.320 | But I think in the future,

00:17:00.160 | it could be a really cool extension

00:17:01.360 | to let you supply a grammar more broadly.

00:17:04.200 | And maybe it's more token efficient than JSON.

00:17:06.520 | So a lot of opportunity there.

00:17:08.680 | - You mentioned before also training the model

00:17:11.040 | to be better at function calling.

00:17:12.720 | What's that discussion like internally for like resources?

00:17:15.400 | It's like, hey, we need to get better JSON mode.

00:17:17.360 | And it's like, well,

00:17:18.200 | can't you figure it out on the API platform

00:17:20.240 | without touching the model?

00:17:21.920 | Like is there a really tight collaboration

00:17:24.400 | between the two teams?

00:17:25.440 | - Yeah, so I actually work on the API models team.

00:17:27.520 | I guess we didn't quite get into what I do at API.

00:17:29.400 | (all laughing)

00:17:31.440 | - What do you say it is you do here?

00:17:33.960 | - Yeah, so yeah, I'm the tech lead for the API,

00:17:36.360 | but also I work on the API models team.

00:17:38.240 | And this team is really working on

00:17:39.760 | making the best models for the API.

00:17:41.560 | And a lot of common deployment patterns

00:17:43.520 | are research makes a model

00:17:45.560 | and then you kind of ship it in the API.

00:17:47.560 | But I think there's a lot you miss when you do that.

00:17:50.440 | You miss a lot of developer feedback

00:17:52.200 | and things that are not kind of immediately obvious.

00:17:54.840 | What we do is we get a lot of feedback from developers

00:17:57.440 | and we go and make the models better in certain ways.

00:17:59.800 | So our team does model training as well.

00:18:01.720 | We work very closely with our post-training team.

00:18:04.000 | And so for structured outputs,

00:18:05.400 | it was a collab between a bunch of teams,

00:18:07.320 | including safety systems to make a really great model

00:18:10.760 | that does structured outputs.

00:18:12.680 | - Mentioning safety systems, you have a refusal field.

00:18:15.320 | - Yes.

00:18:16.160 | - You want to talk about that?

00:18:17.000 | - Yeah, it's pretty interesting.

00:18:19.200 | So you can imagine basically if you constrain the model

00:18:22.520 | to follow a schema,

00:18:23.880 | you can imagine there being like a schema supplied

00:18:26.800 | that it would add some risk or be harmful for the model

00:18:29.720 | to kind of follow that schema.

00:18:31.800 | And we wanted to preserve our model's abilities to refuse

00:18:35.120 | when something doesn't match our policies

00:18:37.800 | or is harmful in some way.

00:18:39.320 | And so we needed to give the model an ability to refuse

00:18:42.200 | even when there is this schema.

00:18:44.000 | But also, you know, if you are a developer

00:18:46.280 | and you have this schema

00:18:47.400 | and you get back something that doesn't match it,

00:18:48.880 | you're like, "Oh, the feature's broken."

00:18:50.440 | So we wanted a really clear way

00:18:51.600 | for developers to program against this.

00:18:53.240 | So if you get something back in the content,

00:18:54.920 | you know it's valid, it's JSON parsable.

00:18:56.880 | But if you get something back in the refusal field,

00:18:59.520 | it makes for a much better UI

00:19:00.720 | for you to kind of display this to your user

00:19:02.680 | in a different way.

00:19:03.640 | And it makes it easier to program against.

00:19:05.320 | So really there was a few goals,

00:19:06.520 | but it was mainly to allow the model to continue to refuse,

00:19:09.160 | but also with a really good developer experience.

00:19:11.240 | - Yeah, why not offer it as like an error code?

00:19:14.320 | Because we have to display error codes anyway.

00:19:16.880 | - Yeah, we've falafeled for a long time about API design,

00:19:20.040 | as we are wont to do.

00:19:21.920 | And there are a few reasons against an error code.

00:19:24.560 | Like you could imagine this being a 4xx error code

00:19:26.720 | or something, but you know,

00:19:28.120 | the developer's paying for the tokens.

00:19:29.620 | And that's kind of atypical for like a 4xx error code.

00:19:33.640 | - We pay with errors anyway, right?

00:19:36.840 | Or no? - So 4xx is-

00:19:38.120 | - Is not, that's a U error.

00:19:40.240 | - Right, and it doesn't make sense as a 5xx either,

00:19:43.500 | 'cause it's not our fault.

00:19:44.340 | It's the way the API, the model is designed.

00:19:46.880 | I think the HTTP spec is a little bit limiting

00:19:49.680 | for AI in a lot of ways.

00:19:51.520 | Like there are things that are in between

00:19:53.080 | your fault and my fault.

00:19:54.120 | There's kind of like the model's fault

00:19:56.040 | and there's no, you know, error code for that.

00:19:58.640 | So we really have to kind of invent

00:20:00.720 | a lot of the paradigm here.

00:20:02.160 | - We get 6xx.

00:20:03.840 | - Yeah, that's one option.

00:20:04.680 | There's actually some like esoteric error codes

00:20:06.600 | we've considered adopting.

00:20:07.880 | - 328, my favorite.

00:20:09.400 | - Yeah, there's the teapot one.

00:20:11.940 | - Hey! (laughs)

00:20:13.800 | - We're still figuring that out.

00:20:14.800 | But I think there are some things,

00:20:16.320 | like for example, sometimes our model will produce tokens

00:20:19.520 | that are invalid based on kind of our language.

00:20:22.600 | And when that happens, it's an error.

00:20:25.080 | But, you know, it doesn't, 500 is fine,

00:20:28.040 | which is what we return,

00:20:29.080 | but it's not as expressive as it could be.

00:20:31.320 | So yeah, just areas where, you know,

00:20:33.480 | web 2.0 doesn't quite fit with AI yet.

00:20:37.080 | - If you have to put in a spec,

00:20:38.520 | I was gonna-- - To just change.

00:20:39.440 | Yeah, yeah, yeah.

00:20:40.400 | What would be your number one proposal to like rehaul?

00:20:42.960 | - The HTTP committee to re-invent the world.

00:20:45.780 | - Yeah, that's a good one.

00:20:47.560 | I mean, I think we just need an error

00:20:48.960 | of like a range of model error.

00:20:51.400 | And we can have many different kinds of model errors.

00:20:53.480 | Like a refusal is a model error.

00:20:55.560 | - 601, auto refusal.

00:20:58.040 | - Yeah, again, like, so we've mentioned before

00:21:00.560 | that chat completions uses this chat ML format.

00:21:03.240 | So when the model doesn't follow chat ML, that's an error.

00:21:06.440 | And we're working on reducing those errors,

00:21:07.960 | but that's like, I don't know, 602, I guess.

00:21:10.600 | - A lot of people actually no longer know what chat ML is.

00:21:13.000 | - Yeah, fair enough. - Because that was

00:21:15.360 | briefly introduced by OpenAI and then like kind of deprecated.

00:21:18.280 | Everyone who implements this under the hood knows it,

00:21:21.200 | but maybe the API users don't know it.

00:21:23.360 | - Basically, the API started with just one endpoint,

00:21:25.960 | the completions endpoint.

00:21:27.320 | And the completions endpoint,

00:21:28.400 | you just put text in and you get text out.

00:21:30.920 | And you can prompt in certain ways.

00:21:33.240 | Then we released chat GPT,

00:21:35.000 | and we decided to put that in the API as well.

00:21:37.520 | And that became the chat completions API.

00:21:39.520 | And that API doesn't just take like a string input

00:21:41.880 | and produce an output.

00:21:42.840 | It actually takes in messages and produces messages.

00:21:45.720 | And so you can get a distinction

00:21:46.800 | between like an assistant message and a user message,

00:21:48.840 | and that allows all kinds of behavior.

00:21:50.720 | And so the format under the hood for that is called chat ML.

00:21:54.520 | Sometimes, you know, because the model

00:21:56.280 | is so out of distribution based on what you're doing,

00:21:58.400 | maybe the temperature is super high,

00:22:00.120 | then it can't follow chat ML.

00:22:02.120 | - Yeah, I didn't know that there could be errors

00:22:04.520 | generated there.

00:22:05.360 | Maybe I'm not asking challenging enough questions.

00:22:07.920 | - It's pretty rare, and we're working on driving it down.

00:22:10.680 | But actually, this is a side effect

00:22:12.440 | of structured outputs now,

00:22:14.040 | which is that we have removed a class of errors.

00:22:16.280 | We didn't really mention this in the blog,

00:22:17.760 | just 'cause we ran out of space.

00:22:19.280 | But--

00:22:20.120 | - That's what we're here to do.

00:22:21.080 | - Yeah, the model used to occasionally pick a recipient

00:22:24.040 | that was invalid, and this would cause an error.

00:22:26.760 | But now we are able to constrain to chat ML

00:22:30.280 | in a more valid way.

00:22:31.520 | And this reduces a class of errors as well.

00:22:33.840 | - Recipient meaning, so there's this,

00:22:35.640 | like a few number of defined roles,

00:22:37.480 | like user, assistant, system.

00:22:39.480 | - Like recipient as in like picking the right tool.

00:22:41.840 | - Oh. - Oh.

00:22:42.760 | - So the model before was able to hallucinate a tool,

00:22:46.520 | but now it can't when you're using structured outputs.

00:22:49.680 | - Do you collaborate with other model developers

00:22:52.600 | to try and figure out this type of errors?

00:22:54.520 | Like how do you display them?

00:22:55.600 | Because a lot of people try to work with different models.

00:22:58.880 | - Yeah. - Yeah, is there any?

00:22:59.920 | - Yeah, not a ton.

00:23:01.160 | We're kind of just focused

00:23:02.240 | on making the best API for developers.

00:23:04.640 | - A lot of research and engineering, I guess,

00:23:07.160 | comes together with evals.

00:23:08.840 | You published some evals there.

00:23:10.920 | I think Gorilla is one of them.

00:23:12.720 | What is your assessment of like the state of evals

00:23:15.200 | for function calling and structured output right now?

00:23:17.720 | - Yeah, we've actually collaborated with BFCL a little bit,

00:23:21.360 | which is, I think, the same thing as Gorilla.

00:23:23.480 | - Function calling leaderboard.

00:23:24.840 | - Kudos to the team.

00:23:25.880 | Those evals are great, and we use them internally.

00:23:27.920 | Yeah, we've also sent some feedback

00:23:29.320 | on some things that are misgraded.

00:23:31.160 | And so we're collaborating to make those better.

00:23:34.320 | In general, I feel evals are kind of the hardest part of AI.

00:23:37.960 | Like when I talk to developers, it's so hard to get started.

00:23:41.040 | It's really hard to make a robust pipeline.

00:23:43.160 | And you don't want evals that are like 80% successful

00:23:46.440 | because, you know, things are gonna improve dramatically.

00:23:49.000 | And it's really hard to craft the right eval.

00:23:51.160 | You kind of want to hit everything on the difficulty curve.

00:23:53.720 | I find that a lot of these evals are mostly saturated,

00:23:56.680 | like for BFCL.

00:23:58.120 | All the models are near the top already,

00:24:00.560 | and kind of the errors are more, I would say,

00:24:02.880 | like just differences in default behaviors.

00:24:05.560 | I think most of the models on leaderboard

00:24:07.360 | can kind of get 100% with different prompting,

00:24:10.280 | but it's more kind of you're just pulling apart

00:24:12.320 | different defaults at this point.

00:24:14.400 | So yeah, I would say in general, we're missing evals.

00:24:16.200 | You know, we work on this a lot internally, but it's hard.

00:24:19.160 | - Did you, other than BFCL, would you call out any others

00:24:21.920 | just for people exploring the space?

00:24:23.600 | - SweetBench is actually like a very interesting eval,

00:24:26.000 | if people don't know.

00:24:26.960 | You basically give the model a GitHub issue and like a repo

00:24:30.640 | and just see how well it does at the issue,

00:24:32.480 | which I think is super cool.

00:24:33.320 | It's kind of like an integration test,

00:24:35.360 | I would say, for models.

00:24:36.480 | - It's a little unfair, right?

00:24:38.280 | - What do you mean?

00:24:39.120 | - A little unfair, 'cause like usually as a human,

00:24:41.400 | you have more opportunity to like ask questions

00:24:43.400 | about what it's supposed to do.

00:24:44.880 | And you're giving the model like way too little information.

00:24:47.320 | - It's a hard job.

00:24:48.160 | - To do the job.

00:24:49.200 | - But yeah, SweetBench targets like,

00:24:50.680 | how well can you follow the diff format

00:24:52.400 | and how well can you like search across files

00:24:54.400 | and how well can you write code?

00:24:55.920 | So I'm really excited about evals like that

00:24:57.640 | because the pass rate is low,

00:24:59.400 | so there's a lot of room to improve.

00:25:01.200 | And it's just targeting a really cool capability.

00:25:03.280 | - I've seen other evals for function calling

00:25:05.040 | where I think might be BFCL as well,

00:25:07.120 | where they evaluate different kinds of function calling.

00:25:10.120 | And I think the top one that people care about,

00:25:12.440 | for some reason,

00:25:13.280 | I don't know personally that this is so important to me,

00:25:15.480 | but it's parallel function calling.

00:25:17.520 | I think you confirmed that you don't support that yet.

00:25:20.600 | Why is that hard?

00:25:21.920 | Just more context about it.

00:25:23.320 | - So yeah, we put out parallel function calling

00:25:25.520 | in Dev Day last year as well.

00:25:26.960 | And it's kind of the evolution of function calling.

00:25:29.080 | So function calling V1, you just get one function back.

00:25:31.840 | Function calling V2, you can get multiple back

00:25:33.360 | at the same time and save latency.

00:25:35.000 | We have this in our API, all our models support it,

00:25:36.760 | or all of our newer models support it,

00:25:38.640 | but we don't support it with structured outputs right now.

00:25:41.640 | And there's actually a very interesting trade-off here.

00:25:44.360 | So when you basically call our API for structured outputs

00:25:48.320 | with a new schema,

00:25:49.520 | we have to build this artifact for fast sampling later on.

00:25:52.480 | But when you do parallel function calling,

00:25:54.320 | the kind of schema we follow

00:25:56.080 | is not just directly one of the function schemas.

00:25:58.360 | It's like this combined schema based on a lot of them.

00:26:01.360 | If we were to kind of do the same thing

00:26:02.680 | and build an index every time

00:26:03.840 | you pass in a list of functions,

00:26:05.240 | if you ever change the list,

00:26:06.480 | you would kind of incur more latency.

00:26:08.360 | And we thought it would be really unintuitive

00:26:09.920 | for developers and hard to reason about.

00:26:12.360 | So we decided to kind of wait

00:26:13.960 | until we can support a no-added-latency solution

00:26:16.600 | and not just kind of make it really confusing for developers.

00:26:19.400 | - Mentioning latency,

00:26:20.240 | that is something that people discovered,

00:26:21.800 | is that there is an increased cost and latency

00:26:24.160 | for the first token.

00:26:25.240 | - For the first request, yeah.

00:26:26.120 | - First request.

00:26:26.960 | Is that an issue?

00:26:27.800 | Is that going to go down over time?

00:26:28.680 | Is there just an overhead to parsing JSON

00:26:31.280 | that is just insurmountable?

00:26:33.160 | - It's definitely not insurmountable.

00:26:34.640 | And I think it will definitely go down over time.

00:26:37.040 | We just kind of take the approach of ship early and often.

00:26:41.440 | And if there's nothing in there you don't want to fix,

00:26:45.720 | then you probably shipped too late.

00:26:47.600 | So I think we will get that latency down over time.

00:26:49.960 | But yeah, I think for most developers,

00:26:51.200 | it's not a big concern.

00:26:52.480 | 'Cause you're testing out your integration,

00:26:54.000 | you're sending some requests while you're developing it,

00:26:56.280 | and then it's fast and broad.

00:26:58.000 | So it kind of works for most people.

00:26:59.560 | The alternative design space that we explored

00:27:02.480 | is like pre-registering your schema,

00:27:04.360 | so like a totally different endpoint,

00:27:06.120 | and then passing in like a schema ID.

00:27:08.320 | But we thought, you know, that was a lot of overhead

00:27:10.640 | and like another endpoint to maintain

00:27:12.400 | and just kind of more complexity for the developer.

00:27:15.120 | And we think this latency is going to come down over time.

00:27:17.520 | So it made sense to keep it kind of in chat completions.

00:27:20.440 | - I mean, hypothetically,

00:27:21.720 | if one were to ship caching at a future point,

00:27:24.480 | it would basically be the superset of that.

00:27:26.840 | - Maybe.

00:27:27.680 | I think the caching space is a little underexplored.

00:27:30.280 | Like we've seen kind of two versions of it.

00:27:32.800 | But I think, yeah,

00:27:33.640 | there's ways that maybe put less onus on the developer.

00:27:36.560 | But, you know, we haven't committed to anything yet,

00:27:38.600 | but we're definitely exploring opportunities

00:27:40.320 | for making things cheaper over time.

00:27:42.560 | - Is AGI and Agents just going to be

00:27:44.640 | a bunch of structure upload

00:27:45.800 | and function calling one next to each other?

00:27:47.640 | Like, how do you see, you know,

00:27:49.240 | there's like the model does everything.

00:27:50.760 | Where do you draw the line?

00:27:51.800 | Because you don't call these things like an agent API,

00:27:54.600 | but like if I were a startup trying to raise a C round,

00:27:57.120 | I would just do function calling and say,

00:27:58.480 | this is an agent API.

00:27:59.920 | So how do you think about the difference

00:28:01.280 | and like how people build on top of it

00:28:02.640 | for like agentic systems?

00:28:04.200 | - Yeah, love that question.

00:28:05.480 | One of the reasons we wanted to build structured outputs

00:28:07.520 | is to make agentic applications actually work.

00:28:09.880 | So right now it's really hard.

00:28:10.840 | Like if something is 95% reliable,

00:28:13.600 | but you're chaining together a bunch of calls,

00:28:15.680 | if you magnify that error rate,

00:28:17.120 | it makes your like application not work.

00:28:18.960 | So that's a really exciting thing here

00:28:20.440 | from going from like 95% to 100%.

00:28:23.080 | I'm very biased working in the API

00:28:24.760 | and working on function calling and structured outputs,

00:28:26.600 | but I think those are the building blocks

00:28:28.200 | that we'll be using kind of to distribute

00:28:30.000 | this technology very far.

00:28:31.400 | It's the way you connect like natural language

00:28:34.000 | and converting user intent

00:28:35.640 | into working with your application.

00:28:37.320 | And so I think like kind of,

00:28:38.920 | there's no way to build without it, honestly.

00:28:40.600 | Like you need your function calls to work.

00:28:42.560 | Like, yeah, we wanted to make that a lot easier.

00:28:44.960 | - Yeah, and do you think the assistance

00:28:47.240 | kind of like API thing will be a bigger part

00:28:50.000 | as people build agents?

00:28:51.280 | I think maybe most people just use messages and completion.

00:28:54.440 | - So I would say the assistance API

00:28:56.120 | was kind of a bet in a few areas.

00:28:58.200 | One bet is hosted tools.

00:28:59.880 | So we have the file search tool and code interpreter.

00:29:02.640 | Another bet was kind of statefulness.

00:29:04.640 | It's our first stateful API.

00:29:06.280 | It'll store threads and you can fetch them later.

00:29:09.160 | I would say the hosted tools aspect

00:29:11.280 | has been really successful.

00:29:12.760 | Like people love our file search tool

00:29:14.920 | and it's like saves a lot of time

00:29:16.960 | to not build your own rag pipeline.

00:29:19.000 | I think we're still iterating on the shape

00:29:20.680 | for the stateful thing to make it as useful as possible.

00:29:23.520 | Right now, there's kind of a few endpoints you need to call

00:29:26.160 | before you can get a run going.

00:29:27.800 | And we want to work to make that, you know,

00:29:29.360 | much more intuitive and easier over time.

00:29:31.400 | - One thing I'm just kind of curious about,

00:29:32.840 | did you notice any trade-offs

00:29:34.760 | when you add more structured output,

00:29:36.760 | it gets worse at some other thing that was like kind of,

00:29:39.400 | you didn't think was related at all?

00:29:41.120 | - Yeah, it's a good question.

00:29:42.600 | Yeah, I mean, models are very spiky

00:29:44.760 | and RL is hard to predict.

00:29:47.440 | And so every model kind of improves on some things

00:29:50.640 | and maybe is flat or neutral on other things.

00:29:52.880 | - Yeah, like it's like very rare to just add a capability

00:29:56.800 | and have no trade-offs in everything else.

00:29:58.320 | - So yeah, I don't have something off the top of my head,

00:30:00.240 | but I would say, yeah,

00:30:01.080 | every model is a special kind of its own thing.

00:30:04.000 | This is why we put them in API dated

00:30:06.160 | so developers can choose for themselves

00:30:07.680 | which one works best for them.

00:30:09.120 | In general, we strive to continue improving on all evals,

00:30:12.320 | but it's stochastic.

00:30:14.040 | - Yeah, able to apply the structured output system

00:30:17.120 | on backdated models like 4.0 May,

00:30:21.080 | as well as Mini, as well as August.

00:30:23.280 | - Actually the new response format

00:30:25.680 | is only available on two models.

00:30:27.360 | It's 4.0 Mini and the new 4.0.

00:30:30.160 | So the old 4.0 doesn't have the new response format.

00:30:33.640 | However, for function calling,

00:30:35.000 | we were able to enable it for all models

00:30:36.640 | that support function calling.

00:30:38.000 | And that's because those models were already trained

00:30:40.320 | to follow these schemas.

00:30:41.880 | We basically just didn't wanna add the new response format

00:30:44.400 | to models that would do poorly at it

00:30:46.400 | because they would just kind of do infinite white space,

00:30:48.840 | which is the most likely token

00:30:50.600 | if you have no idea what's going on.

00:30:52.240 | - I just wanted to call out a little bit more

00:30:53.640 | in the stuff you've done in the blog posts.

00:30:55.600 | So in blog posts, just use cases, right?

00:30:57.280 | I just want people to be like,

00:30:58.640 | yeah, we're spelling it out for you.

00:30:59.960 | Use these for extracting structured data

00:31:02.080 | from unstructured data.

00:31:03.240 | By the way, it does vision too, right?

00:31:05.440 | So that's cool.

00:31:06.880 | Dynamic UI generation.

00:31:08.480 | Actually, let's talk about dynamic UI.

00:31:10.160 | I think gen UI, I think,

00:31:12.000 | is something that people are very interested in.

00:31:14.120 | As your first example, what did you find about it?

00:31:16.920 | - Yeah, I just thought it was a super cool capability

00:31:19.040 | we have now.

00:31:19.880 | So the schemas, we support recursive schemas,

00:31:22.600 | and this allows you to do really cool stuff.

00:31:24.280 | Like, every UI is a nested tree that has children.

00:31:28.080 | So I thought that was super cool.

00:31:29.160 | You can use one schema and generate tons of UIs.

00:31:33.520 | As a backend engineer who's always struggled

00:31:35.520 | with JavaScript and frontend,

00:31:37.280 | for me, that's super cool.

00:31:38.720 | We've now built a system where I can get

00:31:40.400 | any frontend that I want.

00:31:42.080 | So yeah, that's super cool.

00:31:43.120 | The extracting structured data,

00:31:45.600 | the reality of a lot of AI applications

00:31:47.440 | is you're plugging them into your enterprise business

00:31:50.600 | and you have something that works,

00:31:52.080 | but you want to make it a little bit better.

00:31:53.920 | And so the reliability gains you get here

00:31:55.880 | is you'll never get a classification

00:31:58.600 | using the wrong enum.

00:32:00.000 | It's just exactly your types.

00:32:02.200 | So really excited about that.

00:32:03.600 | - Like maybe hallucinate the actual values, right?

00:32:06.240 | So let's clearly state what the guarantees are.

00:32:08.520 | The guarantees is that this fits the schema,

00:32:09.960 | but the schema itself may be too broad

00:32:13.160 | because the JSON schema type system doesn't say like,

00:32:16.240 | I only want to range from one to 11.

00:32:19.040 | You might give me zero.

00:32:20.480 | You might give me 12.

00:32:21.520 | - So yeah, JSON schema.

00:32:22.680 | So this is actually a good thing to talk about.

00:32:24.200 | So JSON schema is extremely vast

00:32:26.560 | and we weren't able to support every corner of it.

00:32:29.600 | So we kind of support our own dialect

00:32:31.480 | and it's described in the docs.

00:32:33.280 | And there are a few trade-offs we had to make there.

00:32:35.040 | So by default,

00:32:36.080 | if you don't pass in additional properties in a schema,

00:32:39.520 | by default, that's true.

00:32:40.800 | And so that means you can get other keys,

00:32:43.200 | which you didn't spell out,

00:32:44.720 | which is kind of the opposite of what developers want.

00:32:47.000 | You basically want to supply the keys and values

00:32:48.600 | and you want to get those keys and values.

00:32:50.520 | And so then we had to decision to make.

00:32:51.920 | It's like, do we redefine what additional properties means

00:32:54.800 | as the default?

00:32:56.040 | And that felt really bad.

00:32:56.960 | It's like, there's a schema that's predated us.

00:32:58.840 | Like, it wouldn't be good.

00:33:00.240 | It'd be better to play nice with the community.

00:33:01.920 | And so we require that you pass it in as false.

00:33:04.560 | One of our design principles is to be very explicit

00:33:06.960 | and so developers know what to expect.

00:33:09.200 | And so this is one where we decided,

00:33:11.240 | it's a little harder to discover,

00:33:12.800 | but we think you should pass this thing in

00:33:14.960 | so that we can have a very clear definition

00:33:16.880 | of what you mean and what we mean.

00:33:18.360 | There's a similar one here with required.

00:33:20.560 | By default, every key in JSON schema is optional,

00:33:23.760 | but that's not what developers want, right?

00:33:25.520 | You'd be very surprised if you passed in a bunch of keys

00:33:28.480 | and you didn't get some of them back.

00:33:29.840 | And so that's the trade-off we made,

00:33:31.040 | is to make everything required

00:33:32.400 | and have the developers spell that out.

00:33:33.880 | - Is there a require false?

00:33:35.560 | Can people turn it off or they're just getting all--

00:33:38.160 | - So developers can, basically what we recommend for that

00:33:41.120 | is to make your actual key a union type.

00:33:43.920 | And so-- - Nullable.

00:33:45.040 | - Yeah, make it union of int and null

00:33:47.520 | and that gets you the same behavior.

00:33:48.920 | - Any other of the examples you want to dive into,

00:33:50.880 | math, chain of thought?

00:33:52.120 | - Yeah, you can now specify like a chain of thought field

00:33:55.120 | before a final answer.

00:33:56.520 | This is just like a more structured way

00:33:57.960 | of extracting the final answer.

00:33:59.680 | One example we have, I think we put up a demo app

00:34:02.800 | of this math tutoring example, or it's coming out soon.

00:34:05.760 | - Did I miss it?

00:34:06.600 | Oh, okay, well.

00:34:07.420 | - Basically, it's this math tutoring thing

00:34:08.920 | and you put in an equation

00:34:10.640 | and you can go step by step and answer it.

00:34:12.760 | This is something you can do now with Structured Office.

00:34:14.480 | In the past, a developer would have to specify their format

00:34:17.560 | and then write a parser and parse out the model's output

00:34:20.800 | to be pretty hard.

00:34:21.640 | But now you just specify steps and it's an array of steps

00:34:24.480 | and every step you can render and then the user can try it

00:34:27.160 | and you can see if it matches and go on that way.

00:34:29.720 | So I think it just opens up a lot of opportunities.

00:34:32.120 | Like for any kind of UI where you want to treat

00:34:34.440 | different parts of the model's responses differently,

00:34:36.720 | Structured Outputs is great for that.

00:34:38.280 | - I remembered my question from earlier.

00:34:40.280 | I'm basically just using this to ask you all the questions

00:34:43.400 | as a user, as a daily user of the stuff that you put out.

00:34:46.160 | So one is a tip that people don't know

00:34:47.920 | and I confronted you on Twitter,

00:34:48.960 | which is you respect descriptions of JSON schemas, right?

00:34:52.240 | And you can basically use that as a prompt for the field.

00:34:54.760 | - Totally.

00:34:55.580 | - I assume that's blessed

00:34:56.420 | and people should do that. - Intentional, yeah.

00:34:58.560 | - One thing that I started to do,

00:35:00.320 | which I don't, it could be a hallucination of me,

00:35:02.160 | is I changed the property name to prompt the model

00:35:07.160 | to what I wanted to do.

00:35:08.200 | So for example, instead of saying topics as a property name,

00:35:12.600 | I would say like, "Brainstorm a list of topics up to five,"

00:35:16.800 | something like that as a property name.

00:35:19.280 | I could stick that in the description as well,

00:35:20.840 | but is that too much? (laughs)

00:35:23.240 | - Yeah, I would say, I mean, we're so early in AI

00:35:26.760 | that people are figuring out the best way to do things.

00:35:29.000 | And I love when I learn from a developer

00:35:30.880 | like a way they found to make something work.

00:35:33.280 | In general, I think there's like three

00:35:34.880 | or four places to put instructions.

00:35:37.200 | You can put instructions in the system message

00:35:39.000 | and I would say that's helpful

00:35:40.720 | for like when to call a function.

00:35:42.720 | So it's like, let's say you're building

00:35:44.840 | a customer support thing and you want the model

00:35:47.320 | to verify the user's phone number or something.

00:35:49.280 | You can tell the model in the system message,

00:35:51.000 | like here's when you should call this function.

00:35:52.680 | Then when you're within a function,

00:35:53.960 | I would say the descriptions there

00:35:55.160 | should be more about how to call a function.

00:35:57.400 | So really common is someone will have like a date

00:35:59.800 | as a string, but you don't tell the model,

00:36:01.880 | like, do you want year, year, month, month, day, day?

00:36:04.120 | Or do you want that backwards?

00:36:05.760 | And that's what a really good spot is

00:36:07.400 | for those kinds of descriptions.

00:36:08.480 | It's like, how do you call this thing?

00:36:10.200 | And then sometimes there's like really stuff

00:36:12.880 | like what you're doing.

00:36:13.720 | It's like, name the key by what you want.

00:36:16.080 | So sometimes people put like, do not use.

00:36:18.200 | And you know, if they don't want, you know,

00:36:19.920 | this parameter to be used except only in some circumstances.

00:36:22.920 | And really, I think that's the fun nature of this.

00:36:24.840 | It's like, you're figuring out the best way

00:36:26.560 | to get something out of the model.

00:36:27.760 | - Okay, so you don't have an official recommendation

00:36:29.720 | is what I'm hearing.

00:36:30.720 | - Well, the official recommendation is, you know,

00:36:32.800 | how to call a model, system instructions.

00:36:34.240 | - Exactly, exactly.

00:36:35.080 | - Or when to call a function, yeah.

00:36:36.600 | - Do you benchmark these type of things?

00:36:38.720 | So like, say with date, it's like description,

00:36:40.920 | it's like return it in like ISO 8.

00:36:43.880 | Or if you call the key date in ISO A6001,

00:36:47.920 | I feel like the benchmarks don't go that deep,

00:36:50.080 | but then all the AI engineering kind of community,

00:36:53.360 | like all the work that people do, it's like,

00:36:55.320 | oh, actually this performs better,

00:36:56.920 | but then there's no way to verify, you know?

00:36:59.200 | Like even the, I'm gonna tip you $100,000 or whatever,

00:37:03.120 | like some people say it works, some people say it doesn't.

00:37:05.600 | Do you pay attention to this stuff as you build this?

00:37:08.280 | Or are you just like, the model is just gonna get better,

00:37:10.800 | so why waste my time running evals on these small things?

00:37:14.080 | - Yeah, I would say to that,

00:37:15.520 | I would say we basically pick our battles.

00:37:17.840 | I mean, there's so much surface area of LLMs

00:37:20.920 | that we could dig into, and we're just mostly focused

00:37:23.280 | on kind of raising the capabilities for everyone.

00:37:25.880 | I think for customers, and we work with a lot of customers,

00:37:28.720 | really developing their own evals is super high leverage,

00:37:31.800 | 'cause then you can upgrade really quickly

00:37:33.280 | when we have a new model,

00:37:34.160 | you can experiment with these things with confidence.

00:37:36.560 | So yeah, we're hoping to make making evals easier.

00:37:39.160 | I think that's really generally very helpful for developers.

00:37:42.560 | - For people, I would just kind of wrap up the discussion

00:37:44.840 | for structured outputs, I immediately implemented,

00:37:47.720 | we use structured outputs for AI News,

00:37:50.200 | I use Instructor, and I ripped it out,

00:37:52.360 | and I think I saved 20 lines of code,

00:37:55.200 | but more importantly, it was like,

00:37:56.360 | we cut it by 55% of API costs based on what I measured,

00:37:59.960 | because we saved on the retries.

00:38:02.320 | - Nice, yeah, love to hear that.

00:38:03.880 | - Yeah, which people I think don't understand,

00:38:05.480 | when you can't just simply add Instructor or add outlines,

00:38:10.160 | you can do that, but it's actually gonna cost you

00:38:12.280 | a lot of retries to get the model that you want,

00:38:14.800 | but you're kind of just kind of building

00:38:16.240 | that internally into the model.

00:38:17.720 | - Yeah, I think this is the kind of feature

00:38:19.640 | that works really well when it's integrated

00:38:21.400 | with the LLM provider.

00:38:23.320 | Yeah, actually, I had folks, even my husband's company,

00:38:25.840 | he works at a small startup,

00:38:26.800 | they thought we were just retrying,

00:38:28.560 | and so I had to make them re-blog those.

00:38:31.800 | We are not retrying, you know, we're doing it in one shot,

00:38:34.080 | and this is how you save on latency and cost.

00:38:36.120 | - Awesome, any other behind-the-scenes stuff,

00:38:38.520 | just generally on structured outputs?

00:38:40.000 | We're gonna move on to the other models.

00:38:41.960 | - Yeah, I think that's it.

00:38:43.000 | - Oh, look, that's excellent products,

00:38:44.560 | and I think everyone will be using it,

00:38:45.960 | and we have the full story now that people can try out.

00:38:49.280 | So Roadmap would be parallel function calling,

00:38:51.560 | anything else that you've called out as coming soon?

00:38:53.880 | - Not quite soon, but we're thinking about,

00:38:56.120 | does it make sense to expose custom grammars

00:38:59.000 | beyond JSON schema?

00:38:59.960 | - What would you want to hear from developers

00:39:01.880 | to give you information, whether it's custom grammars

00:39:04.320 | or anything else about structured outputs?

00:39:05.800 | What would you want to know more of?

00:39:06.960 | - Just always interested in feature requests,

00:39:09.440 | what's not working, but I'd be really curious,

00:39:11.360 | what specific grammars folks want.

00:39:13.200 | I know some folks want to match programming languages

00:39:15.840 | like Python.

00:39:17.000 | There's some challenges with the expressivity

00:39:20.400 | of our implementation, and so, yeah,

00:39:22.760 | just kind of the class of grammars folks want.

00:39:25.640 | - I have a very simple one,

00:39:26.680 | which is a lot of people try to use GPT as judge, right?

00:39:30.680 | Which means they end up doing a rating system,

00:39:32.720 | and then there's like 10 different kinds of rating systems,

00:39:34.600 | there's a Likert scale, there's whatever.

00:39:36.720 | If there was an officially blessed way

00:39:38.400 | to do a rating system with structured outputs,

00:39:41.320 | everyone would use it.

00:39:42.400 | - Yeah, yeah, that makes sense.

00:39:43.520 | I mean, we often recommend using log probs

00:39:47.120 | with classification tasks.

00:39:48.680 | So rather than like sampling,

00:39:51.000 | let's say you have four options,

00:39:52.480 | like red, yellow, blue, green,

00:39:54.080 | rather than sampling two tokens for yellow,

00:39:56.880 | you can just do like A, B, C, D,

00:39:58.880 | and get the log probs of those.

00:40:00.800 | The inherent randomness of each sampling

00:40:02.880 | isn't taken into account,

00:40:04.200 | and you can just actually look at

00:40:05.200 | what is the most likely token.

00:40:07.200 | - I think this is more of like a calibration question.

00:40:09.120 | Like if I asked you to rate things from one to 10,

00:40:11.640 | a non-calibrated model might always pick seven,

00:40:14.400 | just like a human would.

00:40:15.720 | - Right.

00:40:16.800 | - So like actually have a nice gradation from one to 10

00:40:19.840 | would be the rough idea.

00:40:21.920 | And then even for structured outputs,

00:40:23.400 | I can't just say have a field of rating from one to 10

00:40:26.240 | because I have to then validate it,

00:40:28.200 | and it might give me 11.

00:40:29.680 | - Yeah, absolutely.

00:40:31.080 | - So what about model selection?

00:40:33.040 | Now you have a lot of models.

00:40:34.360 | When you first started, you had one model endpoint.

00:40:37.200 | I guess you had like the DaVinci,

00:40:39.280 | but like most people were using one model endpoint.

00:40:42.760 | Today, you have like a lot of competitive models,

00:40:45.040 | and I think we're nearing the end of the 3.5 run RIP.

00:40:49.640 | How do you advise people to like experiment,

00:40:51.840 | select, both in terms of like tasks and like costs,

00:40:54.680 | like what's your playbook?

00:40:56.360 | - In general, I think folks should start with 4.0 mini.

00:40:59.280 | That's our cheapest model,

00:41:01.080 | and it's a great workhorse.

00:41:04.080 | Works for a lot of great use cases.

00:41:05.840 | If you're not finding the performance you need,

00:41:07.640 | like maybe it's not smart enough,

00:41:09.400 | then I would suggest going to 4.0.

00:41:11.440 | And if 4.0 works well for you, that's great.

00:41:13.520 | Finally, there's some like really advanced

00:41:15.280 | frontier use cases, and maybe 4.0 is not quite cutting it,

00:41:18.880 | and there I would recommend our fine tuning API.

00:41:21.120 | Even just like 100 examples is enough to get started there,

00:41:24.400 | and you can really get the performance you're looking for.

00:41:26.360 | - We're recording this ahead of it,

00:41:27.840 | but like you're announcing other some fine tuning stuff

00:41:30.800 | that people should pay attention to.

00:41:32.280 | - Yeah, actually tomorrow we're dropping our GA

00:41:35.200 | for GPT 4.0 fine tuning.

00:41:37.040 | So 4.0 mini has been available for a few weeks now,

00:41:39.640 | and 4.0 is now gonna be generally available.

00:41:42.240 | And we also have a free training offering for a bit.

00:41:45.320 | I think until September 23rd,

00:41:47.120 | you get one million free training tokens a day.

00:41:49.720 | - This is already announced, right?

00:41:51.720 | Am I talking about a different thing?

00:41:52.560 | - So that was for 4.0 mini, and now it's also for 4.0.

00:41:55.120 | So we're really excited to see what people do with it.

00:41:57.000 | And it's actually a lot easier to get started

00:41:58.800 | than a lot of people expect.

00:42:00.000 | I think they might need tens of thousands of examples,

00:42:02.520 | but even 100 really high quality ones,

00:42:04.880 | or 1,000 is enough to get going.

00:42:06.880 | - Oh, well, we might get a separate podcast

00:42:08.960 | just specifically on that,

00:42:10.080 | but we haven't confirmed that yet.

00:42:13.000 | It basically seems like every time,

00:42:15.080 | I think people's concerns about fine tuning

00:42:17.440 | is that they're kind of locked into a model.

00:42:19.200 | And I think you're paving the path for migration of models.

00:42:22.480 | As long as they keep their original data set,

00:42:24.360 | they can at least migrate nicely.

00:42:26.200 | - Yeah, I'm not sure what we've said publicly there yet,

00:42:28.800 | but we definitely wanna make it easier for folks to migrate.

00:42:31.480 | - It's the number one concern.

00:42:32.920 | I'm just, it's obvious. (laughs)

00:42:34.800 | - Absolutely.

00:42:36.080 | I also wanna point people to,

00:42:37.560 | you have official model selection docs,

00:42:39.480 | where it's in the guide, we'll put it in the show notes,

00:42:42.560 | where it says to optimize for accuracy first,

00:42:44.960 | so prompt engineering, RAG, evals, fine tuning.

00:42:46.680 | This was done at Dev Day last year,

00:42:47.960 | so I'm just repeating things.

00:42:49.480 | And then optimize for cost and latency second,

00:42:51.880 | and there's a few sets of steps for optimizing latency,

00:42:55.120 | so people can read up on that stuff.

00:42:57.000 | - Yeah, totally.

00:42:58.280 | - We had one episode with Nigolas Carlini from DeepMind,

00:43:02.000 | and we actually talked about how some people

00:43:04.480 | don't actually get to the boundaries

00:43:06.320 | of the model performance.

00:43:07.360 | You know, they just kind of try one model,

00:43:08.720 | and it's like, "Oh, LLMs cannot do this," and they stop.

00:43:11.520 | How should people get over the hurdle?

00:43:13.160 | It's like, how do you know if you hit the model performance,

00:43:15.880 | or like you hit skill issues?

00:43:17.360 | You know, it's like, "Your prompt is not good,"

00:43:18.800 | or like, "Try another model," and whatnot.

00:43:20.600 | Is there an easy way to do that?

00:43:22.320 | - That's tough.

00:43:23.160 | Some people are really good at prompting,

00:43:24.760 | and they just kind of get it right away,

00:43:26.240 | and for others, it's more of a challenge.

00:43:28.280 | I think there's a lot we can do to make it easier

00:43:30.480 | to prompt our models, but for now,

00:43:32.000 | I think it requires a lot of creativity

00:43:33.640 | and not giving up right away, yeah.

00:43:35.320 | And a lot of people have experience now with ChatGPT.

00:43:38.040 | You know, before, ChatGPT, the easiest way to play

00:43:40.440 | with our models was in the playground,

00:43:42.640 | but now kind of everyone's played with it,

00:43:44.560 | with a model of some sort,

00:43:45.680 | and they have some sort of intuition.

00:43:47.160 | It's like, you know, if I tell you my grandma is sick,

00:43:50.040 | then maybe I'll get the right output,

00:43:51.600 | and we're hoping to kind of remove the need for that,

00:43:53.800 | but playing around with ChatGPT is a really good way

00:43:56.600 | to get a feel for, you know, how to use the API as well.

00:43:59.320 | - Will prompt engineering be here forever,

00:44:01.320 | or is it a dying art as the models get better?

00:44:04.520 | - I mean, it's like the perennial question

00:44:05.960 | of software engineering as well.

00:44:07.320 | It's like, as the models get better at coding, you know,

00:44:09.440 | if we hit a hundred on SWE Bench, what does that mean?

00:44:11.680 | I think there will always be alpha in people

00:44:13.520 | who are able to, like, clearly explain

00:44:16.040 | what they're trying to build.

00:44:17.280 | Most of engineering is like figuring out the requirements

00:44:20.120 | and stating what you're trying to do,

00:44:22.080 | and I believe this will be the case with AI as well.

00:44:24.480 | You're going to have to very clearly explain what you need,

00:44:26.600 | and some people are better than others at it,

00:44:29.080 | and people will always be building.

00:44:30.880 | It's just the tools are going to get far better.

00:44:32.960 | - Last two weeks, you released two models.

00:44:34.720 | There's GPT 4.0 2034-0806,

00:44:37.680 | and then there's also ChatGPT 4.0 latest.

00:44:39.680 | I think people were a little bit confused by that,

00:44:41.080 | and then you issued a clarification that it was,

00:44:43.600 | one's chat-tuned and the other is more

00:44:45.520 | function calling-tuned.

00:44:46.760 | Can you elaborate, just?

00:44:47.720 | - Yeah, totally.

00:44:49.120 | So part of the impetus here was to kind of very transparent

00:44:52.120 | with what's on ChatGPT and in the API.

00:44:54.960 | So basically, we're often training models,

00:44:57.520 | and there are different use cases.

00:44:58.880 | So you don't really need function calling

00:45:01.080 | for user-defined functions in ChatGPT.

00:45:03.280 | And so this gives us kind of the freedom

00:45:04.800 | to build the best model for each use case.

00:45:07.160 | So in ChatGPT latest, we're releasing

00:45:09.320 | kind of this rolling model.

00:45:11.320 | The weights aren't pinned.

00:45:12.520 | As we release new models--

00:45:14.320 | - This is literally what we use.

00:45:15.920 | - Yeah, so it's in what's in ChatGPT,

00:45:18.120 | so it's very good for like chat-style use cases.

00:45:20.800 | But for the API broadly, you know,

00:45:23.400 | we really tune our models to be good at things

00:45:25.640 | that developers want, like function calling

00:45:27.320 | and structured outputs, and when a developer builds

00:45:29.880 | their application, they want to know

00:45:31.400 | that kind of the weights are stable under them.

00:45:33.600 | And so we have this offering where it's like,

00:45:36.000 | if you're tuning to a specific model

00:45:37.640 | and you know your function works,

00:45:39.160 | you know it will never change the weights out from under you.

00:45:42.040 | And so those are the models we commit

00:45:43.240 | to supporting for a long time,

00:45:44.960 | and we think those are the best for developers.

00:45:47.160 | But we want to give it up, you know,

00:45:48.480 | we want to leave the choice to developers.

00:45:49.800 | Like, do you want the ChatGPT model

00:45:51.480 | or do you want the API model?

00:45:53.040 | And you have the freedom to choose what's best for you.

00:45:55.040 | - I think it's for people,

00:45:56.760 | they do want to pin model versions,

00:45:59.040 | so I don't know when they would use ChatGPT,

00:46:02.400 | like the rolling one, unless they're really

00:46:05.040 | just kind of cloning ChatGPT.

00:46:06.960 | Which is like, why would they?

00:46:08.600 | - I mean, I think there's a lot of interesting stuff

00:46:11.920 | that developers can do when unbounded,

00:46:14.120 | and so we don't want to limit them artificially.

00:46:16.400 | So it's kind of survival of the fittest,

00:46:18.920 | like whichever model is better, you know,

00:46:20.640 | that's the one that people should use.

00:46:21.880 | - Yeah, I talked about it to my friends,

00:46:23.680 | there's like, this isn't that new thing.

00:46:25.920 | And basically, OpenAI has never actually shared with you

00:46:29.000 | the actual ChatGPT model, and now they do.

00:46:32.040 | - Well, it's not necessarily true.

00:46:33.400 | Actually, a lot of the models we have shipped

00:46:35.240 | have been the same, but you know,

00:46:37.760 | sometimes they diverge and it's not a limitation

00:46:39.840 | we want to stick around.

00:46:41.520 | - Anything else we should know about the new model?

00:46:43.160 | I don't think there was no evals announced or anything,

00:46:47.160 | but people say it's better.

00:46:48.120 | I mean, obviously, LMSYS is like way better

00:46:50.040 | above on everything, right?

00:46:50.880 | It's like number one in the world on--

00:46:52.640 | - Yeah, we published some release notes.

00:46:55.800 | They're not as in-depth as we want to be yet,

00:46:57.840 | because it's still kind of a science

00:46:59.560 | and we're learning what actually changes with each model

00:47:02.160 | and how can we better understand the capabilities.

00:47:04.720 | But we are trying to do more release notes in the future

00:47:07.600 | and keep folks updated.

00:47:09.400 | But yeah, it's kind of an art and a science right now.

00:47:12.640 | - You need the best evals team in the world

00:47:15.600 | to help you figure this out.

00:47:16.960 | - Yeah, evals are hard.

00:47:17.880 | We're hiring if you want to come work on evals.

00:47:20.120 | - Hold that thought on hiring.

00:47:21.320 | We'll come back to the end on what you're looking for,

00:47:23.720 | 'cause obviously people want to join you

00:47:25.240 | and they want to know what qualities you're looking for.

00:47:27.960 | - So we just talked about API versus ChargedGBT.

00:47:31.240 | What's, I guess, the vision for the interface?

00:47:34.200 | You know, the mission of OpenAI is to build AGI

00:47:36.800 | that is accessible.

00:47:37.640 | Like, where is it going to come from?

00:47:40.120 | - Totally, yeah.

00:47:41.080 | So I believe that the API is kind of our broadest vehicle

00:47:44.720 | for distributing AGI.

00:47:46.680 | You know, we're building some first-party products,

00:47:48.760 | but they'll never reach every niche in the world

00:47:50.680 | and kind of every corner and community.

00:47:52.520 | And so really love working with developers

00:47:54.680 | and seeing the incredible things they come up with.

00:47:56.840 | I often find that developers kind of see the future

00:47:58.880 | before anyone else,

00:47:59.880 | and we love working with them to make it happen.

00:48:02.320 | And so really the API is a bet on going really broad.

00:48:05.680 | And we'll go very deep as well in our first-party products,

00:48:08.240 | but I think just that our impact is absolutely magnified

00:48:11.240 | by every developer that we uplift.

00:48:13.280 | - They can do the last mile where you cannot.

00:48:15.840 | Like, ChargedGBT is one type of product,

00:48:17.880 | but there's many other kinds.

00:48:19.400 | In fact, you know, I observed, I think in February,

00:48:22.520 | basically, ChargedGBT's user growth stopped

00:48:25.360 | when the API was launched,

00:48:26.760 | because everyone was kind of able to take that

00:48:29.120 | and build other things.

00:48:30.680 | That has not become true anymore

00:48:32.760 | because ChargedGBT growth has continued to grow.

00:48:34.840 | But then, you're not confirming any of this.

00:48:36.880 | This is me quoting similar web numbers,

00:48:38.800 | which have very high variance.

00:48:40.800 | - Well, the API predates ChargedGBT.

00:48:42.440 | The API was actually opened as first product,

00:48:44.680 | and the first idea for commercialization,

00:48:47.320 | that predates me as well.

00:48:48.560 | - Wide release.

00:48:49.520 | Like, GA, everyone can sign up and use it immediately.

00:48:52.320 | Yeah, that's what I'm talking about.

00:48:53.440 | But yeah, I mean, I do believe that.

00:48:55.520 | And, you know, that means you also have to expose

00:48:57.920 | all of open-end models, right?

00:48:59.920 | Like, all the multi-modal models.

00:49:02.200 | We'll ask you questions on that,

00:49:03.800 | but I think that API mission is important.

00:49:07.440 | It's interesting that the hottest new programming language

00:49:10.040 | is supposed to be English,

00:49:12.000 | but it's actually just software engineering, right?

00:49:14.640 | It's just, you know, we're talking about HTTP error codes.

00:49:17.480 | - Right.

00:49:18.320 | (laughs)

00:49:19.160 | - Yeah, I think, you know,

00:49:20.040 | engineering is still the way you access these models.

00:49:22.080 | And I think there are companies working on tools

00:49:25.640 | to make engineering more accessible for everyone,

00:49:28.360 | but there's still so much alpha

00:49:29.840 | in just writing code and deploying.

00:49:32.400 | - Yeah, one might even call it AI engineering.

00:49:34.560 | - Exactly. - I don't know.

00:49:35.880 | Yeah, so like, there's lots of war stories

00:49:37.960 | from building this platform.

00:49:39.360 | We started at the start of your career,

00:49:40.840 | and then we jumped straight to structured outputs.

00:49:42.760 | There's a whole thing, like two years,

00:49:44.440 | that we skipped in between.

00:49:45.760 | What have become your principles?

00:49:47.200 | What are your favorite stories that you like to tell?

00:49:50.120 | - We had so much fun working on the Assistance API

00:49:52.600 | and leading up to Dev Day.

00:49:53.760 | You know, things are always pretty chaotic

00:49:55.440 | when you have an externally, like a date--

00:49:58.120 | - Forcing function.

00:49:58.960 | - That is hard, and there's like a stage,

00:50:00.840 | and there's like 1,000 people coming.

00:50:03.200 | - You can always launch a wait list, I mean.

00:50:04.600 | (laughs)

00:50:05.960 | - We're trying hard not to,

00:50:07.680 | because, you know, we love it

00:50:08.760 | when people can access the thing on day one.

00:50:10.600 | And so, yeah, the Assistance API,

00:50:12.720 | we had like this really small team,

00:50:14.960 | and just working as hard as we could

00:50:18.200 | to make this come to life.

00:50:19.360 | But even, actually, the morning of,

00:50:21.320 | I don't know if you'll remember this,

00:50:22.360 | but Sam did this keynote,

00:50:24.360 | and Ramon came up,

00:50:25.960 | and they gave free credits to everybody.

00:50:28.480 | So that was live, fully live,

00:50:30.120 | as were all of the demos that day.

00:50:32.040 | But actually, maybe like two hours before that,

00:50:34.680 | we had a little outage,

00:50:36.080 | and everyone was like scrambling

00:50:37.600 | to make this thing work again.

00:50:39.320 | So, yeah, things are early and scrappy here,

00:50:42.160 | and, you know, we were really glad.

00:50:43.640 | We were a bit on the edge of our seat watching it live.

00:50:46.440 | - What's the plan B in that situation?

00:50:48.360 | If you can share.

00:50:49.200 | - Play a video.

00:50:50.040 | This is classic DevRel, right?

00:50:51.160 | I don't know.

00:50:52.000 | - I mean, I actually don't know what the plan B was.

00:50:55.120 | - No plan B.

00:50:55.960 | No failure.

00:50:56.800 | - But we just, you know, we fixed it.

00:50:58.120 | We got everything running again,

00:50:59.760 | and the demo went well.

00:51:01.640 | - Just higher-cracked Waterloo tracks.

00:51:02.960 | - Exactly.

00:51:03.800 | (laughs)

00:51:04.640 | - Skill issues, as usual.

00:51:05.720 | - Sometimes you just gotta make it happen.

00:51:07.960 | - I imagine it's actually very motivating,

00:51:09.560 | but I did hear that after Dev Day,

00:51:11.200 | like the whole company got like a few weeks off,

00:51:13.520 | just to relax a little bit.

00:51:15.760 | - Yeah, we sometimes get,

00:51:18.680 | like we just had the week of July 4th off, and yeah.

00:51:21.320 | It's hard to take vacation,

00:51:22.640 | because people are working on such exciting things,

00:51:24.320 | and it's like, you get a lot of FOMO on vacation,

00:51:26.640 | so it helps when the whole company's on vacation.

00:51:28.800 | - Mentioning Ace Assistance API,

00:51:30.400 | you actually announced a roadmap there,

00:51:32.600 | and things have developed.

00:51:34.200 | I think people may not be up to date.

00:51:36.440 | What's the offering today versus, you know, one year ago?

00:51:39.480 | - Yeah, so we've made a bunch of key improvements.

00:51:41.960 | I would say the biggest one is in the file search product.

00:51:44.520 | Before, we only supported, I think,

00:51:46.120 | like 20 files per assistant,

00:51:47.960 | and the way we used those files was like less effective.

00:51:51.120 | Basically, the model would decide based on the file name,

00:51:54.080 | whether to search a file,

00:51:55.040 | and there's not a ton of information in there.

00:51:57.360 | So our new offering, which we shipped a few months ago,

00:51:59.600 | I think now allows 10K files per assistant,

00:52:02.760 | which is like dramatically more.

00:52:04.640 | And also, it's a kind of different operation.

00:52:07.320 | So you can search semantically over all files at once,

00:52:09.600 | rather than just kind of the model choosing one up front.

00:52:12.080 | So a lot of customers have seen really good performance.

00:52:14.400 | We also have exposed more like chunking

00:52:16.400 | and re-ranking options.

00:52:18.000 | I think the re-ranking one is coming,

00:52:19.560 | I think, next week or very soon.

00:52:21.960 | So this kind of gives developers more control

00:52:24.200 | and more flexibility there.

00:52:25.320 | So we're trying to make it the easiest way

00:52:27.720 | to kind of do RAG at scale.

00:52:29.560 | - Yeah, I think that visibility into the RAG system

00:52:33.360 | was the number one thing missing from Dev Day,

00:52:35.520 | and then people got their first impressions,

00:52:37.240 | and then they never looked at it again.

00:52:38.680 | So that's important.

00:52:39.840 | The re-ranker is a core feature of, let's say,

00:52:42.800 | some other Foundation Model Labs.

00:52:44.920 | Is OpenAI going to offer a re-ranking service,

00:52:47.760 | a re-ranker model?

00:52:49.360 | - So we do re-ranking as part of it.

00:52:51.560 | I think we're soon going to ship more controls for that.

00:52:54.360 | - Okay, got it.

00:52:55.200 | And if I'm an existing LANG chain, MAMA index, whatever,

00:52:58.800 | how do you compare?

00:52:59.720 | Do you make different choices?

00:53:01.360 | Where does that exist in the spectrum of choices?

00:53:04.400 | - I think we are just coming at it

00:53:06.480 | trying to be the easiest option.

00:53:08.600 | And so ideally, you don't have to know what a re-ranker is,

00:53:12.080 | and you don't have to have a chunking strategy,

00:53:14.160 | and the thing just kind of works out of the box.

00:53:16.040 | So I would say that's where we're going,

00:53:18.160 | and then giving controls to the power users

00:53:20.400 | to make the changes they need.

00:53:22.480 | - Awesome.

00:53:23.320 | I'm going to ask about a couple other things,

00:53:24.320 | just updates on stuff also announced at Dev Day,

00:53:26.640 | and we talked about this before.

00:53:28.120 | Determinism, something that people really want.

00:53:30.480 | Dev Day will announce the Seed Program

00:53:32.240 | as well as System Fingerprint.

00:53:33.800 | And objectively, I've heard issues.

00:53:36.320 | - Yeah.

00:53:37.160 | - I don't know what's going on.

00:53:38.000 | - The Seed parameter is not fully deterministic,

00:53:41.280 | and it's kind of a best effort thing.

00:53:43.280 | - Yeah.

00:53:44.120 | - So you'll notice there's more determinism

00:53:45.320 | in the first few tokens.

00:53:46.440 | That's kind of the current implementation.

00:53:48.160 | We've heard a lot of feedback.

00:53:49.320 | We're thinking about ways to make it better,

00:53:51.080 | but it's challenging.

00:53:51.960 | It's kind of trading off against reliability and uptime.

00:53:55.280 | - Other maybe underrated API-only thing,

00:53:58.440 | Logit Bias, that's another thing

00:54:00.600 | that kind of seems very useful,

00:54:02.160 | and then maybe most people are like,

00:54:03.680 | it's a lot of work, I don't want to use it.

00:54:05.320 | Do you have any examples of use cases

00:54:07.760 | or products that are made a lot better through using it?

00:54:11.360 | - So yeah, classification is the big one.

00:54:13.480 | So Logit Bias, your valid classification outputs,

00:54:17.000 | and you're more likely to get something that matches.

00:54:19.680 | We've seen people Logit Bias punctuation tokens,

00:54:23.160 | maybe trying to get more succinct writing.

00:54:26.000 | Yeah, it's generally very much a power user feature,

00:54:28.760 | and so not a ton of folks use it.

00:54:30.720 | - I actually wanted to use it

00:54:32.280 | to reduce the incidence of the word delve.

00:54:34.320 | - Yeah.

00:54:35.240 | - Have people done that?

00:54:36.640 | - Probably, I don't know, is delve one token?

00:54:38.920 | You're probably, you got to do a lot of permutations.

00:54:41.680 | - It's used so much.

00:54:42.520 | - Maybe it is, depends on the tokenizer.

00:54:44.400 | - Are there non-public tokenizers?

00:54:46.160 | I guess you cannot answer or you would omit it.

00:54:48.480 | Are the 100K and 200K vocabs,

00:54:50.720 | like the ones that you use across all models, or?

00:54:53.240 | - Yeah, I think we have docs that publish more information.

00:54:57.000 | I don't have it off the top,

00:54:58.120 | but I think we publish which tokenizers for which model.

00:55:00.480 | - Okay, so those are the only two.

00:55:02.080 | - Rate, the tiering rate limiting system,

00:55:04.320 | I don't think there was an official blog post

00:55:06.200 | kind of announcing this,

00:55:07.040 | but it was kind of mentioned that you started tying

00:55:09.680 | fine tuning to tiering and feature rollouts.

00:55:13.960 | Just from your point of view, how do you manage that?

00:55:16.760 | And what should people know

00:55:17.680 | about the tiering system and rate limiting?

00:55:20.120 | - Yeah, I think basically the main changes here

00:55:22.280 | were to be more transparent and easier to use.

00:55:24.280 | So before developers didn't know what tier they're in,

00:55:26.280 | and now you can see that in the dashboard.

00:55:29.040 | I think it's also, I think we publish

00:55:31.400 | how you move from tier to tier.

00:55:33.400 | And so this just helps us do kind of gated rollouts

00:55:36.040 | for the fine tuning launch.

00:55:37.360 | I think everyone tier two and up has full access.

00:55:40.200 | - That makes sense.

00:55:41.040 | I would just advise people to just get to tier five

00:55:43.240 | as quickly as possible.

00:55:44.240 | (both laughing)

00:55:45.240 | - Sure.

00:55:46.080 | - Like a gold star customer, you know?

00:55:46.920 | Like, I don't know, it seems to make sense.

00:55:48.520 | - Do we want to maybe wrap with future things

00:55:51.200 | and kind of like how you think about designing and everything?

00:55:53.800 | So you just mentioned you want to be the easiest way

00:55:56.320 | to basically do everything.

00:55:58.240 | What's the relationship with other people building

00:56:00.800 | in the developer ecosystem?

00:56:02.400 | Like I think maybe in the early days, it's like, okay,

00:56:05.280 | we only have these APIs and then everybody helps us,

00:56:07.400 | but now you're kind of building a whole platform.

00:56:09.560 | How do you make decisions?

00:56:10.960 | - Yeah, I think kind of the 80/20 principle applies here.

00:56:13.960 | We'll build things that kind of capture, you know,

00:56:16.240 | 80% of the value and maybe leave the long tail

00:56:18.720 | to other developers.

00:56:19.960 | So we really prioritize by like,

00:56:21.400 | how much feedback are we getting?

00:56:22.560 | How much easier will this make something,

00:56:25.200 | like an integration for a developer?

00:56:27.040 | So yeah, we want to do more in this space

00:56:29.000 | and not just be an LLM as a service,

00:56:31.600 | but kind of AI development platform as a service.

00:56:34.320 | - Ooh, okay.

00:56:35.160 | That ties into a thing that I put in the notes

00:56:38.040 | that we prepped.

00:56:38.880 | There are other companies trying

00:56:40.120 | to be AI development platform.

00:56:42.280 | So you will compete with them

00:56:45.080 | or they just want to know what you won't build

00:56:48.200 | so that they can build it? (laughs)

00:56:50.080 | - Yeah, it's a tough question.

00:56:51.960 | I think we haven't, you know,

00:56:53.320 | determined what exactly we will and won't build,

00:56:55.600 | but you can think of something,

00:56:57.040 | if it makes it a lot easier for developers to integrate,

00:56:59.520 | you know, it's probably on our radar

00:57:01.200 | and we'll, you know, stack rank by impact.

00:57:03.440 | - Yeah, so there's like cost tracking and model fallbacks.

00:57:06.720 | Model fallbacks is an interesting one

00:57:08.240 | because people do it.

00:57:09.680 | I don't think it adds a ton of value,

00:57:11.680 | but like if you don't build it, I have to build it

00:57:14.160 | because if one API is down or something,

00:57:16.800 | I need to fall back to another one.

00:57:18.360 | - Yeah, I mean, the way we're targeting that user need

00:57:20.340 | is just by investing a lot in reliability.

00:57:22.880 | And so we- - Oh yeah.

00:57:24.320 | - We have- - Just don't fail.

00:57:25.960 | - I mean, we have improved our uptime

00:57:27.840 | like pretty dramatically over the last year

00:57:29.680 | and it's been, you know,

00:57:30.720 | the result of a lot of hard work from folks.

00:57:32.640 | So you'll see that on our status page

00:57:34.520 | and our continued commitment going forward.

00:57:37.080 | - Is the important thing about owning the platform

00:57:39.720 | that it gives you the flexibility

00:57:41.200 | to put all the kind of messy stuff behind the scenes?

00:57:44.960 | Or yeah, how do you draw the line

00:57:46.440 | between what you want to include?

00:57:48.080 | - Yeah, I just think of it as like,

00:57:49.640 | how can we onboard the next generation of AI engineers,

00:57:53.400 | as you put it, right?

00:57:54.440 | Like what's the easiest way

00:57:55.600 | to get them building really cool apps?

00:57:57.440 | And I think it's by building stuff

00:57:58.800 | to kind of hide this complexity

00:58:00.480 | or just make it really easy to integrate.

00:58:02.640 | So I think of it a lot as like,

00:58:04.160 | what is the value add we can provide

00:58:05.720 | beyond just the models that makes the models really useful?

00:58:08.760 | - Okay, we'll touch on four more features

00:58:10.360 | of the API platform that we prepped.

00:58:12.320 | Batch, Vision, Whisper, and then Team Enterprise stuff.

00:58:15.120 | So you wanted to talk about Batch.

00:58:16.680 | - Yeah. - So the rough idea is

00:58:19.160 | you give a, the contract between you and me

00:58:21.120 | is that I give you the Batch job.

00:58:23.360 | You have 24 hours to run it.

00:58:25.120 | It's kind of like spot inst for the API.

00:58:28.960 | What should people know about it?

00:58:31.040 | - So it's half off, which is a great savings.

00:58:33.960 | It also works with like 4.0 mini.

00:58:36.200 | So the savings on top of 4.0 mini is pretty crazy.

00:58:39.840 | Like the stuff you can do-

00:58:40.680 | - Like 7.5 cents or something per million.

00:58:42.560 | - Yeah, I should really have that number top of mind,

00:58:44.640 | but it's like staggeringly cheap.

00:58:46.360 | And so I think this opens up a lot more use cases.

00:58:48.640 | Like let's say you have a user activation flow

00:58:51.800 | and you want to send them an email like maybe every day

00:58:54.240 | or like at certain points in their user journey.

00:58:56.520 | So now you can do this with the Batch API

00:58:58.400 | and something that was maybe a lot more expensive

00:59:00.440 | and not feasible is now very easy to do.

00:59:02.680 | So right now we have this 24 hour turnaround time

00:59:05.040 | for half off and curious,

00:59:07.160 | would love to hear from your community,

00:59:08.440 | like what kind of turnaround time do they want?

00:59:11.040 | - I would be an ideal user of Batch

00:59:12.440 | and I cannot use Batch because it's 24 hours.

00:59:14.520 | I need two to four.

00:59:15.720 | - Two to four hours, okay.

00:59:17.360 | Yeah, that's good to know.

00:59:18.400 | But yeah, just a lot of folks haven't heard about it.

00:59:20.200 | It's also really great for like evals, running them offline.

00:59:22.720 | You don't, generally don't need them to come back

00:59:24.640 | within, you know, two hours.

00:59:26.000 | - I think you could do a range, right?

00:59:27.440 | Two to four for me, like I need to produce a daily thing

00:59:29.840 | and then 24 for like the average use case.

00:59:32.280 | And then maybe like a week, a month, who cares?

00:59:34.280 | Like for people who just have a lot to do.

00:59:36.880 | - Yeah, absolutely.

00:59:37.960 | So yeah, that's Batch API.

00:59:38.920 | I think folks should use it more.

00:59:40.400 | It's pretty cool.

00:59:41.240 | - Is there a future in which like six months is like free?

00:59:45.200 | You know, like is there like small,

00:59:47.200 | is there like super small like shards of like GPU runtime

00:59:51.040 | that like over a long enough timeline

00:59:53.320 | you can just run all these things for free?

00:59:55.400 | - Yeah, it's certainly possible.

00:59:56.760 | I think we're getting to the point

00:59:58.080 | where a lot of these are like almost free.

01:00:00.040 | - That's true.

01:00:00.880 | - Why would they work on something

01:00:02.200 | that's like completely free?

01:00:03.080 | I don't know.

01:00:03.920 | Okay, so Vision.

01:00:05.360 | Vision got G8.

01:00:06.560 | Last year, people were so wild by the GPT-4 demo

01:00:09.560 | and that was primarily Vision.

01:00:11.480 | What was it like building the Vision API?

01:00:13.080 | - Yeah, the Vision API is super cool.

01:00:14.640 | We have a great team working there.

01:00:16.800 | I think the cool thing about Vision

01:00:17.920 | is that it works across our APIs.

01:00:19.800 | So there's, you can use it in the Assistance API,

01:00:22.520 | you can use the Batch API in track completions.

01:00:24.160 | It works with structured outputs.

01:00:25.800 | I think it just helps a lot of folks

01:00:27.520 | with kind of data extraction

01:00:29.440 | where the spatial relationships between the data

01:00:32.760 | is too complicated and you can't get that over text.

01:00:35.360 | But yeah, there's a lot of really cool use cases.

01:00:37.080 | - I think the tricky thing for me is understanding

01:00:40.320 | how frequent to turn Vision from like single images

01:00:44.040 | into like effectively just always watching.

01:00:46.560 | And right now, I think people just like

01:00:48.120 | send a frame every second.

01:00:50.320 | Will that model ever change?

01:00:51.360 | Will there just be like, I stream you a video and then?

01:00:55.160 | - Yeah, I think it's very possible

01:00:56.560 | that we'll have an API where you stream video in

01:00:58.960 | and maybe, you know, to start,

01:01:00.840 | we'll do the frame sampling for you.

01:01:03.240 | - 'Cause the frame sampling is the default, right?

01:01:05.360 | - Right.

01:01:06.200 | - But I feel like it's hacky.

01:01:07.240 | - Yeah, I think it's hard for developers to do.

01:01:09.080 | And so, you know,

01:01:10.120 | we should definitely work on making that easier.

01:01:11.760 | - Is there in the Batch API,

01:01:12.920 | do you have like a time guarantees, like order guarantees?

01:01:17.120 | Like if I send you a Batch request of like a video analysis,

01:01:20.160 | I need every frame to be done in order?

01:01:22.240 | - For Batch, you send like a list of requests

01:01:24.880 | and each of them stand alone.

01:01:26.120 | So you'll get all of them finished,

01:01:27.400 | but they don't kind of chain off each other.

01:01:29.360 | - Well, if you're doing a video, you know,

01:01:31.440 | if you're doing like analyzing a video.

01:01:34.000 | - I wasn't linking video to Batch, but that's interesting.

01:01:36.920 | - Yeah, well, a video it's like, you know,

01:01:38.960 | if you have a very long video,

01:01:40.040 | you can just do a Batch of all the images

01:01:41.800 | and let it process.

01:01:42.640 | - Oh, that's a good idea.

01:01:43.480 | It's like Batch, but serially.

01:01:45.080 | - Sequential true.

01:01:46.040 | - Yeah, yeah, yeah, exactly.

01:01:46.880 | - You know.

01:01:48.120 | - But the whole point of Batch

01:01:49.080 | is you're just using kind of spare time to run it.

01:01:52.120 | Let's talk about my favorite model, Whisper.

01:01:54.520 | Oliver, I built this thing called SmallPodcaster,

01:01:58.240 | which is an open source tool for podcasters.

01:02:00.200 | And why does Whisper API not have diarization

01:02:03.520 | when everybody is transcribing people talking?

01:02:06.320 | That's my main question.

01:02:07.520 | - Yeah, it's a good question.

01:02:08.520 | And you've come to the right person.

01:02:09.600 | I actually worked on the Whisper API and shipped that.

01:02:11.920 | That was one of my first APIs I shipped.

01:02:13.720 | Long story short is that like Whisper V3,

01:02:16.840 | which we open sourced,

01:02:18.080 | has I think the diarization feature,

01:02:20.960 | but there's some like performance trade-offs.

01:02:23.240 | And so Whisper V2 is better at some things than Whisper V3.

01:02:26.240 | And so it didn't seem that worthwhile to ship Whisper V3

01:02:29.880 | compared to like the other things in our priorities.

01:02:32.160 | I think we still will at some point.

01:02:33.840 | But yeah, it's just, you know,

01:02:35.160 | there's always so many things we could work on.

01:02:37.120 | It's tough to do everything.

01:02:38.680 | - We have a Python notebook

01:02:39.800 | that does the diarization for the pod,

01:02:41.800 | but I would just like,

01:02:42.800 | you can translate like 50 languages,

01:02:45.680 | but you cannot tell me who's speaking.

01:02:47.360 | That was like the funniest thing.

01:02:49.600 | - There's like an XKCD thing about this,

01:02:51.800 | about hard problems in AI.

01:02:53.560 | I forget the one. - Yeah, yeah, yeah, exactly.

01:02:55.480 | - Tell me if this was taken in a park.

01:02:57.480 | And like, that's easy.

01:02:58.320 | And it's like, tell me if there's a bird in this picture.

01:02:59.560 | And it's like, give me 10 people on a research team.

01:03:01.800 | It's like, you never know which things are challenging

01:03:04.040 | and diarization is, I think, you know,

01:03:06.560 | more challenging than expected.

01:03:08.800 | - Yeah, yeah.

01:03:09.640 | It still breaks a lot with like overlaps, obviously.

01:03:12.480 | Sometimes similar voices it struggles with.

01:03:14.960 | Like I need to like double read the thing.

01:03:17.240 | - Totally.

01:03:18.080 | - But yeah, great model.

01:03:18.920 | I mean, it would take us so long to do transcriptions.

01:03:21.600 | And I don't know why,

01:03:22.760 | like small podcasts has better transcription

01:03:24.640 | than like mostly every commercial tool.

01:03:27.720 | - It beats the script.

01:03:28.920 | - And I'm like, I'm just using the model.

01:03:30.280 | I'm literally not doing anything.

01:03:31.600 | You know, it's just a notebook.

01:03:33.000 | So yeah, it just speaks to like,

01:03:35.120 | sometimes just using the simple OpenAI model

01:03:37.280 | is better than like figuring out your own pipeline thing.

01:03:40.640 | - I think the top feature request there just would be,

01:03:43.560 | I mean, again, you know,

01:03:44.920 | using you as a feature request, Dom,

01:03:47.240 | is like being able to bias the vocab.

01:03:49.760 | I think there is like in raw Whisper, you can do that.

01:03:51.840 | - You can pass a prompt in the API as well.

01:03:53.960 | - But you pass in the prompts, okay.

01:03:55.360 | - Yeah.

01:03:56.200 | - There's no more deterministic way to do it.

01:03:57.280 | - So this is really helpful when you have like acronyms

01:04:00.600 | that aren't very familiar to the model.

01:04:02.240 | And so you can put them in the prompt

01:04:03.440 | and you'll basically get the transcription

01:04:05.240 | using those correctly.

01:04:06.320 | - We have the AI engineer solution,

01:04:07.960 | which is just a dictionary.

01:04:09.640 | - Nice.

01:04:10.480 | - We're like all the way misspelled it in the past

01:04:12.000 | and then G-sub and like replace the thing.

01:04:14.400 | - If it works, it works.

01:04:15.320 | Like that's engineering.

01:04:17.240 | - It's like, you know, llama with like one L

01:04:19.720 | or like all these different things or like length chain.

01:04:22.600 | It like transcribes length chain and like-

01:04:24.640 | - Capitalization.

01:04:25.480 | - A bunch of like three or four different ways.

01:04:27.560 | - Yeah.

01:04:28.400 | - You guys should try the prompt feature.

01:04:29.720 | - I love these like kind of pro tip.

01:04:31.120 | Okay, fun question.

01:04:32.040 | I know we don't know yet,

01:04:33.720 | but I've been enjoying the advanced voice mode.

01:04:36.480 | It really streams back and forth

01:04:38.320 | and it handles interruptions.

01:04:39.800 | How would your audio endpoint change when that comes out?

01:04:43.640 | - We're exploring, you know, new shape of the API

01:04:46.600 | to see how it would work in this kind of

01:04:48.760 | speech to speech paradigm.

01:04:50.240 | I don't think we're ready to share quite yet,

01:04:51.840 | but we're definitely working on it.

01:04:53.160 | I think just the regular request response

01:04:55.360 | probably isn't going to be the right solution.

01:04:57.800 | - For those who are listening along,

01:04:59.040 | I think it's pretty public that OpenAI uses LiveKit

01:05:01.240 | for the chat GPT app,

01:05:03.000 | which like seems to be the socket based approach

01:05:05.800 | that people should be at least up to speed on.

01:05:08.560 | Like I think a lot of developers only do request response

01:05:10.960 | and like that doesn't work for streaming.

01:05:12.920 | - Yeah.

01:05:13.760 | When we do put out this API,

01:05:14.600 | I think we'll make it really easy for developers

01:05:16.560 | to figure out how to use it.

01:05:17.680 | Yeah.

01:05:18.520 | It's hard to do audio. - It'll be a paradigm change.

01:05:20.160 | Okay.

01:05:21.000 | And then I think the last one on our list

01:05:22.080 | was team enterprise stuff.

01:05:23.160 | Audit logs, service accounts, API keys.

01:05:24.920 | What should people know using the enterprise offering?

01:05:27.480 | - Yeah, we recently shipped our admin and audit log APIs.

01:05:31.240 | And so a lot of enterprise users

01:05:32.960 | have been asking for this for a while.

01:05:34.520 | The ability to kind of manage API keys programmatically,

01:05:37.160 | manage your projects, get the audit log.

01:05:39.200 | So we've shipped this and for folks that need it,

01:05:41.320 | it's out there and happy for your feedback.

01:05:43.600 | - Yeah, awesome. I don't use them.

01:05:44.680 | So I don't know.

01:05:45.880 | I imagine it's just like build your own internal gateway

01:05:49.040 | for your internal developers

01:05:51.400 | to manage your deployment of OpenAI.

01:05:53.560 | - Yeah.

01:05:54.400 | I mean, if you work at like a company

01:05:55.480 | that needs to keep track of all the API keys,

01:05:57.720 | it was pretty hard in the past to do this in the dashboard.

01:06:01.080 | We've also improved our SSO offering.

01:06:02.800 | So that's much easier to use now.

01:06:04.680 | - The most important feature of an enterprise company.

01:06:06.680 | - Yeah, people love SSO.

01:06:08.760 | - All right, let's go outside of OpenAI.

01:06:11.960 | What about just you personally?

01:06:14.040 | So you mentioned Waterloo.

01:06:15.240 | Maybe let's just do, why is everybody at Waterloo cracked?

01:06:18.120 | And why are people so good?

01:06:19.800 | And like, why have people not replicated it?

01:06:22.040 | Or any other commentary on your experience?

01:06:24.440 | - The first is the co-op program.

01:06:25.880 | It's obviously really good.

01:06:27.120 | You know, I did six internships,

01:06:28.800 | learned so much in those.

01:06:30.920 | I think another reason is that Waterloo is like,

01:06:34.000 | you know, it's very cold in the winter.

01:06:35.920 | It's pretty miserable.

01:06:37.400 | There's like not that much to do apart from study

01:06:39.760 | and like hack on projects.

01:06:41.160 | And there's this big like hacker mentality.

01:06:43.320 | You know, there's a Hack the North

01:06:44.920 | is a very popular hackathon.

01:06:47.520 | And there's a lot of like startup incubators.

01:06:49.280 | It's kind of just has this like startup and hacker ethos.

01:06:52.040 | Then that combined with the six internships

01:06:53.720 | means that you get people who like graduate

01:06:55.880 | with two years of experience

01:06:57.360 | and they're very entrepreneurial.

01:06:58.680 | And you know, they're down to grind.

01:07:00.440 | - I do notice a correlation between climate

01:07:03.040 | and the crackiness of engineers.

01:07:05.280 | So, you know, it's no coincidence that Seattle

01:07:07.840 | is the birthplace of Microsoft and Amazon.

01:07:10.600 | I think I had this compilation of Denmark

01:07:12.920 | where people like,

01:07:14.360 | so it's the birthplace of C++, PHP, Turbo Pascal,

01:07:18.040 | Standard ML, BNF, the thing that we just talked about,

01:07:21.160 | MD5 Crypt, Ruby on Rails, Google Maps, and V8 for Chrome.

01:07:25.600 | And it's 'cause according to Bjorn Sjostrup,

01:07:27.600 | the creator of C++, there's nothing else to do.

01:07:29.720 | - Well, you have Linus Thorvalds in Finland.

01:07:32.400 | - Yeah, I mean, you hear a lot about this,

01:07:34.680 | like in relation to SF.

01:07:35.920 | People say, you know, New York is way more fun.

01:07:37.880 | There's nothing to do in SF.

01:07:38.920 | And maybe it's a little by design

01:07:40.360 | that Altec is here.

01:07:41.360 | - The climate is too good.

01:07:42.360 | - Yeah.

01:07:43.200 | - If we also have fun things to do.

01:07:44.640 | - Nature is so nice, you can touch grass.

01:07:46.360 | Why are we not touching grass?

01:07:47.640 | - You know, restaurants close at like 8 p.m.

01:07:50.440 | Like that's what people are referring to.

01:07:51.680 | There's not a lot of like late night dining culture.

01:07:55.240 | Yeah, so you have time to wake up early and get to work.

01:07:58.440 | - You are a book recommender or book enjoyer.

01:08:01.560 | What underrated books do you recommend most to others?

01:08:03.920 | - Yeah, I think a book I read somewhat recently

01:08:05.960 | that was very formative

01:08:07.040 | was "The Making of the Prince of Persia."

01:08:09.080 | It's a striped press book.

01:08:10.400 | That book just made me want to work hard.

01:08:12.720 | Like nothing I've ever read.

01:08:14.120 | It's just like this journal of what it takes

01:08:16.640 | to like build, you know, incredible things.

01:08:19.040 | So I'd recommend that.

01:08:20.160 | - Yeah, it's funny how video games are,

01:08:21.920 | for a lot of people, at least for me,

01:08:23.320 | kind of like the, some of the moments in technology.

01:08:26.880 | Like when I played "The Science of Time" on PS2

01:08:29.040 | was like my first PlayStation 2 game.

01:08:31.760 | And I was like, man, this thing is so crazy

01:08:33.440 | compared to any PlayStation 1 game.

01:08:35.080 | And it's like, wow.

01:08:35.920 | My expectations for like the technology,

01:08:37.760 | I think like open AI is a lot of similar things,

01:08:39.720 | like the advanced voice.

01:08:41.600 | It's like, you see that thing and then you're like, okay,

01:08:44.040 | what I can expect from everybody else

01:08:45.600 | is kind of raised now, you know?

01:08:47.160 | - Totally.

01:08:48.000 | Another book I like to plug

01:08:48.920 | is called "Misbehaving" by Richard Thaler.

01:08:51.760 | He's a behavioral economist

01:08:53.240 | and talks a lot about how people act irrationally

01:08:56.080 | in terms of decision-making.

01:08:57.480 | And I actually think about that book like once a week,

01:08:59.240 | probably, at least when I'm making a decision

01:09:01.360 | and I realize that, you know, I'm falling into a fallacy

01:09:04.000 | or, you know, it could be a better decision.

01:09:06.080 | - Yeah, you did a minor in psych.

01:09:07.280 | - I did, yeah.

01:09:08.360 | I don't know if I learned that much there,

01:09:10.000 | but it was interesting.

01:09:11.200 | - Is there like an example of like a cognitive bias

01:09:14.760 | or misbehavior that you just love telling people about?

01:09:18.120 | - Yeah, people, so let's say you won tickets

01:09:20.840 | to like a Taylor Swift concert

01:09:22.720 | and I don't know how much they're going for,

01:09:23.960 | but it's probably like $10,000.

01:09:25.760 | - Oh, okay.

01:09:26.600 | - Or whatever, sure.

01:09:28.080 | And like a lot of people are like, oh, I have to keep these.

01:09:30.120 | Like I won them, it's $10,000.

01:09:31.640 | But really it's the same decision you're making

01:09:33.400 | if you have $10,000, like would you buy these tickets?

01:09:36.040 | And so people don't really think about it rationally.

01:09:37.720 | I'm like, would they rather have $10,000 of the tickets?

01:09:40.200 | For people who won it, a lot of the time

01:09:41.800 | it's going to be the $10,000,

01:09:43.040 | but their bias is because they won it.

01:09:45.360 | The world organized itself this way

01:09:47.240 | and you should keep it for some reason.

01:09:48.560 | - Yeah, oh, okay.

01:09:49.400 | I'm pretty familiar with this stuff.

01:09:50.800 | There's also a loss version of this

01:09:53.560 | where it's like, if I take it away from you,

01:09:55.800 | you respond more strongly than if I give it to you.

01:09:58.000 | - Yes.

01:09:58.840 | If people are like really upset,

01:10:00.320 | if they like don't get a promotion,

01:10:01.960 | but if they do get a promotion, they're like, okay, phew.

01:10:04.480 | It's like not even, you know, excitement.

01:10:06.440 | It's more like, we react a lot worse to losing something.

01:10:10.880 | - Which is why, like when you join like a new platform,

01:10:13.520 | they often give you points and then they'll take it away

01:10:15.760 | if you like don't do some action in like the first few days.

01:10:19.800 | - Yeah, totally.

01:10:20.640 | Yeah, the book references people

01:10:22.560 | who like operate very rationally as econs,

01:10:25.440 | as like a separate group to humans.

01:10:27.760 | And I often think like, you know,

01:10:29.760 | what would an econ do here in this moment

01:10:32.200 | and try to act that way.

01:10:34.000 | - Okay, let's do this.

01:10:34.960 | Our LLMs, econs.

01:10:36.160 | (all laughing)

01:10:38.840 | - I mean, they are maximizing probability distributions.

01:10:42.440 | - Minimizing loss.

01:10:43.280 | - Yeah, so I think way more than all of us, they are econs.

01:10:46.320 | - Whoa, okay.

01:10:47.200 | So they're more rational than us?

01:10:49.720 | - I think their optimization functions

01:10:51.200 | are more clear than ours.

01:10:52.720 | - Yeah, just to wrap, you mentioned

01:10:54.520 | you need help on a lot of things.

01:10:56.320 | - Yeah.

01:10:57.160 | - Any specific roles, call outs,

01:10:58.240 | and also people's backgrounds.

01:10:59.680 | Like, is there anything that they need to have done before?

01:11:02.400 | Like what people fit well at OpenAI?

01:11:04.400 | - Yeah, we've hired people, all kinds of backgrounds,

01:11:07.200 | people who have PhD and an ML,

01:11:09.760 | or folks who've just done engineering like me.

01:11:12.560 | And we're really hiring for a lot of teams.

01:11:14.280 | We're hiring across the Applied Org,

01:11:15.960 | which is where I sit for engineering,

01:11:17.600 | and for a lot of researchers.

01:11:19.320 | And there's a really cool model behavior role

01:11:21.880 | that we just dropped.

01:11:23.160 | So yeah, across the board,

01:11:24.760 | we'd recommend checking out our careers page,

01:11:26.320 | and you don't need a ton of experience

01:11:28.320 | in AI specifically to join.

01:11:30.760 | - I think one thing that I'm trying to get at

01:11:32.600 | is like, what kind of person does well at OpenAI?

01:11:35.400 | I think objectively you have done well.

01:11:37.200 | And I've seen other people not do as well,

01:11:38.960 | and basically be managed out.

01:11:42.000 | I know it's an intense environment.

01:11:43.840 | - I mean, the people I enjoy working with the most

01:11:45.880 | are kind of low ego, do what it takes,

01:11:49.040 | ready to roll up their sleeves,

01:11:50.880 | do what needs to be done, and unpretentious about it.

01:11:54.120 | Yeah, I also think folks that are very user-focused

01:11:57.200 | do well on kind of API and chat GBT.

01:11:59.440 | Like, the YC ethos of build something people want

01:12:02.520 | is very true at OpenAI as well.

01:12:04.760 | So I would say low ego, user-focused, driven.

01:12:08.720 | - Cool, yeah, this was great.

01:12:10.040 | Thank you so much for coming on.

01:12:10.880 | - Yeah, thanks for having me.

01:12:12.320 | (upbeat music)

01:12:14.900 | (upbeat music)

01:12:17.480 | (upbeat music)

01:12:20.060 | (upbeat music)

Building AGI with OpenAI's Structured Outputs API

Chapters