back to indexFrom Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

00:00:25.280 |
And before joining OpenAI, I was also a founder. 00:00:29.200 |
I actually experienced firsthand the magic of building 00:00:34.040 |
Now I'm working on making sure we offer the most delightful 00:00:36.580 |
experience for all of you builders in this room. 00:00:40.900 |
is also showing the art of the possible with our AI models 00:00:46.360 |
And so today, we're going to go through a few things 00:00:50.900 |
that the great capabilities that the OpenAI team has built 00:00:53.520 |
recently, and we'll show some live demos to really bring them 00:00:58.400 |
So first, I'd like to start with a quick zoom out on how we've 00:01:07.320 |
working on building AGI in a way that benefits all of humanity. 00:01:11.140 |
And to achieve this mission, we believe in iterative deployment. 00:01:14.400 |
We really want to make the technology enter contact with reality 00:01:19.780 |
And for that, a top focus for us at OpenAI is really all of you, 00:01:25.200 |
So we really believe strongly that the best developers and startups 00:01:31.820 |
You guys are the ones that are going to build the native AI 00:01:37.240 |
So today, we have 3 million developers around the world 00:01:41.660 |
And we are very fortunate to have so much innovation. 00:01:47.540 |
And today, outside of this room, when people think of OpenAI, 00:01:51.200 |
they often think of ChatGPT first, because that's become 00:01:53.780 |
like the product that has taken the world by storm a little bit. 00:01:57.220 |
But the first product was actually not ChatGPT. 00:02:00.580 |
The first product we put out there was the developer platform. 00:02:07.860 |
And that's when we first started launching it to the public behind an API. 00:02:13.820 |
Who in this room have played with the API at the time of GPT-3 already? 00:02:25.700 |
And you know, at the time, we kind of got a taste 00:02:36.860 |
at the time, this was one of the most popular use cases 00:02:42.960 |
This was like a role-playing game purely based on text. 00:02:46.320 |
And it kind of was generating open-ended stories. 00:02:51.820 |
And at each scenery, when you were trying to look around, 00:02:56.900 |
So that was kind of the state of the art at the time. 00:03:02.980 |
It completely changed the way we thought about AI. 00:03:10.680 |
It could start being better at coding and reasoning 00:03:17.540 |
And that dramatically expanded the aperture of the possibilities 00:03:23.480 |
We've had the great fortune of working with many, many 00:03:26.040 |
developers and companies, integrating GPT-4 in their own apps 00:03:36.160 |
to generate playlists on the fly based on your music taste and history. 00:03:42.380 |
But the one thing I want to highlight today in this talk 00:03:45.100 |
is that GPT-4 was also the beginning of our multi-modality journey. 00:03:49.320 |
This is the very first time where we introduced vision capabilities. 00:03:53.260 |
And suddenly, GPT-4 could analyze and interpret data, such as images, 00:04:01.060 |
And then, with GPT-4 Turbo, for the first time, we brought vision capabilities 00:04:04.680 |
into the same model, right inside, so that you can have the exact same model 00:04:10.120 |
that does the two modalities at the same time. 00:04:14.460 |
And of course, last month, we kind of took a step further with multi-modality. 00:04:21.300 |
So GPT-4-0 is our new flagship model that can reason across audio, video, 00:04:29.120 |
And thanks to its native image and audio capabilities, 00:04:32.580 |
we really think that you can now build very natural human-computer 00:04:37.060 |
interactions in a way that we've never seen before. 00:04:40.700 |
And so the reason we call it GPT-4-0 is because O is for Omni, 00:04:45.800 |
And that's because we brought all of these modalities 00:04:48.800 |
into one single model for you to interact with. 00:04:51.920 |
And there are many things that are very unique and interesting 00:04:54.740 |
about GPT-4-0, but I'd like to call out a couple step function 00:04:59.260 |
The first one is that it's really a big step towards this multi-modality 00:05:07.820 |
For the first time, I'm not sure if many of you 00:05:09.960 |
have played with ChatGPT with voice mode before, 00:05:12.960 |
or if you've tried to build your own voice assistants, 00:05:15.840 |
but you kind of had to stitch three models together. 00:05:18.360 |
First, like Whisper for translation, then GPT-4 for reasoning, 00:05:24.060 |
But of course, that kind of added a loss of context 00:05:28.640 |
Now, we have all of that into a single model. 00:05:31.640 |
And for all of you developers, what's really exciting, too, 00:05:35.020 |
is that for the same kind of state-of-the-art reasoning 00:05:40.640 |
we also ship massive efficiency improvements. 00:05:43.320 |
And that means that, in turn, we've been able to turn those efficiency gains 00:05:48.120 |
So as a result, GPT-4 is twice as fast as GPT-4 Turbo. 00:05:55.460 |
And we also increased drastically the rate limits. 00:05:57.800 |
I actually have like 5x higher rate limits on this slide, 00:06:01.660 |
but we are kind of like always pushing them higher. 00:06:04.000 |
And we really want to live in a world where there's not even any rate limit. 00:06:09.840 |
But with all that said, I'd like to give you a glimpse into GPT-4.0. 00:06:14.000 |
And so we're going to switch over to my Mac here. 00:06:17.080 |
So for this first live demo, I'm going to bring up ChatGPT for desktop. 00:06:25.100 |
and I'm going to bring the voice mode right here. 00:06:36.260 |
I'm actually on stage at the AI Engineer World's Fair in San Francisco. 00:06:41.940 |
We have some amazing developers here in the room. 00:06:45.980 |
It's awesome to be virtually here at the AI Engineer World's Fair in San Francisco. 00:06:58.380 |
And also, don't forget, people are also tuning in from the live stream. 00:07:03.880 |
Hello, San Francisco, and to everyone joining us on the live stream. 00:07:12.300 |
We've got an incredible lineup and a ton of energy here today. 00:07:16.640 |
Let's dive into all the exciting tech and innovation. 00:07:31.260 |
Maybe one thing you could do, like, since we have a crowd of the best builders out there, 00:07:34.840 |
can you whisper one secret that's really precious but has their building the future of their AI native products? 00:07:45.060 |
Remember, the key to building great AI native products is focusing on responsible and ethical. 00:08:08.740 |
It's crucial to keep your AI adaptable and scalable. 00:08:20.960 |
A couple of things to call out here, but first of all, like, you can see the latency is pretty impressive. 00:08:30.040 |
Like, it's very much like natural in a conversation, and there's really no difference with talking to humans. 00:08:37.960 |
The second thing that's really cool is, like, it understands emotion and can generate tones, you know, it could whisper, it could bring the energy, it could be more alive. 00:08:45.720 |
But also, I can jump back in the conversation at any time. 00:08:48.720 |
I can interrupt the model, and I don't have to wait for the kind of audio output to finish, and that's really cool to have something much more natural versus what we knew before. 00:08:58.720 |
So the next thing I'd like to show you, of course, we're talking about modalities, so let's bring the second modality. 00:09:22.480 |
Because I'm in front of a lot of people here. 00:09:31.680 |
I'm going to try to draw something for you, and I'll show it to you. 00:09:58.840 |
Okay, so based on what I drew here, can you take a guess at where I am? 00:10:07.600 |
That's the Golden Gate Bridge, so you must be in San Francisco. 00:10:11.720 |
And you wrote Bonjour, developer, which means hello, developers, in French. 00:10:19.680 |
It was a pretty poor bridge, but ChatGPT is pretty clever. 00:10:28.600 |
ChatGPT: Yeah, that's Poor Charlie's Almanac by Charlie Munger. 00:10:34.880 |
It's a collection of speeches, talks, and light philosophy. 00:10:39.200 |
Munger is known for his wisdom on investing and decision making. 00:10:42.200 |
Yeah, can you just give me a random number between 1 and 400? 00:10:53.200 |
Can you give me like a brief overview of what's happening on this page? 00:10:58.200 |
It looks like this page talks about Coco Chanel's rule for success. 00:11:03.200 |
It mentions that Chanel focused on identifying the qualities her customers wanted and made sure 00:11:10.280 |
Essentially, understanding your audience desires is key to success. 00:11:14.520 |
Would you like a bit more detail from the page? 00:11:18.720 |
You read faster than I could read just one line. 00:11:25.640 |
So instead of camera, I'll go real quick here. 00:11:28.520 |
And instead of sharing camera, I'll share my screen. 00:11:31.360 |
So I'm going to bring up ChatGPT one more time here with audio capabilities. 00:11:35.920 |
I'm going to reduce this window a little bit. 00:11:40.800 |
Okay, ChatGPT, I'm going to share my screen with you. 00:11:53.200 |
You know, it's a travel app, and I'd like to make it a little better. 00:11:57.120 |
But there's one thing that I'm not quite sure about doing, and I'd like your help. 00:12:01.320 |
So if I go to this tab here, and if I try to put the final touches on it, it feels like 00:12:13.280 |
When you switch to the mobile view, the layout looks a bit cramped and the images seem squished. 00:12:18.440 |
Have you considered using media queries to adjust the layout for smaller screens? 00:12:25.880 |
I'm going to send you first this whole component. 00:12:28.480 |
I'm going to copy the code over to you, if you can take a look. 00:12:32.480 |
Can you give me a one-line overview of what's going on in that one? 00:12:35.480 |
This code defines a React hook, Use Assistant, that manages interactions with an AI assistant. 00:12:44.420 |
It handles sending and receiving messages, file uploads, and streaming responses from a server. 00:12:50.440 |
I don't think that's where we need to change some stuff. 00:12:52.520 |
I think it might be over here in the Discover component. 00:12:56.580 |
Yeah, it does look like this is where the layout is being set for the descriptions. 00:13:05.820 |
What should I do exactly to kind of change it and make it responsive here? 00:13:11.980 |
You could adjust the grid color attributes on mobile sizes by using mobile grid. 00:13:17.560 |
What's the kind of prefix I should use for mobile screens then? 00:13:29.300 |
If I say this and maybe for other, is that looking right? 00:13:36.440 |
Like medium screen, maybe two, and then we keep the three for large screens, something like 00:13:59.900 |
So I really like using ChatGPT on desktop as a coding assistant. 00:14:03.460 |
But of course this was a very trivial use case. 00:14:05.700 |
But what's also even more interesting is when you start reasoning out loud with ChatGPT to 00:14:09.680 |
build something, but you also tell like, hey, I'm going to get cursor to do it, but what 00:14:16.400 |
It's also pretty amazing to see how both of them can interact across modalities. 00:14:20.400 |
But let's go back to my presentation, please. 00:14:23.240 |
I'd like to give you a little bit of a sneak peek of what's on our mind. 00:14:27.580 |
What are we working on next at OpenAI as we think about these modalities and the future 00:14:32.140 |
So there are four things that are currently top of mind for us, especially for all of 00:14:41.820 |
Of course, as you can tell, we are extremely excited about modalities, but we also think 00:14:46.240 |
that increasing textual intelligence is still very key to unlock the transformational value 00:14:52.960 |
And we expect the potential of LLMs intelligence, we expect that potential to be still very huge 00:15:01.280 |
Those models today, they are pretty good, you know, as we can tell, we're building things 00:15:05.860 |
But at the same time, what's really cool to realize is that they're the dumbest they'll 00:15:12.340 |
And if you will, like, it's almost like we have first graders working alongside us. 00:15:16.320 |
They still make mistakes every now and then, but we expect that in a year from now, they 00:15:20.940 |
might be like completely different and unrecognizable from what we have today. 00:15:24.960 |
They could become master students in the blink of an eye in multiple disciplines like medical research 00:15:29.500 |
or scientific reasoning to really expect the next frontier model will have such a function 00:15:37.980 |
The second area of focus that we are excited about is like faster and cheaper models. 00:15:42.560 |
And we know that not every use case requires like the highest intelligence. 00:15:46.460 |
of course, GPT-4's pricing has decreased significantly, 80% in fact, over a year. 00:15:52.040 |
But we also want to introduce like more models over time. 00:15:56.100 |
So we want these models to be cheaper for you all to build. 00:15:58.940 |
And finally, we want to have models of different sizes. 00:16:01.580 |
We don't really have timelines to share today, but that's something we're very excited about 00:16:06.320 |
And finally, we want to help you run async workloads. 00:16:09.160 |
We launched a couple months ago the Batch API, and we're seeing like tremendous success already, 00:16:15.320 |
Say you have like documents to analyze with vision or photos or images, all that can be batched 00:16:23.920 |
Third, we also believe in model customization. 00:16:27.220 |
We really believe that every company, every organization will have a customized model. 00:16:32.100 |
And we have like a wide range of offering here. 00:16:34.840 |
I'm sure many of you here have tried our fine tuning API. 00:16:37.700 |
It's completely available for anyone to build with. 00:16:40.180 |
But we also assist companies all the way like Harvey, for instance, a startup 00:16:45.480 |
that's building a product for law firms, and they were able to kind of customize GPT-4 00:16:50.940 |
are entirely on US case law, and they've seen like amazing results in doing so. 00:16:56.620 |
And last, we'll continue to invest in enabling agents. 00:16:59.420 |
We're extremely excited about the future of agents. 00:17:02.120 |
And we share a little bit about that vision back in November at Dev Day. 00:17:05.780 |
And agents will be able to perceive and interact with the world using all of these modalities, 00:17:11.620 |
And once again, that's where the multi-modality story comes into play. 00:17:15.500 |
Imagine an agent being able to kind of coordinate with multiple AI systems, but also 00:17:20.100 |
securely access your data, and even, yes, manage your calendar and things like that. 00:17:25.780 |
David, of course, is an amazing example of what agents can become. 00:17:30.940 |
Like Cognition Labs has built this awesome software engineer that can code alongside you, 00:17:38.780 |
but is able to break down complex tasks and actually, you know, browse the documentation 00:17:44.460 |
online, submit pull requests, and so on and so forth. 00:17:46.960 |
It's really a glimpse into what we can expect for the future of agents. 00:17:51.460 |
And with all of that, it's no surprise that, in fact, Paul Graham realized a few months ago 00:17:56.960 |
that often 22-year-old programmers are often as good, if not better, than 28-year-old programmers. 00:18:04.140 |
And that's because they have these amazing AI tools at their fingertips. 00:18:08.640 |
So with that, I'd like to switch to another demo to kind of show you this time not ChatGPT, 00:18:14.140 |
but rather like what we can build with these modalities. 00:18:20.280 |
So in the title of this talk, I did not mention video, but I'm sure most of you have seen Sora, 00:18:25.480 |
the preview of our kind of diffusion model that's able to generate videos from a very simple prompt. 00:18:32.160 |
So in the interest of time, I've already sent this prompt to Sora, describing a documentary with a tree frog, 00:18:41.000 |
And if I click here, this is what came out of Sora. 00:18:50.280 |
But next, what I'd like to do is kind of bring this video to life, you know. 00:18:59.480 |
And here, what I'm doing is like I simply sliced frames out of the video of Sora. 00:19:04.820 |
And what I'm going to do next is very simple. 00:19:06.620 |
I'm going to send these six frames over to GPT-4O with vision with this prompt, if you're curious. 00:19:14.060 |
And I'm going to tell it to narrate what it sees as if it was a narrator. 00:19:18.260 |
So going back here, I'm going to click Analyze and Narrate. 00:19:23.440 |
So every single time, the story is unique, and I'm just discovering it like all of you. 00:19:29.040 |
So that's what GPT-4O with vision was able to create based on what it saw in those frames. 00:19:36.180 |
But last but not least, I wanted to show you one thing that we also previewed recently, and it's our voice engine model. 00:19:43.320 |
The voice engine model is the ability for us to create custom voices based on very short clips. 00:19:49.400 |
And of course, we take safety very responsibly, so this is not a model that's broadly available just yet. 00:19:56.540 |
But I wanted to give you a sneak peek today of how it works, and also the voice engine is what we use internally with actors to bring the voices you know in the API or in ChatGPT. 00:20:06.540 |
So here, I'm going to go ahead and show you a quick demo. 00:20:09.620 |
Hey, so I'm on stage at the AI engineer world's fair, I just need to record a few seconds of my voice. 00:20:16.760 |
I'm super excited to see the audience that's really captivated by these modalities and what we can now build on the OpenAI platform. 00:20:24.760 |
Sounds like it's perfect, that's all we need, so now to bring us all together here, what I'm going to do is I'm going to take this clip, I'm going to take the script that we just generated, and I'm sending all of it back to the voice engine, and we'll see what happens. 00:20:49.060 |
In the heart of the dense, misty forest, a vibrant frog makes its careful way along a moss-covered branch. Its bright green body, adorned with bright and black and yellow patterns, stands out amidst the lush foliage. 00:21:06.120 |
I can also have it translate in multiple languages, so let's try French. 00:21:10.120 |
And for those who know me, that's actually how I sound when I speak French. 00:21:34.180 |
All right, thank you. Let's go back real quick to the slides. 00:21:50.180 |
And of course, this is one very specific example of bringing modalities together with, you know, Sora videos, GPT-4 Envision, the voice engine that we have not released yet, but I hope this inspires you to see how you can kind of picture the future with these modalities combined together. 00:22:13.180 |
So to wrap up, we're focused on these four things, textual intelligence to drive it up, making our models faster and more affordable so you all can scale. 00:22:21.180 |
We're thinking about customizable models for your needs. 00:22:25.180 |
And finally, making sure you can build for this multimodal future and agents. 00:22:30.180 |
And if there's one thing I want to leave you off with today, it's that our goal is not for you guys to spend more with OpenAI, but our goal is for you to build more with OpenAI. 00:22:40.180 |
Because let's remember, we're still in the very early innings of that transition, and it's a fundamental shift in how we think and build software every day. 00:22:48.180 |
So we really want to help you in that transition. 00:22:50.180 |
We're dedicated to supporting developers, startups. 00:22:53.180 |
So if there's anything we could do better, please come find me after this talk. 00:22:57.180 |
And, you know, this is really like the most exciting time to be building an AI native company. 00:23:05.180 |
And we know that bold builders like all of you are going to come up with the future and invent it before anyone else. 00:23:14.180 |
And we can't wait to see what you're going to build with those new modalities and reinvent software 2.0. 00:23:18.180 |
So with that, thank you so much for having me, thank you so much for having me, thank you so much for having me, thank you so much for being here today.