From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

. Wow. Good afternoon, everyone. Super excited to see you all here today. Such an incredible energy here at the event. I'm Romain. I lead developer experience at OpenAI. And before joining OpenAI, I was also a founder. And like many of you in this room, I actually experienced firsthand the magic of building with the frontier models.

Now I'm working on making sure we offer the most delightful experience for all of you builders in this room. And what I love the most about this role is also showing the art of the possible with our AI models and technologies. And so today, we're going to go through a few things that the great capabilities that the OpenAI team has built recently, and we'll show some live demos to really bring them to life.

So first, I'd like to start with a quick zoom out on how we've gotten to where we are today. OpenAI is a research company, and we're working on building AGI in a way that benefits all of humanity. And to achieve this mission, we believe in iterative deployment. We really want to make the technology enter contact with reality as early and often as possible.

And for that, a top focus for us at OpenAI is really all of you, like the best builders in the world. So we really believe strongly that the best developers and startups are integral to the G in this AGI mission. You guys are the ones that are going to build the native AI products in the future.

So today, we have 3 million developers around the world building on the OpenAI platform. And we are very fortunate to have so much innovation. But I'd like to rewind a little bit. And today, outside of this room, when people think of OpenAI, they often think of ChatGPT first, because that's become like the product that has taken the world by storm a little bit.

But the first product was actually not ChatGPT. The first product we put out there was the developer platform. So back in 2020, at the time, we had GPT-3. And that's when we first started launching it to the public behind an API. Maybe a quick show of hands, actually. Who in this room have played with the API at the time of GPT-3 already?

Wow. That's like more than half of you. That's like more than half of you. You guys are really my crowd here. That's awesome. And you know, at the time, we kind of got a taste for what AI would be capable of doing, like basic coding assistance, copy editing, maybe some very simple translation.

But to really put things in perspective, at the time, this was one of the most popular use cases on the platform, AI Dungeon. This was like a role-playing game purely based on text. And it kind of was generating open-ended stories. And you could navigate the world. And at each scenery, when you were trying to look around, it would generate new text.

So that was kind of the state of the art at the time. Obviously, in 2023, GPT-4 changed the game. It completely changed the way we thought about AI. It got better at reasoning. It'd get more creative, more specific. It could start being better at coding and reasoning about complex problems.

And it could use tools also, interpret data. And that dramatically expanded the aperture of the possibilities with the platform. We've had the great fortune of working with many, many developers and companies, integrating GPT-4 in their own apps and services. And this is just one example among many. Spotify, when they took our models to generate playlists on the fly based on your music taste and history.

But the one thing I want to highlight today in this talk is that GPT-4 was also the beginning of our multi-modality journey. This is the very first time where we introduced vision capabilities. And suddenly, GPT-4 could analyze and interpret data, such as images, photos, et cetera, not just purely text.

And then, with GPT-4 Turbo, for the first time, we brought vision capabilities into the same model, right inside, so that you can have the exact same model that does the two modalities at the same time. And of course, last month, we kind of took a step further with multi-modality.

And that brings us to GPT-4-0. So GPT-4-0 is our new flagship model that can reason across audio, video, and text all in real time. And thanks to its native image and audio capabilities, we really think that you can now build very natural human-computer interactions in a way that we've never seen before.

And so the reason we call it GPT-4-0 is because O is for Omni, Omni-model, if you will. And that's because we brought all of these modalities into one single model for you to interact with. And there are many things that are very unique and interesting about GPT-4-0, but I'd like to call out a couple step function changes here.

The first one is that it's really a big step towards this multi-modality future with ultra-fast latency. For the first time, I'm not sure if many of you have played with ChatGPT with voice mode before, or if you've tried to build your own voice assistants, but you kind of had to stitch three models together.

First, like Whisper for translation, then GPT-4 for reasoning, and then ultimately text-to-speech. But of course, that kind of added a loss of context and some latency along the way. Now, we have all of that into a single model. And for all of you developers, what's really exciting, too, is that for the same kind of state-of-the-art reasoning capabilities that we offer with GPT-4, we also ship massive efficiency improvements.

And that means that, in turn, we've been able to turn those efficiency gains back to you. So as a result, GPT-4 is twice as fast as GPT-4 Turbo. It's also half the price. And we also increased drastically the rate limits. I actually have like 5x higher rate limits on this slide, but we are kind of like always pushing them higher.

And we really want to live in a world where there's not even any rate limit. So you can build with no limits. But with all that said, I'd like to give you a glimpse into GPT-4.0. And so we're going to switch over to my Mac here. So for this first live demo, I'm going to bring up ChatGPT for desktop.

And all I have to do is click option, space, and I'm going to bring the voice mode right here. Hey, ChatGPT, how are you? Hey, I'm doing great. Thanks for asking. How about you? I'm great. I'm actually on stage at the AI Engineer World's Fair in San Francisco. Could you greet the audience?

We have some amazing developers here in the room. Hey, everyone. It's awesome to be virtually here at the AI Engineer World's Fair in San Francisco. So ChatGPT, I kind of like how you started, but I really want you to bring the energy. Like, people are very excited here. So make this more live.

And also, don't forget, people are also tuning in from the live stream. So don't forget about them. Hello, San Francisco, and to everyone joining us on the live stream. Welcome to the AI Engineer World's Fair. We've got an incredible lineup and a ton of energy here today. Let's dive into all the exciting tech and innovation.

That's much better. Thank you, ChatGPT. I think they really enjoyed that. If you need anything else, just let me know. Hope the event goes fantastically. Yeah. Maybe one thing you could do, like, since we have a crowd of the best builders out there, can you whisper one secret that's really precious but has their building the future of their AI native products?

What should they know? All right. Just between us. Remember, the key to building great AI native products is focusing on responsible and ethical. Awesome. Glad you liked it. Can you whisper another one, but even lower? Like very, very, very, very slow. Sure. Here's one more. It's crucial to keep your AI adaptable and scalable.

Technology evolves fast. Your products should do it. Cool. So one-- Thank you. A couple of things to call out here, but first of all, like, you can see the latency is pretty impressive. Like, it's very much like natural in a conversation, and there's really no difference with talking to humans.

It's almost too fast at times, I would say. The second thing that's really cool is, like, it understands emotion and can generate tones, you know, it could whisper, it could bring the energy, it could be more alive. But also, I can jump back in the conversation at any time.

I can interrupt the model, and I don't have to wait for the kind of audio output to finish, and that's really cool to have something much more natural versus what we knew before. So the next thing I'd like to show you, of course, we're talking about modalities, so let's bring the second modality.

I'd like to turn on camera here. Anytime. That's not a fair. Hey, chat GPT, can you now see me? Yes. I can see you. Yeah. You know, you don't have to whisper anymore. You can talk normally now. Got it. How's the event going? It's good. Do I look good, actually?

Because I'm in front of a lot of people here. You look great. Ready for the spotlight. Awesome. Okay, so give me a second. I'm going to try to draw something for you, and I'll show it to you. Sounds good. I'm curious to see what you come up with. Just give me a few more seconds, you know.

Take your time. I'm here whenever you're ready. Akilita GPT is pretty patient. No rush. I'm always here when you're ready to. Okay, so based on what I drew here, can you take a guess at where I am? And also, can you translate what I wrote? Yes. That's the Golden Gate Bridge, so you must be in San Francisco.

And you wrote Bonjour, developer, which means hello, developers, in French. Pretty cool. All right. It was a pretty poor bridge, but ChatGPT is pretty clever. Next, I have this book. Can you take a look, ChatGPT? What is this? ChatGPT: Yeah, that's Poor Charlie's Almanac by Charlie Munger. It's a collection of speeches, talks, and light philosophy.

Munger is known for his wisdom on investing and decision making. Yeah, can you just give me a random number between 1 and 400? Sure. How about 126? 126, here we go. Page 126. Can you give me like a brief overview of what's happening on this page? It looks like this page talks about Coco Chanel's rule for success.

It mentions that Chanel focused on identifying the qualities her customers wanted and made sure to offer them. Essentially, understanding your audience desires is key to success. Would you like a bit more detail from the page? No, I think I'm good. This is pretty amazing. You read faster than I could read just one line.

So it's pretty amazing. All right. Cool. Next, I'd like to show you something else. So instead of camera, I'll go real quick here. And instead of sharing camera, I'll share my screen. So I'm going to bring up ChatGPT one more time here with audio capabilities. I'm going to reduce this window a little bit.

Can you clarify? Okay, ChatGPT, I'm going to share my screen with you. Can you see what I see? Yes, I can see your screen. What are we looking at here? Well, I'm currently working on this app. You know, it's a travel app, and I'd like to make it a little better.

But there's one thing that I'm not quite sure about doing, and I'd like your help. So if I go to this tab here, and if I try to put the final touches on it, it feels like the responsiveness is not really good. It feels like it's not really good.

Like, what do you think? I see what you mean. I see what you mean. When you switch to the mobile view, the layout looks a bit cramped and the images seem squished. Have you considered using media queries to adjust the layout for smaller screens? So let's take a look at the code here.

Maybe you could help me. I'm going to send you first this whole component. I'm going to copy the code over to you, if you can take a look. Can you give me a one-line overview of what's going on in that one? This code defines a React hook, Use Assistant, that manages interactions with an AI assistant.

It handles sending and receiving messages, file uploads, and streaming responses from a server. Okay, cool. I don't think that's where we need to change some stuff. I think it might be over here in the Discover component. What do you think? Yeah, it does look like this is where the layout is being set for the descriptions.

So I'm using Tailwind CSS. What should I do exactly to kind of change it and make it responsive here? You could adjust the grid color attributes on mobile sizes by using mobile grid. What's the kind of prefix I should use for mobile screens then? You can use the SM prefix for small screens.

Okay. Try that. Yeah. If I say this and maybe for other, is that looking right? Like medium screen, maybe two, and then we keep the three for large screens, something like that? Yeah, that looks good. Okay. Adjusting the column. How does it look now? Take a look now. What do you think?

It looks much better. Awesome. Great job solving that. Anything else you want to tweak? So I really like using ChatGPT on desktop as a coding assistant. But of course this was a very trivial use case. But what's also even more interesting is when you start reasoning out loud with ChatGPT to build something, but you also tell like, hey, I'm going to get cursor to do it, but what should I prompt cursor?

And I've done that many times. It's also pretty amazing to see how both of them can interact across modalities. But let's go back to my presentation, please. I'd like to give you a little bit of a sneak peek of what's on our mind. What are we working on next at OpenAI as we think about these modalities and the future of models?

So there are four things that are currently top of mind for us, especially for all of you developers building on the platform. The first thing is textual intelligence. Of course, as you can tell, we are extremely excited about modalities, but we also think that increasing textual intelligence is still very key to unlock the transformational value of AI.

And we expect the potential of LLMs intelligence, we expect that potential to be still very huge in the future. Those models today, they are pretty good, you know, as we can tell, we're building things with them. But at the same time, what's really cool to realize is that they're the dumbest they'll ever be.

We'll always have better models coming up. And if you will, like, it's almost like we have first graders working alongside us. They still make mistakes every now and then, but we expect that in a year from now, they might be like completely different and unrecognizable from what we have today.

They could become master students in the blink of an eye in multiple disciplines like medical research or scientific reasoning to really expect the next frontier model will have such a function change in reasoning improvements again. The second area of focus that we are excited about is like faster and cheaper models.

And we know that not every use case requires like the highest intelligence. of course, GPT-4's pricing has decreased significantly, 80% in fact, over a year. But we also want to introduce like more models over time. So we want these models to be cheaper for you all to build. And finally, we want to have models of different sizes.

We don't really have timelines to share today, but that's something we're very excited about as well. And finally, we want to help you run async workloads. We launched a couple months ago the Batch API, and we're seeing like tremendous success already, especially for those modalities. Say you have like documents to analyze with vision or photos or images, all that can be batched for another 50% discount on pricing.

Third, we also believe in model customization. We really believe that every company, every organization will have a customized model. And we have like a wide range of offering here. I'm sure many of you here have tried our fine tuning API. It's completely available for anyone to build with. But we also assist companies all the way like Harvey, for instance, a startup that's building a product for law firms, and they were able to kind of customize GPT-4 are entirely on US case law, and they've seen like amazing results in doing so.

And last, we'll continue to invest in enabling agents. We're extremely excited about the future of agents. And we share a little bit about that vision back in November at Dev Day. And agents will be able to perceive and interact with the world using all of these modalities, just like human beings.

And once again, that's where the multi-modality story comes into play. Imagine an agent being able to kind of coordinate with multiple AI systems, but also securely access your data, and even, yes, manage your calendar and things like that. We're very excited about agents. David, of course, is an amazing example of what agents can become.

Like Cognition Labs has built this awesome software engineer that can code alongside you, but is able to break down complex tasks and actually, you know, browse the documentation online, submit pull requests, and so on and so forth. It's really a glimpse into what we can expect for the future of agents.

And with all of that, it's no surprise that, in fact, Paul Graham realized a few months ago that often 22-year-old programmers are often as good, if not better, than 28-year-old programmers. And that's because they have these amazing AI tools at their fingertips. So with that, I'd like to switch to another demo to kind of show you this time not ChatGPT, but rather like what we can build with these modalities.

So in the title of this talk, I did not mention video, but I'm sure most of you have seen Sora, the preview of our kind of diffusion model that's able to generate videos from a very simple prompt. And this is one of them. So in the interest of time, I've already sent this prompt to Sora, describing a documentary with a tree frog, very precise on what I'm expecting.

And if I click here, this is what came out of Sora. It's pretty cool. But next, what I'd like to do is kind of bring this video to life, you know. And here, what I'm doing is like I simply sliced frames out of the video of Sora. And what I'm going to do next is very simple.

I'm going to send these six frames over to GPT-4O with vision with this prompt, if you're curious. And I'm going to tell it to narrate what it sees as if it was a narrator. So going back here, I'm going to click Analyze and Narrate. Again, this is all happening in real time.

So every single time, the story is unique, and I'm just discovering it like all of you. And boom, that's it. So that's what GPT-4O with vision was able to create based on what it saw in those frames. So it's pretty magical. But last but not least, I wanted to show you one thing that we also previewed recently, and it's our voice engine model.

The voice engine model is the ability for us to create custom voices based on very short clips. And of course, we take safety very responsibly, so this is not a model that's broadly available just yet. But I wanted to give you a sneak peek today of how it works, and also the voice engine is what we use internally with actors to bring the voices you know in the API or in ChatGPT.

So here, I'm going to go ahead and show you a quick demo. Hey, so I'm on stage at the AI engineer world's fair, I just need to record a few seconds of my voice. I'm super excited to see the audience that's really captivated by these modalities and what we can now build on the OpenAI platform.

Sounds like it's perfect, that's all we need, so now to bring us all together here, what I'm going to do is I'm going to take this clip, I'm going to take the script that we just generated, and I'm sending all of it back to the voice engine, and we'll see what happens.

In the heart of the dense, misty forest, a vibrant frog makes its careful way along a moss-covered branch. Its bright green body, adorned with bright and black and yellow patterns, stands out amidst the lush foliage. I can also have it translate in multiple languages, so let's try French. And for those who know me, that's actually how I sound when I speak French.

Maybe one last one with Japanese. One last one with Japanese. One last one with Japanese. All right, thank you. Let's go back real quick to the slides. And of course, this is one very specific example of bringing modalities together with, you know, Sora videos, GPT-4 Envision, the voice engine that we have not released yet, but I hope this inspires you to see how you can kind of picture the future with these modalities combined together.

So to wrap up, we're focused on these four things, textual intelligence to drive it up, making our models faster and more affordable so you all can scale. We're thinking about customizable models for your needs. And finally, making sure you can build for this multimodal future and agents. And if there's one thing I want to leave you off with today, it's that our goal is not for you guys to spend more with OpenAI, but our goal is for you to build more with OpenAI.

Because let's remember, we're still in the very early innings of that transition, and it's a fundamental shift in how we think and build software every day. So we really want to help you in that transition. We're dedicated to supporting developers, startups. We love feedback. So if there's anything we could do better, please come find me after this talk.

And, you know, this is really like the most exciting time to be building an AI native company. So we want you to bet on the future of AI. And we know that bold builders like all of you are going to come up with the future and invent it before anyone else.

So with that, thank you so much. And we can't wait to see what you're going to build with those new modalities and reinvent software 2.0. So with that, thank you so much for having me, thank you so much for having me, thank you so much for having me, thank you so much for being here today.

From Text to Vision to Voice Exploring Multimodality with Open AI: Romain Huet

Transcript