See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

- So you can think of OpenAI as a product and research company. We build awesome models. What are some of the best ways to apply them to solve the biggest problems that humanity faces? And so there's this deployment pipeline. Logan and I sit at the end of this deployment pipeline.

We work with people in the real world that are using OpenAI's models. We spend our time thinking about what are some of the best ways to use our models, what are some of the hardest problems that haven't been solved yet, and how can we apply OpenAI technology to solve these?

I'm on the apply team, and I'm an engineer. - Yeah, my name's Logan Patrick, and I do developer relations stuff. So helping people build fun and exciting products and services using our API. So, yeah, folks all from the title of the talk, we'll talk about multimodal stuff, but I think it's important to start off with where we are today.

And I think, you know, as we all know, people who have been building in the AI space for the last six, 12, 18 months, 2023 has really been the year of chatbots. And I think it's been incredible to see how much people have actually been able to do, how much value you can create in the world with just a simple chatbot.

And it still blows my mind to think about how rudimentary these systems are and how much more value there's going to be created in the next year, in the next decade. And that's why I'm excited for 2024, which I think is really going to be the, I don't know if I can trademark this, but the year of multimodal models.

No, don't, don't buy it if it's available. Yeah, so I'm excited. We open AI has a ton of multimodal capabilities that are, that are in the works. Some folks might have already tried some of these in chat GBT and the iOS app or the web app today, things like vision, taking in images, describing them.

We'll, we'll show that later on. Also the ability to generate images. We've had this historically with, with DALI 2, but DALI 3 really, if, if folks have tried it, it, it takes things to the next level. So excited to, to show some of that today as well. Cool. So if you, if you think of the way that multimodal capabilities are working right now, it's a little bit of a, of a set up of islands where we have DALI that takes text and generates images.

We have whisper that takes an audio and generates text transcripts. Uh, we have a GPTV with vision capabilities. GPTV with vision capabilities that takes images and text and can reason over both at the same time. Um, but right now this, this are all very disparate things. Uh, a, however, you can think of text as a connective tissue between all of these models.

Uh, and there's a lot of interesting things that, that, uh, we can build right now, uh, using that paradigm. But what we're actually really excited for is, uh, a future in which there's unity between all these modalities. And, uh, and this is where we're going. This is not where we're today.

Uh, but you can, you can think of models in the same way that like GPT can consume, uh, images and text simultaneously. Uh, maybe in the future we'll consume even more modalities and we'll output even more modalities and we'll able to reason about them in the, at the same time.

However, we're not there yet. And so, uh, today Logan and I are going to show you just like some, some, uh, architecture patterns and some ways in which you can, uh, mimic this kind of situation with what we have available today. Uh, and, and, and some of the patterns that you can start to think about as we move towards this future in which models can, uh, reason way beyond text.

Um, as Simone and I were, we're making these demos today, um, waiting till the last minute as, as always, it was really interesting to see that like really much of the work of making multimodal systems today is like, how do you hook everything up together and connect the different modalities?

And again, as Simone said, using text as sort of the, the bridge between different modalities. Um, but it, it's going to be super interesting to see like how much developer efficiency gains there are when you no longer have to do that. And you really just have like a single model that can do text in, text out video at some point, you know, speech in, speech out at some point.

Um, so it'll be super cool to, to see when that's possible and, uh, make, make, making demos even, uh, even easier and simpler. Cool. All right. Well, uh, we'll show you guys two demos today. Uh, and we'll talk about like some, some high level ideas and some high level concepts.

Uh, and hopefully at the end of it, you'll, you'll be inspired to think about like, what, what are some of the things that, uh, maybe you're not able to build today, but you'll be able to build, uh, six months, a year from now. Uh, and how you should start thinking about your products, uh, um, uh, as they are able to incorporate more, uh, modalities.

Cool. So onto demo number one. Uh, this is a, it's a, it's a, it's a very, very simple DALI vision loop. Um, yeah, so yeah, sorry. Um, excited to, to look at this demo. So Simone will, will pull up the demo and I'll, I'll sort of just walk through it.

But the basic idea is let's take a real image. Um, let's use. Um, let's use GPTV, um, or GPT four with, with image inputs to essentially create a nice human, uh, readable, understandable description of that image. Um, and then we'll put that into DALI three and actually go and generate a synthetic version of that image.

Um, so this whole pipeline takes a little bit to run because, uh, it's not a production, um, system at the moment. Uh, but the nice part is, uh, we've got a couple of examples ready and we can, if you want to kick one off live as well, we can let it run in the background.

So, very, very, very, this is a, uh, fun simple idea, but, uh, the, this is a, a photo that I took in the lobby downstairs, uh, just when you walk into the, the hotel. Uh, there are these, uh, kind of like, uh, Halloween themed painted ladies. Um, uh, and so what we did here is that we asked, uh, GPT four with vision to describe this image in detail.

Uh, and then we asked it to, uh, generate, uh, description for DALI to, uh, generate a new image based on this. Um, a, you can see it, it does an okay job. Here's the description of the image. Here's the prompt it uses. Uh, it, it picks up on a lot of details like the, the RIP in the tombstone and the old dogs, uh, welcome, uh, thing here.

And then it generates a whole new image, but there's a lot of details are off, you know, like the, the, um, the marble is black and the, uh, the spiders are white. Uh, and so what we do next is that we pass the, yeah, it's close enough. Uh, but we give the two images to GPT with vision again, um, and we ask it to compare them, uh, and see what are some of the differences.

And, uh, it, it picks up on a lot of the, the different details. And then we ask it to create a new, a new image based on these, uh, differences. And it goes ahead, new image, you see. Uh, all the black marble is gone. The spider is now larger and black.

Uh, but you know, it, it, it matches something closely. And I think this, this is just to illustrate, I think there's a long way to go, but this is to illustrate the idea that there are plenty of tasks that we do right now in AI, where we, we need the human in the loop to be able to evaluate a visual output that a model produces, compare it with something else.

Then like iterate on the instructions, pass that again to another model. And so that, that's a pipeline where we like thought that humans were very essential and that we're probably gonna continue to be essential for some time. And now that's something that the models can do by themselves. Um, and there's a couple of, uh, of, uh, interesting, uh, uh, patterns here.

I think, I think one of them is describing images. That's powerful because now you have an image, now you have text and you can reason about that text. You can do a lot of things with that text. But another really powerful element is, uh, comparing images, um, and, and, and spotting differences, like having like a final destination that you want to get to and like a current destination.

And a, and a, and that pattern of comparing things, you can apply to a lot of things. So imagine, uh, talking outward where Logan and I were just chatting about like some other like ways that you can apply this. And, and, and, uh, Logan's idea was imagine you are, uh, curating, uh, your room and you're, you just moved to a new place.

And you're an Instagram, you find some images of like a vibe that you like and, and like maybe some object and, and then you can grab that image. You can give that to GPT-4 with vision and you can tell like, okay, now like, like crawl through Amazon and find like all the lamps that match this vibe that I want for my room.

Um, I want this so badly. Yeah. And so I can't do interior design. So it's like, I, I would love to be able to just be like, get me all the stuff that matches this specific vibe. It's, it's a, a hard problem right now. Yeah. Um, and a couple.

Simone, can I make one other quick comment, which is just, I think also, you know, folks were, were laughing, you know, in, in, in good jest. When this, when this third image came up, I think it's important to know that there's, there's like no prompt engineering or anything like that.

That's happening. This is like the, the rawest output that you can get. This is a, a one hour demo version. So people can, uh, will hopefully go wild with this once it's available through the API and like ideally get much better results than, than we're seeing today. Um, yeah.

Yeah. Probably using a bunch of techniques that other people talked about at the conference so far. So this is the, the very basic version of, of this demo. Yeah. Yeah. And, and we wanted to keep it simple and minimal just to illustrate the, the, the power of the models.

This is as raw as you can get when it comes to the models that like, there is almost all the completion output. It's going straight into the model. And, and I think there's like 50 lines of code. So like the majority of the power lifting is being done by the models here.

Um, um, another quick example that I'll show you guys, and then I'll try to do one live, uh, which will probably, uh, be tragic. But, uh, um, so this is, this is the backstage right here. Uh, I just took this photo right before walking on stage. Uh, uh, you can see that, uh, GPT with vision does a really good job actually of describing that.

Uh, the, there's the monitors, and there's boxes, and there's cables, and there's one not. Uh, uh, and then this is the image that DALI generates, DALI 3. Uh, so you can see blue carpet, cables, boxes, all the elements. And then it goes on to spot the differences, and it notices, for example, that in this image, there are all these vertical lights that are not present in the first image.

It says that here, uh, lighting, like all this like vertical lights on the walls and ceilings, which adds, uh, but then it rewrites the prompt, and it gets rid of all the vertical lights. And it gets, and it adds the, uh, the curtain in the back, which wasn't present here, but is present in the, the black curtain here.

Um, so little, just little interesting things. It's still a long way to go, but like, this, this new, this whole new, this opens a whole new box of interaction patterns. The, the, the fact that now you can reason visually. Um, cool, and, and let's give a shot to, uh, a live example.

So this, uh, this was, uh, a trail run that I did over the weekend, uh, up in Purissima Woods. Um, and so, I was gonna do it from scratch. Um, hope that it works. I wanna go to another. There you go. Uh, cool. So the image depicts a, uh, serene and pictures, woodland setting.

Uh, the focus of the image is a wooden boardwalk or a footbridge. That winds through the dense forest. Uh, very detailed description. Light filters through the trees. Uh, and I'm just passing that raw, just straight to DALI. Yeah, and if, if folks have seen what happens in the, the DALI, uh, mode in the ChatGPT iOS app, for example.

It's actually doing a little bit. I, I don't, uh, know off the top of my head, like what the prompt is for that, but it's, it's doing some amount of prompt engineering. Like if folks have actually tried to use like our labs product before to make DALI images, you have to do that prompt engineering yourself.

Um, and I think that's been one of the limitations. Like if people used mid journey or other, um, other image models in the past, like it's just kind of hard to make good prompts that work well for these systems. So it's nice that the, uh, the model can, can take a stab at doing it for you.

It's telling us a lot of how the, the second image is a lot more beautiful and more detailed, which checks out. It's, it's also interesting to see, uh, just for folks to, to think about, it's interesting to see that like, it's still of these, um, image models. Like the main limitation as we're seeing this demo in real time is actually, no.

Of course. Thanks for going back to the slides. Next, go back to the slides where we have time. All right. I'm going to leave it running. And then at the time if we, if we have time, it'll probably work the second round. It worked the three times before this.

Um, cool. Okay. For the second demo, um, uh, we're going to take it a little bit further and we're going to do something, uh, with video. Uh, and the idea here is that there's a lot of video summarization demos out there that we've seen. Uh, the majority of them just take a transcript and then, uh, ask GPT-4 to summarize this transcript.

However, videos have a lot of, uh, information in them that is conveyed visually. And so, uh, what we're doing here is that we're taking frames from the video. Um, and then we're asking GPT-4 with Vision to describe all the frames. And then we are asking Whisper to transcribe the video.

And now we have this long textual representation of the video that not only includes all the audio information, but also includes visual information from the video. And then we're doing some exciting, like, mixes on that, uh, that Logan will tell you about. Yeah. I'm ready for the next slide.

Um, yeah. So for, for this demo, we're literally just taking the GPT-4 introduction, uh, video. If folks have seen on YouTube, it's a good video if you haven't seen it before. Um, so taking the video raw from YouTube, uh, taking the video raw from YouTube. Again, like Simone said, cutting up those, uh, the different frames from the video, putting those into, to GPT-4 with image input, getting the summaries, which you can see.

And I know it's really hard. Um, but literally just like actually saying what's, these are simple images. So it's easy to capture the, the depth of what's shown here. Um, taking those images and then going to the next piece, which is essentially a big, another, another wonderful dolly image, but a big description of, uh, of the transcript.

And then all of the image, essentially like image embeddings is the, is the easiest way of thinking about it. So if you want to actually see the results of this QR code, bottom right hand corner is real. Um, you can scan it and see the resulting article. It's, it's pretty, it's pretty good.

Um, it does a good job. And I think for, for me, you know, why this is exciting is cause you can sort of capture the, again, capture the depth of, uh, of what happens in a video. So a dolly image to start, and then a bunch of actual frames that like match up with the contextual representation of what's being talked about in the blog post.

Um, and again, there's no hand, I couldn't open source the code cause it has a bunch of unreleased APIs, but no, no sort of magic behind the scenes stuff that's happening. This is like a raw crappy prompt, um, to generate this, uh, this blog post, which I think is, again, I think it's really cool and, um, takes videos and, and makes them more accessible in, in the, in the text form.

So I like it. Cool. Let's see if this finished. No. Oh, well. Um, cool. Oh, yeah. Sure. Cool. Okay. Uh, so, some, some, uh, concluding takeaways, um, a, start thinking multimodal. Uh, that's, that's something net new that's, that's happening these days and, and if you have any crazy ideas that you think, wow, it would be really cool if, if technology could do this, uh, we'll probably be able to get there and, and the products that you'll be able to build six months from now, a year from now are going to be incredible.

So start having this in mind as, as, as, as people who are building AI products and people who are building companies. Um, think of text as a, as a connecting tissue right now. Uh, and, and I think this is a very powerful concept and that's gonna continue to be the case for the near future.

Uh, a, and there are many powerful patterns that are yet to be explored when it comes to multimodal stuff, especially when it comes to, to, uh, doing things with images. Uh, so really excited to, uh, soon get this in the hands of all of you guys and, and to see why you all build with this.

I think it's, uh, it's really exciting, uh, to see, uh, AI start to venture into the visual world. Yeah. Agents with image input is going to be sick. I can't wait. I feel like so much of the internet is requires that. Yeah. And we're excited. I think there's, there's a lot of stuff that's going to happen in the, in the near future.

And, um, I think it's cool to be able to hopefully get a glimpse of, of what some of those use cases look like. So anything else you want to say, Simone? That's good. All right. This was wonderful. Thank you all. Thank you all. Thank you. Thank you. Thank you.

Thank you. Thank you. We'll see you next time.

See, Hear, Speak, Draw: Logan Kilpatrick & Simón Fishman

Transcript