OpenAI's Sora

There are occasionally these moments in AI, particularly in the past few years, where I see something that is completely unexpected and just seems so incredibly mind-blowing and far beyond where I would expect AI to be. Now the first time I had that mind-blown feeling with AI was when I was getting ready to do a talk with some engineers from OpenAI and they showed me what would have been prompt engineering at the time.

I don't think it was called prompt engineering back then but we prompt engineered GPT-3 to do rag and answer questions in different ways and it was really far beyond what I thought AI was at at that point. And now OpenAI have released something called Sora and it's just insane.

I haven't had a good look at it yet. I wanted to, given how rare these moments are where I'm truly just mind blown, I saw one video from this and thought okay now it's time to turn the camera on and just take a look together and see what it's like.

So Sora is an AI model that can create realistic and imaginative scenes from text instructions and then this and now all videos on this page were generated directly by Sora without modification. I'm sure these are some of the best videos they generated but nonetheless, so this background image here is already pretty cool.

Let's keep going. This is a video where I started watching and thought oh wow this is kind of insane. So this is AI generated. I mean it's insane. So there was the odd thing that was weird but it's just incredible how good like the background is perfect. It doesn't seem to, there's nothing weird going on.

The mood, like the person is actually moving through the scene which you don't usually get with AI generated videos. It's usually kind of like they're moving slightly and you know the background is maybe moving a little bit but this is like insane. The amount of detail and the amount of movement is just, I don't even know.

I noticed earlier that the legs get a little weird around, it's kind of like the hands to disable diffusion. Now it's legs. Around here the leg kind of swaps, it's super weird. Oh look at that, her left leg became her right leg which is interesting. It's hard to even notice that but this is just insane.

I even, like you look at here and you look at the jacket and it has these, you know, these four buttons here. You go ahead, I'm just trying to find anything that's kind of odd but even here it's like the same four buttons, same jacket, maybe oh this is kind of long and big here.

Okay so this grew over the video but like, gosh I'm really like pointing out these very minor little things. It's just insane. Then you look at the prompt and it's not, you know, it's like it's a relative, it's a paragraph. I mean it's nothing crazy and that paragraph of text produced this.

Gosh I'm gonna have so much good suck video for my videos now. This one I saw briefly, I thought this was less impressive but I mean it's so good. And then this one, guy with a woolly, it's just like a, I don't even know, it's a bit weird and kind of all over the place but it's really pretty cool.

I mean look at the detail on the guy and then this as well, like it just looks real, no? Am I, like it just looks real. The guy looks real, like there's nothing here that doesn't look legit. Photorealistic video to pirate ships, ah, it's a sail inside a cup of coffee.

How good is that? Like looking at this, would I think it is, like the book is being weird and kind of going a little crazy but would I, if this wasn't on OpenAI's website, would I have looked at this and thought this is an AI video? I don't, I'm pretty sure I wouldn't.

Okay so then Sora is becoming available to red teamers, so that it's still very, like it isn't released yet. We're also granting access to a number of visual artists, designers and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals. How insane, look at this.

So this, historical footage of California during the gold rush, I mean the prompt is tiny but I would never, if you'd have shown me this two days, a day ago, I would be like oh look at this cool, like how did they film this? Were they on a balloon?

I would have no idea. Things like the human eye are, because we're, I think biologically we're so able to, like we know what an eye looks like more than anything else, right? We can read an eye and understand it so well and the fact that I look at this and I don't think I can tell that it's not real.

And it's probably the feature that I as a human would, should be able to distinguish from reality the easiest of anything. That should be like the hardest thing to convince me. Maybe it moves a bit weirdly, but the eyeball I mean, but really, I mean it's insane. I mean this is going to be like the new AI generated photos where you, after a little while, this is interesting.

So like the people here are tiny and then all of a sudden these people are huge. Oh wow, this is, I mean it's not supposed to be like that of course, but it's interesting. Yeah, this like creating multiple shots within a single generated video that, that you know, the characters remain the same.

How, I mean I don't know how they can, it's just so impressive. Like it's kind of weird here, like the guys, yeah, the perspective is strange. Yeah, that's interesting. So the perspective seems to mess up more, more often. That's like a strange thing that it has going on here.

Also in the earlier video in Lagos in Nigeria, the perspectives were kind of messed up. Same here. Oh my gosh, look at this. How cool is that? It looks like a film. There's some weird stuff going on here, I feel. Like what is, this guy has like three trainers on or something.

So, so it has its, has weaknesses. It may struggle with accurately simulating the physics of a complex scene and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward the cookie might not have a bite mark.

The model may also confuse spatial details of a prompt. For example, mixing up left and right and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory. Yeah, this is interesting. So yeah, let's have a look at these. This is interesting.

Oh yeah, so like people and animals just appear. Oh yeah, that's interesting. Oh, look at that. That's so cool though, at the same time. Archaeologists discover a generic plastic chair in the desert, excavating and dusting it with great care. So here it's like they are, what's happening here? So here they're taking, what is this?

How, how insane. So, and then the guy's hands as well, like messed up. But this is, I mean, despite how weird it is, it looks so, it feels like I'm watching a, I don't know, like a dream or something. And then the actual people themselves, they're pretty impressive. And this is the first version, this is just insane.

Okay, so that working red team is so people that will basically stress test the model, make sure it's not going to do anything weird, and they'll probably go over the top with it, but what can you do? Oh, I imagine how good all the UFO videos will be now.

That's exciting. Oh, it's going to be so difficult to know what's real anymore. All right, so Sora is a diffusion model. It generates videos. We're starting off with one that looks like static noise. Oh wow, it does the same. How? So it starts with static noise and obviously removes the noise over many steps.

I think it's the same as video stable diffusion. Then, okay, transformer architecture. I think they all, I mean, I assume the other ones did that as well with the way that they encode the text. They represent videos and images collection, smaller units called patches. Again, that's, I think, similar to before.

It uses the recaptioning technique from Dali 3, which involves generating highly scripted captions for the visual training data. As a result, the model is able to follow the user's text instructions and generate a video more faithfully. Sora says foundation for models which can understand and simulate the real world.

Yeah, it's not bad. How insane is that? Okay. I don't, yeah, I don't have much more to say. That's pretty impressive. It seems like it's probably going to be a while before we can do anything with it. I'm very curious to see where the other open source video generation models end up.

They, I mean, they've, they've seemed like the ones that are ahead for a long time. And then obviously OpenAI just, they, they share this, but this is, this is really very, very interesting. I mean, that's all I have. So thank you. Thank you for watching and see you later.

Bye. (gentle music) you

OpenAI's Sora

Transcript