Sora - Full Analysis (with new details)

Sora, the text-to-video model from OpenAI, is here and it appears to be exciting people and worrying them in equal measure. There is something visceral about actually seeing the rate of progress in AI that hits different than leaderboards or benchmarks. And in just the last 18 hours, the technical report for Sora has come out and more demos and details have been released.

I'm going to try to unpack what Sora is, what it means and what comes next. Before getting into any details though, we just have to admit that some of the demos are frankly astonishing. This one, a tour of an art gallery, is jaw-dropping to me. But that doesn't mean we have to get completely carried away with OpenAI's marketing material.

That the model understands what the user asks for and understands how those things exist in the physical world. I don't even think the authors of Sora would have signed off on that statement. And I know it might seem I'm being pedantic, but these kind of edge case failures is what's held back self-driving for a decade.

Yes, Sora has been trained at an immense scale, but I wouldn't say that it understands the world. It has derived billions and trillions of patterns from the world, but can't yet reason about those patterns. Hence anomalies like the video you can see. And later on in the release notes, OpenAI says this, "The current model has weaknesses.

It may struggle with accurately simulating the physics of a complex scene. It doesn't quite get cause and effect. It also mixes up left and right and objects appear spontaneously and disappear for no reason. It's a bit like GPT-4 in that it's breathtaking and intelligent, but if you probe a bit too closely, things fall apart a little bit." To be clear, I am stunned by Sora just as much as everyone else.

I just want it to be put in a little bit of context. That being said, if and when models crack reasoning itself, I will try to be among the first to let you know. It's time for more details and Sora can generate videos up to a full minute long, up to 1080p.

It was trained on and can output different aspect ratios and resolutions. And speaking of high resolution, this demo was amongst the most shocking. It is incredible. Just look at the consistent reflections. In terms of how they made it, they say model and implementation details are not included in this report, but later on they give hints in terms of the papers they cite in the appendices.

Almost all of them, funnily enough, come from Google. We have vision transformers, adaptable aspect ratio and resolution vision transformers, also from Google DeepMind, and we saw that being implemented with Sora, and many other papers from Facebook and Google were cited. That even led one Google DeepMinder to jokingly say this, "You're welcome OpenAI.

I'll share my home address in DM if you want to send us flowers and chocolate." By the way, my 30 second summary of how it's done would be this. Just think to yourself about the task of predicting the next word. It's easy to imagine how you'd test yourself, you'd cover the next word, make a prediction and check.

But how would you do that for images or frames of a video? If all you did was cover the entire image, it would be pretty impossible to guess, say, a video frame of a monkey playing chess. So how would you bridge that gap? Well, as you can see below, how about adding some noise, like a little bit of cloudiness to the image?

You can still see most of the image, but now you have to infer little patches here and there with, say, a text caption to help you out. That's more manageable, right? And now it's just a matter of scale. Scale up the number of images or frames of images from a video that you train on.

Ultimately, you could go from a highly descriptive text caption to the full image from scratch, especially if the captions are particularly descriptive as they are for Sora. Now, by the way, all you need to do is find a sugar daddy to invest 13 billion dollars into you and boom, you're there.

Of course, I'm being a little bit facetious. It builds on years of work, including by notable contributors from OpenAI. They pioneered the auto-captioning of images with highly descriptive language. Using those synthetic captions massively optimized the training process. When I mention scale, by the way, look at the difference that more compute makes.

When I say compute, think of arrays of GPUs in a data somewhere in America. When you 4X the compute, you get this. And if you 16X it, you get that. More images, more training, more compute, better results. Now, I know what you're thinking. Just 100X the compute. There's definitely enough data.

I did a back of the envelope calculation that there are quadrillions of frames just on YouTube. Definitely easier to access if you're Google, by the way. But I will caveat that as we've seen with GPT-4, scale doesn't get you all the way to reasoning. So you'll still get weird breaches of the laws of physics until you get other innovations thrown in.

But then we get to something big that I don't think enough people are talking about. By training on video, you're inadvertently solving images. An image, after all, is just a single frame of a video. The images from Sora go up to 2K by 2K pixels. And of course, they could be scaled up further with a tool like Magnific.

I tried that for this image and honestly, there was nothing I could see that would tell me that this isn't just a photo. I'd almost ask the question of whether this means that there won't be a Dali 4 because Sora supersedes it. Take animating an image and this example is just incredible of this Shiba Inu dog wearing a beret and black turtleneck.

That's the image on the left and it being animated on the right. You can imagine the business use cases of this where people bring to life photos of themselves, friends and family, or maybe even deceased loved ones. Or how about every page in what would be an otherwise static children's book being animated on demand.

You just click and then the characters get animated. Honestly, the more I think about it, the more I think Sora is going to make OpenAI billions and billions of dollars. The number of other companies and apps that it just subsumes within it is innumerable. I'll come back to that point.

But meanwhile, here is a handful of other incredible demos. This is a movie trailer and notice how Sora is picking quite fast cuts, obviously all automatically. It gets that a cinematic trailer is going to be pretty dynamic and fast paced. Likewise, this is a single video generated by Sora, not a compilation.

And if you ignore some text spelling issues, it is astonishing. And here is another one that I'm going to have to spend some time on. The implications of this feature alone are astonishing. All three videos that you can see are going to end with the exact same frame. Even that final frame of the cable car crashing into that sign was generated by Sora, including the minor misspelling at the top.

But just think of the implications. You could have a photo with your friends and imagine a hundred different ways that you could have got there to that final photo. Or maybe you have your own website and every user gets a unique voyage to your landing page. And of course, when we scale this up, we could put the ending of a movie in and Sora 2 or Sora 3 would calculate all the different types of movies that could have led to that point.

You could have daily variations to your favorite movie ending. As a side note, this also allows you to create these funky loops where the starting and finishing frame are identical. I could just let this play for a few minutes until people got really confused, but I won't do that to you.

And here is yet another feature that I was truly bowled away with. The video you can see on screen was not generated by Sora. And now I'm going to switch to another video, which was also not generated by Sora. But what Sora can do is interpolate between those videos to come up with a unique creation.

This time I'm not even going to list the potential applications because again, they are innumerable. What I will do though, is give you one more example that I thought of when I saw this. Another demo that OpenAI used was mixing together this chameleon and this funky looking bird, I'm not sure its name, to create this wild mixture.

Now, we all know that OpenAI are not going to allow you to do this with human images, but an open source version of Sora will be following close behind. So imagine putting a video of you and your partner and creating this hybrid freaky video, or maybe you and your pet.

Now, the best results you're going to get from Sora are inevitably when there's not as much movement going on. The less movement, the fewer problems with things like object permanence. Mind you, even when there is quite a lot going on, the results can still be pretty incredible. Look at how Sora handles object permanence here with the dog fully covered and then emerging looking exactly the same.

Likewise, this video of a man eating a burger, because he's moving in slow motion, it's much more high fidelity. Aside from the bokeh effect, it could almost be real. And then we get this gorgeous video where you almost have to convince me it's from Sora. Look at how the paint marks stay on the page.

And then we get simulated gaming where again, if you ignore some of the physics and the rule breaking, the visuals alone are just incredible. Obviously, they trained Sora on thousands of hours of Minecraft videos. I mean, look how accurate some of the boxes are. I bet some of you watching this think I simply replaced a Sora video with an actual Minecraft video, but no, I didn't.

That has been quite a few hype demos, so time for some anti-hype ones. Here is Sora clearly not understanding the world around it. Just like Chachapiti's understanding can sometimes be paper thin, so can Sora's. It doesn't get the physics of the cup, the ice, or the spill. I can't forget to mention though, that you can also change the style of a video.

Here is the input video, presumably from a game. Now with one prompt, you can change the background to being a jungle. Or maybe you prefer to play the game in the 1920s. I mean, you can see how the wheels aren't moving properly, but the overall effect is incredible. Well, actually this time, I want to play the game underwater.

How about that? Job done. Or maybe I'm high and I want the game to look like a rainbow. Or maybe I prefer the old-fashioned days of pixel art. I've noticed a lot of people, by the way, speculating where OpenAI got all the data to train Sora. I think many people have forgotten that they did a deal back in July with Shutterstock.

In case you don't know, Shutterstock has 32 million stock videos, and most of them are high resolution. They probably also used millions of hours of video game frames, would be my guess. One more thing you might be wondering, don't these worlds just disappear the moment you move on to the next prompt?

Well, with video to 3D, that might not always be the case. This is from Luma AI, and imagine a world generated at first by Sora, then turned into a universally shareable 3D landscape that you can interact with. Effectively, you and your friends could inhabit a world generated by Sora.

And yes, ultimately with scale, you could generate your own high fidelity video game. And given that you can indefinitely extend clips, I am sure many people will be creating their own short movies. Perhaps voiced by AI, here's an Eleven Labs voice giving you a snippet of the caption to this video.

An adorable, happy otter confidently stands on a surfboard wearing a yellow life jacket, riding along turquoise tropical waters near lush tropical islands. Or how about hooking Sora up to the Apple Vision Pro or MetaQuest? Especially for those who can't travel, that could be an incredible way of exploring the world.

Of course, being real here, the most common use case might be children using it to make cartoons and play games. But still, that counts as a valid use case to me. But underneath all of these use cases are some serious points. In a since deleted tweet, one OpenAI employee said this, "We are very intentionally not sharing it widely yet.

The hope is that a mini public demo kicks a social response into gear." I'm not really sure what social response people are supposed to give though, however. It's not responsible to let people just panic, which is why I've given the caveats I have throughout this video. I believe, as with language and self-driving, that the edge cases will still take a number of years to solve.

That's at least my best guess. But it seems to me, when reasoning is solved, and therefore even long videos actually make sense, a lot more jobs other than just videographers might be under threat. As the creator of GitHub Copilot put it, "If OpenAI is going to continue to eat AI startups sector by sector, they should go public.

Building the new economy where only 500 people benefit is a dodgy future." And the founder of StabilityAI tweeted out this image, "It does seem to be the best of times and the worst of times to be an AI startup. You never know when OpenAI or Google are going to drop a model that massively changes and affects your business." It's not just Sora whacking PikaLabs, RunwayML and maybe MidJourney.

If you make the chips that OpenAI uses, they want to make them instead. I'm going to be doing a separate video about all of that. When you use the ChatGPT app on a phone, they want to make the phone you're using. You come up with Character AI and OpenAI comes out with the GPT Store.

I bet OpenAI are even cooking up an open world game with GPT powered NPCs. Don't forget that they acquired Global Illumination, the makers of this Minecraft clone. If you make agents, we learned last week that OpenAI want to create an agent that operates your entire device. Again, I've got more on that coming soon.

Or what about if you're making a search engine powered by a GPT model? That's the case of course with Perplexity and I will be interviewing the CEO and founder of Perplexity for AI Insiders next week. Insiders can submit questions and of course do feel free to join on Patreon.

But fitting with the trend, we learned less than 48 hours ago that OpenAI is developing a web search product. I'm not necessarily critiquing any of this, but you're starting to see the theme. OpenAI will have no qualms about eating your lunch. And of course there's one more implication that's a bit more long term.

Two lead authors from Sora both retweeted this video from Berkeley. You're seeing a humanoid transformer robot trained with large scale reinforcement learning in simulation and deployed to the real world zero shot. In other words, it learned to move like this by watching and acting in simulations. If you want to learn more about learning from simulations, do check out my Eureka video and my interview with Jim Fan.

TLDR, better simulations mean better robotics. Two final demos to end this video with. First, a monkey playing chess in a park. This demo kind of sums up Sora. It looks gorgeous. I was astounded like everyone else. However, if you look a bit closer, the piece positions and board don't make any sense.

Sora doesn't understand the world, but it is drawing upon billions and billions of patterns. And then there's this obligatory comparison. The Will Smith spaghetti video, and I wonder what source they originally got some of the images from. You could say this was around state of the art just 11 months ago.

And now here's Sora. Not perfect. Look at the paws, but honestly remarkable. Indeed, I would call Sora a milestone human achievement. But now I want to thank you for watching this video all the way to the end. And no, despite what many people think, it isn't generated by an AI.

Have a wonderful day.

Sora - Full Analysis (with new details)

Transcript