back to indexSora - Full Analysis (with new details)
00:00:00.000 |
Sora, the text-to-video model from OpenAI, is here and it appears to be exciting people 00:00:06.560 |
and worrying them in equal measure. There is something visceral about actually seeing 00:00:12.560 |
the rate of progress in AI that hits different than leaderboards or benchmarks. 00:00:18.320 |
And in just the last 18 hours, the technical report for Sora has come out and more demos 00:00:24.160 |
and details have been released. I'm going to try to unpack what Sora is, what it means and what 00:00:30.880 |
comes next. Before getting into any details though, we just have to admit that some of the demos are 00:00:37.280 |
frankly astonishing. This one, a tour of an art gallery, is jaw-dropping to me. But that doesn't 00:00:43.680 |
mean we have to get completely carried away with OpenAI's marketing material. That the model 00:00:49.440 |
understands what the user asks for and understands how those things exist in the physical world. 00:00:56.000 |
I don't even think the authors of Sora would have signed off on that statement. And I know it might 00:01:00.880 |
seem I'm being pedantic, but these kind of edge case failures is what's held back self-driving 00:01:05.840 |
for a decade. Yes, Sora has been trained at an immense scale, but I wouldn't say that it 00:01:11.040 |
understands the world. It has derived billions and trillions of patterns from the world, 00:01:16.320 |
but can't yet reason about those patterns. Hence anomalies like the video you can see. 00:01:21.120 |
And later on in the release notes, OpenAI says this, "The current model has weaknesses. It may 00:01:26.240 |
struggle with accurately simulating the physics of a complex scene. It doesn't quite get cause 00:01:31.360 |
and effect. It also mixes up left and right and objects appear spontaneously and disappear for no 00:01:37.520 |
reason. It's a bit like GPT-4 in that it's breathtaking and intelligent, but if you probe 00:01:43.280 |
a bit too closely, things fall apart a little bit." To be clear, I am stunned by Sora just as much as 00:01:50.000 |
everyone else. I just want it to be put in a little bit of context. That being said, if and 00:01:54.480 |
when models crack reasoning itself, I will try to be among the first to let you know. It's time for 00:02:00.640 |
more details and Sora can generate videos up to a full minute long, up to 1080p. It was trained on 00:02:08.000 |
and can output different aspect ratios and resolutions. And speaking of high resolution, 00:02:14.000 |
this demo was amongst the most shocking. It is incredible. Just look at the consistent reflections. 00:02:21.280 |
In terms of how they made it, they say model and implementation details are not included 00:02:26.320 |
in this report, but later on they give hints in terms of the papers they cite in the appendices. 00:02:31.520 |
Almost all of them, funnily enough, come from Google. We have vision transformers, 00:02:36.080 |
adaptable aspect ratio and resolution vision transformers, also from Google DeepMind, 00:02:41.120 |
and we saw that being implemented with Sora, and many other papers from Facebook and Google were 00:02:46.240 |
cited. That even led one Google DeepMinder to jokingly say this, "You're welcome OpenAI. I'll 00:02:52.160 |
share my home address in DM if you want to send us flowers and chocolate." By the way, my 30 second 00:02:57.600 |
summary of how it's done would be this. Just think to yourself about the task of predicting the next 00:03:03.200 |
word. It's easy to imagine how you'd test yourself, you'd cover the next word, make a prediction and 00:03:08.240 |
check. But how would you do that for images or frames of a video? If all you did was cover the 00:03:13.280 |
entire image, it would be pretty impossible to guess, say, a video frame of a monkey playing 00:03:18.960 |
chess. So how would you bridge that gap? Well, as you can see below, how about adding some noise, 00:03:23.600 |
like a little bit of cloudiness to the image? You can still see most of the image, but now you have 00:03:28.640 |
to infer little patches here and there with, say, a text caption to help you out. That's more 00:03:34.400 |
manageable, right? And now it's just a matter of scale. Scale up the number of images or frames 00:03:39.760 |
of images from a video that you train on. Ultimately, you could go from a highly descriptive 00:03:44.720 |
text caption to the full image from scratch, especially if the captions are particularly 00:03:49.840 |
descriptive as they are for Sora. Now, by the way, all you need to do is find a sugar daddy to invest 00:03:55.920 |
13 billion dollars into you and boom, you're there. Of course, I'm being a little bit facetious. It 00:04:00.960 |
builds on years of work, including by notable contributors from OpenAI. They pioneered the 00:04:06.640 |
auto-captioning of images with highly descriptive language. Using those synthetic captions 00:04:12.480 |
massively optimized the training process. When I mention scale, by the way, look at the difference 00:04:17.840 |
that more compute makes. When I say compute, think of arrays of GPUs in a data somewhere in America. 00:04:23.920 |
When you 4X the compute, you get this. And if you 16X it, you get that. More images, more training, 00:04:30.400 |
more compute, better results. Now, I know what you're thinking. Just 100X the compute. There's 00:04:35.600 |
definitely enough data. I did a back of the envelope calculation that there are quadrillions 00:04:40.720 |
of frames just on YouTube. Definitely easier to access if you're Google, by the way. But I will 00:04:45.280 |
caveat that as we've seen with GPT-4, scale doesn't get you all the way to reasoning. So you'll still 00:04:51.280 |
get weird breaches of the laws of physics until you get other innovations thrown in. But then we 00:04:56.320 |
get to something big that I don't think enough people are talking about. By training on video, 00:05:02.080 |
you're inadvertently solving images. An image, after all, is just a single frame of a video. 00:05:08.240 |
The images from Sora go up to 2K by 2K pixels. And of course, they could be scaled up further 00:05:14.160 |
with a tool like Magnific. I tried that for this image and honestly, there was nothing I could see 00:05:19.520 |
that would tell me that this isn't just a photo. I'd almost ask the question of whether this means 00:05:24.480 |
that there won't be a Dali 4 because Sora supersedes it. Take animating an image and 00:05:30.800 |
this example is just incredible of this Shiba Inu dog wearing a beret and black turtleneck. 00:05:37.680 |
That's the image on the left and it being animated on the right. You can imagine the 00:05:42.800 |
business use cases of this where people bring to life photos of themselves, friends and family, 00:05:49.200 |
or maybe even deceased loved ones. Or how about every page in what would be an otherwise static 00:05:55.040 |
children's book being animated on demand. You just click and then the characters get animated. 00:06:01.200 |
Honestly, the more I think about it, the more I think Sora is going to make OpenAI 00:06:05.200 |
billions and billions of dollars. The number of other companies and apps that it just subsumes 00:06:12.400 |
within it is innumerable. I'll come back to that point. But meanwhile, here is a handful 00:06:18.000 |
of other incredible demos. This is a movie trailer and notice how Sora is picking quite fast cuts, 00:06:25.040 |
obviously all automatically. It gets that a cinematic trailer is going to be pretty dynamic 00:06:30.160 |
and fast paced. Likewise, this is a single video generated by Sora, not a compilation. 00:06:35.280 |
And if you ignore some text spelling issues, it is astonishing. 00:06:39.040 |
And here is another one that I'm going to have to spend some time on. The implications of this 00:06:46.080 |
feature alone are astonishing. All three videos that you can see are going to end with the exact 00:06:51.520 |
same frame. Even that final frame of the cable car crashing into that sign was generated by 00:06:58.240 |
Sora, including the minor misspelling at the top. But just think of the implications. You could have 00:07:03.200 |
a photo with your friends and imagine a hundred different ways that you could have got there to 00:07:08.000 |
that final photo. Or maybe you have your own website and every user gets a unique voyage to 00:07:14.400 |
your landing page. And of course, when we scale this up, we could put the ending of a movie in 00:07:19.040 |
and Sora 2 or Sora 3 would calculate all the different types of movies that could have led 00:07:24.480 |
to that point. You could have daily variations to your favorite movie ending. As a side note, 00:07:29.920 |
this also allows you to create these funky loops where the starting and finishing frame are 00:07:35.680 |
identical. I could just let this play for a few minutes until people got really confused, 00:07:39.840 |
but I won't do that to you. And here is yet another feature that I was truly bowled away 00:07:45.600 |
with. The video you can see on screen was not generated by Sora. And now I'm going to switch 00:07:51.040 |
to another video, which was also not generated by Sora. But what Sora can do is interpolate between 00:07:58.240 |
those videos to come up with a unique creation. This time I'm not even going to list the potential 00:08:03.440 |
applications because again, they are innumerable. What I will do though, is give you one more 00:08:07.760 |
example that I thought of when I saw this. Another demo that OpenAI used was mixing together this 00:08:12.800 |
chameleon and this funky looking bird, I'm not sure its name, to create this wild mixture. Now, 00:08:19.200 |
we all know that OpenAI are not going to allow you to do this with human images, but an open 00:08:24.720 |
source version of Sora will be following close behind. So imagine putting a video of you and 00:08:30.160 |
your partner and creating this hybrid freaky video, or maybe you and your pet. Now, the best 00:08:36.000 |
results you're going to get from Sora are inevitably when there's not as much movement going 00:08:40.800 |
on. The less movement, the fewer problems with things like object permanence. Mind you, even when 00:08:45.920 |
there is quite a lot going on, the results can still be pretty incredible. Look at how Sora 00:08:52.000 |
handles object permanence here with the dog fully covered and then emerging looking exactly the same. 00:08:57.840 |
Likewise, this video of a man eating a burger, because he's moving in slow motion, 00:09:02.560 |
it's much more high fidelity. Aside from the bokeh effect, it could almost be real. And then 00:09:08.240 |
we get this gorgeous video where you almost have to convince me it's from Sora. Look at how the 00:09:14.880 |
paint marks stay on the page. And then we get simulated gaming where again, if you ignore some 00:09:20.960 |
of the physics and the rule breaking, the visuals alone are just incredible. Obviously, they trained 00:09:27.680 |
Sora on thousands of hours of Minecraft videos. I mean, look how accurate some of the boxes are. 00:09:33.840 |
I bet some of you watching this think I simply replaced a Sora video with an actual Minecraft 00:09:38.640 |
video, but no, I didn't. That has been quite a few hype demos, so time for some anti-hype ones. 00:09:44.400 |
Here is Sora clearly not understanding the world around it. Just like Chachapiti's understanding 00:09:50.720 |
can sometimes be paper thin, so can Sora's. It doesn't get the physics of the cup, the ice, 00:09:56.640 |
or the spill. I can't forget to mention though, that you can also change the style of a video. 00:10:02.000 |
Here is the input video, presumably from a game. Now with one prompt, you can change the background 00:10:08.240 |
to being a jungle. Or maybe you prefer to play the game in the 1920s. I mean, you can see how 00:10:15.840 |
the wheels aren't moving properly, but the overall effect is incredible. Well, actually this time, 00:10:22.400 |
I want to play the game underwater. How about that? Job done. Or maybe I'm high and I want 00:10:28.800 |
the game to look like a rainbow. Or maybe I prefer the old-fashioned days of pixel art. 00:10:35.600 |
I've noticed a lot of people, by the way, speculating where OpenAI got all the data to 00:10:41.360 |
train Sora. I think many people have forgotten that they did a deal back in July with Shutterstock. 00:10:47.680 |
In case you don't know, Shutterstock has 32 million stock videos, and most of them 00:10:53.360 |
are high resolution. They probably also used millions of hours of video game frames, would be 00:10:58.400 |
my guess. One more thing you might be wondering, don't these worlds just disappear the moment you 00:11:03.200 |
move on to the next prompt? Well, with video to 3D, that might not always be the case. This is 00:11:09.680 |
from Luma AI, and imagine a world generated at first by Sora, then turned into a universally 00:11:16.080 |
shareable 3D landscape that you can interact with. Effectively, you and your friends could inhabit 00:11:23.120 |
a world generated by Sora. And yes, ultimately with scale, you could generate your own high 00:11:30.080 |
fidelity video game. And given that you can indefinitely extend clips, I am sure many people 00:11:36.160 |
will be creating their own short movies. Perhaps voiced by AI, here's an Eleven Labs voice giving 00:11:42.240 |
you a snippet of the caption to this video. An adorable, happy otter confidently stands on a 00:11:48.880 |
surfboard wearing a yellow life jacket, riding along turquoise tropical waters near lush tropical 00:11:55.680 |
islands. Or how about hooking Sora up to the Apple Vision Pro or MetaQuest? Especially for those who 00:12:02.720 |
can't travel, that could be an incredible way of exploring the world. Of course, being real here, 00:12:08.240 |
the most common use case might be children using it to make cartoons and play games. But still, 00:12:14.720 |
that counts as a valid use case to me. But underneath all of these use cases are 00:12:19.840 |
some serious points. In a since deleted tweet, one OpenAI employee said this, "We are very 00:12:25.760 |
intentionally not sharing it widely yet. The hope is that a mini public demo kicks a social response 00:12:32.640 |
into gear." I'm not really sure what social response people are supposed to give though, 00:12:38.000 |
however. It's not responsible to let people just panic, which is why I've given the caveats I have 00:12:43.440 |
throughout this video. I believe, as with language and self-driving, that the edge cases will still 00:12:48.800 |
take a number of years to solve. That's at least my best guess. But it seems to me, when reasoning 00:12:53.840 |
is solved, and therefore even long videos actually make sense, a lot more jobs other than just 00:12:59.600 |
videographers might be under threat. As the creator of GitHub Copilot put it, "If OpenAI is 00:13:05.520 |
going to continue to eat AI startups sector by sector, they should go public. Building the new 00:13:11.040 |
economy where only 500 people benefit is a dodgy future." And the founder of StabilityAI tweeted 00:13:18.160 |
out this image, "It does seem to be the best of times and the worst of times to be an AI startup. 00:13:23.920 |
You never know when OpenAI or Google are going to drop a model that massively changes and affects 00:13:29.760 |
your business." It's not just Sora whacking PikaLabs, RunwayML and maybe MidJourney. If you 00:13:35.200 |
make the chips that OpenAI uses, they want to make them instead. I'm going to be doing a separate 00:13:40.560 |
video about all of that. When you use the ChatGPT app on a phone, they want to make the phone you're 00:13:46.800 |
using. You come up with Character AI and OpenAI comes out with the GPT Store. I bet OpenAI are 00:13:52.960 |
even cooking up an open world game with GPT powered NPCs. Don't forget that they acquired 00:13:59.200 |
Global Illumination, the makers of this Minecraft clone. If you make agents, we learned last week 00:14:05.360 |
that OpenAI want to create an agent that operates your entire device. Again, I've got more on that 00:14:10.880 |
coming soon. Or what about if you're making a search engine powered by a GPT model? That's 00:14:16.240 |
the case of course with Perplexity and I will be interviewing the CEO and founder of Perplexity 00:14:22.080 |
for AI Insiders next week. Insiders can submit questions and of course do feel free to join on 00:14:27.920 |
Patreon. But fitting with the trend, we learned less than 48 hours ago that OpenAI is developing 00:14:34.320 |
a web search product. I'm not necessarily critiquing any of this, but you're starting to 00:14:39.200 |
see the theme. OpenAI will have no qualms about eating your lunch. And of course there's one more 00:14:45.040 |
implication that's a bit more long term. Two lead authors from Sora both retweeted this video 00:14:51.920 |
from Berkeley. You're seeing a humanoid transformer robot trained with large scale reinforcement 00:14:57.520 |
learning in simulation and deployed to the real world zero shot. In other words, it learned to 00:15:02.960 |
move like this by watching and acting in simulations. If you want to learn more about 00:15:07.840 |
learning from simulations, do check out my Eureka video and my interview with Jim Fan. 00:15:12.560 |
TLDR, better simulations mean better robotics. Two final demos to end this video with. First, 00:15:19.040 |
a monkey playing chess in a park. This demo kind of sums up Sora. It looks gorgeous. I was astounded 00:15:25.040 |
like everyone else. However, if you look a bit closer, the piece positions and board don't make 00:15:30.000 |
any sense. Sora doesn't understand the world, but it is drawing upon billions and billions of 00:15:35.280 |
patterns. And then there's this obligatory comparison. The Will Smith spaghetti video, 00:15:40.960 |
and I wonder what source they originally got some of the images from. You could say this was around 00:15:45.360 |
state of the art just 11 months ago. And now here's Sora. Not perfect. Look at the paws, 00:15:52.320 |
but honestly remarkable. Indeed, I would call Sora a milestone human achievement. But now I want to 00:15:59.520 |
thank you for watching this video all the way to the end. And no, despite what many people think, 00:16:05.120 |
it isn't generated by an AI. Have a wonderful day.