back to index

Sora - Full Analysis (with new details)


Whisper Transcript | Transcript Only Page

00:00:00.000 | Sora, the text-to-video model from OpenAI, is here and it appears to be exciting people
00:00:06.560 | and worrying them in equal measure. There is something visceral about actually seeing
00:00:12.560 | the rate of progress in AI that hits different than leaderboards or benchmarks.
00:00:18.320 | And in just the last 18 hours, the technical report for Sora has come out and more demos
00:00:24.160 | and details have been released. I'm going to try to unpack what Sora is, what it means and what
00:00:30.880 | comes next. Before getting into any details though, we just have to admit that some of the demos are
00:00:37.280 | frankly astonishing. This one, a tour of an art gallery, is jaw-dropping to me. But that doesn't
00:00:43.680 | mean we have to get completely carried away with OpenAI's marketing material. That the model
00:00:49.440 | understands what the user asks for and understands how those things exist in the physical world.
00:00:56.000 | I don't even think the authors of Sora would have signed off on that statement. And I know it might
00:01:00.880 | seem I'm being pedantic, but these kind of edge case failures is what's held back self-driving
00:01:05.840 | for a decade. Yes, Sora has been trained at an immense scale, but I wouldn't say that it
00:01:11.040 | understands the world. It has derived billions and trillions of patterns from the world,
00:01:16.320 | but can't yet reason about those patterns. Hence anomalies like the video you can see.
00:01:21.120 | And later on in the release notes, OpenAI says this, "The current model has weaknesses. It may
00:01:26.240 | struggle with accurately simulating the physics of a complex scene. It doesn't quite get cause
00:01:31.360 | and effect. It also mixes up left and right and objects appear spontaneously and disappear for no
00:01:37.520 | reason. It's a bit like GPT-4 in that it's breathtaking and intelligent, but if you probe
00:01:43.280 | a bit too closely, things fall apart a little bit." To be clear, I am stunned by Sora just as much as
00:01:50.000 | everyone else. I just want it to be put in a little bit of context. That being said, if and
00:01:54.480 | when models crack reasoning itself, I will try to be among the first to let you know. It's time for
00:02:00.640 | more details and Sora can generate videos up to a full minute long, up to 1080p. It was trained on
00:02:08.000 | and can output different aspect ratios and resolutions. And speaking of high resolution,
00:02:14.000 | this demo was amongst the most shocking. It is incredible. Just look at the consistent reflections.
00:02:21.280 | In terms of how they made it, they say model and implementation details are not included
00:02:26.320 | in this report, but later on they give hints in terms of the papers they cite in the appendices.
00:02:31.520 | Almost all of them, funnily enough, come from Google. We have vision transformers,
00:02:36.080 | adaptable aspect ratio and resolution vision transformers, also from Google DeepMind,
00:02:41.120 | and we saw that being implemented with Sora, and many other papers from Facebook and Google were
00:02:46.240 | cited. That even led one Google DeepMinder to jokingly say this, "You're welcome OpenAI. I'll
00:02:52.160 | share my home address in DM if you want to send us flowers and chocolate." By the way, my 30 second
00:02:57.600 | summary of how it's done would be this. Just think to yourself about the task of predicting the next
00:03:03.200 | word. It's easy to imagine how you'd test yourself, you'd cover the next word, make a prediction and
00:03:08.240 | check. But how would you do that for images or frames of a video? If all you did was cover the
00:03:13.280 | entire image, it would be pretty impossible to guess, say, a video frame of a monkey playing
00:03:18.960 | chess. So how would you bridge that gap? Well, as you can see below, how about adding some noise,
00:03:23.600 | like a little bit of cloudiness to the image? You can still see most of the image, but now you have
00:03:28.640 | to infer little patches here and there with, say, a text caption to help you out. That's more
00:03:34.400 | manageable, right? And now it's just a matter of scale. Scale up the number of images or frames
00:03:39.760 | of images from a video that you train on. Ultimately, you could go from a highly descriptive
00:03:44.720 | text caption to the full image from scratch, especially if the captions are particularly
00:03:49.840 | descriptive as they are for Sora. Now, by the way, all you need to do is find a sugar daddy to invest
00:03:55.920 | 13 billion dollars into you and boom, you're there. Of course, I'm being a little bit facetious. It
00:04:00.960 | builds on years of work, including by notable contributors from OpenAI. They pioneered the
00:04:06.640 | auto-captioning of images with highly descriptive language. Using those synthetic captions
00:04:12.480 | massively optimized the training process. When I mention scale, by the way, look at the difference
00:04:17.840 | that more compute makes. When I say compute, think of arrays of GPUs in a data somewhere in America.
00:04:23.920 | When you 4X the compute, you get this. And if you 16X it, you get that. More images, more training,
00:04:30.400 | more compute, better results. Now, I know what you're thinking. Just 100X the compute. There's
00:04:35.600 | definitely enough data. I did a back of the envelope calculation that there are quadrillions
00:04:40.720 | of frames just on YouTube. Definitely easier to access if you're Google, by the way. But I will
00:04:45.280 | caveat that as we've seen with GPT-4, scale doesn't get you all the way to reasoning. So you'll still
00:04:51.280 | get weird breaches of the laws of physics until you get other innovations thrown in. But then we
00:04:56.320 | get to something big that I don't think enough people are talking about. By training on video,
00:05:02.080 | you're inadvertently solving images. An image, after all, is just a single frame of a video.
00:05:08.240 | The images from Sora go up to 2K by 2K pixels. And of course, they could be scaled up further
00:05:14.160 | with a tool like Magnific. I tried that for this image and honestly, there was nothing I could see
00:05:19.520 | that would tell me that this isn't just a photo. I'd almost ask the question of whether this means
00:05:24.480 | that there won't be a Dali 4 because Sora supersedes it. Take animating an image and
00:05:30.800 | this example is just incredible of this Shiba Inu dog wearing a beret and black turtleneck.
00:05:37.680 | That's the image on the left and it being animated on the right. You can imagine the
00:05:42.800 | business use cases of this where people bring to life photos of themselves, friends and family,
00:05:49.200 | or maybe even deceased loved ones. Or how about every page in what would be an otherwise static
00:05:55.040 | children's book being animated on demand. You just click and then the characters get animated.
00:06:01.200 | Honestly, the more I think about it, the more I think Sora is going to make OpenAI
00:06:05.200 | billions and billions of dollars. The number of other companies and apps that it just subsumes
00:06:12.400 | within it is innumerable. I'll come back to that point. But meanwhile, here is a handful
00:06:18.000 | of other incredible demos. This is a movie trailer and notice how Sora is picking quite fast cuts,
00:06:25.040 | obviously all automatically. It gets that a cinematic trailer is going to be pretty dynamic
00:06:30.160 | and fast paced. Likewise, this is a single video generated by Sora, not a compilation.
00:06:35.280 | And if you ignore some text spelling issues, it is astonishing.
00:06:39.040 | And here is another one that I'm going to have to spend some time on. The implications of this
00:06:46.080 | feature alone are astonishing. All three videos that you can see are going to end with the exact
00:06:51.520 | same frame. Even that final frame of the cable car crashing into that sign was generated by
00:06:58.240 | Sora, including the minor misspelling at the top. But just think of the implications. You could have
00:07:03.200 | a photo with your friends and imagine a hundred different ways that you could have got there to
00:07:08.000 | that final photo. Or maybe you have your own website and every user gets a unique voyage to
00:07:14.400 | your landing page. And of course, when we scale this up, we could put the ending of a movie in
00:07:19.040 | and Sora 2 or Sora 3 would calculate all the different types of movies that could have led
00:07:24.480 | to that point. You could have daily variations to your favorite movie ending. As a side note,
00:07:29.920 | this also allows you to create these funky loops where the starting and finishing frame are
00:07:35.680 | identical. I could just let this play for a few minutes until people got really confused,
00:07:39.840 | but I won't do that to you. And here is yet another feature that I was truly bowled away
00:07:45.600 | with. The video you can see on screen was not generated by Sora. And now I'm going to switch
00:07:51.040 | to another video, which was also not generated by Sora. But what Sora can do is interpolate between
00:07:58.240 | those videos to come up with a unique creation. This time I'm not even going to list the potential
00:08:03.440 | applications because again, they are innumerable. What I will do though, is give you one more
00:08:07.760 | example that I thought of when I saw this. Another demo that OpenAI used was mixing together this
00:08:12.800 | chameleon and this funky looking bird, I'm not sure its name, to create this wild mixture. Now,
00:08:19.200 | we all know that OpenAI are not going to allow you to do this with human images, but an open
00:08:24.720 | source version of Sora will be following close behind. So imagine putting a video of you and
00:08:30.160 | your partner and creating this hybrid freaky video, or maybe you and your pet. Now, the best
00:08:36.000 | results you're going to get from Sora are inevitably when there's not as much movement going
00:08:40.800 | on. The less movement, the fewer problems with things like object permanence. Mind you, even when
00:08:45.920 | there is quite a lot going on, the results can still be pretty incredible. Look at how Sora
00:08:52.000 | handles object permanence here with the dog fully covered and then emerging looking exactly the same.
00:08:57.840 | Likewise, this video of a man eating a burger, because he's moving in slow motion,
00:09:02.560 | it's much more high fidelity. Aside from the bokeh effect, it could almost be real. And then
00:09:08.240 | we get this gorgeous video where you almost have to convince me it's from Sora. Look at how the
00:09:14.880 | paint marks stay on the page. And then we get simulated gaming where again, if you ignore some
00:09:20.960 | of the physics and the rule breaking, the visuals alone are just incredible. Obviously, they trained
00:09:27.680 | Sora on thousands of hours of Minecraft videos. I mean, look how accurate some of the boxes are.
00:09:33.840 | I bet some of you watching this think I simply replaced a Sora video with an actual Minecraft
00:09:38.640 | video, but no, I didn't. That has been quite a few hype demos, so time for some anti-hype ones.
00:09:44.400 | Here is Sora clearly not understanding the world around it. Just like Chachapiti's understanding
00:09:50.720 | can sometimes be paper thin, so can Sora's. It doesn't get the physics of the cup, the ice,
00:09:56.640 | or the spill. I can't forget to mention though, that you can also change the style of a video.
00:10:02.000 | Here is the input video, presumably from a game. Now with one prompt, you can change the background
00:10:08.240 | to being a jungle. Or maybe you prefer to play the game in the 1920s. I mean, you can see how
00:10:15.840 | the wheels aren't moving properly, but the overall effect is incredible. Well, actually this time,
00:10:22.400 | I want to play the game underwater. How about that? Job done. Or maybe I'm high and I want
00:10:28.800 | the game to look like a rainbow. Or maybe I prefer the old-fashioned days of pixel art.
00:10:35.600 | I've noticed a lot of people, by the way, speculating where OpenAI got all the data to
00:10:41.360 | train Sora. I think many people have forgotten that they did a deal back in July with Shutterstock.
00:10:47.680 | In case you don't know, Shutterstock has 32 million stock videos, and most of them
00:10:53.360 | are high resolution. They probably also used millions of hours of video game frames, would be
00:10:58.400 | my guess. One more thing you might be wondering, don't these worlds just disappear the moment you
00:11:03.200 | move on to the next prompt? Well, with video to 3D, that might not always be the case. This is
00:11:09.680 | from Luma AI, and imagine a world generated at first by Sora, then turned into a universally
00:11:16.080 | shareable 3D landscape that you can interact with. Effectively, you and your friends could inhabit
00:11:23.120 | a world generated by Sora. And yes, ultimately with scale, you could generate your own high
00:11:30.080 | fidelity video game. And given that you can indefinitely extend clips, I am sure many people
00:11:36.160 | will be creating their own short movies. Perhaps voiced by AI, here's an Eleven Labs voice giving
00:11:42.240 | you a snippet of the caption to this video. An adorable, happy otter confidently stands on a
00:11:48.880 | surfboard wearing a yellow life jacket, riding along turquoise tropical waters near lush tropical
00:11:55.680 | islands. Or how about hooking Sora up to the Apple Vision Pro or MetaQuest? Especially for those who
00:12:02.720 | can't travel, that could be an incredible way of exploring the world. Of course, being real here,
00:12:08.240 | the most common use case might be children using it to make cartoons and play games. But still,
00:12:14.720 | that counts as a valid use case to me. But underneath all of these use cases are
00:12:19.840 | some serious points. In a since deleted tweet, one OpenAI employee said this, "We are very
00:12:25.760 | intentionally not sharing it widely yet. The hope is that a mini public demo kicks a social response
00:12:32.640 | into gear." I'm not really sure what social response people are supposed to give though,
00:12:38.000 | however. It's not responsible to let people just panic, which is why I've given the caveats I have
00:12:43.440 | throughout this video. I believe, as with language and self-driving, that the edge cases will still
00:12:48.800 | take a number of years to solve. That's at least my best guess. But it seems to me, when reasoning
00:12:53.840 | is solved, and therefore even long videos actually make sense, a lot more jobs other than just
00:12:59.600 | videographers might be under threat. As the creator of GitHub Copilot put it, "If OpenAI is
00:13:05.520 | going to continue to eat AI startups sector by sector, they should go public. Building the new
00:13:11.040 | economy where only 500 people benefit is a dodgy future." And the founder of StabilityAI tweeted
00:13:18.160 | out this image, "It does seem to be the best of times and the worst of times to be an AI startup.
00:13:23.920 | You never know when OpenAI or Google are going to drop a model that massively changes and affects
00:13:29.760 | your business." It's not just Sora whacking PikaLabs, RunwayML and maybe MidJourney. If you
00:13:35.200 | make the chips that OpenAI uses, they want to make them instead. I'm going to be doing a separate
00:13:40.560 | video about all of that. When you use the ChatGPT app on a phone, they want to make the phone you're
00:13:46.800 | using. You come up with Character AI and OpenAI comes out with the GPT Store. I bet OpenAI are
00:13:52.960 | even cooking up an open world game with GPT powered NPCs. Don't forget that they acquired
00:13:59.200 | Global Illumination, the makers of this Minecraft clone. If you make agents, we learned last week
00:14:05.360 | that OpenAI want to create an agent that operates your entire device. Again, I've got more on that
00:14:10.880 | coming soon. Or what about if you're making a search engine powered by a GPT model? That's
00:14:16.240 | the case of course with Perplexity and I will be interviewing the CEO and founder of Perplexity
00:14:22.080 | for AI Insiders next week. Insiders can submit questions and of course do feel free to join on
00:14:27.920 | Patreon. But fitting with the trend, we learned less than 48 hours ago that OpenAI is developing
00:14:34.320 | a web search product. I'm not necessarily critiquing any of this, but you're starting to
00:14:39.200 | see the theme. OpenAI will have no qualms about eating your lunch. And of course there's one more
00:14:45.040 | implication that's a bit more long term. Two lead authors from Sora both retweeted this video
00:14:51.920 | from Berkeley. You're seeing a humanoid transformer robot trained with large scale reinforcement
00:14:57.520 | learning in simulation and deployed to the real world zero shot. In other words, it learned to
00:15:02.960 | move like this by watching and acting in simulations. If you want to learn more about
00:15:07.840 | learning from simulations, do check out my Eureka video and my interview with Jim Fan.
00:15:12.560 | TLDR, better simulations mean better robotics. Two final demos to end this video with. First,
00:15:19.040 | a monkey playing chess in a park. This demo kind of sums up Sora. It looks gorgeous. I was astounded
00:15:25.040 | like everyone else. However, if you look a bit closer, the piece positions and board don't make
00:15:30.000 | any sense. Sora doesn't understand the world, but it is drawing upon billions and billions of
00:15:35.280 | patterns. And then there's this obligatory comparison. The Will Smith spaghetti video,
00:15:40.960 | and I wonder what source they originally got some of the images from. You could say this was around
00:15:45.360 | state of the art just 11 months ago. And now here's Sora. Not perfect. Look at the paws,
00:15:52.320 | but honestly remarkable. Indeed, I would call Sora a milestone human achievement. But now I want to
00:15:59.520 | thank you for watching this video all the way to the end. And no, despite what many people think,
00:16:05.120 | it isn't generated by an AI. Have a wonderful day.