AI Breaks Its Silence: OpenAI’s ‘Next 12 Days’, Genie 2, and a Word of Caution

00:00:00.000 | Did you notice that there hadn't been much interesting AI news in the last few weeks?

00:00:05.600 | And I know that's news to all the accounts that proclaim huge AI news every three days,

00:00:12.240 | but I mean actually. And I even had a whole video ready on that hibernation,

00:00:17.360 | going over a bunch of new papers on some of the hurdles ahead as we inch closer to AGI.

00:00:23.920 | But then a couple of announcements came tonight and that video will have to wait. The mini AI

00:00:30.240 | winter, which is more of a cold snap really, might be drawing to an end. But after covering

00:00:35.760 | those announcements, we're going to see evidence that no matter what AI tech titans tell you,

00:00:41.120 | never leave your hypometer at home. First, what just got announced by Sam Altman and OpenAI?

00:00:48.160 | Well, it's 12 days of releases. And that's all they tell us. But we can join the dots

00:00:55.680 | and piece together some reporting to find out a bit more about what's coming.

00:00:59.840 | What's almost certainly coming in these next 12 days is Sora at long last. In case you've long

00:01:06.240 | since forgotten what Sora is, it's a text to video generator from OpenAI. It was first showcased

00:01:12.960 | back in February of this year. And even though it's been almost a full year, I would argue that

00:01:18.960 | some of these demo videos are still the best I've seen from a text to video model. A version of

00:01:24.240 | Sora was leaked by disgruntled artists around a week ago. And that's how some people were able

00:01:29.920 | to generate these clips. Some of these are obviously state of the art, but then some of

00:01:34.720 | them are less impressive. And what we learn unofficially is that it seems like there might

00:01:39.680 | be a Sora turbo mode. In short, a model that generates outputs more quickly, but with less

00:01:45.920 | quality. I'll have more on hallucinations in a minute, but what else might be coming in these

00:01:50.720 | 12 days? Well, very likely their smartest model, which is called O1. One of their employees writing

00:01:57.680 | under an alias said, first, OpenAI is unbelievably back. That's yesterday. Someone asked, give us full

00:02:04.960 | O1. And he said, okay. Indeed, OpenAI senior vice president of research, newly promoted Mark Chen

00:02:11.680 | wrote, if you know, you know. The full version of O1 simply called O1 as compared to O1 preview

00:02:18.560 | looks set to be the smartest model, at least in terms of mathematics and coding. That doesn't

00:02:23.760 | automatically mean it will become your chosen model. In some areas, it actually slightly

00:02:28.640 | underperforms the currently available O1 preview. That does though leave one question for me,

00:02:33.920 | which is what are they going to do to fill the other 10 days? According to one of their

00:02:38.640 | key researchers, they're going to have to ship faster than the goalposts move.

00:02:43.680 | And now that ChatGPT has surpassed 300 million weekly active users, just three months after it

00:02:50.000 | surpassed 200 million weekly active users, no one can deny that plenty of people might use anything

00:02:55.760 | they do ship. Of course, none of us have to wait for things like Sora. There are some epic free

00:03:01.040 | tools that you can use today that I will show you at the end of this video. But the biggest news of

00:03:05.760 | the day didn't actually come from OpenAI. It came from Google DeepMind with their presentation on

00:03:11.440 | Genie 2. In short, it's a model that you can't yet use, but which can turn any image into a

00:03:19.040 | playable world. And there's something just a little bit ironic about the announcement of Genie 2

00:03:24.480 | tonight. The background is that I covered Genie 1 on this channel and talked about its potential

00:03:30.080 | moving forward. The playable worlds that Genie 1 could conjure up were decent, but pretty low

00:03:36.080 | definition and limited. But I noted from the paper that the architecture could scale gracefully with

00:03:42.480 | additional computational resources. That's not the irony. The irony is that just a couple of days

00:03:47.760 | ago, I interviewed the person who managed and coordinated this Genie 2 project, Tim Roktaschel.

00:03:53.760 | This was for a podcast that's been released on my AI Insiders platform on Patreon. I will,

00:03:58.720 | of course, go through the Genie 2 announcement, the demo videos, and what they say is coming next.

00:04:03.600 | But I can't help but point out, I directly asked Tim Roktaschel about Genie 2. The paper, it capped

00:04:09.680 | out, I think, 2.7 billion parameters. I just wonder, you might not be able to say, but surely

00:04:14.320 | Genie 2, you know, 270 billion parameters or something, is that something that you're working

00:04:19.360 | on or excited about with more data? Because the paper even talked about how one day we might train

00:04:24.240 | on a greater scale of internet data, all of YouTube potentially. Is that something you're

00:04:27.920 | working on or could speak about at all? I'm excited about this. You can just look,

00:04:32.640 | basically, what happened over the last few months since Genie was published. There's, for example,

00:04:38.240 | the Oasis work that came out a few weeks ago, where people basically learned a neural network

00:04:43.280 | to simulate Minecraft. Before that, there was a paper learning to simulate Doom using neural

00:04:48.640 | network. That space is definitely heating up. I think it's exciting. I think maybe at some point,

00:04:54.720 | these simulators, these learned simulators, are getting fast and rich enough so that you

00:04:58.640 | then can also use them to adversarially probe in a body AGI and teach it new capabilities.

00:05:04.720 | The reason I was interviewing him, by the way, is because he is the author of the brand new book,

00:05:10.000 | AI - 10 Things You Should Know. But what really is Genie 2 and what does it say about what's coming?

00:05:16.000 | DeepMind call it a foundation world model. Essentially, you give it a single image and

00:05:20.880 | Genie 2 will turn it into an interactive world. The world might not quite be as high resolution

00:05:26.800 | as the original image, but you can use keyboard actions to control that world. Jump, fly, skip,

00:05:33.200 | swim, all that kind of thing. I can imagine this being used for dream sequences within games,

00:05:38.080 | where a character might have a dream of an alternate reality and you can interact with

00:05:42.560 | that world. Or maybe in the future, websites, instead of having static images in the background

00:05:47.200 | or even looping videos, will have interactive environments that you can play like games.

00:05:51.920 | But just a few quick caveats. These worlds, these generations, on average last 10 to 20 seconds

00:05:57.920 | or for up to a minute. Next is, even though they seem that way, these example videos aren't quite

00:06:04.480 | real time. As it stands, if you want real time interaction, you'd have to suffer from a reduction

00:06:10.480 | in quality. And let's be honest, these outputs weren't exactly high resolution to begin with,

00:06:15.600 | so we're not talking about replacing AAA games anytime soon. Next, the outputs can go

00:06:21.520 | quite wrong quite quickly with no real explanation. Like in this one, a ghost appears for no reason.

00:06:29.040 | In this one, the guy started with a snowboard, but then immediately decides just to run the course.

00:06:34.320 | As Google wrote, the character prefers parkour over snowboarding. Yes, by the way,

00:06:39.040 | the initial prompt could be a real world image. And that, of course, is super cool.

00:06:44.320 | It can kind of model lighting, although we're not talking ray tracing here.

00:06:48.400 | And it can, it says model gravity, but look at this horse jump on the left.

00:06:53.520 | I wouldn't say that's terribly high accuracy physics. This bit, though, I did find more

00:06:59.120 | impressive, which is that Genie 2 is capable of remembering parts of the world that are

00:07:04.000 | no longer in view and then rendering them accurately when they become observable again.

00:07:08.560 | So if you look at the characters, they'll look away from something, look back to it,

00:07:12.720 | and it's mostly the same as when they first looked away.

00:07:15.760 | Interestingly, in the announcement page, which didn't yet come with a paper,

00:07:19.520 | they actually pushed a different angle for why this was important.

00:07:22.800 | They said that if we're going to train general embodied agents, in other words,

00:07:27.440 | AI controlling a robot, that's bottlenecked by the availability of sufficiently rich and

00:07:32.080 | diverse training environments. And they gave an example of how they use Genie 2

00:07:36.400 | to create this interactive world and then told an AI agent to, for example, open the red door.

00:07:42.880 | The SEMA agent, which I've covered before on the channel, was indeed able to open the red door.

00:07:48.080 | But I personally would put an asterisk here because it's an AI agent trained on this AI

00:07:54.320 | generated, not particularly realistic world. Because of the stark gap that exists between

00:07:59.520 | these kind of simulations and our rich, complex reality, I'm not entirely convinced that this

00:08:05.440 | approach will lead to reliable agents. Of course, Google DeepMind could well

00:08:09.520 | prove me wrong. They say that they believe Genie 2 is the path to solving a structural problem of

00:08:16.720 | training embodied agents safely while achieving the breadth and generality required to progress

00:08:22.080 | towards AGI. Now, of course, my objection would fall away if we had a path for removing these

00:08:27.360 | creative but not particularly reliable AI hallucinations. But as even the CEO of NVIDIA

00:08:33.280 | recently admitted, we are "several years away" from that happening.

00:08:37.600 | His solution, as you might expect, is just buy more GPUs. Some of you may say that reliability

00:08:43.360 | issues and hallucinations, they're just a minor bug. They're going to go soon.

00:08:47.040 | What's the problem with Jensen Huang saying that the solution is still just a few years away?

00:08:51.360 | Well, I think many people, including AI lab leaders, massively underestimated the hallucination

00:08:57.200 | issue. As I covered on this channel in June of 2023, Sam Altman said, and I quote,

00:09:02.640 | "We won't be talking about hallucinations in one and a half to two years."

00:09:06.640 | One and a half years from then is like today, and we are talking about hallucinations. And even at

00:09:11.840 | the upper end, we're talking mid-2025. And I don't think anyone would now vouch for the claim,

00:09:16.720 | echoed by Mustafa Suleiman, that LLM hallucinations will be largely eliminated by 2025.

00:09:23.040 | In short, the very thing that makes these models great at generating creative interpolations of

00:09:29.120 | their data, creative worlds, is the very thing that makes them unreliable when it comes to

00:09:33.840 | things like physics. Remember that even frontier generative models like SORA,

00:09:38.160 | when given 10 minutes to produce an output, still produce things where the physics don't make sense.

00:09:43.680 | And this links to a recent paper that I was going to analyse for this video.

00:09:48.160 | In mathematics, and plausibly physics, large language models based on the

00:09:52.320 | transformer architecture don't learn robust algorithms. They rely on a bag of heuristics,

00:09:58.720 | or rules of thumb. In other words, they don't learn a single cohesive world model.

00:10:03.680 | They deploy a collection of simpler rules and patterns. That's why, for example,

00:10:07.760 | with Genie 2 and SORA, you get plausible continuations that, if you look closely,

00:10:12.320 | don't make too much sense. Imagine SORA or Genie 2 generating a car going off a cliff and all the

00:10:18.400 | resulting physics. You might have hoped that their training data had inculcated Isaac Newton's laws

00:10:23.920 | of physics and you'd get a very exact result. But they don't actually have the computational

00:10:29.360 | bandwidth to perform those kinds of calculations. Instead, it's a bit more like this.

00:10:33.520 | When models are tasked with 226, take away 68, they get this vibe. That kind of feels like an

00:10:40.560 | answer between 150 and 180. That's one of the heuristics or rules of thumb that these authors

00:10:47.280 | studied. Patch together enough of these vibes or heuristics and you start getting pretty accurate

00:10:52.240 | answers most of the time. Each heuristic they learn only slightly boosts the correct answer

00:10:58.080 | logic. But combined, they cause the model to produce the correct answer with high probability,

00:11:03.840 | not reliability. Indeed, their results suggest that improving LLM's mathematical abilities

00:11:10.240 | may require fundamental changes to training and architectures. And I totally get it. This is the

00:11:15.840 | same video that I showed you the O1 model getting 83% in an exceptionally hard math competition.

00:11:22.400 | But you may have noticed that none of these models tend to get 100% in anything. For example,

00:11:27.360 | surely if they can solve 93% of PhD level physics problems, why do they only get 81% in AP physics?

00:11:35.920 | That's the same O1 model that we're set to get in the next 12 days. I almost can't help myself.

00:11:40.880 | I'm starting to cover the two papers that I said I would cover another day. But still,

00:11:45.120 | I just want to touch on this other paper. The ridiculously compressed TLDR is that they show

00:11:50.000 | that models do learn procedures rather than memorizing individual answers. The way they

00:11:55.280 | show this is really complex and does rely on some approximations. Estimating, for example,

00:12:00.720 | if you remove these 500 tokens, how would that affect the model parameters and therefore the

00:12:05.360 | likelihood of getting an answer right? You can then in short judge which kind of sources the

00:12:09.840 | model is relying on for particular types of questions. Again, what the authors are showing

00:12:14.160 | is that the models aren't memorizing particular answers to reasoning questions. Like when asked

00:12:18.960 | what is 7-4 in brackets times 7, they're not looking up a source that says the answer of 21.

00:12:25.440 | They're relying on multiple sources that give the kind of procedures you'd need to answer that

00:12:31.040 | question. But while that seems really promising for these models developing world models

00:12:35.920 | and true reasoning, they add this crucial caveat. They don't find evidence for models generalizing

00:12:42.080 | from pre-training data about one type of reasoning to another similar type of reasoning. You could

00:12:47.840 | kind of think of that like a model getting fairly good at simulating the physics of the moon but not

00:12:54.080 | then applying that when asked to simulate the physics of Mars. In-distribution generalization

00:12:59.600 | versus out-of-distribution generalization. Anyway, I've definitely gone on too long,

00:13:04.000 | time to bring us back to the real world. And before I end with that cute turtle and how you

00:13:08.560 | can move it around, here is another real-world tool that you can use today. This is Assembly AI's

00:13:14.640 | Universal 2 speech-to-text model and you can see its performance here. As many of you know,

00:13:19.920 | I reached out to Assembly AI and they are kindly sponsoring this video. I use their models to

00:13:25.120 | transcribe my projects and you can see the comparison not just with Universal 1 but with

00:13:30.560 | other competitive models. One thing I've learned in doing so is don't always focus on word error

00:13:35.680 | rate. Think about how models perform with proper nouns and alphanumerics. That is at least for me

00:13:41.280 | what sets the Universal family apart. Now, as we wrap up this video, just for anyone wondering

00:13:46.320 | about an update to SimpleBench. First, what about the new Gemini experimental models? Well,

00:13:51.760 | they are rate-limited. I might soon be getting early access to something else but for now,

00:13:57.040 | we can't run it in full on SimpleBench. What about DeepSeq R1? Well, again, as of today,

00:14:02.400 | not available through the API. What though about Alibaba's QWQ model? That has been getting a lot

00:14:08.640 | of hype lately but honestly, what's new? Almost everything gets hyped in AI. Of course, I am

00:14:14.160 | following all of the models coming out of China and testing them as much as I possibly can and

00:14:18.960 | I did read that interview with the founder of DeepSeq. I may cover that in a different video

00:14:23.840 | but for now, I could actually run QWQ on SimpleBench and unfortunately, it got a score below

00:14:30.320 | Claude 3.5 Haiku so it doesn't appear on the list. I'm sorry that I can't do a shocked-faced

00:14:35.920 | thumbnail and say that AGI has arrived but that's just the result we got. It was around 11%.

00:14:41.040 | I'm going to show you one more tool just very quickly and you can use it for free today.

00:14:46.080 | It's Kling 1.5, actually another model coming out of China and you could argue, in a way,

00:14:51.200 | it's a foretaste of the kind of interactivity that something like Genie 2 will bring. Again,

00:14:56.160 | free to sign up for at least 5 professional generations. Click on the left to upload an

00:15:01.760 | image. I generated this one with Ideagram. Then go down to Motion Brush. Then I selected

00:15:07.680 | Auto Segmentation so I could pick out the turtle and then for Tracking, I drew this arrow to the

00:15:14.000 | right. Then Confirm of course. Go to Professional Mode. I only have 2 trial uses left and then

00:15:19.840 | Generate. You control the movement and you can end up with super cute generations like this.

00:15:25.360 | So whatever you found the most interesting, thank you for watching and have a wonderful day.

AI Breaks Its Silence: OpenAI’s ‘Next 12 Days’, Genie 2, and a Word of Caution

Chapters