Back to Index

AI Breaks Its Silence: OpenAI’s ‘Next 12 Days’, Genie 2, and a Word of Caution


Chapters

0:0
0:43 OpenAI 12 Days, Sora Turbo, o1
3:6 Genie 2
8:26 Jensen Huang and Altman Hallucination Predictions
9:45 Bag of Heuristics Paper
11:40 Procedural Knowledge Paper
13:2 AssemblyAI Universal 2
13:45 SimpleBench QwQ and Chinese Models
14:42 Kling Motion Brush

Transcript

Did you notice that there hadn't been much interesting AI news in the last few weeks? And I know that's news to all the accounts that proclaim huge AI news every three days, but I mean actually. And I even had a whole video ready on that hibernation, going over a bunch of new papers on some of the hurdles ahead as we inch closer to AGI.

But then a couple of announcements came tonight and that video will have to wait. The mini AI winter, which is more of a cold snap really, might be drawing to an end. But after covering those announcements, we're going to see evidence that no matter what AI tech titans tell you, never leave your hypometer at home.

First, what just got announced by Sam Altman and OpenAI? Well, it's 12 days of releases. And that's all they tell us. But we can join the dots and piece together some reporting to find out a bit more about what's coming. What's almost certainly coming in these next 12 days is Sora at long last.

In case you've long since forgotten what Sora is, it's a text to video generator from OpenAI. It was first showcased back in February of this year. And even though it's been almost a full year, I would argue that some of these demo videos are still the best I've seen from a text to video model.

A version of Sora was leaked by disgruntled artists around a week ago. And that's how some people were able to generate these clips. Some of these are obviously state of the art, but then some of them are less impressive. And what we learn unofficially is that it seems like there might be a Sora turbo mode.

In short, a model that generates outputs more quickly, but with less quality. I'll have more on hallucinations in a minute, but what else might be coming in these 12 days? Well, very likely their smartest model, which is called O1. One of their employees writing under an alias said, first, OpenAI is unbelievably back.

That's yesterday. Someone asked, give us full O1. And he said, okay. Indeed, OpenAI senior vice president of research, newly promoted Mark Chen wrote, if you know, you know. The full version of O1 simply called O1 as compared to O1 preview looks set to be the smartest model, at least in terms of mathematics and coding.

That doesn't automatically mean it will become your chosen model. In some areas, it actually slightly underperforms the currently available O1 preview. That does though leave one question for me, which is what are they going to do to fill the other 10 days? According to one of their key researchers, they're going to have to ship faster than the goalposts move.

And now that ChatGPT has surpassed 300 million weekly active users, just three months after it surpassed 200 million weekly active users, no one can deny that plenty of people might use anything they do ship. Of course, none of us have to wait for things like Sora. There are some epic free tools that you can use today that I will show you at the end of this video.

But the biggest news of the day didn't actually come from OpenAI. It came from Google DeepMind with their presentation on Genie 2. In short, it's a model that you can't yet use, but which can turn any image into a playable world. And there's something just a little bit ironic about the announcement of Genie 2 tonight.

The background is that I covered Genie 1 on this channel and talked about its potential moving forward. The playable worlds that Genie 1 could conjure up were decent, but pretty low definition and limited. But I noted from the paper that the architecture could scale gracefully with additional computational resources.

That's not the irony. The irony is that just a couple of days ago, I interviewed the person who managed and coordinated this Genie 2 project, Tim Roktaschel. This was for a podcast that's been released on my AI Insiders platform on Patreon. I will, of course, go through the Genie 2 announcement, the demo videos, and what they say is coming next.

But I can't help but point out, I directly asked Tim Roktaschel about Genie 2. The paper, it capped out, I think, 2.7 billion parameters. I just wonder, you might not be able to say, but surely Genie 2, you know, 270 billion parameters or something, is that something that you're working on or excited about with more data?

Because the paper even talked about how one day we might train on a greater scale of internet data, all of YouTube potentially. Is that something you're working on or could speak about at all? I'm excited about this. You can just look, basically, what happened over the last few months since Genie was published.

There's, for example, the Oasis work that came out a few weeks ago, where people basically learned a neural network to simulate Minecraft. Before that, there was a paper learning to simulate Doom using neural network. That space is definitely heating up. I think it's exciting. I think maybe at some point, these simulators, these learned simulators, are getting fast and rich enough so that you then can also use them to adversarially probe in a body AGI and teach it new capabilities.

The reason I was interviewing him, by the way, is because he is the author of the brand new book, AI - 10 Things You Should Know. But what really is Genie 2 and what does it say about what's coming? DeepMind call it a foundation world model. Essentially, you give it a single image and Genie 2 will turn it into an interactive world.

The world might not quite be as high resolution as the original image, but you can use keyboard actions to control that world. Jump, fly, skip, swim, all that kind of thing. I can imagine this being used for dream sequences within games, where a character might have a dream of an alternate reality and you can interact with that world.

Or maybe in the future, websites, instead of having static images in the background or even looping videos, will have interactive environments that you can play like games. But just a few quick caveats. These worlds, these generations, on average last 10 to 20 seconds or for up to a minute.

Next is, even though they seem that way, these example videos aren't quite real time. As it stands, if you want real time interaction, you'd have to suffer from a reduction in quality. And let's be honest, these outputs weren't exactly high resolution to begin with, so we're not talking about replacing AAA games anytime soon.

Next, the outputs can go quite wrong quite quickly with no real explanation. Like in this one, a ghost appears for no reason. In this one, the guy started with a snowboard, but then immediately decides just to run the course. As Google wrote, the character prefers parkour over snowboarding. Yes, by the way, the initial prompt could be a real world image.

And that, of course, is super cool. It can kind of model lighting, although we're not talking ray tracing here. And it can, it says model gravity, but look at this horse jump on the left. I wouldn't say that's terribly high accuracy physics. This bit, though, I did find more impressive, which is that Genie 2 is capable of remembering parts of the world that are no longer in view and then rendering them accurately when they become observable again.

So if you look at the characters, they'll look away from something, look back to it, and it's mostly the same as when they first looked away. Interestingly, in the announcement page, which didn't yet come with a paper, they actually pushed a different angle for why this was important. They said that if we're going to train general embodied agents, in other words, AI controlling a robot, that's bottlenecked by the availability of sufficiently rich and diverse training environments.

And they gave an example of how they use Genie 2 to create this interactive world and then told an AI agent to, for example, open the red door. The SEMA agent, which I've covered before on the channel, was indeed able to open the red door. But I personally would put an asterisk here because it's an AI agent trained on this AI generated, not particularly realistic world.

Because of the stark gap that exists between these kind of simulations and our rich, complex reality, I'm not entirely convinced that this approach will lead to reliable agents. Of course, Google DeepMind could well prove me wrong. They say that they believe Genie 2 is the path to solving a structural problem of training embodied agents safely while achieving the breadth and generality required to progress towards AGI.

Now, of course, my objection would fall away if we had a path for removing these creative but not particularly reliable AI hallucinations. But as even the CEO of NVIDIA recently admitted, we are "several years away" from that happening. His solution, as you might expect, is just buy more GPUs.

Some of you may say that reliability issues and hallucinations, they're just a minor bug. They're going to go soon. What's the problem with Jensen Huang saying that the solution is still just a few years away? Well, I think many people, including AI lab leaders, massively underestimated the hallucination issue.

As I covered on this channel in June of 2023, Sam Altman said, and I quote, "We won't be talking about hallucinations in one and a half to two years." One and a half years from then is like today, and we are talking about hallucinations. And even at the upper end, we're talking mid-2025.

And I don't think anyone would now vouch for the claim, echoed by Mustafa Suleiman, that LLM hallucinations will be largely eliminated by 2025. In short, the very thing that makes these models great at generating creative interpolations of their data, creative worlds, is the very thing that makes them unreliable when it comes to things like physics.

Remember that even frontier generative models like SORA, when given 10 minutes to produce an output, still produce things where the physics don't make sense. And this links to a recent paper that I was going to analyse for this video. In mathematics, and plausibly physics, large language models based on the transformer architecture don't learn robust algorithms.

They rely on a bag of heuristics, or rules of thumb. In other words, they don't learn a single cohesive world model. They deploy a collection of simpler rules and patterns. That's why, for example, with Genie 2 and SORA, you get plausible continuations that, if you look closely, don't make too much sense.

Imagine SORA or Genie 2 generating a car going off a cliff and all the resulting physics. You might have hoped that their training data had inculcated Isaac Newton's laws of physics and you'd get a very exact result. But they don't actually have the computational bandwidth to perform those kinds of calculations.

Instead, it's a bit more like this. When models are tasked with 226, take away 68, they get this vibe. That kind of feels like an answer between 150 and 180. That's one of the heuristics or rules of thumb that these authors studied. Patch together enough of these vibes or heuristics and you start getting pretty accurate answers most of the time.

Each heuristic they learn only slightly boosts the correct answer logic. But combined, they cause the model to produce the correct answer with high probability, not reliability. Indeed, their results suggest that improving LLM's mathematical abilities may require fundamental changes to training and architectures. And I totally get it. This is the same video that I showed you the O1 model getting 83% in an exceptionally hard math competition.

But you may have noticed that none of these models tend to get 100% in anything. For example, surely if they can solve 93% of PhD level physics problems, why do they only get 81% in AP physics? That's the same O1 model that we're set to get in the next 12 days.

I almost can't help myself. I'm starting to cover the two papers that I said I would cover another day. But still, I just want to touch on this other paper. The ridiculously compressed TLDR is that they show that models do learn procedures rather than memorizing individual answers. The way they show this is really complex and does rely on some approximations.

Estimating, for example, if you remove these 500 tokens, how would that affect the model parameters and therefore the likelihood of getting an answer right? You can then in short judge which kind of sources the model is relying on for particular types of questions. Again, what the authors are showing is that the models aren't memorizing particular answers to reasoning questions.

Like when asked what is 7-4 in brackets times 7, they're not looking up a source that says the answer of 21. They're relying on multiple sources that give the kind of procedures you'd need to answer that question. But while that seems really promising for these models developing world models and true reasoning, they add this crucial caveat.

They don't find evidence for models generalizing from pre-training data about one type of reasoning to another similar type of reasoning. You could kind of think of that like a model getting fairly good at simulating the physics of the moon but not then applying that when asked to simulate the physics of Mars.

In-distribution generalization versus out-of-distribution generalization. Anyway, I've definitely gone on too long, time to bring us back to the real world. And before I end with that cute turtle and how you can move it around, here is another real-world tool that you can use today. This is Assembly AI's Universal 2 speech-to-text model and you can see its performance here.

As many of you know, I reached out to Assembly AI and they are kindly sponsoring this video. I use their models to transcribe my projects and you can see the comparison not just with Universal 1 but with other competitive models. One thing I've learned in doing so is don't always focus on word error rate.

Think about how models perform with proper nouns and alphanumerics. That is at least for me what sets the Universal family apart. Now, as we wrap up this video, just for anyone wondering about an update to SimpleBench. First, what about the new Gemini experimental models? Well, they are rate-limited. I might soon be getting early access to something else but for now, we can't run it in full on SimpleBench.

What about DeepSeq R1? Well, again, as of today, not available through the API. What though about Alibaba's QWQ model? That has been getting a lot of hype lately but honestly, what's new? Almost everything gets hyped in AI. Of course, I am following all of the models coming out of China and testing them as much as I possibly can and I did read that interview with the founder of DeepSeq.

I may cover that in a different video but for now, I could actually run QWQ on SimpleBench and unfortunately, it got a score below Claude 3.5 Haiku so it doesn't appear on the list. I'm sorry that I can't do a shocked-faced thumbnail and say that AGI has arrived but that's just the result we got.

It was around 11%. I'm going to show you one more tool just very quickly and you can use it for free today. It's Kling 1.5, actually another model coming out of China and you could argue, in a way, it's a foretaste of the kind of interactivity that something like Genie 2 will bring.

Again, free to sign up for at least 5 professional generations. Click on the left to upload an image. I generated this one with Ideagram. Then go down to Motion Brush. Then I selected Auto Segmentation so I could pick out the turtle and then for Tracking, I drew this arrow to the right.

Then Confirm of course. Go to Professional Mode. I only have 2 trial uses left and then Generate. You control the movement and you can end up with super cute generations like this. So whatever you found the most interesting, thank you for watching and have a wonderful day.