back to indexAI Breaks Its Silence: OpenAI’s ‘Next 12 Days’, Genie 2, and a Word of Caution
Chapters
0:0
0:43 OpenAI 12 Days, Sora Turbo, o1
3:6 Genie 2
8:26 Jensen Huang and Altman Hallucination Predictions
9:45 Bag of Heuristics Paper
11:40 Procedural Knowledge Paper
13:2 AssemblyAI Universal 2
13:45 SimpleBench QwQ and Chinese Models
14:42 Kling Motion Brush
00:00:00.000 |
Did you notice that there hadn't been much interesting AI news in the last few weeks? 00:00:05.600 |
And I know that's news to all the accounts that proclaim huge AI news every three days, 00:00:12.240 |
but I mean actually. And I even had a whole video ready on that hibernation, 00:00:17.360 |
going over a bunch of new papers on some of the hurdles ahead as we inch closer to AGI. 00:00:23.920 |
But then a couple of announcements came tonight and that video will have to wait. The mini AI 00:00:30.240 |
winter, which is more of a cold snap really, might be drawing to an end. But after covering 00:00:35.760 |
those announcements, we're going to see evidence that no matter what AI tech titans tell you, 00:00:41.120 |
never leave your hypometer at home. First, what just got announced by Sam Altman and OpenAI? 00:00:48.160 |
Well, it's 12 days of releases. And that's all they tell us. But we can join the dots 00:00:55.680 |
and piece together some reporting to find out a bit more about what's coming. 00:00:59.840 |
What's almost certainly coming in these next 12 days is Sora at long last. In case you've long 00:01:06.240 |
since forgotten what Sora is, it's a text to video generator from OpenAI. It was first showcased 00:01:12.960 |
back in February of this year. And even though it's been almost a full year, I would argue that 00:01:18.960 |
some of these demo videos are still the best I've seen from a text to video model. A version of 00:01:24.240 |
Sora was leaked by disgruntled artists around a week ago. And that's how some people were able 00:01:29.920 |
to generate these clips. Some of these are obviously state of the art, but then some of 00:01:34.720 |
them are less impressive. And what we learn unofficially is that it seems like there might 00:01:39.680 |
be a Sora turbo mode. In short, a model that generates outputs more quickly, but with less 00:01:45.920 |
quality. I'll have more on hallucinations in a minute, but what else might be coming in these 00:01:50.720 |
12 days? Well, very likely their smartest model, which is called O1. One of their employees writing 00:01:57.680 |
under an alias said, first, OpenAI is unbelievably back. That's yesterday. Someone asked, give us full 00:02:04.960 |
O1. And he said, okay. Indeed, OpenAI senior vice president of research, newly promoted Mark Chen 00:02:11.680 |
wrote, if you know, you know. The full version of O1 simply called O1 as compared to O1 preview 00:02:18.560 |
looks set to be the smartest model, at least in terms of mathematics and coding. That doesn't 00:02:23.760 |
automatically mean it will become your chosen model. In some areas, it actually slightly 00:02:28.640 |
underperforms the currently available O1 preview. That does though leave one question for me, 00:02:33.920 |
which is what are they going to do to fill the other 10 days? According to one of their 00:02:38.640 |
key researchers, they're going to have to ship faster than the goalposts move. 00:02:43.680 |
And now that ChatGPT has surpassed 300 million weekly active users, just three months after it 00:02:50.000 |
surpassed 200 million weekly active users, no one can deny that plenty of people might use anything 00:02:55.760 |
they do ship. Of course, none of us have to wait for things like Sora. There are some epic free 00:03:01.040 |
tools that you can use today that I will show you at the end of this video. But the biggest news of 00:03:05.760 |
the day didn't actually come from OpenAI. It came from Google DeepMind with their presentation on 00:03:11.440 |
Genie 2. In short, it's a model that you can't yet use, but which can turn any image into a 00:03:19.040 |
playable world. And there's something just a little bit ironic about the announcement of Genie 2 00:03:24.480 |
tonight. The background is that I covered Genie 1 on this channel and talked about its potential 00:03:30.080 |
moving forward. The playable worlds that Genie 1 could conjure up were decent, but pretty low 00:03:36.080 |
definition and limited. But I noted from the paper that the architecture could scale gracefully with 00:03:42.480 |
additional computational resources. That's not the irony. The irony is that just a couple of days 00:03:47.760 |
ago, I interviewed the person who managed and coordinated this Genie 2 project, Tim Roktaschel. 00:03:53.760 |
This was for a podcast that's been released on my AI Insiders platform on Patreon. I will, 00:03:58.720 |
of course, go through the Genie 2 announcement, the demo videos, and what they say is coming next. 00:04:03.600 |
But I can't help but point out, I directly asked Tim Roktaschel about Genie 2. The paper, it capped 00:04:09.680 |
out, I think, 2.7 billion parameters. I just wonder, you might not be able to say, but surely 00:04:14.320 |
Genie 2, you know, 270 billion parameters or something, is that something that you're working 00:04:19.360 |
on or excited about with more data? Because the paper even talked about how one day we might train 00:04:24.240 |
on a greater scale of internet data, all of YouTube potentially. Is that something you're 00:04:27.920 |
working on or could speak about at all? I'm excited about this. You can just look, 00:04:32.640 |
basically, what happened over the last few months since Genie was published. There's, for example, 00:04:38.240 |
the Oasis work that came out a few weeks ago, where people basically learned a neural network 00:04:43.280 |
to simulate Minecraft. Before that, there was a paper learning to simulate Doom using neural 00:04:48.640 |
network. That space is definitely heating up. I think it's exciting. I think maybe at some point, 00:04:54.720 |
these simulators, these learned simulators, are getting fast and rich enough so that you 00:04:58.640 |
then can also use them to adversarially probe in a body AGI and teach it new capabilities. 00:05:04.720 |
The reason I was interviewing him, by the way, is because he is the author of the brand new book, 00:05:10.000 |
AI - 10 Things You Should Know. But what really is Genie 2 and what does it say about what's coming? 00:05:16.000 |
DeepMind call it a foundation world model. Essentially, you give it a single image and 00:05:20.880 |
Genie 2 will turn it into an interactive world. The world might not quite be as high resolution 00:05:26.800 |
as the original image, but you can use keyboard actions to control that world. Jump, fly, skip, 00:05:33.200 |
swim, all that kind of thing. I can imagine this being used for dream sequences within games, 00:05:38.080 |
where a character might have a dream of an alternate reality and you can interact with 00:05:42.560 |
that world. Or maybe in the future, websites, instead of having static images in the background 00:05:47.200 |
or even looping videos, will have interactive environments that you can play like games. 00:05:51.920 |
But just a few quick caveats. These worlds, these generations, on average last 10 to 20 seconds 00:05:57.920 |
or for up to a minute. Next is, even though they seem that way, these example videos aren't quite 00:06:04.480 |
real time. As it stands, if you want real time interaction, you'd have to suffer from a reduction 00:06:10.480 |
in quality. And let's be honest, these outputs weren't exactly high resolution to begin with, 00:06:15.600 |
so we're not talking about replacing AAA games anytime soon. Next, the outputs can go 00:06:21.520 |
quite wrong quite quickly with no real explanation. Like in this one, a ghost appears for no reason. 00:06:29.040 |
In this one, the guy started with a snowboard, but then immediately decides just to run the course. 00:06:34.320 |
As Google wrote, the character prefers parkour over snowboarding. Yes, by the way, 00:06:39.040 |
the initial prompt could be a real world image. And that, of course, is super cool. 00:06:44.320 |
It can kind of model lighting, although we're not talking ray tracing here. 00:06:48.400 |
And it can, it says model gravity, but look at this horse jump on the left. 00:06:53.520 |
I wouldn't say that's terribly high accuracy physics. This bit, though, I did find more 00:06:59.120 |
impressive, which is that Genie 2 is capable of remembering parts of the world that are 00:07:04.000 |
no longer in view and then rendering them accurately when they become observable again. 00:07:08.560 |
So if you look at the characters, they'll look away from something, look back to it, 00:07:12.720 |
and it's mostly the same as when they first looked away. 00:07:15.760 |
Interestingly, in the announcement page, which didn't yet come with a paper, 00:07:19.520 |
they actually pushed a different angle for why this was important. 00:07:22.800 |
They said that if we're going to train general embodied agents, in other words, 00:07:27.440 |
AI controlling a robot, that's bottlenecked by the availability of sufficiently rich and 00:07:32.080 |
diverse training environments. And they gave an example of how they use Genie 2 00:07:36.400 |
to create this interactive world and then told an AI agent to, for example, open the red door. 00:07:42.880 |
The SEMA agent, which I've covered before on the channel, was indeed able to open the red door. 00:07:48.080 |
But I personally would put an asterisk here because it's an AI agent trained on this AI 00:07:54.320 |
generated, not particularly realistic world. Because of the stark gap that exists between 00:07:59.520 |
these kind of simulations and our rich, complex reality, I'm not entirely convinced that this 00:08:05.440 |
approach will lead to reliable agents. Of course, Google DeepMind could well 00:08:09.520 |
prove me wrong. They say that they believe Genie 2 is the path to solving a structural problem of 00:08:16.720 |
training embodied agents safely while achieving the breadth and generality required to progress 00:08:22.080 |
towards AGI. Now, of course, my objection would fall away if we had a path for removing these 00:08:27.360 |
creative but not particularly reliable AI hallucinations. But as even the CEO of NVIDIA 00:08:33.280 |
recently admitted, we are "several years away" from that happening. 00:08:37.600 |
His solution, as you might expect, is just buy more GPUs. Some of you may say that reliability 00:08:43.360 |
issues and hallucinations, they're just a minor bug. They're going to go soon. 00:08:47.040 |
What's the problem with Jensen Huang saying that the solution is still just a few years away? 00:08:51.360 |
Well, I think many people, including AI lab leaders, massively underestimated the hallucination 00:08:57.200 |
issue. As I covered on this channel in June of 2023, Sam Altman said, and I quote, 00:09:02.640 |
"We won't be talking about hallucinations in one and a half to two years." 00:09:06.640 |
One and a half years from then is like today, and we are talking about hallucinations. And even at 00:09:11.840 |
the upper end, we're talking mid-2025. And I don't think anyone would now vouch for the claim, 00:09:16.720 |
echoed by Mustafa Suleiman, that LLM hallucinations will be largely eliminated by 2025. 00:09:23.040 |
In short, the very thing that makes these models great at generating creative interpolations of 00:09:29.120 |
their data, creative worlds, is the very thing that makes them unreliable when it comes to 00:09:33.840 |
things like physics. Remember that even frontier generative models like SORA, 00:09:38.160 |
when given 10 minutes to produce an output, still produce things where the physics don't make sense. 00:09:43.680 |
And this links to a recent paper that I was going to analyse for this video. 00:09:48.160 |
In mathematics, and plausibly physics, large language models based on the 00:09:52.320 |
transformer architecture don't learn robust algorithms. They rely on a bag of heuristics, 00:09:58.720 |
or rules of thumb. In other words, they don't learn a single cohesive world model. 00:10:03.680 |
They deploy a collection of simpler rules and patterns. That's why, for example, 00:10:07.760 |
with Genie 2 and SORA, you get plausible continuations that, if you look closely, 00:10:12.320 |
don't make too much sense. Imagine SORA or Genie 2 generating a car going off a cliff and all the 00:10:18.400 |
resulting physics. You might have hoped that their training data had inculcated Isaac Newton's laws 00:10:23.920 |
of physics and you'd get a very exact result. But they don't actually have the computational 00:10:29.360 |
bandwidth to perform those kinds of calculations. Instead, it's a bit more like this. 00:10:33.520 |
When models are tasked with 226, take away 68, they get this vibe. That kind of feels like an 00:10:40.560 |
answer between 150 and 180. That's one of the heuristics or rules of thumb that these authors 00:10:47.280 |
studied. Patch together enough of these vibes or heuristics and you start getting pretty accurate 00:10:52.240 |
answers most of the time. Each heuristic they learn only slightly boosts the correct answer 00:10:58.080 |
logic. But combined, they cause the model to produce the correct answer with high probability, 00:11:03.840 |
not reliability. Indeed, their results suggest that improving LLM's mathematical abilities 00:11:10.240 |
may require fundamental changes to training and architectures. And I totally get it. This is the 00:11:15.840 |
same video that I showed you the O1 model getting 83% in an exceptionally hard math competition. 00:11:22.400 |
But you may have noticed that none of these models tend to get 100% in anything. For example, 00:11:27.360 |
surely if they can solve 93% of PhD level physics problems, why do they only get 81% in AP physics? 00:11:35.920 |
That's the same O1 model that we're set to get in the next 12 days. I almost can't help myself. 00:11:40.880 |
I'm starting to cover the two papers that I said I would cover another day. But still, 00:11:45.120 |
I just want to touch on this other paper. The ridiculously compressed TLDR is that they show 00:11:50.000 |
that models do learn procedures rather than memorizing individual answers. The way they 00:11:55.280 |
show this is really complex and does rely on some approximations. Estimating, for example, 00:12:00.720 |
if you remove these 500 tokens, how would that affect the model parameters and therefore the 00:12:05.360 |
likelihood of getting an answer right? You can then in short judge which kind of sources the 00:12:09.840 |
model is relying on for particular types of questions. Again, what the authors are showing 00:12:14.160 |
is that the models aren't memorizing particular answers to reasoning questions. Like when asked 00:12:18.960 |
what is 7-4 in brackets times 7, they're not looking up a source that says the answer of 21. 00:12:25.440 |
They're relying on multiple sources that give the kind of procedures you'd need to answer that 00:12:31.040 |
question. But while that seems really promising for these models developing world models 00:12:35.920 |
and true reasoning, they add this crucial caveat. They don't find evidence for models generalizing 00:12:42.080 |
from pre-training data about one type of reasoning to another similar type of reasoning. You could 00:12:47.840 |
kind of think of that like a model getting fairly good at simulating the physics of the moon but not 00:12:54.080 |
then applying that when asked to simulate the physics of Mars. In-distribution generalization 00:12:59.600 |
versus out-of-distribution generalization. Anyway, I've definitely gone on too long, 00:13:04.000 |
time to bring us back to the real world. And before I end with that cute turtle and how you 00:13:08.560 |
can move it around, here is another real-world tool that you can use today. This is Assembly AI's 00:13:14.640 |
Universal 2 speech-to-text model and you can see its performance here. As many of you know, 00:13:19.920 |
I reached out to Assembly AI and they are kindly sponsoring this video. I use their models to 00:13:25.120 |
transcribe my projects and you can see the comparison not just with Universal 1 but with 00:13:30.560 |
other competitive models. One thing I've learned in doing so is don't always focus on word error 00:13:35.680 |
rate. Think about how models perform with proper nouns and alphanumerics. That is at least for me 00:13:41.280 |
what sets the Universal family apart. Now, as we wrap up this video, just for anyone wondering 00:13:46.320 |
about an update to SimpleBench. First, what about the new Gemini experimental models? Well, 00:13:51.760 |
they are rate-limited. I might soon be getting early access to something else but for now, 00:13:57.040 |
we can't run it in full on SimpleBench. What about DeepSeq R1? Well, again, as of today, 00:14:02.400 |
not available through the API. What though about Alibaba's QWQ model? That has been getting a lot 00:14:08.640 |
of hype lately but honestly, what's new? Almost everything gets hyped in AI. Of course, I am 00:14:14.160 |
following all of the models coming out of China and testing them as much as I possibly can and 00:14:18.960 |
I did read that interview with the founder of DeepSeq. I may cover that in a different video 00:14:23.840 |
but for now, I could actually run QWQ on SimpleBench and unfortunately, it got a score below 00:14:30.320 |
Claude 3.5 Haiku so it doesn't appear on the list. I'm sorry that I can't do a shocked-faced 00:14:35.920 |
thumbnail and say that AGI has arrived but that's just the result we got. It was around 11%. 00:14:41.040 |
I'm going to show you one more tool just very quickly and you can use it for free today. 00:14:46.080 |
It's Kling 1.5, actually another model coming out of China and you could argue, in a way, 00:14:51.200 |
it's a foretaste of the kind of interactivity that something like Genie 2 will bring. Again, 00:14:56.160 |
free to sign up for at least 5 professional generations. Click on the left to upload an 00:15:01.760 |
image. I generated this one with Ideagram. Then go down to Motion Brush. Then I selected 00:15:07.680 |
Auto Segmentation so I could pick out the turtle and then for Tracking, I drew this arrow to the 00:15:14.000 |
right. Then Confirm of course. Go to Professional Mode. I only have 2 trial uses left and then 00:15:19.840 |
Generate. You control the movement and you can end up with super cute generations like this. 00:15:25.360 |
So whatever you found the most interesting, thank you for watching and have a wonderful day.