back to indexGenie 3: The World Becomes Playable (DeepMind)

Chapters
0:0 Introduction
1:27 Background and Access
4:58 Caveats
7:24 Demo
10:12 Conclusion
00:00:00.000 |
In the week that we are set to get GPT-5, it might be easy to miss this announcement of 00:00:07.360 |
Google DeepMind's Genie 3. To cut a long story short, it makes the world playable. 00:00:13.860 |
Start with an image, which could be one of your photos, and then enter that world and modify it 00:00:21.140 |
with prompts. By entering, I mean you can move around, take actions that last and stay in that 00:00:27.520 |
world and basically go wild. I was given early access to the presentation of Genie 3 and got to 00:00:33.560 |
ask the makers a question, but I'm going to be honest. Genie 3 is designed and marketed to allow 00:00:40.200 |
AI agents to act out scenarios and self-improve at taking actions. That's the theory. For me, 00:00:47.420 |
and let me know if you agree, it will be used much more for gamifying all of reality and your 00:00:55.040 |
imagination. If you have been following the channel for just a little bit, you know that I 00:00:59.680 |
interviewed a senior researcher on Genie 2, Tim Rocktash, all here and on my Patreon. And at the 00:01:05.960 |
time, we learned that Genie 2 would, quote, scale gracefully with more compute. Well, it did, and now 00:01:13.340 |
we get real-time interaction in 720p, 24 frames per second. If that's jargon to you, it means you can click 00:01:21.020 |
some buttons and things happen at the exact same time on screen at fairly high resolution. Now, 00:01:27.080 |
in a couple of minutes, I'm going to show you the full intro, which is about 130 seconds, I think, 00:01:33.680 |
which is unusual for this channel. I don't normally show clips that long, but it does showcase Genie 3 00:01:38.940 |
really quite well. First though, just a few thoughts from me. Jack Parker Holder, the lead author of Genie 3, 00:01:45.660 |
told me and a bunch of journalists that the goal behind it was to have a Move 37 moment 00:01:52.240 |
for embodied AI, as in for robots, not just for computers that play games. A Move 37 moment is a 00:02:00.300 |
high bar, as any of you who have watched the AlphaGo documentary know, but think of it as a novel 00:02:06.620 |
breakthrough that goes beyond human data. In other words, we just don't have enough data to train robots 00:02:13.300 |
reliably, given the innumerable scenarios in which they'll be placed. If we can simulate all worlds, 00:02:19.960 |
then we might get novel breakthroughs for those robots, get them to do things essentially that we 00:02:25.920 |
couldn't have even trained them to do. In the presentation, I pushed back though, with the 00:02:29.900 |
question that if these worlds suffer from physics inaccuracies, and they do, how would such agents 00:02:36.480 |
ever be fully reliable? Both lead authors agreed that's a real issue, but then raised something that got me 00:02:42.780 |
thinking. They said that yes, while you can't guarantee reliability, you can demonstrate 00:02:48.900 |
unreliability. Think about it, if an agent goes off the rails in simulation, then it's also liable to do 00:02:55.560 |
so in the real world. In a way then, I think both of these points still stand. I think we can't guarantee 00:03:01.660 |
reliability with simulators like Genie 3, but we can help find unreliability. Anyway, what you're probably 00:03:09.040 |
thinking, and I definitely was, was that we should just be honest with ourselves. Everyone is gonna want to 00:03:15.900 |
upload a still from their favourite game, life event, celebrity, or what have you, and basically interact 00:03:23.860 |
with it, jump around, paint a wall, and just get silly. And even that is probably phrasing things 00:03:29.540 |
somewhat maturely, which is probably why this is currently still a research preview. Meaning you can't get your hands on it. 00:03:38.440 |
Google were pretty evasive about timing for a general release, not even a hint of a date. 00:03:44.960 |
However, if that disappoints you, I am old enough to remember that that same, I guess, not for general release, 00:03:52.800 |
safety issues kind of thing, was true of Imagine 1, the very basic image generator from Google, 00:03:59.540 |
basically not fit for public release. But as of today, we have Imagine 4 out in public, far improved, 00:04:08.240 |
and even available on the API so developers can incorporate it into their apps. 00:04:13.660 |
translated, Genie 4 might be available to you to play with sooner than you think. 00:04:20.020 |
Okay, but what about that incredible memory where you could paint a wall, for example, look around, 00:04:24.880 |
come back, and the paint is still there? Let's just take a moment and say, Google, that is pretty impressive, 00:04:30.700 |
well done. But the memory within these worlds is measured in minutes, not hours. So if you were 00:04:38.560 |
thinking of making a friend in one of these worlds, building a house together, and living in it to 00:04:44.320 |
escape the real world and its current self-immolation, that won't quite work. As it currently stands, 00:04:50.360 |
by the time you return to the house the next day, it will be completely reimagined. And Google told me 00:04:56.180 |
of four other caveats. I think they are pretty telling about the future of simulation, so let's go 00:05:03.260 |
through them. First, while the most common actions are performable, as you'd find in games like 00:05:08.140 |
moving around and jumping, you can't currently perform complex actions. Next, and this thought 00:05:14.680 |
literally just came to me, but it's a bit like a dream, in that the next caveat is that you can't 00:05:19.980 |
talk to other characters. Maybe that's just me, but in your dreams, do you speak to other people? 00:05:24.720 |
Definitely not complex conversations. Anyway, they said to me, accurately modeling complex interactions 00:05:30.540 |
between multiple independent agents is still an ongoing research challenge. Third, as you would expect, 00:05:37.720 |
we can't expect accurate representation of real-world locations. The sheer imaginative 00:05:44.380 |
scope of these worlds are also somewhat their downfall in that lifelike fidelity is not their 00:05:50.400 |
priority. That bleeds into the fourth caveat they gave me, which is text rendering. Don't expect 00:05:56.440 |
high-fidelity text rendering. It can happen if you add it to your prompt, it's just not built into the 00:06:02.180 |
environment. Now, funnily enough, I think it was a Guardian or New York Times journalist asked, 00:06:06.360 |
actually, about whether this is a replacement for something like Omniverse or Unreal Engine. Google 00:06:12.420 |
wouldn't say that, but they did say that hard-coding the complexity of the real world is intractable, 00:06:18.480 |
so that's why we might need simulations like the Genie series. I know quite a few game developers watch 00:06:24.440 |
the channel, so do chip in with your thoughts on this versus Unreal Engine. And I would add, there's a hybrid 00:06:32.040 |
approach, which I saw recently in a TED talk from a guy from Roblox. I forget his name and his rank, 00:06:38.340 |
but the idea was that you could prompt a model to directly code new parts of the environment. The full 00:06:45.820 |
six-minute talk is linked in the description. But this feels to me like it would be slightly more 00:06:50.740 |
predictable, perhaps? But maybe less scalable, because with the Genie series, you could scale it with 00:06:56.740 |
billions of hours of video from YouTube, not so much with hard-coded assets. Which approach will 00:07:02.040 |
win out? I actually don't know, so let me know what you think. Now, enough build-up. There is no paper 00:07:07.320 |
to go through. I was going to release this video at 3 p.m. when the embargo lifted, but I thought maybe 00:07:12.700 |
they're going to give us a paper, so let's hold back. No, there was no paper. So here is the around 00:07:17.260 |
two-minute demo that I promised, albeit slightly later than I said I would give it. 00:07:21.560 |
What you're seeing are not games or videos. They're worlds. Each one of these is an interactive 00:07:30.820 |
environment generated by Genie 3, a new frontier for world models. With Genie 3, you can use natural 00:07:37.780 |
language to generate a variety of worlds, and explore them interactively, all with a single text prompt. 00:07:43.780 |
Let's see what it's like to spend some time in a world. 00:07:53.460 |
Genie 3 has real-time interactivity, meaning that the environment reacts to your movements and actions. 00:08:01.180 |
You're not walking through a pre-built simulation. Everything you see here is being generated live 00:08:06.140 |
as you explore it. And Genie 3 has world memory. That's why environments like this one stay consistent. 00:08:12.980 |
World memory even carries over into your actions. 00:08:16.980 |
For example, when I'm painting on this wall, my actions persist. 00:08:22.280 |
I can look away and generate other parts of the world. 00:08:28.960 |
But when I look back, the actions I took are still there. 00:08:31.720 |
And Genie 3 enables promptable events, so you can add new events into your world on the fly. 00:08:47.120 |
You can use Genie to explore real-world physics and movement. 00:08:55.800 |
You can generate worlds with distinct geographies, historical settings, 00:08:59.600 |
fictional environments, and even other characters. 00:09:02.740 |
We're excited to see how Genie 3 can be used for next-generation gaming and entertainment. 00:09:12.480 |
training robotic agents before working in the real world. 00:09:15.960 |
Or simulating dangerous scenarios for disaster preparedness and emergency training. 00:09:21.040 |
World models can open new pathways for learning, 00:09:29.520 |
We're excited to see how Genie 3's world simulation can benefit research around the world. 00:09:39.760 |
Trying to game out the impact of technologies like Genie on jobs is just too complex for me at the moment. 00:09:48.420 |
But there are real-world jobs you can apply to via the sponsors of today's video, 80,000 hours. 00:09:55.320 |
If you somewhat helpfully use my link in the description, 00:09:58.560 |
then you'll go to their job board, which you can see. 00:10:07.940 |
But either way, the jobs are sourced from around the world. 00:10:16.460 |
I will be touching on Gemini DeepThink on the main channel, 00:10:22.920 |
And my early review of that tool is on Patreon. 00:10:26.460 |
But it just feels inevitable to me that people will initially want their games to be infinitely 00:10:36.340 |
As expectations continue to rise, they'll want their entertainment to be interactive. 00:10:41.240 |
Say, prompting Netflix to add their own face into Squid Game US Edition. 00:10:52.100 |
You'll be able to speak to other agents, or let's just call them bots. 00:10:55.500 |
The other characters in these simulated worlds will be pretty intelligent. 00:10:59.220 |
They probably won't just keep walking into walls. 00:11:01.560 |
You can, like, chat with them about Sophocles. 00:11:04.160 |
Some people may even need to watch their step, 00:11:16.980 |
and the commitment from Google to incorporate this into their march to AGI, 00:11:23.840 |
These worlds then will be born one way or another. 00:11:27.920 |
But the question for me is whether a fully imagined simulation is the way, 00:11:32.740 |
or instead my bet, which is something more like Isaac Lab from NVIDIA. 00:11:37.060 |
Simulated, but also programmable, and so repeatable. 00:11:40.740 |
Soon enough, many worlds are about to get crazy, 00:11:47.360 |
I look forward to covering GPT-5 with you guys this week, almost certainly.