back to index

Genie 3: The World Becomes Playable (DeepMind)


Chapters

0:0 Introduction
1:27 Background and Access
4:58 Caveats
7:24 Demo
10:12 Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | In the week that we are set to get GPT-5, it might be easy to miss this announcement of
00:00:07.360 | Google DeepMind's Genie 3. To cut a long story short, it makes the world playable.
00:00:13.860 | Start with an image, which could be one of your photos, and then enter that world and modify it
00:00:21.140 | with prompts. By entering, I mean you can move around, take actions that last and stay in that
00:00:27.520 | world and basically go wild. I was given early access to the presentation of Genie 3 and got to
00:00:33.560 | ask the makers a question, but I'm going to be honest. Genie 3 is designed and marketed to allow
00:00:40.200 | AI agents to act out scenarios and self-improve at taking actions. That's the theory. For me,
00:00:47.420 | and let me know if you agree, it will be used much more for gamifying all of reality and your
00:00:55.040 | imagination. If you have been following the channel for just a little bit, you know that I
00:00:59.680 | interviewed a senior researcher on Genie 2, Tim Rocktash, all here and on my Patreon. And at the
00:01:05.960 | time, we learned that Genie 2 would, quote, scale gracefully with more compute. Well, it did, and now
00:01:13.340 | we get real-time interaction in 720p, 24 frames per second. If that's jargon to you, it means you can click
00:01:21.020 | some buttons and things happen at the exact same time on screen at fairly high resolution. Now,
00:01:27.080 | in a couple of minutes, I'm going to show you the full intro, which is about 130 seconds, I think,
00:01:33.680 | which is unusual for this channel. I don't normally show clips that long, but it does showcase Genie 3
00:01:38.940 | really quite well. First though, just a few thoughts from me. Jack Parker Holder, the lead author of Genie 3,
00:01:45.660 | told me and a bunch of journalists that the goal behind it was to have a Move 37 moment
00:01:52.240 | for embodied AI, as in for robots, not just for computers that play games. A Move 37 moment is a
00:02:00.300 | high bar, as any of you who have watched the AlphaGo documentary know, but think of it as a novel
00:02:06.620 | breakthrough that goes beyond human data. In other words, we just don't have enough data to train robots
00:02:13.300 | reliably, given the innumerable scenarios in which they'll be placed. If we can simulate all worlds,
00:02:19.960 | then we might get novel breakthroughs for those robots, get them to do things essentially that we
00:02:25.920 | couldn't have even trained them to do. In the presentation, I pushed back though, with the
00:02:29.900 | question that if these worlds suffer from physics inaccuracies, and they do, how would such agents
00:02:36.480 | ever be fully reliable? Both lead authors agreed that's a real issue, but then raised something that got me
00:02:42.780 | thinking. They said that yes, while you can't guarantee reliability, you can demonstrate
00:02:48.900 | unreliability. Think about it, if an agent goes off the rails in simulation, then it's also liable to do
00:02:55.560 | so in the real world. In a way then, I think both of these points still stand. I think we can't guarantee
00:03:01.660 | reliability with simulators like Genie 3, but we can help find unreliability. Anyway, what you're probably
00:03:09.040 | thinking, and I definitely was, was that we should just be honest with ourselves. Everyone is gonna want to
00:03:15.900 | upload a still from their favourite game, life event, celebrity, or what have you, and basically interact
00:03:23.860 | with it, jump around, paint a wall, and just get silly. And even that is probably phrasing things
00:03:29.540 | somewhat maturely, which is probably why this is currently still a research preview. Meaning you can't get your hands on it.
00:03:38.440 | Google were pretty evasive about timing for a general release, not even a hint of a date.
00:03:44.960 | However, if that disappoints you, I am old enough to remember that that same, I guess, not for general release,
00:03:52.800 | safety issues kind of thing, was true of Imagine 1, the very basic image generator from Google,
00:03:59.540 | basically not fit for public release. But as of today, we have Imagine 4 out in public, far improved,
00:04:08.240 | and even available on the API so developers can incorporate it into their apps.
00:04:13.660 | translated, Genie 4 might be available to you to play with sooner than you think.
00:04:20.020 | Okay, but what about that incredible memory where you could paint a wall, for example, look around,
00:04:24.880 | come back, and the paint is still there? Let's just take a moment and say, Google, that is pretty impressive,
00:04:30.700 | well done. But the memory within these worlds is measured in minutes, not hours. So if you were
00:04:38.560 | thinking of making a friend in one of these worlds, building a house together, and living in it to
00:04:44.320 | escape the real world and its current self-immolation, that won't quite work. As it currently stands,
00:04:50.360 | by the time you return to the house the next day, it will be completely reimagined. And Google told me
00:04:56.180 | of four other caveats. I think they are pretty telling about the future of simulation, so let's go
00:05:03.260 | through them. First, while the most common actions are performable, as you'd find in games like
00:05:08.140 | moving around and jumping, you can't currently perform complex actions. Next, and this thought
00:05:14.680 | literally just came to me, but it's a bit like a dream, in that the next caveat is that you can't
00:05:19.980 | talk to other characters. Maybe that's just me, but in your dreams, do you speak to other people?
00:05:24.720 | Definitely not complex conversations. Anyway, they said to me, accurately modeling complex interactions
00:05:30.540 | between multiple independent agents is still an ongoing research challenge. Third, as you would expect,
00:05:37.720 | we can't expect accurate representation of real-world locations. The sheer imaginative
00:05:44.380 | scope of these worlds are also somewhat their downfall in that lifelike fidelity is not their
00:05:50.400 | priority. That bleeds into the fourth caveat they gave me, which is text rendering. Don't expect
00:05:56.440 | high-fidelity text rendering. It can happen if you add it to your prompt, it's just not built into the
00:06:02.180 | environment. Now, funnily enough, I think it was a Guardian or New York Times journalist asked,
00:06:06.360 | actually, about whether this is a replacement for something like Omniverse or Unreal Engine. Google
00:06:12.420 | wouldn't say that, but they did say that hard-coding the complexity of the real world is intractable,
00:06:18.480 | so that's why we might need simulations like the Genie series. I know quite a few game developers watch
00:06:24.440 | the channel, so do chip in with your thoughts on this versus Unreal Engine. And I would add, there's a hybrid
00:06:32.040 | approach, which I saw recently in a TED talk from a guy from Roblox. I forget his name and his rank,
00:06:38.340 | but the idea was that you could prompt a model to directly code new parts of the environment. The full
00:06:45.820 | six-minute talk is linked in the description. But this feels to me like it would be slightly more
00:06:50.740 | predictable, perhaps? But maybe less scalable, because with the Genie series, you could scale it with
00:06:56.740 | billions of hours of video from YouTube, not so much with hard-coded assets. Which approach will
00:07:02.040 | win out? I actually don't know, so let me know what you think. Now, enough build-up. There is no paper
00:07:07.320 | to go through. I was going to release this video at 3 p.m. when the embargo lifted, but I thought maybe
00:07:12.700 | they're going to give us a paper, so let's hold back. No, there was no paper. So here is the around
00:07:17.260 | two-minute demo that I promised, albeit slightly later than I said I would give it.
00:07:21.560 | What you're seeing are not games or videos. They're worlds. Each one of these is an interactive
00:07:30.820 | environment generated by Genie 3, a new frontier for world models. With Genie 3, you can use natural
00:07:37.780 | language to generate a variety of worlds, and explore them interactively, all with a single text prompt.
00:07:43.780 | Let's see what it's like to spend some time in a world.
00:07:53.460 | Genie 3 has real-time interactivity, meaning that the environment reacts to your movements and actions.
00:08:01.180 | You're not walking through a pre-built simulation. Everything you see here is being generated live
00:08:06.140 | as you explore it. And Genie 3 has world memory. That's why environments like this one stay consistent.
00:08:12.980 | World memory even carries over into your actions.
00:08:16.980 | For example, when I'm painting on this wall, my actions persist.
00:08:22.280 | I can look away and generate other parts of the world.
00:08:28.960 | But when I look back, the actions I took are still there.
00:08:31.720 | And Genie 3 enables promptable events, so you can add new events into your world on the fly.
00:08:37.900 | Something like another person.
00:08:39.880 | Or transportation.
00:08:42.800 | Or even something totally unexpected.
00:08:47.120 | You can use Genie to explore real-world physics and movement.
00:08:52.220 | And all kinds of unique environments.
00:08:55.800 | You can generate worlds with distinct geographies, historical settings,
00:08:59.600 | fictional environments, and even other characters.
00:09:02.740 | We're excited to see how Genie 3 can be used for next-generation gaming and entertainment.
00:09:07.660 | And that's just the beginning.
00:09:09.680 | Worlds can help with embodied research,
00:09:12.480 | training robotic agents before working in the real world.
00:09:15.960 | Or simulating dangerous scenarios for disaster preparedness and emergency training.
00:09:21.040 | World models can open new pathways for learning,
00:09:24.600 | agriculture,
00:09:26.040 | manufacturing,
00:09:28.180 | and more.
00:09:29.520 | We're excited to see how Genie 3's world simulation can benefit research around the world.
00:09:39.760 | Trying to game out the impact of technologies like Genie on jobs is just too complex for me at the moment.
00:09:48.420 | But there are real-world jobs you can apply to via the sponsors of today's video, 80,000 hours.
00:09:55.320 | If you somewhat helpfully use my link in the description,
00:09:58.560 | then you'll go to their job board, which you can see.
00:10:02.520 | And these are all real jobs related to AI.
00:10:05.940 | Well, I think the majority relate to AI.
00:10:07.940 | But either way, the jobs are sourced from around the world.
00:10:12.080 | Now, you could say,
00:10:13.240 | why even cover Genie 3?
00:10:15.180 | And don't worry,
00:10:16.460 | I will be touching on Gemini DeepThink on the main channel,
00:10:20.180 | which is also from Google DeepMind,
00:10:22.180 | soon enough.
00:10:22.920 | And my early review of that tool is on Patreon.
00:10:26.460 | But it just feels inevitable to me that people will initially want their games to be infinitely
00:10:32.380 | playable.
00:10:32.900 | Think a map size bigger than GTA 7.
00:10:36.340 | As expectations continue to rise, they'll want their entertainment to be interactive.
00:10:41.240 | Say, prompting Netflix to add their own face into Squid Game US Edition.
00:10:47.180 | And it will just never stop.
00:10:48.900 | It will then be in VR in 16K.
00:10:52.100 | You'll be able to speak to other agents, or let's just call them bots.
00:10:55.500 | The other characters in these simulated worlds will be pretty intelligent.
00:10:59.220 | They probably won't just keep walking into walls.
00:11:01.560 | You can, like, chat with them about Sophocles.
00:11:04.160 | Some people may even need to watch their step,
00:11:08.120 | lest they fall into these infinite worlds.
00:11:10.960 | Others will dive in headlong.
00:11:13.740 | But the step up in resolution and memory,
00:11:16.980 | and the commitment from Google to incorporate this into their march to AGI,
00:11:22.120 | seems noteworthy.
00:11:23.840 | These worlds then will be born one way or another.
00:11:27.920 | But the question for me is whether a fully imagined simulation is the way,
00:11:32.740 | or instead my bet, which is something more like Isaac Lab from NVIDIA.
00:11:37.060 | Simulated, but also programmable, and so repeatable.
00:11:40.740 | Soon enough, many worlds are about to get crazy,
00:11:43.480 | not just the real one.
00:11:45.240 | Thank you so much for watching to the end.
00:11:47.360 | I look forward to covering GPT-5 with you guys this week, almost certainly.
00:11:52.300 | Have a wonderful day.