Back to Index

Genie 3: The World Becomes Playable (DeepMind)


Chapters

0:0 Introduction
1:27 Background and Access
4:58 Caveats
7:24 Demo
10:12 Conclusion

Transcript

In the week that we are set to get GPT-5, it might be easy to miss this announcement of Google DeepMind's Genie 3. To cut a long story short, it makes the world playable. Start with an image, which could be one of your photos, and then enter that world and modify it with prompts.

By entering, I mean you can move around, take actions that last and stay in that world and basically go wild. I was given early access to the presentation of Genie 3 and got to ask the makers a question, but I'm going to be honest. Genie 3 is designed and marketed to allow AI agents to act out scenarios and self-improve at taking actions.

That's the theory. For me, and let me know if you agree, it will be used much more for gamifying all of reality and your imagination. If you have been following the channel for just a little bit, you know that I interviewed a senior researcher on Genie 2, Tim Rocktash, all here and on my Patreon.

And at the time, we learned that Genie 2 would, quote, scale gracefully with more compute. Well, it did, and now we get real-time interaction in 720p, 24 frames per second. If that's jargon to you, it means you can click some buttons and things happen at the exact same time on screen at fairly high resolution.

Now, in a couple of minutes, I'm going to show you the full intro, which is about 130 seconds, I think, which is unusual for this channel. I don't normally show clips that long, but it does showcase Genie 3 really quite well. First though, just a few thoughts from me.

Jack Parker Holder, the lead author of Genie 3, told me and a bunch of journalists that the goal behind it was to have a Move 37 moment for embodied AI, as in for robots, not just for computers that play games. A Move 37 moment is a high bar, as any of you who have watched the AlphaGo documentary know, but think of it as a novel breakthrough that goes beyond human data.

In other words, we just don't have enough data to train robots reliably, given the innumerable scenarios in which they'll be placed. If we can simulate all worlds, then we might get novel breakthroughs for those robots, get them to do things essentially that we couldn't have even trained them to do.

In the presentation, I pushed back though, with the question that if these worlds suffer from physics inaccuracies, and they do, how would such agents ever be fully reliable? Both lead authors agreed that's a real issue, but then raised something that got me thinking. They said that yes, while you can't guarantee reliability, you can demonstrate unreliability.

Think about it, if an agent goes off the rails in simulation, then it's also liable to do so in the real world. In a way then, I think both of these points still stand. I think we can't guarantee reliability with simulators like Genie 3, but we can help find unreliability.

Anyway, what you're probably thinking, and I definitely was, was that we should just be honest with ourselves. Everyone is gonna want to upload a still from their favourite game, life event, celebrity, or what have you, and basically interact with it, jump around, paint a wall, and just get silly.

And even that is probably phrasing things somewhat maturely, which is probably why this is currently still a research preview. Meaning you can't get your hands on it. Google were pretty evasive about timing for a general release, not even a hint of a date. However, if that disappoints you, I am old enough to remember that that same, I guess, not for general release, safety issues kind of thing, was true of Imagine 1, the very basic image generator from Google, basically not fit for public release.

But as of today, we have Imagine 4 out in public, far improved, and even available on the API so developers can incorporate it into their apps. translated, Genie 4 might be available to you to play with sooner than you think. Okay, but what about that incredible memory where you could paint a wall, for example, look around, come back, and the paint is still there?

Let's just take a moment and say, Google, that is pretty impressive, well done. But the memory within these worlds is measured in minutes, not hours. So if you were thinking of making a friend in one of these worlds, building a house together, and living in it to escape the real world and its current self-immolation, that won't quite work.

As it currently stands, by the time you return to the house the next day, it will be completely reimagined. And Google told me of four other caveats. I think they are pretty telling about the future of simulation, so let's go through them. First, while the most common actions are performable, as you'd find in games like moving around and jumping, you can't currently perform complex actions.

Next, and this thought literally just came to me, but it's a bit like a dream, in that the next caveat is that you can't talk to other characters. Maybe that's just me, but in your dreams, do you speak to other people? Definitely not complex conversations. Anyway, they said to me, accurately modeling complex interactions between multiple independent agents is still an ongoing research challenge.

Third, as you would expect, we can't expect accurate representation of real-world locations. The sheer imaginative scope of these worlds are also somewhat their downfall in that lifelike fidelity is not their priority. That bleeds into the fourth caveat they gave me, which is text rendering. Don't expect high-fidelity text rendering.

It can happen if you add it to your prompt, it's just not built into the environment. Now, funnily enough, I think it was a Guardian or New York Times journalist asked, actually, about whether this is a replacement for something like Omniverse or Unreal Engine. Google wouldn't say that, but they did say that hard-coding the complexity of the real world is intractable, so that's why we might need simulations like the Genie series.

I know quite a few game developers watch the channel, so do chip in with your thoughts on this versus Unreal Engine. And I would add, there's a hybrid approach, which I saw recently in a TED talk from a guy from Roblox. I forget his name and his rank, but the idea was that you could prompt a model to directly code new parts of the environment.

The full six-minute talk is linked in the description. But this feels to me like it would be slightly more predictable, perhaps? But maybe less scalable, because with the Genie series, you could scale it with billions of hours of video from YouTube, not so much with hard-coded assets. Which approach will win out?

I actually don't know, so let me know what you think. Now, enough build-up. There is no paper to go through. I was going to release this video at 3 p.m. when the embargo lifted, but I thought maybe they're going to give us a paper, so let's hold back. No, there was no paper.

So here is the around two-minute demo that I promised, albeit slightly later than I said I would give it. What you're seeing are not games or videos. They're worlds. Each one of these is an interactive environment generated by Genie 3, a new frontier for world models. With Genie 3, you can use natural language to generate a variety of worlds, and explore them interactively, all with a single text prompt.

Let's see what it's like to spend some time in a world. Genie 3 has real-time interactivity, meaning that the environment reacts to your movements and actions. You're not walking through a pre-built simulation. Everything you see here is being generated live as you explore it. And Genie 3 has world memory.

That's why environments like this one stay consistent. World memory even carries over into your actions. For example, when I'm painting on this wall, my actions persist. I can look away and generate other parts of the world. But when I look back, the actions I took are still there. And Genie 3 enables promptable events, so you can add new events into your world on the fly.

Something like another person. Or transportation. Or even something totally unexpected. You can use Genie to explore real-world physics and movement. And all kinds of unique environments. You can generate worlds with distinct geographies, historical settings, fictional environments, and even other characters. We're excited to see how Genie 3 can be used for next-generation gaming and entertainment.

And that's just the beginning. Worlds can help with embodied research, training robotic agents before working in the real world. Or simulating dangerous scenarios for disaster preparedness and emergency training. World models can open new pathways for learning, agriculture, manufacturing, and more. We're excited to see how Genie 3's world simulation can benefit research around the world.

Trying to game out the impact of technologies like Genie on jobs is just too complex for me at the moment. But there are real-world jobs you can apply to via the sponsors of today's video, 80,000 hours. If you somewhat helpfully use my link in the description, then you'll go to their job board, which you can see.

And these are all real jobs related to AI. Well, I think the majority relate to AI. But either way, the jobs are sourced from around the world. Now, you could say, why even cover Genie 3? And don't worry, I will be touching on Gemini DeepThink on the main channel, which is also from Google DeepMind, soon enough.

And my early review of that tool is on Patreon. But it just feels inevitable to me that people will initially want their games to be infinitely playable. Think a map size bigger than GTA 7. As expectations continue to rise, they'll want their entertainment to be interactive. Say, prompting Netflix to add their own face into Squid Game US Edition.

And it will just never stop. It will then be in VR in 16K. You'll be able to speak to other agents, or let's just call them bots. The other characters in these simulated worlds will be pretty intelligent. They probably won't just keep walking into walls. You can, like, chat with them about Sophocles.

Some people may even need to watch their step, lest they fall into these infinite worlds. Others will dive in headlong. But the step up in resolution and memory, and the commitment from Google to incorporate this into their march to AGI, seems noteworthy. These worlds then will be born one way or another.

But the question for me is whether a fully imagined simulation is the way, or instead my bet, which is something more like Isaac Lab from NVIDIA. Simulated, but also programmable, and so repeatable. Soon enough, many worlds are about to get crazy, not just the real one. Thank you so much for watching to the end.

I look forward to covering GPT-5 with you guys this week, almost certainly. Have a wonderful day.