Back to Index

The AI 'Genie' is Out + Humanoid Robotics Step Closer


Transcript

We've heard of text-to-speech, text-to-video, and text-to-action, but have we slept on text-to-interaction? Let's take a look at the new Genie concept from Google DeepMind and set it in the context of new developments regarding Sora and Gemini. And we'll hear what Demis Sarbis, CEO of Google DeepMind, has to say about Sam Altman's $7 trillion dollar chip ambitions, and touch on some recent notorious missteps.

But I do want to make a confession up front to all of you guys. The entire industry will not be shocked by this video. Everything might not change, and the world may well not be stunned by what I have to say. If you're willing to forgive that, it should still be an interesting time.

So let's get started. The TL;DR of Genie, released in the last few days, is this. You can now hand a relatively small AI model an image, and it could be any image. A photo you've just taken on your phone, a sketch that your child or you just drew, or an image, of course, that you generated using, say, Midjourney or Dali 3.

And Genie, that small model, will take this image and make it interactive. A bit like handing you a PlayStation or Xbox controller. You could then make the main character jump, go left, go right, and the scene will change around it. Essentially, you've made an image playable. Or in other words, you've made imaginary worlds interactive.

This is how Google put it. Genie is capable of converting a variety of different prompts into interactive, playable environments. These can be easily created, stepped into, and explored. Now, before we get into the meat of the paper, I want to let your imagination run wild. Because your mind, of course, went to the same place as me, which is Imagine This integrated into Sora.

How about controlling the shark or dolphin in this paper craft world? Remember that the promise of this paper is that as you move left, right, up, down, or make jumping motions, the world is crafted around you. This would be open world exploration in its truest sense. Or take this example, again generated by Sora.

In the near future, we needn't have two separate models either. It could be the same model generating the world and allowing you to interact within it. And yes, I do still find it incredible that this video was generated by Sora. And the characters you create could take almost any form.

How about a tortoise made of glass? Or maybe you want to control a translucent jellyfish floating through a post-apocalyptic cityscape. Or how about this example? Yes, it's nice to watch the video, but imagine controlling it so it would be prompted by an image, say, of your hometown. And I can't help but point out the speed with which many of us are now becoming accustomed to new announcements and how we're adjusting to them.

OpenAI's Sora model has been out for just over a week, and here's a paper where we can imagine it being interactive. But that's the way things are going. Modalities are multiplying. Models are unifying across text, audio, video, action, and interaction. In a moment, I'll touch on how this might affect robotics, but here's audio coming to videos generated by Sora.

This is thanks to Eleven Labs, and you can just feel how sound elevates the experience of video. All of this 30-second clip is AI-generated. And there's one key detail that I don't want you to miss from the genie paper. The final version of Genie at 11 billion parameters was trained in an unsupervised manner from unlabeled internet videos.

To simplify, they didn't pair up an image with some controller movements or text and tell the model what happened next. There was no such human supervision. It was just hundreds of thousands of internet videos. And if you don't find that interesting, well, how about this? The results they got from the genie architecture scale gracefully, they say, with additional computational resources.

If you want Sora levels of fidelity, rather than the pixelated stuff we got, just scale up the compute. Then, as the paper says, we will have generative interactive environments, which is a new paradigm whereby interactive environments can be generated from a single text or image prompt. At this point, though, before we get carried away, I want to inject some realism.

Genie was trained on 10 frames per second video clips at 160 by 90 resolution. For the website, they scaled up to 360p. But still, we are not yet that close to Sora levels of immersion and interaction. And I don't just mean that Sora and the genie interactions hallucinate badly, according to the paper.

I'm referring to the fact that real time high fidelity generation is still a while away. That's just not on my prediction list for this year. And that's despite me saying that super realistic text video would happen this year in my January 1st video. What's my evidence that latency will slow everything down?

Well, according to Bloomberg, OpenAI won't say precisely how long Sora takes on each request. But apparently you can definitely go grab a snack while you wait for these things to run. So real time, interactive, low resolution games by the end of this year, yes, and high resolution time limited interactive generations by the end of this year.

But I think we'll have to wait till next year for the combination of those two things. Still, I do think it's worth pausing to imagine a scenario that we might well get by the end of this year, be it inside Gemini 2 or GPT 5. Imagine either of those models creating an intricate short story, say with this cute little robot character as the protagonist.

And then alongside each chapter of that story, it generates a real time video that you can play about with. You can almost picture it as the paper says it would emulate parallax. That's when the character and the foreground move around, but the background stays relatively static. The model would have created not only a story, but a playable world.

And just to reiterate, all we need is a single text prompt or a single image to create that new interactive environment. We've already seen how it can make an AI image playable, but here is that concept applied to a human design sketch and finally to some real world images.

But just before we leave the paper, I want to touch on just how well within their capabilities Google were when they made this 11 billion parameter model. And let's not even talk about Gemini 1.5 Ultra, which is coming soon, or Gemini 2. What could they do with a bigger model size or more compute?

Well, they could train Genie 2 on an even larger proportion of internet videos to simulate realistic and imagined environments. At this point, I'll even throw in another prediction. I think by the end of this year, you could play a run through of a particular game from start to end, then feed in that entire video to say Genie 2 or an open source equivalent.

Then if you wait a few minutes, you'll essentially get an expansion pack, another level of the game generated by the model, one which might have some hallucinations, but in which you can take all the same actions as before. Of course, the copyright issues with that will be multifarious, but there are some other complications aside from copyright about all of these developments.

And no, I don't just mean an explosion of cheating in gaming. You can now buy monitors that alert you of enemy movements and you're going to get AI powered peripherals that ensure you don't miss your shots. But frankly, for me, take away the spirit of any game. But no, I'm more referring to the growing unpredictability of the job market, not necessarily job losses, but the inability to plan your career.

Like this announcement from Tyler Perry isn't exactly about job losses. He saw OpenAI's Sora last week and decided not to expand his studio. But those quote job losses wouldn't show up necessarily in the statistics because those jobs never necessarily existed. It's just that they won't exist now. Let me know what you think, but I feel like that might happen quite a lot.

It's not that companies might start to fire everyone. They just might not hire as many people as they originally would have done. And it almost goes without saying that that doesn't just apply to gaming and entertainment. Samsung, and I raised an eyebrow at this, plans to have fully automated chip fabrication plants by 2030.

And that article brought to mind this one minute video that I think is appropriate to play here. Looking to upskill for the future. This new AI can perform all coding jobs in seconds, including blockchain development. While this AI is already outperforming accounting firms. Meanwhile, the new graphic design AI aims to automate graphic design and could minimize.

Relax, because in a future where AI does most of the work, there'll be one thing that humans would finally get to do all day long. Nothing. Before we lose all of our jobs, though, a quick plug for the new discord channel I've got set up for AI insiders. I've recruited thought leaders from 20 professions, from neurosurgeons to professors, cybersecurity experts, marketing CEOs, and AI engineers.

And new people are joining as thought leaders every week, including a famous game designer, hopefully next week. What we're trying to create is a friendly and professional environment in which to swap tips and share best practices. Of course, I'd love to see you there, but if you join my Patreon, you don't, of course, just get access to the discord.

There's also podcasts and interviews tomorrow. Actually, I'm interviewing the CEO of perplexity and last but not least exclusive AI explained style videos. This is one that I released four days ago that draws upon seven or eight different papers. This is the same week, though, that Demis Hassabis gently mocked that $7 trillion figure.

He was asked in Wired about Sam Altman trying to raise that much money for more AI chips to scale up the compute available. Demis Hassabis, the CEO of Google DeepMind said this. Was that a misquote? I heard someone say that maybe it was yen or something. Of course, taking the Mickey because yen is worth a lot less than dollars.

He went on to point out that, of course, not everything rests on scale. He said you're not going to get new capabilities like planning or tool use or agent like behavior just by scaling existing techniques. I've got another video coming on agents, so that discussion will have to wait for another day.

But what might be coming sooner than that video is a video on AI in robotics. Four days ago, a researcher at Google DeepMind said this. There will be three to four massive news events coming out in the next weeks that will rock the robotics plus AI space. Adjust your timelines.

It will be a crazy 2024. Now, I would guess that Genie counts as one of those three to four announcements. He can't have been referring to Gemma, the open model from Google DeepMind, because that was released the day before. The most interesting part of the Gemma paper and release for me was the sheer scale of data that they used.

For those of you who've been following the scene for a while and remember the chinchilla paper, that was back in 2022 when it was discovered that for a given compute budget, the optimum number of tokens to train on, text tokens we're talking about, was roughly 20 times the number of parameters.

But for Gemma, which was seven or eight billion parameters, they trained on around six trillion tokens of text. That's a thousand tokens of text for every parameter. Or in other words, when you've got the kind of compute that Google has, you don't have to necessarily follow the compute optimal strategy.

But I am prevaricating. What news do I think Ted is referring to? Well, here's my best guess. And no, it's not based on any insider knowledge. I think Google is going to announce another embodied model like RT2, but powered by Gemini. Now, I've covered RT2 in previous videos back in October and indeed interviewed the tech lead for RT2X for AI Insiders.

But those models in a nutshell fuse robotics data with transferred learning from text and web data. In other words, it got better at robotics through having an LLM at its core, or you might say an MMM, a multimodal model. But in the case of RT2, that was Palm E at 12 billion parameters, or Parley X at 55 billion parameters.

Imagine RT3 powered by Gemini at say one trillion parameters. It might understand the world around it to unprecedented degrees of depth and intelligence. And if it's powered by Gemini 1.5, it might be able to remember that world for months and months and months. Indeed, I was so inspired by the RT series that I made it the unofficial logo of AI Insiders.

Evaluations for humanoid AI-powered robotics startups are starting to get pretty wild too. The CEO of NVIDIA, Jensen Huang, said that his equivalent for the Transformer paper in the near future is foundational robotics. He said, "If you could generate text, if you could generate images, can you also generate motion?" He said, "The answer is probably yes.

And like we've seen, we can also generate interaction." He went on, "Humanoid robotics should be just around the corner." I think he was referring to NVIDIA's GEAR, Generalist Embodied Agent Research. That's led by none other than Jim Phan. And he said, "2024 is the year of robotics and the year of gaming AI." Now this is Tesla's Optimus robot, but I do wonder if the Chachapiti or Sora moment for robotics will be when a humanoid robot walks with the fluidity of a human.

That will just seem so wild when it happens. Just imagine a humanoid walking up to you with human-like swagger and shaking your hand, all while remembering a conversation you had with it, say, a year ago. Now, I don't think I can end this video without touching on some of the recent controversies that Google has faced.

I just think they were fairly clearly not given the kind of testing that they obviously required. And I'm not just referring to how Gemini seems to be phobic of the word white. There's also evidence of false refusals to questions that have a pretty obvious answer. And my take on this is going to try to move beyond just the obvious take.

I think these examples show that Google is genuinely rattled by OpenAI, Microsoft, and players like Perplexity. And so they're cutting corners on the testing of their models. After all, if you're six months behind OpenAI, what's one way to catch up? Just cut out six months of human feedback for your models.

I hope this isn't what Google did or plans to do in the future, but it seems like that to me. So let me know what you think of all of this and whether we are indeed entering a new era of action and interaction. Thank you so much for watching all the way to the end and have a wonderful day.