back to indexThe AI 'Genie' is Out + Humanoid Robotics Step Closer
00:00:00.000 |
We've heard of text-to-speech, text-to-video, and text-to-action, but have we slept on 00:00:06.720 |
text-to-interaction? Let's take a look at the new Genie concept from Google DeepMind and set it in 00:00:14.640 |
the context of new developments regarding Sora and Gemini. And we'll hear what Demis Sarbis, 00:00:21.200 |
CEO of Google DeepMind, has to say about Sam Altman's $7 trillion dollar chip ambitions, 00:00:27.280 |
and touch on some recent notorious missteps. But I do want to make a confession up front to all of 00:00:34.080 |
you guys. The entire industry will not be shocked by this video. Everything might not change, 00:00:40.160 |
and the world may well not be stunned by what I have to say. If you're willing to forgive that, 00:00:45.280 |
it should still be an interesting time. So let's get started. The TL;DR of Genie, 00:00:50.640 |
released in the last few days, is this. You can now hand a relatively small AI model an image, 00:00:56.880 |
and it could be any image. A photo you've just taken on your phone, a sketch that your child or 00:01:02.800 |
you just drew, or an image, of course, that you generated using, say, Midjourney or Dali 3. And 00:01:08.560 |
Genie, that small model, will take this image and make it interactive. A bit like handing you a 00:01:13.680 |
PlayStation or Xbox controller. You could then make the main character jump, go left, go right, 00:01:19.200 |
and the scene will change around it. Essentially, you've made an image playable. Or in other words, 00:01:24.880 |
you've made imaginary worlds interactive. This is how Google put it. Genie is capable of 00:01:30.640 |
converting a variety of different prompts into interactive, playable environments. These can 00:01:36.480 |
be easily created, stepped into, and explored. Now, before we get into the meat of the paper, 00:01:42.160 |
I want to let your imagination run wild. Because your mind, of course, went to the same place as 00:01:48.000 |
me, which is Imagine This integrated into Sora. How about controlling the shark or dolphin in 00:01:54.480 |
this paper craft world? Remember that the promise of this paper is that as you move left, right, 00:02:00.000 |
up, down, or make jumping motions, the world is crafted around you. This would be open world 00:02:05.920 |
exploration in its truest sense. Or take this example, again generated by Sora. In the near 00:02:11.440 |
future, we needn't have two separate models either. It could be the same model generating 00:02:16.240 |
the world and allowing you to interact within it. And yes, I do still find it incredible that 00:02:21.040 |
this video was generated by Sora. And the characters you create could take almost any form. 00:02:26.720 |
How about a tortoise made of glass? Or maybe you want to control a translucent jellyfish 00:02:32.000 |
floating through a post-apocalyptic cityscape. Or how about this example? Yes, it's nice to 00:02:37.360 |
watch the video, but imagine controlling it so it would be prompted by an image, say, 00:02:41.600 |
of your hometown. And I can't help but point out the speed with which many of us are now becoming 00:02:46.960 |
accustomed to new announcements and how we're adjusting to them. OpenAI's Sora model has been 00:02:52.160 |
out for just over a week, and here's a paper where we can imagine it being interactive. 00:02:57.120 |
But that's the way things are going. Modalities are multiplying. Models are unifying across text, 00:03:02.880 |
audio, video, action, and interaction. In a moment, I'll touch on how this might affect 00:03:07.680 |
robotics, but here's audio coming to videos generated by Sora. This is thanks to Eleven 00:03:13.600 |
Labs, and you can just feel how sound elevates the experience of video. 00:03:47.920 |
And there's one key detail that I don't want you to miss from the genie paper. 00:03:51.600 |
The final version of Genie at 11 billion parameters was trained in an unsupervised 00:03:57.040 |
manner from unlabeled internet videos. To simplify, they didn't pair up an image with 00:04:01.680 |
some controller movements or text and tell the model what happened next. There was no 00:04:06.000 |
such human supervision. It was just hundreds of thousands of internet videos. 00:04:10.560 |
And if you don't find that interesting, well, how about this? The results they got from the 00:04:14.800 |
genie architecture scale gracefully, they say, with additional computational resources. 00:04:20.400 |
If you want Sora levels of fidelity, rather than the pixelated stuff we got, just scale 00:04:25.840 |
up the compute. Then, as the paper says, we will have generative interactive environments, 00:04:31.120 |
which is a new paradigm whereby interactive environments can be generated from a single 00:04:35.840 |
text or image prompt. At this point, though, before we get carried away, 00:04:39.360 |
I want to inject some realism. Genie was trained on 10 frames per second video clips 00:04:45.440 |
at 160 by 90 resolution. For the website, they scaled up to 360p. But still, 00:04:51.520 |
we are not yet that close to Sora levels of immersion and interaction. And I don't just 00:04:57.360 |
mean that Sora and the genie interactions hallucinate badly, according to the paper. 00:05:02.800 |
I'm referring to the fact that real time high fidelity generation is still a while away. 00:05:08.160 |
That's just not on my prediction list for this year. And that's despite me saying that 00:05:12.640 |
super realistic text video would happen this year in my January 1st video. 00:05:17.360 |
What's my evidence that latency will slow everything down? Well, according to Bloomberg, 00:05:22.080 |
OpenAI won't say precisely how long Sora takes on each request. But apparently you can definitely 00:05:28.080 |
go grab a snack while you wait for these things to run. So real time, interactive, 00:05:33.760 |
low resolution games by the end of this year, yes, and high resolution time limited interactive 00:05:40.000 |
generations by the end of this year. But I think we'll have to wait till next year for the 00:05:44.800 |
combination of those two things. Still, I do think it's worth pausing to imagine a scenario 00:05:50.320 |
that we might well get by the end of this year, be it inside Gemini 2 or GPT 5. Imagine either of 00:05:56.480 |
those models creating an intricate short story, say with this cute little robot character as the 00:06:02.000 |
protagonist. And then alongside each chapter of that story, it generates a real time video that 00:06:06.960 |
you can play about with. You can almost picture it as the paper says it would emulate parallax. 00:06:11.760 |
That's when the character and the foreground move around, but the background stays relatively 00:06:17.120 |
static. The model would have created not only a story, but a playable world. And just to reiterate, 00:06:22.720 |
all we need is a single text prompt or a single image to create that new interactive environment. 00:06:29.120 |
We've already seen how it can make an AI image playable, but here is that concept applied to 00:06:35.120 |
a human design sketch and finally to some real world images. But just before we leave the paper, 00:06:43.040 |
I want to touch on just how well within their capabilities Google were when they made this 00:06:48.560 |
11 billion parameter model. And let's not even talk about Gemini 1.5 Ultra, which is coming soon, 00:06:53.520 |
or Gemini 2. What could they do with a bigger model size or more compute? Well, they could 00:06:58.160 |
train Genie 2 on an even larger proportion of internet videos to simulate realistic and 00:07:04.640 |
imagined environments. At this point, I'll even throw in another prediction. I think by the end 00:07:08.800 |
of this year, you could play a run through of a particular game from start to end, then feed in 00:07:13.920 |
that entire video to say Genie 2 or an open source equivalent. Then if you wait a few minutes, 00:07:19.280 |
you'll essentially get an expansion pack, another level of the game generated by the model, 00:07:24.640 |
one which might have some hallucinations, but in which you can take all the same actions as before. 00:07:29.840 |
Of course, the copyright issues with that will be multifarious, but there are some 00:07:34.960 |
other complications aside from copyright about all of these developments. And no, 00:07:39.280 |
I don't just mean an explosion of cheating in gaming. You can now buy monitors that alert you 00:07:44.800 |
of enemy movements and you're going to get AI powered peripherals that ensure you don't miss 00:07:49.200 |
your shots. But frankly, for me, take away the spirit of any game. But no, I'm more referring 00:07:54.640 |
to the growing unpredictability of the job market, not necessarily job losses, 00:08:00.080 |
but the inability to plan your career. Like this announcement from Tyler Perry isn't exactly about 00:08:05.600 |
job losses. He saw OpenAI's Sora last week and decided not to expand his studio. But those quote 00:08:12.080 |
job losses wouldn't show up necessarily in the statistics because those jobs never necessarily 00:08:17.360 |
existed. It's just that they won't exist now. Let me know what you think, but I feel like that 00:08:21.520 |
might happen quite a lot. It's not that companies might start to fire everyone. They just might not 00:08:26.000 |
hire as many people as they originally would have done. And it almost goes without saying that that 00:08:30.000 |
doesn't just apply to gaming and entertainment. Samsung, and I raised an eyebrow at this, 00:08:35.200 |
plans to have fully automated chip fabrication plants by 2030. And that article brought to 00:08:42.160 |
mind this one minute video that I think is appropriate to play here. Looking to upskill 00:08:47.840 |
for the future. This new AI can perform all coding jobs in seconds, including blockchain development. 00:08:54.640 |
While this AI is already outperforming accounting firms. 00:08:59.360 |
Meanwhile, the new graphic design AI aims to automate graphic design and could minimize. 00:09:06.320 |
Relax, because in a future where AI does most of the work, there'll be one thing that humans 00:09:13.120 |
would finally get to do all day long. Nothing. Before we lose all of our jobs, though, a quick 00:09:22.400 |
plug for the new discord channel I've got set up for AI insiders. I've recruited thought leaders 00:09:27.760 |
from 20 professions, from neurosurgeons to professors, cybersecurity experts, marketing CEOs, 00:09:34.640 |
and AI engineers. And new people are joining as thought leaders every week, including a famous 00:09:40.320 |
game designer, hopefully next week. What we're trying to create is a friendly and professional 00:09:45.280 |
environment in which to swap tips and share best practices. Of course, I'd love to see you there, 00:09:50.640 |
but if you join my Patreon, you don't, of course, just get access to the discord. There's also 00:09:54.960 |
podcasts and interviews tomorrow. Actually, I'm interviewing the CEO of perplexity and last but 00:10:00.560 |
not least exclusive AI explained style videos. This is one that I released four days ago that 00:10:05.600 |
draws upon seven or eight different papers. This is the same week, though, that Demis Hassabis 00:10:10.480 |
gently mocked that $7 trillion figure. He was asked in Wired about Sam Altman trying to raise 00:10:15.920 |
that much money for more AI chips to scale up the compute available. Demis Hassabis, the CEO 00:10:21.520 |
of Google DeepMind said this. Was that a misquote? I heard someone say that maybe it was yen or 00:10:27.680 |
something. Of course, taking the Mickey because yen is worth a lot less than dollars. He went on 00:10:32.800 |
to point out that, of course, not everything rests on scale. He said you're not going to get new 00:10:37.040 |
capabilities like planning or tool use or agent like behavior just by scaling existing techniques. 00:10:43.760 |
I've got another video coming on agents, so that discussion will have to wait for another day. But 00:10:48.960 |
what might be coming sooner than that video is a video on AI in robotics. Four days ago, 00:10:54.800 |
a researcher at Google DeepMind said this. There will be three to four massive news events coming 00:10:59.840 |
out in the next weeks that will rock the robotics plus AI space. Adjust your timelines. It will be 00:11:05.680 |
a crazy 2024. Now, I would guess that Genie counts as one of those three to four announcements. He 00:11:11.680 |
can't have been referring to Gemma, the open model from Google DeepMind, because that was released 00:11:16.480 |
the day before. The most interesting part of the Gemma paper and release for me was the sheer scale 00:11:22.560 |
of data that they used. For those of you who've been following the scene for a while and remember 00:11:27.280 |
the chinchilla paper, that was back in 2022 when it was discovered that for a given compute budget, 00:11:33.200 |
the optimum number of tokens to train on, text tokens we're talking about, was roughly 20 times 00:11:39.120 |
the number of parameters. But for Gemma, which was seven or eight billion parameters, 00:11:44.240 |
they trained on around six trillion tokens of text. That's a thousand tokens of text for every 00:11:50.720 |
parameter. Or in other words, when you've got the kind of compute that Google has, 00:11:54.720 |
you don't have to necessarily follow the compute optimal strategy. But I am prevaricating. What 00:12:00.000 |
news do I think Ted is referring to? Well, here's my best guess. And no, it's not based on any 00:12:05.120 |
insider knowledge. I think Google is going to announce another embodied model like RT2, 00:12:10.720 |
but powered by Gemini. Now, I've covered RT2 in previous videos back in October and indeed 00:12:16.560 |
interviewed the tech lead for RT2X for AI Insiders. But those models in a nutshell fuse robotics data 00:12:23.600 |
with transferred learning from text and web data. In other words, it got better at robotics through 00:12:29.120 |
having an LLM at its core, or you might say an MMM, a multimodal model. But in the case of RT2, 00:12:35.920 |
that was Palm E at 12 billion parameters, or Parley X at 55 billion parameters. Imagine RT3 00:12:42.960 |
powered by Gemini at say one trillion parameters. It might understand the world around it to 00:12:48.720 |
unprecedented degrees of depth and intelligence. And if it's powered by Gemini 1.5, it might be 00:12:54.480 |
able to remember that world for months and months and months. Indeed, I was so inspired by the RT 00:12:59.520 |
series that I made it the unofficial logo of AI Insiders. Evaluations for humanoid AI-powered 00:13:05.680 |
robotics startups are starting to get pretty wild too. The CEO of NVIDIA, Jensen Huang, said that 00:13:11.440 |
his equivalent for the Transformer paper in the near future is foundational robotics. He said, 00:13:16.880 |
"If you could generate text, if you could generate images, can you also generate motion?" He said, 00:13:21.440 |
"The answer is probably yes. And like we've seen, we can also generate interaction." He went on, 00:13:26.240 |
"Humanoid robotics should be just around the corner." I think he was referring to NVIDIA's 00:13:31.280 |
GEAR, Generalist Embodied Agent Research. That's led by none other than Jim Phan. And he said, 00:13:37.360 |
"2024 is the year of robotics and the year of gaming AI." Now this is Tesla's Optimus robot, 00:13:43.360 |
but I do wonder if the Chachapiti or Sora moment for robotics will be when a humanoid robot walks 00:13:50.160 |
with the fluidity of a human. That will just seem so wild when it happens. Just imagine a humanoid 00:13:56.880 |
walking up to you with human-like swagger and shaking your hand, all while remembering a 00:14:01.840 |
conversation you had with it, say, a year ago. Now, I don't think I can end this video without 00:14:06.640 |
touching on some of the recent controversies that Google has faced. I just think they were 00:14:11.360 |
fairly clearly not given the kind of testing that they obviously required. And I'm not just 00:14:16.080 |
referring to how Gemini seems to be phobic of the word white. There's also evidence of false 00:14:21.360 |
refusals to questions that have a pretty obvious answer. And my take on this is going to try to 00:14:26.560 |
move beyond just the obvious take. I think these examples show that Google is genuinely rattled by 00:14:32.240 |
OpenAI, Microsoft, and players like Perplexity. And so they're cutting corners on the testing of 00:14:37.840 |
their models. After all, if you're six months behind OpenAI, what's one way to catch up? Just 00:14:42.800 |
cut out six months of human feedback for your models. I hope this isn't what Google did or 00:14:48.160 |
plans to do in the future, but it seems like that to me. So let me know what you think of all of 00:14:52.960 |
this and whether we are indeed entering a new era of action and interaction. Thank you so much for 00:15:00.240 |
watching all the way to the end and have a wonderful day.