back to index

The AI 'Genie' is Out + Humanoid Robotics Step Closer


Whisper Transcript | Transcript Only Page

00:00:00.000 | We've heard of text-to-speech, text-to-video, and text-to-action, but have we slept on
00:00:06.720 | text-to-interaction? Let's take a look at the new Genie concept from Google DeepMind and set it in
00:00:14.640 | the context of new developments regarding Sora and Gemini. And we'll hear what Demis Sarbis,
00:00:21.200 | CEO of Google DeepMind, has to say about Sam Altman's $7 trillion dollar chip ambitions,
00:00:27.280 | and touch on some recent notorious missteps. But I do want to make a confession up front to all of
00:00:34.080 | you guys. The entire industry will not be shocked by this video. Everything might not change,
00:00:40.160 | and the world may well not be stunned by what I have to say. If you're willing to forgive that,
00:00:45.280 | it should still be an interesting time. So let's get started. The TL;DR of Genie,
00:00:50.640 | released in the last few days, is this. You can now hand a relatively small AI model an image,
00:00:56.880 | and it could be any image. A photo you've just taken on your phone, a sketch that your child or
00:01:02.800 | you just drew, or an image, of course, that you generated using, say, Midjourney or Dali 3. And
00:01:08.560 | Genie, that small model, will take this image and make it interactive. A bit like handing you a
00:01:13.680 | PlayStation or Xbox controller. You could then make the main character jump, go left, go right,
00:01:19.200 | and the scene will change around it. Essentially, you've made an image playable. Or in other words,
00:01:24.880 | you've made imaginary worlds interactive. This is how Google put it. Genie is capable of
00:01:30.640 | converting a variety of different prompts into interactive, playable environments. These can
00:01:36.480 | be easily created, stepped into, and explored. Now, before we get into the meat of the paper,
00:01:42.160 | I want to let your imagination run wild. Because your mind, of course, went to the same place as
00:01:48.000 | me, which is Imagine This integrated into Sora. How about controlling the shark or dolphin in
00:01:54.480 | this paper craft world? Remember that the promise of this paper is that as you move left, right,
00:02:00.000 | up, down, or make jumping motions, the world is crafted around you. This would be open world
00:02:05.920 | exploration in its truest sense. Or take this example, again generated by Sora. In the near
00:02:11.440 | future, we needn't have two separate models either. It could be the same model generating
00:02:16.240 | the world and allowing you to interact within it. And yes, I do still find it incredible that
00:02:21.040 | this video was generated by Sora. And the characters you create could take almost any form.
00:02:26.720 | How about a tortoise made of glass? Or maybe you want to control a translucent jellyfish
00:02:32.000 | floating through a post-apocalyptic cityscape. Or how about this example? Yes, it's nice to
00:02:37.360 | watch the video, but imagine controlling it so it would be prompted by an image, say,
00:02:41.600 | of your hometown. And I can't help but point out the speed with which many of us are now becoming
00:02:46.960 | accustomed to new announcements and how we're adjusting to them. OpenAI's Sora model has been
00:02:52.160 | out for just over a week, and here's a paper where we can imagine it being interactive.
00:02:57.120 | But that's the way things are going. Modalities are multiplying. Models are unifying across text,
00:03:02.880 | audio, video, action, and interaction. In a moment, I'll touch on how this might affect
00:03:07.680 | robotics, but here's audio coming to videos generated by Sora. This is thanks to Eleven
00:03:13.600 | Labs, and you can just feel how sound elevates the experience of video.
00:03:18.480 | All of this 30-second clip is AI-generated.
00:03:21.760 | [VIDEO]
00:03:47.920 | And there's one key detail that I don't want you to miss from the genie paper.
00:03:51.600 | The final version of Genie at 11 billion parameters was trained in an unsupervised
00:03:57.040 | manner from unlabeled internet videos. To simplify, they didn't pair up an image with
00:04:01.680 | some controller movements or text and tell the model what happened next. There was no
00:04:06.000 | such human supervision. It was just hundreds of thousands of internet videos.
00:04:10.560 | And if you don't find that interesting, well, how about this? The results they got from the
00:04:14.800 | genie architecture scale gracefully, they say, with additional computational resources.
00:04:20.400 | If you want Sora levels of fidelity, rather than the pixelated stuff we got, just scale
00:04:25.840 | up the compute. Then, as the paper says, we will have generative interactive environments,
00:04:31.120 | which is a new paradigm whereby interactive environments can be generated from a single
00:04:35.840 | text or image prompt. At this point, though, before we get carried away,
00:04:39.360 | I want to inject some realism. Genie was trained on 10 frames per second video clips
00:04:45.440 | at 160 by 90 resolution. For the website, they scaled up to 360p. But still,
00:04:51.520 | we are not yet that close to Sora levels of immersion and interaction. And I don't just
00:04:57.360 | mean that Sora and the genie interactions hallucinate badly, according to the paper.
00:05:02.800 | I'm referring to the fact that real time high fidelity generation is still a while away.
00:05:08.160 | That's just not on my prediction list for this year. And that's despite me saying that
00:05:12.640 | super realistic text video would happen this year in my January 1st video.
00:05:17.360 | What's my evidence that latency will slow everything down? Well, according to Bloomberg,
00:05:22.080 | OpenAI won't say precisely how long Sora takes on each request. But apparently you can definitely
00:05:28.080 | go grab a snack while you wait for these things to run. So real time, interactive,
00:05:33.760 | low resolution games by the end of this year, yes, and high resolution time limited interactive
00:05:40.000 | generations by the end of this year. But I think we'll have to wait till next year for the
00:05:44.800 | combination of those two things. Still, I do think it's worth pausing to imagine a scenario
00:05:50.320 | that we might well get by the end of this year, be it inside Gemini 2 or GPT 5. Imagine either of
00:05:56.480 | those models creating an intricate short story, say with this cute little robot character as the
00:06:02.000 | protagonist. And then alongside each chapter of that story, it generates a real time video that
00:06:06.960 | you can play about with. You can almost picture it as the paper says it would emulate parallax.
00:06:11.760 | That's when the character and the foreground move around, but the background stays relatively
00:06:17.120 | static. The model would have created not only a story, but a playable world. And just to reiterate,
00:06:22.720 | all we need is a single text prompt or a single image to create that new interactive environment.
00:06:29.120 | We've already seen how it can make an AI image playable, but here is that concept applied to
00:06:35.120 | a human design sketch and finally to some real world images. But just before we leave the paper,
00:06:43.040 | I want to touch on just how well within their capabilities Google were when they made this
00:06:48.560 | 11 billion parameter model. And let's not even talk about Gemini 1.5 Ultra, which is coming soon,
00:06:53.520 | or Gemini 2. What could they do with a bigger model size or more compute? Well, they could
00:06:58.160 | train Genie 2 on an even larger proportion of internet videos to simulate realistic and
00:07:04.640 | imagined environments. At this point, I'll even throw in another prediction. I think by the end
00:07:08.800 | of this year, you could play a run through of a particular game from start to end, then feed in
00:07:13.920 | that entire video to say Genie 2 or an open source equivalent. Then if you wait a few minutes,
00:07:19.280 | you'll essentially get an expansion pack, another level of the game generated by the model,
00:07:24.640 | one which might have some hallucinations, but in which you can take all the same actions as before.
00:07:29.840 | Of course, the copyright issues with that will be multifarious, but there are some
00:07:34.960 | other complications aside from copyright about all of these developments. And no,
00:07:39.280 | I don't just mean an explosion of cheating in gaming. You can now buy monitors that alert you
00:07:44.800 | of enemy movements and you're going to get AI powered peripherals that ensure you don't miss
00:07:49.200 | your shots. But frankly, for me, take away the spirit of any game. But no, I'm more referring
00:07:54.640 | to the growing unpredictability of the job market, not necessarily job losses,
00:08:00.080 | but the inability to plan your career. Like this announcement from Tyler Perry isn't exactly about
00:08:05.600 | job losses. He saw OpenAI's Sora last week and decided not to expand his studio. But those quote
00:08:12.080 | job losses wouldn't show up necessarily in the statistics because those jobs never necessarily
00:08:17.360 | existed. It's just that they won't exist now. Let me know what you think, but I feel like that
00:08:21.520 | might happen quite a lot. It's not that companies might start to fire everyone. They just might not
00:08:26.000 | hire as many people as they originally would have done. And it almost goes without saying that that
00:08:30.000 | doesn't just apply to gaming and entertainment. Samsung, and I raised an eyebrow at this,
00:08:35.200 | plans to have fully automated chip fabrication plants by 2030. And that article brought to
00:08:42.160 | mind this one minute video that I think is appropriate to play here. Looking to upskill
00:08:47.840 | for the future. This new AI can perform all coding jobs in seconds, including blockchain development.
00:08:54.640 | While this AI is already outperforming accounting firms.
00:08:59.360 | Meanwhile, the new graphic design AI aims to automate graphic design and could minimize.
00:09:06.320 | Relax, because in a future where AI does most of the work, there'll be one thing that humans
00:09:13.120 | would finally get to do all day long. Nothing. Before we lose all of our jobs, though, a quick
00:09:22.400 | plug for the new discord channel I've got set up for AI insiders. I've recruited thought leaders
00:09:27.760 | from 20 professions, from neurosurgeons to professors, cybersecurity experts, marketing CEOs,
00:09:34.640 | and AI engineers. And new people are joining as thought leaders every week, including a famous
00:09:40.320 | game designer, hopefully next week. What we're trying to create is a friendly and professional
00:09:45.280 | environment in which to swap tips and share best practices. Of course, I'd love to see you there,
00:09:50.640 | but if you join my Patreon, you don't, of course, just get access to the discord. There's also
00:09:54.960 | podcasts and interviews tomorrow. Actually, I'm interviewing the CEO of perplexity and last but
00:10:00.560 | not least exclusive AI explained style videos. This is one that I released four days ago that
00:10:05.600 | draws upon seven or eight different papers. This is the same week, though, that Demis Hassabis
00:10:10.480 | gently mocked that $7 trillion figure. He was asked in Wired about Sam Altman trying to raise
00:10:15.920 | that much money for more AI chips to scale up the compute available. Demis Hassabis, the CEO
00:10:21.520 | of Google DeepMind said this. Was that a misquote? I heard someone say that maybe it was yen or
00:10:27.680 | something. Of course, taking the Mickey because yen is worth a lot less than dollars. He went on
00:10:32.800 | to point out that, of course, not everything rests on scale. He said you're not going to get new
00:10:37.040 | capabilities like planning or tool use or agent like behavior just by scaling existing techniques.
00:10:43.760 | I've got another video coming on agents, so that discussion will have to wait for another day. But
00:10:48.960 | what might be coming sooner than that video is a video on AI in robotics. Four days ago,
00:10:54.800 | a researcher at Google DeepMind said this. There will be three to four massive news events coming
00:10:59.840 | out in the next weeks that will rock the robotics plus AI space. Adjust your timelines. It will be
00:11:05.680 | a crazy 2024. Now, I would guess that Genie counts as one of those three to four announcements. He
00:11:11.680 | can't have been referring to Gemma, the open model from Google DeepMind, because that was released
00:11:16.480 | the day before. The most interesting part of the Gemma paper and release for me was the sheer scale
00:11:22.560 | of data that they used. For those of you who've been following the scene for a while and remember
00:11:27.280 | the chinchilla paper, that was back in 2022 when it was discovered that for a given compute budget,
00:11:33.200 | the optimum number of tokens to train on, text tokens we're talking about, was roughly 20 times
00:11:39.120 | the number of parameters. But for Gemma, which was seven or eight billion parameters,
00:11:44.240 | they trained on around six trillion tokens of text. That's a thousand tokens of text for every
00:11:50.720 | parameter. Or in other words, when you've got the kind of compute that Google has,
00:11:54.720 | you don't have to necessarily follow the compute optimal strategy. But I am prevaricating. What
00:12:00.000 | news do I think Ted is referring to? Well, here's my best guess. And no, it's not based on any
00:12:05.120 | insider knowledge. I think Google is going to announce another embodied model like RT2,
00:12:10.720 | but powered by Gemini. Now, I've covered RT2 in previous videos back in October and indeed
00:12:16.560 | interviewed the tech lead for RT2X for AI Insiders. But those models in a nutshell fuse robotics data
00:12:23.600 | with transferred learning from text and web data. In other words, it got better at robotics through
00:12:29.120 | having an LLM at its core, or you might say an MMM, a multimodal model. But in the case of RT2,
00:12:35.920 | that was Palm E at 12 billion parameters, or Parley X at 55 billion parameters. Imagine RT3
00:12:42.960 | powered by Gemini at say one trillion parameters. It might understand the world around it to
00:12:48.720 | unprecedented degrees of depth and intelligence. And if it's powered by Gemini 1.5, it might be
00:12:54.480 | able to remember that world for months and months and months. Indeed, I was so inspired by the RT
00:12:59.520 | series that I made it the unofficial logo of AI Insiders. Evaluations for humanoid AI-powered
00:13:05.680 | robotics startups are starting to get pretty wild too. The CEO of NVIDIA, Jensen Huang, said that
00:13:11.440 | his equivalent for the Transformer paper in the near future is foundational robotics. He said,
00:13:16.880 | "If you could generate text, if you could generate images, can you also generate motion?" He said,
00:13:21.440 | "The answer is probably yes. And like we've seen, we can also generate interaction." He went on,
00:13:26.240 | "Humanoid robotics should be just around the corner." I think he was referring to NVIDIA's
00:13:31.280 | GEAR, Generalist Embodied Agent Research. That's led by none other than Jim Phan. And he said,
00:13:37.360 | "2024 is the year of robotics and the year of gaming AI." Now this is Tesla's Optimus robot,
00:13:43.360 | but I do wonder if the Chachapiti or Sora moment for robotics will be when a humanoid robot walks
00:13:50.160 | with the fluidity of a human. That will just seem so wild when it happens. Just imagine a humanoid
00:13:56.880 | walking up to you with human-like swagger and shaking your hand, all while remembering a
00:14:01.840 | conversation you had with it, say, a year ago. Now, I don't think I can end this video without
00:14:06.640 | touching on some of the recent controversies that Google has faced. I just think they were
00:14:11.360 | fairly clearly not given the kind of testing that they obviously required. And I'm not just
00:14:16.080 | referring to how Gemini seems to be phobic of the word white. There's also evidence of false
00:14:21.360 | refusals to questions that have a pretty obvious answer. And my take on this is going to try to
00:14:26.560 | move beyond just the obvious take. I think these examples show that Google is genuinely rattled by
00:14:32.240 | OpenAI, Microsoft, and players like Perplexity. And so they're cutting corners on the testing of
00:14:37.840 | their models. After all, if you're six months behind OpenAI, what's one way to catch up? Just
00:14:42.800 | cut out six months of human feedback for your models. I hope this isn't what Google did or
00:14:48.160 | plans to do in the future, but it seems like that to me. So let me know what you think of all of
00:14:52.960 | this and whether we are indeed entering a new era of action and interaction. Thank you so much for
00:15:00.240 | watching all the way to the end and have a wonderful day.