back to index

Gemini Full Breakdown + AlphaCode 2 Bombshell


Whisper Transcript | Transcript Only Page

00:00:00.000 | In the three to four hours since Google Gemini has been announced, I've read the full 60 page
00:00:05.120 | technical report, the attached AlphaCode 2 fascinating technical report and all the media
00:00:11.920 | interviews, clips and press releases that Google have put out. I've got 45 notes so I'm going to
00:00:18.480 | skip the long intro and get straight to it. Here is the paper Gemini, a family of highly capable
00:00:24.320 | multimodal models. And let's get one thing out of the way straight away. Is this AGI? No. Is it
00:00:30.720 | better than GPT-4? Yes, in many modalities, but in text it's probably a draw. Are many people
00:00:37.600 | going to overlook the bombshell AlphaCode 2 paper? Probably. Anyway, it's three models,
00:00:42.800 | Ultra, Pro and Nano. Nano being for your phone, Pro being the rough equivalent of GPT-3.5 or
00:00:49.680 | slightly better and Ultra being released early next year as the GPT-4 competitor. Now, the first
00:00:56.000 | paragraph of both that technical report and this accompanying web page tout its abilities in the
00:01:01.600 | MMLU. And I must say in the web page, they were gunning for maximum hype because they had this
00:01:07.120 | human expert level here and they had Gemini being the first model to beat it. The first problem that
00:01:12.800 | you might have noticed is that GPT-4 score was done five shots. Basically, it was given five
00:01:18.480 | examples to learn from before answering each question. Whereas Gemini Ultra, the biggest model,
00:01:23.760 | was done chain of thought with 32 samples. This is not the video to go into chain of thought or
00:01:29.760 | self-consistency, but it's a different way of measuring. It's not an apples to apples comparison
00:01:34.080 | and in the appendix of the technical report, we'll see a genuine comparison. Also, I've had many
00:01:39.680 | conversations with the creators of the MMLU as I was doing my smart GPT video and that figure of
00:01:46.000 | 89.8% is very approximate. The 10-second summary of the MMLU is that it's a multiple choice test
00:01:53.680 | across 57 different subjects, from chemistry to business to mathematics to morality. And
00:01:59.200 | unfortunately, Demis Hassabis engaged in some of that hype with this remark.
00:02:04.240 | What's amazing about Gemini is that it's so good at so many things. As we started getting to the
00:02:09.040 | end of the training, we started seeing that Gemini was better than any other model out there on these
00:02:14.640 | very, very important benchmarks. For example, each of the 50 different subject areas that we tested
00:02:19.680 | on, it's as good as the best expert humans in those areas. So no, Gemini Ultra and GPT-4 can't
00:02:27.040 | beat most human experts. And while they describe the MMLU as one of the most popular methods to
00:02:32.880 | test the knowledge and problem solving abilities of AI models, they don't actually try to back up
00:02:37.760 | its credibility. If you watch my second smart GPT video, I showed how with some basic prompt
00:02:43.040 | scaffolding, you can reach 89% with GPT-4. And frankly, if we had a slightly larger budget than
00:02:49.440 | the couple thousand dollars in API calls that we used, we could have reached 90% with GPT-4.
00:02:54.800 | Furthermore, it's somewhat ridiculous to showcase these results to one decimal place when the test
00:02:59.840 | itself has maybe 2 to 3% of its questions being in error. If you want more details, check out that
00:03:05.520 | video, but it's quite egregious. I actually spoke with Anthropic, another leading AGI lab about this
00:03:11.360 | and they did a blog post on the errors in this test. So the fact that months after that, Google
00:03:16.800 | is still touting results on this test is a little surprising to me. That's not to say that Gemini
00:03:21.920 | Ultra isn't the best new model. I think it is. It's just weird that they picked this particular
00:03:26.720 | benchmark. And deep in the appendix of the paper, Google sheepishly gives a slightly more reasonable
00:03:31.920 | comparison between the two models on these text questions. As you can see, depending on the
00:03:36.640 | prompting strategy, you get different results. And please forgive me for one last bit of mockery of
00:03:41.760 | the two decimal places they give here when 2 to 3% of the test contains errors. Very briefly though,
00:03:47.760 | before we get back to the paper, many of you might be wondering what are these kind of techniques
00:03:52.080 | that can boost performance? Well, I was intending to launch this slightly later on, but now is as
00:03:57.760 | good a time as any. It's my new AI Insiders content on Patreon. And no, I'm not going to do a long ad.
00:04:04.320 | We'll be back to Gemini in 30 seconds. I've been working on this for months, doing interviews with
00:04:09.200 | Google DeepMind, people like Jim Fan, Microsoft authors, basically creating the best possible
00:04:15.120 | exclusive content that I could come up with. And the reason I mention it now is because in one of
00:04:19.760 | these exclusive videos, I talk with a top Google author about the future of prompting. It's not
00:04:26.160 | this one, although I'm really proud of that one, reasoning as the holy grail of LLMs. Not this one,
00:04:31.440 | I think it was this one. I spoke to Xingchen Wan about this evolution of prompting. There is much
00:04:36.800 | more content coming and I'm even going to have a feature where you can vote on what kind of
00:04:40.880 | questions I ask the next round of experts. Some of whom will feature later in this video on Google
00:04:46.800 | Gemini. But why do I say in other modalities it beats GPT-4? Well, look at this. In 9 of 9
00:04:53.600 | image understanding benchmarks, it beats GPT-4 Vision and all other models. 6 of 6 video
00:04:59.280 | understanding benchmarks and 5 of 5 speech recognition and speech translation benchmarks.
00:05:05.280 | That's not bad. They are trained to support a 32,000 token context window, which compares to
00:05:11.440 | 128,000 for GPT-4 Turbo. With Anthropic, you can get up to 200,000 tokens, but the performance isn't
00:05:19.040 | quite as good. Interestingly, they did tell us the parameter count of the nano models at 1.8
00:05:25.040 | billion parameters and 3.25 billion parameters. They even give us the detail that they are 4-bit
00:05:30.880 | quantized or distilled down smaller versions of the larger Gemini models. What about other key
00:05:36.640 | details like the dataset they used? Well, as you might have guessed, they say, "We plan to update
00:05:42.160 | this report with more details ahead of the general availability of the Gemini Ultra model." And later
00:05:48.000 | on, they go into fantastic detail, but not really, saying, "Our pre-trained dataset uses data from
00:05:54.560 | web documents, books and code, and includes image, audio and video data." Great. Instead of detail on
00:06:01.680 | the dataset, we do get this incredible nugget. Some of the delay to Gemini was due to external
00:06:08.720 | factors such as cosmic rays. Now, obviously I'm being facetious, but that is an interesting detail.
00:06:14.480 | Now, one key detail that I mentioned back in the Palm 2 video was this case of positive transfer.
00:06:20.320 | Essentially, by training the model on image, audio, as well as text and video, they got positive
00:06:26.320 | transfer. In other words, the model got better at text by being trained on images. As Google keeps
00:06:32.000 | saying, it was trained from the ground up to be multimodal, and that's why it gets such good
00:06:37.120 | results in other modalities, which I'll get to. In fact, here they are pretty much state of the
00:06:41.600 | art across natural image understanding, document understanding, infographic understanding. It's
00:06:47.600 | even better in video captioning, video question answering, speech translation. But these are
00:06:53.840 | solid results. This is nothing to do with prompting. It genuinely is a better model
00:06:58.960 | in these modalities. Remember though, that we're not getting the ultra until early next year,
00:07:04.080 | and at the moment, pro and nano can only respond with text and code. They can't yet do what you're
00:07:10.480 | seeing in all of these demos, which I'll get to in a second, which is generate images. Speaking of
00:07:15.280 | the release though, people in the UK and EU like me aren't even getting Google Gemini on launch.
00:07:21.520 | Apparently it's about regulations, so time will tell. This is the kind of interactive UI that I
00:07:27.040 | mean, and it does look good. Clicking on the interface regenerates the data to be rendered
00:07:32.320 | by the coded route. Oh, I know she likes cupcakes. I can now click on anything in the interface and
00:07:38.400 | ask it for more information. I could say step by step instructions on how to bake this,
00:07:44.800 | and it starts to generate a new UI. This time it designs a UI best suited for giving me step by
00:07:51.120 | step instructions. But now time for some of the highlights, and I'm going to start with the one
00:07:56.400 | that I was most impressed by. Because of Gemini's ability to understand nuanced information and
00:08:02.640 | answer questions relating to complicated topics, it can give you a customized explanation of the
00:08:07.600 | subject you're trying to learn. And lastly, if you want to learn more, you can just ask.
00:08:17.440 | Gemini will provide personalized practice problems based on mistakes.
00:08:21.440 | Here I have a similar problem where I have to figure out the cat's speed.
00:08:26.640 | The height of the ground is double. Oh yeah, I knew that.
00:08:32.000 | Now I'm not naive enough to think that Google Gemini won't sometimes hallucinate answers,
00:08:36.800 | but this format of providing follow-up questions, understanding messy handwriting,
00:08:41.760 | and tying it all together is pretty impressive. The paper makes a great point that because it
00:08:46.160 | was trained from the ground up to be multimodal, it doesn't take audio and then put it into text,
00:08:51.200 | where you lose some nuance like tone. For languages like Mandarin, which I've studied in
00:08:56.000 | China, that's super important as this demo demonstrates.
00:08:59.680 | How to pronounce the words lunar January in Chinese?
00:09:03.200 | Option A, zhēng yuè. Option B, zhèng yuè. Which one is correct? Which tone is the correct tone for the first character?
00:09:12.560 | Option A, the first character is pronounced with the first tone.
00:09:16.400 | Gemini was able to differentiate the two ways of pronouncing the word to make sure it was correct.
00:09:21.840 | And speaking of video understanding, these two demos weren't bad either.
00:09:27.840 | You are trying to get me to find the paper ball under the cup. I accept the challenge.
00:09:32.080 | The cup to the left.
00:09:36.240 | Nice. What movie are they acting out here?
00:09:41.760 | I think they are acting out the famous bullet time scene from The Matrix.
00:09:45.120 | Ooh, nice.
00:09:45.840 | What do you think might happen next?
00:09:50.400 | The cat is going to jump to the wall and stick the landing. It's going to be a perfect 10.
00:09:57.440 | Oh, and it's a miss. It looks like it was just a little too far away.
00:10:01.280 | But don't worry, the cat is a natural athlete and it will be back up there in no time.
00:10:05.440 | Going back to languages for a second, we see from this machine translation benchmark,
00:10:10.240 | which couldn't have been in the training data by the way, that Gemini Ultra beats Palm 2 large.
00:10:15.840 | But why is beating Palm 2 large so significant?
00:10:19.040 | Well, I covered this at the time, but in certain settings, we observe that Palm 2
00:10:23.600 | improves quality both over Palm and Google Translate. And Gemini Ultra is an improvement
00:10:29.120 | on Palm 2 large. So expect the multilingual performance to be next level.
00:10:34.160 | Now, what about coding? Well, here's where it's a draw in some ways
00:10:38.640 | and a massive win for Gemini Ultra in other ways.
00:10:42.400 | Natural 2 code was a held out data set with no leakage on the web.
00:10:47.520 | So a really good benchmark to use and Gemini Ultra beats GPT-4.
00:10:52.480 | Gemini Pro by the way, beats GPT-3.5. But yes, the results are fairly close,
00:10:58.240 | a 1% point difference. The craziness though,
00:11:01.360 | comes from the AlphaCode 2 technical report. This was also released three to four hours ago.
00:11:06.080 | Now, AlphaCode 2 is based on Gemini, Gemini Pro actually, not Gemini Ultra,
00:11:10.640 | and it achieves truly outstanding things. That's not to say it will be available to you
00:11:15.200 | anytime soon. It's very compute intensive, but it does show what is coming to the automation
00:11:20.000 | of coding. I'll try to get to as many details as I can. Yes, I have read this report in full too.
00:11:25.600 | AlphaCode 2 based on Gemini Pro was evaluated on the Codeforces platform.
00:11:30.400 | GPT-4 could solve a zero out of 10 of the easiest problems when they were recent,
00:11:36.080 | not within its data set. As this author points out, that strongly points towards
00:11:39.680 | contamination because it could solve 10 out of 10 of the pre-2021 problems.
00:11:44.560 | Anyway, it's a really challenging problem set. AlphaCode 2 solves 43% of those problems within
00:11:51.200 | 10 attempts, beating 85% of competition participants. Now, AlphaCode 2 isn't just
00:11:57.120 | one model, it's an entire system. They basically get a family of Gemini models by tweaking different
00:12:02.880 | hyperparameters. Think of it as different flavors of Gemini, which generate code samples for each
00:12:07.760 | problem. It was important to have those different flavors because they wanted code diversity.
00:12:11.840 | Then they sampled hundreds and up to a million code samples to search over the space of possible
00:12:18.080 | programs. This is maybe what Demis Hassabis mentioned when he talked about AlphaGo + GPT
00:12:23.520 | in a video I did a while back. At this point, you can see why it's not yet for consumer release
00:12:28.640 | because that is incredibly compute intensive. Anyway, they then filtered out the results
00:12:33.440 | to remove code that didn't compile or didn't pass the unit tests. They also had a mechanism
00:12:38.720 | to filter out code that was too similar. But then here's the interesting bit. They used Gemini
00:12:43.440 | as a scoring model to surface the best candidate. Are you getting vibes of Let's Verify step by step
00:12:50.160 | from my Q* video? If not, I'm not hinting strongly enough. Here's some more evidence. They used a
00:12:55.280 | fine-tuned Gemini Pro model to attribute an estimated correctness score between 0 and 1
00:13:00.880 | to code samples. Remember from Let's Verify that evaluation is easier than generation? I'll have
00:13:06.320 | to cover this in more detail in another video because there's other juicy nuggets to get into.
00:13:11.440 | The ranking of AlphaGo 2 on the test was between expert and master. In its best two performances,
00:13:18.320 | AlphaGo 2 outperformed more than 99.5% of competition participants. Here's where it
00:13:24.320 | gets more interesting. To solve these level of problems before writing the code implementation,
00:13:28.720 | one needs to understand, analyze and reason about the problem. This explains why generally
00:13:33.600 | available AI systems, I think they mean GPT-4, perform poorly on this benchmark. AlphaGo 2's
00:13:39.360 | success on this competitive programming contest represents an impressive step change. This is
00:13:45.280 | something I actually discussed with Jim Fan, a senior AI researcher at NVIDIA for AI Insiders.
00:13:51.440 | He quoted mathematics being the first to fall. I think that's one of the reasons why coding and
00:13:57.200 | math look set to be the first to fall. I'm not talking imminently, but if you can generate
00:14:01.520 | enough samples and then test them, it becomes more a matter of compute rather than reasoning.
00:14:07.200 | Brute force over beauty. You can see here how even with the number of samples approaching a
00:14:12.880 | million, the results keep getting better. They also note that the sample efficiency of AlphaGo 2
00:14:18.720 | because of the underlying model is a lot better than AlphaGo 1. I'm sure you all want to try it
00:14:24.240 | and they say we are working towards bringing AlphaGo 2's unique capabilities to our foundation
00:14:30.000 | Gemini models. But why can't we have it here? Well, they reiterate that just above. Our system
00:14:35.440 | requires a lot of trial and error and remains too costly to operate at scale. And many of you may
00:14:41.440 | have picked up on another key detail. I said they used Gemini Pro. Maybe this was for budgeting
00:14:47.680 | reasons, but they say we suspect using Gemini Ultra as the foundation model instead with its
00:14:53.680 | improved coding and reasoning capabilities would lead to further improvements in the overall Alpha
00:14:59.120 | Code 2 approach. They do try to give a sliver of reassurance to human coders a bit later on though.
00:15:04.960 | They note that when using Alpha Code 2 with human coders who can specify additional filtering
00:15:10.160 | properties, we score above the 90th percentile. That's a five percentile increase. And they say
00:15:15.840 | optimistically, we hope this kind of interactive coding will be the future of programming where
00:15:20.960 | programmers make use of the highly capable AI models as collaborative tools. As always, let me
00:15:25.920 | know what you think in the comments. Some more notes on the release of Gemini before we get to
00:15:31.680 | the media round that Hasabis did. First, Gemini Nano is coming to the Pixel 8 Pro. That's going
00:15:37.840 | to power features like Summarize and Smart Reply. Honestly, I wouldn't yet trust a 3.5 billion
00:15:43.360 | parameter model to reply to anything, but that's just me. They say you can expect in the coming
00:15:48.480 | months that Gemini will be available in services such as Search, Ads, Chrome, and Duo AI. In a week
00:15:54.560 | though, starting December 13th, devs and enterprise customers can access Gemini Pro via the Gemini API
00:16:01.520 | in Google AI Studio. Starting today, Bard will use a fine-tuned version of Gemini Pro in those
00:16:07.200 | 170 countries excluding the UK and EU. And even though I'm hearing reports of people being
00:16:12.800 | disappointed with Gemini Pro, remember that's not the biggest model. That's Gemini Ultra. Gemini Pro
00:16:18.000 | is more like the original Chachi BT. In my last video, I already covered what the delay was behind
00:16:23.760 | Gemini Ultra. They're basically doing more reinforcement learning from human feedback,
00:16:28.160 | especially for low resource languages where it was apparently really easy to jailbreak.
00:16:33.600 | They also kind of teased out this Bard Advanced, which I suspect will be a subscription model,
00:16:38.880 | and they call it a new cutting-edge AI experience that gives you access to our best models and
00:16:43.040 | capabilities starting with Ultra. Be very interesting to see whether they pitch that higher,
00:16:47.600 | lower, or the same as Chachi BT Pro. I'll be honest, I'll be buying it regardless,
00:16:52.320 | assuming it's available in the UK. In terms of API pricing, we don't yet know,
00:16:57.680 | but this New York Times article did point out that during that crisis at OpenAI,
00:17:02.720 | Google offered the same price to customers at OpenAI as their current OpenAI rate and also
00:17:09.040 | offered cloud credits and discounts thrown in. We'll have to see if that still applies to Gemini
00:17:13.920 | Pro and Gemini Ultra. What about the future of Gemini though? Gemini 2.0? Well, Demis Hassabis,
00:17:19.440 | the CEO of Google DeepMind, gave us this clue. He said that Google DeepMind is already looking
00:17:23.680 | into how Gemini might be combined with robotics to physically interact with the world. To become
00:17:29.440 | truly multimodal, you'll want to include touch and tactile feedback. And I can't resist pointing
00:17:34.800 | out at this point that I spoke to Panag Sanketi for AI Insiders. He is the tech lead manager for
00:17:40.560 | RT2 and I also spoke about this topic with many other lead authors at NVIDIA and elsewhere,
00:17:46.800 | and I hope that video will be out on AI Insiders before the end of this month. In this Wired
00:17:51.600 | interview, Demis Hassabis also strongly hinted that this work was related to what Q* might be
00:17:57.520 | about. Further evidence that this is the direction that all the major AGI labs are heading toward.
00:18:02.960 | When asked about that, he said, "We have some of the world's best reinforcement learning experts
00:18:06.960 | who invented some of this stuff." And went on, "We've got some interesting innovations we're
00:18:11.360 | working on to bring to future versions of Gemini." And again, like me, he is a somewhat
00:18:16.320 | understated Londoner, so that could be pretty significant. In The Verge, he doubled down on
00:18:21.360 | that, talking about Gemini Ultra, saying, "It's going to get even more general than images,
00:18:25.840 | video, and audio. There's still things like action and touch, more like robotics." Over time, he says,
00:18:31.120 | "Gemini will get more senses, become more aware." And he ended his media round with this, "As we
00:18:37.120 | approach AGI, things are going to be different," Hassabis says. "We're going to gain insanity
00:18:41.520 | points," is what Sam Altman said. "It's kind of an active technology, so I think we have to
00:18:46.160 | approach that cautiously. Cautiously, but optimistically." And on that note, I want to end
00:18:51.360 | the video. And yes, I'm super excited about the rolling launch of AI Insiders, which you can sign
00:18:57.280 | up to now. But I also want to reassure my longstanding audience, for many of whom $29 or
00:19:03.040 | even $25 is way too expensive for them. I totally respect that, and for most of my life, it would
00:19:09.440 | have been too expensive for me too. So don't worry, as this video shows, I will still be posting as
00:19:14.320 | frequently on the main AI Explained channel. And for those who can join, at least for the next few
00:19:19.200 | weeks, I'm going to be writing a personal message of thanks for joining AI Insiders. Of course, I am
00:19:25.120 | also massively grateful for all of those supporting me at the legendary level. You guys will still get
00:19:31.200 | my personal updates and blog style posts. So that's Google Gemini. Definitely not the last
00:19:36.400 | video from me on any of those models. Thank you so much for watching this video and have a wonderful day.