back to indexGemini Full Breakdown + AlphaCode 2 Bombshell
00:00:00.000 |
In the three to four hours since Google Gemini has been announced, I've read the full 60 page 00:00:05.120 |
technical report, the attached AlphaCode 2 fascinating technical report and all the media 00:00:11.920 |
interviews, clips and press releases that Google have put out. I've got 45 notes so I'm going to 00:00:18.480 |
skip the long intro and get straight to it. Here is the paper Gemini, a family of highly capable 00:00:24.320 |
multimodal models. And let's get one thing out of the way straight away. Is this AGI? No. Is it 00:00:30.720 |
better than GPT-4? Yes, in many modalities, but in text it's probably a draw. Are many people 00:00:37.600 |
going to overlook the bombshell AlphaCode 2 paper? Probably. Anyway, it's three models, 00:00:42.800 |
Ultra, Pro and Nano. Nano being for your phone, Pro being the rough equivalent of GPT-3.5 or 00:00:49.680 |
slightly better and Ultra being released early next year as the GPT-4 competitor. Now, the first 00:00:56.000 |
paragraph of both that technical report and this accompanying web page tout its abilities in the 00:01:01.600 |
MMLU. And I must say in the web page, they were gunning for maximum hype because they had this 00:01:07.120 |
human expert level here and they had Gemini being the first model to beat it. The first problem that 00:01:12.800 |
you might have noticed is that GPT-4 score was done five shots. Basically, it was given five 00:01:18.480 |
examples to learn from before answering each question. Whereas Gemini Ultra, the biggest model, 00:01:23.760 |
was done chain of thought with 32 samples. This is not the video to go into chain of thought or 00:01:29.760 |
self-consistency, but it's a different way of measuring. It's not an apples to apples comparison 00:01:34.080 |
and in the appendix of the technical report, we'll see a genuine comparison. Also, I've had many 00:01:39.680 |
conversations with the creators of the MMLU as I was doing my smart GPT video and that figure of 00:01:46.000 |
89.8% is very approximate. The 10-second summary of the MMLU is that it's a multiple choice test 00:01:53.680 |
across 57 different subjects, from chemistry to business to mathematics to morality. And 00:01:59.200 |
unfortunately, Demis Hassabis engaged in some of that hype with this remark. 00:02:04.240 |
What's amazing about Gemini is that it's so good at so many things. As we started getting to the 00:02:09.040 |
end of the training, we started seeing that Gemini was better than any other model out there on these 00:02:14.640 |
very, very important benchmarks. For example, each of the 50 different subject areas that we tested 00:02:19.680 |
on, it's as good as the best expert humans in those areas. So no, Gemini Ultra and GPT-4 can't 00:02:27.040 |
beat most human experts. And while they describe the MMLU as one of the most popular methods to 00:02:32.880 |
test the knowledge and problem solving abilities of AI models, they don't actually try to back up 00:02:37.760 |
its credibility. If you watch my second smart GPT video, I showed how with some basic prompt 00:02:43.040 |
scaffolding, you can reach 89% with GPT-4. And frankly, if we had a slightly larger budget than 00:02:49.440 |
the couple thousand dollars in API calls that we used, we could have reached 90% with GPT-4. 00:02:54.800 |
Furthermore, it's somewhat ridiculous to showcase these results to one decimal place when the test 00:02:59.840 |
itself has maybe 2 to 3% of its questions being in error. If you want more details, check out that 00:03:05.520 |
video, but it's quite egregious. I actually spoke with Anthropic, another leading AGI lab about this 00:03:11.360 |
and they did a blog post on the errors in this test. So the fact that months after that, Google 00:03:16.800 |
is still touting results on this test is a little surprising to me. That's not to say that Gemini 00:03:21.920 |
Ultra isn't the best new model. I think it is. It's just weird that they picked this particular 00:03:26.720 |
benchmark. And deep in the appendix of the paper, Google sheepishly gives a slightly more reasonable 00:03:31.920 |
comparison between the two models on these text questions. As you can see, depending on the 00:03:36.640 |
prompting strategy, you get different results. And please forgive me for one last bit of mockery of 00:03:41.760 |
the two decimal places they give here when 2 to 3% of the test contains errors. Very briefly though, 00:03:47.760 |
before we get back to the paper, many of you might be wondering what are these kind of techniques 00:03:52.080 |
that can boost performance? Well, I was intending to launch this slightly later on, but now is as 00:03:57.760 |
good a time as any. It's my new AI Insiders content on Patreon. And no, I'm not going to do a long ad. 00:04:04.320 |
We'll be back to Gemini in 30 seconds. I've been working on this for months, doing interviews with 00:04:09.200 |
Google DeepMind, people like Jim Fan, Microsoft authors, basically creating the best possible 00:04:15.120 |
exclusive content that I could come up with. And the reason I mention it now is because in one of 00:04:19.760 |
these exclusive videos, I talk with a top Google author about the future of prompting. It's not 00:04:26.160 |
this one, although I'm really proud of that one, reasoning as the holy grail of LLMs. Not this one, 00:04:31.440 |
I think it was this one. I spoke to Xingchen Wan about this evolution of prompting. There is much 00:04:36.800 |
more content coming and I'm even going to have a feature where you can vote on what kind of 00:04:40.880 |
questions I ask the next round of experts. Some of whom will feature later in this video on Google 00:04:46.800 |
Gemini. But why do I say in other modalities it beats GPT-4? Well, look at this. In 9 of 9 00:04:53.600 |
image understanding benchmarks, it beats GPT-4 Vision and all other models. 6 of 6 video 00:04:59.280 |
understanding benchmarks and 5 of 5 speech recognition and speech translation benchmarks. 00:05:05.280 |
That's not bad. They are trained to support a 32,000 token context window, which compares to 00:05:11.440 |
128,000 for GPT-4 Turbo. With Anthropic, you can get up to 200,000 tokens, but the performance isn't 00:05:19.040 |
quite as good. Interestingly, they did tell us the parameter count of the nano models at 1.8 00:05:25.040 |
billion parameters and 3.25 billion parameters. They even give us the detail that they are 4-bit 00:05:30.880 |
quantized or distilled down smaller versions of the larger Gemini models. What about other key 00:05:36.640 |
details like the dataset they used? Well, as you might have guessed, they say, "We plan to update 00:05:42.160 |
this report with more details ahead of the general availability of the Gemini Ultra model." And later 00:05:48.000 |
on, they go into fantastic detail, but not really, saying, "Our pre-trained dataset uses data from 00:05:54.560 |
web documents, books and code, and includes image, audio and video data." Great. Instead of detail on 00:06:01.680 |
the dataset, we do get this incredible nugget. Some of the delay to Gemini was due to external 00:06:08.720 |
factors such as cosmic rays. Now, obviously I'm being facetious, but that is an interesting detail. 00:06:14.480 |
Now, one key detail that I mentioned back in the Palm 2 video was this case of positive transfer. 00:06:20.320 |
Essentially, by training the model on image, audio, as well as text and video, they got positive 00:06:26.320 |
transfer. In other words, the model got better at text by being trained on images. As Google keeps 00:06:32.000 |
saying, it was trained from the ground up to be multimodal, and that's why it gets such good 00:06:37.120 |
results in other modalities, which I'll get to. In fact, here they are pretty much state of the 00:06:41.600 |
art across natural image understanding, document understanding, infographic understanding. It's 00:06:47.600 |
even better in video captioning, video question answering, speech translation. But these are 00:06:53.840 |
solid results. This is nothing to do with prompting. It genuinely is a better model 00:06:58.960 |
in these modalities. Remember though, that we're not getting the ultra until early next year, 00:07:04.080 |
and at the moment, pro and nano can only respond with text and code. They can't yet do what you're 00:07:10.480 |
seeing in all of these demos, which I'll get to in a second, which is generate images. Speaking of 00:07:15.280 |
the release though, people in the UK and EU like me aren't even getting Google Gemini on launch. 00:07:21.520 |
Apparently it's about regulations, so time will tell. This is the kind of interactive UI that I 00:07:27.040 |
mean, and it does look good. Clicking on the interface regenerates the data to be rendered 00:07:32.320 |
by the coded route. Oh, I know she likes cupcakes. I can now click on anything in the interface and 00:07:38.400 |
ask it for more information. I could say step by step instructions on how to bake this, 00:07:44.800 |
and it starts to generate a new UI. This time it designs a UI best suited for giving me step by 00:07:51.120 |
step instructions. But now time for some of the highlights, and I'm going to start with the one 00:07:56.400 |
that I was most impressed by. Because of Gemini's ability to understand nuanced information and 00:08:02.640 |
answer questions relating to complicated topics, it can give you a customized explanation of the 00:08:07.600 |
subject you're trying to learn. And lastly, if you want to learn more, you can just ask. 00:08:17.440 |
Gemini will provide personalized practice problems based on mistakes. 00:08:21.440 |
Here I have a similar problem where I have to figure out the cat's speed. 00:08:26.640 |
The height of the ground is double. Oh yeah, I knew that. 00:08:32.000 |
Now I'm not naive enough to think that Google Gemini won't sometimes hallucinate answers, 00:08:36.800 |
but this format of providing follow-up questions, understanding messy handwriting, 00:08:41.760 |
and tying it all together is pretty impressive. The paper makes a great point that because it 00:08:46.160 |
was trained from the ground up to be multimodal, it doesn't take audio and then put it into text, 00:08:51.200 |
where you lose some nuance like tone. For languages like Mandarin, which I've studied in 00:08:56.000 |
China, that's super important as this demo demonstrates. 00:08:59.680 |
How to pronounce the words lunar January in Chinese? 00:09:03.200 |
Option A, zhēng yuè. Option B, zhèng yuè. Which one is correct? Which tone is the correct tone for the first character? 00:09:12.560 |
Option A, the first character is pronounced with the first tone. 00:09:16.400 |
Gemini was able to differentiate the two ways of pronouncing the word to make sure it was correct. 00:09:21.840 |
And speaking of video understanding, these two demos weren't bad either. 00:09:27.840 |
You are trying to get me to find the paper ball under the cup. I accept the challenge. 00:09:41.760 |
I think they are acting out the famous bullet time scene from The Matrix. 00:09:50.400 |
The cat is going to jump to the wall and stick the landing. It's going to be a perfect 10. 00:09:57.440 |
Oh, and it's a miss. It looks like it was just a little too far away. 00:10:01.280 |
But don't worry, the cat is a natural athlete and it will be back up there in no time. 00:10:05.440 |
Going back to languages for a second, we see from this machine translation benchmark, 00:10:10.240 |
which couldn't have been in the training data by the way, that Gemini Ultra beats Palm 2 large. 00:10:15.840 |
But why is beating Palm 2 large so significant? 00:10:19.040 |
Well, I covered this at the time, but in certain settings, we observe that Palm 2 00:10:23.600 |
improves quality both over Palm and Google Translate. And Gemini Ultra is an improvement 00:10:29.120 |
on Palm 2 large. So expect the multilingual performance to be next level. 00:10:34.160 |
Now, what about coding? Well, here's where it's a draw in some ways 00:10:38.640 |
and a massive win for Gemini Ultra in other ways. 00:10:42.400 |
Natural 2 code was a held out data set with no leakage on the web. 00:10:47.520 |
So a really good benchmark to use and Gemini Ultra beats GPT-4. 00:10:52.480 |
Gemini Pro by the way, beats GPT-3.5. But yes, the results are fairly close, 00:11:01.360 |
comes from the AlphaCode 2 technical report. This was also released three to four hours ago. 00:11:06.080 |
Now, AlphaCode 2 is based on Gemini, Gemini Pro actually, not Gemini Ultra, 00:11:10.640 |
and it achieves truly outstanding things. That's not to say it will be available to you 00:11:15.200 |
anytime soon. It's very compute intensive, but it does show what is coming to the automation 00:11:20.000 |
of coding. I'll try to get to as many details as I can. Yes, I have read this report in full too. 00:11:25.600 |
AlphaCode 2 based on Gemini Pro was evaluated on the Codeforces platform. 00:11:30.400 |
GPT-4 could solve a zero out of 10 of the easiest problems when they were recent, 00:11:36.080 |
not within its data set. As this author points out, that strongly points towards 00:11:39.680 |
contamination because it could solve 10 out of 10 of the pre-2021 problems. 00:11:44.560 |
Anyway, it's a really challenging problem set. AlphaCode 2 solves 43% of those problems within 00:11:51.200 |
10 attempts, beating 85% of competition participants. Now, AlphaCode 2 isn't just 00:11:57.120 |
one model, it's an entire system. They basically get a family of Gemini models by tweaking different 00:12:02.880 |
hyperparameters. Think of it as different flavors of Gemini, which generate code samples for each 00:12:07.760 |
problem. It was important to have those different flavors because they wanted code diversity. 00:12:11.840 |
Then they sampled hundreds and up to a million code samples to search over the space of possible 00:12:18.080 |
programs. This is maybe what Demis Hassabis mentioned when he talked about AlphaGo + GPT 00:12:23.520 |
in a video I did a while back. At this point, you can see why it's not yet for consumer release 00:12:28.640 |
because that is incredibly compute intensive. Anyway, they then filtered out the results 00:12:33.440 |
to remove code that didn't compile or didn't pass the unit tests. They also had a mechanism 00:12:38.720 |
to filter out code that was too similar. But then here's the interesting bit. They used Gemini 00:12:43.440 |
as a scoring model to surface the best candidate. Are you getting vibes of Let's Verify step by step 00:12:50.160 |
from my Q* video? If not, I'm not hinting strongly enough. Here's some more evidence. They used a 00:12:55.280 |
fine-tuned Gemini Pro model to attribute an estimated correctness score between 0 and 1 00:13:00.880 |
to code samples. Remember from Let's Verify that evaluation is easier than generation? I'll have 00:13:06.320 |
to cover this in more detail in another video because there's other juicy nuggets to get into. 00:13:11.440 |
The ranking of AlphaGo 2 on the test was between expert and master. In its best two performances, 00:13:18.320 |
AlphaGo 2 outperformed more than 99.5% of competition participants. Here's where it 00:13:24.320 |
gets more interesting. To solve these level of problems before writing the code implementation, 00:13:28.720 |
one needs to understand, analyze and reason about the problem. This explains why generally 00:13:33.600 |
available AI systems, I think they mean GPT-4, perform poorly on this benchmark. AlphaGo 2's 00:13:39.360 |
success on this competitive programming contest represents an impressive step change. This is 00:13:45.280 |
something I actually discussed with Jim Fan, a senior AI researcher at NVIDIA for AI Insiders. 00:13:51.440 |
He quoted mathematics being the first to fall. I think that's one of the reasons why coding and 00:13:57.200 |
math look set to be the first to fall. I'm not talking imminently, but if you can generate 00:14:01.520 |
enough samples and then test them, it becomes more a matter of compute rather than reasoning. 00:14:07.200 |
Brute force over beauty. You can see here how even with the number of samples approaching a 00:14:12.880 |
million, the results keep getting better. They also note that the sample efficiency of AlphaGo 2 00:14:18.720 |
because of the underlying model is a lot better than AlphaGo 1. I'm sure you all want to try it 00:14:24.240 |
and they say we are working towards bringing AlphaGo 2's unique capabilities to our foundation 00:14:30.000 |
Gemini models. But why can't we have it here? Well, they reiterate that just above. Our system 00:14:35.440 |
requires a lot of trial and error and remains too costly to operate at scale. And many of you may 00:14:41.440 |
have picked up on another key detail. I said they used Gemini Pro. Maybe this was for budgeting 00:14:47.680 |
reasons, but they say we suspect using Gemini Ultra as the foundation model instead with its 00:14:53.680 |
improved coding and reasoning capabilities would lead to further improvements in the overall Alpha 00:14:59.120 |
Code 2 approach. They do try to give a sliver of reassurance to human coders a bit later on though. 00:15:04.960 |
They note that when using Alpha Code 2 with human coders who can specify additional filtering 00:15:10.160 |
properties, we score above the 90th percentile. That's a five percentile increase. And they say 00:15:15.840 |
optimistically, we hope this kind of interactive coding will be the future of programming where 00:15:20.960 |
programmers make use of the highly capable AI models as collaborative tools. As always, let me 00:15:25.920 |
know what you think in the comments. Some more notes on the release of Gemini before we get to 00:15:31.680 |
the media round that Hasabis did. First, Gemini Nano is coming to the Pixel 8 Pro. That's going 00:15:37.840 |
to power features like Summarize and Smart Reply. Honestly, I wouldn't yet trust a 3.5 billion 00:15:43.360 |
parameter model to reply to anything, but that's just me. They say you can expect in the coming 00:15:48.480 |
months that Gemini will be available in services such as Search, Ads, Chrome, and Duo AI. In a week 00:15:54.560 |
though, starting December 13th, devs and enterprise customers can access Gemini Pro via the Gemini API 00:16:01.520 |
in Google AI Studio. Starting today, Bard will use a fine-tuned version of Gemini Pro in those 00:16:07.200 |
170 countries excluding the UK and EU. And even though I'm hearing reports of people being 00:16:12.800 |
disappointed with Gemini Pro, remember that's not the biggest model. That's Gemini Ultra. Gemini Pro 00:16:18.000 |
is more like the original Chachi BT. In my last video, I already covered what the delay was behind 00:16:23.760 |
Gemini Ultra. They're basically doing more reinforcement learning from human feedback, 00:16:28.160 |
especially for low resource languages where it was apparently really easy to jailbreak. 00:16:33.600 |
They also kind of teased out this Bard Advanced, which I suspect will be a subscription model, 00:16:38.880 |
and they call it a new cutting-edge AI experience that gives you access to our best models and 00:16:43.040 |
capabilities starting with Ultra. Be very interesting to see whether they pitch that higher, 00:16:47.600 |
lower, or the same as Chachi BT Pro. I'll be honest, I'll be buying it regardless, 00:16:52.320 |
assuming it's available in the UK. In terms of API pricing, we don't yet know, 00:16:57.680 |
but this New York Times article did point out that during that crisis at OpenAI, 00:17:02.720 |
Google offered the same price to customers at OpenAI as their current OpenAI rate and also 00:17:09.040 |
offered cloud credits and discounts thrown in. We'll have to see if that still applies to Gemini 00:17:13.920 |
Pro and Gemini Ultra. What about the future of Gemini though? Gemini 2.0? Well, Demis Hassabis, 00:17:19.440 |
the CEO of Google DeepMind, gave us this clue. He said that Google DeepMind is already looking 00:17:23.680 |
into how Gemini might be combined with robotics to physically interact with the world. To become 00:17:29.440 |
truly multimodal, you'll want to include touch and tactile feedback. And I can't resist pointing 00:17:34.800 |
out at this point that I spoke to Panag Sanketi for AI Insiders. He is the tech lead manager for 00:17:40.560 |
RT2 and I also spoke about this topic with many other lead authors at NVIDIA and elsewhere, 00:17:46.800 |
and I hope that video will be out on AI Insiders before the end of this month. In this Wired 00:17:51.600 |
interview, Demis Hassabis also strongly hinted that this work was related to what Q* might be 00:17:57.520 |
about. Further evidence that this is the direction that all the major AGI labs are heading toward. 00:18:02.960 |
When asked about that, he said, "We have some of the world's best reinforcement learning experts 00:18:06.960 |
who invented some of this stuff." And went on, "We've got some interesting innovations we're 00:18:11.360 |
working on to bring to future versions of Gemini." And again, like me, he is a somewhat 00:18:16.320 |
understated Londoner, so that could be pretty significant. In The Verge, he doubled down on 00:18:21.360 |
that, talking about Gemini Ultra, saying, "It's going to get even more general than images, 00:18:25.840 |
video, and audio. There's still things like action and touch, more like robotics." Over time, he says, 00:18:31.120 |
"Gemini will get more senses, become more aware." And he ended his media round with this, "As we 00:18:37.120 |
approach AGI, things are going to be different," Hassabis says. "We're going to gain insanity 00:18:41.520 |
points," is what Sam Altman said. "It's kind of an active technology, so I think we have to 00:18:46.160 |
approach that cautiously. Cautiously, but optimistically." And on that note, I want to end 00:18:51.360 |
the video. And yes, I'm super excited about the rolling launch of AI Insiders, which you can sign 00:18:57.280 |
up to now. But I also want to reassure my longstanding audience, for many of whom $29 or 00:19:03.040 |
even $25 is way too expensive for them. I totally respect that, and for most of my life, it would 00:19:09.440 |
have been too expensive for me too. So don't worry, as this video shows, I will still be posting as 00:19:14.320 |
frequently on the main AI Explained channel. And for those who can join, at least for the next few 00:19:19.200 |
weeks, I'm going to be writing a personal message of thanks for joining AI Insiders. Of course, I am 00:19:25.120 |
also massively grateful for all of those supporting me at the legendary level. You guys will still get 00:19:31.200 |
my personal updates and blog style posts. So that's Google Gemini. Definitely not the last 00:19:36.400 |
video from me on any of those models. Thank you so much for watching this video and have a wonderful day.