Back to Index

Gemini Full Breakdown + AlphaCode 2 Bombshell


Transcript

In the three to four hours since Google Gemini has been announced, I've read the full 60 page technical report, the attached AlphaCode 2 fascinating technical report and all the media interviews, clips and press releases that Google have put out. I've got 45 notes so I'm going to skip the long intro and get straight to it.

Here is the paper Gemini, a family of highly capable multimodal models. And let's get one thing out of the way straight away. Is this AGI? No. Is it better than GPT-4? Yes, in many modalities, but in text it's probably a draw. Are many people going to overlook the bombshell AlphaCode 2 paper?

Probably. Anyway, it's three models, Ultra, Pro and Nano. Nano being for your phone, Pro being the rough equivalent of GPT-3.5 or slightly better and Ultra being released early next year as the GPT-4 competitor. Now, the first paragraph of both that technical report and this accompanying web page tout its abilities in the MMLU.

And I must say in the web page, they were gunning for maximum hype because they had this human expert level here and they had Gemini being the first model to beat it. The first problem that you might have noticed is that GPT-4 score was done five shots. Basically, it was given five examples to learn from before answering each question.

Whereas Gemini Ultra, the biggest model, was done chain of thought with 32 samples. This is not the video to go into chain of thought or self-consistency, but it's a different way of measuring. It's not an apples to apples comparison and in the appendix of the technical report, we'll see a genuine comparison.

Also, I've had many conversations with the creators of the MMLU as I was doing my smart GPT video and that figure of 89.8% is very approximate. The 10-second summary of the MMLU is that it's a multiple choice test across 57 different subjects, from chemistry to business to mathematics to morality.

And unfortunately, Demis Hassabis engaged in some of that hype with this remark. What's amazing about Gemini is that it's so good at so many things. As we started getting to the end of the training, we started seeing that Gemini was better than any other model out there on these very, very important benchmarks.

For example, each of the 50 different subject areas that we tested on, it's as good as the best expert humans in those areas. So no, Gemini Ultra and GPT-4 can't beat most human experts. And while they describe the MMLU as one of the most popular methods to test the knowledge and problem solving abilities of AI models, they don't actually try to back up its credibility.

If you watch my second smart GPT video, I showed how with some basic prompt scaffolding, you can reach 89% with GPT-4. And frankly, if we had a slightly larger budget than the couple thousand dollars in API calls that we used, we could have reached 90% with GPT-4. Furthermore, it's somewhat ridiculous to showcase these results to one decimal place when the test itself has maybe 2 to 3% of its questions being in error.

If you want more details, check out that video, but it's quite egregious. I actually spoke with Anthropic, another leading AGI lab about this and they did a blog post on the errors in this test. So the fact that months after that, Google is still touting results on this test is a little surprising to me.

That's not to say that Gemini Ultra isn't the best new model. I think it is. It's just weird that they picked this particular benchmark. And deep in the appendix of the paper, Google sheepishly gives a slightly more reasonable comparison between the two models on these text questions. As you can see, depending on the prompting strategy, you get different results.

And please forgive me for one last bit of mockery of the two decimal places they give here when 2 to 3% of the test contains errors. Very briefly though, before we get back to the paper, many of you might be wondering what are these kind of techniques that can boost performance?

Well, I was intending to launch this slightly later on, but now is as good a time as any. It's my new AI Insiders content on Patreon. And no, I'm not going to do a long ad. We'll be back to Gemini in 30 seconds. I've been working on this for months, doing interviews with Google DeepMind, people like Jim Fan, Microsoft authors, basically creating the best possible exclusive content that I could come up with.

And the reason I mention it now is because in one of these exclusive videos, I talk with a top Google author about the future of prompting. It's not this one, although I'm really proud of that one, reasoning as the holy grail of LLMs. Not this one, I think it was this one.

I spoke to Xingchen Wan about this evolution of prompting. There is much more content coming and I'm even going to have a feature where you can vote on what kind of questions I ask the next round of experts. Some of whom will feature later in this video on Google Gemini.

But why do I say in other modalities it beats GPT-4? Well, look at this. In 9 of 9 image understanding benchmarks, it beats GPT-4 Vision and all other models. 6 of 6 video understanding benchmarks and 5 of 5 speech recognition and speech translation benchmarks. That's not bad. They are trained to support a 32,000 token context window, which compares to 128,000 for GPT-4 Turbo.

With Anthropic, you can get up to 200,000 tokens, but the performance isn't quite as good. Interestingly, they did tell us the parameter count of the nano models at 1.8 billion parameters and 3.25 billion parameters. They even give us the detail that they are 4-bit quantized or distilled down smaller versions of the larger Gemini models.

What about other key details like the dataset they used? Well, as you might have guessed, they say, "We plan to update this report with more details ahead of the general availability of the Gemini Ultra model." And later on, they go into fantastic detail, but not really, saying, "Our pre-trained dataset uses data from web documents, books and code, and includes image, audio and video data." Great.

Instead of detail on the dataset, we do get this incredible nugget. Some of the delay to Gemini was due to external factors such as cosmic rays. Now, obviously I'm being facetious, but that is an interesting detail. Now, one key detail that I mentioned back in the Palm 2 video was this case of positive transfer.

Essentially, by training the model on image, audio, as well as text and video, they got positive transfer. In other words, the model got better at text by being trained on images. As Google keeps saying, it was trained from the ground up to be multimodal, and that's why it gets such good results in other modalities, which I'll get to.

In fact, here they are pretty much state of the art across natural image understanding, document understanding, infographic understanding. It's even better in video captioning, video question answering, speech translation. But these are solid results. This is nothing to do with prompting. It genuinely is a better model in these modalities.

Remember though, that we're not getting the ultra until early next year, and at the moment, pro and nano can only respond with text and code. They can't yet do what you're seeing in all of these demos, which I'll get to in a second, which is generate images. Speaking of the release though, people in the UK and EU like me aren't even getting Google Gemini on launch.

Apparently it's about regulations, so time will tell. This is the kind of interactive UI that I mean, and it does look good. Clicking on the interface regenerates the data to be rendered by the coded route. Oh, I know she likes cupcakes. I can now click on anything in the interface and ask it for more information.

I could say step by step instructions on how to bake this, and it starts to generate a new UI. This time it designs a UI best suited for giving me step by step instructions. But now time for some of the highlights, and I'm going to start with the one that I was most impressed by.

Because of Gemini's ability to understand nuanced information and answer questions relating to complicated topics, it can give you a customized explanation of the subject you're trying to learn. And lastly, if you want to learn more, you can just ask. Gemini will provide personalized practice problems based on mistakes. Here I have a similar problem where I have to figure out the cat's speed.

The height of the ground is double. Oh yeah, I knew that. Now I'm not naive enough to think that Google Gemini won't sometimes hallucinate answers, but this format of providing follow-up questions, understanding messy handwriting, and tying it all together is pretty impressive. The paper makes a great point that because it was trained from the ground up to be multimodal, it doesn't take audio and then put it into text, where you lose some nuance like tone.

For languages like Mandarin, which I've studied in China, that's super important as this demo demonstrates. How to pronounce the words lunar January in Chinese? Option A, zhēng yuè. Option B, zhèng yuè. Which one is correct? Which tone is the correct tone for the first character? Option A, the first character is pronounced with the first tone.

Gemini was able to differentiate the two ways of pronouncing the word to make sure it was correct. And speaking of video understanding, these two demos weren't bad either. You are trying to get me to find the paper ball under the cup. I accept the challenge. The cup to the left.

Nice. What movie are they acting out here? I think they are acting out the famous bullet time scene from The Matrix. Ooh, nice. What do you think might happen next? The cat is going to jump to the wall and stick the landing. It's going to be a perfect 10.

Oh, and it's a miss. It looks like it was just a little too far away. But don't worry, the cat is a natural athlete and it will be back up there in no time. Going back to languages for a second, we see from this machine translation benchmark, which couldn't have been in the training data by the way, that Gemini Ultra beats Palm 2 large.

But why is beating Palm 2 large so significant? Well, I covered this at the time, but in certain settings, we observe that Palm 2 improves quality both over Palm and Google Translate. And Gemini Ultra is an improvement on Palm 2 large. So expect the multilingual performance to be next level.

Now, what about coding? Well, here's where it's a draw in some ways and a massive win for Gemini Ultra in other ways. Natural 2 code was a held out data set with no leakage on the web. So a really good benchmark to use and Gemini Ultra beats GPT-4. Gemini Pro by the way, beats GPT-3.5.

But yes, the results are fairly close, a 1% point difference. The craziness though, comes from the AlphaCode 2 technical report. This was also released three to four hours ago. Now, AlphaCode 2 is based on Gemini, Gemini Pro actually, not Gemini Ultra, and it achieves truly outstanding things. That's not to say it will be available to you anytime soon.

It's very compute intensive, but it does show what is coming to the automation of coding. I'll try to get to as many details as I can. Yes, I have read this report in full too. AlphaCode 2 based on Gemini Pro was evaluated on the Codeforces platform. GPT-4 could solve a zero out of 10 of the easiest problems when they were recent, not within its data set.

As this author points out, that strongly points towards contamination because it could solve 10 out of 10 of the pre-2021 problems. Anyway, it's a really challenging problem set. AlphaCode 2 solves 43% of those problems within 10 attempts, beating 85% of competition participants. Now, AlphaCode 2 isn't just one model, it's an entire system.

They basically get a family of Gemini models by tweaking different hyperparameters. Think of it as different flavors of Gemini, which generate code samples for each problem. It was important to have those different flavors because they wanted code diversity. Then they sampled hundreds and up to a million code samples to search over the space of possible programs.

This is maybe what Demis Hassabis mentioned when he talked about AlphaGo + GPT in a video I did a while back. At this point, you can see why it's not yet for consumer release because that is incredibly compute intensive. Anyway, they then filtered out the results to remove code that didn't compile or didn't pass the unit tests.

They also had a mechanism to filter out code that was too similar. But then here's the interesting bit. They used Gemini as a scoring model to surface the best candidate. Are you getting vibes of Let's Verify step by step from my Q* video? If not, I'm not hinting strongly enough.

Here's some more evidence. They used a fine-tuned Gemini Pro model to attribute an estimated correctness score between 0 and 1 to code samples. Remember from Let's Verify that evaluation is easier than generation? I'll have to cover this in more detail in another video because there's other juicy nuggets to get into.

The ranking of AlphaGo 2 on the test was between expert and master. In its best two performances, AlphaGo 2 outperformed more than 99.5% of competition participants. Here's where it gets more interesting. To solve these level of problems before writing the code implementation, one needs to understand, analyze and reason about the problem.

This explains why generally available AI systems, I think they mean GPT-4, perform poorly on this benchmark. AlphaGo 2's success on this competitive programming contest represents an impressive step change. This is something I actually discussed with Jim Fan, a senior AI researcher at NVIDIA for AI Insiders. He quoted mathematics being the first to fall.

I think that's one of the reasons why coding and math look set to be the first to fall. I'm not talking imminently, but if you can generate enough samples and then test them, it becomes more a matter of compute rather than reasoning. Brute force over beauty. You can see here how even with the number of samples approaching a million, the results keep getting better.

They also note that the sample efficiency of AlphaGo 2 because of the underlying model is a lot better than AlphaGo 1. I'm sure you all want to try it and they say we are working towards bringing AlphaGo 2's unique capabilities to our foundation Gemini models. But why can't we have it here?

Well, they reiterate that just above. Our system requires a lot of trial and error and remains too costly to operate at scale. And many of you may have picked up on another key detail. I said they used Gemini Pro. Maybe this was for budgeting reasons, but they say we suspect using Gemini Ultra as the foundation model instead with its improved coding and reasoning capabilities would lead to further improvements in the overall Alpha Code 2 approach.

They do try to give a sliver of reassurance to human coders a bit later on though. They note that when using Alpha Code 2 with human coders who can specify additional filtering properties, we score above the 90th percentile. That's a five percentile increase. And they say optimistically, we hope this kind of interactive coding will be the future of programming where programmers make use of the highly capable AI models as collaborative tools.

As always, let me know what you think in the comments. Some more notes on the release of Gemini before we get to the media round that Hasabis did. First, Gemini Nano is coming to the Pixel 8 Pro. That's going to power features like Summarize and Smart Reply. Honestly, I wouldn't yet trust a 3.5 billion parameter model to reply to anything, but that's just me.

They say you can expect in the coming months that Gemini will be available in services such as Search, Ads, Chrome, and Duo AI. In a week though, starting December 13th, devs and enterprise customers can access Gemini Pro via the Gemini API in Google AI Studio. Starting today, Bard will use a fine-tuned version of Gemini Pro in those 170 countries excluding the UK and EU.

And even though I'm hearing reports of people being disappointed with Gemini Pro, remember that's not the biggest model. That's Gemini Ultra. Gemini Pro is more like the original Chachi BT. In my last video, I already covered what the delay was behind Gemini Ultra. They're basically doing more reinforcement learning from human feedback, especially for low resource languages where it was apparently really easy to jailbreak.

They also kind of teased out this Bard Advanced, which I suspect will be a subscription model, and they call it a new cutting-edge AI experience that gives you access to our best models and capabilities starting with Ultra. Be very interesting to see whether they pitch that higher, lower, or the same as Chachi BT Pro.

I'll be honest, I'll be buying it regardless, assuming it's available in the UK. In terms of API pricing, we don't yet know, but this New York Times article did point out that during that crisis at OpenAI, Google offered the same price to customers at OpenAI as their current OpenAI rate and also offered cloud credits and discounts thrown in.

We'll have to see if that still applies to Gemini Pro and Gemini Ultra. What about the future of Gemini though? Gemini 2.0? Well, Demis Hassabis, the CEO of Google DeepMind, gave us this clue. He said that Google DeepMind is already looking into how Gemini might be combined with robotics to physically interact with the world.

To become truly multimodal, you'll want to include touch and tactile feedback. And I can't resist pointing out at this point that I spoke to Panag Sanketi for AI Insiders. He is the tech lead manager for RT2 and I also spoke about this topic with many other lead authors at NVIDIA and elsewhere, and I hope that video will be out on AI Insiders before the end of this month.

In this Wired interview, Demis Hassabis also strongly hinted that this work was related to what Q* might be about. Further evidence that this is the direction that all the major AGI labs are heading toward. When asked about that, he said, "We have some of the world's best reinforcement learning experts who invented some of this stuff." And went on, "We've got some interesting innovations we're working on to bring to future versions of Gemini." And again, like me, he is a somewhat understated Londoner, so that could be pretty significant.

In The Verge, he doubled down on that, talking about Gemini Ultra, saying, "It's going to get even more general than images, video, and audio. There's still things like action and touch, more like robotics." Over time, he says, "Gemini will get more senses, become more aware." And he ended his media round with this, "As we approach AGI, things are going to be different," Hassabis says.

"We're going to gain insanity points," is what Sam Altman said. "It's kind of an active technology, so I think we have to approach that cautiously. Cautiously, but optimistically." And on that note, I want to end the video. And yes, I'm super excited about the rolling launch of AI Insiders, which you can sign up to now.

But I also want to reassure my longstanding audience, for many of whom $29 or even $25 is way too expensive for them. I totally respect that, and for most of my life, it would have been too expensive for me too. So don't worry, as this video shows, I will still be posting as frequently on the main AI Explained channel.

And for those who can join, at least for the next few weeks, I'm going to be writing a personal message of thanks for joining AI Insiders. Of course, I am also massively grateful for all of those supporting me at the legendary level. You guys will still get my personal updates and blog style posts.

So that's Google Gemini. Definitely not the last video from me on any of those models. Thank you so much for watching this video and have a wonderful day.