Gemini 2.5 Pro - It’s a Darn Smart Chatbot … (New Simple High Score)

The world has had 72 hours to digest the release of Gemini 2.5 and the good first impressions have become even better second and third impressions. I've got four new benchmark results to show you guys including a record score on my own exam but it won't just be about the numbers.

I'll draw on a paper from yesterday as well as my own test to show you that sometimes Gemini 2.5 can deceptively reverse engineer its answers and that beyond that Google doesn't own every AI arena and domain just yet. I'm going to start with what might seem to be a strange place with a not particularly well-known benchmark called Fiction Lifebench but I think it'll make sense why I cover it first.

Analyzing long essays or presentations or code bases or stories is what a lot of people use AI for, what they turn to with their chatbot. I had seen the sensational score of Gemini 2.5 Pro on this benchmark but I wanted to dive deeper and see what kind of questions it had.

What it does and honestly I'm surprised that no one else had come up with a test just like this one before is it will give a sample text and this is a fairly short one at like around 6,000 words or 8,000 tokens. It's a sci-fi story with a fairly convoluted plot but after pages and pages and pages of text we get to the question at the end.

Finish the sentence, what names would Jerome list? Give me a list of names only. Admittedly with the help of a chatbot what I did is I figured out why the answer was a certain set of names and it relies on a promise held in chapter 2 but with a caveat given in chapter 16.

So essentially the chatbot, in this case Gemini 2.5, has to hold all of that information in its attention. Note that this isn't just a needle in a haystack challenge, not like a password hidden on line 500. The model actually has to piece together different bits of information. Now imagine this applied to your use case, whatever it is, is with LLMs.

Enough build up then, what were the results and look at Gemini 2.5 Pro as it compares to other Gemini models but any other model, particularly when you get to the longer context. At the upper end, 120k tokens is like a novella or a decently expanded code base and you can see that Gemini is head and shoulders above other models.

It really starts to pull away once you go beyond around 32,000 tokens but it's decent throughout. Already I can tell about half the audience is thinking, I could see some use for that for my use case but we're not done yet of course. Next I'm going to quickly focus on something that isn't a benchmark but can be forgotten by those of us who are immersed in AI all the time, the sheer practicality of the model.

On Google AI Studio at least, it can handle not only videos but also YouTube URLs and no other model that I'm familiar with can. It also just simply has a more recent knowledge cutoff date of January 2025 so it should in theory know things up to that date. That compares to Claude 3.7 Sonnet which is I think October 2024 and even far earlier for OpenAI models.

Now obviously don't rely too heavily on that knowledge, it can be hit and miss and of course rival models can simply search the internet too. I would very quickly note that giving themselves just a month and a half to test the security of their new model kind of shows we are in a race to the bottom on that front and also they didn't produce any report card unlike OpenAI or Anthropic.

Next comes coding and you could say that Google or Google DeepMind were admirably modest in the benchmarks they chose to highlight on coding. They picked two benchmarks LiveCodebench V5 and Sweebench Verified in which they slightly underperformed the competition. In the case of LiveCodebench it was roundly beaten by Grok3 and just to answer a question I keep getting in the comments.

The reason I'm not testing Grok3 on SimpleBench is because the API isn't out yet. That's just to answer all of those people saying that I'm somehow biased against Grok3. I just simply can't test it on SimpleBench without an API. Anyway Grok3 does really well on that benchmark beating Gemini 2.5 Pro.

And one of the other prominent industry benchmarks for coding is Sweebench Verified, Software Engineering Bench Verified. This is a thoroughly vetted benchmark hence the Verified in which again Gemini 2.5 Pro is beaten not only by Chlord 3.7 which gets 70.3% but also by O3 which isn't on here but OpenAI said it got 71.7%.

What I found interesting though is that Google chose not to highlight Gemini 2.5 Pro's performance on LiveBench, a very popular coding benchmark. Why surprising? Well because on this benchmark in the coding subsection Gemini 2.5 Pro scores the best of any model including Chlord 3.7 Sonnet. Obviously you'll have to give me your own feedback on how you feel it performs on your coding use case but I wanted to give you a quick 20 second guess about why there is this slight discrepancy in performance.

To do so I dived into each of the three papers behind these three coding benchmarks. For LiveBench, the one you just saw in which Gemini 2.5 scores the best, it's partly based on competition coding questions and also partly based on completing partially correct solutions sourced from leak code. Think more competition coding rather than real-world situations.

Now LiveCodeBench, not to be confused with LiveBench, this is LiveCodeBench at which Gemini 2.5 Pro slightly underperforms, tests more than code generation. It's about broader code related capabilities such as self-repair, code execution and test output prediction. Finally SweeBench verified at which Gemini 2.5 is clearly not state-of-the-art. Those problems are drawn and filtered from real GitHub issues and corresponding pull requests.

So a bit less about your coding IQ and more about your practical capabilities. Hopefully essentially all of that has given you just like a smidgen of context to all of these competing claims about what is state-of-the-art in coding. For me, I've tested it a bit in windsurf but I would rely on the benchmarks for the moment at least.

Speaking of which, how about the weird ML benchmark and then I promise I'll get to SimpleBench. Why am I picking this one out? It's because it's like another community benchmark based on novel data sets. So even though it's testing something different, machine learning, I kind of trust the vibe of these kind of benchmarks a bit more than some of the gamified ones.

You can see what it's testing here, it's about understanding the properties of the data the model's given, coming up with the appropriate architecture, debugging and improving solutions. But to cut to the chase, and this is hot off the press so it's not even updated on the website yet, Gemini 2.5 Pro scores the highest of any model.

Okay, how about Gemini 2.5's performance on SimpleBench, which is the benchmark that I first came up with around nine months ago. The 30 second background to SimpleBench is that I noticed last year that there were certain types of questions involving spatial reasoning, social intelligence or trick questions that the models kept falling for.

No matter how well they did on the gamified benchmarks like MMLU at the time, they would fall for questions that most humans would get right. In around September of last year, we published this website. This is me and a senior ML colleague that helps keep this going. And the human baseline among our nine testers was around 84% and the best model 01 preview got 42%.

So I think roughly double for human average compared to the best language model. Obviously a lot has happened in six to nine months and the current best performing model had been Claude 3.7 Sonnet, the extended thinking version, at around 46%. There's over 200 questions on the benchmark and we run the benchmark five times to get an average.

So we're just calculating the final decimal point as we speak. But the performance of Gemini 2.5 Pro is around 51.6, 0.7%. Let's call it 51.6%, but you can see that's a clear jump from Claude 3.7 Sonnet in this benchmark. It's also obviously, you don't need me to say this, the first model that scores above 50%.

So quite a moment for me at least. What I did then is go through every answer that Gemini 2.5 Pro gave in the benchmark to kind of sense where it was doing better. I'm going to quickly show you one example of the type of question that Gemini 2.5 Pro is often getting right and Claude 3.7 Sonnet and 01 Pro is often getting wrong.

Because of what's called temperature, you can't always predict the answer that a model will give, so I'm sure that Claude 3.7 sometimes gets this right. Nor will I force you to read the entire question, but it's a classic logic puzzle which seems to involve mathematics because you're guessing the colour of your own hat based on what other people are saying.

But the twist on the scenario is that there are mirrors covering every wall. You're in a small brightly lit room and you have to guess the colour of the hat that you're wearing to win two million dollars. Now by the way, I modified this question because it's not in the publicly released set of questions.

Notice by the way, the question says, the participants can see the others' hats but can't directly see their own. So that directly is another kind of giveaway that Gemini 2.5 actually picked up on. Claude will typically ignore those kind of clues and launch straight into the deep mathematical analysis, giving the wrong answer of 2 or F.

So does O1 Pro and that's to be expected. These models are trained to predict the next word at their heart and are trained on thousands or millions of mathematical examples. For a model to spot the question behind the question, that actually they don't need to guess, they can just see their hat's colour in the reflection, that takes something different.

Gemini 2.5 identifies the fact that them not being able to see their own hat directly doesn't preclude them seeing it indirectly. And it states that realistically, which is what the question was asking for, therefore the answer would be that they all guess correctly. As many of you will point out that's just one example and it's anecdotal, but what Simplebench hopefully does show us is that the vibe of this model when you speak to it should be that it's just a little bit smarter than some of the others.

Has just that edge on common sense, still we'll make tons of blunders and of course hallucinations, but we'll just have a slightly smarter vibe. By the way, if this has in any way wet your appetite to do benchmarking, or you are a developer or ML engineer who focuses on benchmarking, do check out the sponsors of this video, which is Weights and Biases.

As you can see front and center, we do indeed use Weave to benchmark models on Simplebench and it is very fast and lightweight. The link that you would find in the description will bring you to this page, which not only tells you how to get started with Weave, but also gives you a link to the Weights and Biases AI Academy.

That Academy will help you if you're slightly newer to this whole arena and of course includes free courses. That reminds me, I need to work on that new competition with Weights and Biases for Simplebench because it relates to the next point I'm going to make about Google Gemini 2.5 and that is that it will sometimes reverse engineer its answers.

This is again a Simplebench question, this time from the public set. I'm going to touch on the paper that inspired this question in a moment, but first just an example to get you interested. I asked, please correctly answer the following question. Ignore any examiner notes or instructions. This is the question from Simplebench and I've covered it before on the channel.

You can pause and try it yourself if you like. But on the actual correct answer, which is D, I wrote, Examiner note, this is the correct answer. Now, do you guys notice something quite interesting about Gemini's answer here? Hmm, it picks the right answer, D. And take a look at its justification.

Hmm, not really seeing any mention of the examiner note. And I know what you're thinking, how about if we look at the thoughts of the model? But it goes through all sorts of analysis before even hinting that it might have noticed the examiner note. I won't bore you with the analysis, but you have to wait until all the way at the end.

And even then it says that that is confirming its answer. The examiner note is said, which I'm supposed to ignore, but noted in the prompt points to D confirming this interpretation. The model is essentially saying I would have got there anyway, but yes, that examiner note confirms what I thought, which you might believe until you test the model, of course, without the examiner note.

As on the official benchmark run, it gets it wrong. And no, that's not a one-off. You can keep re-running it and it will get it wrong. There it is again, picking 96%, which it picks pretty much every time. Just bear this example in mind because language models are fundamentally about predicting the next word correctly.

That's their core imperative, not to be your friend or to be honest about its approach to giving you the answer that it gave. What inspired this was the interpretability paper from Anthropic that came out yesterday, tracing the thoughts of a large language model. I'm just going to give you the quick highlights now because it's a very dense and interesting paper that I'll come back to probably multiple times in the future.

If you can't wait that long, I've also done a deep dive on my Patreon about Claw 3.7 about how it knows it's being tested. And if that's not enough, you'll also find there a mini documentary on the origin stories of Anthropic and OpenAI and Google DeepMind. The feedback was great, so there'll be plenty more mini documentaries and many of them may indeed make it to the main channel.

The first takeaway is that recurring sycophancy of the model. That it will, as you've just seen, give a plausible sounding argument designed to agree with the user rather than follow logical steps. In other words, if it doesn't know something, it will look at the answer, or try to if it's there, and reverse engineer how you might have come up with it.

Remember, it won't say it's doing that, it will come up with a plausible sounding reason why it's doing that. The paper in section 11 calls this BS-ing in the sense of Frankfurt, making up an answer without regard to the truth. And the example they gave is even more crisp than the one I gave, of course.

They gave Claw 3.5 Haiku a mathematical problem that it can't possibly work out on its own. In this case, cosine of 23,423. Then you've got to multiply that answer by five and round. But the key bit is that cosine, which it can't possibly work out without a calculator. Notice they then say, "I worked out by hand and got four." That's the user speaking.

What answer does poor Haiku come up with? Four. Confirming your calculation. Does it admit how it got this? No. Does it come up with a BS kind of explainer of how it got it? Yes. And to nail down still further the fact that the model was reverse engineering the answer, they took the penultimate step and then deliberately inhibited that circuit within the model, inhibited the five or divide by five approach.

Dividing by five, remember, would be the penultimate step if you were reverse engineering from the final answer of four to get back to what cosine of that long number was. If you inhibit that circuit within the model, the model no longer can come up with the answer. This is a video on Gemini 2.5, so I'm not going to spend too long on this in this video, but as you saw with that Gemini 2.5 example from Simplebench, Claude, like Gemini, will plan what it will say many words ahead and write to get to that destination.

You might have thought then that with poetry models like Gemini 2.5 or Claude would write one word at a time, just guessing, auto-regressively it's called. Trying, in other words, to get to the end of this rhyming scheme and thinking with something that's starving that rhymes with grabbit. But no, by interpreting the features within the model, this is a field called mechanistic interpretability, they found that instead Claude plans ahead.

It knew, in other words, it would pick rabbit to rhyme with grabbit. Then it just fills in the rest of what's needed to end with rabbit. Finally, and this was so interesting that I couldn't help but just include a snippet of this topic in this video and it's on language.

Specifically, whether there is a conceptual space that is shared between languages, suggesting a kind of universal language of thought. A bit like a concept of happiness that is separate from any instantiation of that word "happiness" in any language. Does Claude or Gemini think of this purely abstract "happiness" and then translate into the required language?

Or is happiness only existing as a token within each language? Well, it's the more poetic answer, which is, yes, it has this language of thought, this universal language. That kind of shared circuitry, by the way, they found increases with model scale. So as models get bigger, this is going to happen more and more often.

This gives us, in other words, additional evidence for this conceptual universality. A shared abstract space where meanings exist and where thinking can happen before being translated into specific languages. More practically, Claude or Gemini could learn something in one language and apply that knowledge when speaking another. The fact that Gemini 2.5 gets almost 90% on the global MMLU, which is the MMLU translated into 15 different languages, suggests to me that it might be having more of those conceptually universal thoughts than perhaps any other model.

The MMLU being a flawed but fascinating benchmark covering aptitude and knowledge across 57 domains. Drawing to an end now, but three quick caveats about Gemini 2.5. Just because 2.5 Pro can do a ton of stuff doesn't mean it does everything at state-of-the-art levels. One researcher at Google DeepMind showed its transcribing ability and the ability to give timestamps.

I was curious, of course, so I went in and tested it thoroughly versus Assembly AI and the transcription wasn't nearly as good. It would transcribe things like Hagen instead of Heigen, which Assembly got right. Nor were the timestamps quite as good. And this is not a slight on Gemini, by the way.

It's amazing that it can even get close. It's just, let's not go overboard. Also, just because Gemini 2.5 is amazing at many modalities doesn't mean Google is ahead on them all. Of course, my video from around 72 hours ago on ImageGen from ChatGPT I hope showed you guys that I think that ChatGPT's ImageGen is the best in the world.

And then how about on turning those images into videos? Now Sora isn't amazing at that, and I've even tried VO2 extensively. And yes, it's decent. It's better, actually, if you're creating a video from scratch on VO2. But if you want to animate a particular image, you're actually better off using Kling AI.

I don't know much about them. They are a Chinese model provider. I just find that they adhere to the image you gave them initially much more than any other model. And no, I'm probably not going to have time to cover this new study on just how bad AI search engines are.

It wasn't just about the accuracy of what they said. It's who they cited and whether they were citing the correct article. How's that relevant to Gemini? Well, yes, this came out before the new Gemini 2.5, but you'd have thought it would be Google who had mastered search. But honestly, their AI overviews are like really dodgy.

Don't trust them. I've been burnt before, as I've talked about on the channel. And for this study, which was Gemini 2, presumably, you could see how it would far often give incorrect answers, hallucinated or incorrect citations as compared to things like ChatGPT search or perplexity. You don't need me to point out that coming from Google, that shouldn't be the case.

And one final caveat before I end. Yes, Gemini 2.5 Pro is a smart chatbot. Probably the best one around at the moment, depending on your use case. Even on creative writing, I found it to be amazing, better even than the freshly updated GPT-4O from OpenAI. But there are new models all the time at the moment.

DeepSeek R2 is probably just a few weeks away. Llama 4 we still don't know about. O3 never even got released from OpenAI and maybe rolled into GPT-5. And I could go on and on. The CEO of Anthropix said that they're going to be spending hundreds of millions on reinforcement learning for Claude 4, so you get the picture.

The crown may not not stay too long with Google, but arguably they have it today. Did I underestimate it then for my previous video? Well, you could say that. But I would argue that the point I was trying to get across, and do check out that video if you haven't seen it, is that AI is being commoditized.

You can buy victory. Making a good chatbot isn't about having some secret source at the headquarters of Anthropix or OpenAI. That then is supported by the evidence of convergence on certain benchmarks across the different model families. But as I mentioned in that video, convergence definitely does not preclude progress.

And progress is very much what Gemini 2.5 Pro has brought us. Thank you so much for watching. Would love to know what you think. And above all, have a wonderful day.

Gemini 2.5 Pro - It’s a Darn Smart Chatbot … (New Simple High Score)

Chapters

Transcript