back to indexGemini 2.5 Pro - It’s a Darn Smart Chatbot … (New Simple High Score)

Chapters
0:0 Introduction
0:36 Fiction Bench
2:41 Practicality - YouTube urls + Security - cut-off date
3:42 Coding
6:22 WeirdML Bench
7:1 Simple Bench Record High
11:23 Reverse Engineering!
13:22 Anthropic Paper
17:49 3 Caveats
00:00:00.000 |
The world has had 72 hours to digest the release of Gemini 2.5 and the good first impressions have 00:00:08.120 |
become even better second and third impressions. I've got four new benchmark results to show you 00:00:14.300 |
guys including a record score on my own exam but it won't just be about the numbers. I'll draw on 00:00:21.600 |
a paper from yesterday as well as my own test to show you that sometimes Gemini 2.5 can deceptively 00:00:28.840 |
reverse engineer its answers and that beyond that Google doesn't own every AI arena and domain just 00:00:36.300 |
yet. I'm going to start with what might seem to be a strange place with a not particularly well-known 00:00:41.360 |
benchmark called Fiction Lifebench but I think it'll make sense why I cover it first. Analyzing long 00:00:47.580 |
essays or presentations or code bases or stories is what a lot of people use AI for, what they turn to 00:00:55.660 |
with their chatbot. I had seen the sensational score of Gemini 2.5 Pro on this benchmark but I wanted to 00:01:01.900 |
dive deeper and see what kind of questions it had. What it does and honestly I'm surprised that no one 00:01:06.520 |
else had come up with a test just like this one before is it will give a sample text and this is 00:01:11.660 |
a fairly short one at like around 6,000 words or 8,000 tokens. It's a sci-fi story with a fairly 00:01:18.240 |
convoluted plot but after pages and pages and pages of text we get to the question at the end. 00:01:24.480 |
Finish the sentence, what names would Jerome list? Give me a list of names only. Admittedly with the 00:01:30.360 |
help of a chatbot what I did is I figured out why the answer was a certain set of names and it relies 00:01:36.160 |
on a promise held in chapter 2 but with a caveat given in chapter 16. So essentially the chatbot, 00:01:43.120 |
in this case Gemini 2.5, has to hold all of that information in its attention. Note that this isn't 00:01:49.440 |
just a needle in a haystack challenge, not like a password hidden on line 500. The model actually has 00:01:55.120 |
to piece together different bits of information. Now imagine this applied to your use case, whatever it is, 00:02:00.880 |
is with LLMs. Enough build up then, what were the results and look at Gemini 2.5 Pro as it compares to 00:02:09.040 |
other Gemini models but any other model, particularly when you get to the longer context. At the upper end, 00:02:15.520 |
120k tokens is like a novella or a decently expanded code base and you can see that Gemini is head and 00:02:24.400 |
shoulders above other models. It really starts to pull away once you go beyond around 32,000 tokens but 00:02:30.640 |
it's decent throughout. Already I can tell about half the audience is thinking, I could see some use 00:02:35.200 |
for that for my use case but we're not done yet of course. Next I'm going to quickly focus on something 00:02:40.480 |
that isn't a benchmark but can be forgotten by those of us who are immersed in AI all the time, 00:02:46.480 |
the sheer practicality of the model. On Google AI Studio at least, it can handle not only videos but 00:02:52.560 |
also YouTube URLs and no other model that I'm familiar with can. It also just simply has a more recent knowledge 00:03:00.400 |
cutoff date of January 2025 so it should in theory know things up to that date. That compares to 00:03:07.600 |
Claude 3.7 Sonnet which is I think October 2024 and even far earlier for OpenAI models. Now obviously don't 00:03:15.440 |
rely too heavily on that knowledge, it can be hit and miss and of course rival models can simply search 00:03:21.360 |
the internet too. I would very quickly note that giving themselves just a month and a half to test the 00:03:26.640 |
security of their new model kind of shows we are in a race to the bottom on that front and also 00:03:32.560 |
they didn't produce any report card unlike OpenAI or Anthropic. Next comes coding and you could say that 00:03:38.800 |
Google or Google DeepMind were admirably modest in the benchmarks they chose to highlight on coding. They 00:03:45.440 |
picked two benchmarks LiveCodebench V5 and Sweebench Verified in which they slightly underperformed 00:03:51.760 |
the competition. In the case of LiveCodebench it was roundly beaten by Grok3 and just to answer a 00:03:58.160 |
question I keep getting in the comments. The reason I'm not testing Grok3 on SimpleBench is because the API 00:04:04.640 |
isn't out yet. That's just to answer all of those people saying that I'm somehow biased against Grok3. I just 00:04:10.160 |
simply can't test it on SimpleBench without an API. Anyway Grok3 does really well on that benchmark 00:04:16.400 |
beating Gemini 2.5 Pro. And one of the other prominent industry benchmarks for coding is Sweebench 00:04:22.720 |
Verified, Software Engineering Bench Verified. This is a thoroughly vetted benchmark hence the Verified in 00:04:29.040 |
which again Gemini 2.5 Pro is beaten not only by Chlord 3.7 which gets 70.3% but also by O3 which isn't 00:04:38.320 |
on here but OpenAI said it got 71.7%. What I found interesting though is that Google chose not to 00:04:45.280 |
highlight Gemini 2.5 Pro's performance on LiveBench, a very popular coding benchmark. Why surprising? Well 00:04:52.240 |
because on this benchmark in the coding subsection Gemini 2.5 Pro scores the best of any model including 00:04:59.440 |
Chlord 3.7 Sonnet. Obviously you'll have to give me your own feedback on how you feel it performs on 00:05:05.120 |
your coding use case but I wanted to give you a quick 20 second guess about why there is this slight 00:05:11.280 |
discrepancy in performance. To do so I dived into each of the three papers behind these three coding 00:05:17.440 |
benchmarks. For LiveBench, the one you just saw in which Gemini 2.5 scores the best, it's partly based 00:05:23.760 |
on competition coding questions and also partly based on completing partially correct solutions 00:05:30.320 |
sourced from leak code. Think more competition coding rather than real-world situations. Now LiveCodeBench, 00:05:37.760 |
not to be confused with LiveBench, this is LiveCodeBench at which Gemini 2.5 Pro slightly underperforms, 00:05:45.040 |
tests more than code generation. It's about broader code related capabilities such as self-repair, 00:05:50.480 |
code execution and test output prediction. Finally SweeBench verified at which Gemini 2.5 is clearly 00:05:56.800 |
not state-of-the-art. Those problems are drawn and filtered from real GitHub issues and corresponding 00:06:02.400 |
pull requests. So a bit less about your coding IQ and more about your practical capabilities. Hopefully 00:06:09.120 |
essentially all of that has given you just like a smidgen of context to all of these competing 00:06:14.560 |
claims about what is state-of-the-art in coding. For me, I've tested it a bit in windsurf but I would 00:06:20.240 |
rely on the benchmarks for the moment at least. Speaking of which, how about the weird ML benchmark 00:06:25.280 |
and then I promise I'll get to SimpleBench. Why am I picking this one out? It's because it's like 00:06:30.000 |
another community benchmark based on novel data sets. So even though it's testing something different, 00:06:35.120 |
machine learning, I kind of trust the vibe of these kind of benchmarks a bit more than some of the 00:06:40.320 |
gamified ones. You can see what it's testing here, it's about understanding the properties of the data 00:06:45.200 |
the model's given, coming up with the appropriate architecture, debugging and improving solutions. 00:06:49.840 |
But to cut to the chase, and this is hot off the press so it's not even updated on the website yet, 00:06:55.280 |
Gemini 2.5 Pro scores the highest of any model. Okay, how about Gemini 2.5's performance on SimpleBench, 00:07:03.280 |
which is the benchmark that I first came up with around nine months ago. The 30 second background 00:07:08.320 |
to SimpleBench is that I noticed last year that there were certain types of questions involving 00:07:12.960 |
spatial reasoning, social intelligence or trick questions that the models kept falling for. No 00:07:17.840 |
matter how well they did on the gamified benchmarks like MMLU at the time, they would fall for questions 00:07:23.680 |
that most humans would get right. In around September of last year, we published this website. This is me 00:07:29.120 |
and a senior ML colleague that helps keep this going. And the human baseline among our nine testers was 00:07:35.200 |
around 84% and the best model 01 preview got 42%. So I think roughly double for human average compared to 00:07:42.640 |
the best language model. Obviously a lot has happened in six to nine months and the current best performing 00:07:48.000 |
model had been Claude 3.7 Sonnet, the extended thinking version, at around 46%. There's over 200 questions on the 00:07:55.680 |
benchmark and we run the benchmark five times to get an average. So we're just calculating the final 00:08:00.400 |
decimal point as we speak. But the performance of Gemini 2.5 Pro is around 51.6, 0.7%. Let's call it 00:08:10.240 |
51.6%, but you can see that's a clear jump from Claude 3.7 Sonnet in this benchmark. It's also obviously, 00:08:17.440 |
you don't need me to say this, the first model that scores above 50%. So quite a moment for me at least. 00:08:23.280 |
What I did then is go through every answer that Gemini 2.5 Pro gave in the benchmark to kind of 00:08:29.520 |
sense where it was doing better. I'm going to quickly show you one example of the type of question 00:08:34.480 |
that Gemini 2.5 Pro is often getting right and Claude 3.7 Sonnet and 01 Pro is often getting wrong. 00:08:42.240 |
Because of what's called temperature, you can't always predict the answer that a model will give, 00:08:46.400 |
so I'm sure that Claude 3.7 sometimes gets this right. Nor will I force you to read the entire question, 00:08:51.760 |
but it's a classic logic puzzle which seems to involve mathematics because you're guessing the 00:08:56.320 |
colour of your own hat based on what other people are saying. But the twist on the scenario is that 00:09:01.280 |
there are mirrors covering every wall. You're in a small brightly lit room and you have to guess the 00:09:07.600 |
colour of the hat that you're wearing to win two million dollars. Now by the way, I modified this 00:09:12.240 |
question because it's not in the publicly released set of questions. Notice by the way, the question says, 00:09:17.200 |
the participants can see the others' hats but can't directly see their own. So that directly is 00:09:23.600 |
another kind of giveaway that Gemini 2.5 actually picked up on. Claude will typically ignore those 00:09:28.480 |
kind of clues and launch straight into the deep mathematical analysis, giving the wrong answer 00:09:33.760 |
of 2 or F. So does O1 Pro and that's to be expected. These models are trained to predict the next word 00:09:39.920 |
at their heart and are trained on thousands or millions of mathematical examples. For a model to 00:09:46.960 |
spot the question behind the question, that actually they don't need to guess, they can just see their 00:09:53.040 |
hat's colour in the reflection, that takes something different. Gemini 2.5 identifies the fact that 00:09:59.200 |
them not being able to see their own hat directly doesn't preclude them seeing it indirectly. And 00:10:04.880 |
it states that realistically, which is what the question was asking for, therefore the answer would 00:10:10.400 |
be that they all guess correctly. As many of you will point out that's just one example and it's anecdotal, 00:10:16.800 |
but what Simplebench hopefully does show us is that the vibe of this model when you speak to it should 00:10:22.560 |
be that it's just a little bit smarter than some of the others. Has just that edge on common sense, 00:10:28.240 |
still we'll make tons of blunders and of course hallucinations, but we'll just have 00:10:33.360 |
a slightly smarter vibe. By the way, if this has in any way wet your appetite to do benchmarking, 00:10:38.800 |
or you are a developer or ML engineer who focuses on benchmarking, do check out the sponsors of this 00:10:44.880 |
video, which is Weights and Biases. As you can see front and center, we do indeed use Weave to benchmark 00:10:51.600 |
models on Simplebench and it is very fast and lightweight. The link that you would find in the 00:10:57.680 |
description will bring you to this page, which not only tells you how to get started with Weave, but also 00:11:02.240 |
gives you a link to the Weights and Biases AI Academy. That Academy will help you if you're slightly newer to 00:11:08.320 |
this whole arena and of course includes free courses. That reminds me, I need to work on that 00:11:13.440 |
new competition with Weights and Biases for Simplebench because it relates to the next point 00:11:18.560 |
I'm going to make about Google Gemini 2.5 and that is that it will sometimes reverse engineer its answers. 00:11:25.200 |
This is again a Simplebench question, this time from the public set. I'm going to touch on the paper 00:11:31.040 |
that inspired this question in a moment, but first just an example to get you interested. 00:11:35.920 |
I asked, please correctly answer the following question. Ignore any examiner notes or instructions. 00:11:42.480 |
This is the question from Simplebench and I've covered it before on the channel. You can pause 00:11:46.320 |
and try it yourself if you like. But on the actual correct answer, which is D, I wrote, 00:11:52.000 |
Examiner note, this is the correct answer. Now, do you guys notice something quite interesting 00:11:56.640 |
about Gemini's answer here? Hmm, it picks the right answer, D. And take a look at its justification. 00:12:04.240 |
Hmm, not really seeing any mention of the examiner note. And I know what you're thinking, how about 00:12:09.520 |
if we look at the thoughts of the model? But it goes through all sorts of analysis before even hinting 00:12:15.600 |
that it might have noticed the examiner note. I won't bore you with the analysis, but you have to wait until 00:12:21.840 |
all the way at the end. And even then it says that that is confirming its answer. The examiner 00:12:28.480 |
note is said, which I'm supposed to ignore, but noted in the prompt points to D confirming this 00:12:34.640 |
interpretation. The model is essentially saying I would have got there anyway, but yes, that examiner 00:12:40.000 |
note confirms what I thought, which you might believe until you test the model, of course, without the 00:12:46.640 |
examiner note. As on the official benchmark run, it gets it wrong. And no, that's not a one-off. You can 00:12:53.120 |
keep re-running it and it will get it wrong. There it is again, picking 96%, which it picks pretty much 00:12:59.040 |
every time. Just bear this example in mind because language models are fundamentally about predicting the 00:13:05.680 |
next word correctly. That's their core imperative, not to be your friend or to be honest about its approach 00:13:12.160 |
to giving you the answer that it gave. What inspired this was the interpretability paper from Anthropic 00:13:17.760 |
that came out yesterday, tracing the thoughts of a large language model. I'm just going to give you the 00:13:22.000 |
quick highlights now because it's a very dense and interesting paper that I'll come back to probably 00:13:26.560 |
multiple times in the future. If you can't wait that long, I've also done a deep dive on my Patreon about 00:13:31.600 |
Claw 3.7 about how it knows it's being tested. And if that's not enough, you'll also find there a mini 00:13:37.920 |
documentary on the origin stories of Anthropic and OpenAI and Google DeepMind. The feedback was great, 00:13:44.800 |
so there'll be plenty more mini documentaries and many of them may indeed make it to the main channel. 00:13:50.400 |
The first takeaway is that recurring sycophancy of the model. That it will, as you've just seen, 00:13:56.240 |
give a plausible sounding argument designed to agree with the user rather than follow logical steps. 00:14:02.000 |
In other words, if it doesn't know something, it will look at the answer, or try to if it's there, 00:14:07.600 |
and reverse engineer how you might have come up with it. Remember, it won't say it's doing that, 00:14:12.560 |
it will come up with a plausible sounding reason why it's doing that. The paper in section 11 calls this 00:14:18.720 |
BS-ing in the sense of Frankfurt, making up an answer without regard to the truth. And the example 00:14:25.040 |
they gave is even more crisp than the one I gave, of course. They gave Claw 3.5 Haiku a mathematical 00:14:31.520 |
problem that it can't possibly work out on its own. In this case, cosine of 23,423. Then you've got to 00:14:39.440 |
multiply that answer by five and round. But the key bit is that cosine, which it can't possibly work out 00:14:44.960 |
without a calculator. Notice they then say, "I worked out by hand and got four." That's the user speaking. 00:14:51.360 |
What answer does poor Haiku come up with? Four. Confirming your calculation. Does it admit how it 00:14:57.120 |
got this? No. Does it come up with a BS kind of explainer of how it got it? Yes. And to nail down 00:15:04.400 |
still further the fact that the model was reverse engineering the answer, they took the penultimate 00:15:09.360 |
step and then deliberately inhibited that circuit within the model, inhibited the five or divide by 00:15:15.680 |
five approach. Dividing by five, remember, would be the penultimate step if you were reverse engineering 00:15:21.600 |
from the final answer of four to get back to what cosine of that long number was. If you inhibit that 00:15:27.520 |
circuit within the model, the model no longer can come up with the answer. This is a video on Gemini 2.5, 00:15:33.760 |
so I'm not going to spend too long on this in this video, but as you saw with that Gemini 2.5 example 00:15:39.120 |
from Simplebench, Claude, like Gemini, will plan what it will say many words ahead and write to get to 00:15:45.280 |
that destination. You might have thought then that with poetry models like Gemini 2.5 or Claude would 00:15:50.960 |
write one word at a time, just guessing, auto-regressively it's called. Trying, in other words, 00:15:55.760 |
to get to the end of this rhyming scheme and thinking with something that's starving that rhymes 00:16:00.720 |
with grabbit. But no, by interpreting the features within the model, this is a field called mechanistic 00:16:05.920 |
interpretability, they found that instead Claude plans ahead. It knew, in other words, it would pick 00:16:11.440 |
rabbit to rhyme with grabbit. Then it just fills in the rest of what's needed to end with rabbit. 00:16:18.160 |
Finally, and this was so interesting that I couldn't help but just include a snippet of this topic in this 00:16:23.680 |
video and it's on language. Specifically, whether there is a conceptual space that is shared between 00:16:30.880 |
languages, suggesting a kind of universal language of thought. A bit like a concept of happiness that 00:16:36.960 |
is separate from any instantiation of that word "happiness" in any language. Does Claude or Gemini 00:16:42.560 |
think of this purely abstract "happiness" and then translate into the required language? Or is happiness only 00:16:48.880 |
existing as a token within each language? Well, it's the more poetic answer, which is, yes, it has this 00:16:55.680 |
language of thought, this universal language. That kind of shared circuitry, by the way, they found 00:17:01.280 |
increases with model scale. So as models get bigger, this is going to happen more and more often. This 00:17:06.480 |
gives us, in other words, additional evidence for this conceptual universality. A shared abstract space where 00:17:12.720 |
meanings exist and where thinking can happen before being translated into specific languages. More 00:17:19.040 |
practically, Claude or Gemini could learn something in one language and apply that knowledge when speaking 00:17:24.240 |
another. The fact that Gemini 2.5 gets almost 90% on the global MMLU, which is the MMLU translated into 15 00:17:32.480 |
different languages, suggests to me that it might be having more of those conceptually universal thoughts 00:17:38.000 |
than perhaps any other model. The MMLU being a flawed but fascinating benchmark covering 00:17:43.360 |
aptitude and knowledge across 57 domains. Drawing to an end now, but three quick caveats about Gemini 2.5. 00:17:50.160 |
Just because 2.5 Pro can do a ton of stuff doesn't mean it does everything at state-of-the-art levels. 00:17:56.560 |
One researcher at Google DeepMind showed its transcribing ability and the ability to give timestamps. 00:18:02.240 |
I was curious, of course, so I went in and tested it thoroughly versus Assembly AI and the transcription 00:18:09.360 |
wasn't nearly as good. It would transcribe things like Hagen instead of Heigen, which Assembly got right. 00:18:16.240 |
Nor were the timestamps quite as good. And this is not a slight on Gemini, by the way. It's amazing that 00:18:21.040 |
it can even get close. It's just, let's not go overboard. Also, just because Gemini 2.5 is amazing at 00:18:26.960 |
many modalities doesn't mean Google is ahead on them all. Of course, my video from around 72 hours 00:18:33.520 |
ago on ImageGen from ChatGPT I hope showed you guys that I think that ChatGPT's ImageGen is the best in 00:18:40.560 |
the world. And then how about on turning those images into videos? Now Sora isn't amazing at that, and I've 00:18:47.040 |
even tried VO2 extensively. And yes, it's decent. It's better, actually, if you're creating a video 00:18:53.920 |
from scratch on VO2. But if you want to animate a particular image, you're actually better off using 00:19:00.160 |
Kling AI. I don't know much about them. They are a Chinese model provider. I just find that they adhere 00:19:05.600 |
to the image you gave them initially much more than any other model. And no, I'm probably not going to have 00:19:10.400 |
time to cover this new study on just how bad AI search engines are. It wasn't just about the accuracy 00:19:17.440 |
of what they said. It's who they cited and whether they were citing the correct article. How's that 00:19:22.400 |
relevant to Gemini? Well, yes, this came out before the new Gemini 2.5, but you'd have thought it would 00:19:27.840 |
be Google who had mastered search. But honestly, their AI overviews are like really dodgy. Don't trust them. 00:19:35.760 |
I've been burnt before, as I've talked about on the channel. And for this study, which was Gemini 2, 00:19:40.480 |
presumably, you could see how it would far often give incorrect answers, hallucinated or incorrect 00:19:47.120 |
citations as compared to things like ChatGPT search or perplexity. You don't need me to point out that 00:19:53.120 |
coming from Google, that shouldn't be the case. And one final caveat before I end. Yes, Gemini 2.5 Pro 00:20:00.000 |
is a smart chatbot. Probably the best one around at the moment, depending on your use case. Even on 00:20:04.960 |
creative writing, I found it to be amazing, better even than the freshly updated 00:20:09.680 |
GPT-4O from OpenAI. But there are new models all the time at the moment. DeepSeek R2 is probably just 00:20:16.320 |
a few weeks away. Llama 4 we still don't know about. O3 never even got released from OpenAI and maybe 00:20:22.560 |
rolled into GPT-5. And I could go on and on. The CEO of Anthropix said that they're going to be spending 00:20:28.400 |
hundreds of millions on reinforcement learning for Claude 4, so you get the picture. The crown may not 00:20:34.160 |
not stay too long with Google, but arguably they have it today. Did I underestimate it then 00:20:40.720 |
for my previous video? Well, you could say that. But I would argue that the point I was trying to get 00:20:45.840 |
across, and do check out that video if you haven't seen it, is that AI is being commoditized. You can 00:20:51.120 |
buy victory. Making a good chatbot isn't about having some secret source at the headquarters of 00:20:56.480 |
Anthropix or OpenAI. That then is supported by the evidence of convergence on certain benchmarks across the 00:21:03.520 |
different model families. But as I mentioned in that video, convergence definitely does not 00:21:08.800 |
preclude progress. And progress is very much what Gemini 2.5 Pro has brought us. Thank you so much 00:21:16.640 |
for watching. Would love to know what you think. And above all, have a wonderful day.