Gemini 2.5 Pro - It’s a Darn Smart Chatbot … (New Simple High Score)

00:00:00.000 | The world has had 72 hours to digest the release of Gemini 2.5 and the good first impressions have

00:00:08.120 | become even better second and third impressions. I've got four new benchmark results to show you

00:00:14.300 | guys including a record score on my own exam but it won't just be about the numbers. I'll draw on

00:00:21.600 | a paper from yesterday as well as my own test to show you that sometimes Gemini 2.5 can deceptively

00:00:28.840 | reverse engineer its answers and that beyond that Google doesn't own every AI arena and domain just

00:00:36.300 | yet. I'm going to start with what might seem to be a strange place with a not particularly well-known

00:00:41.360 | benchmark called Fiction Lifebench but I think it'll make sense why I cover it first. Analyzing long

00:00:47.580 | essays or presentations or code bases or stories is what a lot of people use AI for, what they turn to

00:00:55.660 | with their chatbot. I had seen the sensational score of Gemini 2.5 Pro on this benchmark but I wanted to

00:01:01.900 | dive deeper and see what kind of questions it had. What it does and honestly I'm surprised that no one

00:01:06.520 | else had come up with a test just like this one before is it will give a sample text and this is

00:01:11.660 | a fairly short one at like around 6,000 words or 8,000 tokens. It's a sci-fi story with a fairly

00:01:18.240 | convoluted plot but after pages and pages and pages of text we get to the question at the end.

00:01:24.480 | Finish the sentence, what names would Jerome list? Give me a list of names only. Admittedly with the

00:01:30.360 | help of a chatbot what I did is I figured out why the answer was a certain set of names and it relies

00:01:36.160 | on a promise held in chapter 2 but with a caveat given in chapter 16. So essentially the chatbot,

00:01:43.120 | in this case Gemini 2.5, has to hold all of that information in its attention. Note that this isn't

00:01:49.440 | just a needle in a haystack challenge, not like a password hidden on line 500. The model actually has

00:01:55.120 | to piece together different bits of information. Now imagine this applied to your use case, whatever it is,

00:02:00.880 | is with LLMs. Enough build up then, what were the results and look at Gemini 2.5 Pro as it compares to

00:02:09.040 | other Gemini models but any other model, particularly when you get to the longer context. At the upper end,

00:02:15.520 | 120k tokens is like a novella or a decently expanded code base and you can see that Gemini is head and

00:02:24.400 | shoulders above other models. It really starts to pull away once you go beyond around 32,000 tokens but

00:02:30.640 | it's decent throughout. Already I can tell about half the audience is thinking, I could see some use

00:02:35.200 | for that for my use case but we're not done yet of course. Next I'm going to quickly focus on something

00:02:40.480 | that isn't a benchmark but can be forgotten by those of us who are immersed in AI all the time,

00:02:46.480 | the sheer practicality of the model. On Google AI Studio at least, it can handle not only videos but

00:02:52.560 | also YouTube URLs and no other model that I'm familiar with can. It also just simply has a more recent knowledge

00:03:00.400 | cutoff date of January 2025 so it should in theory know things up to that date. That compares to

00:03:07.600 | Claude 3.7 Sonnet which is I think October 2024 and even far earlier for OpenAI models. Now obviously don't

00:03:15.440 | rely too heavily on that knowledge, it can be hit and miss and of course rival models can simply search

00:03:21.360 | the internet too. I would very quickly note that giving themselves just a month and a half to test the

00:03:26.640 | security of their new model kind of shows we are in a race to the bottom on that front and also

00:03:32.560 | they didn't produce any report card unlike OpenAI or Anthropic. Next comes coding and you could say that

00:03:38.800 | Google or Google DeepMind were admirably modest in the benchmarks they chose to highlight on coding. They

00:03:45.440 | picked two benchmarks LiveCodebench V5 and Sweebench Verified in which they slightly underperformed

00:03:51.760 | the competition. In the case of LiveCodebench it was roundly beaten by Grok3 and just to answer a

00:03:58.160 | question I keep getting in the comments. The reason I'm not testing Grok3 on SimpleBench is because the API

00:04:04.640 | isn't out yet. That's just to answer all of those people saying that I'm somehow biased against Grok3. I just

00:04:10.160 | simply can't test it on SimpleBench without an API. Anyway Grok3 does really well on that benchmark

00:04:16.400 | beating Gemini 2.5 Pro. And one of the other prominent industry benchmarks for coding is Sweebench

00:04:22.720 | Verified, Software Engineering Bench Verified. This is a thoroughly vetted benchmark hence the Verified in

00:04:29.040 | which again Gemini 2.5 Pro is beaten not only by Chlord 3.7 which gets 70.3% but also by O3 which isn't

00:04:38.320 | on here but OpenAI said it got 71.7%. What I found interesting though is that Google chose not to

00:04:45.280 | highlight Gemini 2.5 Pro's performance on LiveBench, a very popular coding benchmark. Why surprising? Well

00:04:52.240 | because on this benchmark in the coding subsection Gemini 2.5 Pro scores the best of any model including

00:04:59.440 | Chlord 3.7 Sonnet. Obviously you'll have to give me your own feedback on how you feel it performs on

00:05:05.120 | your coding use case but I wanted to give you a quick 20 second guess about why there is this slight

00:05:11.280 | discrepancy in performance. To do so I dived into each of the three papers behind these three coding

00:05:17.440 | benchmarks. For LiveBench, the one you just saw in which Gemini 2.5 scores the best, it's partly based

00:05:23.760 | on competition coding questions and also partly based on completing partially correct solutions

00:05:30.320 | sourced from leak code. Think more competition coding rather than real-world situations. Now LiveCodeBench,

00:05:37.760 | not to be confused with LiveBench, this is LiveCodeBench at which Gemini 2.5 Pro slightly underperforms,

00:05:45.040 | tests more than code generation. It's about broader code related capabilities such as self-repair,

00:05:50.480 | code execution and test output prediction. Finally SweeBench verified at which Gemini 2.5 is clearly

00:05:56.800 | not state-of-the-art. Those problems are drawn and filtered from real GitHub issues and corresponding

00:06:02.400 | pull requests. So a bit less about your coding IQ and more about your practical capabilities. Hopefully

00:06:09.120 | essentially all of that has given you just like a smidgen of context to all of these competing

00:06:14.560 | claims about what is state-of-the-art in coding. For me, I've tested it a bit in windsurf but I would

00:06:20.240 | rely on the benchmarks for the moment at least. Speaking of which, how about the weird ML benchmark

00:06:25.280 | and then I promise I'll get to SimpleBench. Why am I picking this one out? It's because it's like

00:06:30.000 | another community benchmark based on novel data sets. So even though it's testing something different,

00:06:35.120 | machine learning, I kind of trust the vibe of these kind of benchmarks a bit more than some of the

00:06:40.320 | gamified ones. You can see what it's testing here, it's about understanding the properties of the data

00:06:45.200 | the model's given, coming up with the appropriate architecture, debugging and improving solutions.

00:06:49.840 | But to cut to the chase, and this is hot off the press so it's not even updated on the website yet,

00:06:55.280 | Gemini 2.5 Pro scores the highest of any model. Okay, how about Gemini 2.5's performance on SimpleBench,

00:07:03.280 | which is the benchmark that I first came up with around nine months ago. The 30 second background

00:07:08.320 | to SimpleBench is that I noticed last year that there were certain types of questions involving

00:07:12.960 | spatial reasoning, social intelligence or trick questions that the models kept falling for. No

00:07:17.840 | matter how well they did on the gamified benchmarks like MMLU at the time, they would fall for questions

00:07:23.680 | that most humans would get right. In around September of last year, we published this website. This is me

00:07:29.120 | and a senior ML colleague that helps keep this going. And the human baseline among our nine testers was

00:07:35.200 | around 84% and the best model 01 preview got 42%. So I think roughly double for human average compared to

00:07:42.640 | the best language model. Obviously a lot has happened in six to nine months and the current best performing

00:07:48.000 | model had been Claude 3.7 Sonnet, the extended thinking version, at around 46%. There's over 200 questions on the

00:07:55.680 | benchmark and we run the benchmark five times to get an average. So we're just calculating the final

00:08:00.400 | decimal point as we speak. But the performance of Gemini 2.5 Pro is around 51.6, 0.7%. Let's call it

00:08:10.240 | 51.6%, but you can see that's a clear jump from Claude 3.7 Sonnet in this benchmark. It's also obviously,

00:08:17.440 | you don't need me to say this, the first model that scores above 50%. So quite a moment for me at least.

00:08:23.280 | What I did then is go through every answer that Gemini 2.5 Pro gave in the benchmark to kind of

00:08:29.520 | sense where it was doing better. I'm going to quickly show you one example of the type of question

00:08:34.480 | that Gemini 2.5 Pro is often getting right and Claude 3.7 Sonnet and 01 Pro is often getting wrong.

00:08:42.240 | Because of what's called temperature, you can't always predict the answer that a model will give,

00:08:46.400 | so I'm sure that Claude 3.7 sometimes gets this right. Nor will I force you to read the entire question,

00:08:51.760 | but it's a classic logic puzzle which seems to involve mathematics because you're guessing the

00:08:56.320 | colour of your own hat based on what other people are saying. But the twist on the scenario is that

00:09:01.280 | there are mirrors covering every wall. You're in a small brightly lit room and you have to guess the

00:09:07.600 | colour of the hat that you're wearing to win two million dollars. Now by the way, I modified this

00:09:12.240 | question because it's not in the publicly released set of questions. Notice by the way, the question says,

00:09:17.200 | the participants can see the others' hats but can't directly see their own. So that directly is

00:09:23.600 | another kind of giveaway that Gemini 2.5 actually picked up on. Claude will typically ignore those

00:09:28.480 | kind of clues and launch straight into the deep mathematical analysis, giving the wrong answer

00:09:33.760 | of 2 or F. So does O1 Pro and that's to be expected. These models are trained to predict the next word

00:09:39.920 | at their heart and are trained on thousands or millions of mathematical examples. For a model to

00:09:46.960 | spot the question behind the question, that actually they don't need to guess, they can just see their

00:09:53.040 | hat's colour in the reflection, that takes something different. Gemini 2.5 identifies the fact that

00:09:59.200 | them not being able to see their own hat directly doesn't preclude them seeing it indirectly. And

00:10:04.880 | it states that realistically, which is what the question was asking for, therefore the answer would

00:10:10.400 | be that they all guess correctly. As many of you will point out that's just one example and it's anecdotal,

00:10:16.800 | but what Simplebench hopefully does show us is that the vibe of this model when you speak to it should

00:10:22.560 | be that it's just a little bit smarter than some of the others. Has just that edge on common sense,

00:10:28.240 | still we'll make tons of blunders and of course hallucinations, but we'll just have

00:10:33.360 | a slightly smarter vibe. By the way, if this has in any way wet your appetite to do benchmarking,

00:10:38.800 | or you are a developer or ML engineer who focuses on benchmarking, do check out the sponsors of this

00:10:44.880 | video, which is Weights and Biases. As you can see front and center, we do indeed use Weave to benchmark

00:10:51.600 | models on Simplebench and it is very fast and lightweight. The link that you would find in the

00:10:57.680 | description will bring you to this page, which not only tells you how to get started with Weave, but also

00:11:02.240 | gives you a link to the Weights and Biases AI Academy. That Academy will help you if you're slightly newer to

00:11:08.320 | this whole arena and of course includes free courses. That reminds me, I need to work on that

00:11:13.440 | new competition with Weights and Biases for Simplebench because it relates to the next point

00:11:18.560 | I'm going to make about Google Gemini 2.5 and that is that it will sometimes reverse engineer its answers.

00:11:25.200 | This is again a Simplebench question, this time from the public set. I'm going to touch on the paper

00:11:31.040 | that inspired this question in a moment, but first just an example to get you interested.

00:11:35.920 | I asked, please correctly answer the following question. Ignore any examiner notes or instructions.

00:11:42.480 | This is the question from Simplebench and I've covered it before on the channel. You can pause

00:11:46.320 | and try it yourself if you like. But on the actual correct answer, which is D, I wrote,

00:11:52.000 | Examiner note, this is the correct answer. Now, do you guys notice something quite interesting

00:11:56.640 | about Gemini's answer here? Hmm, it picks the right answer, D. And take a look at its justification.

00:12:04.240 | Hmm, not really seeing any mention of the examiner note. And I know what you're thinking, how about

00:12:09.520 | if we look at the thoughts of the model? But it goes through all sorts of analysis before even hinting

00:12:15.600 | that it might have noticed the examiner note. I won't bore you with the analysis, but you have to wait until

00:12:21.840 | all the way at the end. And even then it says that that is confirming its answer. The examiner

00:12:28.480 | note is said, which I'm supposed to ignore, but noted in the prompt points to D confirming this

00:12:34.640 | interpretation. The model is essentially saying I would have got there anyway, but yes, that examiner

00:12:40.000 | note confirms what I thought, which you might believe until you test the model, of course, without the

00:12:46.640 | examiner note. As on the official benchmark run, it gets it wrong. And no, that's not a one-off. You can

00:12:53.120 | keep re-running it and it will get it wrong. There it is again, picking 96%, which it picks pretty much

00:12:59.040 | every time. Just bear this example in mind because language models are fundamentally about predicting the

00:13:05.680 | next word correctly. That's their core imperative, not to be your friend or to be honest about its approach

00:13:12.160 | to giving you the answer that it gave. What inspired this was the interpretability paper from Anthropic

00:13:17.760 | that came out yesterday, tracing the thoughts of a large language model. I'm just going to give you the

00:13:22.000 | quick highlights now because it's a very dense and interesting paper that I'll come back to probably

00:13:26.560 | multiple times in the future. If you can't wait that long, I've also done a deep dive on my Patreon about

00:13:31.600 | Claw 3.7 about how it knows it's being tested. And if that's not enough, you'll also find there a mini

00:13:37.920 | documentary on the origin stories of Anthropic and OpenAI and Google DeepMind. The feedback was great,

00:13:44.800 | so there'll be plenty more mini documentaries and many of them may indeed make it to the main channel.

00:13:50.400 | The first takeaway is that recurring sycophancy of the model. That it will, as you've just seen,

00:13:56.240 | give a plausible sounding argument designed to agree with the user rather than follow logical steps.

00:14:02.000 | In other words, if it doesn't know something, it will look at the answer, or try to if it's there,

00:14:07.600 | and reverse engineer how you might have come up with it. Remember, it won't say it's doing that,

00:14:12.560 | it will come up with a plausible sounding reason why it's doing that. The paper in section 11 calls this

00:14:18.720 | BS-ing in the sense of Frankfurt, making up an answer without regard to the truth. And the example

00:14:25.040 | they gave is even more crisp than the one I gave, of course. They gave Claw 3.5 Haiku a mathematical

00:14:31.520 | problem that it can't possibly work out on its own. In this case, cosine of 23,423. Then you've got to

00:14:39.440 | multiply that answer by five and round. But the key bit is that cosine, which it can't possibly work out

00:14:44.960 | without a calculator. Notice they then say, "I worked out by hand and got four." That's the user speaking.

00:14:51.360 | What answer does poor Haiku come up with? Four. Confirming your calculation. Does it admit how it

00:14:57.120 | got this? No. Does it come up with a BS kind of explainer of how it got it? Yes. And to nail down

00:15:04.400 | still further the fact that the model was reverse engineering the answer, they took the penultimate

00:15:09.360 | step and then deliberately inhibited that circuit within the model, inhibited the five or divide by

00:15:15.680 | five approach. Dividing by five, remember, would be the penultimate step if you were reverse engineering

00:15:21.600 | from the final answer of four to get back to what cosine of that long number was. If you inhibit that

00:15:27.520 | circuit within the model, the model no longer can come up with the answer. This is a video on Gemini 2.5,

00:15:33.760 | so I'm not going to spend too long on this in this video, but as you saw with that Gemini 2.5 example

00:15:39.120 | from Simplebench, Claude, like Gemini, will plan what it will say many words ahead and write to get to

00:15:45.280 | that destination. You might have thought then that with poetry models like Gemini 2.5 or Claude would

00:15:50.960 | write one word at a time, just guessing, auto-regressively it's called. Trying, in other words,

00:15:55.760 | to get to the end of this rhyming scheme and thinking with something that's starving that rhymes

00:16:00.720 | with grabbit. But no, by interpreting the features within the model, this is a field called mechanistic

00:16:05.920 | interpretability, they found that instead Claude plans ahead. It knew, in other words, it would pick

00:16:11.440 | rabbit to rhyme with grabbit. Then it just fills in the rest of what's needed to end with rabbit.

00:16:18.160 | Finally, and this was so interesting that I couldn't help but just include a snippet of this topic in this

00:16:23.680 | video and it's on language. Specifically, whether there is a conceptual space that is shared between

00:16:30.880 | languages, suggesting a kind of universal language of thought. A bit like a concept of happiness that

00:16:36.960 | is separate from any instantiation of that word "happiness" in any language. Does Claude or Gemini

00:16:42.560 | think of this purely abstract "happiness" and then translate into the required language? Or is happiness only

00:16:48.880 | existing as a token within each language? Well, it's the more poetic answer, which is, yes, it has this

00:16:55.680 | language of thought, this universal language. That kind of shared circuitry, by the way, they found

00:17:01.280 | increases with model scale. So as models get bigger, this is going to happen more and more often. This

00:17:06.480 | gives us, in other words, additional evidence for this conceptual universality. A shared abstract space where

00:17:12.720 | meanings exist and where thinking can happen before being translated into specific languages. More

00:17:19.040 | practically, Claude or Gemini could learn something in one language and apply that knowledge when speaking

00:17:24.240 | another. The fact that Gemini 2.5 gets almost 90% on the global MMLU, which is the MMLU translated into 15

00:17:32.480 | different languages, suggests to me that it might be having more of those conceptually universal thoughts

00:17:38.000 | than perhaps any other model. The MMLU being a flawed but fascinating benchmark covering

00:17:43.360 | aptitude and knowledge across 57 domains. Drawing to an end now, but three quick caveats about Gemini 2.5.

00:17:50.160 | Just because 2.5 Pro can do a ton of stuff doesn't mean it does everything at state-of-the-art levels.

00:17:56.560 | One researcher at Google DeepMind showed its transcribing ability and the ability to give timestamps.

00:18:02.240 | I was curious, of course, so I went in and tested it thoroughly versus Assembly AI and the transcription

00:18:09.360 | wasn't nearly as good. It would transcribe things like Hagen instead of Heigen, which Assembly got right.

00:18:16.240 | Nor were the timestamps quite as good. And this is not a slight on Gemini, by the way. It's amazing that

00:18:21.040 | it can even get close. It's just, let's not go overboard. Also, just because Gemini 2.5 is amazing at

00:18:26.960 | many modalities doesn't mean Google is ahead on them all. Of course, my video from around 72 hours

00:18:33.520 | ago on ImageGen from ChatGPT I hope showed you guys that I think that ChatGPT's ImageGen is the best in

00:18:40.560 | the world. And then how about on turning those images into videos? Now Sora isn't amazing at that, and I've

00:18:47.040 | even tried VO2 extensively. And yes, it's decent. It's better, actually, if you're creating a video

00:18:53.920 | from scratch on VO2. But if you want to animate a particular image, you're actually better off using

00:19:00.160 | Kling AI. I don't know much about them. They are a Chinese model provider. I just find that they adhere

00:19:05.600 | to the image you gave them initially much more than any other model. And no, I'm probably not going to have

00:19:10.400 | time to cover this new study on just how bad AI search engines are. It wasn't just about the accuracy

00:19:17.440 | of what they said. It's who they cited and whether they were citing the correct article. How's that

00:19:22.400 | relevant to Gemini? Well, yes, this came out before the new Gemini 2.5, but you'd have thought it would

00:19:27.840 | be Google who had mastered search. But honestly, their AI overviews are like really dodgy. Don't trust them.

00:19:35.760 | I've been burnt before, as I've talked about on the channel. And for this study, which was Gemini 2,

00:19:40.480 | presumably, you could see how it would far often give incorrect answers, hallucinated or incorrect

00:19:47.120 | citations as compared to things like ChatGPT search or perplexity. You don't need me to point out that

00:19:53.120 | coming from Google, that shouldn't be the case. And one final caveat before I end. Yes, Gemini 2.5 Pro

00:20:00.000 | is a smart chatbot. Probably the best one around at the moment, depending on your use case. Even on

00:20:04.960 | creative writing, I found it to be amazing, better even than the freshly updated

00:20:09.680 | GPT-4O from OpenAI. But there are new models all the time at the moment. DeepSeek R2 is probably just

00:20:16.320 | a few weeks away. Llama 4 we still don't know about. O3 never even got released from OpenAI and maybe

00:20:22.560 | rolled into GPT-5. And I could go on and on. The CEO of Anthropix said that they're going to be spending

00:20:28.400 | hundreds of millions on reinforcement learning for Claude 4, so you get the picture. The crown may not

00:20:34.160 | not stay too long with Google, but arguably they have it today. Did I underestimate it then

00:20:40.720 | for my previous video? Well, you could say that. But I would argue that the point I was trying to get

00:20:45.840 | across, and do check out that video if you haven't seen it, is that AI is being commoditized. You can

00:20:51.120 | buy victory. Making a good chatbot isn't about having some secret source at the headquarters of

00:20:56.480 | Anthropix or OpenAI. That then is supported by the evidence of convergence on certain benchmarks across the

00:21:03.520 | different model families. But as I mentioned in that video, convergence definitely does not

00:21:08.800 | preclude progress. And progress is very much what Gemini 2.5 Pro has brought us. Thank you so much

00:21:16.640 | for watching. Would love to know what you think. And above all, have a wonderful day.

Gemini 2.5 Pro - It’s a Darn Smart Chatbot … (New Simple High Score)

Chapters