While everyone else is focused on other stuff like Twitter spats, let's focus on the real news, the developments in AI, which I would say are accelerating. Particularly if you are Google who have just released the latest version of Gemini 2.5 Pro, fairly unambiguously the best language model in the world.
For the majority of benchmarks, and yes, including my own SimpleBench, it beats out all other models including Claude Opus 4, Grok 3 and OpenAI's O3. Though we are expecting O3 Pro from OpenAI fairly shortly. And that's before you get to the fact that it's quicker to respond, it's cheaper via the API, it can ingest up to 1 million tokens.
That's 4 or 5 times more than other models. Now before we get too hyped up though, there's a reason why the CEO of Google DeepMind, Demisa Sabas, responsible for Gemini, and the CEO of Google itself, Sundar Bachai, yesterday, both said that they don't expect AGI before 2030. Now sorry for those listening on the podcast, but take a look at these two lines here and which two of these vertical lines would you say is longest?
Well, Gemini 2.5 Pro, the latest version, 0605. Yes, if you are not in America, that naming scheme is incredibly confusing. But this latest version, what do you think it says? It says, at first glance, line A appears to be much longer than line B. However, this is a trick of the eye and they are the same length.
In fact, later on, the model doubles down by saying, you can test this yourself by placing a ruler up against the screen. You'll find they are identical in length. For those listening, they are pretty obviously not the same length. Now, of course, that is anecdotal, but there is a reason why Sundar Bachai said that in the near to medium term, Google will be hiring more workers, not firing them.
Of course, you can't always trust CEOs, which is why I'm going to dedicate the end portion of this video to investigating all those headlines you've been seeing recently about a white collar bloodbath. I found that when you dig deeper, not everything is as it seems. Now, somewhat strangely, I want to start with an interview released in the last 18 hours on Lex Friedman with the CEO of Google, Sundar Bachai.
Because the first half of this video is going to be about Gemini 2.5 Pro. But that's not even the biggest and best version of Gemini 2.5, which is Gemini 2.5 Ultra, unavailable to practically anyone. So all these record benchmark scores you're going to see, this isn't even their biggest and best model.
Each year, I sit and say, okay, we are going to throw 10x more compute over the course of next year at it, and will we see progress? Sitting here today, I feel like the year ahead will have a lot of progress. I think it's compute limited in this sense, right?
Like, you know, we can all, part of the reason you've seen us do flash, nano flash in pro models, but not an ultra model. It's like for each generation, we feel like we've been able to get the pro model at like, I don't know, 80-90% of ultra's capability. But ultra would be a lot more slow and a lot more expensive to serve.
But what we've been able to do is to go to the next generation and make the next generation's pro as good as the previous generation's ultra, but be able to serve it in a way that it's fast and you can use it and so on. The models we all use the most is maybe like a few months behind the maximum capability we can deliver, right?
Because that won't be the fastest, easiest to use, etc. But as the latest version of Gemini 2.5 Pro is apparently going to be a stable release used by hundreds of millions of people over the coming months, let's quickly dive into those benchmark results. On the right, by the way, you can see the results of the three iterations of Gemini 2.5 Pro.
To be clear, the latest one is what's going to be rolled out to everyone in the coming couple of weeks. On obscure knowledge as tested by humanity's last exam, it nudges out other models. For incredibly challenging science-based questions, it gets 86.4% when PhDs in those respective domains get around 60%.
On very approximate gauges of hallucinations, it scores better than any other model. And on reading charts and visuals and other types of graphs, it's at least on par with O3, which is around four times more expensive and a lot slower than Gemini 2.5 Pro. Again, it's worth highlighting that Gemini 2.5 Pro is really the middle model of the Gemini series.
You may also notice that the vast majority of these record-breaking scores are on a single attempt. We haven't yet seen the deep-think mode from Gemini 2.5 Pro. That would be roughly the equivalent of the multiple attempts or parallel trials that some of the other models utilize. As for coding, the picture is a lot less clear.
When you're talking about multiple languages, Gemini seems to do better as judged by Ada's polyglot benchmark. When you're talking about a slightly more software engineering focus, like Sweebench Verified, it seems like Claude is still very much in the lead. However, I will make a confession, which is that I was having an issue with connecting a domain on Firebase, which is Google on the backend.
Now, this was more to do with the app hosting infrastructure, but you'd have thought as a Google entity, Firebase, that Gemini would know the most about it. Now, I won't show you the full two-hour conversation, but I basically gave up with Gemini 2.5 Pro. This was, in fairness, the May instance of Gemini 2.5 Pro, but Claude for Opus was able to diagnose the issue almost immediately.
And I'm sure everyone who uses these models for coding will have similar anecdotes, where the benchmarks don't always reflect real-world usage. But while we are on benchmarks, what about my own benchmark, SymbolBench? Well, I am going to make a confession, which is that I thought the latest version of Gemini 2.5 Pro, the one from yesterday, would underperform.
Why did I think that? Well, because the first version of Gemini 2.5 Pro, the one I think from March, got 51.6%. But then when we tried the May version of Gemini 2.5 Pro, it was really hard to get a full run out of the model. I talked about this on Twitter, but the one run where it agreed to actually answer the question.
I think it got around 47%. So I actually had a theory that I was going to come to you guys and gloat and be like, yeah, they're doing RL for coding and mathematics, but that's kind of eroding the common sense of the models. This shows how SymbolBench tests things that other benchmarks don't capture.
Unfortunately, what actually happened is that when we tested the very latest version of Gemini 2.5 Pro yesterday evening, we couldn't get, because of rate limiting, a full five runs, which is why we're not yet reporting the result. But based on the four runs we did get, it was averaging around 62%.
So my little theory about RL maximization just completely went out the window. No, but seriously, even based on four runs, you can see that performance is getting better and better and better across all model types. Hate to say it, but I genuinely think SymbolBench won't last much longer than maybe three to 12 months.
We've got to talk about those job articles now. But if you want a bit more of a reflection about the kind of questions that Claude 4 and Gemini 2.5 Pro are now getting right, do check out this video on my Patreon. Suffice to say, though, that when we reach the moment that there are no text-based benchmarks for which the average human could beat frontier models, we will have crossed quite the Rubicon.
Sundar Pichai and Demis Hassabis, CEOs of Google and Google DeepMind, put the date of full AGI at just after 2030. Then you see stuff which obviously, you know, we are far from AGI too. So you have both experiences simultaneously happening to you. I'll answer your question, but I'll also throw out this.
I almost feel the term doesn't matter. What I know is by 2030, there'll be such dramatic progress. We'll be dealing with the consequences of that progress, both the positive externalities and the negative externalities that come with it in a big way by 2030. So that I strongly feel, right?
Whatever, we may be arguing about the term, or maybe Gemini can answer what that moment is in time in 2030. But I think the progress will be dramatic, right? So that I believe in. Now, please do let me take a moment to tell you about a tool that's available today and that yes, can utilize a variety of models, including Gemini 2.5.
That would be the sponsors of today's video, Emergent Mind, which I've been using for around two years before they even sponsored the channel. What it allows me to do is just catch up on those trending papers that I may have missed otherwise, like this one. As you know, I read those papers in full myself, but sometimes I do miss a paper that is trending on Hacker News or X.
You can download these summaries as a PDF in Markdown, or even listen to it as audio. The 2.5 Pro summaries are appropriately on the ProPlan, but anyway, link in the description. Now on jobs, this week and last, I've been seeing plenty of articles like this one going viral on Twitter and Reddit.
Has the decline of knowledge work begun? asked the New York Times. For one LinkedIn executive in a guest essay on New York Times, it has already begun with the bottom rung of the career ladder breaking. Now, obviously, I am one of the last people to underestimate the potential of AI and its impacts on the world of work.
But these stories were about what was happening now, not what might be coming in three to five years. So I wanted to ask, do they have any stats to back this stuff up? A lot of the articles cross-reference each other, but the one stat that they all seem to turn to is the fact that the unemployment rate for college graduates in the US has risen 30% since September 2022.
Not risen to 30%, has risen 30%. That sounds pretty ominous, right? But let me give you two contextual facts. The first is that that 30% rise is from 2% to 2.6% for college graduates. That's versus 4% for all workers. So a tiny bit less dramatic when you hear it is 2.6%.
Now, I can just feel the rage building up among some of you. So let me just give you one more contextual fact and then my own thoughts. Because even though 2.6% unemployment rate for college grads in the US doesn't sound too dramatic, a 30% rise is pretty real. So I dug deep and looked at the data source that these articles were citing.
And you can see it here with the college graduates at, well, now it seems 2.7%. That is the line in red and it comes from March of this year. But if we zoom out, we can see that, for example, in 2010, it was 5% among all college graduates. Even in, what is this, 1992, it was 3.5%.
Don't worry, I am not in any way downplaying the impact of what's coming. I'm just saying it's a bit much to say the impact is already incredibly noticeable now. The other article that went viral was this one, Behind the Curtain, A White Collar Bloodbath, which heavily featured quotes from Dario Amede, the CEO of Anthropic.
When the language is caveated, like AI could wipe out half of all entry level white collar jobs over the next one to five years, it's actually quite hard to disagree. The way AI is accelerating, it's really hard to counter say a could scenario. Amede gets onto slightly more dangerous territory when he says most people are unaware that this is about to happen.
Others at Anthropic, like Sholto Douglas, are even more definitive. There's important distinctions to be made here. One is that I think we're near guaranteed at this point to have effectively models that are capable of automating any white collar job by like 27, 28 and or near guaranteed end of decade.
This topic obviously deserves a full video on its own, but for me the necessary but not sufficient condition for white collar automation would be the elimination of hallucinations and dumb mistakes that the models don't self-correct. If there is even a one percent chance that frontier models of 2027 and 2028 make mistakes like this one, then having a human in the loop to check for those mistakes would surely allow for massively increased productivity.
Which leads me personally to the whole calm before the storm theory which I first outlined on this channel in 2023. I said back then that we would first see a massive increase in productivity as humans complement the work of frontier AI. That's why I don't think this white collar automation will happen as Amade says in as little as a couple of years or less.
Now I know what many of you are thinking, well these CEOs would know far better than those of us on the outside, but I remember almost two years to the day Sam Altman saying and I quote "we won't be talking about hallucinations in 18 months to two years". That was on the world tour that he did after the release of GPT-4.
Well almost exactly two years on from that quote we get this in the new scientist "AI hallucinations are getting worse and they're here to stay". Among other things the article cites a stat on a benchmark called SimpleQA which I've talked about before on the channel where basically O3, the latest open AI model, hallucinates a bit more than previous models.
Then you guys might remember those viral articles about Klarna eliminating its customer service team so it could use AI instead. Now very quietly without the same fanfare they've actually reversed on that policy saying that customers like talking to people instead. After getting rid of those 700 employees it's now rehiring many human agents.
Duolingo, the language app, also said that it was going to rely on AI before backing down and reversing that policy hiring more humans. Which leads me to the whole calm before the storm theory. While frontier language models are still weak at self-correcting their own hallucinations, the human can still complement their efforts and lead to overall more productivity.
This leads to limited effect on the unemployment rate. I do know there are anecdotal examples about people losing their jobs to AI, trust me I am aware of that and I have read those articles. But limited net effect on the unemployment rate. This of course leads to more and more investment in AI and less and less regulation of AI as countries try to win the AI race so-called.
But then there might come a tipping point where models using enough compute, having access to enough diverse methodologies for self-correction, finally stop making dumb mistakes and only miss things that are beyond their training data. Of course at that point, and I've actually got a documentary covering this, endless amounts more data would be given to them through for example screen recording, mass surveillance or robotics data.
Then the complacency that might have set in throughout the remainder of the 2020s might be quickly upended. And to be honest, it's not like blue-collar work would be immune from the effects of AI automation for that much longer than white-collar work at that point. This is the fully autonomous figure-02 robot humanoid.
So yes, I've probably pissed off those who expect imminent upheaval and those who think LLMs are completely overhyped, but there you go, that's just my opinion of what is coming. While all of this is going on, of course, we get access to some pretty epic AI tools, like the brand new Eleven Labs V3 Alpha.
Hey Jessica, have you tried the new Eleven V3? I just got it. The clarity is amazing. I can actually do whispers now, like this. Ooh, fancy. Check this out. I can do full Shakespeare now. As has been the theme of this video though, Eleven Labs can't rest easy because Google, with their native text-to-speech in Gemini 2.5 flash, isn't that far behind.
Hey Jessica, have you tried the new Eleven V3? I just got it. The clarity is amazing. I can actually do whispers now, like this. Ooh, fancy. Check this out. I can do full Shakespeare now. I can do full Shakespeare now. To be, or not to be, that is the question.
Hey Jessica, have you tried the new Eleven V3? I just got it. The clarity is amazing. I can actually do whispers now, like this. Ooh, fancy. Thank you so much for watching. Let me know what you think, as always, and have a wonderful day.