back to indexAI Accelerates: New Gemini Model + AI Unemployment Stories Analysed

Chapters
0:0 Introduction
2:4 Gemini 2.5 Ultra
3:34 Benchmarks
7:41 AGI Date and Meaning Pichai
9:13 Jobs and AI Unemployment Fears
15:28 Elevenlabs v3
00:00:00.000 |
While everyone else is focused on other stuff like Twitter spats, let's focus on the real news, 00:00:06.580 |
the developments in AI, which I would say are accelerating. Particularly if you are Google 00:00:12.780 |
who have just released the latest version of Gemini 2.5 Pro, fairly unambiguously the best 00:00:20.300 |
language model in the world. For the majority of benchmarks, and yes, including my own SimpleBench, 00:00:25.920 |
it beats out all other models including Claude Opus 4, Grok 3 and OpenAI's O3. Though we are 00:00:33.140 |
expecting O3 Pro from OpenAI fairly shortly. And that's before you get to the fact that it's quicker 00:00:38.880 |
to respond, it's cheaper via the API, it can ingest up to 1 million tokens. That's 4 or 5 times more 00:00:46.940 |
than other models. Now before we get too hyped up though, there's a reason why the CEO of Google 00:00:51.640 |
DeepMind, Demisa Sabas, responsible for Gemini, and the CEO of Google itself, Sundar Bachai, 00:00:56.600 |
yesterday, both said that they don't expect AGI before 2030. Now sorry for those listening on the 00:01:02.060 |
podcast, but take a look at these two lines here and which two of these vertical lines would you say 00:01:07.440 |
is longest? Well, Gemini 2.5 Pro, the latest version, 0605. Yes, if you are not in America, 00:01:14.060 |
that naming scheme is incredibly confusing. But this latest version, what do you think it says? 00:01:18.280 |
It says, at first glance, line A appears to be much longer than line B. However, 00:01:23.320 |
this is a trick of the eye and they are the same length. In fact, later on, the model doubles down 00:01:28.700 |
by saying, you can test this yourself by placing a ruler up against the screen. You'll find they are 00:01:34.260 |
identical in length. For those listening, they are pretty obviously not the same length. Now, 00:01:39.160 |
of course, that is anecdotal, but there is a reason why Sundar Bachai said that in the near to 00:01:44.120 |
medium term, Google will be hiring more workers, not firing them. Of course, you can't always trust 00:01:49.840 |
CEOs, which is why I'm going to dedicate the end portion of this video to investigating all those 00:01:54.580 |
headlines you've been seeing recently about a white collar bloodbath. I found that when you dig deeper, 00:02:00.020 |
not everything is as it seems. Now, somewhat strangely, I want to start with an interview released 00:02:06.440 |
in the last 18 hours on Lex Friedman with the CEO of Google, Sundar Bachai. Because the first half of 00:02:13.120 |
this video is going to be about Gemini 2.5 Pro. But that's not even the biggest and best version of 00:02:19.240 |
Gemini 2.5, which is Gemini 2.5 Ultra, unavailable to practically anyone. So all these record benchmark 00:02:25.900 |
scores you're going to see, this isn't even their biggest and best model. Each year, I sit and say, 00:02:30.960 |
okay, we are going to throw 10x more compute over the course of next year at it, and will we see 00:02:35.620 |
progress? Sitting here today, I feel like the year ahead will have a lot of progress. I think it's 00:02:41.700 |
compute limited in this sense, right? Like, you know, we can all, part of the reason you've seen us do 00:02:46.220 |
flash, nano flash in pro models, but not an ultra model. It's like for each generation, we feel like 00:02:54.760 |
we've been able to get the pro model at like, I don't know, 80-90% of ultra's capability. But ultra 00:03:01.440 |
would be a lot more slow and a lot more expensive to serve. But what we've been able to do is to go to 00:03:10.740 |
the next generation and make the next generation's pro as good as the previous generation's ultra, 00:03:15.240 |
but be able to serve it in a way that it's fast and you can use it and so on. The models we all use the 00:03:21.500 |
most is maybe like a few months behind the maximum capability we can deliver, right? Because that 00:03:31.420 |
won't be the fastest, easiest to use, etc. But as the latest version of Gemini 2.5 Pro is apparently 00:03:37.580 |
going to be a stable release used by hundreds of millions of people over the coming months, let's 00:03:42.700 |
quickly dive into those benchmark results. On the right, by the way, you can see the results of the 00:03:48.880 |
three iterations of Gemini 2.5 Pro. To be clear, the latest one is what's going to be rolled out to 00:03:55.300 |
everyone in the coming couple of weeks. On obscure knowledge as tested by humanity's last exam, 00:04:01.180 |
it nudges out other models. For incredibly challenging science-based questions, it gets 86.4% 00:04:08.020 |
when PhDs in those respective domains get around 60%. On very approximate gauges of hallucinations, 00:04:15.880 |
it scores better than any other model. And on reading charts and visuals and other types of 00:04:22.200 |
graphs, it's at least on par with O3, which is around four times more expensive and a lot slower than 00:04:29.620 |
Gemini 2.5 Pro. Again, it's worth highlighting that Gemini 2.5 Pro is really the middle model of the Gemini 00:04:36.680 |
series. You may also notice that the vast majority of these record-breaking scores are on a single 00:04:43.440 |
attempt. We haven't yet seen the deep-think mode from Gemini 2.5 Pro. That would be roughly the 00:04:49.760 |
equivalent of the multiple attempts or parallel trials that some of the other models utilize. 00:04:54.740 |
As for coding, the picture is a lot less clear. When you're talking about multiple languages, 00:04:59.800 |
Gemini seems to do better as judged by Ada's polyglot benchmark. When you're talking about a 00:05:04.700 |
slightly more software engineering focus, like Sweebench Verified, it seems like Claude is still very much in 00:05:10.920 |
the lead. However, I will make a confession, which is that I was having an issue with connecting a 00:05:16.200 |
domain on Firebase, which is Google on the backend. Now, this was more to do with the app hosting 00:05:21.780 |
infrastructure, but you'd have thought as a Google entity, Firebase, that Gemini would know the most 00:05:26.980 |
about it. Now, I won't show you the full two-hour conversation, but I basically gave up with Gemini 2.5 00:05:33.300 |
Pro. This was, in fairness, the May instance of Gemini 2.5 Pro, but Claude for Opus was able to 00:05:39.440 |
diagnose the issue almost immediately. And I'm sure everyone who uses these models for coding will 00:05:44.540 |
have similar anecdotes, where the benchmarks don't always reflect real-world usage. But while we are on 00:05:50.720 |
benchmarks, what about my own benchmark, SymbolBench? Well, I am going to make a confession, which is 00:05:56.200 |
that I thought the latest version of Gemini 2.5 Pro, the one from yesterday, would underperform. 00:06:02.300 |
Why did I think that? Well, because the first version of Gemini 2.5 Pro, the one I think from 00:06:07.580 |
March, got 51.6%. But then when we tried the May version of Gemini 2.5 Pro, it was really hard to get 00:06:15.340 |
a full run out of the model. I talked about this on Twitter, but the one run where it agreed to actually 00:06:21.020 |
answer the question. I think it got around 47%. So I actually had a theory that I was going to come 00:06:26.240 |
to you guys and gloat and be like, yeah, they're doing RL for coding and mathematics, but that's kind 00:06:31.900 |
of eroding the common sense of the models. This shows how SymbolBench tests things that other benchmarks 00:06:37.640 |
don't capture. Unfortunately, what actually happened is that when we tested the very latest version of 00:06:43.780 |
Gemini 2.5 Pro yesterday evening, we couldn't get, because of rate limiting, a full five runs, 00:06:51.700 |
which is why we're not yet reporting the result. But based on the four runs we did get, it was averaging 00:06:57.600 |
around 62%. So my little theory about RL maximization just completely went out the window. No, but seriously, 00:07:04.600 |
even based on four runs, you can see that performance is getting better and better and better across all 00:07:10.820 |
model types. Hate to say it, but I genuinely think SymbolBench won't last much longer than maybe 00:07:17.280 |
three to 12 months. We've got to talk about those job articles now. But if you want a bit more of a 00:07:23.300 |
reflection about the kind of questions that Claude 4 and Gemini 2.5 Pro are now getting right, do check 00:07:29.300 |
out this video on my Patreon. Suffice to say, though, that when we reach the moment that there are no 00:07:33.900 |
text-based benchmarks for which the average human could beat frontier models, we will have crossed quite the 00:07:40.540 |
Rubicon. Sundar Pichai and Demis Hassabis, CEOs of Google and Google DeepMind, put the date of full AGI 00:07:49.000 |
Then you see stuff which obviously, you know, we are far from AGI too. So you have both experiences 00:07:54.880 |
simultaneously happening to you. I'll answer your question, but I'll also throw out this. I almost 00:07:59.760 |
feel the term doesn't matter. What I know is by 2030, there'll be such dramatic progress. 00:08:06.100 |
We'll be dealing with the consequences of that progress, both the positive externalities and the 00:08:12.400 |
negative externalities that come with it in a big way by 2030. So that I strongly feel, right? Whatever, 00:08:18.700 |
we may be arguing about the term, or maybe Gemini can answer what that moment is in time in 2030. 00:08:24.120 |
But I think the progress will be dramatic, right? So that I believe in. 00:08:29.560 |
Now, please do let me take a moment to tell you about a tool that's available today and that yes, 00:08:34.560 |
can utilize a variety of models, including Gemini 2.5. That would be the sponsors of today's video, 00:08:40.040 |
Emergent Mind, which I've been using for around two years before they even sponsored the channel. 00:08:44.760 |
What it allows me to do is just catch up on those trending papers that I may have missed otherwise, 00:08:49.980 |
like this one. As you know, I read those papers in full myself, but sometimes I do miss a paper 00:08:56.200 |
that is trending on Hacker News or X. You can download these summaries as a PDF in Markdown, 00:09:03.520 |
or even listen to it as audio. The 2.5 Pro summaries are appropriately on the ProPlan, 00:09:09.060 |
but anyway, link in the description. Now on jobs, this week and last, I've been seeing plenty of 00:09:14.620 |
articles like this one going viral on Twitter and Reddit. Has the decline of knowledge work begun? 00:09:21.060 |
asked the New York Times. For one LinkedIn executive in a guest essay on New York Times, 00:09:26.220 |
it has already begun with the bottom rung of the career ladder breaking. Now, obviously, 00:09:31.140 |
I am one of the last people to underestimate the potential of AI and its impacts on the world of 00:09:37.760 |
work. But these stories were about what was happening now, not what might be coming in three 00:09:42.940 |
to five years. So I wanted to ask, do they have any stats to back this stuff up? A lot of the articles 00:09:48.680 |
cross-reference each other, but the one stat that they all seem to turn to is the fact that the 00:09:52.500 |
unemployment rate for college graduates in the US has risen 30% since September 2022. Not risen to 30%, 00:09:59.720 |
has risen 30%. That sounds pretty ominous, right? But let me give you two contextual facts. The first 00:10:06.420 |
is that that 30% rise is from 2% to 2.6% for college graduates. That's versus 4% for all workers. So a 00:10:16.360 |
tiny bit less dramatic when you hear it is 2.6%. Now, I can just feel the rage building up among some of 00:10:22.320 |
you. So let me just give you one more contextual fact and then my own thoughts. Because even though 00:10:26.980 |
2.6% unemployment rate for college grads in the US doesn't sound too dramatic, a 30% rise is pretty 00:10:33.460 |
real. So I dug deep and looked at the data source that these articles were citing. And you can see it 00:10:40.040 |
here with the college graduates at, well, now it seems 2.7%. That is the line in red and it comes from 00:10:47.880 |
March of this year. But if we zoom out, we can see that, for example, in 2010, it was 5% among all college 00:10:58.400 |
graduates. Even in, what is this, 1992, it was 3.5%. Don't worry, I am not in any way downplaying the impact of 00:11:08.160 |
what's coming. I'm just saying it's a bit much to say the impact is already incredibly noticeable now. The other 00:11:14.660 |
article that went viral was this one, Behind the Curtain, A White Collar Bloodbath, which heavily featured quotes from 00:11:21.040 |
Dario Amede, the CEO of Anthropic. When the language is caveated, like AI could wipe out half of all entry 00:11:29.040 |
level white collar jobs over the next one to five years, it's actually quite hard to disagree. The way AI is 00:11:35.280 |
accelerating, it's really hard to counter say a could scenario. Amede gets onto slightly more dangerous 00:11:42.880 |
territory when he says most people are unaware that this is about to happen. Others at Anthropic, 00:11:49.200 |
like Sholto Douglas, are even more definitive. There's important distinctions to be made here. 00:11:54.080 |
One is that I think we're near guaranteed at this point to have effectively models that are capable 00:12:00.640 |
of automating any white collar job by like 27, 28 and or near guaranteed end of decade. 00:12:06.720 |
This topic obviously deserves a full video on its own, but for me the necessary but not sufficient 00:12:12.160 |
condition for white collar automation would be the elimination of hallucinations and dumb mistakes that 00:12:18.000 |
the models don't self-correct. If there is even a one percent chance that frontier models of 2027 and 00:12:23.680 |
2028 make mistakes like this one, then having a human in the loop to check for those mistakes would 00:12:29.840 |
surely allow for massively increased productivity. Which leads me personally to the whole calm before 00:12:35.680 |
the storm theory which I first outlined on this channel in 2023. I said back then that we would 00:12:41.520 |
first see a massive increase in productivity as humans complement the work of frontier AI. That's 00:12:48.000 |
why I don't think this white collar automation will happen as Amade says in as little as a couple of 00:12:53.600 |
years or less. Now I know what many of you are thinking, well these CEOs would know far better than those of 00:12:59.040 |
us on the outside, but I remember almost two years to the day Sam Altman saying and I quote 00:13:04.880 |
"we won't be talking about hallucinations in 18 months to two years". That was on the world tour 00:13:10.720 |
that he did after the release of GPT-4. Well almost exactly two years on from that quote we get this in 00:13:17.280 |
the new scientist "AI hallucinations are getting worse and they're here to stay". Among other things the article 00:13:22.720 |
cites a stat on a benchmark called SimpleQA which I've talked about before on the channel where basically 00:13:28.000 |
O3, the latest open AI model, hallucinates a bit more than previous models. Then you guys might remember 00:13:33.520 |
those viral articles about Klarna eliminating its customer service team so it could use AI instead. Now 00:13:40.080 |
very quietly without the same fanfare they've actually reversed on that policy saying that customers like 00:13:45.920 |
talking to people instead. After getting rid of those 700 employees it's now rehiring many human agents. 00:13:52.640 |
Duolingo, the language app, also said that it was going to rely on AI before backing down and reversing 00:13:58.880 |
that policy hiring more humans. Which leads me to the whole calm before the storm theory. While frontier 00:14:04.320 |
language models are still weak at self-correcting their own hallucinations, the human can still complement 00:14:10.080 |
their efforts and lead to overall more productivity. This leads to limited effect on the unemployment rate. 00:14:16.240 |
I do know there are anecdotal examples about people losing their jobs to AI, trust me I am aware of that 00:14:21.040 |
and I have read those articles. But limited net effect on the unemployment rate. This of course leads to 00:14:26.560 |
more and more investment in AI and less and less regulation of AI as countries try to win the AI race so-called. 00:14:33.840 |
But then there might come a tipping point where models using enough compute, having access to 00:14:38.960 |
enough diverse methodologies for self-correction, finally stop making dumb mistakes and only miss 00:14:45.920 |
things that are beyond their training data. Of course at that point, and I've actually got a documentary 00:14:50.640 |
covering this, endless amounts more data would be given to them through for example screen recording, 00:14:55.360 |
mass surveillance or robotics data. Then the complacency that might have set in throughout the 00:15:01.280 |
remainder of the 2020s might be quickly upended. And to be honest, it's not like blue-collar work would be immune 00:15:08.720 |
from the effects of AI automation for that much longer than white-collar work at that point. This is the 00:15:15.680 |
fully autonomous figure-02 robot humanoid. So yes, I've probably pissed off those who expect imminent upheaval 00:15:22.800 |
and those who think LLMs are completely overhyped, but there you go, that's just my opinion of what is coming. 00:15:28.240 |
While all of this is going on, of course, we get access to some pretty epic AI tools, 00:15:36.800 |
Hey Jessica, have you tried the new Eleven V3? 00:15:39.680 |
I just got it. The clarity is amazing. I can actually do whispers now, like this. 00:15:45.680 |
Ooh, fancy. Check this out. I can do full Shakespeare now. 00:15:49.680 |
As has been the theme of this video though, Eleven Labs can't rest easy because Google, with their native 00:16:10.720 |
text-to-speech in Gemini 2.5 flash, isn't that far behind. 00:16:14.320 |
Hey Jessica, have you tried the new Eleven V3? 00:16:19.360 |
I just got it. The clarity is amazing. I can actually do whispers now, like this. 00:16:24.880 |
Ooh, fancy. Check this out. I can do full Shakespeare now. 00:16:28.400 |
I can do full Shakespeare now. To be, or not to be, that is the question. 00:16:36.240 |
Hey Jessica, have you tried the new Eleven V3? 00:16:38.480 |
I just got it. The clarity is amazing. I can actually do whispers now, like this. 00:16:44.240 |
Thank you so much for watching. Let me know what you think, as always, and have a wonderful day.