back to index

AI Accelerates: New Gemini Model + AI Unemployment Stories Analysed


Chapters

0:0 Introduction
2:4 Gemini 2.5 Ultra
3:34 Benchmarks
7:41 AGI Date and Meaning Pichai
9:13 Jobs and AI Unemployment Fears
15:28 Elevenlabs v3

Whisper Transcript | Transcript Only Page

00:00:00.000 | While everyone else is focused on other stuff like Twitter spats, let's focus on the real news,
00:00:06.580 | the developments in AI, which I would say are accelerating. Particularly if you are Google
00:00:12.780 | who have just released the latest version of Gemini 2.5 Pro, fairly unambiguously the best
00:00:20.300 | language model in the world. For the majority of benchmarks, and yes, including my own SimpleBench,
00:00:25.920 | it beats out all other models including Claude Opus 4, Grok 3 and OpenAI's O3. Though we are
00:00:33.140 | expecting O3 Pro from OpenAI fairly shortly. And that's before you get to the fact that it's quicker
00:00:38.880 | to respond, it's cheaper via the API, it can ingest up to 1 million tokens. That's 4 or 5 times more
00:00:46.940 | than other models. Now before we get too hyped up though, there's a reason why the CEO of Google
00:00:51.640 | DeepMind, Demisa Sabas, responsible for Gemini, and the CEO of Google itself, Sundar Bachai,
00:00:56.600 | yesterday, both said that they don't expect AGI before 2030. Now sorry for those listening on the
00:01:02.060 | podcast, but take a look at these two lines here and which two of these vertical lines would you say
00:01:07.440 | is longest? Well, Gemini 2.5 Pro, the latest version, 0605. Yes, if you are not in America,
00:01:14.060 | that naming scheme is incredibly confusing. But this latest version, what do you think it says?
00:01:18.280 | It says, at first glance, line A appears to be much longer than line B. However,
00:01:23.320 | this is a trick of the eye and they are the same length. In fact, later on, the model doubles down
00:01:28.700 | by saying, you can test this yourself by placing a ruler up against the screen. You'll find they are
00:01:34.260 | identical in length. For those listening, they are pretty obviously not the same length. Now,
00:01:39.160 | of course, that is anecdotal, but there is a reason why Sundar Bachai said that in the near to
00:01:44.120 | medium term, Google will be hiring more workers, not firing them. Of course, you can't always trust
00:01:49.840 | CEOs, which is why I'm going to dedicate the end portion of this video to investigating all those
00:01:54.580 | headlines you've been seeing recently about a white collar bloodbath. I found that when you dig deeper,
00:02:00.020 | not everything is as it seems. Now, somewhat strangely, I want to start with an interview released
00:02:06.440 | in the last 18 hours on Lex Friedman with the CEO of Google, Sundar Bachai. Because the first half of
00:02:13.120 | this video is going to be about Gemini 2.5 Pro. But that's not even the biggest and best version of
00:02:19.240 | Gemini 2.5, which is Gemini 2.5 Ultra, unavailable to practically anyone. So all these record benchmark
00:02:25.900 | scores you're going to see, this isn't even their biggest and best model. Each year, I sit and say,
00:02:30.960 | okay, we are going to throw 10x more compute over the course of next year at it, and will we see
00:02:35.620 | progress? Sitting here today, I feel like the year ahead will have a lot of progress. I think it's
00:02:41.700 | compute limited in this sense, right? Like, you know, we can all, part of the reason you've seen us do
00:02:46.220 | flash, nano flash in pro models, but not an ultra model. It's like for each generation, we feel like
00:02:54.760 | we've been able to get the pro model at like, I don't know, 80-90% of ultra's capability. But ultra
00:03:01.440 | would be a lot more slow and a lot more expensive to serve. But what we've been able to do is to go to
00:03:10.740 | the next generation and make the next generation's pro as good as the previous generation's ultra,
00:03:15.240 | but be able to serve it in a way that it's fast and you can use it and so on. The models we all use the
00:03:21.500 | most is maybe like a few months behind the maximum capability we can deliver, right? Because that
00:03:31.420 | won't be the fastest, easiest to use, etc. But as the latest version of Gemini 2.5 Pro is apparently
00:03:37.580 | going to be a stable release used by hundreds of millions of people over the coming months, let's
00:03:42.700 | quickly dive into those benchmark results. On the right, by the way, you can see the results of the
00:03:48.880 | three iterations of Gemini 2.5 Pro. To be clear, the latest one is what's going to be rolled out to
00:03:55.300 | everyone in the coming couple of weeks. On obscure knowledge as tested by humanity's last exam,
00:04:01.180 | it nudges out other models. For incredibly challenging science-based questions, it gets 86.4%
00:04:08.020 | when PhDs in those respective domains get around 60%. On very approximate gauges of hallucinations,
00:04:15.880 | it scores better than any other model. And on reading charts and visuals and other types of
00:04:22.200 | graphs, it's at least on par with O3, which is around four times more expensive and a lot slower than
00:04:29.620 | Gemini 2.5 Pro. Again, it's worth highlighting that Gemini 2.5 Pro is really the middle model of the Gemini
00:04:36.680 | series. You may also notice that the vast majority of these record-breaking scores are on a single
00:04:43.440 | attempt. We haven't yet seen the deep-think mode from Gemini 2.5 Pro. That would be roughly the
00:04:49.760 | equivalent of the multiple attempts or parallel trials that some of the other models utilize.
00:04:54.740 | As for coding, the picture is a lot less clear. When you're talking about multiple languages,
00:04:59.800 | Gemini seems to do better as judged by Ada's polyglot benchmark. When you're talking about a
00:05:04.700 | slightly more software engineering focus, like Sweebench Verified, it seems like Claude is still very much in
00:05:10.920 | the lead. However, I will make a confession, which is that I was having an issue with connecting a
00:05:16.200 | domain on Firebase, which is Google on the backend. Now, this was more to do with the app hosting
00:05:21.780 | infrastructure, but you'd have thought as a Google entity, Firebase, that Gemini would know the most
00:05:26.980 | about it. Now, I won't show you the full two-hour conversation, but I basically gave up with Gemini 2.5
00:05:33.300 | Pro. This was, in fairness, the May instance of Gemini 2.5 Pro, but Claude for Opus was able to
00:05:39.440 | diagnose the issue almost immediately. And I'm sure everyone who uses these models for coding will
00:05:44.540 | have similar anecdotes, where the benchmarks don't always reflect real-world usage. But while we are on
00:05:50.720 | benchmarks, what about my own benchmark, SymbolBench? Well, I am going to make a confession, which is
00:05:56.200 | that I thought the latest version of Gemini 2.5 Pro, the one from yesterday, would underperform.
00:06:02.300 | Why did I think that? Well, because the first version of Gemini 2.5 Pro, the one I think from
00:06:07.580 | March, got 51.6%. But then when we tried the May version of Gemini 2.5 Pro, it was really hard to get
00:06:15.340 | a full run out of the model. I talked about this on Twitter, but the one run where it agreed to actually
00:06:21.020 | answer the question. I think it got around 47%. So I actually had a theory that I was going to come
00:06:26.240 | to you guys and gloat and be like, yeah, they're doing RL for coding and mathematics, but that's kind
00:06:31.900 | of eroding the common sense of the models. This shows how SymbolBench tests things that other benchmarks
00:06:37.640 | don't capture. Unfortunately, what actually happened is that when we tested the very latest version of
00:06:43.780 | Gemini 2.5 Pro yesterday evening, we couldn't get, because of rate limiting, a full five runs,
00:06:51.700 | which is why we're not yet reporting the result. But based on the four runs we did get, it was averaging
00:06:57.600 | around 62%. So my little theory about RL maximization just completely went out the window. No, but seriously,
00:07:04.600 | even based on four runs, you can see that performance is getting better and better and better across all
00:07:10.820 | model types. Hate to say it, but I genuinely think SymbolBench won't last much longer than maybe
00:07:17.280 | three to 12 months. We've got to talk about those job articles now. But if you want a bit more of a
00:07:23.300 | reflection about the kind of questions that Claude 4 and Gemini 2.5 Pro are now getting right, do check
00:07:29.300 | out this video on my Patreon. Suffice to say, though, that when we reach the moment that there are no
00:07:33.900 | text-based benchmarks for which the average human could beat frontier models, we will have crossed quite the
00:07:40.540 | Rubicon. Sundar Pichai and Demis Hassabis, CEOs of Google and Google DeepMind, put the date of full AGI
00:07:46.700 | at just after 2030.
00:07:49.000 | Then you see stuff which obviously, you know, we are far from AGI too. So you have both experiences
00:07:54.880 | simultaneously happening to you. I'll answer your question, but I'll also throw out this. I almost
00:07:59.760 | feel the term doesn't matter. What I know is by 2030, there'll be such dramatic progress.
00:08:06.100 | We'll be dealing with the consequences of that progress, both the positive externalities and the
00:08:12.400 | negative externalities that come with it in a big way by 2030. So that I strongly feel, right? Whatever,
00:08:18.700 | we may be arguing about the term, or maybe Gemini can answer what that moment is in time in 2030.
00:08:24.120 | But I think the progress will be dramatic, right? So that I believe in.
00:08:29.560 | Now, please do let me take a moment to tell you about a tool that's available today and that yes,
00:08:34.560 | can utilize a variety of models, including Gemini 2.5. That would be the sponsors of today's video,
00:08:40.040 | Emergent Mind, which I've been using for around two years before they even sponsored the channel.
00:08:44.760 | What it allows me to do is just catch up on those trending papers that I may have missed otherwise,
00:08:49.980 | like this one. As you know, I read those papers in full myself, but sometimes I do miss a paper
00:08:56.200 | that is trending on Hacker News or X. You can download these summaries as a PDF in Markdown,
00:09:03.520 | or even listen to it as audio. The 2.5 Pro summaries are appropriately on the ProPlan,
00:09:09.060 | but anyway, link in the description. Now on jobs, this week and last, I've been seeing plenty of
00:09:14.620 | articles like this one going viral on Twitter and Reddit. Has the decline of knowledge work begun?
00:09:21.060 | asked the New York Times. For one LinkedIn executive in a guest essay on New York Times,
00:09:26.220 | it has already begun with the bottom rung of the career ladder breaking. Now, obviously,
00:09:31.140 | I am one of the last people to underestimate the potential of AI and its impacts on the world of
00:09:37.760 | work. But these stories were about what was happening now, not what might be coming in three
00:09:42.940 | to five years. So I wanted to ask, do they have any stats to back this stuff up? A lot of the articles
00:09:48.680 | cross-reference each other, but the one stat that they all seem to turn to is the fact that the
00:09:52.500 | unemployment rate for college graduates in the US has risen 30% since September 2022. Not risen to 30%,
00:09:59.720 | has risen 30%. That sounds pretty ominous, right? But let me give you two contextual facts. The first
00:10:06.420 | is that that 30% rise is from 2% to 2.6% for college graduates. That's versus 4% for all workers. So a
00:10:16.360 | tiny bit less dramatic when you hear it is 2.6%. Now, I can just feel the rage building up among some of
00:10:22.320 | you. So let me just give you one more contextual fact and then my own thoughts. Because even though
00:10:26.980 | 2.6% unemployment rate for college grads in the US doesn't sound too dramatic, a 30% rise is pretty
00:10:33.460 | real. So I dug deep and looked at the data source that these articles were citing. And you can see it
00:10:40.040 | here with the college graduates at, well, now it seems 2.7%. That is the line in red and it comes from
00:10:47.880 | March of this year. But if we zoom out, we can see that, for example, in 2010, it was 5% among all college
00:10:58.400 | graduates. Even in, what is this, 1992, it was 3.5%. Don't worry, I am not in any way downplaying the impact of
00:11:08.160 | what's coming. I'm just saying it's a bit much to say the impact is already incredibly noticeable now. The other
00:11:14.660 | article that went viral was this one, Behind the Curtain, A White Collar Bloodbath, which heavily featured quotes from
00:11:21.040 | Dario Amede, the CEO of Anthropic. When the language is caveated, like AI could wipe out half of all entry
00:11:29.040 | level white collar jobs over the next one to five years, it's actually quite hard to disagree. The way AI is
00:11:35.280 | accelerating, it's really hard to counter say a could scenario. Amede gets onto slightly more dangerous
00:11:42.880 | territory when he says most people are unaware that this is about to happen. Others at Anthropic,
00:11:49.200 | like Sholto Douglas, are even more definitive. There's important distinctions to be made here.
00:11:54.080 | One is that I think we're near guaranteed at this point to have effectively models that are capable
00:12:00.640 | of automating any white collar job by like 27, 28 and or near guaranteed end of decade.
00:12:06.720 | This topic obviously deserves a full video on its own, but for me the necessary but not sufficient
00:12:12.160 | condition for white collar automation would be the elimination of hallucinations and dumb mistakes that
00:12:18.000 | the models don't self-correct. If there is even a one percent chance that frontier models of 2027 and
00:12:23.680 | 2028 make mistakes like this one, then having a human in the loop to check for those mistakes would
00:12:29.840 | surely allow for massively increased productivity. Which leads me personally to the whole calm before
00:12:35.680 | the storm theory which I first outlined on this channel in 2023. I said back then that we would
00:12:41.520 | first see a massive increase in productivity as humans complement the work of frontier AI. That's
00:12:48.000 | why I don't think this white collar automation will happen as Amade says in as little as a couple of
00:12:53.600 | years or less. Now I know what many of you are thinking, well these CEOs would know far better than those of
00:12:59.040 | us on the outside, but I remember almost two years to the day Sam Altman saying and I quote
00:13:04.880 | "we won't be talking about hallucinations in 18 months to two years". That was on the world tour
00:13:10.720 | that he did after the release of GPT-4. Well almost exactly two years on from that quote we get this in
00:13:17.280 | the new scientist "AI hallucinations are getting worse and they're here to stay". Among other things the article
00:13:22.720 | cites a stat on a benchmark called SimpleQA which I've talked about before on the channel where basically
00:13:28.000 | O3, the latest open AI model, hallucinates a bit more than previous models. Then you guys might remember
00:13:33.520 | those viral articles about Klarna eliminating its customer service team so it could use AI instead. Now
00:13:40.080 | very quietly without the same fanfare they've actually reversed on that policy saying that customers like
00:13:45.920 | talking to people instead. After getting rid of those 700 employees it's now rehiring many human agents.
00:13:52.640 | Duolingo, the language app, also said that it was going to rely on AI before backing down and reversing
00:13:58.880 | that policy hiring more humans. Which leads me to the whole calm before the storm theory. While frontier
00:14:04.320 | language models are still weak at self-correcting their own hallucinations, the human can still complement
00:14:10.080 | their efforts and lead to overall more productivity. This leads to limited effect on the unemployment rate.
00:14:16.240 | I do know there are anecdotal examples about people losing their jobs to AI, trust me I am aware of that
00:14:21.040 | and I have read those articles. But limited net effect on the unemployment rate. This of course leads to
00:14:26.560 | more and more investment in AI and less and less regulation of AI as countries try to win the AI race so-called.
00:14:33.840 | But then there might come a tipping point where models using enough compute, having access to
00:14:38.960 | enough diverse methodologies for self-correction, finally stop making dumb mistakes and only miss
00:14:45.920 | things that are beyond their training data. Of course at that point, and I've actually got a documentary
00:14:50.640 | covering this, endless amounts more data would be given to them through for example screen recording,
00:14:55.360 | mass surveillance or robotics data. Then the complacency that might have set in throughout the
00:15:01.280 | remainder of the 2020s might be quickly upended. And to be honest, it's not like blue-collar work would be immune
00:15:08.720 | from the effects of AI automation for that much longer than white-collar work at that point. This is the
00:15:15.680 | fully autonomous figure-02 robot humanoid. So yes, I've probably pissed off those who expect imminent upheaval
00:15:22.800 | and those who think LLMs are completely overhyped, but there you go, that's just my opinion of what is coming.
00:15:28.240 | While all of this is going on, of course, we get access to some pretty epic AI tools,
00:15:33.280 | like the brand new Eleven Labs V3 Alpha.
00:15:36.800 | Hey Jessica, have you tried the new Eleven V3?
00:15:39.680 | I just got it. The clarity is amazing. I can actually do whispers now, like this.
00:15:45.680 | Ooh, fancy. Check this out. I can do full Shakespeare now.
00:15:49.680 | As has been the theme of this video though, Eleven Labs can't rest easy because Google, with their native
00:16:10.720 | text-to-speech in Gemini 2.5 flash, isn't that far behind.
00:16:14.320 | Hey Jessica, have you tried the new Eleven V3?
00:16:19.360 | I just got it. The clarity is amazing. I can actually do whispers now, like this.
00:16:24.880 | Ooh, fancy. Check this out. I can do full Shakespeare now.
00:16:28.400 | I can do full Shakespeare now. To be, or not to be, that is the question.
00:16:36.240 | Hey Jessica, have you tried the new Eleven V3?
00:16:38.480 | I just got it. The clarity is amazing. I can actually do whispers now, like this.
00:16:42.720 | Ooh, fancy.
00:16:44.240 | Thank you so much for watching. Let me know what you think, as always, and have a wonderful day.