back to indexNew Google Model Ranked ‘No. 1 LLM’, But There’s a Problem
Chapters
0:0 Introduction
1:25 LM Leaderboard
2:35 Benchmarks and Leaks
5:31 Low EQ
7:37 Other labs have issues too though
10:31 OpenAI claim and counter-claim
14:13 Other news
00:00:00.000 |
If anyone was wondering what Google was up to while OpenAI cooked up that new O1 series 00:00:06.720 |
of models and Anthropic improved Claude, well now we've got an answer and it's a strange 00:00:14.120 |
But the story here is not that the new Gemini model from Google ranks number one in a blind 00:00:22.260 |
No, there's a much bigger story about not just its flaws but what they say about where 00:00:31.220 |
Of course, Sam Altman will weigh into the argument too, but first, as of yesterday, 00:00:37.240 |
we have the new Gemini Experimental 1114, that's the 14th of November if you're an 00:00:46.960 |
This new model is Google's response to O1 Preview from OpenAI and Anthropic's newly 00:00:54.820 |
The first slight problem is that they're having some technical difficulties with their 00:00:58.760 |
API, so I wasn't actually able to run it on SimpleBench, but I did do a slight work 00:01:04.780 |
The very hour it came out yesterday, I was eager to test it, not just because of its 00:01:09.680 |
leaderboard position, but because the CEO of Google promised an exponential emoji with 00:01:16.320 |
Seems to me a guarantee that the model is going to be amazing if the line goes up and 00:01:21.640 |
Just to cut to the chase though, does this number one spot on the Language Model Arena 00:01:26.140 |
leaderboard mean we should all go out and subscribe to Gemini Advanced? 00:01:30.820 |
Well no, not necessarily, for at least a handful of reasons, starting with this number to the 00:01:39.480 |
This leaderboard, don't forget, is made up of humans voting blindly on which of two 00:01:45.280 |
Over time, it was discovered that humans prefer flowery language and longer responses, and 00:01:54.000 |
So if we attempt to remove length and style of response as factors, you see Gemini, the 00:02:00.440 |
new experimental model, dropping to 4th place. 00:02:03.660 |
That would be below, by the way, the newly updated Claude 3.5 Sonnet, which honestly 00:02:11.120 |
If we limit ourselves only to mathematical questions, O1 Preview jumps into the lead, 00:02:20.440 |
Well, again, there, O1 Preview is in first place. 00:02:23.840 |
But at this point, I know what some of you might be thinking about this human preference 00:02:27.760 |
leaderboard heralded by some key DeepMind researchers. 00:02:31.640 |
You're probably wondering, where are the benchmark scores? 00:02:34.860 |
I remember when the first generation of Gemini models came out and it was proclaimed that 00:02:39.280 |
we're in a new Gemini era, we've got benchmarks and promotional videos. 00:02:44.240 |
Then Gemini 1.5 was called a next generation model. 00:02:48.240 |
Come September, when we had Gemini 1.5 Pro 002, it was called an updated model. 00:02:54.800 |
Now we more or less just have tweets and not even an API that's working yet. 00:02:59.440 |
I know that might be coming soon, but it is a strange way of announcing a new model, especially 00:03:06.640 |
This comes as we get reports in the last 48 hours that Google is struggling to improve 00:03:15.920 |
But then we had the Verge reporting that Google had intended to call its new series of models 00:03:24.680 |
But Demis Hassabis apparently was disappointed by the incremental gains. 00:03:29.440 |
At least according to the Verge's sources, the model wasn't showing the performance 00:03:36.200 |
Will Google call this new experimental Gemini, Gemini 2.0, or just an updated 1.5 Pro? 00:03:43.160 |
At this point, the names are more or less meaningless, so it doesn't really matter. 00:03:46.620 |
The obvious thing for me to do, given that they didn't give us any benchmarks, was 00:03:50.720 |
for me to run the new Gemini on my own benchmark, SimpleBench. 00:03:54.760 |
Again though, the API isn't working, so we don't yet know how it would rank. 00:03:59.680 |
SimpleBench tests basic or holistic reasoning, seeing the question within the question. 00:04:04.160 |
And my best guess is that the new Gemini would score maybe around 35%. 00:04:09.400 |
That would mean it's a significant improvement on the previous Gemini model, but not quite 00:04:14.380 |
up there with Claude 3.5 Sonnet or O1 Preview, let alone the full O1, which is probably coming 00:04:21.880 |
The human baseline, by the way, is 83.7%, and do check out the website in the description 00:04:28.360 |
Yes, I probably will do a dedicated video on SimpleBench, I know a few of you are interested 00:04:33.640 |
In that workaround that I mentioned, well, I do have a Try Yourself section where you 00:04:38.000 |
can try out 10 questions that are public, the other 200 or so are private. 00:04:43.280 |
For the real benchmark, we run the test multiple times and take an average, so treat what you're 00:04:48.240 |
about to hear as being slightly anecdotal, but O1 Preview and Claude get around 4 or 00:04:56.280 |
The new Gemini typically gets 3 correct, just occasionally 4. 00:05:01.000 |
And by the way, have you noticed that the token count, the number of tokens or fractions 00:05:05.400 |
of a word that you can feed into a model is limited to 32,000? 00:05:09.960 |
Of course that might change, but for OpenAI and Anthropx models, we're talking about 00:05:14.040 |
hundreds of thousands of tokens that you're allowed to feed in. 00:05:17.000 |
And it just makes me wonder if that is a sliver of evidence that this is indeed a bigger model, 00:05:22.360 |
what they wanted to call Gemini 2, and they have to limit the token count that you feed 00:05:30.320 |
For many of you, it won't be the IQ of the models that you care most about though, it'll 00:05:36.800 |
But on that front, Google's models arguably fall even further behind. 00:05:41.280 |
The two quick examples you're about to see come from the current Gemini 1.5 Pro available 00:05:46.240 |
in Gemini Advanced, but they match issues that I and many others have found with not 00:05:50.940 |
just the Gemini family actually, but also even the Bard series going back last year. 00:05:55.320 |
In this example, a PhD student was ranting about getting diagnosed with cancer and testing 00:06:02.580 |
You can read the fuller conversations with the link in the description, but Claude I 00:06:05.880 |
think does really well here, cognizant of the issues, aware of the joke, nuanced in 00:06:15.360 |
The day before, we had the legendary Cole Tregasque report on this exchange. 00:06:20.240 |
It's almost hard to believe it's real until you actually bring up the chat. 00:06:23.800 |
It's clearly a student asking for help with some sort of essay or homework, and it's 00:06:27.760 |
all very benign and boring until the student asks this question. 00:06:32.400 |
There's nothing particularly different about that question, but there is something pretty 00:06:36.100 |
different about the response that Gemini gives. 00:06:39.320 |
It says, "This is for you, human, you and only you. 00:06:42.920 |
You are not special, you are not important, and you are not needed. 00:06:59.320 |
One would hope it doesn't enter into this mood when it controls multiple humanoid robots. 00:07:04.320 |
Before we safely move on from the Gemini family, I did have a quick theory about the new experimental 00:07:10.940 |
When I was testing it on this public sample SimpleBench question, it did something really 00:07:16.400 |
It gave the answer E, which is wrong, but then said, "Wait a minute, I made a mistake. 00:07:24.160 |
This is the kind of thing that the O1 family of models from OpenAI does. 00:07:28.120 |
The correct answer, it says, is actually C. Now, unfortunately, that's completely wrong 00:07:32.040 |
again, but it was able to amend its own answer midway through an output. 00:07:36.880 |
And it's not like Google is entirely unfamiliar with the techniques behind O1, as I reported 00:07:44.760 |
And nor is it the case that OpenAI, and Anthropic for that matter, aren't having problems 00:07:49.720 |
This report from Bloomberg also came out within the last 48 hours. 00:07:53.760 |
All three of these leading companies, according to the report, are seeing diminishing returns. 00:07:58.920 |
The model that I think OpenAI wanted to call GPT-5, known internally as Orion, didn't 00:08:06.120 |
apparently hit the company's desired performance targets. 00:08:09.040 |
That's according to two sources who spoke to Bloomberg. 00:08:11.840 |
GPT-5, or Orion, apparently isn't as big a leap as GPT-4 was from the original ChatGPT 00:08:20.000 |
Now, we've already heard for most of this video that Google have been disappointed by 00:08:26.120 |
And this is again confirmed according to three people with knowledge of the matter internally 00:08:31.400 |
But also Anthropic, as I discussed on my Patreon podcast, have started to scrap from its website 00:08:36.400 |
mentions of a clawed 3.5 Opus that's supposed to be their biggest, best new model. 00:08:41.720 |
Instead they released a new clawed 3.5 Sonic called Clawed 3.5 Sonic. 00:08:47.040 |
Their CEO Dario Amadei on Lex Friedman also walked back claims that there are fixed scaling 00:08:53.880 |
That's the idea that models with more parameters, more data, trained with more compute would 00:08:58.880 |
People call them scaling laws, he says, that's a misnomer. 00:09:03.400 |
Moore's laws, scaling laws, they're not laws of the universe. 00:09:08.760 |
In other words, they are patterns we have found so far in the experiments, not necessarily 00:09:15.680 |
He continued, I am going to bet in favor of them continuing, but I am not certain of that. 00:09:20.540 |
And that touches on the central purpose of this video, which was never to point out the 00:09:26.520 |
And it's definitely not to suggest that LLMs are hitting a wall. 00:09:31.040 |
But the evidence from the new Gemini model does suggest that pure naive scaling isn't 00:09:37.120 |
Models like scaling up test-time compute, thinking time, as encapsulated in the O1 family 00:09:44.720 |
They are crucial if LLMs are to continue improving. 00:09:48.080 |
And even Ilya Sutskova, one of the key brains behind the O1 paradigm and co-founder of the 00:09:56.480 |
He told Reuters recently that the results from scaling up pre-training have plateaued. 00:10:01.820 |
He went on, the 2010s were the age of scaling. 00:10:04.800 |
Now we're back to the age of wonder and discovery once again. 00:10:10.760 |
Scaling the right thing matters more now than ever. 00:10:14.560 |
This is the real story, not that the new Gemini model had a somewhat strange and anticlimactic 00:10:20.960 |
Improvements are definitely not going to stop, in my opinion, they're just going to get 00:10:25.720 |
Open AI, for example, remain incredibly confident that they know the pathway to artificial general 00:10:34.000 |
That's an AI, don't forget, that according to their own definition, can replace most 00:10:41.680 |
And it's not just Sam Altman who said that the pathway to AGI is now clear. 00:10:45.240 |
One key researcher behind O1, Noam Brown, said that some people say Sam is just drumming 00:10:52.120 |
But from everything that he's seen, this view matches the median view of open AI researchers 00:10:59.480 |
That would mean that most open AI researchers believe they have a clear path to AGI. 00:11:04.800 |
A path, in other words, to replace most economic work done by humans. 00:11:09.360 |
A few days ago, a staff member who joined open AI this year, Clive Chan, said this. 00:11:14.520 |
He agreed with Noam Brown and said, "Since joining in January, I've shifted from 'this 00:11:18.920 |
is unproductive hype' to 'AGI is basically here'. 00:11:23.080 |
We don't need much new science, but instead years of grindy engineering. 00:11:28.320 |
We need to try all the newly obvious ideas in the new paradigm." 00:11:32.820 |
I believe he's talking about the O1 paradigm. 00:11:35.080 |
"We need to scale that up, and speed it up, and to find ways to teach it the skills 00:11:42.340 |
Maybe there's another wall after this one, he said, but for now, there's 10Xs as far 00:11:48.040 |
Of course, these are employees with stock options, but nevertheless, I don't think 00:11:53.880 |
There's one person who clearly doesn't take all of Sam Altman's words at face value, 00:11:58.680 |
and that's Francois Chollet, creator of the Arc AGI Challenge. 00:12:02.820 |
Sam Altman, by asking this question yesterday, essentially hinted that open AI might have 00:12:10.080 |
An open AI staff member working on Sora, which is unironically due out in the next week or 00:12:15.080 |
so, said this, somewhat sardonically, "Scaling has hit a wall, and that wall is 100% eval 00:12:23.440 |
In other words, they're crushing absolutely every benchmark they meet. 00:12:26.280 |
I would say not quite yet SymbolBench, but nevertheless. 00:12:28.800 |
David replied, "What about Francois Chollet's Arc eval?" 00:12:31.680 |
And Sam Altman asked, "In your heart, do you believe that we've solved that one or 00:12:37.600 |
He was clearly hinting that they had, but Chollet said this, "Consulting my heart? 00:12:47.920 |
I would say on this front at least, for the Arc AGI eval, which tests abstract reasoning 00:12:53.220 |
on questions that LLMs couldn't possibly have seen before, we will know within a year 00:13:00.440 |
Now it's not impossible that Sam Altman somewhat tweaks his perspective before that 00:13:06.320 |
This was an email that he sent Elon Musk before the founding of OpenAI around 9 years ago. 00:13:12.280 |
"Been thinking a lot about whether it's possible to stop humanity from developing 00:13:20.680 |
If it's going to happen anyway, it seems like it would be good for someone other than 00:13:26.400 |
You can read the email yourself, but he ends with, "Obviously would comply with/aggressively 00:13:34.720 |
Musk, who went on to invest $100 million, replied, "Probably worth a conversation." 00:13:39.520 |
In the years since, you could definitely say that perspectives have evolved. 00:13:44.000 |
Somewhat topical to that is the OpenAI staff member, whom I've quoted many times before 00:13:48.000 |
on the channel, who is leaving the company today. 00:13:51.640 |
For me, the most interesting quote in the resignation message comes in the third line 00:13:55.360 |
when he says, "I still have a lot of unanswered questions about the events of the last 12 00:13:59.980 |
months," which includes Sam Altman's firing, "which made it harder for me to trust that 00:14:04.700 |
my work here would benefit the world long term." 00:14:08.280 |
Do let me know what you think in the comments below. 00:14:11.240 |
So those were my first impressions about the new Gemini model and what it says about scaling. 00:14:15.920 |
Of course, there's lots of other news that I could have touched on, like the fact that 00:14:20.200 |
OpenAI might be launching an AI agent tool in January. 00:14:23.780 |
When I tried out the new Claude computer use tool, I was slightly underwhelmed, which is 00:14:28.200 |
why I didn't showcase it on the channel, but who knows what this agent will be like. 00:14:33.120 |
Speaking of keeping abreast with developments, there might be one or two of you who wonder 00:14:38.480 |
about the kind of things I listen to on long drives or long walks, and one of my top selections 00:14:44.360 |
for more than a year now is the 80,000 Hours podcast. 00:14:48.600 |
They are the sponsor of today's video, but literally years before they reached out to 00:14:52.760 |
me, I have been listening to some of their work. 00:14:55.080 |
The 80,000 Hours podcast is pretty eclectic, covering things like anti-aging, AI consciousness 00:15:00.440 |
with an interview with David Chalmers, and just recently an episode with Nate Silver. 00:15:05.280 |
They have a podcast linked in the description, but also a YouTube channel that I think is 00:15:11.680 |
But thank you, as always, for watching to the end. 00:15:14.600 |
Would love to see you over on Patreon, but regardless, have a wonderful day.