back to indexNew OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)
00:00:00.000 |
It has been a somewhat surreal few days in AI for so many reasons and the month of May 00:00:08.980 |
And according to this under the radar article, company insiders and government officials 00:00:15.000 |
tell of an imminent release of new open AI models. 00:00:18.800 |
And yes, of course, the strangeness at the end of April was amplified by the GPT-2 chatbot, 00:00:26.120 |
a mystery model showcased and then withdrawn within days, but which I did get to test. 00:00:31.880 |
I thought testing it would be a slightly more appropriate response than doing an all cap 00:00:39.880 |
I also want to bring in two papers released in the last 24 hours, 90 pages in total and 00:00:46.960 |
They might be more significant than any rumor you have heard. 00:00:51.080 |
First things first, though, that article from Politico that I mentioned and the context 00:00:56.960 |
There was an AI safety summit in Bletchley last year, near to where I live actually in 00:01:03.000 |
Some of the biggest players in AI like Meta and OpenAI promised the UK government that 00:01:08.080 |
the UK government could safety test the frontier latest models before they were released. 00:01:16.200 |
Now you might say that's just par for the course for big tech, but the article also 00:01:22.920 |
Politico spoke to a host, many company insiders, consultants, lobbyists, and government officials. 00:01:32.360 |
And not only did we learn that it's only Google DeepMind that have given the government 00:01:36.920 |
early access, we also learned that OpenAI didn't. 00:01:40.560 |
Now somewhat obviously that tells us that they have a new model and that it's very 00:01:46.480 |
Now I very much doubt they're going to call it GPT-5 and you can see more of my reasons 00:01:54.320 |
But I think it's more likely to be something like GPT-4.5 optimized for reasoning and planning. 00:02:01.120 |
Now some of you might be thinking, is that all the evidence you've got that a GPT-4.5 00:02:08.840 |
How about this MIT technology review interview conducted with Sam Altman in the last few 00:02:14.600 |
In a private discussion, Sam Altman was asked if he knew when the next version of GPT is 00:02:23.440 |
If the model had months and months more of uncertain safety testing, he couldn't be 00:02:30.080 |
Think about what happened to Google Gemini Ultra, which was delayed and delayed and delayed. 00:02:34.360 |
That again points to a more imminent release. 00:02:37.680 |
Then another bit of secondhand evidence, this time from an AI insider on Patreon. 00:02:43.040 |
We have a wonderful discord and this insider at a Stanford event put a question directly 00:02:50.460 |
This was a different Stanford event to the one I'm about to also quote from. 00:02:54.560 |
And in this response, Sam Altman confirmed that he's personally using the unreleased 00:03:02.100 |
What about another direct quote from Sam Altman? 00:03:04.720 |
Well, here's some more evidence released yesterday that rather than drop a bombshell 00:03:08.840 |
GPT-5 on us, which I predict to come somewhere between November and January, they're going 00:03:19.920 |
It does kind of suck to ship a product that you're embarrassed about, but it's much 00:03:25.080 |
And in this case in particular, where I think we really owe it to society to deploy iteratively. 00:03:31.280 |
One thing we've learned is that AI and surprise don't go well together. 00:03:34.920 |
People want a gradual rollout and the ability to influence these systems. 00:03:39.920 |
Now, he might want to tell that to OpenAI's recent former head of developer relations. 00:03:45.660 |
He now works at Google and said, "Something I really appreciate about Google's culture 00:03:51.960 |
30 days in, I feel like I have a great understanding of where we are going from a model perspective. 00:03:57.780 |
Having line of sight on this makes it so much easier to start building compelling developer 00:04:02.860 |
It almost sounds like the workers at OpenAI often don't have a great understanding of 00:04:07.820 |
where they're going from a model perspective. 00:04:09.940 |
Now, in fairness, Sam Altman did say that the current GPT-4 will be significantly dumber 00:04:17.220 |
"ChatGPT is not phenomenal, like ChatGPT is mildly embarrassing at best. 00:04:22.300 |
GPT-4 is the dumbest model any of you will ever, ever have to use again by a lot. 00:04:27.060 |
But you know, it's like important to ship early and often, and we believe in iterative 00:04:31.980 |
So agency and reasoning focused GPT-4.5 coming soon, but GPT-5 not until the end of the year 00:04:41.860 |
Now, some people were saying that the mystery GPT-2 chatbot could be GPT-4.5. 00:04:48.380 |
It was released on a site used to compare the different outputs of language models. 00:04:53.820 |
And look, here is it creating a beautiful unicorn, which LLAMA 3 couldn't do. 00:04:58.780 |
Now, I frantically got ready a tweet saying that superintelligence had arrived, but quickly 00:05:04.940 |
And not just because other people were reporting that they couldn't get decent unicorns. 00:05:10.260 |
And not just because that exact unicorn could be found on the web. 00:05:13.940 |
But the main reason was that I was one of the lucky ones to get in and test GPT-2 chatbot 00:05:21.220 |
I could only do eight questions, but I gave it my standard handcrafted, so not on the 00:05:26.060 |
web set of test questions, spanning logic, theory of mind, mathematics, coding, and more. 00:05:31.980 |
Its performance was pretty much identical to GPT-4 turbo. 00:05:36.660 |
There was one question that it would get right more often than GPT-4 turbo, but that could 00:05:42.980 |
So if this was a sneak preview of GPT-4.5, I don't think it's going to shock and stun 00:05:50.820 |
So tempting as it was to bang out a video saying that AGI has arrived in all caps, I 00:05:57.480 |
Since then, other testers have found broadly the same thing. 00:06:01.320 |
On language translation, the mystery GPT-2 chatbot massively underperforms Claude Opus 00:06:09.780 |
On an extended test of logic, it does about the same as Opus and GPT-4 turbo. 00:06:15.580 |
Of course, that still does leave the possibility that it is an open AI model, a tiny one, and 00:06:20.580 |
one that they might even release open weights, meaning anyone can use it. 00:06:25.300 |
And in that case, the impressive thing would be how well it's performing despite its 00:06:30.420 |
So if GPT-2 chatbot is a smaller model, how could it possibly be even vaguely competitive? 00:06:38.860 |
As James Becker of OpenAI said, "It's not so much about tweaking model configurations 00:06:43.720 |
and hyperparameters, nor is it really about architecture or optimizer choices. 00:06:51.780 |
It is the dataset that you are approximating to an incredible degree." 00:06:56.620 |
In a later post, he referred to the flaws of DALI-3 and GPT-4 and also flaws in video, 00:07:02.220 |
probably referring to the, at the time, unreleased Sora, and said they arise from a lack of data 00:07:09.700 |
And in a more recent post, he said that while compute efficiency was still super important, 00:07:14.300 |
anything can be state-of-the-art with enough scale, compute, and eval hacking. 00:07:19.220 |
Now we'll get to evaluation and benchmark hacking in just a moment, but it does seem 00:07:23.660 |
to me that there are more and more hints that you can brute force performance with enough 00:07:28.380 |
compute and, as mentioned, a quality dataset. 00:07:31.540 |
But at least to me, it seems increasingly clear that you can pay your way to top performance. 00:07:37.100 |
Unless OpenAI reveal something genuinely shocking, the performance of Meta's LLAMA-3, 8 billion, 00:07:43.100 |
70 billion, and soon 400 billion, show that they have less of a secret source than many 00:07:49.900 |
But as Mark Zuckerberg hinted recently, it could just come down to which company blinks 00:07:55.300 |
Who among Google, Meta, and Microsoft, which provides the compute for OpenAI, is willing 00:08:00.140 |
to continue to spend tens or hundreds of billions of dollars on new models? 00:08:05.300 |
If the secret is simply the dataset, that would make less and less sense. 00:08:09.620 |
"Over the last few years, I think there's this issue of GPU production, right? 00:08:15.160 |
So even companies that had the money to pay for the GPUs couldn't necessarily get as many 00:08:19.420 |
as they wanted because there were all these supply constraints. 00:08:25.780 |
So now I think you're seeing a bunch of companies think about, 'Wow, we should just like really 00:08:30.980 |
invest a lot of money in building out these things.' 00:08:33.420 |
And I think that will go for some period of time. 00:08:36.780 |
There is a capital question of like, 'Okay, at what point does it stop being worth it 00:08:43.180 |
But I actually think before we hit that, you're going to run into energy constraints." 00:08:47.460 |
Now if you're curious about energy and data centre constraints, check out my "Why Does 00:08:52.180 |
OpenAI Need a Stargate Supercomputer" video released four weeks ago. 00:08:56.520 |
But before we leave data centres and datasets, I must draw your attention to this paper released 00:09:04.580 |
It's actually a brilliant paper from Scale AI. 00:09:07.920 |
What they did was create a new and refined version of a benchmark that's used all the 00:09:12.560 |
time to test the mathematical reasoning capabilities of AI models. 00:09:17.380 |
And there were at least four fascinating findings relevant to all new models coming out this 00:09:23.360 |
The first, the context, and they worried that many of the latest models had seen the benchmark 00:09:30.380 |
That's called contamination because of course it contaminates the results on the test. 00:09:34.620 |
The original test had 8,000 questions, but what they did was create 1,000 new questions 00:09:40.860 |
Now if contamination wasn't a problem, then models should perform just as well with the 00:09:49.300 |
For the Mistral and Phi family of models, performance notably lagged on the new test 00:09:55.180 |
Whereas fair's fair, for GPT-4 and Claude, performance was the same or better on the 00:10:01.760 |
But here's the thing, the authors figured out that that wasn't just about which models 00:10:05.580 |
had seen the questions in their training data. 00:10:07.980 |
They say that Mistral Large, which performed exactly the same, was just as likely to have 00:10:12.480 |
seen those questions as Mixed Trial Instruct, which way underperformed. 00:10:21.020 |
Even if they have seen the questions, they learn more from them and can generalize to 00:10:26.680 |
And here's another supporting quote, they lean toward the hypothesis that sufficiently 00:10:30.580 |
strong large language models learn elementary reasoning ability during training. 00:10:35.500 |
You could almost say that benchmarks get more reliable when you're talking about the very 00:10:40.700 |
Next, and this seems to be a running theme in popular ML benchmarks, the GSM-8K, designed 00:10:49.540 |
They didn't say how many, but the answers were supposed to be positive integers and 00:10:54.140 |
The new benchmark, however, passed through three layers of quality checks. 00:10:58.220 |
Third, they provide extra theories as to why models might overperform on benchmarks compared 00:11:06.820 |
It could be that model builders design datasets that are similar to test questions. 00:11:11.980 |
After all, if you were trying to bake in reasoning to your model, what kind of data would you 00:11:18.980 |
So the more similar their dataset is in nature, not just exact match to benchmarks, the more 00:11:24.700 |
your benchmark performance will be elevated compared to simple real world use. 00:11:29.600 |
Think about it, it could be an inadvertent thing where enhancing the overall smartness 00:11:33.860 |
of the model comes at the cost of overperforming on benchmarks. 00:11:38.740 |
And whatever you think about benchmarks, that does seem to work. 00:11:42.140 |
Here's Sebastian Bubek, lead author of the PHY series of models. 00:11:45.620 |
I've interviewed him for AI Insiders and he said this, "Even on those 1,000 never-before-seen 00:11:51.580 |
questions, PHY 3 Mini, which is only 3.8 billion parameters, performed within about 8 or 9% 00:12:00.380 |
Now we don't know the parameter count of GPT-4 Turbo, but it's almost certainly orders 00:12:05.580 |
So training on high quality data, as we have seen, definitely works, even if it slightly 00:12:12.180 |
But one final observation from me about this paper, I read almost all the examples that 00:12:19.580 |
And as the paper mentions, they involve basic addition, subtraction, multiplication, and 00:12:25.100 |
After all, the original test was designed for youngsters. 00:12:27.620 |
You can pause and try the questions yourself, but despite them being lots of words, they 00:12:32.860 |
So my question is this, why are models like CLAWD3 Opus still getting any of these questions 00:12:39.900 |
They're scoring around 60% in graduate level expert reasoning, the GPQA. 00:12:45.260 |
If CLAWD3 Opus, for example, can get questions right that PhDs struggle to get right with 00:12:51.420 |
Google and 30 Minutes, why on earth, with five short examples, can they not get these 00:12:59.620 |
Either there are still flaws in the test, or these models do have a limit in terms of 00:13:06.020 |
Now, if you like this kind of analysis, feel free to sign up to my completely free newsletter. 00:13:11.540 |
It's called Signal to Noise, and the link is in the description. 00:13:15.220 |
And if you want to chat in person about it, the regional networking on the AI Insiders 00:13:23.340 |
There are meetings being arranged not only in London, but Germany, the Midwest, Ireland, 00:13:29.100 |
San Francisco, Madrid, Brazil, and it goes on and on. 00:13:32.700 |
Honestly, I've been surprised and honored by the number of spontaneous meetings being 00:13:37.180 |
arranged across the world, but it's time, arguably, for the most exciting development 00:13:49.380 |
The latest series of Gemini models from Google are more than competitive with doctors at 00:13:57.300 |
And even in areas where they can't quite perform, like in surgery, they can be amazing 00:14:03.420 |
In a world in which millions of people die due to medical errors, this could be a tremendous 00:14:09.780 |
MedGemini contains a number of innovations, it wasn't just rerunning the same test on 00:14:15.380 |
For example, you can inspect how confident a model is in its answer. 00:14:19.460 |
By trawling through the raw outputs of a model, called the logit, you could see how high probability 00:14:26.480 |
If they gave confident answers, you would submit that as the answer. 00:14:29.860 |
They used this technique, by the way, for the original Gemini launch, where they claimed 00:14:35.020 |
Anyway, if the model is not confident, you can get the model to generate search queries 00:14:40.020 |
to resolve those conflicts, train it, in other words, to use Google. 00:14:44.980 |
Then you can feed that additional context provided by the web back into the model to 00:14:54.180 |
To oversimplify, they get the model to output answers again using the help of search. 00:14:59.140 |
And then the outputs that had correct answers were used to fine tune the models. 00:15:04.060 |
Now that's not perfect, of course, because sometimes you can get the right answer with 00:15:10.740 |
Just last week, by the way, on Patreon, I described how this reinforced in-context learning 00:15:18.460 |
Other innovations come from the incredible long-context abilities of the Gemini 1.5 series 00:15:24.760 |
With that family of models, you can trawl through a 700,000-word electronic health record. 00:15:31.180 |
Now imagine a human doctor trying to do the same thing. 00:15:34.300 |
I remember on the night of Gemini 1.5's release calling it the biggest news of the day, even 00:15:39.380 |
more significant than SORA, and I still stand by that. 00:15:43.660 |
Well, of course, a state-of-the-art performance on MedQA that assesses your ability to diagnose 00:15:50.780 |
The doctor pass rate, by the way, is around 60%. 00:15:53.620 |
And how about this for a mini-theme of the video. 00:15:56.320 |
When they carefully analyzed the questions in the benchmark, they found that 7.4% of 00:16:04.500 |
Things like lacking key information, incorrect answers, or multiple plausible interpretations. 00:16:10.180 |
So just in this video alone, we've seen multiple benchmark issues, and I collected a thread 00:16:18.020 |
The positive news, though, is just how good these models are getting at things like medical 00:16:22.260 |
note summarization and clinical referral letter generation. 00:16:25.860 |
But I don't want to detract from the headline, which is just how good these models are getting 00:16:31.500 |
Here you can see MedGemini with search way outperforming expert clinicians with search. 00:16:37.540 |
By the way, when errors from the test were taken out, performance bumped up to around 00:16:42.980 |
And the authors can't wait to augment their models with additional data. 00:16:47.280 |
Things like data from consumer wearables, genomic information, nutritional data, and 00:16:53.180 |
And as a quite amusing aside, it seems like Google and Microsoft are in a tussle to throw 00:16:58.380 |
shade at each other's methods in a positive spirit. 00:17:01.640 |
Google contrasts their approach to MedPrompt from Microsoft, saying that their approach 00:17:06.220 |
is principled, and it can be easily extended to more complex scenarios beyond MedQA. 00:17:11.100 |
Now you might say that's harsh, but Microsoft earlier had said that their MedPrompt approach 00:17:15.940 |
shows GPT-4's ability to outperform Google's model that was fine-tuned specifically for 00:17:22.540 |
It outperforms on the same benchmarks by a significant margin. 00:17:26.340 |
Well, Google have obviously one-upped them by reaching new state-of-the-art performances 00:17:32.860 |
Microsoft had also said that their approach has simple prompting and doesn't need more 00:17:39.380 |
Google shot back saying they don't need complex, specialized prompting and their approach 00:17:44.720 |
Honestly, this is competition that I would encourage. 00:17:47.620 |
May they long compete for glory in this medical arena. 00:17:51.340 |
In case you're wondering, because Gemini is a multimodal model, it can see images too. 00:17:56.620 |
You can interact with patients and ask them to provide images. 00:17:59.940 |
The model can also interact with primary care physicians and ask for things like x-rays. 00:18:04.620 |
And most surprisingly to me, it can also interact with surgeons to help boost performance. 00:18:10.260 |
Yes, that's video assistance during live surgery. 00:18:13.640 |
Of course, they haven't yet deployed this for ethical and safety reasons, but Gemini 00:18:18.120 |
is already capable of assessing a video scene and helping with surgery. 00:18:23.060 |
For example, by answering whether the critical view of safety criteria are being met. 00:18:27.780 |
Do you have a great view of the gallbladder, for example. 00:18:31.260 |
And MedGemini could potentially guide surgeons in real time during these complex procedures 00:18:35.760 |
for not only improved accuracy, but patient outcomes. 00:18:39.840 |
Notice the nuance of the response from Gemini, oh, the lower third of the gallbladder is 00:18:47.240 |
And the authors list many improvements that they could have made, they just didn't have 00:18:52.120 |
For example, the models were searching the wild web. 00:18:55.520 |
And one option would be to restrict the search results to just more authoritative medical 00:19:01.920 |
The catch though, is that this model is not open sourced and isn't widely available. 00:19:06.240 |
Due to, they say, the safety implications of unmonitored use. 00:19:10.200 |
I suspect the commercial implications of open sourcing Gemini also had something to do with 00:19:17.360 |
We know that hundreds of thousands or even millions of people die due to medical mistakes 00:19:22.860 |
So if and when MedGemini 2, 3 or 4 becomes unambiguously better than all clinicians at 00:19:29.640 |
diagnosing diseases, then at what point is it unethical not to at least deploy them in 00:19:36.640 |
That's definitely something at least to think about. 00:19:39.400 |
Overall, this is exciting and excellent work. 00:19:44.640 |
And in a world that is seeing some stark misuses of AI, as well as increasingly autonomous 00:19:51.240 |
deployment of AI, like this autonomous tank, the fact that we can get breakthroughs like 00:19:58.640 |
Thank you so much for watching to the end and have a wonderful day.