Back to Index

New OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)


Transcript

It has been a somewhat surreal few days in AI for so many reasons and the month of May promises to be yet stranger. And according to this under the radar article, company insiders and government officials tell of an imminent release of new open AI models. And yes, of course, the strangeness at the end of April was amplified by the GPT-2 chatbot, a mystery model showcased and then withdrawn within days, but which I did get to test.

I thought testing it would be a slightly more appropriate response than doing an all cap video claiming that AGI has arrived. I also want to bring in two papers released in the last 24 hours, 90 pages in total and read in full. They might be more significant than any rumor you have heard.

First things first, though, that article from Politico that I mentioned and the context of this article is this. There was an AI safety summit in Bletchley last year, near to where I live actually in Southern England. Some of the biggest players in AI like Meta and OpenAI promised the UK government that the UK government could safety test the frontier latest models before they were released.

There's only one slight problem. They haven't done it. Now you might say that's just par for the course for big tech, but the article also revealed some interesting insider gossip. Politico spoke to a host, many company insiders, consultants, lobbyists, and government officials. They spoke anonymously over several months. And not only did we learn that it's only Google DeepMind that have given the government early access, we also learned that OpenAI didn't.

Now somewhat obviously that tells us that they have a new model and that it's very near to release. Now I very much doubt they're going to call it GPT-5 and you can see more of my reasons for that in the video you can see on screen. But I think it's more likely to be something like GPT-4.5 optimized for reasoning and planning.

Now some of you might be thinking, is that all the evidence you've got that a GPT-4.5 will be coming before GPT-5? Well, not quite. How about this MIT technology review interview conducted with Sam Altman in the last few days? In a private discussion, Sam Altman was asked if he knew when the next version of GPT is slated to be released.

And he said calmly, "Yes." Now think about it. If the model had months and months more of uncertain safety testing, he couldn't be that confident about a release date. Think about what happened to Google Gemini Ultra, which was delayed and delayed and delayed. That again points to a more imminent release.

Then another bit of secondhand evidence, this time from an AI insider on Patreon. We have a wonderful discord and this insider at a Stanford event put a question directly to Sam Altman very recently. This was a different Stanford event to the one I'm about to also quote from. And in this response, Sam Altman confirmed that he's personally using the unreleased version of their new model.

But enough of secondhand sources. What about another direct quote from Sam Altman? Well, here's some more evidence released yesterday that rather than drop a bombshell GPT-5 on us, which I predict to come somewhere between November and January, they're going to give us an iterative GPT-4.5 first. He doesn't want to surprise us.

It does kind of suck to ship a product that you're embarrassed about, but it's much better than the alternative. And in this case in particular, where I think we really owe it to society to deploy iteratively. One thing we've learned is that AI and surprise don't go well together.

People don't want to be surprised. People want a gradual rollout and the ability to influence these systems. That's how we're going to do it. Now, he might want to tell that to OpenAI's recent former head of developer relations. He now works at Google and said, "Something I really appreciate about Google's culture is how transparent things are.

30 days in, I feel like I have a great understanding of where we are going from a model perspective. Having line of sight on this makes it so much easier to start building compelling developer products." It almost sounds like the workers at OpenAI often don't have a great understanding of where they're going from a model perspective.

Now, in fairness, Sam Altman did say that the current GPT-4 will be significantly dumber than their new model. "ChatGPT is not phenomenal, like ChatGPT is mildly embarrassing at best. GPT-4 is the dumbest model any of you will ever, ever have to use again by a lot. But you know, it's like important to ship early and often, and we believe in iterative deployment." So agency and reasoning focused GPT-4.5 coming soon, but GPT-5 not until the end of the year or early next, those are my predictions.

Now, some people were saying that the mystery GPT-2 chatbot could be GPT-4.5. It was released on a site used to compare the different outputs of language models. And look, here is it creating a beautiful unicorn, which LLAMA 3 couldn't do. Now, I frantically got ready a tweet saying that superintelligence had arrived, but quickly had to delete it.

And not just because other people were reporting that they couldn't get decent unicorns. And not just because that exact unicorn could be found on the web. But the main reason was that I was one of the lucky ones to get in and test GPT-2 chatbot on the arena before it was withdrawn.

I could only do eight questions, but I gave it my standard handcrafted, so not on the web set of test questions, spanning logic, theory of mind, mathematics, coding, and more. Its performance was pretty much identical to GPT-4 turbo. There was one question that it would get right more often than GPT-4 turbo, but that could have been noise.

So if this was a sneak preview of GPT-4.5, I don't think it's going to shock and stun the entire industry. So tempting as it was to bang out a video saying that AGI has arrived in all caps, I resisted the urge. Since then, other testers have found broadly the same thing.

On language translation, the mystery GPT-2 chatbot massively underperforms Claude Opus and still underperforms GPT-4 turbo. On an extended test of logic, it does about the same as Opus and GPT-4 turbo. Of course, that still does leave the possibility that it is an open AI model, a tiny one, and one that they might even release open weights, meaning anyone can use it.

And in that case, the impressive thing would be how well it's performing despite its small size. So if GPT-2 chatbot is a smaller model, how could it possibly be even vaguely competitive? The secret source is the data. As James Becker of OpenAI said, "It's not so much about tweaking model configurations and hyperparameters, nor is it really about architecture or optimizer choices.

Behavior is determined by your dataset. It is the dataset that you are approximating to an incredible degree." In a later post, he referred to the flaws of DALI-3 and GPT-4 and also flaws in video, probably referring to the, at the time, unreleased Sora, and said they arise from a lack of data in a specific domain.

And in a more recent post, he said that while compute efficiency was still super important, anything can be state-of-the-art with enough scale, compute, and eval hacking. Now we'll get to evaluation and benchmark hacking in just a moment, but it does seem to me that there are more and more hints that you can brute force performance with enough compute and, as mentioned, a quality dataset.

But at least to me, it seems increasingly clear that you can pay your way to top performance. Unless OpenAI reveal something genuinely shocking, the performance of Meta's LLAMA-3, 8 billion, 70 billion, and soon 400 billion, show that they have less of a secret source than many people had thought.

But as Mark Zuckerberg hinted recently, it could just come down to which company blinks first. Who among Google, Meta, and Microsoft, which provides the compute for OpenAI, is willing to continue to spend tens or hundreds of billions of dollars on new models? If the secret is simply the dataset, that would make less and less sense.

"Over the last few years, I think there's this issue of GPU production, right? So even companies that had the money to pay for the GPUs couldn't necessarily get as many as they wanted because there were all these supply constraints. Now I think that's sort of getting less. So now I think you're seeing a bunch of companies think about, 'Wow, we should just like really invest a lot of money in building out these things.' And I think that will go for some period of time.

There is a capital question of like, 'Okay, at what point does it stop being worth it to put the capital in?' But I actually think before we hit that, you're going to run into energy constraints." Now if you're curious about energy and data centre constraints, check out my "Why Does OpenAI Need a Stargate Supercomputer" video released four weeks ago.

But before we leave data centres and datasets, I must draw your attention to this paper released in the last 24 hours. It's actually a brilliant paper from Scale AI. What they did was create a new and refined version of a benchmark that's used all the time to test the mathematical reasoning capabilities of AI models.

And there were at least four fascinating findings relevant to all new models coming out this year. The first, the context, and they worried that many of the latest models had seen the benchmark questions in their training data. That's called contamination because of course it contaminates the results on the test.

The original test had 8,000 questions, but what they did was create 1,000 new questions of similar difficulty. Now if contamination wasn't a problem, then models should perform just as well with the new questions as with the old. And obviously that didn't happen. For the Mistral and Phi family of models, performance notably lagged on the new test compared to the old one.

Whereas fair's fair, for GPT-4 and Claude, performance was the same or better on the new fresh test. But here's the thing, the authors figured out that that wasn't just about which models had seen the questions in their training data. They say that Mistral Large, which performed exactly the same, was just as likely to have seen those questions as Mixed Trial Instruct, which way underperformed.

So what could explain the difference? Well, the bigger models generalize. Even if they have seen the questions, they learn more from them and can generalize to new questions. And here's another supporting quote, they lean toward the hypothesis that sufficiently strong large language models learn elementary reasoning ability during training.

You could almost say that benchmarks get more reliable when you're talking about the very biggest models. Next, and this seems to be a running theme in popular ML benchmarks, the GSM-8K, designed for high schoolers, has a few errors. They didn't say how many, but the answers were supposed to be positive integers and they weren't.

The new benchmark, however, passed through three layers of quality checks. Third, they provide extra theories as to why models might overperform on benchmarks compared to the real world. That's not just about data contamination. It could be that model builders design datasets that are similar to test questions. After all, if you were trying to bake in reasoning to your model, what kind of data would you collect?

Plenty of exams and textbooks. So the more similar their dataset is in nature, not just exact match to benchmarks, the more your benchmark performance will be elevated compared to simple real world use. Think about it, it could be an inadvertent thing where enhancing the overall smartness of the model comes at the cost of overperforming on benchmarks.

And whatever you think about benchmarks, that does seem to work. Here's Sebastian Bubek, lead author of the PHY series of models. I've interviewed him for AI Insiders and he said this, "Even on those 1,000 never-before-seen questions, PHY 3 Mini, which is only 3.8 billion parameters, performed within about 8 or 9% of GPT-4 Turbo." Now we don't know the parameter count of GPT-4 Turbo, but it's almost certainly orders of magnitude bigger.

So training on high quality data, as we have seen, definitely works, even if it slightly skews benchmark performance. But one final observation from me about this paper, I read almost all the examples that the paper gave from this new benchmark. And as the paper mentions, they involve basic addition, subtraction, multiplication, and division.

After all, the original test was designed for youngsters. You can pause and try the questions yourself, but despite them being lots of words, they aren't actually hard at all. So my question is this, why are models like CLAWD3 Opus still getting any of these questions wrong? They're scoring around 60% in graduate level expert reasoning, the GPQA.

If CLAWD3 Opus, for example, can get questions right that PhDs struggle to get right with Google and 30 Minutes, why on earth, with five short examples, can they not get these basic high school questions right? Either there are still flaws in the test, or these models do have a limit in terms of how much they can generalize.

Now, if you like this kind of analysis, feel free to sign up to my completely free newsletter. It's called Signal to Noise, and the link is in the description. And if you want to chat in person about it, the regional networking on the AI Insiders Discord server is popping off.

There are meetings being arranged not only in London, but Germany, the Midwest, Ireland, San Francisco, Madrid, Brazil, and it goes on and on. Honestly, I've been surprised and honored by the number of spontaneous meetings being arranged across the world, but it's time, arguably, for the most exciting development of the week, MedGemini from Google.

It's a 58 page paper, but the TLDR is this. The latest series of Gemini models from Google are more than competitive with doctors at providing medical answers. And even in areas where they can't quite perform, like in surgery, they can be amazing assistants. In a world in which millions of people die due to medical errors, this could be a tremendous breakthrough.

MedGemini contains a number of innovations, it wasn't just rerunning the same test on a new model. For example, you can inspect how confident a model is in its answer. By trawling through the raw outputs of a model, called the logit, you could see how high probability they find their answers.

If they gave confident answers, you would submit that as the answer. They used this technique, by the way, for the original Gemini launch, where they claimed to beat GPT-4, but that's another story. Anyway, if the model is not confident, you can get the model to generate search queries to resolve those conflicts, train it, in other words, to use Google.

Seems appropriate. Then you can feed that additional context provided by the web back into the model to see if it's confident now. But that was just one innovation. What about this fine tuning loop? To oversimplify, they get the model to output answers again using the help of search. And then the outputs that had correct answers were used to fine tune the models.

Now that's not perfect, of course, because sometimes you can get the right answer with the wrong logic, but it worked. Up to a certain point, at least. Just last week, by the way, on Patreon, I described how this reinforced in-context learning can be applied to multiple domains. Other innovations come from the incredible long-context abilities of the Gemini 1.5 series of models.

With that family of models, you can trawl through a 700,000-word electronic health record. Now imagine a human doctor trying to do the same thing. I remember on the night of Gemini 1.5's release calling it the biggest news of the day, even more significant than SORA, and I still stand by that.

So what were the results? Well, of course, a state-of-the-art performance on MedQA that assesses your ability to diagnose diseases. The doctor pass rate, by the way, is around 60%. And how about this for a mini-theme of the video. When they carefully analyzed the questions in the benchmark, they found that 7.4% of the questions have quality issues.

Things like lacking key information, incorrect answers, or multiple plausible interpretations. So just in this video alone, we've seen multiple benchmark issues, and I collected a thread of other benchmark issues on Twitter. The positive news, though, is just how good these models are getting at things like medical note summarization and clinical referral letter generation.

But I don't want to detract from the headline, which is just how good these models are getting at diagnosis. Here you can see MedGemini with search way outperforming expert clinicians with search. By the way, when errors from the test were taken out, performance bumped up to around 93%. And the authors can't wait to augment their models with additional data.

Things like data from consumer wearables, genomic information, nutritional data, and environmental factors. And as a quite amusing aside, it seems like Google and Microsoft are in a tussle to throw shade at each other's methods in a positive spirit. Google contrasts their approach to MedPrompt from Microsoft, saying that their approach is principled, and it can be easily extended to more complex scenarios beyond MedQA.

Now you might say that's harsh, but Microsoft earlier had said that their MedPrompt approach shows GPT-4's ability to outperform Google's model that was fine-tuned specifically for medical applications. It outperforms on the same benchmarks by a significant margin. Well, Google have obviously one-upped them by reaching new state-of-the-art performances on 10 of 14 benchmarks.

Microsoft had also said that their approach has simple prompting and doesn't need more sophisticated and expensive methods. Google shot back saying they don't need complex, specialized prompting and their approach is best. Honestly, this is competition that I would encourage. May they long compete for glory in this medical arena.

In case you're wondering, because Gemini is a multimodal model, it can see images too. You can interact with patients and ask them to provide images. The model can also interact with primary care physicians and ask for things like x-rays. And most surprisingly to me, it can also interact with surgeons to help boost performance.

Yes, that's video assistance during live surgery. Of course, they haven't yet deployed this for ethical and safety reasons, but Gemini is already capable of assessing a video scene and helping with surgery. For example, by answering whether the critical view of safety criteria are being met. Do you have a great view of the gallbladder, for example.

And MedGemini could potentially guide surgeons in real time during these complex procedures for not only improved accuracy, but patient outcomes. Notice the nuance of the response from Gemini, oh, the lower third of the gallbladder is not dissected off the cystic plate. And the authors list many improvements that they could have made, they just didn't have time to make.

For example, the models were searching the wild web. And one option would be to restrict the search results to just more authoritative medical sources. The catch though, is that this model is not open sourced and isn't widely available. Due to, they say, the safety implications of unmonitored use. I suspect the commercial implications of open sourcing Gemini also had something to do with it.

But here's the question I would set for you. We know that hundreds of thousands or even millions of people die due to medical mistakes around the world. So if and when MedGemini 2, 3 or 4 becomes unambiguously better than all clinicians at diagnosing diseases, then at what point is it unethical not to at least deploy them in assisting clinicians?

That's definitely something at least to think about. Overall, this is exciting and excellent work. So many congratulations to the team. And in a world that is seeing some stark misuses of AI, as well as increasingly autonomous deployment of AI, like this autonomous tank, the fact that we can get breakthroughs like this is genuinely uplifting.

Thank you so much for watching to the end and have a wonderful day.