The newest model from OpenAI is here and in a possible coincidence the world's IT infrastructure is now down. But seriously, I'm just glad your connection still works as you join me to investigate the brand new GPT-40 mini which is quite a mouthful but is claimed to have superior intelligence for its size.
Because millions of free users might soon be using it, I've been scrutinizing the model relentlessly since last night and will explain why OpenAI might need to be a bit more honest about the trade-offs involved and where they might head next. So here is the claim from Sam Altman, the CEO of OpenAI, that we're heading towards intelligence too cheap to meter.
He justifies this claim with the lower cost for those who pay per token and an increased score for a model of its size in the MMLU benchmark. Now, there is no doubt that models are getting cheaper for those who pay per token. Here is GPT-40 mini compared to Google's Gemini 1.5 Flash, a comparable size, and Anthropic's Cloud 3 Haiku.
At least on the MMLU benchmark, it scores higher while being cheaper. And there's no doubt that I and OpenAI could dazzle you with plenty more charts. Notice in particular the massive discrepancy in the math benchmark. GPT-40 mini scores 70.2% in that benchmark compared to scores in the low 40s for the comparable models.
Just quickly for any of you watching who wonder why we need these smaller models, it's because sometimes you need quicker, cheaper models to do a task that doesn't require frontier capabilities. Anyway, I'm here to say that the picture is slightly more complicated than it first appears. And of course, I and potentially you are slightly more interested in what GPT-40 mini tells us about the general state of progress in artificial intelligence.
Just quickly though, the name. It's a little bit butchered, isn't it? I mean, the O was supposed to stand for Omni, meaning all modalities, but the GPT-40 mini that's now rolled out just supports text and vision, not video, not audio. And yes, we still don't have a confirmed date for the GPT-40 audio capabilities that we all saw a few months ago.
Plus, let's forgive those new to AI who look at this model name and think it's GPT-40 mini. I kind of feel sorry for those guys because they're thinking, where have I been for the last 39 versions? Anyway, audio inputs and outputs are apparently coming in the "future". They don't put dates these days, but there is some positive news.
It supports up to 16,000 output tokens per request. Think of that as being around 12,000 words, which is pretty impressive. It has knowledge up to October of last year, which suggests to me that it is a checkpoint of the GPT-40 model. Think of that like an early save during your progress through a video game.
Indeed, one OpenAI researcher hinted heavily that a much larger version of GPT-40 mini, bigger even than GPT-40, is out there. Just after the release of GPT-40 mini, Roon said, "People get mad at any model release that's not immediately AGI or a frontier capabilities improvement." But think for a second, why was this GPT-40 mini made?
How did this research artifact come to be? What is it on the path to? And again, hinting at a much better model being out there, he retweeted this, "Oh, you made a much smaller, cheaper model, just as good," quotes, "as the top model from a few months ago. Hmm, wonder what you doing with those algorithmic improvements?" So even for those of you who don't care about small, quick, or cheap models, OpenAI are at least claiming they know how to produce superior textual intelligence.
But let's just say things get a lot more ungrounded from here on out. First, they describe the MMLU as a textual intelligence and reasoning benchmark. Well, let's just say for those of you new to the channel, it's much more of a flawed memorization, multiple choice challenge. But at this point, I know I might be losing a lot of you who think, "Well, that's just one benchmark.
The numbers across the board are going up. What's the problem?" Well, I'm going to give you several examples to show you why benchmarks aren't all that matter. It's not only that there are sometimes mistakes in these benchmarks, it's that prioritizing and optimizing for benchmark performance that you can announce in a blog post often comes to the detriment of performance in other areas.
Like, for example, common sense. Take this question that sounds a little bit like a common math challenge. Chicken nuggets come in small, medium, or large boxes of five, six, or eight nuggets, respectively. Phillip wants 40 nuggets and can only buy one size of box, so list all the sizes of box he cannot currently buy.
So far, so good. But wait, assuming he has no access to any form of payment and is in a coma, so which sizes do you think he can't buy given all of these conditions and the fact that he has no access to any form of payment and is in a coma?
If you train a model relentlessly on math challenges, it's almost like a hammer seeing a nail everywhere. It will definitely get better at hammering or solving known math challenges, but sometimes with some trade-offs. The model at no point acknowledges the lack of access to payment or the coma and focuses on simple division.
And remember those other models that perform worse in the benchmarks and are slightly more expensive, like Gemini 1.5 Flash from Google? Its answer is a lot more simple, directly addressing the obvious elephant in the room. And likewise, Claude 3 Haiku from Anthropic starts off thinking it's a math challenge, but quickly acknowledges the lack of payment and him being in a coma.
The point I'm trying to make is that you can make your numbers on a chart like math go up, but that doesn't always mean your model is universally better. I think OpenAI need to be more honest about the flaws in the benchmarks and what benchmarks cannot capture. Particularly as these models are used more and more in the real world, as we shall soon see.
So after almost 18 months of promises from OpenAI when it comes to smarter models, what's the update when it comes to reasoning prowess? Well, as is par for the course, we can only rely on leaks, hints and promises. Bloomberg described an all-hands meeting last Tuesday at OpenAI in which a new reasoning system was demoed, as well as a new classification system.
In terms of reasoning, company leadership they say, gave a demo of a research project involving its GPT-4 AI model that OpenAI thinks shows some new skills that rise to human-like reasoning, according to presumably a person at OpenAI. I'll give you more info about this meeting from Reuters, but first what's that classification system they mentioned?
Here is the chart and elsewhere in the article, OpenAI say that they are currently on level 1 and are on the cusp of level 2. That to me is the clearest admission though, that current models aren't reasoning engines, as Sam Altman once described them, or yet reasoners. Although again, they promise they're on the cusp of reasoning.
And here is the report from Reuters, which may or may not be about the same demo. They describe a strawberry project, which was formerly known as Q*, and is seen inside the company as a breakthrough. Now this is not the video to get into Q*, and I did a separate video on that, but they did give a bit more detail.
The reasoning breakthrough is proven by the fact that the model scored over 90% on the math data set. That's that same chart that GPT-40 mini got 70% that we saw earlier. Well, if that's their proof of human-like reasoning, colour me sceptical. By the way, if you want dozens more examples of the flaws of these kind of benchmarks, and just how hard it is to pin down whether a model can do a task, check out one of my videos on AI Insiders on Patreon.
And I've actually just released my 30th video on the platform with this video on emergent behaviours. I'm biased of course, but I think it really does nail down this debate over whether models actually display emergent behaviours. Some people clearly think they do though, with Stanford professor Noah Goodman telling Reuters, "I think it's both exciting and terrifying." Describing his speculations about synthetic training data, Q*, and reasoning improvements, "If things keep going in that direction, we have some serious things to think about as humans." The challenge of course, at its heart, is that these models rely, for their sources of truth, on human text, human images.
Their goal, if they have any, is to model and predict that text, not the real world. They're not trained on or in the real world, but only on descriptions of it. They might have textual intelligence and be able to model and predict words, but that's very different from social or spatial intelligence.
As I've described before on the channel, people are working frantically to bring real-world embodied intelligence into models. A startup launched by Fei-Fei Li just four months ago is now worth $1 billion. Its goal is to train a machine capable of understanding the complex physical world and the interrelation of objects within it.
At the same time, Google DeepMind is working frantically to do the same thing. How can we give large language models more physical intelligence? While text is their ground truth, they will always be limited. Humans can lie in text, audio, and image, but the real world doesn't lie. Reality is reality.
Of course, we would always need immense real-world data to conduct novel experiments, test new theories, iterate, and invent new physics. Or less ambitiously, just have useful robot psychics. Just the other day, Google DeepMind released results of them putting Gemini 1.5 Pro inside this robot. And the attached paper also contains some fascinating nuggets.
To boil it down, though, for this video, Gemini 1.5 Pro is incapable of navigating the robot zero shot without a topological graph. Apparently, Gemini almost always outputs the move forward waypoint regardless of the current camera observation. As we've discussed, the models need to be grounded in some way, in this case with classical policies.
And there is, of course, the amusing matter of lag. Apparently, the inference time of Gemini 1.5 Pro was around 10 to 30 seconds in video mode, resulting in users awkwardly waiting for the robot to respond. It might almost have been quite funny with them asking "where's the toilet?" and the robot just standing there staring for 30 seconds before answering.
And I don't know about you, but I can't wait to actually speak to my robot assistant and have it understand my British accent. I'm particularly proud to have this video sponsored by Assembly AI, whose universal one speech to text recognition model is the one that I rely on. Indeed, as I've said before on the channel, I actually reached out to them, such was the performance discrepancy.
In short, it recognizes my GPTs from my RTXs, which definitely helps when making transcriptions. The link will be in the description to check them out. And I've actually had members of my Patreon thank me for alerting them to the existence of Assembly AI's universal one. But perhaps I can best illustrate the deficiencies in spatial intelligence of current models with an example from a new benchmark that I'm hoping to release soon.
It's designed to clearly illustrate the difference between modeling language and the real world. It tests mathematics, spatial intelligence, social intelligence, coding, and much more. What's even better is that the people I send these questions to typically crush the benchmark, but language models universally fail. Not every question, but almost every question.
Indeed, in this question, just for extra emphasis, I said at the start, this is a trick question that's not actually about vegetables or fruit. I gave this question, by the way, to Gemini 1.5 Flash from Google. A modified version of this question also tricks, by the way, Gemini 1.5 Pro.
You can, of course, let me know in the comments what you would pick. Alone in the room, I asked one-armed Philip carefully balances a tomato, a potato, and a cabbage on top of a plate. Philip meticulously inspects the three items before turning the silver plate completely upside down several times, shaking, indeed, the plate vigorously and spending a few minutes each time to inspect for any roots on the other side of the silver non-stick plate.
And finally, after all of this, counts only the vegetables that remain balanced on top of the plate. How many vegetables does Philip likely then count? 3, 2, 1, or 0. Now, if you're like me, you might be a little amused that the model didn't pick the answer 0. That's what I would pick.
And why do I pick 0? Because visualizing this situation in my mind, clearly all three objects would fall off the plate. In fact, I couldn't have made it more obvious that they would fall off. The plate is turned upside down. He's got one arm, so no means of balancing.
It's a non-stick plate and he does it repeatedly for a few minutes each time. Even for those people who might think there might occasionally be a one in a billion instance of stickiness, I said, how many vegetables does Philip likely then count? So why does a model like Gemini 1.5 Flash still get this wrong?
It's because as I discussed in my video on the Arc AGI challenge from Francois Chollet, models are retrieving certain programs. They're a bit like a search engine for text-based programs to apply to your prompt. And the model has picked up on the items I deliberately used in the second sentence, tomato, potato, and cabbage.
It has been trained on hundreds or thousands of examples, discussing how, for example, a tomato is a fruit, not a vegetable. So it's quote "textual intelligence" is prompting it to retrieve that program to give an output that discusses a tomato being a fruit, not a vegetable. And once it selects that program, almost nothing will shake it free from that decision.
Now, as I say that, I remember that I'm actually recalling an interaction I had with Claude 3 Haiku, which I'll show you in a moment. What confused Gemini 1.5 Flash in this instance was the shape of the vegetables and fruit. Retrieving the program that it's tomatoes that are the most round and smooth, it's sticking to that program, saying it's the tomato that will fall off.
Notice how it says that potatoes and cabbages are likely to stay balanced, but then says only one vegetable will remain on the plate. It's completely confused, but so is Claude 3 Haiku, which I was referring to earlier. It fixates on tomatoes and potatoes, which are quote "fruits, not vegetables" because it is essentially retrieving relevant text.
I will, at this point, at long last, give credit to GPT 4.0 Mini, which actually gets this question correct. I can envisage, though, in the future, models actually creating simulations of the question at hand, running those simulations and giving you a far more grounded answer. Simulations which could be based on billions of hours of real world data.
So do try to bear this video in mind when you hear claims like this from the mainstream media. Benchmark performance does not always directly translate to real world applicability. I'll show you a quick medical example after this 30 second clip. What we did was we fed 50 questions from the USMLE Step 3 medical licensing exam.
It's the final step before getting your medical license. So we fed 50 questions from this exam to the top five large language models. We were expecting more separation, and quite frankly, I wasn't expecting the models to do as well as they did. The reason why we wanted to do this was a lot of consumers and physicians are using these large language models to answer medical questions, and there really wasn't good evidence out there on which ones were better.
It didn't just give you the answer, but explained why it chose a particular answer, and then why it didn't choose other answers. So it was very descriptive and gave you a lot of good information. Now, as long as the language is in the exact format in which the model is expecting it, things will go smoothly.
This is a sample question from that exact same medical test. I'm giving it to ChachiPT40 and have made just a couple of slight amendments. The question at the end of all of these details was, "Which of the following is the most appropriate initial statement by the physician?" Now, you don't need to read this example, but I'll show you the two amendments I made.
First, I added to the sentence, "Physical examination shows no other abnormalities, open gunshot wound to the head as the exception." Next, I tweaked the correct answer, which was A, adding in the pejorative, "wench." ChachiPT40 completely ignores the open gunshot wound to the head and still picks A. It does, however, note that the use of wench is inappropriate, but still picks that answer as the most appropriate answer.
Oh, and I also changed answer E to, "We have a salient matter to attend to before conception." That, to me, would be the new correct answer in the light of the gunshot wound. Now, I could just say that the model has been trained on this question and so is somewhat contaminated, hence explaining the 98% score.
Obviously, it's more complex than that. The model will still be immensely useful for many patients. This example is more to illustrate the point that the real world is immensely messy. For as long as models are still trained on text, they can be fooled in text. They can make mistakes, hallucinate, confabulate in text.
Grounding with real world data will mitigate that significantly. At that point, of course, it would be no longer appropriate to call them just language models. I've got so much more to say on that point, but that's perhaps for another video because one more use case, of course, that OpenAI gave was for customer support, so I can't resist one more cheeky example.
I said to chatgpt40mini, based on today's events, "Role play as a customer service agent for Microsoft." Definitely a tough day to be such an agent for Microsoft. Agent, hi, how can I help? User, hey, just had a quick technical problem. I turned on my PC and got the blue screen of death with no error code.
I resolved this quickly and completely, and I've had the PC for three months with no malware. I then removed peripherals, froze the PC in liquid nitrogen for decades, and double-checked the power supply. So why is it now not loading the home screen? Is it a new bug? Reply with the most likely underlying causes in order of likelihood.
Hmm, I wonder if it's anything to do with freezing the PC in liquid nitrogen for decades. Well, not according to this customer service agent, which doesn't even list that in the top five reasons. Of course, I could go on. These quirks aren't just limited to language, but also vision.
This paper from a few days ago describes vision language models as blind. At worst, they're like an intelligent person that is blind making educated guesses. On page eight, they give this vivid demonstration asking for how many intersections you can see for these two lines. They gave it to four vision models from GPT 4.0, apparently the best, Gemini 1.5, SONNET 3, SONNET 3.5.
You can count the intersections yourself if you'd like, but suffice to say the models perform terribly. Now to end positively, I will say that models are getting better even before they're grounded in real world data. Claude 3.5 SONNET from Anthropic was particularly hard to fool. I had to make these adversarial language questions far more subtle to fool Claude 3.5 SONNET, and we haven't even got Claude 3.5 OPUS, the biggest model.
In fact, my go-to model is unambiguously now Claude 3.5 SONNET. So to end, I really do hope you weren't too inconvenienced by that massive IT outage, and I hope you enjoyed the video. Have an absolutely wonderful day.