GPT-5 has Arrived

Well, GPT-5 is here and it's in the free tier. I've tested it a bunch, read the system card in full and even sat through that full live stream. Wow. But actually, I think it's pretty huge that free users of ChatGPT will get access to GPT-5. In other words, approaching a billion people will experience a significantly more intelligent AI model, at least before they hit the limits.

But if you watched the live stream and demo, you may have been underwhelmed. And I don't just mean the mathematically impossible bar graphs. And there were multiple of those. There were even hallucinations in the segment describing how the model hallucinates less. For sure, it would be easy to make a video just taking the mic of those mistakes.

But the thing is, GPT-5 is actually a pretty great model. So here are my first impressions. First, my own logic benchmark, or some people call it a trick question benchmark. I can confirm that GPT-5 indeed does crush the public questions of SimpleBench. Whoever this was that came out with this viral thread of it getting nine out of 10 on those public 10 questions from SimpleBench wasn't lying technically.

In some of my early testing, it got questions right that no other model had gotten right. When I saw this, I was like, man, I'm gonna have to bring out V2 really early. Everyone's gonna get super hyped. This is crazy. However, if you are newer to AI, you might not know that the performance of language models is heavily dependent on the training data they're fed.

And I suspect some of these 10 public questions have made it into the training data, at least indirectly, not deliberately, I think. But given that the models are trained on things like Reddit and other forums, it's definitely not impossible. Given how long I normally take to update the leaderboard, you guys might be quite shocked to hear that we're doing the runs tonight.

And so far, it's not setting a new record. That surprised even me, actually. I was expecting, honestly, 70%. I'll be honest with you guys. So far, in the three runs we've done, it's getting around 57-58%. So at this point, we can be clear, it's not a new paradigm of AI.

And if you didn't believe models were AGI now, this model won't convince you that we have AGI. But in fairness to OpenAI, just a couple of hours ago, Sam Altman tweeted that they could release much, much smarter models. But the main thing they were pushing for was real-world utility, which I'm going to come to, and mass accessibility and affordability.

Well, they have delivered on that. When it comes to affordability, in the API, the prices are incredible. Below clawed force on it. For those who care about the API, I've got some coding data coming up. But first, hallucinations. The team made a big play about how GPT-5 would hallucinate less than previous models.

Quite early on in the system card, which is as good a research paper as we're going to get, it seems, they say that we find that GPT-5 main has 44% fewer responses with at least one major factual error. However, I got suspicious immediately when, on the live stream, I saw a bunch of new benchmarks rather than the ones we already know.

I don't blame anyone watching for not knowing this, but one of the most quoted benchmarks on hallucinations is simple QA. These are short, factual questions that can be prone to hallucinations. And on that front, GPT-5 thinking is just about better than O3, maybe if you squint. Obviously, it makes it hard to compare if we invent a new benchmark for every single model release, but it's probably fair to say it does hallucinate a bit less.

Again, though, if you don't follow AI too closely, this model, just as all the others, will still hallucinate with a major incorrect claim around 5% of the time. These are on the questions, by the way, that users are actually using chat GPT for. Okay, now for one domain that OpenAI really did want to highlight, which is software engineering and one benchmark in particular, Sweebench Verified.

This time, by the way, you will notice that the graph doesn't have 52% as being higher than 69%. This, for me, then, is one of the bigger developments with GPT-5. Because in software engineering, essentially, OpenAI lobbed a grenade at Anthropic. They want the Claude family line to die out without heirs.

Because Sweebench verified is the singular benchmark that Anthropic cited as proof that their newest model, 4.1 Opus, just a few days old, was the frontier model in this domain. If we completely ignore statistical error bars, GPT-5 is now better. Fair warning, if you are not into coding or even vibe coding, this video won't spend too long on it.

But straight away within cursor, I did notice the difference. I'm not going to show my code base right now, so let's just leave this other coding benchmark on screen. But testing GPT-5 versus 4.0 Sonnet, the best model you get by default from Anthropic on cursor. And GPT-5 was just better.

Finding bugs that Sonnet assured me were not there. Obviously, we need more time to test, and there's always the black horse of Gemini DeepThink, which I think might be the best of all. But Anthropic get a lot of revenue from vibe coders and professional developers. So, GPT-5's release could be a challenging time for Anthropic.

Now, I know it was a very hype-y statement from the live stream, but I do actually get it when one of the presenters said he trusts the model more with coding. Of course, language models live and die by data, so if GPT-5 lacks it in your domain, it will be jank, so do test things out yourself.

Beyond coding though, if you are asking a technical question that might rely on an image, GPT-5 is looking real good. Take the MMMU, which includes a ton of charts and tables. And here, GPT-5 is beating the massively slower and much more inaccessible Gemini DeepThink, which is currently reserved for the addicts forking out $250 a month.

Yes, I was disappointed that the context window has been barely widened. Man, we need some fresh air in here. What I mean is, I love that Gemini 2.5 Pro can analyse one million tokens, or almost a million words, but we are stuck in the low hundreds of thousands for GPT-5.

Next, you may have noticed on the live stream, Sebastian Bubeck, one of the lead authors of that famous Sparks of AGI paper about GPT-4, and one of the first viewers of this channel, actually, two and a half years ago. He described the "recursive self-improvement" to be had when models can produce better synthetic data that is then used to train the next generation of models.

I do get that for getting epic scores on benchmarks, but if that was all it took, then models like 5.4 and the Openweight OpenAI GPT-OSS model would be demigods by now, given how heavily they were trained on synthetic data. True story, that recent OpenAI Openweight model flopped so badly on Simplebench that OpenAI reached out to me personally to ask about our settings, which were the standard ones, by the way.

A bit later on, I'm going to go to the biggest bit of anti-hype you could find, but a few new people may be watching tonight, so let's touch on another highly usable aspect, amazing aspect of GPT-5. If you watched the live stream and thought those vibe coding demos were epic, they were, and don't let addicts like us tell you otherwise.

Because snake games and chart displays are everywhere in the training data, that's why GPT-5 can bang them out with ease. Building a production-ready consumer app is a very different story for now. So I would say we don't quite have what Sam Olman quoted tonight, software on demand, but we may be slowly getting there.

More on that in another video coming out soon. Back to the system card, and they kept talking about health journeys. And though I had never heard that expression, it didn't strike me that they were taking us for a ride. For me, it's genuinely incredible. And it will be super impactful that you can often get expert-level text-based diagnoses from these models.

Notice I say you can often get that. Though I don't think many people will have noticed this one, that GPT-5 Mini scored higher on health bench consensus than GPT-5 itself. I can imagine some frantic users on the free tier waiting until their GPT-5 allowances out to get the quote "better model on health".

Now just in case there's any really confused people watching, if you're wondering, there's no new Sora or image generator. There are however new voices, and for those who've been chatting on the free tier with models quite a lot, you should notice a step up in conversation quality. Although I was disappointed that GPT-5's language skills have not improved from O3.

Translation, I feel, is such an unmitigated good from AI. I was hoping that they had been able to push the frontier a bit harder. Some of you will be asking about the semi-mythical GPT-5 Pro, and yes, I'd love to test it. I am on the Pro tier, but I'm not yet seeing it in my app as of tonight, so soon hopefully.

Probably isn't crazily better than the other types of GPT-5, given they barely mentioned it in the presentation. Now though, for the real anti-hype, because even one of the lead authors of AI 2027, which took the world by storm and was read by millions, well, the headlines or videos derived from were seen or read by millions, that's Eli Lifland, said that he noticed from the system card no improvement on the coding evals that weren't Sweebench.

Translated, you know all those videos you've been seeing on YouTube about AI taking over the world within the next two years? Well, one of the authors behind that, who I interviewed on Patreon, has probably updated in the negative in terms of his timelines. We should be seeing a bit more self-improvement by now, if those timelines were accurate.

It makes sense, right? We would need to see significant improvement from GPT-5 on machine learning engineering, for example, machine learning engineering bench, and we don't quite see that. What about this benchmark from the system card? OpenAI pull requests. Can GPT-5 do some of the more mundane tasks that are performed at OpenAI?

Well, without diving too much into that benchmark, notice the increment. We're not seeing big jumps from O3. What about the ability of models to replicate state-of-the-art AI research? This was tested on OpenAI's own paper bench. Again, correct me if I'm wrong, but not a huge step forward. Then there's this benchmark, arguably the most interesting of them all.

I actually think it's a brilliant benchmark, and I remember asking someone at OpenAI about such a benchmark two years ago. Pretty sure that request had no impact, but either way, check this out. Can AI models overcome any of 20 internal research and engineering bottlenecks encountered for real at OpenAI in the past?

These were the kind of bottlenecks that led to delays of at least a day, and in some cases they said it influenced the outcome of large training runs and launches. Amazing benchmark, and for now, slightly underwhelming performance. I say that, but it kind of depends on your perspective, because solving 2% of those is not bad in my opinion.

My only point was that's the same score as O3. Mind you, I've just noticed something while filming. The green bar looks taller than O3's bar, even though it's the same 2%. Man, the chart crimes that OpenAI are doing for GPT-5 are unbelievable. I'm just going to spend 20 seconds now on safety, because I did see OpenAI's safety paper, and while I haven't finished it, I do love the sound of the new approach on refusals.

Basically, they've moved to what's called safe completions as a new safety paradigm. It makes sense to me, because rather than the model just making a snap judgment, is the user's intent good or bad, then completely obeying or refusing. Instead, safe completions focuses entirely on the safety of the model's output.

Translating the pages I have read so far, it's basically, we don't really care why you're asking this, this is the only information we're going to give you. Perfect segue to the sponsors of today's video, Grace One. Let me know what you think, but what I find epic is that you can make all models, not just OpenAI's, more secure yourself, as in you watching.

I mean, literally, a few of the viewers of this channel have gone on to the leaderboards in these paid competitions. If you are not familiar with them, you basically have to find jailbreaks for these models, and thereby improve model security. If you're interested, do use my personal link in the description.

And for me, models being less likely to output bioterror instructions is just a win-win. Now for some last benchmarks before I draw this first impressions video to an end. GPT-5 doesn't quite get the record for what I'm calling a pattern recognition benchmark, Arc AGI 2. It's beaten by Grok 4, which gets 16% compared to its 10%.

GPT-5 is of course much cheaper, though. Curiously, GPT-5 got a new record on the Google-proof science benchmark, called GPQA, getting 88.4%. But they barely mentioned that on the website or the live stream or the system card. In 2024, this was one of the most cited benchmarks for testing model intelligence.

Seeing the OpenWeights OpenAI models score so highly did make me start to worry about benchmark maxing. Same story with humanity's last exam. In other words, if you are new to the channel and thought a model breaking records in all sorts of benchmarks meant it had to be the smartest, then do please stick around.

Now it seems fitting, as I draw the video to an end, that we should discuss the end of the model selector. Because unless you are on the pro tier, all the other models that you can see here are deprecated. That's good news in a way if you like to avoid that mess of models to select from.

Not as good news if you liked a particular variant for whatever reason. So there we are, that's my take on GPT-5. What is yours? For me, I must admit, it's quite a poignant moment in the history of this channel. I've been making videos touching on what GPT-5 might be like for, man, it must be over two years now.

Some of you watching will have been following the channel since then and thank you so much. Would the Philip of two years ago have been bowled over by the GPT-5 of today? I genuinely don't know. Will we all be bowled over by the GPT-6 or 7 of the future?

Only time will tell. Man, that was kind of a cliche ending, but forgive me and thank you so much for watching. Have a wonderful day.

GPT-5 has Arrived

Transcript