back to indexGPT-5 has Arrived

00:00:00.000 |
Well, GPT-5 is here and it's in the free tier. I've tested it a bunch, read the system card in 00:00:07.460 |
full and even sat through that full live stream. Wow. But actually, I think it's pretty huge 00:00:13.540 |
that free users of ChatGPT will get access to GPT-5. In other words, approaching a billion people 00:00:21.340 |
will experience a significantly more intelligent AI model, at least before they hit the limits. 00:00:27.660 |
But if you watched the live stream and demo, you may have been underwhelmed. And I don't just mean 00:00:33.600 |
the mathematically impossible bar graphs. And there were multiple of those. There were even 00:00:39.500 |
hallucinations in the segment describing how the model hallucinates less. For sure, it would be 00:00:45.500 |
easy to make a video just taking the mic of those mistakes. But the thing is, GPT-5 is actually a 00:00:52.380 |
pretty great model. So here are my first impressions. First, my own logic benchmark, 00:00:57.600 |
or some people call it a trick question benchmark. I can confirm that GPT-5 indeed does crush the 00:01:06.060 |
public questions of SimpleBench. Whoever this was that came out with this viral thread of it getting 00:01:12.680 |
nine out of 10 on those public 10 questions from SimpleBench wasn't lying technically. In some of my 00:01:19.600 |
early testing, it got questions right that no other model had gotten right. When I saw this, 00:01:25.860 |
I was like, man, I'm gonna have to bring out V2 really early. Everyone's gonna get super hyped. 00:01:30.480 |
This is crazy. However, if you are newer to AI, you might not know that the performance of language 00:01:36.580 |
models is heavily dependent on the training data they're fed. And I suspect some of these 10 public 00:01:43.640 |
questions have made it into the training data, at least indirectly, not deliberately, I think. But 00:01:49.400 |
given that the models are trained on things like Reddit and other forums, it's definitely not impossible. 00:01:55.080 |
Given how long I normally take to update the leaderboard, you guys might be quite shocked to 00:02:00.440 |
hear that we're doing the runs tonight. And so far, it's not setting a new record. That surprised even me, 00:02:07.740 |
actually. I was expecting, honestly, 70%. I'll be honest with you guys. So far, in the three runs we've 00:02:14.960 |
done, it's getting around 57-58%. So at this point, we can be clear, it's not a new paradigm of AI. And if 00:02:24.280 |
you didn't believe models were AGI now, this model won't convince you that we have AGI. But in fairness 00:02:30.600 |
to OpenAI, just a couple of hours ago, Sam Altman tweeted that they could release much, much smarter 00:02:36.500 |
models. But the main thing they were pushing for was real-world utility, which I'm going to come to, 00:02:41.260 |
and mass accessibility and affordability. Well, they have delivered on that. When it comes to 00:02:46.580 |
affordability, in the API, the prices are incredible. Below clawed force on it. For those who care about the 00:02:52.440 |
API, I've got some coding data coming up. But first, hallucinations. The team made a big play about how 00:02:58.680 |
GPT-5 would hallucinate less than previous models. Quite early on in the system card, which is as good 00:03:05.240 |
a research paper as we're going to get, it seems, they say that we find that GPT-5 main has 44% fewer 00:03:13.380 |
responses with at least one major factual error. However, I got suspicious immediately when, on the live 00:03:20.360 |
stream, I saw a bunch of new benchmarks rather than the ones we already know. I don't blame anyone 00:03:26.080 |
watching for not knowing this, but one of the most quoted benchmarks on hallucinations is simple QA. 00:03:31.800 |
These are short, factual questions that can be prone to hallucinations. And on that front, 00:03:38.280 |
GPT-5 thinking is just about better than O3, maybe if you squint. Obviously, it makes it hard to compare 00:03:45.980 |
if we invent a new benchmark for every single model release, but it's probably fair to say it does 00:03:50.860 |
hallucinate a bit less. Again, though, if you don't follow AI too closely, this model, just as all the 00:03:56.700 |
others, will still hallucinate with a major incorrect claim around 5% of the time. These are on the questions, 00:04:03.580 |
by the way, that users are actually using chat GPT for. Okay, now for one domain that OpenAI really did want 00:04:10.700 |
to highlight, which is software engineering and one benchmark in particular, Sweebench Verified. This 00:04:16.300 |
time, by the way, you will notice that the graph doesn't have 52% as being higher than 69%. This, 00:04:22.540 |
for me, then, is one of the bigger developments with GPT-5. Because in software engineering, essentially, 00:04:28.380 |
OpenAI lobbed a grenade at Anthropic. They want the Claude family line to die out without heirs. Because Sweebench 00:04:37.740 |
verified is the singular benchmark that Anthropic cited as proof that their newest model, 4.1 Opus, 00:04:44.700 |
just a few days old, was the frontier model in this domain. If we completely ignore statistical error 00:04:50.940 |
bars, GPT-5 is now better. Fair warning, if you are not into coding or even vibe coding, this video 00:04:57.580 |
won't spend too long on it. But straight away within cursor, I did notice the difference. I'm not going to 00:05:02.940 |
show my code base right now, so let's just leave this other coding benchmark on screen. But testing 00:05:08.540 |
GPT-5 versus 4.0 Sonnet, the best model you get by default from Anthropic on cursor. And GPT-5 was just 00:05:15.020 |
better. Finding bugs that Sonnet assured me were not there. Obviously, we need more time to test, 00:05:20.780 |
and there's always the black horse of Gemini DeepThink, which I think might be the best of all. But 00:05:25.580 |
Anthropic get a lot of revenue from vibe coders and professional developers. So, GPT-5's release 00:05:31.740 |
could be a challenging time for Anthropic. Now, I know it was a very hype-y statement from the live 00:05:37.180 |
stream, but I do actually get it when one of the presenters said he trusts the model more with coding. 00:05:42.060 |
Of course, language models live and die by data, so if GPT-5 lacks it in your domain, it will be jank, 00:05:47.660 |
so do test things out yourself. Beyond coding though, if you are asking a technical question 00:05:52.700 |
that might rely on an image, GPT-5 is looking real good. Take the MMMU, which includes a ton of charts 00:06:00.860 |
and tables. And here, GPT-5 is beating the massively slower and much more inaccessible Gemini DeepThink, 00:06:09.580 |
which is currently reserved for the addicts forking out $250 a month. Yes, I was disappointed that the 00:06:15.580 |
context window has been barely widened. Man, we need some fresh air in here. What I mean is, 00:06:21.340 |
I love that Gemini 2.5 Pro can analyse one million tokens, or almost a million words, but we are stuck in 00:06:28.380 |
the low hundreds of thousands for GPT-5. Next, you may have noticed on the live stream, 00:06:33.420 |
Sebastian Bubeck, one of the lead authors of that famous Sparks of AGI paper about GPT-4, 00:06:40.140 |
and one of the first viewers of this channel, actually, two and a half years ago. He described 00:06:45.180 |
the "recursive self-improvement" to be had when models can produce better synthetic data 00:06:51.740 |
that is then used to train the next generation of models. I do get that for getting epic scores on 00:06:57.180 |
benchmarks, but if that was all it took, then models like 5.4 and the Openweight OpenAI GPT-OSS model 00:07:06.060 |
would be demigods by now, given how heavily they were trained on synthetic data. True story, 00:07:11.340 |
that recent OpenAI Openweight model flopped so badly on Simplebench that OpenAI reached out to 00:07:16.380 |
me personally to ask about our settings, which were the standard ones, by the way. A bit later on, 00:07:21.580 |
I'm going to go to the biggest bit of anti-hype you could find, but a few new people may be watching 00:07:27.660 |
tonight, so let's touch on another highly usable aspect, amazing aspect of GPT-5. If you watched the 00:07:33.420 |
live stream and thought those vibe coding demos were epic, they were, and don't let addicts like us tell 00:07:39.580 |
you otherwise. Because snake games and chart displays are everywhere in the training data, that's why GPT-5 can 00:07:47.100 |
bang them out with ease. Building a production-ready consumer app is a very different story for now. 00:07:53.740 |
So I would say we don't quite have what Sam Olman quoted tonight, software on demand, but we may be 00:07:59.980 |
slowly getting there. More on that in another video coming out soon. Back to the system card, and they 00:08:05.500 |
kept talking about health journeys. And though I had never heard that expression, it didn't strike me that 00:08:11.580 |
they were taking us for a ride. For me, it's genuinely incredible. And it will be super impactful that you 00:08:17.820 |
can often get expert-level text-based diagnoses from these models. Notice I say you can often get that. 00:08:24.460 |
Though I don't think many people will have noticed this one, that GPT-5 Mini scored higher on health bench 00:08:31.660 |
consensus than GPT-5 itself. I can imagine some frantic users on the free tier waiting until their GPT-5 00:08:38.940 |
allowances out to get the quote "better model on health". Now just in case there's any really confused 00:08:43.980 |
people watching, if you're wondering, there's no new Sora or image generator. There are however new voices, 00:08:50.460 |
and for those who've been chatting on the free tier with models quite a lot, you should notice a step up 00:08:55.820 |
in conversation quality. Although I was disappointed that GPT-5's language skills have not improved from O3. 00:09:03.820 |
Translation, I feel, is such an unmitigated good from AI. I was hoping that they had been able to push the 00:09:10.300 |
frontier a bit harder. Some of you will be asking about the semi-mythical GPT-5 Pro, and yes, I'd love 00:09:16.540 |
to test it. I am on the Pro tier, but I'm not yet seeing it in my app as of tonight, so soon hopefully. 00:09:23.740 |
Probably isn't crazily better than the other types of GPT-5, given they barely mentioned it in the 00:09:29.260 |
presentation. Now though, for the real anti-hype, because even one of the lead authors of AI 2027, 00:09:36.140 |
which took the world by storm and was read by millions, well, the headlines or videos derived from 00:09:41.660 |
were seen or read by millions, that's Eli Lifland, said that he noticed from the system card no 00:09:48.780 |
improvement on the coding evals that weren't Sweebench. Translated, you know all those videos you've been 00:09:54.460 |
seeing on YouTube about AI taking over the world within the next two years? Well, one of the authors 00:10:00.380 |
behind that, who I interviewed on Patreon, has probably updated in the negative in terms of his 00:10:06.620 |
timelines. We should be seeing a bit more self-improvement by now, if those timelines were 00:10:11.980 |
accurate. It makes sense, right? We would need to see significant improvement from GPT-5 on machine 00:10:17.900 |
learning engineering, for example, machine learning engineering bench, and we don't quite see that. 00:10:23.180 |
What about this benchmark from the system card? OpenAI pull requests. Can GPT-5 do some of the more 00:10:28.860 |
mundane tasks that are performed at OpenAI? Well, without diving too much into that benchmark, 00:10:34.220 |
notice the increment. We're not seeing big jumps from O3. What about the ability of models to replicate 00:10:41.020 |
state-of-the-art AI research? This was tested on OpenAI's own paper bench. Again, correct me if I'm 00:10:47.500 |
wrong, but not a huge step forward. Then there's this benchmark, arguably the most interesting of them 00:10:53.420 |
all. I actually think it's a brilliant benchmark, and I remember asking someone at OpenAI about such 00:10:59.020 |
a benchmark two years ago. Pretty sure that request had no impact, but either way, check this out. Can AI 00:11:04.220 |
models overcome any of 20 internal research and engineering bottlenecks encountered for real at 00:11:10.940 |
OpenAI in the past? These were the kind of bottlenecks that led to delays of at least a day, and in some 00:11:15.660 |
cases they said it influenced the outcome of large training runs and launches. Amazing benchmark, and for 00:11:21.820 |
now, slightly underwhelming performance. I say that, but it kind of depends on your perspective, because 00:11:27.820 |
solving 2% of those is not bad in my opinion. My only point was that's the same score as O3. Mind you, 00:11:35.100 |
I've just noticed something while filming. The green bar looks taller than O3's bar, even though it's the 00:11:41.980 |
same 2%. Man, the chart crimes that OpenAI are doing for GPT-5 are unbelievable. I'm just going to spend 20 00:11:49.340 |
seconds now on safety, because I did see OpenAI's safety paper, and while I haven't finished it, I do love the 00:11:55.340 |
sound of the new approach on refusals. Basically, they've moved to what's called safe completions as 00:12:01.420 |
a new safety paradigm. It makes sense to me, because rather than the model just making a snap judgment, 00:12:06.460 |
is the user's intent good or bad, then completely obeying or refusing. Instead, safe completions 00:12:12.620 |
focuses entirely on the safety of the model's output. Translating the pages I have read so far, 00:12:19.420 |
it's basically, we don't really care why you're asking this, this is the only information we're going to 00:12:24.620 |
give you. Perfect segue to the sponsors of today's video, Grace One. Let me know what you think, 00:12:30.220 |
but what I find epic is that you can make all models, not just OpenAI's, more secure yourself, 00:12:36.940 |
as in you watching. I mean, literally, a few of the viewers of this channel have gone on to the 00:12:42.380 |
leaderboards in these paid competitions. If you are not familiar with them, you basically have to find 00:12:48.940 |
jailbreaks for these models, and thereby improve model security. If you're interested, do use my 00:12:54.860 |
personal link in the description. And for me, models being less likely to output bioterror 00:13:00.140 |
instructions is just a win-win. Now for some last benchmarks before I draw this first impressions video 00:13:06.220 |
to an end. GPT-5 doesn't quite get the record for what I'm calling a pattern recognition benchmark, 00:13:11.820 |
Arc AGI 2. It's beaten by Grok 4, which gets 16% compared to its 10%. GPT-5 is of course much cheaper, 00:13:19.580 |
though. Curiously, GPT-5 got a new record on the Google-proof science benchmark, 00:13:25.820 |
called GPQA, getting 88.4%. But they barely mentioned that on the website or the live stream 00:13:33.820 |
or the system card. In 2024, this was one of the most cited benchmarks for testing model intelligence. 00:13:39.820 |
Seeing the OpenWeights OpenAI models score so highly did make me start to worry about benchmark 00:13:45.340 |
maxing. Same story with humanity's last exam. In other words, if you are new to the channel and thought 00:13:51.020 |
a model breaking records in all sorts of benchmarks meant it had to be the smartest, then do please 00:13:56.460 |
stick around. Now it seems fitting, as I draw the video to an end, that we should discuss the end of the 00:14:02.220 |
model selector. Because unless you are on the pro tier, all the other models that you can see here are 00:14:08.460 |
deprecated. That's good news in a way if you like to avoid that mess of models to select from. Not as good news 00:14:16.700 |
if you liked a particular variant for whatever reason. So there we are, that's my take on GPT-5. 00:14:23.420 |
What is yours? For me, I must admit, it's quite a poignant moment in the history of this channel. 00:14:29.420 |
I've been making videos touching on what GPT-5 might be like for, man, it must be over two years now. 00:14:36.940 |
Some of you watching will have been following the channel since then and thank you so much. 00:14:41.660 |
Would the Philip of two years ago have been bowled over by the GPT-5 of today? I genuinely don't know. 00:14:48.940 |
Will we all be bowled over by the GPT-6 or 7 of the future? Only time will tell. 00:14:55.340 |
Man, that was kind of a cliche ending, but forgive me and thank you so much for watching. Have a wonderful day.