back to index

GPT-5 has Arrived


Whisper Transcript | Transcript Only Page

00:00:00.000 | Well, GPT-5 is here and it's in the free tier. I've tested it a bunch, read the system card in
00:00:07.460 | full and even sat through that full live stream. Wow. But actually, I think it's pretty huge
00:00:13.540 | that free users of ChatGPT will get access to GPT-5. In other words, approaching a billion people
00:00:21.340 | will experience a significantly more intelligent AI model, at least before they hit the limits.
00:00:27.660 | But if you watched the live stream and demo, you may have been underwhelmed. And I don't just mean
00:00:33.600 | the mathematically impossible bar graphs. And there were multiple of those. There were even
00:00:39.500 | hallucinations in the segment describing how the model hallucinates less. For sure, it would be
00:00:45.500 | easy to make a video just taking the mic of those mistakes. But the thing is, GPT-5 is actually a
00:00:52.380 | pretty great model. So here are my first impressions. First, my own logic benchmark,
00:00:57.600 | or some people call it a trick question benchmark. I can confirm that GPT-5 indeed does crush the
00:01:06.060 | public questions of SimpleBench. Whoever this was that came out with this viral thread of it getting
00:01:12.680 | nine out of 10 on those public 10 questions from SimpleBench wasn't lying technically. In some of my
00:01:19.600 | early testing, it got questions right that no other model had gotten right. When I saw this,
00:01:25.860 | I was like, man, I'm gonna have to bring out V2 really early. Everyone's gonna get super hyped.
00:01:30.480 | This is crazy. However, if you are newer to AI, you might not know that the performance of language
00:01:36.580 | models is heavily dependent on the training data they're fed. And I suspect some of these 10 public
00:01:43.640 | questions have made it into the training data, at least indirectly, not deliberately, I think. But
00:01:49.400 | given that the models are trained on things like Reddit and other forums, it's definitely not impossible.
00:01:55.080 | Given how long I normally take to update the leaderboard, you guys might be quite shocked to
00:02:00.440 | hear that we're doing the runs tonight. And so far, it's not setting a new record. That surprised even me,
00:02:07.740 | actually. I was expecting, honestly, 70%. I'll be honest with you guys. So far, in the three runs we've
00:02:14.960 | done, it's getting around 57-58%. So at this point, we can be clear, it's not a new paradigm of AI. And if
00:02:24.280 | you didn't believe models were AGI now, this model won't convince you that we have AGI. But in fairness
00:02:30.600 | to OpenAI, just a couple of hours ago, Sam Altman tweeted that they could release much, much smarter
00:02:36.500 | models. But the main thing they were pushing for was real-world utility, which I'm going to come to,
00:02:41.260 | and mass accessibility and affordability. Well, they have delivered on that. When it comes to
00:02:46.580 | affordability, in the API, the prices are incredible. Below clawed force on it. For those who care about the
00:02:52.440 | API, I've got some coding data coming up. But first, hallucinations. The team made a big play about how
00:02:58.680 | GPT-5 would hallucinate less than previous models. Quite early on in the system card, which is as good
00:03:05.240 | a research paper as we're going to get, it seems, they say that we find that GPT-5 main has 44% fewer
00:03:13.380 | responses with at least one major factual error. However, I got suspicious immediately when, on the live
00:03:20.360 | stream, I saw a bunch of new benchmarks rather than the ones we already know. I don't blame anyone
00:03:26.080 | watching for not knowing this, but one of the most quoted benchmarks on hallucinations is simple QA.
00:03:31.800 | These are short, factual questions that can be prone to hallucinations. And on that front,
00:03:38.280 | GPT-5 thinking is just about better than O3, maybe if you squint. Obviously, it makes it hard to compare
00:03:45.980 | if we invent a new benchmark for every single model release, but it's probably fair to say it does
00:03:50.860 | hallucinate a bit less. Again, though, if you don't follow AI too closely, this model, just as all the
00:03:56.700 | others, will still hallucinate with a major incorrect claim around 5% of the time. These are on the questions,
00:04:03.580 | by the way, that users are actually using chat GPT for. Okay, now for one domain that OpenAI really did want
00:04:10.700 | to highlight, which is software engineering and one benchmark in particular, Sweebench Verified. This
00:04:16.300 | time, by the way, you will notice that the graph doesn't have 52% as being higher than 69%. This,
00:04:22.540 | for me, then, is one of the bigger developments with GPT-5. Because in software engineering, essentially,
00:04:28.380 | OpenAI lobbed a grenade at Anthropic. They want the Claude family line to die out without heirs. Because Sweebench
00:04:37.740 | verified is the singular benchmark that Anthropic cited as proof that their newest model, 4.1 Opus,
00:04:44.700 | just a few days old, was the frontier model in this domain. If we completely ignore statistical error
00:04:50.940 | bars, GPT-5 is now better. Fair warning, if you are not into coding or even vibe coding, this video
00:04:57.580 | won't spend too long on it. But straight away within cursor, I did notice the difference. I'm not going to
00:05:02.940 | show my code base right now, so let's just leave this other coding benchmark on screen. But testing
00:05:08.540 | GPT-5 versus 4.0 Sonnet, the best model you get by default from Anthropic on cursor. And GPT-5 was just
00:05:15.020 | better. Finding bugs that Sonnet assured me were not there. Obviously, we need more time to test,
00:05:20.780 | and there's always the black horse of Gemini DeepThink, which I think might be the best of all. But
00:05:25.580 | Anthropic get a lot of revenue from vibe coders and professional developers. So, GPT-5's release
00:05:31.740 | could be a challenging time for Anthropic. Now, I know it was a very hype-y statement from the live
00:05:37.180 | stream, but I do actually get it when one of the presenters said he trusts the model more with coding.
00:05:42.060 | Of course, language models live and die by data, so if GPT-5 lacks it in your domain, it will be jank,
00:05:47.660 | so do test things out yourself. Beyond coding though, if you are asking a technical question
00:05:52.700 | that might rely on an image, GPT-5 is looking real good. Take the MMMU, which includes a ton of charts
00:06:00.860 | and tables. And here, GPT-5 is beating the massively slower and much more inaccessible Gemini DeepThink,
00:06:09.580 | which is currently reserved for the addicts forking out $250 a month. Yes, I was disappointed that the
00:06:15.580 | context window has been barely widened. Man, we need some fresh air in here. What I mean is,
00:06:21.340 | I love that Gemini 2.5 Pro can analyse one million tokens, or almost a million words, but we are stuck in
00:06:28.380 | the low hundreds of thousands for GPT-5. Next, you may have noticed on the live stream,
00:06:33.420 | Sebastian Bubeck, one of the lead authors of that famous Sparks of AGI paper about GPT-4,
00:06:40.140 | and one of the first viewers of this channel, actually, two and a half years ago. He described
00:06:45.180 | the "recursive self-improvement" to be had when models can produce better synthetic data
00:06:51.740 | that is then used to train the next generation of models. I do get that for getting epic scores on
00:06:57.180 | benchmarks, but if that was all it took, then models like 5.4 and the Openweight OpenAI GPT-OSS model
00:07:06.060 | would be demigods by now, given how heavily they were trained on synthetic data. True story,
00:07:11.340 | that recent OpenAI Openweight model flopped so badly on Simplebench that OpenAI reached out to
00:07:16.380 | me personally to ask about our settings, which were the standard ones, by the way. A bit later on,
00:07:21.580 | I'm going to go to the biggest bit of anti-hype you could find, but a few new people may be watching
00:07:27.660 | tonight, so let's touch on another highly usable aspect, amazing aspect of GPT-5. If you watched the
00:07:33.420 | live stream and thought those vibe coding demos were epic, they were, and don't let addicts like us tell
00:07:39.580 | you otherwise. Because snake games and chart displays are everywhere in the training data, that's why GPT-5 can
00:07:47.100 | bang them out with ease. Building a production-ready consumer app is a very different story for now.
00:07:53.740 | So I would say we don't quite have what Sam Olman quoted tonight, software on demand, but we may be
00:07:59.980 | slowly getting there. More on that in another video coming out soon. Back to the system card, and they
00:08:05.500 | kept talking about health journeys. And though I had never heard that expression, it didn't strike me that
00:08:11.580 | they were taking us for a ride. For me, it's genuinely incredible. And it will be super impactful that you
00:08:17.820 | can often get expert-level text-based diagnoses from these models. Notice I say you can often get that.
00:08:24.460 | Though I don't think many people will have noticed this one, that GPT-5 Mini scored higher on health bench
00:08:31.660 | consensus than GPT-5 itself. I can imagine some frantic users on the free tier waiting until their GPT-5
00:08:38.940 | allowances out to get the quote "better model on health". Now just in case there's any really confused
00:08:43.980 | people watching, if you're wondering, there's no new Sora or image generator. There are however new voices,
00:08:50.460 | and for those who've been chatting on the free tier with models quite a lot, you should notice a step up
00:08:55.820 | in conversation quality. Although I was disappointed that GPT-5's language skills have not improved from O3.
00:09:03.820 | Translation, I feel, is such an unmitigated good from AI. I was hoping that they had been able to push the
00:09:10.300 | frontier a bit harder. Some of you will be asking about the semi-mythical GPT-5 Pro, and yes, I'd love
00:09:16.540 | to test it. I am on the Pro tier, but I'm not yet seeing it in my app as of tonight, so soon hopefully.
00:09:23.740 | Probably isn't crazily better than the other types of GPT-5, given they barely mentioned it in the
00:09:29.260 | presentation. Now though, for the real anti-hype, because even one of the lead authors of AI 2027,
00:09:36.140 | which took the world by storm and was read by millions, well, the headlines or videos derived from
00:09:41.660 | were seen or read by millions, that's Eli Lifland, said that he noticed from the system card no
00:09:48.780 | improvement on the coding evals that weren't Sweebench. Translated, you know all those videos you've been
00:09:54.460 | seeing on YouTube about AI taking over the world within the next two years? Well, one of the authors
00:10:00.380 | behind that, who I interviewed on Patreon, has probably updated in the negative in terms of his
00:10:06.620 | timelines. We should be seeing a bit more self-improvement by now, if those timelines were
00:10:11.980 | accurate. It makes sense, right? We would need to see significant improvement from GPT-5 on machine
00:10:17.900 | learning engineering, for example, machine learning engineering bench, and we don't quite see that.
00:10:23.180 | What about this benchmark from the system card? OpenAI pull requests. Can GPT-5 do some of the more
00:10:28.860 | mundane tasks that are performed at OpenAI? Well, without diving too much into that benchmark,
00:10:34.220 | notice the increment. We're not seeing big jumps from O3. What about the ability of models to replicate
00:10:41.020 | state-of-the-art AI research? This was tested on OpenAI's own paper bench. Again, correct me if I'm
00:10:47.500 | wrong, but not a huge step forward. Then there's this benchmark, arguably the most interesting of them
00:10:53.420 | all. I actually think it's a brilliant benchmark, and I remember asking someone at OpenAI about such
00:10:59.020 | a benchmark two years ago. Pretty sure that request had no impact, but either way, check this out. Can AI
00:11:04.220 | models overcome any of 20 internal research and engineering bottlenecks encountered for real at
00:11:10.940 | OpenAI in the past? These were the kind of bottlenecks that led to delays of at least a day, and in some
00:11:15.660 | cases they said it influenced the outcome of large training runs and launches. Amazing benchmark, and for
00:11:21.820 | now, slightly underwhelming performance. I say that, but it kind of depends on your perspective, because
00:11:27.820 | solving 2% of those is not bad in my opinion. My only point was that's the same score as O3. Mind you,
00:11:35.100 | I've just noticed something while filming. The green bar looks taller than O3's bar, even though it's the
00:11:41.980 | same 2%. Man, the chart crimes that OpenAI are doing for GPT-5 are unbelievable. I'm just going to spend 20
00:11:49.340 | seconds now on safety, because I did see OpenAI's safety paper, and while I haven't finished it, I do love the
00:11:55.340 | sound of the new approach on refusals. Basically, they've moved to what's called safe completions as
00:12:01.420 | a new safety paradigm. It makes sense to me, because rather than the model just making a snap judgment,
00:12:06.460 | is the user's intent good or bad, then completely obeying or refusing. Instead, safe completions
00:12:12.620 | focuses entirely on the safety of the model's output. Translating the pages I have read so far,
00:12:19.420 | it's basically, we don't really care why you're asking this, this is the only information we're going to
00:12:24.620 | give you. Perfect segue to the sponsors of today's video, Grace One. Let me know what you think,
00:12:30.220 | but what I find epic is that you can make all models, not just OpenAI's, more secure yourself,
00:12:36.940 | as in you watching. I mean, literally, a few of the viewers of this channel have gone on to the
00:12:42.380 | leaderboards in these paid competitions. If you are not familiar with them, you basically have to find
00:12:48.940 | jailbreaks for these models, and thereby improve model security. If you're interested, do use my
00:12:54.860 | personal link in the description. And for me, models being less likely to output bioterror
00:13:00.140 | instructions is just a win-win. Now for some last benchmarks before I draw this first impressions video
00:13:06.220 | to an end. GPT-5 doesn't quite get the record for what I'm calling a pattern recognition benchmark,
00:13:11.820 | Arc AGI 2. It's beaten by Grok 4, which gets 16% compared to its 10%. GPT-5 is of course much cheaper,
00:13:19.580 | though. Curiously, GPT-5 got a new record on the Google-proof science benchmark,
00:13:25.820 | called GPQA, getting 88.4%. But they barely mentioned that on the website or the live stream
00:13:33.820 | or the system card. In 2024, this was one of the most cited benchmarks for testing model intelligence.
00:13:39.820 | Seeing the OpenWeights OpenAI models score so highly did make me start to worry about benchmark
00:13:45.340 | maxing. Same story with humanity's last exam. In other words, if you are new to the channel and thought
00:13:51.020 | a model breaking records in all sorts of benchmarks meant it had to be the smartest, then do please
00:13:56.460 | stick around. Now it seems fitting, as I draw the video to an end, that we should discuss the end of the
00:14:02.220 | model selector. Because unless you are on the pro tier, all the other models that you can see here are
00:14:08.460 | deprecated. That's good news in a way if you like to avoid that mess of models to select from. Not as good news
00:14:16.700 | if you liked a particular variant for whatever reason. So there we are, that's my take on GPT-5.
00:14:23.420 | What is yours? For me, I must admit, it's quite a poignant moment in the history of this channel.
00:14:29.420 | I've been making videos touching on what GPT-5 might be like for, man, it must be over two years now.
00:14:36.940 | Some of you watching will have been following the channel since then and thank you so much.
00:14:41.660 | Would the Philip of two years ago have been bowled over by the GPT-5 of today? I genuinely don't know.
00:14:48.940 | Will we all be bowled over by the GPT-6 or 7 of the future? Only time will tell.
00:14:55.340 | Man, that was kind of a cliche ending, but forgive me and thank you so much for watching. Have a wonderful day.