back to indexOpenAI Tests if GPT-5 Can Automate Your Job - 4 Unexpected Findings

Chapters
0:0 Introduction
0:55 OpenAI Report Summary
2:40 Tipping Point Speed-up
4:11 Better than Industry Experts?
6:33 Big Caveat
11:10 Karpathy and the Radiologist Analogy
13:30 Outro
00:00:00.000 |
In the last 24 hours, OpenAI have released research on essentially whether current language 00:00:06.640 |
models can automate your job. The big claim, albeit carefully worded, is that current best 00:00:12.560 |
frontier models are approaching industry experts in deliverable quality. But as you'll see from the 00:00:18.500 |
title, there are plenty of unexpected findings in this research. Before I dive into that, 00:00:24.680 |
there is one job we seem intent on automating and that is one of being a UFC fighter. You can laugh at 00:00:33.140 |
the lack of performance now, but like me, you might be laughing somewhat nervously. Take a look at this 00:00:39.280 |
Unitree G1 robot, which maybe hasn't mastered Kung Fu, but he's getting a bit closer. Quick predictions, 00:00:45.700 |
do you reckon billionaires will have robot humanoid bodyguards by 2035? Let me know. Back to the paper, 00:00:52.240 |
and they are only focusing on the most important sectors according to their contribution to GDP. 00:00:57.920 |
What makes things more interesting is that the questions weren't designed by OpenAI. They were 00:01:03.080 |
designed by industry professionals themselves with an average of 14 years of industry experience. They 00:01:08.800 |
had to meet all sorts of criteria just to design the questions. And here are the headline results, 00:01:13.440 |
which you may have seen go viral with Claude Opus 4.1, a model by Anthropic, beating out OpenAI's 00:01:20.460 |
models and coming quite close to parity with industry experts. This, I am obviously going to 00:01:25.560 |
class as the first surprising finding, not that Opus is the best model because Opus 4.1, if you haven't 00:01:31.260 |
tried it, is indeed an amazing model. So no, that's not the most surprising bit. It's that OpenAI published 00:01:35.920 |
this result showing Opus beating its own models. I think that's great, honest science, by the way, 00:01:41.080 |
and I commend OpenAI for publishing this. Now you might be thinking, no, Philip, the most surprising 00:01:45.240 |
bit is how close we're getting to parity with industry experts, but I'll come back to that in 00:01:50.040 |
just a moment. Right now, I want to cover this second, you could say, somewhat surprising result, 00:01:55.220 |
which is that the win rate, when compared to humans, depended quite heavily on the file type 00:02:02.180 |
involved. If your workflow involves submitting or producing a PDF, PowerPoint, or Excel spreadsheet, 00:02:09.240 |
you might well find that Opus 4.1 is a league ahead. All these figures, by the way, are on how 00:02:15.500 |
often a model beats a human expert output as judged by a human expert. You may want to pause this one to 00:02:22.140 |
look across the different sectors, and you may or may not find it surprising that it's in government 00:02:27.540 |
where we have a model beating the average human expert. Personally, I'm a little bit skeptical that 00:02:33.280 |
Gemini 2.5 Pro scored so badly across these metrics. I find it a really great model. But then again, 00:02:38.680 |
Gemini 3 might well be around the corner. The third potentially unexpected finding is that we seem to 00:02:44.900 |
have passed a tipping point where models tend to speed up human experts. To briefly summarize this 00:02:51.000 |
table, if a model is too weak, even if you let it try a task multiple times and then only use its output, 00:02:57.100 |
if you judge it to be satisfactory, that doesn't actually speed you up. Essentially, the time to 00:03:02.560 |
review its output is just badly spent, and it's not worth the time. You might as well just do it 00:03:06.960 |
yourself alone. However, by the time we get to GPT-5, this does actually speed you up. You guys may have 00:03:13.480 |
experienced this yourself, but GPT-5 does a good enough job, often enough, that across the board, 00:03:19.980 |
on average, in these industries, and I'll get to which ones in a second, you are slightly sped up. Two fairly 00:03:25.380 |
critical caveats to this unexpected finding though, and one is that where is clawed 4.1 opus? Because 00:03:31.540 |
surely that would have had even greater a speed improvement for the human experts. And second, 00:03:35.920 |
the bar for acceptance for what these models were producing was meeting the human's quality level. 00:03:43.640 |
They call it the quality bar, as judged by those humans. But what if those humans can't always spot 00:03:49.760 |
the subtle errors that the models output? Reminds me of that developer study that Meta did, 00:03:55.360 |
where the experts thought that they were being sped up by 20%, but they were actually being slowed 00:04:00.500 |
down by, I think, around 10-20%. Now, though, to the biggest finding of all, and the big claim in the 00:04:06.840 |
paper, as iterated by Lawrence Summers. He's a famous economist and former president of Harvard, 00:04:12.700 |
I believe, and he said that these are task-specific Turing tests. Models can now do many of these tasks 00:04:19.800 |
as well, or better than humans. If that's generally true, then that would lend support to claims like 00:04:25.960 |
this one from another OpenAI researcher, which is that their current systems are AGI. For example, 00:04:31.900 |
one of their unreleased models was able to beat every single human in one particular coding competition. 00:04:37.060 |
Logically, that makes some initial sense, right? If it can beat these experts at coding competitions, 00:04:42.320 |
and at least match experts across a whole range of domains, why wouldn't that be AGI? The former 00:04:48.040 |
founder of Stability AI implied we're close to a tipping point. The implication, of course, being 00:04:52.880 |
that then we will start to see the automation of jobs wholesale. Well, I would say one of the big 00:04:58.300 |
unexpected findings was of how robust human jobs seem to be to automation by current generation LLMs. 00:05:06.360 |
The evidence from this paper, to me, suggests that we will need a further step change improvement in 00:05:12.080 |
model performance to start genuinely automating whole swathes of the economy. Why would I say that 00:05:17.400 |
when they just said, in the abstract, current best frontier models are approaching industry experts 00:05:22.800 |
in deliverable quality? Like, we're really close, right? Not really when you dive into the details of 00:05:27.540 |
the paper. First, the paper admits if you look at adoption rates, the picture doesn't look so great 00:05:31.960 |
for AI. And I covered that in a recent video with many companies dropping their pilot projects. But 00:05:36.480 |
those are lagging indicators, as is GDP growth. It takes time for people to realize how good these 00:05:42.040 |
models are. So those metrics will be lagging indicators. Fair enough, it will take time for AI 00:05:47.340 |
to diffuse. So they're just going to focus on what current gen AI can actually do. Here are some of the 00:05:52.260 |
tasks, by the way. For example, if you are a manufacturing engineer, then you were asked in this study to 00:05:57.000 |
design a 3D model of a cable reel stand for an assembly line. All the other models were given 00:06:02.180 |
the same task and then the results were compared, blind graded. What then is the problem if these 00:06:07.840 |
tasks were designed by industry experts and then blind graded? Surely that shows that the models are 00:06:13.440 |
almost at human expert level industry performance. Even on task length, these tasks required an average 00:06:19.360 |
of seven hours of work for the experts. So these are realistic tasks. Well, first, they excluded those 00:06:24.820 |
occupations whose tasks were not predominantly digital. I had to dig quite a long way through 00:06:30.360 |
the appendices to work out how they did this, but I want to give you just an example of the kind of 00:06:35.280 |
thing they did. They looked at this table and found only those sectors that contributed at least 5% 00:06:40.500 |
to the US GDP. Then they found five occupations weighted by salary whose work was predominantly digital. 00:06:47.540 |
Take manufacturing. All of these five occupations have predominantly digital work, apparently. But then, 00:06:53.660 |
of course, if you dig into the data where they got that from, there are countless occupations within 00:06:58.980 |
that category whose work is not predominantly digital. So for every one or two that made it into 00:07:04.480 |
the paper, there were loads, of course, that did not. Okay, but what about those occupations that are 00:07:09.760 |
predominantly digital? Well, even there, they didn't look at all of what those occupations did. I took just 00:07:16.320 |
one of the occupations rated as predominantly digital property manager and categorized all the 27 tasks 00:07:23.600 |
that was listed that they did in the official records. This was from Onet, which is the same source 00:07:28.820 |
that OpenAI used. GPT-5 Pro, ironically saving me lots of time, categorized it thusly with about six or 00:07:36.840 |
seven of the tasks rated as not being primarily digital. Things like overseeing operations and maintenance, 00:07:42.940 |
coordinating staff, investigating complaints and violations. The obvious point being that even if 00:07:47.660 |
we can automate the 19 or 20 tasks that are obviously digital within this predominantly digital 00:07:53.820 |
occupation, that wouldn't eliminate the job entirely. In fact, that profession might get even more well 00:07:59.340 |
paid as we're going to see in a moment for radiologists. So not all sectors, not all occupations within 00:08:04.980 |
that sector and not all tasks within each occupation. Fine, but what about the actual tasks themselves? 00:08:11.060 |
Well, they were super realistic and you can look at the range of industries involved from Apple to the 00:08:16.540 |
US Department of War now, Google and BBC News for example. But first, they were somewhat subjective with 00:08:23.040 |
even the human experts having only 70% agreement between themselves about which answer was better, 00:08:29.260 |
the model answer or the human and goal deliverable. Next, sometimes it was obvious which answer was the model output 00:08:34.700 |
because OpenAI models for example would often use M dashes. Grok would occasionally randomly introduce itself 00:08:40.640 |
apparently. More fundamentally though, the tasks were one shot. Here's the task, get it done. Of course, in a real job, 00:08:48.060 |
there's much more interactivity where you ask questions of the person giving you the task to find out the scope and 00:08:54.020 |
parameters of the task. Also, they had to exclude tasks that relied on too much context, like the use of 00:08:59.140 |
proprietary software tools. Then there were the catastrophic mistakes. They admit one further limitation of this 00:09:05.340 |
analysis is that it does not capture the cost of catastrophic mistakes, which can be disproportionately 00:09:10.580 |
expensive in some domains. They give some examples of catastrophic answers and I'll give one of my own. They said 00:09:16.500 |
something could go dangerously wrong, like insulting a customer or suggesting things that will cause 00:09:21.300 |
physical harm. This happened apparently 2.7% of the time. Here's something to consider. If the damage done 00:09:28.420 |
by those catastrophic failures is a hundred times worse than the cost savings you get from the model 00:09:35.700 |
being better, then weighted by impact using quote agentic AI without a human in the loop could cost you 00:09:42.420 |
more in the long run. Here's my example from a recent bit of coding where Claude admits, and I saw it do this, 00:09:48.340 |
that it completely hallucinated a price set for a particular model. You're absolutely right. It said, 00:09:53.940 |
I apologize for making up those credit numbers. That was incredibly irresponsible of me. Let me check 00:09:58.740 |
the actual values. It thought about it, then said, yes, your diagnosis is a hundred percent correct. I apologize 00:10:04.180 |
again for making up those credit values. You would have to be a pretty irresponsible employee or a downright 00:10:09.380 |
Fordster to make up such critical values without asking anyone. This was Claude 4.1 Opus, by the way. 00:10:15.460 |
I am open-minded though. Let me know what you think in terms of whether there will be more real-life 00:10:21.060 |
human Fordsters or just complete dotards in terms of the mistakes they make versus these catastrophic 00:10:26.980 |
hallucinations from models. Speaking of catastrophes, by the way, you can help avert certain catastrophes by 00:10:32.580 |
joining in the Gray Swan Arena. Link in the description. Essentially, you're rewarded with real 00:10:38.180 |
human money for breaking an AI, for jailbreaking LLMs. Several of my own subscribers have joined in these 00:10:45.700 |
competitions and won prizes. Actually, you can see in the corner, $350,000 worth of rewards have been 00:10:52.260 |
distributed. And actually, scrolling down, I can see that there is a competition that is live and in 00:10:57.860 |
progress as we speak. They're proving ground one. As I've mentioned before on the channel, 00:11:01.860 |
I see this as a win-win. You can gain recognition and money and the AI gets just that bit more secure. 00:11:08.980 |
One more limitation and then I'm going to end on a positive. I think Andrei Karpathy, formerly of OpenAI, 00:11:15.060 |
made a fantastic point in this recent tweet. In 2015-16, Jeffrey Hinton famously predicted 00:11:21.860 |
that we shouldn't be training new radiologists. And Karpathy linked to this article, which is indeed 00:11:27.620 |
a great one. It said that there were models released back in 2017 that could detect pneumonia 00:11:34.180 |
with greater accuracy than a panel of board-certified radiologists. You can just imagine the clickbait 00:11:39.380 |
that could have been written about that study. So how come eight years later, radiologists have an 00:11:44.020 |
average salary of over half a million dollars per year, which is 48% higher than in 2015? Well, 00:11:50.820 |
some of this is about to sound familiar, but there were issues with training data not covering edge cases. 00:11:56.820 |
There were, of course, legal hurdles. And just like in the paper we just read, 00:12:01.380 |
there were also tasks within radiology that didn't involve such automation, like talking to patients. 00:12:07.060 |
As best I could, I recently tried to delineate each of the blockers to the singularity, as I called it 00:12:13.380 |
in my recent Patreon video from the 19th. And I'm going to link in the description this framework that I 00:12:19.300 |
created. None of these are unsolvable, but understanding each one will help you read beyond the headlines. 00:12:25.300 |
Now let's spot some more patterns because the AI for radiology didn't cover all tasks. It focused on the 00:12:31.220 |
big ones like stroke, breast cancer, and lung cancer. What about things like vascular, head and neck, 00:12:36.420 |
spine and thyroid? Well, relatively few AI products. Think of those tasks not covered in that spreadsheet. 00:12:42.260 |
Then if you're a child or an ethnic minority, then these AI tools perform worse. And think of the analogy 00:12:47.700 |
with LLMs. Outside of English, they don't do as well. Notice how the study only focused on the US GDP. 00:12:53.780 |
Then there's the fact that OpenAI, for example, keep hiring new people despite designing a tool that's 00:12:58.820 |
designed to automate AI research. Likewise, in radiology, head counts and salaries just continue to rise. 00:13:05.060 |
Karpathy's prediction, we will have more software engineers in five years than we have now. Just to end 00:13:10.100 |
though, I would say don't sleep on this multiplier. You could be sped up by AI even if it can't automate 00:13:16.580 |
your job. The AI, for example, in Descript can't fully edit my videos, but it does speed up my own 00:13:23.060 |
editing of videos. Understanding AI and getting familiar with using it is still, I think, one of 00:13:28.340 |
the best bets you can make. On content creation, there is one tipping point I think we have reached, 00:13:33.140 |
which is that visually at least, we can't fully trust that we are seeing the human we think we are, 00:13:39.860 |
at least on video. Thank you so much for watching to the end. I didn't cover ChatGPT Pulse, 00:13:44.580 |
even though I am a pro subscriber, because it wasn't rolled out to me. I wonder if it's blocked in the UK. 00:13:49.620 |
I tried everything. Having said that, it does seem to be a replacement for scheduled tasks. 00:13:54.260 |
Do you guys remember that from January, where you could ask ChatGPT to perform a task at a certain 00:13:58.820 |
time? It never worked, kind of flopped, and then everyone forgot about it, but now we have Pulse, 00:14:02.980 |
so let's see if that does any better. Have a wonderful day.