back to index

OpenAI Tests if GPT-5 Can Automate Your Job - 4 Unexpected Findings


Chapters

0:0 Introduction
0:55 OpenAI Report Summary
2:40 Tipping Point Speed-up
4:11 Better than Industry Experts?
6:33 Big Caveat
11:10 Karpathy and the Radiologist Analogy
13:30 Outro

Whisper Transcript | Transcript Only Page

00:00:00.000 | In the last 24 hours, OpenAI have released research on essentially whether current language
00:00:06.640 | models can automate your job. The big claim, albeit carefully worded, is that current best
00:00:12.560 | frontier models are approaching industry experts in deliverable quality. But as you'll see from the
00:00:18.500 | title, there are plenty of unexpected findings in this research. Before I dive into that,
00:00:24.680 | there is one job we seem intent on automating and that is one of being a UFC fighter. You can laugh at
00:00:33.140 | the lack of performance now, but like me, you might be laughing somewhat nervously. Take a look at this
00:00:39.280 | Unitree G1 robot, which maybe hasn't mastered Kung Fu, but he's getting a bit closer. Quick predictions,
00:00:45.700 | do you reckon billionaires will have robot humanoid bodyguards by 2035? Let me know. Back to the paper,
00:00:52.240 | and they are only focusing on the most important sectors according to their contribution to GDP.
00:00:57.920 | What makes things more interesting is that the questions weren't designed by OpenAI. They were
00:01:03.080 | designed by industry professionals themselves with an average of 14 years of industry experience. They
00:01:08.800 | had to meet all sorts of criteria just to design the questions. And here are the headline results,
00:01:13.440 | which you may have seen go viral with Claude Opus 4.1, a model by Anthropic, beating out OpenAI's
00:01:20.460 | models and coming quite close to parity with industry experts. This, I am obviously going to
00:01:25.560 | class as the first surprising finding, not that Opus is the best model because Opus 4.1, if you haven't
00:01:31.260 | tried it, is indeed an amazing model. So no, that's not the most surprising bit. It's that OpenAI published
00:01:35.920 | this result showing Opus beating its own models. I think that's great, honest science, by the way,
00:01:41.080 | and I commend OpenAI for publishing this. Now you might be thinking, no, Philip, the most surprising
00:01:45.240 | bit is how close we're getting to parity with industry experts, but I'll come back to that in
00:01:50.040 | just a moment. Right now, I want to cover this second, you could say, somewhat surprising result,
00:01:55.220 | which is that the win rate, when compared to humans, depended quite heavily on the file type
00:02:02.180 | involved. If your workflow involves submitting or producing a PDF, PowerPoint, or Excel spreadsheet,
00:02:09.240 | you might well find that Opus 4.1 is a league ahead. All these figures, by the way, are on how
00:02:15.500 | often a model beats a human expert output as judged by a human expert. You may want to pause this one to
00:02:22.140 | look across the different sectors, and you may or may not find it surprising that it's in government
00:02:27.540 | where we have a model beating the average human expert. Personally, I'm a little bit skeptical that
00:02:33.280 | Gemini 2.5 Pro scored so badly across these metrics. I find it a really great model. But then again,
00:02:38.680 | Gemini 3 might well be around the corner. The third potentially unexpected finding is that we seem to
00:02:44.900 | have passed a tipping point where models tend to speed up human experts. To briefly summarize this
00:02:51.000 | table, if a model is too weak, even if you let it try a task multiple times and then only use its output,
00:02:57.100 | if you judge it to be satisfactory, that doesn't actually speed you up. Essentially, the time to
00:03:02.560 | review its output is just badly spent, and it's not worth the time. You might as well just do it
00:03:06.960 | yourself alone. However, by the time we get to GPT-5, this does actually speed you up. You guys may have
00:03:13.480 | experienced this yourself, but GPT-5 does a good enough job, often enough, that across the board,
00:03:19.980 | on average, in these industries, and I'll get to which ones in a second, you are slightly sped up. Two fairly
00:03:25.380 | critical caveats to this unexpected finding though, and one is that where is clawed 4.1 opus? Because
00:03:31.540 | surely that would have had even greater a speed improvement for the human experts. And second,
00:03:35.920 | the bar for acceptance for what these models were producing was meeting the human's quality level.
00:03:43.640 | They call it the quality bar, as judged by those humans. But what if those humans can't always spot
00:03:49.760 | the subtle errors that the models output? Reminds me of that developer study that Meta did,
00:03:55.360 | where the experts thought that they were being sped up by 20%, but they were actually being slowed
00:04:00.500 | down by, I think, around 10-20%. Now, though, to the biggest finding of all, and the big claim in the
00:04:06.840 | paper, as iterated by Lawrence Summers. He's a famous economist and former president of Harvard,
00:04:12.700 | I believe, and he said that these are task-specific Turing tests. Models can now do many of these tasks
00:04:19.800 | as well, or better than humans. If that's generally true, then that would lend support to claims like
00:04:25.960 | this one from another OpenAI researcher, which is that their current systems are AGI. For example,
00:04:31.900 | one of their unreleased models was able to beat every single human in one particular coding competition.
00:04:37.060 | Logically, that makes some initial sense, right? If it can beat these experts at coding competitions,
00:04:42.320 | and at least match experts across a whole range of domains, why wouldn't that be AGI? The former
00:04:48.040 | founder of Stability AI implied we're close to a tipping point. The implication, of course, being
00:04:52.880 | that then we will start to see the automation of jobs wholesale. Well, I would say one of the big
00:04:58.300 | unexpected findings was of how robust human jobs seem to be to automation by current generation LLMs.
00:05:06.360 | The evidence from this paper, to me, suggests that we will need a further step change improvement in
00:05:12.080 | model performance to start genuinely automating whole swathes of the economy. Why would I say that
00:05:17.400 | when they just said, in the abstract, current best frontier models are approaching industry experts
00:05:22.800 | in deliverable quality? Like, we're really close, right? Not really when you dive into the details of
00:05:27.540 | the paper. First, the paper admits if you look at adoption rates, the picture doesn't look so great
00:05:31.960 | for AI. And I covered that in a recent video with many companies dropping their pilot projects. But
00:05:36.480 | those are lagging indicators, as is GDP growth. It takes time for people to realize how good these
00:05:42.040 | models are. So those metrics will be lagging indicators. Fair enough, it will take time for AI
00:05:47.340 | to diffuse. So they're just going to focus on what current gen AI can actually do. Here are some of the
00:05:52.260 | tasks, by the way. For example, if you are a manufacturing engineer, then you were asked in this study to
00:05:57.000 | design a 3D model of a cable reel stand for an assembly line. All the other models were given
00:06:02.180 | the same task and then the results were compared, blind graded. What then is the problem if these
00:06:07.840 | tasks were designed by industry experts and then blind graded? Surely that shows that the models are
00:06:13.440 | almost at human expert level industry performance. Even on task length, these tasks required an average
00:06:19.360 | of seven hours of work for the experts. So these are realistic tasks. Well, first, they excluded those
00:06:24.820 | occupations whose tasks were not predominantly digital. I had to dig quite a long way through
00:06:30.360 | the appendices to work out how they did this, but I want to give you just an example of the kind of
00:06:35.280 | thing they did. They looked at this table and found only those sectors that contributed at least 5%
00:06:40.500 | to the US GDP. Then they found five occupations weighted by salary whose work was predominantly digital.
00:06:47.540 | Take manufacturing. All of these five occupations have predominantly digital work, apparently. But then,
00:06:53.660 | of course, if you dig into the data where they got that from, there are countless occupations within
00:06:58.980 | that category whose work is not predominantly digital. So for every one or two that made it into
00:07:04.480 | the paper, there were loads, of course, that did not. Okay, but what about those occupations that are
00:07:09.760 | predominantly digital? Well, even there, they didn't look at all of what those occupations did. I took just
00:07:16.320 | one of the occupations rated as predominantly digital property manager and categorized all the 27 tasks
00:07:23.600 | that was listed that they did in the official records. This was from Onet, which is the same source
00:07:28.820 | that OpenAI used. GPT-5 Pro, ironically saving me lots of time, categorized it thusly with about six or
00:07:36.840 | seven of the tasks rated as not being primarily digital. Things like overseeing operations and maintenance,
00:07:42.940 | coordinating staff, investigating complaints and violations. The obvious point being that even if
00:07:47.660 | we can automate the 19 or 20 tasks that are obviously digital within this predominantly digital
00:07:53.820 | occupation, that wouldn't eliminate the job entirely. In fact, that profession might get even more well
00:07:59.340 | paid as we're going to see in a moment for radiologists. So not all sectors, not all occupations within
00:08:04.980 | that sector and not all tasks within each occupation. Fine, but what about the actual tasks themselves?
00:08:11.060 | Well, they were super realistic and you can look at the range of industries involved from Apple to the
00:08:16.540 | US Department of War now, Google and BBC News for example. But first, they were somewhat subjective with
00:08:23.040 | even the human experts having only 70% agreement between themselves about which answer was better,
00:08:29.260 | the model answer or the human and goal deliverable. Next, sometimes it was obvious which answer was the model output
00:08:34.700 | because OpenAI models for example would often use M dashes. Grok would occasionally randomly introduce itself
00:08:40.640 | apparently. More fundamentally though, the tasks were one shot. Here's the task, get it done. Of course, in a real job,
00:08:48.060 | there's much more interactivity where you ask questions of the person giving you the task to find out the scope and
00:08:54.020 | parameters of the task. Also, they had to exclude tasks that relied on too much context, like the use of
00:08:59.140 | proprietary software tools. Then there were the catastrophic mistakes. They admit one further limitation of this
00:09:05.340 | analysis is that it does not capture the cost of catastrophic mistakes, which can be disproportionately
00:09:10.580 | expensive in some domains. They give some examples of catastrophic answers and I'll give one of my own. They said
00:09:16.500 | something could go dangerously wrong, like insulting a customer or suggesting things that will cause
00:09:21.300 | physical harm. This happened apparently 2.7% of the time. Here's something to consider. If the damage done
00:09:28.420 | by those catastrophic failures is a hundred times worse than the cost savings you get from the model
00:09:35.700 | being better, then weighted by impact using quote agentic AI without a human in the loop could cost you
00:09:42.420 | more in the long run. Here's my example from a recent bit of coding where Claude admits, and I saw it do this,
00:09:48.340 | that it completely hallucinated a price set for a particular model. You're absolutely right. It said,
00:09:53.940 | I apologize for making up those credit numbers. That was incredibly irresponsible of me. Let me check
00:09:58.740 | the actual values. It thought about it, then said, yes, your diagnosis is a hundred percent correct. I apologize
00:10:04.180 | again for making up those credit values. You would have to be a pretty irresponsible employee or a downright
00:10:09.380 | Fordster to make up such critical values without asking anyone. This was Claude 4.1 Opus, by the way.
00:10:15.460 | I am open-minded though. Let me know what you think in terms of whether there will be more real-life
00:10:21.060 | human Fordsters or just complete dotards in terms of the mistakes they make versus these catastrophic
00:10:26.980 | hallucinations from models. Speaking of catastrophes, by the way, you can help avert certain catastrophes by
00:10:32.580 | joining in the Gray Swan Arena. Link in the description. Essentially, you're rewarded with real
00:10:38.180 | human money for breaking an AI, for jailbreaking LLMs. Several of my own subscribers have joined in these
00:10:45.700 | competitions and won prizes. Actually, you can see in the corner, $350,000 worth of rewards have been
00:10:52.260 | distributed. And actually, scrolling down, I can see that there is a competition that is live and in
00:10:57.860 | progress as we speak. They're proving ground one. As I've mentioned before on the channel,
00:11:01.860 | I see this as a win-win. You can gain recognition and money and the AI gets just that bit more secure.
00:11:08.980 | One more limitation and then I'm going to end on a positive. I think Andrei Karpathy, formerly of OpenAI,
00:11:15.060 | made a fantastic point in this recent tweet. In 2015-16, Jeffrey Hinton famously predicted
00:11:21.860 | that we shouldn't be training new radiologists. And Karpathy linked to this article, which is indeed
00:11:27.620 | a great one. It said that there were models released back in 2017 that could detect pneumonia
00:11:34.180 | with greater accuracy than a panel of board-certified radiologists. You can just imagine the clickbait
00:11:39.380 | that could have been written about that study. So how come eight years later, radiologists have an
00:11:44.020 | average salary of over half a million dollars per year, which is 48% higher than in 2015? Well,
00:11:50.820 | some of this is about to sound familiar, but there were issues with training data not covering edge cases.
00:11:56.820 | There were, of course, legal hurdles. And just like in the paper we just read,
00:12:01.380 | there were also tasks within radiology that didn't involve such automation, like talking to patients.
00:12:07.060 | As best I could, I recently tried to delineate each of the blockers to the singularity, as I called it
00:12:13.380 | in my recent Patreon video from the 19th. And I'm going to link in the description this framework that I
00:12:19.300 | created. None of these are unsolvable, but understanding each one will help you read beyond the headlines.
00:12:25.300 | Now let's spot some more patterns because the AI for radiology didn't cover all tasks. It focused on the
00:12:31.220 | big ones like stroke, breast cancer, and lung cancer. What about things like vascular, head and neck,
00:12:36.420 | spine and thyroid? Well, relatively few AI products. Think of those tasks not covered in that spreadsheet.
00:12:42.260 | Then if you're a child or an ethnic minority, then these AI tools perform worse. And think of the analogy
00:12:47.700 | with LLMs. Outside of English, they don't do as well. Notice how the study only focused on the US GDP.
00:12:53.780 | Then there's the fact that OpenAI, for example, keep hiring new people despite designing a tool that's
00:12:58.820 | designed to automate AI research. Likewise, in radiology, head counts and salaries just continue to rise.
00:13:05.060 | Karpathy's prediction, we will have more software engineers in five years than we have now. Just to end
00:13:10.100 | though, I would say don't sleep on this multiplier. You could be sped up by AI even if it can't automate
00:13:16.580 | your job. The AI, for example, in Descript can't fully edit my videos, but it does speed up my own
00:13:23.060 | editing of videos. Understanding AI and getting familiar with using it is still, I think, one of
00:13:28.340 | the best bets you can make. On content creation, there is one tipping point I think we have reached,
00:13:33.140 | which is that visually at least, we can't fully trust that we are seeing the human we think we are,
00:13:39.860 | at least on video. Thank you so much for watching to the end. I didn't cover ChatGPT Pulse,
00:13:44.580 | even though I am a pro subscriber, because it wasn't rolled out to me. I wonder if it's blocked in the UK.
00:13:49.620 | I tried everything. Having said that, it does seem to be a replacement for scheduled tasks.
00:13:54.260 | Do you guys remember that from January, where you could ask ChatGPT to perform a task at a certain
00:13:58.820 | time? It never worked, kind of flopped, and then everyone forgot about it, but now we have Pulse,
00:14:02.980 | so let's see if that does any better. Have a wonderful day.