
In the last 24 hours, OpenAI have released research on essentially whether current language models can automate your job. The big claim, albeit carefully worded, is that current best frontier models are approaching industry experts in deliverable quality. But as you'll see from the title, there are plenty of unexpected findings in this research.
Before I dive into that, there is one job we seem intent on automating and that is one of being a UFC fighter. You can laugh at the lack of performance now, but like me, you might be laughing somewhat nervously. Take a look at this Unitree G1 robot, which maybe hasn't mastered Kung Fu, but he's getting a bit closer.
Quick predictions, do you reckon billionaires will have robot humanoid bodyguards by 2035? Let me know. Back to the paper, and they are only focusing on the most important sectors according to their contribution to GDP. What makes things more interesting is that the questions weren't designed by OpenAI. They were designed by industry professionals themselves with an average of 14 years of industry experience.
They had to meet all sorts of criteria just to design the questions. And here are the headline results, which you may have seen go viral with Claude Opus 4.1, a model by Anthropic, beating out OpenAI's models and coming quite close to parity with industry experts. This, I am obviously going to class as the first surprising finding, not that Opus is the best model because Opus 4.1, if you haven't tried it, is indeed an amazing model.
So no, that's not the most surprising bit. It's that OpenAI published this result showing Opus beating its own models. I think that's great, honest science, by the way, and I commend OpenAI for publishing this. Now you might be thinking, no, Philip, the most surprising bit is how close we're getting to parity with industry experts, but I'll come back to that in just a moment.
Right now, I want to cover this second, you could say, somewhat surprising result, which is that the win rate, when compared to humans, depended quite heavily on the file type involved. If your workflow involves submitting or producing a PDF, PowerPoint, or Excel spreadsheet, you might well find that Opus 4.1 is a league ahead.
All these figures, by the way, are on how often a model beats a human expert output as judged by a human expert. You may want to pause this one to look across the different sectors, and you may or may not find it surprising that it's in government where we have a model beating the average human expert.
Personally, I'm a little bit skeptical that Gemini 2.5 Pro scored so badly across these metrics. I find it a really great model. But then again, Gemini 3 might well be around the corner. The third potentially unexpected finding is that we seem to have passed a tipping point where models tend to speed up human experts.
To briefly summarize this table, if a model is too weak, even if you let it try a task multiple times and then only use its output, if you judge it to be satisfactory, that doesn't actually speed you up. Essentially, the time to review its output is just badly spent, and it's not worth the time.
You might as well just do it yourself alone. However, by the time we get to GPT-5, this does actually speed you up. You guys may have experienced this yourself, but GPT-5 does a good enough job, often enough, that across the board, on average, in these industries, and I'll get to which ones in a second, you are slightly sped up.
Two fairly critical caveats to this unexpected finding though, and one is that where is clawed 4.1 opus? Because surely that would have had even greater a speed improvement for the human experts. And second, the bar for acceptance for what these models were producing was meeting the human's quality level.
They call it the quality bar, as judged by those humans. But what if those humans can't always spot the subtle errors that the models output? Reminds me of that developer study that Meta did, where the experts thought that they were being sped up by 20%, but they were actually being slowed down by, I think, around 10-20%.
Now, though, to the biggest finding of all, and the big claim in the paper, as iterated by Lawrence Summers. He's a famous economist and former president of Harvard, I believe, and he said that these are task-specific Turing tests. Models can now do many of these tasks as well, or better than humans.
If that's generally true, then that would lend support to claims like this one from another OpenAI researcher, which is that their current systems are AGI. For example, one of their unreleased models was able to beat every single human in one particular coding competition. Logically, that makes some initial sense, right?
If it can beat these experts at coding competitions, and at least match experts across a whole range of domains, why wouldn't that be AGI? The former founder of Stability AI implied we're close to a tipping point. The implication, of course, being that then we will start to see the automation of jobs wholesale.
Well, I would say one of the big unexpected findings was of how robust human jobs seem to be to automation by current generation LLMs. The evidence from this paper, to me, suggests that we will need a further step change improvement in model performance to start genuinely automating whole swathes of the economy.
Why would I say that when they just said, in the abstract, current best frontier models are approaching industry experts in deliverable quality? Like, we're really close, right? Not really when you dive into the details of the paper. First, the paper admits if you look at adoption rates, the picture doesn't look so great for AI.
And I covered that in a recent video with many companies dropping their pilot projects. But those are lagging indicators, as is GDP growth. It takes time for people to realize how good these models are. So those metrics will be lagging indicators. Fair enough, it will take time for AI to diffuse.
So they're just going to focus on what current gen AI can actually do. Here are some of the tasks, by the way. For example, if you are a manufacturing engineer, then you were asked in this study to design a 3D model of a cable reel stand for an assembly line.
All the other models were given the same task and then the results were compared, blind graded. What then is the problem if these tasks were designed by industry experts and then blind graded? Surely that shows that the models are almost at human expert level industry performance. Even on task length, these tasks required an average of seven hours of work for the experts.
So these are realistic tasks. Well, first, they excluded those occupations whose tasks were not predominantly digital. I had to dig quite a long way through the appendices to work out how they did this, but I want to give you just an example of the kind of thing they did.
They looked at this table and found only those sectors that contributed at least 5% to the US GDP. Then they found five occupations weighted by salary whose work was predominantly digital. Take manufacturing. All of these five occupations have predominantly digital work, apparently. But then, of course, if you dig into the data where they got that from, there are countless occupations within that category whose work is not predominantly digital.
So for every one or two that made it into the paper, there were loads, of course, that did not. Okay, but what about those occupations that are predominantly digital? Well, even there, they didn't look at all of what those occupations did. I took just one of the occupations rated as predominantly digital property manager and categorized all the 27 tasks that was listed that they did in the official records.
This was from Onet, which is the same source that OpenAI used. GPT-5 Pro, ironically saving me lots of time, categorized it thusly with about six or seven of the tasks rated as not being primarily digital. Things like overseeing operations and maintenance, coordinating staff, investigating complaints and violations. The obvious point being that even if we can automate the 19 or 20 tasks that are obviously digital within this predominantly digital occupation, that wouldn't eliminate the job entirely.
In fact, that profession might get even more well paid as we're going to see in a moment for radiologists. So not all sectors, not all occupations within that sector and not all tasks within each occupation. Fine, but what about the actual tasks themselves? Well, they were super realistic and you can look at the range of industries involved from Apple to the US Department of War now, Google and BBC News for example.
But first, they were somewhat subjective with even the human experts having only 70% agreement between themselves about which answer was better, the model answer or the human and goal deliverable. Next, sometimes it was obvious which answer was the model output because OpenAI models for example would often use M dashes.
Grok would occasionally randomly introduce itself apparently. More fundamentally though, the tasks were one shot. Here's the task, get it done. Of course, in a real job, there's much more interactivity where you ask questions of the person giving you the task to find out the scope and parameters of the task.
Also, they had to exclude tasks that relied on too much context, like the use of proprietary software tools. Then there were the catastrophic mistakes. They admit one further limitation of this analysis is that it does not capture the cost of catastrophic mistakes, which can be disproportionately expensive in some domains.
They give some examples of catastrophic answers and I'll give one of my own. They said something could go dangerously wrong, like insulting a customer or suggesting things that will cause physical harm. This happened apparently 2.7% of the time. Here's something to consider. If the damage done by those catastrophic failures is a hundred times worse than the cost savings you get from the model being better, then weighted by impact using quote agentic AI without a human in the loop could cost you more in the long run.
Here's my example from a recent bit of coding where Claude admits, and I saw it do this, that it completely hallucinated a price set for a particular model. You're absolutely right. It said, I apologize for making up those credit numbers. That was incredibly irresponsible of me. Let me check the actual values.
It thought about it, then said, yes, your diagnosis is a hundred percent correct. I apologize again for making up those credit values. You would have to be a pretty irresponsible employee or a downright Fordster to make up such critical values without asking anyone. This was Claude 4.1 Opus, by the way.
I am open-minded though. Let me know what you think in terms of whether there will be more real-life human Fordsters or just complete dotards in terms of the mistakes they make versus these catastrophic hallucinations from models. Speaking of catastrophes, by the way, you can help avert certain catastrophes by joining in the Gray Swan Arena.
Link in the description. Essentially, you're rewarded with real human money for breaking an AI, for jailbreaking LLMs. Several of my own subscribers have joined in these competitions and won prizes. Actually, you can see in the corner, $350,000 worth of rewards have been distributed. And actually, scrolling down, I can see that there is a competition that is live and in progress as we speak.
They're proving ground one. As I've mentioned before on the channel, I see this as a win-win. You can gain recognition and money and the AI gets just that bit more secure. One more limitation and then I'm going to end on a positive. I think Andrei Karpathy, formerly of OpenAI, made a fantastic point in this recent tweet.
In 2015-16, Jeffrey Hinton famously predicted that we shouldn't be training new radiologists. And Karpathy linked to this article, which is indeed a great one. It said that there were models released back in 2017 that could detect pneumonia with greater accuracy than a panel of board-certified radiologists. You can just imagine the clickbait that could have been written about that study.
So how come eight years later, radiologists have an average salary of over half a million dollars per year, which is 48% higher than in 2015? Well, some of this is about to sound familiar, but there were issues with training data not covering edge cases. There were, of course, legal hurdles.
And just like in the paper we just read, there were also tasks within radiology that didn't involve such automation, like talking to patients. As best I could, I recently tried to delineate each of the blockers to the singularity, as I called it in my recent Patreon video from the 19th.
And I'm going to link in the description this framework that I created. None of these are unsolvable, but understanding each one will help you read beyond the headlines. Now let's spot some more patterns because the AI for radiology didn't cover all tasks. It focused on the big ones like stroke, breast cancer, and lung cancer.
What about things like vascular, head and neck, spine and thyroid? Well, relatively few AI products. Think of those tasks not covered in that spreadsheet. Then if you're a child or an ethnic minority, then these AI tools perform worse. And think of the analogy with LLMs. Outside of English, they don't do as well.
Notice how the study only focused on the US GDP. Then there's the fact that OpenAI, for example, keep hiring new people despite designing a tool that's designed to automate AI research. Likewise, in radiology, head counts and salaries just continue to rise. Karpathy's prediction, we will have more software engineers in five years than we have now.
Just to end though, I would say don't sleep on this multiplier. You could be sped up by AI even if it can't automate your job. The AI, for example, in Descript can't fully edit my videos, but it does speed up my own editing of videos. Understanding AI and getting familiar with using it is still, I think, one of the best bets you can make.
On content creation, there is one tipping point I think we have reached, which is that visually at least, we can't fully trust that we are seeing the human we think we are, at least on video. Thank you so much for watching to the end. I didn't cover ChatGPT Pulse, even though I am a pro subscriber, because it wasn't rolled out to me.
I wonder if it's blocked in the UK. I tried everything. Having said that, it does seem to be a replacement for scheduled tasks. Do you guys remember that from January, where you could ask ChatGPT to perform a task at a certain time? It never worked, kind of flopped, and then everyone forgot about it, but now we have Pulse, so let's see if that does any better.
Have a wonderful day.