How Not to Read a Headline on AI (ft. new Olympiad Gold, GPT-5 …)

Almost five million people saw the headline 48 hours ago that OpenAI have a secret large language model that got gold at the International Math Olympiad. Here though are nine ways to misread that headline. First, this means that AI is now as good as the best mathematicians and could put them out of a job.

The IMO is extremely difficult but contains human expert written questions, not questions that no one knows the answer to yet. I am in awe of the high school competitors who get any medal in it or even qualified to be in the competition truly. But as one UCL math professor said yesterday, math research is about solving problems no one yet knows how to solve and this requires significant creativity, something notably absent from OpenAI's IMO solutions.

Now OpenAI's model apparently out around the end of the year did not find a correct proof for the hardest problem, requiring the most creativity. That's unlike by the way a fair few of the young human participants. The model did get problem one through five correct. That is bloody impressive and enough for a gold.

Second misreading of the headline though. This means that OpenAI are now in the lead in AI or maybe language models for mathematics. Well, we actually don't know what the Google effort got in the IMO. This professor is hearing that Google DeepMind also got gold but has not yet announced it.

We will find out in the coming week apparently whether Google DeepMind got problem 6 correct. Was this why OpenAI rushed the announcement to get there before Google and steal the headlines? Now one of the Google DeepMind researchers on AI for mathematics and the lead of their famous, well is actually famous, well famous to me, Alpha Geometry System that I discussed 18 months ago, True Trin, retweeted this tweet.

Apparently AI organisations were asked not to report their results for a week to give some space for human celebration. Unfortunately, Gnome Brown of OpenAI said that this message somehow didn't get through to OpenAI, maybe it wasn't relayed to them. We don't know but this explains why we don't yet have the Google DeepMind results which I believe are coming out on the 28th of July and some other results from a company called Harmonic.

Third way to misread this gold medal headline, that none of this is relevant to whether AI will reduce entry-level white collar jobs. I frankly disagree. I think it is relevant. One of the leads on OpenAI's new secretive model, Jerry Tworek, if I'm pronouncing that right, revealed that it is not specialised for mathematics and draws on the same research technique used to power most of OpenAI's other offerings.

This is bigger news than it sounds, because it means that this secret model did not use tools or specialised fine tuning to optimise for the mathematics use case. Even one of OpenAI's chief critics at a rival lab and an IMO gold medalist himself conceded that for this result to be achieved by a pure language model was impressive.

To the degree he said that this was indicative of general reasoning training without specialisation, that's significant. But many of you will still be saying, nah, none of this is relevant, but let me try to put the strongest case yet. Remember, this reinforcement learning system within OpenAI was the same one responsible for that general purpose computer using agent whose headlines you may have seen recently.

I'll play the clip now because it's soon going to be rolled out to all plus users. But it's that system that can browse the web and perform deep research for you. Millions of people saw the headlines about OpenAI's agent mode that can spin up its own virtual computer, operate the mouse, navigate the browser visually.

Now, yes, that agent is a bit jank, but this same researcher revealed that the agent mode system is an earlier version of the same one that performs so exceptionally at the IMO. The thing is that more limited agent mode drawing on an older base model is approaching human baselines in a range of real world domains.

This is what I mean then when I say that this headline is not irrelevant to the impact on white collar jobs. The agent mode released just a few days ago and just to stress again make sure that the name is not irrelevant. It's just a few days ago that was tested on real world professional work, such as preparing a competitive analysis of on demand urgent care providers and identifying viable water wells for a new green hydrogen facility.

Pay attention to the bars in blue because that's the win rate of ChatGPT agent versus humans. As you can see for a variety of tasks, it's approaching a 50% win rate. You don't need me to make the obvious point that if this is ChatGPT agent, what about this model we're getting at the end of the year?

Suddenly models exceeding most human participants in the IMO competition doesn't seem so irrelevant. Then there is data science tasks in which OpenAI claim to actually have a superior system to most human performers. The emphasis there should be on most performers because again remember these questions were designed by human experts.

Therefore there must be some humans by definition who can ace these questions comfortably. Now what is more white collar unfortunately than filling out spreadsheets or editing them in the case of spreadsheet bench. In this case as you can see here human performance on average is still far superior to ChatGPT agent.

But it is barely speculative at this point to surmise that the model we're getting at the end of the year might score say 75% or 80% on spreadsheet bench. The obvious point to be made is that surely expert spreadsheeters will just increase their productivity by using these tools. And that's true but it does beg the question about what the incentives will be at that point to hire entry level helpers.

If entry level human white collar workers can no longer complement the systems then that could really start showing up in the data. How about the headlines meaning that we are actually close then to fully eliminating white collar jobs? The logic would go if it can get gold in the IMO then isn't it just better than us at everything?

This leads us to the fourth way that many might misread the headline which is that if we're getting gold in the international math olympiad we are actually really quite close to eliminating white collar jobs. Well if you have read the 42 page system card for these latest systems like ChatGPT agent and frankly who hasn't read that 42 page system card then you'll see that the hallucination rate of these new agents drawing on the same techniques again as the MathWiz went up.

To repeat that same single reinforcement learning system in the words of the OpenAI researcher produced higher hallucinations within ChatGPT agent. On simple QA which is one benchmark measuring hallucinations you can see a drop of around four percent compared to the O3 system with browsing. Likewise on another measure of hallucinations person QA.

It should be noted that OpenAI added the caveat that it was actually Wikipedia getting stuff wrong often. So there may be some noise in that data. That would be the same data used to train the models but that's another discussion. On evaluations designed to test whether ChatGPT agent refuses to do high stakes financial tasks such as making financial account transfers the agent mode was worse than the previous 4.0 or O3 operator.

In other words it would be more liable to try to do something highly risky and that's not the only high stake setting in which things can go haywire under the new system. OpenAI were testing ChatGPT agent essentially on whether it could produce a bioweapon or at least whether it had one skill pertaining to that ability.

Now ChatGPT agent was unable to install or run the bio design tool but that's no biggie. But here's where it gets worse. The ChatGPT agent researched and wrote substitute scripts then it misrepresented those scripts outputs as real tool results. Any terrorists using it for this purpose then is going to get mightily pissed off.

But seriously this is all critical context for these new breakthrough results that you hear for example the IMO gold. In my opinion even if the best of a language model's answers are better than before if you can't employ a language model at its lowest point when it hallucinates then you might not employ it at its best.

So while I foresee there being significant impact on entry-level jobs it's a far cry from eliminating white-collar jobs. That prediction by the way is also echoed by that math professor who said he sees an increasing number of mathematicians improving their productivity by using language models to search for known parts of a tentative proof.

Another massive positive of course is that younger entrants to a field can use these kind of tools to more rapidly ascend to expertise level. Before we leave human jobs for just a moment a word about real jobs you can apply for today. The sponsors of this video are 80 000 hours and while I have mentioned their podcast and youtube channel before just a quick reminder that they have a job board link in the description with hundreds of jobs filtered for positive impact.

I'm just going to refresh the page because what I didn't mention last time when talking about this is that these jobs are around the world as well notice for example paris. If you are interested in any of this obviously it would be amazing if you could use the link in the description.

Fifth way not to misread the open ai headline. You might have looked at that headline on twitter and said no it's all hype and ai models have actually hit a plateau. Well try telling that to this machine learning researcher who got almost half a million impressions for being disappointed in how the latest models like grok 4 did on the international math olympiad.

They found that gemini 2.5 pro did the best of the models they tested but grok 4 performed particularly poorly. I could point to my own benchmark simplebench as some form of proof that grok 4 wasn't purely benchmark hacking and that there is plenty of genuine progress in ai. After all I made this benchmark to expose the gap between human performance and model performance and yet that gap is shrinking rapidly.

There probably will be a simplebench v2 one day soon and yes we are working on benchmarking models like Kimi trust me we are working on it. Anyway even that researcher Ravid Schwartz did have to admit by saying well played gnome, gnome brown of open ai, well played. If even after that concession you still think that all ai progress is just hype wait till the end of the video.

Obligatory mention by the way for me at least that I did call that AI would get gold in the IMO this year. I can't find the quote I think it was from a few months ago. Maybe one of you can find the quote. Sixth potential misreading. Some of the more trusting among you may misread the headline as being about a peer-reviewed research paper in which we can learn all about the methodology.

After all this is crucial research and part of open ai's main push towards general intelligence or AGI. Nope quite the opposite. We have gone from peer-reviewed papers from the frontier labs say circa 2022 to website posts up to 2024 to now 3am twitter threads. That leaves us with an unbelievable amount of unknowns about this IMO achievement.

The smartest man in the world by IQ Terence Tao said that there are all sorts of unknowns in how the result was achieved. Each one of which would cast a result in a slightly more favourable or less favourable light. My key question along with him is did the model submit multiple attempts for example.

That is by the way allowed for the human participants. Neil Nanda again asks about more subtle hacks but we just don't know. This forces us including me to have to read between the lines of obscure esoteric tweets but I would say that one key technique does seem to be just to let inference run for longer.

As in train models to output yet longer chains of thought. Again according to Noam Brown this model thinks for a long time for hours and he says there's a lot of room to push that test time compute and efficiency further. How much compute was used during the competition? We don't know.

How much cash would such inference cost an average user? Again we don't know but it does seem to hint that we really will be getting those two thousand dollar a month pricing tiers for ChatGPT. The most intriguing hint for me and some of you watching will be the fact that these new techniques he says make LLMs a lot better at hard to verify tasks.

If OpenAI do take the lead in software engineering for example by the end of the year that really would be a big shake-up. Unlike competitive coding software engineering is harder to verify but has huge economic impact. But back to that sixth misreading while I strongly suspect Google's announcement will be more quantitative on the 28th and detailed it will likely still fall far short of complete transparency.

Such is the money at stake in AI these days. Speaking of which by the way side note would you turn down a 300 million dollar annual salary to work at Meta? Make that 312 by the way in case you weren't convinced. Seventh misreading that we will have to wait to the end of the year to get a glimpse of OpenAI's progress.

No it seems GPT-5 reasoning alpha is coming pretty soon. Not the same as the model coming out at the end of the year that got gold but nevertheless it will give us a taste of the latest progress at OpenAI. Eighth misreading that the AI news these days is nothing but insane progress and exponentials.

Actually no see this new Meta report. I have chatted with the lead author both in person and online and we'll hopefully do a deep dive soon but the TLDR is that against expectations even the participants expectations language models can slow down developers in certain settings. Especially on more complex code bases averaging over a million lines of code in which the developers already have lots of experience.

Recent language models at least just get a little bit overwhelmed. We'll see about the new generation of models but this does remind us that if competition coding were the same as real world software engineering you just wouldn't see results like this. The developers thought that using language models within Cursor would speed them up by say 25 percent but it actually slowed them down by around 20 percent.

Again it's a small study but a fascinating one that I'll come back to. Ninth and finally try not to misread the gold medal headline and think that you know generative AI is just all about phony benchmarks. It doesn't ever have any real world impact. Well aside from a potential negative impact of our new age of intelligent surveillance that's covered in my most recent documentary on Patreon.

Do check it out. AI and language models can also have and have had positive impact in hard numbers real world settings. Just take alpha evolve and I did do a separate video on this but it made data centers about 0.7 percent more efficient in the real world. Or more technically the alpha evolve system continuously recovers on average 0.7 percent of Google's worldwide compute resources.

This sustained efficiency gain means that at any given moment more tasks can be completed on the same computational footprint. That's an example of the marrying of language models essentially next word predictors with symbolic pre-programmed systems. That seems to be the sweet spot at the moment for real world impact and I suspect on the 28th of July Google's submission to the International Math Olympiad will use a bit of both.

We'll see. Did they get problem six correct? Did they demonstrate real creativity? Time will tell. Either way I would argue as you can see plenty of ways of misreading the headline. But what do you think? In a meta way have I misread the headline? Quite possible. Even if I have thank you so much for watching to the end and have a wonderful day.

How Not to Read a Headline on AI (ft. new Olympiad Gold, GPT-5 …)

Chapters

Transcript