back to indexHow Not to Read a Headline on AI (ft. new Olympiad Gold, GPT-5 …)

Chapters
0:0 Introduction
0:18 AI Beat Mathematicians?
1:23 OPENAI vs GOOGLE
2:42 Irrelevant to Jobs or …
6:45 White-collar jobs gone?
10:26 AI is Plateauing?
12:0 We Don’t Know the Details…
14:33 GPT-5 alpha
14:54 Nothing but Exponentials?
15:53 No Impact?
00:00:00.000 |
Almost five million people saw the headline 48 hours ago that OpenAI have a secret large language 00:00:07.300 |
model that got gold at the International Math Olympiad. Here though are nine ways to misread 00:00:15.780 |
that headline. First, this means that AI is now as good as the best mathematicians and could put 00:00:23.440 |
them out of a job. The IMO is extremely difficult but contains human expert written questions, not 00:00:30.640 |
questions that no one knows the answer to yet. I am in awe of the high school competitors who get 00:00:37.500 |
any medal in it or even qualified to be in the competition truly. But as one UCL math professor 00:00:43.900 |
said yesterday, math research is about solving problems no one yet knows how to solve and this 00:00:49.980 |
requires significant creativity, something notably absent from OpenAI's IMO solutions. 00:00:57.060 |
Now OpenAI's model apparently out around the end of the year did not find a correct proof for the 00:01:04.700 |
hardest problem, requiring the most creativity. That's unlike by the way a fair few of the young 00:01:12.100 |
human participants. The model did get problem one through five correct. That is bloody impressive 00:01:19.380 |
and enough for a gold. Second misreading of the headline though. This means that OpenAI are now 00:01:26.680 |
in the lead in AI or maybe language models for mathematics. Well, we actually don't know what 00:01:33.340 |
the Google effort got in the IMO. This professor is hearing that Google DeepMind also got gold but has 00:01:41.680 |
not yet announced it. We will find out in the coming week apparently whether Google DeepMind got problem 00:01:47.600 |
6 correct. Was this why OpenAI rushed the announcement to get there before Google and steal the headlines? 00:01:54.900 |
Now one of the Google DeepMind researchers on AI for mathematics and the lead of their famous, 00:02:01.980 |
well is actually famous, well famous to me, Alpha Geometry System that I discussed 18 months ago, 00:02:07.120 |
True Trin, retweeted this tweet. Apparently AI organisations were asked not to report their results for a week 00:02:15.620 |
to give some space for human celebration. Unfortunately, Gnome Brown of OpenAI said that this message 00:02:21.600 |
somehow didn't get through to OpenAI, maybe it wasn't relayed to them. We don't know but this explains why 00:02:28.140 |
we don't yet have the Google DeepMind results which I believe are coming out on the 28th of July and some 00:02:34.600 |
other results from a company called Harmonic. Third way to misread this gold medal headline, that none of this 00:02:42.400 |
is relevant to whether AI will reduce entry-level white collar jobs. I frankly disagree. I think it is relevant. 00:02:50.620 |
One of the leads on OpenAI's new secretive model, Jerry Tworek, if I'm pronouncing that right, revealed that it is 00:02:56.880 |
not specialised for mathematics and draws on the same research technique used to power most of OpenAI's other offerings. 00:03:04.740 |
This is bigger news than it sounds, because it means that this secret model did not use tools or specialised fine tuning 00:03:11.280 |
to optimise for the mathematics use case. Even one of OpenAI's chief critics at a rival lab and an IMO gold medalist himself 00:03:19.560 |
conceded that for this result to be achieved by a pure language model was impressive. To the degree he said that this was indicative of 00:03:28.400 |
general reasoning training without specialisation, that's significant. But many of you will still be 00:03:33.920 |
saying, nah, none of this is relevant, but let me try to put the strongest case yet. Remember, this reinforcement 00:03:40.900 |
learning system within OpenAI was the same one responsible for that general purpose computer using agent whose 00:03:48.400 |
headlines you may have seen recently. I'll play the clip now because it's soon going to be rolled out to all 00:03:54.160 |
plus users. But it's that system that can browse the web and perform deep research for you. Millions of 00:04:01.440 |
people saw the headlines about OpenAI's agent mode that can spin up its own virtual computer, operate the 00:04:06.640 |
mouse, navigate the browser visually. Now, yes, that agent is a bit jank, but this same researcher revealed 00:04:13.680 |
that the agent mode system is an earlier version of the same one that performs so exceptionally at the IMO. 00:04:22.160 |
The thing is that more limited agent mode drawing on an older base model is approaching human baselines in 00:04:29.360 |
a range of real world domains. This is what I mean then when I say that this headline is not irrelevant 00:04:37.760 |
to the impact on white collar jobs. The agent mode released just a few days ago and just to stress again 00:04:44.080 |
make sure that the name is not irrelevant. It's just a few days ago that was tested on real world professional 00:04:50.640 |
work, such as preparing a competitive analysis of on demand urgent care providers and identifying 00:04:57.440 |
viable water wells for a new green hydrogen facility. Pay attention to the bars in blue because that's the 00:05:05.200 |
win rate of ChatGPT agent versus humans. As you can see for a variety of tasks, it's approaching a 50% 00:05:14.240 |
win rate. You don't need me to make the obvious point that if this is ChatGPT agent, what about this 00:05:19.920 |
model we're getting at the end of the year? Suddenly models exceeding most human participants in the IMO 00:05:26.080 |
competition doesn't seem so irrelevant. Then there is data science tasks in which OpenAI claim to actually 00:05:33.520 |
have a superior system to most human performers. The emphasis there should be on most performers because 00:05:40.240 |
again remember these questions were designed by human experts. Therefore there must be some humans by 00:05:46.800 |
definition who can ace these questions comfortably. Now what is more white collar unfortunately than 00:05:51.920 |
filling out spreadsheets or editing them in the case of spreadsheet bench. In this case as you can see 00:05:58.320 |
here human performance on average is still far superior to ChatGPT agent. But it is barely speculative 00:06:05.920 |
at this point to surmise that the model we're getting at the end of the year might score say 75% or 80% 00:06:12.880 |
on spreadsheet bench. The obvious point to be made is that surely expert spreadsheeters will just 00:06:20.160 |
increase their productivity by using these tools. And that's true but it does beg the question about what 00:06:25.840 |
the incentives will be at that point to hire entry level helpers. If entry level human white collar 00:06:32.960 |
workers can no longer complement the systems then that could really start showing up in the data. How 00:06:39.680 |
about the headlines meaning that we are actually close then to fully eliminating white collar jobs? The logic 00:06:46.160 |
would go if it can get gold in the IMO then isn't it just better than us at everything? This leads us 00:06:52.960 |
to the fourth way that many might misread the headline which is that if we're getting gold in the 00:06:59.040 |
international math olympiad we are actually really quite close to eliminating white collar jobs. 00:07:03.840 |
Well if you have read the 42 page system card for these latest systems like ChatGPT agent and frankly who 00:07:11.120 |
hasn't read that 42 page system card then you'll see that the hallucination rate of these new agents 00:07:18.800 |
drawing on the same techniques again as the MathWiz went up. To repeat that same single reinforcement 00:07:25.840 |
learning system in the words of the OpenAI researcher produced higher hallucinations within ChatGPT agent. 00:07:33.120 |
On simple QA which is one benchmark measuring hallucinations you can see a drop of around four 00:07:39.440 |
percent compared to the O3 system with browsing. Likewise on another measure of hallucinations person QA. 00:07:46.880 |
It should be noted that OpenAI added the caveat that it was actually Wikipedia getting stuff wrong often. 00:07:54.000 |
So there may be some noise in that data. That would be the same data used to train the models but that's 00:07:59.760 |
another discussion. On evaluations designed to test whether ChatGPT agent refuses to do high stakes 00:08:06.880 |
financial tasks such as making financial account transfers the agent mode was worse than the previous 00:08:14.400 |
4.0 or O3 operator. In other words it would be more liable to try to do something highly risky and that's 00:08:21.840 |
not the only high stake setting in which things can go haywire under the new system. OpenAI were testing 00:08:28.640 |
ChatGPT agent essentially on whether it could produce a bioweapon or at least whether it had one skill 00:08:34.400 |
pertaining to that ability. Now ChatGPT agent was unable to install or run the bio design tool but that's 00:08:41.760 |
no biggie. But here's where it gets worse. The ChatGPT agent researched and wrote substitute scripts then it 00:08:49.280 |
misrepresented those scripts outputs as real tool results. Any terrorists using it for this purpose then 00:08:56.320 |
is going to get mightily pissed off. But seriously this is all critical context for these new breakthrough 00:09:03.120 |
results that you hear for example the IMO gold. In my opinion even if the best of a language model's 00:09:09.520 |
answers are better than before if you can't employ a language model at its lowest point when it 00:09:15.120 |
hallucinates then you might not employ it at its best. So while I foresee there being significant impact on 00:09:21.840 |
entry-level jobs it's a far cry from eliminating white-collar jobs. That prediction by the way is also 00:09:29.200 |
echoed by that math professor who said he sees an increasing number of mathematicians improving their 00:09:34.560 |
productivity by using language models to search for known parts of a tentative proof. Another massive 00:09:39.520 |
positive of course is that younger entrants to a field can use these kind of tools to more rapidly ascend 00:09:45.760 |
to expertise level. Before we leave human jobs for just a moment a word about real jobs you can apply for 00:09:53.040 |
today. The sponsors of this video are 80 000 hours and while I have mentioned their podcast and youtube 00:09:59.520 |
channel before just a quick reminder that they have a job board link in the description with hundreds of 00:10:05.760 |
jobs filtered for positive impact. I'm just going to refresh the page because what I didn't mention 00:10:11.440 |
last time when talking about this is that these jobs are around the world as well notice for example 00:10:17.440 |
paris. If you are interested in any of this obviously it would be amazing if you could use the link in the 00:10:23.680 |
description. Fifth way not to misread the open ai headline. You might have looked at that headline 00:10:29.520 |
on twitter and said no it's all hype and ai models have actually hit a plateau. Well try telling that to 00:10:37.440 |
this machine learning researcher who got almost half a million impressions for being disappointed in how the 00:10:44.640 |
latest models like grok 4 did on the international math olympiad. They found that gemini 2.5 pro did the 00:10:51.760 |
best of the models they tested but grok 4 performed particularly poorly. I could point to my own benchmark 00:10:58.640 |
simplebench as some form of proof that grok 4 wasn't purely benchmark hacking and that there is plenty of 00:11:06.960 |
genuine progress in ai. After all I made this benchmark to expose the gap between human performance and 00:11:14.560 |
model performance and yet that gap is shrinking rapidly. There probably will be a simplebench v2 00:11:20.880 |
one day soon and yes we are working on benchmarking models like Kimi trust me we are working on it. 00:11:26.000 |
Anyway even that researcher Ravid Schwartz did have to admit by saying well played gnome, 00:11:32.320 |
gnome brown of open ai, well played. If even after that concession you still think that all ai progress is 00:11:39.680 |
just hype wait till the end of the video. Obligatory mention by the way for me at least that I did call 00:11:45.760 |
that AI would get gold in the IMO this year. I can't find the quote I think it was from a few months ago. 00:11:51.440 |
Maybe one of you can find the quote. Sixth potential misreading. Some of the more trusting among you may 00:11:57.920 |
misread the headline as being about a peer-reviewed research paper in which we can learn all about the 00:12:04.160 |
methodology. After all this is crucial research and part of open ai's main push towards general 00:12:10.800 |
intelligence or AGI. Nope quite the opposite. We have gone from peer-reviewed papers from the frontier 00:12:17.600 |
labs say circa 2022 to website posts up to 2024 to now 3am twitter threads. That leaves us with an 00:12:26.960 |
unbelievable amount of unknowns about this IMO achievement. The smartest man in the world by 00:12:32.960 |
IQ Terence Tao said that there are all sorts of unknowns in how the result was achieved. Each one 00:12:39.120 |
of which would cast a result in a slightly more favourable or less favourable light. My key question 00:12:44.880 |
along with him is did the model submit multiple attempts for example. That is by the way allowed for 00:12:50.800 |
the human participants. Neil Nanda again asks about more subtle hacks but we just don't know. This 00:12:56.720 |
forces us including me to have to read between the lines of obscure esoteric tweets but I would say that 00:13:04.640 |
one key technique does seem to be just to let inference run for longer. As in train models to output yet 00:13:11.760 |
longer chains of thought. Again according to Noam Brown this model thinks for a long time for hours and 00:13:20.000 |
he says there's a lot of room to push that test time compute and efficiency further. How much compute was 00:13:26.320 |
used during the competition? We don't know. How much cash would such inference cost an average user? Again 00:13:32.160 |
we don't know but it does seem to hint that we really will be getting those two thousand dollar a month 00:13:37.840 |
pricing tiers for ChatGPT. The most intriguing hint for me and some of you watching will be the fact 00:13:44.480 |
that these new techniques he says make LLMs a lot better at hard to verify tasks. If OpenAI do take 00:13:52.160 |
the lead in software engineering for example by the end of the year that really would be a big shake-up. 00:13:58.960 |
Unlike competitive coding software engineering is harder to verify but has huge economic impact. But back to 00:14:06.400 |
that sixth misreading while I strongly suspect Google's announcement will be more quantitative 00:14:11.760 |
on the 28th and detailed it will likely still fall far short of complete transparency. Such is the money 00:14:18.480 |
at stake in AI these days. Speaking of which by the way side note would you turn down a 300 million dollar 00:14:25.920 |
annual salary to work at Meta? Make that 312 by the way in case you weren't convinced. Seventh misreading 00:14:32.960 |
that we will have to wait to the end of the year to get a glimpse of OpenAI's progress. 00:14:38.480 |
No it seems GPT-5 reasoning alpha is coming pretty soon. Not the same as the model coming out at the 00:14:46.880 |
end of the year that got gold but nevertheless it will give us a taste of the latest progress at OpenAI. 00:14:54.480 |
Eighth misreading that the AI news these days is nothing but insane progress and exponentials. 00:15:00.400 |
Actually no see this new Meta report. I have chatted with the lead author both in person and online and 00:15:06.720 |
we'll hopefully do a deep dive soon but the TLDR is that against expectations even the participants 00:15:13.840 |
expectations language models can slow down developers in certain settings. Especially on more complex 00:15:20.400 |
code bases averaging over a million lines of code in which the developers already have lots of experience. 00:15:25.760 |
Recent language models at least just get a little bit overwhelmed. We'll see about the new generation of 00:15:30.880 |
models but this does remind us that if competition coding were the same as real world software 00:15:35.920 |
engineering you just wouldn't see results like this. The developers thought that using language models 00:15:41.200 |
within Cursor would speed them up by say 25 percent but it actually slowed them down by around 20 percent. 00:15:47.680 |
Again it's a small study but a fascinating one that I'll come back to. Ninth and finally try not to 00:15:53.040 |
misread the gold medal headline and think that you know generative AI is just all about phony benchmarks. 00:15:59.600 |
It doesn't ever have any real world impact. Well aside from a potential negative impact of our new age 00:16:06.240 |
of intelligent surveillance that's covered in my most recent documentary on Patreon. Do check it out. 00:16:11.680 |
AI and language models can also have and have had positive impact in hard numbers real world settings. 00:16:19.520 |
Just take alpha evolve and I did do a separate video on this but it made data centers about 0.7 percent 00:16:26.480 |
more efficient in the real world. Or more technically the alpha evolve system continuously recovers on 00:16:32.560 |
average 0.7 percent of Google's worldwide compute resources. This sustained efficiency gain means that 00:16:38.560 |
at any given moment more tasks can be completed on the same computational footprint. That's an example of 00:16:44.000 |
the marrying of language models essentially next word predictors with symbolic pre-programmed systems. 00:16:50.160 |
That seems to be the sweet spot at the moment for real world impact and I suspect on the 28th of July 00:16:55.680 |
Google's submission to the International Math Olympiad will use a bit of both. We'll see. Did they get 00:17:00.560 |
problem six correct? Did they demonstrate real creativity? Time will tell. Either way I would argue 00:17:06.240 |
as you can see plenty of ways of misreading the headline. But what do you think? In a meta way have I misread the 00:17:12.800 |
headline? Quite possible. Even if I have thank you so much for watching to the end and have a wonderful day.