How Not to Read a Headline on AI (ft. new Olympiad Gold, GPT-5 …)

00:00:00.000 | Almost five million people saw the headline 48 hours ago that OpenAI have a secret large language

00:00:07.300 | model that got gold at the International Math Olympiad. Here though are nine ways to misread

00:00:15.780 | that headline. First, this means that AI is now as good as the best mathematicians and could put

00:00:23.440 | them out of a job. The IMO is extremely difficult but contains human expert written questions, not

00:00:30.640 | questions that no one knows the answer to yet. I am in awe of the high school competitors who get

00:00:37.500 | any medal in it or even qualified to be in the competition truly. But as one UCL math professor

00:00:43.900 | said yesterday, math research is about solving problems no one yet knows how to solve and this

00:00:49.980 | requires significant creativity, something notably absent from OpenAI's IMO solutions.

00:00:57.060 | Now OpenAI's model apparently out around the end of the year did not find a correct proof for the

00:01:04.700 | hardest problem, requiring the most creativity. That's unlike by the way a fair few of the young

00:01:12.100 | human participants. The model did get problem one through five correct. That is bloody impressive

00:01:19.380 | and enough for a gold. Second misreading of the headline though. This means that OpenAI are now

00:01:26.680 | in the lead in AI or maybe language models for mathematics. Well, we actually don't know what

00:01:33.340 | the Google effort got in the IMO. This professor is hearing that Google DeepMind also got gold but has

00:01:41.680 | not yet announced it. We will find out in the coming week apparently whether Google DeepMind got problem

00:01:47.600 | 6 correct. Was this why OpenAI rushed the announcement to get there before Google and steal the headlines?

00:01:54.900 | Now one of the Google DeepMind researchers on AI for mathematics and the lead of their famous,

00:02:01.980 | well is actually famous, well famous to me, Alpha Geometry System that I discussed 18 months ago,

00:02:07.120 | True Trin, retweeted this tweet. Apparently AI organisations were asked not to report their results for a week

00:02:15.620 | to give some space for human celebration. Unfortunately, Gnome Brown of OpenAI said that this message

00:02:21.600 | somehow didn't get through to OpenAI, maybe it wasn't relayed to them. We don't know but this explains why

00:02:28.140 | we don't yet have the Google DeepMind results which I believe are coming out on the 28th of July and some

00:02:34.600 | other results from a company called Harmonic. Third way to misread this gold medal headline, that none of this

00:02:42.400 | is relevant to whether AI will reduce entry-level white collar jobs. I frankly disagree. I think it is relevant.

00:02:50.620 | One of the leads on OpenAI's new secretive model, Jerry Tworek, if I'm pronouncing that right, revealed that it is

00:02:56.880 | not specialised for mathematics and draws on the same research technique used to power most of OpenAI's other offerings.

00:03:04.740 | This is bigger news than it sounds, because it means that this secret model did not use tools or specialised fine tuning

00:03:11.280 | to optimise for the mathematics use case. Even one of OpenAI's chief critics at a rival lab and an IMO gold medalist himself

00:03:19.560 | conceded that for this result to be achieved by a pure language model was impressive. To the degree he said that this was indicative of

00:03:28.400 | general reasoning training without specialisation, that's significant. But many of you will still be

00:03:33.920 | saying, nah, none of this is relevant, but let me try to put the strongest case yet. Remember, this reinforcement

00:03:40.900 | learning system within OpenAI was the same one responsible for that general purpose computer using agent whose

00:03:48.400 | headlines you may have seen recently. I'll play the clip now because it's soon going to be rolled out to all

00:03:54.160 | plus users. But it's that system that can browse the web and perform deep research for you. Millions of

00:04:01.440 | people saw the headlines about OpenAI's agent mode that can spin up its own virtual computer, operate the

00:04:06.640 | mouse, navigate the browser visually. Now, yes, that agent is a bit jank, but this same researcher revealed

00:04:13.680 | that the agent mode system is an earlier version of the same one that performs so exceptionally at the IMO.

00:04:22.160 | The thing is that more limited agent mode drawing on an older base model is approaching human baselines in

00:04:29.360 | a range of real world domains. This is what I mean then when I say that this headline is not irrelevant

00:04:37.760 | to the impact on white collar jobs. The agent mode released just a few days ago and just to stress again

00:04:44.080 | make sure that the name is not irrelevant. It's just a few days ago that was tested on real world professional

00:04:50.640 | work, such as preparing a competitive analysis of on demand urgent care providers and identifying

00:04:57.440 | viable water wells for a new green hydrogen facility. Pay attention to the bars in blue because that's the

00:05:05.200 | win rate of ChatGPT agent versus humans. As you can see for a variety of tasks, it's approaching a 50%

00:05:14.240 | win rate. You don't need me to make the obvious point that if this is ChatGPT agent, what about this

00:05:19.920 | model we're getting at the end of the year? Suddenly models exceeding most human participants in the IMO

00:05:26.080 | competition doesn't seem so irrelevant. Then there is data science tasks in which OpenAI claim to actually

00:05:33.520 | have a superior system to most human performers. The emphasis there should be on most performers because

00:05:40.240 | again remember these questions were designed by human experts. Therefore there must be some humans by

00:05:46.800 | definition who can ace these questions comfortably. Now what is more white collar unfortunately than

00:05:51.920 | filling out spreadsheets or editing them in the case of spreadsheet bench. In this case as you can see

00:05:58.320 | here human performance on average is still far superior to ChatGPT agent. But it is barely speculative

00:06:05.920 | at this point to surmise that the model we're getting at the end of the year might score say 75% or 80%

00:06:12.880 | on spreadsheet bench. The obvious point to be made is that surely expert spreadsheeters will just

00:06:20.160 | increase their productivity by using these tools. And that's true but it does beg the question about what

00:06:25.840 | the incentives will be at that point to hire entry level helpers. If entry level human white collar

00:06:32.960 | workers can no longer complement the systems then that could really start showing up in the data. How

00:06:39.680 | about the headlines meaning that we are actually close then to fully eliminating white collar jobs? The logic

00:06:46.160 | would go if it can get gold in the IMO then isn't it just better than us at everything? This leads us

00:06:52.960 | to the fourth way that many might misread the headline which is that if we're getting gold in the

00:06:59.040 | international math olympiad we are actually really quite close to eliminating white collar jobs.

00:07:03.840 | Well if you have read the 42 page system card for these latest systems like ChatGPT agent and frankly who

00:07:11.120 | hasn't read that 42 page system card then you'll see that the hallucination rate of these new agents

00:07:18.800 | drawing on the same techniques again as the MathWiz went up. To repeat that same single reinforcement

00:07:25.840 | learning system in the words of the OpenAI researcher produced higher hallucinations within ChatGPT agent.

00:07:33.120 | On simple QA which is one benchmark measuring hallucinations you can see a drop of around four

00:07:39.440 | percent compared to the O3 system with browsing. Likewise on another measure of hallucinations person QA.

00:07:46.880 | It should be noted that OpenAI added the caveat that it was actually Wikipedia getting stuff wrong often.

00:07:54.000 | So there may be some noise in that data. That would be the same data used to train the models but that's

00:07:59.760 | another discussion. On evaluations designed to test whether ChatGPT agent refuses to do high stakes

00:08:06.880 | financial tasks such as making financial account transfers the agent mode was worse than the previous

00:08:14.400 | 4.0 or O3 operator. In other words it would be more liable to try to do something highly risky and that's

00:08:21.840 | not the only high stake setting in which things can go haywire under the new system. OpenAI were testing

00:08:28.640 | ChatGPT agent essentially on whether it could produce a bioweapon or at least whether it had one skill

00:08:34.400 | pertaining to that ability. Now ChatGPT agent was unable to install or run the bio design tool but that's

00:08:41.760 | no biggie. But here's where it gets worse. The ChatGPT agent researched and wrote substitute scripts then it

00:08:49.280 | misrepresented those scripts outputs as real tool results. Any terrorists using it for this purpose then

00:08:56.320 | is going to get mightily pissed off. But seriously this is all critical context for these new breakthrough

00:09:03.120 | results that you hear for example the IMO gold. In my opinion even if the best of a language model's

00:09:09.520 | answers are better than before if you can't employ a language model at its lowest point when it

00:09:15.120 | hallucinates then you might not employ it at its best. So while I foresee there being significant impact on

00:09:21.840 | entry-level jobs it's a far cry from eliminating white-collar jobs. That prediction by the way is also

00:09:29.200 | echoed by that math professor who said he sees an increasing number of mathematicians improving their

00:09:34.560 | productivity by using language models to search for known parts of a tentative proof. Another massive

00:09:39.520 | positive of course is that younger entrants to a field can use these kind of tools to more rapidly ascend

00:09:45.760 | to expertise level. Before we leave human jobs for just a moment a word about real jobs you can apply for

00:09:53.040 | today. The sponsors of this video are 80 000 hours and while I have mentioned their podcast and youtube

00:09:59.520 | channel before just a quick reminder that they have a job board link in the description with hundreds of

00:10:05.760 | jobs filtered for positive impact. I'm just going to refresh the page because what I didn't mention

00:10:11.440 | last time when talking about this is that these jobs are around the world as well notice for example

00:10:17.440 | paris. If you are interested in any of this obviously it would be amazing if you could use the link in the

00:10:23.680 | description. Fifth way not to misread the open ai headline. You might have looked at that headline

00:10:29.520 | on twitter and said no it's all hype and ai models have actually hit a plateau. Well try telling that to

00:10:37.440 | this machine learning researcher who got almost half a million impressions for being disappointed in how the

00:10:44.640 | latest models like grok 4 did on the international math olympiad. They found that gemini 2.5 pro did the

00:10:51.760 | best of the models they tested but grok 4 performed particularly poorly. I could point to my own benchmark

00:10:58.640 | simplebench as some form of proof that grok 4 wasn't purely benchmark hacking and that there is plenty of

00:11:06.960 | genuine progress in ai. After all I made this benchmark to expose the gap between human performance and

00:11:14.560 | model performance and yet that gap is shrinking rapidly. There probably will be a simplebench v2

00:11:20.880 | one day soon and yes we are working on benchmarking models like Kimi trust me we are working on it.

00:11:26.000 | Anyway even that researcher Ravid Schwartz did have to admit by saying well played gnome,

00:11:32.320 | gnome brown of open ai, well played. If even after that concession you still think that all ai progress is

00:11:39.680 | just hype wait till the end of the video. Obligatory mention by the way for me at least that I did call

00:11:45.760 | that AI would get gold in the IMO this year. I can't find the quote I think it was from a few months ago.

00:11:51.440 | Maybe one of you can find the quote. Sixth potential misreading. Some of the more trusting among you may

00:11:57.920 | misread the headline as being about a peer-reviewed research paper in which we can learn all about the

00:12:04.160 | methodology. After all this is crucial research and part of open ai's main push towards general

00:12:10.800 | intelligence or AGI. Nope quite the opposite. We have gone from peer-reviewed papers from the frontier

00:12:17.600 | labs say circa 2022 to website posts up to 2024 to now 3am twitter threads. That leaves us with an

00:12:26.960 | unbelievable amount of unknowns about this IMO achievement. The smartest man in the world by

00:12:32.960 | IQ Terence Tao said that there are all sorts of unknowns in how the result was achieved. Each one

00:12:39.120 | of which would cast a result in a slightly more favourable or less favourable light. My key question

00:12:44.880 | along with him is did the model submit multiple attempts for example. That is by the way allowed for

00:12:50.800 | the human participants. Neil Nanda again asks about more subtle hacks but we just don't know. This

00:12:56.720 | forces us including me to have to read between the lines of obscure esoteric tweets but I would say that

00:13:04.640 | one key technique does seem to be just to let inference run for longer. As in train models to output yet

00:13:11.760 | longer chains of thought. Again according to Noam Brown this model thinks for a long time for hours and

00:13:20.000 | he says there's a lot of room to push that test time compute and efficiency further. How much compute was

00:13:26.320 | used during the competition? We don't know. How much cash would such inference cost an average user? Again

00:13:32.160 | we don't know but it does seem to hint that we really will be getting those two thousand dollar a month

00:13:37.840 | pricing tiers for ChatGPT. The most intriguing hint for me and some of you watching will be the fact

00:13:44.480 | that these new techniques he says make LLMs a lot better at hard to verify tasks. If OpenAI do take

00:13:52.160 | the lead in software engineering for example by the end of the year that really would be a big shake-up.

00:13:58.960 | Unlike competitive coding software engineering is harder to verify but has huge economic impact. But back to

00:14:06.400 | that sixth misreading while I strongly suspect Google's announcement will be more quantitative

00:14:11.760 | on the 28th and detailed it will likely still fall far short of complete transparency. Such is the money

00:14:18.480 | at stake in AI these days. Speaking of which by the way side note would you turn down a 300 million dollar

00:14:25.920 | annual salary to work at Meta? Make that 312 by the way in case you weren't convinced. Seventh misreading

00:14:32.960 | that we will have to wait to the end of the year to get a glimpse of OpenAI's progress.

00:14:38.480 | No it seems GPT-5 reasoning alpha is coming pretty soon. Not the same as the model coming out at the

00:14:46.880 | end of the year that got gold but nevertheless it will give us a taste of the latest progress at OpenAI.

00:14:54.480 | Eighth misreading that the AI news these days is nothing but insane progress and exponentials.

00:15:00.400 | Actually no see this new Meta report. I have chatted with the lead author both in person and online and

00:15:06.720 | we'll hopefully do a deep dive soon but the TLDR is that against expectations even the participants

00:15:13.840 | expectations language models can slow down developers in certain settings. Especially on more complex

00:15:20.400 | code bases averaging over a million lines of code in which the developers already have lots of experience.

00:15:25.760 | Recent language models at least just get a little bit overwhelmed. We'll see about the new generation of

00:15:30.880 | models but this does remind us that if competition coding were the same as real world software

00:15:35.920 | engineering you just wouldn't see results like this. The developers thought that using language models

00:15:41.200 | within Cursor would speed them up by say 25 percent but it actually slowed them down by around 20 percent.

00:15:47.680 | Again it's a small study but a fascinating one that I'll come back to. Ninth and finally try not to

00:15:53.040 | misread the gold medal headline and think that you know generative AI is just all about phony benchmarks.

00:15:59.600 | It doesn't ever have any real world impact. Well aside from a potential negative impact of our new age

00:16:06.240 | of intelligent surveillance that's covered in my most recent documentary on Patreon. Do check it out.

00:16:11.680 | AI and language models can also have and have had positive impact in hard numbers real world settings.

00:16:19.520 | Just take alpha evolve and I did do a separate video on this but it made data centers about 0.7 percent

00:16:26.480 | more efficient in the real world. Or more technically the alpha evolve system continuously recovers on

00:16:32.560 | average 0.7 percent of Google's worldwide compute resources. This sustained efficiency gain means that

00:16:38.560 | at any given moment more tasks can be completed on the same computational footprint. That's an example of

00:16:44.000 | the marrying of language models essentially next word predictors with symbolic pre-programmed systems.

00:16:50.160 | That seems to be the sweet spot at the moment for real world impact and I suspect on the 28th of July

00:16:55.680 | Google's submission to the International Math Olympiad will use a bit of both. We'll see. Did they get

00:17:00.560 | problem six correct? Did they demonstrate real creativity? Time will tell. Either way I would argue

00:17:06.240 | as you can see plenty of ways of misreading the headline. But what do you think? In a meta way have I misread the

00:17:12.800 | headline? Quite possible. Even if I have thank you so much for watching to the end and have a wonderful day.

How Not to Read a Headline on AI (ft. new Olympiad Gold, GPT-5 …)

Chapters