back to index

How Not to Read a Headline on AI (ft. new Olympiad Gold, GPT-5 …)


Chapters

0:0 Introduction
0:18 AI Beat Mathematicians?
1:23 OPENAI vs GOOGLE
2:42 Irrelevant to Jobs or …
6:45 White-collar jobs gone?
10:26 AI is Plateauing?
12:0 We Don’t Know the Details…
14:33 GPT-5 alpha
14:54 Nothing but Exponentials?
15:53 No Impact?

Whisper Transcript | Transcript Only Page

00:00:00.000 | Almost five million people saw the headline 48 hours ago that OpenAI have a secret large language
00:00:07.300 | model that got gold at the International Math Olympiad. Here though are nine ways to misread
00:00:15.780 | that headline. First, this means that AI is now as good as the best mathematicians and could put
00:00:23.440 | them out of a job. The IMO is extremely difficult but contains human expert written questions, not
00:00:30.640 | questions that no one knows the answer to yet. I am in awe of the high school competitors who get
00:00:37.500 | any medal in it or even qualified to be in the competition truly. But as one UCL math professor
00:00:43.900 | said yesterday, math research is about solving problems no one yet knows how to solve and this
00:00:49.980 | requires significant creativity, something notably absent from OpenAI's IMO solutions.
00:00:57.060 | Now OpenAI's model apparently out around the end of the year did not find a correct proof for the
00:01:04.700 | hardest problem, requiring the most creativity. That's unlike by the way a fair few of the young
00:01:12.100 | human participants. The model did get problem one through five correct. That is bloody impressive
00:01:19.380 | and enough for a gold. Second misreading of the headline though. This means that OpenAI are now
00:01:26.680 | in the lead in AI or maybe language models for mathematics. Well, we actually don't know what
00:01:33.340 | the Google effort got in the IMO. This professor is hearing that Google DeepMind also got gold but has
00:01:41.680 | not yet announced it. We will find out in the coming week apparently whether Google DeepMind got problem
00:01:47.600 | 6 correct. Was this why OpenAI rushed the announcement to get there before Google and steal the headlines?
00:01:54.900 | Now one of the Google DeepMind researchers on AI for mathematics and the lead of their famous,
00:02:01.980 | well is actually famous, well famous to me, Alpha Geometry System that I discussed 18 months ago,
00:02:07.120 | True Trin, retweeted this tweet. Apparently AI organisations were asked not to report their results for a week
00:02:15.620 | to give some space for human celebration. Unfortunately, Gnome Brown of OpenAI said that this message
00:02:21.600 | somehow didn't get through to OpenAI, maybe it wasn't relayed to them. We don't know but this explains why
00:02:28.140 | we don't yet have the Google DeepMind results which I believe are coming out on the 28th of July and some
00:02:34.600 | other results from a company called Harmonic. Third way to misread this gold medal headline, that none of this
00:02:42.400 | is relevant to whether AI will reduce entry-level white collar jobs. I frankly disagree. I think it is relevant.
00:02:50.620 | One of the leads on OpenAI's new secretive model, Jerry Tworek, if I'm pronouncing that right, revealed that it is
00:02:56.880 | not specialised for mathematics and draws on the same research technique used to power most of OpenAI's other offerings.
00:03:04.740 | This is bigger news than it sounds, because it means that this secret model did not use tools or specialised fine tuning
00:03:11.280 | to optimise for the mathematics use case. Even one of OpenAI's chief critics at a rival lab and an IMO gold medalist himself
00:03:19.560 | conceded that for this result to be achieved by a pure language model was impressive. To the degree he said that this was indicative of
00:03:28.400 | general reasoning training without specialisation, that's significant. But many of you will still be
00:03:33.920 | saying, nah, none of this is relevant, but let me try to put the strongest case yet. Remember, this reinforcement
00:03:40.900 | learning system within OpenAI was the same one responsible for that general purpose computer using agent whose
00:03:48.400 | headlines you may have seen recently. I'll play the clip now because it's soon going to be rolled out to all
00:03:54.160 | plus users. But it's that system that can browse the web and perform deep research for you. Millions of
00:04:01.440 | people saw the headlines about OpenAI's agent mode that can spin up its own virtual computer, operate the
00:04:06.640 | mouse, navigate the browser visually. Now, yes, that agent is a bit jank, but this same researcher revealed
00:04:13.680 | that the agent mode system is an earlier version of the same one that performs so exceptionally at the IMO.
00:04:22.160 | The thing is that more limited agent mode drawing on an older base model is approaching human baselines in
00:04:29.360 | a range of real world domains. This is what I mean then when I say that this headline is not irrelevant
00:04:37.760 | to the impact on white collar jobs. The agent mode released just a few days ago and just to stress again
00:04:44.080 | make sure that the name is not irrelevant. It's just a few days ago that was tested on real world professional
00:04:50.640 | work, such as preparing a competitive analysis of on demand urgent care providers and identifying
00:04:57.440 | viable water wells for a new green hydrogen facility. Pay attention to the bars in blue because that's the
00:05:05.200 | win rate of ChatGPT agent versus humans. As you can see for a variety of tasks, it's approaching a 50%
00:05:14.240 | win rate. You don't need me to make the obvious point that if this is ChatGPT agent, what about this
00:05:19.920 | model we're getting at the end of the year? Suddenly models exceeding most human participants in the IMO
00:05:26.080 | competition doesn't seem so irrelevant. Then there is data science tasks in which OpenAI claim to actually
00:05:33.520 | have a superior system to most human performers. The emphasis there should be on most performers because
00:05:40.240 | again remember these questions were designed by human experts. Therefore there must be some humans by
00:05:46.800 | definition who can ace these questions comfortably. Now what is more white collar unfortunately than
00:05:51.920 | filling out spreadsheets or editing them in the case of spreadsheet bench. In this case as you can see
00:05:58.320 | here human performance on average is still far superior to ChatGPT agent. But it is barely speculative
00:06:05.920 | at this point to surmise that the model we're getting at the end of the year might score say 75% or 80%
00:06:12.880 | on spreadsheet bench. The obvious point to be made is that surely expert spreadsheeters will just
00:06:20.160 | increase their productivity by using these tools. And that's true but it does beg the question about what
00:06:25.840 | the incentives will be at that point to hire entry level helpers. If entry level human white collar
00:06:32.960 | workers can no longer complement the systems then that could really start showing up in the data. How
00:06:39.680 | about the headlines meaning that we are actually close then to fully eliminating white collar jobs? The logic
00:06:46.160 | would go if it can get gold in the IMO then isn't it just better than us at everything? This leads us
00:06:52.960 | to the fourth way that many might misread the headline which is that if we're getting gold in the
00:06:59.040 | international math olympiad we are actually really quite close to eliminating white collar jobs.
00:07:03.840 | Well if you have read the 42 page system card for these latest systems like ChatGPT agent and frankly who
00:07:11.120 | hasn't read that 42 page system card then you'll see that the hallucination rate of these new agents
00:07:18.800 | drawing on the same techniques again as the MathWiz went up. To repeat that same single reinforcement
00:07:25.840 | learning system in the words of the OpenAI researcher produced higher hallucinations within ChatGPT agent.
00:07:33.120 | On simple QA which is one benchmark measuring hallucinations you can see a drop of around four
00:07:39.440 | percent compared to the O3 system with browsing. Likewise on another measure of hallucinations person QA.
00:07:46.880 | It should be noted that OpenAI added the caveat that it was actually Wikipedia getting stuff wrong often.
00:07:54.000 | So there may be some noise in that data. That would be the same data used to train the models but that's
00:07:59.760 | another discussion. On evaluations designed to test whether ChatGPT agent refuses to do high stakes
00:08:06.880 | financial tasks such as making financial account transfers the agent mode was worse than the previous
00:08:14.400 | 4.0 or O3 operator. In other words it would be more liable to try to do something highly risky and that's
00:08:21.840 | not the only high stake setting in which things can go haywire under the new system. OpenAI were testing
00:08:28.640 | ChatGPT agent essentially on whether it could produce a bioweapon or at least whether it had one skill
00:08:34.400 | pertaining to that ability. Now ChatGPT agent was unable to install or run the bio design tool but that's
00:08:41.760 | no biggie. But here's where it gets worse. The ChatGPT agent researched and wrote substitute scripts then it
00:08:49.280 | misrepresented those scripts outputs as real tool results. Any terrorists using it for this purpose then
00:08:56.320 | is going to get mightily pissed off. But seriously this is all critical context for these new breakthrough
00:09:03.120 | results that you hear for example the IMO gold. In my opinion even if the best of a language model's
00:09:09.520 | answers are better than before if you can't employ a language model at its lowest point when it
00:09:15.120 | hallucinates then you might not employ it at its best. So while I foresee there being significant impact on
00:09:21.840 | entry-level jobs it's a far cry from eliminating white-collar jobs. That prediction by the way is also
00:09:29.200 | echoed by that math professor who said he sees an increasing number of mathematicians improving their
00:09:34.560 | productivity by using language models to search for known parts of a tentative proof. Another massive
00:09:39.520 | positive of course is that younger entrants to a field can use these kind of tools to more rapidly ascend
00:09:45.760 | to expertise level. Before we leave human jobs for just a moment a word about real jobs you can apply for
00:09:53.040 | today. The sponsors of this video are 80 000 hours and while I have mentioned their podcast and youtube
00:09:59.520 | channel before just a quick reminder that they have a job board link in the description with hundreds of
00:10:05.760 | jobs filtered for positive impact. I'm just going to refresh the page because what I didn't mention
00:10:11.440 | last time when talking about this is that these jobs are around the world as well notice for example
00:10:17.440 | paris. If you are interested in any of this obviously it would be amazing if you could use the link in the
00:10:23.680 | description. Fifth way not to misread the open ai headline. You might have looked at that headline
00:10:29.520 | on twitter and said no it's all hype and ai models have actually hit a plateau. Well try telling that to
00:10:37.440 | this machine learning researcher who got almost half a million impressions for being disappointed in how the
00:10:44.640 | latest models like grok 4 did on the international math olympiad. They found that gemini 2.5 pro did the
00:10:51.760 | best of the models they tested but grok 4 performed particularly poorly. I could point to my own benchmark
00:10:58.640 | simplebench as some form of proof that grok 4 wasn't purely benchmark hacking and that there is plenty of
00:11:06.960 | genuine progress in ai. After all I made this benchmark to expose the gap between human performance and
00:11:14.560 | model performance and yet that gap is shrinking rapidly. There probably will be a simplebench v2
00:11:20.880 | one day soon and yes we are working on benchmarking models like Kimi trust me we are working on it.
00:11:26.000 | Anyway even that researcher Ravid Schwartz did have to admit by saying well played gnome,
00:11:32.320 | gnome brown of open ai, well played. If even after that concession you still think that all ai progress is
00:11:39.680 | just hype wait till the end of the video. Obligatory mention by the way for me at least that I did call
00:11:45.760 | that AI would get gold in the IMO this year. I can't find the quote I think it was from a few months ago.
00:11:51.440 | Maybe one of you can find the quote. Sixth potential misreading. Some of the more trusting among you may
00:11:57.920 | misread the headline as being about a peer-reviewed research paper in which we can learn all about the
00:12:04.160 | methodology. After all this is crucial research and part of open ai's main push towards general
00:12:10.800 | intelligence or AGI. Nope quite the opposite. We have gone from peer-reviewed papers from the frontier
00:12:17.600 | labs say circa 2022 to website posts up to 2024 to now 3am twitter threads. That leaves us with an
00:12:26.960 | unbelievable amount of unknowns about this IMO achievement. The smartest man in the world by
00:12:32.960 | IQ Terence Tao said that there are all sorts of unknowns in how the result was achieved. Each one
00:12:39.120 | of which would cast a result in a slightly more favourable or less favourable light. My key question
00:12:44.880 | along with him is did the model submit multiple attempts for example. That is by the way allowed for
00:12:50.800 | the human participants. Neil Nanda again asks about more subtle hacks but we just don't know. This
00:12:56.720 | forces us including me to have to read between the lines of obscure esoteric tweets but I would say that
00:13:04.640 | one key technique does seem to be just to let inference run for longer. As in train models to output yet
00:13:11.760 | longer chains of thought. Again according to Noam Brown this model thinks for a long time for hours and
00:13:20.000 | he says there's a lot of room to push that test time compute and efficiency further. How much compute was
00:13:26.320 | used during the competition? We don't know. How much cash would such inference cost an average user? Again
00:13:32.160 | we don't know but it does seem to hint that we really will be getting those two thousand dollar a month
00:13:37.840 | pricing tiers for ChatGPT. The most intriguing hint for me and some of you watching will be the fact
00:13:44.480 | that these new techniques he says make LLMs a lot better at hard to verify tasks. If OpenAI do take
00:13:52.160 | the lead in software engineering for example by the end of the year that really would be a big shake-up.
00:13:58.960 | Unlike competitive coding software engineering is harder to verify but has huge economic impact. But back to
00:14:06.400 | that sixth misreading while I strongly suspect Google's announcement will be more quantitative
00:14:11.760 | on the 28th and detailed it will likely still fall far short of complete transparency. Such is the money
00:14:18.480 | at stake in AI these days. Speaking of which by the way side note would you turn down a 300 million dollar
00:14:25.920 | annual salary to work at Meta? Make that 312 by the way in case you weren't convinced. Seventh misreading
00:14:32.960 | that we will have to wait to the end of the year to get a glimpse of OpenAI's progress.
00:14:38.480 | No it seems GPT-5 reasoning alpha is coming pretty soon. Not the same as the model coming out at the
00:14:46.880 | end of the year that got gold but nevertheless it will give us a taste of the latest progress at OpenAI.
00:14:54.480 | Eighth misreading that the AI news these days is nothing but insane progress and exponentials.
00:15:00.400 | Actually no see this new Meta report. I have chatted with the lead author both in person and online and
00:15:06.720 | we'll hopefully do a deep dive soon but the TLDR is that against expectations even the participants
00:15:13.840 | expectations language models can slow down developers in certain settings. Especially on more complex
00:15:20.400 | code bases averaging over a million lines of code in which the developers already have lots of experience.
00:15:25.760 | Recent language models at least just get a little bit overwhelmed. We'll see about the new generation of
00:15:30.880 | models but this does remind us that if competition coding were the same as real world software
00:15:35.920 | engineering you just wouldn't see results like this. The developers thought that using language models
00:15:41.200 | within Cursor would speed them up by say 25 percent but it actually slowed them down by around 20 percent.
00:15:47.680 | Again it's a small study but a fascinating one that I'll come back to. Ninth and finally try not to
00:15:53.040 | misread the gold medal headline and think that you know generative AI is just all about phony benchmarks.
00:15:59.600 | It doesn't ever have any real world impact. Well aside from a potential negative impact of our new age
00:16:06.240 | of intelligent surveillance that's covered in my most recent documentary on Patreon. Do check it out.
00:16:11.680 | AI and language models can also have and have had positive impact in hard numbers real world settings.
00:16:19.520 | Just take alpha evolve and I did do a separate video on this but it made data centers about 0.7 percent
00:16:26.480 | more efficient in the real world. Or more technically the alpha evolve system continuously recovers on
00:16:32.560 | average 0.7 percent of Google's worldwide compute resources. This sustained efficiency gain means that
00:16:38.560 | at any given moment more tasks can be completed on the same computational footprint. That's an example of
00:16:44.000 | the marrying of language models essentially next word predictors with symbolic pre-programmed systems.
00:16:50.160 | That seems to be the sweet spot at the moment for real world impact and I suspect on the 28th of July
00:16:55.680 | Google's submission to the International Math Olympiad will use a bit of both. We'll see. Did they get
00:17:00.560 | problem six correct? Did they demonstrate real creativity? Time will tell. Either way I would argue
00:17:06.240 | as you can see plenty of ways of misreading the headline. But what do you think? In a meta way have I misread the
00:17:12.800 | headline? Quite possible. Even if I have thank you so much for watching to the end and have a wonderful day.