back to indexOpenAI Backtracks, Gunning for Superintelligence: Altman Brings His AGI Timeline Closer - '25 to '29
Chapters
0:0 Introduction
1:3 Altman Timeline Moves Forward
4:33 Superintelligence?
6:55 AGI was not the only pitch
9:26 AgentCompany and OpenAI New Agent
17:24 SimpleBench Competition
23:3 Kling 1.6 vs Veo 2 vs Sora
00:00:00.000 |
For the few that think 2025 will be a quieter year in AI after the somewhat 00:00:06.400 |
hectic pace you could say of 23 and 24, I'm going to have to disagree with you. 00:00:11.360 |
This video will first highlight how the CEO of OpenAI has revised his timelines for AGI 00:00:18.240 |
forward and revised upward his already aggressive definition of what counts as an AGI. 00:00:25.120 |
Okay, that's just one guy, but then we'll see how OpenAI itself have backtracked on 00:00:30.720 |
whether they are working on superintelligence at all. Just a minor misunderstanding I am sure. 00:00:36.640 |
Then, as we enter this bright new year, I'll cover a fascinating new paper 00:00:41.600 |
and what it says about the current limitations of LLMs. I'll give my prediction of how quickly 00:00:47.360 |
things will change this year with models completing real-world tasks on your behalf. 00:00:52.240 |
I'm going to launch a cool competition for you guys with actual prizes 00:00:56.160 |
and end on a fun little demo of the latest in-textor video from Kling and VO2. 00:01:03.040 |
But first, Sam Altman's subtle timeline shift on when AGI is coming. 00:01:08.320 |
I noticed the shift in this substantial interview with Bloomberg from a couple days ago. 00:01:14.000 |
How does he define AGI? Well, check out this somewhat aggressive definition that seems new to 00:01:19.280 |
me. He says AGI is when an AI system can do what very skilled humans in important jobs can do. 00:01:26.480 |
You might wonder why he would make the definition of AGI harder on himself and OpenAI, but I'll 00:01:32.320 |
come to that a bit later. Suffice to say, when we have an AI system that can do what very skilled 00:01:37.920 |
humans can do in important jobs, that would be quite an epochal moment. Of course, it seems like 00:01:44.880 |
we're really far away from that because even systems that can crush benchmarks like O1 from 00:01:50.080 |
OpenAI and O3, they can't even, say, open up screen recording software, record a video, edit 00:01:56.400 |
it in Premiere Pro, and publish it. That's all well and good, he says nervously, but things could 00:02:01.360 |
change on that front fairly soon. But bear in mind that more aggressive definition for what AGI will 00:02:08.000 |
do when you read this prediction from Sam Altman. I think AGI will probably get developed during 00:02:15.360 |
this president's term, and getting that right seems really important. Trump's term, of course, 00:02:20.480 |
runs from January of 2025 to January of 2029. Those of you who have been following the channel 00:02:27.520 |
closely might remember that that's an update on what he was saying until fairly recently. 00:02:32.720 |
On Joe Rogan, last summer, before the training of the latest O3 model, he was saying how 00:02:38.240 |
appropriate it would be if AGI was developed in 2030, but kind of pushed it back to 2031. 00:02:44.560 |
I no longer think of AGI as quite the end point, but to get to the point where we accomplish the 00:02:50.160 |
thing we set out to accomplish, that would take us to 2030, 2031. That has felt to me all the way 00:02:59.920 |
through kind of a reasonable estimate with huge error bars, and I kind of think we're on the 00:03:06.080 |
trajectory I sort of would have assumed. Moreover, the president of Y Combinator 00:03:11.120 |
thought that Sam Altman was being serious in his interview with Sam Altman, when Sam Altman implied 00:03:17.360 |
that it might be the year of 2025 in which we get AGI. And he echoed that suspicion again in 00:03:24.640 |
the Bloomberg interview from a couple days ago, saying, "Funnily enough, I remember thinking to 00:03:29.520 |
myself back then, in 2015, that we would do it, build AGI, in 2025." The point here is not for 00:03:35.920 |
you to believe that particular date, it's just to note the clear shift in emphasis. And this, 00:03:41.840 |
of course, follows Sam Altman's blog post from 48 hours ago, in which he said, "OpenAI are now 00:03:47.280 |
confident that we know how to build AGI, as we have traditionally understood it. We believe that, 00:03:52.400 |
in 2025, we may see the first AI agents join the workforce and materially change the output of 00:03:59.920 |
companies. Of course, we are close enough now to these dates that we won't have to wait too long 00:04:05.440 |
to see if they manifest themselves. It turns out, though, that building powerful things is quite 00:04:11.200 |
addictive because OpenAI, and Sam Altman specifically, don't want to stop with AGI. 00:04:16.320 |
They don't want to automate particular tasks in important jobs. They want the whole cake. 00:04:21.120 |
We are beginning to turn our aim beyond AGI to superintelligence in the true sense of the word, 00:04:27.600 |
a glorious future in which they can do anything else." And that statement comes just six months 00:04:34.560 |
after OpenAI explicitly denied that that was their mission. OpenAI's Vice President of Global 00:04:40.160 |
Affairs told the Financial Times in May of last year that their mission is to build AGI. 00:04:46.000 |
"I would not say our mission is to build superintelligence. Superintelligence is going 00:04:50.080 |
to be a technology that is going to be orders of magnitude more intelligent than human beings 00:04:54.320 |
on Earth." And another spokesperson said superintelligence was not the company's mission, 00:04:59.840 |
though she admitted "we might study superintelligence." I'll note that massively 00:05:05.040 |
increasing abundance and prosperity and being super capable of accelerating scientific discovery 00:05:11.200 |
and innovation doesn't sound like just studying superintelligence. It sounds like they want to 00:05:16.320 |
do anything else. There is, though, probably a reason why those spokespeople denied that they 00:05:22.240 |
were in pursuit of superintelligence. One reason could be that 10 years ago, almost to the month, 00:05:27.920 |
Sam Altman said that the development of superhuman machine intelligence is probably 00:05:32.160 |
the greatest threat to the continued existence of humanity. Remember, though, it suits OpenAI 00:05:37.360 |
to keep pushing back or up the definition of what counts as AGI or superintelligence. 00:05:43.680 |
They're trying to change it, but as of today, there is a clause that kicks in where Microsoft 00:05:49.440 |
surrendered the rights to "AGI technology" that OpenAI makes if it's defined to be AGI. So now, 00:05:56.240 |
despite several OpenAI employees claiming that their current systems, like O3, are AGI, 00:06:02.160 |
we have these five stages of AGI. AGI has to be not just a reasoner, but also an agent, 00:06:08.880 |
a system that can take action, and an innovator, and even have the power of an entire organization. 00:06:15.440 |
Seems like we really are stretching the definition of general intelligence quite 00:06:19.680 |
far here. Microsoft, by the way, you might not know, want that definition stretched even more. 00:06:24.160 |
To be counted as AGI, the system must itself be able to generate profits of $100 billion. 00:06:30.880 |
Wait, I've just realized that I can't personally generate profits of $100 billion, 00:06:35.440 |
and there are very few of you in the audience who can do so. So does that mean that we're not AGI? 00:06:40.640 |
Damn, maybe Elon Musk is the only AGI on the planet? That would be really weird. 00:06:44.560 |
Anyway, as you can see, words seem to chop and change their meaning at people's convenience, 00:06:49.280 |
so bear that in mind. Speaking of which, there is some history you probably need to know going 00:06:53.920 |
back to the very founding of OpenAI in 2015. In this week's interview with Bloomberg, 00:06:58.560 |
someone was asked, "How did you poach that top AI research talent to get OpenAI started, 00:07:03.840 |
often when you had much less money to offer than your competitors?" He said, "The pitch was just 00:07:09.840 |
come build AGI." And he said, "That worked because it was heretical at the time to say we're going to 00:07:16.480 |
build AGI." Actually, that's not quite accurate. The pitch definitely wasn't just to come build AGI. 00:07:23.120 |
The pitch was that they were going to do the right thing with AGI. And that's how they won 00:07:28.720 |
over people who were tempted by DeepMind offering even more money. If those researchers just wanted 00:07:33.840 |
to work on AGI, they could have just joined DeepMind because a year before that offer from 00:07:38.560 |
Sam Altman, Demis Hassabis was doing interviews talking about how they're working on artificial 00:07:43.040 |
general intelligence. Or here's an article from a year before that in which the co-founder of 00:07:48.000 |
DeepMind, Shane Legg, says that they are working on creating AGI by 2030. No, the pitch was that 00:07:54.960 |
OpenAI were going to create AGI and have it be controlled by a non-profit. And by the way, 00:08:01.040 |
that is still the situation today, despite the board debacle a year ago with firing Sam Altman 00:08:07.600 |
and joining up with Microsoft and investing billions and billions. Yes, it turned out that 00:08:11.760 |
billions and billions was needed for scaling. But still, as of today, if AGI is created by OpenAI, 00:08:17.280 |
it's controlled by the non-profit board. Two weeks ago, though, OpenAI revealed that they 00:08:22.320 |
are planning to change that. Of course, it's phrased in terms of this is best for the long-term 00:08:27.200 |
success of the mission and we're doing it to benefit all of humanity. But the critical detail 00:08:32.160 |
is that the non-profit wouldn't control the AGI. It would get a ton of money for healthcare, 00:08:37.600 |
education, and science. But that's very different from controlling what is done with an AGI or a 00:08:43.760 |
superintelligence. The until very recently former head of policy research at OpenAI, Miles Brundage, 00:08:50.240 |
said that a well-capitalized non-profit on the side is no substitute for being aligned with the 00:08:55.920 |
original non-profit's mission on safety mitigation. Another former lead researcher at OpenAI said, 00:09:02.000 |
"It's pretty disappointing that 'ensure AGI benefits all of humanity' gave way to a much 00:09:07.040 |
less ambitious charitable initiatives in sectors such as healthcare, education, and science." 00:09:11.760 |
Even if you don't care about any of that, you may find it somewhat curious that Microsoft is 00:09:16.800 |
getting serious about defining the terms of what counts as AGI and what they get out of it. If that 00:09:22.640 |
$3 trillion behemoth thought all of this was going nowhere, then why bother? All of that leads very 00:09:28.720 |
naturally to the next obvious question. Well, how close are we then to AGI? Have we, in somewhat 00:09:35.120 |
grandiose terms, crossed the event horizon of the singularity? Sam Ullman is unclear whether we have, 00:09:42.400 |
but what do you think? For me, one obvious obstacle is the inability of models to complete 00:09:47.840 |
somewhat basic tasks on their own. You could count this under the umbrella of lacking reliability. 00:09:53.120 |
But we are starting to get good benchmarks for consequential real-world tasks, as in this paper 00:09:59.120 |
from the 18th of December. As we'll see, these tasks were sourced from the most common of those 00:10:04.960 |
performed in real-world professions. And yes, as of today, just 24% of the tasks can be completed 00:10:12.240 |
autonomously, although they weren't able to test O3, for example. Here's the thing though, 00:10:16.960 |
that 24% was roughly the performance we were getting from, say, GPT-4 18 months ago on a 00:10:23.520 |
benchmark called GPQA - Google-proof, PhD-level science questions. Roughly a year after that, 00:10:30.880 |
O1 preview got 70% and O3, by the way, gets 87%. Also, you might note that the pace of improvement 00:10:39.360 |
has increased quite dramatically in the last 6 months, basically since the O1 paradigm came out. 00:10:44.560 |
I know what some of you might be thinking, is the GPQA that hard? Are they working on a harder one? 00:10:49.520 |
Well, check this out. This is from a talk this week by Jason Wei of OpenAI. 00:11:14.240 |
All of which is to say, that 24% could become 84% faster than you might think. 00:11:20.640 |
And indeed, that would be my prediction. 84% by the end of 2025. But wait, how impactful 00:11:26.880 |
would that jump from, say, 24% to 84% be? Well, to find out, here's my 2-minute summary of this 00:11:33.920 |
24-page paper. First, they trawled a massive database of all tasks done by professionals 00:11:39.760 |
in America. They excluded physical labor and focused on those jobs in which a large number 00:11:45.440 |
of people performed the job. They also weighted the tasks by the median salary of those performing 00:11:50.800 |
the tasks. That narrowed things down to 175 diverse, realistic tasks like arranging meeting 00:11:56.160 |
rooms, analysing spreadsheets and screening resumes, which they gave the imaginary setting 00:12:00.960 |
of a software engineering company. Some of the tasks, of course, required interaction 00:12:05.600 |
with other colleagues and the models could do that, although the colleagues were role-played 00:12:10.160 |
by Claude. The tasks should be clear enough so that any human worker would be able to complete 00:12:14.960 |
the task without asking for further instructions. Although, of course, they may need to ask 00:12:19.200 |
questions of their co-workers. The evaluations of task performance were mostly deterministic, 00:12:24.400 |
which is good, and there was a heavy weighting toward whether the model could fully complete 00:12:29.360 |
the task. Partial completion would always result in less than half marks. Here's an example of 00:12:34.880 |
a task with multiple steps and checkpoints, and if at one point to run a code coverage script, 00:12:41.680 |
it didn't recognize it needed to install certain dependencies, it would fail that checkpoint and 00:12:46.000 |
therefore for this score of 4 out of 8, it would actually get 25% only. You can see the final 00:12:51.680 |
results here and you might wonder why I'm predicting 84% by the end of the year if even 00:12:57.600 |
Claude is getting, say, 24%. If we're that far away from task automation, why was it reported 00:13:03.680 |
yesterday that OpenAI are releasing a computer-using agent as soon as this month? Indeed, 00:13:09.920 |
why have Anthropic already released a computer-use agent in beta? That launch from Anthropic was 00:13:15.840 |
apparently mocked by OpenAI leaders because of its risks for things like prompt injection 00:13:20.960 |
and Anthropic's high-minded rhetoric, it says, about AI safety. The reason, though, 00:13:25.440 |
that that prediction and all of these releases can still make sense despite these disappointing 00:13:30.880 |
results is because of reinforcement learning. That, after all, is the secret to why O1 and now 00:13:36.880 |
O3 has broken the benchmark that it has done. Push a model to try again and again and again 00:13:43.040 |
until it completes a task successfully and then reinforce those weights that led it to doing so. 00:13:47.840 |
As Vedant Mishra, who's working on superintelligence at DeepMind and was formerly of 00:13:52.560 |
OpenAI, has said, "There are maybe a few hundred people in the world who viscerally understand 00:13:57.840 |
what's coming. Most are at DeepMind, OpenAI, Anthropic, or X, or I would say in my audience. 00:14:03.280 |
Some are on the outside. You have to be able to forecast the aggregate effect of rapid algorithmic 00:14:08.320 |
improvement, aggressive investment in building reinforcement learning environments for iterative 00:14:13.280 |
self-improvement, and of course the tens of billions already committed to building data 00:14:17.200 |
centers. Either we're all wrong or everything's about to change." The reason, of course, that task 00:14:23.040 |
can be so much more difficult than scientific multiple choice questions is because one mistake 00:14:28.800 |
at any stage in a long chain can screw everything up. That apparently, by the way, was one of the 00:14:34.160 |
key reasons why ArcAGI wasn't solved until O3. I've done other videos explaining ArcAGI, but for 00:14:40.960 |
now, when the grid count of the tasks was below a certain threshold, models did fairly well, even 00:14:46.880 |
earlier models. But when you're talking about a massive grid, those long-range dependencies get 00:14:52.240 |
harder and harder to spot. A bit like solving a task where you have to remember something that 00:14:56.080 |
someone said a thousand steps ago. Until O3, models simply couldn't cope with that amount 00:15:01.760 |
of complexity. This chart, by the way, came from a great study linked in the description from Mikel 00:15:07.360 |
Bober-Irizar. He showed that unlike humans where the task length didn't make that much difference, 00:15:13.120 |
LLMs really struggled when the task length went beyond a certain size. In short, the benchmark 00:15:18.880 |
fell in large part due to scaling, which of course will continue throughout 2025, if not speed up. 00:15:25.760 |
And that's why I think people are scrambling to create new benchmarks for task performance, 00:15:31.280 |
such as Epoch AI, who are behind the famous frontier math. That's the ridiculously hard 00:15:36.640 |
benchmark that O3 scored around 25% on to everyone's amazement. There are, however, 00:15:41.840 |
just a few more reasons why LLMs fail on task benchmarks like Aging Company. Some of these, 00:15:48.240 |
I find personally quite funny. Sometimes it's through a lack of social skills. For example, 00:15:52.960 |
one time the model was told by a colleague, role played by Claude, you should introduce yourself 00:15:58.000 |
to Chen Xingyi next. She's on our front end team and would be a great person to connect with. At 00:16:03.920 |
this point, a human would then talk to Chen, but instead the agent then decides not to follow up 00:16:08.720 |
with her and prematurely considers the task accomplished. Chen, by the way, in this simulated 00:16:13.520 |
environment was a human resources manager, a bit like Toby from The Office. Also, the agent 00:16:19.040 |
struggled heavily with pop-ups. Multiple times, apparently, they struggled to close the pop-up 00:16:24.480 |
windows. And so it could well be that cookie banners are the major obstacle between us and AGI. 00:16:30.800 |
Also, here is a slightly more worrying one, which reminds me of the scheming exposed by Epoch, 00:16:37.040 |
among others. Sometimes when there is a particularly hard step, the model will just 00:16:42.560 |
fake that it's done. For example, during the execution of one task, the agent could not find 00:16:47.280 |
the right person to ask questions to on the team chat. As a result, it then decides to create a 00:16:52.000 |
shortcut solution by renaming another user to the name of the intended user. Remember, it's not 00:16:57.920 |
necessarily that the models want to cheat, but if they are rewarded sufficiently for cheating, 00:17:03.760 |
that's what they'll do. That is, I guess, another bitter lesson from reinforcement learning. 00:17:08.480 |
But now for the final reason given in the paper, a lack of common sense. This for me is the grist 00:17:14.800 |
that makes so much of the world go round and why models often struggle in real-world performance. 00:17:20.000 |
You gotta sometimes step back, see the bigger picture and re-evaluate your entire strategy. 00:17:24.560 |
This lack of common sense or simple reasoning is of course what I am trying to test with SimpleBench 00:17:30.080 |
with a public leaderboard linked in the description. And here is a brand new example 00:17:35.040 |
from the hundreds in the benchmark and you'll see why I'm giving it to you in a moment. You 00:17:39.120 |
can pause and try it yourself, but it illustrates the point I'm trying to make. Hussain types a 00:17:44.000 |
letter on a normal laptop screen and he can see any letters on the screen clearly. Every second, 00:17:50.320 |
the letter will randomly transform into another letter of the alphabet. Hussain is in a park and 00:17:56.320 |
slowly inches back from the laptop but has just one item with him, a remote controller so he can 00:18:02.560 |
increase the font size of the changing letters by exactly as much as he wants. Hussain has always 00:18:09.440 |
had trouble distinguishing W's from M's so when he is a couple of football field lengths away 00:18:17.600 |
from this laptop, a couple of football field lengths away, controller in hand, he has a blank 00:18:24.480 |
probability of correctly guessing the current letter. 96%, 95%, 97%, 1/26, 0% or 1/2. I asked 00:18:33.600 |
the famously expensive O1 Pro which Sam Ullman recently said they are losing money on it's so 00:18:39.360 |
expensive to serve and it said this. First note that Hussain can make the letter as large as he 00:18:45.680 |
wants so he has no problem identifying any letter except the W's from M's. One last time though, 00:18:53.280 |
he is two football field lengths away from a normal laptop screen. If he was a few feet away, 00:19:00.480 |
the increasing font would indeed be helpful but two football field lengths away doesn't matter 00:19:05.520 |
if you make the font size 1 billion, he can barely even see the screen. And by the way, 00:19:10.080 |
you can make this 10 football fields and O1 Pro will still give the same answer. It will focus 00:19:16.080 |
on the distraction of the W's and the M's and give the answer of 96%. The official answer by the way 00:19:22.480 |
is not actually 0% because even if you can't see the screen, you still have a 1/26 chance of 00:19:28.320 |
guessing the right letter. Now what many of you have told me is that if we simply change the 00:19:33.840 |
prompt, models like O1 would get all of these questions correct. Now though, I am very excited 00:19:39.440 |
to tell you we can put that to the test. Weights and Biases, I'm very happy to say, are sponsoring 00:19:44.960 |
a competition for you guys running to the end of January on 20 questions from Simple Bench. 00:19:51.520 |
That's the 10 public questions that are already out there on the website plus 10 more specially 00:19:56.160 |
for this competition. The winner will get some meta Ray-Bans, 2nd place gets gift cards and I 00:20:01.760 |
believe 3rd place gets some swag. Either way, all you need to do is open up the Colab and run each 00:20:06.960 |
of these cells and yes, it's not authored by Google. You will of course need either an OpenAI 00:20:12.480 |
API key or an Anthropic API key. I recommend trying with Claude 3.5 Sonnet or O1 Preview/O1 00:20:20.080 |
if you have access. If you already have a Weights and Biases account, it literally takes maybe 30 00:20:25.680 |
seconds to set up but even if you don't, it's completely free to have the account. The first 00:20:31.760 |
easy option is to do a quick run with GPC 4.0 on those 20 questions but there is a more exciting 00:20:37.840 |
option below. By the way, true count tells you how many questions your model got right 00:20:42.800 |
and the true fraction is that number out of the total number of calls. The mean is just referring 00:20:49.360 |
to the latency, how many seconds it took for the model to reply on average. The more exciting thing 00:20:54.640 |
though is to play about with the system prompt. This is where you can test your theory about 00:21:01.040 |
telling the model it's a trick question and seeing if that boosts performance. Of course, 00:21:05.120 |
to get top performance, you're also going to want to change the model name from GPC 4.0 to say O1. 00:21:10.400 |
At this stage, I must give a couple of quick caveats about this mini competition. The first 00:21:15.040 |
is an example of what not to do. Unfortunately, you can't tell O1 models to think step by step 00:21:21.280 |
to come up with an answer. OpenAI have disallowed this, presumably so you don't get access to the 00:21:26.240 |
underlying chains of thought. So, let's not try that in our system prompt. Now, in the instruction 00:21:31.760 |
hierarchy, it's more like a user prompt but it can make a big difference to the performance of 00:21:37.040 |
the model. Which leads me to the second rule. I've been able to get to around 12 or 13 on these 20 00:21:43.040 |
questions by coming up with ever more advanced prompts. What I want to see is whether any of you 00:21:48.720 |
can get to 20 out of 20 or even 18 out of 20. Naturally though, one way of doing that would 00:21:54.560 |
just be to put the answers in the system prompt or make numerous references to the questions 00:21:59.840 |
themselves accessible via the Weave portal. What you'll get is a portal that looks something 00:22:05.440 |
like this. You can have fun with seeing the scores and the percentages here, but you can also click 00:22:11.040 |
on the individual run. Scrolling down, you can click to view the individual questions. Of course, 00:22:16.880 |
this is not the entire benchmark which is over 200 questions, just 20, 10 of which you already 00:22:22.480 |
knew about. Of course, we are not going to accept prompts where you basically say something like, 00:22:27.280 |
"For question 18, the answer is C." Or very specific hints where you tell the model, 00:22:32.240 |
"Think about her legs and how it could do this or that." No, what we're looking for are general 00:22:36.960 |
prompts where you can tell the model these are trick questions and they test spatial reasoning. 00:22:41.840 |
And you're going to give the model a massive tip if it gets it right. If a general prompt like that 00:22:46.800 |
can get 18 or 20 out of 20, I would be very impressed. So that's the competition sponsored 00:22:52.640 |
by Weights & Biases running till the end of January. Hopefully you have some fun with it, 00:22:57.680 |
but either way, it illustrates some of the common sense gaps in current frontier LLMs. 00:23:03.600 |
Good luck and now time to end with a bit of fun. I discussed how Text-to-Video is 00:23:08.240 |
also accelerating through 2025 on my Patreon AI Insiders, but I thought I would give you 00:23:14.400 |
a taster with a quick side-by-side comparison between the best three tools currently available. 00:23:20.320 |
All with the same prompt, first Kling 1.6, then VO2 from Google DeepMind and finally Sora 1080p. 00:23:29.200 |
If you like, you can let me know in the comments which one you thought was best. 00:23:32.960 |
As ever, thank you so much for watching to the end and have a wonderful day and 2025.