Back to Index

OpenAI Backtracks, Gunning for Superintelligence: Altman Brings His AGI Timeline Closer - '25 to '29


Chapters

0:0 Introduction
1:3 Altman Timeline Moves Forward
4:33 Superintelligence?
6:55 AGI was not the only pitch
9:26 AgentCompany and OpenAI New Agent
17:24 SimpleBench Competition
23:3 Kling 1.6 vs Veo 2 vs Sora

Transcript

For the few that think 2025 will be a quieter year in AI after the somewhat hectic pace you could say of 23 and 24, I'm going to have to disagree with you. This video will first highlight how the CEO of OpenAI has revised his timelines for AGI forward and revised upward his already aggressive definition of what counts as an AGI.

Okay, that's just one guy, but then we'll see how OpenAI itself have backtracked on whether they are working on superintelligence at all. Just a minor misunderstanding I am sure. Then, as we enter this bright new year, I'll cover a fascinating new paper and what it says about the current limitations of LLMs.

I'll give my prediction of how quickly things will change this year with models completing real-world tasks on your behalf. I'm going to launch a cool competition for you guys with actual prizes and end on a fun little demo of the latest in-textor video from Kling and VO2. But first, Sam Altman's subtle timeline shift on when AGI is coming.

I noticed the shift in this substantial interview with Bloomberg from a couple days ago. How does he define AGI? Well, check out this somewhat aggressive definition that seems new to me. He says AGI is when an AI system can do what very skilled humans in important jobs can do.

You might wonder why he would make the definition of AGI harder on himself and OpenAI, but I'll come to that a bit later. Suffice to say, when we have an AI system that can do what very skilled humans can do in important jobs, that would be quite an epochal moment.

Of course, it seems like we're really far away from that because even systems that can crush benchmarks like O1 from OpenAI and O3, they can't even, say, open up screen recording software, record a video, edit it in Premiere Pro, and publish it. That's all well and good, he says nervously, but things could change on that front fairly soon.

But bear in mind that more aggressive definition for what AGI will do when you read this prediction from Sam Altman. I think AGI will probably get developed during this president's term, and getting that right seems really important. Trump's term, of course, runs from January of 2025 to January of 2029.

Those of you who have been following the channel closely might remember that that's an update on what he was saying until fairly recently. On Joe Rogan, last summer, before the training of the latest O3 model, he was saying how appropriate it would be if AGI was developed in 2030, but kind of pushed it back to 2031.

I no longer think of AGI as quite the end point, but to get to the point where we accomplish the thing we set out to accomplish, that would take us to 2030, 2031. That has felt to me all the way through kind of a reasonable estimate with huge error bars, and I kind of think we're on the trajectory I sort of would have assumed.

Moreover, the president of Y Combinator thought that Sam Altman was being serious in his interview with Sam Altman, when Sam Altman implied that it might be the year of 2025 in which we get AGI. And he echoed that suspicion again in the Bloomberg interview from a couple days ago, saying, "Funnily enough, I remember thinking to myself back then, in 2015, that we would do it, build AGI, in 2025." The point here is not for you to believe that particular date, it's just to note the clear shift in emphasis.

And this, of course, follows Sam Altman's blog post from 48 hours ago, in which he said, "OpenAI are now confident that we know how to build AGI, as we have traditionally understood it. We believe that, in 2025, we may see the first AI agents join the workforce and materially change the output of companies.

Of course, we are close enough now to these dates that we won't have to wait too long to see if they manifest themselves. It turns out, though, that building powerful things is quite addictive because OpenAI, and Sam Altman specifically, don't want to stop with AGI. They don't want to automate particular tasks in important jobs.

They want the whole cake. We are beginning to turn our aim beyond AGI to superintelligence in the true sense of the word, a glorious future in which they can do anything else." And that statement comes just six months after OpenAI explicitly denied that that was their mission. OpenAI's Vice President of Global Affairs told the Financial Times in May of last year that their mission is to build AGI.

"I would not say our mission is to build superintelligence. Superintelligence is going to be a technology that is going to be orders of magnitude more intelligent than human beings on Earth." And another spokesperson said superintelligence was not the company's mission, though she admitted "we might study superintelligence." I'll note that massively increasing abundance and prosperity and being super capable of accelerating scientific discovery and innovation doesn't sound like just studying superintelligence.

It sounds like they want to do anything else. There is, though, probably a reason why those spokespeople denied that they were in pursuit of superintelligence. One reason could be that 10 years ago, almost to the month, Sam Altman said that the development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity.

Remember, though, it suits OpenAI to keep pushing back or up the definition of what counts as AGI or superintelligence. They're trying to change it, but as of today, there is a clause that kicks in where Microsoft surrendered the rights to "AGI technology" that OpenAI makes if it's defined to be AGI.

So now, despite several OpenAI employees claiming that their current systems, like O3, are AGI, we have these five stages of AGI. AGI has to be not just a reasoner, but also an agent, a system that can take action, and an innovator, and even have the power of an entire organization.

Seems like we really are stretching the definition of general intelligence quite far here. Microsoft, by the way, you might not know, want that definition stretched even more. To be counted as AGI, the system must itself be able to generate profits of $100 billion. Wait, I've just realized that I can't personally generate profits of $100 billion, and there are very few of you in the audience who can do so.

So does that mean that we're not AGI? Damn, maybe Elon Musk is the only AGI on the planet? That would be really weird. Anyway, as you can see, words seem to chop and change their meaning at people's convenience, so bear that in mind. Speaking of which, there is some history you probably need to know going back to the very founding of OpenAI in 2015.

In this week's interview with Bloomberg, someone was asked, "How did you poach that top AI research talent to get OpenAI started, often when you had much less money to offer than your competitors?" He said, "The pitch was just come build AGI." And he said, "That worked because it was heretical at the time to say we're going to build AGI." Actually, that's not quite accurate.

The pitch definitely wasn't just to come build AGI. The pitch was that they were going to do the right thing with AGI. And that's how they won over people who were tempted by DeepMind offering even more money. If those researchers just wanted to work on AGI, they could have just joined DeepMind because a year before that offer from Sam Altman, Demis Hassabis was doing interviews talking about how they're working on artificial general intelligence.

Or here's an article from a year before that in which the co-founder of DeepMind, Shane Legg, says that they are working on creating AGI by 2030. No, the pitch was that OpenAI were going to create AGI and have it be controlled by a non-profit. And by the way, that is still the situation today, despite the board debacle a year ago with firing Sam Altman and joining up with Microsoft and investing billions and billions.

Yes, it turned out that billions and billions was needed for scaling. But still, as of today, if AGI is created by OpenAI, it's controlled by the non-profit board. Two weeks ago, though, OpenAI revealed that they are planning to change that. Of course, it's phrased in terms of this is best for the long-term success of the mission and we're doing it to benefit all of humanity.

But the critical detail is that the non-profit wouldn't control the AGI. It would get a ton of money for healthcare, education, and science. But that's very different from controlling what is done with an AGI or a superintelligence. The until very recently former head of policy research at OpenAI, Miles Brundage, said that a well-capitalized non-profit on the side is no substitute for being aligned with the original non-profit's mission on safety mitigation.

Another former lead researcher at OpenAI said, "It's pretty disappointing that 'ensure AGI benefits all of humanity' gave way to a much less ambitious charitable initiatives in sectors such as healthcare, education, and science." Even if you don't care about any of that, you may find it somewhat curious that Microsoft is getting serious about defining the terms of what counts as AGI and what they get out of it.

If that $3 trillion behemoth thought all of this was going nowhere, then why bother? All of that leads very naturally to the next obvious question. Well, how close are we then to AGI? Have we, in somewhat grandiose terms, crossed the event horizon of the singularity? Sam Ullman is unclear whether we have, but what do you think?

For me, one obvious obstacle is the inability of models to complete somewhat basic tasks on their own. You could count this under the umbrella of lacking reliability. But we are starting to get good benchmarks for consequential real-world tasks, as in this paper from the 18th of December. As we'll see, these tasks were sourced from the most common of those performed in real-world professions.

And yes, as of today, just 24% of the tasks can be completed autonomously, although they weren't able to test O3, for example. Here's the thing though, that 24% was roughly the performance we were getting from, say, GPT-4 18 months ago on a benchmark called GPQA - Google-proof, PhD-level science questions.

Roughly a year after that, O1 preview got 70% and O3, by the way, gets 87%. Also, you might note that the pace of improvement has increased quite dramatically in the last 6 months, basically since the O1 paradigm came out. I know what some of you might be thinking, is the GPQA that hard?

Are they working on a harder one? Well, check this out. This is from a talk this week by Jason Wei of OpenAI. All of which is to say, that 24% could become 84% faster than you might think. And indeed, that would be my prediction. 84% by the end of 2025.

But wait, how impactful would that jump from, say, 24% to 84% be? Well, to find out, here's my 2-minute summary of this 24-page paper. First, they trawled a massive database of all tasks done by professionals in America. They excluded physical labor and focused on those jobs in which a large number of people performed the job.

They also weighted the tasks by the median salary of those performing the tasks. That narrowed things down to 175 diverse, realistic tasks like arranging meeting rooms, analysing spreadsheets and screening resumes, which they gave the imaginary setting of a software engineering company. Some of the tasks, of course, required interaction with other colleagues and the models could do that, although the colleagues were role-played by Claude.

The tasks should be clear enough so that any human worker would be able to complete the task without asking for further instructions. Although, of course, they may need to ask questions of their co-workers. The evaluations of task performance were mostly deterministic, which is good, and there was a heavy weighting toward whether the model could fully complete the task.

Partial completion would always result in less than half marks. Here's an example of a task with multiple steps and checkpoints, and if at one point to run a code coverage script, it didn't recognize it needed to install certain dependencies, it would fail that checkpoint and therefore for this score of 4 out of 8, it would actually get 25% only.

You can see the final results here and you might wonder why I'm predicting 84% by the end of the year if even Claude is getting, say, 24%. If we're that far away from task automation, why was it reported yesterday that OpenAI are releasing a computer-using agent as soon as this month?

Indeed, why have Anthropic already released a computer-use agent in beta? That launch from Anthropic was apparently mocked by OpenAI leaders because of its risks for things like prompt injection and Anthropic's high-minded rhetoric, it says, about AI safety. The reason, though, that that prediction and all of these releases can still make sense despite these disappointing results is because of reinforcement learning.

That, after all, is the secret to why O1 and now O3 has broken the benchmark that it has done. Push a model to try again and again and again until it completes a task successfully and then reinforce those weights that led it to doing so. As Vedant Mishra, who's working on superintelligence at DeepMind and was formerly of OpenAI, has said, "There are maybe a few hundred people in the world who viscerally understand what's coming.

Most are at DeepMind, OpenAI, Anthropic, or X, or I would say in my audience. Some are on the outside. You have to be able to forecast the aggregate effect of rapid algorithmic improvement, aggressive investment in building reinforcement learning environments for iterative self-improvement, and of course the tens of billions already committed to building data centers.

Either we're all wrong or everything's about to change." The reason, of course, that task can be so much more difficult than scientific multiple choice questions is because one mistake at any stage in a long chain can screw everything up. That apparently, by the way, was one of the key reasons why ArcAGI wasn't solved until O3.

I've done other videos explaining ArcAGI, but for now, when the grid count of the tasks was below a certain threshold, models did fairly well, even earlier models. But when you're talking about a massive grid, those long-range dependencies get harder and harder to spot. A bit like solving a task where you have to remember something that someone said a thousand steps ago.

Until O3, models simply couldn't cope with that amount of complexity. This chart, by the way, came from a great study linked in the description from Mikel Bober-Irizar. He showed that unlike humans where the task length didn't make that much difference, LLMs really struggled when the task length went beyond a certain size.

In short, the benchmark fell in large part due to scaling, which of course will continue throughout 2025, if not speed up. And that's why I think people are scrambling to create new benchmarks for task performance, such as Epoch AI, who are behind the famous frontier math. That's the ridiculously hard benchmark that O3 scored around 25% on to everyone's amazement.

There are, however, just a few more reasons why LLMs fail on task benchmarks like Aging Company. Some of these, I find personally quite funny. Sometimes it's through a lack of social skills. For example, one time the model was told by a colleague, role played by Claude, you should introduce yourself to Chen Xingyi next.

She's on our front end team and would be a great person to connect with. At this point, a human would then talk to Chen, but instead the agent then decides not to follow up with her and prematurely considers the task accomplished. Chen, by the way, in this simulated environment was a human resources manager, a bit like Toby from The Office.

Also, the agent struggled heavily with pop-ups. Multiple times, apparently, they struggled to close the pop-up windows. And so it could well be that cookie banners are the major obstacle between us and AGI. Also, here is a slightly more worrying one, which reminds me of the scheming exposed by Epoch, among others.

Sometimes when there is a particularly hard step, the model will just fake that it's done. For example, during the execution of one task, the agent could not find the right person to ask questions to on the team chat. As a result, it then decides to create a shortcut solution by renaming another user to the name of the intended user.

Remember, it's not necessarily that the models want to cheat, but if they are rewarded sufficiently for cheating, that's what they'll do. That is, I guess, another bitter lesson from reinforcement learning. But now for the final reason given in the paper, a lack of common sense. This for me is the grist that makes so much of the world go round and why models often struggle in real-world performance.

You gotta sometimes step back, see the bigger picture and re-evaluate your entire strategy. This lack of common sense or simple reasoning is of course what I am trying to test with SimpleBench with a public leaderboard linked in the description. And here is a brand new example from the hundreds in the benchmark and you'll see why I'm giving it to you in a moment.

You can pause and try it yourself, but it illustrates the point I'm trying to make. Hussain types a letter on a normal laptop screen and he can see any letters on the screen clearly. Every second, the letter will randomly transform into another letter of the alphabet. Hussain is in a park and slowly inches back from the laptop but has just one item with him, a remote controller so he can increase the font size of the changing letters by exactly as much as he wants.

Hussain has always had trouble distinguishing W's from M's so when he is a couple of football field lengths away from this laptop, a couple of football field lengths away, controller in hand, he has a blank probability of correctly guessing the current letter. 96%, 95%, 97%, 1/26, 0% or 1/2.

I asked the famously expensive O1 Pro which Sam Ullman recently said they are losing money on it's so expensive to serve and it said this. First note that Hussain can make the letter as large as he wants so he has no problem identifying any letter except the W's from M's.

One last time though, he is two football field lengths away from a normal laptop screen. If he was a few feet away, the increasing font would indeed be helpful but two football field lengths away doesn't matter if you make the font size 1 billion, he can barely even see the screen.

And by the way, you can make this 10 football fields and O1 Pro will still give the same answer. It will focus on the distraction of the W's and the M's and give the answer of 96%. The official answer by the way is not actually 0% because even if you can't see the screen, you still have a 1/26 chance of guessing the right letter.

Now what many of you have told me is that if we simply change the prompt, models like O1 would get all of these questions correct. Now though, I am very excited to tell you we can put that to the test. Weights and Biases, I'm very happy to say, are sponsoring a competition for you guys running to the end of January on 20 questions from Simple Bench.

That's the 10 public questions that are already out there on the website plus 10 more specially for this competition. The winner will get some meta Ray-Bans, 2nd place gets gift cards and I believe 3rd place gets some swag. Either way, all you need to do is open up the Colab and run each of these cells and yes, it's not authored by Google.

You will of course need either an OpenAI API key or an Anthropic API key. I recommend trying with Claude 3.5 Sonnet or O1 Preview/O1 if you have access. If you already have a Weights and Biases account, it literally takes maybe 30 seconds to set up but even if you don't, it's completely free to have the account.

The first easy option is to do a quick run with GPC 4.0 on those 20 questions but there is a more exciting option below. By the way, true count tells you how many questions your model got right and the true fraction is that number out of the total number of calls.

The mean is just referring to the latency, how many seconds it took for the model to reply on average. The more exciting thing though is to play about with the system prompt. This is where you can test your theory about telling the model it's a trick question and seeing if that boosts performance.

Of course, to get top performance, you're also going to want to change the model name from GPC 4.0 to say O1. At this stage, I must give a couple of quick caveats about this mini competition. The first is an example of what not to do. Unfortunately, you can't tell O1 models to think step by step to come up with an answer.

OpenAI have disallowed this, presumably so you don't get access to the underlying chains of thought. So, let's not try that in our system prompt. Now, in the instruction hierarchy, it's more like a user prompt but it can make a big difference to the performance of the model. Which leads me to the second rule.

I've been able to get to around 12 or 13 on these 20 questions by coming up with ever more advanced prompts. What I want to see is whether any of you can get to 20 out of 20 or even 18 out of 20. Naturally though, one way of doing that would just be to put the answers in the system prompt or make numerous references to the questions themselves accessible via the Weave portal.

What you'll get is a portal that looks something like this. You can have fun with seeing the scores and the percentages here, but you can also click on the individual run. Scrolling down, you can click to view the individual questions. Of course, this is not the entire benchmark which is over 200 questions, just 20, 10 of which you already knew about.

Of course, we are not going to accept prompts where you basically say something like, "For question 18, the answer is C." Or very specific hints where you tell the model, "Think about her legs and how it could do this or that." No, what we're looking for are general prompts where you can tell the model these are trick questions and they test spatial reasoning.

And you're going to give the model a massive tip if it gets it right. If a general prompt like that can get 18 or 20 out of 20, I would be very impressed. So that's the competition sponsored by Weights & Biases running till the end of January. Hopefully you have some fun with it, but either way, it illustrates some of the common sense gaps in current frontier LLMs.

Good luck and now time to end with a bit of fun. I discussed how Text-to-Video is also accelerating through 2025 on my Patreon AI Insiders, but I thought I would give you a taster with a quick side-by-side comparison between the best three tools currently available. All with the same prompt, first Kling 1.6, then VO2 from Google DeepMind and finally Sora 1080p.

If you like, you can let me know in the comments which one you thought was best. As ever, thank you so much for watching to the end and have a wonderful day and 2025.