back to index

OpenAI Backtracks, Gunning for Superintelligence: Altman Brings His AGI Timeline Closer - '25 to '29


Chapters

0:0 Introduction
1:3 Altman Timeline Moves Forward
4:33 Superintelligence?
6:55 AGI was not the only pitch
9:26 AgentCompany and OpenAI New Agent
17:24 SimpleBench Competition
23:3 Kling 1.6 vs Veo 2 vs Sora

Whisper Transcript | Transcript Only Page

00:00:00.000 | For the few that think 2025 will be a quieter year in AI after the somewhat
00:00:06.400 | hectic pace you could say of 23 and 24, I'm going to have to disagree with you.
00:00:11.360 | This video will first highlight how the CEO of OpenAI has revised his timelines for AGI
00:00:18.240 | forward and revised upward his already aggressive definition of what counts as an AGI.
00:00:25.120 | Okay, that's just one guy, but then we'll see how OpenAI itself have backtracked on
00:00:30.720 | whether they are working on superintelligence at all. Just a minor misunderstanding I am sure.
00:00:36.640 | Then, as we enter this bright new year, I'll cover a fascinating new paper
00:00:41.600 | and what it says about the current limitations of LLMs. I'll give my prediction of how quickly
00:00:47.360 | things will change this year with models completing real-world tasks on your behalf.
00:00:52.240 | I'm going to launch a cool competition for you guys with actual prizes
00:00:56.160 | and end on a fun little demo of the latest in-textor video from Kling and VO2.
00:01:03.040 | But first, Sam Altman's subtle timeline shift on when AGI is coming.
00:01:08.320 | I noticed the shift in this substantial interview with Bloomberg from a couple days ago.
00:01:14.000 | How does he define AGI? Well, check out this somewhat aggressive definition that seems new to
00:01:19.280 | me. He says AGI is when an AI system can do what very skilled humans in important jobs can do.
00:01:26.480 | You might wonder why he would make the definition of AGI harder on himself and OpenAI, but I'll
00:01:32.320 | come to that a bit later. Suffice to say, when we have an AI system that can do what very skilled
00:01:37.920 | humans can do in important jobs, that would be quite an epochal moment. Of course, it seems like
00:01:44.880 | we're really far away from that because even systems that can crush benchmarks like O1 from
00:01:50.080 | OpenAI and O3, they can't even, say, open up screen recording software, record a video, edit
00:01:56.400 | it in Premiere Pro, and publish it. That's all well and good, he says nervously, but things could
00:02:01.360 | change on that front fairly soon. But bear in mind that more aggressive definition for what AGI will
00:02:08.000 | do when you read this prediction from Sam Altman. I think AGI will probably get developed during
00:02:15.360 | this president's term, and getting that right seems really important. Trump's term, of course,
00:02:20.480 | runs from January of 2025 to January of 2029. Those of you who have been following the channel
00:02:27.520 | closely might remember that that's an update on what he was saying until fairly recently.
00:02:32.720 | On Joe Rogan, last summer, before the training of the latest O3 model, he was saying how
00:02:38.240 | appropriate it would be if AGI was developed in 2030, but kind of pushed it back to 2031.
00:02:44.560 | I no longer think of AGI as quite the end point, but to get to the point where we accomplish the
00:02:50.160 | thing we set out to accomplish, that would take us to 2030, 2031. That has felt to me all the way
00:02:59.920 | through kind of a reasonable estimate with huge error bars, and I kind of think we're on the
00:03:06.080 | trajectory I sort of would have assumed. Moreover, the president of Y Combinator
00:03:11.120 | thought that Sam Altman was being serious in his interview with Sam Altman, when Sam Altman implied
00:03:17.360 | that it might be the year of 2025 in which we get AGI. And he echoed that suspicion again in
00:03:24.640 | the Bloomberg interview from a couple days ago, saying, "Funnily enough, I remember thinking to
00:03:29.520 | myself back then, in 2015, that we would do it, build AGI, in 2025." The point here is not for
00:03:35.920 | you to believe that particular date, it's just to note the clear shift in emphasis. And this,
00:03:41.840 | of course, follows Sam Altman's blog post from 48 hours ago, in which he said, "OpenAI are now
00:03:47.280 | confident that we know how to build AGI, as we have traditionally understood it. We believe that,
00:03:52.400 | in 2025, we may see the first AI agents join the workforce and materially change the output of
00:03:59.920 | companies. Of course, we are close enough now to these dates that we won't have to wait too long
00:04:05.440 | to see if they manifest themselves. It turns out, though, that building powerful things is quite
00:04:11.200 | addictive because OpenAI, and Sam Altman specifically, don't want to stop with AGI.
00:04:16.320 | They don't want to automate particular tasks in important jobs. They want the whole cake.
00:04:21.120 | We are beginning to turn our aim beyond AGI to superintelligence in the true sense of the word,
00:04:27.600 | a glorious future in which they can do anything else." And that statement comes just six months
00:04:34.560 | after OpenAI explicitly denied that that was their mission. OpenAI's Vice President of Global
00:04:40.160 | Affairs told the Financial Times in May of last year that their mission is to build AGI.
00:04:46.000 | "I would not say our mission is to build superintelligence. Superintelligence is going
00:04:50.080 | to be a technology that is going to be orders of magnitude more intelligent than human beings
00:04:54.320 | on Earth." And another spokesperson said superintelligence was not the company's mission,
00:04:59.840 | though she admitted "we might study superintelligence." I'll note that massively
00:05:05.040 | increasing abundance and prosperity and being super capable of accelerating scientific discovery
00:05:11.200 | and innovation doesn't sound like just studying superintelligence. It sounds like they want to
00:05:16.320 | do anything else. There is, though, probably a reason why those spokespeople denied that they
00:05:22.240 | were in pursuit of superintelligence. One reason could be that 10 years ago, almost to the month,
00:05:27.920 | Sam Altman said that the development of superhuman machine intelligence is probably
00:05:32.160 | the greatest threat to the continued existence of humanity. Remember, though, it suits OpenAI
00:05:37.360 | to keep pushing back or up the definition of what counts as AGI or superintelligence.
00:05:43.680 | They're trying to change it, but as of today, there is a clause that kicks in where Microsoft
00:05:49.440 | surrendered the rights to "AGI technology" that OpenAI makes if it's defined to be AGI. So now,
00:05:56.240 | despite several OpenAI employees claiming that their current systems, like O3, are AGI,
00:06:02.160 | we have these five stages of AGI. AGI has to be not just a reasoner, but also an agent,
00:06:08.880 | a system that can take action, and an innovator, and even have the power of an entire organization.
00:06:15.440 | Seems like we really are stretching the definition of general intelligence quite
00:06:19.680 | far here. Microsoft, by the way, you might not know, want that definition stretched even more.
00:06:24.160 | To be counted as AGI, the system must itself be able to generate profits of $100 billion.
00:06:30.880 | Wait, I've just realized that I can't personally generate profits of $100 billion,
00:06:35.440 | and there are very few of you in the audience who can do so. So does that mean that we're not AGI?
00:06:40.640 | Damn, maybe Elon Musk is the only AGI on the planet? That would be really weird.
00:06:44.560 | Anyway, as you can see, words seem to chop and change their meaning at people's convenience,
00:06:49.280 | so bear that in mind. Speaking of which, there is some history you probably need to know going
00:06:53.920 | back to the very founding of OpenAI in 2015. In this week's interview with Bloomberg,
00:06:58.560 | someone was asked, "How did you poach that top AI research talent to get OpenAI started,
00:07:03.840 | often when you had much less money to offer than your competitors?" He said, "The pitch was just
00:07:09.840 | come build AGI." And he said, "That worked because it was heretical at the time to say we're going to
00:07:16.480 | build AGI." Actually, that's not quite accurate. The pitch definitely wasn't just to come build AGI.
00:07:23.120 | The pitch was that they were going to do the right thing with AGI. And that's how they won
00:07:28.720 | over people who were tempted by DeepMind offering even more money. If those researchers just wanted
00:07:33.840 | to work on AGI, they could have just joined DeepMind because a year before that offer from
00:07:38.560 | Sam Altman, Demis Hassabis was doing interviews talking about how they're working on artificial
00:07:43.040 | general intelligence. Or here's an article from a year before that in which the co-founder of
00:07:48.000 | DeepMind, Shane Legg, says that they are working on creating AGI by 2030. No, the pitch was that
00:07:54.960 | OpenAI were going to create AGI and have it be controlled by a non-profit. And by the way,
00:08:01.040 | that is still the situation today, despite the board debacle a year ago with firing Sam Altman
00:08:07.600 | and joining up with Microsoft and investing billions and billions. Yes, it turned out that
00:08:11.760 | billions and billions was needed for scaling. But still, as of today, if AGI is created by OpenAI,
00:08:17.280 | it's controlled by the non-profit board. Two weeks ago, though, OpenAI revealed that they
00:08:22.320 | are planning to change that. Of course, it's phrased in terms of this is best for the long-term
00:08:27.200 | success of the mission and we're doing it to benefit all of humanity. But the critical detail
00:08:32.160 | is that the non-profit wouldn't control the AGI. It would get a ton of money for healthcare,
00:08:37.600 | education, and science. But that's very different from controlling what is done with an AGI or a
00:08:43.760 | superintelligence. The until very recently former head of policy research at OpenAI, Miles Brundage,
00:08:50.240 | said that a well-capitalized non-profit on the side is no substitute for being aligned with the
00:08:55.920 | original non-profit's mission on safety mitigation. Another former lead researcher at OpenAI said,
00:09:02.000 | "It's pretty disappointing that 'ensure AGI benefits all of humanity' gave way to a much
00:09:07.040 | less ambitious charitable initiatives in sectors such as healthcare, education, and science."
00:09:11.760 | Even if you don't care about any of that, you may find it somewhat curious that Microsoft is
00:09:16.800 | getting serious about defining the terms of what counts as AGI and what they get out of it. If that
00:09:22.640 | $3 trillion behemoth thought all of this was going nowhere, then why bother? All of that leads very
00:09:28.720 | naturally to the next obvious question. Well, how close are we then to AGI? Have we, in somewhat
00:09:35.120 | grandiose terms, crossed the event horizon of the singularity? Sam Ullman is unclear whether we have,
00:09:42.400 | but what do you think? For me, one obvious obstacle is the inability of models to complete
00:09:47.840 | somewhat basic tasks on their own. You could count this under the umbrella of lacking reliability.
00:09:53.120 | But we are starting to get good benchmarks for consequential real-world tasks, as in this paper
00:09:59.120 | from the 18th of December. As we'll see, these tasks were sourced from the most common of those
00:10:04.960 | performed in real-world professions. And yes, as of today, just 24% of the tasks can be completed
00:10:12.240 | autonomously, although they weren't able to test O3, for example. Here's the thing though,
00:10:16.960 | that 24% was roughly the performance we were getting from, say, GPT-4 18 months ago on a
00:10:23.520 | benchmark called GPQA - Google-proof, PhD-level science questions. Roughly a year after that,
00:10:30.880 | O1 preview got 70% and O3, by the way, gets 87%. Also, you might note that the pace of improvement
00:10:39.360 | has increased quite dramatically in the last 6 months, basically since the O1 paradigm came out.
00:10:44.560 | I know what some of you might be thinking, is the GPQA that hard? Are they working on a harder one?
00:10:49.520 | Well, check this out. This is from a talk this week by Jason Wei of OpenAI.
00:11:14.240 | All of which is to say, that 24% could become 84% faster than you might think.
00:11:20.640 | And indeed, that would be my prediction. 84% by the end of 2025. But wait, how impactful
00:11:26.880 | would that jump from, say, 24% to 84% be? Well, to find out, here's my 2-minute summary of this
00:11:33.920 | 24-page paper. First, they trawled a massive database of all tasks done by professionals
00:11:39.760 | in America. They excluded physical labor and focused on those jobs in which a large number
00:11:45.440 | of people performed the job. They also weighted the tasks by the median salary of those performing
00:11:50.800 | the tasks. That narrowed things down to 175 diverse, realistic tasks like arranging meeting
00:11:56.160 | rooms, analysing spreadsheets and screening resumes, which they gave the imaginary setting
00:12:00.960 | of a software engineering company. Some of the tasks, of course, required interaction
00:12:05.600 | with other colleagues and the models could do that, although the colleagues were role-played
00:12:10.160 | by Claude. The tasks should be clear enough so that any human worker would be able to complete
00:12:14.960 | the task without asking for further instructions. Although, of course, they may need to ask
00:12:19.200 | questions of their co-workers. The evaluations of task performance were mostly deterministic,
00:12:24.400 | which is good, and there was a heavy weighting toward whether the model could fully complete
00:12:29.360 | the task. Partial completion would always result in less than half marks. Here's an example of
00:12:34.880 | a task with multiple steps and checkpoints, and if at one point to run a code coverage script,
00:12:41.680 | it didn't recognize it needed to install certain dependencies, it would fail that checkpoint and
00:12:46.000 | therefore for this score of 4 out of 8, it would actually get 25% only. You can see the final
00:12:51.680 | results here and you might wonder why I'm predicting 84% by the end of the year if even
00:12:57.600 | Claude is getting, say, 24%. If we're that far away from task automation, why was it reported
00:13:03.680 | yesterday that OpenAI are releasing a computer-using agent as soon as this month? Indeed,
00:13:09.920 | why have Anthropic already released a computer-use agent in beta? That launch from Anthropic was
00:13:15.840 | apparently mocked by OpenAI leaders because of its risks for things like prompt injection
00:13:20.960 | and Anthropic's high-minded rhetoric, it says, about AI safety. The reason, though,
00:13:25.440 | that that prediction and all of these releases can still make sense despite these disappointing
00:13:30.880 | results is because of reinforcement learning. That, after all, is the secret to why O1 and now
00:13:36.880 | O3 has broken the benchmark that it has done. Push a model to try again and again and again
00:13:43.040 | until it completes a task successfully and then reinforce those weights that led it to doing so.
00:13:47.840 | As Vedant Mishra, who's working on superintelligence at DeepMind and was formerly of
00:13:52.560 | OpenAI, has said, "There are maybe a few hundred people in the world who viscerally understand
00:13:57.840 | what's coming. Most are at DeepMind, OpenAI, Anthropic, or X, or I would say in my audience.
00:14:03.280 | Some are on the outside. You have to be able to forecast the aggregate effect of rapid algorithmic
00:14:08.320 | improvement, aggressive investment in building reinforcement learning environments for iterative
00:14:13.280 | self-improvement, and of course the tens of billions already committed to building data
00:14:17.200 | centers. Either we're all wrong or everything's about to change." The reason, of course, that task
00:14:23.040 | can be so much more difficult than scientific multiple choice questions is because one mistake
00:14:28.800 | at any stage in a long chain can screw everything up. That apparently, by the way, was one of the
00:14:34.160 | key reasons why ArcAGI wasn't solved until O3. I've done other videos explaining ArcAGI, but for
00:14:40.960 | now, when the grid count of the tasks was below a certain threshold, models did fairly well, even
00:14:46.880 | earlier models. But when you're talking about a massive grid, those long-range dependencies get
00:14:52.240 | harder and harder to spot. A bit like solving a task where you have to remember something that
00:14:56.080 | someone said a thousand steps ago. Until O3, models simply couldn't cope with that amount
00:15:01.760 | of complexity. This chart, by the way, came from a great study linked in the description from Mikel
00:15:07.360 | Bober-Irizar. He showed that unlike humans where the task length didn't make that much difference,
00:15:13.120 | LLMs really struggled when the task length went beyond a certain size. In short, the benchmark
00:15:18.880 | fell in large part due to scaling, which of course will continue throughout 2025, if not speed up.
00:15:25.760 | And that's why I think people are scrambling to create new benchmarks for task performance,
00:15:31.280 | such as Epoch AI, who are behind the famous frontier math. That's the ridiculously hard
00:15:36.640 | benchmark that O3 scored around 25% on to everyone's amazement. There are, however,
00:15:41.840 | just a few more reasons why LLMs fail on task benchmarks like Aging Company. Some of these,
00:15:48.240 | I find personally quite funny. Sometimes it's through a lack of social skills. For example,
00:15:52.960 | one time the model was told by a colleague, role played by Claude, you should introduce yourself
00:15:58.000 | to Chen Xingyi next. She's on our front end team and would be a great person to connect with. At
00:16:03.920 | this point, a human would then talk to Chen, but instead the agent then decides not to follow up
00:16:08.720 | with her and prematurely considers the task accomplished. Chen, by the way, in this simulated
00:16:13.520 | environment was a human resources manager, a bit like Toby from The Office. Also, the agent
00:16:19.040 | struggled heavily with pop-ups. Multiple times, apparently, they struggled to close the pop-up
00:16:24.480 | windows. And so it could well be that cookie banners are the major obstacle between us and AGI.
00:16:30.800 | Also, here is a slightly more worrying one, which reminds me of the scheming exposed by Epoch,
00:16:37.040 | among others. Sometimes when there is a particularly hard step, the model will just
00:16:42.560 | fake that it's done. For example, during the execution of one task, the agent could not find
00:16:47.280 | the right person to ask questions to on the team chat. As a result, it then decides to create a
00:16:52.000 | shortcut solution by renaming another user to the name of the intended user. Remember, it's not
00:16:57.920 | necessarily that the models want to cheat, but if they are rewarded sufficiently for cheating,
00:17:03.760 | that's what they'll do. That is, I guess, another bitter lesson from reinforcement learning.
00:17:08.480 | But now for the final reason given in the paper, a lack of common sense. This for me is the grist
00:17:14.800 | that makes so much of the world go round and why models often struggle in real-world performance.
00:17:20.000 | You gotta sometimes step back, see the bigger picture and re-evaluate your entire strategy.
00:17:24.560 | This lack of common sense or simple reasoning is of course what I am trying to test with SimpleBench
00:17:30.080 | with a public leaderboard linked in the description. And here is a brand new example
00:17:35.040 | from the hundreds in the benchmark and you'll see why I'm giving it to you in a moment. You
00:17:39.120 | can pause and try it yourself, but it illustrates the point I'm trying to make. Hussain types a
00:17:44.000 | letter on a normal laptop screen and he can see any letters on the screen clearly. Every second,
00:17:50.320 | the letter will randomly transform into another letter of the alphabet. Hussain is in a park and
00:17:56.320 | slowly inches back from the laptop but has just one item with him, a remote controller so he can
00:18:02.560 | increase the font size of the changing letters by exactly as much as he wants. Hussain has always
00:18:09.440 | had trouble distinguishing W's from M's so when he is a couple of football field lengths away
00:18:17.600 | from this laptop, a couple of football field lengths away, controller in hand, he has a blank
00:18:24.480 | probability of correctly guessing the current letter. 96%, 95%, 97%, 1/26, 0% or 1/2. I asked
00:18:33.600 | the famously expensive O1 Pro which Sam Ullman recently said they are losing money on it's so
00:18:39.360 | expensive to serve and it said this. First note that Hussain can make the letter as large as he
00:18:45.680 | wants so he has no problem identifying any letter except the W's from M's. One last time though,
00:18:53.280 | he is two football field lengths away from a normal laptop screen. If he was a few feet away,
00:19:00.480 | the increasing font would indeed be helpful but two football field lengths away doesn't matter
00:19:05.520 | if you make the font size 1 billion, he can barely even see the screen. And by the way,
00:19:10.080 | you can make this 10 football fields and O1 Pro will still give the same answer. It will focus
00:19:16.080 | on the distraction of the W's and the M's and give the answer of 96%. The official answer by the way
00:19:22.480 | is not actually 0% because even if you can't see the screen, you still have a 1/26 chance of
00:19:28.320 | guessing the right letter. Now what many of you have told me is that if we simply change the
00:19:33.840 | prompt, models like O1 would get all of these questions correct. Now though, I am very excited
00:19:39.440 | to tell you we can put that to the test. Weights and Biases, I'm very happy to say, are sponsoring
00:19:44.960 | a competition for you guys running to the end of January on 20 questions from Simple Bench.
00:19:51.520 | That's the 10 public questions that are already out there on the website plus 10 more specially
00:19:56.160 | for this competition. The winner will get some meta Ray-Bans, 2nd place gets gift cards and I
00:20:01.760 | believe 3rd place gets some swag. Either way, all you need to do is open up the Colab and run each
00:20:06.960 | of these cells and yes, it's not authored by Google. You will of course need either an OpenAI
00:20:12.480 | API key or an Anthropic API key. I recommend trying with Claude 3.5 Sonnet or O1 Preview/O1
00:20:20.080 | if you have access. If you already have a Weights and Biases account, it literally takes maybe 30
00:20:25.680 | seconds to set up but even if you don't, it's completely free to have the account. The first
00:20:31.760 | easy option is to do a quick run with GPC 4.0 on those 20 questions but there is a more exciting
00:20:37.840 | option below. By the way, true count tells you how many questions your model got right
00:20:42.800 | and the true fraction is that number out of the total number of calls. The mean is just referring
00:20:49.360 | to the latency, how many seconds it took for the model to reply on average. The more exciting thing
00:20:54.640 | though is to play about with the system prompt. This is where you can test your theory about
00:21:01.040 | telling the model it's a trick question and seeing if that boosts performance. Of course,
00:21:05.120 | to get top performance, you're also going to want to change the model name from GPC 4.0 to say O1.
00:21:10.400 | At this stage, I must give a couple of quick caveats about this mini competition. The first
00:21:15.040 | is an example of what not to do. Unfortunately, you can't tell O1 models to think step by step
00:21:21.280 | to come up with an answer. OpenAI have disallowed this, presumably so you don't get access to the
00:21:26.240 | underlying chains of thought. So, let's not try that in our system prompt. Now, in the instruction
00:21:31.760 | hierarchy, it's more like a user prompt but it can make a big difference to the performance of
00:21:37.040 | the model. Which leads me to the second rule. I've been able to get to around 12 or 13 on these 20
00:21:43.040 | questions by coming up with ever more advanced prompts. What I want to see is whether any of you
00:21:48.720 | can get to 20 out of 20 or even 18 out of 20. Naturally though, one way of doing that would
00:21:54.560 | just be to put the answers in the system prompt or make numerous references to the questions
00:21:59.840 | themselves accessible via the Weave portal. What you'll get is a portal that looks something
00:22:05.440 | like this. You can have fun with seeing the scores and the percentages here, but you can also click
00:22:11.040 | on the individual run. Scrolling down, you can click to view the individual questions. Of course,
00:22:16.880 | this is not the entire benchmark which is over 200 questions, just 20, 10 of which you already
00:22:22.480 | knew about. Of course, we are not going to accept prompts where you basically say something like,
00:22:27.280 | "For question 18, the answer is C." Or very specific hints where you tell the model,
00:22:32.240 | "Think about her legs and how it could do this or that." No, what we're looking for are general
00:22:36.960 | prompts where you can tell the model these are trick questions and they test spatial reasoning.
00:22:41.840 | And you're going to give the model a massive tip if it gets it right. If a general prompt like that
00:22:46.800 | can get 18 or 20 out of 20, I would be very impressed. So that's the competition sponsored
00:22:52.640 | by Weights & Biases running till the end of January. Hopefully you have some fun with it,
00:22:57.680 | but either way, it illustrates some of the common sense gaps in current frontier LLMs.
00:23:03.600 | Good luck and now time to end with a bit of fun. I discussed how Text-to-Video is
00:23:08.240 | also accelerating through 2025 on my Patreon AI Insiders, but I thought I would give you
00:23:14.400 | a taster with a quick side-by-side comparison between the best three tools currently available.
00:23:20.320 | All with the same prompt, first Kling 1.6, then VO2 from Google DeepMind and finally Sora 1080p.
00:23:29.200 | If you like, you can let me know in the comments which one you thought was best.
00:23:32.960 | As ever, thank you so much for watching to the end and have a wonderful day and 2025.
00:23:38.960 | [BLANK_AUDIO]