back to index

Altman Expects a ‘Fast Take-off’, ‘Super-Agent’ Debuting Soon and DeepSeek R1 Out


Chapters

0:0 Introduction
1:13 Pro Cost and OpenAI Operator
4:0 Agent Benchmarks Being Targeted
7:48 Fast Take-off, Altman
8:48 Altman flip-flops
10:2 Deepseek R1 First Reaction

Whisper Transcript | Transcript Only Page

00:00:00.000 | Progress in AI is increasingly hidden behind closed doors,
00:00:04.100 | but not all of those doors are locked.
00:00:06.840 | So let's piece together what we do know.
00:00:09.720 | We know, for example, that OpenAI
00:00:11.720 | are targeting particular AI agent benchmarks,
00:00:15.120 | and I'll give you the highlights of two papers
00:00:17.920 | to showcase what that might mean.
00:00:19.960 | And no, this is not a video
00:00:22.080 | on that new Chachapiti task feature,
00:00:24.480 | which I tried to find interesting, but I just couldn't.
00:00:27.120 | Meanwhile, though, Sam Altman markedly changes gear
00:00:30.400 | on takeoff speeds,
00:00:31.880 | in other words, how fast superintelligence is coming,
00:00:34.880 | while telling the hype bros to chill.
00:00:37.880 | But DeepSeek, based in China,
00:00:39.840 | proves that open-source models aren't that far behind
00:00:43.640 | with their new R1 model.
00:00:45.320 | So whatever Western Labs cook up
00:00:47.280 | could well be served to all in short order.
00:00:50.500 | Whether any of this, honestly,
00:00:52.160 | affects your work directly this year
00:00:54.640 | will more depend on how digital your work is
00:00:57.760 | and how quantifiable or benchmarkable it is.
00:01:01.760 | That, to be honest, will give you the best gauge
00:01:04.540 | of what 2025 will mean for you with AI.
00:01:07.840 | But first, I want to give you some numbers,
00:01:10.520 | and I don't just mean the cost of O3 when it is released,
00:01:15.280 | which apparently will be still $200 on the pro tier.
00:01:19.600 | Given that they are already losing money with O1 Pro,
00:01:23.000 | it does kind of make you wonder about the economics
00:01:25.520 | of serving out O3 Pro for $200 a month,
00:01:28.800 | but let's see what happens.
00:01:29.840 | No, I more mean the numbers behind the operator system
00:01:33.960 | that OpenAI looks set to be releasing quite soon.
00:01:37.360 | We can already glimpse options to toggle on
00:01:40.340 | the computer use agent or operator,
00:01:43.000 | or force it to quit and stop.
00:01:45.080 | I'll get to the two relevant papers in a moment,
00:01:47.160 | but at face value,
00:01:48.640 | if the O series has proven anything from OpenAI,
00:01:51.740 | it's proven that it can rapidly improve
00:01:54.000 | in any domain that can be benchmarked.
00:01:55.960 | So is that why yesterday we got this headline in Axios,
00:02:00.320 | "Coming soon, PhD level super agents."
00:02:03.320 | It's a decently long article,
00:02:04.920 | but I'm just going to give you the two or three highlights.
00:02:08.080 | A top company, possibly OpenAI,
00:02:10.160 | in the coming weeks will announce a breakthrough
00:02:12.920 | that unleashes, quote, "PhD level super agents
00:02:16.240 | "to do complex human tasks."
00:02:18.080 | That's all their words, not mine.
00:02:19.520 | That denomination of PhD level
00:02:21.900 | is highly disputed of course.
00:02:23.740 | OpenAI CEO, Sam Altman,
00:02:25.500 | has scheduled a closed door briefing
00:02:27.700 | for US government officials on the 30th of January.
00:02:31.020 | And there's not much other information in the article
00:02:33.460 | other than this sentence.
00:02:35.380 | "Several OpenAI staff have been telling friends
00:02:38.160 | "that they are both jazzed and spooked by recent progress."
00:02:41.780 | Now, while that's vague,
00:02:43.020 | we already know publicly that OpenAI
00:02:45.820 | are hiring aggressively for a multi-agent research team,
00:02:50.200 | which also specializes in equipping models
00:02:52.580 | to do more with tools.
00:02:54.280 | Think teams of agents,
00:02:55.600 | each one of which specialized in the apps
00:02:58.400 | and tools you use on your computer.
00:03:00.680 | OpenAI want you to be able to delegate tasks
00:03:02.960 | that would take a long time to complete
00:03:05.060 | and involve complex environments with multiple agents.
00:03:08.520 | This is something that they are marching towards this year.
00:03:12.280 | Of course, if fulfilled,
00:03:13.840 | that could mean the massive disruption
00:03:16.700 | and dislocation of jobs in the medium term,
00:03:19.820 | according to one White House National Security Advisor.
00:03:22.980 | This was again an exclusive in Axios.
00:03:25.320 | And that advisor, by the way,
00:03:26.460 | spoke with an urgency and directness
00:03:28.620 | that was rarely heard during his decade plus in public life.
00:03:32.260 | Suffice to say though,
00:03:33.260 | the first version of this computer use operator agent
00:03:36.700 | from OpenAI, according to leaks,
00:03:38.540 | won't be capable of much of any of that.
00:03:41.060 | It can't yet reliably generate profits
00:03:43.380 | or issue meme coins,
00:03:44.980 | although I doubt OpenAI would release a model that could.
00:03:48.760 | As we enter this year of AI agents doing our work for us,
00:03:51.780 | what can we expect from this first version
00:03:54.780 | of OpenAI's computer use agent?
00:03:56.900 | What kind of tasks are involved in WebVoyager and OS World?
00:04:00.780 | How about this one,
00:04:01.620 | where you type "Search Apple"
00:04:03.460 | for the accessory Smart Folio for iPad
00:04:06.180 | and check the closest pickup availability
00:04:08.620 | next to this zip code.
00:04:10.400 | That is pretty cool that the agent could do that.
00:04:13.400 | My only question is,
00:04:14.680 | that would take me quite a while to type out.
00:04:17.280 | I mean, I guess I could speak that to the agent,
00:04:19.440 | but if I was typing it out,
00:04:20.900 | by the time I type that out,
00:04:22.560 | I probably could have got the answer
00:04:24.280 | just by browsing the web.
00:04:25.480 | This one's kind of cool.
00:04:26.300 | Find this particular recipe
00:04:27.920 | that takes less than 30 minutes to prepare
00:04:30.120 | and has at least a four-star rating based on user reviews.
00:04:33.540 | I think stuff like this is gonna work
00:04:35.280 | because you could just immediately verify
00:04:37.320 | if it's giving you something that meets your criteria.
00:04:39.760 | Likewise for Amazon searches,
00:04:41.920 | I could well imagine listing a bunch of criteria
00:04:45.080 | for something that I wanna buy
00:04:46.640 | and it just popping up with the item
00:04:49.360 | that match that criteria.
00:04:50.680 | Definitely not a long horizon task in a complex environment,
00:04:54.280 | but it's a start.
00:04:55.120 | The tasks in the OS World benchmark
00:04:57.400 | seem to be somewhat harder.
00:04:59.200 | The prompt was,
00:05:00.160 | I illegally downloaded an episode of "Friends"
00:05:02.600 | to practice listening,
00:05:03.720 | but I don't know how to remove the subtitles.
00:05:06.040 | Please help me remove the subtitles.
00:05:07.960 | Now, this is the kind of thing
00:05:09.360 | that I am looking forward to.
00:05:10.600 | Honestly, it takes me like an hour at least,
00:05:13.880 | sometimes two to edit these videos in Descript
00:05:16.720 | and I'm looking for an agent
00:05:18.760 | that can kind of mimic my style of editing
00:05:21.900 | and just like immediately edit these videos.
00:05:24.480 | Why can't existing agents already crush the simpler tasks?
00:05:28.320 | Well, apparently more than 75% of their clicks
00:05:31.880 | are inaccurate.
00:05:33.100 | Must be pretty frustrating to be an AI agent
00:05:35.560 | that's repeatedly clicking the screen
00:05:37.240 | and not being able to click the right thing.
00:05:39.040 | Oh, and also they were attracted by advertisement content,
00:05:43.320 | which affects their judgment.
00:05:44.720 | Just imagine you in the future
00:05:46.560 | having given your credit card to an AI agent
00:05:49.040 | and watching "Helpless" as it clicks on an ad
00:05:51.920 | and buys a random product.
00:05:53.800 | Now, I know the flaws of agents can seem silly sometimes,
00:05:57.080 | unlike we're years and years away from usable agents,
00:06:00.040 | but let me give you a little anecdote.
00:06:01.840 | Just almost for fun one time years ago,
00:06:04.400 | I created over 200 pages worth of mathematics puzzles
00:06:09.200 | and quizzes with explainers.
00:06:10.960 | Now, as it happened,
00:06:12.000 | those quizzes proved really quite useful
00:06:14.960 | to benchmark early AI models like the original ChatGPT.
00:06:18.520 | And as you probably experienced yourself,
00:06:21.000 | those early models like again,
00:06:22.600 | the original ChatGPT flopped hard
00:06:24.680 | on pretty much all of the questions
00:06:26.560 | except the most simple calculation ones.
00:06:28.640 | Fast forward two years
00:06:29.920 | after the initial release of ChatGPT
00:06:32.120 | and O1, when I got access,
00:06:35.200 | crushed pretty much every single question.
00:06:38.320 | This is O1 in pro mode.
00:06:40.000 | Obviously, there had been incremental progress before that,
00:06:42.760 | but even tougher challenges like this one, O1 pro aced.
00:06:46.920 | So I guess I'm saying that I feel like we will go
00:06:49.880 | from laughing at AI agents
00:06:51.880 | to being super impressed with them
00:06:53.840 | in actually less than two years this time,
00:06:55.960 | possibly within this calendar year.
00:06:58.040 | And I echo what Noam Brown said,
00:07:00.080 | who is a lead researcher on the O series of models,
00:07:02.840 | when he said, "It can be hard to feel the AGI
00:07:06.720 | until you see an AI surpass top humans
00:07:09.080 | in a domain you care deeply about.
00:07:10.760 | Competitive coders will feel it
00:07:12.240 | within a couple of years," he said.
00:07:13.720 | Then when he refers to Paul,
00:07:15.240 | he's talking about the writer behind Taxi Driver,
00:07:18.120 | who said that AI came up with better script ideas
00:07:20.640 | than he could.
00:07:21.480 | And he said, "Paul is early,
00:07:22.960 | but I think writers will feel it too.
00:07:25.040 | Everyone will have their Lee Sedol moment
00:07:27.280 | at a different time."
00:07:28.400 | Of course, the legendary player at Go
00:07:30.760 | who was beaten by AlphaGo.
00:07:32.160 | And I don't think that's necessarily contradictory
00:07:34.680 | with this post that he said earlier.
00:07:37.560 | "Lots of vague AI hype on social media these days,
00:07:40.440 | of course.
00:07:41.280 | There are good reasons to be optimistic
00:07:43.080 | about further progress,
00:07:44.240 | but plenty of unsolved research problems remain."
00:07:47.200 | Now, speaking of vague hype though,
00:07:49.680 | that issue is not helped by none other
00:07:52.840 | than the CEO of OpenAI,
00:07:54.960 | who has reversed his position on fast takeoff timelines.
00:07:59.320 | First, let me give you his current opinion
00:08:01.520 | as of a week ago.
00:08:03.160 | - What's something you've rethought recently on AI
00:08:05.880 | or changed your mind about?
00:08:07.360 | - I think a fast takeoff is more possible
00:08:09.080 | than I thought a couple of years ago.
00:08:10.880 | - How fast?
00:08:11.760 | - Feels hard to reason about,
00:08:12.800 | but something that's in like a small number of years
00:08:15.000 | rather than a decade.
00:08:16.000 | - Wow.
00:08:16.840 | What do you think is the worst advice people are given
00:08:18.800 | on adapting to AI?
00:08:20.360 | - AI is hitting a wall,
00:08:21.440 | which I think is the laziest way
00:08:23.080 | to try to not think about it
00:08:24.240 | and just put it out of sight, out of mind.
00:08:26.560 | - Now let me play you a brief extract
00:08:28.760 | from a video I just published on my Patreon
00:08:32.200 | about what he thought just 18 months ago or so.
00:08:35.760 | Short timelines and slow takeoff
00:08:37.680 | will be a pretty good call.
00:08:39.400 | It's the prediction he would make.
00:08:41.280 | But the way people define the start of the takeoff,
00:08:44.760 | reaching the human baseline,
00:08:46.560 | may make it seem otherwise.
00:08:48.360 | Of course, in an ideal world,
00:08:50.200 | we would have clearer communication from these companies
00:08:53.040 | about just what the frontier is,
00:08:55.560 | but we don't live in that world.
00:08:57.200 | And honestly, it is hard to keep up sometimes
00:09:00.000 | with the changing opinions of the CEOs of these AI labs.
00:09:03.920 | When OpenAI was founded, Sam Altman said,
00:09:06.320 | "Obviously," this is to Elon Musk,
00:09:08.080 | "we'd comply with and aggressively support
00:09:10.560 | "all AI regulation."
00:09:12.120 | 18 months ago, he personally implored Congress
00:09:15.040 | to regulate AI, and I covered that at the time.
00:09:17.560 | But then this week,
00:09:18.400 | we got this very corporate economic blueprint from OpenAI,
00:09:22.480 | which was not fun to read in full.
00:09:24.880 | In short, though, it implores the US government
00:09:27.800 | not to stunt AI through regulation.
00:09:30.800 | Later in the document,
00:09:31.800 | it's promised that OpenAI would never facilitate
00:09:34.400 | their tools being used to threaten or coerce other states.
00:09:37.680 | Meanwhile, that principle doesn't always seem
00:09:40.320 | to be top of mind of OpenAI's CEO.
00:09:43.280 | The anthropic CEO who chose not to make such a donation
00:09:47.000 | did say this about the stakes for 2025
00:09:50.760 | and his sense of urgency on regulating AI.
00:09:54.040 | - And I feel urgency.
00:09:55.600 | I really think we need to do something in 2025.
00:09:58.000 | If we get to the end of 2025
00:09:59.880 | and we've still done nothing about this,
00:10:01.840 | then I'm gonna be worried.
00:10:03.000 | - I don't know if you guys remember the days
00:10:04.680 | where companies used to take six to eight months
00:10:07.520 | to safety test their models before release
00:10:10.080 | and open source was claimed to be
00:10:12.080 | at least a year behind the frontier.
00:10:14.440 | These days, speaking to official safety testers and others,
00:10:17.440 | and correct me if you feel differently,
00:10:19.120 | but it feels like get the model out
00:10:21.080 | as soon as you possibly can.
00:10:22.840 | And no, open source does not feel like a year behind
00:10:26.480 | as proven by DeepSeek R1.
00:10:28.840 | It was announced literally an hour and a half ago
00:10:31.240 | while I'm filming the video,
00:10:32.680 | so no, I haven't read the paper in full,
00:10:35.120 | but I have digested some of the benchmark results
00:10:37.440 | and noticed that the pricing, by the way,
00:10:39.960 | is like 95% cheaper than for example,
00:10:43.760 | O1 when it comes to output tokens.
00:10:46.720 | Now you might agree with me at this stage
00:10:48.920 | that official benchmarks tell us less than they used to
00:10:52.840 | and that each of us really should come up
00:10:54.680 | with our own benchmark and see which model performs best.
00:10:57.720 | I will say it didn't do particularly well
00:10:59.640 | on my benchmark, SimpleBench.
00:11:02.000 | This is just on the public set of questions.
00:11:04.160 | We are gonna do a full run very soon.
00:11:06.360 | Let me know if you experience the same,
00:11:08.640 | but it repeatedly says, wait, no, wait, I'm gonna do this.
00:11:12.200 | Wait, no, I'm gonna do something else.
00:11:13.600 | But more seriously, when the OpenAI operator
00:11:16.040 | or computer use agent comes out,
00:11:17.680 | it will be very interesting to see how quickly
00:11:19.920 | Chinese labs can catch up with that.
00:11:22.040 | The fact, by the way, that OpenAI's O-series
00:11:24.880 | sometimes thinks in its chain of thought in Chinese
00:11:29.000 | is perhaps a story for another video.
00:11:31.360 | Of course, 2025 won't only be about agents.
00:11:34.720 | We're also set to see the merger
00:11:37.440 | of the GPT series and the O-series.
00:11:39.880 | That would be really interesting.
00:11:41.440 | And I will be honest here.
00:11:42.840 | You know the model that I'm actually looking forward
00:11:44.880 | to the most, that would be Claude for Sonnet.
00:11:48.240 | I was spending about 50 hours the last 10 days
00:11:52.400 | or so working on this coding project with a colleague.
00:11:55.880 | And there's one critical task that we needed an LLM to do.
00:11:59.600 | And O1 Pro simply couldn't get the hang of it,
00:12:03.520 | but Claude 3.5 did almost instantly.
00:12:06.160 | I know that's super anecdotal,
00:12:07.720 | and I'll be telling you much more
00:12:09.120 | about what we're working on soon,
00:12:11.120 | but that was quite a powerful moment for me.
00:12:13.480 | And speaking of powerful moments,
00:12:15.240 | I honestly think you might have just a few
00:12:19.160 | while listening to the 80,000 Hours podcast.
00:12:22.600 | Yes, they are the sponsors of this video,
00:12:24.760 | but I genuinely listen to them and really learn a lot.
00:12:28.120 | For example, this podcast 209,
00:12:30.440 | I was listening to while on a long walk in London.
00:12:33.480 | Really interesting, of course,
00:12:34.560 | all the shenanigans that are going on
00:12:36.440 | with the nonprofit oversight of OpenAI.
00:12:39.480 | Yes, by the way, they also have a YouTube channel
00:12:41.920 | that I know some of you have already checked out and like,
00:12:44.600 | so thank you for checking it out.
00:12:46.480 | Thank you also to everyone who has participated
00:12:49.520 | in the SimpleBench competition,
00:12:51.240 | which runs for another 11 days.
00:12:53.880 | Lots more to say on that front in another video.
00:12:57.080 | Honestly, let me know what you think.
00:12:58.840 | Will this be the year of super agents
00:13:02.000 | or is Twitter hype out of control again?
00:13:04.800 | For me, as ever, the truth lies somewhere in between.
00:13:08.680 | Thank you so much for watching and have a wonderful day.