Altman Expects a ‘Fast Take-off’, ‘Super-Agent’ Debuting Soon and DeepSeek R1 Out

00:00:00.000 | Progress in AI is increasingly hidden behind closed doors,

00:00:04.100 | but not all of those doors are locked.

00:00:06.840 | So let's piece together what we do know.

00:00:09.720 | We know, for example, that OpenAI

00:00:11.720 | are targeting particular AI agent benchmarks,

00:00:15.120 | and I'll give you the highlights of two papers

00:00:17.920 | to showcase what that might mean.

00:00:19.960 | And no, this is not a video

00:00:22.080 | on that new Chachapiti task feature,

00:00:24.480 | which I tried to find interesting, but I just couldn't.

00:00:27.120 | Meanwhile, though, Sam Altman markedly changes gear

00:00:30.400 | on takeoff speeds,

00:00:31.880 | in other words, how fast superintelligence is coming,

00:00:34.880 | while telling the hype bros to chill.

00:00:37.880 | But DeepSeek, based in China,

00:00:39.840 | proves that open-source models aren't that far behind

00:00:43.640 | with their new R1 model.

00:00:45.320 | So whatever Western Labs cook up

00:00:47.280 | could well be served to all in short order.

00:00:50.500 | Whether any of this, honestly,

00:00:52.160 | affects your work directly this year

00:00:54.640 | will more depend on how digital your work is

00:00:57.760 | and how quantifiable or benchmarkable it is.

00:01:01.760 | That, to be honest, will give you the best gauge

00:01:04.540 | of what 2025 will mean for you with AI.

00:01:07.840 | But first, I want to give you some numbers,

00:01:10.520 | and I don't just mean the cost of O3 when it is released,

00:01:15.280 | which apparently will be still $200 on the pro tier.

00:01:19.600 | Given that they are already losing money with O1 Pro,

00:01:23.000 | it does kind of make you wonder about the economics

00:01:25.520 | of serving out O3 Pro for $200 a month,

00:01:28.800 | but let's see what happens.

00:01:29.840 | No, I more mean the numbers behind the operator system

00:01:33.960 | that OpenAI looks set to be releasing quite soon.

00:01:37.360 | We can already glimpse options to toggle on

00:01:40.340 | the computer use agent or operator,

00:01:43.000 | or force it to quit and stop.

00:01:45.080 | I'll get to the two relevant papers in a moment,

00:01:47.160 | but at face value,

00:01:48.640 | if the O series has proven anything from OpenAI,

00:01:51.740 | it's proven that it can rapidly improve

00:01:54.000 | in any domain that can be benchmarked.

00:01:55.960 | So is that why yesterday we got this headline in Axios,

00:02:00.320 | "Coming soon, PhD level super agents."

00:02:03.320 | It's a decently long article,

00:02:04.920 | but I'm just going to give you the two or three highlights.

00:02:08.080 | A top company, possibly OpenAI,

00:02:10.160 | in the coming weeks will announce a breakthrough

00:02:12.920 | that unleashes, quote, "PhD level super agents

00:02:16.240 | "to do complex human tasks."

00:02:18.080 | That's all their words, not mine.

00:02:19.520 | That denomination of PhD level

00:02:21.900 | is highly disputed of course.

00:02:23.740 | OpenAI CEO, Sam Altman,

00:02:25.500 | has scheduled a closed door briefing

00:02:27.700 | for US government officials on the 30th of January.

00:02:31.020 | And there's not much other information in the article

00:02:33.460 | other than this sentence.

00:02:35.380 | "Several OpenAI staff have been telling friends

00:02:38.160 | "that they are both jazzed and spooked by recent progress."

00:02:41.780 | Now, while that's vague,

00:02:43.020 | we already know publicly that OpenAI

00:02:45.820 | are hiring aggressively for a multi-agent research team,

00:02:50.200 | which also specializes in equipping models

00:02:52.580 | to do more with tools.

00:02:54.280 | Think teams of agents,

00:02:55.600 | each one of which specialized in the apps

00:02:58.400 | and tools you use on your computer.

00:03:00.680 | OpenAI want you to be able to delegate tasks

00:03:02.960 | that would take a long time to complete

00:03:05.060 | and involve complex environments with multiple agents.

00:03:08.520 | This is something that they are marching towards this year.

00:03:12.280 | Of course, if fulfilled,

00:03:13.840 | that could mean the massive disruption

00:03:16.700 | and dislocation of jobs in the medium term,

00:03:19.820 | according to one White House National Security Advisor.

00:03:22.980 | This was again an exclusive in Axios.

00:03:25.320 | And that advisor, by the way,

00:03:26.460 | spoke with an urgency and directness

00:03:28.620 | that was rarely heard during his decade plus in public life.

00:03:32.260 | Suffice to say though,

00:03:33.260 | the first version of this computer use operator agent

00:03:36.700 | from OpenAI, according to leaks,

00:03:38.540 | won't be capable of much of any of that.

00:03:41.060 | It can't yet reliably generate profits

00:03:43.380 | or issue meme coins,

00:03:44.980 | although I doubt OpenAI would release a model that could.

00:03:48.760 | As we enter this year of AI agents doing our work for us,

00:03:51.780 | what can we expect from this first version

00:03:54.780 | of OpenAI's computer use agent?

00:03:56.900 | What kind of tasks are involved in WebVoyager and OS World?

00:04:00.780 | How about this one,

00:04:01.620 | where you type "Search Apple"

00:04:03.460 | for the accessory Smart Folio for iPad

00:04:06.180 | and check the closest pickup availability

00:04:08.620 | next to this zip code.

00:04:10.400 | That is pretty cool that the agent could do that.

00:04:13.400 | My only question is,

00:04:14.680 | that would take me quite a while to type out.

00:04:17.280 | I mean, I guess I could speak that to the agent,

00:04:19.440 | but if I was typing it out,

00:04:20.900 | by the time I type that out,

00:04:22.560 | I probably could have got the answer

00:04:24.280 | just by browsing the web.

00:04:25.480 | This one's kind of cool.

00:04:26.300 | Find this particular recipe

00:04:27.920 | that takes less than 30 minutes to prepare

00:04:30.120 | and has at least a four-star rating based on user reviews.

00:04:33.540 | I think stuff like this is gonna work

00:04:35.280 | because you could just immediately verify

00:04:37.320 | if it's giving you something that meets your criteria.

00:04:39.760 | Likewise for Amazon searches,

00:04:41.920 | I could well imagine listing a bunch of criteria

00:04:45.080 | for something that I wanna buy

00:04:46.640 | and it just popping up with the item

00:04:49.360 | that match that criteria.

00:04:50.680 | Definitely not a long horizon task in a complex environment,

00:04:54.280 | but it's a start.

00:04:55.120 | The tasks in the OS World benchmark

00:04:57.400 | seem to be somewhat harder.

00:04:59.200 | The prompt was,

00:05:00.160 | I illegally downloaded an episode of "Friends"

00:05:02.600 | to practice listening,

00:05:03.720 | but I don't know how to remove the subtitles.

00:05:06.040 | Please help me remove the subtitles.

00:05:07.960 | Now, this is the kind of thing

00:05:09.360 | that I am looking forward to.

00:05:10.600 | Honestly, it takes me like an hour at least,

00:05:13.880 | sometimes two to edit these videos in Descript

00:05:16.720 | and I'm looking for an agent

00:05:18.760 | that can kind of mimic my style of editing

00:05:21.900 | and just like immediately edit these videos.

00:05:24.480 | Why can't existing agents already crush the simpler tasks?

00:05:28.320 | Well, apparently more than 75% of their clicks

00:05:31.880 | are inaccurate.

00:05:33.100 | Must be pretty frustrating to be an AI agent

00:05:35.560 | that's repeatedly clicking the screen

00:05:37.240 | and not being able to click the right thing.

00:05:39.040 | Oh, and also they were attracted by advertisement content,

00:05:43.320 | which affects their judgment.

00:05:44.720 | Just imagine you in the future

00:05:46.560 | having given your credit card to an AI agent

00:05:49.040 | and watching "Helpless" as it clicks on an ad

00:05:51.920 | and buys a random product.

00:05:53.800 | Now, I know the flaws of agents can seem silly sometimes,

00:05:57.080 | unlike we're years and years away from usable agents,

00:06:00.040 | but let me give you a little anecdote.

00:06:01.840 | Just almost for fun one time years ago,

00:06:04.400 | I created over 200 pages worth of mathematics puzzles

00:06:09.200 | and quizzes with explainers.

00:06:10.960 | Now, as it happened,

00:06:12.000 | those quizzes proved really quite useful

00:06:14.960 | to benchmark early AI models like the original ChatGPT.

00:06:18.520 | And as you probably experienced yourself,

00:06:21.000 | those early models like again,

00:06:22.600 | the original ChatGPT flopped hard

00:06:24.680 | on pretty much all of the questions

00:06:26.560 | except the most simple calculation ones.

00:06:28.640 | Fast forward two years

00:06:29.920 | after the initial release of ChatGPT

00:06:32.120 | and O1, when I got access,

00:06:35.200 | crushed pretty much every single question.

00:06:38.320 | This is O1 in pro mode.

00:06:40.000 | Obviously, there had been incremental progress before that,

00:06:42.760 | but even tougher challenges like this one, O1 pro aced.

00:06:46.920 | So I guess I'm saying that I feel like we will go

00:06:49.880 | from laughing at AI agents

00:06:51.880 | to being super impressed with them

00:06:53.840 | in actually less than two years this time,

00:06:55.960 | possibly within this calendar year.

00:06:58.040 | And I echo what Noam Brown said,

00:07:00.080 | who is a lead researcher on the O series of models,

00:07:02.840 | when he said, "It can be hard to feel the AGI

00:07:06.720 | until you see an AI surpass top humans

00:07:09.080 | in a domain you care deeply about.

00:07:10.760 | Competitive coders will feel it

00:07:12.240 | within a couple of years," he said.

00:07:13.720 | Then when he refers to Paul,

00:07:15.240 | he's talking about the writer behind Taxi Driver,

00:07:18.120 | who said that AI came up with better script ideas

00:07:20.640 | than he could.

00:07:21.480 | And he said, "Paul is early,

00:07:22.960 | but I think writers will feel it too.

00:07:25.040 | Everyone will have their Lee Sedol moment

00:07:27.280 | at a different time."

00:07:28.400 | Of course, the legendary player at Go

00:07:30.760 | who was beaten by AlphaGo.

00:07:32.160 | And I don't think that's necessarily contradictory

00:07:34.680 | with this post that he said earlier.

00:07:37.560 | "Lots of vague AI hype on social media these days,

00:07:40.440 | of course.

00:07:41.280 | There are good reasons to be optimistic

00:07:43.080 | about further progress,

00:07:44.240 | but plenty of unsolved research problems remain."

00:07:47.200 | Now, speaking of vague hype though,

00:07:49.680 | that issue is not helped by none other

00:07:52.840 | than the CEO of OpenAI,

00:07:54.960 | who has reversed his position on fast takeoff timelines.

00:07:59.320 | First, let me give you his current opinion

00:08:01.520 | as of a week ago.

00:08:03.160 | - What's something you've rethought recently on AI

00:08:05.880 | or changed your mind about?

00:08:07.360 | - I think a fast takeoff is more possible

00:08:09.080 | than I thought a couple of years ago.

00:08:10.880 | - How fast?

00:08:11.760 | - Feels hard to reason about,

00:08:12.800 | but something that's in like a small number of years

00:08:15.000 | rather than a decade.

00:08:16.000 | - Wow.

00:08:16.840 | What do you think is the worst advice people are given

00:08:18.800 | on adapting to AI?

00:08:20.360 | - AI is hitting a wall,

00:08:21.440 | which I think is the laziest way

00:08:23.080 | to try to not think about it

00:08:24.240 | and just put it out of sight, out of mind.

00:08:26.560 | - Now let me play you a brief extract

00:08:28.760 | from a video I just published on my Patreon

00:08:32.200 | about what he thought just 18 months ago or so.

00:08:35.760 | Short timelines and slow takeoff

00:08:37.680 | will be a pretty good call.

00:08:39.400 | It's the prediction he would make.

00:08:41.280 | But the way people define the start of the takeoff,

00:08:44.760 | reaching the human baseline,

00:08:46.560 | may make it seem otherwise.

00:08:48.360 | Of course, in an ideal world,

00:08:50.200 | we would have clearer communication from these companies

00:08:53.040 | about just what the frontier is,

00:08:55.560 | but we don't live in that world.

00:08:57.200 | And honestly, it is hard to keep up sometimes

00:09:00.000 | with the changing opinions of the CEOs of these AI labs.

00:09:03.920 | When OpenAI was founded, Sam Altman said,

00:09:06.320 | "Obviously," this is to Elon Musk,

00:09:08.080 | "we'd comply with and aggressively support

00:09:10.560 | "all AI regulation."

00:09:12.120 | 18 months ago, he personally implored Congress

00:09:15.040 | to regulate AI, and I covered that at the time.

00:09:17.560 | But then this week,

00:09:18.400 | we got this very corporate economic blueprint from OpenAI,

00:09:22.480 | which was not fun to read in full.

00:09:24.880 | In short, though, it implores the US government

00:09:27.800 | not to stunt AI through regulation.

00:09:30.800 | Later in the document,

00:09:31.800 | it's promised that OpenAI would never facilitate

00:09:34.400 | their tools being used to threaten or coerce other states.

00:09:37.680 | Meanwhile, that principle doesn't always seem

00:09:40.320 | to be top of mind of OpenAI's CEO.

00:09:43.280 | The anthropic CEO who chose not to make such a donation

00:09:47.000 | did say this about the stakes for 2025

00:09:50.760 | and his sense of urgency on regulating AI.

00:09:54.040 | - And I feel urgency.

00:09:55.600 | I really think we need to do something in 2025.

00:09:58.000 | If we get to the end of 2025

00:09:59.880 | and we've still done nothing about this,

00:10:01.840 | then I'm gonna be worried.

00:10:03.000 | - I don't know if you guys remember the days

00:10:04.680 | where companies used to take six to eight months

00:10:07.520 | to safety test their models before release

00:10:10.080 | and open source was claimed to be

00:10:12.080 | at least a year behind the frontier.

00:10:14.440 | These days, speaking to official safety testers and others,

00:10:17.440 | and correct me if you feel differently,

00:10:19.120 | but it feels like get the model out

00:10:21.080 | as soon as you possibly can.

00:10:22.840 | And no, open source does not feel like a year behind

00:10:26.480 | as proven by DeepSeek R1.

00:10:28.840 | It was announced literally an hour and a half ago

00:10:31.240 | while I'm filming the video,

00:10:32.680 | so no, I haven't read the paper in full,

00:10:35.120 | but I have digested some of the benchmark results

00:10:37.440 | and noticed that the pricing, by the way,

00:10:39.960 | is like 95% cheaper than for example,

00:10:43.760 | O1 when it comes to output tokens.

00:10:46.720 | Now you might agree with me at this stage

00:10:48.920 | that official benchmarks tell us less than they used to

00:10:52.840 | and that each of us really should come up

00:10:54.680 | with our own benchmark and see which model performs best.

00:10:57.720 | I will say it didn't do particularly well

00:10:59.640 | on my benchmark, SimpleBench.

00:11:02.000 | This is just on the public set of questions.

00:11:04.160 | We are gonna do a full run very soon.

00:11:06.360 | Let me know if you experience the same,

00:11:08.640 | but it repeatedly says, wait, no, wait, I'm gonna do this.

00:11:12.200 | Wait, no, I'm gonna do something else.

00:11:13.600 | But more seriously, when the OpenAI operator

00:11:16.040 | or computer use agent comes out,

00:11:17.680 | it will be very interesting to see how quickly

00:11:19.920 | Chinese labs can catch up with that.

00:11:22.040 | The fact, by the way, that OpenAI's O-series

00:11:24.880 | sometimes thinks in its chain of thought in Chinese

00:11:29.000 | is perhaps a story for another video.

00:11:31.360 | Of course, 2025 won't only be about agents.

00:11:34.720 | We're also set to see the merger

00:11:37.440 | of the GPT series and the O-series.

00:11:39.880 | That would be really interesting.

00:11:41.440 | And I will be honest here.

00:11:42.840 | You know the model that I'm actually looking forward

00:11:44.880 | to the most, that would be Claude for Sonnet.

00:11:48.240 | I was spending about 50 hours the last 10 days

00:11:52.400 | or so working on this coding project with a colleague.

00:11:55.880 | And there's one critical task that we needed an LLM to do.

00:11:59.600 | And O1 Pro simply couldn't get the hang of it,

00:12:03.520 | but Claude 3.5 did almost instantly.

00:12:06.160 | I know that's super anecdotal,

00:12:07.720 | and I'll be telling you much more

00:12:09.120 | about what we're working on soon,

00:12:11.120 | but that was quite a powerful moment for me.

00:12:13.480 | And speaking of powerful moments,

00:12:15.240 | I honestly think you might have just a few

00:12:19.160 | while listening to the 80,000 Hours podcast.

00:12:22.600 | Yes, they are the sponsors of this video,

00:12:24.760 | but I genuinely listen to them and really learn a lot.

00:12:28.120 | For example, this podcast 209,

00:12:30.440 | I was listening to while on a long walk in London.

00:12:33.480 | Really interesting, of course,

00:12:34.560 | all the shenanigans that are going on

00:12:36.440 | with the nonprofit oversight of OpenAI.

00:12:39.480 | Yes, by the way, they also have a YouTube channel

00:12:41.920 | that I know some of you have already checked out and like,

00:12:44.600 | so thank you for checking it out.

00:12:46.480 | Thank you also to everyone who has participated

00:12:49.520 | in the SimpleBench competition,

00:12:51.240 | which runs for another 11 days.

00:12:53.880 | Lots more to say on that front in another video.

00:12:57.080 | Honestly, let me know what you think.

00:12:58.840 | Will this be the year of super agents

00:13:02.000 | or is Twitter hype out of control again?

00:13:04.800 | For me, as ever, the truth lies somewhere in between.

00:13:08.680 | Thank you so much for watching and have a wonderful day.

Altman Expects a ‘Fast Take-off’, ‘Super-Agent’ Debuting Soon and DeepSeek R1 Out

Chapters