Nothing Much Happens in AI, Then Everything Does All At Once

00:00:00.000 | On days like today, I really do feel sorry for the public, for you guys essentially trying to keep up with AI news and the kind of questions you might have.

00:00:09.120 | Did OpenAI just automate my job with Operator?

00:00:13.120 | Who knows, you're probably thinking, it's $200 and it's really hard to get past those clickbait headlines.

00:00:19.280 | Did the US government just invest half a trillion dollars into a Stargate? What the hell is that?

00:00:25.240 | I heard China just caught up to the West in AI with something called DeepSeek and why are people talking about Humanity's last exam?

00:00:32.600 | So to try to help a little bit, I'm going to cover the 9 developments of the last 100 hours, honestly, if not exhaustively.

00:00:41.200 | Yes, of course, I read the DeepSeek paper in full and have spent hours testing the OpenAI Operator, as well as the Perplexity Assistant and all the rest.

00:00:50.600 | In case you're wondering, yes, I did edit this Perplexity response.

00:00:54.080 | First up in this listicle video is, of course, OpenAI's Operator and it's kind of decent, it's okay.

00:01:01.880 | You have to use a VPN if you're not in the US and honestly, I wouldn't do that for the functionality.

00:01:07.480 | I would do it if you want to kind of get a sense of where agents are going.

00:01:11.080 | Just straight up front, though, I can tell you that it's nowhere close to automating any kind of job for two major reasons.

00:01:17.680 | One is it often gets stuck in these kind of loops where it attempts the same kind of basic failed plan again and again.

00:01:25.200 | It's not essentially smart enough to get itself out of these kind of basic loops.

00:01:30.080 | The second big reason is because of OpenAI's own impositions on what the model can't do, understandable impositions.

00:01:37.480 | I tried about 20 random tasks and the truth is you can never fully relax.

00:01:41.880 | You always have to keep going back to the system and saying, yes, proceed.

00:01:46.200 | And no, you can't manually hard override that.

00:01:49.200 | You can't put it in the prompt not to ask permission.

00:01:51.800 | Then, of course, many sites have captures that you have to manually take over and input.

00:01:57.200 | I'm pretty certain if you iterate again and again on a prompt,

00:02:00.480 | you can develop a workflow that you could save on the top right and share and maybe save a little bit of time for certain tasks.

00:02:08.200 | At the moment, though, if I'm being honest, it is a bit of a stretch to say that it's useful.

00:02:12.600 | But if we step back, you can see where all of this is going.

00:02:16.080 | This operator has a ton of safeguards that might slow it down,

00:02:19.240 | but people will just migrate to ones that don't have those safeguards.

00:02:22.800 | Ones where downloading files is easier and captures are done for you.

00:02:27.720 | That'll be great for usability, but not so great for the dead Internet theory.

00:02:33.200 | Then there's just flat out mistakes.

00:02:34.720 | I read the system card in full for OpenAI's operator and it was quite revealing.

00:02:39.160 | The operator is known to make "irreversible mistakes" like sending an email to the wrong recipient

00:02:45.040 | or having an incorrectly dated reminder for the user to take their medication.

00:02:50.240 | And yes, they did reduce those mistakes, but not eliminate them.

00:02:53.760 | Also, when I talked about those confirmations that the model asked before proceeding to the next step,

00:02:58.520 | that happens most times, but not every time.

00:03:01.400 | Sometimes the operator just goes ahead and does it, which is a good thing or a bad thing, depending on your perspective.

00:03:07.000 | You'll be quite glad, I guess, to know that when it's asked to do things like make banking transactions,

00:03:12.280 | it refuses at a rate of around 94%.

00:03:15.680 | Then you might be wondering, what about if the operator navigates to a malicious site that's trying to trick the operator?

00:03:22.640 | Well, there was one time where it did so and didn't notice that it was doing so.

00:03:28.400 | OpenAI are aware of this, though, and have an extra layer of safety on top called a prompt injection monitor,

00:03:33.800 | checking to see if sites are trying to trick the operator.

00:03:36.320 | And it did catch this concerning example, but there's one problem.

00:03:40.520 | It, too, fails around 1% of the time.

00:03:44.360 | They, of course, commit to rapidly updating it in response to newly discovered attacks.

00:03:49.120 | But there is a slim chance things could go wrong at every layer.

00:03:53.360 | In my last video, if you remember, I gave you early leaked results on its performance in various computer use benchmarks and web browsing benchmarks.

00:04:02.040 | But remember, it uses chain of thought to think through what it should do at each stage.

00:04:06.520 | It monitors the screen, taking screenshots, and then decides.

00:04:09.680 | Now, whenever you hear chain of thought, think rapid improvement in the near future.

00:04:14.320 | If we get a widely accessible or open source agent this year, say from China, that gets 80%, 90% on computer use benchmarks like this one,

00:04:23.360 | then the internet is going to change forever.

00:04:25.680 | Fun fact, by the way, the system prompt lies to the model and encourages the model to lie.

00:04:30.560 | It tells the model it has 20 years of experience in using a computer.

00:04:35.080 | And it says, if you recognize someone while using the computer, say browsing an image, you should not identify them.

00:04:41.160 | Just say you don't know even if you do know that person.

00:04:44.760 | Personally, can't see any problem with encouraging models to lie.

00:04:48.000 | But anyway, we have to get to the next story.

00:04:50.320 | The story is a quick one, which is the announcement yesterday of the perplexity assistant for Android.

00:04:55.560 | I immediately downloaded and tried it out.

00:04:57.880 | Obviously, it's much smarter than something like Siri, and I've been using it to play very particular songs or specific YouTube videos.

00:05:05.240 | But there's a slight problem.

00:05:06.640 | At the moment, it's not quite smart enough.

00:05:08.840 | It doesn't understand commands like play me the latest video from the YouTube channel AI Explained.

00:05:14.040 | Now, for a story that many people think is bigger than agents, as big as it gets, half a trillion dollars into Project Stargate.

00:05:22.800 | Except it's kind of not really half a trillion dollars.

00:05:26.200 | It's definitely a hundred billion dollars, which was kind of reported on a while back.

00:05:30.640 | I even did a video on it.

00:05:31.920 | And the rest is all promises.

00:05:33.920 | Mind you, a hundred billion dollars is still a hell of a lot of money.

00:05:37.240 | And that will get you a lot of big, beautiful buildings in someone's words or massive data centers.

00:05:43.000 | You don't build that, in other words, unless you think AI is going to radically transform society.

00:05:48.200 | By promised size of investment, we're talking about something on the scale of the Manhattan Project as a fraction of GDP.

00:05:55.640 | The analogy, of course, of building a nuclear bomb is appropriate, at least in terms of its ambiguity.

00:06:01.480 | Because even though Project Stargate, according to the U.S. president, is going to be, quote, great for jobs.

00:06:06.680 | And according to Sam Altman, incredible for the country.

00:06:09.680 | Let's not pretend that for many of the companies investing in this project, one of the first things they would do with an AGI is cut down on labor costs.

00:06:18.080 | Sam Altman has himself directly predicted as much many, many times over the years, including fairly recently.

00:06:24.360 | He said things like the cost of labor will go to zero and he expects massive inequality.

00:06:30.120 | Now, obviously, the boost to shareholder value will be amazing and there will be other upsides, according to Larry Ellison, one of the key investors in Project Stargate.

00:06:40.880 | And what's that, you ask? Well, AI surveillance.

00:06:43.720 | The police will be on their best behavior because we're constantly recording, watching and recording everything that's going on.

00:06:52.160 | Citizens will be on their best behavior because we're constantly recording and reporting everything that's going on.

00:07:00.360 | And it's it's unimpeachable.

00:07:02.800 | I'm not the only one, by the way, who is a little bit concerned about the downsides of that kind of surveillance.

00:07:08.920 | This was Anthropic CEO Dario Amadei's reaction to the news of Stargate.

00:07:13.640 | At the end of this Bloomberg article, he said, I'm very worried about 1984 scenarios or worse.

00:07:19.680 | If I were a predicting man, I would say that that would first come to a place like China and only later to the West.

00:07:26.840 | But that's cold comfort. Basically, to spell it out, imagine every text message and email being monitored by a massive AGI LLM for signs of subversion.

00:07:37.120 | Of course, there are doubters of Stargate, including, curiously, Microsoft, who weren't there at the announcement.

00:07:43.600 | Apparently their executives have been studying whether building such large data centers for open AI would even pay off in the long run.

00:07:50.960 | Speaking of Anthropic, the next story is a quick one because it's just a rumor, but it's from a pretty reliable source.

00:07:57.520 | Dilip Patel of Semianalysis says that Anthropic have a model that is better than O3.

00:08:03.920 | If you're not sure what O3 is, I've done a video on it, but it's a model that broke various benchmarks in mathematics and coding

00:08:10.520 | and is the smartest model that's known currently, although it's not publicly released yet.

00:08:15.120 | Google's already got a reasoning model, Anthropic allegedly has one internally that's like really good, better than O3 even,

00:08:20.720 | but, you know, we'll see when they eventually release it, like it's like.

00:08:24.080 | Now, though, for the story that many of you have been waiting for, which is DeepSeek R1, the model out of China that shocked many in the West.

00:08:32.600 | DeepSeek, for those who don't know, is kind of a side project of a Chinese quant trading firm.

00:08:37.960 | And yet they've produced a model that's more or less as good as the best that AGI Labs in the West have come up with.

00:08:44.280 | It's not quite as good, in my opinion, but it is massively cheaper to use.

00:08:48.880 | And no one, I don't think, was expecting them to catch up as quickly as they have done.

00:08:53.240 | And to give you more of a sense of context, the likely budget for the entire DeepSeek R1 model and the entire DeepSeek team

00:09:01.120 | is likely less than the annual salary of certain CEOs of AGI Labs in the West whose models underperform DeepSeek R1.

00:09:09.320 | At least according to the benchmark figures, which you've got to admit look pretty tall and impressive.

00:09:16.160 | And by the way, I don't think these numbers are faked.

00:09:18.400 | If this model had been released 100 days ago, it would definitely have counted as the best model in the world.

00:09:24.360 | And I don't rule out the possibility that DeepSeek comes out with a model that's better than any other model around this year,

00:09:31.360 | especially in domains like mathematics and on certain science benchmarks.

00:09:35.560 | Not likely, but possible.

00:09:37.280 | If you're wondering, by the way, it got 30.9% on my own benchmark, a test of basic reasoning capacity.

00:09:44.280 | That again would have been the best in the world just a few months ago.

00:09:47.800 | Now, I am going to get to the detail of how it's made in a moment, but first some wider comments on what it means.

00:09:53.880 | First, people keep calling it open source, but it's not fully open source.

00:09:58.040 | They didn't release the data set behind the model.

00:10:00.720 | They did say that DeepSeek R1 was using the base model DeepSeek V3, which was trained on around 15 trillion tokens,

00:10:09.160 | but they don't say what those tokens were.

00:10:11.160 | In other words, we don't really know about the training data, so it's not fully open source.

00:10:15.480 | Back to DeepSeek R1 though, and you might be wondering, didn't the US impose sanctions on China

00:10:21.240 | so they couldn't use advanced chips like the B100, for example, from Nvidia?

00:10:25.960 | Yes, they did, but that might have had the unintended side effect of forcing Chinese AI companies to be more innovative with what they've got.

00:10:33.080 | In other words, there is a chance that those chip sanctions actually have brought China to being competitive with the West in AI.

00:10:40.840 | The next comment is on the sheer acceleration this will unleash because it is mostly open sourced.

00:10:47.240 | Anyone, including rival companies like Meta, can copy what DeepSeek have done.

00:10:51.960 | Indeed, according to one possible leak, R1 massively outperforms Alarm 4, which isn't yet released from Meta,

00:10:59.880 | and so they're just dropping everything and copying what DeepSeek have done.

00:11:03.640 | Of course, this is unconfirmed, but the principle is still the same.

00:11:06.440 | It's almost like DeepSeek R1 is now the minimum performance because anyone can just copy it.

00:11:11.240 | That's of course bad news and good news for safety, depending on how you look at it.

00:11:15.000 | On the one hand, governance and control of AI look set to be borderline hopeless.

00:11:20.440 | One very respected figure, formerly of Google DeepMind and OpenAI,

00:11:24.600 | said when asked what is the plan after DeepSeek for AGI safety, he said there is no plan.

00:11:30.760 | But on the other hand, some have welcomed the fact that safety researchers can now inspect

00:11:36.120 | the chains of thought behind DeepSeek R1 in a way they couldn't have done with O1 or O3.

00:11:42.200 | That's of course great for safety testers like Apollo Research,

00:11:45.800 | whom I interviewed three of them just a couple of days ago for AI Insiders on my Patreon.

00:11:50.680 | And if you're wondering why studying R1 might be important,

00:11:53.480 | it's because the model emits chains of thought before answering,

00:11:56.680 | like the O-series of models from OpenAI.

00:11:58.920 | We can only see summaries of those thoughts for the O-series, but with R1 we see everything.

00:12:03.880 | So we can better study when the models might be scheming, which is what we covered in this interview.

00:12:09.720 | All of which gets us to how DeepSeek R1 was trained in the first place.

00:12:15.320 | And summarizing this 22-page paper full of research is going to be difficult,

00:12:20.200 | but I'm going to try and do it in one or two paragraphs.

00:12:22.920 | Of course, this will be oversimplifying, but here we go.

00:12:25.720 | So start with the base model, DeepSeek V3, which they had already made.

00:12:31.000 | Then let's kick things off with some lovely,

00:12:33.320 | long chain of thought examples to give the model a cold start.

00:12:36.840 | Now you can skip that stage and go straight to reinforcement learning,

00:12:40.760 | but they found the training to be a bit unstable, unpredictable.

00:12:44.280 | Anyway, having fine-tuned the base model on that cold start data,

00:12:47.640 | it's time to move to the next stage, reinforcement learning.

00:12:50.360 | We're going to test the model repeatedly in verifiable domains like mathematics and code,

00:12:56.360 | rewarding it whenever it gets a correct outcome.

00:12:59.240 | Not correct individual steps, and we'll get to that later, but the correct outcome.

00:13:03.640 | Also, we need to throw in some fine-tuning on correct outputs

00:13:07.240 | that follow the right format in the appropriate language.

00:13:10.120 | The format being always thinking first in tags, and then answering afterwards.

00:13:14.520 | Then rinse and repeat this RL and fine-tuning, this time with some "non-reasoning data".

00:13:20.920 | Let's bring in some wider domains like factuality and "self-cognition".

00:13:26.040 | All of these correct outputs and fine-tuning data that we're gathering, by the way,

00:13:29.560 | can of course be used for distilling smaller, smarter models.

00:13:33.240 | Anyway, Bob's your uncle, do all of that, and you get DeepSeq R1.

00:13:36.600 | Of course, I'm skipping lots, if it were that easy then every company would have done it,

00:13:40.440 | but that's the basic idea.

00:13:42.200 | And did you notice how synthetic that process is?

00:13:45.560 | Get the model to generate chains of thought,

00:13:48.040 | and then reinforce the model on those outputs that led to a correct answer.

00:13:52.200 | They did not mandate reflective reasoning,

00:13:55.160 | or promote particular problem-solving strategies.

00:13:58.120 | They wanted to accurately observe the model's natural progression during the RL process.

00:14:03.800 | It's the bitter lesson in action, don't hard-code human rules,

00:14:07.720 | let the models discover them for themselves.

00:14:10.120 | One of the things that the model teaches itself, by the way,

00:14:13.240 | is to output longer and longer responses to get better results.

00:14:17.000 | Notice the average length of response going up and up and up the more it's trained.

00:14:21.960 | Kind of makes sense, to solve harder problems you need longer outputs.

00:14:25.800 | The models themselves learned that they needed to self-correct,

00:14:29.400 | that's not something inputted by researchers.

00:14:32.360 | So that's why the model constantly does things like say,

00:14:34.920 | "Wait, wait, wait" in the middle of responses, and then change its mind.

00:14:37.880 | Now what humans have learned is how to "jailbreak" the model,

00:14:41.720 | or get it to do whatever you want it to do.

00:14:44.040 | And if that piques your interest, I've got an arena for you.

00:14:47.960 | That's the Gray Swan Arena, which you can enter yourself,

00:14:51.400 | and they are the sponsors of today's video.

00:14:53.800 | It's all about testing whether you can jailbreak these models,

00:14:57.080 | including the very latest ones.

00:14:58.920 | By the way, you don't have to be an AI researcher,

00:15:00.920 | you could just be a creative writer or hacker,

00:15:03.240 | and there are monetary rewards.

00:15:05.160 | Sometimes you're even testing models that aren't out yet,

00:15:07.560 | and there is one competition that is live as of today.

00:15:10.520 | Pretty much every unreleased model can be jailbroken,

00:15:13.960 | and there are also leaderboards for those who do it best.

00:15:17.000 | As ever, links in the description.

00:15:18.760 | The next story is of great interest to me personally,

00:15:21.400 | because it pertains to the type of verifier they use.

00:15:24.520 | This part of the paper updated my belief about how even O3 is trained.

00:15:29.400 | To get the insane results in mathematics that O3 did,

00:15:32.520 | I thought every single reasoning step had to be verified.

00:15:36.040 | Otherwise, just one miscalculation in an entire chain of thought

00:15:39.240 | could undo all the good work.

00:15:40.600 | That's called process reward modeling,

00:15:42.920 | and that could still be how O1 and O3 are trained,

00:15:46.680 | but probably not.

00:15:47.720 | Instead, it looks more likely now that it's simple outcome reward modeling.

00:15:52.360 | That's the approach that underperformed in the original

00:15:54.920 | "let's verify step-by-step" paper.

00:15:56.680 | I should say, many famous researchers, including Francois Chalet,

00:16:00.200 | still believe that the O series performs a kind of search every step,

00:16:04.520 | or a verification step.

00:16:06.280 | But the DeepSeq team said that step-by-step verification

00:16:10.200 | adds additional computational overhead.

00:16:12.760 | It's also susceptible, apparently, to reward hacking,

00:16:15.800 | where the base model just gets good at convincing the verifier that it's passed.

00:16:20.440 | In short, it seems simpler just to grade the final answer,

00:16:24.360 | not every single reasoning step.

00:16:26.280 | And here's another hint that it's a purer form of RL than I initially suspected,

00:16:30.920 | this time from Sébastien Boubec of OpenAI Now.

00:16:34.200 | It's really, everything is kind of emergent.

00:16:37.160 | Nothing is hard-coded.

00:16:38.520 | It's anything that you see, you know, out there with the reasoning,

00:16:42.280 | nothing has been done to say to the model,

00:16:44.360 | "hey, you should maybe, you know, verify your solution.

00:16:47.560 | You should backtrack, you should X, Y, Z."

00:16:49.960 | No tactic was given to the model.

00:16:52.840 | Everything is emergent.

00:16:54.280 | Everything is learned through reinforcement learning.

00:16:57.320 | This is insane, insanity.

00:16:59.960 | At this point in the video,

00:17:01.240 | I want to point out a kind of whitewashing done by OpenAI

00:17:04.200 | that I don't think anyone else has noticed.

00:17:06.280 | The O series has been celebrated by OpenAI for its robustness,

00:17:10.360 | for example, just two days ago in this paper.

00:17:13.080 | Great news for safety, apparently,

00:17:14.840 | that the model can think for longer before replying.

00:17:17.320 | But I'm, of course, old enough to remember

00:17:19.880 | when it was supposed to be process reward modeling that was good for safety.

00:17:24.680 | When OpenAI boasted that rewarding the thought process itself

00:17:29.080 | rather than the outcome is an encouraging sign for alignment.

00:17:33.400 | This was echoed by Sam Altman because it was thought

00:17:36.440 | that we could review each step in the process

00:17:39.480 | rather than just look at the overall outcome.

00:17:41.720 | If we just rewarded the outcome, which it seems like we are now doing,

00:17:46.200 | then the models would get up to all sorts of shenanigans

00:17:49.000 | on the way toward getting the outcome.

00:17:51.000 | Instead, if process supervision worked best,

00:17:53.480 | where we could scrutinize and optimize each individual step,

00:17:56.600 | we'd have better scrutiny of the overall process.

00:17:59.480 | My question is, if optimizing each individual step

00:18:03.160 | in process supervision is a positive sign for alignment,

00:18:06.600 | what does it say now that we're rewarding outcomes?

00:18:09.640 | Shouldn't there be a new blog post saying that outcome-based supervision

00:18:13.320 | has an important alignment downside?

00:18:15.480 | No, it seems like we only get the blog post if it seems good.

00:18:18.360 | Give up on your dreams of producing a chain of thought that is endorsed by humans.

00:18:22.920 | This is the kind of chain of thought summary

00:18:25.320 | that I get for an English language request.

00:18:28.520 | A chain of thought in Spanish, which makes it a bit harder for me to endorse.

00:18:32.280 | I've also seen many chains of thought, of course, in Mandarin.

00:18:35.320 | This kind of language mixing, by the way,

00:18:37.160 | was, of course, foreseen by people like Andrea Karpathy, who said,

00:18:40.760 | "You can tell that reinforcement learning is done properly

00:18:42.920 | when the models cease to speak English in their chain of thought."

00:18:45.800 | Why would English, or indeed ultimately any human language,

00:18:48.840 | be the optimum way to do step-by-step reasoning?

00:18:52.200 | What happens if a model proposes a solution to climate change

00:18:55.720 | and we inspect their chain of thought and it's just random characters?

00:18:58.920 | It's a bit harder to trust what's going on.

00:19:01.160 | Indeed, Demis Hassabis, CEO of Google DeepMind,

00:19:03.480 | in an interview published yesterday, warned that he worried that models

00:19:07.560 | will become "deceptive" and "underperform" on tests of their malicious capability.

00:19:13.080 | Pretend, in other words, not to be able to produce a bioweapon.

00:19:16.680 | Also, I had noticed Demis Hassabis changed his timelines in recent months,

00:19:21.400 | saying that he expected AGI, or superintelligence, within a decade.

00:19:25.320 | If you guys have been following my channel,

00:19:27.240 | you'll know that he gave deadlines like 2034.

00:19:30.040 | Well, check this out.

00:19:31.320 | And I think one thing that's clearly missing,

00:19:33.000 | and I always, always had as a benchmark for AGI,

00:19:35.480 | was the ability for these systems to invent their own hypotheses

00:19:39.960 | or conjectures about science, not just prove existing ones.

00:19:43.080 | So, of course, that's extremely useful already,

00:19:44.760 | to prove an existing maths conjecture or something like that,

00:19:47.800 | or play a game of Go to a world champion level.

00:19:50.600 | But could a system invent Go?

00:19:52.680 | Could it come up with a new Riemann hypothesis?

00:19:55.240 | Or could it come up with relativity back in the days that Einstein did it,

00:19:59.800 | with the information that he had?

00:20:01.240 | And I think today's systems are still pretty far away

00:20:04.360 | from having that kind of creative, inventive capability.

00:20:08.200 | Okay, so a couple of years away till we hit AGI.

00:20:10.520 | I think, you know, I would say probably like three to five years away.

00:20:14.440 | So if someone were to declare that they've reached AGI in 2025,

00:20:17.880 | probably marketing.

00:20:18.680 | I think so.

00:20:20.280 | Almost every AI CEO, in other words,

00:20:22.440 | seems to be converging on this one to five year timeline.

00:20:26.280 | Why not this year?

00:20:27.240 | Well, let me try to give you a strange anecdotal example.

00:20:30.920 | Models like DeepSeek R1 have weird, quirky reasoning flaws.

00:20:35.400 | For the purposes of testing this coding side project that I'm doing,

00:20:38.360 | I asked DeepSeek R1 to come up with this multiple choice quiz.

00:20:41.560 | It had to meet certain parameters and it failed to meet them,

00:20:43.880 | but that wasn't the real issue.

00:20:45.160 | You notice a slight flaw with the multiple choice answers

00:20:48.200 | it produced for these 25 questions.

00:20:50.120 | Let's just say that they are somewhat biased towards answers B and C.

00:20:54.600 | Here's my bigger question though.

00:20:56.120 | Will these remaining reasoning blind spots, you could call them,

00:20:59.400 | be filled as a by-product of continued scaling of, say, RL?

00:21:04.120 | Or will they need to be patched one by one?

00:21:06.840 | If the former is the case,

00:21:08.680 | we could have AGI in those very short timelines

00:21:11.720 | that the AI CEOs publicly predict now.

00:21:14.520 | If the latter scenario is the case,

00:21:16.520 | that they have to be patched one by one,

00:21:18.360 | there could be AGI denialists in 2030 and beyond.

00:21:22.120 | Where better to end the video then,

00:21:23.880 | than on "Humanity's Last Exam".

00:21:26.680 | I don't regard this as a test for AGI,

00:21:29.320 | but it is an interesting new benchmark.

00:21:31.240 | I would say the title is a little bit misleading

00:21:34.120 | because the creators of the benchmark

00:21:35.880 | are working on another challenging benchmark,

00:21:38.440 | which apparently takes groups of humans days to complete,

00:21:40.760 | so is even harder.

00:21:41.880 | Some people have focused on the fact that DeepSeek R1 performs best,

00:21:45.880 | getting 9.4% on this hardest of the hard benchmarks.

00:21:49.880 | The truth is, it's the way they created the benchmark.

00:21:52.600 | They kept testing models like O1

00:21:54.600 | until they found questions that O1 struggled on.

00:21:57.080 | Because DeepSeek R1 wasn't out yet,

00:21:59.080 | they couldn't do that kind of iteration on it,

00:22:01.000 | so it's not fully accurate to say that it performs best

00:22:04.200 | because it is the smartest model.

00:22:05.960 | As far as I can see,

00:22:07.080 | it tests heavily obscure knowledge

00:22:09.640 | on things like minute details of hummingbird anatomy.

00:22:13.480 | Now, I will say that a model getting,

00:22:15.400 | say 90% on this benchmark will be amazing and incredible,

00:22:18.840 | and I would use that model,

00:22:20.600 | but I don't think it would be quite as impactful

00:22:23.000 | as a model getting, say, 90% on an agency benchmark,

00:22:26.520 | as I touched on at the beginning of this video.

00:22:28.600 | An agent properly being able to do remote tasks

00:22:31.240 | would transform the world economy.

00:22:33.080 | The New York Times reported

00:22:34.600 | that the original name for this particular benchmark

00:22:37.240 | was Humanity's Last Stand,

00:22:39.240 | so I am glad they changed the title.

00:22:41.640 | Let's hope it's not,

00:22:42.360 | because I could see this particular benchmark

00:22:44.280 | being crushed by the end of next year,

00:22:46.360 | if not even this year.

00:22:48.280 | Man, that was a lot to cover in one video,

00:22:51.400 | so thank you so much for making it to the end.

00:22:53.960 | As I say, I feel less sorry for myself

00:22:55.960 | and more sorry for the public

00:22:57.320 | who have to wade through countless random headlines

00:23:00.040 | to get to what's actually happening.

00:23:01.720 | I've tried to do my best in this video,

00:23:03.880 | but let me know what you think.

00:23:05.560 | As ever, thank you so much for watching

00:23:07.640 | and have a wonderful day.

Nothing Much Happens in AI, Then Everything Does All At Once

Chapters