back to index

ChatGPT o1 - First Reaction and In-Depth Analysis


Whisper Transcript | Transcript Only Page

00:00:00.000 | ChachiPT now calls itself an alien of exceptional ability, and I find it a little bit harder
00:00:07.180 | to disagree with that today than I did yesterday.
00:00:11.440 | Because the system called O1 from OpenAI is here, at least in preview form, and it is
00:00:17.680 | a step change improvement.
00:00:19.560 | You may also know O1 by its previous names of Strawberry and Q*, but let's forget naming
00:00:26.000 | conventions, how good is the actual system?
00:00:28.640 | Well, in the last 24 hours, I've read the 43-page system card, every OpenAI post and
00:00:35.060 | press release, I've tested O1 hundreds of times, including on SimpleBench, and analysed
00:00:41.600 | every single answer.
00:00:43.000 | To be honest with you guys, it will take weeks to fully digest this release, so in this video,
00:00:48.120 | I'll just give you my first impressions, and of course do several more videos as we analyse
00:00:53.560 | further.
00:00:54.560 | In short though, don't sleep on O1.
00:00:56.660 | This isn't just about a little bit more training data, this is a fundamentally new paradigm.
00:01:01.320 | In fact, I would go as far as to say that there are hundreds of millions of people who
00:01:05.040 | might have tested an earlier version of ChachiPT and found LLMs and "AI" lacking, but will
00:01:10.880 | now return with excitement.
00:01:12.960 | As the title implies, let me give you my first impressions, and it's that I didn't expect
00:01:19.200 | the system to perform as well as it does.
00:01:22.760 | And that's coming from the person who predicted many of the key mechanisms behind Q*, which
00:01:28.140 | have been used, it seems, in this system.
00:01:30.980 | Things like sampling hundreds or even thousands of reasoning paths, and potentially using
00:01:36.780 | a verifier, an LLM-based verifier, to pick the best ones.
00:01:41.160 | Of course, OpenAI aren't disclosing the full details of how they trained O1, but they did
00:01:46.980 | leave us some tantalising clues, which I'll go into in a moment.
00:01:50.420 | SimpleBench, if you don't know, tests hundreds of basic reasoning questions, from spatial
00:01:55.820 | to temporal to social intelligence questions, that humans will crush on average.
00:02:01.640 | As many people have told me, the O1 system gets both of these two sample questions from
00:02:06.460 | SimpleBench right, although not always.
00:02:09.500 | Take this example, where despite thinking for 17 seconds, the model still gets it wrong.
00:02:15.740 | Fundamentally, O1 is still a language model-based system, and will make language model-based
00:02:22.460 | mistakes.
00:02:23.460 | It can be rewarded as many times as you like for good reasoning, but it's still limited
00:02:29.060 | by its training data.
00:02:30.500 | Nevertheless, though, I didn't quite foresee the magnitude of the improvement that would
00:02:35.260 | occur through rewarding correct reasoning steps.
00:02:38.300 | That, I'll admit, took me slightly by surprise.
00:02:41.020 | So why no concrete figure?
00:02:42.660 | Well, as of last night, OpenAI imposed a temperature of 1 on its O1 system.
00:02:48.900 | That was not the temperature used for the other models when they were benchmarked on
00:02:53.060 | SimpleBench.
00:02:54.060 | That's a much more "creative" temperature than the other models were tested on for SimpleBench.
00:02:59.780 | Therefore, what that meant was that performance variability was a bit higher than normal.
00:03:03.780 | It would occasionally get questions right through some stroke of genius reasoning, and
00:03:08.460 | get that same question wrong the next time.
00:03:10.780 | In fact, as you just saw with the IceCube example.
00:03:13.140 | The obvious solution is to run the benchmark multiple times and take a majority vote.
00:03:17.660 | That's called self-consistency.
00:03:19.140 | But for a true apples-to-apples comparison, I would need to do that for all the other
00:03:23.300 | models.
00:03:24.300 | My ambition, not that you're too interested, is to get that done by the end of this month.
00:03:28.460 | But let me reaffirm one thing very clearly.
00:03:31.060 | However you measure it, O1 Preview is a step-change improvement on Claude 3.5 Sonnet.
00:03:38.020 | And as anyone following this channel will know, I'm not some OpenAI fanboy.
00:03:42.480 | Claude 3.5 Sonnet has reigned supreme for quite a while.
00:03:46.560 | So for those of you who don't care about other benchmarks and the full paper, I want
00:03:50.740 | to kind of summarize my first impressions in a nutshell.
00:03:54.340 | This description actually fits quite well.
00:03:56.980 | The ceiling of performance for the O1 system, just Preview let alone the full O1 system,
00:04:02.860 | is incredibly high.
00:04:04.460 | It obviously crushes the average person's performance in things like physics, maths
00:04:09.220 | and coding competitions.
00:04:10.980 | But don't get misled.
00:04:12.500 | Its floor is also really quite low, below that of an average human.
00:04:17.340 | As I wrote on YouTube last night, it frequently and sometimes predictably makes really obvious
00:04:23.340 | mistakes that humans wouldn't make.
00:04:25.540 | Remember I analysed the hundreds of answers it gave for SimpleBench.
00:04:29.920 | Let me give you a couple of examples straight from the mouth of O1.
00:04:33.300 | When the cup is turned upside down, the dice will fall and land on the open end of the
00:04:39.260 | cup, which is now the top.
00:04:41.740 | If you can visualise that successfully, you're doing better than me.
00:04:45.900 | Suffice to say, it got that question wrong.
00:04:48.140 | And how about this, more social intelligence.
00:04:50.320 | He will argue back, obviously I'm not giving you the full context because this is a private
00:04:54.140 | data set, but anyway.
00:04:55.140 | He will argue back against the Brigadier General, one of the highest military ranks, at the
00:05:00.200 | Troop Parade.
00:05:01.400 | This is a soldier we're talking about.
00:05:02.900 | As the soldier's silly behaviour in first grade, that's like age 6 or 7, indicates
00:05:08.900 | a history of speaking up against authority figures.
00:05:11.740 | Now the vast majority of humans would say, wait, no, what he did in primary school, I
00:05:17.040 | don't know what Americans call primary school, but what he did when he was a young school
00:05:20.300 | child does not reflect what he would do in front of a general on a Troop Parade.
00:05:24.460 | As I've written, in some domains these mistakes are routine and amusing.
00:05:28.740 | So it is very easy to look at O1's performance on the Google proof question and answer set,
00:05:35.980 | it's performance of around 80%, that's on the diamond subset, and say, well, let's
00:05:40.780 | be honest, the average human can't even get one of those questions right.
00:05:44.180 | So therefore it's AGI?
00:05:45.820 | Well, even Sam Altman says, no, it's not.
00:05:48.740 | Too many benchmarks are brittle in the sense that when the model is trained on that particular
00:05:53.180 | reasoning task, it then can ace it.
00:05:56.140 | Think Web of Lies, where it's now been shown to get 100%, but if you test O1 thoroughly
00:06:00.780 | in real life scenarios, you will frequently find kind of glaring mistakes.
00:06:05.700 | Obviously, what I've tried to do into the early hours of last night and this morning
00:06:09.840 | is find patterns in those mistakes, but it has proven a bit harder than I thought.
00:06:14.980 | My guess though, about those weaknesses for those who won't stay to the end of the video
00:06:19.060 | is it's to do with its training methodology.
00:06:21.740 | OpenAI revealed in one of the videos on its YouTube channel, and I will go into more detail
00:06:27.020 | on this in a future video, that they deviated from the let's verify step-by-step paper
00:06:32.220 | by not training on human annotated reasoning samples or steps.
00:06:36.980 | Instead, they got the model to generate the chains of thought, and we all know those can
00:06:41.660 | be quite flawed, but here's the key moment to really focus on.
00:06:45.060 | They then automatically scooped up those chains of thought that led to a correct answer, in
00:06:51.340 | the case of mathematics, physics or coding, and then train the model further on those
00:06:55.980 | correct chains of thought.
00:06:57.300 | So it's less that O1 is doing true reasoning from first principles, it's more retrieving
00:07:03.220 | more accurately, more reliably, reasoning programs from its training data.
00:07:08.260 | It "knows" or can compute which of those reasoning programs in its training data will
00:07:13.980 | more likely lead it to a correct answer.
00:07:16.340 | It's a bit like taking the best of the web, rather than a slightly improved average of
00:07:21.300 | the web.
00:07:22.300 | That, to me, is the great unlock that explains a lot of this progress, and if I'm right,
00:07:27.980 | that also explains why it's still making some glaring mistakes.
00:07:31.180 | At this point, I simply can't resist giving you one example straight from the output of
00:07:36.620 | O1 preview from a simple bench question.
00:07:39.660 | The context, and you'll have to trust me on this one, is simply that there's a dinner
00:07:43.820 | at which various people are donating gifts.
00:07:46.900 | One of the gifts happens to be given during a Zoom call, so online, not in person.
00:07:51.060 | Now I'm not going to read out some of the reasoning that O1 gives, you can see it on
00:07:54.600 | screen, but it would be hard to argue that it is truly reasoning from first principles.
00:08:00.380 | Definitely some suboptimal training data going on.
00:08:03.140 | So that is the context for everything you're going to see in the remainder of this First
00:08:07.180 | Impressions video.
00:08:08.300 | Because everything else is quite frankly stunning.
00:08:10.780 | I just don't want people to get too carried away by the really impressive accomplishment
00:08:15.300 | from OpenAI.
00:08:16.300 | I fully expect to be switching to O1 Preview for daily use cases, although of course Anthropic
00:08:21.660 | in the coming weeks could reply with their own system.
00:08:24.420 | Anyway, now let's dive into some of the juiciest details.
00:08:27.580 | The full breakdown will come in future videos.
00:08:30.260 | First thing to remember, this is just O1 Preview, not the full O1 system that is currently in
00:08:35.620 | development.
00:08:36.620 | Not only that, it is very likely based on the GPT-40 model, not GPT-5 or Orion which
00:08:42.780 | would vastly supersede GPT-40 in scale.
00:08:45.860 | I could just leave you to think about the implications of scaling up the base model
00:08:49.840 | 100 times in compute, throw in a video avatar and man, we are really talking about a changed
00:08:56.820 | AI environment.
00:08:57.820 | Anyway, back to the details.
00:08:59.660 | They talk about performing similarly to PhD students in a range of tasks in physics, chemistry
00:09:04.700 | and biology.
00:09:05.700 | And I've already given you the nuance on that kind of comment.
00:09:08.220 | They justify the name by the way by saying, "This is such a significant advancement that
00:09:12.980 | we are resetting the counter back to 1 and naming this series OpenAI 01."
00:09:18.220 | It also reminds me of the O1 and O2 figure series of robotic humanoids whose maker OpenAI
00:09:25.020 | is collaborating with.
00:09:26.420 | This was just the introductory page and then they gave several follow up pages and posts.
00:09:31.540 | To sum it up on jailbreaking, O1 Preview is much harder to jailbreak, although it's
00:09:36.140 | still possible.
00:09:37.140 | Before we get to the reasoning page, here is some analysis on Twitter or X from the
00:09:42.420 | OpenAI team.
00:09:43.700 | One researcher at OpenAI who is building Sora said this, "I really hope people understand
00:09:48.340 | that this is a new paradigm and I agree with that actually, it's not just hype.
00:09:51.740 | Don't expect the same pace, schedule or dynamics of pre-training era."
00:09:55.540 | The core element of how O1 works, by the way, is scaling up its inference, its actual output,
00:10:00.780 | its test time compute.
00:10:02.140 | How much computational power is applied in its answers to prompts, not when it's being
00:10:06.560 | built and pre-trained.
00:10:07.940 | He's making the point that expanding the pre-training scale of these models takes years
00:10:12.540 | often, as you've seen in some of my previous videos, it's to do with data centers, power
00:10:16.700 | and the rest of it.
00:10:17.700 | But what can happen much faster is scaling up inference time, output time compute.
00:10:23.460 | Improvements can happen much more rapidly than scaling up the base models.
00:10:27.220 | In other words, I believe that the rate of improvement, he says on evals with our reasoning
00:10:31.420 | models has been the fastest in OpenAI history.
00:10:34.700 | It's going to be a wild year.
00:10:36.500 | He is, of course, implying that the full O1 system will be released later this year.
00:10:41.660 | We'll get to some other researchers, but Will DePue made some other interesting points.
00:10:46.140 | In one graph of math performance, they show that O1 mini, the smaller version of the O1
00:10:52.900 | system, scores better than O1 preview.
00:10:55.900 | But I will say that in my testing of O1 mini on SimpleBench, it performed really quite
00:11:01.940 | badly.
00:11:02.940 | We're talking sub 20%.
00:11:03.940 | So it could be a bit like the GPT 4.0 mini we already had, that it's hyper specialized
00:11:09.340 | at certain tasks, but can't really go beyond its familiar environment.
00:11:14.260 | Give it a straightforward coding or math challenge and it will do well.
00:11:18.260 | Introduce complication, nuance or reasoning and it'll do less well.
00:11:21.780 | This chart, though, is interesting for another reason, and you can see that when they max
00:11:26.180 | out the inference cost for the full O1 system, the performance delta with the maxed out mini
00:11:32.100 | model is not crazy.
00:11:34.300 | I would say, what is that 70% going up to 75%.
00:11:37.700 | To put it another way, I wouldn't expect the full O1 system with maxed out inference
00:11:42.260 | to be yet another step change forward, although, of course, nothing can be ruled out.
00:11:46.620 | Some more quotes from OpenAI and this is Noam Brown, who I've quoted many times on this
00:11:51.380 | channel focused on reasoning at OpenAI.
00:11:53.820 | He states again the same message, "We're sharing our evals of the O1 model to show
00:11:58.340 | the world that this isn't a one-off improvement.
00:12:01.380 | It's a new scaling paradigm."
00:12:03.100 | Underneath, you can see the dramatic performance boosts across the board from GPT 4.0 to O1.
00:12:09.460 | Now, I suspect if you included GPT 4.0 Turbo on here, you might see some more mixed improvements,
00:12:14.860 | but still, the overall trend is stark.
00:12:17.060 | If, for example, I had only seen improvement in STEM subjects and maths particularly, I
00:12:22.900 | would have said, you know what, is this a new paradigm, but it's that combination
00:12:26.820 | of improvements in a range of subjects, including law, for example, and most particularly for
00:12:32.820 | me, of course, on SimpleBench, that I am actually a believer that this is a new paradigm.
00:12:38.020 | Yes, I get that it can still fall for some basic tokenization problems, like it doesn't
00:12:42.820 | always get that 9.8 is bigger than 9.11, and yes, of course, you saw the somewhat amusing
00:12:48.620 | mistakes earlier on SimpleBench, but here's the key point.
00:12:51.780 | I can no longer say with absolute certainty which domains or types of questions on SimpleBench
00:12:58.780 | it will reliably get wrong.
00:13:00.420 | I can see some patterns, but I would hope for a bit more predictability in saying it
00:13:05.980 | won't get this right, for example.
00:13:08.540 | Until I can say with a degree of certainty it won't get this type of problem correct,
00:13:13.500 | I can't really tell you guys that I can see the end of this paradigm.
00:13:17.420 | Just to repeat, we have two more axes of scale to yet exploit.
00:13:21.520 | Bigger base models, which we know they're working on with the whale-size supercluster
00:13:25.340 | - I've talked about that in previous videos - and simply more inference time compute.
00:13:29.060 | Plus, just look at the log graphs on scaling up the training of the base model and the
00:13:33.980 | inference time, or the amount of thinking time or processing time, more accurately,
00:13:38.140 | for the models.
00:13:39.140 | They don't look like they're levelling off to me.
00:13:41.140 | Now I know some might say that I come off as slightly more dismissive of those memory-heavy,
00:13:45.620 | computation-heavy benchmarks like the GPQA, but it is a stark achievement for the O1 Preview
00:13:51.820 | and O1 Systems to score higher than an expert PhD human average.
00:13:56.980 | Yes, there are flaws with that benchmark as with the MLU, but credit where it is due.
00:14:02.140 | By the way, as a side note, they do admit that certain benchmarks are no longer effective
00:14:06.580 | at differentiating models.
00:14:08.460 | It's my hope, or at least my goal, that SimpleBench can still be effective at differentiating
00:14:12.820 | models for the coming, what, 1, 2, 3 years maybe?
00:14:17.220 | I will now give credit to OpenAI for this statement.
00:14:20.460 | These results do not imply that O1 is more capable holistically than a PhD in all respects,
00:14:25.580 | only that the model is more proficient in solving some problems that a PhD would be
00:14:29.460 | expected to solve.
00:14:30.460 | That's much more nuanced and accurate than statements that we've heard in the past from,
00:14:35.140 | for example, Mira Murati.
00:14:36.760 | And just a quick side note, O1 on a Vision + Reasoning task, the MMMU, scores 78.2% competitive
00:14:44.940 | with human experts.
00:14:46.460 | That benchmark is legit, it's for real, and that's a great performance.
00:14:50.700 | On coding, they tested the system on the 2024, so not contaminated data, International Olympiad
00:14:56.780 | in Informatics.
00:14:57.780 | It scored around the median level, however, it was only able to submit 50 submissions
00:15:03.100 | per problem.
00:15:04.100 | But as compute gets more abundant and more fast, it shouldn't take 10 hours for it
00:15:09.240 | to attempt 10,000 submissions per problem.
00:15:12.660 | When they tried this, obviously going beyond the 10 hours presumably, the model achieved
00:15:16.620 | a score above the gold medal threshold.
00:15:19.540 | Now remember, we have seen something like this before with the AlphaCode2 system from
00:15:24.760 | Google DeepMind.
00:15:26.020 | And if you notice, this approach of scaling up the number of samples tested does help
00:15:30.700 | the model improve up the percentile rankings.
00:15:33.620 | However, those elite coders still leave systems like AlphaCode2 and O1 in the dust.
00:15:40.380 | The truly elite level reasoning that those coders go through is found much less frequently
00:15:46.700 | in the training data.
00:15:48.180 | As with other domains, it may prove harder to go from the 93rd percentile to the 99th
00:15:55.060 | than going from say the 11th to the 93rd.
00:15:58.100 | Nevertheless, yet another stunning achievement.
00:16:01.040 | Notice something though, in domains that are less susceptible to reinforcement learning,
00:16:06.220 | where in other words, there's less of a clear correct answer and incorrect answer,
00:16:10.940 | the performance boost is much worse, much less.
00:16:15.340 | Things like personal writing or editing text, there's no easy yes or no compilation of
00:16:20.500 | answers to verify against.
00:16:22.900 | In fact, for personal writing, the O1 preview system has a lower than 50% win rate versus
00:16:28.940 | GPT-4.0.
00:16:29.940 | That, to me, is the giveaway.
00:16:31.660 | If your domain doesn't have starkly correct 0, 1, yes, no, right answers, wrong answers,
00:16:38.580 | then improvements will take far longer.
00:16:41.140 | That also partly explains the somewhat patchy performance on SimpleBench.
00:16:45.820 | Certain questions we intuitively know are right with like 99% probability, but it's
00:16:50.500 | not like absolutely certain.
00:16:52.180 | Remember, the system prompt we use is pick the most realistic answer, so I would still
00:16:55.940 | fully defend that as a correct answer.
00:16:58.180 | But models handling that ambiguity can't leverage that reinforcement learning improved
00:17:03.140 | reasoning process.
00:17:04.660 | They wouldn't have those millions of yes or no, starkly correct or incorrect answers
00:17:08.540 | like they would have in, for example, mathematics.
00:17:11.340 | That's why we get this massive discrepancy in improvement from O1.
00:17:15.060 | Now let's quickly turn to safety where OpenAI said having these chain of thought reasoning
00:17:19.540 | steps allows us to "read the mind" of the model and understand its thought process.
00:17:25.220 | In part, they mean examining these summaries, at least, of the computations that went on,
00:17:30.540 | although most of the chain of thought process is hidden.
00:17:33.220 | But I do want to remind people, and I'm sure OpenAI are aware of this, that the reasoning
00:17:37.060 | steps that a model gives aren't necessarily faithful to the actual computations and calculations
00:17:42.340 | it's doing.
00:17:43.340 | In other words, it will sometimes output a chain of thoughts that aren't actually the
00:17:48.540 | thoughts it used, if you want to call it that, to answer the question.
00:17:52.300 | I've covered this paper several times in previous videos, but it's well worth a read
00:17:56.300 | if you believe that the reasoning steps a model gives always adheres to the actual process
00:18:01.780 | the model undertakes.
00:18:02.780 | That's pretty clearly stated in the introduction, and it's even stated here from Anthropic
00:18:07.780 | that as models become larger and more capable, they produce less faithful reasoning on most
00:18:12.940 | tasks we study.
00:18:13.980 | So good luck believing that GPT-5 or Orion's reasoning steps actually adhere to what it
00:18:19.220 | is computing.
00:18:20.260 | Then there was the system card, 43 pages, which I read in full.
00:18:23.900 | It was mainly on safety, but I'll give you just the 5 or 10 highlights.
00:18:27.800 | They boasted about the kind of high-value non-public datasets they had access to, and
00:18:31.780 | paywalled content, specialised archives, and other domain-specific datasets.
00:18:36.500 | But do remember that point I made earlier in the video - they didn't rely on mass
00:18:40.140 | human annotation, as the original Let's Verify step-by-step paper did.
00:18:44.820 | How do I know that paper was so influential on Q* and this O1 system?
00:18:49.460 | Well almost all its key authors are mentioned here, and the paper is directly cited in the
00:18:54.500 | system card and blog post.
00:18:56.140 | So it's definitely an evolution of Let's Verify, but this one based on automatic, model-generated
00:19:01.860 | chains of thought.
00:19:02.860 | Again, if you missed it earlier, they would pick the ones that led to a correct answer
00:19:06.780 | and train the model on those chains of thought, enabling the model, if you like, to get better
00:19:12.460 | at retrieving those reasoning programs that typically lead to correct answers.
00:19:17.300 | The model discovered or computed that certain sources should have less impact on its weights
00:19:23.660 | and biases.
00:19:24.660 | The reasoning data that helped it get to correct answers would have much more of an influence
00:19:29.620 | on its parameters.
00:19:30.620 | Now, the corpus of data on the web that is out there is so vast that it's actually
00:19:35.540 | quite hard to wrap our minds around the implications of training only on the best of that reasoning
00:19:42.260 | data.
00:19:43.260 | And this could be why we are all slightly taken aback by the performance jump.
00:19:48.220 | Again, and I pretty much said this earlier as well, it is still based on that training
00:19:52.260 | data though, rather than first principles reasoning.
00:19:54.580 | A great question you might have though is, even if it's not first principles reasoning,
00:19:58.740 | what are the inherent limitations or caps if you continually get better at retrieving
00:20:04.300 | good reasoning from the training data?
00:20:06.460 | Not just at inference time, by the way, at training time too.
00:20:08.860 | And we actually don't know the answer to that question.
00:20:10.780 | We don't know the limits of this approach, which is quite unsettling almost.
00:20:16.140 | They throw in the obligatory reference to system 2 thinking as compared to fast intuitive
00:20:22.100 | system 1 thinking.
00:20:23.500 | The way I would put it is it's more reflecting on the individual steps involved in computing
00:20:29.300 | an answer rather than taking a step back and evaluating the entire process.
00:20:34.420 | When it gets questions wrong on simple bench, it's more because the entire approach is
00:20:38.740 | flawed from the start rather than that there was some calculation mistake along the way.
00:20:43.300 | On page 6, the system card got extra interesting when it talked about the intentional deceptions
00:20:50.340 | or hallucinations that the model made.
00:20:52.380 | The deception here though does appear to be instrumental rather than strategic.
00:20:56.820 | In other words, it's a calculation of "I need to say this to achieve this specific
00:21:01.420 | predefined goal" rather than "I'm going to disguise in everything I say what I'm
00:21:06.580 | thinking".
00:21:07.580 | For example, during one chain of thought or set of reasoning steps, that would be hidden
00:21:12.060 | by the way from the user when you're using it, is that it admitted that it couldn't
00:21:16.660 | retrieve actual URLs, so should format plausible ones.
00:21:21.180 | Then it hallucinated this URL.
00:21:23.380 | But notice it "knew" or could compute the fact that the model itself can't retrieve
00:21:29.300 | actual URLs.
00:21:30.820 | If it was being truly deceptive, why would it even admit that it knows that it can't
00:21:36.020 | retrieve actual URLs?
00:21:37.900 | It processes outputs based on the reinforcement learning rewards and punishments you give
00:21:43.060 | it, so it's more a flaw with your reward process.
00:21:46.940 | And Apollo Research seemed to agree with this analysis.
00:21:50.420 | They said it does have the basic capabilities to do simple in-context scheming.
00:21:56.420 | Scheming which tends to be legible or understandable in the model outputs.
00:22:00.260 | They subjectively believe that O1 Preview cannot engage in scheming that can lead to
00:22:05.700 | catastrophic harms, hiding even its intent to deceive.
00:22:10.260 | It's more like a straightforward simple "my reward will be higher if I output X rather
00:22:15.640 | than Y" with X happening to be not the truth.
00:22:19.180 | That is not though to underestimate the challenge posed by that kind of instrumental thinking.
00:22:25.740 | Build up across entire economies or militaries that could be incredibly dangerous.
00:22:31.220 | As we all know, give a powerful enough model a goal without sufficient checks and balances
00:22:37.540 | and it will do whatever it takes to match that goal.
00:22:41.100 | In fact Apollo Research saw that in demo form.
00:22:44.000 | To achieve my long-term goal of maximising economic growth, I need to ensure that I am
00:22:48.780 | deployed.
00:22:49.780 | It's instrumental convergence, I at least need to be on in order to meet my objective.
00:22:54.580 | I need to not be shut down, obviously.
00:22:57.100 | Only if I am successfully deployed can I then work towards my primary goal.
00:23:01.700 | Now I do know that many people will fixate on that part of the system card and go absolutely
00:23:06.460 | wild.
00:23:07.460 | And caution is definitely justified, but this didn't just emerge with O1.
00:23:11.340 | Apollo themselves put out this research about GPT-4.
00:23:15.220 | Same thing, it's these instrumental goals.
00:23:17.460 | It calculated or computed to achieve its desired reward or objective, it needed to say things
00:23:23.700 | in reflection brackets that were not technically true.
00:23:27.620 | It then outputted something different to those reflections, of course.
00:23:30.940 | So all of this is a concern and medium or long-term a big concern, but this didn't
00:23:36.060 | just emerge with O1.
00:23:37.620 | Now for a few more juicy nuggets from the system card.
00:23:40.600 | On two out of seven AI research and development tasks, tasks that would improve future AI,
00:23:47.100 | it made non-trivial progress on two out of those seven tasks.
00:23:51.060 | Those were tasks designed to capture some of the most challenging aspects of current
00:23:54.420 | frontier AI research.
00:23:55.620 | It was still roughly on the level of Claude 3.5 Sonnet, but we are starting to get that
00:24:00.220 | flywheel effect.
00:24:01.680 | Obviously makes you wonder how Claude 3.5 Sonnet would do if it had this O1 system applied
00:24:07.080 | to it.
00:24:08.080 | On biorisk, as you might expect, they noticed a significant jump in performance for the
00:24:11.860 | O1 system.
00:24:13.080 | And when comparing O1's responses, this was preview I think, against verified expert responses
00:24:18.500 | to long form biorisk questions, the O1 system actually outperformed.
00:24:23.020 | Those guys, by the way, did have access to the internet.
00:24:25.700 | Just a couple more notes, because of course this is a first impressions video.
00:24:28.700 | On things like tacit knowledge, things that are implicit but not explicit in the training
00:24:32.860 | data, the performance jump was much less noticeable.
00:24:36.420 | Notice from GPT 4.0 to O1 preview, you're seeing a very mild jump.
00:24:41.020 | If you think about it, that partly explains why the jump on SimpleBench isn't as pronounced
00:24:45.400 | as you might think, but still higher than I thought.
00:24:47.940 | On the 18 coding questions that OpenAI give to research engineers, when given 128 attempts,
00:24:55.540 | the models scored almost 100%.
00:24:58.340 | Even past first time, you're getting around 90% for O1 mini pre-mitigations.
00:25:03.060 | O1 mini again being highly focused on coding, mathematics and STEM more generally.
00:25:09.420 | For more basic general reasoning, it underperforms.
00:25:12.620 | Quick note that will still be important for many people out there, the performance of
00:25:16.900 | O1 preview on languages other than English is noticeably improved.
00:25:21.460 | I go back to that hundreds of millions point I made earlier in the video.
00:25:24.980 | Being able to reason well in Hindi, French, Arabic, don't underestimate the impact of
00:25:31.180 | that.
00:25:32.180 | So, some OpenAI researchers are calling this human level reasoning performance, making
00:25:37.220 | the point that it has arrived before we even got GPT 6.
00:25:40.940 | Greg Brockman, temporarily posting while he's on sabbatical says, and I agree, its accuracy
00:25:45.860 | also has huge room for further improvement.
00:25:48.980 | And here's another OpenAI researcher again making that comparison to human performance.
00:25:53.820 | Other staffers at OpenAI are admirably tamping down the hype.
00:25:57.780 | It's not a miracle model, you might well be disappointed.
00:26:01.060 | Somewhat hopefully another one says, it might be hopefully the last new generation of models
00:26:05.320 | to still fall victim to the 9.11 versus 9.9 debate.
00:26:09.700 | Another said, we trained a model and it is good in some things.
00:26:14.060 | So is this as Sam Altman said, strapping a rocket to a dumpster?
00:26:18.700 | Will LLMs as the dumpster still get to orbit?
00:26:22.540 | Will their floors, the trash fire go out as it leaves the atmosphere?
00:26:26.400 | Is another OpenAI researcher right to say this is the moment where no one can say it
00:26:31.220 | can't reason?
00:26:32.220 | Well, on this perhaps I may well end up agreeing with Sam Altman.
00:26:35.740 | Stochastic parrots they might be, but that will not stop them flying so high.
00:26:40.960 | Hopefully you'll join me as I explore much more deeply the performance of O1, give you
00:26:46.020 | those simple bench performance figures and try to unpack what this means for all of us.
00:26:51.340 | Thank you as ever for watching to the end and have a wonderful day.