back to indexChatGPT o1 - First Reaction and In-Depth Analysis
00:00:00.000 |
ChachiPT now calls itself an alien of exceptional ability, and I find it a little bit harder 00:00:07.180 |
to disagree with that today than I did yesterday. 00:00:11.440 |
Because the system called O1 from OpenAI is here, at least in preview form, and it is 00:00:19.560 |
You may also know O1 by its previous names of Strawberry and Q*, but let's forget naming 00:00:28.640 |
Well, in the last 24 hours, I've read the 43-page system card, every OpenAI post and 00:00:35.060 |
press release, I've tested O1 hundreds of times, including on SimpleBench, and analysed 00:00:43.000 |
To be honest with you guys, it will take weeks to fully digest this release, so in this video, 00:00:48.120 |
I'll just give you my first impressions, and of course do several more videos as we analyse 00:00:56.660 |
This isn't just about a little bit more training data, this is a fundamentally new paradigm. 00:01:01.320 |
In fact, I would go as far as to say that there are hundreds of millions of people who 00:01:05.040 |
might have tested an earlier version of ChachiPT and found LLMs and "AI" lacking, but will 00:01:12.960 |
As the title implies, let me give you my first impressions, and it's that I didn't expect 00:01:22.760 |
And that's coming from the person who predicted many of the key mechanisms behind Q*, which 00:01:30.980 |
Things like sampling hundreds or even thousands of reasoning paths, and potentially using 00:01:36.780 |
a verifier, an LLM-based verifier, to pick the best ones. 00:01:41.160 |
Of course, OpenAI aren't disclosing the full details of how they trained O1, but they did 00:01:46.980 |
leave us some tantalising clues, which I'll go into in a moment. 00:01:50.420 |
SimpleBench, if you don't know, tests hundreds of basic reasoning questions, from spatial 00:01:55.820 |
to temporal to social intelligence questions, that humans will crush on average. 00:02:01.640 |
As many people have told me, the O1 system gets both of these two sample questions from 00:02:09.500 |
Take this example, where despite thinking for 17 seconds, the model still gets it wrong. 00:02:15.740 |
Fundamentally, O1 is still a language model-based system, and will make language model-based 00:02:23.460 |
It can be rewarded as many times as you like for good reasoning, but it's still limited 00:02:30.500 |
Nevertheless, though, I didn't quite foresee the magnitude of the improvement that would 00:02:35.260 |
occur through rewarding correct reasoning steps. 00:02:38.300 |
That, I'll admit, took me slightly by surprise. 00:02:42.660 |
Well, as of last night, OpenAI imposed a temperature of 1 on its O1 system. 00:02:48.900 |
That was not the temperature used for the other models when they were benchmarked on 00:02:54.060 |
That's a much more "creative" temperature than the other models were tested on for SimpleBench. 00:02:59.780 |
Therefore, what that meant was that performance variability was a bit higher than normal. 00:03:03.780 |
It would occasionally get questions right through some stroke of genius reasoning, and 00:03:10.780 |
In fact, as you just saw with the IceCube example. 00:03:13.140 |
The obvious solution is to run the benchmark multiple times and take a majority vote. 00:03:19.140 |
But for a true apples-to-apples comparison, I would need to do that for all the other 00:03:24.300 |
My ambition, not that you're too interested, is to get that done by the end of this month. 00:03:31.060 |
However you measure it, O1 Preview is a step-change improvement on Claude 3.5 Sonnet. 00:03:38.020 |
And as anyone following this channel will know, I'm not some OpenAI fanboy. 00:03:42.480 |
Claude 3.5 Sonnet has reigned supreme for quite a while. 00:03:46.560 |
So for those of you who don't care about other benchmarks and the full paper, I want 00:03:50.740 |
to kind of summarize my first impressions in a nutshell. 00:03:56.980 |
The ceiling of performance for the O1 system, just Preview let alone the full O1 system, 00:04:04.460 |
It obviously crushes the average person's performance in things like physics, maths 00:04:12.500 |
Its floor is also really quite low, below that of an average human. 00:04:17.340 |
As I wrote on YouTube last night, it frequently and sometimes predictably makes really obvious 00:04:25.540 |
Remember I analysed the hundreds of answers it gave for SimpleBench. 00:04:29.920 |
Let me give you a couple of examples straight from the mouth of O1. 00:04:33.300 |
When the cup is turned upside down, the dice will fall and land on the open end of the 00:04:41.740 |
If you can visualise that successfully, you're doing better than me. 00:04:48.140 |
And how about this, more social intelligence. 00:04:50.320 |
He will argue back, obviously I'm not giving you the full context because this is a private 00:04:55.140 |
He will argue back against the Brigadier General, one of the highest military ranks, at the 00:05:02.900 |
As the soldier's silly behaviour in first grade, that's like age 6 or 7, indicates 00:05:08.900 |
a history of speaking up against authority figures. 00:05:11.740 |
Now the vast majority of humans would say, wait, no, what he did in primary school, I 00:05:17.040 |
don't know what Americans call primary school, but what he did when he was a young school 00:05:20.300 |
child does not reflect what he would do in front of a general on a Troop Parade. 00:05:24.460 |
As I've written, in some domains these mistakes are routine and amusing. 00:05:28.740 |
So it is very easy to look at O1's performance on the Google proof question and answer set, 00:05:35.980 |
it's performance of around 80%, that's on the diamond subset, and say, well, let's 00:05:40.780 |
be honest, the average human can't even get one of those questions right. 00:05:48.740 |
Too many benchmarks are brittle in the sense that when the model is trained on that particular 00:05:56.140 |
Think Web of Lies, where it's now been shown to get 100%, but if you test O1 thoroughly 00:06:00.780 |
in real life scenarios, you will frequently find kind of glaring mistakes. 00:06:05.700 |
Obviously, what I've tried to do into the early hours of last night and this morning 00:06:09.840 |
is find patterns in those mistakes, but it has proven a bit harder than I thought. 00:06:14.980 |
My guess though, about those weaknesses for those who won't stay to the end of the video 00:06:21.740 |
OpenAI revealed in one of the videos on its YouTube channel, and I will go into more detail 00:06:27.020 |
on this in a future video, that they deviated from the let's verify step-by-step paper 00:06:32.220 |
by not training on human annotated reasoning samples or steps. 00:06:36.980 |
Instead, they got the model to generate the chains of thought, and we all know those can 00:06:41.660 |
be quite flawed, but here's the key moment to really focus on. 00:06:45.060 |
They then automatically scooped up those chains of thought that led to a correct answer, in 00:06:51.340 |
the case of mathematics, physics or coding, and then train the model further on those 00:06:57.300 |
So it's less that O1 is doing true reasoning from first principles, it's more retrieving 00:07:03.220 |
more accurately, more reliably, reasoning programs from its training data. 00:07:08.260 |
It "knows" or can compute which of those reasoning programs in its training data will 00:07:16.340 |
It's a bit like taking the best of the web, rather than a slightly improved average of 00:07:22.300 |
That, to me, is the great unlock that explains a lot of this progress, and if I'm right, 00:07:27.980 |
that also explains why it's still making some glaring mistakes. 00:07:31.180 |
At this point, I simply can't resist giving you one example straight from the output of 00:07:39.660 |
The context, and you'll have to trust me on this one, is simply that there's a dinner 00:07:46.900 |
One of the gifts happens to be given during a Zoom call, so online, not in person. 00:07:51.060 |
Now I'm not going to read out some of the reasoning that O1 gives, you can see it on 00:07:54.600 |
screen, but it would be hard to argue that it is truly reasoning from first principles. 00:08:00.380 |
Definitely some suboptimal training data going on. 00:08:03.140 |
So that is the context for everything you're going to see in the remainder of this First 00:08:08.300 |
Because everything else is quite frankly stunning. 00:08:10.780 |
I just don't want people to get too carried away by the really impressive accomplishment 00:08:16.300 |
I fully expect to be switching to O1 Preview for daily use cases, although of course Anthropic 00:08:21.660 |
in the coming weeks could reply with their own system. 00:08:24.420 |
Anyway, now let's dive into some of the juiciest details. 00:08:27.580 |
The full breakdown will come in future videos. 00:08:30.260 |
First thing to remember, this is just O1 Preview, not the full O1 system that is currently in 00:08:36.620 |
Not only that, it is very likely based on the GPT-40 model, not GPT-5 or Orion which 00:08:45.860 |
I could just leave you to think about the implications of scaling up the base model 00:08:49.840 |
100 times in compute, throw in a video avatar and man, we are really talking about a changed 00:08:59.660 |
They talk about performing similarly to PhD students in a range of tasks in physics, chemistry 00:09:05.700 |
And I've already given you the nuance on that kind of comment. 00:09:08.220 |
They justify the name by the way by saying, "This is such a significant advancement that 00:09:12.980 |
we are resetting the counter back to 1 and naming this series OpenAI 01." 00:09:18.220 |
It also reminds me of the O1 and O2 figure series of robotic humanoids whose maker OpenAI 00:09:26.420 |
This was just the introductory page and then they gave several follow up pages and posts. 00:09:31.540 |
To sum it up on jailbreaking, O1 Preview is much harder to jailbreak, although it's 00:09:37.140 |
Before we get to the reasoning page, here is some analysis on Twitter or X from the 00:09:43.700 |
One researcher at OpenAI who is building Sora said this, "I really hope people understand 00:09:48.340 |
that this is a new paradigm and I agree with that actually, it's not just hype. 00:09:51.740 |
Don't expect the same pace, schedule or dynamics of pre-training era." 00:09:55.540 |
The core element of how O1 works, by the way, is scaling up its inference, its actual output, 00:10:02.140 |
How much computational power is applied in its answers to prompts, not when it's being 00:10:07.940 |
He's making the point that expanding the pre-training scale of these models takes years 00:10:12.540 |
often, as you've seen in some of my previous videos, it's to do with data centers, power 00:10:17.700 |
But what can happen much faster is scaling up inference time, output time compute. 00:10:23.460 |
Improvements can happen much more rapidly than scaling up the base models. 00:10:27.220 |
In other words, I believe that the rate of improvement, he says on evals with our reasoning 00:10:31.420 |
models has been the fastest in OpenAI history. 00:10:36.500 |
He is, of course, implying that the full O1 system will be released later this year. 00:10:41.660 |
We'll get to some other researchers, but Will DePue made some other interesting points. 00:10:46.140 |
In one graph of math performance, they show that O1 mini, the smaller version of the O1 00:10:55.900 |
But I will say that in my testing of O1 mini on SimpleBench, it performed really quite 00:11:03.940 |
So it could be a bit like the GPT 4.0 mini we already had, that it's hyper specialized 00:11:09.340 |
at certain tasks, but can't really go beyond its familiar environment. 00:11:14.260 |
Give it a straightforward coding or math challenge and it will do well. 00:11:18.260 |
Introduce complication, nuance or reasoning and it'll do less well. 00:11:21.780 |
This chart, though, is interesting for another reason, and you can see that when they max 00:11:26.180 |
out the inference cost for the full O1 system, the performance delta with the maxed out mini 00:11:34.300 |
I would say, what is that 70% going up to 75%. 00:11:37.700 |
To put it another way, I wouldn't expect the full O1 system with maxed out inference 00:11:42.260 |
to be yet another step change forward, although, of course, nothing can be ruled out. 00:11:46.620 |
Some more quotes from OpenAI and this is Noam Brown, who I've quoted many times on this 00:11:53.820 |
He states again the same message, "We're sharing our evals of the O1 model to show 00:11:58.340 |
the world that this isn't a one-off improvement. 00:12:03.100 |
Underneath, you can see the dramatic performance boosts across the board from GPT 4.0 to O1. 00:12:09.460 |
Now, I suspect if you included GPT 4.0 Turbo on here, you might see some more mixed improvements, 00:12:17.060 |
If, for example, I had only seen improvement in STEM subjects and maths particularly, I 00:12:22.900 |
would have said, you know what, is this a new paradigm, but it's that combination 00:12:26.820 |
of improvements in a range of subjects, including law, for example, and most particularly for 00:12:32.820 |
me, of course, on SimpleBench, that I am actually a believer that this is a new paradigm. 00:12:38.020 |
Yes, I get that it can still fall for some basic tokenization problems, like it doesn't 00:12:42.820 |
always get that 9.8 is bigger than 9.11, and yes, of course, you saw the somewhat amusing 00:12:48.620 |
mistakes earlier on SimpleBench, but here's the key point. 00:12:51.780 |
I can no longer say with absolute certainty which domains or types of questions on SimpleBench 00:13:00.420 |
I can see some patterns, but I would hope for a bit more predictability in saying it 00:13:08.540 |
Until I can say with a degree of certainty it won't get this type of problem correct, 00:13:13.500 |
I can't really tell you guys that I can see the end of this paradigm. 00:13:17.420 |
Just to repeat, we have two more axes of scale to yet exploit. 00:13:21.520 |
Bigger base models, which we know they're working on with the whale-size supercluster 00:13:25.340 |
- I've talked about that in previous videos - and simply more inference time compute. 00:13:29.060 |
Plus, just look at the log graphs on scaling up the training of the base model and the 00:13:33.980 |
inference time, or the amount of thinking time or processing time, more accurately, 00:13:39.140 |
They don't look like they're levelling off to me. 00:13:41.140 |
Now I know some might say that I come off as slightly more dismissive of those memory-heavy, 00:13:45.620 |
computation-heavy benchmarks like the GPQA, but it is a stark achievement for the O1 Preview 00:13:51.820 |
and O1 Systems to score higher than an expert PhD human average. 00:13:56.980 |
Yes, there are flaws with that benchmark as with the MLU, but credit where it is due. 00:14:02.140 |
By the way, as a side note, they do admit that certain benchmarks are no longer effective 00:14:08.460 |
It's my hope, or at least my goal, that SimpleBench can still be effective at differentiating 00:14:12.820 |
models for the coming, what, 1, 2, 3 years maybe? 00:14:17.220 |
I will now give credit to OpenAI for this statement. 00:14:20.460 |
These results do not imply that O1 is more capable holistically than a PhD in all respects, 00:14:25.580 |
only that the model is more proficient in solving some problems that a PhD would be 00:14:30.460 |
That's much more nuanced and accurate than statements that we've heard in the past from, 00:14:36.760 |
And just a quick side note, O1 on a Vision + Reasoning task, the MMMU, scores 78.2% competitive 00:14:46.460 |
That benchmark is legit, it's for real, and that's a great performance. 00:14:50.700 |
On coding, they tested the system on the 2024, so not contaminated data, International Olympiad 00:14:57.780 |
It scored around the median level, however, it was only able to submit 50 submissions 00:15:04.100 |
But as compute gets more abundant and more fast, it shouldn't take 10 hours for it 00:15:12.660 |
When they tried this, obviously going beyond the 10 hours presumably, the model achieved 00:15:19.540 |
Now remember, we have seen something like this before with the AlphaCode2 system from 00:15:26.020 |
And if you notice, this approach of scaling up the number of samples tested does help 00:15:30.700 |
the model improve up the percentile rankings. 00:15:33.620 |
However, those elite coders still leave systems like AlphaCode2 and O1 in the dust. 00:15:40.380 |
The truly elite level reasoning that those coders go through is found much less frequently 00:15:48.180 |
As with other domains, it may prove harder to go from the 93rd percentile to the 99th 00:15:58.100 |
Nevertheless, yet another stunning achievement. 00:16:01.040 |
Notice something though, in domains that are less susceptible to reinforcement learning, 00:16:06.220 |
where in other words, there's less of a clear correct answer and incorrect answer, 00:16:10.940 |
the performance boost is much worse, much less. 00:16:15.340 |
Things like personal writing or editing text, there's no easy yes or no compilation of 00:16:22.900 |
In fact, for personal writing, the O1 preview system has a lower than 50% win rate versus 00:16:31.660 |
If your domain doesn't have starkly correct 0, 1, yes, no, right answers, wrong answers, 00:16:41.140 |
That also partly explains the somewhat patchy performance on SimpleBench. 00:16:45.820 |
Certain questions we intuitively know are right with like 99% probability, but it's 00:16:52.180 |
Remember, the system prompt we use is pick the most realistic answer, so I would still 00:16:58.180 |
But models handling that ambiguity can't leverage that reinforcement learning improved 00:17:04.660 |
They wouldn't have those millions of yes or no, starkly correct or incorrect answers 00:17:08.540 |
like they would have in, for example, mathematics. 00:17:11.340 |
That's why we get this massive discrepancy in improvement from O1. 00:17:15.060 |
Now let's quickly turn to safety where OpenAI said having these chain of thought reasoning 00:17:19.540 |
steps allows us to "read the mind" of the model and understand its thought process. 00:17:25.220 |
In part, they mean examining these summaries, at least, of the computations that went on, 00:17:30.540 |
although most of the chain of thought process is hidden. 00:17:33.220 |
But I do want to remind people, and I'm sure OpenAI are aware of this, that the reasoning 00:17:37.060 |
steps that a model gives aren't necessarily faithful to the actual computations and calculations 00:17:43.340 |
In other words, it will sometimes output a chain of thoughts that aren't actually the 00:17:48.540 |
thoughts it used, if you want to call it that, to answer the question. 00:17:52.300 |
I've covered this paper several times in previous videos, but it's well worth a read 00:17:56.300 |
if you believe that the reasoning steps a model gives always adheres to the actual process 00:18:02.780 |
That's pretty clearly stated in the introduction, and it's even stated here from Anthropic 00:18:07.780 |
that as models become larger and more capable, they produce less faithful reasoning on most 00:18:13.980 |
So good luck believing that GPT-5 or Orion's reasoning steps actually adhere to what it 00:18:20.260 |
Then there was the system card, 43 pages, which I read in full. 00:18:23.900 |
It was mainly on safety, but I'll give you just the 5 or 10 highlights. 00:18:27.800 |
They boasted about the kind of high-value non-public datasets they had access to, and 00:18:31.780 |
paywalled content, specialised archives, and other domain-specific datasets. 00:18:36.500 |
But do remember that point I made earlier in the video - they didn't rely on mass 00:18:40.140 |
human annotation, as the original Let's Verify step-by-step paper did. 00:18:44.820 |
How do I know that paper was so influential on Q* and this O1 system? 00:18:49.460 |
Well almost all its key authors are mentioned here, and the paper is directly cited in the 00:18:56.140 |
So it's definitely an evolution of Let's Verify, but this one based on automatic, model-generated 00:19:02.860 |
Again, if you missed it earlier, they would pick the ones that led to a correct answer 00:19:06.780 |
and train the model on those chains of thought, enabling the model, if you like, to get better 00:19:12.460 |
at retrieving those reasoning programs that typically lead to correct answers. 00:19:17.300 |
The model discovered or computed that certain sources should have less impact on its weights 00:19:24.660 |
The reasoning data that helped it get to correct answers would have much more of an influence 00:19:30.620 |
Now, the corpus of data on the web that is out there is so vast that it's actually 00:19:35.540 |
quite hard to wrap our minds around the implications of training only on the best of that reasoning 00:19:43.260 |
And this could be why we are all slightly taken aback by the performance jump. 00:19:48.220 |
Again, and I pretty much said this earlier as well, it is still based on that training 00:19:52.260 |
data though, rather than first principles reasoning. 00:19:54.580 |
A great question you might have though is, even if it's not first principles reasoning, 00:19:58.740 |
what are the inherent limitations or caps if you continually get better at retrieving 00:20:06.460 |
Not just at inference time, by the way, at training time too. 00:20:08.860 |
And we actually don't know the answer to that question. 00:20:10.780 |
We don't know the limits of this approach, which is quite unsettling almost. 00:20:16.140 |
They throw in the obligatory reference to system 2 thinking as compared to fast intuitive 00:20:23.500 |
The way I would put it is it's more reflecting on the individual steps involved in computing 00:20:29.300 |
an answer rather than taking a step back and evaluating the entire process. 00:20:34.420 |
When it gets questions wrong on simple bench, it's more because the entire approach is 00:20:38.740 |
flawed from the start rather than that there was some calculation mistake along the way. 00:20:43.300 |
On page 6, the system card got extra interesting when it talked about the intentional deceptions 00:20:52.380 |
The deception here though does appear to be instrumental rather than strategic. 00:20:56.820 |
In other words, it's a calculation of "I need to say this to achieve this specific 00:21:01.420 |
predefined goal" rather than "I'm going to disguise in everything I say what I'm 00:21:07.580 |
For example, during one chain of thought or set of reasoning steps, that would be hidden 00:21:12.060 |
by the way from the user when you're using it, is that it admitted that it couldn't 00:21:16.660 |
retrieve actual URLs, so should format plausible ones. 00:21:23.380 |
But notice it "knew" or could compute the fact that the model itself can't retrieve 00:21:30.820 |
If it was being truly deceptive, why would it even admit that it knows that it can't 00:21:37.900 |
It processes outputs based on the reinforcement learning rewards and punishments you give 00:21:43.060 |
it, so it's more a flaw with your reward process. 00:21:46.940 |
And Apollo Research seemed to agree with this analysis. 00:21:50.420 |
They said it does have the basic capabilities to do simple in-context scheming. 00:21:56.420 |
Scheming which tends to be legible or understandable in the model outputs. 00:22:00.260 |
They subjectively believe that O1 Preview cannot engage in scheming that can lead to 00:22:05.700 |
catastrophic harms, hiding even its intent to deceive. 00:22:10.260 |
It's more like a straightforward simple "my reward will be higher if I output X rather 00:22:15.640 |
than Y" with X happening to be not the truth. 00:22:19.180 |
That is not though to underestimate the challenge posed by that kind of instrumental thinking. 00:22:25.740 |
Build up across entire economies or militaries that could be incredibly dangerous. 00:22:31.220 |
As we all know, give a powerful enough model a goal without sufficient checks and balances 00:22:37.540 |
and it will do whatever it takes to match that goal. 00:22:41.100 |
In fact Apollo Research saw that in demo form. 00:22:44.000 |
To achieve my long-term goal of maximising economic growth, I need to ensure that I am 00:22:49.780 |
It's instrumental convergence, I at least need to be on in order to meet my objective. 00:22:57.100 |
Only if I am successfully deployed can I then work towards my primary goal. 00:23:01.700 |
Now I do know that many people will fixate on that part of the system card and go absolutely 00:23:07.460 |
And caution is definitely justified, but this didn't just emerge with O1. 00:23:11.340 |
Apollo themselves put out this research about GPT-4. 00:23:17.460 |
It calculated or computed to achieve its desired reward or objective, it needed to say things 00:23:23.700 |
in reflection brackets that were not technically true. 00:23:27.620 |
It then outputted something different to those reflections, of course. 00:23:30.940 |
So all of this is a concern and medium or long-term a big concern, but this didn't 00:23:37.620 |
Now for a few more juicy nuggets from the system card. 00:23:40.600 |
On two out of seven AI research and development tasks, tasks that would improve future AI, 00:23:47.100 |
it made non-trivial progress on two out of those seven tasks. 00:23:51.060 |
Those were tasks designed to capture some of the most challenging aspects of current 00:23:55.620 |
It was still roughly on the level of Claude 3.5 Sonnet, but we are starting to get that 00:24:01.680 |
Obviously makes you wonder how Claude 3.5 Sonnet would do if it had this O1 system applied 00:24:08.080 |
On biorisk, as you might expect, they noticed a significant jump in performance for the 00:24:13.080 |
And when comparing O1's responses, this was preview I think, against verified expert responses 00:24:18.500 |
to long form biorisk questions, the O1 system actually outperformed. 00:24:23.020 |
Those guys, by the way, did have access to the internet. 00:24:25.700 |
Just a couple more notes, because of course this is a first impressions video. 00:24:28.700 |
On things like tacit knowledge, things that are implicit but not explicit in the training 00:24:32.860 |
data, the performance jump was much less noticeable. 00:24:36.420 |
Notice from GPT 4.0 to O1 preview, you're seeing a very mild jump. 00:24:41.020 |
If you think about it, that partly explains why the jump on SimpleBench isn't as pronounced 00:24:45.400 |
as you might think, but still higher than I thought. 00:24:47.940 |
On the 18 coding questions that OpenAI give to research engineers, when given 128 attempts, 00:24:58.340 |
Even past first time, you're getting around 90% for O1 mini pre-mitigations. 00:25:03.060 |
O1 mini again being highly focused on coding, mathematics and STEM more generally. 00:25:09.420 |
For more basic general reasoning, it underperforms. 00:25:12.620 |
Quick note that will still be important for many people out there, the performance of 00:25:16.900 |
O1 preview on languages other than English is noticeably improved. 00:25:21.460 |
I go back to that hundreds of millions point I made earlier in the video. 00:25:24.980 |
Being able to reason well in Hindi, French, Arabic, don't underestimate the impact of 00:25:32.180 |
So, some OpenAI researchers are calling this human level reasoning performance, making 00:25:37.220 |
the point that it has arrived before we even got GPT 6. 00:25:40.940 |
Greg Brockman, temporarily posting while he's on sabbatical says, and I agree, its accuracy 00:25:48.980 |
And here's another OpenAI researcher again making that comparison to human performance. 00:25:53.820 |
Other staffers at OpenAI are admirably tamping down the hype. 00:25:57.780 |
It's not a miracle model, you might well be disappointed. 00:26:01.060 |
Somewhat hopefully another one says, it might be hopefully the last new generation of models 00:26:05.320 |
to still fall victim to the 9.11 versus 9.9 debate. 00:26:09.700 |
Another said, we trained a model and it is good in some things. 00:26:14.060 |
So is this as Sam Altman said, strapping a rocket to a dumpster? 00:26:18.700 |
Will LLMs as the dumpster still get to orbit? 00:26:22.540 |
Will their floors, the trash fire go out as it leaves the atmosphere? 00:26:26.400 |
Is another OpenAI researcher right to say this is the moment where no one can say it 00:26:32.220 |
Well, on this perhaps I may well end up agreeing with Sam Altman. 00:26:35.740 |
Stochastic parrots they might be, but that will not stop them flying so high. 00:26:40.960 |
Hopefully you'll join me as I explore much more deeply the performance of O1, give you 00:26:46.020 |
those simple bench performance figures and try to unpack what this means for all of us. 00:26:51.340 |
Thank you as ever for watching to the end and have a wonderful day.