back to indexDid you miss these 2 AI stories? A *Real* LLM-crafted Breakthrough + Continual Learning Blocked?

Chapters
0:0 Introduction
0:55 C2S
4:48 OpenAI not too far behind (Simple, Codex)
6:37 A Definition of AGI?
11:10 OpenAI Researcher on Continual Learning Problems
13:2 Sora 2 can answer math qs
00:00:00.000 |
Okay, I'm going to be honest, AI companies have a set amount of computing power, and at the moment 00:00:05.240 |
they are spending more of it on scaling up money-making stuff, like browsers and video 00:00:10.720 |
shorts, than on scaling up frontier performance and IQ points. Hence the feeling among some of a 00:00:17.780 |
slowdown in progress, and no hard feelings, you've got to make the investors some money, but as that 00:00:22.980 |
maxes out, the story will shortly, hopefully, return to ramping up frontier intelligence. For 00:00:29.980 |
example, when Gemini 3 comes out from Google DeepMind, expected in the next two months. But 00:00:35.140 |
none of what I just said means that pretty juicy things aren't happening with our current language 00:00:40.820 |
model horsepower. So I'm going to start with a novelty produced by a baby LLM, then I'll get to 00:00:47.500 |
a revealing remark from a top OpenAI researcher, and end with some interesting bits that I found 00:00:53.980 |
throughout this week. Let's start with a language model, so often decried for their hallucinations, 00:00:59.120 |
but one that actually is pushing science forward by learning the language of biology. Yes, the model 00:01:06.580 |
does have a dodgy name, C2S scale, but it was able to generate a novel hypothesis for a drug to aid in 00:01:14.540 |
cancer treatment. What I'm going to try and do is simplify the simplification of this 29-page 00:01:20.780 |
paper. So here we go. When I say we have a language model that generated a novel hypothesis for a drug 00:01:27.140 |
to aid in cancer treatment, what I mean is it was a drug candidate that was not in the literature for 00:01:33.480 |
being able to help in this way. This model, by the way, was based largely on the OpenWeights Gemma 2 00:01:39.160 |
architecture from Google released over a year ago. Gemma 3 has come out since. Gemma 4 is due any time. 00:01:46.480 |
So it's not the latest Gemma by any stretch. But anyway, this language model was given special 00:01:52.460 |
doggy training, you can think of it. Reinforcement learning rewards for accurately predicting how cells 00:01:57.800 |
would react to drugs, especially regarding interferon. I know what you're thinking, but why would we do that 00:02:03.820 |
though? Well, it's to make cold cancer tumors hot, in other words, to make them detectable by the immune 00:02:10.520 |
system. Wait, so hold on. This is an LLM that can speak biology. Yes, and English too, like you could 00:02:18.260 |
probably, likely still chat to this model about football. Anyway, C2S scale turns each cell's gene 00:02:24.800 |
activity into a short sentence, if you like. So a large language model can read biology like it does 00:02:32.120 |
text. Slightly more technically, it can find drugs that reliably amplify the effects of interferon, 00:02:38.300 |
conditional on there already being some of it present, but just not quite enough. In other words, 00:02:43.300 |
it learns to predict how the immune system will respond to a drug, a bit like how it might learn 00:02:49.020 |
how an author will continue a sentence. Now, just quickly on the name, it's called C2S scale, 00:02:54.200 |
and the authors say scaling the model to 27 billion parameters yields consistent improvements. In case 00:03:00.940 |
you're not aware, that's still easily five or ten times smaller than the model some of us will work 00:03:06.280 |
with every day. I totally get that there are going to be cynics in the audience, but this is not like 00:03:11.340 |
those GPT-5 surfacing existing breakthroughs stories. If you don't know about that controversy, 00:03:17.120 |
I won't bore you with it, but this is quite different. The drug candidate, which I'm going 00:03:21.200 |
to butcher, Silmitacertib, was not linked anywhere in the literature for having this capacity. As the 00:03:27.940 |
authors write, this highlights that the model was generating a new testable hypothesis and not just 00:03:33.580 |
repeating known facts. Even better, by the way, you're probably wondering, it worked on human cells in a 00:03:38.480 |
dish. AKA, the model's in silico prediction was confirmed multiple times in vitro, in the lab. 00:03:45.420 |
Obviously, I do have to clarify at this point, though, that it wasn't confirmed on actual human 00:03:51.260 |
beings. That, alas, will take years and years of testing, as is the way in medicine. But before we 00:03:57.560 |
leave this story, just imagine with me for a moment this drug being used in concert with others for more 00:04:03.800 |
effective cancer treatment in just a few years, directly inspired by a language model. That's wild. 00:04:10.160 |
As the paper's authors say, this result also provides a blueprint for a new kind of biological 00:04:17.120 |
discovery. Now, just as I am filming this video, I have noticed that Google has published this story 00:04:24.020 |
in Nature on quantum computing. And some of you may have thought that the video's title was about that. 00:04:30.280 |
Note they say that it's paving a path towards potential future uses in drug discovery. That's 00:04:36.340 |
incredible. And that's all going to change the world in the coming decades. But this C2S is pretty 00:04:42.280 |
remarkable right now. Just remember these stories when people say that LLMs will not accelerate science. 00:04:47.960 |
Now, it would be pretty lazy of me to just write the narrative that this shows that open AI is falling 00:04:53.000 |
further behind, especially when you get reports that Gemini 2.5 DeepThink has just hit record 00:04:58.600 |
performance on Frontier Math. That's pretty much the hardest benchmark around for mathematics. But 00:05:04.000 |
actually, on raw intelligence, I don't think open AI is that far behind. And I don't just mean on, 00:05:09.980 |
for example, my own benchmark, SimpleBench, where I finally got around to testing GPT-5 Pro and it snuck 00:05:16.280 |
in just under Gemini 2.5 Pro. I know, I know you guys would want me to test DeepThink, 00:05:21.180 |
but there's no API for that. So for now, you've got this comparison. If you don't know, SimpleBench 00:05:25.920 |
is about testing the question behind the question, kind of like trick questions, you could think of 00:05:30.500 |
it, with some spatio-temporal questions thrown in. No, I mean, I think I've spent over 100 hours 00:05:35.420 |
testing GPT-5 within Codex, and it's just frankly better than any of the alternatives. Definitely compared 00:05:42.400 |
to those from Google, and yes, even compared to the brand new Clawed Code. I do love how you can use 00:05:47.960 |
Clawed Code on your mobile. I think that's epic. Thank you, Anthropic. But it wasn't just that on 00:05:52.600 |
one of the first queries I had for Clawed Code, it attempted to delete a key section of code completely 00:05:58.800 |
randomly. That's, of course, just anecdotal. I mean, you could get the outputs from Clawed Code, 00:06:03.860 |
ask about it in GPT-5 Codex, and vice versa, and far, far more often, Clawed Code will say, 00:06:11.020 |
oh yeah, I'm wrong, sorry, Codex is right. And that is borne out in testing too. And remember, 00:06:15.140 |
coding is Anthropic's specialty, so OpenAI are really cooking with Codex. That's what I mean when 00:06:20.320 |
I call OpenAI a beast that is currently choosing to spend its set amount of compute on money-making 00:06:26.980 |
activities like Sora. Now, I'll come back to Sora in a moment, but it's time for the second story that 00:06:33.140 |
I found really interesting in the last few days that I think many people missed. And the story isn't just 00:06:38.020 |
that we got a paper by a bunch of famous AI authors proposing a final conclusive definition of AGI. 00:06:46.040 |
I say with a rising tone because I don't think it is actually a conclusive definition of AGI. 00:06:50.980 |
But no, that isn't quite the story, though, it's part of it. The story comes with a link from this 00:06:55.800 |
paper to something that a top OpenAI researcher recently revealed. But first, what's this paper? 00:07:01.460 |
Well, basically, it's an answer to what many of you have asked for for a long, long time, 00:07:07.000 |
which is giving a definition of AGI. And the big anchor that the paper proposes is this theory of 00:07:15.400 |
cognitive capacity, the cattle horn cowl theory. It's called the most empirically validated model 00:07:20.880 |
of human cognition. So they basically take what has been proven to assess human cognition and then 00:07:25.960 |
apply it to models. One of the big headlines is that the resulting AGI scores are GPT-4 at 27%, 00:07:33.820 |
GPT-5 at 58%. Kind of makes it sound like just with the same architecture going up to GPT-6 or 7, 00:07:41.100 |
you'd get to AGI. But that's not what the paper says. The theory breaks cognition down into 10 discrete 00:07:47.580 |
categories that have survived as factors of cognition for over a century of research. They do not, by the way, 00:07:54.620 |
include physical dexterity. So that's one huge caveat. Each category is given an equal 10% weighting on the AGI score 00:08:02.900 |
out of 100. So for example, reaction time is treated as equally important to general knowledge. So we have general 00:08:09.480 |
knowledge, and it's definitely true that models know a lot more than they used to. Reading ability, I think it's pretty 00:08:15.260 |
fair to say models have come a long way. Math stability, well, we just saw how Gemini DeepThink is breaking records on 00:08:21.660 |
frontier math benchmarks. On the spot, reasoning is a bit more hazy. It's kind of a bit like a mix of simple 00:08:27.500 |
bench and a traditional IQ test and a bit of theory of mind, and they give it 7%. Not sure about that. 00:08:33.500 |
What I mean by that is that there are so many different ways of testing induction, like ARC AGI 1, which yields amazing 00:08:41.260 |
scores for GPT-5, and then ARC AGI 3, where it performs terribly. Traditional theory of mind, in which the models ace those 00:08:48.860 |
tests, but then perturbed, where you just change a few words and they fall right down. Then there's working memory and 00:08:55.820 |
long-term memory storage, and that's where things get really weird. Language models can't remember things, or at least 00:09:02.380 |
they can, if they're within the context of the conversation, but not beyond that. They don't 00:09:07.260 |
continually learn things on the job, as it were, and that's the key point I'm going to come back to with 00:09:12.620 |
a quote from OpenAI. Then there's a metric on hallucinations, which again is quite hard to pin down, 00:09:18.220 |
and visual processing. That is rapidly improving, as you might have seen with the recent DeepSeq paper. 00:09:23.980 |
Rounding things out, we have listening ability and reaction time, essentially speed. Don't know about you, 00:09:29.180 |
but models can, in general, do things much, much faster than me. But the one that stands out is 00:09:34.860 |
lacking memory. The authors say, without the ability to continually learn, AI systems suffer from amnesia, 00:09:41.260 |
which limits their utility, forcing the AI to relearn context in every interaction. What I would add to 00:09:46.940 |
that, is that because every bit of context adds to the cost of a call to a model, these providers 00:09:53.980 |
deliberately limit the amount of context that models take in. So they often make huge blunders, because they 00:09:59.660 |
just don't understand the situation. If you spent more money on more context, they'd understand more 00:10:05.820 |
of the situation, but again, they wouldn't remember it next time. One is a fundamental limitation, one is a 00:10:11.820 |
question of cost. But will we soon have a solution to the fundamental limitation, continual learning? 00:10:19.500 |
Well, that's the quote I want to get to. On that nasty cliffhanger, I'm going to bring you today's 00:10:25.260 |
sponsor, a long time sponsor of the channel, Assembly AI, and specifically, their universal streaming tool. 00:10:32.700 |
Because I think it's pretty epic. Did you know how accurate these tools have become? For example, 00:10:39.820 |
it was only recently that certain rival tools couldn't understand when I said things like 00:10:45.580 |
GPT-5 Pro. They just wouldn't get those letters right, especially in my accent. Now, as you can see, 00:10:51.420 |
they handle it flawlessly. I think that's incredible. And the link to this, by the way, 00:10:57.340 |
is going to be in the description. I follow AI relentlessly and, of course, test almost everything 00:11:03.100 |
that comes out. And I didn't know these tools had gotten this good, frankly. That was also really 00:11:08.060 |
fast. I don't know if you noticed that. But anyway, link in the description. Now, on that quote, 00:11:12.300 |
this is OpenAI's VP of Research, Jerry Tworek. I think it's his first interview, but he mentioned 00:11:20.140 |
something quite crucial around 58 minutes. But first, some context. Before we even hear from him, 00:11:27.180 |
what are some of the obvious limitations of continual learning? Well, the model providers 00:11:32.140 |
wouldn't be in control of what the models were learning. Like, imagine GPT-6 truly understands 00:11:37.420 |
your app from the ground up, or maybe your exam from the ground up. Then, on the job, 00:11:42.220 |
it bakes that into its weight, so you never have to tell it again. That's amazing, right? But if we do 00:11:47.180 |
that naively, then you can imagine all kinds of sick things that people train the models to learn. 00:11:53.980 |
Is there a concept of, I guess, online RL that happens where, as the agent does something and 00:12:02.700 |
learns from the real world, the RL happens in real time? 00:12:07.740 |
So generally, all of RL is happening. Most of RL that you hear talk to language models is online, 00:12:16.780 |
but it's done online in a way that is still a training run. It's still being trained kind of 00:12:23.020 |
separately from the users. There have been a few models in the world, and I've learned recently that I 00:12:28.220 |
think Cursor is trying to train some models online with their users in the loop, and it's theoretically 00:12:36.300 |
possible to train models in ChatGPT or every other product, just responding to the users and reinforce 00:12:42.780 |
through whatever rewards you get in there. But this is not what I am aware of, at least not what 00:12:48.860 |
opening eye is doing at the moment. And it can be great, but it can be also dangerous because you are 00:12:56.060 |
not really very much controlling what you are reinforcing in that loop and what could happen. 00:13:02.300 |
Let me know in the comments if you can see an obvious way around that limitation. I hope I've 00:13:17.100 |
at least convinced you that we live in pretty weird and unpredictable times when it comes to AI. And so 00:13:24.060 |
I want to end on this note. Did you know that SOAR 2 can answer benchmark level questions as a video 00:13:31.340 |
generator? As in, you could ask it a complex multi-choice mathematical or coding question, 00:13:36.060 |
and it would answer in the video scoring really well. Yes, not quite as well as a focused model, 00:13:42.860 |
but I think that's wild. Doesn't that kind of show you the physics calculations that these video 00:13:47.260 |
generators are doing on the fly? Anyway, that's all from me. Let me know if you're loving the new browser 00:13:53.020 |
from ChatGPT. I'm still going to give it a bit more time before I mention it properly in a video, 00:13:57.820 |
just to see if it's worth talking about. Obviously, do let me know your early thoughts too. 00:14:02.460 |
But thank you most of all for watching all the way to the end. And I know it's been a while 00:14:07.180 |
between videos. I really am hoping to ramp things up in November. Have a wonderful day.