back to index

Did you miss these 2 AI stories? A *Real* LLM-crafted Breakthrough + Continual Learning Blocked?


Chapters

0:0 Introduction
0:55 C2S
4:48 OpenAI not too far behind (Simple, Codex)
6:37 A Definition of AGI?
11:10 OpenAI Researcher on Continual Learning Problems
13:2 Sora 2 can answer math qs

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, I'm going to be honest, AI companies have a set amount of computing power, and at the moment
00:00:05.240 | they are spending more of it on scaling up money-making stuff, like browsers and video
00:00:10.720 | shorts, than on scaling up frontier performance and IQ points. Hence the feeling among some of a
00:00:17.780 | slowdown in progress, and no hard feelings, you've got to make the investors some money, but as that
00:00:22.980 | maxes out, the story will shortly, hopefully, return to ramping up frontier intelligence. For
00:00:29.980 | example, when Gemini 3 comes out from Google DeepMind, expected in the next two months. But
00:00:35.140 | none of what I just said means that pretty juicy things aren't happening with our current language
00:00:40.820 | model horsepower. So I'm going to start with a novelty produced by a baby LLM, then I'll get to
00:00:47.500 | a revealing remark from a top OpenAI researcher, and end with some interesting bits that I found
00:00:53.980 | throughout this week. Let's start with a language model, so often decried for their hallucinations,
00:00:59.120 | but one that actually is pushing science forward by learning the language of biology. Yes, the model
00:01:06.580 | does have a dodgy name, C2S scale, but it was able to generate a novel hypothesis for a drug to aid in
00:01:14.540 | cancer treatment. What I'm going to try and do is simplify the simplification of this 29-page
00:01:20.780 | paper. So here we go. When I say we have a language model that generated a novel hypothesis for a drug
00:01:27.140 | to aid in cancer treatment, what I mean is it was a drug candidate that was not in the literature for
00:01:33.480 | being able to help in this way. This model, by the way, was based largely on the OpenWeights Gemma 2
00:01:39.160 | architecture from Google released over a year ago. Gemma 3 has come out since. Gemma 4 is due any time.
00:01:46.480 | So it's not the latest Gemma by any stretch. But anyway, this language model was given special
00:01:52.460 | doggy training, you can think of it. Reinforcement learning rewards for accurately predicting how cells
00:01:57.800 | would react to drugs, especially regarding interferon. I know what you're thinking, but why would we do that
00:02:03.820 | though? Well, it's to make cold cancer tumors hot, in other words, to make them detectable by the immune
00:02:10.520 | system. Wait, so hold on. This is an LLM that can speak biology. Yes, and English too, like you could
00:02:18.260 | probably, likely still chat to this model about football. Anyway, C2S scale turns each cell's gene
00:02:24.800 | activity into a short sentence, if you like. So a large language model can read biology like it does
00:02:32.120 | text. Slightly more technically, it can find drugs that reliably amplify the effects of interferon,
00:02:38.300 | conditional on there already being some of it present, but just not quite enough. In other words,
00:02:43.300 | it learns to predict how the immune system will respond to a drug, a bit like how it might learn
00:02:49.020 | how an author will continue a sentence. Now, just quickly on the name, it's called C2S scale,
00:02:54.200 | and the authors say scaling the model to 27 billion parameters yields consistent improvements. In case
00:03:00.940 | you're not aware, that's still easily five or ten times smaller than the model some of us will work
00:03:06.280 | with every day. I totally get that there are going to be cynics in the audience, but this is not like
00:03:11.340 | those GPT-5 surfacing existing breakthroughs stories. If you don't know about that controversy,
00:03:17.120 | I won't bore you with it, but this is quite different. The drug candidate, which I'm going
00:03:21.200 | to butcher, Silmitacertib, was not linked anywhere in the literature for having this capacity. As the
00:03:27.940 | authors write, this highlights that the model was generating a new testable hypothesis and not just
00:03:33.580 | repeating known facts. Even better, by the way, you're probably wondering, it worked on human cells in a
00:03:38.480 | dish. AKA, the model's in silico prediction was confirmed multiple times in vitro, in the lab.
00:03:45.420 | Obviously, I do have to clarify at this point, though, that it wasn't confirmed on actual human
00:03:51.260 | beings. That, alas, will take years and years of testing, as is the way in medicine. But before we
00:03:57.560 | leave this story, just imagine with me for a moment this drug being used in concert with others for more
00:04:03.800 | effective cancer treatment in just a few years, directly inspired by a language model. That's wild.
00:04:10.160 | As the paper's authors say, this result also provides a blueprint for a new kind of biological
00:04:17.120 | discovery. Now, just as I am filming this video, I have noticed that Google has published this story
00:04:24.020 | in Nature on quantum computing. And some of you may have thought that the video's title was about that.
00:04:30.280 | Note they say that it's paving a path towards potential future uses in drug discovery. That's
00:04:36.340 | incredible. And that's all going to change the world in the coming decades. But this C2S is pretty
00:04:42.280 | remarkable right now. Just remember these stories when people say that LLMs will not accelerate science.
00:04:47.960 | Now, it would be pretty lazy of me to just write the narrative that this shows that open AI is falling
00:04:53.000 | further behind, especially when you get reports that Gemini 2.5 DeepThink has just hit record
00:04:58.600 | performance on Frontier Math. That's pretty much the hardest benchmark around for mathematics. But
00:05:04.000 | actually, on raw intelligence, I don't think open AI is that far behind. And I don't just mean on,
00:05:09.980 | for example, my own benchmark, SimpleBench, where I finally got around to testing GPT-5 Pro and it snuck
00:05:16.280 | in just under Gemini 2.5 Pro. I know, I know you guys would want me to test DeepThink,
00:05:21.180 | but there's no API for that. So for now, you've got this comparison. If you don't know, SimpleBench
00:05:25.920 | is about testing the question behind the question, kind of like trick questions, you could think of
00:05:30.500 | it, with some spatio-temporal questions thrown in. No, I mean, I think I've spent over 100 hours
00:05:35.420 | testing GPT-5 within Codex, and it's just frankly better than any of the alternatives. Definitely compared
00:05:42.400 | to those from Google, and yes, even compared to the brand new Clawed Code. I do love how you can use
00:05:47.960 | Clawed Code on your mobile. I think that's epic. Thank you, Anthropic. But it wasn't just that on
00:05:52.600 | one of the first queries I had for Clawed Code, it attempted to delete a key section of code completely
00:05:58.800 | randomly. That's, of course, just anecdotal. I mean, you could get the outputs from Clawed Code,
00:06:03.860 | ask about it in GPT-5 Codex, and vice versa, and far, far more often, Clawed Code will say,
00:06:11.020 | oh yeah, I'm wrong, sorry, Codex is right. And that is borne out in testing too. And remember,
00:06:15.140 | coding is Anthropic's specialty, so OpenAI are really cooking with Codex. That's what I mean when
00:06:20.320 | I call OpenAI a beast that is currently choosing to spend its set amount of compute on money-making
00:06:26.980 | activities like Sora. Now, I'll come back to Sora in a moment, but it's time for the second story that
00:06:33.140 | I found really interesting in the last few days that I think many people missed. And the story isn't just
00:06:38.020 | that we got a paper by a bunch of famous AI authors proposing a final conclusive definition of AGI.
00:06:46.040 | I say with a rising tone because I don't think it is actually a conclusive definition of AGI.
00:06:50.980 | But no, that isn't quite the story, though, it's part of it. The story comes with a link from this
00:06:55.800 | paper to something that a top OpenAI researcher recently revealed. But first, what's this paper?
00:07:01.460 | Well, basically, it's an answer to what many of you have asked for for a long, long time,
00:07:07.000 | which is giving a definition of AGI. And the big anchor that the paper proposes is this theory of
00:07:15.400 | cognitive capacity, the cattle horn cowl theory. It's called the most empirically validated model
00:07:20.880 | of human cognition. So they basically take what has been proven to assess human cognition and then
00:07:25.960 | apply it to models. One of the big headlines is that the resulting AGI scores are GPT-4 at 27%,
00:07:33.820 | GPT-5 at 58%. Kind of makes it sound like just with the same architecture going up to GPT-6 or 7,
00:07:41.100 | you'd get to AGI. But that's not what the paper says. The theory breaks cognition down into 10 discrete
00:07:47.580 | categories that have survived as factors of cognition for over a century of research. They do not, by the way,
00:07:54.620 | include physical dexterity. So that's one huge caveat. Each category is given an equal 10% weighting on the AGI score
00:08:02.900 | out of 100. So for example, reaction time is treated as equally important to general knowledge. So we have general
00:08:09.480 | knowledge, and it's definitely true that models know a lot more than they used to. Reading ability, I think it's pretty
00:08:15.260 | fair to say models have come a long way. Math stability, well, we just saw how Gemini DeepThink is breaking records on
00:08:21.660 | frontier math benchmarks. On the spot, reasoning is a bit more hazy. It's kind of a bit like a mix of simple
00:08:27.500 | bench and a traditional IQ test and a bit of theory of mind, and they give it 7%. Not sure about that.
00:08:33.500 | What I mean by that is that there are so many different ways of testing induction, like ARC AGI 1, which yields amazing
00:08:41.260 | scores for GPT-5, and then ARC AGI 3, where it performs terribly. Traditional theory of mind, in which the models ace those
00:08:48.860 | tests, but then perturbed, where you just change a few words and they fall right down. Then there's working memory and
00:08:55.820 | long-term memory storage, and that's where things get really weird. Language models can't remember things, or at least
00:09:02.380 | they can, if they're within the context of the conversation, but not beyond that. They don't
00:09:07.260 | continually learn things on the job, as it were, and that's the key point I'm going to come back to with
00:09:12.620 | a quote from OpenAI. Then there's a metric on hallucinations, which again is quite hard to pin down,
00:09:18.220 | and visual processing. That is rapidly improving, as you might have seen with the recent DeepSeq paper.
00:09:23.980 | Rounding things out, we have listening ability and reaction time, essentially speed. Don't know about you,
00:09:29.180 | but models can, in general, do things much, much faster than me. But the one that stands out is
00:09:34.860 | lacking memory. The authors say, without the ability to continually learn, AI systems suffer from amnesia,
00:09:41.260 | which limits their utility, forcing the AI to relearn context in every interaction. What I would add to
00:09:46.940 | that, is that because every bit of context adds to the cost of a call to a model, these providers
00:09:53.980 | deliberately limit the amount of context that models take in. So they often make huge blunders, because they
00:09:59.660 | just don't understand the situation. If you spent more money on more context, they'd understand more
00:10:05.820 | of the situation, but again, they wouldn't remember it next time. One is a fundamental limitation, one is a
00:10:11.820 | question of cost. But will we soon have a solution to the fundamental limitation, continual learning?
00:10:19.500 | Well, that's the quote I want to get to. On that nasty cliffhanger, I'm going to bring you today's
00:10:25.260 | sponsor, a long time sponsor of the channel, Assembly AI, and specifically, their universal streaming tool.
00:10:32.700 | Because I think it's pretty epic. Did you know how accurate these tools have become? For example,
00:10:39.820 | it was only recently that certain rival tools couldn't understand when I said things like
00:10:45.580 | GPT-5 Pro. They just wouldn't get those letters right, especially in my accent. Now, as you can see,
00:10:51.420 | they handle it flawlessly. I think that's incredible. And the link to this, by the way,
00:10:57.340 | is going to be in the description. I follow AI relentlessly and, of course, test almost everything
00:11:03.100 | that comes out. And I didn't know these tools had gotten this good, frankly. That was also really
00:11:08.060 | fast. I don't know if you noticed that. But anyway, link in the description. Now, on that quote,
00:11:12.300 | this is OpenAI's VP of Research, Jerry Tworek. I think it's his first interview, but he mentioned
00:11:20.140 | something quite crucial around 58 minutes. But first, some context. Before we even hear from him,
00:11:27.180 | what are some of the obvious limitations of continual learning? Well, the model providers
00:11:32.140 | wouldn't be in control of what the models were learning. Like, imagine GPT-6 truly understands
00:11:37.420 | your app from the ground up, or maybe your exam from the ground up. Then, on the job,
00:11:42.220 | it bakes that into its weight, so you never have to tell it again. That's amazing, right? But if we do
00:11:47.180 | that naively, then you can imagine all kinds of sick things that people train the models to learn.
00:11:53.980 | Is there a concept of, I guess, online RL that happens where, as the agent does something and
00:12:02.700 | learns from the real world, the RL happens in real time?
00:12:07.740 | So generally, all of RL is happening. Most of RL that you hear talk to language models is online,
00:12:16.780 | but it's done online in a way that is still a training run. It's still being trained kind of
00:12:23.020 | separately from the users. There have been a few models in the world, and I've learned recently that I
00:12:28.220 | think Cursor is trying to train some models online with their users in the loop, and it's theoretically
00:12:36.300 | possible to train models in ChatGPT or every other product, just responding to the users and reinforce
00:12:42.780 | through whatever rewards you get in there. But this is not what I am aware of, at least not what
00:12:48.860 | opening eye is doing at the moment. And it can be great, but it can be also dangerous because you are
00:12:56.060 | not really very much controlling what you are reinforcing in that loop and what could happen.
00:13:02.300 | Let me know in the comments if you can see an obvious way around that limitation. I hope I've
00:13:17.100 | at least convinced you that we live in pretty weird and unpredictable times when it comes to AI. And so
00:13:24.060 | I want to end on this note. Did you know that SOAR 2 can answer benchmark level questions as a video
00:13:31.340 | generator? As in, you could ask it a complex multi-choice mathematical or coding question,
00:13:36.060 | and it would answer in the video scoring really well. Yes, not quite as well as a focused model,
00:13:42.860 | but I think that's wild. Doesn't that kind of show you the physics calculations that these video
00:13:47.260 | generators are doing on the fly? Anyway, that's all from me. Let me know if you're loving the new browser
00:13:53.020 | from ChatGPT. I'm still going to give it a bit more time before I mention it properly in a video,
00:13:57.820 | just to see if it's worth talking about. Obviously, do let me know your early thoughts too.
00:14:02.460 | But thank you most of all for watching all the way to the end. And I know it's been a while
00:14:07.180 | between videos. I really am hoping to ramp things up in November. Have a wonderful day.