
Okay, I'm going to be honest, AI companies have a set amount of computing power, and at the moment they are spending more of it on scaling up money-making stuff, like browsers and video shorts, than on scaling up frontier performance and IQ points. Hence the feeling among some of a slowdown in progress, and no hard feelings, you've got to make the investors some money, but as that maxes out, the story will shortly, hopefully, return to ramping up frontier intelligence.
For example, when Gemini 3 comes out from Google DeepMind, expected in the next two months. But none of what I just said means that pretty juicy things aren't happening with our current language model horsepower. So I'm going to start with a novelty produced by a baby LLM, then I'll get to a revealing remark from a top OpenAI researcher, and end with some interesting bits that I found throughout this week.
Let's start with a language model, so often decried for their hallucinations, but one that actually is pushing science forward by learning the language of biology. Yes, the model does have a dodgy name, C2S scale, but it was able to generate a novel hypothesis for a drug to aid in cancer treatment.
What I'm going to try and do is simplify the simplification of this 29-page paper. So here we go. When I say we have a language model that generated a novel hypothesis for a drug to aid in cancer treatment, what I mean is it was a drug candidate that was not in the literature for being able to help in this way.
This model, by the way, was based largely on the OpenWeights Gemma 2 architecture from Google released over a year ago. Gemma 3 has come out since. Gemma 4 is due any time. So it's not the latest Gemma by any stretch. But anyway, this language model was given special doggy training, you can think of it.
Reinforcement learning rewards for accurately predicting how cells would react to drugs, especially regarding interferon. I know what you're thinking, but why would we do that though? Well, it's to make cold cancer tumors hot, in other words, to make them detectable by the immune system. Wait, so hold on. This is an LLM that can speak biology.
Yes, and English too, like you could probably, likely still chat to this model about football. Anyway, C2S scale turns each cell's gene activity into a short sentence, if you like. So a large language model can read biology like it does text. Slightly more technically, it can find drugs that reliably amplify the effects of interferon, conditional on there already being some of it present, but just not quite enough.
In other words, it learns to predict how the immune system will respond to a drug, a bit like how it might learn how an author will continue a sentence. Now, just quickly on the name, it's called C2S scale, and the authors say scaling the model to 27 billion parameters yields consistent improvements.
In case you're not aware, that's still easily five or ten times smaller than the model some of us will work with every day. I totally get that there are going to be cynics in the audience, but this is not like those GPT-5 surfacing existing breakthroughs stories. If you don't know about that controversy, I won't bore you with it, but this is quite different.
The drug candidate, which I'm going to butcher, Silmitacertib, was not linked anywhere in the literature for having this capacity. As the authors write, this highlights that the model was generating a new testable hypothesis and not just repeating known facts. Even better, by the way, you're probably wondering, it worked on human cells in a dish.
AKA, the model's in silico prediction was confirmed multiple times in vitro, in the lab. Obviously, I do have to clarify at this point, though, that it wasn't confirmed on actual human beings. That, alas, will take years and years of testing, as is the way in medicine. But before we leave this story, just imagine with me for a moment this drug being used in concert with others for more effective cancer treatment in just a few years, directly inspired by a language model.
That's wild. As the paper's authors say, this result also provides a blueprint for a new kind of biological discovery. Now, just as I am filming this video, I have noticed that Google has published this story in Nature on quantum computing. And some of you may have thought that the video's title was about that.
Note they say that it's paving a path towards potential future uses in drug discovery. That's incredible. And that's all going to change the world in the coming decades. But this C2S is pretty remarkable right now. Just remember these stories when people say that LLMs will not accelerate science. Now, it would be pretty lazy of me to just write the narrative that this shows that open AI is falling further behind, especially when you get reports that Gemini 2.5 DeepThink has just hit record performance on Frontier Math.
That's pretty much the hardest benchmark around for mathematics. But actually, on raw intelligence, I don't think open AI is that far behind. And I don't just mean on, for example, my own benchmark, SimpleBench, where I finally got around to testing GPT-5 Pro and it snuck in just under Gemini 2.5 Pro.
I know, I know you guys would want me to test DeepThink, but there's no API for that. So for now, you've got this comparison. If you don't know, SimpleBench is about testing the question behind the question, kind of like trick questions, you could think of it, with some spatio-temporal questions thrown in.
No, I mean, I think I've spent over 100 hours testing GPT-5 within Codex, and it's just frankly better than any of the alternatives. Definitely compared to those from Google, and yes, even compared to the brand new Clawed Code. I do love how you can use Clawed Code on your mobile.
I think that's epic. Thank you, Anthropic. But it wasn't just that on one of the first queries I had for Clawed Code, it attempted to delete a key section of code completely randomly. That's, of course, just anecdotal. I mean, you could get the outputs from Clawed Code, ask about it in GPT-5 Codex, and vice versa, and far, far more often, Clawed Code will say, oh yeah, I'm wrong, sorry, Codex is right.
And that is borne out in testing too. And remember, coding is Anthropic's specialty, so OpenAI are really cooking with Codex. That's what I mean when I call OpenAI a beast that is currently choosing to spend its set amount of compute on money-making activities like Sora. Now, I'll come back to Sora in a moment, but it's time for the second story that I found really interesting in the last few days that I think many people missed.
And the story isn't just that we got a paper by a bunch of famous AI authors proposing a final conclusive definition of AGI. I say with a rising tone because I don't think it is actually a conclusive definition of AGI. But no, that isn't quite the story, though, it's part of it.
The story comes with a link from this paper to something that a top OpenAI researcher recently revealed. But first, what's this paper? Well, basically, it's an answer to what many of you have asked for for a long, long time, which is giving a definition of AGI. And the big anchor that the paper proposes is this theory of cognitive capacity, the cattle horn cowl theory.
It's called the most empirically validated model of human cognition. So they basically take what has been proven to assess human cognition and then apply it to models. One of the big headlines is that the resulting AGI scores are GPT-4 at 27%, GPT-5 at 58%. Kind of makes it sound like just with the same architecture going up to GPT-6 or 7, you'd get to AGI.
But that's not what the paper says. The theory breaks cognition down into 10 discrete categories that have survived as factors of cognition for over a century of research. They do not, by the way, include physical dexterity. So that's one huge caveat. Each category is given an equal 10% weighting on the AGI score out of 100.
So for example, reaction time is treated as equally important to general knowledge. So we have general knowledge, and it's definitely true that models know a lot more than they used to. Reading ability, I think it's pretty fair to say models have come a long way. Math stability, well, we just saw how Gemini DeepThink is breaking records on frontier math benchmarks.
On the spot, reasoning is a bit more hazy. It's kind of a bit like a mix of simple bench and a traditional IQ test and a bit of theory of mind, and they give it 7%. Not sure about that. What I mean by that is that there are so many different ways of testing induction, like ARC AGI 1, which yields amazing scores for GPT-5, and then ARC AGI 3, where it performs terribly.
Traditional theory of mind, in which the models ace those tests, but then perturbed, where you just change a few words and they fall right down. Then there's working memory and long-term memory storage, and that's where things get really weird. Language models can't remember things, or at least they can, if they're within the context of the conversation, but not beyond that.
They don't continually learn things on the job, as it were, and that's the key point I'm going to come back to with a quote from OpenAI. Then there's a metric on hallucinations, which again is quite hard to pin down, and visual processing. That is rapidly improving, as you might have seen with the recent DeepSeq paper.
Rounding things out, we have listening ability and reaction time, essentially speed. Don't know about you, but models can, in general, do things much, much faster than me. But the one that stands out is lacking memory. The authors say, without the ability to continually learn, AI systems suffer from amnesia, which limits their utility, forcing the AI to relearn context in every interaction.
What I would add to that, is that because every bit of context adds to the cost of a call to a model, these providers deliberately limit the amount of context that models take in. So they often make huge blunders, because they just don't understand the situation. If you spent more money on more context, they'd understand more of the situation, but again, they wouldn't remember it next time.
One is a fundamental limitation, one is a question of cost. But will we soon have a solution to the fundamental limitation, continual learning? Well, that's the quote I want to get to. On that nasty cliffhanger, I'm going to bring you today's sponsor, a long time sponsor of the channel, Assembly AI, and specifically, their universal streaming tool.
Because I think it's pretty epic. Did you know how accurate these tools have become? For example, it was only recently that certain rival tools couldn't understand when I said things like GPT-5 Pro. They just wouldn't get those letters right, especially in my accent. Now, as you can see, they handle it flawlessly.
I think that's incredible. And the link to this, by the way, is going to be in the description. I follow AI relentlessly and, of course, test almost everything that comes out. And I didn't know these tools had gotten this good, frankly. That was also really fast. I don't know if you noticed that.
But anyway, link in the description. Now, on that quote, this is OpenAI's VP of Research, Jerry Tworek. I think it's his first interview, but he mentioned something quite crucial around 58 minutes. But first, some context. Before we even hear from him, what are some of the obvious limitations of continual learning?
Well, the model providers wouldn't be in control of what the models were learning. Like, imagine GPT-6 truly understands your app from the ground up, or maybe your exam from the ground up. Then, on the job, it bakes that into its weight, so you never have to tell it again.
That's amazing, right? But if we do that naively, then you can imagine all kinds of sick things that people train the models to learn. Is there a concept of, I guess, online RL that happens where, as the agent does something and learns from the real world, the RL happens in real time?
So generally, all of RL is happening. Most of RL that you hear talk to language models is online, but it's done online in a way that is still a training run. It's still being trained kind of separately from the users. There have been a few models in the world, and I've learned recently that I think Cursor is trying to train some models online with their users in the loop, and it's theoretically possible to train models in ChatGPT or every other product, just responding to the users and reinforce through whatever rewards you get in there.
But this is not what I am aware of, at least not what opening eye is doing at the moment. And it can be great, but it can be also dangerous because you are not really very much controlling what you are reinforcing in that loop and what could happen. Let me know in the comments if you can see an obvious way around that limitation.
I hope I've at least convinced you that we live in pretty weird and unpredictable times when it comes to AI. And so I want to end on this note. Did you know that SOAR 2 can answer benchmark level questions as a video generator? As in, you could ask it a complex multi-choice mathematical or coding question, and it would answer in the video scoring really well.
Yes, not quite as well as a focused model, but I think that's wild. Doesn't that kind of show you the physics calculations that these video generators are doing on the fly? Anyway, that's all from me. Let me know if you're loving the new browser from ChatGPT. I'm still going to give it a bit more time before I mention it properly in a video, just to see if it's worth talking about.
Obviously, do let me know your early thoughts too. But thank you most of all for watching all the way to the end. And I know it's been a while between videos. I really am hoping to ramp things up in November. Have a wonderful day.