Progress in AI can feel incremental until you step back and think in terms of weeks and months rather than just days. So this video won't just be about the release then of GPT 4.1 in the last 48 hours, or Cling 2.0, or a sneak peek at the next OpenAI model O3, or even about Dolphin Gemma, the hopeful new language model from Google.
It will be about all of this in the wider context and how seven such stories contextualize where we are in AI and what's happening. I want to start with something super practical just for those who don't care much about the incremental advance in intelligence and just want cool tools to use.
For those people in the last few days, Cling have released Cling 2.0. And here is a workflow that I recommend. Generate an image with ChatGPT because it has incredible text fidelity. Not perfect, but really good. Now you shouldn't really explain a joke, but as you can tell, the background image is somewhat taking the piss out of OpenAI's model names.
Pro tip, if you have any curse words, ChatGPT will generate that, but Cling won't work with that within the image. So I had to use the version without GPT WTF. Anyway, the only point I wanted to make is that Cling 2.0 for me is the state of the art at generating smooth, realistic scenes.
Of course, not perfect with regard to physics. And yes, I have compared it directly to VO2 and also to Sora video generation. And no, I'm not going to belabor the point because it's not perfect, but sometimes incremental progress when you step back can add up to something quite significant.
Speaking of which, of course, in the last 48 hours, we got GPT 4.1 from OpenAI. That's their first model that can process up to a million tokens, or think of that as around 750,000 words. Aside from being a bit less verbose than the notoriously talkative Claude 3.7 Sonnet or the famously garrulous Gemini 2.5 Pro, I don't actually think GPT 4.1 is that much of a step forward.
So I'm not going to spend a huge amount of time on it, but just a few words on its background. GPT 4.1, like GPT 4.5, is not a reasoning model. It doesn't output those long chains of thought before giving you an answer. That then begs the question of why release 4.1 when we already have GPT 4.0 and GPT 4.5, which are also non-reasoning models.
It seems like demand for GPT 4.5 wasn't as great as OpenAI hoped, and that could be because of how expensive GPT 4.5 was or how great Gemini 2.5 Pro is. So OpenAI wanted to release a non-reasoning model that answers more quickly that was better than GPT 4.0, but not as expensive as GPT 4.5.
And for those of you at the back with your hands up, why do we even need non-reasoning models when the reasoning models do so well? If you can improve these, quote, base models on software engineering, then when reasoning is applied to those better base models, you will get a better end result.
But there is a slight marketing problem for OpenAI. If Google can somehow serve reasoning models that perform better for a lower price than OpenAI can serve non-reasoning models. Take Ada's Polyglot Coding Benchmark, which is a popular and well-regarded benchmark. You don't have to remember any of these numbers, just focus on relative performance, with GPT 4.1 getting 52% at a cost of around $10.
If you bear in mind those numbers, we can then scroll up and see Gemini 2.5 Pro getting 73% correct at a cost of $6. On my own benchmark, SimpleBench, you can see a clustering effect now for those base models, those non-reasoning models. GPT 4.1 gets 27%, which is very similar to Llama 4 Maverick, Claude 3.5 Sonnet and the new DeepSeek V3.
Worth noting though that we could finally benchmark Grok 3 because the API was released and that scored 36.1% comparing directly to the original GPT 4.5 at around 34%. Now, I know you guys are noticing Gemini 2.5 Pro way out in the lead, but more on that in a moment.
What about that 1 million token context window? Doesn't that really stand out? Well, yes, except Gemini 2.5 Pro also has a 1 million token context window. And when you sprinkle in a bunch of clues across a long fiction story, as for this amazing benchmark, you can see which models actually pick up on those clues and utilize that long context the best, piecing together plot narratives across many diverse chapters.
Well, as you can possibly see if you zoom in, Gemini 2.5 Pro can do this even across novel length, 100,000 word length pieces of fiction. On this benchmark, GPT 4.1 falls far behind, as does pretty much every other model than Gemini 2.5 Pro. So when you see OpenAI tout needle in a haystack charts like this one, remember they're selectively picking the benchmarks that make their models look the best.
Llama 4, you may remember, did a very similar thing. And of course, don't get any of us on LM arena, which can be gamed heavily and was by meta. Now you might say that's a little harsh given that OpenAI just open sourced a brand new benchmark, an eval on long context called OpenAI MRCR.
The only problem is that we've already had a benchmark like that for over a year from Google. And with this benchmark, we could compare across model families. It turns out that if you're in the lead, you're much more inclined to compare your model to other model families. Now, of course, I am aware that by tonight, or at least within the next week, we are likely getting O3 from OpenAI.
According to the information, that's the model that can really help with science. It can connect the dots between concepts from different fields to suggest new types of experiments involving anything from nuclear fusion to pathogen detection. This is according to people who have tested the model. And apparently, we're not just going to get O3, but O4 mini.
Now, as the title promised, I do have to critically analyze this announcement, though, even before it actually happens. First, obviously, it would have to be extremely good to justify a $20,000 per month price, as the information reports. Second, models can perform well in benchmarks, but not actually understand the real world or perform effectively when conducting science.
And yes, that includes Gemini 2.5, as one researcher recently found. Any of you following simple bench or doing your own tests probably saw this coming, but look what happened when he created a benchmark on manufacturing this simple brass part. All models except Gemini 2.5 failed at the first hurdle just because they had horrible visual abilities.
But even Gemini 2.5 had terrible physical reasoning. Its machining plans had multiple critical errors that a beginner machinist would spot. It could parrot textbook terms but lacked practical understanding. And that's Gemini 2.5, remember that AI co-scientist powered by Gemini 2 that Google touted in February. I'm not saying that Gemini 2.5 or O3 can't suggest interesting new research directions.
It's just, they don't have this mystical understanding of science that no other human has. Not yet at least. And I say not yet for two reasons. One I'll give at the end of the video and one here. Models are incrementally getting better even at physical reasoning, spatial reasoning type of questions.
I haven't personally tested O3, but I have analysed a bunch of its answers through someone I know. No, they don't work at OpenAI. The model does still make basic errors, but it's the only one to get certain questions right that I have never seen any other model get right even once.
That's pretty much all I can say at this point without a risk of getting sued, but it does back up my incremental improvement point. Of course, O3, especially on any high setting, is likely to be slower and more expensive than Gemini 2.5 Pro, but that might not matter as much as you think.
Especially if, in the words of Satya Nadella, Satya Nadella, and now Sam Altman from this week, that OpenAI is moving from being a model company to being a product company. ChatGPT is like a standard user. The model capability is very smart, but we have to build a great product, not just a great model.
And so there will be a lot of people with great models, and we will try to build the best product. The focus on this channel is much more on the state of the art in model intelligence rather than in product features, but I have noticed a trend. More and more product features are now being copied shared across all the different model providers.
Anthropic with the Claude series now has web search and is soon going to have a voice assistant just like OpenAI. And now Anthropic have joined the deep research party with their own research mode. That comes after Gemini updating their own deep research tool with Gemini 2.5 Pro, meaning that that, and it's hardly surprising, the Gemini tool is now arguably the best one.
I recently switched over from defaulting to OpenAI's deep research to Gemini's one simply because it's faster and, on average, slightly better. I'm not as keen on how, even for a simple query, it outputs this massive volume of text, but still its accuracy is slightly higher for me. Of course, that means it's getting slightly harder to justify paying for the $200 pro tier for OpenAI, but let's see what they release in the next week.
Speaking of deep research, if you want to see how all LLMs lie when they're trying to justify their reasons for an answer, do check out the latest video on my Patreon. One somewhat amusing test I gave was to see which deep research would make up a report on an African author that I entirely made up from my imagination.
One deep research tool does spectacularly well, the other not so much. If you are a little overwhelmed with all the product offerings, Safe Super Intelligence from Ilya Sutskova has got your back. They offer precisely zero products. That has not stopped them being now valued at $32 billion, apparently. This is not a made up figure, people are giving them billions of dollars, this time $2 billion, at that valuation.
That's what they value the company at. I have precisely zero extra details to give you. It's just the obvious question, what on earth are they up to? One product that is here right now is the sponsors of today's video, Emergent Mind, where you can see which AI papers have caught fire online.
Now while I might have the time to go off and then read those papers, what you can do if you wish is use Gemini 2.5 Pro to summarize those papers. Or say you're just interested in a topic like reward hacking, you can search for it and get a summary of all the relevant papers from Gemini 2.5 Pro.
And I actually know the creator of Emergent Mind. And I asked for this feature where you could directly then click on the PDF and boom, here it is. As I've said before, I also love the social section at the bottom where you can see the social media reaction to a certain paper.
And I'm on the pro tier, which is free, by the way, for students who are currently enrolled in college or university. The links as ever will be in the description for emergentmind.com. Now as with many of you, my attention was caught by Dolphin Gemma from Google. How Google AI is "helping decode dolphin communication".
That is a grand title and it got millions of views on social media. But when you dig in, I don't know, there's just a bit less than meets the eye. Don't get me wrong, I think it's incredible that we are trying to do this and I am so enthusiastic about it.
I absolutely love animals. It's just that if you analyse any of the hype headlines you've seen across YouTube and Twitter, it would make it sound like we already have a model that can do that. The announcement though was more about progress. How they had accumulated an incredible dataset. And how they had an ultimate goal of doing certain things.
See down here, the ultimate goal of this research is to understand the structure and potential meaning within the natural sounds of dolphins. Seeking patterns and rules that might indicate language. If you watch the accompanying video, one of the researchers says "We don't know if they have words". Obviously I, and pretty much everyone watching, HOPES they have a language.
Because that would be insane to be able to decode it. It's just don't be fooled by any hype headlines. We don't actually know if they have a coherent language. We know certain sound types correlate with certain behaviours. Like whistles that seem like they are unique names. Special sounds, as many animals have, that they release during fights.
Or buzzing during courtship. But that's different from more abstract rules, as they say, that might indicate language. Again, they're looking for potential meanings in the sounds coming from dolphins. Using a 400 million parameter model that can fit on a Pixel 9 phone. They then go on to tout one obvious benefit of fitting on a phone.
Which is their goal to "speak dolphin". Of course, once you've decoded certain sounds, you can get your phone to emit those sounds and essentially communicate with a dolphin. That would be incredible. It would, of course, establish a simpler shared vocabulary. That is the hope of researchers. That naturally curious dolphins will learn to mimic the whistles to request certain items, for example.
Again, absolutely incredible research and I really do hope they succeed. I would be their number one fan. But I just wanted to give you a sense of where we actually are currently. By the way, I suspect dolphins do have a proto-language. So I'm fingers crossed for this mission. Now, I could have ended the video there, but as I outlined at the beginning, I wanted to step back and give you guys a sense of context of where we are.
You may have gathered from various media reports over the last few years that we are compute constrained. That the only thing limiting progress is a lack of, for example, NVIDIA GPUs. Even that, of course, would be a simplification because Google announced a 7th generation TPU not reliant on NVIDIA.
But even if you've just imbibed that general narrative that it's all about compute. Well, this video from OpenAI, pre-training GPT 4.5, might have a few answers for you. The truth is, it's actually much more about data constraints now rather than compute constraints. It's very interesting because I think up until this rough point in time, like if you look even through GPT-4, we were largely just in a compute constrained environment.
So that was kind of where all the research was going into. But now we're in a very different kind of regime, starting with 4.5 for some aspects of the data, where we are much more data bound. So there's not a lot more excitement about this research. It is a crazy update that I don't think the world has really grokked yet.
I should pick a different one that the world has understood yet. That we're no longer compute constrained on the best models we can produce. That's just like, that was so the world we lived in for so long. And what's the most useful kind of data? Evaluations or benchmarks. The Chief Product Officer at OpenAI explained it well this week.
You made a kind of a comment along these same lines around eBells that AI is almost like capped in how amazing it can be by how good we are at eBells. Does that resonate? Any more thoughts along those lines? These models are intelligences and intelligence is so fundamentally multidimensional.
So you can talk about a model being amazing at competitive coding, which may not be the same as as that model being great at front end coding or back end coding or taking a whole bunch of code that's written in COBOL and turning it into Python, you know, like and that's just within the software engineering world.
Still, most of the world's data knowledge process is is not public. It's behind the walls of companies or governments or other things. And same way, if you were going to join a company, you would spend your first two weeks onboarding. You'd be learning the company specific processes. You'd get access to company specific data.
It's you can teach these models. The models are smart enough. You can teach them anything, but they need to have the sort of the raw data to to learn from. And so there's a there's a sense in which I think the future is really going to be incredibly smart, broad based models tailored with company specific or use case specific data so that they perform really well on company specific or use case specific things.
You're going to measure that with custom evals. And so, you know, what I what I was referring to is just like these models are really smart. You need to still teach them things if the data is not in their training set. And there's a huge amount of use cases that are not going to be in their training set because they're relevant to one industry or one company.
That's why OpenAI want to work with anyone they can with their OpenAI pioneer program to get domain specific evals. Having niche evaluations for your models doesn't just help you extract the good data from the bad and improve the data efficiency of your model. It also helps you identify the best new data to improve your model.
If that new data contains information or you can think of it as functions or programs that help models perform better during reinforcement learning, then it will be prioritized. And that is why, to sum up, among many other reasons, I think Google has taken the lead and may even have an enduring lead.
I'm not saying the new O3 won't pip it on a few benchmarks. I'm talking about a long term trend over the next year or two. Google can source almost unlimited data. Think Google Search, Android, Chrome, Gmail, Google Maps, YouTube, Waymo Self-Driving, even Calico Life Extension. And to wrap things up where we started, remember the lack of performance sometimes in SimpleBench or on that brass manufacturing benchmark?
Well, just a week ago or so, Google announced Geospatial Reasoning. One of their first attempts to integrate Gemini with a bunch of these spatial reasoning tools. I'll let their one minute promo video speak for itself. From maps and trends to weather, floods and wildfires, Google has studied the geospatial world for decades.
And we've made that information accessible through AI models and real-time services. But synthesizing across these models and combining your data with ours can be challenging and expensive. That's why we're introducing Geospatial Reasoning. We now bring your data and models together with Google's geospatial tools for easier analysis using Gemini's reasoning ability.
"Gemini plans and enacts a custom program, searching over data and gathering inferences from multiple models to unlock powerful insights all through a simple conversational interface. Geospatial reasoning can be a critical tool for advancing public health. Geospatial reasoning can be a critical tool for emerging technologies. Geospatial reasoning can be a critical tool for emerging technologies.
Geospatial reasoning can be a critical tool for emerging technologies. Geospatial reasoning can be a critical tool for emerging technologies. Geospatial reasoning can be a critical tool for emerging technologies. Geospatial reasoning can be a critical tool for emerging technologies. Geospatial reasoning can be a critical tool for emerging technologies. Google taking what could be a permanent lead must be a bitter decade-long sting for Musk and Altman in particular.
I'm going to end with a 45 second extract from a recent documentary I put on Patreon about how OpenAI was founded a decade ago, almost to the month, to stop Google creating AGI. Leaked emails from a later lawsuit between Musk and Altman revealed a May email exchange about stopping Google.
And here it is, been thinking a lot about whether it's possible to stop humanity from developing AI. To stop humanity from developing AI. This is Sam Altman in an email to Musk. I think the answer is almost definitely not. If it's going to happen anyway, it seems like it would be good for someone other than Google to do it first.
Any thoughts on whether it would be good for Y Combinator to start a Manhattan project for AI? My sense is we could get many of the top 50 to work on it and we could structure it so that the tech belongs to the world via some sort of non-profit.
Thank you so much for watching to the end. I can't wait for this OpenAI researcher to add O4mini to this long whiteboard list and have an absolutely wonderful day.