‘Speaking Dolphin’ to AI Data Dominance, 4.1 + Kling 2.0: 7 Updates Critically Analysed

00:00:00.000 | Progress in AI can feel incremental until you step back and think in terms of weeks and months

00:00:06.140 | rather than just days. So this video won't just be about the release then of GPT 4.1 in the last

00:00:13.680 | 48 hours, or Cling 2.0, or a sneak peek at the next OpenAI model O3, or even about Dolphin Gemma,

00:00:21.020 | the hopeful new language model from Google. It will be about all of this in the wider context

00:00:27.140 | and how seven such stories contextualize where we are in AI and what's happening. I want to start

00:00:33.160 | with something super practical just for those who don't care much about the incremental advance in

00:00:37.900 | intelligence and just want cool tools to use. For those people in the last few days, Cling have

00:00:43.440 | released Cling 2.0. And here is a workflow that I recommend. Generate an image with ChatGPT because

00:00:50.360 | it has incredible text fidelity. Not perfect, but really good. Now you shouldn't really explain a

00:00:55.920 | joke, but as you can tell, the background image is somewhat taking the piss out of OpenAI's model

00:01:00.840 | names. Pro tip, if you have any curse words, ChatGPT will generate that, but Cling won't work with that

00:01:06.740 | within the image. So I had to use the version without GPT WTF. Anyway, the only point I wanted to make is

00:01:12.940 | that Cling 2.0 for me is the state of the art at generating smooth, realistic scenes. Of course,

00:01:18.780 | not perfect with regard to physics. And yes, I have compared it directly to VO2 and also to Sora video

00:01:26.420 | generation. And no, I'm not going to belabor the point because it's not perfect, but sometimes

00:01:30.760 | incremental progress when you step back can add up to something quite significant. Speaking of which,

00:01:36.160 | of course, in the last 48 hours, we got GPT 4.1 from OpenAI. That's their first model that can process up to a

00:01:43.300 | million tokens, or think of that as around 750,000 words. Aside from being a bit less verbose than the

00:01:50.300 | notoriously talkative Claude 3.7 Sonnet or the famously garrulous Gemini 2.5 Pro, I don't actually

00:01:57.620 | think GPT 4.1 is that much of a step forward. So I'm not going to spend a huge amount of time on it,

00:02:03.140 | but just a few words on its background. GPT 4.1, like GPT 4.5, is not a reasoning model. It doesn't

00:02:09.220 | output those long chains of thought before giving you an answer. That then begs the question of why

00:02:14.360 | release 4.1 when we already have GPT 4.0 and GPT 4.5, which are also non-reasoning models. It seems like

00:02:21.940 | demand for GPT 4.5 wasn't as great as OpenAI hoped, and that could be because of how expensive GPT 4.5 was

00:02:29.100 | or how great Gemini 2.5 Pro is. So OpenAI wanted to release a non-reasoning model that answers more

00:02:36.140 | quickly that was better than GPT 4.0, but not as expensive as GPT 4.5. And for those of you at the

00:02:43.140 | back with your hands up, why do we even need non-reasoning models when the reasoning models do

00:02:47.240 | so well? If you can improve these, quote, base models on software engineering, then when reasoning is

00:02:53.360 | applied to those better base models, you will get a better end result. But there is a slight marketing

00:02:58.360 | problem for OpenAI. If Google can somehow serve reasoning models that perform better for a lower price

00:03:05.320 | than OpenAI can serve non-reasoning models. Take Ada's Polyglot Coding Benchmark, which is a popular

00:03:10.900 | and well-regarded benchmark. You don't have to remember any of these numbers, just focus on

00:03:14.640 | relative performance, with GPT 4.1 getting 52% at a cost of around $10. If you bear in mind those

00:03:20.460 | numbers, we can then scroll up and see Gemini 2.5 Pro getting 73% correct at a cost of $6. On my own

00:03:28.720 | benchmark, SimpleBench, you can see a clustering effect now for those base models, those non-reasoning

00:03:34.700 | models. GPT 4.1 gets 27%, which is very similar to Llama 4 Maverick, Claude 3.5 Sonnet and the new

00:03:41.760 | DeepSeek V3. Worth noting though that we could finally benchmark Grok 3 because the API was released

00:03:47.780 | and that scored 36.1% comparing directly to the original GPT 4.5 at around 34%. Now, I know you

00:03:55.880 | guys are noticing Gemini 2.5 Pro way out in the lead, but more on that in a moment. What about that 1 million

00:04:01.480 | token context window? Doesn't that really stand out? Well, yes, except Gemini 2.5 Pro also has a 1

00:04:07.500 | million token context window. And when you sprinkle in a bunch of clues across a long fiction story,

00:04:13.620 | as for this amazing benchmark, you can see which models actually pick up on those clues and utilize

00:04:19.480 | that long context the best, piecing together plot narratives across many diverse chapters.

00:04:24.540 | Well, as you can possibly see if you zoom in, Gemini 2.5 Pro can do this even across novel length,

00:04:31.080 | 100,000 word length pieces of fiction. On this benchmark, GPT 4.1 falls far behind,

00:04:36.860 | as does pretty much every other model than Gemini 2.5 Pro. So when you see OpenAI tout needle in a

00:04:42.780 | haystack charts like this one, remember they're selectively picking the benchmarks that make their

00:04:47.640 | models look the best. Llama 4, you may remember, did a very similar thing. And of course, don't get any of us

00:04:53.920 | on LM arena, which can be gamed heavily and was by meta. Now you might say that's a little harsh given

00:04:59.900 | that OpenAI just open sourced a brand new benchmark, an eval on long context called OpenAI MRCR. The only

00:05:06.860 | problem is that we've already had a benchmark like that for over a year from Google. And with this

00:05:12.280 | benchmark, we could compare across model families. It turns out that if you're in the lead, you're much

00:05:16.920 | more inclined to compare your model to other model families. Now, of course, I am aware that by tonight, or at least

00:05:23.300 | within the next week, we are likely getting O3 from OpenAI. According to the information,

00:05:29.020 | that's the model that can really help with science. It can connect the dots between concepts from

00:05:34.660 | different fields to suggest new types of experiments involving anything from nuclear fusion to pathogen

00:05:39.880 | detection. This is according to people who have tested the model. And apparently, we're not just

00:05:44.180 | going to get O3, but O4 mini. Now, as the title promised, I do have to critically analyze this

00:05:49.920 | announcement, though, even before it actually happens. First, obviously, it would have to be

00:05:54.180 | extremely good to justify a $20,000 per month price, as the information reports. Second, models can perform

00:06:01.360 | well in benchmarks, but not actually understand the real world or perform effectively when conducting

00:06:07.220 | science. And yes, that includes Gemini 2.5, as one researcher recently found. Any of you following

00:06:12.780 | simple bench or doing your own tests probably saw this coming, but look what happened when he created

00:06:18.020 | a benchmark on manufacturing this simple brass part. All models except Gemini 2.5 failed at the

00:06:25.020 | first hurdle just because they had horrible visual abilities. But even Gemini 2.5 had terrible physical

00:06:31.640 | reasoning. Its machining plans had multiple critical errors that a beginner machinist would spot. It could

00:06:38.440 | parrot textbook terms but lacked practical understanding. And that's Gemini 2.5, remember that AI co-scientist

00:06:46.080 | powered by Gemini 2 that Google touted in February. I'm not saying that Gemini 2.5 or O3 can't suggest

00:06:53.260 | interesting new research directions. It's just, they don't have this mystical understanding of science

00:06:58.260 | that no other human has. Not yet at least. And I say not yet for two reasons. One I'll give at the end of

00:07:03.500 | the video and one here. Models are incrementally getting better even at physical reasoning, spatial

00:07:10.100 | reasoning type of questions. I haven't personally tested O3, but I have analysed a bunch of its answers

00:07:16.500 | through someone I know. No, they don't work at OpenAI. The model does still make basic errors, but it's the only

00:07:23.780 | one to get certain questions right that I have never seen any other model get right even once. That's pretty much

00:07:29.780 | all I can say at this point without a risk of getting sued, but it does back up my incremental improvement

00:07:35.220 | point. Of course, O3, especially on any high setting, is likely to be slower and more expensive than Gemini 2.5 Pro, but that might not matter as much as you think. Especially if, in the words of Satya Nadella,

00:07:46.140 | Satya Nadella, and now Sam Altman from this week, that OpenAI is moving from being a model company to being a product company.

00:07:53.420 | ChatGPT is like a standard user. The model capability is very smart, but we have to build a great product, not just a great model. And so there will be a lot of people with great models, and we will try to build the best product.

00:08:05.740 | The focus on this channel is much more on the state of the art in model intelligence rather than in product features, but I have noticed a trend. More and more product features are now being copied

00:08:15.780 | shared across all the different model providers. Anthropic with the Claude series now has web search and is soon going to have a voice assistant just

00:08:23.060 | like OpenAI. And now Anthropic have joined the deep research party with their own research mode. That comes after Gemini updating their own deep research tool with Gemini 2.5 Pro, meaning that

00:08:35.060 | that, and it's hardly surprising, the Gemini tool is now arguably the best one. I recently switched over from defaulting to OpenAI's deep

00:08:42.340 | research to Gemini's one simply because it's faster and, on average, slightly better. I'm not as keen on how, even for a simple query, it outputs this massive

00:08:52.340 | volume of text, but still its accuracy is slightly higher for me. Of course, that means it's getting slightly harder to justify

00:08:59.620 | paying for the $200 pro tier for OpenAI, but let's see what they release in the next week. Speaking of deep research, if you want to see how all LLMs lie when they're

00:09:09.620 | trying to justify their reasons for an answer, do check out the latest video on my Patreon. One somewhat amusing test I gave was to see which deep research would make up a report on an African author that I entirely

00:09:21.980 | made up from my imagination. One deep research tool does spectacularly well, the other not so much. If you are a little overwhelmed with all the product offerings,

00:09:30.860 | Safe Super Intelligence from Ilya Sutskova has got your back. They offer precisely zero products. That has not stopped them being now valued at $32 billion, apparently.

00:09:42.820 | This is not a made up figure, people are giving them billions of dollars, this time $2 billion, at that valuation. That's what they value the company at.

00:09:51.620 | I have precisely zero extra details to give you. It's just the obvious question, what on earth are they up to?

00:09:58.500 | One product that is here right now is the sponsors of today's video, Emergent Mind, where you can see which AI papers have caught fire online.

00:10:08.180 | Now while I might have the time to go off and then read those papers, what you can do if you wish is use Gemini 2.5 Pro to summarize those papers.

00:10:17.060 | Or say you're just interested in a topic like reward hacking, you can search for it and get a summary of all the relevant papers from Gemini 2.5 Pro.

00:10:25.060 | And I actually know the creator of Emergent Mind. And I asked for this feature where you could directly then click on the PDF and boom, here it is.

00:10:31.940 | As I've said before, I also love the social section at the bottom where you can see the social media reaction to a certain paper.

00:10:39.940 | And I'm on the pro tier, which is free, by the way, for students who are currently enrolled in college or university.

00:10:46.820 | The links as ever will be in the description for emergentmind.com.

00:10:50.820 | Now as with many of you, my attention was caught by Dolphin Gemma from Google.

00:10:55.860 | How Google AI is "helping decode dolphin communication".

00:11:00.340 | That is a grand title and it got millions of views on social media.

00:11:04.100 | But when you dig in, I don't know, there's just a bit less than meets the eye.

00:11:08.660 | Don't get me wrong, I think it's incredible that we are trying to do this and I am so enthusiastic about it.

00:11:14.580 | I absolutely love animals.

00:11:16.260 | It's just that if you analyse any of the hype headlines you've seen across YouTube and Twitter,

00:11:21.540 | it would make it sound like we already have a model that can do that.

00:11:25.060 | The announcement though was more about progress.

00:11:28.020 | How they had accumulated an incredible dataset.

00:11:31.460 | And how they had an ultimate goal of doing certain things.

00:11:35.860 | See down here, the ultimate goal of this research is to understand the structure and potential meaning

00:11:41.700 | within the natural sounds of dolphins.

00:11:43.780 | Seeking patterns and rules that might indicate language.

00:11:46.740 | If you watch the accompanying video, one of the researchers says "We don't know if they have words".

00:11:52.900 | Obviously I, and pretty much everyone watching,

00:11:55.860 | HOPES they have a language.

00:11:57.220 | Because that would be insane to be able to decode it.

00:11:59.940 | It's just don't be fooled by any hype headlines.

00:12:02.340 | We don't actually know if they have a coherent language.

00:12:05.860 | We know certain sound types correlate with certain behaviours.

00:12:09.540 | Like whistles that seem like they are unique names.

00:12:12.740 | Special sounds, as many animals have, that they release during fights.

00:12:17.140 | Or buzzing during courtship.

00:12:19.060 | But that's different from more abstract rules, as they say, that might indicate language.

00:12:23.620 | Again, they're looking for potential meanings in the sounds coming from dolphins.

00:12:27.860 | Using a 400 million parameter model that can fit on a Pixel 9 phone.

00:12:32.260 | They then go on to tout one obvious benefit of fitting on a phone.

00:12:36.820 | Which is their goal to "speak dolphin".

00:12:39.860 | Of course, once you've decoded certain sounds, you can get your phone to emit those sounds and

00:12:44.340 | essentially communicate with a dolphin.

00:12:46.340 | That would be incredible.

00:12:47.620 | It would, of course, establish a simpler shared vocabulary.

00:12:51.380 | That is the hope of researchers.

00:12:55.220 | That naturally curious dolphins will learn to mimic the whistles to request certain items, for example.

00:13:00.100 | Again, absolutely incredible research and I really do hope they succeed.

00:13:03.780 | I would be their number one fan.

00:13:05.620 | But I just wanted to give you a sense of where we actually are currently.

00:13:09.860 | By the way, I suspect dolphins do have a proto-language.

00:13:13.060 | So I'm fingers crossed for this mission.

00:13:15.300 | Now, I could have ended the video there, but as I outlined at the beginning, I wanted to step back

00:13:20.340 | and give you guys a sense of context of where we are.

00:13:22.820 | You may have gathered from various media reports over the last few years that we are compute constrained.

00:13:27.620 | That the only thing limiting progress is a lack of, for example, NVIDIA GPUs.

00:13:32.420 | Even that, of course, would be a simplification because Google announced

00:13:35.940 | a 7th generation TPU not reliant on NVIDIA.

00:13:39.540 | But even if you've just imbibed that general narrative that it's all about compute.

00:13:43.620 | Well, this video from OpenAI, pre-training GPT 4.5, might have a few answers for you.

00:13:49.460 | The truth is, it's actually much more about data constraints now rather than compute constraints.

00:13:54.820 | It's very interesting because I think up until this rough point in time,

00:13:58.580 | like if you look even through GPT-4, we were largely just in a compute constrained environment.

00:14:02.900 | So that was kind of where all the research was going into.

00:14:06.020 | But now we're in a very different kind of regime, starting with 4.5 for some aspects of the data,

00:14:12.580 | where we are much more data bound.

00:14:14.100 | So there's not a lot more excitement about this research.

00:14:17.300 | It is a crazy update that I don't think the world has really grokked yet.

00:14:20.340 | I should pick a different one that the world has understood yet.

00:14:23.300 | That we're no longer compute constrained on the best models we can produce.

00:14:27.380 | That's just like, that was so the world we lived in for so long.

00:14:30.420 | And what's the most useful kind of data?

00:14:32.420 | Evaluations or benchmarks.

00:14:34.980 | The Chief Product Officer at OpenAI explained it well this week.

00:14:38.900 | You made a kind of a comment along these same lines around eBells that AI is almost like capped in how

00:14:45.060 | amazing it can be by how good we are at eBells.

00:14:48.420 | Does that resonate?

00:14:49.780 | Any more thoughts along those lines?

00:14:51.220 | These models are intelligences and intelligence is so fundamentally multidimensional.

00:14:59.380 | So you can talk about a model being amazing at competitive coding, which may not be the same as

00:15:05.060 | as that model being great at front end coding or back end coding or taking a whole bunch of code

00:15:11.540 | that's written in COBOL and turning it into Python, you know, like and that's just within the software

00:15:16.100 | engineering world.

00:15:17.380 | Still, most of the world's data knowledge process is is not public.

00:15:24.100 | It's behind the walls of companies or governments or other things.

00:15:27.220 | And same way, if you were going to join a company, you would spend your first two weeks onboarding.

00:15:32.100 | You'd be learning the company specific processes.

00:15:34.180 | You'd get access to company specific data.

00:15:36.100 | It's you can teach these models.

00:15:39.300 | The models are smart enough.

00:15:40.180 | You can teach them anything, but they need to have the sort of the raw data to to learn from.

00:15:48.580 | And so there's a there's a sense in which I think the future is really going to be

00:15:53.060 | incredibly smart, broad based models tailored with company specific or use case specific data

00:16:02.420 | so that they perform really well on company specific or use case specific things.

00:16:07.940 | You're going to measure that with custom evals.

00:16:10.180 | And so, you know, what I what I was referring to is just like these models are really smart.

00:16:14.500 | You need to still teach them things if the data is not in their training set.

00:16:18.100 | And there's a huge amount of use cases that are not going to be in their training set because they're relevant to one industry or one company.

00:16:23.700 | That's why OpenAI want to work with anyone they can with their OpenAI pioneer program to get domain specific evals.

00:16:31.460 | Having niche evaluations for your models doesn't just help you extract the good data from the bad and improve the data efficiency of your model.

00:16:39.220 | It also helps you identify the best new data to improve your model.

00:16:43.620 | If that new data contains information or you can think of it as functions or programs that help models perform better during reinforcement learning, then it will be prioritized.

00:16:53.300 | And that is why, to sum up, among many other reasons, I think Google has taken the lead and may even have an enduring lead.

00:17:01.220 | I'm not saying the new O3 won't pip it on a few benchmarks.

00:17:04.340 | I'm talking about a long term trend over the next year or two.

00:17:07.540 | Google can source almost unlimited data.

00:17:10.500 | Think Google Search, Android, Chrome, Gmail, Google Maps, YouTube, Waymo Self-Driving, even Calico Life Extension.

00:17:18.180 | And to wrap things up where we started, remember the lack of performance sometimes in SimpleBench or on that brass manufacturing benchmark?

00:17:25.780 | Well, just a week ago or so, Google announced Geospatial Reasoning.

00:17:29.860 | One of their first attempts to integrate Gemini with a bunch of these spatial reasoning tools.

00:17:34.820 | I'll let their one minute promo video speak for itself.

00:17:37.380 | From maps and trends to weather, floods and wildfires, Google has studied the geospatial world for decades.

00:17:46.820 | And we've made that information accessible through AI models and real-time services.

00:17:52.340 | But synthesizing across these models and combining your data with ours can be challenging and expensive.

00:17:58.820 | That's why we're introducing Geospatial Reasoning.

00:18:03.780 | We now bring your data and models together with Google's geospatial tools for easier analysis using Gemini's reasoning ability.

00:18:12.740 | "Gemini plans and enacts a custom program, searching over data and gathering inferences from multiple models to unlock powerful insights all through a simple conversational interface.

00:18:29.700 | Geospatial reasoning can be a critical tool for advancing public health.

00:18:38.660 | Geospatial reasoning can be a critical tool for emerging technologies.

00:18:43.620 | Geospatial reasoning can be a critical tool for emerging technologies.

00:18:45.620 | Geospatial reasoning can be a critical tool for emerging technologies.

00:18:47.620 | Geospatial reasoning can be a critical tool for emerging technologies.

00:18:49.620 | Geospatial reasoning can be a critical tool for emerging technologies.

00:18:51.620 | Geospatial reasoning can be a critical tool for emerging technologies.

00:18:53.620 | Geospatial reasoning can be a critical tool for emerging technologies.

00:18:58.580 | Google taking what could be a permanent lead must be a bitter decade-long sting for Musk and Altman in particular.

00:19:06.580 | I'm going to end with a 45 second extract from a recent documentary I put on Patreon about how OpenAI was founded a decade ago, almost to the month, to stop Google creating AGI.

00:19:19.540 | Leaked emails from a later lawsuit between Musk and Altman revealed a May email exchange about stopping Google.

00:19:25.540 | And here it is, been thinking a lot about whether it's possible to stop humanity from developing AI.

00:19:31.540 | To stop humanity from developing AI.

00:19:33.540 | This is Sam Altman in an email to Musk.

00:19:35.540 | I think the answer is almost definitely not.

00:19:38.900 | If it's going to happen anyway, it seems like it would be good for someone other than Google to do it first.

00:19:44.980 | Any thoughts on whether it would be good for Y Combinator to start a Manhattan project for AI?

00:19:51.140 | My sense is we could get many of the top 50 to work on it and we could structure it so that the tech belongs to the world via some sort of non-profit.

00:20:00.020 | Thank you so much for watching to the end.

00:20:01.860 | I can't wait for this OpenAI researcher to add O4mini to this long whiteboard list and have an absolutely wonderful day.

‘Speaking Dolphin’ to AI Data Dominance, 4.1 + Kling 2.0: 7 Updates Critically Analysed

Chapters