Back to Index

Did AI Just Get Commoditized? Gemini 2.5, New DeepSeek V3, & Microsoft vs OpenAI


Chapters

0:0 Introduction
1:15 Gemini 2.5 Benchmarks
5:46 Long Context, Simple indication
7:8 New Deepseek V3 -024
9:11 Microsoft MAI
11:48 90% of code but new Claude jobs

Transcript

Just like buses, on the same day that we got GPT-40 Image Gen and a new DeepSeq V3, we also got Gemini 2.5 Pro. There is no Ultra or Nano version, but we're still calling it Pro. But it is out, and some at Google are claiming that it is the best AI language model out there.

I've been testing it out a fair bit myself, and yes, this is normally where I'd say, and I've read the paper, but unlike the new open-weight DeepSeq V3, Gemini 2.5's secret source is being kept secret. Beyond the benchmarks, though, there is a bigger story here of wider interest. Just recently, the CEO of Microsoft has claimed that models are being commoditized, with greater performance bought, like commodities, and that labs like OpenAI are merely, quote, product companies, selling and experience.

He's saying that they no more have the secret to, quote, AGI than anyone else. Let's set aside for a moment the fact that the new GPT-40 Image Gen from OpenAI is very much a model, as well as a product. Is the bigger point correct? Does today's Gemini 2.5 and DeepSeq V3 news prove that there is no secret to AI anymore, at least when it comes to intelligence?

To answer that, let's start with the brand new Gemini 2.5, Google's, quote, most intelligent AI model. Notice that this title is a little more humble, saying it's their most intelligent model, rather than flat out the smartest model around. I'll be honest, I didn't see this model coming, and I don't think too many people did either, coming so soon after Gemini 2, but here it is.

All the benchmark figures can be a little bit overwhelming, so I'm going to try and help you understand what some of them mean. The somewhat amusingly misconceived title of Humanity's Last Exam is an incredibly knowledge-intensive benchmark. Of course, you do need to reason and calculate for some of the questions, but most of all, it's testing incredibly obscure trivia, difficult Latin translations, and abstruse butterfly physiology.

But credit where credit is due, Gemini 2.5 seems to know the most. One thing to bear in mind, of course, is that the full O3 hasn't been released yet. This is just O3 Mini from OpenAI, so one would suspect that the full O3 will score more highly than this.

One hint at that is that the deep research system, albeit using tools, gets around 27% on this benchmark. For knowledge without searching the web, Gemini 2.5 Pro knows the most, you could say. What about incredibly difficult science questions? Google-proof PhD-level questions. Again, on that front, notice the convergence, with 2.5 Pro being more or less level with Claude 3.7 Sonnet, just a shade under when it uses extended thinking, and likewise for Quark 3.

Now, if you want to be harsh, you could note the fact that they didn't include O3, even though we know the score of that model using majority voting, which is 87.7%. But Google slightly call out OpenAI there, saying our scores are without things like majority voting, which expend even more compute.

But in a way, I've got to say, that's kind of the point of this video, which is this word leads. Does anyone really lead anymore? As someone who analyzes AI, making direct comparisons is increasingly, and you could say deliberately, made more difficult. Some companies like OpenAI use majority voting for their benchmark scores.

Others just don't report the benchmarks in which they perform worse at. Some figures you see include the use of tools, others don't. But even with all of that said, the scores are slightly converging. For a given amount of compute, you get roughly around a certain level of performance in, say, mathematics, or science, or knowledge trivia, or even coding.

That's not to say that anyone watching can't have a clear favorite model. I would argue probably O3 might be just about the best all around, and I use deep research all the time. Many people love the personality and writing style of Claude 3.7's Sonnet, and use it extensively for coding in things like Cursor.

DeepSeek, and we'll come on to them, are absolutely pushing the limits when it comes to cost efficiency. The best bang for your buck, if that's your main criterion. But if you keep compute expenditures level, performance is really starting to converge. And that's the big revelation of tonight for me.

It is very much worth pointing out that that does not preclude the fact that models are improving. It's just that they're improving together. I love the fact that models can now read tables and charts in, for example, the MMMU benchmark better than ever. In this case, Gemini 2.5 Pro is literally state-of-the-art.

And its excellence in that particular category was reinforced by the Vista benchmark from Scale AI. Unlike the MMMU, which is multiple choice, the Vista benchmark gives the model a free-form response. And as makes sense for visual understanding, it tests things like, can you count how many things there are in an image?

Can you use logic? Can you extract information? And not only is Gemini 2.5 Pro a huge step above even Claude 3.7 Sonnet in this capacity, it's actually the first model to get within touching distance of human performance in this benchmark. And those humans, by the way, were able to browse the web and take their time to answer.

So that performance is pretty impressive. One quick benchmark I should include, even though it's slightly problematic, is the huge jump on the language model arena. Yes, it's community voted and it can be slightly gamed, but it is a huge delta now between number one and number two. The eagle-eyed among you, though, may be noticing one big red flag about this benchmark, which is, where on earth is Claude 3.7 Sonnet?

I have got to admit that in one area, Gemini 2.5 Pro is just head and shoulders above other models, and that is in long context. It can handle a million tokens, that's roughly, let's say, 750,000 words, when the other models in this chart can't even handle a quarter of that.

But I would still argue that with one or two exceptions, performance across the board is converging between diverse model families. It's currently free, by the way, in Google's AI Studio, but that, of course, won't last. And it is able to search, which is great, except that ChatGPT has been able to do that for quite a while, and so too will Claude very soon.

Just quickly, how does it do on my benchmark, SimpleBench, a test of common sense reasoning, or some would say trick questions? We're hoping to sometime very late tonight. I mean, what is it now? Approaching 10.30. Hopefully get it done and on the leaderboard. But I couldn't resist testing it on the 10 public questions.

Gemini 2 Pro, even Gemini 2 Thinking, only got 1 out of 10. Gemini 2.5 Pro, and you can test yourself on the questions if you like, gets 5 out of 10, which is a huge jump. Only thing is, that's what Claude 3.7 gets. And, oh, one. That's, again, kind of the point of the video.

Not that progress isn't being made, and we can talk about that all day. More that model performance is converging. But now we must come to the new DeepSeq V3 announced this morning. And that's not the new reasoning model. That's not R2. It's a new base model, you can think of it, for a reasoning model.

The best analogy is comparing it to GPT 4.5, which should be the base model for GPT-5, which will be a reasoning model. Just like GPT-4 O is kind of the base model behind the O series of models. And let me zoom into the chart, because this perhaps is the starkest evidence that performance is converging across teams.

Today's DeepSeq V3 is in stripes, and that is what will surely be the base model for the R2 model coming probably in the next few weeks. Now, not only do I want you to focus on the improvement from the original DeepSeq V3, which is the underlying model of R1, but also focus on the comparison with GPT-4.5 from OpenAI.

It seems starkly better in mathematics, as you can see, and arguably so in coding, just underperforming slightly for science questions and general knowledge. But remember, OpenAI was supposed to be 6 or 12 months or more ahead of Chinese companies. Now, the base models between the new DeepSeq V3 and GPT-4.5 are kind of on par.

4.5, don't forget, is the model that Sam Altman said many people will feel the AGI for. Now, on my Patreon, I'm going to be imminently releasing a documentary on the behind-the-scenes of DeepSeq and Liang Wenfeng, but I doubt he would say he's going to feel the AGI on this new V3.

And yes, of course I'm aware that unless they're backed by the Chinese state, and even then, DeepSeq might struggle to match the sheer compute of Anthropic or OpenAI. But nevertheless, as things currently stand, there is, for reasoning models, no clear moat. Before we finish then, two more cheeky bits of evidence for the commoditization of AI.

Commoditization here being the argument that the only real differentiator at the moment is how much money you can funnel to getting more compute, which then increments your benchmark performance. Now, we have already talked about how Satya Nadella has said that directly AI models are getting commoditized. And as anyone who's watched this channel for a while, you will know that that's a huge vibe shift from two years ago when he was celebrating the fact that he had this special partnership with OpenAI.

But a couple weeks back, we got this insider report in the information, which I couldn't find a chance to talk about on the channel, but it's really interesting. There is a unit within Microsoft called Microsoft AI, and it's run by this guy, Mustafa Suleiman, formerly the head of Inflection AI, and before that, the co-founder of Google DeepMind, along with Demes Azabis.

As you might expect, the Microsoft AI unit is also trying to fashion AGI. But then, last September, they noticed a one, of course, from OpenAI leap ahead, and we all did. Mustafa Suleiman apparently called up and got very angry when OpenAI wouldn't tell him how they'd made it. He started to raise his voice at Mira Mirati, apparently.

You're not holding up your end of the deal, he said. The call ended abruptly. Give us documentation, he demanded, about how you had programmed O1 to think about user queries before answering. Now, obviously, that's a somewhat juicy story of the failing relationship between Microsoft and OpenAI, but the real nugget comes later.

According to Microsoft, at least, they have figured out how to do this kind of reasoning, like Gemini and R1 and the rest of it. Grok 3 and O3 and everyone seems to be reasoning now, but so are Microsoft, apparently. And they claim that their MAI models now perform nearly as well as leading models from OpenAI and Anthropic on benchmarks.

Their models, of course, also do this thinking before answering. Is that why Satya Nadella so confidently said that models are being commoditized? If his own team is able to even partially replicate the performance of, say, O3, then that would indeed give him that confidence to say that. Now, it's not like Microsoft needs to have the best model.

They are making a ton of money off AI in all sorts of directions. For example, did you know, according to the 972 magazine, sales of the company's cloud and AI services to the Israeli army have skyrocketed since the beginning of its onslaught on Gaza? And, of course, we'll have to wait and see if the MAI models live up to the hype.

But nevertheless, the statement alone is noteworthy. Do you feel OpenAI is just a product company? I think it's more than that, but time will tell. The final bit of cheeky evidence comes from, again, comparing DeepSeq V3, the new one from today, with Claude 3.7 Sonnet. As we know, these are just five benchmarks out of the probably hundred or so out there now.

But just on, say, live code bench, it outperforms 3.7 Sonnet. And that comes just days after the CEO of Anthropic made this comment. What we are finding is we are not far from the world. I think we'll be there in three to six months where AI is writing 90% of the code.

And then in 12 months, we may be in a world where AI is writing essentially all of the code. But I must confess to noticing after I saw that rather dramatic and hypey comment. that Anthropic are still advertising for software engineering roles in my hometown of London. Not only that, they are advertising annual salaries of very generous proportions.

But wait, if Claude 4 or 5 within 12 months is going to do, quote, all the coding, then why even advertise an annual salary? Logically, these people would be out of a job within a few months. Yes, I'm being somewhat facetious because, of course, engineering is much more than just coding.

But still, the words and the prediction don't quite match the recruitment intensity. And for a family of models that is set to do, quote, all the coding, it sure is struggling with a primary school age game, in this case, playing Pokemon, endlessly getting stuck and resorting to hilarious means to progress and never quite doing so.

So that is my take on the new Gemini. An amazing model, but more proof of convergence than exceptionality. Of course, very awkward for the Gemini team that it came on the same night as ImageGen from OpenAI. I decided to do two separate videos, so let me know if you like that approach.

I thought it would just make it easier for you guys to share it with friends if they're interested in one topic or the other. But as it is approaching 11 o'clock here, I must bid you good night. Thank you so much for watching and have a wonderful day.