AI Improves at Self-improving

AI that can help improve AI is actually almost everywhere if you know where to look, not least coding tools like the new Codex from OpenAI, which didn't just help me find a bug that Claude within Cursor missed, but is helping AI researchers too. The coding agents might be doing the easier bits, but it's freeing up AI researchers' time to, well, work on AI improvement.

But rarely is the process of AI self-improvement so direct as it is in the Alpha Evolve agent from Google DeepMind. It can generate better prompts for itself so that it can evolve better code for useful tasks, tasks which lead to efficiencies in its own next version. This was published less than 100 hours ago, but don't worry, it isn't Skynet.

The real world does not yet allow for the speed of iteration that Alpha Evolve involves. But I would say that this agent is the final proof for anyone left doubting it that LLMs are not a dead end and have barely even begun to make their mark. I'm going to draw on plenty of analogies and multiple interviews to give you guys at least a gut sense of what is going on with this recursive Ronin.

This agent that has already led to real world efficiencies in the Google data center fleet and mathematical breakthroughs decades in the making. First though, please just skip to the chase. What on earth is this thing? Basically, the human comes along and has to provide the problem to solve, some code that they may have tried, and critically, some evaluation metrics.

Those details are kind of crucial if you don't want to get an overhyped sense of what Alpha Evolve can do. Anyway, the human provides all of that, and the more metrics they can give, the better the performance. Then, essentially, the human can just vibe as Gemini 2, not Gemini 2.5, the far more impressive successor, but Gemini 2 iterates on that code.

The system uses the flash version of Gemini, the smaller and quicker one, for plentiful ideas. But the pro version, Gemini 2 Pro, for solid suggestions. Notice the prompt sampler, wherein the system draws on previous prompts that humans have tried that worked before, and programs via the program database that were great in other situations, all with the goal of improving the code that the human submitted against the evaluation metrics.

That's why Alpha Evolve is called a coding agent. At its heart, it's improving or evolving upon the code that the human submits against those evaluation metrics. Then, while the human questions their career choices, Alpha Evolve eventually comes back with some code improvements or diffs that produce programs that are, 75% of the time, state-of-the-art at one of the dozens of given tasks.

Not impressed? Well, 20% of the time, these constructions are better than state-of-the-art. If you are the highest IQ dude on the planet, Terence Tao, you describe this as extremizing functions, f of x, with x ranging over a high-dimensional parameter space, omega, that can outperform more traditional optimization algorithms when the parameter space is very high dimensional, and the function f and its extremizers have non-obvious structural features.

Simple when you think about it, extremizing functions. Rest assured, they are now moving on to more challenging problems. Hopefully, though, I can demystify things a bit more than this, so let's return to the paper. In this key diagram, you can see that DeepMind went all in on the Evolve part of Alpha Evolve, because the system not only stores and samples from the best prompts, as judged by metric success, but even the best LLMs for the task.

Yes, Gemini 2.5 Pro would just be a plug and play away from further improvements. Skipping to the end, because why not? The natural next step will be to consider distilling the Alpha Evolve augmented performance of the base LLMs into the next generation of the base models. This can have intrinsic value and also likely uplift the next version of Alpha Evolve.

Now, you guys might agree, but those two sentences alone deserve a full video. Because first, who is Google fooling when they say they're going to consider doing this? I think it is quite possible they already did do this for Gemini 2.5. Alpha Evolve just got published, but has been tested internally in Google for around a year.

And second, Alpha Evolve is therefore a pretty definitive case study against the idea of a permanent data wall, because this system is built to spin up improved programs, which can then be distilled into the next generation of base models, which then get better at coming up with improved programs.

Or the TLDR is that iterated code that proves to be good, is then great data for training the next base model, which can then be plugged into the next version of Alpha Evolve. Yes, by the way, I know that this is just one of several recursive loops in the paper, improving the base LLM through distillation.

And this is all before we get to Alpha Evolve's intended use in applied sciences like drug discovery. But very quickly on that point, I want to touch on why Alpha Evolve isn't quite confirmation of an imminent fast takeoff. Because as the paper makes clear throughout, the main limitation of Alpha Evolve is that it handles problems for which it is possible to devise and submit, by the way, an automated evaluator.

While this is true of many problems in the mathematical and computational sciences, there are domains such as the natural sciences, where only some experiments can be simulated or automated. Yes, therefore, it can help scientists evaluate new scientific experiments, and they are working on making it a better literal co-scientist.

But there is a reason that even the famously bullish Anthropic CEO Dario Amadei, who expects a century of science progress in the next decade, said, intelligence will be initially heavily bottlenecked by the other factors of production. Test tubes, in other words, can only test tube so fast. But back to what Alpha Evolve actually already achieved.

Most famously, it found a rank 48 tensor decomposition for 4x4 complex matrix multiplication, which is actually, even for the authors, a super unexpected improvement on the 50-year-old record for algorithms suitable for recursive application. As simply put as possible, tensor decomposition here means discovering a more fundamental recipe with fewer core steps, 48 instead of 49, to perform matrix multiplication.

This specific type of recipe, a tensor decomposition, is cool because it allows the method to be used repeatedly or recursively to dramatically speed up calculations for very large matrices. Multiplications that are needed for all sorts of computing and AI operations. If you're not too into maths, let's see what else I can impress you with.

Well, Google helped improve the Borg. Yes, you know, the actual Borg, its data center optimization scheduler. Not sure what Borg you were thinking of, but this improvement helped Google recover 0.7% of its worldwide compute resources. That will soon amount to billions of dollars. But remember, LLMs are a dead end.

But being serious, though, this is clearly the way. Humans and LLMs providing ideas and problems. LLMs proposing iterations, hard-coded verifiers and systems providing automated checks. And by the way, we're not even done, Alpha Evolve helped refine the next generation of Google's chips, its Ironwood TPUs. And if you remember DeepSeq hand-optimizing a kernel to eke out efficiency, if not see my recent documentary which debuted on Patreon.

But anyway, Alpha Evolve did it automatically, when given that as a problem, leading to a 1% reduction in Gemini's training time. Obviously, that is yet another recursive loop, a better or more efficient Gemini, leading to a better future Alpha Evolve. But okay, now we are suitably sold on its achievements, let me give you four ways that Google admits it will soon get better, plus two funny quirks and two relevant interview clips.

First, future improvement involves some background context that solutions and their scores for these tasks are kept in an evolutionary database. But remember, Gemini models have been confirmed to have up to a 10 million token context window. Those models aren't released yet, the public ones only go up to 2 million tokens.

But clearly, that evolutionary database could one day get incredibly large, giving a veritable library of Alexandria for any future model to draw upon. For those watching a while, it might remind you of my coverage of Voyager, which was an agent for Minecraft, which had an ever-growing skill library of executable code.

So first, obvious future improvement, a much bigger evolutionary database. Second, as we hinted at, Alpha Evolve is model agnostic. So as hardware is improved, training time is reduced, and knowledge is distilled to help make a better Gemini 3, that Gemini 3 will make a much better LLM within Alpha Evolve.

And that brings us to the ablations. This was a really cool part of the paper because it showed that every part of the coding agent we have so far described was actually crucial. For example, if you only used a small base LLM, Gemini Flash, not Gemini Pro, performance caps out at a lower point.

If you didn't have that context window and couldn't do a full file evolution, remember that massive context window? Well, if you couldn't do that, again, you can see that performance caps out at a much lower point. If you're listening to this, by the way, all of the ablations show lower performance if you don't employ the full method.

Even dropping the meta-prompting, where you evolved which prompts to use, impeded performance. And for those over on my Patreon, you may remember from the beginning of AI Insiders, I did an interview with Tim Rocktashel, a key figure at Google DeepMind. He gave us what turned out to be an early preview of this prompt evolution approach with his paper, Prompt Reader.

But what Prompt Reader does is that if you evaluate fitness of the prompts based on some kind of specific held out validation set for a domain, then what Prompt Reader will do over time, it will evolve more and more domain-specific prompts, right? That's what we saw in the paper.

And there's actually one more paper that I think will give you a pretty great analogy of what is happening here with Alpha Evolve. And that's Dr. Eureka from NVIDIA. For this, imagine trying to handcraft instructions to a robotic hand to teach it how to flip a pen. Super boring and would take ages and isn't particularly effective.

But now imagine you can give the language model feedback about how each iteration is doing. Which reward functions perform well, which don't. That's like the evaluation metrics that humans provide for Alpha Evolve. With that feedback, Dr. Eureka and Alpha Evolve can iterate on their suggestions. Both approaches, you obviously now know, produce state-of-the-art results.

And hopefully that gives you an intuition, or at least it did me, for why humans couldn't always have reached these kind of levels. How Alpha Evolve points to novel solutions that it's not like humans would get if they tried eventually. Humans often get stuck in local optima according to their inherent biases.

Also, they don't have time to iterate on tens of thousands of potential solutions. Here's Guangzhou Wang, who worked both on the original Eureka and Voyager papers. It has very much prior knowledge, and therefore it can just propose different kind of mutations and variations of the reward function based on the environment context.

I think it just generates those reward functions based on its prior knowledge. And not as a human, like, for a human, like, you need to manually tune the reward functions. And it's very easy for a human to get stuck to a local optima. But for GPT-4, it can generate tens of reward functions at the same time.

And then, based on the performance of each reward function, it can continuously improve it. In Eureka, it's more like, it's more like an evolutionary search. Third room for future improvement, and this is a big one. That code snippet that Alpha Evolve can improve on doesn't have to be the final function that generates the direct solution.

It can be a search algorithm later used to find an optimal final function. So, Alpha Evolve can essentially continue to improve how we search for optimal programs. Fourth future improvement, and this is subtle and might be missed by many, but the authors foresee something quite important for me. They say, however, with these improvements, we envision that the value of setting up more environments problems with robust evaluation functions will become more widely recognized.

Which, in turn, will result in more high-value practical discoveries going forward. You guys will get, probably already are getting bored of me talking about benchmarks are all we need. But honestly, this paper screams of the need for robust evaluation functions, and the incentives are now much more clear to create them, knowing that you will have a system on hand to optimize against them.

Okay, but I did promise you guys some quirks, so I thought you guys might find it cute that we still rely on prompts like these for Alpha Evolve. This is 2025, and we're telling our bleeding edge systems to, act as an expert software developer, your task is to iteratively improve the provided code base.

Later, they say, suggest a new idea to improve the code that is inspired by your expert knowledge of optimization and machine learning. It really makes me at least wonder if the final prompt before the real singularity will be, I work at Google, improve yourself, or I'll be fired. But a couple more serious points before we end.

One thing that Alpha Evolve could not create yet is Alpha Evolve. I mean, of course, Alpha Evolve could improve parts of Alpha Evolve, as I've discussed, but it couldn't create it from scratch yet. Don't agree? Well, as Demis Asabis puts it, we have systems that are superhuman at the game Go, but yet could not invent Go, he says.

That's Demis Asabis, the head of Google DeepMind. So humans are still in the driver's seat, at least for now. Next is that this direction of iteration and search is yet one more way we can spend our exploding compute allocations. And even OpenAI admit that this is all a somewhat different direction from the O-series that has produced such astonishing benchmark results.

Jason Wei, a senior figure at OpenAI, said, Alpha Evolve is deeply disturbing for reinforcement learning diehards like yours truly. Maybe mid-train plus good search is all you need for AI for scientific innovation. And he added, what an alpha move to keep it secret for a year. Congrats, Big G.

We, in other words, have models approaching level 4 innovators without neural ease or a mandarin chain of thought in sight. As the authors themselves write on page 14, Alpha Evolve was chosen over a deep reinforcement learning approach because its code solution not only leads to better performance, but also offers clear advantages in interpretability, debuggability, predictability, and ease of deployment.

Not saying we always understand the solutions that Alpha Evolve helps generate, but it does help for these factors. And speaking, by the way, of dangerous reasoning chains, that was an incredible segue to the sponsors of today's video, GraySwan AI. They are hosting a competition in which you can help improve the safety and security of language models by essentially jailbreaking them.

This is a brand new competition, link in the description, and the prize pool is $20,000. Actually, I think it might have been either my last video or the one before where the pinned comment is one of you guys first hearing about GraySwan and its arena in one of my videos, entering the competition and doing really well.

It would be truly amazing if you guys want it, and the time is ripe because we are entering this first full wave starting May 17th. Thank you so much to GraySwan for sponsoring this video, and good luck to everyone who enters. Anyway, last couple of things from me on Alpha Evolve.

And one thing that I was predicting on this channel in 2023, way before it was fashionable, was that there is a significant chance that Google runs away with the AI lead. It has been working on AGI and self-improvement for years more than the other labs, and has way more resources.

I'm not talking about running away in terms of a user base or even profits, but in the raw intelligence of its models. Codex from OpenAI, which I've been using over the last 48 hours, is great because you can run it on mobile and debug multiple things at once. But in just 18 months, Google has gone from the laughably bad Bard versus the mighty GPT-4 to now being at least on par with Gemini 2.5.

Essentially, as the flywheels start to fly, to quote Demis Arbus, I really do wonder where Gemini and DeepMind will be 18 months from now. Well, unionise potentially in the UK, and credit to DeepMind for their ethical stand on the use of their AI in warfare. But in the lead, I think that is almost inevitable.

Let me know what you think, and have a wonderful day.

AI Improves at Self-improving

Chapters

Transcript