Back to Index

Alpha Everywhere: AlphaGeometry, AlphaCodium and the Future of LLMs


Transcript

24 hours ago, Google DeepMind released Alpha Geometry, and while their leaders are calling it a step toward AGI, the team itself is warning everyone not to overhype it. I've read the paper in Nature, the press releases, and associated interviews, and feel that hitting gold for geometry in the International Math Olympiad is significant more so for what it signifies about the growing alliance between language models and search, idea generation, and brute force.

In that same vein, we'll also take a quick peek at Alpha Codium, the brand new open-sourced rival to Alpha Code from Google DeepMind. But let's start all the way down in the day-to-day way AI is now being used for math education. If you think this is the way to go to get kids interested, let me know.

It's bordered off by these two values. So in this case, the integral would be the area of this shape here. But what about this other stuff here? Let me take it from here, Kim. That tall swirly symbol on the left is an S, which stands for sum. What are we summing?

We're summing the area of these strips. A tiny distance, dx, multiplied by the height, which is the value of the function. But these are way too thick, Taylor. dx is actually really, really tiny. For those who don't know, the International Math Olympiad is the most prestigious math competition in the world.

I remember competing in challenges just to get into the International Math Olympiad. Spoiler, I didn't get in. But I would say I never studied that hard. Anyway, this new system, Alpha Geometry, scores almost as highly as the average IMO gold medalist, but specifically for a subset of geometry problems only.

Not algebra or number theory. We're talking just geometry. So it's not like Alpha Geometry did an IMO test. It just did 30 geometry IMO questions. Nevertheless, getting a gold medal overall in the IMO has long been one of the holy grails of machine learning. That's maybe why one of the co-founders of DeepMind said AGI keeps getting closer.

And even Demis Hassabis, the leader of DeepMind and one of the other co-founders, said this. Congrats to the team. This represents another step on the road to AGI. He later edited out that last sentence, possibly because he read that the team said not to overhype it, but also he might have read some of the caveats in the paper itself.

Of course, I'll get to the paper, but first I want to set the stage. There is now a grand prize of 5 million and an overall prize pool of 10 million for getting gold in the IMO. Two years ago, the forecast on Metaculous for an AI getting a gold medal was 2037.

And what is it as of tonight? 2027. And of course, you don't need me to tell you that's just three and a half years away. So how does it work? Well, alpha geometry is a neuro-symbolic system, a combination of a neural network and the old-fashioned symbolic pre-programmed systems. And in fact, that alliance between large language models, neural networks, and old-fashioned pre-programmed systems is going to be the theme of this video.

Idea generation, and you could call it creativity, plus brute force and search. That alliance, I predict in the future, will yield AGI. Here is a simple example of how it works. Imagine you're trying to prove that two angles are equal in an isosceles triangle. A key part of that proof is to drop a perpendicular line down from A to hit the midpoint of B and C.

The thing is, symbolic systems aren't designed to propose those kinds of constructs. Idea generation isn't their forte. That's where a language model came in. The language model in this case was only 151 million parameters and it was trained on a purely synthetic data. The purely synthetic training data was all about getting the model to provide proofs for various geometric statements.

In 91 million of those samples, brute force would be enough, just step-by-step deduction using known rules. But in 9 million cases, you would need one of these constructs. The authors call them pulling rabbits out of the hat. And the language model was fine-tuned on those examples. It paid particular attention to those examples.

Basically, it got really good at suggesting such constructs. Going back to this example, the moment you posit that line, an old-fashioned symbolic deducer could then solve the rest. It could mechanically produce the proof that these two angles are equal. Basically, the angle at B and the angle at C.

If, by the way, the deducer couldn't solve the problem, it would send it back to the language model to suggest other constructs. While most of that training data involved basic proofs, apparently one involved two constructs and a proof length of 247 deduction steps. I can start to see why Alpha Geometry outperformed all but the best humans.

A bit below, somewhat sheepishly, the authors admit that these solutions tend not to be symmetrical like human-discovered theorems, as they are not biased towards any aesthetic standard. As in, these solutions don't look good, they look like trash, but they work. The lead author of the paper put this really well in a video on his own YouTube channel and pointed out that the approach isn't fully novel.

- The general observation here is that given a hard problem, we usually have to come up with one or more rabbits in order to transform the problem into a more mechanical state, in such a way that the symbolic engine or the mechanical solver can just take the problem and then solve it.

But if the solver failed to solve the problem, then we can always come back and ask for more rabbits. And then we keep doing this in a loop until we find the solution. And so with this observation, our solver here pretty much reflect the structure of this observation, where we built a neural language model that is trained to propose magic instruction.

And then we built a symbolic engine that is tasked with handling all the mechanical cases and the mechanical deduction in geometry. And then we put these two components into a loop so that we obtain a neural symbolic solver named alpha geometry. Let me point to an important fact, that is the observation of neural symbolic structure is not a novel observation that is made in our work.

In fact, in 2020, Polu and Suskever have already pointed out that a major limitation of theorem proving compared to human is in fact the ability to generate original mathematical terms. And this limitation might be addressable via the generation from language models. - Geometry, it seems, might be particularly amenable to this approach.

As one IMO gold medalist and fields medalist put it, "Finding solutions for IMO geometry problems "works a little bit like chess "in the sense that we have a rather small number "of sensible moves at each step." Nevertheless, he says he was stunned that they could make it work. They even cheekily compare their system trained on a hundred million proofs with GPT-4.

It apparently had a success rate of 0%, often making syntactic and semantic errors. Of course, deciding which of the many constructs to use is a question of search and compute budget. But they noticed that using less than 2% of that search budget, analyzing eight constructs each time versus 512 during test time, it could still solve 21 problems.

That would still put it at just below the silver medalist level and way above the previous state of the art. Speaking of search and compute budget though, I couldn't help but notice this. They use NVIDIA's V100 GPUs and said somewhat modestly, "Scaling up these factors to examine a larger fraction "of the search space "might improve alpha geometry results even further." I think frankly, that's an understatement because the V100 was replaced in 2020 with the A100, recently replaced by the H100.

And yes, I know I pronounce my H's in a Cockney way. Even the H100 from NVIDIA is gonna be replaced this year with the B100 and next year with the X100. I've almost lost count now of how many generations behind the V100 is. So the fact that they use V100s is incredibly impressive.

I feel like the bitter lesson is gonna strike again soon and IMO geometry is gonna be all but solved by next year. I must caution though that this had been foreseen, including by Paul Cristiano, former head of alignment at OpenAI and IMO participant when he was younger. He predicted that AI would soon solve most geometry problems essentially for free.

DeepMind in their blog post go a bit further though. They described this as demonstrating AI's growing ability to reason logically and to discover and verify new knowledge. I feel like there might be years more of debate over whether it's appropriate to use that word reason for what's happening here.

But in the end, it might end up being semantics. Nevertheless, Google are open sourcing the alpha geometry code and model. Within a year, they hope it will be inside Google's Gemini. Remember, Google also promised that that alpha code too will be put inside Gemini. So that's a lot of alphas to go around.

Of course, many of you might be wondering if this is an example of mathematics falling first, which would then lead to a torrent of results that will impact everything in theoretical science as one machine learning professor put it. Well, we simply don't know. As the co-founder of XAI and former Googler put it, it leaves a lot of questions open.

He said it's not easily generalizable to other domains and other areas of math. That's not gonna stop the lead author attempting to generalize the system across mathematical fields and beyond. But speaking of alpha code and open sourcing, we now have AlphaCodeum. It's open source, single click, and is claimed to beat AlphaCode2 without fine tuning.

All the relevant links will be in the description. But there's another reason why I bring it up in this video, not just that it's brand new and state of the art, but it's also that same theme of LLM's proposing solutions and iterating based on feedback from the environment. In this case, code unit tests.

As Andrej Karpathy puts it, we are moving away from that naive prompt to auto-regressive token by token answer, where LLMs like GPT-4 are forced to put out immediate solutions. It's becoming more like a conversation between LLMs and their environment. In my own tests for SmartGPT 2.0, I'm discovering the same thing as the authors when they say this, "Try to avoid direct questions and leave room for exploration." Or the way I would translate that is that if you force an LLM into an immediate answer, it will then pick an answer and then stick to it.

It values fluency over accuracy. So what's the answer? Try to avoid those direct questions. Encourage reflection. That's probably why chain of thought works so well. Here's a great summary from Santiago on Twitter. "First, AlphaCodeum gets the LLM and its model agnostic to reason about the problem. Describe it using bullet points and focus on the goal, inputs, outputs, rules, et cetera.

Then make the model reason about the tests it would need. Generate potential solutions and rank them in order of correctness, simplicity, and robustness. Now generate more diverse tests for edge cases." And here's the key step. Pick a solution, generate the code, and run it on a few test cases.

If the tests fail, improve the code and repeat the process. I can't help but notice that this is eerily reminiscent of some of the prep work I did for SmartGPT. I won't go through it now, but what it involved was commanding the model to not output a solution immediately.

In fact, I wanted it to generate mistakes that students might make. Then I would force it to come up with test cases. And the rest of the steps I might cover in another video, but it was that same approach. The same idea, don't get the model to output an immediate answer, delay that as long as possible, and first generate test cases.

It's almost like you're forcing it to reason logically. And yes, in case you're wondering, this works amazingly for mathematics. Here are some of the results of AlphaCodeum compared to direct prompting across a range of models. So you might mention this video to anyone who thinks LLMs have peaked. The theme of using them for idea generation and then external experimentation just keeps occurring again and again in the literature.

We saw it with Eureka. And if you haven't seen my video on that, do check it out. The LLM GPT-4 would propose reward functions and these would be tested in a simulated environment and the reflection fed back in. And even the notorious LLM skeptic, Professor Ralph, who I interviewed for AI Insiders, updated in November his original paper on the planning abilities of LLMs, tweaking the ending to say this, "We demonstrate that LLM generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back prompt the LLM for better plan generation." Coming from him, that's borderline euphoric.

And yes, I can't help but mention that I go into more detail on this topic on AI Insiders on Patreon. The link is in the description. And that's not just for this video on its implications for embodiment and robotics. I also interviewed Professor Rao for this video on reasoning as the Holy Grail for artificial intelligence.

While we're here though, I can't resist mentioning that I also released this video tonight on AI Insiders. Basically, it's my attempt through analyzing five papers to answer the question as to whether LLMs boost worker productivity. And no, unfortunately the ad is not yet over because today I also released this video from Donato Capitella.

He's an AI Insider himself and one of the benefits is that members can submit explainers for other insiders to watch. The best of these I'll talk about on the main channel, which is what I'm doing right now. This was a fantastic video from Donato who is a cybersecurity consultant based in London.

In fact, I fairly recently met up with him, again, proving that I am not GPT-5. I'm even gonna go one step further and recommend his YouTube channel. I think it is criminally underrated. He creates, partly with AI admittedly, these amazing detailed diagrams to explain certain topics. If you wanna know what I mean, check out his channel.

So no, in summary, LLMs are not peaking. But here's another quick example. Just 48 hours ago, we heard about Google laying off a thousand workers. But what about their workers at Google DeepMind? No, those workers, they are spending hundreds of thousands to millions of dollars to keep them at Google.

That's because OpenAI has hired at least six of Google's Gemini contributors since October. Indeed, money-wise, I would say things are heating up rather than slowing down. I imagine Samsung have signed a multi-billion dollar contract to get access to Google Gemini models in their smartphones. And apparently, Samsung will be among the first partners to test Gemini Ultra.

So no, Alpha Geometry and Alpha Codium are definitely not AGI, but neither is the race to AGI slowing down anytime soon. Thank you so much for watching and have a wonderful day.