back to index

Alpha Everywhere: AlphaGeometry, AlphaCodium and the Future of LLMs


Whisper Transcript | Transcript Only Page

00:00:00.000 | 24 hours ago, Google DeepMind released Alpha Geometry,
00:00:04.360 | and while their leaders are calling it a step toward AGI,
00:00:08.600 | the team itself is warning everyone not to overhype it.
00:00:12.720 | I've read the paper in Nature, the press releases,
00:00:15.120 | and associated interviews, and feel
00:00:16.880 | that hitting gold for geometry in the International Math
00:00:20.320 | Olympiad is significant more so for what
00:00:22.880 | it signifies about the growing alliance between language
00:00:26.280 | models and search, idea generation, and brute force.
00:00:30.520 | In that same vein, we'll also take a quick peek
00:00:33.600 | at Alpha Codium, the brand new open-sourced rival to Alpha
00:00:39.000 | Code from Google DeepMind.
00:00:40.600 | But let's start all the way down in the day-to-day way
00:00:44.740 | AI is now being used for math education.
00:00:48.200 | If you think this is the way to go to get kids interested,
00:00:51.240 | let me know.
00:00:52.000 | It's bordered off by these two values.
00:00:54.240 | So in this case, the integral would be
00:00:56.200 | the area of this shape here.
00:00:57.680 | But what about this other stuff here?
00:00:59.360 | Let me take it from here, Kim.
00:01:00.920 | That tall swirly symbol on the left
00:01:02.960 | is an S, which stands for sum.
00:01:04.880 | What are we summing?
00:01:05.720 | We're summing the area of these strips.
00:01:07.640 | A tiny distance, dx, multiplied by the height, which
00:01:11.480 | is the value of the function.
00:01:12.800 | But these are way too thick, Taylor.
00:01:14.560 | dx is actually really, really tiny.
00:01:17.080 | For those who don't know, the International Math Olympiad
00:01:20.360 | is the most prestigious math competition in the world.
00:01:24.240 | I remember competing in challenges
00:01:26.000 | just to get into the International Math Olympiad.
00:01:29.160 | Spoiler, I didn't get in.
00:01:30.320 | But I would say I never studied that hard.
00:01:32.600 | Anyway, this new system, Alpha Geometry,
00:01:34.560 | scores almost as highly as the average IMO gold medalist,
00:01:38.920 | but specifically for a subset of geometry problems only.
00:01:43.200 | Not algebra or number theory.
00:01:44.960 | We're talking just geometry.
00:01:46.720 | So it's not like Alpha Geometry did an IMO test.
00:01:49.840 | It just did 30 geometry IMO questions.
00:01:53.080 | Nevertheless, getting a gold medal overall in the IMO
00:01:56.680 | has long been one of the holy grails of machine learning.
00:02:00.600 | That's maybe why one of the co-founders of DeepMind
00:02:03.000 | said AGI keeps getting closer.
00:02:05.000 | And even Demis Hassabis, the leader of DeepMind
00:02:07.760 | and one of the other co-founders, said this.
00:02:10.120 | Congrats to the team.
00:02:11.280 | This represents another step on the road to AGI.
00:02:15.320 | He later edited out that last sentence,
00:02:17.680 | possibly because he read that the team said
00:02:19.680 | not to overhype it,
00:02:21.080 | but also he might have read some of the caveats
00:02:23.240 | in the paper itself.
00:02:24.520 | Of course, I'll get to the paper,
00:02:25.800 | but first I want to set the stage.
00:02:27.680 | There is now a grand prize of 5 million
00:02:30.360 | and an overall prize pool of 10 million
00:02:32.880 | for getting gold in the IMO.
00:02:35.160 | Two years ago, the forecast on Metaculous
00:02:37.480 | for an AI getting a gold medal was 2037.
00:02:41.320 | And what is it as of tonight?
00:02:43.520 | 2027.
00:02:44.880 | And of course, you don't need me to tell you
00:02:46.680 | that's just three and a half years away.
00:02:48.760 | So how does it work?
00:02:49.880 | Well, alpha geometry is a neuro-symbolic system,
00:02:53.080 | a combination of a neural network
00:02:55.280 | and the old-fashioned symbolic pre-programmed systems.
00:02:58.560 | And in fact, that alliance between large language models,
00:03:01.120 | neural networks, and old-fashioned pre-programmed systems
00:03:04.760 | is going to be the theme of this video.
00:03:06.560 | Idea generation, and you could call it creativity,
00:03:09.400 | plus brute force and search.
00:03:11.680 | That alliance, I predict in the future, will yield AGI.
00:03:15.400 | Here is a simple example of how it works.
00:03:18.040 | Imagine you're trying to prove
00:03:19.440 | that two angles are equal in an isosceles triangle.
00:03:23.080 | A key part of that proof
00:03:24.480 | is to drop a perpendicular line down from A
00:03:27.640 | to hit the midpoint of B and C.
00:03:30.040 | The thing is, symbolic systems aren't designed
00:03:32.120 | to propose those kinds of constructs.
00:03:34.440 | Idea generation isn't their forte.
00:03:36.640 | That's where a language model came in.
00:03:38.720 | The language model in this case was only 151 million parameters
00:03:43.360 | and it was trained on a purely synthetic data.
00:03:46.400 | The purely synthetic training data
00:03:48.400 | was all about getting the model to provide proofs
00:03:51.480 | for various geometric statements.
00:03:53.600 | In 91 million of those samples, brute force would be enough,
00:03:57.400 | just step-by-step deduction using known rules.
00:04:00.040 | But in 9 million cases,
00:04:01.560 | you would need one of these constructs.
00:04:03.960 | The authors call them pulling rabbits out of the hat.
00:04:06.280 | And the language model was fine-tuned on those examples.
00:04:09.480 | It paid particular attention to those examples.
00:04:12.440 | Basically, it got really good at suggesting such constructs.
00:04:16.040 | Going back to this example, the moment you posit that line,
00:04:19.240 | an old-fashioned symbolic deducer could then solve the rest.
00:04:23.000 | It could mechanically produce the proof
00:04:25.360 | that these two angles are equal.
00:04:27.280 | Basically, the angle at B and the angle at C.
00:04:29.640 | If, by the way, the deducer couldn't solve the problem,
00:04:32.560 | it would send it back to the language model
00:04:34.640 | to suggest other constructs.
00:04:36.280 | While most of that training data involved basic proofs,
00:04:39.160 | apparently one involved two constructs
00:04:41.840 | and a proof length of 247 deduction steps.
00:04:46.200 | I can start to see why Alpha Geometry
00:04:48.240 | outperformed all but the best humans.
00:04:50.480 | A bit below, somewhat sheepishly,
00:04:52.040 | the authors admit that these solutions
00:04:54.160 | tend not to be symmetrical like human-discovered theorems,
00:04:57.840 | as they are not biased towards any aesthetic standard.
00:05:01.400 | As in, these solutions don't look good,
00:05:03.720 | they look like trash, but they work.
00:05:05.480 | The lead author of the paper put this really well
00:05:08.120 | in a video on his own YouTube channel
00:05:10.720 | and pointed out that the approach isn't fully novel.
00:05:13.760 | - The general observation here is that given a hard problem,
00:05:16.720 | we usually have to come up with one or more rabbits
00:05:19.720 | in order to transform the problem
00:05:21.160 | into a more mechanical state,
00:05:23.280 | in such a way that the symbolic engine
00:05:25.120 | or the mechanical solver can just take the problem
00:05:27.280 | and then solve it.
00:05:28.400 | But if the solver failed to solve the problem,
00:05:31.120 | then we can always come back and ask for more rabbits.
00:05:33.680 | And then we keep doing this in a loop
00:05:34.960 | until we find the solution.
00:05:36.360 | And so with this observation,
00:05:37.680 | our solver here pretty much reflect
00:05:40.120 | the structure of this observation,
00:05:41.840 | where we built a neural language model
00:05:44.280 | that is trained to propose magic instruction.
00:05:46.720 | And then we built a symbolic engine
00:05:48.560 | that is tasked with handling all the mechanical cases
00:05:51.120 | and the mechanical deduction in geometry.
00:05:53.200 | And then we put these two components into a loop
00:05:55.840 | so that we obtain a neural symbolic solver
00:05:58.960 | named alpha geometry.
00:06:00.560 | Let me point to an important fact,
00:06:02.800 | that is the observation of neural symbolic structure
00:06:06.000 | is not a novel observation that is made in our work.
00:06:09.040 | In fact, in 2020, Polu and Suskever
00:06:11.480 | have already pointed out that a major limitation
00:06:13.840 | of theorem proving compared to human
00:06:15.920 | is in fact the ability to generate
00:06:17.680 | original mathematical terms.
00:06:19.040 | And this limitation might be addressable
00:06:21.120 | via the generation from language models.
00:06:23.600 | - Geometry, it seems, might be particularly amenable
00:06:26.680 | to this approach.
00:06:27.800 | As one IMO gold medalist and fields medalist put it,
00:06:31.200 | "Finding solutions for IMO geometry problems
00:06:34.200 | "works a little bit like chess
00:06:35.880 | "in the sense that we have a rather small number
00:06:37.760 | "of sensible moves at each step."
00:06:39.680 | Nevertheless, he says he was stunned
00:06:41.640 | that they could make it work.
00:06:42.920 | They even cheekily compare their system
00:06:44.920 | trained on a hundred million proofs with GPT-4.
00:06:48.520 | It apparently had a success rate of 0%,
00:06:51.640 | often making syntactic and semantic errors.
00:06:54.360 | Of course, deciding which of the many constructs to use
00:06:57.520 | is a question of search and compute budget.
00:07:00.240 | But they noticed that using less than 2%
00:07:03.160 | of that search budget,
00:07:04.720 | analyzing eight constructs each time
00:07:06.800 | versus 512 during test time,
00:07:09.280 | it could still solve 21 problems.
00:07:11.360 | That would still put it at just below
00:07:13.080 | the silver medalist level
00:07:14.600 | and way above the previous state of the art.
00:07:17.280 | Speaking of search and compute budget though,
00:07:19.400 | I couldn't help but notice this.
00:07:21.080 | They use NVIDIA's V100 GPUs and said somewhat modestly,
00:07:25.240 | "Scaling up these factors to examine a larger fraction
00:07:28.680 | "of the search space
00:07:30.200 | "might improve alpha geometry results even further."
00:07:33.240 | I think frankly, that's an understatement
00:07:35.280 | because the V100 was replaced in 2020 with the A100,
00:07:40.160 | recently replaced by the H100.
00:07:42.920 | And yes, I know I pronounce my H's in a Cockney way.
00:07:45.800 | Even the H100 from NVIDIA
00:07:47.760 | is gonna be replaced this year with the B100
00:07:50.800 | and next year with the X100.
00:07:53.080 | I've almost lost count now
00:07:54.080 | of how many generations behind the V100 is.
00:07:56.440 | So the fact that they use V100s is incredibly impressive.
00:07:59.560 | I feel like the bitter lesson is gonna strike again soon
00:08:02.400 | and IMO geometry is gonna be all but solved by next year.
00:08:06.320 | I must caution though that this had been foreseen,
00:08:09.000 | including by Paul Cristiano,
00:08:10.640 | former head of alignment at OpenAI
00:08:13.160 | and IMO participant when he was younger.
00:08:15.960 | He predicted that AI would soon solve
00:08:18.320 | most geometry problems essentially for free.
00:08:20.920 | DeepMind in their blog post go a bit further though.
00:08:23.520 | They described this as demonstrating AI's growing ability
00:08:26.800 | to reason logically
00:08:28.400 | and to discover and verify new knowledge.
00:08:30.960 | I feel like there might be years more of debate
00:08:33.040 | over whether it's appropriate to use that word reason
00:08:35.840 | for what's happening here.
00:08:37.280 | But in the end, it might end up being semantics.
00:08:39.920 | Nevertheless, Google are open sourcing
00:08:41.960 | the alpha geometry code and model.
00:08:44.080 | Within a year, they hope it will be inside Google's Gemini.
00:08:47.840 | Remember, Google also promised that that alpha code too
00:08:50.680 | will be put inside Gemini.
00:08:52.320 | So that's a lot of alphas to go around.
00:08:54.320 | Of course, many of you might be wondering
00:08:56.000 | if this is an example of mathematics falling first,
00:08:59.220 | which would then lead to a torrent of results
00:09:01.560 | that will impact everything in theoretical science
00:09:04.480 | as one machine learning professor put it.
00:09:06.480 | Well, we simply don't know.
00:09:07.920 | As the co-founder of XAI and former Googler put it,
00:09:11.440 | it leaves a lot of questions open.
00:09:13.400 | He said it's not easily generalizable to other domains
00:09:16.680 | and other areas of math.
00:09:18.420 | That's not gonna stop the lead author
00:09:20.160 | attempting to generalize the system
00:09:22.080 | across mathematical fields and beyond.
00:09:24.080 | But speaking of alpha code and open sourcing,
00:09:27.120 | we now have AlphaCodeum.
00:09:29.520 | It's open source, single click,
00:09:31.280 | and is claimed to beat AlphaCode2 without fine tuning.
00:09:35.000 | All the relevant links will be in the description.
00:09:37.520 | But there's another reason why I bring it up in this video,
00:09:40.000 | not just that it's brand new and state of the art,
00:09:42.560 | but it's also that same theme of LLM's proposing solutions
00:09:46.380 | and iterating based on feedback from the environment.
00:09:49.720 | In this case, code unit tests.
00:09:51.720 | As Andrej Karpathy puts it,
00:09:53.360 | we are moving away from that naive prompt
00:09:56.760 | to auto-regressive token by token answer,
00:09:59.960 | where LLMs like GPT-4
00:10:01.760 | are forced to put out immediate solutions.
00:10:04.540 | It's becoming more like a conversation
00:10:06.680 | between LLMs and their environment.
00:10:08.840 | In my own tests for SmartGPT 2.0,
00:10:11.680 | I'm discovering the same thing as the authors
00:10:14.160 | when they say this,
00:10:15.080 | "Try to avoid direct questions
00:10:17.120 | and leave room for exploration."
00:10:19.280 | Or the way I would translate that
00:10:20.920 | is that if you force an LLM into an immediate answer,
00:10:24.140 | it will then pick an answer and then stick to it.
00:10:26.840 | It values fluency over accuracy.
00:10:29.200 | So what's the answer?
00:10:30.040 | Try to avoid those direct questions.
00:10:32.240 | Encourage reflection.
00:10:33.560 | That's probably why chain of thought works so well.
00:10:36.040 | Here's a great summary from Santiago on Twitter.
00:10:39.160 | "First, AlphaCodeum gets the LLM and its model agnostic
00:10:42.320 | to reason about the problem.
00:10:44.120 | Describe it using bullet points
00:10:45.760 | and focus on the goal, inputs, outputs, rules, et cetera.
00:10:48.600 | Then make the model reason about the tests it would need.
00:10:51.360 | Generate potential solutions and rank them
00:10:53.560 | in order of correctness, simplicity, and robustness.
00:10:56.340 | Now generate more diverse tests for edge cases."
00:10:59.160 | And here's the key step.
00:11:00.440 | Pick a solution, generate the code,
00:11:02.560 | and run it on a few test cases.
00:11:04.840 | If the tests fail, improve the code and repeat the process.
00:11:08.160 | I can't help but notice that this is eerily reminiscent
00:11:10.760 | of some of the prep work I did for SmartGPT.
00:11:13.600 | I won't go through it now,
00:11:14.480 | but what it involved was commanding the model
00:11:16.880 | to not output a solution immediately.
00:11:19.320 | In fact, I wanted it to generate mistakes
00:11:21.400 | that students might make.
00:11:22.580 | Then I would force it to come up with test cases.
00:11:25.200 | And the rest of the steps I might cover in another video,
00:11:27.520 | but it was that same approach.
00:11:29.120 | The same idea, don't get the model
00:11:30.760 | to output an immediate answer,
00:11:32.520 | delay that as long as possible,
00:11:33.920 | and first generate test cases.
00:11:35.920 | It's almost like you're forcing it to reason logically.
00:11:39.020 | And yes, in case you're wondering,
00:11:40.300 | this works amazingly for mathematics.
00:11:42.520 | Here are some of the results of AlphaCodeum
00:11:44.640 | compared to direct prompting across a range of models.
00:11:48.040 | So you might mention this video
00:11:49.560 | to anyone who thinks LLMs have peaked.
00:11:52.280 | The theme of using them for idea generation
00:11:54.840 | and then external experimentation
00:11:56.880 | just keeps occurring again and again in the literature.
00:11:59.860 | We saw it with Eureka.
00:12:01.200 | And if you haven't seen my video on that, do check it out.
00:12:03.400 | The LLM GPT-4 would propose reward functions
00:12:06.260 | and these would be tested in a simulated environment
00:12:09.120 | and the reflection fed back in.
00:12:10.840 | And even the notorious LLM skeptic, Professor Ralph,
00:12:14.000 | who I interviewed for AI Insiders,
00:12:16.080 | updated in November his original paper
00:12:18.480 | on the planning abilities of LLMs,
00:12:20.740 | tweaking the ending to say this,
00:12:22.800 | "We demonstrate that LLM generated plans
00:12:25.520 | can improve the search process
00:12:27.560 | for underlying sound planners
00:12:29.400 | and additionally show that external verifiers
00:12:31.480 | can help provide feedback on the generated plans
00:12:34.260 | and back prompt the LLM for better plan generation."
00:12:37.600 | Coming from him, that's borderline euphoric.
00:12:40.120 | And yes, I can't help but mention
00:12:41.640 | that I go into more detail on this topic
00:12:43.760 | on AI Insiders on Patreon.
00:12:45.740 | The link is in the description.
00:12:47.440 | And that's not just for this video
00:12:49.040 | on its implications for embodiment and robotics.
00:12:51.680 | I also interviewed Professor Rao for this video
00:12:54.640 | on reasoning as the Holy Grail for artificial intelligence.
00:12:58.320 | While we're here though, I can't resist mentioning
00:13:00.800 | that I also released this video tonight on AI Insiders.
00:13:04.640 | Basically, it's my attempt through analyzing five papers
00:13:07.880 | to answer the question as to whether LLMs
00:13:10.640 | boost worker productivity.
00:13:12.260 | And no, unfortunately the ad is not yet over
00:13:14.920 | because today I also released this video
00:13:17.360 | from Donato Capitella.
00:13:19.080 | He's an AI Insider himself and one of the benefits
00:13:22.040 | is that members can submit explainers
00:13:24.240 | for other insiders to watch.
00:13:25.920 | The best of these I'll talk about on the main channel,
00:13:28.320 | which is what I'm doing right now.
00:13:29.680 | This was a fantastic video from Donato
00:13:32.120 | who is a cybersecurity consultant based in London.
00:13:35.040 | In fact, I fairly recently met up with him,
00:13:36.960 | again, proving that I am not GPT-5.
00:13:39.480 | I'm even gonna go one step further
00:13:41.080 | and recommend his YouTube channel.
00:13:43.100 | I think it is criminally underrated.
00:13:45.420 | He creates, partly with AI admittedly,
00:13:47.820 | these amazing detailed diagrams to explain certain topics.
00:13:51.700 | If you wanna know what I mean, check out his channel.
00:13:53.860 | So no, in summary, LLMs are not peaking.
00:13:57.460 | But here's another quick example.
00:13:58.660 | Just 48 hours ago, we heard about Google
00:14:00.780 | laying off a thousand workers.
00:14:02.540 | But what about their workers at Google DeepMind?
00:14:04.780 | No, those workers, they are spending hundreds of thousands
00:14:07.660 | to millions of dollars to keep them at Google.
00:14:10.540 | That's because OpenAI has hired at least six
00:14:13.500 | of Google's Gemini contributors since October.
00:14:16.340 | Indeed, money-wise, I would say things are heating up
00:14:19.260 | rather than slowing down.
00:14:20.700 | I imagine Samsung have signed a multi-billion dollar contract
00:14:24.180 | to get access to Google Gemini models in their smartphones.
00:14:27.900 | And apparently, Samsung will be among the first partners
00:14:30.480 | to test Gemini Ultra.
00:14:32.100 | So no, Alpha Geometry and Alpha Codium
00:14:34.980 | are definitely not AGI, but neither is the race to AGI
00:14:39.820 | slowing down anytime soon.
00:14:41.820 | Thank you so much for watching and have a wonderful day.