RL for Autonomous Coding — Aakanksha Chowdhery, Reflection.ai

Hi, everyone. I'm Akhan Shaw. I was at Google for more than six years, and I led the research for Palm, and I was a lead researcher in Gemini. These days, I'm working on pushing the frontier for Autonomous Coding with Reinforcement Learning. So just to recap the arc of how we have progressed in large language models and why Autonomous Coding and why now.

So I think everyone here, or those of you who don't remember, in 2020, there was this breakthrough paper that came out, which talked about scaling laws for large language models. And if you were to take a 30-second recap, all the main thing it said was that there's a power law relationship between the test loss of large language models.

So if you use more compute, more data, and put more parameters in your machine learning model, which is a transformer model, you will get more performant models. And it will not be performant just in the domain in which you are training the model. It will actually be performant, and it will generalize to many other domains.

And the generalization was pretty much a feature in this particular case. So as the large language models got bigger, we saw continuous improvement across benchmarks to the point that they're starting to get saturated now. And the other interesting thing was that we saw emergent behavior where capabilities were emerging in large language models that were not present in smaller models.

And this is a classic slide that I show for the work that we did in Palm. So typically when you go about trying to solve math problems and you give the model some examples, on the left you have a math problem around tennis balls, and then you give a second problem, the model output looks wrong.

But what Palm and the subsequent set of papers showed was that if you ask the model to output its reasoning chains, which has become a very common concept now, but this is, remember, 2021, so four years ago, if you ask the model to show its reasoning chains, then the answer actually is correct.

So basically by getting the model to output its chain of thought or reasoning chains, the model performance improves. And this capability particularly emerged in large language models. These are all the models. So Lambda and Palm were the state-of-the-art models about three years ago. And what I'm showing on x-axis is the increasing number of parameters.

Palm was scaled all the way up to 540 billion parameters. No one actually publishes the number of parameters these days, so you have to live with the graphs from three years ago or the open source stuff that's coming out with DeepSeq and Quen models. But what y-axis is showing is that the solve rate on middle school math word problems was increasing with the number of parameters in the models.

And it was essentially increasing mainly when you are prompting the models and asking them to show chain of thought. And this led to all kinds of prompting techniques where you ask the model to think step by step. You even go and bribe the model and such, and you ask the model nicely or not.

So this was all kinds of fun stuff. And I think the thing that really stood out from this generation of models few years ago was that this capability capability was not just limited to math problems. It was basically generalizing across a whole bunch of domains anywhere from question answering in other languages to puzzle problems to multitask natural language understanding problems.

And what this led to next was that now that these models could reason, we could get them to follow instructions. So the first set of applications that became possible with these large language models were chatbot applications. So everyone remembers that ChatGPT and now Gemini and various other chatbots have become extremely popular.

All of us use them all the time. But what made them really possible was that when you give instructions to the model to go do something, it's actually able to do it. And the way it learns that is actually based on reinforcement learning. And the reinforcement learning data that we're giving to the model in this particular case is essentially data based on human feedback.

So you're basically saying, okay, here is a set of questions. And if I were to give it to a human and it were there were two answers, which one would the human prefer? And if you have enough of this data and you train your model, you would actually end up with a better performance because you taught the model which set of responses to prefer.

And this actually doesn't only work in chatbot applications, it also works in code. So on the bottom right, I'm showing that even if you were to do this for applications in code, you start to see some performance improvements. Now, of course, the question is that last year, there was a whole bunch of debate as to are we hitting the wall in terms of performance of large language models, pre-training is not giving any gains, or all of these questions were on the horizon.

So what is next? And one of the key questions to remember in all of this is that when you go and pre-train the models, you end up spending a lot of money on training these models. It could be tens of millions of dollars. And when you do inference on the models, it's extremely cheap.

These numbers are not endorsed by any of the companies I worked at, but these are public numbers from public sources. So going back to the main point that I want to make here is that training is extremely costly. So if you constantly try to scale up the model size, you end up in this regime of like, if it's not giving performance gains, then can we get performance gains at inference time because inference calls are so cheap?

And a key idea that was extremely useful here was that if you could get the models to generate multiple responses and then do majority voting. So in the example above, I'm showing that the prompt doesn't make sense, but you've given a mathematical problem to large language model and you're asking it to generate three answers independently.

And then you basically do some voting on top of those answers. And if two answers match, then that's a majority vote. Or like if in this room I were to ask a question and all of you said, yes, then that is a majority vote. So similarly in large language models, if you can get the model to like generate many, many samples and then consistently get it to, uh, like get many of those answers to agree, this notion of majority voting or self consistency had shown gains.

So this kind of scaling computed inference time was clearly one avenue to go push on. Another avenue that emerged and showed substantial value was that you could sequentially revise your previous response. So as humans, oftentimes we write the first answer and then we go evaluate our answer and we're like, oh, there's some mistake here.

It doesn't quite match. And then you go fix it. So basically can we get LLMs to do the same kind of revision looking at previous set of revisions? And this was the second. So basically having longer, uh, chains of thought, uh, and getting the model to improve consistently in inference time based on that.

And these kind of techniques, uh, where you could verify your correct answer. So in math or in programming where you have unit tests showed, uh, very clear gains. So what I'm showing you here is an example from, uh, uh, one of my colleagues work, uh, at Stanford, uh, which is a publicly, uh, published, uh, paper.

And on the y-axis, we have pass at K or coverage score. And on the x-axis, we have a number of samples. So as you basically are doing a lot of samples on the x-axis, your accuracy is improving, uh, with open source DeepSeq model and just taking more samples. So you're getting a very high score on Sweebench verified compared to even state of the art back in end of 2024.

Uh, of course, now all of these scores have pushed up and we are roughly somewhere around 80% already. But what we want to take away here is the fact that these lines of work, they showed that inference time compute predictably gives us gains, especially in domains where we can verify.

If we know how to verify the answers, then we actually know how to translate that into intelligence. And going back to my talk title, coding is one of those domains where we do have the capability to verify. Um, and that gives us tremendous advantage in terms of building super intelligence on top of autonomous coding.

Of course, now you ask the question of what does automated verification mean here? So for inference time scaling to work, you need basically some way to say this output is correct. Now in math, um, this is a very simple example. If you were to give the input to solve this mathematical equation, um, and if you were to do the same calculation on a calculator, you can actually verify that that that problem is correct or that solution is correct.

Um, and similarly in math, you have formal proofs so you can actually verify that things are correct. Uh, in coding, you have unit tests in compilers. You can actually generate the code and then use Pytorch as a verifier, uh, Pytorch the compiler as a verifier. And in fact, uh, in domains where you don't have, uh, this kind of verification, then there's a large gap.

If you were to generate a lot of solutions and then do majority voting, you actually don't get as much gains. So what this roughly meant was that, okay, so inference time scaling would work in scenarios where I have automated verification, but that doesn't quite solve the problem for it to have real world impact.

And the reason for that is shown in this graph as to typically, uh, if you do majority voting and these, uh, this is across multiple different models on GSM 8K, which is middle school math problems and another math benchmark. Uh, if you were to sort them by correct fraction, you have to sample a lot.

The correct generations could be very rare. So who has time to sample 10,000 times and then get a correct solution? You would be sitting there waiting, just finding the correct solution unless you can actually figure out where the correct generation is. So basically scaling inference time compute with just majority voting or longer reasoning chains is great in the sense that there are some correct solutions somewhere there, but it doesn't work well across the board.

So what will get these models to learn to generate correctly, uh, during training? Well, in the chat bot application scenario, we saw that RL with human feedback did work. So can we apply the same principle here and get the model to generate correctly in, uh, where, where we can automatically verify the outputs?

So our belief at, uh, reflection is that the next frontier for scaling is reinforcement learning. And we already have proof points from some of the frontier labs as well. And as David Silver, um, uh, sudden published recently, they agree with, uh, or, or rather they, they are the pioneers in, uh, in reinforcement learning.

They say that we are basically entering the era of experience, like starting from alpha go and alpha zero, uh, where you had an era of simulation and the next set of large language model era was where you scaled up with RL using human data. But the next era from this year is really the era of experience, which was, which will lead us to super intelligence.

So reinforcement learning will be a fundamental component in building, uh, super intelligent systems, uh, especially in areas where we have automated, uh, verification. And some, uh, proof point for why this makes sense is that in math, uh, over several papers, this is, uh, results from 01, but over several papers, we have already seen examples that if you give the model on the right side, uh, test time compute, uh, on the Y axis, test time compute the same as inference time scaling and you measure accuracy on the X axis, it should go up.

Um, but as you can repeat this process, uh, and with the reinforcement learning, then the training time compute going up on X axis also improves the accuracy on Y axis for a challenging benchmark in math called Amy. Uh, most of these benchmarks saturate within a year as you probably have learned by now.

So, uh, this benchmark is already saturated. Um, so now that I've hopefully convinced you that reinforcement learning and scaling reinforcement learning is, uh, the next frontier, you'd be like, okay, so why are, why is not everyone doing it? What's so challenging about it? So as I have built large language models before, a big part of building, uh, these systems, uh, ends up being that the machine learning plus system stack for these, uh, systems themselves is very challenging.

So here is, um, an example of, uh, why scaling up reinforcement learning is challenging. So if you are trying to do reinforcement learning with, uh, PPO, which is, uh, one of the, uh, algorithms used for RL with human feedback, um, then it moved to, uh, DPO, you have to keep four copies of, uh, different models.

Uh, so if you imagine a really large model and then you have to keep four copies, then you have to arrange them somewhere on GPUs in your large cluster. You, you can have some fun figuring out the exact layout and, um, it's, it's, it's a fun and interesting problem, but it's a hard problem in the sense that, uh, to make maximum utilization of these systems and, and arranging them in the right way, just building that system is extremely hard.

And, uh, deep seek actually showed, uh, with deep seek math that GRPO, uh, gets rid of the value model and it only has three copies of the model, but that doesn't, that's still a very challenging problem. So scaling up RL, uh, is more, even more challenging, um, than scaling up, um, LLMs because you have multiple copies of the model and you have a training loop and an inference loop.

And then on the machine learning side on the, on the reinforcement learning side, you also suffer a lot from reward hacking. If you're, uh, the model that is deciding that this is the correct answer is a neural reward model. So you, uh, as we discussed before in autonomous coding applications, you do have the ability to verify your output, uh, which roughly means that you can decide this is the correct answer or not.

Uh, that's how sweet bench verified scores, uh, work today. Uh, you have execution feedback, you have unit tests. So all of these possibilities, of course, um, this is an ongoing list. All of these possibilities may mean that you can design better reward functions. Okay. So this means that autonomous coding is a great domain for scaling up RL.

Then the question becomes, how does this have real world impact? So in software engineering applications, generation of code is only one part of the system. If you look at end to end workflows for software engineering, there is many more parts to that system. How do you scale up your system to generalize across all of those domains?

So that's the problem we are trying to solve at reflection. Uh, our mission is that we would like to build some super intelligence and we are starting with autonomous coding as the root node problem for this, um, a mission. And, uh, we have a team of about 35 pioneers, um, who are, who have pioneered various, uh, legendary works in LLMs and reinforcement learning.

So if you're excited about this mission, uh, you can reach out to, um, one of us, Um, or my, uh, my emails, uh, my last name at reflection.ai and we would love to work with you. And with that, I can take questions. All right. Um, same protocol as last time.

If you have a question, please come up to one of these three microphones we have distributed throughout. We can probably take one or two questions. So if you want to ask something, um, feel free. Um, I guess I can, I I'll do the first one while people are coming up.

So I'm curious, um, it seems like the foundation models are trying to build one model and deploy it across everything. Do you have an opinion with the work you're doing right now? If you think that's the right approach or if you think there'll be more specialization on different languages or even like individual code bases, um, or do you feel like the best approach is just to have like one model that's trained across the, the greatest diversity of tasks possible?

Uh, I think I will answer your question, uh, in terms of building coding agents does require, um, multiple capabilities and how you get there, you will definitely need multiple LLM calls and then whether that's one model or multiple models, I think that's the secret sauce right now for most people.

Fair enough. All right. Please. Hi. Um, I'm wondering in the slide with the chart of error of simulation, error of something and error of experience, uh, they had put in AlphaGo and, um, the previous one where also you, they played Star, Starcraft or something. They all used MCDS, uh, which I mean, maybe it's my unfamiliarity with them, but it's also data simulation.

Uh, so we're using synthetic data for error of experience as well. So how does, why is that called simulation and why is what we're doing right now not called simulation? What's the sort of overlap between simulation experience? How does that, how do you think about that? I can ask Dave that question, you know, but going back to the point, I think, I think the better way to answer that question is, uh, roughly what Greg covered in the last talk where his comment was that, um, so in gaming you can envision what scenarios might happen next and you're basically using that to build your reinforcement learning.

So you're doing rollouts and you're, you're basically building, uh, based on that, uh, in, in real world, in most scenarios, you have an imperfect rollout. So you don't have full knowledge of how the system might works. Um, simulation is possible in certain domains where you do build a world model, uh, which is closer to robotics and all the work that's happening in the physical AI space, right.

But in the, in the real world applications, which is what we're targeting, uh, you will have imperfect things. So you have to actually experience the real world and you have to collect some data and that data is not going to be in any way complete, nor will it complete early search, the exponential search space that could exist.

RL for Autonomous Coding — Aakanksha Chowdhery, Reflection.ai

Chapters

Transcript