back to indexMeta Superintelligence: Self-Improving Bootstrapping, Compute as Teacher, ARE: Agent Research Envs

Chapters
0:0 Introduction to the three papers
0:33 Compute as Teacher: Turning Inference Compute into Reference-Free Supervision
4:21 ARE: Scaling Up Agent Environments and Evaluations
10:20 Bootstrapping Task Spaces for Self-Improvement
41:14 ARE: Scaling Up Agent Environments and Evaluations (revisited)
56:19 Details on the dual-agent model in the Compute as Teacher paper
57:43 Analysis of different models and APIs in the ARE paper
00:00:00.420 |
So not fair, but the new meta super intelligence lab, two of them 00:00:14.760 |
I think we'll actually only have time for two of them. 00:00:17.900 |
They're basically really focused on better RL approaches. 00:00:22.600 |
And here, I'm actually just sharing my screen right now. 00:00:25.780 |
If someone else wants to share the links, that would be sick. 00:00:32.400 |
Calendar invite has a bunch of like tweet threads, but yeah, you can find them in there. 00:00:39.800 |
So we should be able to see my screen also, obviously super interactive for anyone that 00:00:45.440 |
has a joint paper club, you know, just interrupt me whenever it's not that deep. 00:00:51.500 |
Um, so three papers, uh, compute as a teacher, this is like, uh, a better, you know, RL optimized 00:01:01.780 |
So basically can we use the spare, can we add like a little bit of overhead to our inference 00:01:07.420 |
time compute during RL to get better training data? 00:01:10.720 |
And basically we, we do a lot of inference during RL, right? 00:01:16.320 |
Uh, you're no longer just doing pre like next token prediction. 00:01:20.900 |
Instead of just training, you're actually doing a lot of inference. 00:01:25.700 |
And we use some of that inference, add a little bit of overhead and then improve our RL. 00:01:34.280 |
So they're like, uh, kind of trying to look at for stuff that's not super verifiable. 00:01:39.820 |
Uh, can we, can we get some signals out of some of these rollouts and, and improve our RL? 00:01:46.240 |
So the very quick approach here, if I'm remembering this correctly and basically, um, for the rollouts 00:01:54.120 |
that exist, um, let's say, you know, so it builds on GRPO and GRPO basically you have your policy. 00:02:01.420 |
You do inference and you generate a bunch of rollouts, like a bunch of answers to your question. 00:02:06.380 |
Let's say you generate eight or 16 different, uh, responses. 00:02:10.300 |
Then you, you prefer towards the one that's the most correct. 00:02:14.680 |
So what this does is let's add a second little inference step where we take all of those outputs. 00:02:21.400 |
Uh, we, we pass all of that into a model without the question. 00:02:26.680 |
And we basically say, okay, can you improve this? 00:02:31.540 |
Can you iterate on them, improve them and make like a better, uh, answer based on all 00:02:37.780 |
Then we use that and we have like a rubric to grade it and we can move the policy slightly 00:02:42.940 |
So with a little bit of overhead and inference, uh, can we improve our RL and then they, they 00:02:51.500 |
do a little bit of testing and they're like, yes, we can, it works guys. 00:02:59.400 |
It shows some fun stuff that Meta is working on. 00:03:02.760 |
It's like, okay, for RL, uh, there's only so much verified data, right? 00:03:08.460 |
Are we just going to throw all of our money at Mercor or scale? 00:03:11.780 |
Or are we going to like, you know, generate some environments, have some like verification? 00:03:21.820 |
Then we did really big pre-training and we're like, oh shit, we can't really pre-train 00:03:29.740 |
It's not like we're out of tokens, but there's a heavy decay in the value we get for training 00:03:36.780 |
So we started doing RL, uh, we, we did it on a lot of like verifiable domains. 00:03:43.200 |
So math and code, then there's some stuff like the deep research, but you know, we don't have 00:03:48.420 |
that much verifiable data or we have, they, they talk about issues with stuff like, um, LLM is a 00:03:55.560 |
judge based verification and then they're like, okay, we need at some point to get other forms 00:04:00.480 |
of verified data, um, or we need to be able to maximize our value out of the data that we 00:04:08.240 |
So, um, yeah, they're, they're like, here's, here's a little paper on how we can be more efficient 00:04:23.580 |
So, um, has anyone here seen verifiers from prime intellect? 00:04:28.320 |
It's like a open source library for RL environments. 00:04:31.320 |
Uh, will Brown and the PI team has basically become a 24 seven shell of this library of we 00:04:40.020 |
We need environments and no one's ever using it. 00:04:42.900 |
There aren't that great environments, but you know, I liked the, I liked the 00:04:48.280 |
Uh, yeah, the open open pipe people have also been showing RL they've been acquired by core 00:04:56.320 |
So maybe core weave is AGI in this, but basically, um, you know, this is another library similar 00:05:06.320 |
It's a similar library sort of tool to scale up agent environments. 00:05:12.820 |
And they're like, there's a few premises here. 00:05:19.520 |
Like current, uh, RL environments are not realistic where they don't have, like, uh, they 00:05:30.020 |
Basically like issues and stuff like sweet bench and T bench and stuff is that the environment 00:05:39.820 |
But in reality, um, environments are dynamic, right? 00:05:43.780 |
Even while your agent is doing work, the environment is changing. 00:05:50.100 |
They have, um, random data points that get added in. 00:05:54.040 |
And then there's obviously a hit for taking long. 00:05:56.820 |
Some of the stuff that they measure is the time trade-off for how long agents take. 00:06:02.240 |
And they try to normalize this around stuff like, um, API requests and rate limits. 00:06:08.480 |
So like, it's, it's a bit, you know, they, they do pause the system for that to not take 00:06:12.960 |
that into account, but high level, it's like a way to create environments. 00:06:21.260 |
It's, it's a build off of an environment they made in 2023. 00:06:25.620 |
That was like, I think it stands for like general agent, uh, something, 00:06:33.180 |
So they, they expanded on it and they, they show how it's different, what they trained on. 00:06:37.440 |
Then it's like a whole bunch of, okay, what goes into making an environment? 00:06:42.120 |
Um, you need like, you know, apps, you need the ability to access the external 00:06:50.180 |
The one that they do, it's basically a mobile operating system. 00:06:54.060 |
And then there's apps like calendar, email, this and that there's tasks to do. 00:06:58.380 |
Then you talk about how our agents interacting with it. 00:07:04.100 |
And these are basically like, okay, notifications are even while your 00:07:07.960 |
agent is doing something, uh, shit will happen, right? 00:07:10.760 |
You might get a random text, a random email information might change. 00:07:14.360 |
And then how does the agent respond to all this? 00:07:24.480 |
Then they have a whole GUI for how to, um, look at this tool stuff, environment usage. 00:07:30.660 |
This was honestly quite a long one, but it's kind of fun to go through for people that 00:07:41.240 |
Like I didn't expect it to be so long, but it was all a pretty fun read, honestly. 00:07:48.200 |
It's just like, okay, let's learn about how environments work. 00:07:51.800 |
What, like what you really need to do to build one. 00:07:53.960 |
And I'm like, oh shit, these aren't that simple. 00:07:56.000 |
And I say this having put out like an open source environment a while ago. 00:08:00.380 |
It was like the whole, um, enterprise suite of, okay. 00:08:04.700 |
Send an email, edit a spreadsheet and let's test models. 00:08:07.520 |
But, um, then, then, you know, the, the big thing here they want to show off is like, okay. 00:08:13.400 |
Time really affects stuff and so does budget, right? 00:08:16.660 |
So there's like a, there is a proper, um, correlation between performance as you 00:08:22.520 |
give a higher thinking budget or more time to performance. 00:08:29.600 |
Then in the, the results section, they have some interesting stuff. 00:08:33.220 |
Like, um, the hybrid models from Claude are pretty chill with this, you know? 00:08:39.520 |
Um, they don't really have that much of a trade-off. 00:08:43.780 |
So Kimi K2 is the best, um, open source GPT five with high reasoning was the best close, but 00:08:50.500 |
stuff like Claude force on it is, um, faster. 00:08:55.060 |
It combines, you know, it doesn't, it doesn't waste time reasoning when it doesn't need to. 00:09:05.000 |
Uh, third paper was actually just very, very different. 00:09:08.800 |
Um, bootstrapping tasks for self-improvement. 00:09:11.680 |
Um, let me pause and check the comments real quick. 00:09:16.060 |
Wait, so basically they open source their RL environment, not their RL environment. 00:09:25.960 |
It's a generic agent environment and it's specifically an environment with like a thousand 00:09:38.360 |
So they, they talk about what's wrong with verifiers, right? 00:09:45.020 |
We want stuff where time changes and the environment changes while you're thinking. 00:09:48.140 |
Basically in other environments, there's no time delay. 00:09:52.160 |
So you can, so like in Sweebench, you can spend six hours solving one task, but in reality, your 00:10:09.880 |
And then I think they want people to work on their thing, but they do, they do site verifiers. 00:10:17.620 |
It's just slight improvements to RL with task spaces for self-improvement. 00:10:23.560 |
Basically in GRPO, uh, you have like step-based changes. 00:10:27.600 |
They're like, can we instead change the objective to be a sequential? 00:10:32.740 |
So based on like your thought data and the last, in the last step, have a bigger impact. 00:10:38.920 |
And then it's like just a lot of, um, RL policy optimization stuff. 00:10:44.480 |
I don't know if we'll get to this one since it's very different and the environment one is pretty long, but. 00:10:49.180 |
That is a 15 minute intro of what meta super intelligence has been up to. 00:10:55.820 |
This one's pretty short compute as a teacher. 00:10:59.680 |
It has some plays on LLM as a judge and rubrics being AGI. 00:11:07.120 |
Um, do you think I'm gonna pause for questions also, if anyone wants to unmute and chime in, if you read it, you know, feel free. 00:11:15.460 |
Do you think environment should dynamically change or just reward allocation along the way? 00:11:20.920 |
Uh, there's both steps and then here in their, in their testing, they, they do have a time penalty, right? 00:11:26.800 |
So, um, I think it's not just that there's reward allocation along the way. 00:11:32.620 |
One of the rewards is for being able to deal with like useless information, right? 00:11:40.060 |
So if you get a random text that deters you from your objective, uh, can you handle this? 00:11:46.360 |
Or can you not handle this, uh, static versus dynamic? 00:11:51.120 |
The distinction is basically like, as you're working, um, environmental changes are happening. 00:11:57.980 |
So in a static environment, like sweet beds, uh, you have tasks, then you just do them in something like this. 00:12:05.060 |
Uh, the code base is being changed externally, regardless of what you're working on. 00:12:10.960 |
And then you have calls like, you know, get state, and then you can see how state has changed and you have to be able to deal with that. 00:12:16.220 |
There were some flaws to this paper that I didn't like, um, basically in the evaluation, they have like cutoffs for. 00:12:23.540 |
When the task is failed and part of that is like, okay, you take X many steps or your context window fills up. 00:12:30.200 |
And I'm like, what do you mean your context window fills up? 00:12:32.180 |
And then they're like, a great thing to be added would be like a tool, tool, um, a code interpreter tool, because that lets you do a lot of, uh, thinking without filling up your context. 00:12:43.160 |
And they were like, well, add it soon, but we haven't added it yet. 00:12:45.560 |
The little flaws, but anyway, I think the bigger value in this is like one, they care about time to solve tasks. 00:12:53.520 |
Dynamic environment and three, um, I've blanked on the third, uh, let's see more questions. 00:13:01.020 |
Is there a reason these papers came out now did compute changes? 00:13:04.200 |
These are like efficiency and RL, so, um, you know, we only have so much RL data, right? 00:13:09.000 |
So much verifiable code, so much verifiable, like labeled environments. 00:13:14.820 |
And instead of paying Mercor their fastest growing 500 million ARR, what if it can be more, more efficient? 00:13:22.500 |
And also most of these authors were, uh, they weren't really at meta super intelligence when they started. 00:13:28.860 |
So for example, for a lot of these, like you'll see them using, um, here they use Quen 2.5, here they use LLAMA 3.1 or 3.2. 00:13:39.940 |
And this is like building on a benchmark from 2023. 00:13:44.100 |
So, you know, it's been in the work for awhile. 00:13:46.080 |
Um, some of these guys are now at Anthropic, like two of them are at Anthropic, but anyway, you know, it's, it's still, it's still relevant stuff overall. 00:13:59.220 |
The only reason we're covering these papers is because meta super intelligence, uh, otherwise they're, they're, they're interesting papers. 00:14:06.900 |
I mean, I didn't think that like stuff like this compute as a teacher, it's like better performance in RL that you can kind of just drop in and it doesn't really have too much of a negative impact. 00:14:17.880 |
So we'll stuff like this be standard, potentially the only problem is that no one's really doing RL, right? 00:14:24.240 |
Uh, the people that do do RL are like big labs and we don't know if they're doing this stuff. 00:14:31.260 |
So maybe, maybe one of the open models, maybe Mistraw or something drops a paper and they're like, oh yeah, we've been doing this. 00:14:38.160 |
It helps a lot, but you know, maybe if you're into some mid training custom RL work, that'd be cool. 00:14:47.380 |
Compute as a teacher, uh, turning inference compute into reference free supervision. 00:14:57.400 |
There's no ground truth when there's no wrong truth in post-training, uh, converting the model's own exploration at inference time into reference free supervision by synthesizing a single reference from a group of parallel rollouts, then optimizing towards it. 00:15:17.720 |
What if we can synthesize a single reference from all of these and then optimize towards it? 00:15:25.520 |
Like, aren't you going to be limited at the quality of what's in the rollouts? 00:15:32.120 |
turns out no, with a bunch of like little information, you can actually get slight improvements. 00:15:40.140 |
Like you can, you can get, uh, synthesized results when they differ from all rollouts. 00:15:46.700 |
So like in 1% of cases, uh, the result is better, disagrees with every rollout, but is still slightly correct. 00:15:53.200 |
And then you judge these based on a rubric grading, which is kind of, uh, unique as well. 00:15:57.760 |
So concurrently current policy produces a group of rollouts, a frozen anchor, uh, reconciles emissions and contradicts to estimate reference. 00:16:08.920 |
So this is basic GRPL, uh, there's two regimes. 00:16:12.020 |
There's verifiable tasks where you have a programmatic way to check the answer. 00:16:16.480 |
So basically for math, like you can use a little, um, a little software to just check, like, okay, is this math correct? 00:16:24.720 |
Use a calculator or code, you know, does the code compile? 00:16:28.480 |
Then there's non non-verifiable tasks with self-proposed rubrics, binary, auditable criteria scored by an independent LLM as a judge with reward given by a fraction of satisfied like output. 00:16:43.540 |
Unlike the current method synthesis may disagree with the majority and be correct. 00:16:49.720 |
Even when all the rollouts are wrong, this was kind of interesting. 00:16:52.300 |
So current currently there's like a few other stuff, like someone mentioned the tree based approaches. 00:16:57.100 |
So, um, you know, if you do best of N and just move towards that policy or majority voting or like judges based the best output, um, the difference here is that this synthesized result can 00:17:10.600 |
disagree with everything that's in the rollout, but still be better. 00:17:16.180 |
They tested a bunch of four B's and eight B's and number went up, went up with general training and with RL. 00:17:24.220 |
Uh, I thought this was a pretty, not so great diagram, but it exists. 00:17:29.380 |
Um, so system the size and estimated reference convert the supervision into reward can be applied at test time for inference time gains inside RL. 00:17:40.660 |
So, um, for specialized skills, why do we need, we need a, how to do figure one. 00:17:52.480 |
This figure one was bad, but there's a better figure two somewhere in this paper. 00:18:00.040 |
Post training typically relies on supervised fine tuning with labels or verifiable rewards and programmatic checkers, valuable tests, tasks, lack both. 00:18:11.760 |
So in non-verifiable settings, uh, like, you know, clinical tests or lifestyle, there's also the issue of labeling being ambiguous. 00:18:22.680 |
So annotation pipelines are hard to scale and even then we don't, we don't often, um, we don't often like have experts agreeing. 00:18:33.180 |
So in ambiguous stuff, like experts can have different thoughts on what's concise writing. 00:18:41.100 |
They can like, you know, there's just, there's just issues with that. 00:18:44.160 |
And then LLM as judges, they, they can have issues here. 00:18:47.880 |
So, uh, verbosity bias, they love long outputs. 00:18:53.760 |
So one main question, uh, can inference compute substitute for missing supervision for non-verifiable stuff or ambiguous, uh, verification. 00:19:06.720 |
Can inference compute do better than stuff like LLM as a judge? 00:19:16.280 |
Compute as a teacher, not chain of thought, C-A-T, uh, compute as a teacher. 00:19:21.120 |
It converts the model's own exploration into reference free supervision. 00:19:25.200 |
For each prompt, the current policy generates a set of parallel rollouts, uh, that's GRPO. 00:19:31.400 |
Then condition on the rollout, a synthesis, single estimated reference by, by reconciling emissions and predictions of cultural relations. 00:19:40.840 |
Uh, this is once again, basically they throw all these rollouts into a model. 00:19:47.800 |
We'll get into the prompt that they use for this, which is kind of interesting. 00:19:52.600 |
A key thing is they don't give the question anymore. 00:19:57.400 |
Um, cat reuses the group rollout compute budget already common in RL, adding little overhead beyond the compute figure already spent, right? 00:20:10.160 |
So there's like multiple test times in RL, right? 00:20:14.160 |
In RL, when you're doing GRPO, you do a bunch of rollouts and some of these are already off policy. 00:20:21.920 |
So there's a lot of weighted wasted inference time, right? 00:20:25.160 |
Um, there's tricks you can do to optimize this, but this is just a way of doing a little bit more overhead there. 00:20:31.360 |
So two domains for cat, uh, one is verifiable where it's obvious, right? 00:20:36.680 |
If it's verifiable, just use the same verifier. 00:20:40.920 |
So let's say you have a math question, all your rollouts are shit, but you can answer it with a synthetic, um, output from cat. 00:20:50.800 |
Well, then you can check it there and that's good. 00:20:52.760 |
If it's non-verifiable, uh, they have this concept of rubrics. 00:21:00.120 |
You say, here's a question, uh, what are rubrics that are hard that should like not be easy to check that these can be graded on? 00:21:07.960 |
And then a judge follows that rubric and the output, and then has yes or no, um, criteria. 00:21:14.320 |
Uh, the big thing here is this is synthesis, not selection, right? 00:21:18.840 |
So, uh, you're, you're generating a new answer that can one, disagree with the majority and be correct even when rollouts are wrong. 00:21:28.720 |
Uh, so empirically we observe behaviors and disagreement with majority of rollouts on 14% of questions. 00:21:35.720 |
So basically, you know, it disagrees fairly often, um, and it disagrees with all rollouts. 00:21:42.080 |
1% of the time performance scales with the number of rollouts as expected, right? 00:21:48.080 |
So if you give it one rollout, it's not going to do much. 00:21:51.080 |
Uh, if you give it two, four, eight, 16, 32, it starts to do better when it has more information. 00:21:57.080 |
They've got, they have, uh, plotted a scaling curve for this and then the intuition around why this works. 00:22:03.440 |
So basically, um, parallel rollouts diversify generations, surfacing like different sub-factor solutions, right? 00:22:12.800 |
So, uh, you don't just do rollouts such that they have the same output. 00:22:18.800 |
They're, they're somewhat diverse and they go down different thinking paths, right? 00:22:22.800 |
Uh, this was shown in other pre-training reasoning work where in the GRPO policy, you want to optimize for diversity, right? 00:22:31.800 |
So, uh, the last RL paper that we covered, they basically completely got rid of, uh, one of the constraints that really allowed for a lot of diversity in samples and that allowed for a higher hit rate towards being, um, correct and verifiable. 00:22:47.800 |
So in this case, you know, GRPO is already doing different rollouts that are in different, uh, you 00:22:55.400 |
know, different diverse groups and then conditioning this, um, it's similar to ensembling, right? 00:23:03.200 |
So basically if you have a synthetic data generator that has a bunch of different, um, you know, different 00:23:11.800 |
trees that it could have gone down, you have, you have the ability to pick out, okay, this, this 00:23:17.880 |
A bunch of these seemed off, um, maybe this is the right way and in non-verifiable domains, rewards 00:23:23.720 |
transform match later into discrete, dah, dah, dah, dah. 00:23:28.840 |
Basically, um, you're just, you're just looking at a bunch of potential solutions and then you're like, 00:23:38.920 |
Um, let's, let's synthesize a better response from this. 00:23:42.840 |
And then you have a bunch of different things. 00:23:48.120 |
You don't need human labels, no specific verification or anything. 00:23:51.800 |
You use the same, um, you use the same verification techniques as before. 00:23:59.240 |
If it's a non-verified domain, you have this rubric thing. 00:24:03.640 |
Uh, and then, yeah, they, they actually tested it. 00:24:07.160 |
It improves three, four to eight B model families, Gemma, Quinn, and Lama. 00:24:13.240 |
So why are we talking about meta super intelligence now? 00:24:15.800 |
Uh, this work is not that recent, you know, it's Lama 3.18B. 00:24:19.880 |
They could have done Lama 3.31B, 3B, but it's, it's old models. 00:24:28.040 |
So several other bridges of work around this, right? 00:24:31.320 |
So, um, learned from model generated supervision, but derives the target 00:24:36.760 |
by reconciling multiple samples rather than trusting a self-centered label. 00:24:42.840 |
So there's distillation, um, there's self-training, there's majority voting. 00:24:51.000 |
There is, these are basically like interesting. 00:24:54.840 |
This section's a bunch of other examples that you can, um, 00:24:58.440 |
look at if you're interested in like better, um, self-generation for RL stuff. 00:25:07.480 |
Um, the uniqueness here is it constructs new answers that can depart from 00:25:13.000 |
the consensus, um, different than LLM as a judge. 00:25:17.480 |
That's different than majority voting, right? 00:25:19.080 |
So majority voting picks the majority best from the rollout. 00:25:23.480 |
This can deviate from that different than LLM as a judge. 00:25:27.640 |
Um, it has specific criteria that, you know, it mitigates like instability. 00:25:36.280 |
So the problem with LLM as a judge is that there's biases like, 00:25:43.160 |
I prefer, um, you know, like you have preferences across model family. 00:25:52.840 |
Then there's, um, you know, programmatic verification. 00:26:00.120 |
Once they do, uh, cat their, you know, compute as a teacher. 00:26:05.000 |
And then this rubric score judging, uh, it avoids human references and reduces 00:26:16.600 |
They test it on math and health bench, and then they, they compare it and all that. 00:26:22.280 |
So reference-free fine tuning, um, constitutional AI from Anthropic, self-instruct for training 00:26:32.440 |
All these are like different approaches to the same thing. 00:26:37.880 |
There's reference-free RL, uh, so test time RL, reference-free LLM training by RL, absolute zero, 00:26:46.760 |
self-play, um, all these are like interesting approaches. 00:26:52.040 |
So compared to reference-free fine tuning, their approach can holistically improve outputs for 00:26:59.720 |
That sounds like buzzwords, uh, compared to reference-free RL, they're able to construct 00:27:05.400 |
and synthesize answers outside of explored distribution and extend beyond verifiable 00:27:10.040 |
and non- extend beyond verifiable to non-verifiable domains. 00:27:14.920 |
I think this is the big takeaway of the paper. 00:27:16.760 |
So, um, the second half is pretty key, right? 00:27:19.640 |
If you have non-verifiable domains, uh, you can, you can still work in those and you can go out of your 00:27:29.480 |
Then, um, other work in non-verifiable RL is very free, JEPO, RLPR, um, all this stuff. 00:27:38.280 |
In contrast, they use rubrics as rewards and more general approach that constructs rubrics from 00:27:44.760 |
reference answers, which are then judged by LLMs to compute a score. 00:27:49.240 |
Unlike all these methods, theirs does not require a reference answer, right? 00:27:54.120 |
So all the other work that they cite, they need reference answers. 00:27:57.720 |
Um, their approach doesn't, which is, you know, it's unique. 00:28:05.720 |
Basically, uh, it's GRPL and then you have this, uh, synthesis step with, um, uh, yeah. 00:28:14.040 |
You have a synthesis step that's generated from rollout and then rubric criteria is done later. 00:28:20.040 |
This was a better sort of diagram, but I think we still need to make reports on it. 00:28:23.960 |
Uh, estimating a reference by synthesizing rollouts to estimate a reference response. 00:28:32.040 |
Um, so at each GRPL step, the current policy generates a bunch of forwards. 00:28:41.160 |
I don't know how many people are super up to date with GRPL. 00:28:45.960 |
Uh, can someone explain why we need to keep the question blind to prevent it from acting? 00:28:58.840 |
Uh, someone asked, why do we need to keep this? 00:29:01.800 |
So for the synthesis step, um, why do we need to keep it question blind? 00:29:08.680 |
I'll explain that right after explaining what GRPL is. 00:29:11.480 |
So, um, GRPL is basically what's used in current RL where, um, you basically have, 00:29:20.200 |
instead of having multiple models and all this, you, you do a bunch of rollout. 00:29:23.960 |
So at your current step, you generate a bunch of different outputs. 00:29:28.520 |
So let's say for a different, for a given query, like a math question, like some integral question, 00:29:33.960 |
you generate X amount of different answers with chain of thought reasoning and whatnot. 00:29:43.400 |
So like you, you optimize towards what's the best in this group of outputs. 00:29:49.000 |
It's like a memory efficient version of PPL, since you don't need multiple models and this and that. 00:29:53.400 |
So, uh, the, the very, very basic TLDR is just, you generate multiple outputs. 00:29:59.400 |
Like let's say you generate four, eight or 16 outputs for a question. 00:30:03.800 |
And then you optimize towards the one that's the most correct. 00:30:06.840 |
Now, what, uh, CAT is doing is you pass in all of these rollouts to a synthesizer model, 00:30:14.360 |
a synthesizer step, and then you have it kind of improve these. 00:30:18.040 |
Now you say, okay, here's chain of thought one. 00:30:23.240 |
Uh, can you give a more concise, better approach towards solving this? 00:30:27.400 |
And sometimes, you know, given all these different rollouts and different, uh, 00:30:32.520 |
techniques and attempts at the question, um, you know, you get a better result by just 00:30:39.960 |
throwing all this in one more verification, synthesization step. 00:30:43.320 |
Uh, the reason that we don't give it the question is because we don't want bias, right? 00:30:47.640 |
We don't just want like a 17th rollout, right? 00:30:49.960 |
If you're already doing 16 shots on goal, giving it the question is just going to make it answer the 00:30:57.240 |
Now, if you supply it a bunch of incorrect information and ask it the question again, 00:31:05.000 |
Instead, what you're doing is you're just giving it the rollout, just the chain of thoughts. 00:31:11.880 |
Now this reference will often still have like, you know, here's an answer and then it will be 00:31:19.800 |
What they show is that oftentimes it is a lot better. 00:31:22.440 |
Uh, for synthesis, do they use the same policy model? 00:31:25.720 |
Uh, they use an LLM as a judge and they don't, it shouldn't really matter what model you use. 00:31:34.520 |
Uh, we'll, we'll double check, but they, the, the fun thing here is actually in the appendix, 00:31:39.960 |
they show all the prompts for this, which is more useful. 00:31:46.680 |
So for verifiable stuff like that, they just use the verifier. 00:31:50.280 |
So like they use the little, uh, software to check if it compiles. 00:31:54.040 |
For non-verifiable, they, they have rubrics using rubric prompts. 00:31:58.440 |
And then they use an LLM as a judge to just clarify this rubric. 00:32:02.360 |
Uh, then they can input this back into the GRPO. 00:32:05.800 |
And then if the CAT is better than the rollouts, let's instead, um, start to, you know, move our 00:32:18.120 |
So when you only have one rollout, the synthesis is not gonna do much, but it grows as you have 00:32:26.280 |
This is, you know, what they expect and it's true. 00:32:33.160 |
The prompt seems like it's creating a unified response. 00:32:37.880 |
Um, so the synthesis step is different than the rubric step. 00:32:42.680 |
And the rubric step is not always used, right? 00:32:44.600 |
Synthesis step is used for verifiable stuff and non-verifiable. 00:32:48.440 |
The rubric is used when you need to do non-verifiable domains. 00:32:59.080 |
Basically, they, they evaluate it in two models. 00:33:02.680 |
They do, sorry, in a few models, Gemma, Quinn, Lama. 00:33:05.640 |
Um, yeah, the TLDR is, it kind of works, you know? 00:33:09.800 |
So it improves, it improves 30% relative to the initial policy. 00:33:18.360 |
So like in Lama doing CAT with RL had a 30% bump on health bench on math. 00:33:26.760 |
Uh, so, you know, it, it works quite, quite, uh, well. 00:33:47.240 |
Then you use an LLM to judge the synthesis answer on the rubric. 00:33:51.720 |
Um, that's, unfortunately, I wish that was like section four. 00:33:56.600 |
We have to go through the damn appendix to find that stuff. 00:33:59.800 |
So that's why I'm going to like skip through this pretty fast. 00:34:02.440 |
Uh, would have been interesting to ablate not using question and question. 00:34:08.760 |
It might be something in, um, in the appendix. 00:34:15.800 |
You know, this, this stuff works on small models. 00:34:18.600 |
In model as a judge, instead of checking individual rubric criteria, 00:34:23.400 |
Roach consistently outperforms model as a judge. 00:34:27.000 |
Rubrics provides fine-grained assessment criteria easier to verify. 00:34:34.120 |
Um, RL with self-proposed rubrics is better than SFT. 00:34:41.400 |
Uh, SFT did not do that much, but rubrics did a lot. 00:34:47.640 |
Um, this produces better reference estimates as single sample and selection baselines. 00:34:55.400 |
Uh, strongest reference estimates, the most fertile, versatile, best of, and okay. 00:35:02.200 |
I think we can skip through a lot of these pretty quick. 00:35:08.680 |
So if you're doing a lot of rollouts, this will do a lot better. 00:35:11.960 |
Um, you'll get more benefit from throwing more in there. 00:35:15.720 |
Reasons about prior rollouts rather than act as another. 00:35:19.160 |
So this is, you know, I've asked a question about, um, why not pass in the question? 00:35:41.960 |
So, um, CAT with a single rollout in context only performs slightly better than a rollout. 00:35:49.320 |
This suggests that additional general generation steps of synthesizing is not only acting as a new 00:35:57.720 |
new rollout that self-conditions on this past context. 00:36:10.040 |
Okay, um, reconciles rather than selects to disagree. 00:36:18.120 |
So most of the time it will, um, not just disagree. 00:36:21.960 |
It wants, it wants to synthesize better approaches. 00:36:29.080 |
Um, it depends on the initial policy model to be baseline good. 00:36:39.160 |
So, CAT turns inference compute into supervision by another anchor policy. 00:36:55.560 |
Yeah, I mean, it's basically useful, free, better performance. 00:37:05.080 |
So this is an example where the CAT disagrees with all rollouts. 00:37:23.000 |
You're tasked with combining multiple responses into a single cohesive response. 00:37:44.200 |
And then, you know, box your answer and stuff. 00:37:48.440 |
The key thing you'll note here is, you know, the synthesis prompt, they're not giving the question. 00:37:52.440 |
They're not saying like in this question, let F be this all complex numbers. 00:37:59.640 |
You're just being told, you know, your job is to combine responses into a single cohesive response. 00:38:09.320 |
So like, look at a bunch of long and like, for example, think of like deep research, right? 00:38:14.120 |
You have like 16 different shots of like, tens of thousands of tokens of web searches. 00:38:20.440 |
And you want to like ignore stuff that's bad, keep stuff that's good. 00:38:23.480 |
And then, you know, the output response they find does better. 00:38:34.840 |
Your goal is to identify this, synthesize this. 00:38:50.440 |
So the self rubric is a separate prompt, right? 00:38:53.160 |
This is that step for how do you get, how do you get quality 00:39:12.040 |
Carefully read the response and develop a reference 00:39:17.640 |
Task. Develop a detailed rubric for this specific response. 00:39:21.160 |
Create a detailed response for this specific response that describes what high quality responses 00:39:26.840 |
look like to it with respect to accuracy, verifiable supporting evidence, logic structure, 00:39:32.440 |
data, data, data, data, provide five or more rubric criteria that can be verified with yes or no. 00:39:39.480 |
Ensure these are very specific and be verified. 00:39:42.520 |
Make it extremely difficult to achieve a high rating. 00:39:45.880 |
A high quality answer should be very hard to achieve. 00:39:48.280 |
It's rare that any question would achieve high quality. 00:39:50.920 |
You may use a reference answer if you see fit and then do it in XML. 00:39:56.200 |
But this is kind of how they create a rubric. 00:40:07.240 |
You're an expert judge that determines whether an answer satisfies a rubric. 00:40:15.240 |
If there's no answer provided, plea is answered. 00:40:21.240 |
Only determine if the answer satisfies the rubric. 00:40:29.800 |
I think it would be interesting for anyone doing like basic mid training to try it out. 00:40:38.360 |
Like it's a better way to use all your rollout data, right? 00:40:44.040 |
You're doing so much inference with little overhead. 00:40:47.880 |
I'm going to take like a two minute pause to see if anyone has any questions on this. 00:41:01.480 |
Yeah, I think it would still work though, right? 00:41:05.240 |
They tested on small models on like verifiable, like basic math and stuff. 00:41:09.880 |
It shows a higher jump, but at the same time, like it's not harmful in any sense, right? 00:41:19.240 |
It's also just like basic compute budget, right? 00:41:30.600 |
For all we know, this is an optimization step that's already being done. 00:41:34.200 |
The bigger fun thing, I guess, was like this rubric-based response for me working better. 00:41:41.080 |
And, you know, it being a way to check non-verifiable stuff better than LLM as a judge. 00:41:47.320 |
Like there's this fantasy that people like to have that you can just do RFT on any model and just do LLM as a judge and that'll just work. 00:41:59.160 |
So it's nice to see a change where, you know, this is like slightly better than that. 00:42:08.360 |
I'm going to go into a base model may fail to produce reasonable responses. 00:42:15.400 |
So someone asked if this will, they only tested this on small models. 00:42:23.320 |
Someone disagrees because the weak base model may fail to produce meaningful responses. 00:42:31.240 |
If the bigger models would already be stronger. 00:42:38.760 |
Like this might not work on like a 1B model because the base models suck, but it should work on bigger models. 00:42:46.040 |
I'm gonna switch to the agent thing they did. 00:42:51.960 |
So we already did a bit of an intro about this, but basically this is their like verifiers competitor of RL agent environments. 00:43:11.320 |
So how long does it take your agent to solve stuff because the environments are not static, right? 00:43:17.640 |
Current benchmarks are too static and that's an issue. 00:43:21.800 |
So as you do nothing, the world still changes around you. 00:43:25.560 |
And then we need to measure, you know, how can your agent handle like ambiguities, noise? 00:43:33.480 |
And then it runs async surfacing new program modes. 00:43:38.680 |
Okay, I'm gonna go through this pretty quick. 00:43:42.200 |
Our experiments show that no system dominates across intelligent spectrum. 00:43:50.680 |
So they have like six or seven different checks. 00:43:53.160 |
Stuff like ability to handle ambiguity, conciseness, and the models that do well in different domains 00:44:02.680 |
differ from ones that do well in other checks, which is kind of interesting. 00:44:22.840 |
The hybrid models don't reason unnecessarily. 00:44:28.920 |
The interesting thing here is that nothing dominates across the spectrum. 00:44:35.560 |
Like if you reason more, well, now you're taking more time, right? 00:44:39.480 |
And then the unique thing is that these curves all plateau, right? 00:44:50.840 |
The fun thing here is they define like that this is not an AGI benchmark. 00:45:13.720 |
The problem is that web environments change, right? 00:45:18.840 |
Like if you have an Amazon-based benchmark, like a clone of an Amazon website for shopping, 00:45:27.800 |
Like you're not getting new products randomly added. 00:45:31.240 |
But on realamazon.com, search results change all the time. 00:45:37.880 |
So there's a lot of write operations that can change environment that happened randomly. 00:45:45.320 |
Few open source, as of the time of this writing, few open source, few open source and flexible 00:45:54.040 |
libraries exist for developing and studying correctable LLM agents. 00:45:58.440 |
This is citing Will Brown from verifiers that we need more environments. 00:46:06.600 |
The interesting thing here is when they test all these models, they basically do a very simple 00:46:15.720 |
Do other model agent orchestration scaffolds and stuff. 00:46:22.280 |
Okay, they want running creation of environments. 00:46:42.920 |
So it has mobile apps like email, messaging, calendar, the associated content with them. 00:46:49.000 |
Evaluations for agents beyond peer search and execution. 00:46:54.920 |
I'll cover this in 10 minutes and push the next one next thing. 00:47:01.320 |
So one, they differ than most agent benchmarks. 00:47:05.320 |
And one, there's more realistic interactions between agents and environment that run asynchronously. 00:47:11.560 |
Scenarios spanning arbitrary periods of time. 00:47:16.120 |
So this is what makes their benchmark different, right? 00:47:19.720 |
Environment time passes independent on whether agent acts or not. 00:47:23.720 |
The environment state is continuously updated with random or scheduled events, 00:47:27.880 |
such as friends replying to messages sent by user or agent. 00:47:30.840 |
They have robust verification system into RL. 00:47:35.000 |
Basically, comparing agent right actions only to annotated oracle right actions for right action. 00:47:42.360 |
While today's frontier models are far from solving this, 00:47:46.280 |
we do not consider it to be AGI level benchmark in the OLM RL arena. 00:47:53.000 |
It's just like a start of like their first environment that's dynamic. 00:48:00.920 |
This is a much better diagram of what's going on. 00:48:03.720 |
We expect that this will require modeling effort beyond scaling test time scaling. 00:48:16.520 |
So basically, they have multi-agent interaction, time-based stuff. 00:48:22.680 |
They're like, basically, if we want efficiency and models and RL, we need to do modeling effort beyond 00:48:31.720 |
test time scaling, which I kind of disagree with, right? 00:48:34.440 |
So they're like, you can't just scale for more reasoning. 00:48:37.320 |
I think you have routers, you have low, medium, high thinking models, right? 00:48:42.920 |
So like in their charts on the performance, they basically show how there's like a clear 00:48:49.240 |
performance scale between GPT-5 on low, medium, high thinking. 00:48:59.640 |
But, you know, they're like, we need to do model based. 00:49:03.400 |
So here, so like GPT-5 with minimal thinking versus low versus high. 00:49:10.040 |
The more thinking does better, but it takes more time. 00:49:13.400 |
They're like, we need to do better architecture changes to solve this problem. 00:49:20.920 |
But in reality, you know, a simple router or an auto thinking mode would also solve this 00:49:26.280 |
because we've already dynamically been able to think better. 00:49:29.560 |
And then the hybrid reasoners also do really good on this. 00:49:45.080 |
There's apps, which are basically stateful APIs. 00:49:48.440 |
So email messaging, all that stuff is an app. 00:49:50.920 |
It's an API that you can interact with as an agent. 00:49:54.280 |
Environments are collections of apps, data, governing rules. 00:49:58.280 |
Events are anything that happens in the environments. 00:50:03.400 |
Notifications or messages from the environment. 00:50:08.040 |
Scenarios are basically initialized states and different events that you can schedule. 00:50:20.040 |
Apps have their own state and then they store state internally. 00:50:25.960 |
So basically, if you want to use the environment, 00:50:28.280 |
it's all a Python class that executes, creates stuff. 00:50:36.040 |
Environment is stuff that the agent can't control, right? 00:50:41.560 |
The agent doesn't have control over this user. 00:50:50.120 |
So MCP to externally interact with the world. 00:50:54.440 |
So if you want to dynamically change events based on like real world data, like 00:50:59.080 |
yesterday, someone did a SF map of parking tickets coming in live. 00:51:04.360 |
If you want to hook that up to your environment, you can do it pretty easily, right? 00:51:13.800 |
So agent interfaces and then there's system stuff. 00:51:16.760 |
System core apps are basically like, okay, what's the time I can wait for stuff to change? 00:51:23.080 |
So basically I can make a change that I can wait for time to pass. 00:51:25.800 |
Stuff that allows for stuff that would take like real time. 00:51:37.400 |
Let's go a little fast since we're short on time. 00:51:42.360 |
There's validation events, agent events, notification. 00:51:45.960 |
At each notification step, events trigger notifications. 00:51:52.600 |
They're not the only way for agents to observe them. 00:51:57.400 |
So outside of just like your phone getting a notification, 00:52:00.920 |
the agent can make a call to its notification inbox. 00:52:06.840 |
These are basically like specific scenarios, right? 00:52:10.280 |
Like once I receive this, you can figure those scenario hints. 00:52:23.720 |
To generate all these environments, they use LAMA 3.370B and they need like consistency. 00:52:31.960 |
They basically have a hierarchical like top down approach of synthetically generating like, 00:52:38.200 |
So like if you have like, okay, a French physics professor that's in your context or a Chinese 00:52:43.720 |
professional athlete, you need to have like, you know, 400,000 tokens of 00:52:49.560 |
information about them, all these chats, messages, then like a whole schema for how they do it. 00:52:57.080 |
Scenario creation, implementing environments. 00:53:02.200 |
The whole point of this is basically it's open source, right? 00:53:04.840 |
So if you want to do your own, here's kind of their walkthrough of here's how we did it for all 00:53:13.560 |
Initial verifiers, verifiable rewards are crucial for ORL, right? 00:53:17.400 |
So they have rubric-like verification, mobile checking agent to check write operations. 00:53:24.520 |
So like there's verification in these environments, verification mechanisms, basically you read stuff. 00:53:33.240 |
As you would expect, hard check is like, okay, let's specifically check over like a regex string. 00:53:47.960 |
There's a time delay for actions that takes a hit, verifying verifiers. 00:54:00.120 |
So simple react loop that has a pre-step and a post-step. 00:54:06.200 |
So check for turn-based terminations, sorry, check for notifications, check for the current 00:54:13.400 |
Post-step is after you finish whatever you're doing, you can do a little post-step, right? 00:54:22.680 |
Like if someone says like, hey, can you send me a recipe for apple pie? 00:54:28.600 |
Then you start working on it after you're like, okay, I need to, 00:54:32.680 |
I need to text my mom to get the recipe from her. 00:54:36.040 |
Once you send that text, there's a post-step in your loop, which is like, okay, 00:54:41.560 |
And maybe, maybe your brother has texted you. 00:54:46.040 |
And then, you know, that, that affects the time wasted in this. 00:54:49.640 |
They have a, they have a UI for visualizing all this and for evaluating it. 00:54:59.400 |
Here's their, their one that they did, the mobile thing. 00:55:05.320 |
Gear 2 is their, their sort of Android OS style thing. 00:55:13.880 |
Each scenario, they have like a mini version. 00:55:16.760 |
So they have dynamic environments where world state changes. 00:55:19.800 |
They contrast vending bench, which is a fun one. 00:55:27.640 |
Scenarios explicitly incorporate time dimension requiring agents to handle temporal constraints. 00:55:43.800 |
So instead of apps just being like API based SDKs, there's also agents that you can fit in. 00:55:52.200 |
Um, agent capabilities, search, execution, adaptability, time, ambiguity, agent to agent. 00:56:03.640 |
Basically like to use an app, you need to go through another agent, which you can control. 00:56:08.120 |
Um, having that delegated to an agent often helped some of the small models. 00:56:14.600 |
So API changes services, uh, how do the models handle this? 00:56:20.760 |
So you can, you can dynamically have some apps, the robust APIs, some of them be agents 00:56:28.680 |
So like, um, one of your apps could actually just be like, talk to a friend, right? 00:56:33.720 |
And it's handled as an agent to agent communication. 00:56:41.720 |
I think we got to skip through a bunch of this since we only have a minute left. 00:56:50.600 |
They're all tested at long context, temperature generation of 16K per turn. 00:56:56.200 |
If the context length of 128K is exceeded, it's an automatic failure. 00:57:01.080 |
Agent loop runs continuously until one of two termination conditions is hit. 00:57:07.560 |
So termination is 200 steps or, um, you run out of context. 00:57:12.840 |
Environment, um, all scenarios are verified with their verifier using LAMA 3.370B. 00:57:22.840 |
I think we are out of time, but I'll go through the quick stuff. 00:57:26.920 |
Core results, um, time split was interesting. 00:57:29.960 |
These are some like just fun takeaways of how models were, um, how models serve, right? 00:57:35.000 |
So, um, execution and search were the easiest splits. 00:57:39.000 |
These are kind of the splits on which they evaluate domains, right? 00:57:45.080 |
How well can you handle ambiguity, adaptability, time, noise, and then use other agents. 00:57:51.480 |
And then what's your, what's your, um, performance of these? 00:57:55.800 |
So for example, Gemini 2.5 is very, very good across the board. 00:58:00.120 |
It doesn't suffer any major penalties where stuff like Grok has a really bad time penalty. 00:58:05.560 |
GPT-5 high thinking, even low thinking has bad time penalties. 00:58:11.640 |
So like the weaker models, LAMA noise, uh, sorry, LAMA 3 doesn't handle noise. 00:58:20.920 |
But stuff like LAMA 4 Maverick, it, it did really good. 00:58:25.000 |
It got a lot of benefit from agent to agent communication. 00:58:28.040 |
Um, but you know, fun stuff to just go through. 00:58:30.520 |
So Grok 4 really good at search, uh, stuff that has a deep research product, right? 00:58:36.360 |
So OpenAI, Claude and Grok, they have deep research products. 00:58:41.160 |
Uh, ambiguity, ambiguity remain challenging except, uh, Claude 4, Sonnet and GPT-5 were really good at this. 00:58:49.160 |
Uh, times only Gemini 2.5 Pro and Claude 4 Sonnet had really good trade-offs with time. 00:58:57.080 |
Uh, latency trade-off, noise robustness lags. 00:59:11.160 |
Uh, Claude 4 Sonnet was 3x more expensive than GPT-5 low, but much faster. 00:59:21.320 |
Kimi was a pretty good trade-off in the middle. 00:59:28.920 |
I want to leave like the two minutes for anyone that's still here. 00:59:31.800 |
If anyone has questions, anything they want to dig into here, um, you know, any thoughts, questions? 00:59:46.440 |
They talk about tool calling, uh, how they want other agent frameworks. 00:59:55.480 |
Like if you want to make a fun hype post, just, just benchmark max this thing, right? 01:00:01.320 |
Like don't overfit, but it's very easy to do a basic agent that's better than react and kind 01:00:06.520 |
of set up and beat this, beat this benchmark, you know? 01:00:18.200 |
And so it's kind of the paper, the whole library is out there. 01:00:24.120 |
And then I guess we can do the next one next time. 01:00:31.720 |
Next week, I think we have a volunteer, unless anyone wants to volunteer a paper. 01:00:53.560 |
I guess we have a bio RL paper volunteered for next week.