Back to Index

Meta Superintelligence: Self-Improving Bootstrapping, Compute as Teacher, ARE: Agent Research Envs


Chapters

0:0 Introduction to the three papers
0:33 Compute as Teacher: Turning Inference Compute into Reference-Free Supervision
4:21 ARE: Scaling Up Agent Environments and Evaluations
10:20 Bootstrapping Task Spaces for Self-Improvement
41:14 ARE: Scaling Up Agent Environments and Evaluations (revisited)
56:19 Details on the dual-agent model in the Compute as Teacher paper
57:43 Analysis of different models and APIs in the ARE paper

Transcript

So not fair, but the new meta super intelligence lab, two of them are somewhat similar. The third one is pretty different. It's okay if we haven't pre-read. I think we'll actually only have time for two of them. They're basically really focused on better RL approaches. And here, I'm actually just sharing my screen right now.

If someone else wants to share the links, that would be sick. Um, they should be in the calendar invite. Calendar invite has a bunch of like tweet threads, but yeah, you can find them in there. Um, okay. So we should be able to see my screen also, obviously super interactive for anyone that has a joint paper club, you know, just interrupt me whenever it's not that deep.

We should make it a discussion. I didn't write the paper. Um, so three papers, uh, compute as a teacher, this is like, uh, a better, you know, RL optimized Asian technique. So basically can we use the spare, can we add like a little bit of overhead to our inference time compute during RL to get better training data?

And basically we, we do a lot of inference during RL, right? So you have a bunch of rollouts. Uh, you're no longer just doing pre like next token prediction. Instead of just training, you're actually doing a lot of inference. So this paper is like, okay. And we use some of that inference, add a little bit of overhead and then improve our RL.

And it's, it's kind of interesting. So they're like, uh, kind of trying to look at for stuff that's not super verifiable. Uh, can we, can we get some signals out of some of these rollouts and, and improve our RL? So the very quick approach here, if I'm remembering this correctly and basically, um, for the rollouts that exist, um, let's say, you know, so it builds on GRPO and GRPO basically you have your policy.

You do inference and you generate a bunch of rollouts, like a bunch of answers to your question. Let's say you generate eight or 16 different, uh, responses. Then you, you prefer towards the one that's the most correct. Right? So what this does is let's add a second little inference step where we take all of those outputs.

Uh, we, we pass all of that into a model without the question. And we basically say, okay, can you improve this? Uh, here's a bunch of responses. Can you iterate on them, improve them and make like a better, uh, answer based on all of these rollouts. Then we use that and we have like a rubric to grade it and we can move the policy slightly towards that.

So with a little bit of overhead and inference, uh, can we improve our RL and then they, they do a little bit of testing and they're like, yes, we can, it works guys. We did it. Um, this was a pretty interesting approach. It shows some fun stuff that Meta is working on.

It's like, okay, for RL, uh, there's only so much verified data, right? Are we just going to throw all of our money at Mercor or scale? Or are we going to like, you know, generate some environments, have some like verification? We're pretty limited in that, right? Before we were limited in compute.

Then we did really big pre-training and we're like, oh shit, we can't really pre-train anymore where we're cooked. Like we're out of useful tokens. It's very much decaying. It's not like we're out of tokens, but there's a heavy decay in the value we get for training on like 35 versus 30 billion tokens.

Right? So we started doing RL, uh, we, we did it on a lot of like verifiable domains. Right? So math and code, then there's some stuff like the deep research, but you know, we don't have that much verifiable data or we have, they, they talk about issues with stuff like, um, LLM is a judge based verification and then they're like, okay, we need at some point to get other forms of verified data, um, or we need to be able to maximize our value out of the data that we do have and the, the verification stuff.

So, um, yeah, they're, they're like, here's, here's a little paper on how we can be more efficient with that. Second paper is, oh shit. I'm very zoomed in. It's about, um, agentic environments. So, um, has anyone here seen verifiers from prime intellect? It's like a open source library for RL environments.

Uh, will Brown and the PI team has basically become a 24 seven shell of this library of we need open source RL environments. We need environments and no one's ever using it. There aren't that great environments, but you know, I liked the, I liked the initiative it's, it's a cool, cool thing.

Uh, yeah, the open open pipe people have also been showing RL they've been acquired by core weave. So maybe core weave is AGI in this, but basically, um, you know, this is another library similar and they, they do site, uh, verifiers. It's a similar library sort of tool to scale up agent environments.

And they're like, there's a few premises here. Uh, one is that current abstractions. Like current, uh, RL environments are not realistic where they don't have, like, uh, they don't have a live environment. Right. So they're too static. They're not dynamic. Basically like issues and stuff like sweet bench and T bench and stuff is that the environment doesn't change while you do stuff.

Right. But in reality, um, environments are dynamic, right? Even while your agent is doing work, the environment is changing. Right. So they have this sort of time factor. They have, um, random data points that get added in. Right. And then there's obviously a hit for taking long. So. Some of the stuff that they measure is the time trade-off for how long agents take.

And they try to normalize this around stuff like, um, API requests and rate limits. So like, it's, it's a bit, you know, they, they do pause the system for that to not take that into account, but high level, it's like a way to create environments. And they, they did create one.

It's called a gear two. It's, it's a build off of an environment they made in 2023. That was like, I think it stands for like general agent, uh, something, something like benchmark. Right. So they, they expanded on it and they, they show how it's different, what they trained on.

Then it's like a whole bunch of, okay, what goes into making an environment? Um, you need like, you know, apps, you need the ability to access the external environment, events, notifications. The one that they do, it's basically a mobile operating system. And then there's apps like calendar, email, this and that there's tasks to do.

Then you talk about how our agents interacting with it. So there's scenarios, there's notifications. And these are basically like, okay, notifications are even while your agent is doing something, uh, shit will happen, right? You might get a random text, a random email information might change. And then how does the agent respond to all this?

Right. So it changes as stuff happens live. Um, then what else is in here? Different apps, different state. Then they have a whole GUI for how to, um, look at this tool stuff, environment usage. This was honestly quite a long one, but it's kind of fun to go through for people that haven't gone through environments.

Um, it's kind of fun. Like I didn't expect it to be so long, but it was all a pretty fun read, honestly. It's nothing that technical. It's just like, okay, let's learn about how environments work. What, like what you really need to do to build one. And I'm like, oh shit, these aren't that simple.

And I say this having put out like an open source environment a while ago. It was like the whole, um, enterprise suite of, okay. Send an email, edit a spreadsheet and let's test models. But, um, then, then, you know, the, the big thing here they want to show off is like, okay.

Time really affects stuff and so does budget, right? So there's like a, there is a proper, um, correlation between performance as you give a higher thinking budget or more time to performance. But, you know, you don't want that issue. Then in the, the results section, they have some interesting stuff.

Like, um, the hybrid models from Claude are pretty chill with this, you know? Um, they don't really have that much of a trade-off. So Kimi K2 is the best, um, open source GPT five with high reasoning was the best close, but stuff like Claude force on it is, um, faster.

It combines, you know, it doesn't, it doesn't waste time reasoning when it doesn't need to. So interesting little takeouts. Then they have like a whole deep dive. Uh, third paper was actually just very, very different. Um, bootstrapping tasks for self-improvement. Um, let me pause and check the comments real quick.

Wait, so basically they open source their RL environment, not their RL environment. It's one RL environment. Uh, but yes, they, they did open source it. It's, it's the second version of this. It's a generic agent environment and it's specifically an environment with like a thousand tasks on, um, mobile use.

So email and this and that, uh, why do that? Why not just use verifiers? So they, they talk about what's wrong with verifiers, right? So that's static. We want, we want dynamic environments. We want stuff where time changes and the environment changes while you're thinking. Basically in other environments, there's no time delay.

Right? So you can, so like in Sweebench, you can spend six hours solving one task, but in reality, your code base is still getting other changes. Other people are working on it. There's noise that comes in. So, you know, you want stuff to be dynamic. Uh, that's why not.

That's why not just use other ones. And then I think they want people to work on their thing, but they do, they do site verifiers. Their paper is just kind of different. It's just slight improvements to RL with task spaces for self-improvement. Basically in GRPO, uh, you have like step-based changes.

They're like, can we instead change the objective to be a sequential? So based on like your thought data and the last, in the last step, have a bigger impact. And then it's like just a lot of, um, RL policy optimization stuff. I don't know if we'll get to this one since it's very different and the environment one is pretty long, but.

That is a 15 minute intro of what meta super intelligence has been up to. I think we should just go into them. This one's pretty short compute as a teacher. It's kind of fun. It has some plays on LLM as a judge and rubrics being AGI. So we'll, we'll kind of start with this one.

Then we'll do a second one. Um, do you think I'm gonna pause for questions also, if anyone wants to unmute and chime in, if you read it, you know, feel free. Okay. Do you think environment should dynamically change or just reward allocation along the way? Uh, there's both steps and then here in their, in their testing, they, they do have a time penalty, right?

So, um, I think it's not just that there's reward allocation along the way. One of the rewards is for being able to deal with like useless information, right? So if you get a random text that deters you from your objective, uh, can you handle this? Or can you not handle this, uh, static versus dynamic?

What is the distinction here? The distinction is basically like, as you're working, um, environmental changes are happening. Right. So in a static environment, like sweet beds, uh, you have tasks, then you just do them in something like this. Uh, the code base is being changed externally, regardless of what you're working on.

And then you have calls like, you know, get state, and then you can see how state has changed and you have to be able to deal with that. There were some flaws to this paper that I didn't like, um, basically in the evaluation, they have like cutoffs for. When the task is failed and part of that is like, okay, you take X many steps or your context window fills up.

And I'm like, what do you mean your context window fills up? And then they're like, a great thing to be added would be like a tool, tool, um, a code interpreter tool, because that lets you do a lot of, uh, thinking without filling up your context. And they were like, well, add it soon, but we haven't added it yet.

The little flaws, but anyway, I think the bigger value in this is like one, they care about time to solve tasks. Dynamic environment and three, um, I've blanked on the third, uh, let's see more questions. Is there a reason these papers came out now did compute changes? Yeah. These are like efficiency and RL, so, um, you know, we only have so much RL data, right?

So much verifiable code, so much verifiable, like labeled environments. And instead of paying Mercor their fastest growing 500 million ARR, what if it can be more, more efficient? And also most of these authors were, uh, they weren't really at meta super intelligence when they started. So for example, for a lot of these, like you'll see them using, um, here they use Quen 2.5, here they use LLAMA 3.1 or 3.2.

And this is like building on a benchmark from 2023. So, you know, it's been in the work for awhile. Um, some of these guys are now at Anthropic, like two of them are at Anthropic, but anyway, you know, it's, it's still, it's still relevant stuff overall. And it's just for the hype.

The only reason we're covering these papers is because meta super intelligence, uh, otherwise they're, they're, they're interesting papers. I mean, I didn't think that like stuff like this compute as a teacher, it's like better performance in RL that you can kind of just drop in and it doesn't really have too much of a negative impact.

So we'll stuff like this be standard, potentially the only problem is that no one's really doing RL, right? Uh, the people that do do RL are like big labs and we don't know if they're doing this stuff. So maybe, maybe one of the open models, maybe Mistraw or something drops a paper and they're like, oh yeah, we've been doing this.

It helps a lot, but you know, maybe if you're into some mid training custom RL work, that'd be cool. And you can build on this. Okay. Compute as a teacher, uh, turning inference compute into reference free supervision. It's free. We love it. Uh, where do learning signals come from?

There's no ground truth when there's no wrong truth in post-training, uh, converting the model's own exploration at inference time into reference free supervision by synthesizing a single reference from a group of parallel rollouts, then optimizing towards it. Kind of interesting. So you're already doing a bunch of rollouts. You're doing a bunch of inference.

What if we can synthesize a single reference from all of these and then optimize towards it? Seems fake. Seems like it shouldn't work. Right? Like, aren't you going to be limited at the quality of what's in the rollouts? turns out no, with a bunch of like little information, you can actually get slight improvements.

And then there's even disagreements. Like you can, you can get, uh, synthesized results when they differ from all rollouts. So like in 1% of cases, uh, the result is better, disagrees with every rollout, but is still slightly correct. And then you judge these based on a rubric grading, which is kind of, uh, unique as well.

So concurrently current policy produces a group of rollouts, a frozen anchor, uh, reconciles emissions and contradicts to estimate reference. So this is basic GRPL, uh, there's two regimes. There's verifiable tasks where you have a programmatic way to check the answer. So basically for math, like you can use a little, um, a little software to just check, like, okay, is this math correct?

Use a calculator or code, you know, does the code compile? Then there's non non-verifiable tasks with self-proposed rubrics, binary, auditable criteria scored by an independent LLM as a judge with reward given by a fraction of satisfied like output. Unlike the current method synthesis may disagree with the majority and be correct.

Even when all the rollouts are wrong, this was kind of interesting. So current currently there's like a few other stuff, like someone mentioned the tree based approaches. So, um, you know, if you do best of N and just move towards that policy or majority voting or like judges based the best output, um, the difference here is that this synthesized result can disagree with everything that's in the rollout, but still be better.

And then, uh, yes, numbers go up. They tested a bunch of four B's and eight B's and number went up, went up with general training and with RL. Uh, I thought this was a pretty, not so great diagram, but it exists. Um, so system the size and estimated reference convert the supervision into reward can be applied at test time for inference time gains inside RL.

So, um, for specialized skills, why do we need, we need a, how to do figure one. Yeah. This figure one was bad, but there's a better figure two somewhere in this paper. We'll get to it eventually. Okay. Post training typically relies on supervised fine tuning with labels or verifiable rewards and programmatic checkers, valuable tests, tasks, lack both.

Right. So in non-verifiable settings, uh, like, you know, clinical tests or lifestyle, there's also the issue of labeling being ambiguous. Right. So annotation pipelines are hard to scale and even then we don't, we don't often, um, we don't often like have experts agreeing. Right. So in ambiguous stuff, like experts can have different thoughts on what's concise writing.

They can like, you know, there's just, there's just issues with that. And then LLM as judges, they, they can have issues here. So, uh, verbosity bias, they love long outputs. They're inconsistent. They like, or we're happy. So one main question, uh, can inference compute substitute for missing supervision for non-verifiable stuff or ambiguous, uh, verification.

Can inference compute do better than stuff like LLM as a judge? Uh, yes, this is their answer. They did it guys. Compute as a teacher, not chain of thought, C-A-T, uh, compute as a teacher. It converts the model's own exploration into reference free supervision. For each prompt, the current policy generates a set of parallel rollouts, uh, that's GRPO.

Then condition on the rollout, a synthesis, single estimated reference by, by reconciling emissions and predictions of cultural relations. Uh, this is once again, basically they throw all these rollouts into a model. They have it synthesized a better response. We'll get into the prompt that they use for this, which is kind of interesting.

A key thing is they don't give the question anymore. They just give rollouts. Um, cat reuses the group rollout compute budget already common in RL, adding little overhead beyond the compute figure already spent, right? Uh, is it, is C-A-T is test time? Kind of, it's during rollout. So there's like multiple test times in RL, right?

In RL, when you're doing GRPO, you do a bunch of rollouts and some of these are already off policy. Some of these are longer and shorter. So there's a lot of weighted wasted inference time, right? Um, there's tricks you can do to optimize this, but this is just a way of doing a little bit more overhead there.

So two domains for cat, uh, one is verifiable where it's obvious, right? If it's verifiable, just use the same verifier. And if it works, it works, right? So let's say you have a math question, all your rollouts are shit, but you can answer it with a synthetic, um, output from cat.

Well, then you can check it there and that's good. If it's non-verifiable, uh, they have this concept of rubrics. Basically, uh, you use another model. You say, here's a question, uh, what are rubrics that are hard that should like not be easy to check that these can be graded on?

And then a judge follows that rubric and the output, and then has yes or no, um, criteria. Uh, the big thing here is this is synthesis, not selection, right? So, uh, you're, you're generating a new answer that can one, disagree with the majority and be correct even when rollouts are wrong.

This is kind of interesting, right? Uh, so empirically we observe behaviors and disagreement with majority of rollouts on 14% of questions. So basically, you know, it disagrees fairly often, um, and it disagrees with all rollouts. 1% of the time performance scales with the number of rollouts as expected, right?

So if you give it one rollout, it's not going to do much. Uh, if you give it two, four, eight, 16, 32, it starts to do better when it has more information. They've got, they have, uh, plotted a scaling curve for this and then the intuition around why this works.

So basically, um, parallel rollouts diversify generations, surfacing like different sub-factor solutions, right? So, uh, you don't just do rollouts such that they have the same output. They're, they're somewhat diverse and they go down different thinking paths, right? Uh, this was shown in other pre-training reasoning work where in the GRPO policy, you want to optimize for diversity, right?

So, uh, the last RL paper that we covered, they basically completely got rid of, uh, one of the constraints that really allowed for a lot of diversity in samples and that allowed for a higher hit rate towards being, um, correct and verifiable. So in this case, you know, GRPO is already doing different rollouts that are in different, uh, you know, different diverse groups and then conditioning this, um, it's similar to ensembling, right?

So basically if you have a synthetic data generator that has a bunch of different, um, you know, different trees that it could have gone down, you have, you have the ability to pick out, okay, this, this look correct. A bunch of these seemed off, um, maybe this is the right way and in non-verifiable domains, rewards transform match later into discrete, dah, dah, dah, dah.

Yeah. So the auditor is question blind, right? Basically, um, you're just, you're just looking at a bunch of potential solutions and then you're like, okay, a few of them go down this path. This is probably relevant. Um, let's, let's synthesize a better response from this. And then you have a bunch of different things.

So that's why it works. Uh, practically it's just drop in, right? You don't need human labels, no specific verification or anything. You use the same, um, you use the same verification techniques as before. If it's a non-verified domain, you have this rubric thing. Uh, and then, yeah, they, they actually tested it.

It improves three, four to eight B model families, Gemma, Quinn, and Lama. This is once again, interesting, right? So why are we talking about meta super intelligence now? Uh, this work is not that recent, you know, it's Lama 3.18B. They could have done Lama 3.31B, 3B, but it's, it's old models.

But anyway, uh, yeah. Okay. So several other bridges of work around this, right? So, um, learned from model generated supervision, but derives the target by reconciling multiple samples rather than trusting a self-centered label. Uh, they talk about other stuff, right? So there's distillation, um, there's self-training, there's majority voting.

There's LLM as a judge. There is, these are basically like interesting. This section's a bunch of other examples that you can, um, look at if you're interested in like better, um, self-generation for RL stuff. Okay. Um, the uniqueness here is it constructs new answers that can depart from the consensus, um, different than LLM as a judge.

That's different than majority voting, right? So majority voting picks the majority best from the rollout. This can deviate from that different than LLM as a judge. Um, it has specific criteria that, you know, it mitigates like instability. So the problem with LLM as a judge is that there's biases like, okay, I prefer verbosity.

I prefer, um, you know, like you have preferences across model family. Uh, this doesn't, this doesn't do that. It's just specific. Follow this rubric criteria. Then there's, um, you know, programmatic verification. Okay. Contributions. Once they do, uh, cat their, you know, compute as a teacher. And then this rubric score judging, uh, it avoids human references and reduces reliance on riddle judge only scores.

Uh, then we do have a case study here. They actually do it. They test it on math and health bench, and then they, they compare it and all that. Oh, this was the related work section. So reference-free fine tuning, um, constitutional AI from Anthropic, self-instruct for training with instruction following, quiet star.

All these are like different approaches to the same thing. Better, uh, reference-free fine tuning. There's reference-free RL, uh, so test time RL, reference-free LLM training by RL, absolute zero, self-play, um, all these are like interesting approaches. So compared to reference-free fine tuning, their approach can holistically improve outputs for arbitrary specialized tasks.

That sounds like buzzwords, uh, compared to reference-free RL, they're able to construct and synthesize answers outside of explored distribution and extend beyond verifiable and non- extend beyond verifiable to non-verifiable domains. This is pretty unique, right? I think this is the big takeaway of the paper. So, um, the second half is pretty key, right?

If you have non-verifiable domains, uh, you can, you can still work in those and you can go out of your rollout policy. Then, um, other work in non-verifiable RL is very free, JEPO, RLPR, um, all this stuff. In contrast, they use rubrics as rewards and more general approach that constructs rubrics from reference answers, which are then judged by LLMs to compute a score.

Unlike all these methods, theirs does not require a reference answer, right? So all the other work that they cite, they need reference answers. Um, their approach doesn't, which is, you know, it's unique. I thought this paper was somewhat impactful. Okay. A bunch of notation bullshit. Basically, uh, it's GRPL and then you have this, uh, synthesis step with, um, uh, yeah.

You have a synthesis step that's generated from rollout and then rubric criteria is done later. This was a better sort of diagram, but I think we still need to make reports on it. Uh, estimating a reference by synthesizing rollouts to estimate a reference response. We introduce a synthesis step.

Um, so at each GRPL step, the current policy generates a bunch of forwards. Okay. Actually, uh, I'll explain what GRPL is. I don't know how many people are super up to date with GRPL. Uh, can someone explain why we need to keep the question blind to prevent it from acting?

It's just another rollout. Yeah. I'll, I'll explain this question next. Uh, someone asked, why do we need to keep this? So for the synthesis step, um, why do we need to keep it question blind? Why do we not give it the question? I'll explain that right after explaining what GRPL is.

So, um, GRPL is basically what's used in current RL where, um, you basically have, instead of having multiple models and all this, you, you do a bunch of rollout. So at your current step, you generate a bunch of different outputs. So let's say for a different, for a given query, like a math question, like some integral question, you generate X amount of different answers with chain of thought reasoning and whatnot.

And then you have a group relative policy. So like you, you optimize towards what's the best in this group of outputs. And then you move your policy towards that. It's like a memory efficient version of PPL, since you don't need multiple models and this and that. So, uh, the, the very, very basic TLDR is just, you generate multiple outputs.

Like let's say you generate four, eight or 16 outputs for a question. And then you optimize towards the one that's the most correct. Now, what, uh, CAT is doing is you pass in all of these rollouts to a synthesizer model, a synthesizer step, and then you have it kind of improve these.

Now you say, okay, here's chain of thought one. Here's chain of thought two. Here's chain of thought three. Uh, can you give a more concise, better approach towards solving this? And sometimes, you know, given all these different rollouts and different, uh, techniques and attempts at the question, um, you know, you get a better result by just throwing all this in one more verification, synthesization step.

Uh, the reason that we don't give it the question is because we don't want bias, right? We don't just want like a 17th rollout, right? If you're already doing 16 shots on goal, giving it the question is just going to make it answer the question again with more information.

Now, if you supply it a bunch of incorrect information and ask it the question again, you're not going to get any deviants, right? You're going to use that. Instead, what you're doing is you're just giving it the rollout, just the chain of thoughts. And then you're giving it a reference.

Now this reference will often still have like, you know, here's an answer and then it will be verified. And sometimes it's correct. Sometimes it's not. What they show is that oftentimes it is a lot better. Uh, for synthesis, do they use the same policy model? Uh, they use an LLM as a judge and they don't, it shouldn't really matter what model you use.

I think they do use the same policy model. Uh, we'll, we'll double check, but they, the, the fun thing here is actually in the appendix, they show all the prompts for this, which is more useful. Um, then how do they verify this? Two things. So for verifiable stuff like that, they just use the verifier.

So like they use the little, uh, software to check if it compiles. For non-verifiable, they, they have rubrics using rubric prompts. And then they use an LLM as a judge to just clarify this rubric. Uh, then they can input this back into the GRPO. And then if the CAT is better than the rollouts, let's instead, um, start to, you know, move our policy and change our weights towards that.

Okay. Uh, remarks. So when you only have one rollout, the synthesis is not gonna do much, but it grows as you have more rollouts. This is, you know, what they expect and it's true. And then they plot it later. Um, is this just to create a rubric? The prompt seems like it's creating a unified response.

So there's two things. There's separation. Um, so the synthesis step is different than the rubric step. And the rubric step is not always used, right? Synthesis step is used for verifiable stuff and non-verifiable. The rubric is used when you need to do non-verifiable domains. So you don't have labeled data for that.

Then you can use a rubric. Um, okay. Let's talk about experiments. Basically, they, they evaluate it in two models. They do, sorry, in a few models, Gemma, Quinn, Lama. Um, yeah, the TLDR is, it kind of works, you know? So it improves, it improves 30% relative to the initial policy.

And these are small models. Uh, adding RL adds a lot more. So like in Lama doing CAT with RL had a 30% bump on health bench on math. It has a 33% bump. Uh, so, you know, it, it works quite, quite, uh, well. Ooh, someone is sharing deep wiki.

Very nice. Um, where do our rubrics come from? Are they AI generated, explicit written? Yes, they are AI generated. So two-step approach. One is you use LLM to generate a rubric. We'll go about the prompt for that. Then you use an LLM to judge the synthesis answer on the rubric.

Um, that's, unfortunately, I wish that was like section four. We have to go through the damn appendix to find that stuff. So that's why I'm going to like skip through this pretty fast. Uh, would have been interesting to ablate not using question and question. They might have actually. It might be something in, um, in the appendix.

Okay. Results are results. You know, this, this stuff works on small models. In model as a judge, instead of checking individual rubric criteria, check whether, uh, da, da, da. Roach consistently outperforms model as a judge. Rubrics provides fine-grained assessment criteria easier to verify. So I do check their rubric thing.

Rubric is useful. Um, RL with self-proposed rubrics is better than SFT. Uh, SFT did not do that much, but rubrics did a lot. Uh, what else? Um, this produces better reference estimates as single sample and selection baselines. Uh, strongest reference estimates, the most fertile, versatile, best of, and okay.

I think we can skip through a lot of these pretty quick. Um, it scales with the number of rollouts. So if you're doing a lot of rollouts, this will do a lot better. Um, you'll get more benefit from throwing more in there. Reasons about prior rollouts rather than act as another.

So this is, you know, I've asked a question about, um, why not pass in the question? I think my thing is froze. We're cooked. Um, my Zotero is slightly frozen. Okay, we're good. We're good. So, um, CAT with a single rollout in context only performs slightly better than a rollout.

This suggests that additional general generation steps of synthesizing is not only acting as a new new rollout that self-conditions on this past context. Uh, only slightly. Yeah, I mean, these are just results. It does better than being a single rollout. Okay, um, reconciles rather than selects to disagree. So most of the time it will, um, not just disagree.

It wants, it wants to synthesize better approaches. It exceeds performance of majority voting. Um, it depends on the initial policy model to be baseline good. What else, what else? Conclusion. So, CAT turns inference compute into supervision by another anchor policy. Using synthetic parallel LLM policy. Convert estimated rewards. Delivers 33% of this to include.

Yeah, I mean, it's basically useful, free, better performance. Um, what else do we have here? Okay, let's look at some of these prompts. So this is an example where the CAT disagrees with all rollouts. It's basic math. There's nothing fun there. When does it stop learning? Oh, I want that fun prompts.

Very fun. Okay. So synthesis prompt. You're tasked with combining multiple responses into a single cohesive response. Below, I will provide several responses. Your goal is to identify common themes. Reconcile differences. Combine the information to unified response. Preserve all key insights. Ensure final response is logical, coherent. Here's all the rollouts.

And then, you know, box your answer and stuff. Your response should not be much longer. The key thing you'll note here is, you know, the synthesis prompt, they're not giving the question. They're not saying like in this question, let F be this all complex numbers. Given this, find this.

You're just being told, you know, your job is to combine responses into a single cohesive response. So like, look at a bunch of long and like, for example, think of like deep research, right? You have like 16 different shots of like, tens of thousands of tokens of web searches.

And you want to like ignore stuff that's bad, keep stuff that's good. And then, you know, the output response they find does better. It's not just another rollout. Reasoning synthesis. Your goal is to identify this, synthesize this. Be sure to preserve key insights. Avoid discarding unique insights. Highlight, address them where possible.

Here's your summary prompt. Here's how to output it. Rubric. This is another fun one. So the self rubric is a separate prompt, right? This is that step for how do you get, how do you get quality RL training data out of non-verifiable non-verifiable like data tasks. So here's the rubric thing.

You are given a reference response. Carefully read the response and develop a reference response evaluation rubric as follows. Task. Develop a detailed rubric for this specific response. Create a detailed response for this specific response that describes what high quality responses look like to it with respect to accuracy, verifiable supporting evidence, logic structure, data, data, data, data, provide five or more rubric criteria that can be verified with yes or no.

Ensure these are very specific and be verified. Make it extremely difficult to achieve a high rating. A high quality answer should be very hard to achieve. It's rare that any question would achieve high quality. You may use a reference answer if you see fit and then do it in XML.

But this is kind of how they create a rubric. Then they have LLM as a judge. Check this. Here's their judge template. You're an expert judge that determines whether an answer satisfies a rubric. Here's the rubric. Here's the answer. Tell if it answers. If there's no answer provided, plea is answered.

It fails. You know, be strict and unbiased. Only determine if the answer satisfies the rubric. So yeah, those are kind of examples. I think it's a cool approach. I think it would be interesting for anyone doing like basic mid training to try it out. Yes, we have more efficient RL.

Like it's a better way to use all your rollout data, right? You're doing so much inference with little overhead. It looks like you can do slightly better. I'm going to take like a two minute pause to see if anyone has any questions on this. Otherwise, we'll move on. They only tested this with small models.

Yeah, I think it would still work though, right? Like it doesn't make much of a difference. They tested on small models on like verifiable, like basic math and stuff. It shows a higher jump, but at the same time, like it's not harmful in any sense, right? So why not test it out?

It's also just like basic compute budget, right? Why it's faster to test on small stuff. They train faster. You can see meaningful changes. So yeah, why not test it on bigger models? For all we know, this is an optimization step that's already being done. The bigger fun thing, I guess, was like this rubric-based response for me working better.

And, you know, it being a way to check non-verifiable stuff better than LLM as a judge. Like there's this fantasy that people like to have that you can just do RFT on any model and just do LLM as a judge and that'll just work. So it's nice to see a change where, you know, this is like slightly better than that.

Cool. Okay. I'm going to go into a base model may fail to produce reasonable responses. So someone asked if this will, they only tested this on small models. Would it work on bigger models? Someone disagrees because the weak base model may fail to produce meaningful responses. I think that would go the other way, right?

If the bigger models would already be stronger. So it would be like a concern. Like this might not work on like a 1B model because the base models suck, but it should work on bigger models. But anyway, okay. I'm gonna switch to the agent thing they did. So we already did a bit of an intro about this, but basically this is their like verifiers competitor of RL agent environments.

And once again, they really want two things. They want time to be a factor. So how long does it take your agent to solve stuff because the environments are not static, right? Current benchmarks are too static and that's an issue. We want dynamic environments. So as you do nothing, the world still changes around you.

And then we need to measure, you know, how can your agent handle like ambiguities, noise? How can they adapt to these changes? And then it runs async surfacing new program modes. Okay, I'm gonna go through this pretty quick. Our experiments show that no system dominates across intelligent spectrum. So they have like six or seven different checks.

Stuff like ability to handle ambiguity, conciseness, and the models that do well in different domains differ from ones that do well in other checks, which is kind of interesting. Here's kind of a trade-off between models. Once again, this is slightly old, right? We're on cloud force on it. We're not on cloud for Opus or 4.1.

Kimi K2 is the best open source. Gemini was a really good mix. The hybrid models don't reason unnecessarily. Then there's a cost trade-off. The interesting thing here is that nothing dominates across the spectrum. They all have trade-offs. Like if you reason more, well, now you're taking more time, right?

And then the unique thing is that these curves all plateau, right? So standard scaffolds miss key ingredients. We don't want plateaus. The fun thing here is they define like that this is not an AGI benchmark. It'll get saturated. Yeah. Meta can't even benchmark Opus. Too expensive. So deployment and production.

Web is a great environment. The problem is that web environments change, right? Like if you have an Amazon-based benchmark, like a clone of an Amazon website for shopping, it's too static, right? Like you're not getting new products randomly added. But on realamazon.com, search results change all the time. New products are added.

Descriptions change. So there's a lot of write operations that can change environment that happened randomly. Few open source, as of the time of this writing, few open source, few open source and flexible libraries exist for developing and studying correctable LLM agents. This is citing Will Brown from verifiers that we need more environments.

Okay, where did I go? The interesting thing here is when they test all these models, they basically do a very simple React framework. So they're like, you know, come do better. Do other model agent orchestration scaffolds and stuff. Shit, we only have 10 minutes. I'm going to go kind of quick.

Okay, they want running creation of environments. They need to handle time. Simulated stuff is not realistic. Mobile. So the one that they do is Kia 2. It's got a thousand verifiable scenarios. It's a mobile environment. So it has mobile apps like email, messaging, calendar, the associated content with them.

Evaluations for agents beyond peer search and execution. Verifiable tasks that are simple. I'll cover this in 10 minutes and push the next one next thing. Yeah. Okay. I'll do this one in 10 minutes. We'll skip the third paper for now. So one, they differ than most agent benchmarks. And one, there's more realistic interactions between agents and environment that run asynchronously.

Scenarios spanning arbitrary periods of time. Environments time passes. So this is what makes their benchmark different, right? This is one of the key things. Environment time passes independent on whether agent acts or not. The environment state is continuously updated with random or scheduled events, such as friends replying to messages sent by user or agent.

They have robust verification system into RL. Basically, comparing agent right actions only to annotated oracle right actions for right action. While today's frontier models are far from solving this, we do not consider it to be AGI level benchmark in the OLM RL arena. And we expect rapid hill climbing.

It's just like a start of like their first environment that's dynamic. They expect progress to be done. This is a much better diagram of what's going on. We expect that this will require modeling effort beyond scaling test time scaling. So basically, they have multi-agent interaction, time-based stuff. They're like, basically, if we want efficiency and models and RL, we need to do modeling effort beyond test time scaling, which I kind of disagree with, right?

So they're like, you can't just scale for more reasoning. I think you have routers, you have low, medium, high thinking models, right? So like in their charts on the performance, they basically show how there's like a clear performance scale between GPT-5 on low, medium, high thinking. But, you know, they're like, we need to do model based.

So here, so like GPT-5 with minimal thinking versus low versus high. The more thinking does better, but it takes more time. They're like, we need to do better architecture changes to solve this problem. But in reality, you know, a simple router or an auto thinking mode would also solve this because we've already dynamically been able to think better.

And then the hybrid reasoners also do really good on this. But continuing through time is a thing. Okay, so foundations of their environment. Everything is an event. There's apps, which are basically stateful APIs. So email messaging, all that stuff is an app. It's an API that you can interact with as an agent.

Environments are collections of apps, data, governing rules. Events are anything that happens in the environments. Everything is logged. They have like a UI for seeing this too. Notifications or messages from the environment. Scenarios are basically initialized states and different events that you can schedule. Okay, what are apps? Apps are basically tools with APIs.

So stuff like send email, delete email. Apps have their own state and then they store state internally. Tool creation. So basically, if you want to use the environment, it's all a Python class that executes, creates stuff. There's scopes to these. There's agent user environment. Environment is stuff that the agent can't control, right?

So like a text comes in from the world. The agent doesn't have control over this user. This user can interact with it. Extensibility, you can have external APIs. So MCP to externally interact with the world. So if you want to dynamically change events based on like real world data, like yesterday, someone did a SF map of parking tickets coming in live.

If you want to hook that up to your environment, you can do it pretty easily, right? Because that's real time dynamic. Core apps. So there's core apps. There's basic interaction. So agent interfaces and then there's system stuff. System core apps are basically like, okay, what's the time I can wait for stuff to change?

So basically I can make a change that I can wait for time to pass. Stuff that allows for stuff that would take like real time. Hours in the real world can happen fast. Environment, events. Event is an agent action. Let's go a little fast since we're short on time.

Event types. There's validation events, agent events, notification. At each notification step, events trigger notifications. Agents can get them. They're not the only way for agents to observe them. They can proactively check. So outside of just like your phone getting a notification, the agent can make a call to its notification inbox.

Scenarios. These are basically like specific scenarios, right? Like once I receive this, you can figure those scenario hints. Mobile. It's meant to be mobile environment. Turn rules. Okay. We got to go fast through this. This is kind of interesting. To generate all these environments, they use LAMA 3.370B and they need like consistency.

So how do they handle diversity? They basically have a hierarchical like top down approach of synthetically generating like, you know, context. So like if you have like, okay, a French physics professor that's in your context or a Chinese professional athlete, you need to have like, you know, 400,000 tokens of information about them, all these chats, messages, then like a whole schema for how they do it.

Scenario creation, implementing environments. The whole point of this is basically it's open source, right? So if you want to do your own, here's kind of their walkthrough of here's how we did it for all of these. Go figure it out for yourself. Initial verifiers, verifiable rewards are crucial for ORL, right?

So they have rubric-like verification, mobile checking agent to check write operations. So like there's verification in these environments, verification mechanisms, basically you read stuff. There's soft checks, hard checks. As you would expect, hard check is like, okay, let's specifically check over like a regex string. Does the sync with this?

Soft check is more LLM judge vibey. This was kind of advanced. I think we can skip it. Timing. There's a time delay for actions that takes a hit, verifying verifiers. Okay. They have default agent orchestrations. Basically they do a react loop. So simple react loop that has a pre-step and a post-step.

Pre-step is to get agent context. So check for turn-based terminations, sorry, check for notifications, check for the current state of the world. Post-step is after you finish whatever you're doing, you can do a little post-step, right? So like, has the world changed? Is this no longer relevant? Like if someone says like, hey, can you send me a recipe for apple pie?

Then you start working on it after you're like, okay, I need to, I need to text my mom to get the recipe from her. Once you send that text, there's a post-step in your loop, which is like, okay, let me check if anything changed. And maybe, maybe your brother has texted you.

I don't need the recipe anymore. You can terminate right there. And then, you know, that, that affects the time wasted in this. They have a, they have a UI for visualizing all this and for evaluating it. Kind of cool. It's pretty like robust. Okay. Here's their, their one that they did, the mobile thing.

Do they implement wait tool? Yes, they have wait tools. Gear 2 is their, their sort of Android OS style thing. They have a bunch of scenarios. Each scenario, they have like a mini version. So they have dynamic environments where world state changes. They contrast vending bench, which is a fun one.

Time. So time flows continuously. Scenarios explicitly incorporate time dimension requiring agents to handle temporal constraints. Temporal awareness is essential for stuff. Uh, da, da, da, da. Agent to agent collaboration. So this is another interesting one. Um, you have multi agents. So instead of apps just being like API based SDKs, there's also agents that you can fit in.

Um, agent capabilities, search, execution, adaptability, time, ambiguity, agent to agent. Uh, main agents can no longer access apps. Basically like to use an app, you need to go through another agent, which you can control. Um, having that delegated to an agent often helped some of the small models. Then noise is another one, right?

So API changes services, uh, how do the models handle this? Uh, this is kind of their agent to agent. So you can, you can dynamically have some apps, the robust APIs, some of them be agents that you communicate with, right? So like, um, one of your apps could actually just be like, talk to a friend, right?

And it's handled as an agent to agent communication. Uh, everything. Okay. Environment events, data collection. I think we got to skip through a bunch of this since we only have a minute left. Um, basically they tested a bunch of stuff. These were some of my issues with it. They're all tested at long context, temperature generation of 16K per turn.

If the context length of 128K is exceeded, it's an automatic failure. Agent loop runs continuously until one of two termination conditions is hit. So termination is 200 steps or, um, you run out of context. Environment, um, all scenarios are verified with their verifier using LAMA 3.370B. Okay. I think we are out of time, but I'll go through the quick stuff.

Okay. Core results, um, time split was interesting. These are some like just fun takeaways of how models were, um, how models serve, right? So, um, execution and search were the easiest splits. These are kind of the splits on which they evaluate domains, right? So how well can you do execution search?

How well can you handle ambiguity, adaptability, time, noise, and then use other agents. And then what's your, what's your, um, performance of these? So for example, Gemini 2.5 is very, very good across the board. It doesn't suffer any major penalties where stuff like Grok has a really bad time penalty.

GPT-5 high thinking, even low thinking has bad time penalties. Uh, some of them don't handle noise. So like the weaker models, LAMA noise, uh, sorry, LAMA 3 doesn't handle noise. LAMA 4 struggles with noise. GPT-4 always pretty bad with noise. But stuff like LAMA 4 Maverick, it, it did really good.

It got a lot of benefit from agent to agent communication. Um, but you know, fun stuff to just go through. So Grok 4 really good at search, uh, stuff that has a deep research product, right? So OpenAI, Claude and Grok, they have deep research products. Those do really good at search.

Uh, ambiguity, ambiguity remain challenging except, uh, Claude 4, Sonnet and GPT-5 were really good at this. Uh, times only Gemini 2.5 Pro and Claude 4 Sonnet had really good trade-offs with time. Uh, latency trade-off, noise robustness lags. Agent to agent benefited weaker models. GPT-5 performed the best on the benchmark.

Kimi K2 was the best open source. Costs were another one. Uh, Claude 4 Sonnet was 3x more expensive than GPT-5 low, but much faster. Whereas GPT-4 was worst in both senses. Kimi was a pretty good trade-off in the middle. Um, what other fun stuff? I think that's pretty much it.

I want to leave like the two minutes for anyone that's still here. If anyone has questions, anything they want to dig into here, um, you know, any thoughts, questions? Otherwise, yeah. Uh, memory was a fun thing. They talk about tool calling, uh, how they want other agent frameworks. I think it's like pretty easy, right?

Like if you want to make a fun hype post, just, just benchmark max this thing, right? Like don't overfit, but it's very easy to do a basic agent that's better than react and kind of set up and beat this, beat this benchmark, you know? And like Medicare is about it.

Seems interesting. It's the first benchmark. That's not static. Um, yeah. And so it's kind of the paper, the whole library is out there. Um, it's, it's interesting. And then I guess we can do the next one next time. Okay. Cool. Um, fun stuff, guys. Next week, I think we have a volunteer, unless anyone wants to volunteer a paper.

Oh, someone volunteered. Uh, RJ, do you know what they volunteered? Okay. Seems, uh, bio RL paper. Cool. I guess we have a bio RL paper volunteered for next week. Um, awesome. We'll share more in the discord, but cool. Thanks for, thanks for coming guys. We'll see you guys next time.