back to index

Meta Superintelligence: Self-Improving Bootstrapping, Compute as Teacher, ARE: Agent Research Envs


Chapters

0:0 Introduction to the three papers
0:33 Compute as Teacher: Turning Inference Compute into Reference-Free Supervision
4:21 ARE: Scaling Up Agent Environments and Evaluations
10:20 Bootstrapping Task Spaces for Self-Improvement
41:14 ARE: Scaling Up Agent Environments and Evaluations (revisited)
56:19 Details on the dual-agent model in the Compute as Teacher paper
57:43 Analysis of different models and APIs in the ARE paper

Whisper Transcript | Transcript Only Page

00:00:00.420 | So not fair, but the new meta super intelligence lab, two of them
00:00:07.060 | are somewhat similar.
00:00:08.220 | The third one is pretty different.
00:00:11.160 | It's okay if we haven't pre-read.
00:00:14.760 | I think we'll actually only have time for two of them.
00:00:17.900 | They're basically really focused on better RL approaches.
00:00:22.600 | And here, I'm actually just sharing my screen right now.
00:00:25.780 | If someone else wants to share the links, that would be sick.
00:00:29.780 | Um, they should be in the calendar invite.
00:00:32.400 | Calendar invite has a bunch of like tweet threads, but yeah, you can find them in there.
00:00:37.820 | Um, okay.
00:00:39.800 | So we should be able to see my screen also, obviously super interactive for anyone that
00:00:45.440 | has a joint paper club, you know, just interrupt me whenever it's not that deep.
00:00:48.660 | We should make it a discussion.
00:00:50.300 | I didn't write the paper.
00:00:51.500 | Um, so three papers, uh, compute as a teacher, this is like, uh, a better, you know, RL optimized
00:00:59.560 | Asian technique.
00:01:01.780 | So basically can we use the spare, can we add like a little bit of overhead to our inference
00:01:07.420 | time compute during RL to get better training data?
00:01:10.720 | And basically we, we do a lot of inference during RL, right?
00:01:14.620 | So you have a bunch of rollouts.
00:01:16.320 | Uh, you're no longer just doing pre like next token prediction.
00:01:20.900 | Instead of just training, you're actually doing a lot of inference.
00:01:24.340 | So this paper is like, okay.
00:01:25.700 | And we use some of that inference, add a little bit of overhead and then improve our RL.
00:01:32.800 | And it's, it's kind of interesting.
00:01:34.280 | So they're like, uh, kind of trying to look at for stuff that's not super verifiable.
00:01:39.820 | Uh, can we, can we get some signals out of some of these rollouts and, and improve our RL?
00:01:46.240 | So the very quick approach here, if I'm remembering this correctly and basically, um, for the rollouts
00:01:54.120 | that exist, um, let's say, you know, so it builds on GRPO and GRPO basically you have your policy.
00:02:01.420 | You do inference and you generate a bunch of rollouts, like a bunch of answers to your question.
00:02:06.380 | Let's say you generate eight or 16 different, uh, responses.
00:02:10.300 | Then you, you prefer towards the one that's the most correct.
00:02:14.380 | Right?
00:02:14.680 | So what this does is let's add a second little inference step where we take all of those outputs.
00:02:21.400 | Uh, we, we pass all of that into a model without the question.
00:02:26.680 | And we basically say, okay, can you improve this?
00:02:29.500 | Uh, here's a bunch of responses.
00:02:31.540 | Can you iterate on them, improve them and make like a better, uh, answer based on all
00:02:36.820 | of these rollouts.
00:02:37.780 | Then we use that and we have like a rubric to grade it and we can move the policy slightly
00:02:42.440 | towards that.
00:02:42.940 | So with a little bit of overhead and inference, uh, can we improve our RL and then they, they
00:02:51.500 | do a little bit of testing and they're like, yes, we can, it works guys.
00:02:56.180 | We did it.
00:02:56.740 | Um, this was a pretty interesting approach.
00:02:59.400 | It shows some fun stuff that Meta is working on.
00:03:02.760 | It's like, okay, for RL, uh, there's only so much verified data, right?
00:03:08.460 | Are we just going to throw all of our money at Mercor or scale?
00:03:11.780 | Or are we going to like, you know, generate some environments, have some like verification?
00:03:17.140 | We're pretty limited in that, right?
00:03:19.620 | Before we were limited in compute.
00:03:21.820 | Then we did really big pre-training and we're like, oh shit, we can't really pre-train
00:03:25.380 | anymore where we're cooked.
00:03:26.640 | Like we're out of useful tokens.
00:03:28.480 | It's very much decaying.
00:03:29.740 | It's not like we're out of tokens, but there's a heavy decay in the value we get for training
00:03:33.880 | on like 35 versus 30 billion tokens.
00:03:36.620 | Right?
00:03:36.780 | So we started doing RL, uh, we, we did it on a lot of like verifiable domains.
00:03:42.960 | Right?
00:03:43.200 | So math and code, then there's some stuff like the deep research, but you know, we don't have
00:03:48.420 | that much verifiable data or we have, they, they talk about issues with stuff like, um, LLM is a
00:03:55.560 | judge based verification and then they're like, okay, we need at some point to get other forms
00:04:00.480 | of verified data, um, or we need to be able to maximize our value out of the data that we
00:04:06.200 | do have and the, the verification stuff.
00:04:08.240 | So, um, yeah, they're, they're like, here's, here's a little paper on how we can be more efficient
00:04:15.240 | with that.
00:04:16.080 | Second paper is, oh shit.
00:04:17.700 | I'm very zoomed in.
00:04:18.700 | It's about, um, agentic environments.
00:04:23.580 | So, um, has anyone here seen verifiers from prime intellect?
00:04:28.320 | It's like a open source library for RL environments.
00:04:31.320 | Uh, will Brown and the PI team has basically become a 24 seven shell of this library of we
00:04:37.920 | need open source RL environments.
00:04:40.020 | We need environments and no one's ever using it.
00:04:42.900 | There aren't that great environments, but you know, I liked the, I liked the
00:04:45.820 | initiative it's, it's a cool, cool thing.
00:04:48.280 | Uh, yeah, the open open pipe people have also been showing RL they've been acquired by core
00:04:55.820 | weave.
00:04:56.320 | So maybe core weave is AGI in this, but basically, um, you know, this is another library similar
00:05:03.940 | and they, they do site, uh, verifiers.
00:05:06.320 | It's a similar library sort of tool to scale up agent environments.
00:05:12.820 | And they're like, there's a few premises here.
00:05:15.560 | Uh, one is that current abstractions.
00:05:19.520 | Like current, uh, RL environments are not realistic where they don't have, like, uh, they
00:05:26.360 | don't have a live environment.
00:05:27.980 | Right.
00:05:28.320 | So they're too static.
00:05:29.180 | They're not dynamic.
00:05:30.020 | Basically like issues and stuff like sweet bench and T bench and stuff is that the environment
00:05:36.740 | doesn't change while you do stuff.
00:05:39.320 | Right.
00:05:39.820 | But in reality, um, environments are dynamic, right?
00:05:43.780 | Even while your agent is doing work, the environment is changing.
00:05:47.160 | Right.
00:05:47.480 | So they have this sort of time factor.
00:05:50.100 | They have, um, random data points that get added in.
00:05:53.700 | Right.
00:05:54.040 | And then there's obviously a hit for taking long.
00:05:56.820 | Some of the stuff that they measure is the time trade-off for how long agents take.
00:06:02.240 | And they try to normalize this around stuff like, um, API requests and rate limits.
00:06:08.480 | So like, it's, it's a bit, you know, they, they do pause the system for that to not take
00:06:12.960 | that into account, but high level, it's like a way to create environments.
00:06:18.020 | And they, they did create one.
00:06:20.020 | It's called a gear two.
00:06:21.260 | It's, it's a build off of an environment they made in 2023.
00:06:25.620 | That was like, I think it stands for like general agent, uh, something,
00:06:30.840 | something like benchmark.
00:06:32.940 | Right.
00:06:33.180 | So they, they expanded on it and they, they show how it's different, what they trained on.
00:06:37.440 | Then it's like a whole bunch of, okay, what goes into making an environment?
00:06:42.120 | Um, you need like, you know, apps, you need the ability to access the external
00:06:48.060 | environment, events, notifications.
00:06:50.180 | The one that they do, it's basically a mobile operating system.
00:06:54.060 | And then there's apps like calendar, email, this and that there's tasks to do.
00:06:58.380 | Then you talk about how our agents interacting with it.
00:07:01.560 | So there's scenarios, there's notifications.
00:07:04.100 | And these are basically like, okay, notifications are even while your
00:07:07.960 | agent is doing something, uh, shit will happen, right?
00:07:10.760 | You might get a random text, a random email information might change.
00:07:14.360 | And then how does the agent respond to all this?
00:07:17.340 | Right.
00:07:17.520 | So it changes as stuff happens live.
00:07:19.340 | Um, then what else is in here?
00:07:22.920 | Different apps, different state.
00:07:24.480 | Then they have a whole GUI for how to, um, look at this tool stuff, environment usage.
00:07:30.660 | This was honestly quite a long one, but it's kind of fun to go through for people that
00:07:36.440 | haven't gone through environments.
00:07:39.180 | Um, it's kind of fun.
00:07:41.240 | Like I didn't expect it to be so long, but it was all a pretty fun read, honestly.
00:07:46.880 | It's nothing that technical.
00:07:48.200 | It's just like, okay, let's learn about how environments work.
00:07:51.800 | What, like what you really need to do to build one.
00:07:53.960 | And I'm like, oh shit, these aren't that simple.
00:07:56.000 | And I say this having put out like an open source environment a while ago.
00:08:00.380 | It was like the whole, um, enterprise suite of, okay.
00:08:04.700 | Send an email, edit a spreadsheet and let's test models.
00:08:07.520 | But, um, then, then, you know, the, the big thing here they want to show off is like, okay.
00:08:13.400 | Time really affects stuff and so does budget, right?
00:08:16.660 | So there's like a, there is a proper, um, correlation between performance as you
00:08:22.520 | give a higher thinking budget or more time to performance.
00:08:26.720 | But, you know, you don't want that issue.
00:08:29.600 | Then in the, the results section, they have some interesting stuff.
00:08:33.220 | Like, um, the hybrid models from Claude are pretty chill with this, you know?
00:08:39.520 | Um, they don't really have that much of a trade-off.
00:08:43.780 | So Kimi K2 is the best, um, open source GPT five with high reasoning was the best close, but
00:08:50.500 | stuff like Claude force on it is, um, faster.
00:08:55.060 | It combines, you know, it doesn't, it doesn't waste time reasoning when it doesn't need to.
00:09:00.580 | So interesting little takeouts.
00:09:03.060 | Then they have like a whole deep dive.
00:09:05.000 | Uh, third paper was actually just very, very different.
00:09:08.800 | Um, bootstrapping tasks for self-improvement.
00:09:11.680 | Um, let me pause and check the comments real quick.
00:09:16.060 | Wait, so basically they open source their RL environment, not their RL environment.
00:09:19.960 | It's one RL environment.
00:09:21.400 | Uh, but yes, they, they did open source it.
00:09:23.740 | It's, it's the second version of this.
00:09:25.960 | It's a generic agent environment and it's specifically an environment with like a thousand
00:09:30.560 | tasks on, um, mobile use.
00:09:33.380 | So email and this and that, uh, why do that?
00:09:37.340 | Why not just use verifiers?
00:09:38.360 | So they, they talk about what's wrong with verifiers, right?
00:09:41.060 | So that's static.
00:09:42.800 | We want, we want dynamic environments.
00:09:45.020 | We want stuff where time changes and the environment changes while you're thinking.
00:09:48.140 | Basically in other environments, there's no time delay.
00:09:51.860 | Right?
00:09:52.160 | So you can, so like in Sweebench, you can spend six hours solving one task, but in reality, your
00:09:59.180 | code base is still getting other changes.
00:10:01.140 | Other people are working on it.
00:10:02.560 | There's noise that comes in.
00:10:04.100 | So, you know, you want stuff to be dynamic.
00:10:06.500 | Uh, that's why not.
00:10:07.720 | That's why not just use other ones.
00:10:09.880 | And then I think they want people to work on their thing, but they do, they do site verifiers.
00:10:15.100 | Their paper is just kind of different.
00:10:17.620 | It's just slight improvements to RL with task spaces for self-improvement.
00:10:23.560 | Basically in GRPO, uh, you have like step-based changes.
00:10:27.600 | They're like, can we instead change the objective to be a sequential?
00:10:32.740 | So based on like your thought data and the last, in the last step, have a bigger impact.
00:10:38.920 | And then it's like just a lot of, um, RL policy optimization stuff.
00:10:44.480 | I don't know if we'll get to this one since it's very different and the environment one is pretty long, but.
00:10:49.180 | That is a 15 minute intro of what meta super intelligence has been up to.
00:10:54.200 | I think we should just go into them.
00:10:55.820 | This one's pretty short compute as a teacher.
00:10:58.900 | It's kind of fun.
00:10:59.680 | It has some plays on LLM as a judge and rubrics being AGI.
00:11:03.720 | So we'll, we'll kind of start with this one.
00:11:06.280 | Then we'll do a second one.
00:11:07.120 | Um, do you think I'm gonna pause for questions also, if anyone wants to unmute and chime in, if you read it, you know, feel free.
00:11:14.800 | Okay.
00:11:15.460 | Do you think environment should dynamically change or just reward allocation along the way?
00:11:20.920 | Uh, there's both steps and then here in their, in their testing, they, they do have a time penalty, right?
00:11:26.800 | So, um, I think it's not just that there's reward allocation along the way.
00:11:32.620 | One of the rewards is for being able to deal with like useless information, right?
00:11:40.060 | So if you get a random text that deters you from your objective, uh, can you handle this?
00:11:46.360 | Or can you not handle this, uh, static versus dynamic?
00:11:49.860 | What is the distinction here?
00:11:51.120 | The distinction is basically like, as you're working, um, environmental changes are happening.
00:11:57.720 | Right.
00:11:57.980 | So in a static environment, like sweet beds, uh, you have tasks, then you just do them in something like this.
00:12:05.060 | Uh, the code base is being changed externally, regardless of what you're working on.
00:12:10.960 | And then you have calls like, you know, get state, and then you can see how state has changed and you have to be able to deal with that.
00:12:16.220 | There were some flaws to this paper that I didn't like, um, basically in the evaluation, they have like cutoffs for.
00:12:23.540 | When the task is failed and part of that is like, okay, you take X many steps or your context window fills up.
00:12:30.200 | And I'm like, what do you mean your context window fills up?
00:12:32.180 | And then they're like, a great thing to be added would be like a tool, tool, um, a code interpreter tool, because that lets you do a lot of, uh, thinking without filling up your context.
00:12:43.160 | And they were like, well, add it soon, but we haven't added it yet.
00:12:45.560 | The little flaws, but anyway, I think the bigger value in this is like one, they care about time to solve tasks.
00:12:53.520 | Dynamic environment and three, um, I've blanked on the third, uh, let's see more questions.
00:13:01.020 | Is there a reason these papers came out now did compute changes?
00:13:04.040 | Yeah.
00:13:04.200 | These are like efficiency and RL, so, um, you know, we only have so much RL data, right?
00:13:09.000 | So much verifiable code, so much verifiable, like labeled environments.
00:13:14.820 | And instead of paying Mercor their fastest growing 500 million ARR, what if it can be more, more efficient?
00:13:22.500 | And also most of these authors were, uh, they weren't really at meta super intelligence when they started.
00:13:28.860 | So for example, for a lot of these, like you'll see them using, um, here they use Quen 2.5, here they use LLAMA 3.1 or 3.2.
00:13:39.940 | And this is like building on a benchmark from 2023.
00:13:44.100 | So, you know, it's been in the work for awhile.
00:13:46.080 | Um, some of these guys are now at Anthropic, like two of them are at Anthropic, but anyway, you know, it's, it's still, it's still relevant stuff overall.
00:13:57.660 | And it's just for the hype.
00:13:59.220 | The only reason we're covering these papers is because meta super intelligence, uh, otherwise they're, they're, they're interesting papers.
00:14:06.900 | I mean, I didn't think that like stuff like this compute as a teacher, it's like better performance in RL that you can kind of just drop in and it doesn't really have too much of a negative impact.
00:14:17.880 | So we'll stuff like this be standard, potentially the only problem is that no one's really doing RL, right?
00:14:24.240 | Uh, the people that do do RL are like big labs and we don't know if they're doing this stuff.
00:14:31.260 | So maybe, maybe one of the open models, maybe Mistraw or something drops a paper and they're like, oh yeah, we've been doing this.
00:14:38.160 | It helps a lot, but you know, maybe if you're into some mid training custom RL work, that'd be cool.
00:14:44.560 | And you can build on this.
00:14:47.080 | Okay.
00:14:47.380 | Compute as a teacher, uh, turning inference compute into reference free supervision.
00:14:53.740 | It's free.
00:14:54.460 | We love it.
00:14:55.060 | Uh, where do learning signals come from?
00:14:57.400 | There's no ground truth when there's no wrong truth in post-training, uh, converting the model's own exploration at inference time into reference free supervision by synthesizing a single reference from a group of parallel rollouts, then optimizing towards it.
00:15:13.660 | Kind of interesting.
00:15:14.560 | So you're already doing a bunch of rollouts.
00:15:16.400 | You're doing a bunch of inference.
00:15:17.720 | What if we can synthesize a single reference from all of these and then optimize towards it?
00:15:22.900 | Seems fake.
00:15:23.880 | Seems like it shouldn't work.
00:15:25.160 | Right?
00:15:25.520 | Like, aren't you going to be limited at the quality of what's in the rollouts?
00:15:32.120 | turns out no, with a bunch of like little information, you can actually get slight improvements.
00:15:38.340 | And then there's even disagreements.
00:15:40.140 | Like you can, you can get, uh, synthesized results when they differ from all rollouts.
00:15:46.700 | So like in 1% of cases, uh, the result is better, disagrees with every rollout, but is still slightly correct.
00:15:53.200 | And then you judge these based on a rubric grading, which is kind of, uh, unique as well.
00:15:57.760 | So concurrently current policy produces a group of rollouts, a frozen anchor, uh, reconciles emissions and contradicts to estimate reference.
00:16:08.920 | So this is basic GRPL, uh, there's two regimes.
00:16:12.020 | There's verifiable tasks where you have a programmatic way to check the answer.
00:16:16.480 | So basically for math, like you can use a little, um, a little software to just check, like, okay, is this math correct?
00:16:24.720 | Use a calculator or code, you know, does the code compile?
00:16:28.480 | Then there's non non-verifiable tasks with self-proposed rubrics, binary, auditable criteria scored by an independent LLM as a judge with reward given by a fraction of satisfied like output.
00:16:43.540 | Unlike the current method synthesis may disagree with the majority and be correct.
00:16:49.720 | Even when all the rollouts are wrong, this was kind of interesting.
00:16:52.300 | So current currently there's like a few other stuff, like someone mentioned the tree based approaches.
00:16:57.100 | So, um, you know, if you do best of N and just move towards that policy or majority voting or like judges based the best output, um, the difference here is that this synthesized result can
00:17:10.600 | disagree with everything that's in the rollout, but still be better.
00:17:13.900 | And then, uh, yes, numbers go up.
00:17:16.180 | They tested a bunch of four B's and eight B's and number went up, went up with general training and with RL.
00:17:24.220 | Uh, I thought this was a pretty, not so great diagram, but it exists.
00:17:29.380 | Um, so system the size and estimated reference convert the supervision into reward can be applied at test time for inference time gains inside RL.
00:17:40.660 | So, um, for specialized skills, why do we need, we need a, how to do figure one.
00:17:52.360 | Yeah.
00:17:52.480 | This figure one was bad, but there's a better figure two somewhere in this paper.
00:17:57.460 | We'll get to it eventually.
00:17:58.580 | Okay.
00:18:00.040 | Post training typically relies on supervised fine tuning with labels or verifiable rewards and programmatic checkers, valuable tests, tasks, lack both.
00:18:11.440 | Right.
00:18:11.760 | So in non-verifiable settings, uh, like, you know, clinical tests or lifestyle, there's also the issue of labeling being ambiguous.
00:18:22.120 | Right.
00:18:22.680 | So annotation pipelines are hard to scale and even then we don't, we don't often, um, we don't often like have experts agreeing.
00:18:32.940 | Right.
00:18:33.180 | So in ambiguous stuff, like experts can have different thoughts on what's concise writing.
00:18:41.100 | They can like, you know, there's just, there's just issues with that.
00:18:44.160 | And then LLM as judges, they, they can have issues here.
00:18:47.880 | So, uh, verbosity bias, they love long outputs.
00:18:51.600 | They're inconsistent.
00:18:52.680 | They like, or we're happy.
00:18:53.760 | So one main question, uh, can inference compute substitute for missing supervision for non-verifiable stuff or ambiguous, uh, verification.
00:19:06.720 | Can inference compute do better than stuff like LLM as a judge?
00:19:11.320 | Uh, yes, this is their answer.
00:19:15.280 | They did it guys.
00:19:16.280 | Compute as a teacher, not chain of thought, C-A-T, uh, compute as a teacher.
00:19:21.120 | It converts the model's own exploration into reference free supervision.
00:19:25.200 | For each prompt, the current policy generates a set of parallel rollouts, uh, that's GRPO.
00:19:31.400 | Then condition on the rollout, a synthesis, single estimated reference by, by reconciling emissions and predictions of cultural relations.
00:19:40.840 | Uh, this is once again, basically they throw all these rollouts into a model.
00:19:45.760 | They have it synthesized a better response.
00:19:47.800 | We'll get into the prompt that they use for this, which is kind of interesting.
00:19:52.600 | A key thing is they don't give the question anymore.
00:19:55.680 | They just give rollouts.
00:19:57.400 | Um, cat reuses the group rollout compute budget already common in RL, adding little overhead beyond the compute figure already spent, right?
00:20:06.240 | Uh, is it, is C-A-T is test time?
00:20:09.320 | Kind of, it's during rollout.
00:20:10.160 | So there's like multiple test times in RL, right?
00:20:14.160 | In RL, when you're doing GRPO, you do a bunch of rollouts and some of these are already off policy.
00:20:20.160 | Some of these are longer and shorter.
00:20:21.920 | So there's a lot of weighted wasted inference time, right?
00:20:25.160 | Um, there's tricks you can do to optimize this, but this is just a way of doing a little bit more overhead there.
00:20:31.360 | So two domains for cat, uh, one is verifiable where it's obvious, right?
00:20:36.680 | If it's verifiable, just use the same verifier.
00:20:39.320 | And if it works, it works, right?
00:20:40.920 | So let's say you have a math question, all your rollouts are shit, but you can answer it with a synthetic, um, output from cat.
00:20:50.800 | Well, then you can check it there and that's good.
00:20:52.760 | If it's non-verifiable, uh, they have this concept of rubrics.
00:20:57.520 | Basically, uh, you use another model.
00:21:00.120 | You say, here's a question, uh, what are rubrics that are hard that should like not be easy to check that these can be graded on?
00:21:07.960 | And then a judge follows that rubric and the output, and then has yes or no, um, criteria.
00:21:14.320 | Uh, the big thing here is this is synthesis, not selection, right?
00:21:18.840 | So, uh, you're, you're generating a new answer that can one, disagree with the majority and be correct even when rollouts are wrong.
00:21:27.400 | This is kind of interesting, right?
00:21:28.720 | Uh, so empirically we observe behaviors and disagreement with majority of rollouts on 14% of questions.
00:21:35.720 | So basically, you know, it disagrees fairly often, um, and it disagrees with all rollouts.
00:21:42.080 | 1% of the time performance scales with the number of rollouts as expected, right?
00:21:48.080 | So if you give it one rollout, it's not going to do much.
00:21:51.080 | Uh, if you give it two, four, eight, 16, 32, it starts to do better when it has more information.
00:21:57.080 | They've got, they have, uh, plotted a scaling curve for this and then the intuition around why this works.
00:22:03.440 | So basically, um, parallel rollouts diversify generations, surfacing like different sub-factor solutions, right?
00:22:12.800 | So, uh, you don't just do rollouts such that they have the same output.
00:22:18.800 | They're, they're somewhat diverse and they go down different thinking paths, right?
00:22:22.800 | Uh, this was shown in other pre-training reasoning work where in the GRPO policy, you want to optimize for diversity, right?
00:22:31.800 | So, uh, the last RL paper that we covered, they basically completely got rid of, uh, one of the constraints that really allowed for a lot of diversity in samples and that allowed for a higher hit rate towards being, um, correct and verifiable.
00:22:47.800 | So in this case, you know, GRPO is already doing different rollouts that are in different, uh, you
00:22:55.400 | know, different diverse groups and then conditioning this, um, it's similar to ensembling, right?
00:23:03.200 | So basically if you have a synthetic data generator that has a bunch of different, um, you know, different
00:23:11.800 | trees that it could have gone down, you have, you have the ability to pick out, okay, this, this
00:23:17.240 | look correct.
00:23:17.880 | A bunch of these seemed off, um, maybe this is the right way and in non-verifiable domains, rewards
00:23:23.720 | transform match later into discrete, dah, dah, dah, dah.
00:23:26.760 | Yeah.
00:23:27.000 | So the auditor is question blind, right?
00:23:28.840 | Basically, um, you're just, you're just looking at a bunch of potential solutions and then you're like,
00:23:34.840 | okay, a few of them go down this path.
00:23:36.680 | This is probably relevant.
00:23:38.920 | Um, let's, let's synthesize a better response from this.
00:23:42.840 | And then you have a bunch of different things.
00:23:44.600 | So that's why it works.
00:23:45.560 | Uh, practically it's just drop in, right?
00:23:48.120 | You don't need human labels, no specific verification or anything.
00:23:51.800 | You use the same, um, you use the same verification techniques as before.
00:23:59.240 | If it's a non-verified domain, you have this rubric thing.
00:24:03.640 | Uh, and then, yeah, they, they actually tested it.
00:24:07.160 | It improves three, four to eight B model families, Gemma, Quinn, and Lama.
00:24:11.880 | This is once again, interesting, right?
00:24:13.240 | So why are we talking about meta super intelligence now?
00:24:15.800 | Uh, this work is not that recent, you know, it's Lama 3.18B.
00:24:19.880 | They could have done Lama 3.31B, 3B, but it's, it's old models.
00:24:25.320 | But anyway, uh, yeah.
00:24:26.840 | Okay.
00:24:28.040 | So several other bridges of work around this, right?
00:24:31.320 | So, um, learned from model generated supervision, but derives the target
00:24:36.760 | by reconciling multiple samples rather than trusting a self-centered label.
00:24:40.920 | Uh, they talk about other stuff, right?
00:24:42.840 | So there's distillation, um, there's self-training, there's majority voting.
00:24:48.200 | There's LLM as a judge.
00:24:51.000 | There is, these are basically like interesting.
00:24:54.840 | This section's a bunch of other examples that you can, um,
00:24:58.440 | look at if you're interested in like better, um, self-generation for RL stuff.
00:25:06.360 | Okay.
00:25:07.480 | Um, the uniqueness here is it constructs new answers that can depart from
00:25:13.000 | the consensus, um, different than LLM as a judge.
00:25:17.480 | That's different than majority voting, right?
00:25:19.080 | So majority voting picks the majority best from the rollout.
00:25:23.480 | This can deviate from that different than LLM as a judge.
00:25:27.640 | Um, it has specific criteria that, you know, it mitigates like instability.
00:25:36.280 | So the problem with LLM as a judge is that there's biases like,
00:25:41.160 | okay, I prefer verbosity.
00:25:43.160 | I prefer, um, you know, like you have preferences across model family.
00:25:47.800 | Uh, this doesn't, this doesn't do that.
00:25:49.880 | It's just specific.
00:25:51.240 | Follow this rubric criteria.
00:25:52.840 | Then there's, um, you know, programmatic verification.
00:25:58.680 | Okay.
00:25:59.240 | Contributions.
00:26:00.120 | Once they do, uh, cat their, you know, compute as a teacher.
00:26:05.000 | And then this rubric score judging, uh, it avoids human references and reduces
00:26:10.760 | reliance on riddle judge only scores.
00:26:12.920 | Uh, then we do have a case study here.
00:26:15.640 | They actually do it.
00:26:16.600 | They test it on math and health bench, and then they, they compare it and all that.
00:26:20.520 | Oh, this was the related work section.
00:26:22.280 | So reference-free fine tuning, um, constitutional AI from Anthropic, self-instruct for training
00:26:30.120 | with instruction following, quiet star.
00:26:32.440 | All these are like different approaches to the same thing.
00:26:35.720 | Better, uh, reference-free fine tuning.
00:26:37.880 | There's reference-free RL, uh, so test time RL, reference-free LLM training by RL, absolute zero,
00:26:46.760 | self-play, um, all these are like interesting approaches.
00:26:52.040 | So compared to reference-free fine tuning, their approach can holistically improve outputs for
00:26:57.800 | arbitrary specialized tasks.
00:26:59.720 | That sounds like buzzwords, uh, compared to reference-free RL, they're able to construct
00:27:05.400 | and synthesize answers outside of explored distribution and extend beyond verifiable
00:27:10.040 | and non- extend beyond verifiable to non-verifiable domains.
00:27:13.880 | This is pretty unique, right?
00:27:14.920 | I think this is the big takeaway of the paper.
00:27:16.760 | So, um, the second half is pretty key, right?
00:27:19.640 | If you have non-verifiable domains, uh, you can, you can still work in those and you can go out of your
00:27:26.920 | rollout policy.
00:27:29.480 | Then, um, other work in non-verifiable RL is very free, JEPO, RLPR, um, all this stuff.
00:27:38.280 | In contrast, they use rubrics as rewards and more general approach that constructs rubrics from
00:27:44.760 | reference answers, which are then judged by LLMs to compute a score.
00:27:49.240 | Unlike all these methods, theirs does not require a reference answer, right?
00:27:54.120 | So all the other work that they cite, they need reference answers.
00:27:57.720 | Um, their approach doesn't, which is, you know, it's unique.
00:28:01.400 | I thought this paper was somewhat impactful.
00:28:03.080 | Okay.
00:28:04.200 | A bunch of notation bullshit.
00:28:05.720 | Basically, uh, it's GRPL and then you have this, uh, synthesis step with, um, uh, yeah.
00:28:14.040 | You have a synthesis step that's generated from rollout and then rubric criteria is done later.
00:28:20.040 | This was a better sort of diagram, but I think we still need to make reports on it.
00:28:23.960 | Uh, estimating a reference by synthesizing rollouts to estimate a reference response.
00:28:29.880 | We introduce a synthesis step.
00:28:32.040 | Um, so at each GRPL step, the current policy generates a bunch of forwards.
00:28:38.200 | Okay.
00:28:38.440 | Actually, uh, I'll explain what GRPL is.
00:28:41.160 | I don't know how many people are super up to date with GRPL.
00:28:45.960 | Uh, can someone explain why we need to keep the question blind to prevent it from acting?
00:28:54.040 | It's just another rollout.
00:28:55.160 | Yeah.
00:28:55.640 | I'll, I'll explain this question next.
00:28:58.840 | Uh, someone asked, why do we need to keep this?
00:29:01.800 | So for the synthesis step, um, why do we need to keep it question blind?
00:29:06.920 | Why do we not give it the question?
00:29:08.680 | I'll explain that right after explaining what GRPL is.
00:29:11.480 | So, um, GRPL is basically what's used in current RL where, um, you basically have,
00:29:20.200 | instead of having multiple models and all this, you, you do a bunch of rollout.
00:29:23.960 | So at your current step, you generate a bunch of different outputs.
00:29:28.520 | So let's say for a different, for a given query, like a math question, like some integral question,
00:29:33.960 | you generate X amount of different answers with chain of thought reasoning and whatnot.
00:29:39.720 | And then you have a group relative policy.
00:29:43.400 | So like you, you optimize towards what's the best in this group of outputs.
00:29:47.160 | And then you move your policy towards that.
00:29:49.000 | It's like a memory efficient version of PPL, since you don't need multiple models and this and that.
00:29:53.400 | So, uh, the, the very, very basic TLDR is just, you generate multiple outputs.
00:29:59.400 | Like let's say you generate four, eight or 16 outputs for a question.
00:30:03.800 | And then you optimize towards the one that's the most correct.
00:30:06.840 | Now, what, uh, CAT is doing is you pass in all of these rollouts to a synthesizer model,
00:30:14.360 | a synthesizer step, and then you have it kind of improve these.
00:30:18.040 | Now you say, okay, here's chain of thought one.
00:30:21.160 | Here's chain of thought two.
00:30:22.040 | Here's chain of thought three.
00:30:23.240 | Uh, can you give a more concise, better approach towards solving this?
00:30:27.400 | And sometimes, you know, given all these different rollouts and different, uh,
00:30:32.520 | techniques and attempts at the question, um, you know, you get a better result by just
00:30:39.960 | throwing all this in one more verification, synthesization step.
00:30:43.320 | Uh, the reason that we don't give it the question is because we don't want bias, right?
00:30:47.640 | We don't just want like a 17th rollout, right?
00:30:49.960 | If you're already doing 16 shots on goal, giving it the question is just going to make it answer the
00:30:55.240 | question again with more information.
00:30:57.240 | Now, if you supply it a bunch of incorrect information and ask it the question again,
00:31:01.720 | you're not going to get any deviants, right?
00:31:04.040 | You're going to use that.
00:31:05.000 | Instead, what you're doing is you're just giving it the rollout, just the chain of thoughts.
00:31:10.200 | And then you're giving it a reference.
00:31:11.880 | Now this reference will often still have like, you know, here's an answer and then it will be
00:31:17.160 | verified.
00:31:17.720 | And sometimes it's correct.
00:31:18.680 | Sometimes it's not.
00:31:19.800 | What they show is that oftentimes it is a lot better.
00:31:22.440 | Uh, for synthesis, do they use the same policy model?
00:31:25.720 | Uh, they use an LLM as a judge and they don't, it shouldn't really matter what model you use.
00:31:32.520 | I think they do use the same policy model.
00:31:34.520 | Uh, we'll, we'll double check, but they, the, the fun thing here is actually in the appendix,
00:31:39.960 | they show all the prompts for this, which is more useful.
00:31:42.120 | Um, then how do they verify this?
00:31:46.200 | Two things.
00:31:46.680 | So for verifiable stuff like that, they just use the verifier.
00:31:50.280 | So like they use the little, uh, software to check if it compiles.
00:31:54.040 | For non-verifiable, they, they have rubrics using rubric prompts.
00:31:58.440 | And then they use an LLM as a judge to just clarify this rubric.
00:32:02.360 | Uh, then they can input this back into the GRPO.
00:32:05.800 | And then if the CAT is better than the rollouts, let's instead, um, start to, you know, move our
00:32:13.320 | policy and change our weights towards that.
00:32:16.440 | Okay.
00:32:16.760 | Uh, remarks.
00:32:18.120 | So when you only have one rollout, the synthesis is not gonna do much, but it grows as you have
00:32:25.320 | more rollouts.
00:32:26.280 | This is, you know, what they expect and it's true.
00:32:28.360 | And then they plot it later.
00:32:29.960 | Um, is this just to create a rubric?
00:32:33.160 | The prompt seems like it's creating a unified response.
00:32:36.280 | So there's two things.
00:32:37.080 | There's separation.
00:32:37.880 | Um, so the synthesis step is different than the rubric step.
00:32:42.680 | And the rubric step is not always used, right?
00:32:44.600 | Synthesis step is used for verifiable stuff and non-verifiable.
00:32:48.440 | The rubric is used when you need to do non-verifiable domains.
00:32:51.480 | So you don't have labeled data for that.
00:32:53.720 | Then you can use a rubric.
00:32:54.920 | Um, okay.
00:32:56.600 | Let's talk about experiments.
00:32:59.080 | Basically, they, they evaluate it in two models.
00:33:02.680 | They do, sorry, in a few models, Gemma, Quinn, Lama.
00:33:05.640 | Um, yeah, the TLDR is, it kind of works, you know?
00:33:09.800 | So it improves, it improves 30% relative to the initial policy.
00:33:14.440 | And these are small models.
00:33:15.720 | Uh, adding RL adds a lot more.
00:33:18.360 | So like in Lama doing CAT with RL had a 30% bump on health bench on math.
00:33:25.080 | It has a 33% bump.
00:33:26.760 | Uh, so, you know, it, it works quite, quite, uh, well.
00:33:31.800 | Ooh, someone is sharing deep wiki.
00:33:33.560 | Very nice.
00:33:34.600 | Um, where do our rubrics come from?
00:33:37.720 | Are they AI generated, explicit written?
00:33:40.120 | Yes, they are AI generated.
00:33:41.560 | So two-step approach.
00:33:43.400 | One is you use LLM to generate a rubric.
00:33:45.800 | We'll go about the prompt for that.
00:33:47.240 | Then you use an LLM to judge the synthesis answer on the rubric.
00:33:51.720 | Um, that's, unfortunately, I wish that was like section four.
00:33:56.600 | We have to go through the damn appendix to find that stuff.
00:33:59.800 | So that's why I'm going to like skip through this pretty fast.
00:34:02.440 | Uh, would have been interesting to ablate not using question and question.
00:34:07.720 | They might have actually.
00:34:08.760 | It might be something in, um, in the appendix.
00:34:14.520 | Okay.
00:34:14.840 | Results are results.
00:34:15.800 | You know, this, this stuff works on small models.
00:34:18.600 | In model as a judge, instead of checking individual rubric criteria,
00:34:22.040 | check whether, uh, da, da, da.
00:34:23.400 | Roach consistently outperforms model as a judge.
00:34:27.000 | Rubrics provides fine-grained assessment criteria easier to verify.
00:34:31.320 | So I do check their rubric thing.
00:34:33.160 | Rubric is useful.
00:34:34.120 | Um, RL with self-proposed rubrics is better than SFT.
00:34:41.400 | Uh, SFT did not do that much, but rubrics did a lot.
00:34:45.480 | Uh, what else?
00:34:47.640 | Um, this produces better reference estimates as single sample and selection baselines.
00:34:55.400 | Uh, strongest reference estimates, the most fertile, versatile, best of, and okay.
00:35:02.200 | I think we can skip through a lot of these pretty quick.
00:35:05.480 | Um, it scales with the number of rollouts.
00:35:08.680 | So if you're doing a lot of rollouts, this will do a lot better.
00:35:11.960 | Um, you'll get more benefit from throwing more in there.
00:35:15.720 | Reasons about prior rollouts rather than act as another.
00:35:19.160 | So this is, you know, I've asked a question about, um, why not pass in the question?
00:35:27.320 | I think my thing is froze.
00:35:29.400 | We're cooked.
00:35:32.200 | Um, my Zotero is slightly frozen.
00:35:40.840 | Okay, we're good.
00:35:41.640 | We're good.
00:35:41.960 | So, um, CAT with a single rollout in context only performs slightly better than a rollout.
00:35:49.320 | This suggests that additional general generation steps of synthesizing is not only acting as a new
00:35:57.720 | new rollout that self-conditions on this past context.
00:36:00.360 | Uh, only slightly.
00:36:02.520 | Yeah, I mean, these are just results.
00:36:06.680 | It does better than being a single rollout.
00:36:10.040 | Okay, um, reconciles rather than selects to disagree.
00:36:18.120 | So most of the time it will, um, not just disagree.
00:36:21.960 | It wants, it wants to synthesize better approaches.
00:36:26.360 | It exceeds performance of majority voting.
00:36:29.080 | Um, it depends on the initial policy model to be baseline good.
00:36:36.120 | What else, what else?
00:36:38.520 | Conclusion.
00:36:39.160 | So, CAT turns inference compute into supervision by another anchor policy.
00:36:46.840 | Using synthetic parallel LLM policy.
00:36:51.080 | Convert estimated rewards.
00:36:52.520 | Delivers 33% of this to include.
00:36:55.560 | Yeah, I mean, it's basically useful, free, better performance.
00:36:59.720 | Um, what else do we have here?
00:37:01.960 | Okay, let's look at some of these prompts.
00:37:05.080 | So this is an example where the CAT disagrees with all rollouts.
00:37:15.000 | It's basic math.
00:37:15.880 | There's nothing fun there.
00:37:16.840 | When does it stop learning?
00:37:18.200 | Oh, I want that fun prompts.
00:37:20.920 | Very fun.
00:37:21.560 | Okay.
00:37:21.960 | So synthesis prompt.
00:37:23.000 | You're tasked with combining multiple responses into a single cohesive response.
00:37:29.160 | Below, I will provide several responses.
00:37:31.400 | Your goal is to identify common themes.
00:37:33.400 | Reconcile differences.
00:37:34.600 | Combine the information to unified response.
00:37:37.480 | Preserve all key insights.
00:37:38.920 | Ensure final response is logical, coherent.
00:37:42.840 | Here's all the rollouts.
00:37:44.200 | And then, you know, box your answer and stuff.
00:37:46.200 | Your response should not be much longer.
00:37:48.440 | The key thing you'll note here is, you know, the synthesis prompt, they're not giving the question.
00:37:52.440 | They're not saying like in this question, let F be this all complex numbers.
00:37:57.960 | Given this, find this.
00:37:59.640 | You're just being told, you know, your job is to combine responses into a single cohesive response.
00:38:09.320 | So like, look at a bunch of long and like, for example, think of like deep research, right?
00:38:14.120 | You have like 16 different shots of like, tens of thousands of tokens of web searches.
00:38:20.440 | And you want to like ignore stuff that's bad, keep stuff that's good.
00:38:23.480 | And then, you know, the output response they find does better.
00:38:29.480 | It's not just another rollout.
00:38:30.920 | Reasoning synthesis.
00:38:34.840 | Your goal is to identify this, synthesize this.
00:38:37.240 | Be sure to preserve key insights.
00:38:39.320 | Avoid discarding unique insights.
00:38:42.280 | Highlight, address them where possible.
00:38:44.440 | Here's your summary prompt.
00:38:46.360 | Here's how to output it.
00:38:47.400 | Rubric.
00:38:48.920 | This is another fun one.
00:38:50.440 | So the self rubric is a separate prompt, right?
00:38:53.160 | This is that step for how do you get, how do you get quality
00:38:56.680 | RL training data out of non-verifiable
00:39:01.080 | non-verifiable like data tasks.
00:39:07.480 | So here's the rubric thing.
00:39:09.800 | You are given a reference response.
00:39:12.040 | Carefully read the response and develop a reference
00:39:14.600 | response evaluation rubric as follows.
00:39:17.640 | Task. Develop a detailed rubric for this specific response.
00:39:21.160 | Create a detailed response for this specific response that describes what high quality responses
00:39:26.840 | look like to it with respect to accuracy, verifiable supporting evidence, logic structure,
00:39:32.440 | data, data, data, data, provide five or more rubric criteria that can be verified with yes or no.
00:39:39.480 | Ensure these are very specific and be verified.
00:39:42.520 | Make it extremely difficult to achieve a high rating.
00:39:45.880 | A high quality answer should be very hard to achieve.
00:39:48.280 | It's rare that any question would achieve high quality.
00:39:50.920 | You may use a reference answer if you see fit and then do it in XML.
00:39:56.200 | But this is kind of how they create a rubric.
00:40:01.400 | Then they have LLM as a judge.
00:40:03.240 | Check this.
00:40:04.600 | Here's their judge template.
00:40:07.240 | You're an expert judge that determines whether an answer satisfies a rubric.
00:40:10.920 | Here's the rubric.
00:40:12.040 | Here's the answer.
00:40:12.920 | Tell if it answers.
00:40:15.240 | If there's no answer provided, plea is answered.
00:40:17.640 | It fails.
00:40:18.120 | You know, be strict and unbiased.
00:40:21.240 | Only determine if the answer satisfies the rubric.
00:40:24.040 | So yeah, those are kind of examples.
00:40:28.120 | I think it's a cool approach.
00:40:29.800 | I think it would be interesting for anyone doing like basic mid training to try it out.
00:40:35.320 | Yes, we have more efficient RL.
00:40:38.360 | Like it's a better way to use all your rollout data, right?
00:40:44.040 | You're doing so much inference with little overhead.
00:40:46.280 | It looks like you can do slightly better.
00:40:47.880 | I'm going to take like a two minute pause to see if anyone has any questions on this.
00:40:54.120 | Otherwise, we'll move on.
00:40:56.440 | They only tested this with small models.
00:41:01.480 | Yeah, I think it would still work though, right?
00:41:03.080 | Like it doesn't make much of a difference.
00:41:05.240 | They tested on small models on like verifiable, like basic math and stuff.
00:41:09.880 | It shows a higher jump, but at the same time, like it's not harmful in any sense, right?
00:41:16.200 | So why not test it out?
00:41:19.240 | It's also just like basic compute budget, right?
00:41:21.400 | Why it's faster to test on small stuff.
00:41:23.960 | They train faster.
00:41:24.840 | You can see meaningful changes.
00:41:27.080 | So yeah, why not test it on bigger models?
00:41:30.600 | For all we know, this is an optimization step that's already being done.
00:41:34.200 | The bigger fun thing, I guess, was like this rubric-based response for me working better.
00:41:41.080 | And, you know, it being a way to check non-verifiable stuff better than LLM as a judge.
00:41:47.320 | Like there's this fantasy that people like to have that you can just do RFT on any model and just do LLM as a judge and that'll just work.
00:41:59.160 | So it's nice to see a change where, you know, this is like slightly better than that.
00:42:05.240 | Cool. Okay.
00:42:08.360 | I'm going to go into a base model may fail to produce reasonable responses.
00:42:15.400 | So someone asked if this will, they only tested this on small models.
00:42:21.240 | Would it work on bigger models?
00:42:23.320 | Someone disagrees because the weak base model may fail to produce meaningful responses.
00:42:29.640 | I think that would go the other way, right?
00:42:31.240 | If the bigger models would already be stronger.
00:42:36.760 | So it would be like a concern.
00:42:38.760 | Like this might not work on like a 1B model because the base models suck, but it should work on bigger models.
00:42:44.360 | But anyway, okay.
00:42:46.040 | I'm gonna switch to the agent thing they did.
00:42:51.960 | So we already did a bit of an intro about this, but basically this is their like verifiers competitor of RL agent environments.
00:43:03.320 | And once again, they really want two things.
00:43:06.120 | They want time to be a factor.
00:43:11.320 | So how long does it take your agent to solve stuff because the environments are not static, right?
00:43:17.640 | Current benchmarks are too static and that's an issue.
00:43:20.200 | We want dynamic environments.
00:43:21.800 | So as you do nothing, the world still changes around you.
00:43:25.560 | And then we need to measure, you know, how can your agent handle like ambiguities, noise?
00:43:31.720 | How can they adapt to these changes?
00:43:33.480 | And then it runs async surfacing new program modes.
00:43:38.680 | Okay, I'm gonna go through this pretty quick.
00:43:42.200 | Our experiments show that no system dominates across intelligent spectrum.
00:43:50.680 | So they have like six or seven different checks.
00:43:53.160 | Stuff like ability to handle ambiguity, conciseness, and the models that do well in different domains
00:44:02.680 | differ from ones that do well in other checks, which is kind of interesting.
00:44:07.320 | Here's kind of a trade-off between models.
00:44:10.280 | Once again, this is slightly old, right?
00:44:12.600 | We're on cloud force on it.
00:44:13.800 | We're not on cloud for Opus or 4.1.
00:44:18.520 | Kimi K2 is the best open source.
00:44:21.160 | Gemini was a really good mix.
00:44:22.840 | The hybrid models don't reason unnecessarily.
00:44:25.960 | Then there's a cost trade-off.
00:44:28.920 | The interesting thing here is that nothing dominates across the spectrum.
00:44:34.280 | They all have trade-offs.
00:44:35.560 | Like if you reason more, well, now you're taking more time, right?
00:44:39.480 | And then the unique thing is that these curves all plateau, right?
00:44:44.680 | So standard scaffolds miss key ingredients.
00:44:49.400 | We don't want plateaus.
00:44:50.840 | The fun thing here is they define like that this is not an AGI benchmark.
00:44:56.440 | It'll get saturated.
00:44:57.720 | Yeah.
00:45:00.680 | Meta can't even benchmark Opus.
00:45:04.200 | Too expensive.
00:45:07.560 | So deployment and production.
00:45:11.000 | Web is a great environment.
00:45:13.720 | The problem is that web environments change, right?
00:45:18.840 | Like if you have an Amazon-based benchmark, like a clone of an Amazon website for shopping,
00:45:25.880 | it's too static, right?
00:45:27.800 | Like you're not getting new products randomly added.
00:45:31.240 | But on realamazon.com, search results change all the time.
00:45:35.800 | New products are added.
00:45:36.840 | Descriptions change.
00:45:37.880 | So there's a lot of write operations that can change environment that happened randomly.
00:45:45.320 | Few open source, as of the time of this writing, few open source, few open source and flexible
00:45:54.040 | libraries exist for developing and studying correctable LLM agents.
00:45:58.440 | This is citing Will Brown from verifiers that we need more environments.
00:46:02.600 | Okay, where did I go?
00:46:06.600 | The interesting thing here is when they test all these models, they basically do a very simple
00:46:11.960 | React framework.
00:46:13.320 | So they're like, you know, come do better.
00:46:15.720 | Do other model agent orchestration scaffolds and stuff.
00:46:19.320 | Shit, we only have 10 minutes.
00:46:20.360 | I'm going to go kind of quick.
00:46:22.280 | Okay, they want running creation of environments.
00:46:27.560 | They need to handle time.
00:46:30.280 | Simulated stuff is not realistic.
00:46:32.520 | Mobile.
00:46:35.320 | So the one that they do is Kia 2.
00:46:38.360 | It's got a thousand verifiable scenarios.
00:46:41.640 | It's a mobile environment.
00:46:42.920 | So it has mobile apps like email, messaging, calendar, the associated content with them.
00:46:49.000 | Evaluations for agents beyond peer search and execution.
00:46:53.240 | Verifiable tasks that are simple.
00:46:54.920 | I'll cover this in 10 minutes and push the next one next thing.
00:46:58.440 | Yeah.
00:46:58.600 | Okay.
00:46:58.840 | I'll do this one in 10 minutes.
00:46:59.720 | We'll skip the third paper for now.
00:47:01.320 | So one, they differ than most agent benchmarks.
00:47:05.320 | And one, there's more realistic interactions between agents and environment that run asynchronously.
00:47:11.560 | Scenarios spanning arbitrary periods of time.
00:47:14.680 | Environments time passes.
00:47:16.120 | So this is what makes their benchmark different, right?
00:47:18.280 | This is one of the key things.
00:47:19.720 | Environment time passes independent on whether agent acts or not.
00:47:23.720 | The environment state is continuously updated with random or scheduled events,
00:47:27.880 | such as friends replying to messages sent by user or agent.
00:47:30.840 | They have robust verification system into RL.
00:47:35.000 | Basically, comparing agent right actions only to annotated oracle right actions for right action.
00:47:42.360 | While today's frontier models are far from solving this,
00:47:46.280 | we do not consider it to be AGI level benchmark in the OLM RL arena.
00:47:51.000 | And we expect rapid hill climbing.
00:47:53.000 | It's just like a start of like their first environment that's dynamic.
00:47:57.880 | They expect progress to be done.
00:48:00.920 | This is a much better diagram of what's going on.
00:48:03.720 | We expect that this will require modeling effort beyond scaling test time scaling.
00:48:16.520 | So basically, they have multi-agent interaction, time-based stuff.
00:48:22.680 | They're like, basically, if we want efficiency and models and RL, we need to do modeling effort beyond
00:48:31.720 | test time scaling, which I kind of disagree with, right?
00:48:34.440 | So they're like, you can't just scale for more reasoning.
00:48:37.320 | I think you have routers, you have low, medium, high thinking models, right?
00:48:42.920 | So like in their charts on the performance, they basically show how there's like a clear
00:48:49.240 | performance scale between GPT-5 on low, medium, high thinking.
00:48:59.640 | But, you know, they're like, we need to do model based.
00:49:03.400 | So here, so like GPT-5 with minimal thinking versus low versus high.
00:49:10.040 | The more thinking does better, but it takes more time.
00:49:13.400 | They're like, we need to do better architecture changes to solve this problem.
00:49:20.920 | But in reality, you know, a simple router or an auto thinking mode would also solve this
00:49:26.280 | because we've already dynamically been able to think better.
00:49:29.560 | And then the hybrid reasoners also do really good on this.
00:49:34.280 | But continuing through time is a thing.
00:49:41.320 | Okay, so foundations of their environment.
00:49:43.960 | Everything is an event.
00:49:45.080 | There's apps, which are basically stateful APIs.
00:49:48.440 | So email messaging, all that stuff is an app.
00:49:50.920 | It's an API that you can interact with as an agent.
00:49:54.280 | Environments are collections of apps, data, governing rules.
00:49:58.280 | Events are anything that happens in the environments.
00:50:00.760 | Everything is logged.
00:50:01.720 | They have like a UI for seeing this too.
00:50:03.400 | Notifications or messages from the environment.
00:50:08.040 | Scenarios are basically initialized states and different events that you can schedule.
00:50:12.600 | Okay, what are apps?
00:50:15.000 | Apps are basically tools with APIs.
00:50:17.240 | So stuff like send email, delete email.
00:50:20.040 | Apps have their own state and then they store state internally.
00:50:24.440 | Tool creation.
00:50:25.960 | So basically, if you want to use the environment,
00:50:28.280 | it's all a Python class that executes, creates stuff.
00:50:31.720 | There's scopes to these.
00:50:33.640 | There's agent user environment.
00:50:36.040 | Environment is stuff that the agent can't control, right?
00:50:39.240 | So like a text comes in from the world.
00:50:41.560 | The agent doesn't have control over this user.
00:50:43.640 | This user can interact with it.
00:50:45.160 | Extensibility, you can have external APIs.
00:50:50.120 | So MCP to externally interact with the world.
00:50:54.440 | So if you want to dynamically change events based on like real world data, like
00:50:59.080 | yesterday, someone did a SF map of parking tickets coming in live.
00:51:04.360 | If you want to hook that up to your environment, you can do it pretty easily, right?
00:51:07.720 | Because that's real time dynamic.
00:51:09.240 | Core apps.
00:51:10.680 | So there's core apps.
00:51:12.280 | There's basic interaction.
00:51:13.800 | So agent interfaces and then there's system stuff.
00:51:16.760 | System core apps are basically like, okay, what's the time I can wait for stuff to change?
00:51:23.080 | So basically I can make a change that I can wait for time to pass.
00:51:25.800 | Stuff that allows for stuff that would take like real time.
00:51:30.440 | Hours in the real world can happen fast.
00:51:32.760 | Environment, events.
00:51:34.760 | Event is an agent action.
00:51:37.400 | Let's go a little fast since we're short on time.
00:51:40.600 | Event types.
00:51:42.360 | There's validation events, agent events, notification.
00:51:45.960 | At each notification step, events trigger notifications.
00:51:50.520 | Agents can get them.
00:51:52.600 | They're not the only way for agents to observe them.
00:51:55.880 | They can proactively check.
00:51:57.400 | So outside of just like your phone getting a notification,
00:52:00.920 | the agent can make a call to its notification inbox.
00:52:05.560 | Scenarios.
00:52:06.840 | These are basically like specific scenarios, right?
00:52:10.280 | Like once I receive this, you can figure those scenario hints.
00:52:13.560 | Mobile.
00:52:16.120 | It's meant to be mobile environment.
00:52:19.000 | Turn rules.
00:52:19.800 | Okay.
00:52:20.120 | We got to go fast through this.
00:52:21.400 | This is kind of interesting.
00:52:23.720 | To generate all these environments, they use LAMA 3.370B and they need like consistency.
00:52:29.560 | So how do they handle diversity?
00:52:31.960 | They basically have a hierarchical like top down approach of synthetically generating like,
00:52:37.480 | you know, context.
00:52:38.200 | So like if you have like, okay, a French physics professor that's in your context or a Chinese
00:52:43.720 | professional athlete, you need to have like, you know, 400,000 tokens of
00:52:49.560 | information about them, all these chats, messages, then like a whole schema for how they do it.
00:52:57.080 | Scenario creation, implementing environments.
00:53:02.200 | The whole point of this is basically it's open source, right?
00:53:04.840 | So if you want to do your own, here's kind of their walkthrough of here's how we did it for all
00:53:11.320 | of these.
00:53:12.040 | Go figure it out for yourself.
00:53:13.560 | Initial verifiers, verifiable rewards are crucial for ORL, right?
00:53:17.400 | So they have rubric-like verification, mobile checking agent to check write operations.
00:53:24.520 | So like there's verification in these environments, verification mechanisms, basically you read stuff.
00:53:31.400 | There's soft checks, hard checks.
00:53:33.240 | As you would expect, hard check is like, okay, let's specifically check over like a regex string.
00:53:38.920 | Does the sync with this?
00:53:40.040 | Soft check is more LLM judge vibey.
00:53:42.760 | This was kind of advanced.
00:53:45.880 | I think we can skip it.
00:53:47.400 | Timing.
00:53:47.960 | There's a time delay for actions that takes a hit, verifying verifiers.
00:53:53.160 | Okay.
00:53:53.480 | They have default agent orchestrations.
00:53:57.480 | Basically they do a react loop.
00:54:00.120 | So simple react loop that has a pre-step and a post-step.
00:54:03.720 | Pre-step is to get agent context.
00:54:06.200 | So check for turn-based terminations, sorry, check for notifications, check for the current
00:54:12.040 | state of the world.
00:54:13.400 | Post-step is after you finish whatever you're doing, you can do a little post-step, right?
00:54:19.640 | So like, has the world changed?
00:54:20.920 | Is this no longer relevant?
00:54:22.680 | Like if someone says like, hey, can you send me a recipe for apple pie?
00:54:28.600 | Then you start working on it after you're like, okay, I need to,
00:54:32.680 | I need to text my mom to get the recipe from her.
00:54:36.040 | Once you send that text, there's a post-step in your loop, which is like, okay,
00:54:40.200 | let me check if anything changed.
00:54:41.560 | And maybe, maybe your brother has texted you.
00:54:43.640 | I don't need the recipe anymore.
00:54:44.840 | You can terminate right there.
00:54:46.040 | And then, you know, that, that affects the time wasted in this.
00:54:49.640 | They have a, they have a UI for visualizing all this and for evaluating it.
00:54:54.920 | Kind of cool.
00:54:56.280 | It's pretty like robust.
00:54:57.720 | Okay.
00:54:59.400 | Here's their, their one that they did, the mobile thing.
00:55:02.680 | Do they implement wait tool?
00:55:04.040 | Yes, they have wait tools.
00:55:05.320 | Gear 2 is their, their sort of Android OS style thing.
00:55:12.120 | They have a bunch of scenarios.
00:55:13.880 | Each scenario, they have like a mini version.
00:55:16.760 | So they have dynamic environments where world state changes.
00:55:19.800 | They contrast vending bench, which is a fun one.
00:55:23.560 | Time.
00:55:25.720 | So time flows continuously.
00:55:27.640 | Scenarios explicitly incorporate time dimension requiring agents to handle temporal constraints.
00:55:33.560 | Temporal awareness is essential for stuff.
00:55:35.800 | Uh, da, da, da, da.
00:55:38.040 | Agent to agent collaboration.
00:55:39.480 | So this is another interesting one.
00:55:41.480 | Um, you have multi agents.
00:55:43.800 | So instead of apps just being like API based SDKs, there's also agents that you can fit in.
00:55:52.200 | Um, agent capabilities, search, execution, adaptability, time, ambiguity, agent to agent.
00:55:59.720 | Uh, main agents can no longer access apps.
00:56:03.640 | Basically like to use an app, you need to go through another agent, which you can control.
00:56:08.120 | Um, having that delegated to an agent often helped some of the small models.
00:56:13.080 | Then noise is another one, right?
00:56:14.600 | So API changes services, uh, how do the models handle this?
00:56:18.680 | Uh, this is kind of their agent to agent.
00:56:20.760 | So you can, you can dynamically have some apps, the robust APIs, some of them be agents
00:56:27.320 | that you communicate with, right?
00:56:28.680 | So like, um, one of your apps could actually just be like, talk to a friend, right?
00:56:33.720 | And it's handled as an agent to agent communication.
00:56:36.280 | Uh, everything.
00:56:38.360 | Okay.
00:56:38.760 | Environment events, data collection.
00:56:41.720 | I think we got to skip through a bunch of this since we only have a minute left.
00:56:45.160 | Um, basically they tested a bunch of stuff.
00:56:48.760 | These were some of my issues with it.
00:56:50.600 | They're all tested at long context, temperature generation of 16K per turn.
00:56:56.200 | If the context length of 128K is exceeded, it's an automatic failure.
00:57:01.080 | Agent loop runs continuously until one of two termination conditions is hit.
00:57:07.560 | So termination is 200 steps or, um, you run out of context.
00:57:12.840 | Environment, um, all scenarios are verified with their verifier using LAMA 3.370B.
00:57:20.440 | Okay.
00:57:22.840 | I think we are out of time, but I'll go through the quick stuff.
00:57:26.600 | Okay.
00:57:26.920 | Core results, um, time split was interesting.
00:57:29.960 | These are some like just fun takeaways of how models were, um, how models serve, right?
00:57:35.000 | So, um, execution and search were the easiest splits.
00:57:39.000 | These are kind of the splits on which they evaluate domains, right?
00:57:42.520 | So how well can you do execution search?
00:57:45.080 | How well can you handle ambiguity, adaptability, time, noise, and then use other agents.
00:57:51.480 | And then what's your, what's your, um, performance of these?
00:57:55.800 | So for example, Gemini 2.5 is very, very good across the board.
00:58:00.120 | It doesn't suffer any major penalties where stuff like Grok has a really bad time penalty.
00:58:05.560 | GPT-5 high thinking, even low thinking has bad time penalties.
00:58:09.640 | Uh, some of them don't handle noise.
00:58:11.640 | So like the weaker models, LAMA noise, uh, sorry, LAMA 3 doesn't handle noise.
00:58:16.760 | LAMA 4 struggles with noise.
00:58:18.280 | GPT-4 always pretty bad with noise.
00:58:20.920 | But stuff like LAMA 4 Maverick, it, it did really good.
00:58:25.000 | It got a lot of benefit from agent to agent communication.
00:58:28.040 | Um, but you know, fun stuff to just go through.
00:58:30.520 | So Grok 4 really good at search, uh, stuff that has a deep research product, right?
00:58:36.360 | So OpenAI, Claude and Grok, they have deep research products.
00:58:39.800 | Those do really good at search.
00:58:41.160 | Uh, ambiguity, ambiguity remain challenging except, uh, Claude 4, Sonnet and GPT-5 were really good at this.
00:58:49.160 | Uh, times only Gemini 2.5 Pro and Claude 4 Sonnet had really good trade-offs with time.
00:58:57.080 | Uh, latency trade-off, noise robustness lags.
00:59:00.760 | Agent to agent benefited weaker models.
00:59:03.720 | GPT-5 performed the best on the benchmark.
00:59:06.680 | Kimi K2 was the best open source.
00:59:09.720 | Costs were another one.
00:59:11.160 | Uh, Claude 4 Sonnet was 3x more expensive than GPT-5 low, but much faster.
00:59:17.960 | Whereas GPT-4 was worst in both senses.
00:59:21.320 | Kimi was a pretty good trade-off in the middle.
00:59:23.480 | Um, what other fun stuff?
00:59:26.120 | I think that's pretty much it.
00:59:28.920 | I want to leave like the two minutes for anyone that's still here.
00:59:31.800 | If anyone has questions, anything they want to dig into here, um, you know, any thoughts, questions?
00:59:37.480 | Otherwise, yeah.
00:59:45.000 | Uh, memory was a fun thing.
00:59:46.440 | They talk about tool calling, uh, how they want other agent frameworks.
00:59:53.880 | I think it's like pretty easy, right?
00:59:55.480 | Like if you want to make a fun hype post, just, just benchmark max this thing, right?
01:00:01.320 | Like don't overfit, but it's very easy to do a basic agent that's better than react and kind
01:00:06.520 | of set up and beat this, beat this benchmark, you know?
01:00:09.960 | And like Medicare is about it.
01:00:11.720 | Seems interesting.
01:00:12.520 | It's the first benchmark.
01:00:13.560 | That's not static.
01:00:14.760 | Um, yeah.
01:00:18.200 | And so it's kind of the paper, the whole library is out there.
01:00:21.720 | Um, it's, it's interesting.
01:00:24.120 | And then I guess we can do the next one next time.
01:00:26.840 | Okay.
01:00:28.280 | Cool.
01:00:28.520 | Um, fun stuff, guys.
01:00:31.720 | Next week, I think we have a volunteer, unless anyone wants to volunteer a paper.
01:00:36.520 | Oh, someone volunteered.
01:00:42.040 | Uh, RJ, do you know what they volunteered?
01:00:50.120 | Okay.
01:00:50.760 | Seems, uh, bio RL paper.
01:00:53.240 | Cool.
01:00:53.560 | I guess we have a bio RL paper volunteered for next week.
01:00:57.960 | Um, awesome.
01:01:00.440 | We'll share more in the discord, but cool.
01:01:03.640 | Thanks for, thanks for coming guys.
01:01:05.320 | We'll see you guys next time.