Kimi-K2 Tech Report (Full Breakdown /w Vibhu Sapra)

00:00:00.320 | the luma said okay give me k2 um last week we went over the original muon paper so basically

00:00:09.280 | the the really cool thing if anyone has seen it or trained models is figure three

00:00:14.560 | they have a very very smooth loss curve uh normally when you train there's big spikes

00:00:20.400 | big spikes are not good because you have to kind of like intervene see like okay is the model kind

00:00:24.640 | of cooked do we have to roll back do we have to go to a new checkpoint do we have to fix stuff

00:00:30.080 | that's like kind of where you have to stop your training and like your gpu utilization gets kind

00:00:35.440 | of cooked what they did is they have a new optimizer that gives us really really smooth loss curves with

00:00:40.880 | like no spikes throughout the entire training process this is kind of crazy um i won't go

00:00:46.560 | super into detail on the optimizer since last week we we basically covered it so i think the recording

00:00:54.320 | is already up on youtube uh if you're interested watch it but okay give me k2 who's uh who's used

00:01:00.800 | it who's heard of it who doesn't know what it is um if anyone wants you know to just interrupt me whenever

00:01:06.080 | and we can we can answer questions so basically this is like state of the art open source model it's very

00:01:14.480 | very big um i'm gonna check chat real quick okay nothing nothing about this so basically this is a

00:01:21.680 | trillion parameters um it's not a thinking model but it's like deep seek v2 moment uh like the second deep

00:01:28.640 | seek moment um what they did is basically it's like better than claude on all coding benchmarks better

00:01:35.040 | than like anything it's very very state of the art it's not thinking and it's a it's a very good uh

00:01:41.520 | non-reasoning model it's very open source they shared um they shared this optimizer that's basically

00:01:48.720 | like a better atom w and um this this week we just got out the actual tech report paper it's decent

00:01:56.640 | honestly um it's like the first highlight of once again how do you train a very very large model

00:02:04.160 | so it's like kind of a pre-training post-training how do you train a large model and then the

00:02:09.040 | interesting note here for me was you get to see the problems of using china chips like they have the

00:02:15.360 | h-800s right because there was an export restriction so these chips are like different than um you know

00:02:22.240 | interconnected nodes and what do they do to come around that i think like two weeks ago we covered we

00:02:28.560 | covered the other reasoning paper um magistral the reasoning paper from mistral and they show how

00:02:33.920 | they can just like skip some of these issues for full gpu efficiency um but yeah here's kimmy k2 so

00:02:41.760 | uh trillion parameter huge huge trillion parameter model with 32 billion active parameters very very

00:02:49.360 | sparse uh they talk about the sparsity they talk about like experiments they did for sparseness ratios

00:02:56.160 | and all this stuff it's pre-trained on 15 trillion tokens they use a muon clip optimizer this is basically

00:03:03.680 | like all the hot news they they put out a separate paper about this before um what this basically

00:03:09.600 | leads to is very very good um training stability so uh they have this qk clip technique where they're

00:03:17.440 | they're kind of clipping attention gradients and and yeah it's just very smooth smooth training okay so

00:03:23.920 | trained on 15 trillion tokens uh trillion total parameters 32 billion active parameters very sparse

00:03:30.400 | um and then the big thing here is zero loss spikes which is kind of crazy um during post training there's multi-stage

00:03:38.320 | process uh they they do a lot of synthetic data work another fun thing i would recommend about this

00:03:44.160 | people paper for people is they do good citations into other work so if you're interested in different

00:03:50.560 | um like synthetic data generation pipelines they they cite quite a few cool papers there's a few in there

00:03:57.440 | that are like good citations i'm gonna follow up and read uh estimates on the compute used in training

00:04:03.280 | this yeah i think they actually say how much compute they used how much how much training compute uh how

00:04:08.240 | many gpus and stuff they had now it doesn't that doesn't account for a lot of the experiments they have a lot of

00:04:13.840 | like ablations and stuff they have a lot of data generation i don't know if that necessarily ties

00:04:20.320 | into how much they used for it right they share a decent amount of numbers but they don't share everything

00:04:25.600 | um yeah so large synthetic data synthesis pipeline and a joint oral stage honestly i found this

00:04:33.200 | model a little confusing because everyone calls it not a reasoning model and like sure it's not reasoning

00:04:41.040 | per se where it's like grpo like rollouts just like proper rl to think but they do have a lot of like

00:04:49.120 | rl to think so i don't know where the line blurs and i think this is a discussion we can have later and

00:04:54.800 | i'd love to hear people's thoughts but like i would semi call this a reasoning model like it it reasons

00:05:00.240 | and it's trying to reason with a like similar grpo with with some um with some changes but yeah it's a

00:05:08.000 | state-of-the-art very very good coding model you can plug it into cloud code and stuff um

00:05:13.120 | so we bench verified 65.8 it's like right up there with opus 03 all that stuff

00:05:19.600 | um and it's very sparse you know it's it's sparse it should be cheap for people that

00:05:24.640 | host it for inference um on everything else it's pretty good it's very open source they give base

00:05:30.080 | model and post train checkpoints um it's nice they do base model evals as well what's the difference

00:05:36.640 | between a reasoning and a thinking model yeah that's a good question mostly we don't call stuff

00:05:42.880 | thinking models we just have reasoning models and on reasoning models in this case i would consider

00:05:48.560 | this thing to be like a reasoning model but yeah i guess there's just think tokens they just take out

00:05:53.680 | the think tokens and they just let it kind of reason um performance pretty good i think we can cover

00:06:01.920 | charts later let's just get through the paper for now but if anyone's interested in diving on any of

00:06:06.800 | these we can we can kind of look at it um but yeah you know like coding benchmark basically up there with

00:06:14.080 | opus um the one that just beat this recently so like the the cool thing with this paper is

00:06:19.440 | it beats deep seek in like across the board on everything um as of a week later quen quen reasoning

00:06:29.200 | sorry quen coder beats this at coding which is cool so you know stuff is moving fast uh they basically

00:06:36.960 | reran a bunch of benchmarks except for opus opus they just took their their numbers i think open hands

00:06:43.360 | help them once we've been verified so shout out to them at the end but um opus was expensive to

00:06:49.360 | benchmark it's it's crazy how expensive benchmarking is so but good for them to do it uh basically in

00:06:55.120 | the introduction they go a little like um you know broader and they want to explain what are these terms

00:07:01.600 | so they have this um they have this like vision of agentic intelligence right uh models are no longer

00:07:08.800 | static like they don't just do next token prediction they're they need to be agents so they need to like

00:07:15.040 | evolve from static understanding of what they know and they need to be able to plan reason act in

00:07:21.280 | environments use tools and stuff so a lot of what they want to do and like you know dynamically adapt to

00:07:27.360 | these um environments so what they basically want to train a model that's really really good at tool

00:07:35.600 | use so um they they train it on a lot of like environments agentic tool use uh they generate a

00:07:43.840 | lot of these tools and they wanted to acquire skills beyond training distribution and adapt behavior through

00:07:49.200 | experiences so like this is kind of like the motivation for why this exists uh you know this thing should be

00:07:55.360 | be able to pick up tools really good and like when i mentioned there's good citations of papers for

00:08:00.160 | like synthetic data gen they also have stuff that's like here's what we were influenced by in tool use

00:08:06.720 | papers here's like four papers that do really good tool use and here's how we build upon that so

00:08:11.520 | you know if you're interested in tool use stuff you should probably read the other four papers

00:08:15.120 | um okay achieving agentic intelligence in quite introduces challenges in both pre-training and post-training

00:08:23.200 | um pre-training must in endow models with general purpose priors and effective token efficiency learning

00:08:32.240 | per signal post-training must transform those priors into actionable be actionable behaviors yet agentic

00:08:42.000 | capabilities such as multi-step reasoning long-term planning and tool use are rare in natural data and

00:08:47.920 | costly to scale so what this is saying is basically um you need a decent enough base model to do all

00:08:55.520 | this stuff and for post-training like this data is not really existent and it's like hard to scale right

00:09:02.800 | so like we don't really just have a lot of multi-step reasoning long-term planning tool use data so um you

00:09:09.120 | know they do a lot of scalable synthetic data gen uh okay paper uh model big big model trillion parameter

00:09:16.640 | moe with 32 billion active parameters there's there's a few major major things here uh muon clip which is

00:09:23.600 | their new novel optimizer it's basically better than um adam w this was introduced like quite a while ago but it

00:09:33.840 | was trained at small scale and it wasn't really scaling very well so in this paper they they show

00:09:41.040 | scaling up this optimizer and sort of having uh gradient issues in attention so the way they solve

00:09:48.000 | it is this kind of um stability enhancing qk clip where they're they're clipping off a lot of attention um

00:09:56.400 | a lot of attention weights so from there um it's kind of crazy they don't have a single loss spike

00:10:02.880 | very very beautiful training curve okay then they have a lot of stuff on their data gen so a large

00:10:10.320 | scale agentic data synthesis pipeline i thought this was like kind of basic i'm surprised how much it just

00:10:17.440 | works but um pretty cool stuff on synthetic data gen basically they generate a lot of tools tool

00:10:24.880 | demonstrations tool descriptions and then they have environments for it to use this stuff and they have

00:10:30.640 | like verifiers and in the middle step verifiers but it's it's like a cool pipeline that they use for

00:10:36.800 | all this uh data gen why why do you think it was basic i mean we'll we'll go into how it is later and how

00:10:43.840 | it's evaluated um i i thought it's like like the verification of some of this stuff isn't that crazy

00:10:51.280 | you know yeah it's a it's a lot of rubric based i'm surprised it just kind of works you know um especially

00:10:59.120 | at the scale and like how they achieve diversity like from the techniques they explain i wouldn't expect

00:11:06.640 | the amount of diversity that makes this actually work out

00:11:11.200 | but let's we'll dive into it in a bit you know yeah a consistent question i have for folks is that how

00:11:16.640 | do you evaluate on non-verifiable reward like essentially things that are not code or software

00:11:21.040 | and everyone's just saying rubik base would work and i think it's like an example of hey you know rubik

00:11:25.360 | base actually does work uh yeah i'm surprised that that's all you need they they talk about this

00:11:32.960 | specifically so for stuff like creative writing and stuff um they you know they share how they how they

00:11:39.360 | generate this pipeline is it true they used mcp tool architecture outputs yeah they basically have a

00:11:45.680 | whole bunch of mcp stuff so they scrape like a thousand mcp tools and then they they generate

00:11:50.880 | it even more so mcp is all you need in this um we'll get into it later and like actually they have a

00:11:56.400 | a stator plot of different tools um see t-sne visualization of real mcp tools um colored by

00:12:06.480 | their original sources so thousands of mcp tools of course they have crypto but um you know this is

00:12:13.600 | their mcp tools then they have synthetic tools which are also you know in different domains i love how they

00:12:21.280 | just have an unknown domain but they have crypto where it's it's some of the mcp tools but um

00:12:27.600 | that's part of it okay okay continuing on um where were we

00:12:32.720 | okay they also talk about their rl framework um it's rl with verifiable rewards this is what eugene

00:12:42.000 | was talking about it's it's not um it's not just code in math they have it on um they have it on like

00:12:50.560 | general domain so like creative writing and stuff okay so kind of interesting they talk a lot about

00:12:57.920 | critiquing outputs instead of like grpo where you do rollouts and reward the best output um they just

00:13:06.160 | have a lot of critiquing here which is kind of interesting i found their explanation of how they

00:13:11.680 | set up this whole loop to be a little confusing so like their training loop it's a little odd but um

00:13:17.440 | i don't know maybe someone else had a better intuition while reading it or broke it down or

00:13:20.960 | looked into it more but uh they they do share a lot about that okay pre-training 15 trillion tokens

00:13:27.280 | um they talk a lot about token efficiency and how it's important basically rl is very token efficient

00:13:33.280 | right so guess what do rl um they they talk about pre-training techniques so uh uh effectively

00:13:41.440 | this muon optimizer it's very very much a token efficiency thing as well then they have synthetic

00:13:47.200 | data gen um of course their ultra sparse moe is efficient uh it's mla so multi-head latent attention

00:13:55.440 | which was derived from deep seek and then they make some modifications they actually show scaling laws for

00:14:02.240 | how sparse you want moes there's charts of this later uh muon clip so this is the optimizer and

00:14:09.280 | what they do um i was thinking we actually just skip this for now since we basically did this last week

00:14:17.200 | but if there's time i think we come back um let's see if there's anything basic to say about it so you

00:14:24.000 | you know outperforms adam w better token efficiency um leads to training stability

00:14:31.440 | constraint attention logits so attention was having issues we constrain it apply kipping to unshared

00:14:39.920 | attention ads yeah i think we go over this later let's just go through the paper first unless anyone

00:14:45.280 | wants to ask a specific question or eugene do you want to say anything about it okay let's just move on since we

00:14:52.560 | basically did it so if you're interested um why they're covered at the end or check last week's

00:14:58.160 | thing okay uh pre-training data improving token utility with rephrasing it's interesting how much

00:15:04.240 | they just did rephrasing like they show cool little experiments here of um rephrasing versus uh repeating

00:15:12.240 | for multiple epochs and how their their strategy is kind of better so token efficiency refers to how much

00:15:19.440 | performance improvement you can get out of each token consumed during training increasing token

00:15:25.040 | utility the effective learning signal for each token contributes is pretty important right so that's

00:15:31.520 | something that they want to focus on naive approach is basically just do more epochs right um eugene

00:15:39.040 | question comment you're muted oh sorry not a question i think you briefly glance at our table we will go

00:15:47.200 | through it uh when we go through the table i think the table is absolutely insane where the delta between

00:15:51.760 | the second row and the first row essentially almost answers all of the game um but we can talk about it

00:15:57.200 | when you get here yeah yeah it's interesting um so basically like you guys know what multiple epochs are right so

00:16:04.560 | you you you take your let's say you have a thousand samples you train for 10 epochs you train on the

00:16:09.200 | same data 10 times um so that's one naive approach another so they compare to that um you know you do

00:16:17.920 | get some gains from doing multiple epochs and it's somewhat common for some some data like for your base

00:16:23.920 | pre-trained we don't do multiple epochs anymore we used to um for some higher quality data like some some

00:16:30.160 | some high high quality reasoning we'll we'll do it for multiple epochs and do the same data multiple

00:16:35.120 | times but here's kind of what they do um a key advancement is synthetic data gen to increase token

00:16:42.560 | utility so they have this core rephrasing pipeline um basically instead of just train on the same same

00:16:49.680 | data multiple times let's kind of rephrase it a little bit there's two two domain specialized

00:16:54.880 | rephrasing techniques that they use for knowledge and math so knowledge data rephrasing a single epoch

00:17:01.600 | is insufficient for comprehensive knowledge absorption while multi-epoch repetition yields diminishing

00:17:06.880 | returns in the risk of and increases the risk of overfitting so um we'll get to a chart in a bit but

00:17:12.240 | let's talk about what they do so first is style and perspective diversity prompting so uh prompting guide

00:17:18.880 | is you take an llm to generate free uh faithful rephrasing of the original text in very in various

00:17:25.840 | styles from different perspectives then there's uh chunk wise auto regressive generation basically if

00:17:32.160 | stuff is very long just do it in chunks so rephrase segment by segment then adopt uh auto aggressive like

00:17:39.200 | chunking fixer so divide text into segments rephrase them individually stitch them back to back to back

00:17:45.440 | together and fix that there's there's like a little pipeline on there's a little diagram on this later

00:17:51.680 | um to ensure consistency between original and rewritten content there's fidelity checks that compare the

00:17:59.360 | semantic alignment of the rephrase page with its source um yeah cool so basically they do all this

00:18:05.440 | then they run a little experiment so let's compare how this performs in the set in the sense of token

00:18:11.920 | efficiency so there's three three things that they do one is you take your original data set you train

00:18:17.600 | on it for 10 epochs uh risks here are you know you might overfit your data so you might overfit and it

00:18:25.120 | might just you're not really adding more quality you're just doing the same thing multiple times

00:18:28.560 | two is rephrase the data once and repeat it for 10 epochs so basically you take all your data

00:18:35.680 | rephrase it in a different sense and then you train for 10 epochs uh three is rephrase the data 10 times

00:18:42.640 | with a single pass of training and here's kind of the results um we extend this method towards larger

00:18:49.120 | stuff and you know each corpora is rephrased at most twice so 10 epochs on same data they they test it on

00:18:56.880 | simple qa they get 23. just rephrasing it once and training it for uh sorry rephrasing it once and

00:19:03.440 | training it for 10 epochs gets you quite a bit of improvement then there's rephrasing it 10 times and

00:19:09.120 | doing one epoch you still get a little bit more improvement but you know the main difference like

00:19:13.280 | eugene said is just here rephrasing makes quite a difference am i reading this right when actually the

00:19:19.760 | first row and second row each of them are 10 epochs but the difference between the second from the first

00:19:23.840 | row is essentially the second row actually does data cleanup by rephrasing it it's not data cleanup it's

00:19:29.680 | data augmentation so i think they keep both the original data and rephrase oh i see so they keep

00:19:36.960 | both okay i was reading it wrong i thought the first one was raw 10 epochs the second one is only rephrased

00:19:44.080 | 10 epochs um we could you could be right here yeah because yeah i read it so what eugene is saying is

00:19:54.080 | you know in the the the benefit just comes from rephrasing what i read this as is keeping the original

00:20:01.280 | and rephrasing it i think i'm right because if you look at the paragraph it says rephrasing the data once

00:20:07.040 | repeating it which i think as it as the rephrase data if you look at the third line which is uh so

00:20:14.480 | that is actually cleaning up the data to be better

00:20:17.840 | that's all the juice uh so that's pretty that's pretty mind-blowing to me that someone they did this

00:20:24.000 | ablation so that's really nice yeah beautiful little um ablation basically rephrasing you get so much data

00:20:32.480 | quality improvement but the reason i interpret it the way that i did is because um you know i thought

00:20:39.280 | that 10 10 epochs of the same data is basically just the same data um adding in another look and

00:20:47.440 | doing it for tiny box gets you better performance but i think we'll dig into it later somewhere at the

00:20:52.000 | bottom they have um they have a lot of this in the appendix i wouldn't be surprised if they talk more

00:20:57.680 | about it well we'll have to check um this is our kind of chunking for long context diagram basically

00:21:05.520 | you know chunk it rewrite and then merge it and check but nothing crazy there i had a question on

00:21:11.520 | this actually this diagram um i'm thinking a lot of time but uh in the third column let's say we

00:21:20.800 | rephrase partial input one what does it mean to be auto regressive

00:21:24.800 | oh like you're adding in the first chunk and the second chunk like you're using the first chunk

00:21:34.880 | to write the second chunk i think so um here chunk based auto regressive strategy texts are divided into

00:21:42.320 | segments rephrased individually then stitched back together to form complete passages this method

00:21:48.640 | mitigates implicit output length limitations um the auto regressive part i think is let's see

00:21:55.440 | so so when i read the paragraph i read it as okay we've split in the paragraphs you rephrase it and we

00:22:00.800 | just smash it back together but i realized that the first few phrases makes a lot of difference to

00:22:07.200 | maintain global coherence i think what they're doing here is they're rephrasing the first paragraph

00:22:14.240 | add the first paragraph and the second part at the new first paragraph and the raw second paragraph and

00:22:19.600 | ask it to rewrite the second paragraph and i think by doing this it keeps the style of the first paragraph

00:22:24.480 | but i want to make sure if i'm understanding this right that's how i'm interpreting it too uh the input

00:22:30.080 | is spent to small objects with preserved context rewritten sequentially then concatenated into a fully

00:22:35.760 | rewritten but um it could be the other way too which wouldn't be as good if it's the other way then it

00:22:41.760 | doesn't have to be sequential right if it's there if you just do everything you can just be parallel the

00:22:47.920 | fact that is i yeah anyone also the graph the graph also shows they are feeding the the previous segment

00:22:56.800 | into the rewrite step so so like the second rewrite model has like two inputs it has the like original

00:23:04.240 | segment and then also the rewritten previous segment i think so so that's it if we are rewriting the last

00:23:11.520 | chunk do we pass in everything or do we just need to pass in the last chunk minus one and this is just

00:23:18.320 | last minus one right you can see it yeah it looks like partial except two right yeah so okay i think

00:23:24.240 | one one one's not going into three but one is kind of already seen this is extremely valuable uh this

00:23:31.120 | this diagram itself says a lot about how you can process long context data uh so i think you could

00:23:38.080 | do better really basically yeah that's just like yeah you could pass in multiple chunks you could do a

00:23:43.680 | better long context rewrite like this is not super efficient yeah you're right it's not super efficient yeah

00:23:50.400 | yeah yeah it also depends on how long context documents you have right so like at a certain

00:23:56.000 | yeah so i mean there's interesting stuff there but the the interesting note is the majority of the

00:24:02.240 | training was done at pretty short context actually um exactly i was so surprised 4k context only so but

00:24:08.800 | okay i'll let you get it there i'll stop interrupting i mean no no it makes sense right like it's a

00:24:13.520 | trillion tokens it's a trillion sorry trillion parameters it's a big model like um that context

00:24:20.240 | really adds up so they do context length extension later i also just don't think that we have 15

00:24:26.720 | trillion like even trillions of tokens of really long context right like that just doesn't exist um and

00:24:33.280 | it's so much more exponentially expensive to train at long context um okay so uh basically they rewrite

00:24:42.000 | high quality math into learning note style so instead of just question answers you know here's a learning

00:24:48.320 | note of how i would solve this uh translate high quality map into other languages this is standard

00:24:54.240 | seen in other work as well okay overall they have 15 trillion tokens i thought this was interesting

00:24:59.680 | they have web text code math and knowledge um i don't know what knowledge is but they have knowledge

00:25:04.720 | for each domain we perform rigorous correctness and quality evals you know standard filtration stuff

00:25:12.560 | a lot of rewriting though okay 32 billion active parameters similar to deep seek v3 they use mla

00:25:18.560 | um is our attention to external hidden dimension of this okay so scaling law analysis reveals that

00:25:27.040 | continued increase in sparsity yields substantial performance improvements which allowed us which

00:25:32.160 | motivated us to increase the number of experts to 384 instead of 256 in v3 to reduce computational overhead

00:25:38.960 | during inference we can we cut the attention heads to 64 instead of 128 so they do scaling laws to reveal that

00:25:47.360 | um you know based on the number of active parameters you want increasing sparsity so um taking the same

00:25:56.880 | number of active parameters but increasing total experts is very um is better performance uh and

00:26:04.960 | they um so you know they increase experts and they cut the attention heads in half this helps with their

00:26:13.840 | training stability so their new optimizer um had issues and they they cut attention heads in half

00:26:21.520 | okay this is um you know similar to deep seek so more total parameters less active parameters more

00:26:28.800 | experts less attention heads only one dense layer which they don't talk much about here's their sparsity

00:26:35.120 | scaling law um sparse oh one sec don't be right back

00:26:42.240 | oh no oh no oh no that's mochi need to get out of the room or that's what you need to get into the room

00:26:47.920 | what he has always has needs um mochi's fomo from paper club oh mochi's here

00:26:55.120 | no she just needs papers what you can tell with us all right okay sparsity is the ratio of um total experts

00:27:05.520 | compared to active experts uh 48 is their magic number but um basically what is this they have

00:27:12.960 | uh 384 um 384 total experts and 16

00:27:23.120 | and let's read to this okay sparsity is the ratio of total number of experts compared to a number of

00:27:30.320 | factors under so this is the scaling law basically that they find under a fixed number of active

00:27:36.000 | parameters uh basically in this case the 32b active parameters that they want to set um increasing the

00:27:44.000 | total number of experts in basically making it more sparse consistency consistently lowers both the

00:27:51.280 | training and validation laws thereby enhancing overall model performance um they do these experiments

00:27:59.200 | um increasing sparsity leads to better performance it also adds infrastructure complexity um there was a

00:28:09.760 | semi-analysis paper on like why llama 4 failed and they talk a lot about moe stuff and their expert

00:28:17.920 | choice routing versus token choice routing and all these stability issues and infrastructure and how they

00:28:23.120 | get and how they changed halfway through but yeah when you make stuff more sparse and you have like

00:28:28.480 | uh stuff like that then then you need to you know you need to deal with infrastructure complexity

00:28:34.240 | so they have a sparsity of 48 achieving um activating eight out of 384 experts per forward pass um here's

00:28:44.320 | kind of their their laws so increased sparsity better performance doubling the number of attention heads

00:28:51.040 | they also checked it didn't take much of a hit i think this is the next section uh frankie you have hand up

00:28:57.120 | oh yeah i just want to ask um maybe more of an interpretive kind of uh question here so they

00:29:04.080 | found that increasing moe is and decreasing number of heads um it is better so could the interpretation be

00:29:12.320 | that um the tissue heads are essentially kind of like mixing stuff from your context the moes is kind

00:29:18.800 | of figuring out what you do with that mix so i think maybe it's saying that the expressivity of what you

00:29:25.760 | should do with the mix is more important than the mixing itself that is you can kind of like

00:29:30.640 | get raw stuff from your neighbors but you should be smarter about what you do with it

00:29:35.680 | would that be a correct interpretation and it's kind of hard but what do you guys think

00:29:41.760 | um two things so one the attention head is not actually it's not necessarily better to cut them

00:29:50.800 | they did it for training stability so the the dots here are lower loss but double the attention head so

00:29:57.360 | um more attention heads actually helps performance it's just not much like they say it's a doubling the

00:30:05.200 | number of attention heads led to a reduction in validation loss of 0.5 to 1.2 percent so it's still

00:30:11.600 | better it's not worse to have more attention heads it's just a lot more um it's a lot more complexity so

00:30:18.000 | so here's a here's a cool quote so with the sequence length of 128k doubling the attention

00:30:23.680 | heads from 64 to 128 while keeping the total expert count fixed leads to an 83 percent increase in

00:30:29.840 | inference flops so you're nearly doubling the compute required for inference by doubling your attention heads

00:30:38.720 | but the performance increase is very small and uh it it had instability issues with their optimizer

00:30:47.040 | this doesn't mean that um adam w with more attention heads isn't stable but you know it's

00:30:53.200 | it's a big uh compute cost so something to note there regarding the analogy i think i need to read uh hear

00:31:01.520 | it again i don't know if someone else wants to comment or maybe you want to post it in the chat

00:31:06.320 | but i i didn't really digest and if someone else wants to comment you know pop in i wanted to ask

00:31:12.640 | is it more of the experts or the attention heads that we're you know being optimized for here

00:31:20.560 | this is both um so they're keeping it sparse by uh having a lot of experts but only a few active

00:31:30.400 | that's one part of the equation that's in their scaling laws better performance so

00:31:34.320 | if you have a set of if you have a fixed inference budget like let's say i want to run this thing at 32

00:31:40.480 | billion active parameters uh you'll do better by having more total experts so that's part of it

00:31:47.040 | the that's for performance right that will get you better overall performance even though it has

00:31:53.360 | training complexity the attention heads is uh inference optimization so you know when you cut attention heads

00:32:01.920 | um you have less kvs right each attention head basically has to do kvs and over long sequences

00:32:10.320 | kv is quadratic scaling and that's that's a lot of gpu utilization if you can cut half you can kind of

00:32:17.760 | cut half of your kv cache which is a lot of gpu usage uh eugene it seems like you have good intuition you

00:32:24.480 | wanna wanna you wanna wanna intuit no i i don't know if i have good intuition but my thinking is that

00:32:30.320 | attention heads are just like an ensemble and beyond a certain number they found that you know from 68

00:32:36.480 | 64 to 128 they say that yes while it does improve validation loss um like it's over there uh only

00:32:44.240 | validation loss improved by half a percent to 1.2 percent and okay given that fixed amount of compute they rather

00:32:50.800 | not double they rather half the attention hits and have more experts so it's a compute trade-off which

00:32:57.120 | makes sense to me i think i think that makes sense no no i think that makes sense the difference here is

00:33:02.480 | also just um the architecture right so we're operating on trillion parameter and 384 experts now does this

00:33:12.800 | scale to dense models does this scale to a small 24b who knows i think that's um you know we'll have to

00:33:20.400 | figure that out but um it's it's verified in their case i don't know if you can take this at face value

00:33:27.680 | for for other architectures and other sizes um but you know i highlighted this in green it's important

00:33:35.040 | very very useful cutting attention head saves a lot of compute at long context but once again attention is

00:33:41.680 | a problem at long context this is like my one concern for like uh transformer alternatives and stuff

00:33:50.000 | uh you guys really need to train at long context right because the benefit of your like getting rid

00:33:56.480 | of attention and swapping it in with something like recurrence states that's cool but it really only makes

00:34:02.320 | a difference at long context at short context it's not that deep and you know train at long context even

00:34:09.040 | though you're not very gpu rich do it anyway okay uh training infrastructure this was cool if you care

00:34:16.560 | about trade restrictions so basically nvidia didn't allow h100 h200 exports but china had the custom h800

00:34:25.440 | gpu um then this is their training node so uh each node in the h800 cluster contains two bare two terabytes of

00:34:34.720 | ram and eight gpus connected with envy link and envy switches within nodes across different nodes eight

00:34:41.840 | um g8 400 gig roce interconnect they talk about stuff with like pcie offloading saving to disk uh weight

00:34:52.480 | optimization so basically the thing that magistral did was really efficient to just keep rollouts 24/7

00:35:00.640 | uh with this scale of trillion parameters they they did a lot more offloading to dram and stuff um

00:35:06.880 | apparently h200s are coming to china soon nvidia is winning um actually like there was a post about

00:35:14.480 | this where everyone was like i don't know why deep seek had such a big impact on like nvidia stock because

00:35:22.800 | guess what h800 is still nvidia like it's it's all nvidia at the end of the day but um what's she

00:35:30.240 | trying to jump off table but you stay this is mochi mochi's the non-expert doggo but she's learning to be

00:35:39.120 | expert okay um parallelism for model scaling i don't know if we have that much time left for this so maybe i go

00:35:48.640 | through this section quickly and we come back if we have time um i like this i nerded out over this in

00:35:56.560 | the magistral paper uh do the rl part i think the rl and data gen stuff is more fun watch you come here

00:36:03.520 | okay um i was gonna read a lot of this but let's let's skip it for now um basically there's three things

00:36:12.000 | they do uh combination of 16-way pipeline parallelism 16-way expert parallelism and zero one data parallel

00:36:20.640 | a lot of this is just like if you're ever interested in how big models are trained you should read this

00:36:25.680 | section it's kind of cool here's what they have to go over activation reductions with glue okay fpa for

00:36:32.560 | more intensive off but what else cpu offloading training recipe this was interesting so training

00:36:38.880 | recipe um they mostly train at 4 000 token context window using their new optimizer um total of 15

00:36:48.800 | trillion tokens the first 10 trillion tokens were you know standard learning rate small warm-up then the

00:36:55.680 | last 5.5 trillion tokens they had cosine dk weight dk was set global back size was pretty large 67 million

00:37:04.080 | tokens per bat crazy uh figure four is there figure four that i skipped over these figures were very odd

00:37:14.000 | um the rewrite takes both previous and the current segment okay that's that's relevant right now okay

00:37:19.520 | so uh most of the training was done at short context then towards the end they had an annealing phase

00:37:25.760 | followed by long context activation activation stage bat size still 67 million tokens uh learning rate was

00:37:32.960 | decayed quite a bit more in this phase we trained 400 billion tokens with a 4k sequence length followed

00:37:39.600 | by an additional 60 billion tokens on 32k sequence then they extend with yarn so basically like you know

00:37:47.360 | 60 billion out of 15 trillion tokens are at 32k sequence um very very much like the majority of this training is

00:37:55.680 | just done at 4k context post training um muon is still used in post training if anyone's fine tuning

00:38:02.560 | kimmy k2 you should also still use this muon optimizer don't don't use adam w it's cooked um i think there's

00:38:10.480 | a chance so like one of the co-founders of xai is jimmy ba who who worked on the adam w optimizer it's like his big

00:38:17.360 | thing um what are the odds that these people get coached to xai because they like optimizers um who

00:38:25.280 | else elon is trying to coach carpathy again but that's side t okay uh sft so um there's instruction

00:38:36.960 | tuning data set maximizing probability of fire response data gen pipelines for different tasks in

00:38:43.120 | different domains so they use kimmy k 1.5 their in-house domain specialized expert model to generate

00:38:49.680 | candidate responses for various tasks followed by llm or human judges to perform automated query

00:38:55.120 | evaluation and filtering uh for agentic data we create a data synthesis pipeline to teach models

00:39:01.440 | tool use capabilities through multi-step interactive reading this stuff is pretty interesting um this

00:39:06.720 | is like really the nice meat of this paper i think um okay large-scale agentic data synthesis for

00:39:14.400 | tool use um basically we we need llms to be able to autonomously figure out how to use unfamiliar tools

00:39:23.360 | uh interact with external environments and iteratively refine their actions to reasoning

00:39:28.800 | execution and error correction so here are some papers that you should check out about how to train uh tool

00:39:35.200 | use and tool use evaluation so tool lm ac bench and then same thing with um synthetic data so um

00:39:46.240 | real world environments rich so difficult to construct at scale complexity privacy accessibility recent work on

00:39:57.760 | synthetic data gen agent instruct self-instruct and stable tool bench and zero search i think we covered

00:40:03.120 | agent instruct and self-instruct uh two new ones that they talk about here that if you're interested in

00:40:10.080 | synthetic data gen you should probably check out uh stable tool bench and zero search they show promise in

00:40:16.800 | creating large-scale data training without relying on real-world interactions it's all synthetic they have a pipeline that simulates

00:40:26.720 | real-world tool use scenarios at scale uh enabling the generation of tens of thousands of diverse and

00:40:33.280 | high-quality training examples okay three stages in uh data synthesis pipeline tool spec generation so step

00:40:40.720 | one is you know first construct a large repository of tools from both real-world tools and lm uh synthetic tools

00:40:49.120 | agentic and task generation for each tool sampled from the tool repository so they have like thousands of tools by the way

00:40:56.080 | by the way generate an agent to use a tool set and some corresponding tasks then trajectory generation

00:41:03.760 | for each agent and task generate trajectories where the agent finishes a task by invoking tools

00:41:10.320 | um here are some of those tools so they have a bunch of real mcp tools they kind of do a distinguish

00:41:18.640 | they kind of like filter them out by category so some search tools database tools uh communication r d

00:41:25.760 | then here's their synthetic tools we'll go over this soon um first we directly fetched 3000 plus real mcp

00:41:37.120 | tools from github repositories leveraging existing high quality tool specs uh the interesting thing with mcp

00:41:43.280 | tools is like they don't really say how they verify that these are high quality tools right you can have

00:41:48.880 | non-high quality mcps there's no like great standard over this but they don't mention much filtration over

00:41:55.120 | this uh second we systematically evolve synthetic tools through a hierarchical domain generation process

00:42:02.560 | we begin with uh key categories so they they have a they have like a they basically have like a categorizer

00:42:11.440 | that that categorizes different tools in different domains they start with that so uh then we evolve

00:42:18.640 | multiple specific application domains with each category specialized tools are then synthesized for

00:42:23.840 | for each domain with clear interfaces description operational semantics so get a bunch of tools classify

00:42:30.720 | them in categories um then evolve multiple specific application domains with each category then tools are

00:42:37.520 | synthesized they produce over 20 000 synthetic tools across different domains um both mcp and synthetic

00:42:47.280 | tools cover complementary regions okay agent diversification seems like generating all these

00:42:53.520 | synthetic tool workflows would take a ton of compute yeah there's a lot of inference that goes under this

00:42:57.920 | stuff um okay agent diversification so we just casually generate thousands of distinct agents they

00:43:05.680 | synthesize various system prompts and equip them with different combinations of tools from our repository

00:43:11.360 | when i said that this is kind of basic eugene this is kind of what i meant like i didn't expect stuff

00:43:18.480 | like this to just work like sure you can get a bunch of tools um generate different categories and expand

00:43:25.840 | them out i didn't expect diversity in these 20 000 tools but this is what i really didn't understand

00:43:31.840 | they just generate thousands of distinct agents with different combinations of tools from their repository

00:43:37.840 | and you know from this they have rubrics of different stuff that they should be able to solve and all

00:43:42.640 | this kind of just work they have varied capabilities areas of expertise behavior patterns ensuring a broad

00:43:48.240 | range of uh potential use cases and then before you know this is the last section for each agent

00:43:54.480 | config we generate tasks that range from simple to complex operations each task is paired with an

00:44:00.400 | explicit rubric that specifies success criteria expected tool use patterns i don't know how they just

00:44:06.640 | generate tool use patterns because if you can do inference to you know figure out what are the tool uses

00:44:13.440 | you kind of don't have the best quality data right because like you can kind of solve the task i guess

00:44:18.880 | but like i get matching a task with tools but not tool use pattern rubrics uh and then evaluation

00:44:26.800 | checkpoints rubrics ensure consistent and objective evaluation of agentic performance um this is kind of

00:44:33.840 | interesting you know yeah i get that i think um i think of this as like a huge synthetic data etl pipeline

00:44:43.840 | where you could be creating a hundred traces and through your quality evaluation and filtering

00:44:48.960 | we don't know to what extent how strict they do it but they could be filtering only to the best 10

00:44:53.680 | or the best five even they talk about their filtration actually a bit like in some stuff like in in

00:45:00.720 | magistral they talked about like they they get rid of stuff that mr large could solve they have a they

00:45:06.640 | have something here that talks about yeah they have something here whereby they actually use an sft

00:45:11.440 | model and if the sft model is only when the sft model is not able to solve it then they keep it

00:45:16.320 | uh so it's really interesting the data cleaning and data generation that they do yeah it's like

00:45:22.560 | it's cool it's there i think this paper could be a bit worded better but it's still cool it's still like

00:45:28.240 | nice that they that they explain all this okay so uh multi-turn trajectory generation they simulate

00:45:35.200 | real tool use scenarios user simulation so basically um you know lms are personas with communication

00:45:42.560 | styles like one might be moody one might be authoritative one might be stupid one might be like

00:45:47.680 | phd level smart um and preferences they they have multi-turn dialogues with agents creating naturalistic

00:45:55.120 | interaction patterns uh tool execution environments so there's a tool simulator so it's kind of cool

00:46:01.040 | they talk about like their sandbox here um a sophisticated tool simulator functionally equivalent

00:46:08.000 | to a world model they have ag i guess they have a world model they execute tools and provide realistic

00:46:13.280 | feedback simulator maintains an update state after each tool enabling complex multi-step interactions so this

00:46:20.640 | is like if you think of like web arena and stuff like that where you kind of have like a fake website and

00:46:25.120 | you're shopping through uh instead you have a simulator that might have like tasks and as you make as you

00:46:32.160 | interact and do stuff and use a tool the simulator keeps state section by section this is kind of cool i wish

00:46:38.480 | you could have talked a bit more about this um it introduces controlled stochasticity to produce

00:46:43.120 | varied outcomes including success partial and edge cases so you know you're actually changing an environment

00:46:48.400 | a lot of this is like gym motivated style um but you know very cool like they they have all of this um

00:46:54.720 | lm-based judge evaluates trajectories against task rubrics only trajectories that meet success criteria

00:47:02.960 | are retained for training ensuring high quality data while only allowing natural variation and test

00:47:08.480 | completion um interesting little section there yeah i i i yeah i'm sorry jump the gun i thought the

00:47:16.400 | hybrid approach uh which is what you're going to talk about nick is amazingly cool because

00:47:24.000 | they fine-tune their the judge on verifiable code and then what they say is that from this verifiable

00:47:32.560 | code the model is able to learn subjective eval actually wait am i looking at the right section

00:47:41.360 | sorry guys what is confusion is this the section yeah i i had a note somewhere uh give me a second let me

00:47:49.440 | check it i'm gonna cover this for now so uh yeah go ahead there's inherent limitation to simulation

00:47:55.840 | fidelity we complement our simulated environments with real execution sandbox they talk about scraping a

00:48:01.760 | lot of like real github issues uh they want real authentic like stuff right so encoding and software

00:48:07.600 | engineering uh they have execution sandbox these execution sandbox execute actual code interact with

00:48:14.720 | genuine development environments and provide ground truth feedback for objective metrics and test

00:48:18.880 | suite pass rates uh this allows them to learn from diversity of simulated scenarios and authenticity of

00:48:25.200 | fuel execution significantly strengthening the practical agent capabilities so in this hyper

00:48:30.880 | uh pipeline i don't think this is the section you're talking about eugene but you're right sorry it's the

00:48:35.520 | section on close loop yeah we get to that when we get to that okay i think i'm gonna go a little

00:48:41.680 | fast because we're running short on time but basically uh you know rejection sampling so get rid of the

00:48:47.520 | the ones that are easy quality filtration sft has demonstrated improvements for tool use okay oral

00:48:54.160 | um better token efficiency than sft this is understood uh they have a gym like extensible framework that

00:49:02.560 | facilitates rl across a wide range of scenarios uh tasks with verifiable rewards subjective preferences

00:49:10.160 | such as creative writing open-ended question answering we introduce a self-critique reward in which models perform

00:49:16.240 | pairwise comparisons to judge its own outputs it's like similar to grpo right grpo you output a bunch of

00:49:22.960 | outputs and then you do preference over the group of outputs to what's the best uh this instead is you

00:49:30.640 | know you have a self-critique over your output and you judge your own uh output so kind of interesting

00:49:37.600 | on how they do rl for creative writing and open-ended questions uh verifiable rewards and gym so i think we can

00:49:43.840 | skip some of this for math and stem you know pretty straightforward complex instruction following is

00:49:49.360 | this what you're talking about eugene oh no slightly below okay uh we will get it yeah okay so for hybrid

00:49:57.680 | rule verification there's um you know evaluation of code interpreters lm is a judge okay multi-sourced

00:50:04.640 | instruction generation complex prompts faithfulness uh this was kind of cool they have this um there's

00:50:13.040 | this framework of facts grounding there's a sentence level classifier judge that's prompted that's uh

00:50:20.480 | useful to perform automated verification basically uh it detects sentences that make factual claims without

00:50:27.440 | evidence and context they have this as a bit of the reward in rl so faithfulness coding and software

00:50:34.400 | engineering um again this is this part is cool right in the sense that they rephrase the question but

00:50:44.400 | they still so care so much about uh factual verification as well as the responses factual verification they are

00:50:50.880 | putting so much effort into this as well as the coding as well as they have adversarial prompts for their

00:50:56.240 | safety aspects as well safety and faithfulness yeah it's so interesting to see what are the key

00:51:03.200 | things and there are some points where they say that the key challenge is faithfulness reducing

00:51:07.200 | hallucinations and scaling this so essentially they actually write very clearly these are the

00:51:11.760 | challenges we are facing right now um if you if you read between the lines yeah um this was cool

00:51:19.840 | basically e2b style or daytona style sandboxing very very robust sandbox infrastructure uh so pr's and

00:51:26.880 | issues from github to build software development environments that consist of user prompt issues and

00:51:31.680 | executionable unit tests so real github staff has unit tests uh they this environment was built on a

00:51:38.560 | robust sandbox infrastructure powered by kubernetes for scalability and security it supports over 10 000

00:51:44.080 | concurrent sandbox instances while with stable performance they just they just cranked that

00:51:50.080 | out real quick safety they have uh you know attack model that generates adversarial prompts a target

00:51:56.880 | model that produces responses and then a judge that determines if they could bypass the uh safety

00:52:03.280 | mechanism each iteration is assessed with a rubric and then it's used in the reward uh 10k concurrent agents

00:52:11.040 | it's crazy right okay beyond verification self-critique rubric reward so self-critique rubric reward is what

00:52:17.120 | they did uh general rl with self-critique it evaluates its own outputs to generate preference signals

00:52:23.200 | um it initializes critique ability in the sft stage since we're kind of out of time i'm gonna like really

00:52:30.960 | really quickly do the rest of this and then and then we'll stay for discussion but i guess if anyone has

00:52:35.760 | any questions um i guess now's your chance if you have to help or if eugene you want to add in maybe

00:52:40.800 | i'll just focus on the the last paragraph in page 12 here um if you read this right the critic model is

00:52:48.640 | refined with verifiable signals so they find in the critic model with the verifiable signals they continuously

00:52:54.240 | update this critique this transfer learning process grounds so this transfer this transfers subjective judgments

00:53:04.160 | allowing the performance gains from verifiable tasks to enhance the critics judgment on complex tasks like

00:53:10.000 | explicit reward that's crazy in the sense that you can fine tune essentially you can bootstrap yourself

00:53:15.680 | on verifiable reward to train a reward model for non-verifiable reward um how that transfers i don't

00:53:22.400 | know i don't know what the intuition is on that but i think there's a lot in this paragraph here

00:53:26.400 | on reward modeling and evaluation and the the critique just keeps getting better and better right they

00:53:33.520 | update the critiques uh as well and you know it recalibrates its evaluation standards and locks up a

00:53:41.280 | policy evaluation so if you if you look at this right right now we are solving verifiable reward like code

00:53:45.920 | and software right and the question we always have would this generalize to non-verifiable reward i think

00:53:51.120 | this paragraph here says that maybe it would uh in the sense that at least non-verifiable reward is

00:53:58.160 | helping non-verifiable reward then our evaluation so we start from there

00:54:03.680 | very interesting uh rl has slight modification policy so they have their muon optimizer

00:54:17.040 | several additions budget control so they say um properly distributed inference budget we enforce

00:54:25.280 | so um they kind of shit on reasoning models rl often results in substantial increase in length

00:54:31.440 | model generated outputs um now even though that gives improvement in performance these benefits often do

00:54:40.160 | not justify its inference costs in non-reasoning domains so for general stuff you don't want this to

00:54:45.520 | encourage the model to properly distribute its inference budget we in first we enforce a per sample maximum

00:54:52.880 | token budget draw rl throughout rl training and the budget is determined based on the type of tasks so

00:54:59.600 | our task you can reason a little more responses that exceed this token budget are truncated and assigned a

00:55:05.520 | penalty this is all done in the formula there watch it um it incentivizes the model to generate solutions within the

00:55:14.080 | limit enhances its token efficiency to get concise and effective um solutions ptx loss this just keeps it

00:55:21.760 | grounded in previous stuff temperature decay when it starts out you want high temperature so it can

00:55:27.840 | explore different stuff then uh you tone it down rl infrastructure um i don't think we have time for a lot

00:55:36.560 | of this efficient engine swapping efficient system startup agentic rollout rl infrastructure supports

00:55:44.640 | training on long horizon multi-turn agentic tasks tasks present distinct challenges two strategies

00:55:51.760 | to play heavy environments okay evals it books um all the evals are done at 4k context yeah um soda very

00:56:05.200 | soda i think that's enough red teaming they do some red teaming here limitations um when dealing

00:56:13.280 | with hard reasoning tasks or unclear tool definition the model may generate excessive tokens sometimes

00:56:18.160 | leading to truncated outputs or incomplete tool calls basically they have it in their rl that if it's

00:56:24.800 | too long we just truncate and it's cooked um you know that can lead to it wasting its context and then

00:56:32.080 | cutting off at tool calls additionally performance may decline on certain tasks if tool use is

00:56:37.680 | unnecessarily enabled when building a complete software project the success rate of one-shot

00:56:41.920 | bumping is not as good as k2 and their agent decoding stuff yeah um that's paper sorry um

00:56:48.320 | anyone familiar with their rl formula what rl variant is it based on

00:56:54.800 | let me check i thought i thought i knew this i think they said it's based on the kimi v 1.5 i

00:57:01.040 | haven't had a chance to go through that um okay yeah it is a 1.5 but what is the kimi 1.5 based on

00:57:09.440 | uh it's the same the reward model uh so the reward will minus the mean r uh so you can go back so grpo

00:57:17.280 | uh it yes it has gpo in it uh but it doesn't go through the partial sequence right so grpo actually

00:57:23.680 | go through partial sequence um and this one doesn't just a final okay thank you it looked grpo-esque um

00:57:32.400 | any other thoughts questions comments so i was wondering like for the 10 000 agents um it seems like a bit

00:57:42.800 | excessive because does it is it the fact that um all these agents are doing something a little bit different

00:57:47.760 | like how how does uh uh how does it take yeah do you mean the sandbox execution uh no given that you

00:57:55.920 | have so many tools right oh it's 10 000 concurrent sandboxes oh okay i'm sorry which i extrapolated to

00:58:02.480 | 10 000 agents running it may be the same agent maybe i think no no i think this is just like when

00:58:07.920 | you're training in batch size of like 67 million or whatever you need a way to run your verification

00:58:16.080 | and stuff right you need to be able to let this thing talk call tools and like so when you're doing

00:58:23.360 | rollouts you're testing how well it can use tools and interact with the environment you have to keep

00:58:28.640 | state for these environments now you need a bunch of environments running so this is like you know this

00:58:34.800 | is for that it's not it's not 10 000 agents like swarming out to solve stuff maybe more like figure

00:58:42.560 | nine right which each dot each dot is like uh a different tool is that correct so basically the

00:58:48.800 | synthetic tool right yeah you just had it right there with a bunch of colors yeah right there so yeah figure

00:58:54.880 | yes figure nine right so figure nine um they create a lot of synthetic tools and i was wondering like

00:59:00.720 | does that diversity really help out that much um of different tools like uh are some of these doing

00:59:07.840 | redundant stuff so that's my question yes they're definitely doing redundant stuff right because i mean

00:59:14.080 | how many categories do you have here you have 10 000 tools and how many colors do we see i see maybe 30 40

00:59:20.080 | colors so yes tools are doing similar categories of stuff but um it's it's still like even if you have a tool

00:59:29.440 | in the same category as long as it has different descriptions different ways to use it different

00:59:35.200 | um steps because they don't just look at like their rubric isn't only on full task completion so they have

00:59:41.920 | like in the middle uh checks for you know partial partial completion as long as it's using the right

00:59:47.600 | tool use like the right steps that's still a good output but i think their sandbox is basically just

00:59:56.320 | keeping their environment online and like allowing these tool calls and these different environments

01:00:02.800 | to just keep state i guess my question is like are they do they want specifically add so much redundancy

01:00:10.160 | because you want the model to better generalize to usage uh because if i was kind of like have this

01:00:16.880 | diversity of stuff which does similar things how do i know which tool to even pick right so i'm trying to

01:00:22.080 | understand like what's the reasoning there yeah they're pre they're predetermined so in their data gen

01:00:27.600 | pipeline they create agents that have um specified what tools they're allowed to use so i think one yes

01:00:36.480 | you want diversity and being able to generalize and use different tools but two when they generate the

01:00:42.480 | agents themselves they're they're given um different yeah yeah yeah i'll try to find the quote yeah you're

01:00:52.640 | saying that they only present a certain number of tools right so that makes sense they're not given

01:00:57.280 | everything yeah gotcha so um basically we generate thousands of distinct agents by synthesizing various

01:01:04.000 | system prompts and equipping them with different combinations of tools from our repository so it's not

01:01:09.760 | like they have access to all 10 000 you probably have an agent that does something that has access to

01:01:13.760 | like four tools with varied capabilities areas of expertise and behavioral patterns

01:01:18.320 | i know armgard has a hand raised

01:01:22.080 | yeah thanks a lot for the amazing walkthrough i just wanted to go back to the point about uh whether

01:01:32.160 | this is a reasoning or non-reasoning model i i think this one is non-reasoning because uh one of the

01:01:39.120 | reasons i have a gripe with the recent reasoning models is they take too long to give you the answer

01:01:44.320 | and there is like an argument about whether we can justify this increase in cost on latency

01:01:48.880 | like whether the increase in performance can justify the increasing cost of latency but for this model if

01:01:54.400 | you ask it to translate a document from english to french it will give you the french translator

01:01:59.920 | translation right away it will not think for 500 or a thousand tokens

01:02:03.360 | i i think that's part of it the other approach there is hybrid reasoning and non-reasoning models

01:02:12.800 | right like you can you can turn on and off reasoning so like cloud models have thinking and

01:02:19.760 | non-thinking modes small lm3 is a specific like recipe to train a hybrid reasoning model basically in that

01:02:27.920 | in their rl stage with grpo they mix in stuff that's a subset of reasoning and non-reasoning

01:02:34.800 | and they reward them equally but um i do agree like i really think that like

01:02:42.320 | they're kind of pushing the cost onto you right like for basic stuff i don't want to pay extra tokens and

01:02:48.560 | the the proper like the proper gauge for models right now is no longer priced to um like before

01:02:57.280 | it used to be price to intelligence ratio right so how much do i pay for this level of intelligence

01:03:02.720 | it's very skewed now right a model can have an intelligence that's very high but it could take

01:03:09.360 | twice the cost in the terms of number of tokens to get there the number of tokens to get that intelligence

01:03:14.080 | this kind of variable what would be interesting now i think is in some benchmarks if we could include

01:03:20.880 | how many tokens were used in reasoning for specific like questions or stuff i don't know a great way to

01:03:27.520 | measure this i haven't really thought about it too much but you know what i mean like cost to performance

01:03:34.800 | is no longer as relevant because your performance has variable number of thinking for different

01:03:40.320 | like models like some models are very verbose in their thinking the deep seek update basically got it

01:03:45.280 | for think to think for twice as long which could also mean to you know charge you twice as much but

01:03:51.360 | on the other hand like it's a significant performance increase right so like are you really gonna are you

01:03:58.000 | really gonna make that trade-off i think in general even though it was forced on people like

01:04:01.840 | oh one was the first reasoning model and you didn't have a choice right it's not hybrid like

01:04:05.840 | if you want reasoning you're gonna use full reasoning and people were fine waiting an extra minute for

01:04:10.880 | performance and the costs always come down but that's just high level thinking in my in my opinion

01:04:18.640 | i don't have i haven't given it much more thought

01:04:24.720 | yeah that's that's a good argument about cost yeah any other thoughts comments questions anything i

01:04:32.400 | missed in this paper sorry i like we yapped too much in the beginning i didn't realize it would take so

01:04:37.280 | long i was gonna think it was a quick one it's it's quite a dense paper like lots of stuff to go through

01:04:46.480 | yeah um i think they're they're like muon clip stuff is pretty interesting but like there's only a few

01:04:54.160 | takeaways right like they do some cool scaling laws basically they do like individual attention head

01:05:00.880 | level parameter clipping of optimization and that that leads to really cool scaling the the bigger takeaway of

01:05:07.760 | this is like it's very very expensive for labs to do big train runs and like lost spikes are a big detriment

01:05:17.200 | because basically if something gets cooked like all your major infrastructure is being useless and you

01:05:23.120 | might have to roll back and restart and rechange weights and load weights and like that's a huge delay in a lot of gpus so

01:05:29.760 | stability in training is like quite huge but that's like the big takeaway um and then check out check

01:05:38.480 | out last week's paper you know we kind of dove into the optimizer frankie you have hand up oh yeah i think

01:05:44.240 | just with the comment from eugene on that paragraph on self-critique closed loop so i'm wondering like

01:05:51.280 | uh because again like how does it learn from subjective giving subjective rewards right so i'm wondering um does it

01:05:58.400 | have to do with the fact that um they're they're critiquing on the verifiable signals right so

01:06:03.440 | there's things that are definitely verifiable but because you have things which are not verifiable

01:06:08.960 | are they you think that the model is kind of like learning from the steps that were taken right it's kind

01:06:14.320 | of like the chain of thought or whatever if you like learning from the steps like if it looks like you're

01:06:18.640 | doing good steps then you know i can give reward you better so because you don't have you you don't you

01:06:24.800 | don't have a you have a subjective reward on you right so i just wanted to see how you guys thought about

01:06:29.200 | that yeah kind of um in their rubric um they have rubrics that are for like potential steps so given um

01:06:42.800 | given this like let me try to find it given this prompt or question and these set of tools uh here

01:06:53.200 | i think this is it um combine rlvr with the self critique rubric reward the model learns not only

01:06:59.920 | from externally defined tasks but also from evaluating its own inputs extending alignment from smell basically

01:07:06.640 | in the rubric if you're using the right tools and getting like the right steps um you're you're

01:07:12.800 | rewarded for that too but i think we can do a bit of a follow-up on this um for each agent we generate

01:07:19.040 | tasks that range from simple to complex each task is paired with an explicit rubric that specifies the

01:07:24.400 | success criteria expected tool use and eventuation and evaluation checkpoints this rubric-based approach

01:07:31.040 | ensures consistent objective evaluation of agent performance where else is this mentioned

01:07:35.840 | expert crafted conditional prompts and rubrics developed by our data team

01:07:44.880 | oh this is kind of interesting how they developed this uh agentic instruction augmentation fine tuning

01:07:51.120 | okay each iteration is assessed using task specific rubric enabling the judge model to provide binary sex

01:07:58.880 | slash success failures uh that's for success beyond verification self-critique

01:08:04.800 | self-critique reward with evaluates outputs initial critique is building an sft stage

01:08:13.520 | eliminate your word hacking clarity relevance so this is some of it core rubrics for rl um clarity and

01:08:23.120 | relevance assess the extent to which the response so this is the rubric used in the rl for the rollout

01:08:29.760 | um excess assess the extent to which the response is six is succinct with full addressing the user's

01:08:36.400 | intent the focus on eliminating unnecessary details staying aligned central uh with the query

01:08:43.600 | efficient formats such as brief paraphrase brief paragraphs or compact lists uh conversation fluency

01:08:51.040 | engagement it's coherent so you know basically yeah they have an rl reward for stuff like clarity

01:08:59.200 | conciseness staying on topic um objective groundedness like are you objectively answering the step the

01:09:05.840 | question perspectives so um so you think that uh these rubrics is kind of guide help it to like decide what's

01:09:14.800 | good or bad is that what you're saying yeah um in in the group of outputs because it's based on grpo right

01:09:23.840 | you're still giving a reward for like you're now not verifying oh yeah in a sense you are verifying right

01:09:29.360 | because in a group of rollouts you're giving reward to the ones that match these right so you give a

01:09:36.560 | reward for stuff that's clear give us stuff a reward for stuff that's fluent give a reward for stuff that's

01:09:42.000 | objective um as justification and you know you're you're explicitly adding that in your rl policy so

01:09:50.080 | it's not verifying on stuff like okay and output is like verified and boxed math answer or code that

01:09:58.240 | compiles uh you do have a rubric that verifies for this i think the interesting thing is that they have a

01:10:03.920 | policy model that iteratively learns better critiquing and can self-critique that makes this possible

01:10:13.360 | because otherwise it's like okay yeah this is essentially just an llm as a judge as your rl on output

01:10:19.760 | and we know that um rl is very very um what's the term i'm looking for it's it's um it has to it's like

01:10:31.120 | very sam i'm blanking on the word it has it's uh i've blanked on the word but it has to be concise

01:10:40.240 | right like you can't have noise in your rl policy otherwise you won't generalize there's a term for

01:10:46.320 | this that i'm blanking on and you would assume that just having a basic um lm as a judge for your output

01:10:53.200 | for non-verifiable stuff like is this good is this concise wouldn't work but this kind of shows that if

01:10:59.760 | you keep your um okay come here okay um yes read this section okay but but what what is your what um

01:11:13.760 | what makes you think that it is kind of improving its critique capability as it progresses it says it

01:11:21.040 | it says it in one section that section that eugene highlighted is literally about keeping it

01:11:26.640 | improving uh where is it

01:11:30.320 | as a self-closed loop one right yeah yeah the closed loop thing so yeah that's that's how i know it's yeah

01:11:38.880 | okay thanks cool um any other stuff yeah closed loop uh our critique model is refined using verifiable signal

01:11:47.040 | on policy rollouts generated from verifiable rewards continuously update the critique the

01:11:53.040 | critic a crucial step that distills objective performance signals from rl vr directly they even

01:11:58.400 | talk about this in the infrastructure section about loading moving these weights and stuff uh this closed

01:12:05.040 | loop process ensures that the critique continuously recalibrates its evaluation standards in lockstep with

01:12:11.280 | policy evaluation so this is like what eugene was saying right this closed loop stuff lets you do rl and

01:12:16.720 | verify about rewards um basically you have a good rubric and you iteratively get better thank you

01:12:24.240 | cool guys we're 20 minutes over so i think that's where i call it if anyone wants to volunteer for

01:12:31.440 | next week that would be sick find a cool paper we share a bunch of papers in um discord i have like

01:12:38.400 | four or five that i'd recommend but you know if anyone wants to present would be would be sick

01:12:51.280 | thanks fix you're speaking but muted all right next week guys i'll see you