back to indexKimi-K2 Tech Report (Full Breakdown /w Vibhu Sapra)

00:00:00.320 |
the luma said okay give me k2 um last week we went over the original muon paper so basically 00:00:09.280 |
the the really cool thing if anyone has seen it or trained models is figure three 00:00:14.560 |
they have a very very smooth loss curve uh normally when you train there's big spikes 00:00:20.400 |
big spikes are not good because you have to kind of like intervene see like okay is the model kind 00:00:24.640 |
of cooked do we have to roll back do we have to go to a new checkpoint do we have to fix stuff 00:00:30.080 |
that's like kind of where you have to stop your training and like your gpu utilization gets kind 00:00:35.440 |
of cooked what they did is they have a new optimizer that gives us really really smooth loss curves with 00:00:40.880 |
like no spikes throughout the entire training process this is kind of crazy um i won't go 00:00:46.560 |
super into detail on the optimizer since last week we we basically covered it so i think the recording 00:00:54.320 |
is already up on youtube uh if you're interested watch it but okay give me k2 who's uh who's used 00:01:00.800 |
it who's heard of it who doesn't know what it is um if anyone wants you know to just interrupt me whenever 00:01:06.080 |
and we can we can answer questions so basically this is like state of the art open source model it's very 00:01:14.480 |
very big um i'm gonna check chat real quick okay nothing nothing about this so basically this is a 00:01:21.680 |
trillion parameters um it's not a thinking model but it's like deep seek v2 moment uh like the second deep 00:01:28.640 |
seek moment um what they did is basically it's like better than claude on all coding benchmarks better 00:01:35.040 |
than like anything it's very very state of the art it's not thinking and it's a it's a very good uh 00:01:41.520 |
non-reasoning model it's very open source they shared um they shared this optimizer that's basically 00:01:48.720 |
like a better atom w and um this this week we just got out the actual tech report paper it's decent 00:01:56.640 |
honestly um it's like the first highlight of once again how do you train a very very large model 00:02:04.160 |
so it's like kind of a pre-training post-training how do you train a large model and then the 00:02:09.040 |
interesting note here for me was you get to see the problems of using china chips like they have the 00:02:15.360 |
h-800s right because there was an export restriction so these chips are like different than um you know 00:02:22.240 |
interconnected nodes and what do they do to come around that i think like two weeks ago we covered we 00:02:28.560 |
covered the other reasoning paper um magistral the reasoning paper from mistral and they show how 00:02:33.920 |
they can just like skip some of these issues for full gpu efficiency um but yeah here's kimmy k2 so 00:02:41.760 |
uh trillion parameter huge huge trillion parameter model with 32 billion active parameters very very 00:02:49.360 |
sparse uh they talk about the sparsity they talk about like experiments they did for sparseness ratios 00:02:56.160 |
and all this stuff it's pre-trained on 15 trillion tokens they use a muon clip optimizer this is basically 00:03:03.680 |
like all the hot news they they put out a separate paper about this before um what this basically 00:03:09.600 |
leads to is very very good um training stability so uh they have this qk clip technique where they're 00:03:17.440 |
they're kind of clipping attention gradients and and yeah it's just very smooth smooth training okay so 00:03:23.920 |
trained on 15 trillion tokens uh trillion total parameters 32 billion active parameters very sparse 00:03:30.400 |
um and then the big thing here is zero loss spikes which is kind of crazy um during post training there's multi-stage 00:03:38.320 |
process uh they they do a lot of synthetic data work another fun thing i would recommend about this 00:03:44.160 |
people paper for people is they do good citations into other work so if you're interested in different 00:03:50.560 |
um like synthetic data generation pipelines they they cite quite a few cool papers there's a few in there 00:03:57.440 |
that are like good citations i'm gonna follow up and read uh estimates on the compute used in training 00:04:03.280 |
this yeah i think they actually say how much compute they used how much how much training compute uh how 00:04:08.240 |
many gpus and stuff they had now it doesn't that doesn't account for a lot of the experiments they have a lot of 00:04:13.840 |
like ablations and stuff they have a lot of data generation i don't know if that necessarily ties 00:04:20.320 |
into how much they used for it right they share a decent amount of numbers but they don't share everything 00:04:25.600 |
um yeah so large synthetic data synthesis pipeline and a joint oral stage honestly i found this 00:04:33.200 |
model a little confusing because everyone calls it not a reasoning model and like sure it's not reasoning 00:04:41.040 |
per se where it's like grpo like rollouts just like proper rl to think but they do have a lot of like 00:04:49.120 |
rl to think so i don't know where the line blurs and i think this is a discussion we can have later and 00:04:54.800 |
i'd love to hear people's thoughts but like i would semi call this a reasoning model like it it reasons 00:05:00.240 |
and it's trying to reason with a like similar grpo with with some um with some changes but yeah it's a 00:05:08.000 |
state-of-the-art very very good coding model you can plug it into cloud code and stuff um 00:05:13.120 |
so we bench verified 65.8 it's like right up there with opus 03 all that stuff 00:05:19.600 |
um and it's very sparse you know it's it's sparse it should be cheap for people that 00:05:24.640 |
host it for inference um on everything else it's pretty good it's very open source they give base 00:05:30.080 |
model and post train checkpoints um it's nice they do base model evals as well what's the difference 00:05:36.640 |
between a reasoning and a thinking model yeah that's a good question mostly we don't call stuff 00:05:42.880 |
thinking models we just have reasoning models and on reasoning models in this case i would consider 00:05:48.560 |
this thing to be like a reasoning model but yeah i guess there's just think tokens they just take out 00:05:53.680 |
the think tokens and they just let it kind of reason um performance pretty good i think we can cover 00:06:01.920 |
charts later let's just get through the paper for now but if anyone's interested in diving on any of 00:06:06.800 |
these we can we can kind of look at it um but yeah you know like coding benchmark basically up there with 00:06:14.080 |
opus um the one that just beat this recently so like the the cool thing with this paper is 00:06:19.440 |
it beats deep seek in like across the board on everything um as of a week later quen quen reasoning 00:06:29.200 |
sorry quen coder beats this at coding which is cool so you know stuff is moving fast uh they basically 00:06:36.960 |
reran a bunch of benchmarks except for opus opus they just took their their numbers i think open hands 00:06:43.360 |
help them once we've been verified so shout out to them at the end but um opus was expensive to 00:06:49.360 |
benchmark it's it's crazy how expensive benchmarking is so but good for them to do it uh basically in 00:06:55.120 |
the introduction they go a little like um you know broader and they want to explain what are these terms 00:07:01.600 |
so they have this um they have this like vision of agentic intelligence right uh models are no longer 00:07:08.800 |
static like they don't just do next token prediction they're they need to be agents so they need to like 00:07:15.040 |
evolve from static understanding of what they know and they need to be able to plan reason act in 00:07:21.280 |
environments use tools and stuff so a lot of what they want to do and like you know dynamically adapt to 00:07:27.360 |
these um environments so what they basically want to train a model that's really really good at tool 00:07:35.600 |
use so um they they train it on a lot of like environments agentic tool use uh they generate a 00:07:43.840 |
lot of these tools and they wanted to acquire skills beyond training distribution and adapt behavior through 00:07:49.200 |
experiences so like this is kind of like the motivation for why this exists uh you know this thing should be 00:07:55.360 |
be able to pick up tools really good and like when i mentioned there's good citations of papers for 00:08:00.160 |
like synthetic data gen they also have stuff that's like here's what we were influenced by in tool use 00:08:06.720 |
papers here's like four papers that do really good tool use and here's how we build upon that so 00:08:11.520 |
you know if you're interested in tool use stuff you should probably read the other four papers 00:08:15.120 |
um okay achieving agentic intelligence in quite introduces challenges in both pre-training and post-training 00:08:23.200 |
um pre-training must in endow models with general purpose priors and effective token efficiency learning 00:08:32.240 |
per signal post-training must transform those priors into actionable be actionable behaviors yet agentic 00:08:42.000 |
capabilities such as multi-step reasoning long-term planning and tool use are rare in natural data and 00:08:47.920 |
costly to scale so what this is saying is basically um you need a decent enough base model to do all 00:08:55.520 |
this stuff and for post-training like this data is not really existent and it's like hard to scale right 00:09:02.800 |
so like we don't really just have a lot of multi-step reasoning long-term planning tool use data so um you 00:09:09.120 |
know they do a lot of scalable synthetic data gen uh okay paper uh model big big model trillion parameter 00:09:16.640 |
moe with 32 billion active parameters there's there's a few major major things here uh muon clip which is 00:09:23.600 |
their new novel optimizer it's basically better than um adam w this was introduced like quite a while ago but it 00:09:33.840 |
was trained at small scale and it wasn't really scaling very well so in this paper they they show 00:09:41.040 |
scaling up this optimizer and sort of having uh gradient issues in attention so the way they solve 00:09:48.000 |
it is this kind of um stability enhancing qk clip where they're they're clipping off a lot of attention um 00:09:56.400 |
a lot of attention weights so from there um it's kind of crazy they don't have a single loss spike 00:10:02.880 |
very very beautiful training curve okay then they have a lot of stuff on their data gen so a large 00:10:10.320 |
scale agentic data synthesis pipeline i thought this was like kind of basic i'm surprised how much it just 00:10:17.440 |
works but um pretty cool stuff on synthetic data gen basically they generate a lot of tools tool 00:10:24.880 |
demonstrations tool descriptions and then they have environments for it to use this stuff and they have 00:10:30.640 |
like verifiers and in the middle step verifiers but it's it's like a cool pipeline that they use for 00:10:36.800 |
all this uh data gen why why do you think it was basic i mean we'll we'll go into how it is later and how 00:10:43.840 |
it's evaluated um i i thought it's like like the verification of some of this stuff isn't that crazy 00:10:51.280 |
you know yeah it's a it's a lot of rubric based i'm surprised it just kind of works you know um especially 00:10:59.120 |
at the scale and like how they achieve diversity like from the techniques they explain i wouldn't expect 00:11:06.640 |
the amount of diversity that makes this actually work out 00:11:11.200 |
but let's we'll dive into it in a bit you know yeah a consistent question i have for folks is that how 00:11:16.640 |
do you evaluate on non-verifiable reward like essentially things that are not code or software 00:11:21.040 |
and everyone's just saying rubik base would work and i think it's like an example of hey you know rubik 00:11:25.360 |
base actually does work uh yeah i'm surprised that that's all you need they they talk about this 00:11:32.960 |
specifically so for stuff like creative writing and stuff um they you know they share how they how they 00:11:39.360 |
generate this pipeline is it true they used mcp tool architecture outputs yeah they basically have a 00:11:45.680 |
whole bunch of mcp stuff so they scrape like a thousand mcp tools and then they they generate 00:11:50.880 |
it even more so mcp is all you need in this um we'll get into it later and like actually they have a 00:11:56.400 |
a stator plot of different tools um see t-sne visualization of real mcp tools um colored by 00:12:06.480 |
their original sources so thousands of mcp tools of course they have crypto but um you know this is 00:12:13.600 |
their mcp tools then they have synthetic tools which are also you know in different domains i love how they 00:12:21.280 |
just have an unknown domain but they have crypto where it's it's some of the mcp tools but um 00:12:27.600 |
that's part of it okay okay continuing on um where were we 00:12:32.720 |
okay they also talk about their rl framework um it's rl with verifiable rewards this is what eugene 00:12:42.000 |
was talking about it's it's not um it's not just code in math they have it on um they have it on like 00:12:50.560 |
general domain so like creative writing and stuff okay so kind of interesting they talk a lot about 00:12:57.920 |
critiquing outputs instead of like grpo where you do rollouts and reward the best output um they just 00:13:06.160 |
have a lot of critiquing here which is kind of interesting i found their explanation of how they 00:13:11.680 |
set up this whole loop to be a little confusing so like their training loop it's a little odd but um 00:13:17.440 |
i don't know maybe someone else had a better intuition while reading it or broke it down or 00:13:20.960 |
looked into it more but uh they they do share a lot about that okay pre-training 15 trillion tokens 00:13:27.280 |
um they talk a lot about token efficiency and how it's important basically rl is very token efficient 00:13:33.280 |
right so guess what do rl um they they talk about pre-training techniques so uh uh effectively 00:13:41.440 |
this muon optimizer it's very very much a token efficiency thing as well then they have synthetic 00:13:47.200 |
data gen um of course their ultra sparse moe is efficient uh it's mla so multi-head latent attention 00:13:55.440 |
which was derived from deep seek and then they make some modifications they actually show scaling laws for 00:14:02.240 |
how sparse you want moes there's charts of this later uh muon clip so this is the optimizer and 00:14:09.280 |
what they do um i was thinking we actually just skip this for now since we basically did this last week 00:14:17.200 |
but if there's time i think we come back um let's see if there's anything basic to say about it so you 00:14:24.000 |
you know outperforms adam w better token efficiency um leads to training stability 00:14:31.440 |
constraint attention logits so attention was having issues we constrain it apply kipping to unshared 00:14:39.920 |
attention ads yeah i think we go over this later let's just go through the paper first unless anyone 00:14:45.280 |
wants to ask a specific question or eugene do you want to say anything about it okay let's just move on since we 00:14:52.560 |
basically did it so if you're interested um why they're covered at the end or check last week's 00:14:58.160 |
thing okay uh pre-training data improving token utility with rephrasing it's interesting how much 00:15:04.240 |
they just did rephrasing like they show cool little experiments here of um rephrasing versus uh repeating 00:15:12.240 |
for multiple epochs and how their their strategy is kind of better so token efficiency refers to how much 00:15:19.440 |
performance improvement you can get out of each token consumed during training increasing token 00:15:25.040 |
utility the effective learning signal for each token contributes is pretty important right so that's 00:15:31.520 |
something that they want to focus on naive approach is basically just do more epochs right um eugene 00:15:39.040 |
question comment you're muted oh sorry not a question i think you briefly glance at our table we will go 00:15:47.200 |
through it uh when we go through the table i think the table is absolutely insane where the delta between 00:15:51.760 |
the second row and the first row essentially almost answers all of the game um but we can talk about it 00:15:57.200 |
when you get here yeah yeah it's interesting um so basically like you guys know what multiple epochs are right so 00:16:04.560 |
you you you take your let's say you have a thousand samples you train for 10 epochs you train on the 00:16:09.200 |
same data 10 times um so that's one naive approach another so they compare to that um you know you do 00:16:17.920 |
get some gains from doing multiple epochs and it's somewhat common for some some data like for your base 00:16:23.920 |
pre-trained we don't do multiple epochs anymore we used to um for some higher quality data like some some 00:16:30.160 |
some high high quality reasoning we'll we'll do it for multiple epochs and do the same data multiple 00:16:35.120 |
times but here's kind of what they do um a key advancement is synthetic data gen to increase token 00:16:42.560 |
utility so they have this core rephrasing pipeline um basically instead of just train on the same same 00:16:49.680 |
data multiple times let's kind of rephrase it a little bit there's two two domain specialized 00:16:54.880 |
rephrasing techniques that they use for knowledge and math so knowledge data rephrasing a single epoch 00:17:01.600 |
is insufficient for comprehensive knowledge absorption while multi-epoch repetition yields diminishing 00:17:06.880 |
returns in the risk of and increases the risk of overfitting so um we'll get to a chart in a bit but 00:17:12.240 |
let's talk about what they do so first is style and perspective diversity prompting so uh prompting guide 00:17:18.880 |
is you take an llm to generate free uh faithful rephrasing of the original text in very in various 00:17:25.840 |
styles from different perspectives then there's uh chunk wise auto regressive generation basically if 00:17:32.160 |
stuff is very long just do it in chunks so rephrase segment by segment then adopt uh auto aggressive like 00:17:39.200 |
chunking fixer so divide text into segments rephrase them individually stitch them back to back to back 00:17:45.440 |
together and fix that there's there's like a little pipeline on there's a little diagram on this later 00:17:51.680 |
um to ensure consistency between original and rewritten content there's fidelity checks that compare the 00:17:59.360 |
semantic alignment of the rephrase page with its source um yeah cool so basically they do all this 00:18:05.440 |
then they run a little experiment so let's compare how this performs in the set in the sense of token 00:18:11.920 |
efficiency so there's three three things that they do one is you take your original data set you train 00:18:17.600 |
on it for 10 epochs uh risks here are you know you might overfit your data so you might overfit and it 00:18:25.120 |
might just you're not really adding more quality you're just doing the same thing multiple times 00:18:28.560 |
two is rephrase the data once and repeat it for 10 epochs so basically you take all your data 00:18:35.680 |
rephrase it in a different sense and then you train for 10 epochs uh three is rephrase the data 10 times 00:18:42.640 |
with a single pass of training and here's kind of the results um we extend this method towards larger 00:18:49.120 |
stuff and you know each corpora is rephrased at most twice so 10 epochs on same data they they test it on 00:18:56.880 |
simple qa they get 23. just rephrasing it once and training it for uh sorry rephrasing it once and 00:19:03.440 |
training it for 10 epochs gets you quite a bit of improvement then there's rephrasing it 10 times and 00:19:09.120 |
doing one epoch you still get a little bit more improvement but you know the main difference like 00:19:13.280 |
eugene said is just here rephrasing makes quite a difference am i reading this right when actually the 00:19:19.760 |
first row and second row each of them are 10 epochs but the difference between the second from the first 00:19:23.840 |
row is essentially the second row actually does data cleanup by rephrasing it it's not data cleanup it's 00:19:29.680 |
data augmentation so i think they keep both the original data and rephrase oh i see so they keep 00:19:36.960 |
both okay i was reading it wrong i thought the first one was raw 10 epochs the second one is only rephrased 00:19:44.080 |
10 epochs um we could you could be right here yeah because yeah i read it so what eugene is saying is 00:19:54.080 |
you know in the the the benefit just comes from rephrasing what i read this as is keeping the original 00:20:01.280 |
and rephrasing it i think i'm right because if you look at the paragraph it says rephrasing the data once 00:20:07.040 |
repeating it which i think as it as the rephrase data if you look at the third line which is uh so 00:20:14.480 |
that is actually cleaning up the data to be better 00:20:17.840 |
that's all the juice uh so that's pretty that's pretty mind-blowing to me that someone they did this 00:20:24.000 |
ablation so that's really nice yeah beautiful little um ablation basically rephrasing you get so much data 00:20:32.480 |
quality improvement but the reason i interpret it the way that i did is because um you know i thought 00:20:39.280 |
that 10 10 epochs of the same data is basically just the same data um adding in another look and 00:20:47.440 |
doing it for tiny box gets you better performance but i think we'll dig into it later somewhere at the 00:20:52.000 |
bottom they have um they have a lot of this in the appendix i wouldn't be surprised if they talk more 00:20:57.680 |
about it well we'll have to check um this is our kind of chunking for long context diagram basically 00:21:05.520 |
you know chunk it rewrite and then merge it and check but nothing crazy there i had a question on 00:21:11.520 |
this actually this diagram um i'm thinking a lot of time but uh in the third column let's say we 00:21:20.800 |
rephrase partial input one what does it mean to be auto regressive 00:21:24.800 |
oh like you're adding in the first chunk and the second chunk like you're using the first chunk 00:21:34.880 |
to write the second chunk i think so um here chunk based auto regressive strategy texts are divided into 00:21:42.320 |
segments rephrased individually then stitched back together to form complete passages this method 00:21:48.640 |
mitigates implicit output length limitations um the auto regressive part i think is let's see 00:21:55.440 |
so so when i read the paragraph i read it as okay we've split in the paragraphs you rephrase it and we 00:22:00.800 |
just smash it back together but i realized that the first few phrases makes a lot of difference to 00:22:07.200 |
maintain global coherence i think what they're doing here is they're rephrasing the first paragraph 00:22:14.240 |
add the first paragraph and the second part at the new first paragraph and the raw second paragraph and 00:22:19.600 |
ask it to rewrite the second paragraph and i think by doing this it keeps the style of the first paragraph 00:22:24.480 |
but i want to make sure if i'm understanding this right that's how i'm interpreting it too uh the input 00:22:30.080 |
is spent to small objects with preserved context rewritten sequentially then concatenated into a fully 00:22:35.760 |
rewritten but um it could be the other way too which wouldn't be as good if it's the other way then it 00:22:41.760 |
doesn't have to be sequential right if it's there if you just do everything you can just be parallel the 00:22:47.920 |
fact that is i yeah anyone also the graph the graph also shows they are feeding the the previous segment 00:22:56.800 |
into the rewrite step so so like the second rewrite model has like two inputs it has the like original 00:23:04.240 |
segment and then also the rewritten previous segment i think so so that's it if we are rewriting the last 00:23:11.520 |
chunk do we pass in everything or do we just need to pass in the last chunk minus one and this is just 00:23:18.320 |
last minus one right you can see it yeah it looks like partial except two right yeah so okay i think 00:23:24.240 |
one one one's not going into three but one is kind of already seen this is extremely valuable uh this 00:23:31.120 |
this diagram itself says a lot about how you can process long context data uh so i think you could 00:23:38.080 |
do better really basically yeah that's just like yeah you could pass in multiple chunks you could do a 00:23:43.680 |
better long context rewrite like this is not super efficient yeah you're right it's not super efficient yeah 00:23:50.400 |
yeah yeah it also depends on how long context documents you have right so like at a certain 00:23:56.000 |
yeah so i mean there's interesting stuff there but the the interesting note is the majority of the 00:24:02.240 |
training was done at pretty short context actually um exactly i was so surprised 4k context only so but 00:24:08.800 |
okay i'll let you get it there i'll stop interrupting i mean no no it makes sense right like it's a 00:24:13.520 |
trillion tokens it's a trillion sorry trillion parameters it's a big model like um that context 00:24:20.240 |
really adds up so they do context length extension later i also just don't think that we have 15 00:24:26.720 |
trillion like even trillions of tokens of really long context right like that just doesn't exist um and 00:24:33.280 |
it's so much more exponentially expensive to train at long context um okay so uh basically they rewrite 00:24:42.000 |
high quality math into learning note style so instead of just question answers you know here's a learning 00:24:48.320 |
note of how i would solve this uh translate high quality map into other languages this is standard 00:24:54.240 |
seen in other work as well okay overall they have 15 trillion tokens i thought this was interesting 00:24:59.680 |
they have web text code math and knowledge um i don't know what knowledge is but they have knowledge 00:25:04.720 |
for each domain we perform rigorous correctness and quality evals you know standard filtration stuff 00:25:12.560 |
a lot of rewriting though okay 32 billion active parameters similar to deep seek v3 they use mla 00:25:18.560 |
um is our attention to external hidden dimension of this okay so scaling law analysis reveals that 00:25:27.040 |
continued increase in sparsity yields substantial performance improvements which allowed us which 00:25:32.160 |
motivated us to increase the number of experts to 384 instead of 256 in v3 to reduce computational overhead 00:25:38.960 |
during inference we can we cut the attention heads to 64 instead of 128 so they do scaling laws to reveal that 00:25:47.360 |
um you know based on the number of active parameters you want increasing sparsity so um taking the same 00:25:56.880 |
number of active parameters but increasing total experts is very um is better performance uh and 00:26:04.960 |
they um so you know they increase experts and they cut the attention heads in half this helps with their 00:26:13.840 |
training stability so their new optimizer um had issues and they they cut attention heads in half 00:26:21.520 |
okay this is um you know similar to deep seek so more total parameters less active parameters more 00:26:28.800 |
experts less attention heads only one dense layer which they don't talk much about here's their sparsity 00:26:35.120 |
scaling law um sparse oh one sec don't be right back 00:26:42.240 |
oh no oh no oh no that's mochi need to get out of the room or that's what you need to get into the room 00:26:47.920 |
what he has always has needs um mochi's fomo from paper club oh mochi's here 00:26:55.120 |
no she just needs papers what you can tell with us all right okay sparsity is the ratio of um total experts 00:27:05.520 |
compared to active experts uh 48 is their magic number but um basically what is this they have 00:27:23.120 |
and let's read to this okay sparsity is the ratio of total number of experts compared to a number of 00:27:30.320 |
factors under so this is the scaling law basically that they find under a fixed number of active 00:27:36.000 |
parameters uh basically in this case the 32b active parameters that they want to set um increasing the 00:27:44.000 |
total number of experts in basically making it more sparse consistency consistently lowers both the 00:27:51.280 |
training and validation laws thereby enhancing overall model performance um they do these experiments 00:27:59.200 |
um increasing sparsity leads to better performance it also adds infrastructure complexity um there was a 00:28:09.760 |
semi-analysis paper on like why llama 4 failed and they talk a lot about moe stuff and their expert 00:28:17.920 |
choice routing versus token choice routing and all these stability issues and infrastructure and how they 00:28:23.120 |
get and how they changed halfway through but yeah when you make stuff more sparse and you have like 00:28:28.480 |
uh stuff like that then then you need to you know you need to deal with infrastructure complexity 00:28:34.240 |
so they have a sparsity of 48 achieving um activating eight out of 384 experts per forward pass um here's 00:28:44.320 |
kind of their their laws so increased sparsity better performance doubling the number of attention heads 00:28:51.040 |
they also checked it didn't take much of a hit i think this is the next section uh frankie you have hand up 00:28:57.120 |
oh yeah i just want to ask um maybe more of an interpretive kind of uh question here so they 00:29:04.080 |
found that increasing moe is and decreasing number of heads um it is better so could the interpretation be 00:29:12.320 |
that um the tissue heads are essentially kind of like mixing stuff from your context the moes is kind 00:29:18.800 |
of figuring out what you do with that mix so i think maybe it's saying that the expressivity of what you 00:29:25.760 |
should do with the mix is more important than the mixing itself that is you can kind of like 00:29:30.640 |
get raw stuff from your neighbors but you should be smarter about what you do with it 00:29:35.680 |
would that be a correct interpretation and it's kind of hard but what do you guys think 00:29:41.760 |
um two things so one the attention head is not actually it's not necessarily better to cut them 00:29:50.800 |
they did it for training stability so the the dots here are lower loss but double the attention head so 00:29:57.360 |
um more attention heads actually helps performance it's just not much like they say it's a doubling the 00:30:05.200 |
number of attention heads led to a reduction in validation loss of 0.5 to 1.2 percent so it's still 00:30:11.600 |
better it's not worse to have more attention heads it's just a lot more um it's a lot more complexity so 00:30:18.000 |
so here's a here's a cool quote so with the sequence length of 128k doubling the attention 00:30:23.680 |
heads from 64 to 128 while keeping the total expert count fixed leads to an 83 percent increase in 00:30:29.840 |
inference flops so you're nearly doubling the compute required for inference by doubling your attention heads 00:30:38.720 |
but the performance increase is very small and uh it it had instability issues with their optimizer 00:30:47.040 |
this doesn't mean that um adam w with more attention heads isn't stable but you know it's 00:30:53.200 |
it's a big uh compute cost so something to note there regarding the analogy i think i need to read uh hear 00:31:01.520 |
it again i don't know if someone else wants to comment or maybe you want to post it in the chat 00:31:06.320 |
but i i didn't really digest and if someone else wants to comment you know pop in i wanted to ask 00:31:12.640 |
is it more of the experts or the attention heads that we're you know being optimized for here 00:31:20.560 |
this is both um so they're keeping it sparse by uh having a lot of experts but only a few active 00:31:30.400 |
that's one part of the equation that's in their scaling laws better performance so 00:31:34.320 |
if you have a set of if you have a fixed inference budget like let's say i want to run this thing at 32 00:31:40.480 |
billion active parameters uh you'll do better by having more total experts so that's part of it 00:31:47.040 |
the that's for performance right that will get you better overall performance even though it has 00:31:53.360 |
training complexity the attention heads is uh inference optimization so you know when you cut attention heads 00:32:01.920 |
um you have less kvs right each attention head basically has to do kvs and over long sequences 00:32:10.320 |
kv is quadratic scaling and that's that's a lot of gpu utilization if you can cut half you can kind of 00:32:17.760 |
cut half of your kv cache which is a lot of gpu usage uh eugene it seems like you have good intuition you 00:32:24.480 |
wanna wanna you wanna wanna intuit no i i don't know if i have good intuition but my thinking is that 00:32:30.320 |
attention heads are just like an ensemble and beyond a certain number they found that you know from 68 00:32:36.480 |
64 to 128 they say that yes while it does improve validation loss um like it's over there uh only 00:32:44.240 |
validation loss improved by half a percent to 1.2 percent and okay given that fixed amount of compute they rather 00:32:50.800 |
not double they rather half the attention hits and have more experts so it's a compute trade-off which 00:32:57.120 |
makes sense to me i think i think that makes sense no no i think that makes sense the difference here is 00:33:02.480 |
also just um the architecture right so we're operating on trillion parameter and 384 experts now does this 00:33:12.800 |
scale to dense models does this scale to a small 24b who knows i think that's um you know we'll have to 00:33:20.400 |
figure that out but um it's it's verified in their case i don't know if you can take this at face value 00:33:27.680 |
for for other architectures and other sizes um but you know i highlighted this in green it's important 00:33:35.040 |
very very useful cutting attention head saves a lot of compute at long context but once again attention is 00:33:41.680 |
a problem at long context this is like my one concern for like uh transformer alternatives and stuff 00:33:50.000 |
uh you guys really need to train at long context right because the benefit of your like getting rid 00:33:56.480 |
of attention and swapping it in with something like recurrence states that's cool but it really only makes 00:34:02.320 |
a difference at long context at short context it's not that deep and you know train at long context even 00:34:09.040 |
though you're not very gpu rich do it anyway okay uh training infrastructure this was cool if you care 00:34:16.560 |
about trade restrictions so basically nvidia didn't allow h100 h200 exports but china had the custom h800 00:34:25.440 |
gpu um then this is their training node so uh each node in the h800 cluster contains two bare two terabytes of 00:34:34.720 |
ram and eight gpus connected with envy link and envy switches within nodes across different nodes eight 00:34:41.840 |
um g8 400 gig roce interconnect they talk about stuff with like pcie offloading saving to disk uh weight 00:34:52.480 |
optimization so basically the thing that magistral did was really efficient to just keep rollouts 24/7 00:35:00.640 |
uh with this scale of trillion parameters they they did a lot more offloading to dram and stuff um 00:35:06.880 |
apparently h200s are coming to china soon nvidia is winning um actually like there was a post about 00:35:14.480 |
this where everyone was like i don't know why deep seek had such a big impact on like nvidia stock because 00:35:22.800 |
guess what h800 is still nvidia like it's it's all nvidia at the end of the day but um what's she 00:35:30.240 |
trying to jump off table but you stay this is mochi mochi's the non-expert doggo but she's learning to be 00:35:39.120 |
expert okay um parallelism for model scaling i don't know if we have that much time left for this so maybe i go 00:35:48.640 |
through this section quickly and we come back if we have time um i like this i nerded out over this in 00:35:56.560 |
the magistral paper uh do the rl part i think the rl and data gen stuff is more fun watch you come here 00:36:03.520 |
okay um i was gonna read a lot of this but let's let's skip it for now um basically there's three things 00:36:12.000 |
they do uh combination of 16-way pipeline parallelism 16-way expert parallelism and zero one data parallel 00:36:20.640 |
a lot of this is just like if you're ever interested in how big models are trained you should read this 00:36:25.680 |
section it's kind of cool here's what they have to go over activation reductions with glue okay fpa for 00:36:32.560 |
more intensive off but what else cpu offloading training recipe this was interesting so training 00:36:38.880 |
recipe um they mostly train at 4 000 token context window using their new optimizer um total of 15 00:36:48.800 |
trillion tokens the first 10 trillion tokens were you know standard learning rate small warm-up then the 00:36:55.680 |
last 5.5 trillion tokens they had cosine dk weight dk was set global back size was pretty large 67 million 00:37:04.080 |
tokens per bat crazy uh figure four is there figure four that i skipped over these figures were very odd 00:37:14.000 |
um the rewrite takes both previous and the current segment okay that's that's relevant right now okay 00:37:19.520 |
so uh most of the training was done at short context then towards the end they had an annealing phase 00:37:25.760 |
followed by long context activation activation stage bat size still 67 million tokens uh learning rate was 00:37:32.960 |
decayed quite a bit more in this phase we trained 400 billion tokens with a 4k sequence length followed 00:37:39.600 |
by an additional 60 billion tokens on 32k sequence then they extend with yarn so basically like you know 00:37:47.360 |
60 billion out of 15 trillion tokens are at 32k sequence um very very much like the majority of this training is 00:37:55.680 |
just done at 4k context post training um muon is still used in post training if anyone's fine tuning 00:38:02.560 |
kimmy k2 you should also still use this muon optimizer don't don't use adam w it's cooked um i think there's 00:38:10.480 |
a chance so like one of the co-founders of xai is jimmy ba who who worked on the adam w optimizer it's like his big 00:38:17.360 |
thing um what are the odds that these people get coached to xai because they like optimizers um who 00:38:25.280 |
else elon is trying to coach carpathy again but that's side t okay uh sft so um there's instruction 00:38:36.960 |
tuning data set maximizing probability of fire response data gen pipelines for different tasks in 00:38:43.120 |
different domains so they use kimmy k 1.5 their in-house domain specialized expert model to generate 00:38:49.680 |
candidate responses for various tasks followed by llm or human judges to perform automated query 00:38:55.120 |
evaluation and filtering uh for agentic data we create a data synthesis pipeline to teach models 00:39:01.440 |
tool use capabilities through multi-step interactive reading this stuff is pretty interesting um this 00:39:06.720 |
is like really the nice meat of this paper i think um okay large-scale agentic data synthesis for 00:39:14.400 |
tool use um basically we we need llms to be able to autonomously figure out how to use unfamiliar tools 00:39:23.360 |
uh interact with external environments and iteratively refine their actions to reasoning 00:39:28.800 |
execution and error correction so here are some papers that you should check out about how to train uh tool 00:39:35.200 |
use and tool use evaluation so tool lm ac bench and then same thing with um synthetic data so um 00:39:46.240 |
real world environments rich so difficult to construct at scale complexity privacy accessibility recent work on 00:39:57.760 |
synthetic data gen agent instruct self-instruct and stable tool bench and zero search i think we covered 00:40:03.120 |
agent instruct and self-instruct uh two new ones that they talk about here that if you're interested in 00:40:10.080 |
synthetic data gen you should probably check out uh stable tool bench and zero search they show promise in 00:40:16.800 |
creating large-scale data training without relying on real-world interactions it's all synthetic they have a pipeline that simulates 00:40:26.720 |
real-world tool use scenarios at scale uh enabling the generation of tens of thousands of diverse and 00:40:33.280 |
high-quality training examples okay three stages in uh data synthesis pipeline tool spec generation so step 00:40:40.720 |
one is you know first construct a large repository of tools from both real-world tools and lm uh synthetic tools 00:40:49.120 |
agentic and task generation for each tool sampled from the tool repository so they have like thousands of tools by the way 00:40:56.080 |
by the way generate an agent to use a tool set and some corresponding tasks then trajectory generation 00:41:03.760 |
for each agent and task generate trajectories where the agent finishes a task by invoking tools 00:41:10.320 |
um here are some of those tools so they have a bunch of real mcp tools they kind of do a distinguish 00:41:18.640 |
they kind of like filter them out by category so some search tools database tools uh communication r d 00:41:25.760 |
then here's their synthetic tools we'll go over this soon um first we directly fetched 3000 plus real mcp 00:41:37.120 |
tools from github repositories leveraging existing high quality tool specs uh the interesting thing with mcp 00:41:43.280 |
tools is like they don't really say how they verify that these are high quality tools right you can have 00:41:48.880 |
non-high quality mcps there's no like great standard over this but they don't mention much filtration over 00:41:55.120 |
this uh second we systematically evolve synthetic tools through a hierarchical domain generation process 00:42:02.560 |
we begin with uh key categories so they they have a they have like a they basically have like a categorizer 00:42:11.440 |
that that categorizes different tools in different domains they start with that so uh then we evolve 00:42:18.640 |
multiple specific application domains with each category specialized tools are then synthesized for 00:42:23.840 |
for each domain with clear interfaces description operational semantics so get a bunch of tools classify 00:42:30.720 |
them in categories um then evolve multiple specific application domains with each category then tools are 00:42:37.520 |
synthesized they produce over 20 000 synthetic tools across different domains um both mcp and synthetic 00:42:47.280 |
tools cover complementary regions okay agent diversification seems like generating all these 00:42:53.520 |
synthetic tool workflows would take a ton of compute yeah there's a lot of inference that goes under this 00:42:57.920 |
stuff um okay agent diversification so we just casually generate thousands of distinct agents they 00:43:05.680 |
synthesize various system prompts and equip them with different combinations of tools from our repository 00:43:11.360 |
when i said that this is kind of basic eugene this is kind of what i meant like i didn't expect stuff 00:43:18.480 |
like this to just work like sure you can get a bunch of tools um generate different categories and expand 00:43:25.840 |
them out i didn't expect diversity in these 20 000 tools but this is what i really didn't understand 00:43:31.840 |
they just generate thousands of distinct agents with different combinations of tools from their repository 00:43:37.840 |
and you know from this they have rubrics of different stuff that they should be able to solve and all 00:43:42.640 |
this kind of just work they have varied capabilities areas of expertise behavior patterns ensuring a broad 00:43:48.240 |
range of uh potential use cases and then before you know this is the last section for each agent 00:43:54.480 |
config we generate tasks that range from simple to complex operations each task is paired with an 00:44:00.400 |
explicit rubric that specifies success criteria expected tool use patterns i don't know how they just 00:44:06.640 |
generate tool use patterns because if you can do inference to you know figure out what are the tool uses 00:44:13.440 |
you kind of don't have the best quality data right because like you can kind of solve the task i guess 00:44:18.880 |
but like i get matching a task with tools but not tool use pattern rubrics uh and then evaluation 00:44:26.800 |
checkpoints rubrics ensure consistent and objective evaluation of agentic performance um this is kind of 00:44:33.840 |
interesting you know yeah i get that i think um i think of this as like a huge synthetic data etl pipeline 00:44:43.840 |
where you could be creating a hundred traces and through your quality evaluation and filtering 00:44:48.960 |
we don't know to what extent how strict they do it but they could be filtering only to the best 10 00:44:53.680 |
or the best five even they talk about their filtration actually a bit like in some stuff like in in 00:45:00.720 |
magistral they talked about like they they get rid of stuff that mr large could solve they have a they 00:45:06.640 |
have something here that talks about yeah they have something here whereby they actually use an sft 00:45:11.440 |
model and if the sft model is only when the sft model is not able to solve it then they keep it 00:45:16.320 |
uh so it's really interesting the data cleaning and data generation that they do yeah it's like 00:45:22.560 |
it's cool it's there i think this paper could be a bit worded better but it's still cool it's still like 00:45:28.240 |
nice that they that they explain all this okay so uh multi-turn trajectory generation they simulate 00:45:35.200 |
real tool use scenarios user simulation so basically um you know lms are personas with communication 00:45:42.560 |
styles like one might be moody one might be authoritative one might be stupid one might be like 00:45:47.680 |
phd level smart um and preferences they they have multi-turn dialogues with agents creating naturalistic 00:45:55.120 |
interaction patterns uh tool execution environments so there's a tool simulator so it's kind of cool 00:46:01.040 |
they talk about like their sandbox here um a sophisticated tool simulator functionally equivalent 00:46:08.000 |
to a world model they have ag i guess they have a world model they execute tools and provide realistic 00:46:13.280 |
feedback simulator maintains an update state after each tool enabling complex multi-step interactions so this 00:46:20.640 |
is like if you think of like web arena and stuff like that where you kind of have like a fake website and 00:46:25.120 |
you're shopping through uh instead you have a simulator that might have like tasks and as you make as you 00:46:32.160 |
interact and do stuff and use a tool the simulator keeps state section by section this is kind of cool i wish 00:46:38.480 |
you could have talked a bit more about this um it introduces controlled stochasticity to produce 00:46:43.120 |
varied outcomes including success partial and edge cases so you know you're actually changing an environment 00:46:48.400 |
a lot of this is like gym motivated style um but you know very cool like they they have all of this um 00:46:54.720 |
lm-based judge evaluates trajectories against task rubrics only trajectories that meet success criteria 00:47:02.960 |
are retained for training ensuring high quality data while only allowing natural variation and test 00:47:08.480 |
completion um interesting little section there yeah i i i yeah i'm sorry jump the gun i thought the 00:47:16.400 |
hybrid approach uh which is what you're going to talk about nick is amazingly cool because 00:47:24.000 |
they fine-tune their the judge on verifiable code and then what they say is that from this verifiable 00:47:32.560 |
code the model is able to learn subjective eval actually wait am i looking at the right section 00:47:41.360 |
sorry guys what is confusion is this the section yeah i i had a note somewhere uh give me a second let me 00:47:49.440 |
check it i'm gonna cover this for now so uh yeah go ahead there's inherent limitation to simulation 00:47:55.840 |
fidelity we complement our simulated environments with real execution sandbox they talk about scraping a 00:48:01.760 |
lot of like real github issues uh they want real authentic like stuff right so encoding and software 00:48:07.600 |
engineering uh they have execution sandbox these execution sandbox execute actual code interact with 00:48:14.720 |
genuine development environments and provide ground truth feedback for objective metrics and test 00:48:18.880 |
suite pass rates uh this allows them to learn from diversity of simulated scenarios and authenticity of 00:48:25.200 |
fuel execution significantly strengthening the practical agent capabilities so in this hyper 00:48:30.880 |
uh pipeline i don't think this is the section you're talking about eugene but you're right sorry it's the 00:48:35.520 |
section on close loop yeah we get to that when we get to that okay i think i'm gonna go a little 00:48:41.680 |
fast because we're running short on time but basically uh you know rejection sampling so get rid of the 00:48:47.520 |
the ones that are easy quality filtration sft has demonstrated improvements for tool use okay oral 00:48:54.160 |
um better token efficiency than sft this is understood uh they have a gym like extensible framework that 00:49:02.560 |
facilitates rl across a wide range of scenarios uh tasks with verifiable rewards subjective preferences 00:49:10.160 |
such as creative writing open-ended question answering we introduce a self-critique reward in which models perform 00:49:16.240 |
pairwise comparisons to judge its own outputs it's like similar to grpo right grpo you output a bunch of 00:49:22.960 |
outputs and then you do preference over the group of outputs to what's the best uh this instead is you 00:49:30.640 |
know you have a self-critique over your output and you judge your own uh output so kind of interesting 00:49:37.600 |
on how they do rl for creative writing and open-ended questions uh verifiable rewards and gym so i think we can 00:49:43.840 |
skip some of this for math and stem you know pretty straightforward complex instruction following is 00:49:49.360 |
this what you're talking about eugene oh no slightly below okay uh we will get it yeah okay so for hybrid 00:49:57.680 |
rule verification there's um you know evaluation of code interpreters lm is a judge okay multi-sourced 00:50:04.640 |
instruction generation complex prompts faithfulness uh this was kind of cool they have this um there's 00:50:13.040 |
this framework of facts grounding there's a sentence level classifier judge that's prompted that's uh 00:50:20.480 |
useful to perform automated verification basically uh it detects sentences that make factual claims without 00:50:27.440 |
evidence and context they have this as a bit of the reward in rl so faithfulness coding and software 00:50:34.400 |
engineering um again this is this part is cool right in the sense that they rephrase the question but 00:50:44.400 |
they still so care so much about uh factual verification as well as the responses factual verification they are 00:50:50.880 |
putting so much effort into this as well as the coding as well as they have adversarial prompts for their 00:50:56.240 |
safety aspects as well safety and faithfulness yeah it's so interesting to see what are the key 00:51:03.200 |
things and there are some points where they say that the key challenge is faithfulness reducing 00:51:07.200 |
hallucinations and scaling this so essentially they actually write very clearly these are the 00:51:11.760 |
challenges we are facing right now um if you if you read between the lines yeah um this was cool 00:51:19.840 |
basically e2b style or daytona style sandboxing very very robust sandbox infrastructure uh so pr's and 00:51:26.880 |
issues from github to build software development environments that consist of user prompt issues and 00:51:31.680 |
executionable unit tests so real github staff has unit tests uh they this environment was built on a 00:51:38.560 |
robust sandbox infrastructure powered by kubernetes for scalability and security it supports over 10 000 00:51:44.080 |
concurrent sandbox instances while with stable performance they just they just cranked that 00:51:50.080 |
out real quick safety they have uh you know attack model that generates adversarial prompts a target 00:51:56.880 |
model that produces responses and then a judge that determines if they could bypass the uh safety 00:52:03.280 |
mechanism each iteration is assessed with a rubric and then it's used in the reward uh 10k concurrent agents 00:52:11.040 |
it's crazy right okay beyond verification self-critique rubric reward so self-critique rubric reward is what 00:52:17.120 |
they did uh general rl with self-critique it evaluates its own outputs to generate preference signals 00:52:23.200 |
um it initializes critique ability in the sft stage since we're kind of out of time i'm gonna like really 00:52:30.960 |
really quickly do the rest of this and then and then we'll stay for discussion but i guess if anyone has 00:52:35.760 |
any questions um i guess now's your chance if you have to help or if eugene you want to add in maybe 00:52:40.800 |
i'll just focus on the the last paragraph in page 12 here um if you read this right the critic model is 00:52:48.640 |
refined with verifiable signals so they find in the critic model with the verifiable signals they continuously 00:52:54.240 |
update this critique this transfer learning process grounds so this transfer this transfers subjective judgments 00:53:04.160 |
allowing the performance gains from verifiable tasks to enhance the critics judgment on complex tasks like 00:53:10.000 |
explicit reward that's crazy in the sense that you can fine tune essentially you can bootstrap yourself 00:53:15.680 |
on verifiable reward to train a reward model for non-verifiable reward um how that transfers i don't 00:53:22.400 |
know i don't know what the intuition is on that but i think there's a lot in this paragraph here 00:53:26.400 |
on reward modeling and evaluation and the the critique just keeps getting better and better right they 00:53:33.520 |
update the critiques uh as well and you know it recalibrates its evaluation standards and locks up a 00:53:41.280 |
policy evaluation so if you if you look at this right right now we are solving verifiable reward like code 00:53:45.920 |
and software right and the question we always have would this generalize to non-verifiable reward i think 00:53:51.120 |
this paragraph here says that maybe it would uh in the sense that at least non-verifiable reward is 00:53:58.160 |
helping non-verifiable reward then our evaluation so we start from there 00:54:03.680 |
very interesting uh rl has slight modification policy so they have their muon optimizer 00:54:17.040 |
several additions budget control so they say um properly distributed inference budget we enforce 00:54:25.280 |
so um they kind of shit on reasoning models rl often results in substantial increase in length 00:54:31.440 |
model generated outputs um now even though that gives improvement in performance these benefits often do 00:54:40.160 |
not justify its inference costs in non-reasoning domains so for general stuff you don't want this to 00:54:45.520 |
encourage the model to properly distribute its inference budget we in first we enforce a per sample maximum 00:54:52.880 |
token budget draw rl throughout rl training and the budget is determined based on the type of tasks so 00:54:59.600 |
our task you can reason a little more responses that exceed this token budget are truncated and assigned a 00:55:05.520 |
penalty this is all done in the formula there watch it um it incentivizes the model to generate solutions within the 00:55:14.080 |
limit enhances its token efficiency to get concise and effective um solutions ptx loss this just keeps it 00:55:21.760 |
grounded in previous stuff temperature decay when it starts out you want high temperature so it can 00:55:27.840 |
explore different stuff then uh you tone it down rl infrastructure um i don't think we have time for a lot 00:55:36.560 |
of this efficient engine swapping efficient system startup agentic rollout rl infrastructure supports 00:55:44.640 |
training on long horizon multi-turn agentic tasks tasks present distinct challenges two strategies 00:55:51.760 |
to play heavy environments okay evals it books um all the evals are done at 4k context yeah um soda very 00:56:05.200 |
soda i think that's enough red teaming they do some red teaming here limitations um when dealing 00:56:13.280 |
with hard reasoning tasks or unclear tool definition the model may generate excessive tokens sometimes 00:56:18.160 |
leading to truncated outputs or incomplete tool calls basically they have it in their rl that if it's 00:56:24.800 |
too long we just truncate and it's cooked um you know that can lead to it wasting its context and then 00:56:32.080 |
cutting off at tool calls additionally performance may decline on certain tasks if tool use is 00:56:37.680 |
unnecessarily enabled when building a complete software project the success rate of one-shot 00:56:41.920 |
bumping is not as good as k2 and their agent decoding stuff yeah um that's paper sorry um 00:56:48.320 |
anyone familiar with their rl formula what rl variant is it based on 00:56:54.800 |
let me check i thought i thought i knew this i think they said it's based on the kimi v 1.5 i 00:57:01.040 |
haven't had a chance to go through that um okay yeah it is a 1.5 but what is the kimi 1.5 based on 00:57:09.440 |
uh it's the same the reward model uh so the reward will minus the mean r uh so you can go back so grpo 00:57:17.280 |
uh it yes it has gpo in it uh but it doesn't go through the partial sequence right so grpo actually 00:57:23.680 |
go through partial sequence um and this one doesn't just a final okay thank you it looked grpo-esque um 00:57:32.400 |
any other thoughts questions comments so i was wondering like for the 10 000 agents um it seems like a bit 00:57:42.800 |
excessive because does it is it the fact that um all these agents are doing something a little bit different 00:57:47.760 |
like how how does uh uh how does it take yeah do you mean the sandbox execution uh no given that you 00:57:55.920 |
have so many tools right oh it's 10 000 concurrent sandboxes oh okay i'm sorry which i extrapolated to 00:58:02.480 |
10 000 agents running it may be the same agent maybe i think no no i think this is just like when 00:58:07.920 |
you're training in batch size of like 67 million or whatever you need a way to run your verification 00:58:16.080 |
and stuff right you need to be able to let this thing talk call tools and like so when you're doing 00:58:23.360 |
rollouts you're testing how well it can use tools and interact with the environment you have to keep 00:58:28.640 |
state for these environments now you need a bunch of environments running so this is like you know this 00:58:34.800 |
is for that it's not it's not 10 000 agents like swarming out to solve stuff maybe more like figure 00:58:42.560 |
nine right which each dot each dot is like uh a different tool is that correct so basically the 00:58:48.800 |
synthetic tool right yeah you just had it right there with a bunch of colors yeah right there so yeah figure 00:58:54.880 |
yes figure nine right so figure nine um they create a lot of synthetic tools and i was wondering like 00:59:00.720 |
does that diversity really help out that much um of different tools like uh are some of these doing 00:59:07.840 |
redundant stuff so that's my question yes they're definitely doing redundant stuff right because i mean 00:59:14.080 |
how many categories do you have here you have 10 000 tools and how many colors do we see i see maybe 30 40 00:59:20.080 |
colors so yes tools are doing similar categories of stuff but um it's it's still like even if you have a tool 00:59:29.440 |
in the same category as long as it has different descriptions different ways to use it different 00:59:35.200 |
um steps because they don't just look at like their rubric isn't only on full task completion so they have 00:59:41.920 |
like in the middle uh checks for you know partial partial completion as long as it's using the right 00:59:47.600 |
tool use like the right steps that's still a good output but i think their sandbox is basically just 00:59:56.320 |
keeping their environment online and like allowing these tool calls and these different environments 01:00:02.800 |
to just keep state i guess my question is like are they do they want specifically add so much redundancy 01:00:10.160 |
because you want the model to better generalize to usage uh because if i was kind of like have this 01:00:16.880 |
diversity of stuff which does similar things how do i know which tool to even pick right so i'm trying to 01:00:22.080 |
understand like what's the reasoning there yeah they're pre they're predetermined so in their data gen 01:00:27.600 |
pipeline they create agents that have um specified what tools they're allowed to use so i think one yes 01:00:36.480 |
you want diversity and being able to generalize and use different tools but two when they generate the 01:00:42.480 |
agents themselves they're they're given um different yeah yeah yeah i'll try to find the quote yeah you're 01:00:52.640 |
saying that they only present a certain number of tools right so that makes sense they're not given 01:00:57.280 |
everything yeah gotcha so um basically we generate thousands of distinct agents by synthesizing various 01:01:04.000 |
system prompts and equipping them with different combinations of tools from our repository so it's not 01:01:09.760 |
like they have access to all 10 000 you probably have an agent that does something that has access to 01:01:13.760 |
like four tools with varied capabilities areas of expertise and behavioral patterns 01:01:22.080 |
yeah thanks a lot for the amazing walkthrough i just wanted to go back to the point about uh whether 01:01:32.160 |
this is a reasoning or non-reasoning model i i think this one is non-reasoning because uh one of the 01:01:39.120 |
reasons i have a gripe with the recent reasoning models is they take too long to give you the answer 01:01:44.320 |
and there is like an argument about whether we can justify this increase in cost on latency 01:01:48.880 |
like whether the increase in performance can justify the increasing cost of latency but for this model if 01:01:54.400 |
you ask it to translate a document from english to french it will give you the french translator 01:01:59.920 |
translation right away it will not think for 500 or a thousand tokens 01:02:03.360 |
i i think that's part of it the other approach there is hybrid reasoning and non-reasoning models 01:02:12.800 |
right like you can you can turn on and off reasoning so like cloud models have thinking and 01:02:19.760 |
non-thinking modes small lm3 is a specific like recipe to train a hybrid reasoning model basically in that 01:02:27.920 |
in their rl stage with grpo they mix in stuff that's a subset of reasoning and non-reasoning 01:02:34.800 |
and they reward them equally but um i do agree like i really think that like 01:02:42.320 |
they're kind of pushing the cost onto you right like for basic stuff i don't want to pay extra tokens and 01:02:48.560 |
the the proper like the proper gauge for models right now is no longer priced to um like before 01:02:57.280 |
it used to be price to intelligence ratio right so how much do i pay for this level of intelligence 01:03:02.720 |
it's very skewed now right a model can have an intelligence that's very high but it could take 01:03:09.360 |
twice the cost in the terms of number of tokens to get there the number of tokens to get that intelligence 01:03:14.080 |
this kind of variable what would be interesting now i think is in some benchmarks if we could include 01:03:20.880 |
how many tokens were used in reasoning for specific like questions or stuff i don't know a great way to 01:03:27.520 |
measure this i haven't really thought about it too much but you know what i mean like cost to performance 01:03:34.800 |
is no longer as relevant because your performance has variable number of thinking for different 01:03:40.320 |
like models like some models are very verbose in their thinking the deep seek update basically got it 01:03:45.280 |
for think to think for twice as long which could also mean to you know charge you twice as much but 01:03:51.360 |
on the other hand like it's a significant performance increase right so like are you really gonna are you 01:03:58.000 |
really gonna make that trade-off i think in general even though it was forced on people like 01:04:01.840 |
oh one was the first reasoning model and you didn't have a choice right it's not hybrid like 01:04:05.840 |
if you want reasoning you're gonna use full reasoning and people were fine waiting an extra minute for 01:04:10.880 |
performance and the costs always come down but that's just high level thinking in my in my opinion 01:04:18.640 |
i don't have i haven't given it much more thought 01:04:24.720 |
yeah that's that's a good argument about cost yeah any other thoughts comments questions anything i 01:04:32.400 |
missed in this paper sorry i like we yapped too much in the beginning i didn't realize it would take so 01:04:37.280 |
long i was gonna think it was a quick one it's it's quite a dense paper like lots of stuff to go through 01:04:46.480 |
yeah um i think they're they're like muon clip stuff is pretty interesting but like there's only a few 01:04:54.160 |
takeaways right like they do some cool scaling laws basically they do like individual attention head 01:05:00.880 |
level parameter clipping of optimization and that that leads to really cool scaling the the bigger takeaway of 01:05:07.760 |
this is like it's very very expensive for labs to do big train runs and like lost spikes are a big detriment 01:05:17.200 |
because basically if something gets cooked like all your major infrastructure is being useless and you 01:05:23.120 |
might have to roll back and restart and rechange weights and load weights and like that's a huge delay in a lot of gpus so 01:05:29.760 |
stability in training is like quite huge but that's like the big takeaway um and then check out check 01:05:38.480 |
out last week's paper you know we kind of dove into the optimizer frankie you have hand up oh yeah i think 01:05:44.240 |
just with the comment from eugene on that paragraph on self-critique closed loop so i'm wondering like 01:05:51.280 |
uh because again like how does it learn from subjective giving subjective rewards right so i'm wondering um does it 01:05:58.400 |
have to do with the fact that um they're they're critiquing on the verifiable signals right so 01:06:03.440 |
there's things that are definitely verifiable but because you have things which are not verifiable 01:06:08.960 |
are they you think that the model is kind of like learning from the steps that were taken right it's kind 01:06:14.320 |
of like the chain of thought or whatever if you like learning from the steps like if it looks like you're 01:06:18.640 |
doing good steps then you know i can give reward you better so because you don't have you you don't you 01:06:24.800 |
don't have a you have a subjective reward on you right so i just wanted to see how you guys thought about 01:06:29.200 |
that yeah kind of um in their rubric um they have rubrics that are for like potential steps so given um 01:06:42.800 |
given this like let me try to find it given this prompt or question and these set of tools uh here 01:06:53.200 |
i think this is it um combine rlvr with the self critique rubric reward the model learns not only 01:06:59.920 |
from externally defined tasks but also from evaluating its own inputs extending alignment from smell basically 01:07:06.640 |
in the rubric if you're using the right tools and getting like the right steps um you're you're 01:07:12.800 |
rewarded for that too but i think we can do a bit of a follow-up on this um for each agent we generate 01:07:19.040 |
tasks that range from simple to complex each task is paired with an explicit rubric that specifies the 01:07:24.400 |
success criteria expected tool use and eventuation and evaluation checkpoints this rubric-based approach 01:07:31.040 |
ensures consistent objective evaluation of agent performance where else is this mentioned 01:07:35.840 |
expert crafted conditional prompts and rubrics developed by our data team 01:07:44.880 |
oh this is kind of interesting how they developed this uh agentic instruction augmentation fine tuning 01:07:51.120 |
okay each iteration is assessed using task specific rubric enabling the judge model to provide binary sex 01:07:58.880 |
slash success failures uh that's for success beyond verification self-critique 01:08:04.800 |
self-critique reward with evaluates outputs initial critique is building an sft stage 01:08:13.520 |
eliminate your word hacking clarity relevance so this is some of it core rubrics for rl um clarity and 01:08:23.120 |
relevance assess the extent to which the response so this is the rubric used in the rl for the rollout 01:08:29.760 |
um excess assess the extent to which the response is six is succinct with full addressing the user's 01:08:36.400 |
intent the focus on eliminating unnecessary details staying aligned central uh with the query 01:08:43.600 |
efficient formats such as brief paraphrase brief paragraphs or compact lists uh conversation fluency 01:08:51.040 |
engagement it's coherent so you know basically yeah they have an rl reward for stuff like clarity 01:08:59.200 |
conciseness staying on topic um objective groundedness like are you objectively answering the step the 01:09:05.840 |
question perspectives so um so you think that uh these rubrics is kind of guide help it to like decide what's 01:09:14.800 |
good or bad is that what you're saying yeah um in in the group of outputs because it's based on grpo right 01:09:23.840 |
you're still giving a reward for like you're now not verifying oh yeah in a sense you are verifying right 01:09:29.360 |
because in a group of rollouts you're giving reward to the ones that match these right so you give a 01:09:36.560 |
reward for stuff that's clear give us stuff a reward for stuff that's fluent give a reward for stuff that's 01:09:42.000 |
objective um as justification and you know you're you're explicitly adding that in your rl policy so 01:09:50.080 |
it's not verifying on stuff like okay and output is like verified and boxed math answer or code that 01:09:58.240 |
compiles uh you do have a rubric that verifies for this i think the interesting thing is that they have a 01:10:03.920 |
policy model that iteratively learns better critiquing and can self-critique that makes this possible 01:10:13.360 |
because otherwise it's like okay yeah this is essentially just an llm as a judge as your rl on output 01:10:19.760 |
and we know that um rl is very very um what's the term i'm looking for it's it's um it has to it's like 01:10:31.120 |
very sam i'm blanking on the word it has it's uh i've blanked on the word but it has to be concise 01:10:40.240 |
right like you can't have noise in your rl policy otherwise you won't generalize there's a term for 01:10:46.320 |
this that i'm blanking on and you would assume that just having a basic um lm as a judge for your output 01:10:53.200 |
for non-verifiable stuff like is this good is this concise wouldn't work but this kind of shows that if 01:10:59.760 |
you keep your um okay come here okay um yes read this section okay but but what what is your what um 01:11:13.760 |
what makes you think that it is kind of improving its critique capability as it progresses it says it 01:11:21.040 |
it says it in one section that section that eugene highlighted is literally about keeping it 01:11:30.320 |
as a self-closed loop one right yeah yeah the closed loop thing so yeah that's that's how i know it's yeah 01:11:38.880 |
okay thanks cool um any other stuff yeah closed loop uh our critique model is refined using verifiable signal 01:11:47.040 |
on policy rollouts generated from verifiable rewards continuously update the critique the 01:11:53.040 |
critic a crucial step that distills objective performance signals from rl vr directly they even 01:11:58.400 |
talk about this in the infrastructure section about loading moving these weights and stuff uh this closed 01:12:05.040 |
loop process ensures that the critique continuously recalibrates its evaluation standards in lockstep with 01:12:11.280 |
policy evaluation so this is like what eugene was saying right this closed loop stuff lets you do rl and 01:12:16.720 |
verify about rewards um basically you have a good rubric and you iteratively get better thank you 01:12:24.240 |
cool guys we're 20 minutes over so i think that's where i call it if anyone wants to volunteer for 01:12:31.440 |
next week that would be sick find a cool paper we share a bunch of papers in um discord i have like 01:12:38.400 |
four or five that i'd recommend but you know if anyone wants to present would be would be sick 01:12:51.280 |
thanks fix you're speaking but muted all right next week guys i'll see you