back to indexGLM 4.5: Agentic Coding & Reasoning Foundation Model

Chapters
0:0 Introduction to the presenter and the GLM 4.5 paper
2:11 Overview of GLM 4.5 as a Mixture of Experts model
4:6 Defining ARC: Agentic, Reasoning, and Coding capabilities
5:41 Pre-training architecture and the Muon optimizer
6:54 Data collection, de-duplication, and context window extension
11:31 Mid-training with repository-level code and synthetic data
25:8 Post-training: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)
31:22 Tool calling and agentic data synthesis
36:52 Reasoning RL with curriculum learning
39:58 Dynamic temperature and token-weighted mean loss
49:12 Self-distillation process for unifying the model
59:16 Reward function calculation and handling infinite loops
61:12 Emergent capabilities: surprising translation performance
00:00:07.200 |
um can you see my screen i can see your screen the audio audio could be better if you can switch 00:00:18.000 |
your zoom audio uh don't use airpods for for uh for microphones airpods are very bad microphones 00:00:24.480 |
okay even though it looks like i'm using my airpods in my in my ear i actually always 00:00:30.480 |
switch the zoom to laptop microphone is much better should be in like the zoom clicker thing 00:00:41.760 |
no i don't even like airpods max i've tried it it's not great 00:00:46.480 |
is it any better now yeah much better oh sweet sweet 00:00:53.920 |
okay yeah um you know just introduce yourself because you're because you're your new presenter 00:00:57.600 |
uh let's get to know about you and why you're interested 00:01:00.240 |
i i was working as a back-end engineer um my name is venki by the way nice to meet you all 00:01:08.160 |
i took a sabbatical to do uh to understand like the current ecosystem and to 00:01:15.120 |
experiment more uh with the ai tools and also understand the fundamentals a lot so i took a break 00:01:22.160 |
uh i was um and then i'm trying to read papers but so far it's not working out for me the self-study 00:01:28.720 |
but i'm still like uh trying to make sure like i get something positive out of this sabbatical 00:01:34.000 |
uh this is my this is my first time presenting the paper i'm i'm not sure like how it goes but 00:01:39.680 |
i'm hope uh i feel like i i put a lot of attention to understand the details because i'm presenting it 00:01:45.360 |
otherwise i would usually skip through them skip through them yeah so it's a good good uh a good 00:01:52.480 |
like a decision for me to volunteer for this agreed okay uh if you want you can make your paper full 00:01:59.680 |
screen and then we can get into it okay yeah so uh should i should i give an overview first uh yeah yeah 00:02:13.040 |
so this is a uh it looks like a mixture of expert a transformer uh on the base model and then uh 00:02:20.400 |
what they have done differently is like uh they want to make sure like this model has a 00:02:25.040 |
arc capabilities agentic coding and reasoning abilities so what they have done they uh 00:02:30.880 |
post training they they they've used like two different ways to train this model like they stage one 00:02:38.080 |
they uh they train an expert model uh for each specific task like agent degree reasoning and 00:02:44.800 |
coding and then once they have like a a sub expert expert model they distill these capabilities into one 00:02:51.520 |
uh unique one model so the uh that's why uh this model seems to have like all three capabilities uh and 00:02:59.520 |
this model can is able to like uh uh have like a switch between modes like thinking long thinking modes 00:03:06.640 |
and then uh quick response modes uh i haven't actually uh used the model much but i asked claude to 00:03:14.000 |
generate some uh some task and prompt to verify these capabilities so maybe post this talk we can see 00:03:21.280 |
if we can go through uh if this model is actually looking good um with those tasks but basically this is a 00:03:29.120 |
mixture of experts um and yeah um and these are like some metrics they provided uh with arc capabilities 00:03:43.600 |
i'll move to the second yeah i feel like arc is a little bit um co-opted by arcgi but arc has a 00:03:53.360 |
definite definition which i think you're showing on screen right now right uh oh yeah the arc challenge 00:04:00.720 |
true so they these people like they say arc uh to define like uh if the model has agentic or reasoning or 00:04:08.160 |
coding capabilities all together in one uh one model instead of using multiple models to achieve uh 00:04:16.160 |
different performances uh so they clearly define the ambition here 00:04:20.800 |
so how often um like yeah this seems like the main uh drive behind the this model training uh to have 00:04:33.600 |
all these capabilities in a single model and then uh the paper mostly quickly goes through uh like uh 00:04:44.400 |
what they have done uh and then also the rl training pipeline and how they approach like a data 00:04:52.480 |
synthesis for training etc but this uh these are like a bunch of metrics they've mentioned here and 00:04:59.040 |
they have now a smaller model as well but they haven't mentioned much about the smaller model apart 00:05:04.880 |
from i'm assuming it's a distillation yeah these are like a bunch of metrics how we uh how we performed 00:05:14.000 |
on a various metrics uh i i try to see if they have like uh the training loop made available i don't think 00:05:25.440 |
they've made the training loop available for open source i think the only weights are available in the 00:05:30.400 |
uh so yeah architecture the fundamental architecture is like a mixture of experts i feel like the only thing 00:05:48.240 |
they changed is uh uh they used a muon optimizer instead of adam i'm not really sure if adam is like 00:05:54.880 |
standard but they're using beyond uh they mentioned this down somewhere here yeah 00:06:02.880 |
so they used one optimizer uh and then what else they've changed 00:06:13.760 |
yeah so for for context basically for the past few years the adam and adam w optimizers they've been 00:06:21.200 |
very much the standard defaults there hasn't been much progress like adam came then adam w 00:06:26.880 |
uh the recent kimmy k2 paper it's like the big the big model that came out after um after deep seek 00:06:35.840 |
they switched to a muon flip optimizer it's like better training stability they have a very pretty 00:06:41.520 |
loss curve so that's kind of the background optimizer 00:06:44.960 |
um and one thing they have done uh they have this is like the general pattern they used like uh they 00:06:55.200 |
started with 4k context window and then they increased the context window to 32 and uh and afterwards like 1.28 00:07:03.200 |
and uh and they mentioned like how they collected their data uh so basically uh yeah they used like 00:07:10.640 |
web crawling uh to have like english and chinese web pages and then uh they used like a quality metric 00:07:17.440 |
so here they mentioned like uh they divide the crawl web pages into buckets with different quality scores 00:07:24.320 |
and then they're up sampling the documents uh so that like they uh the initial like the base data is much 00:07:30.640 |
much higher quality um and then they they were using like uh this uh semdudu uh to basically uh deduplicate 00:07:40.560 |
like uh data i mean the duplicate data uh this basically mentions like what they have done with the 00:07:52.160 |
and one thing they uh they said like uh they used like group query attention with partial rope in the 00:07:58.960 |
attention block and then also they mentioned like uh attention heads or more so uh one of the claim was 00:08:05.920 |
like having more attentions and wider instead of like uh sorry instead of uh having deeper uh deeper 00:08:14.320 |
over the white like it increases the reasoning capabilities 00:08:22.720 |
and when came before you go forward you said something at the beginning of uh 00:08:28.960 |
of this discussion uh so did i i just want to make sure that i got this right so did they train three 00:08:38.240 |
separate models one each for agentic reasoning and coding and then they distilled all those three 00:08:47.120 |
models into one model is that how they approached this so for the base model they used like the same 00:08:53.280 |
pre-training for the base model once they have the uh pre-trained base model in the post-training they 00:09:00.160 |
uh used like a t they created like three different models and then did they then distill into one other 00:09:10.960 |
um yeah so in stage one they used expert training and then stage two they unified 00:09:25.200 |
on the on the previous note here about attention heads and wide versus deep i think it's been 00:09:31.280 |
interesting with the other few late papers so like uh once again the kimmy paper talks about 00:09:37.040 |
how they do less attention heads because you know a very long a very long context it's like half the 00:09:43.200 |
memory bandwidth that's required and then open ai's open source model they talk about how they 00:09:50.320 |
go for a wider moe as opposed to a deeper one and that that helped with like training stability and 00:09:55.680 |
stuff as well so it's like interesting these little notes that you know normally you just take at face 00:10:00.320 |
value you're starting to see some other differences with different models and why they pick what 00:10:13.040 |
um so in pre-training they have like multiple stages uh and like they use different data and then also 00:10:19.360 |
different context sizes um one thing they have done with the code like they use like a mostly github 00:10:25.680 |
for code and one thing they i observed like uh they used uh uh they combined like a files from same 00:10:36.480 |
report to get a better understanding of relation between different files in in the code base 00:10:41.440 |
um i i thought that's interesting uh to have that semantic meaning embedded 00:10:47.280 |
um and also most of the times like once they have once they crawl the data they were using uh some 00:10:55.520 |
kind of model to predict the quality of a particular page and then based on the quality uh that seems like 00:11:02.560 |
they have sampled the particular high quality data that seems like a common pattern they used for all the data 00:11:08.720 |
uh yeah so uh they like they mentioned up sampling a lot based on the quality of the content 00:11:21.360 |
um so the pre-training i believe like they're trying to have like a base model uh that i mean with high 00:11:31.280 |
quality data and then second phase like in the mid training they were using repo level code training uh 00:11:37.360 |
it's basically they increased the context and they added like a more data like a long context and 00:11:44.800 |
agent training and here in i feel like they use a lot of synthetic uh data generations and synthetic data 00:11:52.000 |
uh especially for agent tasks and uh for reasoning tasks uh one thing so 00:12:05.200 |
they included these as well to improve the coding abilities um yeah and they mentioned like they 00:12:12.960 |
used a deflect formatting uh i think it helps like uh to generate better uh to understand or to review 00:12:30.640 |
and uh here they mentioned like uh they boosted like context length from 32k to 180 180k 00:12:37.200 |
and uh they up sample only long documents um i mean that makes sense 00:12:43.520 |
yeah it's the same thing for 4k to 32k and then one 128k 00:12:54.000 |
uh i'm not sure i understand oh yeah so if during pre-training uh uh they didn't use like a so i 00:13:03.440 |
think this basically best fit packing it basically truncates the context to given length uh so in one 00:13:10.320 |
of the stages they didn't use that uh so that like uh uh it becomes like a good data augmentation strategy 00:13:17.200 |
and then for me training uh they applied best fit packaging so that like i think they compress the 00:13:22.320 |
context into a given context window so that like it has uh it it's like a summary generation i think 00:13:29.280 |
it's like some regeneration uh frankie you have a question uh yeah could you could you explain a little 00:13:37.440 |
bit what they mean by up sample uh i i think they uh yeah um i'm assuming it's like down sampling up 00:13:53.680 |
sampling right like they will want to have like a uh if you have like multiple buckets you want to get 00:13:58.080 |
like more data from a particular bucket rather than less data from other buckets suppose if you have 00:14:03.680 |
like equal distribution you probably get like a uh if you do thousands samples like the distribution 00:14:09.040 |
would be even right without sampling i believe like you have more weight to certain buckets to get the 00:14:14.880 |
high quality data uh bibu do you want to add anything 00:14:19.440 |
typically that's what up sampling is um it's not just more weight to different buckets sometimes i'll just 00:14:27.760 |
do multiple books right so for some long context or for some like high quality it's not uncommon that 00:14:35.920 |
they do multiple passes of the same data and down sampling as well you know like if you have 00:14:43.360 |
a lot of generic text it's like filtering is a form of sound sampling right uh yeah sometimes it's just in 00:14:49.680 |
a way that's what's going to do so it's just in a way that's going to be that's going to be that's going to be 00:14:56.320 |
the best fit packing what does that actually do uh based on my understanding i think they're 00:15:04.400 |
trying to summarize uh so that it fits actually to the context i haven't actually uh verified that i assumed it 00:15:15.200 |
oh sorry i what's a what's a basic procedure i don't quite understand what are they doing 00:15:20.160 |
um i yeah i haven't actually verified uh what exactly they've been doing like i haven't seen the 00:15:30.480 |
algorithm i just assumed that like they probably uh like summarizing the car summary summarizing the 00:15:36.160 |
context to fit uh better suppose like if you have like 128k or like a lot of data 00:15:45.680 |
it's better to have like a summary uh so that like you have like main points in the uh limited 00:15:51.200 |
context so that like uh the queries can i mean the llms can focus better on the uh what's going on 00:15:58.640 |
rather than like whole context okay and then if anyone has a better explanation please uh chip in 00:16:14.080 |
sorry i have a separate question vicky um so uh what is the rationale so i think they're saying 00:16:20.160 |
pre-training you begin with something smaller and then mid-training you introduce longer um is this kind 00:16:26.880 |
of like uh um you know like going going i forgot what they call the curriculum training is that kind 00:16:33.280 |
of like thinking the thought process here uh or why what's the reason for doing it this way 00:16:38.160 |
uh the curriculum training i think they use it in the context of rl right like to get like a better 00:16:46.240 |
reward models uh but i think this here it's simply extending the context so they can uh i mean we can put 00:16:55.680 |
like a lot of data into the context there's just like a uh i feel like a 4k they have like a a good 00:17:03.120 |
base like the weights are uh set in a better way here and then once we increase the context i believe 00:17:09.680 |
like uh it can generalize well and then it can have like a uh the attention uh can be spread without 00:17:16.560 |
losing the actual value or that's what i think here they've been doing does it have any data showing 00:17:23.760 |
like um i don't know ablations showing that if you do it mixed together it's like this or here 00:17:30.000 |
separate it like this it does something differently oh uh i can i can chime in here actually uh there's 00:17:36.800 |
two parts of this mid training right one is context length extension which is typically not what you 00:17:43.040 |
consider mid training a context extension can just happen at the very end of all your rl your post 00:17:49.120 |
training right you can just do more training on long context and that kind of works now what this is is 00:17:55.840 |
the reason it's more mid training is because you're going from regular pre-training just next token 00:18:01.120 |
prediction to some form of specialized data right so repo level code this is long context 00:18:06.720 |
full code base uh synthetic reasoning data long context agentic data right now the the fact of 00:18:14.000 |
the matter is it's not like exclusive that they're just doing this but they're also doing long context 00:18:19.280 |
but it's pretty common right you don't see like reasoning data at short context right because oftentimes 00:18:26.400 |
the queries that go into these like the system prompts themselves are already pretty long so same 00:18:31.920 |
thing with the reasoning traces right we just don't do reasoning at short context because 00:18:36.480 |
you need context length for actually doing reasoning right you're not going to do 00:18:41.600 |
reasoning at 4k context so it's mid training more so because of the type of training they're doing 00:18:47.760 |
not because of context but yeah they didn't just do it as a separate step which is like okay too like 00:18:53.040 |
long context is not super like unknown right all you do is just train on long context for like 00:19:00.560 |
couple hundred thousand or a couple hundred thousand or a couple million samples and you're kind of chilling 00:19:04.400 |
and with that like you know you can just mix it in wherever but in this case it's mid training because 00:19:11.680 |
the type of data and then um that's like the first question on the other one about um best fit packing so 00:19:19.520 |
one thing you can do if you have like long context let's say you're training on like 32k context right 00:19:25.920 |
and you have a bunch of sequences that are like short like 4 000 sequence land uh what you do in 00:19:31.920 |
terms of packing is you can concatenate multiple samples into one one sample right so like if i have 00:19:39.120 |
eight queries that are 4 000 sequence land i can batch them all together and make one sequence of 32k 00:19:45.040 |
and that's that's best fit packing that's more efficient on gpus or you could not do that and you 00:19:51.120 |
could just train normally right so it's you could do best fit packing by packing multiple short sequences 00:19:57.360 |
into one sequence or you could not do it and then you know they they talk about what they do here but 00:20:02.880 |
that's what best fit packing is it's where you pack short sequences into a lot of sequences and then 00:20:08.080 |
there's pros and cons of both i think for them they say in pre-training they don't do it so it's not as 00:20:12.960 |
efficient on the gpus they just do random truncation right so sometimes different sequences and batches 00:20:18.880 |
will be different lengths and that's okay it is what it is and they talk about why they do this 00:20:23.680 |
and then in mid training they do do it because context matters right you don't want half of a 00:20:30.160 |
reasoning trace to be cut off because you're out of sequence length right you want stuff to be the 00:20:35.520 |
same length so in mid training and reasoning they do do it like you don't want to chop off a code base 00:20:40.960 |
just to be efficient in gpu and pre-training they're like it's good it's data augmentation even if you 00:20:45.760 |
don't have half of the stuff it's fine it's more efficient on our majority train run but sorry that's 00:20:50.960 |
like two two two topics in one um you know i hope that made sense thank you so but when they kind of 00:20:58.480 |
concatenate this like say 8k into 32k uh this the you you know you're not like masking anything right on the 00:21:07.440 |
attention that is the stuff you're concatenate can still yeah the previous stuff right yeah yeah yeah 00:21:12.560 |
it's weird like that uh there's there's different techniques to do uh packing but like you know like 00:21:18.640 |
there are techniques of add-in tokens to forget stuff before but i i mean that's like i don't think 00:21:25.600 |
they're doing any of this so the real point of packing is just you keep sequence similar and it doesn't 00:21:32.400 |
have like contamination and training data right so with reasoning i don't want like i don't want in 00:21:37.760 |
the same training batch code about like you know one library with something completely irrelevant to 00:21:44.320 |
think about but um it's it's just beneficial if you keep it training but then if it's like just regular 00:21:51.840 |
like storybooks right let's say you have like a harry potter book and then half of it is added in with a 00:21:57.200 |
lord of the ring book and you're just predicting next tokens it's fine you know like it's just 00:22:01.920 |
common but i think i think uh you know that's that's high level what this stuff is someone else should 00:22:08.080 |
do a little bit more research into it i just know that it's typically done in pre-training where the 00:22:12.960 |
type of data is not as important and you don't want to do it in in reasoning thank you and there are 00:22:19.600 |
approaches to how you can how you can better process stuff and there's like some paper that talks about 00:22:24.720 |
like adding special tokens like how encoder decoders just have like cls token 00:22:47.680 |
thank you and uh and also i don't think they mentioned a lot of ablation studies or 00:22:55.680 |
why they have done on the pre-training stage most of their focus on like uh making sure the rl works 00:23:02.160 |
rl is effective and then they're able to improve the capabilities so this paper like a lot of the times 00:23:09.120 |
the charts or the graphs are basically talking about rl pipeline not the pre-training 00:23:22.080 |
uh so they mentioned uh about like hyper parameters like i already mentioned like they used some one 00:23:32.240 |
optimizer uh i couldn't understand like some of these things but uh primer for c i feel like uh 00:23:51.440 |
um yeah i i to be honest like i haven't understood like this completely uh 00:23:57.280 |
um so this one's like a warm stable decay schedule like uh they're changing the learning rate on 00:24:04.000 |
they mentioned like how they change it why uh and then they also use like a batch size warm-up strategy 00:24:11.120 |
uh so batch size was increased gradually from 16 million to 64 million training 00:24:17.440 |
uh and they use like more optimizer because it's uh it feels like a accelerate convergence and like it 00:24:30.560 |
yeah uh if anyone have a better understanding uh please help me out 00:24:49.200 |
from either but usually it's just you know it's one paragraph but they spent like three months updating 00:25:03.680 |
i feel like this is uh this is where the majority of the work uh that's i think it's very carefully 00:25:12.720 |
crafted and created uh as i mentioned earlier like they use like a two-stage approach like in stage one 00:25:20.080 |
uh they created expert models for each uh specific task and then stage two they 00:25:26.320 |
use distillation techniques to integrate these multiple experts 00:25:31.120 |
uh um and then they mentioned like uh unsupervised uh supervised fine-tuning uh 00:25:39.280 |
uh so uh they do this safety for both stage one and stage two and in expert training like uh this is 00:25:50.480 |
basically like making sure that the model has basic capabilities uh to either to generate or to respond 00:25:57.920 |
properly uh other than that it doesn't have like any uh specific like a a tool for agentic or 00:26:12.160 |
yeah uh in sft for stage two um it's like a their distilling capabilities from different expert 00:26:19.120 |
models and then making sure that like it gets the uh capabilities from uh three different experts 00:26:34.720 |
yeah um like same thing um so here they mentioned like how they uh perform these tasks 00:26:59.760 |
uh one thing they have done differently for uh code so uh i feel like in regular models it seems like uh 00:27:24.480 |
they have to uh like for text data they have to use like a lot of escape 00:27:29.680 |
characters so uh this is for this they say that like that's fine for like a model specifically 00:27:36.080 |
designed for coding task but they want to have like a general capability for this model right so 00:27:40.800 |
instead of using a lot of uh escape characters they decided to format the data into this uh xml format 00:27:48.400 |
uh so that they don't have to use like escape characters that's the one thing they have done differently 00:27:59.600 |
um so um um so in um i couldn't hear you i'm sorry sorry you're really soft rj i don't know what 00:28:15.200 |
happened to your audio nope you know what happened rj you uh you could type i guess 00:28:27.120 |
i'll be rj's voice but also he's a veteran i don't know what's um what's screwed up there 00:28:35.840 |
this rj says is this a new token or just a tag i'm not sure there's a difference um yeah i'm not sure i think 00:28:54.640 |
it's maybe they added these uh like end of text uh tokens as part of our vocabulary so that like 00:29:04.640 |
is it that i'm not sure i understand the question itself let me see the 00:29:21.440 |
yeah i think it's a new token yeah new token in the vocabulary but they haven't mentioned it 00:29:30.720 |
usually for this kind of stuff they do add new tokens uh deep seek v 3.1 which dropped 00:29:38.880 |
yesterday uh you can see an example of how they add tokens for this kind of stuff i would imagine glm 00:29:45.280 |
to be the same does anyone know what does glm stand for no idea you could probably look it up 00:29:54.560 |
so i think here they also mentioned like uh by distilling from output of distinct experts the model 00:30:06.080 |
uh learns uh learns quickly applying long context of like cot reasoning for each task and then it can 00:30:14.320 |
switch from uh thinking long or also thinking short without any explicit uh without any explicit suggestion 00:30:23.440 |
and third one uh the rejection sampling i feel like uh so while sampling uh they want to make sure that like 00:30:33.040 |
uh they sample high quality data uh this rejection sampling they basically mentioned how they have done 00:30:38.960 |
that uh they want to make sure that it's in the valid format and then um verification with the objective 00:30:47.680 |
like i think the math or science uh you can have linear objective answers even the code uh so they did that 00:30:54.320 |
uh that part as well and then uh and then uh they have like uh reward models for subjective questions uh it 00:31:03.200 |
can be either uh rl hf or or early ai uh uh and then uh they use this metric to distill the um sampler i 00:31:14.160 |
mean um reject the sample and then i feel like the tool calling scenario is very interesting um they go deeper in 00:31:23.440 |
later stages as well how they orchestrated like the tool coring or uh yeah i'm gonna talk about that later 00:31:34.320 |
section and then also uh seems like they uh they selected few prompts to improve the uh uh or like 00:31:43.280 |
they deleted like some of the prompts which are not conducive to better results um they mentioned that here 00:31:52.320 |
and they state that like it improved uh improved the reasoning capabilities 00:31:57.520 |
so here they mentioned like how they uh constructed the agentic sft data 00:32:05.920 |
so uh i feel like they they used like existing frameworks like mcp servers uh to uh get like a basic 00:32:15.760 |
features and then they created uh they did task synthesis it seems like uh they used like whatever 00:32:22.080 |
the documentation they have for mcp servers to create tasks uh that seems like a very uh yeah it's 00:32:30.160 |
so they used llms to comprehend functionalities and then generate relevant queries and tasks 00:32:39.840 |
and for uh fragmented like uh they they did follow similar approach but they uh like uh first they 00:32:48.240 |
select like a representative subset and then they employ the construct the task based on that 00:32:54.320 |
um yeah single step and multi-step tool calling i i believe uh uh this basically says that like uh in 00:33:03.280 |
the reasoning process you may have to use like a uh like one color multiple call at different stages 00:33:12.560 |
and then uh uh hybrid hey vanky sorry just a quick kind of maybe might be a stupid question but i just 00:33:22.960 |
wanted to confirm is it the actual like them training specialized experts is this different to the other 00:33:31.680 |
models do you know like is this one of the different parts of of um of this model um i mean i know a lot 00:33:40.240 |
of times we uh other other llms do distillation but i'm not sure i haven't read any paper maybe kimmy 00:33:48.960 |
yeah i was i was curious too about whether can you guys hear me now by the way yes yeah i had my head 00:34:01.520 |
set microphone on sorry about that uh i was also curious about what what was the motivation for 00:34:07.360 |
using distinct experts versus just using good rollouts and i i kind of the conclusion i came 00:34:17.760 |
to for myself was that they have this like multi-stage uh curriculum where they're they keep they mention a 00:34:29.440 |
couple places where they keep uh you know sort of doing like rollouts of their own data and so that 00:34:37.920 |
maybe in the early phases it's easier to generate uh distinct experts than it is to generate one model 00:34:45.760 |
that's good at all three and then and that's why so they like in the process of generating a a generic model 00:34:54.880 |
it's easier to to do like specific like fine basically fine-tuned models that are good at one 00:35:01.600 |
thing and then distill them all into the same model that's that was my take i'm not sure i've ever seen 00:35:07.200 |
this my before myself i agree in in the oral setup like they mentioned this uh they needed like different 00:35:15.920 |
architecture for different types of uh reasoning tasks or reinforcement learning tasks so maybe that's 00:35:21.760 |
why they so they actually change the architecture for each expert uh i think so um interesting 00:35:33.040 |
yeah they have a slim rl and then uh they mentioned sometimes like uh for software engineering task like 00:35:39.200 |
they had to have like docker containers and uh other stuff like uh so they mentioned data like a bit later 00:35:47.520 |
in the page but i'm not sure like this is like the training architecture not the model architecture 00:35:52.880 |
right oh i see okay yeah no but that makes sense why you would also want to split that okay that's actually 00:35:58.800 |
really interesting um right so here uh they use like llms uh they used existing frameworks and then 00:36:19.840 |
they did task synthesis based on uh based on these frameworks and then uh so i feel like a lot of times 00:36:28.480 |
they uh they uh they do this synthesis to get the uh data uh required for training and then also 00:36:34.320 |
they use like quality filtering uh to make sure that they have like a high quality data goes into the 00:36:39.440 |
training yeah and then they use llms to generate this uh multi-step tool called uh into trajectories 00:36:52.880 |
uh yeah we'll move on to reasoning rl now one thing they uh i think uh they use like a grpo i feel like 00:37:00.960 |
this is like a common framework even deep seek uses this i'm not sure uh i'm not exactly sure what is a 00:37:12.640 |
so one thing i understand based on this like uh so yeah uh if i understand like aurel correctly like 00:37:26.080 |
for each task like you you get a reward and based on the reward you change the policy uh you take an 00:37:31.360 |
option based on the existing policy you get a reward and then change it so here they uh they want to have like 00:37:37.120 |
uh if if the task is like too complicated like the reward would be always zero so we don't have any 00:37:44.480 |
gradient to train the model better way so they want to make sure that like uh at particular stage we get 00:37:51.120 |
like appropriate reward or like we get a reward so that like we can train more or train better so uh 00:37:58.240 |
so in stage one they use like moderate examples moderate difficult data uh and then after like 00:38:04.720 |
certain point like after certain training steps they use like extremely difficult data so that like they 00:38:10.000 |
have a better gradient or the model is well trained enough to produce a gradient or produce a reward um 00:38:20.480 |
i think that's what i understand of from this uh curriculum based learning 00:38:24.320 |
yeah uh and they tried to use like a different context uh like uh like 16k 32k 48 and 64. 00:38:36.320 |
i i believe at the end they decided like it's better to use like a 64k context window uh there is no 00:38:43.280 |
additional advantage with the uh starting with the small context window and then increasing the uh 00:38:48.640 |
context window to 64k rather they can do uh 64k training from the beginning 00:39:05.360 |
uh yeah they also mentioned like uh this two stage approach uh increasing the difficulty produces uh 00:39:20.720 |
yeah also they mentioned like a single stage rl at 64 output length uh this is a for that uh in 00:39:44.720 |
uh they uh they mentioned like they used a different uh 00:40:04.160 |
and this this basically this chart like uh using different uh loss function uh for like a coding tasks 00:40:14.720 |
uh i think uh that comes in this paragraph but we can talk about this first and then 00:40:21.840 |
so uh temperature i believe it's like a common uh temperature we see in llms like if the temperature is 00:40:29.600 |
a high uh it becomes like a high uh it becomes like a more exploratory uh 00:40:38.080 |
if temperature is too low it becomes like a very deterministic 00:40:41.200 |
right so instead of using a uh like a standard temperature they try to use like a dynamic 00:40:47.840 |
temperature um and then they have like a certain validation to make sure that like the temperature they 00:40:54.720 |
have chosen uh is working fine uh is working fine uh yeah so here they mentioned how they uh how 00:41:07.920 |
they have chosen temperature based on the validation uh validation set 00:41:14.240 |
uh and they mentioned here like why they have uh taken that approach 00:41:22.720 |
uh so here they mentioned like for code aural uh the loss calculation is critical for training 00:41:34.880 |
efficiency and then instead of using or they used like a token weighted mean loss uh 00:41:43.360 |
rather than conventional uh sequence mean loss uh i haven't exactly know what's going on here but 00:41:50.240 |
i'm taking by the word and also this chart seems to be a case like token weighted average uh i mean 00:41:57.680 |
token weighted mean uh is working better for all our training compared to the sequence mean 00:42:03.840 |
uh um and they also mentioned like what they have done uh i feel like uh uh so uh for science rl i 00:42:16.080 |
feel like uh a lot of the times like they have like a verified answers so i it becomes easier for 00:42:22.000 |
training uh to get like exactly uh to get the reward uh if it's uh accurate or not the reasoning steps 00:42:31.520 |
um for agent so uh i feel like here uh they use like both humans and ai uh to get the feedback um 00:43:00.480 |
uh so this basically like they generated multiple uh uh like they generated multiple samples and then 00:43:12.960 |
they're trying to like this this is just like a training loop 00:43:30.160 |
uh we would uh will you be able to uh comment on this 00:43:41.760 |
yeah i mean this is it am i reading this right someone else should 00:43:48.960 |
comment here maybe this is grpo without the kl divergence right is that this is just a reiteration 00:43:57.840 |
of the question of that statement from above or not yeah 00:44:05.440 |
um you know they don't they don't seem to did they mention what like why they removed the kl divergence 00:44:17.520 |
from the last term i'm not sure i don't think i come across that reasoning yeah it seems a little bit 00:44:25.520 |
yeah um i'm not even sure i mean i'm so not super confident like i understand this part uh this a lot 00:44:43.200 |
of like aural heavy uh wait sometimes i understand um there's there's two things to dropping kill 00:44:50.800 |
divergence penalty stuff um in in other work they also i don't remember which paper it was but one of 00:44:59.600 |
these they significantly dropped the penalty for kale divergence and then they like decayed it so it 00:45:05.040 |
didn't have an effect later on so hail divergence is to keep the model from going too off track right but 00:45:14.640 |
lowering the penalty allows the model to explore like the random space a bit more so the paper basically 00:45:22.160 |
really dropped the penalty for kale divergence because it allowed the model to explore a broader 00:45:29.520 |
broader set of generated tokens right it's like similar to temperature it allows it to explore more freely 00:45:37.760 |
and you know in return going down different paths it eventually converges to an answer in this case 00:45:43.360 |
they're dropping kale divergence because they're doing this on math and code right so stuff that is 00:45:49.920 |
objectively verifiable um you know as long as you can verify that that answer is correct you're you're 00:45:57.520 |
fine right like you're you're you're able to get reward as long as you're able to start solving the 00:46:04.160 |
questions so in in more broad terms like rlhf with like you know preference feedback and dpo and stuff 00:46:12.960 |
it's still a concern right because you can get slight reward for preference based outcomes but in stuff 00:46:21.280 |
like math and code you know as long as you end up at that verifiable answer you're kind of chilling right 00:46:26.320 |
you don't really need a kale anchor to supervise the policy reward like it's it's okay and then there was 00:46:32.640 |
another paper we covered recently that like echoed this yeah thank you yeah that's that's really clear 00:46:40.000 |
that um i think it was the kimi paper right that that they did this in maybe uh the sound sounds about 00:46:46.800 |
right yeah that's really good insight i appreciate that and this is also just like me spewing my 00:46:55.120 |
of what i think about it you know could be wrong but seems legit to me i recall that too so that sounds 00:47:01.840 |
right to me good enough for me um in this paragraph like they mentioned that uh how they calculated 00:47:10.000 |
reward uh for the given specific task uh so for web search task like uh they use like a final answer 00:47:17.760 |
as a reward and then for coding agents uh they use like verifiable test cases for uh rewards and then 00:47:28.960 |
yeah um and for uh tool calling uh they try to see the uh format of the uh particular uh trajectory 00:47:36.720 |
and then they uh basically like either uh proceed further or like uh receive a zero reward 00:47:43.920 |
um and then uh basically uh the colds instead of like cold start sft cold start model like uh it has like a 00:47:53.040 |
basic understanding what's going on uh how to deploy back but and then uh they did like a uh like once uh 00:48:01.200 |
uh training has reached like certain steps like uh they started applying self distillation 00:48:06.080 |
uh like uh by replacing calls uh like the data generated by the cold star uh cold start sft to uh the rl 00:48:16.320 |
trained model i think this is the like expert model um yeah this this section sorry i'm taking so much of 00:48:26.400 |
this talk but this section really made me nervous right because i feel like i mean i think that a long 00:48:34.400 |
time ago i remember going over one of some paper on synthetic data and and and sort of like methods 00:48:42.880 |
where like self-improvement um kind of hits a hits a limit pretty fast so i'm um it's interesting 00:48:52.080 |
they're they really like sort of emphasize the self-improvement aspect here and that it just seems like 00:48:59.760 |
you're you might like squeeze the most out of the existing data set but you're gonna maybe be over 00:49:06.320 |
fit or something i don't know yeah i mean the i was worried about that uh because like they use a lot 00:49:17.200 |
of synthetic data but i think uh even though it's like synthetic data the reward function probably captured 00:49:23.760 |
like uh enough reasoning capabilities so generalized well um that's that's what i was hoping for yeah yeah 00:49:31.600 |
sorry could i ask a question about what is meant by self distillation maybe you didn't quite catch 00:49:48.480 |
i i i think they were they were using like expert model uh to generate the response and then 00:49:55.520 |
they were using non-expert model uh to use that expert model generated reasoning 00:50:02.320 |
i think they call it self distillation but it doesn't make sense 00:50:07.040 |
it feels like reading from this it feels like what they're saying is that they use the io 00:50:14.880 |
like trained model uh and feed that response into an sft um so but that's kind of like weird to do that 00:50:25.200 |
right so i don't quite understand why somebody would do do that i know so this is um this is common they 00:50:32.560 |
did this in magistral basically when you run out of um like a good verifier and like training data they 00:50:41.440 |
they trained a very good math model so basically in magistral medium they retrained a reasoning model 00:50:49.360 |
just on math and then they use that model to generate math data so in this case they do the 00:50:55.120 |
same thing right so stage one you train experts right um like reasoning agent um tool use agent general 00:51:04.960 |
chat agent then you do this unified training with distillation so right so the the agents generate 00:51:11.760 |
responses and reasoning and tool calls and you know that's then the the unified model is trained on this 00:51:18.880 |
synthetic data it's like self-distillation because it kind of came from itself right they they branched 00:51:25.120 |
off trained experts and then they train on that output so distillation person oh so um before you're saying 00:51:36.480 |
that um they were trained the the trained on specific tasks and then use because they were specific 00:51:45.440 |
then for the unified one you kind of like use that specific rl trained one to kind of like generalize to 00:51:53.360 |
other tasks or at least you kind of um put all the ex i guess you're calling experts right on different 00:52:00.080 |
tasks try to use the combined into uh another sft run yeah yeah so they they they use the experts to 00:52:09.040 |
generate response reasoning pull call traces right that's that's the distillation so they they they 00:52:15.520 |
basically get they train experts to to do output rollouts and then they train on those this is like 00:52:23.680 |
how how magistral got reasoning they trained a good math model then they used it for data 00:52:30.960 |
it's like a very clear example of like this idea of training another model 00:52:39.920 |
to to to generate some of this uh data would recommend that paper or a paper club on it but yeah makes sense 00:52:49.040 |
this is like a general comment it seems like a uh increasing the interaction terms with the 00:53:04.640 |
environment has obviously improved the uh test time compute um 00:53:11.920 |
i think this is like a generic statement it doesn't have any value i mean it has value but it doesn't 00:53:22.960 |
like a any specific insight and and for general oral they use like a oral hf and then aif uh they use like a 00:53:32.480 |
like a model based feedback as well uh uh depends on the data depends on the task like the kind of 00:53:37.520 |
mixing or uh use uh mixing all these three approaches like a rule-based feedback 00:53:43.360 |
i think depends on the task like uh one of these would be uh better better um but i feel like uh even 00:53:52.320 |
models can give like a better uh better judgment for a lot of tasks um yeah 00:54:05.040 |
um yeah this is like a browser com uh accuracy 00:54:12.880 |
browser column very interesting benchmark this is from open ai it's like their web search type thing 00:54:24.880 |
that they released with deep research i think they updated it recently i think six has played with it a bit 00:54:30.240 |
um the the interesting chart i found were like the two in section three actually there's two charts 00:54:37.920 |
about training about multi-step and um what do these red and blue lines um yeah they're multi-step training 00:54:50.800 |
they have these two charts that kind of break down multi-stage versus single stage training and 00:54:57.040 |
breaking off of um difficulty problems and plateauing these two are pretty cool 00:55:03.760 |
honestly i think since we're running short of time maybe we don't just look at chart but they had good charts 00:55:10.480 |
i recommend charts that's true um and on that last point by the way with the distillation 00:55:20.320 |
uh so like the high level process is like first you train the three experts right reasoner agent and 00:55:26.880 |
general chat then you know you change them what you train them with a cold start of sft traces then you do rl 00:55:34.160 |
you make them experts then the big thing is like you don't just train on their rollouts like flat out 00:55:40.480 |
right they do a lot of filtration so like for the distillation from three experts they they do rejection 00:55:46.480 |
sampling correctness checking uh reward model filtering tool call verification like they actually 00:55:52.480 |
verify all this stuff then the unified model they they do that sft distillation sort of stuff and that you 00:56:01.120 |
you know that's like a deep level of like filtration stuff i think that's one of the interesting things 00:56:06.640 |
right it's not like they just do three models and then they merge them into one like some weird model 00:56:11.760 |
merge or some uh ensemble it's just that they use them for synthetic data and then they distill down 00:56:17.440 |
and it's like a bunch of just sft kickstart do rl okay generate data filter data let's do it again 00:56:27.520 |
let's do sft as a kick start off of these and of course you know strip out some stuff like you want 00:56:33.600 |
you want hybrid reasoning so you want to keep some reasoners keep some non-reasoning stuff and you know 00:56:39.440 |
in the end you get one model that outperforms the three 00:56:42.320 |
is it also oh sorry thank you uh go ahead bro i just asked one more quick question um it says here like 00:56:55.920 |
once training has reached a certain step count or plateaued is it normal that they loop the 00:57:00.080 |
distillation process that they what the distillation process like they're doing they're doing many many 00:57:09.360 |
rounds of the self-distillation um once training has reached a certain step count or plateaued we apply 00:57:19.840 |
self-distillation by substituting the original cold start data with responses generated by the rl 00:57:25.200 |
uh thus creating a superior sft we then conduct further rl training to announce this model this is 00:57:32.000 |
talking about training the experts right then conduct further rl on this no no this is the join 00:57:39.600 |
increasing difficulty this strategy allows us to push the limits of rl trained models of it sorry what's the 00:57:47.840 |
what's the what's the question here this seems somewhat standard approach i but i think i 00:57:52.000 |
misinterpreted what your question was about this no it's okay i might be misunderstanding i just got 00:57:57.120 |
the impression that something that they're doing differently is um it's a self-iteration 00:58:02.960 |
loop rather than just running the distillation process once which is what i thought the other models were 00:58:09.600 |
doing applying self-distillation by substituting that i think the loop is just like you know after 00:58:20.160 |
you do easy questions you do hard questions like that's their whole multi-step thing right like they 00:58:25.600 |
at different stages do harder and harder questions and they're mixing it in with context line they they 00:58:31.760 |
show this in one of the like i think it's is it figure five about moderate and extreme difficulty 00:58:38.320 |
whereas one plateaus they they switch to stage two with extreme difficulty and then you know progress 00:58:44.160 |
continues so like that blue line if you keep it stagnant with the same difficulty uh you know that's 00:58:51.120 |
where it starts to plateau and you're not improving accuracy so you know next cycle is just how extreme 00:58:57.440 |
difficulty instead i think that's like i think that's what they're saying here it's just tears of 00:59:03.760 |
harder and harder questions um i could also be misinterpreting this you know 00:59:13.200 |
um they mentioned like uh how they created like a reward function for each type of specific task 00:59:23.840 |
uh they were checking format they were checking like trajectory uh and then they were also checking 00:59:29.200 |
uh if sometimes like uh so this stepwise uh it's basically checking like uh did we call exact tool 00:59:37.680 |
at the particular time uh and sometimes they're using that and also they're trying to see if you actually 00:59:44.560 |
completed the task other than like just calling the function at particular time so they were using these 00:59:50.000 |
to uh and then uh sometimes these rls they get into infinite loops so and then instead of uh uh instead 00:59:58.640 |
of adding a penalty to the task like they uh sampled a lot of uh see a lot of prompts with uh the oral 01:00:05.360 |
tends to get infinite loops and then uh they uh created like a penalty function based on that instead of uh 01:00:12.240 |
using a standard penalty which is like sampling efficient uh they mentioned like how uh they uh 01:00:19.840 |
they mentioned how they want to use like a gpu better ways uh if uh some uh some rl is it's taking 01:00:27.200 |
a lot of time to finish like instead of waiting on that they created like a different platform uh to 01:00:33.760 |
optimize like a gpu usage better uh you can read the details later and then uh yeah these evolution 01:00:41.200 |
metrics how they compare against uh existing models um i don't like this seems like very straightforward to me 01:00:48.720 |
uh it's not like a uh i think it's cool uh yeah but it's still a lot of times it's still second to 01:00:56.480 |
claude or other maybe the model sizes is like very small compared to them but still um nothing new inside 01:01:04.240 |
it's just like a metrics they're talking how it compares to other models um i feel like uh one 01:01:10.880 |
interesting thing uh they were able to have like uh i think the model has like a lot of reasoning 01:01:15.200 |
a lot of reasoning capabilities inbuilt uh so it gets like a emergent phenomena like uh the translation 01:01:22.640 |
surprisingly works comparatively uh existing models so that's like a like a side effect of having a lot of 01:01:29.840 |
reasoning tokens yeah i think um that's all i wanted to cover 01:01:38.480 |
but i feel like the aural setup or uh the things they mentioned is very cool 01:01:53.840 |
well thank you so much for presenting um if anyone wants you know next week volunteer there's all 01:02:02.640 |
these fun papers i posted a bunch of in discord or if there's any topics you want to cover we can 01:02:08.960 |
cover together you know but uh you know big shout out thank you so much for presenting it's always fun 01:02:14.240 |
to get different perspective on this stuff you know thank you yeah it's well done um thanks for uh 01:02:22.400 |
volunteering all right many more volunteers uh more papers somebody's gonna do c dance that's another one 01:02:28.880 |
oh china models man we i really want to do a china tour at some point maybe maybe uh early next year 01:02:37.040 |
uh oh my god you just have access to go see all the labs here they're very i can i can definitely 01:02:42.960 |
get quen we have inroads to deep seek uh stepfather mini max for sure i can get my god uh which is glm 01:02:53.520 |
i mean i don't think they're hard to they're hard to get but i don't currently have objects 01:02:58.400 |
okay other papers though there's um metas dyno v3 there's a genie clone that came out of tencent 01:03:05.760 |
uh what else what else there's a survey on parallel text generation there is a lot of stuff there's a 01:03:14.400 |
few auto regressive image generation papers um hrm tyler tell us volunteering uh yes i haven't read that 01:03:22.400 |
one and i need to read it so hrm is yes on the docket in the next two to three weeks i might reach 01:03:29.040 |
out to j alamar for his gpt oss thing uh i'll share in discord if you guys haven't seen it he did a 01:03:37.920 |
really good blog post called like the illustrated transformer illustrated for gpt2 he put out one 01:03:46.320 |
yesterday i think on the illustrated gpt oss so it's like a good visual breakdown of what's happening 01:03:52.880 |
in state-of-the-art moe models and um i will i'll share it in in discord and try to get them to come 01:04:01.040 |
up on otherwise you know if you have paper it's always better someone else someone else uh volunteer 01:04:10.080 |
okay tyler you're locked for uh september 3rd glm uh uh sorry hrm uh and then we need to watch you