Back to Index

GLM 4.5: Agentic Coding & Reasoning Foundation Model


Chapters

0:0 Introduction to the presenter and the GLM 4.5 paper
2:11 Overview of GLM 4.5 as a Mixture of Experts model
4:6 Defining ARC: Agentic, Reasoning, and Coding capabilities
5:41 Pre-training architecture and the Muon optimizer
6:54 Data collection, de-duplication, and context window extension
11:31 Mid-training with repository-level code and synthetic data
25:8 Post-training: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)
31:22 Tool calling and agentic data synthesis
36:52 Reasoning RL with curriculum learning
39:58 Dynamic temperature and token-weighted mean loss
49:12 Self-distillation process for unifying the model
59:16 Reward function calculation and handling infinite loops
61:12 Emergent capabilities: surprising translation performance

Transcript

yeah go ahead um can you see my screen i can see your screen the audio audio could be better if you can switch your zoom audio uh don't use airpods for for uh for microphones airpods are very bad microphones okay even though it looks like i'm using my airpods in my in my ear i actually always switch the zoom to laptop microphone is much better should be in like the zoom clicker thing no i don't even like airpods max i've tried it it's not great is it any better now yeah much better oh sweet sweet okay yeah um you know just introduce yourself because you're because you're your new presenter uh let's get to know about you and why you're interested i i was working as a back-end engineer um my name is venki by the way nice to meet you all i took a sabbatical to do uh to understand like the current ecosystem and to experiment more uh with the ai tools and also understand the fundamentals a lot so i took a break uh i was um and then i'm trying to read papers but so far it's not working out for me the self-study but i'm still like uh trying to make sure like i get something positive out of this sabbatical uh this is my this is my first time presenting the paper i'm i'm not sure like how it goes but i'm hope uh i feel like i i put a lot of attention to understand the details because i'm presenting it otherwise i would usually skip through them skip through them yeah so it's a good good uh a good like a decision for me to volunteer for this agreed okay uh if you want you can make your paper full screen and then we can get into it okay yeah so uh should i should i give an overview first uh yeah yeah so this is a uh it looks like a mixture of expert a transformer uh on the base model and then uh what they have done differently is like uh they want to make sure like this model has a arc capabilities agentic coding and reasoning abilities so what they have done they uh post training they they they've used like two different ways to train this model like they stage one they uh they train an expert model uh for each specific task like agent degree reasoning and coding and then once they have like a a sub expert expert model they distill these capabilities into one uh unique one model so the uh that's why uh this model seems to have like all three capabilities uh and this model can is able to like uh uh have like a switch between modes like thinking long thinking modes and then uh quick response modes uh i haven't actually uh used the model much but i asked claude to generate some uh some task and prompt to verify these capabilities so maybe post this talk we can see if we can go through uh if this model is actually looking good um with those tasks but basically this is a mixture of experts um and yeah um and these are like some metrics they provided uh with arc capabilities i'll move to the second yeah i feel like arc is a little bit um co-opted by arcgi but arc has a definite definition which i think you're showing on screen right now right uh oh yeah the arc challenge true so they these people like they say arc uh to define like uh if the model has agentic or reasoning or coding capabilities all together in one uh one model instead of using multiple models to achieve uh different performances uh so they clearly define the ambition here so how often um like yeah this seems like the main uh drive behind the this model training uh to have all these capabilities in a single model and then uh the paper mostly quickly goes through uh like uh what they have done uh and then also the rl training pipeline and how they approach like a data synthesis for training etc but this uh these are like a bunch of metrics they've mentioned here and they have now a smaller model as well but they haven't mentioned much about the smaller model apart from i'm assuming it's a distillation yeah these are like a bunch of metrics how we uh how we performed on a various metrics uh i i try to see if they have like uh the training loop made available i don't think they've made the training loop available for open source i think the only weights are available in the git repo uh so yeah architecture the fundamental architecture is like a mixture of experts i feel like the only thing they changed is uh uh they used a muon optimizer instead of adam i'm not really sure if adam is like standard but they're using beyond uh they mentioned this down somewhere here yeah so they used one optimizer uh and then what else they've changed yeah so for for context basically for the past few years the adam and adam w optimizers they've been very much the standard defaults there hasn't been much progress like adam came then adam w uh the recent kimmy k2 paper it's like the big the big model that came out after um after deep seek they switched to a muon flip optimizer it's like better training stability they have a very pretty loss curve so that's kind of the background optimizer um and one thing they have done uh they have this is like the general pattern they used like uh they started with 4k context window and then they increased the context window to 32 and uh and afterwards like 1.28 and uh and they mentioned like how they collected their data uh so basically uh yeah they used like web crawling uh to have like english and chinese web pages and then uh they used like a quality metric so here they mentioned like uh they divide the crawl web pages into buckets with different quality scores and then they're up sampling the documents uh so that like they uh the initial like the base data is much much higher quality um and then they they were using like uh this uh semdudu uh to basically uh deduplicate like uh data i mean the duplicate data uh this basically mentions like what they have done with the pre-training data and one thing they uh they said like uh they used like group query attention with partial rope in the attention block and then also they mentioned like uh attention heads or more so uh one of the claim was like having more attentions and wider instead of like uh sorry instead of uh having deeper uh deeper over the white like it increases the reasoning capabilities and and when came before you go forward you said something at the beginning of uh of this discussion uh so did i i just want to make sure that i got this right so did they train three separate models one each for agentic reasoning and coding and then they distilled all those three models into one model is that how they approached this so for the base model they used like the same pre-training for the base model once they have the uh pre-trained base model in the post-training they uh used like a t they created like three different models and then did they then distill into one other model yeah okay um yeah so in stage one they used expert training and then stage two they unified okay thanks on the on the previous note here about attention heads and wide versus deep i think it's been interesting with the other few late papers so like uh once again the kimmy paper talks about how they do less attention heads because you know a very long a very long context it's like half the memory bandwidth that's required and then open ai's open source model they talk about how they go for a wider moe as opposed to a deeper one and that that helped with like training stability and stuff as well so it's like interesting these little notes that you know normally you just take at face value you're starting to see some other differences with different models and why they pick what um so in pre-training they have like multiple stages uh and like they use different data and then also different context sizes um one thing they have done with the code like they use like a mostly github for code and one thing they i observed like uh they used uh uh they combined like a files from same report to get a better understanding of relation between different files in in the code base um i i thought that's interesting uh to have that semantic meaning embedded um and also most of the times like once they have once they crawl the data they were using uh some kind of model to predict the quality of a particular page and then based on the quality uh that seems like they have sampled the particular high quality data that seems like a common pattern they used for all the data uh yeah so uh they like they mentioned up sampling a lot based on the quality of the content um so the pre-training i believe like they're trying to have like a base model uh that i mean with high quality data and then second phase like in the mid training they were using repo level code training uh it's basically they increased the context and they added like a more data like a long context and agent training and here in i feel like they use a lot of synthetic uh data generations and synthetic data uh especially for agent tasks and uh for reasoning tasks uh one thing so they included these as well to improve the coding abilities um yeah and they mentioned like they used a deflect formatting uh i think it helps like uh to generate better uh to understand or to review better uh having this deflect format and uh here they mentioned like uh they boosted like context length from 32k to 180 180k and uh they up sample only long documents um i mean that makes sense yeah it's the same thing for 4k to 32k and then one 128k uh i'm not sure i understand oh yeah so if during pre-training uh uh they didn't use like a so i think this basically best fit packing it basically truncates the context to given length uh so in one of the stages they didn't use that uh so that like uh uh it becomes like a good data augmentation strategy and then for me training uh they applied best fit packaging so that like i think they compress the context into a given context window so that like it has uh it it's like a summary generation i think it's like some regeneration uh frankie you have a question uh yeah could you could you explain a little bit what they mean by up sample uh i i think they uh yeah um i'm assuming it's like down sampling up sampling right like they will want to have like a uh if you have like multiple buckets you want to get like more data from a particular bucket rather than less data from other buckets suppose if you have like equal distribution you probably get like a uh if you do thousands samples like the distribution would be even right without sampling i believe like you have more weight to certain buckets to get the high quality data uh bibu do you want to add anything typically that's what up sampling is um it's not just more weight to different buckets sometimes i'll just do multiple books right so for some long context or for some like high quality it's not uncommon that they do multiple passes of the same data and down sampling as well you know like if you have a lot of generic text it's like filtering is a form of sound sampling right uh yeah sometimes it's just in a way that's what's going to do so it's just in a way that's going to be that's going to be that's going to be the best fit packing what does that actually do uh based on my understanding i think they're trying to summarize uh so that it fits actually to the context i haven't actually uh verified that i assumed it oh sorry i what's a what's a basic procedure i don't quite understand what are they doing um i yeah i haven't actually verified uh what exactly they've been doing like i haven't seen the algorithm i just assumed that like they probably uh like summarizing the car summary summarizing the context to fit uh better suppose like if you have like 128k or like a lot of data it's better to have like a summary uh so that like you have like main points in the uh limited context so that like uh the queries can i mean the llms can focus better on the uh what's going on rather than like whole context okay and then if anyone has a better explanation please uh chip in sorry i have a separate question vicky um so uh what is the rationale so i think they're saying pre-training you begin with something smaller and then mid-training you introduce longer um is this kind of like uh um you know like going going i forgot what they call the curriculum training is that kind of like thinking the thought process here uh or why what's the reason for doing it this way uh the curriculum training i think they use it in the context of rl right like to get like a better reward models uh but i think this here it's simply extending the context so they can uh i mean we can put like a lot of data into the context there's just like a uh i feel like a 4k they have like a a good base like the weights are uh set in a better way here and then once we increase the context i believe like uh it can generalize well and then it can have like a uh the attention uh can be spread without losing the actual value or that's what i think here they've been doing does it have any data showing like um i don't know ablations showing that if you do it mixed together it's like this or here separate it like this it does something differently oh uh i can i can chime in here actually uh there's two parts of this mid training right one is context length extension which is typically not what you consider mid training a context extension can just happen at the very end of all your rl your post training right you can just do more training on long context and that kind of works now what this is is the reason it's more mid training is because you're going from regular pre-training just next token prediction to some form of specialized data right so repo level code this is long context full code base uh synthetic reasoning data long context agentic data right now the the fact of the matter is it's not like exclusive that they're just doing this but they're also doing long context but it's pretty common right you don't see like reasoning data at short context right because oftentimes the queries that go into these like the system prompts themselves are already pretty long so same thing with the reasoning traces right we just don't do reasoning at short context because you need context length for actually doing reasoning right you're not going to do reasoning at 4k context so it's mid training more so because of the type of training they're doing not because of context but yeah they didn't just do it as a separate step which is like okay too like long context is not super like unknown right all you do is just train on long context for like couple hundred thousand or a couple hundred thousand or a couple million samples and you're kind of chilling and with that like you know you can just mix it in wherever but in this case it's mid training because the type of data and then um that's like the first question on the other one about um best fit packing so one thing you can do if you have like long context let's say you're training on like 32k context right and you have a bunch of sequences that are like short like 4 000 sequence land uh what you do in terms of packing is you can concatenate multiple samples into one one sample right so like if i have eight queries that are 4 000 sequence land i can batch them all together and make one sequence of 32k and that's that's best fit packing that's more efficient on gpus or you could not do that and you could just train normally right so it's you could do best fit packing by packing multiple short sequences into one sequence or you could not do it and then you know they they talk about what they do here but that's what best fit packing is it's where you pack short sequences into a lot of sequences and then there's pros and cons of both i think for them they say in pre-training they don't do it so it's not as efficient on the gpus they just do random truncation right so sometimes different sequences and batches will be different lengths and that's okay it is what it is and they talk about why they do this and then in mid training they do do it because context matters right you don't want half of a reasoning trace to be cut off because you're out of sequence length right you want stuff to be the same length so in mid training and reasoning they do do it like you don't want to chop off a code base just to be efficient in gpu and pre-training they're like it's good it's data augmentation even if you don't have half of the stuff it's fine it's more efficient on our majority train run but sorry that's like two two two topics in one um you know i hope that made sense thank you so but when they kind of concatenate this like say 8k into 32k uh this the you you know you're not like masking anything right on the attention that is the stuff you're concatenate can still yeah the previous stuff right yeah yeah yeah it's weird like that uh there's there's different techniques to do uh packing but like you know like there are techniques of add-in tokens to forget stuff before but i i mean that's like i don't think they're doing any of this so the real point of packing is just you keep sequence similar and it doesn't have like contamination and training data right so with reasoning i don't want like i don't want in the same training batch code about like you know one library with something completely irrelevant to think about but um it's it's just beneficial if you keep it training but then if it's like just regular like storybooks right let's say you have like a harry potter book and then half of it is added in with a lord of the ring book and you're just predicting next tokens it's fine you know like it's just common but i think i think uh you know that's that's high level what this stuff is someone else should do a little bit more research into it i just know that it's typically done in pre-training where the type of data is not as important and you don't want to do it in in reasoning thank you and there are approaches to how you can how you can better process stuff and there's like some paper that talks about like adding special tokens like how encoder decoders just have like cls token thank you and uh and also i don't think they mentioned a lot of ablation studies or why they have done on the pre-training stage most of their focus on like uh making sure the rl works rl is effective and then they're able to improve the capabilities so this paper like a lot of the times the charts or the graphs are basically talking about rl pipeline not the pre-training um yeah uh so they mentioned uh about like hyper parameters like i already mentioned like they used some one optimizer uh i couldn't understand like some of these things but uh primer for c i feel like uh um um yeah i i to be honest like i haven't understood like this completely uh um so this one's like a warm stable decay schedule like uh they're changing the learning rate on they mentioned like how they change it why uh and then they also use like a batch size warm-up strategy uh so batch size was increased gradually from 16 million to 64 million training uh and they use like more optimizer because it's uh it feels like a accelerate convergence and like it can tolerate large batches batch sizes yeah uh if anyone have a better understanding uh please help me out yeah i don't know where to get these numbers from either but usually it's just you know it's one paragraph but they spent like three months updating different parameters yeah i'll move on to the post training i feel like this is uh this is where the majority of the work uh that's i think it's very carefully crafted and created uh as i mentioned earlier like they use like a two-stage approach like in stage one uh they created expert models for each uh specific task and then stage two they use distillation techniques to integrate these multiple experts uh um and then they mentioned like uh unsupervised uh supervised fine-tuning uh uh so uh they do this safety for both stage one and stage two and in expert training like uh this is basically like making sure that the model has basic capabilities uh to either to generate or to respond properly uh other than that it doesn't have like any uh specific like a a tool for agentic or reasoning or capability uh capabilities yeah uh in sft for stage two um it's like a their distilling capabilities from different expert models and then making sure that like it gets the uh capabilities from uh three different experts yeah um like same thing um so here they mentioned like how they uh perform these tasks uh one thing they have done differently for uh code so uh i feel like in regular models it seems like uh they have to uh like for text data they have to use like a lot of escape characters so uh this is for this they say that like that's fine for like a model specifically designed for coding task but they want to have like a general capability for this model right so instead of using a lot of uh escape characters they decided to format the data into this uh xml format uh so that they don't have to use like escape characters that's the one thing they have done differently um so um um so in um i couldn't hear you i'm sorry sorry you're really soft rj i don't know what happened to your audio nope you know what happened rj you uh you could type i guess i'll be rj's voice but also he's a veteran i don't know what's um what's screwed up there this rj says is this a new token or just a tag i'm not sure there's a difference um yeah i'm not sure i think it's maybe they added these uh like end of text uh tokens as part of our vocabulary so that like we don't pass them but i'm not really sure is it that i'm not sure i understand the question itself let me see the yeah i think it's a new token yeah new token in the vocabulary but they haven't mentioned it all right right usually for this kind of stuff they do add new tokens uh deep seek v 3.1 which dropped yesterday uh you can see an example of how they add tokens for this kind of stuff i would imagine glm to be the same does anyone know what does glm stand for no idea you could probably look it up so i think here they also mentioned like uh by distilling from output of distinct experts the model uh learns uh learns quickly applying long context of like cot reasoning for each task and then it can switch from uh thinking long or also thinking short without any explicit uh without any explicit suggestion and third one uh the rejection sampling i feel like uh so while sampling uh they want to make sure that like uh they sample high quality data uh this rejection sampling they basically mentioned how they have done that uh they want to make sure that it's in the valid format and then um verification with the objective like i think the math or science uh you can have linear objective answers even the code uh so they did that uh that part as well and then uh and then uh they have like uh reward models for subjective questions uh it can be either uh rl hf or or early ai uh uh and then uh they use this metric to distill the um sampler i mean um reject the sample and then i feel like the tool calling scenario is very interesting um they go deeper in later stages as well how they orchestrated like the tool coring or uh yeah i'm gonna talk about that later section and then also uh seems like they uh they selected few prompts to improve the uh uh or like they deleted like some of the prompts which are not conducive to better results um they mentioned that here and they state that like it improved uh improved the reasoning capabilities so here they mentioned like how they uh constructed the agentic sft data so uh i feel like they they used like existing frameworks like mcp servers uh to uh get like a basic features and then they created uh they did task synthesis it seems like uh they used like whatever the documentation they have for mcp servers to create tasks uh that seems like a very uh yeah it's interesting approach so they used llms to comprehend functionalities and then generate relevant queries and tasks and for uh fragmented like uh they they did follow similar approach but they uh like uh first they select like a representative subset and then they employ the construct the task based on that um yeah single step and multi-step tool calling i i believe uh uh this basically says that like uh in the reasoning process you may have to use like a uh like one color multiple call at different stages and then uh uh hybrid hey vanky sorry just a quick kind of maybe might be a stupid question but i just wanted to confirm is it the actual like them training specialized experts is this different to the other models do you know like is this one of the different parts of of um of this model um i mean i know a lot of times we uh other other llms do distillation but i'm not sure i haven't read any paper maybe kimmy or other papers does it i don't know yeah i was i was curious too about whether can you guys hear me now by the way yes yeah i had my head set microphone on sorry about that uh i was also curious about what what was the motivation for using distinct experts versus just using good rollouts and i i kind of the conclusion i came to for myself was that they have this like multi-stage uh curriculum where they're they keep they mention a couple places where they keep uh you know sort of doing like rollouts of their own data and so that maybe in the early phases it's easier to generate uh distinct experts than it is to generate one model that's good at all three and then and that's why so they like in the process of generating a a generic model it's easier to to do like specific like fine basically fine-tuned models that are good at one thing and then distill them all into the same model that's that was my take i'm not sure i've ever seen this my before myself i agree in in the oral setup like they mentioned this uh they needed like different architecture for different types of uh reasoning tasks or reinforcement learning tasks so maybe that's why they so they actually change the architecture for each expert uh i think so um interesting yeah they have a slim rl and then uh they mentioned sometimes like uh for software engineering task like they had to have like docker containers and uh other stuff like uh so they mentioned data like a bit later in the page but i'm not sure like this is like the training architecture not the model architecture right oh i see okay yeah no but that makes sense why you would also want to split that okay that's actually really interesting um right so here uh they use like llms uh they used existing frameworks and then they did task synthesis based on uh based on these frameworks and then uh so i feel like a lot of times they uh they uh they do this synthesis to get the uh data uh required for training and then also they use like quality filtering uh to make sure that they have like a high quality data goes into the training yeah and then they use llms to generate this uh multi-step tool called uh into trajectories uh yeah we'll move on to reasoning rl now one thing they uh i think uh they use like a grpo i feel like this is like a common framework even deep seek uses this i'm not sure uh i'm not exactly sure what is a this uh this loss term um yeah so one thing i understand based on this like uh so yeah uh if i understand like aurel correctly like for each task like you you get a reward and based on the reward you change the policy uh you take an option based on the existing policy you get a reward and then change it so here they uh they want to have like uh if if the task is like too complicated like the reward would be always zero so we don't have any gradient to train the model better way so they want to make sure that like uh at particular stage we get like appropriate reward or like we get a reward so that like we can train more or train better so uh so in stage one they use like moderate examples moderate difficult data uh and then after like certain point like after certain training steps they use like extremely difficult data so that like they have a better gradient or the model is well trained enough to produce a gradient or produce a reward um i think that's what i understand of from this uh curriculum based learning yeah uh and they tried to use like a different context uh like uh like 16k 32k 48 and 64.

i i believe at the end they decided like it's better to use like a 64k context window uh there is no additional advantage with the uh starting with the small context window and then increasing the uh context window to 64k rather they can do uh 64k training from the beginning uh yeah they also mentioned like uh this two stage approach uh increasing the difficulty produces uh some better performances better performance yeah also they mentioned like a single stage rl at 64 output length uh this is a for that uh in second paragraph like this chart uh they uh they mentioned like they used a different uh loss function uh and this this basically this chart like uh using different uh loss function uh for like a coding tasks uh i think uh that comes in this paragraph but we can talk about this first and then so uh temperature i believe it's like a common uh temperature we see in llms like if the temperature is a high uh it becomes like a high uh it becomes like a more exploratory uh uh uh if temperature is too low it becomes like a very deterministic right so instead of using a uh like a standard temperature they try to use like a dynamic temperature um and then they have like a certain validation to make sure that like the temperature they have chosen uh is working fine uh is working fine uh yeah so here they mentioned how they uh how they have chosen temperature based on the validation uh validation set uh and they mentioned here like why they have uh taken that approach uh so here they mentioned like for code aural uh the loss calculation is critical for training efficiency and then instead of using or they used like a token weighted mean loss uh rather than conventional uh sequence mean loss uh i haven't exactly know what's going on here but i'm taking by the word and also this chart seems to be a case like token weighted average uh i mean token weighted mean uh is working better for all our training compared to the sequence mean uh um and they also mentioned like what they have done uh i feel like uh uh so uh for science rl i feel like uh a lot of the times like they have like a verified answers so i it becomes easier for training uh to get like exactly uh to get the reward uh if it's uh accurate or not the reasoning steps um for agent so uh i feel like here uh they use like both humans and ai uh to get the feedback um i'm not sure what um uh so this basically like they generated multiple uh uh like they generated multiple samples and then they're trying to like this this is just like a training loop uh we would uh will you be able to uh comment on this yeah i mean this is it am i reading this right someone else should comment here maybe this is grpo without the kl divergence right is that this is just a reiteration of the question of that statement from above or not yeah right um you know they don't they don't seem to did they mention what like why they removed the kl divergence from the last term i'm not sure i don't think i come across that reasoning yeah it seems a little bit yeah um i'm not even sure i mean i'm so not super confident like i understand this part uh this a lot of like aural heavy uh wait sometimes i understand um there's there's two things to dropping kill divergence penalty stuff um in in other work they also i don't remember which paper it was but one of these they significantly dropped the penalty for kale divergence and then they like decayed it so it didn't have an effect later on so hail divergence is to keep the model from going too off track right but lowering the penalty allows the model to explore like the random space a bit more so the paper basically really dropped the penalty for kale divergence because it allowed the model to explore a broader broader set of generated tokens right it's like similar to temperature it allows it to explore more freely and you know in return going down different paths it eventually converges to an answer in this case they're dropping kale divergence because they're doing this on math and code right so stuff that is objectively verifiable um you know as long as you can verify that that answer is correct you're you're fine right like you're you're you're able to get reward as long as you're able to start solving the questions so in in more broad terms like rlhf with like you know preference feedback and dpo and stuff it's still a concern right because you can get slight reward for preference based outcomes but in stuff like math and code you know as long as you end up at that verifiable answer you're kind of chilling right you don't really need a kale anchor to supervise the policy reward like it's it's okay and then there was another paper we covered recently that like echoed this yeah thank you yeah that's that's really clear that um i think it was the kimi paper right that that they did this in maybe uh the sound sounds about right yeah that's really good insight i appreciate that and this is also just like me spewing my of what i think about it you know could be wrong but seems legit to me i recall that too so that sounds right to me good enough for me um in this paragraph like they mentioned that uh how they calculated reward uh for the given specific task uh so for web search task like uh they use like a final answer as a reward and then for coding agents uh they use like verifiable test cases for uh rewards and then yeah um and for uh tool calling uh they try to see the uh format of the uh particular uh trajectory and then they uh basically like either uh proceed further or like uh receive a zero reward um and then uh basically uh the colds instead of like cold start sft cold start model like uh it has like a basic understanding what's going on uh how to deploy back but and then uh they did like a uh like once uh uh training has reached like certain steps like uh they started applying self distillation uh like uh by replacing calls uh like the data generated by the cold star uh cold start sft to uh the rl trained model i think this is the like expert model um yeah this this section sorry i'm taking so much of this talk but this section really made me nervous right because i feel like i mean i think that a long time ago i remember going over one of some paper on synthetic data and and and sort of like methods where like self-improvement um kind of hits a hits a limit pretty fast so i'm um it's interesting they're they really like sort of emphasize the self-improvement aspect here and that it just seems like you're you might like squeeze the most out of the existing data set but you're gonna maybe be over fit or something i don't know yeah i mean the i was worried about that uh because like they use a lot of synthetic data but i think uh even though it's like synthetic data the reward function probably captured like uh enough reasoning capabilities so generalized well um that's that's what i was hoping for yeah yeah sorry could i ask a question about what is meant by self distillation maybe you didn't quite catch what they're actually doing here i i i think they were they were using like expert model uh to generate the response and then they were using non-expert model uh to use that expert model generated reasoning i think they call it self distillation but it doesn't make sense it feels like reading from this it feels like what they're saying is that they use the io like trained model uh and feed that response into an sft um so but that's kind of like weird to do that right so i don't quite understand why somebody would do do that i know so this is um this is common they did this in magistral basically when you run out of um like a good verifier and like training data they they trained a very good math model so basically in magistral medium they retrained a reasoning model just on math and then they use that model to generate math data so in this case they do the same thing right so stage one you train experts right um like reasoning agent um tool use agent general chat agent then you do this unified training with distillation so right so the the agents generate responses and reasoning and tool calls and you know that's then the the unified model is trained on this synthetic data it's like self-distillation because it kind of came from itself right they they branched off trained experts and then they train on that output so distillation person oh so um before you're saying that um they were trained the the trained on specific tasks and then use because they were specific then for the unified one you kind of like use that specific rl trained one to kind of like generalize to other tasks or at least you kind of um put all the ex i guess you're calling experts right on different tasks try to use the combined into uh another sft run yeah yeah so they they they use the experts to generate response reasoning pull call traces right that's that's the distillation so they they they basically get they train experts to to do output rollouts and then they train on those this is like how how magistral got reasoning they trained a good math model then they used it for data it's like a very clear example of like this idea of training another model to to to generate some of this uh data would recommend that paper or a paper club on it but yeah makes sense this is like a general comment it seems like a uh increasing the interaction terms with the environment has obviously improved the uh test time compute um i think this is like a generic statement it doesn't have any value i mean it has value but it doesn't like a any specific insight and and for general oral they use like a oral hf and then aif uh they use like a like a model based feedback as well uh uh depends on the data depends on the task like the kind of mixing or uh use uh mixing all these three approaches like a rule-based feedback i think depends on the task like uh one of these would be uh better better um but i feel like uh even models can give like a better uh better judgment for a lot of tasks um yeah what is this shot um yeah this is like a browser com uh accuracy browser column very interesting benchmark this is from open ai it's like their web search type thing that they released with deep research i think they updated it recently i think six has played with it a bit um the the interesting chart i found were like the two in section three actually there's two charts about training about multi-step and um what do these red and blue lines um yeah they're multi-step training they have these two charts that kind of break down multi-stage versus single stage training and breaking off of um difficulty problems and plateauing these two are pretty cool honestly i think since we're running short of time maybe we don't just look at chart but they had good charts i recommend charts that's true um and on that last point by the way with the distillation uh so like the high level process is like first you train the three experts right reasoner agent and general chat then you know you change them what you train them with a cold start of sft traces then you do rl you make them experts then the big thing is like you don't just train on their rollouts like flat out right they do a lot of filtration so like for the distillation from three experts they they do rejection sampling correctness checking uh reward model filtering tool call verification like they actually verify all this stuff then the unified model they they do that sft distillation sort of stuff and that you you know that's like a deep level of like filtration stuff i think that's one of the interesting things right it's not like they just do three models and then they merge them into one like some weird model merge or some uh ensemble it's just that they use them for synthetic data and then they distill down and it's like a bunch of just sft kickstart do rl okay generate data filter data let's do it again let's do sft as a kick start off of these and of course you know strip out some stuff like you want you want hybrid reasoning so you want to keep some reasoners keep some non-reasoning stuff and you know in the end you get one model that outperforms the three is it also oh sorry thank you uh go ahead bro i just asked one more quick question um it says here like once training has reached a certain step count or plateaued is it normal that they loop the distillation process that they what the distillation process like they're doing they're doing many many rounds of the self-distillation um once training has reached a certain step count or plateaued we apply self-distillation by substituting the original cold start data with responses generated by the rl uh thus creating a superior sft we then conduct further rl training to announce this model this is talking about training the experts right then conduct further rl on this no no this is the join increasing difficulty this strategy allows us to push the limits of rl trained models of it sorry what's the what's the what's the question here this seems somewhat standard approach i but i think i misinterpreted what your question was about this no it's okay i might be misunderstanding i just got the impression that something that they're doing differently is um it's a self-iteration loop rather than just running the distillation process once which is what i thought the other models were doing applying self-distillation by substituting that i think the loop is just like you know after you do easy questions you do hard questions like that's their whole multi-step thing right like they at different stages do harder and harder questions and they're mixing it in with context line they they show this in one of the like i think it's is it figure five about moderate and extreme difficulty whereas one plateaus they they switch to stage two with extreme difficulty and then you know progress continues so like that blue line if you keep it stagnant with the same difficulty uh you know that's where it starts to plateau and you're not improving accuracy so you know next cycle is just how extreme difficulty instead i think that's like i think that's what they're saying here it's just tears of harder and harder questions um i could also be misinterpreting this you know that seems to be my understanding as well um they mentioned like uh how they created like a reward function for each type of specific task uh they were checking format they were checking like trajectory uh and then they were also checking uh if sometimes like uh so this stepwise uh it's basically checking like uh did we call exact tool at the particular time uh and sometimes they're using that and also they're trying to see if you actually completed the task other than like just calling the function at particular time so they were using these to uh and then uh sometimes these rls they get into infinite loops so and then instead of uh uh instead of adding a penalty to the task like they uh sampled a lot of uh see a lot of prompts with uh the oral tends to get infinite loops and then uh they uh created like a penalty function based on that instead of uh using a standard penalty which is like sampling efficient uh they mentioned like how uh they uh they mentioned how they want to use like a gpu better ways uh if uh some uh some rl is it's taking a lot of time to finish like instead of waiting on that they created like a different platform uh to optimize like a gpu usage better uh you can read the details later and then uh yeah these evolution metrics how they compare against uh existing models um i don't like this seems like very straightforward to me uh it's not like a uh i think it's cool uh yeah but it's still a lot of times it's still second to claude or other maybe the model sizes is like very small compared to them but still um nothing new inside it's just like a metrics they're talking how it compares to other models um i feel like uh one interesting thing uh they were able to have like uh i think the model has like a lot of reasoning a lot of reasoning capabilities inbuilt uh so it gets like a emergent phenomena like uh the translation surprisingly works comparatively uh existing models so that's like a like a side effect of having a lot of reasoning tokens yeah i think um that's all i wanted to cover but i feel like the aural setup or uh the things they mentioned is very cool well thank you so much for presenting um if anyone wants you know next week volunteer there's all these fun papers i posted a bunch of in discord or if there's any topics you want to cover we can cover together you know but uh you know big shout out thank you so much for presenting it's always fun to get different perspective on this stuff you know thank you yeah it's well done um thanks for uh volunteering all right many more volunteers uh more papers somebody's gonna do c dance that's another one oh china models man we i really want to do a china tour at some point maybe maybe uh early next year uh oh my god you just have access to go see all the labs here they're very i can i can definitely get quen we have inroads to deep seek uh stepfather mini max for sure i can get my god uh which is glm i mean i don't think they're hard to they're hard to get but i don't currently have objects okay other papers though there's um metas dyno v3 there's a genie clone that came out of tencent uh what else what else there's a survey on parallel text generation there is a lot of stuff there's a few auto regressive image generation papers um hrm tyler tell us volunteering uh yes i haven't read that one and i need to read it so hrm is yes on the docket in the next two to three weeks i might reach out to j alamar for his gpt oss thing uh i'll share in discord if you guys haven't seen it he did a really good blog post called like the illustrated transformer illustrated for gpt2 he put out one yesterday i think on the illustrated gpt oss so it's like a good visual breakdown of what's happening in state-of-the-art moe models and um i will i'll share it in in discord and try to get them to come up on otherwise you know if you have paper it's always better someone else someone else uh volunteer okay tyler you're locked for uh september 3rd glm uh uh sorry hrm uh and then we need to watch you for next week oh thanks guys bye all bye