back to index

GLM 4.5: Agentic Coding & Reasoning Foundation Model


Chapters

0:0 Introduction to the presenter and the GLM 4.5 paper
2:11 Overview of GLM 4.5 as a Mixture of Experts model
4:6 Defining ARC: Agentic, Reasoning, and Coding capabilities
5:41 Pre-training architecture and the Muon optimizer
6:54 Data collection, de-duplication, and context window extension
11:31 Mid-training with repository-level code and synthetic data
25:8 Post-training: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)
31:22 Tool calling and agentic data synthesis
36:52 Reasoning RL with curriculum learning
39:58 Dynamic temperature and token-weighted mean loss
49:12 Self-distillation process for unifying the model
59:16 Reward function calculation and handling infinite loops
61:12 Emergent capabilities: surprising translation performance

Whisper Transcript | Transcript Only Page

00:00:00.000 | yeah go ahead
00:00:07.200 | um can you see my screen i can see your screen the audio audio could be better if you can switch
00:00:18.000 | your zoom audio uh don't use airpods for for uh for microphones airpods are very bad microphones
00:00:24.480 | okay even though it looks like i'm using my airpods in my in my ear i actually always
00:00:30.480 | switch the zoom to laptop microphone is much better should be in like the zoom clicker thing
00:00:41.760 | no i don't even like airpods max i've tried it it's not great
00:00:46.480 | is it any better now yeah much better oh sweet sweet
00:00:53.920 | okay yeah um you know just introduce yourself because you're because you're your new presenter
00:00:57.600 | uh let's get to know about you and why you're interested
00:01:00.240 | i i was working as a back-end engineer um my name is venki by the way nice to meet you all
00:01:08.160 | i took a sabbatical to do uh to understand like the current ecosystem and to
00:01:15.120 | experiment more uh with the ai tools and also understand the fundamentals a lot so i took a break
00:01:22.160 | uh i was um and then i'm trying to read papers but so far it's not working out for me the self-study
00:01:28.720 | but i'm still like uh trying to make sure like i get something positive out of this sabbatical
00:01:34.000 | uh this is my this is my first time presenting the paper i'm i'm not sure like how it goes but
00:01:39.680 | i'm hope uh i feel like i i put a lot of attention to understand the details because i'm presenting it
00:01:45.360 | otherwise i would usually skip through them skip through them yeah so it's a good good uh a good
00:01:52.480 | like a decision for me to volunteer for this agreed okay uh if you want you can make your paper full
00:01:59.680 | screen and then we can get into it okay yeah so uh should i should i give an overview first uh yeah yeah
00:02:13.040 | so this is a uh it looks like a mixture of expert a transformer uh on the base model and then uh
00:02:20.400 | what they have done differently is like uh they want to make sure like this model has a
00:02:25.040 | arc capabilities agentic coding and reasoning abilities so what they have done they uh
00:02:30.880 | post training they they they've used like two different ways to train this model like they stage one
00:02:38.080 | they uh they train an expert model uh for each specific task like agent degree reasoning and
00:02:44.800 | coding and then once they have like a a sub expert expert model they distill these capabilities into one
00:02:51.520 | uh unique one model so the uh that's why uh this model seems to have like all three capabilities uh and
00:02:59.520 | this model can is able to like uh uh have like a switch between modes like thinking long thinking modes
00:03:06.640 | and then uh quick response modes uh i haven't actually uh used the model much but i asked claude to
00:03:14.000 | generate some uh some task and prompt to verify these capabilities so maybe post this talk we can see
00:03:21.280 | if we can go through uh if this model is actually looking good um with those tasks but basically this is a
00:03:29.120 | mixture of experts um and yeah um and these are like some metrics they provided uh with arc capabilities
00:03:43.600 | i'll move to the second yeah i feel like arc is a little bit um co-opted by arcgi but arc has a
00:03:53.360 | definite definition which i think you're showing on screen right now right uh oh yeah the arc challenge
00:04:00.720 | true so they these people like they say arc uh to define like uh if the model has agentic or reasoning or
00:04:08.160 | coding capabilities all together in one uh one model instead of using multiple models to achieve uh
00:04:16.160 | different performances uh so they clearly define the ambition here
00:04:20.800 | so how often um like yeah this seems like the main uh drive behind the this model training uh to have
00:04:33.600 | all these capabilities in a single model and then uh the paper mostly quickly goes through uh like uh
00:04:44.400 | what they have done uh and then also the rl training pipeline and how they approach like a data
00:04:52.480 | synthesis for training etc but this uh these are like a bunch of metrics they've mentioned here and
00:04:59.040 | they have now a smaller model as well but they haven't mentioned much about the smaller model apart
00:05:04.880 | from i'm assuming it's a distillation yeah these are like a bunch of metrics how we uh how we performed
00:05:14.000 | on a various metrics uh i i try to see if they have like uh the training loop made available i don't think
00:05:25.440 | they've made the training loop available for open source i think the only weights are available in the
00:05:29.840 | git repo
00:05:30.400 | uh so yeah architecture the fundamental architecture is like a mixture of experts i feel like the only thing
00:05:48.240 | they changed is uh uh they used a muon optimizer instead of adam i'm not really sure if adam is like
00:05:54.880 | standard but they're using beyond uh they mentioned this down somewhere here yeah
00:06:02.880 | so they used one optimizer uh and then what else they've changed
00:06:13.760 | yeah so for for context basically for the past few years the adam and adam w optimizers they've been
00:06:21.200 | very much the standard defaults there hasn't been much progress like adam came then adam w
00:06:26.880 | uh the recent kimmy k2 paper it's like the big the big model that came out after um after deep seek
00:06:35.840 | they switched to a muon flip optimizer it's like better training stability they have a very pretty
00:06:41.520 | loss curve so that's kind of the background optimizer
00:06:44.960 | um and one thing they have done uh they have this is like the general pattern they used like uh they
00:06:55.200 | started with 4k context window and then they increased the context window to 32 and uh and afterwards like 1.28
00:07:03.200 | and uh and they mentioned like how they collected their data uh so basically uh yeah they used like
00:07:10.640 | web crawling uh to have like english and chinese web pages and then uh they used like a quality metric
00:07:17.440 | so here they mentioned like uh they divide the crawl web pages into buckets with different quality scores
00:07:24.320 | and then they're up sampling the documents uh so that like they uh the initial like the base data is much
00:07:30.640 | much higher quality um and then they they were using like uh this uh semdudu uh to basically uh deduplicate
00:07:40.560 | like uh data i mean the duplicate data uh this basically mentions like what they have done with the
00:07:48.480 | pre-training data
00:07:52.160 | and one thing they uh they said like uh they used like group query attention with partial rope in the
00:07:58.960 | attention block and then also they mentioned like uh attention heads or more so uh one of the claim was
00:08:05.920 | like having more attentions and wider instead of like uh sorry instead of uh having deeper uh deeper
00:08:14.320 | over the white like it increases the reasoning capabilities
00:08:22.720 | and when came before you go forward you said something at the beginning of uh
00:08:28.960 | of this discussion uh so did i i just want to make sure that i got this right so did they train three
00:08:38.240 | separate models one each for agentic reasoning and coding and then they distilled all those three
00:08:47.120 | models into one model is that how they approached this so for the base model they used like the same
00:08:53.280 | pre-training for the base model once they have the uh pre-trained base model in the post-training they
00:09:00.160 | uh used like a t they created like three different models and then did they then distill into one other
00:09:08.640 | model yeah okay
00:09:10.960 | um yeah so in stage one they used expert training and then stage two they unified
00:09:20.000 | okay thanks
00:09:25.200 | on the on the previous note here about attention heads and wide versus deep i think it's been
00:09:31.280 | interesting with the other few late papers so like uh once again the kimmy paper talks about
00:09:37.040 | how they do less attention heads because you know a very long a very long context it's like half the
00:09:43.200 | memory bandwidth that's required and then open ai's open source model they talk about how they
00:09:50.320 | go for a wider moe as opposed to a deeper one and that that helped with like training stability and
00:09:55.680 | stuff as well so it's like interesting these little notes that you know normally you just take at face
00:10:00.320 | value you're starting to see some other differences with different models and why they pick what
00:10:13.040 | um so in pre-training they have like multiple stages uh and like they use different data and then also
00:10:19.360 | different context sizes um one thing they have done with the code like they use like a mostly github
00:10:25.680 | for code and one thing they i observed like uh they used uh uh they combined like a files from same
00:10:36.480 | report to get a better understanding of relation between different files in in the code base
00:10:41.440 | um i i thought that's interesting uh to have that semantic meaning embedded
00:10:47.280 | um and also most of the times like once they have once they crawl the data they were using uh some
00:10:55.520 | kind of model to predict the quality of a particular page and then based on the quality uh that seems like
00:11:02.560 | they have sampled the particular high quality data that seems like a common pattern they used for all the data
00:11:08.720 | uh yeah so uh they like they mentioned up sampling a lot based on the quality of the content
00:11:21.360 | um so the pre-training i believe like they're trying to have like a base model uh that i mean with high
00:11:31.280 | quality data and then second phase like in the mid training they were using repo level code training uh
00:11:37.360 | it's basically they increased the context and they added like a more data like a long context and
00:11:44.800 | agent training and here in i feel like they use a lot of synthetic uh data generations and synthetic data
00:11:52.000 | uh especially for agent tasks and uh for reasoning tasks uh one thing so
00:12:05.200 | they included these as well to improve the coding abilities um yeah and they mentioned like they
00:12:12.960 | used a deflect formatting uh i think it helps like uh to generate better uh to understand or to review
00:12:20.800 | better uh having this deflect format
00:12:30.640 | and uh here they mentioned like uh they boosted like context length from 32k to 180 180k
00:12:37.200 | and uh they up sample only long documents um i mean that makes sense
00:12:43.520 | yeah it's the same thing for 4k to 32k and then one 128k
00:12:54.000 | uh i'm not sure i understand oh yeah so if during pre-training uh uh they didn't use like a so i
00:13:03.440 | think this basically best fit packing it basically truncates the context to given length uh so in one
00:13:10.320 | of the stages they didn't use that uh so that like uh uh it becomes like a good data augmentation strategy
00:13:17.200 | and then for me training uh they applied best fit packaging so that like i think they compress the
00:13:22.320 | context into a given context window so that like it has uh it it's like a summary generation i think
00:13:29.280 | it's like some regeneration uh frankie you have a question uh yeah could you could you explain a little
00:13:37.440 | bit what they mean by up sample uh i i think they uh yeah um i'm assuming it's like down sampling up
00:13:53.680 | sampling right like they will want to have like a uh if you have like multiple buckets you want to get
00:13:58.080 | like more data from a particular bucket rather than less data from other buckets suppose if you have
00:14:03.680 | like equal distribution you probably get like a uh if you do thousands samples like the distribution
00:14:09.040 | would be even right without sampling i believe like you have more weight to certain buckets to get the
00:14:14.880 | high quality data uh bibu do you want to add anything
00:14:19.440 | typically that's what up sampling is um it's not just more weight to different buckets sometimes i'll just
00:14:27.760 | do multiple books right so for some long context or for some like high quality it's not uncommon that
00:14:35.920 | they do multiple passes of the same data and down sampling as well you know like if you have
00:14:43.360 | a lot of generic text it's like filtering is a form of sound sampling right uh yeah sometimes it's just in
00:14:49.680 | a way that's what's going to do so it's just in a way that's going to be that's going to be that's going to be
00:14:56.320 | the best fit packing what does that actually do uh based on my understanding i think they're
00:15:04.400 | trying to summarize uh so that it fits actually to the context i haven't actually uh verified that i assumed it
00:15:15.200 | oh sorry i what's a what's a basic procedure i don't quite understand what are they doing
00:15:20.160 | um i yeah i haven't actually verified uh what exactly they've been doing like i haven't seen the
00:15:30.480 | algorithm i just assumed that like they probably uh like summarizing the car summary summarizing the
00:15:36.160 | context to fit uh better suppose like if you have like 128k or like a lot of data
00:15:45.680 | it's better to have like a summary uh so that like you have like main points in the uh limited
00:15:51.200 | context so that like uh the queries can i mean the llms can focus better on the uh what's going on
00:15:58.640 | rather than like whole context okay and then if anyone has a better explanation please uh chip in
00:16:14.080 | sorry i have a separate question vicky um so uh what is the rationale so i think they're saying
00:16:20.160 | pre-training you begin with something smaller and then mid-training you introduce longer um is this kind
00:16:26.880 | of like uh um you know like going going i forgot what they call the curriculum training is that kind
00:16:33.280 | of like thinking the thought process here uh or why what's the reason for doing it this way
00:16:38.160 | uh the curriculum training i think they use it in the context of rl right like to get like a better
00:16:46.240 | reward models uh but i think this here it's simply extending the context so they can uh i mean we can put
00:16:55.680 | like a lot of data into the context there's just like a uh i feel like a 4k they have like a a good
00:17:03.120 | base like the weights are uh set in a better way here and then once we increase the context i believe
00:17:09.680 | like uh it can generalize well and then it can have like a uh the attention uh can be spread without
00:17:16.560 | losing the actual value or that's what i think here they've been doing does it have any data showing
00:17:23.760 | like um i don't know ablations showing that if you do it mixed together it's like this or here
00:17:30.000 | separate it like this it does something differently oh uh i can i can chime in here actually uh there's
00:17:36.800 | two parts of this mid training right one is context length extension which is typically not what you
00:17:43.040 | consider mid training a context extension can just happen at the very end of all your rl your post
00:17:49.120 | training right you can just do more training on long context and that kind of works now what this is is
00:17:55.840 | the reason it's more mid training is because you're going from regular pre-training just next token
00:18:01.120 | prediction to some form of specialized data right so repo level code this is long context
00:18:06.720 | full code base uh synthetic reasoning data long context agentic data right now the the fact of
00:18:14.000 | the matter is it's not like exclusive that they're just doing this but they're also doing long context
00:18:19.280 | but it's pretty common right you don't see like reasoning data at short context right because oftentimes
00:18:26.400 | the queries that go into these like the system prompts themselves are already pretty long so same
00:18:31.920 | thing with the reasoning traces right we just don't do reasoning at short context because
00:18:36.480 | you need context length for actually doing reasoning right you're not going to do
00:18:41.600 | reasoning at 4k context so it's mid training more so because of the type of training they're doing
00:18:47.760 | not because of context but yeah they didn't just do it as a separate step which is like okay too like
00:18:53.040 | long context is not super like unknown right all you do is just train on long context for like
00:19:00.560 | couple hundred thousand or a couple hundred thousand or a couple million samples and you're kind of chilling
00:19:04.400 | and with that like you know you can just mix it in wherever but in this case it's mid training because
00:19:11.680 | the type of data and then um that's like the first question on the other one about um best fit packing so
00:19:19.520 | one thing you can do if you have like long context let's say you're training on like 32k context right
00:19:25.920 | and you have a bunch of sequences that are like short like 4 000 sequence land uh what you do in
00:19:31.920 | terms of packing is you can concatenate multiple samples into one one sample right so like if i have
00:19:39.120 | eight queries that are 4 000 sequence land i can batch them all together and make one sequence of 32k
00:19:45.040 | and that's that's best fit packing that's more efficient on gpus or you could not do that and you
00:19:51.120 | could just train normally right so it's you could do best fit packing by packing multiple short sequences
00:19:57.360 | into one sequence or you could not do it and then you know they they talk about what they do here but
00:20:02.880 | that's what best fit packing is it's where you pack short sequences into a lot of sequences and then
00:20:08.080 | there's pros and cons of both i think for them they say in pre-training they don't do it so it's not as
00:20:12.960 | efficient on the gpus they just do random truncation right so sometimes different sequences and batches
00:20:18.880 | will be different lengths and that's okay it is what it is and they talk about why they do this
00:20:23.680 | and then in mid training they do do it because context matters right you don't want half of a
00:20:30.160 | reasoning trace to be cut off because you're out of sequence length right you want stuff to be the
00:20:35.520 | same length so in mid training and reasoning they do do it like you don't want to chop off a code base
00:20:40.960 | just to be efficient in gpu and pre-training they're like it's good it's data augmentation even if you
00:20:45.760 | don't have half of the stuff it's fine it's more efficient on our majority train run but sorry that's
00:20:50.960 | like two two two topics in one um you know i hope that made sense thank you so but when they kind of
00:20:58.480 | concatenate this like say 8k into 32k uh this the you you know you're not like masking anything right on the
00:21:07.440 | attention that is the stuff you're concatenate can still yeah the previous stuff right yeah yeah yeah
00:21:12.560 | it's weird like that uh there's there's different techniques to do uh packing but like you know like
00:21:18.640 | there are techniques of add-in tokens to forget stuff before but i i mean that's like i don't think
00:21:25.600 | they're doing any of this so the real point of packing is just you keep sequence similar and it doesn't
00:21:32.400 | have like contamination and training data right so with reasoning i don't want like i don't want in
00:21:37.760 | the same training batch code about like you know one library with something completely irrelevant to
00:21:44.320 | think about but um it's it's just beneficial if you keep it training but then if it's like just regular
00:21:51.840 | like storybooks right let's say you have like a harry potter book and then half of it is added in with a
00:21:57.200 | lord of the ring book and you're just predicting next tokens it's fine you know like it's just
00:22:01.920 | common but i think i think uh you know that's that's high level what this stuff is someone else should
00:22:08.080 | do a little bit more research into it i just know that it's typically done in pre-training where the
00:22:12.960 | type of data is not as important and you don't want to do it in in reasoning thank you and there are
00:22:19.600 | approaches to how you can how you can better process stuff and there's like some paper that talks about
00:22:24.720 | like adding special tokens like how encoder decoders just have like cls token
00:22:47.680 | thank you and uh and also i don't think they mentioned a lot of ablation studies or
00:22:55.680 | why they have done on the pre-training stage most of their focus on like uh making sure the rl works
00:23:02.160 | rl is effective and then they're able to improve the capabilities so this paper like a lot of the times
00:23:09.120 | the charts or the graphs are basically talking about rl pipeline not the pre-training
00:23:15.760 | um yeah
00:23:22.080 | uh so they mentioned uh about like hyper parameters like i already mentioned like they used some one
00:23:32.240 | optimizer uh i couldn't understand like some of these things but uh primer for c i feel like uh
00:23:51.440 | um yeah i i to be honest like i haven't understood like this completely uh
00:23:57.280 | um so this one's like a warm stable decay schedule like uh they're changing the learning rate on
00:24:04.000 | they mentioned like how they change it why uh and then they also use like a batch size warm-up strategy
00:24:11.120 | uh so batch size was increased gradually from 16 million to 64 million training
00:24:17.440 | uh and they use like more optimizer because it's uh it feels like a accelerate convergence and like it
00:24:23.840 | can tolerate large batches batch sizes
00:24:30.560 | yeah uh if anyone have a better understanding uh please help me out
00:24:43.280 | yeah i don't know where to get these numbers
00:24:49.200 | from either but usually it's just you know it's one paragraph but they spent like three months updating
00:24:55.440 | different parameters yeah
00:24:56.560 | i'll move on to the post training
00:25:03.680 | i feel like this is uh this is where the majority of the work uh that's i think it's very carefully
00:25:12.720 | crafted and created uh as i mentioned earlier like they use like a two-stage approach like in stage one
00:25:20.080 | uh they created expert models for each uh specific task and then stage two they
00:25:26.320 | use distillation techniques to integrate these multiple experts
00:25:31.120 | uh um and then they mentioned like uh unsupervised uh supervised fine-tuning uh
00:25:39.280 | uh so uh they do this safety for both stage one and stage two and in expert training like uh this is
00:25:50.480 | basically like making sure that the model has basic capabilities uh to either to generate or to respond
00:25:57.920 | properly uh other than that it doesn't have like any uh specific like a a tool for agentic or
00:26:05.440 | reasoning or capability uh capabilities
00:26:12.160 | yeah uh in sft for stage two um it's like a their distilling capabilities from different expert
00:26:19.120 | models and then making sure that like it gets the uh capabilities from uh three different experts
00:26:34.720 | yeah um like same thing um so here they mentioned like how they uh perform these tasks
00:26:59.760 | uh one thing they have done differently for uh code so uh i feel like in regular models it seems like uh
00:27:24.480 | they have to uh like for text data they have to use like a lot of escape
00:27:29.680 | characters so uh this is for this they say that like that's fine for like a model specifically
00:27:36.080 | designed for coding task but they want to have like a general capability for this model right so
00:27:40.800 | instead of using a lot of uh escape characters they decided to format the data into this uh xml format
00:27:48.400 | uh so that they don't have to use like escape characters that's the one thing they have done differently
00:27:59.600 | um so um um so in um i couldn't hear you i'm sorry sorry you're really soft rj i don't know what
00:28:15.200 | happened to your audio nope you know what happened rj you uh you could type i guess
00:28:27.120 | i'll be rj's voice but also he's a veteran i don't know what's um what's screwed up there
00:28:35.840 | this rj says is this a new token or just a tag i'm not sure there's a difference um yeah i'm not sure i think
00:28:54.640 | it's maybe they added these uh like end of text uh tokens as part of our vocabulary so that like
00:29:02.480 | we don't pass them but i'm not really sure
00:29:04.640 | is it that i'm not sure i understand the question itself let me see the
00:29:21.440 | yeah i think it's a new token yeah new token in the vocabulary but they haven't mentioned it
00:29:26.400 | all right right
00:29:30.720 | usually for this kind of stuff they do add new tokens uh deep seek v 3.1 which dropped
00:29:38.880 | yesterday uh you can see an example of how they add tokens for this kind of stuff i would imagine glm
00:29:45.280 | to be the same does anyone know what does glm stand for no idea you could probably look it up
00:29:54.560 | so i think here they also mentioned like uh by distilling from output of distinct experts the model
00:30:06.080 | uh learns uh learns quickly applying long context of like cot reasoning for each task and then it can
00:30:14.320 | switch from uh thinking long or also thinking short without any explicit uh without any explicit suggestion
00:30:23.440 | and third one uh the rejection sampling i feel like uh so while sampling uh they want to make sure that like
00:30:33.040 | uh they sample high quality data uh this rejection sampling they basically mentioned how they have done
00:30:38.960 | that uh they want to make sure that it's in the valid format and then um verification with the objective
00:30:47.680 | like i think the math or science uh you can have linear objective answers even the code uh so they did that
00:30:54.320 | uh that part as well and then uh and then uh they have like uh reward models for subjective questions uh it
00:31:03.200 | can be either uh rl hf or or early ai uh uh and then uh they use this metric to distill the um sampler i
00:31:14.160 | mean um reject the sample and then i feel like the tool calling scenario is very interesting um they go deeper in
00:31:23.440 | later stages as well how they orchestrated like the tool coring or uh yeah i'm gonna talk about that later
00:31:34.320 | section and then also uh seems like they uh they selected few prompts to improve the uh uh or like
00:31:43.280 | they deleted like some of the prompts which are not conducive to better results um they mentioned that here
00:31:52.320 | and they state that like it improved uh improved the reasoning capabilities
00:31:57.520 | so here they mentioned like how they uh constructed the agentic sft data
00:32:05.920 | so uh i feel like they they used like existing frameworks like mcp servers uh to uh get like a basic
00:32:15.760 | features and then they created uh they did task synthesis it seems like uh they used like whatever
00:32:22.080 | the documentation they have for mcp servers to create tasks uh that seems like a very uh yeah it's
00:32:29.280 | interesting approach
00:32:30.160 | so they used llms to comprehend functionalities and then generate relevant queries and tasks
00:32:39.840 | and for uh fragmented like uh they they did follow similar approach but they uh like uh first they
00:32:48.240 | select like a representative subset and then they employ the construct the task based on that
00:32:54.320 | um yeah single step and multi-step tool calling i i believe uh uh this basically says that like uh in
00:33:03.280 | the reasoning process you may have to use like a uh like one color multiple call at different stages
00:33:12.560 | and then uh uh hybrid hey vanky sorry just a quick kind of maybe might be a stupid question but i just
00:33:22.960 | wanted to confirm is it the actual like them training specialized experts is this different to the other
00:33:31.680 | models do you know like is this one of the different parts of of um of this model um i mean i know a lot
00:33:40.240 | of times we uh other other llms do distillation but i'm not sure i haven't read any paper maybe kimmy
00:33:47.600 | or other papers does it i don't know
00:33:48.960 | yeah i was i was curious too about whether can you guys hear me now by the way yes yeah i had my head
00:34:01.520 | set microphone on sorry about that uh i was also curious about what what was the motivation for
00:34:07.360 | using distinct experts versus just using good rollouts and i i kind of the conclusion i came
00:34:17.760 | to for myself was that they have this like multi-stage uh curriculum where they're they keep they mention a
00:34:29.440 | couple places where they keep uh you know sort of doing like rollouts of their own data and so that
00:34:37.920 | maybe in the early phases it's easier to generate uh distinct experts than it is to generate one model
00:34:45.760 | that's good at all three and then and that's why so they like in the process of generating a a generic model
00:34:54.880 | it's easier to to do like specific like fine basically fine-tuned models that are good at one
00:35:01.600 | thing and then distill them all into the same model that's that was my take i'm not sure i've ever seen
00:35:07.200 | this my before myself i agree in in the oral setup like they mentioned this uh they needed like different
00:35:15.920 | architecture for different types of uh reasoning tasks or reinforcement learning tasks so maybe that's
00:35:21.760 | why they so they actually change the architecture for each expert uh i think so um interesting
00:35:33.040 | yeah they have a slim rl and then uh they mentioned sometimes like uh for software engineering task like
00:35:39.200 | they had to have like docker containers and uh other stuff like uh so they mentioned data like a bit later
00:35:47.520 | in the page but i'm not sure like this is like the training architecture not the model architecture
00:35:52.880 | right oh i see okay yeah no but that makes sense why you would also want to split that okay that's actually
00:35:58.800 | really interesting um right so here uh they use like llms uh they used existing frameworks and then
00:36:19.840 | they did task synthesis based on uh based on these frameworks and then uh so i feel like a lot of times
00:36:28.480 | they uh they uh they do this synthesis to get the uh data uh required for training and then also
00:36:34.320 | they use like quality filtering uh to make sure that they have like a high quality data goes into the
00:36:39.440 | training yeah and then they use llms to generate this uh multi-step tool called uh into trajectories
00:36:52.880 | uh yeah we'll move on to reasoning rl now one thing they uh i think uh they use like a grpo i feel like
00:37:00.960 | this is like a common framework even deep seek uses this i'm not sure uh i'm not exactly sure what is a
00:37:08.000 | this uh this loss term um yeah
00:37:12.640 | so one thing i understand based on this like uh so yeah uh if i understand like aurel correctly like
00:37:26.080 | for each task like you you get a reward and based on the reward you change the policy uh you take an
00:37:31.360 | option based on the existing policy you get a reward and then change it so here they uh they want to have like
00:37:37.120 | uh if if the task is like too complicated like the reward would be always zero so we don't have any
00:37:44.480 | gradient to train the model better way so they want to make sure that like uh at particular stage we get
00:37:51.120 | like appropriate reward or like we get a reward so that like we can train more or train better so uh
00:37:58.240 | so in stage one they use like moderate examples moderate difficult data uh and then after like
00:38:04.720 | certain point like after certain training steps they use like extremely difficult data so that like they
00:38:10.000 | have a better gradient or the model is well trained enough to produce a gradient or produce a reward um
00:38:20.480 | i think that's what i understand of from this uh curriculum based learning
00:38:24.320 | yeah uh and they tried to use like a different context uh like uh like 16k 32k 48 and 64.
00:38:36.320 | i i believe at the end they decided like it's better to use like a 64k context window uh there is no
00:38:43.280 | additional advantage with the uh starting with the small context window and then increasing the uh
00:38:48.640 | context window to 64k rather they can do uh 64k training from the beginning
00:39:05.360 | uh yeah they also mentioned like uh this two stage approach uh increasing the difficulty produces uh
00:39:17.120 | some better performances better performance
00:39:20.720 | yeah also they mentioned like a single stage rl at 64 output length uh this is a for that uh in
00:39:32.560 | second paragraph like this chart
00:39:44.720 | uh they uh they mentioned like they used a different uh
00:40:02.320 | loss function uh
00:40:04.160 | and this this basically this chart like uh using different uh loss function uh for like a coding tasks
00:40:14.720 | uh i think uh that comes in this paragraph but we can talk about this first and then
00:40:21.840 | so uh temperature i believe it's like a common uh temperature we see in llms like if the temperature is
00:40:29.600 | a high uh it becomes like a high uh it becomes like a more exploratory uh
00:40:38.080 | if temperature is too low it becomes like a very deterministic
00:40:41.200 | right so instead of using a uh like a standard temperature they try to use like a dynamic
00:40:47.840 | temperature um and then they have like a certain validation to make sure that like the temperature they
00:40:54.720 | have chosen uh is working fine uh is working fine uh yeah so here they mentioned how they uh how
00:41:07.920 | they have chosen temperature based on the validation uh validation set
00:41:14.240 | uh and they mentioned here like why they have uh taken that approach
00:41:22.720 | uh so here they mentioned like for code aural uh the loss calculation is critical for training
00:41:34.880 | efficiency and then instead of using or they used like a token weighted mean loss uh
00:41:43.360 | rather than conventional uh sequence mean loss uh i haven't exactly know what's going on here but
00:41:50.240 | i'm taking by the word and also this chart seems to be a case like token weighted average uh i mean
00:41:57.680 | token weighted mean uh is working better for all our training compared to the sequence mean
00:42:03.840 | uh um and they also mentioned like what they have done uh i feel like uh uh so uh for science rl i
00:42:16.080 | feel like uh a lot of the times like they have like a verified answers so i it becomes easier for
00:42:22.000 | training uh to get like exactly uh to get the reward uh if it's uh accurate or not the reasoning steps
00:42:31.520 | um for agent so uh i feel like here uh they use like both humans and ai uh to get the feedback um
00:42:50.480 | i'm not sure what um
00:43:00.480 | uh so this basically like they generated multiple uh uh like they generated multiple samples and then
00:43:12.960 | they're trying to like this this is just like a training loop
00:43:30.160 | uh we would uh will you be able to uh comment on this
00:43:41.760 | yeah i mean this is it am i reading this right someone else should
00:43:48.960 | comment here maybe this is grpo without the kl divergence right is that this is just a reiteration
00:43:57.840 | of the question of that statement from above or not yeah
00:44:00.720 | right
00:44:05.440 | um you know they don't they don't seem to did they mention what like why they removed the kl divergence
00:44:17.520 | from the last term i'm not sure i don't think i come across that reasoning yeah it seems a little bit
00:44:25.520 | yeah um i'm not even sure i mean i'm so not super confident like i understand this part uh this a lot
00:44:43.200 | of like aural heavy uh wait sometimes i understand um there's there's two things to dropping kill
00:44:50.800 | divergence penalty stuff um in in other work they also i don't remember which paper it was but one of
00:44:59.600 | these they significantly dropped the penalty for kale divergence and then they like decayed it so it
00:45:05.040 | didn't have an effect later on so hail divergence is to keep the model from going too off track right but
00:45:14.640 | lowering the penalty allows the model to explore like the random space a bit more so the paper basically
00:45:22.160 | really dropped the penalty for kale divergence because it allowed the model to explore a broader
00:45:29.520 | broader set of generated tokens right it's like similar to temperature it allows it to explore more freely
00:45:37.760 | and you know in return going down different paths it eventually converges to an answer in this case
00:45:43.360 | they're dropping kale divergence because they're doing this on math and code right so stuff that is
00:45:49.920 | objectively verifiable um you know as long as you can verify that that answer is correct you're you're
00:45:57.520 | fine right like you're you're you're able to get reward as long as you're able to start solving the
00:46:04.160 | questions so in in more broad terms like rlhf with like you know preference feedback and dpo and stuff
00:46:12.960 | it's still a concern right because you can get slight reward for preference based outcomes but in stuff
00:46:21.280 | like math and code you know as long as you end up at that verifiable answer you're kind of chilling right
00:46:26.320 | you don't really need a kale anchor to supervise the policy reward like it's it's okay and then there was
00:46:32.640 | another paper we covered recently that like echoed this yeah thank you yeah that's that's really clear
00:46:40.000 | that um i think it was the kimi paper right that that they did this in maybe uh the sound sounds about
00:46:46.800 | right yeah that's really good insight i appreciate that and this is also just like me spewing my
00:46:55.120 | of what i think about it you know could be wrong but seems legit to me i recall that too so that sounds
00:47:01.840 | right to me good enough for me um in this paragraph like they mentioned that uh how they calculated
00:47:10.000 | reward uh for the given specific task uh so for web search task like uh they use like a final answer
00:47:17.760 | as a reward and then for coding agents uh they use like verifiable test cases for uh rewards and then
00:47:28.960 | yeah um and for uh tool calling uh they try to see the uh format of the uh particular uh trajectory
00:47:36.720 | and then they uh basically like either uh proceed further or like uh receive a zero reward
00:47:43.920 | um and then uh basically uh the colds instead of like cold start sft cold start model like uh it has like a
00:47:53.040 | basic understanding what's going on uh how to deploy back but and then uh they did like a uh like once uh
00:48:01.200 | uh training has reached like certain steps like uh they started applying self distillation
00:48:06.080 | uh like uh by replacing calls uh like the data generated by the cold star uh cold start sft to uh the rl
00:48:16.320 | trained model i think this is the like expert model um yeah this this section sorry i'm taking so much of
00:48:26.400 | this talk but this section really made me nervous right because i feel like i mean i think that a long
00:48:34.400 | time ago i remember going over one of some paper on synthetic data and and and sort of like methods
00:48:42.880 | where like self-improvement um kind of hits a hits a limit pretty fast so i'm um it's interesting
00:48:52.080 | they're they really like sort of emphasize the self-improvement aspect here and that it just seems like
00:48:59.760 | you're you might like squeeze the most out of the existing data set but you're gonna maybe be over
00:49:06.320 | fit or something i don't know yeah i mean the i was worried about that uh because like they use a lot
00:49:17.200 | of synthetic data but i think uh even though it's like synthetic data the reward function probably captured
00:49:23.760 | like uh enough reasoning capabilities so generalized well um that's that's what i was hoping for yeah yeah
00:49:31.600 | sorry could i ask a question about what is meant by self distillation maybe you didn't quite catch
00:49:40.000 | what they're actually doing here
00:49:48.480 | i i i think they were they were using like expert model uh to generate the response and then
00:49:55.520 | they were using non-expert model uh to use that expert model generated reasoning
00:50:02.320 | i think they call it self distillation but it doesn't make sense
00:50:07.040 | it feels like reading from this it feels like what they're saying is that they use the io
00:50:14.880 | like trained model uh and feed that response into an sft um so but that's kind of like weird to do that
00:50:25.200 | right so i don't quite understand why somebody would do do that i know so this is um this is common they
00:50:32.560 | did this in magistral basically when you run out of um like a good verifier and like training data they
00:50:41.440 | they trained a very good math model so basically in magistral medium they retrained a reasoning model
00:50:49.360 | just on math and then they use that model to generate math data so in this case they do the
00:50:55.120 | same thing right so stage one you train experts right um like reasoning agent um tool use agent general
00:51:04.960 | chat agent then you do this unified training with distillation so right so the the agents generate
00:51:11.760 | responses and reasoning and tool calls and you know that's then the the unified model is trained on this
00:51:18.880 | synthetic data it's like self-distillation because it kind of came from itself right they they branched
00:51:25.120 | off trained experts and then they train on that output so distillation person oh so um before you're saying
00:51:36.480 | that um they were trained the the trained on specific tasks and then use because they were specific
00:51:45.440 | then for the unified one you kind of like use that specific rl trained one to kind of like generalize to
00:51:53.360 | other tasks or at least you kind of um put all the ex i guess you're calling experts right on different
00:52:00.080 | tasks try to use the combined into uh another sft run yeah yeah so they they they use the experts to
00:52:09.040 | generate response reasoning pull call traces right that's that's the distillation so they they they
00:52:15.520 | basically get they train experts to to do output rollouts and then they train on those this is like
00:52:23.680 | how how magistral got reasoning they trained a good math model then they used it for data
00:52:30.960 | it's like a very clear example of like this idea of training another model
00:52:39.920 | to to to generate some of this uh data would recommend that paper or a paper club on it but yeah makes sense
00:52:49.040 | this is like a general comment it seems like a uh increasing the interaction terms with the
00:53:04.640 | environment has obviously improved the uh test time compute um
00:53:11.920 | i think this is like a generic statement it doesn't have any value i mean it has value but it doesn't
00:53:22.960 | like a any specific insight and and for general oral they use like a oral hf and then aif uh they use like a
00:53:32.480 | like a model based feedback as well uh uh depends on the data depends on the task like the kind of
00:53:37.520 | mixing or uh use uh mixing all these three approaches like a rule-based feedback
00:53:43.360 | i think depends on the task like uh one of these would be uh better better um but i feel like uh even
00:53:52.320 | models can give like a better uh better judgment for a lot of tasks um yeah
00:54:01.040 | what is this shot
00:54:05.040 | um yeah this is like a browser com uh accuracy
00:54:12.880 | browser column very interesting benchmark this is from open ai it's like their web search type thing
00:54:24.880 | that they released with deep research i think they updated it recently i think six has played with it a bit
00:54:30.240 | um the the interesting chart i found were like the two in section three actually there's two charts
00:54:37.920 | about training about multi-step and um what do these red and blue lines um yeah they're multi-step training
00:54:50.800 | they have these two charts that kind of break down multi-stage versus single stage training and
00:54:57.040 | breaking off of um difficulty problems and plateauing these two are pretty cool
00:55:03.760 | honestly i think since we're running short of time maybe we don't just look at chart but they had good charts
00:55:10.480 | i recommend charts that's true um and on that last point by the way with the distillation
00:55:20.320 | uh so like the high level process is like first you train the three experts right reasoner agent and
00:55:26.880 | general chat then you know you change them what you train them with a cold start of sft traces then you do rl
00:55:34.160 | you make them experts then the big thing is like you don't just train on their rollouts like flat out
00:55:40.480 | right they do a lot of filtration so like for the distillation from three experts they they do rejection
00:55:46.480 | sampling correctness checking uh reward model filtering tool call verification like they actually
00:55:52.480 | verify all this stuff then the unified model they they do that sft distillation sort of stuff and that you
00:56:01.120 | you know that's like a deep level of like filtration stuff i think that's one of the interesting things
00:56:06.640 | right it's not like they just do three models and then they merge them into one like some weird model
00:56:11.760 | merge or some uh ensemble it's just that they use them for synthetic data and then they distill down
00:56:17.440 | and it's like a bunch of just sft kickstart do rl okay generate data filter data let's do it again
00:56:27.520 | let's do sft as a kick start off of these and of course you know strip out some stuff like you want
00:56:33.600 | you want hybrid reasoning so you want to keep some reasoners keep some non-reasoning stuff and you know
00:56:39.440 | in the end you get one model that outperforms the three
00:56:42.320 | is it also oh sorry thank you uh go ahead bro i just asked one more quick question um it says here like
00:56:55.920 | once training has reached a certain step count or plateaued is it normal that they loop the
00:57:00.080 | distillation process that they what the distillation process like they're doing they're doing many many
00:57:09.360 | rounds of the self-distillation um once training has reached a certain step count or plateaued we apply
00:57:19.840 | self-distillation by substituting the original cold start data with responses generated by the rl
00:57:25.200 | uh thus creating a superior sft we then conduct further rl training to announce this model this is
00:57:32.000 | talking about training the experts right then conduct further rl on this no no this is the join
00:57:39.600 | increasing difficulty this strategy allows us to push the limits of rl trained models of it sorry what's the
00:57:47.840 | what's the what's the question here this seems somewhat standard approach i but i think i
00:57:52.000 | misinterpreted what your question was about this no it's okay i might be misunderstanding i just got
00:57:57.120 | the impression that something that they're doing differently is um it's a self-iteration
00:58:02.960 | loop rather than just running the distillation process once which is what i thought the other models were
00:58:09.600 | doing applying self-distillation by substituting that i think the loop is just like you know after
00:58:20.160 | you do easy questions you do hard questions like that's their whole multi-step thing right like they
00:58:25.600 | at different stages do harder and harder questions and they're mixing it in with context line they they
00:58:31.760 | show this in one of the like i think it's is it figure five about moderate and extreme difficulty
00:58:38.320 | whereas one plateaus they they switch to stage two with extreme difficulty and then you know progress
00:58:44.160 | continues so like that blue line if you keep it stagnant with the same difficulty uh you know that's
00:58:51.120 | where it starts to plateau and you're not improving accuracy so you know next cycle is just how extreme
00:58:57.440 | difficulty instead i think that's like i think that's what they're saying here it's just tears of
00:59:03.760 | harder and harder questions um i could also be misinterpreting this you know
00:59:08.480 | that seems to be my understanding as well
00:59:13.200 | um they mentioned like uh how they created like a reward function for each type of specific task
00:59:23.840 | uh they were checking format they were checking like trajectory uh and then they were also checking
00:59:29.200 | uh if sometimes like uh so this stepwise uh it's basically checking like uh did we call exact tool
00:59:37.680 | at the particular time uh and sometimes they're using that and also they're trying to see if you actually
00:59:44.560 | completed the task other than like just calling the function at particular time so they were using these
00:59:50.000 | to uh and then uh sometimes these rls they get into infinite loops so and then instead of uh uh instead
00:59:58.640 | of adding a penalty to the task like they uh sampled a lot of uh see a lot of prompts with uh the oral
01:00:05.360 | tends to get infinite loops and then uh they uh created like a penalty function based on that instead of uh
01:00:12.240 | using a standard penalty which is like sampling efficient uh they mentioned like how uh they uh
01:00:19.840 | they mentioned how they want to use like a gpu better ways uh if uh some uh some rl is it's taking
01:00:27.200 | a lot of time to finish like instead of waiting on that they created like a different platform uh to
01:00:33.760 | optimize like a gpu usage better uh you can read the details later and then uh yeah these evolution
01:00:41.200 | metrics how they compare against uh existing models um i don't like this seems like very straightforward to me
01:00:48.720 | uh it's not like a uh i think it's cool uh yeah but it's still a lot of times it's still second to
01:00:56.480 | claude or other maybe the model sizes is like very small compared to them but still um nothing new inside
01:01:04.240 | it's just like a metrics they're talking how it compares to other models um i feel like uh one
01:01:10.880 | interesting thing uh they were able to have like uh i think the model has like a lot of reasoning
01:01:15.200 | a lot of reasoning capabilities inbuilt uh so it gets like a emergent phenomena like uh the translation
01:01:22.640 | surprisingly works comparatively uh existing models so that's like a like a side effect of having a lot of
01:01:29.840 | reasoning tokens yeah i think um that's all i wanted to cover
01:01:38.480 | but i feel like the aural setup or uh the things they mentioned is very cool
01:01:53.840 | well thank you so much for presenting um if anyone wants you know next week volunteer there's all
01:02:02.640 | these fun papers i posted a bunch of in discord or if there's any topics you want to cover we can
01:02:08.960 | cover together you know but uh you know big shout out thank you so much for presenting it's always fun
01:02:14.240 | to get different perspective on this stuff you know thank you yeah it's well done um thanks for uh
01:02:22.400 | volunteering all right many more volunteers uh more papers somebody's gonna do c dance that's another one
01:02:28.880 | oh china models man we i really want to do a china tour at some point maybe maybe uh early next year
01:02:37.040 | uh oh my god you just have access to go see all the labs here they're very i can i can definitely
01:02:42.960 | get quen we have inroads to deep seek uh stepfather mini max for sure i can get my god uh which is glm
01:02:53.520 | i mean i don't think they're hard to they're hard to get but i don't currently have objects
01:02:58.400 | okay other papers though there's um metas dyno v3 there's a genie clone that came out of tencent
01:03:05.760 | uh what else what else there's a survey on parallel text generation there is a lot of stuff there's a
01:03:14.400 | few auto regressive image generation papers um hrm tyler tell us volunteering uh yes i haven't read that
01:03:22.400 | one and i need to read it so hrm is yes on the docket in the next two to three weeks i might reach
01:03:29.040 | out to j alamar for his gpt oss thing uh i'll share in discord if you guys haven't seen it he did a
01:03:37.920 | really good blog post called like the illustrated transformer illustrated for gpt2 he put out one
01:03:46.320 | yesterday i think on the illustrated gpt oss so it's like a good visual breakdown of what's happening
01:03:52.880 | in state-of-the-art moe models and um i will i'll share it in in discord and try to get them to come
01:04:01.040 | up on otherwise you know if you have paper it's always better someone else someone else uh volunteer
01:04:10.080 | okay tyler you're locked for uh september 3rd glm uh uh sorry hrm uh and then we need to watch you
01:04:17.840 | for next week oh thanks guys bye all bye