GLM 4.5: Agentic Coding & Reasoning Foundation Model

00:00:00.000 | yeah go ahead

00:00:07.200 | um can you see my screen i can see your screen the audio audio could be better if you can switch

00:00:18.000 | your zoom audio uh don't use airpods for for uh for microphones airpods are very bad microphones

00:00:24.480 | okay even though it looks like i'm using my airpods in my in my ear i actually always

00:00:30.480 | switch the zoom to laptop microphone is much better should be in like the zoom clicker thing

00:00:41.760 | no i don't even like airpods max i've tried it it's not great

00:00:46.480 | is it any better now yeah much better oh sweet sweet

00:00:53.920 | okay yeah um you know just introduce yourself because you're because you're your new presenter

00:00:57.600 | uh let's get to know about you and why you're interested

00:01:00.240 | i i was working as a back-end engineer um my name is venki by the way nice to meet you all

00:01:08.160 | i took a sabbatical to do uh to understand like the current ecosystem and to

00:01:15.120 | experiment more uh with the ai tools and also understand the fundamentals a lot so i took a break

00:01:22.160 | uh i was um and then i'm trying to read papers but so far it's not working out for me the self-study

00:01:28.720 | but i'm still like uh trying to make sure like i get something positive out of this sabbatical

00:01:34.000 | uh this is my this is my first time presenting the paper i'm i'm not sure like how it goes but

00:01:39.680 | i'm hope uh i feel like i i put a lot of attention to understand the details because i'm presenting it

00:01:45.360 | otherwise i would usually skip through them skip through them yeah so it's a good good uh a good

00:01:52.480 | like a decision for me to volunteer for this agreed okay uh if you want you can make your paper full

00:01:59.680 | screen and then we can get into it okay yeah so uh should i should i give an overview first uh yeah yeah

00:02:13.040 | so this is a uh it looks like a mixture of expert a transformer uh on the base model and then uh

00:02:20.400 | what they have done differently is like uh they want to make sure like this model has a

00:02:25.040 | arc capabilities agentic coding and reasoning abilities so what they have done they uh

00:02:30.880 | post training they they they've used like two different ways to train this model like they stage one

00:02:38.080 | they uh they train an expert model uh for each specific task like agent degree reasoning and

00:02:44.800 | coding and then once they have like a a sub expert expert model they distill these capabilities into one

00:02:51.520 | uh unique one model so the uh that's why uh this model seems to have like all three capabilities uh and

00:02:59.520 | this model can is able to like uh uh have like a switch between modes like thinking long thinking modes

00:03:06.640 | and then uh quick response modes uh i haven't actually uh used the model much but i asked claude to

00:03:14.000 | generate some uh some task and prompt to verify these capabilities so maybe post this talk we can see

00:03:21.280 | if we can go through uh if this model is actually looking good um with those tasks but basically this is a

00:03:29.120 | mixture of experts um and yeah um and these are like some metrics they provided uh with arc capabilities

00:03:43.600 | i'll move to the second yeah i feel like arc is a little bit um co-opted by arcgi but arc has a

00:03:53.360 | definite definition which i think you're showing on screen right now right uh oh yeah the arc challenge

00:04:00.720 | true so they these people like they say arc uh to define like uh if the model has agentic or reasoning or

00:04:08.160 | coding capabilities all together in one uh one model instead of using multiple models to achieve uh

00:04:16.160 | different performances uh so they clearly define the ambition here

00:04:20.800 | so how often um like yeah this seems like the main uh drive behind the this model training uh to have

00:04:33.600 | all these capabilities in a single model and then uh the paper mostly quickly goes through uh like uh

00:04:44.400 | what they have done uh and then also the rl training pipeline and how they approach like a data

00:04:52.480 | synthesis for training etc but this uh these are like a bunch of metrics they've mentioned here and

00:04:59.040 | they have now a smaller model as well but they haven't mentioned much about the smaller model apart

00:05:04.880 | from i'm assuming it's a distillation yeah these are like a bunch of metrics how we uh how we performed

00:05:14.000 | on a various metrics uh i i try to see if they have like uh the training loop made available i don't think

00:05:25.440 | they've made the training loop available for open source i think the only weights are available in the

00:05:29.840 | git repo

00:05:30.400 | uh so yeah architecture the fundamental architecture is like a mixture of experts i feel like the only thing

00:05:48.240 | they changed is uh uh they used a muon optimizer instead of adam i'm not really sure if adam is like

00:05:54.880 | standard but they're using beyond uh they mentioned this down somewhere here yeah

00:06:02.880 | so they used one optimizer uh and then what else they've changed

00:06:13.760 | yeah so for for context basically for the past few years the adam and adam w optimizers they've been

00:06:21.200 | very much the standard defaults there hasn't been much progress like adam came then adam w

00:06:26.880 | uh the recent kimmy k2 paper it's like the big the big model that came out after um after deep seek

00:06:35.840 | they switched to a muon flip optimizer it's like better training stability they have a very pretty

00:06:41.520 | loss curve so that's kind of the background optimizer

00:06:44.960 | um and one thing they have done uh they have this is like the general pattern they used like uh they

00:06:55.200 | started with 4k context window and then they increased the context window to 32 and uh and afterwards like 1.28

00:07:03.200 | and uh and they mentioned like how they collected their data uh so basically uh yeah they used like

00:07:10.640 | web crawling uh to have like english and chinese web pages and then uh they used like a quality metric

00:07:17.440 | so here they mentioned like uh they divide the crawl web pages into buckets with different quality scores

00:07:24.320 | and then they're up sampling the documents uh so that like they uh the initial like the base data is much

00:07:30.640 | much higher quality um and then they they were using like uh this uh semdudu uh to basically uh deduplicate

00:07:40.560 | like uh data i mean the duplicate data uh this basically mentions like what they have done with the

00:07:48.480 | pre-training data

00:07:52.160 | and one thing they uh they said like uh they used like group query attention with partial rope in the

00:07:58.960 | attention block and then also they mentioned like uh attention heads or more so uh one of the claim was

00:08:05.920 | like having more attentions and wider instead of like uh sorry instead of uh having deeper uh deeper

00:08:14.320 | over the white like it increases the reasoning capabilities

00:08:17.120 | and

00:08:22.720 | and when came before you go forward you said something at the beginning of uh

00:08:28.960 | of this discussion uh so did i i just want to make sure that i got this right so did they train three

00:08:38.240 | separate models one each for agentic reasoning and coding and then they distilled all those three

00:08:47.120 | models into one model is that how they approached this so for the base model they used like the same

00:08:53.280 | pre-training for the base model once they have the uh pre-trained base model in the post-training they

00:09:00.160 | uh used like a t they created like three different models and then did they then distill into one other

00:09:08.640 | model yeah okay

00:09:10.960 | um yeah so in stage one they used expert training and then stage two they unified

00:09:20.000 | okay thanks

00:09:25.200 | on the on the previous note here about attention heads and wide versus deep i think it's been

00:09:31.280 | interesting with the other few late papers so like uh once again the kimmy paper talks about

00:09:37.040 | how they do less attention heads because you know a very long a very long context it's like half the

00:09:43.200 | memory bandwidth that's required and then open ai's open source model they talk about how they

00:09:50.320 | go for a wider moe as opposed to a deeper one and that that helped with like training stability and

00:09:55.680 | stuff as well so it's like interesting these little notes that you know normally you just take at face

00:10:00.320 | value you're starting to see some other differences with different models and why they pick what

00:10:13.040 | um so in pre-training they have like multiple stages uh and like they use different data and then also

00:10:19.360 | different context sizes um one thing they have done with the code like they use like a mostly github

00:10:25.680 | for code and one thing they i observed like uh they used uh uh they combined like a files from same

00:10:36.480 | report to get a better understanding of relation between different files in in the code base

00:10:41.440 | um i i thought that's interesting uh to have that semantic meaning embedded

00:10:47.280 | um and also most of the times like once they have once they crawl the data they were using uh some

00:10:55.520 | kind of model to predict the quality of a particular page and then based on the quality uh that seems like

00:11:02.560 | they have sampled the particular high quality data that seems like a common pattern they used for all the data

00:11:08.720 | uh yeah so uh they like they mentioned up sampling a lot based on the quality of the content

00:11:21.360 | um so the pre-training i believe like they're trying to have like a base model uh that i mean with high

00:11:31.280 | quality data and then second phase like in the mid training they were using repo level code training uh

00:11:37.360 | it's basically they increased the context and they added like a more data like a long context and

00:11:44.800 | agent training and here in i feel like they use a lot of synthetic uh data generations and synthetic data

00:11:52.000 | uh especially for agent tasks and uh for reasoning tasks uh one thing so

00:12:05.200 | they included these as well to improve the coding abilities um yeah and they mentioned like they

00:12:12.960 | used a deflect formatting uh i think it helps like uh to generate better uh to understand or to review

00:12:20.800 | better uh having this deflect format

00:12:30.640 | and uh here they mentioned like uh they boosted like context length from 32k to 180 180k

00:12:37.200 | and uh they up sample only long documents um i mean that makes sense

00:12:43.520 | yeah it's the same thing for 4k to 32k and then one 128k

00:12:54.000 | uh i'm not sure i understand oh yeah so if during pre-training uh uh they didn't use like a so i

00:13:03.440 | think this basically best fit packing it basically truncates the context to given length uh so in one

00:13:10.320 | of the stages they didn't use that uh so that like uh uh it becomes like a good data augmentation strategy

00:13:17.200 | and then for me training uh they applied best fit packaging so that like i think they compress the

00:13:22.320 | context into a given context window so that like it has uh it it's like a summary generation i think

00:13:29.280 | it's like some regeneration uh frankie you have a question uh yeah could you could you explain a little

00:13:37.440 | bit what they mean by up sample uh i i think they uh yeah um i'm assuming it's like down sampling up

00:13:53.680 | sampling right like they will want to have like a uh if you have like multiple buckets you want to get

00:13:58.080 | like more data from a particular bucket rather than less data from other buckets suppose if you have

00:14:03.680 | like equal distribution you probably get like a uh if you do thousands samples like the distribution

00:14:09.040 | would be even right without sampling i believe like you have more weight to certain buckets to get the

00:14:14.880 | high quality data uh bibu do you want to add anything

00:14:19.440 | typically that's what up sampling is um it's not just more weight to different buckets sometimes i'll just

00:14:27.760 | do multiple books right so for some long context or for some like high quality it's not uncommon that

00:14:35.920 | they do multiple passes of the same data and down sampling as well you know like if you have

00:14:43.360 | a lot of generic text it's like filtering is a form of sound sampling right uh yeah sometimes it's just in

00:14:49.680 | a way that's what's going to do so it's just in a way that's going to be that's going to be that's going to be

00:14:56.320 | the best fit packing what does that actually do uh based on my understanding i think they're

00:15:04.400 | trying to summarize uh so that it fits actually to the context i haven't actually uh verified that i assumed it

00:15:15.200 | oh sorry i what's a what's a basic procedure i don't quite understand what are they doing

00:15:20.160 | um i yeah i haven't actually verified uh what exactly they've been doing like i haven't seen the

00:15:30.480 | algorithm i just assumed that like they probably uh like summarizing the car summary summarizing the

00:15:36.160 | context to fit uh better suppose like if you have like 128k or like a lot of data

00:15:45.680 | it's better to have like a summary uh so that like you have like main points in the uh limited

00:15:51.200 | context so that like uh the queries can i mean the llms can focus better on the uh what's going on

00:15:58.640 | rather than like whole context okay and then if anyone has a better explanation please uh chip in

00:16:14.080 | sorry i have a separate question vicky um so uh what is the rationale so i think they're saying

00:16:20.160 | pre-training you begin with something smaller and then mid-training you introduce longer um is this kind

00:16:26.880 | of like uh um you know like going going i forgot what they call the curriculum training is that kind

00:16:33.280 | of like thinking the thought process here uh or why what's the reason for doing it this way

00:16:38.160 | uh the curriculum training i think they use it in the context of rl right like to get like a better

00:16:46.240 | reward models uh but i think this here it's simply extending the context so they can uh i mean we can put

00:16:55.680 | like a lot of data into the context there's just like a uh i feel like a 4k they have like a a good

00:17:03.120 | base like the weights are uh set in a better way here and then once we increase the context i believe

00:17:09.680 | like uh it can generalize well and then it can have like a uh the attention uh can be spread without

00:17:16.560 | losing the actual value or that's what i think here they've been doing does it have any data showing

00:17:23.760 | like um i don't know ablations showing that if you do it mixed together it's like this or here

00:17:30.000 | separate it like this it does something differently oh uh i can i can chime in here actually uh there's

00:17:36.800 | two parts of this mid training right one is context length extension which is typically not what you

00:17:43.040 | consider mid training a context extension can just happen at the very end of all your rl your post

00:17:49.120 | training right you can just do more training on long context and that kind of works now what this is is

00:17:55.840 | the reason it's more mid training is because you're going from regular pre-training just next token

00:18:01.120 | prediction to some form of specialized data right so repo level code this is long context

00:18:06.720 | full code base uh synthetic reasoning data long context agentic data right now the the fact of

00:18:14.000 | the matter is it's not like exclusive that they're just doing this but they're also doing long context

00:18:19.280 | but it's pretty common right you don't see like reasoning data at short context right because oftentimes

00:18:26.400 | the queries that go into these like the system prompts themselves are already pretty long so same

00:18:31.920 | thing with the reasoning traces right we just don't do reasoning at short context because

00:18:36.480 | you need context length for actually doing reasoning right you're not going to do

00:18:41.600 | reasoning at 4k context so it's mid training more so because of the type of training they're doing

00:18:47.760 | not because of context but yeah they didn't just do it as a separate step which is like okay too like

00:18:53.040 | long context is not super like unknown right all you do is just train on long context for like

00:19:00.560 | couple hundred thousand or a couple hundred thousand or a couple million samples and you're kind of chilling

00:19:04.400 | and with that like you know you can just mix it in wherever but in this case it's mid training because

00:19:11.680 | the type of data and then um that's like the first question on the other one about um best fit packing so

00:19:19.520 | one thing you can do if you have like long context let's say you're training on like 32k context right

00:19:25.920 | and you have a bunch of sequences that are like short like 4 000 sequence land uh what you do in

00:19:31.920 | terms of packing is you can concatenate multiple samples into one one sample right so like if i have

00:19:39.120 | eight queries that are 4 000 sequence land i can batch them all together and make one sequence of 32k

00:19:45.040 | and that's that's best fit packing that's more efficient on gpus or you could not do that and you

00:19:51.120 | could just train normally right so it's you could do best fit packing by packing multiple short sequences

00:19:57.360 | into one sequence or you could not do it and then you know they they talk about what they do here but

00:20:02.880 | that's what best fit packing is it's where you pack short sequences into a lot of sequences and then

00:20:08.080 | there's pros and cons of both i think for them they say in pre-training they don't do it so it's not as

00:20:12.960 | efficient on the gpus they just do random truncation right so sometimes different sequences and batches

00:20:18.880 | will be different lengths and that's okay it is what it is and they talk about why they do this

00:20:23.680 | and then in mid training they do do it because context matters right you don't want half of a

00:20:30.160 | reasoning trace to be cut off because you're out of sequence length right you want stuff to be the

00:20:35.520 | same length so in mid training and reasoning they do do it like you don't want to chop off a code base

00:20:40.960 | just to be efficient in gpu and pre-training they're like it's good it's data augmentation even if you

00:20:45.760 | don't have half of the stuff it's fine it's more efficient on our majority train run but sorry that's

00:20:50.960 | like two two two topics in one um you know i hope that made sense thank you so but when they kind of

00:20:58.480 | concatenate this like say 8k into 32k uh this the you you know you're not like masking anything right on the

00:21:07.440 | attention that is the stuff you're concatenate can still yeah the previous stuff right yeah yeah yeah

00:21:12.560 | it's weird like that uh there's there's different techniques to do uh packing but like you know like

00:21:18.640 | there are techniques of add-in tokens to forget stuff before but i i mean that's like i don't think

00:21:25.600 | they're doing any of this so the real point of packing is just you keep sequence similar and it doesn't

00:21:32.400 | have like contamination and training data right so with reasoning i don't want like i don't want in

00:21:37.760 | the same training batch code about like you know one library with something completely irrelevant to

00:21:44.320 | think about but um it's it's just beneficial if you keep it training but then if it's like just regular

00:21:51.840 | like storybooks right let's say you have like a harry potter book and then half of it is added in with a

00:21:57.200 | lord of the ring book and you're just predicting next tokens it's fine you know like it's just

00:22:01.920 | common but i think i think uh you know that's that's high level what this stuff is someone else should

00:22:08.080 | do a little bit more research into it i just know that it's typically done in pre-training where the

00:22:12.960 | type of data is not as important and you don't want to do it in in reasoning thank you and there are

00:22:19.600 | approaches to how you can how you can better process stuff and there's like some paper that talks about

00:22:24.720 | like adding special tokens like how encoder decoders just have like cls token

00:22:47.680 | thank you and uh and also i don't think they mentioned a lot of ablation studies or

00:22:55.680 | why they have done on the pre-training stage most of their focus on like uh making sure the rl works

00:23:02.160 | rl is effective and then they're able to improve the capabilities so this paper like a lot of the times

00:23:09.120 | the charts or the graphs are basically talking about rl pipeline not the pre-training

00:23:15.760 | um yeah

00:23:22.080 | uh so they mentioned uh about like hyper parameters like i already mentioned like they used some one

00:23:32.240 | optimizer uh i couldn't understand like some of these things but uh primer for c i feel like uh

00:23:43.440 | um

00:23:51.440 | um yeah i i to be honest like i haven't understood like this completely uh

00:23:57.280 | um so this one's like a warm stable decay schedule like uh they're changing the learning rate on

00:24:04.000 | they mentioned like how they change it why uh and then they also use like a batch size warm-up strategy

00:24:11.120 | uh so batch size was increased gradually from 16 million to 64 million training

00:24:17.440 | uh and they use like more optimizer because it's uh it feels like a accelerate convergence and like it

00:24:23.840 | can tolerate large batches batch sizes

00:24:30.560 | yeah uh if anyone have a better understanding uh please help me out

00:24:43.280 | yeah i don't know where to get these numbers

00:24:49.200 | from either but usually it's just you know it's one paragraph but they spent like three months updating

00:24:55.440 | different parameters yeah

00:24:56.560 | i'll move on to the post training

00:25:03.680 | i feel like this is uh this is where the majority of the work uh that's i think it's very carefully

00:25:12.720 | crafted and created uh as i mentioned earlier like they use like a two-stage approach like in stage one

00:25:20.080 | uh they created expert models for each uh specific task and then stage two they

00:25:26.320 | use distillation techniques to integrate these multiple experts

00:25:31.120 | uh um and then they mentioned like uh unsupervised uh supervised fine-tuning uh

00:25:39.280 | uh so uh they do this safety for both stage one and stage two and in expert training like uh this is

00:25:50.480 | basically like making sure that the model has basic capabilities uh to either to generate or to respond

00:25:57.920 | properly uh other than that it doesn't have like any uh specific like a a tool for agentic or

00:26:05.440 | reasoning or capability uh capabilities

00:26:12.160 | yeah uh in sft for stage two um it's like a their distilling capabilities from different expert

00:26:19.120 | models and then making sure that like it gets the uh capabilities from uh three different experts

00:26:34.720 | yeah um like same thing um so here they mentioned like how they uh perform these tasks

00:26:59.760 | uh one thing they have done differently for uh code so uh i feel like in regular models it seems like uh

00:27:24.480 | they have to uh like for text data they have to use like a lot of escape

00:27:29.680 | characters so uh this is for this they say that like that's fine for like a model specifically

00:27:36.080 | designed for coding task but they want to have like a general capability for this model right so

00:27:40.800 | instead of using a lot of uh escape characters they decided to format the data into this uh xml format

00:27:48.400 | uh so that they don't have to use like escape characters that's the one thing they have done differently

00:27:59.600 | um so um um so in um i couldn't hear you i'm sorry sorry you're really soft rj i don't know what

00:28:15.200 | happened to your audio nope you know what happened rj you uh you could type i guess

00:28:27.120 | i'll be rj's voice but also he's a veteran i don't know what's um what's screwed up there

00:28:35.840 | this rj says is this a new token or just a tag i'm not sure there's a difference um yeah i'm not sure i think

00:28:54.640 | it's maybe they added these uh like end of text uh tokens as part of our vocabulary so that like

00:29:02.480 | we don't pass them but i'm not really sure

00:29:04.640 | is it that i'm not sure i understand the question itself let me see the

00:29:21.440 | yeah i think it's a new token yeah new token in the vocabulary but they haven't mentioned it

00:29:26.400 | all right right

00:29:30.720 | usually for this kind of stuff they do add new tokens uh deep seek v 3.1 which dropped

00:29:38.880 | yesterday uh you can see an example of how they add tokens for this kind of stuff i would imagine glm

00:29:45.280 | to be the same does anyone know what does glm stand for no idea you could probably look it up

00:29:54.560 | so i think here they also mentioned like uh by distilling from output of distinct experts the model

00:30:06.080 | uh learns uh learns quickly applying long context of like cot reasoning for each task and then it can

00:30:14.320 | switch from uh thinking long or also thinking short without any explicit uh without any explicit suggestion

00:30:23.440 | and third one uh the rejection sampling i feel like uh so while sampling uh they want to make sure that like

00:30:33.040 | uh they sample high quality data uh this rejection sampling they basically mentioned how they have done

00:30:38.960 | that uh they want to make sure that it's in the valid format and then um verification with the objective

00:30:47.680 | like i think the math or science uh you can have linear objective answers even the code uh so they did that

00:30:54.320 | uh that part as well and then uh and then uh they have like uh reward models for subjective questions uh it

00:31:03.200 | can be either uh rl hf or or early ai uh uh and then uh they use this metric to distill the um sampler i

00:31:14.160 | mean um reject the sample and then i feel like the tool calling scenario is very interesting um they go deeper in

00:31:23.440 | later stages as well how they orchestrated like the tool coring or uh yeah i'm gonna talk about that later

00:31:34.320 | section and then also uh seems like they uh they selected few prompts to improve the uh uh or like

00:31:43.280 | they deleted like some of the prompts which are not conducive to better results um they mentioned that here

00:31:52.320 | and they state that like it improved uh improved the reasoning capabilities

00:31:57.520 | so here they mentioned like how they uh constructed the agentic sft data

00:32:05.920 | so uh i feel like they they used like existing frameworks like mcp servers uh to uh get like a basic

00:32:15.760 | features and then they created uh they did task synthesis it seems like uh they used like whatever

00:32:22.080 | the documentation they have for mcp servers to create tasks uh that seems like a very uh yeah it's

00:32:29.280 | interesting approach

00:32:30.160 | so they used llms to comprehend functionalities and then generate relevant queries and tasks

00:32:39.840 | and for uh fragmented like uh they they did follow similar approach but they uh like uh first they

00:32:48.240 | select like a representative subset and then they employ the construct the task based on that

00:32:54.320 | um yeah single step and multi-step tool calling i i believe uh uh this basically says that like uh in

00:33:03.280 | the reasoning process you may have to use like a uh like one color multiple call at different stages

00:33:12.560 | and then uh uh hybrid hey vanky sorry just a quick kind of maybe might be a stupid question but i just

00:33:22.960 | wanted to confirm is it the actual like them training specialized experts is this different to the other

00:33:31.680 | models do you know like is this one of the different parts of of um of this model um i mean i know a lot

00:33:40.240 | of times we uh other other llms do distillation but i'm not sure i haven't read any paper maybe kimmy

00:33:47.600 | or other papers does it i don't know

00:33:48.960 | yeah i was i was curious too about whether can you guys hear me now by the way yes yeah i had my head

00:34:01.520 | set microphone on sorry about that uh i was also curious about what what was the motivation for

00:34:07.360 | using distinct experts versus just using good rollouts and i i kind of the conclusion i came

00:34:17.760 | to for myself was that they have this like multi-stage uh curriculum where they're they keep they mention a

00:34:29.440 | couple places where they keep uh you know sort of doing like rollouts of their own data and so that

00:34:37.920 | maybe in the early phases it's easier to generate uh distinct experts than it is to generate one model

00:34:45.760 | that's good at all three and then and that's why so they like in the process of generating a a generic model

00:34:54.880 | it's easier to to do like specific like fine basically fine-tuned models that are good at one

00:35:01.600 | thing and then distill them all into the same model that's that was my take i'm not sure i've ever seen

00:35:07.200 | this my before myself i agree in in the oral setup like they mentioned this uh they needed like different

00:35:15.920 | architecture for different types of uh reasoning tasks or reinforcement learning tasks so maybe that's

00:35:21.760 | why they so they actually change the architecture for each expert uh i think so um interesting

00:35:33.040 | yeah they have a slim rl and then uh they mentioned sometimes like uh for software engineering task like

00:35:39.200 | they had to have like docker containers and uh other stuff like uh so they mentioned data like a bit later

00:35:47.520 | in the page but i'm not sure like this is like the training architecture not the model architecture

00:35:52.880 | right oh i see okay yeah no but that makes sense why you would also want to split that okay that's actually

00:35:58.800 | really interesting um right so here uh they use like llms uh they used existing frameworks and then

00:36:19.840 | they did task synthesis based on uh based on these frameworks and then uh so i feel like a lot of times

00:36:28.480 | they uh they uh they do this synthesis to get the uh data uh required for training and then also

00:36:34.320 | they use like quality filtering uh to make sure that they have like a high quality data goes into the

00:36:39.440 | training yeah and then they use llms to generate this uh multi-step tool called uh into trajectories

00:36:52.880 | uh yeah we'll move on to reasoning rl now one thing they uh i think uh they use like a grpo i feel like

00:37:00.960 | this is like a common framework even deep seek uses this i'm not sure uh i'm not exactly sure what is a

00:37:08.000 | this uh this loss term um yeah

00:37:12.640 | so one thing i understand based on this like uh so yeah uh if i understand like aurel correctly like

00:37:26.080 | for each task like you you get a reward and based on the reward you change the policy uh you take an

00:37:31.360 | option based on the existing policy you get a reward and then change it so here they uh they want to have like

00:37:37.120 | uh if if the task is like too complicated like the reward would be always zero so we don't have any

00:37:44.480 | gradient to train the model better way so they want to make sure that like uh at particular stage we get

00:37:51.120 | like appropriate reward or like we get a reward so that like we can train more or train better so uh

00:37:58.240 | so in stage one they use like moderate examples moderate difficult data uh and then after like

00:38:04.720 | certain point like after certain training steps they use like extremely difficult data so that like they

00:38:10.000 | have a better gradient or the model is well trained enough to produce a gradient or produce a reward um

00:38:20.480 | i think that's what i understand of from this uh curriculum based learning

00:38:24.320 | yeah uh and they tried to use like a different context uh like uh like 16k 32k 48 and 64.

00:38:36.320 | i i believe at the end they decided like it's better to use like a 64k context window uh there is no

00:38:43.280 | additional advantage with the uh starting with the small context window and then increasing the uh

00:38:48.640 | context window to 64k rather they can do uh 64k training from the beginning

00:39:05.360 | uh yeah they also mentioned like uh this two stage approach uh increasing the difficulty produces uh

00:39:17.120 | some better performances better performance

00:39:20.720 | yeah also they mentioned like a single stage rl at 64 output length uh this is a for that uh in

00:39:32.560 | second paragraph like this chart

00:39:44.720 | uh they uh they mentioned like they used a different uh

00:40:02.320 | loss function uh

00:40:04.160 | and this this basically this chart like uh using different uh loss function uh for like a coding tasks

00:40:14.720 | uh i think uh that comes in this paragraph but we can talk about this first and then

00:40:21.840 | so uh temperature i believe it's like a common uh temperature we see in llms like if the temperature is

00:40:29.600 | a high uh it becomes like a high uh it becomes like a more exploratory uh

00:40:34.080 | uh

00:40:36.080 | uh

00:40:38.080 | if temperature is too low it becomes like a very deterministic

00:40:41.200 | right so instead of using a uh like a standard temperature they try to use like a dynamic

00:40:47.840 | temperature um and then they have like a certain validation to make sure that like the temperature they

00:40:54.720 | have chosen uh is working fine uh is working fine uh yeah so here they mentioned how they uh how

00:41:07.920 | they have chosen temperature based on the validation uh validation set

00:41:14.240 | uh and they mentioned here like why they have uh taken that approach

00:41:22.720 | uh so here they mentioned like for code aural uh the loss calculation is critical for training

00:41:34.880 | efficiency and then instead of using or they used like a token weighted mean loss uh

00:41:43.360 | rather than conventional uh sequence mean loss uh i haven't exactly know what's going on here but

00:41:50.240 | i'm taking by the word and also this chart seems to be a case like token weighted average uh i mean

00:41:57.680 | token weighted mean uh is working better for all our training compared to the sequence mean

00:42:03.840 | uh um and they also mentioned like what they have done uh i feel like uh uh so uh for science rl i

00:42:16.080 | feel like uh a lot of the times like they have like a verified answers so i it becomes easier for

00:42:22.000 | training uh to get like exactly uh to get the reward uh if it's uh accurate or not the reasoning steps

00:42:31.520 | um for agent so uh i feel like here uh they use like both humans and ai uh to get the feedback um

00:42:50.480 | i'm not sure what um

00:43:00.480 | uh so this basically like they generated multiple uh uh like they generated multiple samples and then

00:43:12.960 | they're trying to like this this is just like a training loop

00:43:30.160 | uh we would uh will you be able to uh comment on this

00:43:41.760 | yeah i mean this is it am i reading this right someone else should

00:43:48.960 | comment here maybe this is grpo without the kl divergence right is that this is just a reiteration

00:43:57.840 | of the question of that statement from above or not yeah

00:44:00.720 | right

00:44:05.440 | um you know they don't they don't seem to did they mention what like why they removed the kl divergence

00:44:17.520 | from the last term i'm not sure i don't think i come across that reasoning yeah it seems a little bit

00:44:25.520 | yeah um i'm not even sure i mean i'm so not super confident like i understand this part uh this a lot

00:44:43.200 | of like aural heavy uh wait sometimes i understand um there's there's two things to dropping kill

00:44:50.800 | divergence penalty stuff um in in other work they also i don't remember which paper it was but one of

00:44:59.600 | these they significantly dropped the penalty for kale divergence and then they like decayed it so it

00:45:05.040 | didn't have an effect later on so hail divergence is to keep the model from going too off track right but

00:45:14.640 | lowering the penalty allows the model to explore like the random space a bit more so the paper basically

00:45:22.160 | really dropped the penalty for kale divergence because it allowed the model to explore a broader

00:45:29.520 | broader set of generated tokens right it's like similar to temperature it allows it to explore more freely

00:45:37.760 | and you know in return going down different paths it eventually converges to an answer in this case

00:45:43.360 | they're dropping kale divergence because they're doing this on math and code right so stuff that is

00:45:49.920 | objectively verifiable um you know as long as you can verify that that answer is correct you're you're

00:45:57.520 | fine right like you're you're you're able to get reward as long as you're able to start solving the

00:46:04.160 | questions so in in more broad terms like rlhf with like you know preference feedback and dpo and stuff

00:46:12.960 | it's still a concern right because you can get slight reward for preference based outcomes but in stuff

00:46:21.280 | like math and code you know as long as you end up at that verifiable answer you're kind of chilling right

00:46:26.320 | you don't really need a kale anchor to supervise the policy reward like it's it's okay and then there was

00:46:32.640 | another paper we covered recently that like echoed this yeah thank you yeah that's that's really clear

00:46:40.000 | that um i think it was the kimi paper right that that they did this in maybe uh the sound sounds about

00:46:46.800 | right yeah that's really good insight i appreciate that and this is also just like me spewing my

00:46:55.120 | of what i think about it you know could be wrong but seems legit to me i recall that too so that sounds

00:47:01.840 | right to me good enough for me um in this paragraph like they mentioned that uh how they calculated

00:47:10.000 | reward uh for the given specific task uh so for web search task like uh they use like a final answer

00:47:17.760 | as a reward and then for coding agents uh they use like verifiable test cases for uh rewards and then

00:47:28.960 | yeah um and for uh tool calling uh they try to see the uh format of the uh particular uh trajectory

00:47:36.720 | and then they uh basically like either uh proceed further or like uh receive a zero reward

00:47:43.920 | um and then uh basically uh the colds instead of like cold start sft cold start model like uh it has like a

00:47:53.040 | basic understanding what's going on uh how to deploy back but and then uh they did like a uh like once uh

00:48:01.200 | uh training has reached like certain steps like uh they started applying self distillation

00:48:06.080 | uh like uh by replacing calls uh like the data generated by the cold star uh cold start sft to uh the rl

00:48:16.320 | trained model i think this is the like expert model um yeah this this section sorry i'm taking so much of

00:48:26.400 | this talk but this section really made me nervous right because i feel like i mean i think that a long

00:48:34.400 | time ago i remember going over one of some paper on synthetic data and and and sort of like methods

00:48:42.880 | where like self-improvement um kind of hits a hits a limit pretty fast so i'm um it's interesting

00:48:52.080 | they're they really like sort of emphasize the self-improvement aspect here and that it just seems like

00:48:59.760 | you're you might like squeeze the most out of the existing data set but you're gonna maybe be over

00:49:06.320 | fit or something i don't know yeah i mean the i was worried about that uh because like they use a lot

00:49:17.200 | of synthetic data but i think uh even though it's like synthetic data the reward function probably captured

00:49:23.760 | like uh enough reasoning capabilities so generalized well um that's that's what i was hoping for yeah yeah

00:49:31.600 | sorry could i ask a question about what is meant by self distillation maybe you didn't quite catch

00:49:40.000 | what they're actually doing here

00:49:48.480 | i i i think they were they were using like expert model uh to generate the response and then

00:49:55.520 | they were using non-expert model uh to use that expert model generated reasoning

00:50:02.320 | i think they call it self distillation but it doesn't make sense

00:50:07.040 | it feels like reading from this it feels like what they're saying is that they use the io

00:50:14.880 | like trained model uh and feed that response into an sft um so but that's kind of like weird to do that

00:50:25.200 | right so i don't quite understand why somebody would do do that i know so this is um this is common they

00:50:32.560 | did this in magistral basically when you run out of um like a good verifier and like training data they

00:50:41.440 | they trained a very good math model so basically in magistral medium they retrained a reasoning model

00:50:49.360 | just on math and then they use that model to generate math data so in this case they do the

00:50:55.120 | same thing right so stage one you train experts right um like reasoning agent um tool use agent general

00:51:04.960 | chat agent then you do this unified training with distillation so right so the the agents generate

00:51:11.760 | responses and reasoning and tool calls and you know that's then the the unified model is trained on this

00:51:18.880 | synthetic data it's like self-distillation because it kind of came from itself right they they branched

00:51:25.120 | off trained experts and then they train on that output so distillation person oh so um before you're saying

00:51:36.480 | that um they were trained the the trained on specific tasks and then use because they were specific

00:51:45.440 | then for the unified one you kind of like use that specific rl trained one to kind of like generalize to

00:51:53.360 | other tasks or at least you kind of um put all the ex i guess you're calling experts right on different

00:52:00.080 | tasks try to use the combined into uh another sft run yeah yeah so they they they use the experts to

00:52:09.040 | generate response reasoning pull call traces right that's that's the distillation so they they they

00:52:15.520 | basically get they train experts to to do output rollouts and then they train on those this is like

00:52:23.680 | how how magistral got reasoning they trained a good math model then they used it for data

00:52:30.960 | it's like a very clear example of like this idea of training another model

00:52:39.920 | to to to generate some of this uh data would recommend that paper or a paper club on it but yeah makes sense

00:52:49.040 | this is like a general comment it seems like a uh increasing the interaction terms with the

00:53:04.640 | environment has obviously improved the uh test time compute um

00:53:11.920 | i think this is like a generic statement it doesn't have any value i mean it has value but it doesn't

00:53:22.960 | like a any specific insight and and for general oral they use like a oral hf and then aif uh they use like a

00:53:32.480 | like a model based feedback as well uh uh depends on the data depends on the task like the kind of

00:53:37.520 | mixing or uh use uh mixing all these three approaches like a rule-based feedback

00:53:43.360 | i think depends on the task like uh one of these would be uh better better um but i feel like uh even

00:53:52.320 | models can give like a better uh better judgment for a lot of tasks um yeah

00:54:01.040 | what is this shot

00:54:05.040 | um yeah this is like a browser com uh accuracy

00:54:12.880 | browser column very interesting benchmark this is from open ai it's like their web search type thing

00:54:24.880 | that they released with deep research i think they updated it recently i think six has played with it a bit

00:54:30.240 | um the the interesting chart i found were like the two in section three actually there's two charts

00:54:37.920 | about training about multi-step and um what do these red and blue lines um yeah they're multi-step training

00:54:50.800 | they have these two charts that kind of break down multi-stage versus single stage training and

00:54:57.040 | breaking off of um difficulty problems and plateauing these two are pretty cool

00:55:03.760 | honestly i think since we're running short of time maybe we don't just look at chart but they had good charts

00:55:10.480 | i recommend charts that's true um and on that last point by the way with the distillation

00:55:20.320 | uh so like the high level process is like first you train the three experts right reasoner agent and

00:55:26.880 | general chat then you know you change them what you train them with a cold start of sft traces then you do rl

00:55:34.160 | you make them experts then the big thing is like you don't just train on their rollouts like flat out

00:55:40.480 | right they do a lot of filtration so like for the distillation from three experts they they do rejection

00:55:46.480 | sampling correctness checking uh reward model filtering tool call verification like they actually

00:55:52.480 | verify all this stuff then the unified model they they do that sft distillation sort of stuff and that you

00:56:01.120 | you know that's like a deep level of like filtration stuff i think that's one of the interesting things

00:56:06.640 | right it's not like they just do three models and then they merge them into one like some weird model

00:56:11.760 | merge or some uh ensemble it's just that they use them for synthetic data and then they distill down

00:56:17.440 | and it's like a bunch of just sft kickstart do rl okay generate data filter data let's do it again

00:56:27.520 | let's do sft as a kick start off of these and of course you know strip out some stuff like you want

00:56:33.600 | you want hybrid reasoning so you want to keep some reasoners keep some non-reasoning stuff and you know

00:56:39.440 | in the end you get one model that outperforms the three

00:56:42.320 | is it also oh sorry thank you uh go ahead bro i just asked one more quick question um it says here like

00:56:55.920 | once training has reached a certain step count or plateaued is it normal that they loop the

00:57:00.080 | distillation process that they what the distillation process like they're doing they're doing many many

00:57:09.360 | rounds of the self-distillation um once training has reached a certain step count or plateaued we apply

00:57:19.840 | self-distillation by substituting the original cold start data with responses generated by the rl

00:57:25.200 | uh thus creating a superior sft we then conduct further rl training to announce this model this is

00:57:32.000 | talking about training the experts right then conduct further rl on this no no this is the join

00:57:39.600 | increasing difficulty this strategy allows us to push the limits of rl trained models of it sorry what's the

00:57:47.840 | what's the what's the question here this seems somewhat standard approach i but i think i

00:57:52.000 | misinterpreted what your question was about this no it's okay i might be misunderstanding i just got

00:57:57.120 | the impression that something that they're doing differently is um it's a self-iteration

00:58:02.960 | loop rather than just running the distillation process once which is what i thought the other models were

00:58:09.600 | doing applying self-distillation by substituting that i think the loop is just like you know after

00:58:20.160 | you do easy questions you do hard questions like that's their whole multi-step thing right like they

00:58:25.600 | at different stages do harder and harder questions and they're mixing it in with context line they they

00:58:31.760 | show this in one of the like i think it's is it figure five about moderate and extreme difficulty

00:58:38.320 | whereas one plateaus they they switch to stage two with extreme difficulty and then you know progress

00:58:44.160 | continues so like that blue line if you keep it stagnant with the same difficulty uh you know that's

00:58:51.120 | where it starts to plateau and you're not improving accuracy so you know next cycle is just how extreme

00:58:57.440 | difficulty instead i think that's like i think that's what they're saying here it's just tears of

00:59:03.760 | harder and harder questions um i could also be misinterpreting this you know

00:59:08.480 | that seems to be my understanding as well

00:59:13.200 | um they mentioned like uh how they created like a reward function for each type of specific task

00:59:23.840 | uh they were checking format they were checking like trajectory uh and then they were also checking

00:59:29.200 | uh if sometimes like uh so this stepwise uh it's basically checking like uh did we call exact tool

00:59:37.680 | at the particular time uh and sometimes they're using that and also they're trying to see if you actually

00:59:44.560 | completed the task other than like just calling the function at particular time so they were using these

00:59:50.000 | to uh and then uh sometimes these rls they get into infinite loops so and then instead of uh uh instead

00:59:58.640 | of adding a penalty to the task like they uh sampled a lot of uh see a lot of prompts with uh the oral

01:00:05.360 | tends to get infinite loops and then uh they uh created like a penalty function based on that instead of uh

01:00:12.240 | using a standard penalty which is like sampling efficient uh they mentioned like how uh they uh

01:00:19.840 | they mentioned how they want to use like a gpu better ways uh if uh some uh some rl is it's taking

01:00:27.200 | a lot of time to finish like instead of waiting on that they created like a different platform uh to

01:00:33.760 | optimize like a gpu usage better uh you can read the details later and then uh yeah these evolution

01:00:41.200 | metrics how they compare against uh existing models um i don't like this seems like very straightforward to me

01:00:48.720 | uh it's not like a uh i think it's cool uh yeah but it's still a lot of times it's still second to

01:00:56.480 | claude or other maybe the model sizes is like very small compared to them but still um nothing new inside

01:01:04.240 | it's just like a metrics they're talking how it compares to other models um i feel like uh one

01:01:10.880 | interesting thing uh they were able to have like uh i think the model has like a lot of reasoning

01:01:15.200 | a lot of reasoning capabilities inbuilt uh so it gets like a emergent phenomena like uh the translation

01:01:22.640 | surprisingly works comparatively uh existing models so that's like a like a side effect of having a lot of

01:01:29.840 | reasoning tokens yeah i think um that's all i wanted to cover

01:01:38.480 | but i feel like the aural setup or uh the things they mentioned is very cool

01:01:53.840 | well thank you so much for presenting um if anyone wants you know next week volunteer there's all

01:02:02.640 | these fun papers i posted a bunch of in discord or if there's any topics you want to cover we can

01:02:08.960 | cover together you know but uh you know big shout out thank you so much for presenting it's always fun

01:02:14.240 | to get different perspective on this stuff you know thank you yeah it's well done um thanks for uh

01:02:22.400 | volunteering all right many more volunteers uh more papers somebody's gonna do c dance that's another one

01:02:28.880 | oh china models man we i really want to do a china tour at some point maybe maybe uh early next year

01:02:37.040 | uh oh my god you just have access to go see all the labs here they're very i can i can definitely

01:02:42.960 | get quen we have inroads to deep seek uh stepfather mini max for sure i can get my god uh which is glm

01:02:53.520 | i mean i don't think they're hard to they're hard to get but i don't currently have objects

01:02:58.400 | okay other papers though there's um metas dyno v3 there's a genie clone that came out of tencent

01:03:05.760 | uh what else what else there's a survey on parallel text generation there is a lot of stuff there's a

01:03:14.400 | few auto regressive image generation papers um hrm tyler tell us volunteering uh yes i haven't read that

01:03:22.400 | one and i need to read it so hrm is yes on the docket in the next two to three weeks i might reach

01:03:29.040 | out to j alamar for his gpt oss thing uh i'll share in discord if you guys haven't seen it he did a

01:03:37.920 | really good blog post called like the illustrated transformer illustrated for gpt2 he put out one

01:03:46.320 | yesterday i think on the illustrated gpt oss so it's like a good visual breakdown of what's happening

01:03:52.880 | in state-of-the-art moe models and um i will i'll share it in in discord and try to get them to come

01:04:01.040 | up on otherwise you know if you have paper it's always better someone else someone else uh volunteer

01:04:10.080 | okay tyler you're locked for uh september 3rd glm uh uh sorry hrm uh and then we need to watch you

01:04:17.840 | for next week oh thanks guys bye all bye

GLM 4.5: Agentic Coding & Reasoning Foundation Model

Chapters