back to index

Muon and Kimi K-2 MoonShot


Whisper Transcript | Transcript Only Page

00:00:00.000 | Very well, also, so I, and I think, um.
00:00:05.400 | You went over the, like this plot last time, or was it 2 weeks ago? I can't remember where, you know, like, basically summarize this kind of summarize what you just said, which is that, like, they're really just.
00:00:18.360 | Um, really doing better than the competition for the number of flops that you spend and like, you know, this is another way of saying that with this, um, scaling curve. Um, so.
00:00:31.440 | So, then, um, why the question, of course, is why does it.
00:00:37.980 | Do so much better and this technical report is about how they accomplish this and a little bit about why I, I found that, um.
00:00:46.860 | I think there's a lot of implicit knowledge for people who, um, like follow optimization, um, that I didn't have. So I kind of dug in a little bit deep on some of the, uh, some of the details here.
00:01:01.200 | Um, um, I, the questions that I had, maybe not the same questions that other people have. So I, I want to make sure that, um, I'm answering other people's questions. So, um, I'm not keeping an eye
00:01:15.360 | I'm not keeping an eye on the chat, but if you please, like, interrupt and whatever. If you, um, if you, if you have other questions or I explain something poorly or I get something wrong, they definitely could, because this material is not super familiar to me, so I may have misunderstood something or get a detail wrong or whatever. Um.
00:01:33.360 | Um, okay. So without further ado. Um, so the point of this page, so there were some early results with one that were really promising, but that the performance of it, the optimization
00:01:51.360 | um, diminished, um, as, um, as the models scaled. And so this was this technical report was okay, but we can do some tweaks and, um, and actually scale to large models. Um, and so, um, but I think
00:02:09.360 | I think a lot of the meat is actually in understanding what's going on here is actually in just understanding the one, um, itself. Um.
00:02:17.360 | Which I think was, I don't know that there's actually a paper on it. I think there's a blog post on it. Um, and it's a, you know, a sort of evolution of optimization technique called shampoo, um, which I have not studied.
00:02:31.360 | And then Adam, uh, is also, um, you know, one of the dominant, uh, optimizers. And so, like, and if, in case you're like really new, um, these are all optimizers that, so you're doing, um, back propagation to train your neural network.
00:02:49.360 | And, um, when you do, every time you do a forward pass, then you have some, some loss, meaning the error between what you predicted with the neural network predicted and what it should be.
00:03:00.360 | And then that loss is used to, to calculate a gradient on the parameters of the neural network so that I know, okay, in order to get closer to where I'm supposed to be, then what's the difference between what I, what the parameter values that I have set currently and what they should have been to do better.
00:03:23.360 | And, and so I do this a whole bunch of times with different data, um, chunks of my data, different batches of data.
00:03:29.360 | And then eventually like that sort of causes what's called the scope stochastic gradient descent towards the, you know, sort of optimal setting of those parameters.
00:03:38.360 | And of course, this is like not the convex problem. So meaning that, that like when I'm doing gradient descent, when I'm going down the hill, um, I might get in like a little local valley and get stuck there.
00:03:51.360 | But, um, so some of, some of the techniques that we'll talk about are to avoid that.
00:03:55.360 | Um, uh, so, okay. That's my like really high level introduction, uh, sort of what this algorithm does in terms of the sort of ecosystem of, uh, neural network stuff.
00:04:08.360 | Um, okay. So, um, um, so quick, quick, quick, quick note.
00:04:13.360 | So I'm God has joined your second cohost. He's enjoying, sorry. I had trouble giving him zoom link.
00:04:17.360 | Um, so he'll, he'll help with presentation, I guess on the background of optimizer.
00:04:23.360 | So like you mentioned, there was regular gradient descent, which is you do all parameters kind of what's the change and update in general.
00:04:31.360 | Then the kind of iteration loop for people that haven't really followed is like from there, we went to adaptive per parameter.
00:04:38.360 | So like at a grad RSM prop where like, okay, for specific parameters, maybe we change them differently than the whole network.
00:04:45.360 | Then Adam had this, like the Adam optimizer was pretty big in like 2014.
00:04:50.360 | That was like, okay, let's, let's have this concept of momentum, right?
00:04:53.360 | If we're moving in the right direction, it's like gradually increase and have adaptive scaling.
00:04:58.360 | Adam w was like, okay, we like momentum, but let's separate weight decay.
00:05:03.360 | Cause we, we want weight decay from gradient step to be separate.
00:05:07.360 | Uh, then from there is, you know, where we have stuff like this and this stuff, like no one has really scaled it, but that's like the little bit of middle ground.
00:05:18.360 | Yeah.
00:05:19.360 | Okay. Yeah. Good. Thank you for that.
00:05:23.360 | Um, uh, okay. So, um, let's see.
00:05:26.360 | So the, the, um, so this, you know, like Vivo was just saying, you know, weight decay here, um, you know, you know, is important, but, um, uh, they, they, you know, sort of created a technique to set that.
00:05:46.360 | The weight decay relevant parameter, um, automatically.
00:05:50.360 | And so that, so that's one of the things here.
00:05:53.360 | Um, um, um, um, they talk about like a distributed version that works.
00:05:59.360 | Oops. Sorry.
00:06:00.360 | Uh, let me undo that.
00:06:02.360 | Um, um, that, that works with, with zero, meaning the, um, zero is what is zero stand for again?
00:06:10.360 | Uh, uh, uh, Vivo, you help me?
00:06:13.360 | Um, it's, so zero is like, you know, sort of like fully, uh, parallelized across multiple dimension.
00:06:22.360 | Um, training, um, multi GPU and stuff.
00:06:25.360 | Yeah.
00:06:26.360 | Zero zero one, two, three, different, different levels of distributed training.
00:06:31.360 | Right.
00:06:32.360 | Right.
00:06:33.360 | So it's a efficiency, right?
00:06:34.360 | So instead of like having all weights on all GPUs, there's different degrees at which you can do zero optimization.
00:06:41.360 | Yeah.
00:06:42.360 | And this is compatible with zero.
00:06:44.360 | Yeah.
00:06:45.360 | So they're, they're previously, um, had some problems distributing.
00:06:49.360 | And so that this, this is how do we do that?
00:06:51.360 | So this is basically essential for actually scaling to a large model.
00:06:56.360 | Um, and then they look at the scaling laws and compare to Adam baseline.
00:07:03.360 | And, and it's, you know, I sort of like way better.
00:07:06.360 | So, uh, right there.
00:07:09.360 | Okay.
00:07:10.360 | So, um, so, okay, so I, I, so the thing that I actually got out of the paper was more just move on and not the, these three points here, which I think are cut to me are kind of like more engineering.
00:07:23.360 | And, um, so like definitely other people will have different interests here, but I thought the thing that I can do is really understand, uh, what's going on here and as best I can, and try to explain it to you guys.
00:07:37.360 | So, um, so like, this is the sort of like core algorithm, right?
00:07:42.360 | Where you, so you have this momentum, this M here, and, um, and the momentum is just at every time it starts at zero, all zero.
00:07:51.360 | And then the, so like, if this is M, if we're at MT minus before step one, I have, you know, I'm forming my first one, then it's just going to be the loss.
00:08:02.360 | Um, the gradient of the loss function on the weights.
00:08:05.360 | Okay. And then every time, and then I, every time I'm taking like a trade off between this, so this, this new parameter kind of trades off between the momentum and the loss.
00:08:16.360 | And so this determines how much do I pay attention to the past versus the current gradient.
00:08:23.360 | Okay. And then, and then this, or, so this O step is kind of the core of what, what this algorithm is doing.
00:08:30.360 | And this is what they call orthogonalizing the, the matrix.
00:08:36.360 | And so I actually didn't understand what they meant by that because, uh, like, if you think about, if I don't, you know, and again, I will.
00:08:47.360 | I'll try to like play a middle ground between people who understand linear algebra well, and those who don't, um, I'm, I'm somewhere in the middle.
00:08:56.360 | And so I'm, I'm, I'm, I'm not a great linear algebra, but I have a reasonable understanding.
00:08:59.360 | So like, if you look at singular value decomposition, which is how you would calculate an exact, I can decomposition or singular value decomposition of a matrix.
00:09:11.360 | Um, you would, um, uh, that that's like a expensive, very expensive, uh, thing to do.
00:09:21.360 | And so this Newton Schultz algorithm is just a way to, is, is sort of like a way to approximate that.
00:09:28.360 | But it's not quite because, um, you, when you do a singular value decomposition, you're getting like this, these specific set of outputs.
00:09:37.360 | That is like the eigen eigen, uh, values, which is like the sort of amount that I'm moving in that my, um, transformation scales things in different directions.
00:09:49.360 | And then it's, and then like the eigen vectors, which point in which directions.
00:09:54.360 | And so I was a little confused about what they mean by an orthogonalizing a matrix.
00:10:00.360 | So, um, the way, so I, I kind of like had to do the, you know, sort of little diagram here.
00:10:06.360 | So sorry if it's a little bit, um, scratchy, but you can think about, so let's just look at this.
00:10:12.360 | Um, and by the matrix here and you have, so I just picked arbitrary values.
00:10:17.360 | And so D like in, in our case, just to keep things concrete to what we're talking about here.
00:10:23.360 | So D is like the, pretty much the hidden dimension of the neural net or of a given matrix.
00:10:30.360 | So this is done per weight matrix, not for the whole weight matrix.
00:10:34.360 | So you, it's basically the hidden dimension.
00:10:37.360 | Like it could be, depending on which one, it could be like some multiple of the hidden dimension or whatever, but it's kind of generally that order of magnitude.
00:10:44.360 | And so we're talking like 4096 or something like that.
00:10:48.360 | And then, um, the end here is the batch size times the sequence length.
00:10:53.360 | So batch size is, I don't know what, like, uh, somewhere between like 256, probably to, I don't know, a couple of thousand.
00:11:02.360 | And then the sequence length is what we all talk about the context window length.
00:11:06.360 | Right. So that could be, I think they've mentioned in the paper for the, the trainings that they did.
00:11:13.360 | They were at 4,000 and 8,000.
00:11:15.360 | So these are like in the millions basically.
00:11:17.360 | Um, so like much higher dimension.
00:11:20.360 | So my N is more than my D.
00:11:22.360 | Right.
00:11:23.360 | And so that what the key observation, one of the observations here is that this is low that like a reduced rank makes matrix.
00:11:30.360 | Meaning I'm on a subspace of the total number of dimensions.
00:11:35.360 | So for example, um, and if, and in this case is three dimensions.
00:11:40.360 | So I have these three blue lines that kind of sort of try to draw this three dimensional space.
00:11:45.360 | And then I have these two black lines, which try to draw.
00:11:48.360 | There's a plane that are defined by these, um, two vectors, right?
00:11:54.360 | These two, three dimensional vectors.
00:11:56.360 | There's they, those, those two define a plane.
00:11:59.360 | And then what I'm doing by orthogonalizing is I'm just finding.
00:12:03.360 | Two, um, two unit one, uh, unit length vectors that are on that plane.
00:12:11.360 | Um, that are, or are orthogonal to each other.
00:12:15.360 | They're perpendicular to each other.
00:12:17.360 | And so, and so that, um, and it, like, you'll notice there are infinite number of these.
00:12:22.360 | So it's not a unique solution, but that, um, I'm just, and okay.
00:12:26.360 | So we'll talk about why, but that's what's going on here is that that's what this, uh, Newton Schultz thing does.
00:12:34.360 | And this is in the, this is actually not in the weights.
00:12:38.360 | This is in the, uh, gradients of the weights, right?
00:12:42.360 | Um, so what, what's going on here?
00:12:45.360 | And the, they talk about the intuition and I'll give it a little bit right now.
00:12:48.360 | Is it, there's this momentum term and the momentum is like, I'm, I'm sort of keeping track of how, where I was going before and making sure that I can continue in that direction, sort of.
00:13:02.360 | Um, and I'm doing that so that like, if I have a, like a little valley, if you can imagine yourself being a ball rolling down a hill and you get into a little valley, then maybe you don't have enough.
00:13:14.360 | Um, then, then you end up like, you know, stuck in that valley.
00:13:18.360 | But if you, if your ball is heavy enough, then maybe it'll roll up the other side and keep going down.
00:13:23.360 | Right.
00:13:24.360 | So that's sort of the intuition, why they call it momentum.
00:13:26.360 | And, and so what this is doing is saying rather than like a lot of my momentum going in this one direction and only a little in this direction.
00:13:35.360 | I'm saying my momentum, I'm stretching that matrix and changing that so that it can kind of go in.
00:13:41.360 | The momentum can kind of go in equally in all these different directions that are defined by the data.
00:13:48.360 | Okay, that's very vague. I know, but like that's the intuition.
00:13:53.360 | Yeah, maybe I should stop here because that's like, I just said a lot.
00:13:59.360 | Um, what do you, do you guys, um, is there, are there any questions in the chat or anyone wanna?
00:14:04.360 | Like sort of ask about this.
00:14:07.360 | Um, uh, okay.
00:14:11.360 | If not, I will continue.
00:14:13.360 | Um, so I, I'm actually gonna skip back and forth here.
00:14:18.360 | Cause I think that the, um, sort of, I will talk about that a little more.
00:14:26.360 | So they actually have a, an interesting section here.
00:14:31.360 | Where is it?
00:14:33.360 | Um, where they validate this concept.
00:14:40.360 | So, um, so what, so this dynamics of singular spectrum.
00:14:47.360 | Okay. So, um, what they did was they, they, um, looked at, uh, so, okay.
00:14:57.360 | So there's a concept of entropy again.
00:15:00.360 | If you're not familiar with the math, um, that's fine.
00:15:04.360 | What the entropy measures is the sort of like the amount of, of like variation.
00:15:12.360 | It's not quite variant.
00:15:13.360 | It's not variance, but it's the amount of, I mean, technically it's the amount of information.
00:15:16.360 | It's the amount of information in a distribution.
00:15:19.360 | But, um, if you can imagine, uh, you know, like a probability distribution, you have, uh,
00:15:26.360 | you could have like a really, really sharp distribution and it always picks one value.
00:15:32.360 | And that would be very little information, right?
00:15:34.360 | Cause I always pick that value.
00:15:35.360 | So I know I can't express anything really with that.
00:15:40.360 | I can't even express a single bit of information.
00:15:42.360 | If I pick one of two values and I can express, you know, some more.
00:15:47.360 | And if I have a uniform distribution, then I can express lots of different values.
00:15:51.360 | So those are, that's like higher entropy.
00:15:54.360 | So it's like a measure of the amount of information you can express and the sort of randomness of a data set.
00:16:04.360 | And so they, what they did was they sort of abused this, um, this, um, based on another paper, which I didn't read.
00:16:12.360 | Um, but so you get this like entropy equation.
00:16:15.360 | And if you've seen entropy before, you'll recognize that this is basically just, this is my distribution.
00:16:21.360 | So I'm, I'm, uh, you know, sort of taking my eigenvalues.
00:16:26.360 | I'm, I'm squaring them and taking the, the sum of them so that it becomes a probability.
00:16:33.360 | And then I take the log.
00:16:35.360 | And so it becomes an entropy, right?
00:16:37.360 | And so multiply by the log and it becomes an entropy.
00:16:40.360 | So what they're doing here is they're saying, you know, like sort of, if I have, um, uh, a lot of, uh, if my weight matrices have a lot of, um, you know, sort of like are, are pretty, um, spread out and flat.
00:17:02.360 | Um, you know, if I'm tending to cause my weight matrices to be like sort of, um, quote unquote orthogonal so that I like have lots of different directions in them.
00:17:14.360 | Um, then, um, in the, in sort of like my rank is high then, uh, or, you know, sort of like the dimensionality of the embedding space, maybe is a better way to put that.
00:17:25.360 | Um, then, um, then, you know, like I get a high entropy here.
00:17:31.360 | And so what they found was that compared to Adam, they do a lot.
00:17:36.360 | Um, they have like, where is it?
00:17:38.360 | Um, you know, uh, this right here.
00:17:42.360 | Um, so that basically there's, this is like empirical.
00:17:49.360 | They, so they had higher entropy in the, um, in the weight matrices than the, the equivalent models, uh, trained by Adam.
00:18:02.360 | Um, and so what they're saying is that we were able to explore dimensions more effectively is sort of kind of sort of what, and so this is kind of the why, and this is like really fascinating to me.
00:18:14.360 | Um, is that like, and so that they talk about this a little bit more.
00:18:19.360 | Um, let me find it.
00:18:22.360 | Um, let me find it.
00:18:23.360 | Right.
00:18:24.360 | Uh, right.
00:18:29.360 | Yeah.
00:18:30.360 | So, um, uh, sorry.
00:18:34.360 | Uh, um, they, they talk about this, you know, sort of like viewing, um, new on versus Adam is using different norm constraints, meaning they're sort of basically.
00:18:49.360 | Um, regularizing in a different way.
00:18:52.360 | Um, and so, um, they're saying, whereas Adam is sort of, and so like, let me, and I think I have this right.
00:19:00.360 | So, you know, somebody knows better, please correct me.
00:19:02.360 | But what, what Adam kind of sorted is doing is I'm saying, I'm looking at these, these vectors.
00:19:14.360 | Kind of, or these dimensions that I'm taking the maximum, uh, size of that.
00:19:21.360 | And I'm, I'm constraining my, my momentum by that.
00:19:26.360 | Whereas, uh, the, um, the muon optimizer is instead looking at the largest eigen value and constraining by that.
00:19:40.360 | Okay. So th that's sort of the fundamental difference.
00:19:43.360 | And, um, that makes sense because, um, it's not like, because things are happening in this eigen space, not in, in these like sort of dimensions, right?
00:19:54.360 | Because they, you can rotate things around, um, in the embedding space.
00:19:59.360 | And so that it doesn't make sense to, um, constrain your optimization direction.
00:20:06.360 | Um, because you might have one direction that is really large because of all these different directions that are summing up.
00:20:13.360 | Okay. So that's a really mumbo jumbo explanation.
00:20:17.360 | I can go into more.
00:20:19.360 | Um, uh, I can go into more, but I think I've probably talked about this much.
00:20:24.360 | Does anyone want to ask a question or should I move on?
00:20:29.360 | Okay. Um, okay.
00:20:31.360 | So I'm gonna, I wanna make sure that the other speakers have some time.
00:20:35.360 | So I'm gonna like sort of like brush over all this stuff and then we can, we can go into, uh, any detail either at the end or, uh, as you can interrupt me.
00:20:49.360 | And I'm happy to, uh, I'm happy to, um, I'm going to details about these things.
00:20:55.360 | So, but like, okay, so what are the key things from this paper now that I've like made a half, half effective, maybe attempt to, uh, explain like the core concept here.
00:21:05.360 | So the, um, so I think that, yeah, good question.
00:21:09.360 | The yikes, yikes has said in the chat, if you can repeat what Adam looks like versus this looking at the eigenvalue, like in the math, your interpretation of the math.
00:21:20.360 | Us comparing these. Oh, okay. So let's see. You, yikes, you wanted like this kind of explanation again.
00:21:28.360 | Sorry. Okay.
00:21:31.360 | Yeah. So, so basically like, um, so my, and, uh, again, that, you know, some, some, hopefully somebody, um, who, who's, who's more knowledgeable can correct me if I'm misunderstanding.
00:21:45.360 | I, what it basically like each one of these vectors has a size, right?
00:21:51.360 | And if you look in, I'm not going to go into the math of why, but, um, effectively that the norm, the, the, the sort of constraint that Adam is using in effect is that the biggest one of these vectors, you're normalizing by this, the length of that vector.
00:22:10.360 | Right. So you have a whole bunch of, you know, this D is in reality, like, you know, big.
00:22:15.360 | And so you have all these vectors and you're kind of normalizing by the biggest one of those.
00:22:20.360 | And so the problem with that is that it, you might have a, a direction in like, like you could rotate.
00:22:29.360 | Um, so that you're, you're squishing in some direction that, uh, is a, you know, sort of combination of these different vectors.
00:22:38.360 | And that's, that would be like the, you know, the largest eigenvalue.
00:22:42.360 | Right. So, um, and, and so that what effectively you're doing with one is you're in this spectral, um, in spectrum space, which is like the eigen decomposition space.
00:22:58.360 | You're, you're, you're, you're constraining the, the, the, the size of the jumps, the steps, the, the sort of strength of the momentum by the largest direction in eigen space instead.
00:23:14.360 | Right. So you've, you know, you sort of like, look at what are the principal components of this, um, of this, this, or the, sorry, the, the eigenvectors and looking at the size of those and looking at what is the largest eigenvalue and dividing by that basically kind of sort
00:23:33.360 | of normalizing via that. Does that, does that make any sense?
00:23:37.360 | Okay. Um, I'm happy to like discuss this and I, you know, I think I have a reasonable grasp on it, but I may not.
00:23:47.360 | And so like happy to be corrected and happy to discuss this.
00:23:51.360 | And it's super fascinating to me and happy to discuss it in great detail.
00:23:55.360 | Um, let me, let me just touch on all these other points and then we can talk about Kimi.
00:24:00.360 | I think that I don't want to take up all the time.
00:24:02.360 | Um, so like this weight decay, actually, I think I kind of, the way I kind of read it was.
00:24:08.360 | So if you look at the difference between this weight update here and this one here, they're just, um, they're just adding this, um, Lambda W T minus one.
00:24:21.360 | And so that what they're saying is that, uh, I'm, I'm, uh, I can, I have a controllable parameter that allows me to subtract off some fraction of the weights every step.
00:24:35.360 | So that my, um, so that like I tend to favor smaller weights, basically the same reason we do weight decay and all other algorithms.
00:24:46.360 | And it doesn't look any difference different there.
00:24:49.360 | And so that they have this algorithm and then, um, this is basically should be familiar if you've ever looked at, at weight decay before.
00:24:58.360 | Um, and then, and then what, what they're saying here is that, um, the, the RMS norm that they use.
00:25:11.360 | And, um, is actually doesn't that you use in Adam doesn't work well for, um, one.
00:25:22.360 | And I think this is one of their primary contributions here.
00:25:25.360 | Was it in this paper?
00:25:27.360 | Is it that, that, um, it's actually the, it depends on the shape of the matrix that you're.
00:25:36.360 | Um, what you're, um, you're, uh, um, you're normalizing around.
00:25:44.360 | Be so that this a B it's the max.
00:25:48.360 | So the, the, um, RMS is, is the max of these two things, or you can maybe I.
00:25:55.360 | High rate, I highlighted the wrong place so that like when it's large, then you're dividing by a, you know, like the large dimension.
00:26:04.360 | And when it's small, then you're not, you're not normalizing enough.
00:26:08.360 | And so that you either get, you know, sort of like gradient, um, you know, you get explosions or, or, or, or, um, or your great vanishing gradients.
00:26:19.360 | Um, if you're not careful.
00:26:21.360 | And I think that they hypothesize that this is one of the primary reasons why larger models were running into trouble.
00:26:27.360 | So, um, because of that, they, um, they had a few different mechanisms for, um, for updating.
00:26:37.360 | And, and they, they, they talked, they did some oblations down here, but, um, but I think they, they came up with just multiplying out like this.
00:26:47.360 | Right.
00:26:48.360 | Right.
00:26:49.360 | So that, um, so now this one, this equation here has changed to this.
00:26:54.360 | So they're just adding, and they have this point to, to match what Adam does.
00:26:59.360 | Um, and then they, um, and I, you know, that's like a minor detail that probably don't care about, but, um, they, they, you know, so they do this normalization so that they can.
00:27:11.360 | Um, so do regard it's they're now, uh, independent of the size of the matrix that they're doing the update on.
00:27:20.360 | And so this is one of the key things that allows them to be stable with larger training runs.
00:27:27.360 | Um, okay.
00:27:30.360 | And then what are, what are the other things?
00:27:32.360 | So then they also like, and we kind of touched on this, the zero, um, and, um, you know, sort of.
00:27:40.360 | Um, you know, sort of Megatron LM, which is, uh, you know, like even more complicated, uh, parallel strategies.
00:27:49.360 | So they basically have a, an algorithm here that they describe.
00:27:53.360 | I, I don't think that there's, um, a whole lot here.
00:27:58.360 | Um, but, you know, like you can go through it.
00:28:02.360 | It's just like, they just gather some more.
00:28:05.360 | They, they, you know, they've kind of changed how you do the updates to gather.
00:28:09.360 | Um, information.
00:28:10.360 | The one thing that I found was interesting was this line.
00:28:13.360 | So they, they actually discard.
00:28:15.360 | They, they, so I think what they're saying is that on each node, they're actually calculating the, and they have to do like the entire.
00:28:23.360 | Uh, they have to calculate the entire momentum matrix and then only apply it to the parameters that they care about.
00:28:30.360 | So this is a little bit, um, inefficient, but they, you know, they, they talk about how the, despite that inefficiency, they, um, because they're only using one momentum buffer instead of two that like, okay.
00:28:45.360 | From a space standpoint, that doesn't matter too much.
00:28:49.360 | The communication overhead, um, is like slightly higher, but, um, because like, as you scale, it matters less and less.
00:28:58.360 | And then sort of like that they found that in practice that, um, you know, the sort of, uh, the end to end latency increase was pretty negligible.
00:29:08.360 | So, um, so like, and then finally, this was like evaluation of some different, um, strategies for doing consistent, um, update, um, in the weights or sorry, in the, uh, momentum term.
00:29:27.360 | Um, they, like I said, they landed on this one.
00:29:30.360 | Um, you can read more detail if, if that's interesting.
00:29:34.360 | And then they, um, you know, they had some interesting, um, scaling laws here.
00:29:40.360 | You can see that.
00:29:41.360 | Um, you know, like sort of Adam does not surprise, or sorry, moon, not surprisingly does better.
00:29:47.360 | And they fit.
00:29:48.360 | I didn't find their curve very convincing that this is actually the, the, um, the right fit, but, um, but, you know, I didn't.
00:29:58.360 | Whatever.
00:29:59.360 | It's clearly much better than Adam.
00:30:01.360 | So I'm willing to take it.
00:30:03.360 | Um, and then let's see what else.
00:30:06.360 | I'm sorry.
00:30:07.360 | This is so rush.
00:30:08.360 | I really didn't have a time to prepare super well.
00:30:12.360 | Um, but, uh, oh yeah, this, this section.
00:30:16.360 | Um, I thought was really interesting as well.
00:30:19.360 | So this, I talked about the entropy measure that, um, is right here.
00:30:24.360 | I talked about that previously.
00:30:26.360 | Um, I think that this, um, this is a really interesting set of plots where, um, the, uh, um, the weight matrices, um, have, you know, throughout the, you know.
00:30:44.360 | So we, if you look at Adam is red.
00:30:47.360 | Um, so this is Adam.
00:30:49.360 | And then one is the blue one.
00:30:51.360 | And, um, these are, uh, different parts of the, these, these here are different parts of the model.
00:31:00.360 | And different training iterations that like the entropy in the weight matrices is, you know, tends to be much higher, meaning that they're the weight matrices are expressing different.
00:31:13.360 | Um, different things more effectively, um, than when you're trained by Adam different.
00:31:20.360 | So they're using the dimensionality of the weight matrices more effectively is I think the, the summary of that.
00:31:28.360 | Um, and one, so I'll, I'll, I'll conclude now.
00:31:33.360 | I think, but one like interesting thing that they, um, so they found.
00:31:38.360 | So I found these two sections together quite interesting.
00:31:42.360 | Whoops.
00:31:43.360 | These two sections together quite interesting.
00:31:45.360 | So if you pre-train using Mulan, then like, like everything's better, basically.
00:31:53.360 | Kind of sort of.
00:31:54.360 | Um, and they talk about like different, like for, um, um, you know, sort of like pre-training plus fine tuning.
00:32:04.360 | Oh, sorry.
00:32:05.360 | If you pre-train and fine tune using Mulan, they do better.
00:32:09.360 | Um, however, they found that if we take public models and we, they're obviously already pre-trained on whatever optimizer.
00:32:20.360 | They, um, do the fine tuning using, um, Adam, oops, sorry.
00:32:25.360 | Adam versus Adam versus Mulan.
00:32:29.360 | Adam does better.
00:32:31.360 | Right.
00:32:32.360 | Consistently.
00:32:33.360 | And so, um, so that, that is one of the, um, that is one of the primary questions that they had.
00:32:41.360 | It was, why does it happen?
00:32:42.360 | So, and I don't think they had a, they don't really think they had a good explanation for that at all.
00:32:47.360 | Um, okay, so, um, yeah, so I, I think that's probably enough.
00:32:53.360 | I want to make sure that other people have time to spend, but I, well, maybe we should, if there are any more questions, I'd be happy to answer those.
00:33:00.360 | And then I'll hand over the, the reins.
00:33:02.360 | Any, any questions?
00:33:06.360 | I know this was like really rushed and maybe not super intuitive.
00:33:10.360 | Um, I apologize for that.
00:33:13.360 | Not at all.
00:33:14.360 | Like this is, uh, as, uh, as good as we can hope for, for, uh, explaining like this kind of optimize your work.
00:33:21.360 | Um, I, I, I think, um, I'm gotten had some commentary.
00:33:25.360 | Uh, Eugene has actually used more, um, either of them.
00:33:31.360 | Um, it looks like I'm glad it's coming.
00:33:38.360 | Uh, can you hear me?
00:33:40.360 | Yeah.
00:33:41.360 | Okay.
00:33:42.360 | Cool.
00:33:43.360 | Yeah.
00:33:44.360 | I, I, I did, uh, some remarks about how the even scaled moment further in their training of gaming K2.
00:33:52.360 | I, I can share like the notes and kind of quickly go over them.
00:33:56.360 | If there are no other questions.
00:33:57.360 | Yeah.
00:33:57.360 | This is the clip stuff.
00:33:58.360 | Yeah.
00:33:59.360 | Uh, yeah, I think, uh, RJ, you're gonna have to stop the share so that I'm going to do it or whatever.
00:34:09.360 | However, this works.
00:34:10.360 | Yeah.
00:34:11.360 | Okay.
00:34:13.360 | Give me one second.
00:34:14.360 | Let me figure that out.
00:34:15.360 | Uh, here we go.
00:34:16.360 | Uh, yeah.
00:34:17.360 | Zach's is Matthew and Dan says, so I need to attend this.
00:34:20.360 | Um, I actually like thinking about like use this in terms of theory of learning.
00:34:26.360 | Um, so I think there's a lot of ways in which you can apply the algorithms to your own personal learning as well.
00:34:34.360 | Um, so like the constant momentum to me is something that I tell people about a lot when you are wrong.
00:34:39.360 | And you are wrong.
00:34:40.360 | You should not only update to what you were wrong on, but also have momentum on further updating.
00:34:46.360 | Um, uh, I'm glad are you sharing your screen?
00:34:50.360 | Um, I'm about to, uh, should be, should be visible now.
00:34:58.360 | Yeah.
00:34:59.360 | Cool.
00:35:03.360 | All visible.
00:35:04.360 | Uh, I'm, I'm glad.
00:35:05.360 | I don't know if you're speaking.
00:35:06.360 | He is not speaking.
00:35:07.360 | He has killed his voice.
00:35:08.360 | Okay.
00:35:09.360 | Someone else explain his.
00:35:10.360 | Okay.
00:35:11.360 | Okay.
00:35:11.360 | Is the screen visible now?
00:35:12.360 | Yeah.
00:35:13.360 | Yeah.
00:35:14.360 | Yeah.
00:35:15.360 | Yeah.
00:35:16.360 | Yeah.
00:35:17.360 | Yeah.
00:35:18.360 | You're good.
00:35:18.360 | You're good.
00:35:19.360 | Yeah.
00:35:19.360 | You can go.
00:35:20.360 | Okay.
00:35:21.360 | Cool.
00:35:22.360 | Okay.
00:35:23.360 | Cool.
00:35:24.360 | Okay.
00:35:25.360 | Cool.
00:35:26.360 | Cool.
00:35:27.360 | So hi everyone.
00:35:28.360 | My name is Amgad.
00:35:29.360 | I'll, I'll be sharing some of the notes I took, uh, when researching the K2 model.
00:35:36.360 | Uh, most of this comes from the blog posts they published.
00:35:41.360 | They said they will be, uh, sharing a very detailed technical report, but, you know, uh,
00:35:48.360 | we can never know.
00:35:49.360 | So I'll just go ahead and share whatever I know right now.
00:35:51.360 | So let's, let's start with this.
00:35:53.360 | Yeah.
00:35:54.360 | Yeah.
00:35:55.360 | So let's, let's start with the architecture.
00:35:57.360 | Uh, basically how they designed the model.
00:36:01.360 | Uh, they had a few principles, uh, they needed to follow.
00:36:04.360 | Uh, and they did a lot of scaling experiments before the training even started.
00:36:09.360 | So there are a lot of scaling experiments, uh, a lot of the details that you can do, or
00:36:15.360 | a lot of the changes you can do to the architecture.
00:36:17.360 | And they found out that every single detail, uh, every single change they made, it didn't
00:36:22.360 | improve the model at all.
00:36:23.360 | And sometimes it was just worse.
00:36:25.360 | So basically they decided to stick with deep seek v3 architecture instead of just changing
00:36:31.360 | for the sake of change.
00:36:32.360 | And I think this is a very good, uh, approach that we don't see a lot.
00:36:37.360 | Like people want to try different stuff because they're just different without having any,
00:36:43.360 | uh, uh, justification for using them.
00:36:46.360 | So yeah, their, their answer was no, we should just use deep seek v3.
00:36:49.360 | And then maybe you can make some even further scaling on top of it rather than, uh, making
00:36:55.360 | new changes to the architecture because they already have a significant change, which is the optimizer.
00:37:00.360 | They are now using Moan instead of Adam W, which is the industry standard.
00:37:05.360 | Uh, so they had a few constraints when, when, when trying to scale this up.
00:37:10.360 | Uh, the first constraint is it should be as close as possible to deep seek v3.
00:37:16.360 | The second constraint is the cost ceiling.
00:37:20.360 | They are a small company and they don't have a lot of money.
00:37:23.360 | So they want to keep training and inference costs as low as possible.
00:37:28.360 | So, uh, they did a lot, they did a lot of experiments and they found that deep seek v3
00:37:33.360 | is at the upper limit of what they can afford.
00:37:35.360 | So K2 must be similar to deep seek v3 in terms of cost.
00:37:40.360 | Uh, so with these token constraints, they, they actually, uh, arrived at a conclusion that
00:37:47.360 | they should use the deep seek v3 skeleton and then find parameters that keep the training
00:37:52.360 | and inference, inference cost at the same, but this, uh, but while like reducing the loss significantly.
00:38:00.360 | Um, so they made a few changes.
00:38:02.360 | We can go in over them now.
00:38:05.360 | The first change comes from something called the sparsity scaling law.
00:38:09.360 | This is a, an internal experiment or like an internal series of experiments that was done
00:38:15.360 | by their pre-training team.
00:38:17.360 | They found out that with fixed activate activate activated parameters.
00:38:21.360 | If you increase the total, uh, parameters of a mixture of expert, it still obeys the scaling.
00:38:26.360 | Well, uh, there is no overfitting and the loss keeps going down.
00:38:30.360 | And this is internal research research that has not been published yet.
00:38:33.360 | So we just will take the words for it now for now.
00:38:36.360 | So based on these observations, they decided to take an action to increase the total number
00:38:41.360 | of experts to 384 compared to, uh, what deep seek v3 is using, which is 256.
00:38:50.360 | The main issue with going with increasing the total model size is assuming that you have a fixed
00:38:57.360 | a fixed number of notes, a fixed number of GPUs.
00:39:00.360 | Now each, each GPU, uh, is consuming more and more memory because you are having more, uh, total experts.
00:39:09.360 | So in the new k2, you have three experts per node.
00:39:13.360 | Plus the one shared expert, which consumes about 10 gigabyte compared to deep seek v3, which used two routed experts and one shared one and used, uh, 7.5 gigabytes.
00:39:25.360 | So right now the, the model is consuming more memory.
00:39:28.360 | I think it's almost, uh, they say it's 50% more.
00:39:33.360 | Uh, I didn't double check the mess.
00:39:36.360 | So, so we can just take this for granted, but they are saying that now we have more memory consumed.
00:39:42.360 | We want to claw back this memory because memory is valuable and compute and bandwidth are like, uh, limited resources for them.
00:39:49.360 | So they, they made the second change, which is reducing the number of attention heads.
00:39:55.360 | Uh, the deep seek v3 initially doubled the number of attention heads and they justified this because they said this maximizes the bandwidth utilization.
00:40:04.360 | Uh, so this is something that is a bit more specific to that deep, deep seek v3 infrastructure.
00:40:10.360 | So the Kimi K2 authors, uh, decided to reduce the number of attention heads to 64.
00:40:16.360 | And this results in two main points.
00:40:19.360 | The first one we're cutting the head in half.
00:40:22.360 | So this meant, this means we're getting like huge wins for long context.
00:40:26.360 | And this is important for K2 because it's going to be used for agents and vibe coding and, and consuming tens of thousands of dollars.
00:40:33.360 | Uh, to do vibe coding stuff.
00:40:36.360 | So this is quite important already.
00:40:38.360 | The second, uh, result is that the, the query and the key and the value on the output projection, but these parameters are reduced in half because we're using half the heads.
00:40:49.360 | So this goes down from 10 billion parameters to 5 billion.
00:40:52.360 | So we are cutting the flops used, uh, in a significant portion again, and they have done a lot of ablations internally that show that the negative impact.
00:41:02.360 | impact on the loss by reducing the number of heads is quite minimal.
00:41:07.360 | And again, we will have to take them for, for granted on this.
00:41:11.360 | The third change they made in the architecture is the, are no longer using expert grouping.
00:41:16.360 | Uh, so grouping helps when you have more than one expert on a GPU, uh, because this tends to balance the work at the device level.
00:41:25.360 | Uh, they said that they have a certain requirement for scaling and that they are actually using one expert or even less than one expert on each GPU, because the need to serve.
00:41:36.360 | Many, many numbers of concurrent requests.
00:41:38.360 | So at the current scale, if you are having one expert per GPU, you don't need to group the experts because, uh, you're just routing to each node.
00:41:48.360 | So, uh, they said that we no longer need any expert, expert grouping.
00:41:53.360 | We can just use the, uh, router and give it more freedom.
00:41:56.360 | And this allows them, allows them to explore more of the combinatorial space and, uh, arguably have better model quality.
00:42:05.360 | So these are like the main changes in architecture compared to deep seek V3.
00:42:09.360 | Uh, the summary is it is a sparse mixture of expert model.
00:42:13.360 | It has 1 trillion total, uh, parameters, sorry, not tokens, but only 32 of which are activated in each forward pass.
00:42:21.360 | Uh, it is quite similar to the deep seek V3 on R1 architecture because it's basically an evolution of it.
00:42:28.360 | It it's using less attention heads, but has more total experts, uh, and it has a bigger vocabulary size.
00:42:36.360 | It's now using 160,000 tokens instead of 129,000, uh, tokens in the vocabulary.
00:42:45.360 | And this is a nice, uh, graph comparing the two architectures by Sebastian Rojka.
00:42:50.360 | So please go, go and check this.
00:42:53.360 | I'll be sharing the notes, uh, and the dashboard server after the meeting.
00:42:59.360 | So this is regarding the architecture.
00:43:02.360 | Any questions before we cover, uh, some of the, uh, training approaches and, and other, uh, contributions?
00:43:10.360 | Uh, can you guys even hear me?
00:43:25.360 | Can just, can just someone confirm that we're all good.
00:43:28.360 | Yeah, we can hear, we can hear.
00:43:31.360 | Uh, I do see one question in chat.
00:43:33.360 | Uh, I'm not able to digest how to find the best momentum and weight decay of the muon optimizer among different attention heads among all the Emily.
00:43:48.360 | Um, maybe I can, I, I, if I understand that question correctly.
00:43:54.360 | Um, the momentum term is per the, the, the momentum matrix is there's a different momentum term per parameter kind of sorta.
00:44:08.360 | So you like each attention head and each expert will matrix will have its own momentum direction.
00:44:21.360 | Does that make sense?
00:44:22.360 | So they like you, they're all calculated separately is kind of another way to put it.
00:44:26.360 | Is that answer the question?
00:44:27.360 | I'm not sure if it does or not.
00:44:28.360 | That was from Osama, I guess.
00:44:37.360 | Okay.
00:44:38.360 | Good.
00:44:39.360 | Awesome.
00:44:40.360 | Yeah.
00:44:41.360 | Sorry.
00:44:42.360 | I disconnected, uh, for a while.
00:44:45.360 | Um, no worries before disconnect.
00:44:50.360 | Yeah.
00:44:51.360 | I answered the question anyway.
00:44:53.360 | Perfect.
00:44:54.360 | Okay.
00:44:55.360 | I was asking if like, there are any questions regarding the, uh, architecture, but I think we have no questions for no further questions.
00:45:02.360 | Right.
00:45:03.360 | Um, I actually do.
00:45:05.360 | Uh, I, I heard that it was, it was 1 trillion tokens and not 1 trillion parameters.
00:45:10.360 | Um, okay.
00:45:11.360 | Oh, is it, so it's 1 trillion.
00:45:13.360 | Is it?
00:45:14.360 | Yeah.
00:45:15.360 | What, uh, how many tokens and how many parameters?
00:45:17.360 | Yeah.
00:45:18.360 | Uh, so the total number of parameters, though, their size, let me just correct this quickly.
00:45:24.360 | Uh, uh, the total number of parameters is 1 trillion, but only 32 are active.
00:45:34.360 | And this is compared to deep seek V3, which has, I think 670 ish total, uh, parameters, but only 36 ish are active.
00:45:44.360 | Okay.
00:45:45.360 | So yeah, this is regarding the total size.
00:45:47.360 | So if you were to download this model on desk and assuming two bytes per, uh, token, you're gonna need like two, uh, terabytes.
00:45:55.360 | I think of, of, of, of memory to just download it.
00:45:58.360 | Yeah.
00:45:59.360 | Yeah.
00:46:00.360 | I was, I was mostly wondering about the distinction between, between, uh, parameters and tokens, but it sounds like 15 trillion total tokens that it was trained on.
00:46:08.360 | And then one trillion parameters, uh, is what I'm, what I'm hearing.
00:46:11.360 | Exactly.
00:46:12.360 | Yeah.
00:46:13.360 | All right.
00:46:14.360 | Cool.
00:46:15.360 | Sweet.
00:46:17.360 | Yeah.
00:46:18.360 | So yeah, this was regarding the architecture.
00:46:20.360 | Uh, we can now go into details about some of the motivations and the principles behind their approach to train this model.
00:46:27.360 | So they, they have two big, uh, principles.
00:46:30.360 | The first one is the agentic intelligence.
00:46:33.360 | Uh, they believe that LLMs should be agentic and they should be intelligent.
00:46:38.360 | They should be able to use tools and intelligent way, uh, tools to write down summaries and, and like, uh, insert new information into memory and access MCP tools and all of this amazing agenting capabilities.
00:46:52.360 | And they also think that the model should be intelligent to know which tool to use and how to use the tool.
00:46:58.360 | Uh, basically this principle comes from reinforcement learning and that to do a really good job at reinforcement learning, you need three components.
00:47:08.360 | You know, the, the algorithm, uh, and the environment and the priors.
00:47:12.360 | Uh, the main point is without good priors, the agent is just going to randomly guess what action to take, uh, when interacting with the environment.
00:47:21.360 | And this was going to result in a low reward, uh, because you're just randomly guessing, uh, which, which tool should I use or which, which action should I dig.
00:47:31.360 | And you're not getting, gaining a lot of rewards and you're not improving.
00:47:34.360 | This is going to make the, uh, training, the, the training process of RL is going to be very, very hard.
00:47:41.360 | It's going to take a lot of time because you're getting a very little or like very weak feedback signal.
00:47:46.360 | But in the last two years, the pre-training of large language models has made them, uh, almost universal models.
00:47:54.360 | They have, uh, great oral knowledge.
00:47:56.360 | They know a lot of stuff, a lot of, uh, about a lot of domains.
00:48:00.360 | So the pre-training of modern LLMs is the crucial foundation for establishing the priors, uh, that are needed to develop strong RL models.
00:48:09.360 | Uh, we have one caveat though, is that the human data is a finite fossil fuel.
00:48:14.360 | Uh, this is a saying by Elia and that the data is, is, is like not catching up the, to the progress of computer.
00:48:22.360 | We're getting more and more powerful GPUs every year, but the data is limited, at least the natural human data.
00:48:28.360 | So this means we need to be talking efficient and talking efficiency means given a fixed size dataset.
00:48:35.360 | How can we learn more, more, how can we develop more, uh, agenting model and like smarter models and, and more intelligent models given a fixed amount of data.
00:48:45.360 | And we're going to give it like a quick hint that we should maybe look into better optimizers.
00:48:51.360 | Uh, so optimizers that can achieve a lower loss given a fixed size dataset.
00:48:57.360 | And obviously the, the ended up using monoclip, uh, that RJ, uh, explained in, in, uh, in a good way.
00:49:06.360 | The second motivation or principle is the post training.
00:49:09.360 | So, uh, we're basically in the, what, what people call the era of experience.
00:49:14.360 | LLMs are increasingly learning from their own self generated interactions.
00:49:18.360 | They're not just relying on the data that we create, the instructions that we write for them on how to solve the problem.
00:49:24.360 | They're actually learning from their attempted solutions at the problem.
00:49:27.360 | Uh, and there are like two examples of this.
00:49:31.360 | The first one is alpha proof.
00:49:32.360 | So this model was initially trained on like around a hundred thousand, uh, proofs written by human experts.
00:49:38.360 | But then it went on to generate around a hundred million more by interaction with the, or the environment and receiving rewards using a formal proving system.
00:49:48.360 | This allowed the model to basically improve significantly and become way more capable compared to previous models.
00:49:56.360 | The second one is the tried and tested deep seek R1.
00:49:59.360 | It uses verifiable problems like coding and math.
00:50:03.360 | Uh, basically we give the model the problem and it makes a lot of attempts to solve the problem.
00:50:08.360 | And we just judge or, uh, grade their solution.
00:50:12.360 | We can say, okay, you're getting close.
00:50:14.360 | You're getting not close.
00:50:15.360 | You're doing good.
00:50:16.360 | You're doing bad.
00:50:16.360 | And this allows the model to learn from its own solution without having to write detailed instructions of how to solve the problem.
00:50:23.360 | Actually.
00:50:24.360 | So given these two principles, they made a few contributions to achieve these goals.
00:50:29.360 | The first one is demo on clip optimizer.
00:50:32.360 | Uh, basically how can we become more token efficient during training?
00:50:36.360 | Uh, so as, as RJ explained, Adam W has been the dominant optimization algorithm since 2014, 2015.
00:50:45.360 | And since then people have been trying nonstop to improve upon it.
00:50:48.360 | But, uh, I think it was only recently that we, uh, that like a more prominent approach was discovered by color Jordan.
00:50:58.360 | Uh, basically demo on approach to train nano GPT to a SOTA level.
00:51:03.360 | Uh, so just want to give a quick overview of, of how one, uh, went on to one clip.
00:51:11.360 | So basically before even K2, the Moonshot team was working on something called moonlight,
00:51:17.360 | which was an attempt to scale more into an actual LLM.
00:51:21.360 | A modern LLM because nano GPT was, I think a hundred million parameters.
00:51:25.360 | So they wanted to scale this to something that is 3 billion parameters or even more.
00:51:29.360 | And they successfully successfully demonstrated that you can actually scale the Moonshot optimizer into a modern LLM.
00:51:36.360 | And they train what they call moonlight, which is the model, uh, trained by the Moonshot optimizer.
00:51:42.360 | Now they wanted to scale the, this algorithm even further into something the size of deep seek V3 or even more.
00:51:50.360 | Now we want to train our trillion parameter model.
00:51:52.360 | This is quite huge.
00:51:53.360 | So when, when the, when the, we're trying to train this new model with the algorithm, they ran into a few issues and they came up with clever solutions for it.
00:52:02.360 | Uh, so they decided to reduce the number of heads for, for long context efficiency.
00:52:10.360 | I think increased the sparsity, but they encountered that.
00:52:13.360 | a lot of training instability due to mainly exploding attention logits.
00:52:18.360 | And this, uh, occurred more frequently with Moons than the Adam W.
00:52:22.360 | So, uh, this is a big problem.
00:52:25.360 | They tried it.
00:52:26.360 | They tried some of the existing solutions like, uh, soft capping the logits and normalizing the query key parameters, but they didn't, uh, help a lot.
00:52:35.360 | So they, they offered a clever way to rescale the query and, and, uh, key parameters.
00:52:43.360 | They call it the QK clip technique.
00:52:45.360 | Basically it tries to stabilize the training by rescaling the weight metric, uh, matrices of the query and the key projection after the updates.
00:52:53.360 | We can take a look at the math here.
00:52:55.360 | So after we have the inputs and multiply them by the weight metrics, we scale it by a certain factor for the query.
00:53:01.360 | And we also scale it by a similar factor for the key.
00:53:04.360 | Uh, if you multiply these both together, you're going to get, uh, the query multiplied by the key and both of them are now scaled by a factor.
00:53:14.360 | And they, they, this scaling factor, uh, it's called an adaptive factor.
00:53:19.360 | I think this, this simple is, is eta.
00:53:22.360 | Uh, basically it has a, a way to calculate it by using the maximum attention logit in the previous step.
00:53:31.360 | I'm not going to go into details about the math.
00:53:35.360 | Uh, you can take a look at the equations and try to understand the intuition by a bit, but basically they're trying to have a soft capping on the.
00:53:43.360 | uh, values of the query multiplied by the key.
00:53:47.360 | And they obviously did a lot of training and, uh, training experiments.
00:53:51.360 | And they showed that one clip can effectively prevent the explosions while maintaining the, uh, downstream task performance.
00:53:58.360 | And this is, I think the highlight of the Kimi Q2 model.
00:54:02.360 | They show us the loss curve while training the model.
00:54:06.360 | So this very big model was trained on 15 trillion tokens.
00:54:09.360 | So this is a massive training run.
00:54:11.360 | And they use long clip.
00:54:12.360 | And they say that the didn't encounter any, any training spike.
00:54:16.360 | And this demonstrates that one clip can be stable and robust for training state of the art, uh, large language models.
00:54:25.360 | And even more, you can notice a second dip in, in loss at around 11 trillion tokens.
00:54:30.360 | And this is quite awesome.
00:54:34.360 | Um, so yeah, the first, uh, contribution by the team is scaling more one by developing more on clip and training the model on 15 trillion tokens in a stable and robust way.
00:54:44.360 | Uh, I had a question, um, with regards to the, um, uh, the soft cap logits, uh, technique that they addressed.
00:54:55.360 | Is that, uh, what's the relationship to that with soft max or are they the, essentially the same thing?
00:55:01.360 | Are they describing the same operation?
00:55:03.360 | I don't, no, I don't think it's related to soft max.
00:55:07.360 | I think they're just trying to, uh, put a limit on, on the value of the query.
00:55:13.360 | And key because they noticed that during the training, the, the, the value of multiplying query and key kept going up and up and up.
00:55:21.360 | And this exceeded the, the range that can be represented by BF 16, uh, values.
00:55:28.360 | I don't think it's related to soft max.
00:55:30.360 | It's just like a way for them to, uh, control the values without explosion.
00:55:35.360 | I think.
00:55:36.360 | Okay, cool.
00:55:37.360 | Yeah.
00:55:38.360 | Good.
00:55:38.360 | Thank you.
00:55:39.360 | Okay.
00:55:41.360 | Okay.
00:55:42.360 | So yeah, this was their first contribution, the mon clip, uh, which allowed them to do a very successful pre-training run.
00:55:49.360 | The other contributions are related to the RL side of things.
00:55:53.360 | And they did two contributions.
00:55:55.360 | The first one is they generated a large scale agentic data, agentic synthetic data set.
00:56:01.360 | So basically how do we teach the model sophisticated tool, tool use capabilities?
00:56:05.360 | And they, their approach was to develop a very comprehensive pipeline that generates synthetic data to train the model.
00:56:11.360 | So they start by, uh, evolving hundreds of domains with thousands of tools and evolving here comes, I think from the evolve instruct algorithm by the wizard LMT.
00:56:23.360 | So you're trying to create more and more domains and trying to make them more complex, uh, starting with a, I think fixed seed of domains.
00:56:29.360 | And then once you have these domains, you can generate hundreds of agents with diverse tool sets, and then you can create a simulated environment and user agents to interact with these agents.
00:56:42.360 | And I, I think user agents mean here means just an agent that acts as a user.
00:56:46.360 | So submitting queries and questions.
00:56:48.360 | And then they use tasks that are rubric based, basically tasks that are easy to gauge or easy to judge.
00:56:55.360 | And they use this to, uh, basically judge and evaluate the generated data set.
00:57:03.360 | And then they start the simulations.
00:57:06.360 | They run multi turn toolier scenarios with their agents, uh, interacting with the environment and the user agents.
00:57:11.360 | And then they use LLM as a judge to evaluate the results and then use this, uh, judgment by the LLM to filter for high quality training data.
00:57:21.360 | Uh, this is on Instagram from their blog posts as well.
00:57:24.360 | So basically you start with a list of domains that have been evolved.
00:57:29.360 | You evolve the domains to generate tools, and then maybe use some MCP tools and then give these tools to agents, which interacts with, uh, user agents.
00:57:40.360 | And the environment.
00:57:41.360 | And then you have these tasks that you can evaluate easily, uh, and then use the LLM as a judge to filter and continue the generation and evolve more and more.
00:57:51.360 | So this is quite an, uh, highly attractive loop that allows you to generate high quality data.
00:57:57.360 | So basically eval driven training, I think, or I should say eval driven, uh, synthetic data generation.
00:58:07.360 | Uh, so this is the first approach regarding reinforcement learning.
00:58:10.360 | The second one is general reinforcement learning.
00:58:13.360 | The second one is general reinforcement learning.
00:58:14.360 | So we already have RL for, for verifiable tasks.
00:58:17.360 | Uh, thanks to deep seek R1.
00:58:19.360 | They basically trained the model on code and math where the model, uh, generates a solution.
00:58:25.360 | And then they, uh, run unit tests or, uh, evaluate the math solution and use this as a signal to the model.
00:58:34.360 | Now the key challenge is how can we expand RL into other types of tasks?
00:58:38.360 | Uh, especially the ones that are not easily verifiable.
00:58:41.360 | Examples of verifiable tasks are like math and competition coding.
00:58:45.360 | While writing a research board or a summarization is viewed as a non-verifiable.
00:58:50.360 | So they created a general RL system using a self judge mechanism.
00:58:55.360 | Basically they use the model to act as its own critic, uh, providing scalable rubric based feedback for non-verifiable tasks.
00:59:03.360 | And one cool, uh, detail here is that they use the on policy rollout, uh, with verifiable rewards to continuously update the critic.
00:59:12.360 | So basically we have some tasks that are verifiable.
00:59:15.360 | They can use.
00:59:16.360 | You can track the performance of the model on these verifiable rewards and see if you're actually, if the model is still going better and better.
00:59:26.360 | And this makes sure that the critic model keeps improving, uh, its evaluation accuracy.
00:59:32.360 | So this can be viewed as a way of using verifiable rewards to improve the estimation of non-verifiable rewards.
00:59:39.360 | And these are like the three main contributions by the team.
00:59:44.360 | Uh, the first one is the mon clip optimizer to allow them to train up to 15 trillion tokens and achieve, uh, lower loss while having the training being stable and robust.
00:59:56.360 | And the second one is synthetic data generation for tool use.
01:00:00.360 | And the third one is to generalize the RL into non-verifiable rewards.
01:00:05.360 | Uh, so I think that's, that's all I have to share.
01:00:09.360 | Uh, this is all the information extracted from the blog post.
01:00:12.360 | Just one fun fact here regarding the mon optimizer.
01:00:15.360 | Uh, this is Keller Jordan, the main author behind the optimizer.
01:00:20.360 | And he said, when he was interviewing with open AI and XI, the XAI team told him that they don't think the idea of developing Moan is going to work.
01:00:29.360 | Uh, so I think, yeah, sometimes you, you're like, you get rejected until you can prove that the algorithm scales.
01:00:36.360 | Um, these are like some references.
01:00:38.360 | I used the blog post and the tweet of Sebastian Ruschka and the translation of one of the engineers about the reasoning for developing for modifying the architecture of the model.
01:00:49.360 | So, yeah, that's, that's all I have to share with you guys.
01:00:53.360 | If, if anyone has any questions, please go ahead.
01:01:05.360 | Um, yeah, I asked it in chat, but, uh, they mentioned that they have the, the, the sort of, uh, self-judging RL framework.
01:01:13.360 | Did they, did they like release the code for that anywhere or, um, uh, was it just like something you found in a blog?
01:01:20.360 | Uh, I don't think they shared anything besides the weights and the inference code, uh, unfortunately.
01:01:25.360 | So yeah, this is just inferred from the blog.
01:01:28.360 | Sure.
01:01:29.360 | I think we had, that's it for time.
01:01:42.360 | Amazing.
01:01:43.360 | We actually covered Kimmy.
01:01:44.360 | I didn't, I just put it in there because I was like, maybe there's not enough on, on just move on.
01:01:49.360 | Um, fun facts.
01:01:53.360 | Kimmy is now powering all the LLM responses inside of the discord.
01:01:57.360 | And so whenever you enter, you see a summary from, um, yikes is bought it's, it's Kimmy.
01:02:04.360 | Cause Gemini is.
01:02:06.360 | Yeah.
01:02:07.360 | He's on free tier on open router.
01:02:09.360 | So figure we give her a shot.
01:02:10.360 | Yeah.
01:02:11.360 | Yeah.
01:02:12.360 | Yeah.
01:02:13.360 | Yeah.
01:02:14.360 | I think we have a thread on the discord server about which providers are, are serving Kimmy
01:02:18.360 | key to.
01:02:19.360 | Yeah.
01:02:20.360 | I mean, you know, it's given that it's free tier, so we're not using any tokens.
01:02:24.360 | We might as well just like enable that bot to chat more.
01:02:28.360 | Um, cause right now it only summarizes links, but, uh, it's pretty useful.
01:02:32.360 | Yeah.
01:02:33.360 | You can add her like she'll, she'll talk.
01:02:35.360 | I'll see.
01:02:36.360 | I'll double check.
01:02:37.360 | See if she's got, um, I don't know if she's got, I think I should be able to, I'll double,
01:02:42.360 | I'll double check her chat model, but yeah, you can add her and she'll chat.
01:02:45.360 | Yeah.
01:02:46.360 | Cool.
01:02:47.360 | All right.
01:02:48.360 | Okay.
01:02:49.360 | Uh, lots of people need to go.
01:02:50.360 | I need to go.
01:02:51.360 | Uh, thank you everyone.
01:02:52.360 | Thanks.
01:02:53.360 | I'm glad.
01:02:54.360 | Um, yeah.
01:02:55.360 | People want your, uh, your write up.
01:02:56.360 | Yeah.
01:02:57.360 | I'll share it in the discord server.
01:02:58.360 | I mean, like a few minutes.
01:03:00.360 | Okay.
01:03:01.360 | Bye everyone.
01:03:02.360 | Thank you so much.
01:03:03.360 | Thanks RJ.
01:03:04.360 | Okay.