Open Reasoning LLMS: Magistral + SmolLM 3

I actually want to share my entire screen for a second. So Magistral is kind of Mistral's reasoning model. The interesting thing they do here is they don't want to do any distillation. Basically, they want to see how far can they get with like native RL without distilling from a big model.

And then they also do a little bit of distillation. So they train one, which is Magistral Medium. It's trained on Mistral Medium as the base model and they just do pure RL and they kind of talk about how they do it, how they get the data set. The very interesting thing here is this is like a very large scale RL run.

So what we don't see is how do you actually efficiently do large scale RL? So what are the challenges in that? And how do you set up your compute and data center to do this? Because with RL, you need like two instances of the model. You need to generate rollouts of your thinking.

You need to verify them, which takes some time. They'll have different leads, right? Like when you have a rollout that's like 3000 tokens versus like 30,000, you kind of have a delta as they're being generated. Do you just waste time and waste GPU hours? And then as you like batch up, do your gradient, um, as you do like changes in weights, um, how do you resync this stuff?

So Mistral is like, you know, we actually have to be a real research lab since we've raised billions of dollars. Uh, so let's do some research instead of just basic shit. And then they kind of outline like, okay, here's how we do a really, really good training setup for all this stuff.

So that's Magistral. The other paper is SmallLM3. SmallLM3 just came out yesterday, I think, from HuggingFace. It's kind of the opposite. It's like OML style, fully open, um, where they give you everything, training code, full repo, dataset, um, model weights, checkpoints, base model, and struct model. And they kind of talk about distillation.

So their, their dataset is distilled, but it's a small model that they get to reason. And then of course they have their fun charts, you know, they're always in the top left or top right corner where they want to be. And the interesting thing here is two cool takeaways.

One is this is a paper that talks about hybrid reasoning. So they have thinking and non-thinking mode. And then there is a section on the bottom, which I thought was interesting where they talk about model merging. You don't see much about model merging, but, uh, spoiler when they, when they did their model merging, what did they do this for again?

So they, they have this kind of preference optimization for reasoning and non-reasoning. And then as they start to do this training, uh, turns out that they kind of nerf out the long context capability. So what they did is they took some of this reasoning, non-reasoning checkpoint, and then they took some of the long context checkpoint that didn't get screwed.

And then boom, they just, uh, combined them together. And now we've solved both solutions, which is pretty crazy. I didn't know. I didn't know model merging can just do shit like that, but that's pretty cool. Um, so they do that, but then they talk about everything that goes into it.

Um, also they're pre-training, post-training, but they use a lot of other data sets. Some stuff that I found very weird here was like when they generate synthetic SFT data, right? So they're generating reasoning data sets. Uh, they, they, they use QEN 332B in reasoning mode. If anyone has intuition as to why they would do this instead of, my zoom is overlaying my tabs instead of, um, instead of using the big QEN 3, I would be curious to know, cause you know, in QEN 3, we also have this big 235B.

So why not use 235B with active? Uh, they do use some deep seek, but yeah, interestingly enough in their synthetic data gen they use, they use, um, 132B, but both very good papers. Um, they, they, this is actually a pretty, pretty expensive paper. If the answer is compute. So in their blog post, they talk about how many GPU hours this is.

So, uh, they're, they're pretty like chunky. This is not a little research preview. Like this model is trained on 48 nodes of eight H100s for 24 days. That's a 220k GPU hours. If you assume like $2 to $3 per GPU hour, um, that's roughly a million dollar train run.

Not to mention a bunch of ablations, testing, synthetic data gen, salaries, like resources that go into this. But the training alone, like this was trained on trillions of tokens, generating those tokens. But the final train run, like what is this, uh, 220,000 H100 hours. That's, that's over a million dollar just train run alone.

So not, not cheap, not cheap. Um, but yeah, cool. It's, it's a small LM that's a hybrid reasoner. So this paper is about hybrid reasoning. Um, this paper is about native reasoning without distillation. Okay. I think I'll do this one first since it's, uh, there's more substance to it.

The other one you can kind of just read through. It's kind of a blog post. There's not much to it. Um, okay. I will get started. If there's any questions, anything, you know, pop in. We'll just interrupt me whenever it's not that deep. Okay. So Magistral, their, um, RL, they, they kind of really just lay out their RL pipeline.

Uh, the, one of them is open source. It's Apache 2, but they don't give us as much. They don't give us training data. They give us like numbers around how much data they use, but they obviously don't give it all to us. But you know, this is still a pretty good paper for Mistral.

Like I'm, I'm pretty, I think they're going in the right direction, learning how to do this from scratch as opposed to some of the other work, but you know, they are bigger. So what do we expect? Um, basically there's two things. So they want to show, oh, also the other takeaway of this is that RL on text alone works pretty well.

So it's still like all this stuff generalizes. They do good ablations in this. They text, they test, uh, if we just do code, does it work on code and math? If we just do math, does it work on math and code? Um, but can we do pure RL on text without distillation on a small model?

Yes, we can. It's interesting what we think of small models, right? Because Mistral calls small models 24B. Uh, five, four called small models 14B. I guess they had a little 3B section in there with SFT and, um, hugging face calls small models, um, three or 4B, three or 4B.

Okay. Uh, let me make sure my screen share is sharing the right screen. It is. Okay. So, uh, basically they, they want to detail these few things. So one, you can do no distillation from existing models. Uh, they talk about the infrastructure, their design choices, which is kind of an interesting section, right?

You don't really get anyone giving you this much detail about it. Hugging face gave us a little, but honestly, this gave us a much better recipe. Um, then, uh, effective way to make it multilingual. So this was another interesting little tidbit where they noticed that originally, uh, normally the model would start reasoning or outputting its response in different languages.

Uh, interestingly enough, as much as this is Mistral, a French company, they, they didn't have French output. They had a lot of Chinese, Russian and stuff, but they can basically adjust their RL policy. They, they basically do GRPO with some modifications. One of those modifications is giving a slight reward increase for models that have, um, that keep their reasoning and output in, in the same language that it's guided.

And then they even add this in the system prompt. It's like pretty interesting how you can just solve this problem of reasoning and outputting different languages by a little 0.1 increase in RL, uh, reward. Like it just shows how strong RL is, right? You just give slight, slight reward if you keep your language consistent and guess what?

Language is now consistent, but, um, you know, interesting note there. Um, other stuff in the, that are like little takeaways before we go deep that I noticed, um, towards the end of this paper, they, they noticed something about multimodal that we might not get time to cover. So, um, eating the multimodal free lunch, basically, uh, all they did was they trained on a pre-trained model that has multimodality, right?

So mistral three small and medium, they're multimodal models that have vision encoders, right? So they encode images into the same embedding latent space as text. Then during the RL, they're only trained on text data. So they're like, our intuition is that one might expect that multimodal performance would degrade, right?

But no, actually, uh, it didn't degrade multimodal actually got even better. So testing on stuff like MMMU and MMMU Pro, there's no regression in most benchmarks and notable improvements in some. So, um, just since like the models were natively multimodal, um, this reasoning sort of transferred over the model transfers, it's extended thinking across all, all, um, modalities.

Very interesting. And then this is part of their future work directions. Uh, some of the questions that they should work on later, uh, looking ahead, we're also excited to push the boundaries of RL across whole range of applications, including tool use, um, integrated multimodality, and agents. So they want to do multimodal reasoning down the line, um, and agents and tool use.

But interesting little takeaway, right? Without, um, without any multimodal data, they were able to get better multimodal preference just by having, um, text reasoning because it's all in the same latent space in the end of the day. Then of course, um, they, they give, um, weights small model as open source.

Then they, there was basically like prior to this, there was this kind of notion of for small models, it's not worth doing RL. You would want to always have a big model and distill and they want to challenge that claim. So they want to see whether RL can improve on distillation, SFT baseline of small models.

So basically we show that, um, yes, RL is very competitive in some cases better, but harder to do. Um, for those that don't stick around through the whole thing, I guess the other takeaway with this model is how much data filtration they do. So like a distinction between the two papers, um, small LLM uses a lot of preexisting data sets.

Mistral basically parses through and data filters everything, everything. So I only have, um, purple highlighting for data set stuff, but basically for math and code, is this just math? Yeah. So for the, for the math data set, they start with 700,000 samples. They have like two stages of filtering we're getting to.

Getting rid of formatting stuff, they cut down 100,000. In the end with their difficulty filtering, they only have 5% of the data left. So they cut from 700,000 to 38,000. They cut 95% out. Same thing with code. Um, how much do they start with? I don't know how much they start with, but they cut a lot.

They only use 35k code samples. So Mistral basically like, I don't know if all this stuff transfers over like their claims or it's just, Hey, we can do really, really good data set filtration, but okay. Enough, enough background yapping. Let's get into these papers. Um, I don't really like covering benchmarks in my paper yaps.

Cause you know, you can read a number always go up. Benchmarks are a scam. Um, interesting stuff compared to DeepSeq. They don't need to do a cold start. Cold start is basically before your RL, you take a reasoning chain of thought traces, you do an SFT. So the model starts to understand reasoning structure.

Then you do RL. Um, forget that. We just do, we just do straight RL. It works. Um, isn't QN3, 4B fully open source to wait? Yeah, quite an open source, but we don't have training code. We don't have the data set. We don't have all the ablations. There's a lot of that stuff.

You know, uh, we need an LLM to show benchmarks for models that are not shown because they don't look favorable. Yes. I think that's interesting. So like my takeaway here as well, um, they, they, they look at like three Bs, four Bs, but you know, I want to see some more, I want to see seven Bs, eight Bs.

I want to see the other ones that aren't on here. I want to see the 0.5 Bs. Um, yeah, that's, that's always useful. Okay. So RL, infra, performance. This is basically all the sections. So guess what? RL is back. They do GRPO. Uh, GRPO for those that don't know is group relative policy optimization.

It's basically what deep seat said we can do instead of all this fancy, um, reward modeling, like policy modeling training, like good, all that stuff. What we basically do is we just generate a bunch of samples and then we kind of, um, reward based on what does best out of this like goal of optimization.

So use the average award for multiple generations per prompt from the policy to compute a baseline for advantage calculation. So generate a bunch of samples, group, uh, group relative. So what's relatively the better outcome in that group and then reward for that. Here's GRPO in fancy math definitions. I think it's pretty useful, like send this section and like screenshot this into chat GPT or Claude, I guess, uh, I used to GPT and just have it explained through math formula.

It's kind of useful to follow through. All these are like pretty easy functions. Once you understand what they all do, there's KL divergence penalties, all that stuff. They kind of get rid of that. So here's what they did. Here are the modifications to GRPO that they do. One, get rid of KL divergence.

Um, KL divergence penalty constrains it from deviating too far from a reference policy. So basically we have KL divergence, so we don't, we don't like venture too far and kind of just generate noise. They find that with, um, GRPO this kind of happens anyway. So they get rid of it because KL computation incurs compute costs that's unjustified, right?

Um, get rid of that for inference like training efficiency, not training efficiency, just efficiency of GPU usage. Okay. Uh, loss norm to avoid length bias. They have a loss normalization. So basically they normalize basically on an average length generation. So you're not penalized for being too short or too long.

Uh, makes sense. Advantage normalization as you would expect. Okay. Uh, relaxing the trust region upper bound. This is basically allowing again for more, um, more variability and allowing for more exploration. They do a lot of that. Eliminating non-diverse groups. So, um, we filter out all groups with zero advantage.

Basically if there's outputs where there's like no significant difference. If all answers are correct or all are wrong, we just filter them out and then net the reward to zero. Uh, reward shaping during training, dah, dah, dah, dah. Oh, okay. So that's, that's kind of, those are the four or five main changes to GRPO.

Um, then it comes to reward shipping. How do they do this? Okay. So we have, uh, formatting. So choosing the appropriate reward is very crucial when you do RL. So we want our stuff to be verifiable, right? All the, all the training they do is with verifiable reward. So how do we verify that the output is correct?

One, we have think tags and they make sure that there's exactly one set of think tags and then they look at the thinking in the middle. So your thinking should be between think and close think, and you must start with a think tag for math. You must end your answer in a box.

Um, this follows the think tag for code. Uh, you must have one bark down markdown block followed by the language specification. And then, um, you know, if you don't do any of these, so if you're formatting is wrong, instant reward zero, you're cooked. Otherwise you get slight reward preference.

So you learn to use it. Okay. Um, that's formatting. They give the whole system prompt and they talk about it quite a bit. Okay. Correctness. So math correctness, they use a rule-based verifier reward of 0.9 is given. If the answer is correct, making the total reward one, uh, as you would expect, this is the biggest portion of the reward, right?

If your answer is correct, you're good. Uh, if your answer is in the right format, you'll have slightly more reward. If it's in the wrong format, you're completely cooked it to zero. Code correctness, same thing. Uh, they, they check if it can compile with a timeout of 10 seconds.

They, they randomly, so they, they talk about this later in the dataset selection. Uh, they want code that has tests. If it doesn't have tests, they make tests. If it has tests that seem wrong, they fix the test. They randomly select, um, tests that are available to test it.

And then, you know, if correct, you get reward. Basic RL, right? Um, if you're, so, so far we're at two things. Uh, formatting is a non-negotiable. If your formatting is wrong, your reward is directly zero. If it's correct, you get a slight reward. Answer correctness. If your answer is correct, you get big reward.

This is kind of the main thing. Okay. Length penalty. They have slow, uh, they have soft length penalties. And as you would expect, there are penalties. You're out of distribution with length, you're cooked. Okay. Language consistency. This is one of those key points that they wanted to mention, right?

So, um, one of the, one of the main things they did here is we present a simple yet effective strategy to make the model multilingual, where both the chain of thought and the final response are in the user language. So, how do they do this? Um, duh, duh, duh, duh, duh.

Okay. So, a core design principle is for to reason in the same language as the user. We frequently observed outputs mixed in English, Chinese, and Russian. Uh, they were coherent, but they're not what we want, right? They're undesirable. If I ask something in Chinese, I don't want it to think in Russian.

If I ask for English, I still want the thought traces, and I want to be able to read them, and I want them in English, right? I don't want them in Chinese. So, to prevent this, um, they translated their problems into different languages. So, they take 10% of the data set, translate it into French, Spanish, Italian, German, Chinese, Russian.

Then they calculate the, when calculating the reward, they, they have a classifier that basically checks if all three parts are, um, in the same language. Three parts being problem, thought, and answer. So, does your problem, thought, and answer match its same language? They check this by, uh, doing, you know, normalization or moving latex code blocks.

Then they do a fast text classifier. So, oh, shit. Um, they just check, you know, is all, are all three in the correct language? If they are, you get like a 0.1 reward bonus, I think. Let's double check that. da, da, da, da, da. Classifier. Where is it? Yeah.

So, um, if the classifier indicates that all three parts are in the same language, so the problem, the thinking, and the answer, you get a little bit more reward. Similar to if your answer is boxed, you got a little bit more reward. Now, this is not as impactful as getting the right answer, but it is slight reward.

And with enough data, um, the model starts to understand that, you know, I'll get rewarded for staying consistent and that just solves the problem. So simple solution solves reasoning and thinking in the right language. Um, little note here, they only translated English problems. They didn't translate like French problems to English.

They didn't do a crazy mixture. Very simple solution. Just take 10% of your English, translate it to a few languages. The model has generalized that if it stays consistent, it gets rewarded. Now it stays consistent. Very straightforward, simple solution, but elegant and works. And this is kind of power of RL, right?

Little reward for doing what you want and you don't need to go crazy and it just works. If this was like traditional pre-training or SFT, you would have to do all the languages and all these mixtures and keep it consistent. And like, you know, traditional ML, you have to like have good distribution spread, but no, it's RL.

Just give little reward for consistency and it generalizes to be inconsistent. RL very strong. Okay. System prompt. Um, they, they know, does this cause a drop in reasoning performance like R1? No, it doesn't. Um, reasoning performance stays, uh, performance stays consistent. I think they have this in section six in the ablation.

I'll, I'll bring it up later if you can remind me. Okay. Next section, system prompt. Um, they find that RL is quite sensitive to their system prompt. And then I think their system prompt is a little too overfit, but what do I know? I'm not mistral. So they add stuff like be as casual as you want.

This increases entropy, meaning it's allowed to explore a little bit more and it proves, yeah, it improves exploration. Here is the system prompt. So, um, you know, your thinking process must follow the, the think, the template, think your thoughts, then provide a concise summary of your reasoning and the final answer.

Uh, oh, B is casual or causal? Oh, causal. I see. I see. I see. I see. Um, okay. You're so a user. It's casual. It's casual. It's casual. You're actually right. I just, I couldn't believe it. I think causal is better. Casual just means be chill bro. You know, be chill bro.

But yeah, be chill made it reason better. But like, it's interesting if that's what you want and not even just that, like here, they even ask you, like they, they remind it to, what is it? Write your thoughts in the same language. Right? So we thought RL was enough to tell it to just like, okay, stick in reasoning, stick in your language.

No, we must also prompt it again. Uh, you know, keep your thoughts in the same language and then yeah, be casual. That's cool. So, okay. Um, a user will ask you to solve a task. You should first draft your thinking process, your inner monologue. It tells it that it's an inner monologue.

I think, uh, you know, in the long scheme of things on billions of tokens, do we need to remind it that thinking is an inner monologue? I think not, but I'm clearly not a prompt engineer like them. Uh, until you have drawn the final answer afterwards, write a summary of your thoughts.

So yeah, then, you know, I think we should actually read the whole thing. I shouldn't skip through this. You should use markdown and latex to format your response. Your thinking should be in this think tags, your thoughts or draft. Also, I thought this is interesting. I don't know if this is just European.

Uh, normally we see and slash or not or slash and, but you know, this model has seen or slash and much more. So interesting little stuff. B is casual and as long as you want until you are confident to generate a correct answer. I thought this is a bit much, but what do I know?

Okay. After thinking. Yeah. Yeah. I, I think like the way they put problem call in and then stuff, I find it like weird. Shouldn't the model learn that the problem comes after the tag for the user role? Student. Uh, I mean, this is just pretty common. I don't, I don't think it's crazy.

Most, most system prompts have something like this. Um, I guess it could not be problem. It could be like useful assistant, user call in great, but I don't know. This is what they do. The other one uses chat ML, um, more standard prompt format, but I didn't dwell as much there.

If anyone else has thoughts, let's, let's discuss or just take it to chat. Okay. Um, sorry, I'm going to go a little quick because this is quite a long paper. There's still, there's still fun stuff in here. Uh, same, same system prompt for math and coding. Uh, the other interesting thing note here is their reasoning is only math and coding.

Um, yeah, that's, that's what it is. Um, but it's all text. Okay. So infrastructure, this is a very fun section that I don't think any other paper really covers. Here's how to do like very large scale, um, RL on a lot of GPUs. Sorry. I've clicked a citation. Let me come back.

Okay. RL. So, um, we adopt the distributed RL training similar to prior works. You should read these works if you're interested. That coordinates three types of workers. So trainers, trainers maintain copy of all the model weights and perform gradient updates. Generators are kind of doing these rollouts that use the latest policy to return completions with log probs from the training prompt.

So basically prompt generators are doing a bunch of rollouts for GRPO. Here's, you know, 10 outputs. Generators are doing inference. Trainers are doing, uh, they're, they're keeping the model weights and doing, uh, gradient outputs. Verifiers. Verifiers are the ones verifying the output. So this is also not like, you know, this is like that, um, the classifier that checks for output, checks for consistency, checks if code compiles, checks if test pass, checks if the output is boxed.

There are verifiers as well. So these are kind of the three kinds of workers that they need all in sync. And this is not like little, you know, okay, I'm running this on like one node. I'm okay with inefficiency. This is like, I'm training on tens of billions of tokens.

And I don't want to waste like multiple, multiple nodes of, um, very expensive GPUs. So how do we optimize all this? It's very interesting. They're completely fine with stuff being like out of sync and shit just generalizes. Okay. So challenges with distributed RL. Generators are a significant part of the total compute and the part that's unique to online RL, right?

Online RL being that the model is kind of doing this self-play, right? It's generating and then it's being, it's being graded on the output. So you're actually doing a lot of inference. Generators are kind of like the inference boxes, right? Uh, their workload is highly heterogeneous and hard to predict as the distribution of sequence lengths is highly skewed, right?

Some outputs might be short and concise. Some generators might be very long and then they do have stuff in RL policy that, um, has like a length penalty. They have stuff to keep it within a certain distribution, but then you also want it to explore long lengths. But point being, um, these, these things are skewed, right?

So if you have some stuff that's waiting for an extra 20,000 tokens to be generated, are you just stalling the generator or are you throwing this in some batch? How are you verifying stuff as it comes in? Um, so one of the main components inter is to introduce no bias on the sequence length, right?

You can't do this all sequentially because you'll have so much delay. So how do we do this? Um, a more competing goal is to update the generator weights as soon as possible, right? So as soon as one inference is done, just do another one. And we want the generators to be as on policy as possible, but we want them to operate without waiting for trainers.

So you don't want to wait. You want to be on, on policy kind of weird since you're keep, so it's like this kind of lag thing. I think I should have drawn a diagram for this, but GG. Um, so what they do is async generators. Basically we process batches sequentially.

You start the generators on a batch, wait for all sequences to complete, update the model weights. No, sorry. This is not what they do. This is what you could do. And this is why it's slow. So what you would expect is, okay, basically you batch out all inferences. So do all your rollouts, wait for them all to complete, update all model weights on both the trainers and generators, and then repeat.

So here's kind of the lag. As the first generator is done, as your first rollout that short finishes, you're now waiting for, that's kind of sitting idle for the other rollouts to complete. Once they're all done, we send them all to verifiers to verify the output. Generators are all sitting idle.

And of course, the trainers are all just storing weights, doing nothing. Once the verification is done, then we do all the back prop, check what might weights to change. Then we update the trainers. After trainers are updated, then we have to change those weights back to generators, even more lag.

So we have to re-update the weights on generators. Then we do all this again sequentially. Very, very bad. You have idle generators, low pipeline efficiency. This is fine for a little bit of post training. It's not that deep. So for an entire train run, you're cooked. You're like, it's not efficient.

They want to be efficient. So we operate generators continuously at maximum throughput without even waiting for trainers. Basically, this means you're always doing rollouts and you'll always gather small groups from generators, verify them, update trainers. After these updates, trainers send new weights via NCCL. This is like interconnect InfiniBand GPU to GPU weight updates and they don't even disregard in-flight sequences that are being generated.

So there's like this push of always keep rollouts happening and then always update those weights. Even if you're off policy and you're using old weights, just keep using them and you might be slightly off. Your KVs, your KV caches, so like your old previously generated tokens will be on old policy, but it's okay.

It doesn't matter. Just use it. It doesn't make much of a difference. I think everyone should like really read into this section quite a bit. If you care about distributed, like large-scale RL training, it's a very elegant solution that just works that they tried. Yeah. So they just kind of keep stuff always happening.

Here's kind of a, here's kind of a good diagram of what's happening here. So generators are rollouts, right? So you generate a bunch of sequences of different lengths. This one very long, this one very short. Do they open source the code? No, they don't. They don't open source shit.

They don't even tell us what data this is trained on, but small LLM does, but they don't do this same fancy stuff. Okay. And they actually don't do native RL from scratch. They're doing distribution. They're doing distillation, but actually that's not as relevant. Anyway, so long output, short output.

Generators just keep shitting out generations. Verifiers check when they can. Here are kind of batches that happen and you compute your back prop and your weight updates. You do this on the fly as soon as these little mini batches get filled. Then with Interconnect and FiniBand, you update these generator weights.

And since they're still chunking out stuff, you might have stuff that's a little bit off policy from previous steps, previous weights, but it's okay. Just keep it running and it'll all work out in the end. And very efficient, fast pipeline of one to four. Let's read through this again, since I think this is interesting.

So generators continuously output completions from prompts. So your generators are always doing rollouts. When a completion is finished, it's sent to a verifier. So check if it's box, check the output, run a code compile test. If it passes or if it doesn't, you do your step three, send it to a batch for updates.

Each sequence is sent to a different data parallel group using all this stuff. Then you do gradient steps. So you do your GRPO. You check what's best on policy. Then this is very interesting. Wait to replace mid-generation, which means that in-flight generations continue with a slightly outdated KV cache and we do not refresh it.

If I could just keep going. Sense model resides on both GPUs and trainers and generators. You basically have all the weights. You have to double up the weights, right? So are all very expensive. It's not just like hold weights, do next token prediction. You need two sets of weights actually held in memory.

So expensive GPU shit. But since you have all the weights on your trainers and your generators, you can use NCCL, which is like CUDA InfiniBand GPU to GPU transfer, which is much faster than like most other data transfer shits. You can use that. It's very fast. That works. Okay.

As a solution is generated for a single prompt, it may experience multiple updates from the weights. So even in a single prompt, as you're doing next token generation, you'll have multiple weight updates to that model reflecting latest time improvements. Very, very, very, very fast stuff. I think I'm too hyped on this training pipeline.

So I'll just continue in case people are getting lost. Yup. That's that section. I think if you care more, read section three of this paper, follow up in Discord. Okay. Data curation. They only want to use verifiable solutions. Oh, actually, I'll take a pause. Any questions on this stuff?

Or should we move on? Okay. No questions. I hope people followed. I like that section. Never mind. No question. Okay. Data curation. So data curation, we, they only want to use verifiable solution. Oh, people, sorry. Can I ask a quick question? Yes, yes, yes. Did they mention what the, like the sort of utilization level they got on their GPUs?

No, no, no. They don't talk about any of this. I think actually there's, I don't know if I'm blanking. There might be, uh, a note that I'm not remembering it, but I'm like 80% sure they don't. But, uh, it seems like the generators are a hundred percent utilization. Verifiers will never be full utilization.

And a lot of that doesn't have to even be on GPU. Cause you have to wait for sequences, even with your batching. Yeah. Um, trainers doing backprop. That's just efficiency of how well they updated this GRPO stuff. Um, but no, they, they don't talk about this level of efficiency.

Somebody in the chat also asked about learning rate, uh, or open source code for the distributed RL or the two questions I think related to this section. Yeah. I answered the, the code. No, Mistral doesn't give any of it, but the small LLM paper does. They give it all.

They give all the checkpoints, all the, not all the checkpoints. They give some checkpoints. They give base model, instruct model, training data, all that stuff. Good resource to learn a more distributed RL training. Yeah. I have a, I have a link of papers I would recommend. I'll, I'll share it in discord.

Just remind me cause I'll probably forget. But, um, that's the fun thing about this paper, but let's, let's just spend the next five minutes covering the rest of it pretty quick. Okay. So, um, they want verifiable stuff. They basically want math and code. Ooh, I didn't know I could do this.

Very fun. Um, they do a lot of filtration. Here's how they filter. So this is something I brought up earlier. Math started with 700,000 samples. Uh, they cut 95%. This is actually pretty interesting. They train a model with RL to do filtration. That's the TDR of this. So as you would expect, they get rid of the basic shit, right?

So filter out stuff that has wrong answers filter. Uh, they do a little bit of rewriting. So, uh, final answers are verifiable with a rule based system. They want to filter proof based multi-part problems where it's difficult to verify, uh, correctness. So get rid of shit. That's hard to verify.

They reformulate multiple choice into statement based problems for more robust verification. So some rewriting, uh, two stage of filtration process. So first, um, they want stuff that's like Goldilocks difficulty, right? Not too easy, not too hard. So if Mr. Large can answer all of it, um, throw it away.

If it can't, can't answer any of it, throw it away again. So sample 16 problems, 16 solutions for each problem, removing the ones that are either never solved or solved with high success rate. Um, then this initial set they use to train a 24 B model with this RL pipeline, getting a small model that's better than Mistral too large, right?

So step one, use your best model, filter out shit. Step two, do a little bit of RL on a model to make it good at math. Then in the second stage, use this model to once again, answer the shit that was hard. Uh, well answer everything filter, just do inference, right?

Get rid of the easy stuff again. Then we filter out potentially incorrect problems that have a majority of samples with the same final answer of a degree on the ground of a disagree. Um, so you can kind of bring back some stuff that was hard, but not hard. This two stage is good because, um, a single pass with the initial weaker model would have been insufficient.

The reasoning capabilities would have caused it to discard many difficult problems that were incorrect, but this new middle model can kind of solve them. So kind of interesting, right? Use your big model, get rid of everything easy and hard. Train a reasoning model, use the reasoning model again. You can bring back some hard stuff that your hard model couldn't solve.

Now you have more data, but holy shit. They, they filtered a lot. They got rid of 95% of the stuff. Uh, code same, same beans, right? So we want them to be, um, a large number of correct tests. First remove stuff without solutions and without enough tests. This was kind of the interesting thing.

Each solution is then tested. Uh, we disregard tests with insufficient, we disregard tests with insufficient agreement. Um, then, uh, yeah, this two paragraph is a ton of work, right? You have to train a whole middle model. And I mean, you get rid of 95% of your data, kind of crazy.

Um, on code, we want the, okay, what do we do for tests with sufficient agreement, but no successful solution. We assume that the test is incorrect and updated. That's just some like next level shit, right? I don't know if it's that next level, but basically your test is wrong.

Fuck you. We'll make our, we'll rewrite our test. Uh, in cases where the code lacks tests, we just generate tests. If we are confident we can generate tests. This is like, this paragraph is also a good bit of work, right? Uh, finally, we, where applicable problem statements are duplicated to Python and C++.

Then we have 35 K sets. Here's where they're like, Hey, we can't tell you everything. We don't tell you how many we started with. Uh, we just end with, um, with 35 K. Okay. Experimentation. Uh, that's, that's data section. I think it's pretty interesting. The math stuff is pretty cool.

Okay. How far can one get with pure RL given a strong teacher model? Um, very basic diagram, right? So data filtering, uh, for math, phase one, phase two, phase three. This is kind of useless for code. Just do that training overview. So for MISTO3 medium for magistral medium, it's just pure RL.

For MISTO3 small, you do some SFT on rollout. Then you do RL and then you've got magic wand over here. Here's kind of their RL stages. Um, length plateaus, increased completion length in the middle. So this is like more into that RL policy itself, right? They do multi-stages. So, uh, start with low length, add more length, add more challenging data, um, as you start to hit the limit.

So it's like similar to pre-training, post-training where the analogy you can think of is for context length. First you do the majority of your training at like 4K context, then you do higher context, then you do higher context. Or for post-training, you do the majority of your SFT on regular data.

Then you do a little bit of hard math and reasoning. Then you do SFT like the last 5% on all of your like super hard data. They do the same shit with RL. Um, duh, duh, duh, duh, duh. We, oh, for all their eval, um, they do temperature 0.7.

Top-P of one, which is basically include all output distribution for math. And then for math, top-P of one. So include all your outputs. For GPQA, top-P of 0.95, which means you only use the, the like top 500 roughly, uh, output tokens. Um, maximum length for 4DK evals are good.

I guess this, maybe I read a little bit more into this multi-stage. I don't know if they talked about it. I thought they did, but oh, here they do, they do. Um, okay. Evals. I don't care about evals on this, uh, model evals are fake news. You should really care about your, um, your other, your other evals.

Okay. Your, your actual system evals. Okay. Training a model without cold start problem. This is basically what every, this is what DC DeepSeek said. You need cold start. Turns out that's fake news. Um, Fi showed that you can, you can do other stuff. This also shows you cannot do other stuff.

So, uh, as model performance increases, we add harder data. Crazy. Harder data splits are in, are constructed by one more complicated data, which was filtered out earlier stage. So we saw that two stage of math, right? Basically that second stage where they use their smart model to generate stuff that the Mr.

Large couldn't solve, but RL model could solve that's used in the second stage or just completely removing, um, or completely remove stuff. Uh, length doesn't stop growing. We increase allowed completion length and maximal completion length over time. Uh, this is in their RL penalty. So they have a RL penalty for length skewing of being too far out of distribution.

As training progresses, they start to, um, they start to increase this. This is all on magistral medium. So just pure RL from, uh, SFT model. So basically from a useful assistant that went from next token predictor to useful chat bot, um, pure RL, they, they start to add harder and harder data.

They start to allow for longer output. Um, generally as generation length increases, the memory associated with KV cache increases. So we scale down the total number of concurrent requests. Uh, this is something about RJ or like efficiency. They, they need to do better, um, better RL stuff, but they don't talk much about the training.

I think I'm just inferring a lot from their stuff. I could be completely fake using this and being wrong, but that's my interpretation. Then the fun one, the Apache 2, um, RL with SFT bootstrapping. So, um, they do a cold start on SFT traces from magistral medium. So they take their big model.

That's a reasoner. They don't use deep seek. They, they, they pass through, um, they basically pass through traces of, um, problems. They keep a mix of regular, uh, difficulty, dah, dah, dah, dah, avoid biasing towards easier problems. So mix of easy, hard, um, early chain of thought. Then they, they do SFT.

So generate responses from reasoning model on a diverse set of prompts. They use, um, open data sets here. They use open thoughts and a code subset of open R1. Uh, this gives us a reasoning data set with mixed difficulty. We also include 10% data points of general instruction tuning to preserve non-reasoning capabilities.

They do four epochs of training of how many samples. They tell us how many samples and that's somewhere here. Then we use this SFT checkpoint with RL. They do their RL thing. Now that knows how to kind of do, um, do this reasoning. Where'd it go? Uh, temperature of this, dah, dah, dah, dah.

Okay. Benchmarks of bullshit. Um, benchmarks exist. Multilingual benchmarks. This was my, uh, what is this? We see that the model performs lower on multilingual compared to English. Probably because we constrained English in the language of reasoning. Okay. It's similar to that of a base model. I guess someone did bring up multi, multi link.

No, someone brought up multimodal, not multilingual. But yeah, slightly worse multimodal. It's good. Uh, multilingual slightly cooked. Uh, I don't want to spend much time on benchmarks. So I will skip this. I don't think many people use these models, but very good. Very good paper. Uh, sorry for skipping, but it's good.

Abilation is pretty fun. Um, cross domain generalization. I talked about this. So if you only do RL on math, can it do good on code? Yes, it can. If you only do RL on, um, code, can it do good on math? Yes, it can. So baseline was 32 on math, 22 on code.

Only do RL on math. Guess what? Code goes up. Uh, baseline was, I'm cooked. Baseline was 32 on math. Only do RL on code. Guess what? Math goes up. Um, so shit kind of generalizes, but we only have two styles of reasoning verifiable data. So I don't know how much this matters, but it's cool.

I guess it matters. I disagree. It matters. Um, what else? Are Mistral 3 small with pure RL achieve similar performance? Uh, as the distilled version suggests that benefits of RL are not exclusive to larger base models. This is one of those four points that they make. Um, we contribute insights to contradict existing formulas that, um, are, whether RL can improve upon distillation of SFT baseline.

So if you remember for DeepSeq, what they also put out was distills. They used like actual logic based distillation laws and they distilled SFT. Um, they, they distilled the models on reasoning rollouts with SFT. And it had very, very good performance updates, uh, performance benchmarks. They're like, nah, fuck that.

Our RL actually works pretty good too. And this shows that they did a Mistral 3 small with pure RL. So everything we talked about here was Mistral 3 small with SFT as a cold start then RL. But no, they actually trained one. That's also pure RL. It's pretty good.

Um, actually, yeah, I think you're right. I don't think DeepSeq was logic based distillation. I think, uh, I don't think they could do that since that's not the same class family. It was just SFT based distillation. My bad. Um, but, uh, they're probably talking about the Quen, Quen logic based distillation.

So same, same point, just Quen. Uh, DeepSeq was just SFT rollouts, but Quen should be logic based because Quen has same family. But I could also be wrong there. Okay. Uh, batching stuff. Skip it. Not enough time, but if you're interested, read it. Analysis. Um, increasing completion length is the main resource that improves performance.

This is kind of interesting, right? So you must have longer, uh, completion length over time. Uh, the multimodal thing that I talked about for those that missed it was very interesting. That kind of, is this, did I skip this free lunch? Free lunch on multimodal. Um, yeah, that was interesting.

Uh, low dimension space, uh, more fine grade rewards and code tasks, completion, dah, dah, dah, dah. Okay. Um, a length direction. So they do some PCA on outputs and find that length wants to go up, which is good. I think you can read that on your own. Uh, multimodal, I'll bring this up again.

So models are natively multimodal. The base models that they train on, they have a vision encoder that encodes images in the same latent space. Reasoning happens only on text, but it's all in the same latent space. So one might expect that the performance of multimodal would degrade, but no, not only does it remain good, but it actually gets better.

So mmMU, mmMU pro, uh, provision, all these go up. Very, very cool stuff. Um, what other impacts? Uh, function calling. The thing maintains and improves tool calling and instruction use. Um, partial reward for code data. This was an interesting one. So unsuccessful approaches. Uh, basically you need verification on output, right?

So they thought, okay, what if, uh, we do partial stuff. So what if we can't verify the output and we want to give it a slight reward bonus for stuff where it's on the right path, but it gets the wrong output. So basically if you start thinking, uh, you fuck up at the end, but you're on the right path, can we give you a little bit of reward?

No, we can't because that starts to give it false signals. Basically like, let's say you're thinking about how to answer a problem, like a math question, and you have good reasoning traces and you're following the stuff. But if you like, forget a theorem, right? Like let's say you forget central limit theorem and you just go down this complete opposite path, or you forget you can't divide by zero, or you forget like how to plot, how to plot an exponential and you're, you're doing everything right.

You're thinking your reasoning, but you're on the complete wrong path. You're giving false signals and these are incorrect solutions. And, uh, yeah, it's actually worse. So for little proportional rewards, which would be great. That, that gives us a bunch more training data, right? If I don't have to use verifiable data, if I can still get little outputs from proportional rewards, uh, that would be huge.

Cause we already have to cut 95% of our data, but it turns out this didn't work. I'm a little bit more bullish on this. I think that their approach to GRPO and how much reward they gave and the filtration, like, I think you need a better verifier on this and you can squeeze out some more performance, right?

Uh, but then you need a good verifier on this and that. So I think there's still stuff. I think in the future we'll have like partial output verification stuff. You can quote me on this, uh, soon, but you need very good voter verifiers. Uh, I don't think that the, but like, it's the right approach, right?

You, you basically gave it a false signal. You're cooked. Uh, entropy targeting. You want it to be exploratory. Um, RL model using open source reasoning traces. So they use, um, open thoughts and code subset, both improve prompt generalization, generations, da, da, da, da, da. Applying RL is good over SFT checkpoint.

Good. Good. Good. You can do pure RL without SFT. What else? Conclusion. Okay. Uh, we look forward to the next research. What loss and optimization algorithms are most appropriate? How much gain can be unlocked by bootstrapping with its own reasoning traces or how to scale the next generation, next order of magnitude of compute?

How do we scale this up? So basically I really liked that, um, that section on infrastructure. How do we scale this up another organ, uh, another order of magnitude? That's another follow up. That's another open question. Then the fun one. Uh, we want to push the boundaries of RL across a whole range of applications.

They want to do RL with tool use integrated multimodality and agents. Uh, multimodality. Think about verification of images and like questions like that. And then agents, agents is a feature where we're hyped. So that's conclusion. Um, damn. I really want to go over second paper, but I was too hyped on this one.

When I was setting programming, dah, dah, dah, dah. Okay. Any questions? Um, what questions do we have? Someone interrupt me for their questions. Otherwise I'm going to do a four minute, unjustice to small lm3. No questions. Small lm3. No questions. Small lm3. This came out of Hugging Face yesterday.

Chonky, chunky dataset model. Uh, but you know, very good. I must shout out Hugging Face. It's actual research. Um, they did a lot of experiments, ablations, a million dollar train run. They train a three B that's a hybrid reasoner. First thing that shows us how to do good hybrid reasoning.

Uh, pre-training is all done. Post-training is all done. 11 trillion tokens. Soda at three B scale competitive with four B. I used to think this is stupid. Just run a four B. But, uh, I guess, you know, you're, you're shaving off a lot of parameters. You're shaving 30% of, um, your weight, which is important for edge stuff, but really uses edge stuff.

Uh, instruction model with dual mode reasoning. That's new. Uh, basically to do this, uh, they do a mixture of thinking, no thinking. Is there a good paper on agentic behavior URL? I don't know. Multilingual, long context. Great paper, great paper. Well, not really a paper. It's just a blog post.

I expect better. Um, group query attention. Nope. Intra document masking. Okay. Forget all that. Here's their fun nodes. They use a whole lot of GPUs. 48 nodes of H one hundreds for 220,000 GPU hours. Uh, distributed was, oh no, I've opened stuff. Um, yeah, distributed data parallel and tensor parallel checkpoint saving.

Okay. Uh, data mixture three phases of training. First phase, all web, then more math and code. Then even more math and code. Crazy, crazy. Uh, what data sets go into here. They talk about it. Uh, mid training, this concept of mid training context length extension. I think the more fun stuff is reasoning.

When reasoning comes in and mid training. Um, so after extending context length, we do, we do mid training. They use other data sets. So open thoughts and Nvidia Nemo Tron post training reasoning. Uh, this is distilled from R1 open thoughts as well. 35 billion tokens. We use chat ML template.

Da, da, da, da, da. Post training. Um, da, da, da, da. Mid training, post training. APO. APO was interesting. Building the chat template. Do SFT. Okay. I don't have time for this paper. This is too long. Model merging. I talked about model merging. This was a fun one. APO is a way to kind of better DPO.

It's like a play on DPO that lets it do good RL reasoning plus, um, context. Then we model merge by context. Performance is stupid, but I wanna, I wanna talk about this dual instruct reasoning model. This is very interesting. Um, I think, I think I shouldn't do this paper harm by doing it one minute.

Maybe we get someone from hugging face to do it next time. Or, or we just do this one. It's kind of short, but yeah, you should read this. Um, good paper. Good paper. Ooh, small LLM pod. Nevermind. Why do paper club when we have podcast? Um, okay. I guess they're coming on podcast.

Coming paper club important for learning. Okay. Well, maybe next time I'll do 20 minutes on this or we'll have someone join us and talk about this too. TBD. Okay. I hope my ramble of why I thought this infrastructure was cool. It's very interesting to me actually. I'm not much of an info guy.

I don't really care about this. But then once I, um, I think it's, it's different when you do pre and post training and you just have GPU go bird. This is not GPU go bird. This is, uh, oh shit. We'll have a lot of like over, overhang and dead GPU inefficiency if we do stuff sequentially.

So here's how they, here's how we do it. Next time we can peek into the training data. Yeah. I think training data is one aspect. Uh, these data sets are open. People have already done a lot of exploratory analysis and breakdown of them. But the fun thing that we can actually dig into is the, um, the training code.

So, you know, uh, all, all this stuff is open. I will probably, oh no, we're cooked. I'll probably just, uh, throw all this in cloud code and we can, we can yap around with it if we want. But yeah, I don't want to, I don't want to keep you guys, um, for too long.

If there's any, if there's any fun questions or anything, let me know. Um, congrats people on, on a fantastic presentation. Um, always love these like deep dyes, especially when it's like a really good paper. Like, um, I actually didn't read it. I don't know why I didn't read it.

I just was busy that day. So thank you for highlighting it. Okay. Thanks guys. See you next week. Thanks everyone. I need volunteers. Volunteer, volunteer volunteer. If someone wants to volunteer for a paper, let us know, uh, potentially small LM3 next week, but any, any time in the future, if anyone wants to volunteer, let me know.

Grok 4. Grok 4 will not be a paper. For, for, for fun context, for those that don't know, I, I used to run a very aggressive shitposting alt on Twitter. And, um, when Grok 3 came out, I poked, I poked the bear. I really shat on their charts. I got into Twitter beef.

Elon got involved. And then the next day my account was terminated. So if you shitpost on Twitter, don't, don't go get Grok. I learned my lesson, but, uh, yeah, Grok 4 today. Um, Grok, Grok 4. So interestingly enough, they open sourced Grok 1 when Grok 2 came out. They didn't open source Grok 2 when Grok 3 came out.

Maybe they open source Grok 3. Who knows? Um, we'll maybe do a watch party. Uh, who knows? Paper highlights and Zotero? No, I'm not a Zotero guy. I don't know how it works. Too hard, too hard for me, but I'll share my paper highlights and maybe someone can throw in Zotero.

Um, for Timeless Paper Club, I think I'll share a post later this week. I have a bunch of topics papers. There's some domains where I'll cover them. If anyone wants to volunteer, that would be useful too. So, um, some stuff, like if you think you're good, like if you know diffusion pretty well, if you know optimizers, if you know, like inference optimization, like flash attention, if there's shit that you're passionate about, if there's stuff that you want accountability to learn and, and read over, maybe we do it together or you take it.

And, uh, yeah, that'll be fun. I'll give you a list of papers or, you know, feel free to add some. And then, and then we'll, we'll go through second paper clip. Uh, launch event for OpenAI's open source model next week. Yeah. Uh, someone said that it's, uh, next week, but I heard a date that's slightly later.

Where is this? Where is this new open source model? So someone has said that it comes out, um, next Thursday. I heard later, but we'll see. We'll see. Um, I dunno, maybe we'll do an event. Maybe, maybe we won't. It's not dev day in November. It's, it's this month, but maybe not next week.

Okay. Anyway, enough yap. Thanks for listening to my yap guys. Two weeks. Okay. Not next week, two weeks, but we'll see.

Open Reasoning LLMS: Magistral + SmolLM 3

Transcript