The Phi-4 Reasoning Technical Report

Let me also pull up chat real quick. But for those that don't know, there's the five series of models, right? So this started out way back when, when Microsoft was trying tiny stories, tiny stories was like a one mil multiple million parameters. So not even a billion, not even tens of millions parameters models.

They just wanted to see, can we make a really small one million parameter model, like learn text and can it produce any coherent words? Turns out it can. Then they had like textbooks is all you need. Or, you know, if you train on textbooks, you can do some level of words.

Then they had this five series. These were really, really small language models. So like up to one B, then we went from super small, like million parameter to billion to now we're at five, four, which is a 14 billion parameter model. So they've gotten chunky. I think, you know, the sweet spot that they were hitting was kind of that three B range where no one else was really doing this, right?

Like we had llama, which was seven B, we had Falcon, we had Mistral. These are all seven B models. At the time, Quen wasn't really doing anything tiny, tiny. So these were like the on device, coherent, somewhat decent models. The history of the five models is that they would do good on benchmarks.

And then people would say, damn, these models suck. And it turns out that, you know, they were like accused of training on the benchmarks and they were inflating their scores. So five, three was the, you know, last one that had a pretty big splash. Five, three was like, okay, here's a whole section on how we're doing like decontamination of training data.

So they basically said, Hey, we're serious about this. We're not going to train on benchmarks. Here's how we like filter all our data. Here's how we train our big foundation models. They were kind of decent. Some people use them. Then recently we had five, four, this paper came out December, 2024.

From there, we, as of this week, have got two more papers called five, four reasoning, reasoning, mini and reasoning plus. So let's do a quick little refresher on five, four, the one that came out, because this is kind of the foundation model that builds on top that the other two build on.

So anyone has comments, questions, feel free, you know, pop in wherever. I feel like we covered this in one of the papers, but anyway, we'll go over it again real quick. So five, four now is 14 billion parameter model model, and they do basically a bunch of high quality filtering.

They use LLMs to filter a bunch of stuff, and then they have a bunch of synthetic data. So they want to show how like distillation can do pretty good. So they train on high quality data. They have the same architecture as five, three, but you know, now they're focused on reasoning benchmark.

So this came out like pre reasoning models. But this is the time when we started to realize, oh, shit, reasoning models are pretty, reasoning data is pretty high quality, right? So a bunch of the training data here is synthetic data. They have stuff like multi agent prompting, self revision workflows, instruction reversal.

They have a section on rejection sampling where, you know, you take examples, you have chain of thought, you see what was wrong, they have rejection sampling in there, they do DPO, they start to introduce this concept of mid training here, I think they're the ones that start to like, you know, really push on it.

And then of course, in their new thinking models, they have mid training is like their whole mid training is all you need, you know, but this one was basically synthetic data, SFT, DPO, we can get pretty good models. Mid training, mid training is all they need. Here's kind of the benchmarks of how it performed at the time.

So, you know, as they always claim, it's kind of 4.0 mini, it's like, similar to the 70b models, it's similar to 4.0. But you know, this is a little old, it's kind of out of date now, but still good to know what the overall thing is on. So step zero is this data generation, they have this whole method of seeding data, expanding it out, post training, stuff like that.

And then you know, performance is on par with even 4.05 B, but they've been known to not have the best, like track record of how they actually perform on benchmarks. But from what, you know, people say for 5.4, this is where things start to turn around, especially because they have this whole decontamination set.

So like they really kick off the paper with addressing overfitting and data contamination where, you know, one of their original pitfalls is that they have learned to overfit. So they improved their data decontamination process. Basically, they have good ways to take out stuff from data sets, you know, it's what you would expect, they can use LLMs to filter this out, they can not train on variants of data set stuff.

They have contamination proof benchmarks that they're doing. Okay, what else have we got here? So purpose of synthetic data, synthetic data, pretty good, right? It's a form of distillation with SFT, structured, gradual learning, what else? Chain of thought data is in there. So synthetic data for pre-training and mid-training.

Here's kind of how they do it. So it begins with high quality seeds from multiple domains. Basically, they start seeding data from big domains, right? So they'll do web scrapes, they'll do this seeding filtration process, right? So essentially, they'll take big data sets, they'll use an LLM and they'll start filtering it.

So they want to identify what's the highest quality of this data, then they'll generate synthetic examples and they'll use these examples as more training data. They talk quite a bit about multiple epochs for different sub data sets, but yeah, you know, they created 50 types of synthetic data sets.

So web and code, code-based seed, right? For example, so they have a two-stage filtration process. First, identifying pages with strong educational content and second, segmenting selected passages into pages into passages, scoring each for its factual and reasoning content. Basically, they're trying to filter out what is good reasoning data here.

In this, for question data sets, you know, discard questions where all answered were agreed, you know, where questions are easy or where the answers were kind of inconsistent. So a lot of data filtration, delete, deduction chains, logical reasoning steps, kind of more efficient, rewrite and augment data. So seeds are transformed into synthetic data through multi-step prompting workflows.

That includes rewriting the most useful content in a passage into exercises, discussions, the structured reasoning. So once you have a lot of text, you subset the best part of it, you know, what are the actual hard questions here? From there, let's rewrite it into exercises, let's create discussions around it, let's have structured reasoning to how we got to that example.

That's kind of how they do that. Self-revision, instruction reversal for code. So, you know, you have high quality code, let's reverse it. Let's get the path out of it. Let's get the, you know, generation steps out of it. Filtering web dumps was another one. They have small non-LLM classifiers trained on annotations on how to filter out web dumps.

They have subsets of non-filtered and filtered stuff. You know, they over-index on STEM words. They want to just get that high-level stuff. They have multi-modality into this, multi-linguality, sorry. So 176 languages. More extraction, more cleaning. Okay. Post-training. Post-training has SFT and DPO. And then we kind of go into their training mix.

I think like five minutes to cover what they're doing here. But you know, they start off with 4K context length. They extend it out to 16K during mid-training. They have a new vocab of a hundred thousand with unused tokens. Turns out they just left this note in the paper in December.

These unused tokens are what the reasoning tokens end up becoming, right? So they add in these think tokens out of the unused tokens, and that's how they're able to do it. Previously, they had had a sliding window attention in 5-3. Now they do full attention over the 4K context.

This model trained from scratch is on 10 trillion tokens. They kind of give you, you know, here's the regular learning rate, weight decay, batch size, all that stuff. But pre-trained from scratch, 10 million tokens, a lot of synthetic tokens. Then after that, they have a mid-training stage where they do this context length extension from 4K to 16K.

The interesting thing with all their models, even their reasoning models, like in the reasoning models, they have four stages of training. So one of the things that they noted was like, you know, as we do GRPO, we run into like vanishing gradient problems where, you know, at all lengths we have issues.

So at each stage, at each stages of their fixing these problems, how our performance is doing? How many more points do we get in different benchmarks? Pretty interesting way how they lay these out. It's just like really a lot of learning. Okay, so 5-3 was two stages. Phase one was largely filtered web text.

Phase two was basically, you know, small subset of reasoning tokens, basically like the high quality code and math. In this case, they show that web data sets had small benefits on reasoning heavy benchmarks. Models trained with only synthetic data underperformed on knowledge heavy benchmarks as well. So, you know, we need to fix these things.

TLDR, here's their data pre-training mixture, where we search over different allocations of tokens coming from various sources, mainly synthetic, web rewrites, web filtration, divided into reasoning and knowledge heavy portions, targeted acquisitions and organic data. So math, books, stuff like that, forums, code data. And then there's a correlation between 7B and 14B models.

Basically, they noticed that the scaling was pretty consistent. So they did a lot of their data pre-training mixtures on a 7B model, and then they transferred that understanding over to the 14B. Let's go a little bit faster. Here's their subset of the the final training of their 10B tokens in the pre-training stage.

Or I don't know if it's 10B, because they're still post-training, you know. So like, let's see, probably 95% of the training is here in pre-training. The interesting thing here is, you know, they have a lot of epochs for different data. So like synthetic data, even though there's only 290 billion tokens, it's actually 40% of the total pre-training because they go over it with 14 epochs, basically.

But yeah, so to keep everything general, 30% of the training tokens are for web and web rewrites, namely 1.3 trillion tokens of just web scraping. Web rewrites are 290 billion tokens, but you know, we do multiple epochs on that. Then the remaining tokens are largely from synthetic data, which is 40% of that.

20% of this is pure code, and then acquired sources are, you know, like the stuff that like the textbooks and things like that. That's most of the pre-training. Then they have their mid-training where they extend context length. Here is, you know, where they are high quality non-synthetic data sets separate.

So they filter out in these pre-training sets, which one of these samples have long context, right? So in their pre-training, what is over 8,000 tokens, what is 16,000? Then they upweight subsets that are long, right? Then they do synthetic data sets to have more than 4k. The data set of this long context extension is 30% newly curated and 70% of previous stuff from the original pre-training.

Then, you know, here's how it performs on different long context evals. So recall, rag, re-ranking, summarization. Here's how the models actually perform. And they show pretty good benchmarks on this, you know? And they kind of explain for people that don't understand what are long context benchmarks. Here's kind of the ones that exist, right?

So recall, this is basically retrieving corresponding value from randomly generated long files. Rag, answering questions with QA. You know, this is a subset of these data sets. Re-ranking is, you know, you're given 10 documents and a query. You want to re-rank them. QA is question answer over long document.

Oh, my thing has froze. Okay, never mind. I just decided to break my PDF reader a little bit. That's cool. Okay, after that, they have post-training. They do DPO. You know, this is basically, let's make it a chat-tuned model instead of a base model. Interesting thing to note in the paper this week in the thinking model, one of the approaches that they tried was instead of using the actual 5/4 model, what if we take the base model and not the instruction-tuned model?

And what if we do all of our thinking from there? And they're like, actually, it does pretty well, but not good enough. So, you know, spoiler, they end up just using their instruction-tuned model. But very open how they do this. So, you know, they use ChatML, you know, user assistant, they have DPO, they have SFT, they have multilingual data.

This is pretty interesting. They only need 8 billion tokens out of their 10 trillion during their SFT and post-training, you know. So, DPO is, you know, alignment. SFT is for following that completion. I think that's basically enough for high-level what's going on in 5/4. Performances are as you'd expect, you know, it's a 14B.

There aren't many 14Bs anymore. They have sections on red teaming, weakness. They have a lot of, like, just open research. But yeah, that's kind of where they sit, you know. They have 14B models. They do a lot of filtration. They do pre-training. They have this set of mid-training. They have SFT.

They have DPO. It performs pretty good. It's on par with what you'd expect, you know. It came out after Quen 2.5B 14 Instruct, so it's slightly better. Better than 5.3, of course. When you look at stuff like 4.0 Mini or 4.0, you know, in some cases, it's better. On math and GPQA, they say it's better.

Human eval, so coding stuff like this, slightly worse. Simple QA. One thing that they note, you know, like, in some of the problems with these models is that when you have factual, like, recollection, they do note that small models just struggle with this, right? Model, small model can't recognize a bunch of facts, so on QA, it's not the best.

Big model will always do better. But yeah, that's kind of what 5.4 is at a high level. It's cool, you know. It's, like, not better than LAMA 3.370B, but in some areas it is where, you know, they tried to get this reasoning data through SFT and SlightDPO with just filtration and a lot of synthetic data.

But okay, I will pause here since that's mostly 5.4. From there, we'll go on to the two reasoning data sets. But any questions, any thoughts, questions, comments, concerns? Has anyone used it? Probably not. They get some use now, they're, like, actually hosted by most people. They're on the, they're there through Azure and stuff, of course.

But some people have started using them, you know. But anyway, any questions, any thoughts? Okay, chat, someone has said something. It's interesting to me, FI reasoning distilled from O3 and FI 4 mini distilled from DeepSeq. Why they use completely different SFT curated data sets, also kind of different SFT strategy.

Yeah, it's an interesting note that they make. So the papers that they released this week, let me actually change my screen share real quick. So this came out in December. They have two models, two models that they released, kind of three, actually. So FI 4 mini reasoning, this builds on FI 4 mini.

So it's actually a three point something B model. Let me check real quick. This is roughly a 3B model. And they show that, you know, there's a whole different set to getting, sorry, 3.8 B. There's a whole different formula to getting a small, small model to do reasoning than there is to getting a large model to do reasoning.

So one of the interesting notes that someone in chat pointed out is that when they do FI 4 mini reasoning, they do their distillation from DeepSeq. But when they do FI 4 large, the one on the 14 B in the plus, they do it from O1. Sorry, from O3, O3 mini, which is interesting.

Who knows why. But there's so many little gems in this FI 4 mini reasoning paper. Like, for example, they cite a bunch of recent sources of getting small models to do reasoning. Basically, they start with DeepSeq R1, right? So DeepSeq R1 had distals. They had Quen and Llama distals.

And they show how they can do better. After that, there was a bunch of stuff like stuff from OpenHands, stuff from S1. And they make little notes here, right? So like, there's, there's other things like there's OpenThinker, Bespoke Labs. But one thing that they noticed was like, if they take the S1 and Limo datasets, S1 was basically, you know, a thousand samples of reasoning data trained into Quen 32B, I believe, had a really good reasoning model, right?

So a thousand samples is all you need. They show that if they do this SFT on a thousand samples of their mini 3.8B model, it actually performs worse. So even though it performed well for S1, S1 is where they did it at 32B. They took a base instruct model.

They did a thousand reasoning, high, high quality reasoning samples. They got reasoning. It worked very well. Benchmarks shot up like crazy. S1 paper was pretty cool, but pretty basic. They show that if we do the exact same thing on 5.4 mini, so we take a 3.8B model that's competent, good, instruction-tuned, we do a thousand samples of SFT on really good reasoning data set, our scores actually go down a lot.

So base model had a score of 10, 78, 37. It shot down to 3, 47, 26. So TLDR is, you know, that actually doesn't work. And they, I don't remember, I was like trying to quote something in this, but they keep saying like, we need to explore a better recipe for how to do reasoning tiny models.

This paper is basically a blueprint for that. But you know, it is based on DeepSeq. I would assume the reason that they do DeepSeq to answer the question exactly is because they're specifically comparing their model to Lama 3.8B, the distil, and the Quen distil. So they're training on both, they're comparing to both distils and, you know, why not use DeepSeq?

But anyway, before we move on to that and then the large one, any questions on regular 5.4? Okay, we move on. Someone asked about training on synthetic data risk, the effect of hallucinations. So if you look deeper in there, like synthetic data processing, they have heavy, heavy filtration. So for some stuff that's questionable, for some stuff that has like good chain of thought, but not the right final output, they deal with that.

But basically, you know, they have verifiable math output, they have verifiable code. They basically have really good pipelines to test whether outputs are correct or not. And then this is a lot more shown in the 5.3 paper as well. Is there any justification on why they did so many epochs in synthetic data or any rationale behind that?

So in 5.4, yeah, basically they're just saying that all their synthetic data is just higher quality, right? So they start here, I'm sorry, it's a reasoning paper. Basically, they start with the web scrape is decent, we can filter it down and then expand it, and it's all better quality data.

And then in 5.3, they show some of this too. But the number of epochs is kind of interesting. Specifically, like when they show in context length extension, something they talk about in a lot of these, and they also do actually mention this in later, like towards the end, I just don't think it's worth covering since these are more so reasoning benchmarks.

Some of the interesting things that they note is they do like this method of packing and unpacking prompts. They also show how they'll repeat long context examples that were done in pre-training. So even their method of multiple epochs is like not necessarily reflecting the exact number of samples. Although I'm sure it's very minute, in this pre-training, like for example, the web data has 1.3 trillion tokens and is 1.2 epochs.

In the mid-training where they do context length extension, they once again use the previous samples that had more than 8k context. So they do create synthetic, but they also reuse. So like, you know, this is a little skewed, but I don't know, they just show that synthetic data is good.

The other nice thing about these papers is that they basically cite everything. A lot, a lot of citations. But what they show here is like a bunch of synthetic data, model go good. As you would expect, the reasoning models kind of follow that same trajectory, right? A lot of synthetic data, model can reason, model get good.

But let's start with the mini reasoning model. So 5.4 mini reasoning. This is the 3.8 B reasoning model. Okay, so improving small language model reasoning remains challenging due to their limited model capability. Also, are we seeing my entire screen? Okay, we are making sure we're on the right paper.

So they kind of bring up R1 distillation, right? So distillation from LLM generated synthetic data can improve reasoning and they show that even doing SFT works. So this paper, this work is trying to get a systematic training recipe for small language models. They have four steps. Step one is large scale mid training on diverse distilled chain of thought data.

Step two is SFT on high quality long chain of thought. Step three is rollout DPO to leverage carefully curated preference data set. Step four is RL with verifiable reward. So four steps. here's how we do reasoning on tiny model. And at the end, they show that, you know, their compact 3.8 B reasoner can outperform distilled reasoning when 7B and distilled reasoning LLM8B by, you know, a little bit on math.

They're starting to do even more and not overfitting here. So they have sections on like calendar planning and they're like, this is something we didn't explicitly train on at all. We filtered out data sets in our training, but our models start to generalize. Like they do really good on this task.

And this is something that wasn't trained on at all. Um, basically they validate a carefully designed training recipe with large scale, high quality chain of thought data is effective to unlock strong reasoning capabilities, even in reasoning constraints, small models. I don't know what their like team is up to with chart colors, but here they go all pink and purple.

So, um, here's, you know, AIME 24, math 500, GPQA diamond. Um, the original 5.4 is this shade of purple. Um, the R1 distills, you know, you can see how they're actually pretty good. So LLM8 and Quinn were pretty good, but hey, their tiny model at half the parameters can do even better.

They're winning. Um, I thought this was interesting, uh, equal contribution for everyone except the first and last two authors. Screw the others. They, they, they're just listed first, but not equal contribution. Um, those charts are pink. These charts are colorful and different, but okay. Back to this. Um, I feel like me sitting here and reading through charts is useless.

Um, but you know, oh no, my reader is broken. Okay. If interested, of how they perform small models, it's interesting how they, they put, uh, 4.0 mini kind of under the small model section instead of large models, uh, and 4.0 here, we don't really know how big 4.0 mini is, but, um, it's there nonetheless.

Um, Oh, wrong, wrong chart. My bad. Let me go back to this. Okay. Okay. Sorry. Wrong chart. Okay. Um, basically they start off with their intro. They're like, uh, chain of thought is a way to do reasoning steps. It's cool. Uh, they say that small language models struggle with chain of thought.

I thought that's kind of interesting. Um, enhancing reasoning capabilities is easier for language models for large language models due to extensive capability. It remains challenging for small reasoning models. Um, deep seek R1 shows that non-logit level distillation, basically just SFT with synthetic data makes a good reasoning performance. Then they cite all the other stuff.

Um, then they show the other people. So, you know, um, bespoke studio, bespoke labs, 7b open thinker, 7b. Uh, they show that, um, they can do this with SFT. Um, some people suggest GRPO deep scaler does S1 and LIMO, um, show that, you know, small samples, even a thousand sets, a thousand samples can do good reasoning.

Okay. Um, rather than focusing on isolated techniques, we explore training paradigms specifically tailored for small language models. So, um, they have two stages of distillation followed by rollout-based learning that reuses wrong LLM-generated samples and concludes with RL with verifiable reward. Initially, we employ a distillation as mid-training mechanism to embed foundation models with reasoning capabilities.

Then they apply distillation again in fine tuning to further improve generalization. Uh, then there's LLM rollout out sampling for distillation. Incorrect outputs are typically discarded. However, they want to still use them. So they have this way of doing this, um, set. They, they take, uh, a sort of RL approach where they have it optimized, um, long.

So if the answer is incorrect, it should still think a lot for incorrect examples. And then they want conciseness on correct examples and stuff. They talk about this later. Then they fine tune it with RL for final correctness, um, outcomes, five, four reasoning. Okay. More background, um, optimal distillation strategy for small models still unexplored.

They keep repeating this, um, you know, data diversity. Uh, so they showed that, you know, um, data diversity and quality is very important. Applying isolated techniques degrades performance. So they tried S1 on 5-4 mini went down, which I thought is pretty crazy. Um, their goal is to once again, comprehensive efficient training recipe for small language models.

So, um, non-reasoning models require a pre, uh, mid training stage to absorb a large volume of rate reasoning trajectories before additional techniques are applied. So once again, you know, if small, we need mid training. They love this concept of mid training. So quick questions to ask is how much mid training do they need?

And then what do we do after mid training? Like careful distillation, preference learning RL, what should we do next? Uh, once again, systematically address these conditions and propose a recipe for building small reasoning models. Mid training is basically after pre training and after pre training and before post training.

It's mid training. Um, it's like, you know, you have SFT for post training, but before you do that, when you have base model, is there something that we can do to do this sort of reasoning or before this specific RL DPO before that last step of most high quality stuff?

Basically, uh, you know, if you take an instruction model, like five, four mini, and you just do direct, um, 1000 samples of high quality deep data, like you're cooked, you're not getting, um, reasoning. The models, the small models cannot pick that up in time. So they're saying mid training is where, you know, you need this stage of, let's do some chain of thought, extended synthetic reasoning data.

Let's roll out reasoning, tracing steps. Let's have thinking steps. Let's do a lot of those. Then we go once again to our core examples. Um, that's kind of what, that's kind of what it is. And yeah, it's like, uh, basically train us train on chain of thought before your RL.

Okay. Multi-stage, um, continual training for reasoning. So, um, multi-stage continual training is good. First we do that. We train curated chain of thought reasoning data set. Then we run RL with verifiable rewards. Okay. So can distillation be used as mid training? Yes. Uh, we want to use distillation. So we train base models with next token prediction on extensive corpus of synthetic chain of thought data.

Uh, chain of thought are generated from deep seek R1. We apply rejection sampling, only get the correct answer. They talk about this later in section four. We pair questions with correct chain of thought, uh, with corresponding correct chain of thought answers, train a base model using standard causal language modeling objective.

Uh, they have a packing mode. So basically what they're doing here is not just SFT on reasoning. They're packing in multiple examples into one training set, right? So, um, multiple examples are packed into the same input sequence to increase token efficiency. So we're not specifically trying to learn what is input output.

We're not doing SFT to just do reasoning step answer. We're doing multiple examples. So like they're packing in a bunch of them. So like you could have seven examples of input chain of thought output, input chain of thought output. And we're just trying to learn and mid training, you know, here's how we do chain of thought.

It's not just good answer generation. Um, this is like an effective way to use mid training, uh, you know, effective way to allow mid training to iteratively use as much chain of thought data as possible until the model starts to perform well on a validation set. Then we have, uh, distillation for SFT.

Basically after it started to learn how to do this chain of thought, um, you know, we do fine tuning or just continual training in a non-packing way. And then that's where you teach the model where to stop generating text. Okay. After that, uh, rollout preference learning, uh, this is where, you know, the previous two stages is trained only on accepted generation.

So they've done filtration. They've taken out all incorrect examples. They're only doing positive chain of thought. Here's question. Here's chain of thought. Here's answer packet, unpack it. You now know how to think, you now know where to stop, but now they want to use rejected rollouts to enhance performance.

Basically, um, they, this is how they they like ensure that you have, uh, diversity, you have enough thinking, you have, you know, conciseness. So basically incorrect responses with minor nuances are compared to their correct answers and they provide positive, effective candidates for constructing informative preference pairs. Uh, so you can do preferences of, you know, uh, here's the right, uh, thinking approach, but the answer was wrong.

And then here's the correct way to do it. Uh, this preference data set is constructed by using correct, um, correct answers as preferred rollouts and incorrect as dispreferred rollout. So now, uh, you know, we have DPO, we're going to do DPO and we have, here's chain of thought with correct answer.

Here's chain of thought where you kind of messed up a little bit. Let's, let's do that. Then we have, um, RL with verifiable reward. So now we've done some alignment with DPO. We want it to, you know, we want to have preference towards these pair of examples, but now we want, um, RL on the distilled and preference train model.

So, um, in the following, we describe RL algorithms that we've implemented. So PPO, PPO is a clip surrogate objective to limit the policy update. So it stays close to previous policy. Clipping is doing this, dah, dah, dah, dah. It has stabilization. Basically this is where they want it to, um, you know, okay, let's see.

We have PPO and GRPO. We love our GRPO, um, comparing rewards in a batch with multiple responses. So for each question, we sample a set of responses under the old policy, compute their rewards, average out what we want. There's a verifiable reward for the right one. Standard RL stuff at these days.

Um, what else base, uh, high variance in response lens. So, uh, our pilot study applying GRPO to train base model, we observed three issues that affected stability and effectiveness of, um, of model training, high variance in response lens. So although the base model after mid training is already able to generate chain of thought responses, we observed substantial variability and response lens within the same GRPO sampling group.

So, uh, when you generate multiple outputs for the same example, uh, there's a lot of variability and how long that is. So for some stuff, positively reward responses range from 12,000 to 20,000 tokens, which is, you know, pretty large, uh, optimizing the model for standard GRPO. Um, it led to instability, right?

Okay. Vanishing gradient. This is what you would expect. Um, you know, so as they did, um, a bunch of these, as they had so much diversity, so many different, uh, lengths, you kind of had identical rewards. So, uh, vanishing gradient problem, zero variance in the returns. Um, the model is sensitive to intra-group length discrepancies requiring extended GRPO batch size to 128.

So more batches in GRPO while it worked. Um, we hypothesized that these issues become more prominent for small language models where RL stability is likely to be more fragile compared to large models. Okay. Uh, moving on quick, since we're short on time, what else? Synthetic data, synthetic chain of thought data generation.

So we can stock large-scale reasoning data set composed of LLM generated synthetic, synthetic reasoning capabilities, uh, trajectories. Uh, they use a bunch of pre-made data sets. So basically all these are open data, open data sets and their sizes and whether they have reasoning or not. Um, for data sets that already had reasoning traces, we directly use their annotation.

So basically, uh, bespoke labs, open R1 math from hugging face and open thoughts from open thoughts. That's roughly what is that like 350,000 samples. It has reasoning annotations. They use that, uh, for the other ones. So all these six or seven different data sets, um, we retain only, sorry, for data sets lacking such trajectories, we retain only math questions and generate new chain of thought answers using R1, the big R1, 671B.

For each question, we sample approximately eight rollouts. And, um, in total, we collect 10 million rollouts across 1.6 million samples. So, uh, they use all these for the other ones. They use big R1. They only use math. They get 10, uh, eight rollouts per, now they have 10 million samples.

All the previous, uh, training steps. So the mid-training and stuff that's used with this data set, this is not like a new training stage. This is just explaining the previous steps above. Okay. Um, this is kind of interesting as well. For math questions that are verifiable, we first apply math verification tools to address correctness.

Uh, some auto-verification fails for complex solutions. And then it says, we additionally employ 40 mini to re-verify rollouts, incorrectly flagged, initially flagged as incorrect. I thought this was interesting. Like, um, we're in like mid 2025, we have O3, we have O1, we have 2.5 flash, we have Sonnet, we have all these big models, but let's verify our math with 40 mini.

Like, let's think about that for a second. We have benchmarks, we have tables, we have everything. But for verifying if our math was flagged as incorrect properly, let's use 40 mini. Why do we use 40 mini? That's for Microsoft for them to, you know, not tell us. Uh, could they have used a bigger model?

I think so. But anyway, um, they do that. Okay, experiments, evaluations, um, you know, we evaluate our model on three, mathematical, uh, possibly they're obsessed with token cost, 40 mini validation. I think 40 mini validation is just, you know, it's mini model. Let's compare it to mini. Um, 40 mini was compared to big, uh, five, four of the 14 B model.

And based on estimates, you know, people say active parameters of 40 mini are small. So that's fair for token costs. I mean, I don't know. These are not cheap models to train. You know, this is a 14 B model trained on 10 trillion tokens. Um, that's a lot of compute.

That's in the millions of dollars of compute. One thing I highlighted here that I might've skipped over was how cheap training five, four mini reasoning was. Um, like, I'm not going to say it's like deep seek are one cost $5 million, but, um, you know, there's obviously a lot of filtration, a lot of time, a lot of synthetic data generation, a lot of inference that goes into this, but they trained this on, I believe 32 H one hundreds for like two and a half days.

So like maybe 64 H one hundreds, but, uh, very few nodes were used, you know? So like, this is something that someone could do themselves if they really wanted, if someone can actually pull up the five, four mini reasoning model on hugging face, it explains the, um, GPUs used to link it, just share it in chat.

And we'll go over it real quick. But, um, that was something I wanted to know, you know, um, training this thing, the reasoning model outside of generating that data, which honestly, even in and of itself, they just used, um, a lot of deep seek or one, and they used open data sets.

Like this thing didn't cost that much to train two nodes of H one hundreds for two and a half days. Like that's on the scale of tens to hundreds of thousands of dollars. So not that crazy, you know? Um, but yeah, uh, what else, uh, to, to show their results.

Once again, you know, they're very cautious of overfitting on benchmarks. Uh, they, they do three runs and they, they report the averages and they're like, they have a whole section on how they need to redo on how we need to better evaluate reasoning models. Uh, they have a section on like reasoning models versus non-reasoning where, where these evals sit, uh, baselines and stuff.

But yeah, they kind of show that, you know, the thing's pretty good. Training strategies. This is very straightforward, right? So what do they do? Distillation, uh, distill, uh, sorry, in the distillation stages, what's the batch size learning rate, how many epochs, warmup ratio, sequence length, packing, not packing. Um, this is all just, you know, if you really care about that, go ahead and read it.

We're running short on time, so I will not read it for us. Um, on scores, here's kind of where they sit. So once again, like they show all of their stages, right? So basic five, four mini sucked at these math and reasoning benchmarks. O1 mini, good. Distills, pretty good.

The other people, not so good. Base llama, not good. Adding distillation mid-training, ooh, a lot of bonuses, a lot of pretty good stuff. Adding distillation fine-tuning after, even better. Uh, adding rollout DPO, even better. RL GURPOL. Wow, we're good. It beats everything. Actually, it doesn't beat everything, but you know, um, pretty good.

Beats, uh, R1 Distill Quinn 7B and R1 Distill Llama 8B, which, you know, when DeepSeek came out, a lot of people for local inference were using, um, Distill Quinn 7B. Uh, Tyler has linked the hugging face thing. Let's, let's check it out real quick. Um, context length, what model is this?

5-4 mini reasoning. Okay. Model quality, H100, GPU times 128 H100s for two days of training. So, you know, not that much. Assuming $2 per hour for H100, 128 times 48 hours is $12,000. Um, math checks out, but math doesn't really check out. Um, you can't assume $2 per H100 hour.

Cause you know, you're doing this on actual nodes of training, but yeah, you know, per hour on H100, you can say 12K. Now that you're doing it on node level, maybe double it up, maybe triple it up. But point being, you know, tens of thousands of dollars for this, hundreds of billions of tokens, not trillions of tokens.

And this is stuff that like low key could have been done earlier, but, um, yeah, they show it out stage by stage. Thank Thank you, Tyler for link and, um, quick math. Um, they show out stage by stage performance, ablations, um, fun charts, safety. We need safety. Ours stay consistent.

Others go down. What else? Conclusion. Okay. They love their line, uh, small models when trained with deliberate data selection and training strategies can match or even exceed capabilities of larger models. They want to show this work as a blueprint for developing efficient, high performing models under resource constraints. Now, one of the very interesting things is, you know, um, some of the other work that's referenced here, like the distillation models, uh, distillation with SFT is cool.

Deep seek showed it really worked, right? Let me see if this chart actually compares to base Lama. I believe they have it in there at the bottom. So basically for let's change color real quick for regular Lama 3.2. Oh, they did 3B. Goddamn. Nevermind. I was trying to compare Lama 8B to 8B.

Uh, but you know, okay, let's just say for five, four, which is better than 8B on reasoning data sets. We did really, really bad. Now, what deep seek showed is with basic SFT on reasoning data, um, you can get pretty good performance. Uh, what they show is if you have a well thought out process to do this specifically for, um, small models, you can do even better than what people were very amazed by in the past.

So, um, this thought out way to do this is really good. Um, what else, what else? Uh, so they basically have this as a blueprint for small model thinking RL. Um, what was interesting was just, yeah, stuff doesn't directly transfer over, you know? So, um, S1, if someone correct me if I'm wrong, I think was done on Quen 32B, right?

Or one of the large ones where, you know, you take a thousand samples, you have small model and now you have really good reasoning. They tried that exact recipe with their small model. They took five, four mini, they did S1 training and not only did it not improve as much, it actually got worse.

So, uh, you know, this stuff doesn't translate transfer over at face value for regular, um, foundation model stuff. It does like in the regular five, four 14B, um, they did a lot of their experiments on a 7B and then they, they did a lot of their pre-training data experiments on a 7B and then they transferred over to a 14B for actual train run.

Where is this? It's somewhere in here. Um, yeah, right here, right here. So we observe a high rank correlation between the performance of 7B and 14B models on different data mixtures. Given large enough distance between the two data mixtures, uh, this allocated us to conduct experiments on 7B scale and transfer findings to 5-4, which is 14B.

Uh, the key difference between 32B and this one is mid-training? No, so this one is a 14B, which also had mid-training. This is basically a mixture of a bunch of stuff, the small reasoning. Um, there's a whole bunch of sets of this, but here, yeah, basically they have a whole bunch of, um, let's learn chain of thought, packed, unpacked to know where to stop, then let's do RL.

That's a better approach for mini reasoning model. Uh, we have questions and discussions, but this is not the only paper that dropped. I, in fact, was bamboozled an hour ago when I decided to read these turns out there's like low key two and a half papers. So we needed to know 5-4 base to see what they did.

Then there's 5-4 mini reasoning, which is their 3.8B. Then there's also just 5-4 reasoning. 5-4 reasoning is not one model. It's actually two models. Um, 5-4 reasoning is where they take the 14B model. they do, um, they make a reasoning model using O3 mini traces. Uh, and they have another one.

They have 5-4 reasoning plus where they basically do the same thing as before, but now they add RL specifically. And, uh, now they're comparing it to the 70B distills. And guess what? It does well. It outperforms the 70Bs. So same benchmarks, reasoning benchmarks. Um, we show that the benefit of carefully, of careful data curation and SFT extends to reasoning models.

This can be further amplified with RL. So similar to how there's like, uh, O3 mini, low, high, O1 pro, all that stuff. They can also do this. So they have, um, uh, 5-4 reasoning and 5-4 reasoning plus. Could this paper have basically just launched 5-4 reasoning plus and called it 5-4 reasoning?

Yes. They didn't have to do two. This is very similar to how in the, uh, reasoning mini paper they have, oh, this is the, this is the big one. One sec. In this reasoning mini paper, in the mini reasoning paper, they show every stage of training and how it performed.

Right? Like they could have also had 5-4 mini reasoning and then 5-4 mini reasoning plus and do their whole recipe. Like it's starting to do better as it is. But, um, in this one, they, I guess the point they're trying to make is that an additional RL stage takes you from 5-4 reasoning to 5-4 reasoning plus.

Um, yeah. TLDR is, um, they're getting O1 mini O3 level performance, beating R1 70Bs. Uh, yeah. Okay. I'm going to try to go through this in three minutes. Sorry on bad use of time. Um, how they do this. So 14B model, it's 5-4. Uh, they have a 14B model, supervised, fine-tuned.

So they do SFT. Then they have reasoning plus, which has a further round of RL, uh, 1.4 million prompt of high quality answers containing long reasoning traces generated using O3 mini. Prompts are filtered to cover a range of difficulty. They want it to be stuff that the regular model can't answer.

So if 5-4, is there any traction on these? Yeah. Uh, five models have decent traction. The reasoning one just came out. I don't know about traction on this. Um, so they basically want to filter out stuff that the regular, um, five model can solve. They only want to do stuff that it can't solve.

Um, my highlighting got weird. So the data set used in SFT includes stem topics, coding, safety focused, uh, tasks. The reasoning plus model is trained using RL on a small set of 6,000 high quality math posted, math focused problems with verifiable solutions. Kind of interesting, right? They get so much performance gain from just 6,000 samples of RL.

Um, they talk into how RL is, um, RL is like, you know, it has high variance, but it also has high impact if done correctly, but then it's just math. So, you know, live code bench went down a little bit. Uh, interesting thing to note. Okay. Basically it's a data, like the whole training pipeline, it matches what they've done in previous five models or come models.

Uh, they once again, want to show that good data curation, synthetic data, small models to be good. A small model performs better than 01 mini 70 B models. They also outperform Claude's, uh, 3.72 thinking on all taxes, um, except GPQA and calendar planning performance on this. Okay. Performance is cool.

Um, this was kind of an interesting one. Um, both of them present improvements over the base models, including math and specific, notably improvement of 50 percentage points here. Surprisingly, these models also improved by 30, 60 percentage points on, um, algorithmic and planning problems like calendar planning, which demonstrate increased generalizability of reasoning skills to domains that we did not target directly during SFT or RL.

So, uh, on stuff they didn't target and didn't train, this shit still generalizes. Uh, very cool. Very cool. Improvement on general benchmarks. Of course it improves. Here's numbers of hell thinking effort versus accuracy trade-off. This was interesting. So the reasoning plus model that slapped some RL that does better, it takes approximately 1.5 times more, more tokens than the other one.

Uh, this difference is less pronounced on other reasoning domains. So, um, uses this on average. So some domains like coding, planning, spatial tasks. Uh, these are all avenues to improve RL. Okay. Keep going through quick five, four demonstrates reasoning responses. They show some examples that the reasoning one did that the other one couldn't.

So this is a word play riddle. This is a question, you know, I have coin tosses. What's the chance I see exactly 1.2 heads five, four would give you, you know, an actual math calculation. The reasoning model is like, you can't get exactly 0.2. So probability is zero. Um, pretty cool.

Pretty cool. More stuff, planning, games. Crazy. It can do games, data stuff, uh, seed database. We specifically target seeds situated at the edge of five four's current ability. Additionally, to maximize the focus of reasoning skills and data set, we prioritize prompts that demand complex multi-step reasoning, as opposed to those, uh, testing factual knowledge.

So they want reasoning stuff. They want teachable things, synthetic seeds. Sorry, I'm going quick. We're at like one minute left. So I'm going to go through this quick, quick, uh, five, four reasoning, basically SFT on five, four with specific tokens. They have reasoning tokens. They use two of those empty tokens that they have.

They have think and, um, you know, think tags that they add in, uh, increased token length. So they, they extend the 32 K synthetically generate examples of long chain of thought over this. Our SFT data set is 1.4 million prompt response pairs, um, totaling 8.3 billion unique token samples.

So this many extended out over these domains. Here's training. Here's SFT steps. Here's improvement. Uh, during exploration stage, we studied effects of various design choices. Here's that role of SFT seed hyperparameters, uh, role of system message. Of course, system message is useful. So, um, basically to promote chain of thought, they tell the thing, uh, you're a system, you're a thinking model, have your thinking and think tags, um, partially removing or replacing system messages was cool, but it didn't work.

Here's the exact thing. You're a large language model by Microsoft. Your role is an assistant involved story, uh, exploring questions that, uh, please structure responses in two sections. Thought solution using the performed method, think thought section and think then solution second in thought detail, your reasoning steps. Each step should include analysis, summarize, brainstorm, the solution section should be logical, accurate, concise.

Um, now try it with the following guidelines. So that's our system message. It helped optimization base model. So they, they considered using the base model before SFT, like we talked about. So the 14 B before that SFT, um, both of them worked pretty well, but the, you know, the one with SFT and instruction tuning did slightly better.

So they use it. Um, we attribute this to addition of safety focused post training. Wow. Safety is all you need. Um, scaling final model is trained on 16 B trillion, 16 B tokens in this. Okay. Uh, real, real quick, the plus. So they did 6,000 samples of RL. We applied outcome-based RL to enhance reasoning capabilities.

Um, um, um, they use, um, 72,000 math problems. Um, 72,000 math problems subset to 64 seeds of that. We do small set of 6,400 problems. See, no coding reward function. They want it to, or is it basically, um, incentivize correctness, penalize undesirable behavior, such as reputation and extensive length, and encourage proper response formatting.

Uh, we encourage the model to generate concise outputs when it's correct, provoke thinking to think more when it's incorrect. Here's how they do that with GERPO, uh, repetition penalty, training details. Uh, oh, this was the one that was, um, bat size of 64 over 32 H100. So, sorry. The, um, the mini that we talked about was 128 H100s, but this, uh, reasoning plus was only 32 H100s, 64 batches for that much.

It's like a couple hundred hours. They also do context length extension here, but I think that's enough. I don't want to go too long over. They have evals. Of course they have evals, but I want to, you know, even though we're over time, I want to give it two minutes for questions.

They have a whole main findings, takeaways, stuff like that. Um, I'm gonna go over chat real quick, but if anyone wants to pop in share, please, um, you know, interrupt me now. Yeah. So they call this SFT instead of, uh, deep seeks use of distill. So distill has like a term of distillation loss, right?

Where you kind of compare output logics and you kind of force what did big model say? Let's, let's do actual distillation there. They, they do know how this is basically just SFT on big model outputs. That's the beginning. What's the difference between five, four reasoning and reasoning mini? Oh, five, four reasoning is on their big 14 B model.

So five, four 14 B, uh, gets trained to do reasoning. It's on par with O one mini O one and seven 70 B distills. The reasoning mini model is done on a three B. So reasoning mini, they, they do, um, post training on a 3.8 B and they show that they can match distill seven B's and they show that, you know, if you take what distillation was done to big models and you directly apply it to a small model, it doesn't work.

So they try the S1 data set on a small model. It does worse, but they have a blueprint for how to do this for small, small models. Is there anything interesting in the mini ablation section? Maybe which part they deem most important for small models following their, their narrative is mid training or do they imply the combination is the key?

Um, let's see. Let's see. Let's see. Let's see. Uh, our distillation pipeline serves an effective approach to measure reasoning. I don't think there's much that's reasoning here, much that's super, um, important. Basically they just have like a, you need to do some mid training. They, they pose some open questions.

Um, yeah, cool. Okay. Thank you everyone. Sorry for last minute, um, paper change. I low-key read this like an hour and a half ago, but hopefully it was useful. I'm sure some of this is wrong, but yeah. Uh, next week we should have, uh, Professor Tom who does, uh, by hand illustrations, talk about the differences in Lama 1, 2, 3, and 4 architecture.

I'll share more in Discord. Um, he was gonna just do Lama 4, but we pushed it a week so he has time to prep to do a better comparison between, you know, what's actually changing between the series. So that should be next. We always need volunteers. So if anyone wants to volunteer a paper, uh, let us know and we'll, we'll slot you in.

Someone, so Flo says, I can't wait till GPT 5.4 drops and I can't tell if people are saying 5, 4, 5, 4, like GPT 5.4 or 5, 4, uh, people will be saying GPT 5.4 because no one talks about the five models. These are cool, but really, really not many people talk about them.

Uh, not many people use them, but it's, it's always nice to have, you know, it's, it's open research. They, they do a lot of work, but yeah, nobody, nobody really talks about these. The only people that talk about them are people that kind of shit on them for, um, training on benchmarks.

If I'm not mistaken, they made one of these multimodal and five, four or five, three with audio had the best transcription word error rate. Someone fact checked me on this, but, um, I'll follow up somewhere. I'm pretty sure the multilingual version of this had the best word error rate out of any model.

Now that's not saying a lot because speech and transcription models are small, like low latency efficient optimized. This is a fat model, but yeah, uh, multimodal large models get good. You know, what can you say? Okay. I feel like, um, that's enough. Thanks guys. Okay, I would end meeting, but Suix is host, so he will end meeting.

So, um, that's a good question. I'm going to ask you to answer your question. I'm going to ask you to answer your question. I'm going to ask you to answer your question. I'm going to ask you to answer your question. I'm going to ask you to answer your question.

I'm going to ask you to answer your question. I'm going to ask you to answer your question. I'm going to ask you to answer your question. I'm going to ask you to answer your question. I'm going to ask you to answer your question. I'm going to ask you to answer your question.

I'm going to ask you to answer your question. I'm going to ask you to answer your question. I'm going to ask you to answer your question. What are you going to ask you to answer your question? I'm going to ask you to answer your question. I'm going to ask you to answer your question.

I'm going to ask you to answer your question. I'm going to ask you to answer your question.

The Phi-4 Reasoning Technical Report — Vibhu Sapra

Transcript