back to indexThe Phi-4 Reasoning Technical Report — Vibhu Sapra

00:00:00.000 |
Let me also pull up chat real quick. But for those that don't know, there's the five series 00:00:10.940 |
of models, right? So this started out way back when, when Microsoft was trying tiny stories, 00:00:17.860 |
tiny stories was like a one mil multiple million parameters. So not even a billion, not even 00:00:23.280 |
tens of millions parameters models. They just wanted to see, can we make a really small one 00:00:28.880 |
million parameter model, like learn text and can it produce any coherent words? Turns out it can. 00:00:35.100 |
Then they had like textbooks is all you need. Or, you know, if you train on textbooks, you can do 00:00:39.980 |
some level of words. Then they had this five series. These were really, really small language models. 00:00:44.800 |
So like up to one B, then we went from super small, like million parameter to billion to now we're at 00:00:51.680 |
five, four, which is a 14 billion parameter model. So they've gotten chunky. I think, you know, 00:00:57.760 |
the sweet spot that they were hitting was kind of that three B range where no one else was really 00:01:03.460 |
doing this, right? Like we had llama, which was seven B, we had Falcon, we had Mistral. These are 00:01:09.140 |
all seven B models. At the time, Quen wasn't really doing anything tiny, tiny. So these were like the 00:01:14.740 |
on device, coherent, somewhat decent models. The history of the five models is that they would do 00:01:21.660 |
good on benchmarks. And then people would say, damn, these models suck. And it turns out that, 00:01:26.600 |
you know, they were like accused of training on the benchmarks and they were inflating their scores. 00:01:33.000 |
So five, three was the, you know, last one that had a pretty big splash. Five, three was like, okay, 00:01:39.040 |
here's a whole section on how we're doing like decontamination of training data. So 00:01:44.880 |
they basically said, Hey, we're serious about this. We're not going to train on benchmarks. Here's how 00:01:50.940 |
we like filter all our data. Here's how we train our big foundation models. They were kind of decent. 00:01:56.020 |
Some people use them. Then recently we had five, four, this paper came out December, 2024. 00:02:03.580 |
From there, we, as of this week, have got two more papers called five, four reasoning, reasoning, 00:02:10.700 |
mini and reasoning plus. So let's do a quick little refresher on five, four, the one that came out, 00:02:18.220 |
because this is kind of the foundation model that builds on top that the other two build on. So anyone 00:02:24.140 |
has comments, questions, feel free, you know, pop in wherever. I feel like we covered this in one of the 00:02:30.780 |
papers, but anyway, we'll go over it again real quick. So five, four now is 14 billion parameter 00:02:37.200 |
model model, and they do basically a bunch of high quality filtering. They use LLMs to filter a bunch 00:02:44.240 |
of stuff, and then they have a bunch of synthetic data. So they want to show how like distillation can 00:02:51.100 |
do pretty good. So they train on high quality data. They have the same architecture as five, three, 00:02:58.240 |
but you know, now they're focused on reasoning benchmark. So this came out like pre reasoning 00:03:03.360 |
models. But this is the time when we started to realize, oh, shit, reasoning models are pretty, 00:03:09.200 |
reasoning data is pretty high quality, right? So a bunch of the training data here is synthetic data. 00:03:15.600 |
They have stuff like multi agent prompting, self revision workflows, instruction reversal. They have a 00:03:21.760 |
section on rejection sampling where, you know, you take examples, you have chain of thought, you see what was 00:03:26.960 |
wrong, they have rejection sampling in there, they do DPO, they start to introduce this concept of 00:03:33.200 |
mid training here, I think they're the ones that start to like, you know, really push on it. And then of 00:03:36.720 |
course, in their new thinking models, they have mid training is like their whole mid training is all you 00:03:43.120 |
need, you know, but this one was basically synthetic data, SFT, DPO, we can get pretty good models. 00:03:49.920 |
Mid training, mid training is all they need. Here's kind of the benchmarks of how it performed at the time. 00:03:55.440 |
So, you know, as they always claim, it's kind of 4.0 mini, it's like, similar to the 70b models, it's similar to 4.0. 00:04:05.440 |
But you know, this is a little old, it's kind of out of date now, but still good to know what the overall thing is on. 00:04:11.440 |
So step zero is this data generation, they have this whole method of seeding data, expanding it out, post training, 00:04:19.360 |
stuff like that. And then you know, performance is on par with even 4.05 B, but they've been known to 00:04:28.160 |
not have the best, like track record of how they actually perform on benchmarks. But from what, 00:04:35.920 |
you know, people say for 5.4, this is where things start to turn around, especially because they have 00:04:41.440 |
this whole decontamination set. So like they really kick off the paper with addressing overfitting and data 00:04:48.000 |
contamination where, you know, one of their original pitfalls is that they have learned to overfit. 00:04:55.040 |
So they improved their data decontamination process. Basically, they have good ways to take out 00:05:04.400 |
stuff from data sets, you know, it's what you would expect, they can use LLMs to filter this out, 00:05:08.800 |
they can not train on variants of data set stuff. They have contamination proof benchmarks that they're 00:05:16.640 |
doing. Okay, what else have we got here? So purpose of synthetic data, synthetic data, 00:05:23.360 |
pretty good, right? It's a form of distillation with SFT, structured, gradual learning, what else? 00:05:29.360 |
Chain of thought data is in there. So synthetic data for pre-training and mid-training. Here's kind 00:05:35.200 |
of how they do it. So it begins with high quality seeds from multiple domains. Basically, they start seeding 00:05:41.440 |
data from big domains, right? So they'll do web scrapes, they'll do this seeding filtration process, 00:05:46.640 |
right? So essentially, they'll take big data sets, they'll use an LLM and they'll start filtering it. So 00:05:52.480 |
they want to identify what's the highest quality of this data, then they'll generate synthetic examples 00:05:58.080 |
and they'll use these examples as more training data. They talk quite a bit about multiple epochs for different 00:06:04.160 |
sub data sets, but yeah, you know, they created 50 types of synthetic data sets. So web and code, 00:06:11.840 |
code-based seed, right? For example, so they have a two-stage filtration process. First, identifying 00:06:16.800 |
pages with strong educational content and second, segmenting selected passages into pages into 00:06:22.800 |
passages, scoring each for its factual and reasoning content. Basically, they're trying to filter out what 00:06:27.600 |
is good reasoning data here. In this, for question data sets, you know, discard questions where all 00:06:33.520 |
answered were agreed, you know, where questions are easy or where the answers were kind of inconsistent. So 00:06:38.960 |
a lot of data filtration, delete, deduction chains, logical reasoning steps, kind of more efficient, 00:06:46.800 |
rewrite and augment data. So seeds are transformed into synthetic data through multi-step prompting 00:06:53.760 |
workflows. That includes rewriting the most useful content in a passage into exercises, 00:06:58.640 |
discussions, the structured reasoning. So once you have a lot of text, you subset the best part of it, 00:07:04.160 |
you know, what are the actual hard questions here? From there, let's rewrite it into exercises, 00:07:09.040 |
let's create discussions around it, let's have structured reasoning to how we got to that example. 00:07:13.200 |
That's kind of how they do that. Self-revision, instruction reversal for code. So, you know, 00:07:19.520 |
you have high quality code, let's reverse it. Let's get the path out of it. Let's get the, 00:07:24.240 |
you know, generation steps out of it. Filtering web dumps was another one. They have small non-LLM 00:07:31.680 |
classifiers trained on annotations on how to filter out web dumps. They have subsets of non-filtered and 00:07:39.600 |
filtered stuff. You know, they over-index on STEM words. They want to just get that high-level stuff. 00:07:46.240 |
They have multi-modality into this, multi-linguality, sorry. So 176 languages. More extraction, more 00:07:54.240 |
cleaning. Okay. Post-training. Post-training has SFT and DPO. And then we kind of go into their training 00:08:01.840 |
mix. I think like five minutes to cover what they're doing here. But you know, they start off with 4K context 00:08:08.160 |
length. They extend it out to 16K during mid-training. They have a new vocab of a hundred thousand with 00:08:15.120 |
unused tokens. Turns out they just left this note in the paper in December. These unused tokens are what 00:08:21.520 |
the reasoning tokens end up becoming, right? So they add in these think tokens out of the unused tokens, 00:08:26.640 |
and that's how they're able to do it. Previously, they had had a sliding window attention 00:08:32.320 |
in 5-3. Now they do full attention over the 4K context. This model trained from scratch is on 10 00:08:39.520 |
trillion tokens. They kind of give you, you know, here's the regular learning rate, weight decay, 00:08:44.480 |
batch size, all that stuff. But pre-trained from scratch, 10 million tokens, a lot of synthetic tokens. 00:08:50.400 |
Then after that, they have a mid-training stage where they do this context length extension from 4K to 16K. 00:08:58.000 |
The interesting thing with all their models, even their reasoning models, like in the reasoning models, 00:09:03.440 |
they have four stages of training. So one of the things that they noted was like, you know, 00:09:09.920 |
as we do GRPO, we run into like vanishing gradient problems where, you know, at all lengths we have 00:09:16.880 |
issues. So at each stage, at each stages of their fixing these problems, how our performance is doing? 00:09:23.440 |
How many more points do we get in different benchmarks? Pretty interesting way how they lay 00:09:28.240 |
these out. It's just like really a lot of learning. Okay, so 5-3 was two stages. 00:09:36.320 |
Phase one was largely filtered web text. Phase two was basically, you know, small subset of reasoning 00:09:44.000 |
tokens, basically like the high quality code and math. In this case, they show that web data sets had 00:09:51.520 |
small benefits on reasoning heavy benchmarks. Models trained with only synthetic data underperformed on 00:09:57.280 |
knowledge heavy benchmarks as well. So, you know, we need to fix these things. 00:10:01.840 |
TLDR, here's their data pre-training mixture, where we search over different allocations of tokens coming 00:10:09.600 |
from various sources, mainly synthetic, web rewrites, web filtration, divided into reasoning and knowledge 00:10:16.560 |
heavy portions, targeted acquisitions and organic data. So math, books, stuff like that, forums, code data. 00:10:24.000 |
And then there's a correlation between 7B and 14B models. Basically, they noticed that the scaling was 00:10:31.520 |
pretty consistent. So they did a lot of their data pre-training mixtures on a 7B model, and then they 00:10:37.200 |
transferred that understanding over to the 14B. Let's go a little bit faster. Here's their subset of the 00:10:45.840 |
the final training of their 10B tokens in the pre-training stage. Or I don't know if it's 10B, 00:10:51.520 |
because they're still post-training, you know. So like, let's see, probably 95% of the training is 00:10:56.880 |
here in pre-training. The interesting thing here is, you know, they have a lot of epochs for different 00:11:02.320 |
data. So like synthetic data, even though there's only 290 billion tokens, it's actually 40% of the total 00:11:09.600 |
pre-training because they go over it with 14 epochs, basically. But yeah, so to keep everything 00:11:17.120 |
general, 30% of the training tokens are for web and web rewrites, namely 1.3 trillion tokens of just 00:11:24.080 |
web scraping. Web rewrites are 290 billion tokens, but you know, we do multiple epochs on that. Then 00:11:30.320 |
the remaining tokens are largely from synthetic data, which is 40% of that. 20% of this is pure code, 00:11:37.280 |
and then acquired sources are, you know, like the stuff that like the textbooks and things like that. 00:11:42.720 |
That's most of the pre-training. Then they have their mid-training where they extend context length. 00:11:49.680 |
Here is, you know, where they are high quality non-synthetic data sets separate. So they filter out 00:11:58.160 |
in these pre-training sets, which one of these samples have long context, right? So in their pre-training, 00:12:05.920 |
what is over 8,000 tokens, what is 16,000? Then they upweight subsets that are long, right? Then they do 00:12:13.440 |
synthetic data sets to have more than 4k. The data set of this long context extension is 30% newly curated 00:12:22.800 |
and 70% of previous stuff from the original pre-training. Then, you know, here's how it performs on different long 00:12:32.640 |
context evals. So recall, rag, re-ranking, summarization. Here's how the models actually 00:12:39.120 |
perform. And they show pretty good benchmarks on this, you know? And they kind of explain 00:12:46.000 |
for people that don't understand what are long context benchmarks. Here's kind of the ones that 00:12:50.640 |
exist, right? So recall, this is basically retrieving corresponding value from randomly generated long 00:12:56.240 |
files. Rag, answering questions with QA. You know, this is a subset of these data sets. 00:13:02.800 |
Re-ranking is, you know, you're given 10 documents and a query. You want to re-rank them. QA is question 00:13:09.840 |
answer over long document. Oh, my thing has froze. Okay, never mind. I just decided to break my PDF reader a 00:13:19.680 |
little bit. That's cool. Okay, after that, they have post-training. They do DPO. You know, this is 00:13:25.760 |
basically, let's make it a chat-tuned model instead of a base model. Interesting thing to note in the paper 00:13:32.240 |
this week in the thinking model, one of the approaches that they tried was instead of using the actual 5/4 00:13:38.800 |
model, what if we take the base model and not the instruction-tuned model? And what if we do all of our 00:13:45.360 |
thinking from there? And they're like, actually, it does pretty well, but not good enough. So, you know, 00:13:51.680 |
spoiler, they end up just using their instruction-tuned model. But very open how they do this. 00:13:56.800 |
So, you know, they use ChatML, you know, user assistant, they have DPO, they have SFT, they have 00:14:03.360 |
multilingual data. This is pretty interesting. They only need 8 billion tokens out of their 10 trillion 00:14:08.880 |
during their SFT and post-training, you know. So, DPO is, you know, alignment. SFT is for following 00:14:16.880 |
that completion. I think that's basically enough for high-level what's going on in 5/4. Performances are 00:14:25.680 |
as you'd expect, you know, it's a 14B. There aren't many 14Bs anymore. They have sections on red teaming, 00:14:32.240 |
weakness. They have a lot of, like, just open research. But yeah, that's kind of where they 00:14:37.200 |
sit, you know. They have 14B models. They do a lot of filtration. They do pre-training. They have this 00:14:43.440 |
set of mid-training. They have SFT. They have DPO. It performs pretty good. It's on par with what you'd 00:14:50.480 |
expect, you know. It came out after Quen 2.5B 14 Instruct, so it's slightly better. Better than 5.3, 00:14:57.360 |
of course. When you look at stuff like 4.0 Mini or 4.0, you know, in some cases, it's better. On math 00:15:04.960 |
and GPQA, they say it's better. Human eval, so coding stuff like this, slightly worse. Simple QA. 00:15:10.320 |
One thing that they note, you know, like, in some of the problems with these models is that when you 00:15:17.600 |
have factual, like, recollection, they do note that small models just struggle with this, right? Model, 00:15:23.520 |
small model can't recognize a bunch of facts, so on QA, it's not the best. Big model will always do 00:15:28.640 |
better. But yeah, that's kind of what 5.4 is at a high level. It's cool, you know. It's, like, not 00:15:36.880 |
better than LAMA 3.370B, but in some areas it is where, you know, they tried to get this reasoning 00:15:44.320 |
data through SFT and SlightDPO with just filtration and a lot of synthetic data. But okay, I will pause 00:15:51.360 |
here since that's mostly 5.4. From there, we'll go on to the two reasoning data sets. But any questions, 00:15:59.120 |
any thoughts, questions, comments, concerns? Has anyone used it? Probably not. They get some use now, 00:16:06.160 |
they're, like, actually hosted by most people. They're on the, they're there through Azure and 00:16:12.080 |
stuff, of course. But some people have started using them, you know. But anyway, any questions, 00:16:17.040 |
any thoughts? Okay, chat, someone has said something. It's interesting to me, FI reasoning distilled from 00:16:21.840 |
O3 and FI 4 mini distilled from DeepSeq. Why they use completely different SFT curated data sets, 00:16:27.600 |
also kind of different SFT strategy. Yeah, it's an interesting note that they make. So the papers that 00:16:33.520 |
they released this week, let me actually change my screen share real quick. So this came out in 00:16:39.760 |
December. They have two models, two models that they released, kind of three, actually. So FI 4 mini 00:16:45.920 |
reasoning, this builds on FI 4 mini. So it's actually a three point something B model. Let me check real 00:16:53.920 |
quick. This is roughly a 3B model. And they show that, you know, there's a whole different set to getting, 00:17:03.360 |
sorry, 3.8 B. There's a whole different formula to getting a small, small model to do reasoning 00:17:09.680 |
than there is to getting a large model to do reasoning. So one of the interesting notes that 00:17:15.360 |
someone in chat pointed out is that when they do FI 4 mini reasoning, they do their distillation from 00:17:22.640 |
DeepSeq. But when they do FI 4 large, the one on the 14 B in the plus, they do it from O1. 00:17:30.880 |
Sorry, from O3, O3 mini, which is interesting. Who knows why. But there's so many little gems in this 00:17:38.720 |
FI 4 mini reasoning paper. Like, for example, they cite a bunch of recent sources of getting 00:17:46.320 |
small models to do reasoning. Basically, they start with DeepSeq R1, right? So DeepSeq R1 had distals. 00:17:53.600 |
They had Quen and Llama distals. And they show how they can do better. After that, there was a bunch of 00:18:00.880 |
stuff like stuff from OpenHands, stuff from S1. And they make little notes here, right? So like, 00:18:08.240 |
there's, there's other things like there's OpenThinker, Bespoke Labs. But one thing that they 00:18:13.520 |
noticed was like, if they take the S1 and Limo datasets, S1 was basically, you know, a thousand 00:18:19.280 |
samples of reasoning data trained into Quen 32B, I believe, had a really good reasoning model, right? 00:18:27.200 |
So a thousand samples is all you need. They show that if they do this SFT on a thousand samples of 00:18:34.160 |
their mini 3.8B model, it actually performs worse. So even though it performed well for S1, S1 is where 00:18:43.440 |
they did it at 32B. They took a base instruct model. They did a thousand reasoning, high, high quality 00:18:48.640 |
reasoning samples. They got reasoning. It worked very well. Benchmarks shot up like crazy. S1 paper 00:18:53.680 |
was pretty cool, but pretty basic. They show that if we do the exact same thing on 5.4 mini, so we take a 00:18:59.040 |
3.8B model that's competent, good, instruction-tuned, we do a thousand samples of SFT on really good 00:19:06.080 |
reasoning data set, our scores actually go down a lot. So base model had a score of 10, 78, 37. 00:19:13.520 |
It shot down to 3, 47, 26. So TLDR is, you know, that actually doesn't work. And they, I don't remember, 00:19:22.000 |
I was like trying to quote something in this, but they keep saying like, we need to explore a better 00:19:26.240 |
recipe for how to do reasoning tiny models. This paper is basically a blueprint for that. But you know, 00:19:32.080 |
it is based on DeepSeq. I would assume the reason that they do DeepSeq to answer the question exactly is 00:19:39.760 |
because they're specifically comparing their model to Lama 3.8B, the distil, and the Quen distil. So 00:19:49.600 |
they're training on both, they're comparing to both distils and, you know, why not use DeepSeq? 00:19:55.440 |
But anyway, before we move on to that and then the large one, any questions on regular 5.4? 00:20:06.160 |
Okay, we move on. Someone asked about training on synthetic data risk, the effect of hallucinations. 00:20:12.960 |
So if you look deeper in there, like synthetic data processing, they have heavy, heavy filtration. So 00:20:18.560 |
for some stuff that's questionable, for some stuff that has like good chain of thought, but not the 00:20:26.160 |
right final output, they deal with that. But basically, you know, they have verifiable math output, 00:20:30.640 |
they have verifiable code. They basically have really good pipelines to test whether outputs are 00:20:37.120 |
correct or not. And then this is a lot more shown in the 5.3 paper as well. Is there any justification 00:20:43.680 |
on why they did so many epochs in synthetic data or any rationale behind that? So in 5.4, yeah, basically 00:20:49.840 |
they're just saying that all their synthetic data is just higher quality, right? So they start here, 00:20:55.440 |
I'm sorry, it's a reasoning paper. Basically, they start with the web scrape is decent, we can filter 00:21:00.400 |
it down and then expand it, and it's all better quality data. And then in 5.3, they show some of 00:21:05.120 |
this too. But the number of epochs is kind of interesting. Specifically, like when they show in 00:21:12.000 |
context length extension, something they talk about in a lot of these, and they also do actually mention 00:21:18.400 |
this in later, like towards the end, I just don't think it's worth covering since these are more so 00:21:23.680 |
reasoning benchmarks. Some of the interesting things that they note is they do like this method of 00:21:29.760 |
packing and unpacking prompts. They also show how they'll repeat long context examples that were 00:21:39.920 |
done in pre-training. So even their method of multiple epochs is like not necessarily reflecting 00:21:48.800 |
the exact number of samples. Although I'm sure it's very minute, in this pre-training, like for example, 00:22:00.320 |
the web data has 1.3 trillion tokens and is 1.2 epochs. In the mid-training where they do context length 00:22:08.560 |
extension, they once again use the previous samples that had more than 8k context. So they do create 00:22:16.960 |
synthetic, but they also reuse. So like, you know, this is a little skewed, but I don't know, they just 00:22:22.160 |
show that synthetic data is good. The other nice thing about these papers is that they basically cite 00:22:27.040 |
everything. A lot, a lot of citations. But what they show here is like a bunch of synthetic data, model go 00:22:34.400 |
good. As you would expect, the reasoning models kind of follow that same trajectory, right? A lot of synthetic 00:22:41.600 |
data, model can reason, model get good. But let's start with the mini reasoning model. So 00:22:46.160 |
5.4 mini reasoning. This is the 3.8 B reasoning model. Okay, so improving small language model reasoning 00:22:57.520 |
remains challenging due to their limited model capability. Also, are we seeing my entire screen? 00:23:02.960 |
Okay, we are making sure we're on the right paper. So they kind of bring up R1 distillation, right? So 00:23:10.400 |
distillation from LLM generated synthetic data can improve reasoning and they show that 00:23:15.200 |
even doing SFT works. So this paper, this work is trying to get a systematic training recipe for 00:23:23.040 |
small language models. They have four steps. Step one is large scale mid training on diverse distilled 00:23:29.200 |
chain of thought data. Step two is SFT on high quality long chain of thought. Step three is rollout DPO to 00:23:36.560 |
leverage carefully curated preference data set. Step four is RL with verifiable reward. So four steps. 00:23:44.400 |
here's how we do reasoning on tiny model. And at the end, they show that, you know, their compact 3.8 B 00:23:51.600 |
reasoner can outperform distilled reasoning when 7B and distilled reasoning LLM8B by, you know, a little 00:23:59.120 |
bit on math. They're starting to do even more and not overfitting here. So they have sections on like 00:24:06.320 |
calendar planning and they're like, this is something we didn't explicitly train on at all. We filtered out 00:24:11.920 |
data sets in our training, but our models start to generalize. Like they do really good on this task. 00:24:17.520 |
And this is something that wasn't trained on at all. Um, basically they validate a carefully designed 00:24:23.440 |
training recipe with large scale, high quality chain of thought data is effective to unlock strong 00:24:28.240 |
reasoning capabilities, even in reasoning constraints, small models. I don't know what their like team is up 00:24:34.400 |
to with chart colors, but here they go all pink and purple. So, um, here's, you know, AIME 24, 00:24:40.880 |
math 500, GPQA diamond. Um, the original 5.4 is this shade of purple. Um, the R1 distills, 00:24:49.200 |
you know, you can see how they're actually pretty good. So LLM8 and Quinn were pretty good, but hey, 00:24:54.240 |
their tiny model at half the parameters can do even better. They're winning. Um, I thought this was 00:24:59.200 |
interesting, uh, equal contribution for everyone except the first and last two authors. Screw the others. 00:25:05.680 |
They, they, they're just listed first, but not equal contribution. Um, those charts are pink. 00:25:10.960 |
These charts are colorful and different, but okay. Back to this. Um, I feel like me sitting here and 00:25:18.720 |
reading through charts is useless. Um, but you know, oh no, my reader is broken. Okay. If interested, 00:25:26.000 |
of how they perform small models, it's interesting how they, they put, uh, 4.0 mini kind of under the 00:25:33.680 |
small model section instead of large models, uh, and 4.0 here, we don't really know how big 4.0 mini is, 00:25:42.160 |
but, um, it's there nonetheless. Um, Oh, wrong, wrong chart. My bad. Let me go back to this. Okay. 00:25:52.080 |
Okay. Sorry. Wrong chart. Okay. Um, basically they start off with their intro. They're like, uh, 00:25:57.520 |
chain of thought is a way to do reasoning steps. It's cool. Uh, they say that small language models 00:26:03.760 |
struggle with chain of thought. I thought that's kind of interesting. Um, enhancing reasoning capabilities 00:26:09.600 |
is easier for language models for large language models due to extensive capability. It remains 00:26:15.120 |
challenging for small reasoning models. Um, deep seek R1 shows that non-logit level distillation, 00:26:21.440 |
basically just SFT with synthetic data makes a good reasoning performance. Then they cite all the other 00:26:28.640 |
stuff. Um, then they show the other people. So, you know, um, bespoke studio, bespoke labs, 7b open thinker, 00:26:36.560 |
7b. Uh, they show that, um, they can do this with SFT. Um, some people suggest GRPO deep scaler does S1 and LIMO, 00:26:48.480 |
um, show that, you know, small samples, even a thousand sets, a thousand samples can do good reasoning. 00:26:55.360 |
Okay. Um, rather than focusing on isolated techniques, we explore training paradigms specifically tailored for 00:27:04.400 |
small language models. So, um, they have two stages of distillation followed by rollout-based learning 00:27:12.800 |
that reuses wrong LLM-generated samples and concludes with RL with verifiable reward. Initially, we employ 00:27:19.840 |
a distillation as mid-training mechanism to embed foundation models with reasoning capabilities. 00:27:25.280 |
Then they apply distillation again in fine tuning to further improve generalization. Uh, then there's LLM rollout 00:27:32.320 |
out sampling for distillation. Incorrect outputs are typically discarded. However, they want to still 00:27:37.680 |
use them. So they have this way of doing this, um, set. They, they take, uh, a sort of RL approach where 00:27:45.760 |
they have it optimized, um, long. So if the answer is incorrect, it should still think a lot for incorrect 00:27:53.840 |
examples. And then they want conciseness on correct examples and stuff. They talk about this later. Then they 00:27:59.840 |
fine tune it with RL for final correctness, um, outcomes, five, four reasoning. Okay. More background, um, 00:28:07.760 |
optimal distillation strategy for small models still unexplored. They keep repeating this, 00:28:13.120 |
um, you know, data diversity. Uh, so they showed that, you know, um, data diversity and quality 00:28:22.160 |
is very important. Applying isolated techniques degrades performance. So they tried S1 on 5-4 mini went down, 00:28:30.400 |
which I thought is pretty crazy. Um, their goal is to once again, comprehensive efficient training recipe 00:28:37.680 |
for small language models. So, um, non-reasoning models require a pre, uh, mid training stage to 00:28:44.400 |
absorb a large volume of rate reasoning trajectories before additional techniques are applied. So once 00:28:49.840 |
again, you know, if small, we need mid training. They love this concept of mid training. So quick questions 00:28:55.680 |
to ask is how much mid training do they need? And then what do we do after mid training? Like careful 00:29:00.880 |
distillation, preference learning RL, what should we do next? Uh, once again, systematically address these 00:29:06.880 |
conditions and propose a recipe for building small reasoning models. Mid training is basically after 00:29:12.080 |
pre training and after pre training and before post training. It's mid training. Um, it's like, you know, you 00:29:20.960 |
have SFT for post training, but before you do that, when you have base model, is there something that we can 00:29:26.400 |
do to do this sort of reasoning or before this specific RL DPO before that last step of most high 00:29:33.440 |
quality stuff? Basically, uh, you know, if you take an instruction model, like five, four mini, and you 00:29:39.280 |
just do direct, um, 1000 samples of high quality deep data, like you're cooked, you're not getting, um, reasoning. 00:29:48.400 |
The models, the small models cannot pick that up in time. So they're saying mid training is where, you know, 00:29:53.280 |
you need this stage of, let's do some chain of thought, extended synthetic reasoning data. Let's 00:29:59.360 |
roll out reasoning, tracing steps. Let's have thinking steps. Let's do a lot of those. Then we 00:30:04.320 |
go once again to our core examples. Um, that's kind of what, that's kind of what it is. And yeah, it's like, uh, 00:30:11.360 |
basically train us train on chain of thought before your RL. Okay. Multi-stage, um, continual training for 00:30:19.200 |
reasoning. So, um, multi-stage continual training is good. First we do that. We train curated chain 00:30:25.440 |
of thought reasoning data set. Then we run RL with verifiable rewards. Okay. So can distillation be used 00:30:32.320 |
as mid training? Yes. Uh, we want to use distillation. So we train base models with next token prediction 00:30:39.280 |
on extensive corpus of synthetic chain of thought data. Uh, chain of thought are generated from deep 00:30:46.560 |
seek R1. We apply rejection sampling, only get the correct answer. They talk about this later in 00:30:52.000 |
section four. We pair questions with correct chain of thought, uh, with corresponding correct chain of 00:30:56.640 |
thought answers, train a base model using standard causal language modeling objective. Uh, they have a 00:31:03.280 |
packing mode. So basically what they're doing here is not just SFT on reasoning. They're packing in multiple 00:31:11.120 |
examples into one training set, right? So, um, multiple examples are packed into the same input sequence 00:31:17.600 |
to increase token efficiency. So we're not specifically trying to learn what is input output. We're not doing 00:31:24.800 |
SFT to just do reasoning step answer. We're doing multiple examples. So like they're packing in a bunch of 00:31:30.560 |
them. So like you could have seven examples of input chain of thought output, input chain of thought output. 00:31:38.000 |
And we're just trying to learn and mid training, you know, here's how we do chain of thought. It's not 00:31:42.800 |
just good answer generation. Um, this is like an effective way to use mid training, uh, you know, effective 00:31:50.960 |
way to allow mid training to iteratively use as much chain of thought data as possible until the model 00:31:57.360 |
starts to perform well on a validation set. Then we have, uh, distillation for SFT. 00:32:03.360 |
Basically after it started to learn how to do this chain of thought, um, you know, we do 00:32:10.640 |
fine tuning or just continual training in a non-packing way. And then that's where you teach the model where 00:32:16.880 |
to stop generating text. Okay. After that, uh, rollout preference learning, uh, this is where, you know, 00:32:24.480 |
the previous two stages is trained only on accepted generation. So they've done filtration. 00:32:30.080 |
They've taken out all incorrect examples. They're only doing positive chain of thought. Here's question. 00:32:36.800 |
Here's chain of thought. Here's answer packet, unpack it. You now know how to think, you now know where to 00:32:42.000 |
stop, but now they want to use rejected rollouts to enhance performance. Basically, um, they, this is how they 00:32:50.000 |
they like ensure that you have, uh, diversity, you have enough thinking, you have, you know, conciseness. 00:32:58.640 |
So basically incorrect responses with minor nuances are compared to their correct answers and they provide 00:33:06.400 |
positive, effective candidates for constructing informative preference pairs. Uh, so you can do preferences of, you know, 00:33:13.040 |
uh, here's the right, uh, thinking approach, but the answer was wrong. And then here's the correct way to 00:33:18.400 |
do it. Uh, this preference data set is constructed by using correct, um, correct answers as preferred 00:33:24.640 |
rollouts and incorrect as dispreferred rollout. So now, uh, you know, we have DPO, we're going to do DPO and 00:33:31.280 |
we have, here's chain of thought with correct answer. Here's chain of thought where you kind of messed up a 00:33:35.680 |
little bit. Let's, let's do that. Then we have, um, RL with verifiable reward. So now we've done some 00:33:42.800 |
alignment with DPO. We want it to, you know, we want to have preference towards these pair of examples, 00:33:49.440 |
but now we want, um, RL on the distilled and preference train model. So, um, in the following, 00:33:56.880 |
we describe RL algorithms that we've implemented. So PPO, PPO is a clip surrogate objective to limit 00:34:04.320 |
the policy update. So it stays close to previous policy. Clipping is doing this, dah, dah, dah, dah. 00:34:10.000 |
It has stabilization. Basically this is where they want it to, um, you know, okay, let's see. We have 00:34:18.400 |
PPO and GRPO. We love our GRPO, um, comparing rewards in a batch with multiple responses. So 00:34:26.560 |
for each question, we sample a set of responses under the old policy, compute their rewards, 00:34:32.080 |
average out what we want. There's a verifiable reward for the right one. Standard RL stuff at these 00:34:36.960 |
days. Um, what else base, uh, high variance in response lens. So, uh, our pilot study applying 00:34:44.080 |
GRPO to train base model, we observed three issues that affected stability and effectiveness of, um, 00:34:49.520 |
of model training, high variance in response lens. So although the base model after mid training is 00:34:56.080 |
already able to generate chain of thought responses, we observed substantial variability and response 00:35:01.520 |
lens within the same GRPO sampling group. So, uh, when you generate multiple outputs for the same example, 00:35:09.600 |
uh, there's a lot of variability and how long that is. So for some stuff, positively reward responses 00:35:16.320 |
range from 12,000 to 20,000 tokens, which is, you know, pretty large, uh, optimizing the model for standard 00:35:22.880 |
GRPO. Um, it led to instability, right? Okay. Vanishing gradient. This is what you would expect. Um, you know, 00:35:33.680 |
so as they did, um, a bunch of these, as they had so much diversity, so many different, uh, lengths, 00:35:41.520 |
you kind of had identical rewards. So, uh, vanishing gradient problem, zero variance in the returns. Um, 00:35:50.160 |
the model is sensitive to intra-group length discrepancies requiring extended GRPO batch size to 128. So 00:35:58.320 |
more batches in GRPO while it worked. Um, we hypothesized that these issues become more 00:36:04.000 |
prominent for small language models where RL stability is likely to be more fragile compared 00:36:08.480 |
to large models. Okay. Uh, moving on quick, since we're short on time, what else? Synthetic data, 00:36:16.160 |
synthetic chain of thought data generation. So we can stock large-scale reasoning data set composed of LLM 00:36:22.640 |
generated synthetic, synthetic reasoning capabilities, uh, trajectories. Uh, they use a bunch of pre-made 00:36:30.080 |
data sets. So basically all these are open data, open data sets and their sizes and whether they have 00:36:35.920 |
reasoning or not. Um, for data sets that already had reasoning traces, we directly use their annotation. 00:36:43.360 |
So basically, uh, bespoke labs, open R1 math from hugging face and open thoughts from open thoughts. 00:36:49.920 |
That's roughly what is that like 350,000 samples. It has reasoning annotations. They use that, uh, 00:36:56.560 |
for the other ones. So all these six or seven different data sets, um, 00:37:01.360 |
we retain only, sorry, for data sets lacking such trajectories, we retain only math questions 00:37:09.520 |
and generate new chain of thought answers using R1, the big R1, 671B. For each question, we sample 00:37:16.000 |
approximately eight rollouts. And, um, in total, we collect 10 million rollouts across 1.6 million 00:37:24.880 |
samples. So, uh, they use all these for the other ones. They use big R1. They only use math. They get 00:37:31.280 |
10, uh, eight rollouts per, now they have 10 million samples. All the previous, uh, training steps. So the mid-training 00:37:37.280 |
and stuff that's used with this data set, this is not like a new training stage. This is just explaining 00:37:43.120 |
the previous steps above. Okay. Um, this is kind of interesting as well. For math questions that are 00:37:50.320 |
verifiable, we first apply math verification tools to address correctness. Uh, some auto-verification 00:37:56.800 |
fails for complex solutions. And then it says, we additionally employ 40 mini to re-verify rollouts, 00:38:03.280 |
incorrectly flagged, initially flagged as incorrect. I thought this was interesting. Like, um, we're in 00:38:09.760 |
like mid 2025, we have O3, we have O1, we have 2.5 flash, we have Sonnet, we have all these big models, 00:38:17.360 |
but let's verify our math with 40 mini. Like, let's think about that for a second. We have benchmarks, 00:38:25.360 |
we have tables, we have everything. But for verifying if our math was flagged as incorrect properly, 00:38:31.920 |
let's use 40 mini. Why do we use 40 mini? That's for Microsoft for them to, you know, not tell us. 00:38:40.640 |
Uh, could they have used a bigger model? I think so. But anyway, um, they do that. Okay, experiments, 00:38:48.400 |
evaluations, um, you know, we evaluate our model on three, mathematical, uh, possibly they're obsessed 00:38:56.960 |
with token cost, 40 mini validation. I think 40 mini validation is just, you know, it's mini model. Let's 00:39:02.800 |
compare it to mini. Um, 40 mini was compared to big, uh, five, four of the 14 B model. And based on 00:39:11.200 |
estimates, you know, people say active parameters of 40 mini are small. So that's fair for token costs. 00:39:18.480 |
I mean, I don't know. These are not cheap models to train. You know, this is a 14 B model trained on 00:39:26.640 |
10 trillion tokens. Um, that's a lot of compute. That's in the millions of dollars of compute. 00:39:32.400 |
One thing I highlighted here that I might've skipped over was how cheap training five, four mini reasoning 00:39:38.320 |
was. Um, like, I'm not going to say it's like deep seek are one cost $5 million, but, um, you know, 00:39:46.720 |
there's obviously a lot of filtration, a lot of time, a lot of synthetic data generation, a lot of inference 00:39:52.400 |
that goes into this, but they trained this on, I believe 32 H one hundreds for like two and a half days. 00:40:00.480 |
So like maybe 64 H one hundreds, but, uh, very few nodes were used, you know? So like, this is something 00:40:08.240 |
that someone could do themselves if they really wanted, if someone can actually pull up the five, 00:40:14.240 |
four mini reasoning model on hugging face, it explains the, um, GPUs used to link it, just share it in chat. 00:40:21.520 |
And we'll go over it real quick. But, um, that was something I wanted to know, you know, um, training 00:40:26.880 |
this thing, the reasoning model outside of generating that data, which honestly, even in and of itself, 00:40:32.960 |
they just used, um, a lot of deep seek or one, and they used open data sets. Like this thing didn't 00:40:40.320 |
cost that much to train two nodes of H one hundreds for two and a half days. Like that's on the scale of 00:40:46.960 |
tens to hundreds of thousands of dollars. So not that crazy, you know? Um, but yeah, uh, what else, uh, to, 00:40:56.240 |
to show their results. Once again, you know, they're very cautious of overfitting on benchmarks. 00:41:02.080 |
Uh, they, they do three runs and they, they report the averages and they're like, 00:41:06.800 |
they have a whole section on how they need to redo on how we need to better evaluate reasoning models. 00:41:12.320 |
Uh, they have a section on like reasoning models versus non-reasoning where, where these evals sit, 00:41:19.520 |
uh, baselines and stuff. But yeah, they kind of show that, you know, the thing's pretty good. 00:41:25.440 |
Training strategies. This is very straightforward, right? So what do they do? Distillation, uh, 00:41:30.880 |
distill, uh, sorry, in the distillation stages, what's the batch size learning rate, how many epochs, 00:41:36.960 |
warmup ratio, sequence length, packing, not packing. Um, this is all just, you know, 00:41:42.400 |
if you really care about that, go ahead and read it. We're running short on time, so I will not read it 00:41:47.520 |
for us. Um, on scores, here's kind of where they sit. So once again, like they show all of their stages, 00:41:56.240 |
right? So basic five, four mini sucked at these math and reasoning benchmarks. 00:42:02.320 |
O1 mini, good. Distills, pretty good. The other people, not so good. Base llama, not good. 00:42:08.800 |
Adding distillation mid-training, ooh, a lot of bonuses, a lot of pretty good stuff. Adding 00:42:15.520 |
distillation fine-tuning after, even better. Uh, adding rollout DPO, even better. 00:42:21.600 |
RL GURPOL. Wow, we're good. It beats everything. Actually, it doesn't beat everything, but you know, 00:42:27.280 |
um, pretty good. Beats, uh, R1 Distill Quinn 7B and R1 Distill Llama 8B, which, you know, 00:42:36.320 |
when DeepSeek came out, a lot of people for local inference were using, um, Distill Quinn 7B. Uh, 00:42:43.840 |
Tyler has linked the hugging face thing. Let's, let's check it out real quick. Um, context length, what model is this? 00:42:54.880 |
5-4 mini reasoning. Okay. Model quality, H100, GPU times 128 H100s for two days of training. So, you know, 00:43:04.800 |
not that much. Assuming $2 per hour for H100, 128 times 48 hours is $12,000. Um, math checks out, 00:43:14.080 |
but math doesn't really check out. Um, you can't assume $2 per H100 hour. Cause you know, you're doing 00:43:20.000 |
this on actual nodes of training, but yeah, you know, per hour on H100, you can say 12K. Now that 00:43:26.800 |
you're doing it on node level, maybe double it up, maybe triple it up. But point being, you know, tens of 00:43:32.400 |
thousands of dollars for this, hundreds of billions of tokens, not trillions of tokens. And this is stuff 00:43:38.080 |
that like low key could have been done earlier, but, um, yeah, they show it out stage by stage. Thank 00:43:45.120 |
Thank you, Tyler for link and, um, quick math. Um, they show out stage by stage performance, 00:43:52.960 |
ablations, um, fun charts, safety. We need safety. Ours stay consistent. Others go down. 00:44:02.560 |
What else? Conclusion. Okay. They love their line, uh, small models when trained with deliberate data 00:44:09.440 |
selection and training strategies can match or even exceed capabilities of larger models. 00:44:15.360 |
They want to show this work as a blueprint for developing efficient, high performing models 00:44:20.320 |
under resource constraints. Now, one of the very interesting things is, you know, um, 00:44:26.240 |
some of the other work that's referenced here, like the distillation models, uh, distillation with SFT is 00:44:32.240 |
cool. Deep seek showed it really worked, right? Let me see if this chart actually compares to base 00:44:37.600 |
Lama. I believe they have it in there at the bottom. So basically for let's change color real 00:44:43.040 |
quick for regular Lama 3.2. Oh, they did 3B. Goddamn. Nevermind. I was trying to compare Lama 8B to 8B. 00:44:52.560 |
Uh, but you know, okay, let's just say for five, four, which is better than 8B on reasoning data sets. 00:44:59.440 |
We did really, really bad. Now, what deep seek showed is with basic SFT on reasoning data, um, you can get pretty 00:45:08.960 |
good performance. Uh, what they show is if you have a well thought out process to do this specifically for, um, 00:45:18.560 |
small models, you can do even better than what people were very amazed by in the past. So, um, 00:45:26.480 |
this thought out way to do this is really good. Um, what else, what else? Uh, so they basically have 00:45:33.040 |
this as a blueprint for small model thinking RL. Um, what was interesting was just, yeah, stuff doesn't 00:45:40.800 |
directly transfer over, you know? So, um, S1, if someone correct me if I'm wrong, I think was done on 00:45:48.160 |
Quen 32B, right? Or one of the large ones where, you know, you take a thousand samples, you have small 00:45:55.600 |
model and now you have really good reasoning. They tried that exact recipe with their small model. 00:46:01.760 |
They took five, four mini, they did S1 training and not only did it not improve as much, it actually got 00:46:09.840 |
worse. So, uh, you know, this stuff doesn't translate transfer over at face value for regular, um, 00:46:17.520 |
foundation model stuff. It does like in the regular five, four 14B, um, they did a lot of their experiments 00:46:25.200 |
on a 7B and then they, they did a lot of their pre-training data experiments on a 7B 00:46:32.080 |
and then they transferred over to a 14B for actual train run. Where is this? It's somewhere in here. 00:46:39.360 |
Um, yeah, right here, right here. So we observe a high rank correlation between the performance of 00:46:46.080 |
7B and 14B models on different data mixtures. Given large enough distance between the two data mixtures, 00:46:52.160 |
uh, this allocated us to conduct experiments on 7B scale and transfer findings to 5-4, which is 14B. 00:46:58.080 |
Uh, the key difference between 32B and this one is mid-training? No, so this one is a 14B, 00:47:05.120 |
which also had mid-training. This is basically a mixture of a bunch of stuff, the small reasoning. 00:47:10.480 |
Um, there's a whole bunch of sets of this, but here, yeah, basically they have a whole bunch of, 00:47:16.880 |
um, let's learn chain of thought, packed, unpacked to know where to stop, then let's do RL. 00:47:22.560 |
That's a better approach for mini reasoning model. Uh, we have questions and discussions, but this is 00:47:29.600 |
not the only paper that dropped. I, in fact, was bamboozled an hour ago when I decided to 00:47:34.880 |
read these turns out there's like low key two and a half papers. So we needed to know 5-4 base to see 00:47:40.480 |
what they did. Then there's 5-4 mini reasoning, which is their 3.8B. Then there's also just 5-4 reasoning. 00:47:46.720 |
5-4 reasoning is not one model. It's actually two models. Um, 5-4 reasoning is where they take the 14B model. 00:47:55.040 |
they do, um, they make a reasoning model using O3 mini traces. Uh, and they have another one. They have 00:48:03.120 |
5-4 reasoning plus where they basically do the same thing as before, but now they add RL specifically. And, 00:48:08.720 |
uh, now they're comparing it to the 70B distills. And guess what? It does well. It outperforms the 70Bs. 00:48:16.800 |
So same benchmarks, reasoning benchmarks. Um, we show that the benefit of carefully, 00:48:22.080 |
of careful data curation and SFT extends to reasoning models. This can be further amplified 00:48:27.760 |
with RL. So similar to how there's like, uh, O3 mini, low, high, O1 pro, all that stuff. 00:48:34.640 |
They can also do this. So they have, um, uh, 5-4 reasoning and 5-4 reasoning plus. Could this paper 00:48:40.880 |
have basically just launched 5-4 reasoning plus and called it 5-4 reasoning? Yes. They didn't have to 00:48:47.440 |
do two. This is very similar to how in the, uh, reasoning mini paper they have, oh, this is the, 00:48:55.200 |
this is the big one. One sec. In this reasoning mini paper, 00:49:01.280 |
in the mini reasoning paper, they show every stage of training and how it performed. Right? Like 00:49:07.360 |
they could have also had 5-4 mini reasoning and then 5-4 mini reasoning plus and do their whole recipe. 00:49:15.520 |
Like it's starting to do better as it is. But, um, in this one, they, I guess the point they're trying 00:49:21.200 |
to make is that an additional RL stage takes you from 5-4 reasoning to 5-4 reasoning plus. Um, 00:49:30.400 |
yeah. TLDR is, um, they're getting O1 mini O3 level performance, beating R1 70Bs. 00:49:37.040 |
Uh, yeah. Okay. I'm going to try to go through this in three minutes. Sorry on bad use of time. Um, 00:49:44.960 |
how they do this. So 14B model, it's 5-4. Uh, they have a 14B model, supervised, fine-tuned. So they do SFT. Then they have reasoning plus, which has a further 00:49:59.520 |
round of RL, uh, 1.4 million prompt of high quality answers containing long reasoning traces generated using 00:50:07.040 |
O3 mini. Prompts are filtered to cover a range of difficulty. They want it to be stuff that the regular 00:50:13.200 |
model can't answer. So if 5-4, is there any traction on these? Yeah. Uh, five models have decent traction. 00:50:19.680 |
The reasoning one just came out. I don't know about traction on this. Um, so 00:50:24.480 |
they basically want to filter out stuff that the regular, um, five model can solve. They only want 00:50:32.960 |
to do stuff that it can't solve. Um, my highlighting got weird. So the data set used in SFT includes stem 00:50:39.120 |
topics, coding, safety focused, uh, tasks. The reasoning plus model is trained using RL on a 00:50:45.840 |
small set of 6,000 high quality math posted, math focused problems with verifiable solutions. 00:50:51.360 |
Kind of interesting, right? They get so much performance gain from just 6,000 samples of RL. 00:50:56.960 |
Um, they talk into how RL is, um, RL is like, you know, it has high variance, but it also has high 00:51:04.720 |
impact if done correctly, but then it's just math. So, you know, live code bench went down a little 00:51:09.440 |
bit. Uh, interesting thing to note. Okay. Basically it's a data, like the whole training pipeline, 00:51:17.680 |
it matches what they've done in previous five models or come models. Uh, they once again, 00:51:22.880 |
want to show that good data curation, synthetic data, small models to be good. A small model performs 00:51:29.600 |
better than 01 mini 70 B models. They also outperform Claude's, uh, 3.72 thinking on all 00:51:36.880 |
taxes, um, except GPQA and calendar planning performance on this. Okay. Performance is cool. 00:51:44.080 |
Um, this was kind of an interesting one. Um, both of them present improvements over the base models, 00:51:51.600 |
including math and specific, notably improvement of 50 percentage points here. Surprisingly, 00:51:57.520 |
these models also improved by 30, 60 percentage points on, um, algorithmic and planning problems 00:52:05.680 |
like calendar planning, which demonstrate increased generalizability of reasoning skills to domains 00:52:12.640 |
that we did not target directly during SFT or RL. So, uh, on stuff they didn't target and didn't train, 00:52:19.360 |
this shit still generalizes. Uh, very cool. Very cool. Improvement on general benchmarks. 00:52:24.800 |
Of course it improves. Here's numbers of hell thinking effort versus accuracy trade-off. This was 00:52:30.480 |
interesting. So the reasoning plus model that slapped some RL that does better, it takes approximately 00:52:36.480 |
1.5 times more, more tokens than the other one. Uh, this difference is less pronounced on other reasoning 00:52:42.560 |
domains. So, um, uses this on average. So some domains like coding, planning, spatial tasks. 00:52:50.880 |
Uh, these are all avenues to improve RL. Okay. Keep going through quick five, four demonstrates 00:52:57.840 |
reasoning responses. They show some examples that the reasoning one did that the other one couldn't. So 00:53:03.600 |
this is a word play riddle. This is a question, you know, I have coin tosses. What's the chance I see 00:53:11.440 |
exactly 1.2 heads five, four would give you, you know, an actual math calculation. The reasoning model 00:53:17.840 |
is like, you can't get exactly 0.2. So probability is zero. Um, pretty cool. Pretty cool. More stuff, planning, 00:53:25.840 |
games. Crazy. It can do games, data stuff, uh, seed database. We specifically target seeds situated at 00:53:35.040 |
the edge of five four's current ability. Additionally, to maximize the focus of reasoning skills and data 00:53:40.160 |
set, we prioritize prompts that demand complex multi-step reasoning, as opposed to those, uh, testing 00:53:46.880 |
factual knowledge. So they want reasoning stuff. They want teachable things, synthetic seeds. Sorry, 00:53:52.240 |
I'm going quick. We're at like one minute left. So I'm going to go through this quick, quick, uh, 00:53:58.480 |
five, four reasoning, basically SFT on five, four with specific tokens. They have reasoning tokens. 00:54:05.200 |
They use two of those empty tokens that they have. They have think and, um, you know, think tags that they 00:54:10.720 |
add in, uh, increased token length. So they, they extend the 32 K synthetically generate examples of long 00:54:18.640 |
chain of thought over this. Our SFT data set is 1.4 million prompt response pairs, um, totaling 8.3 00:54:27.920 |
billion unique token samples. So this many extended out over these domains. Here's training. Here's SFT 00:54:34.800 |
steps. Here's improvement. Uh, during exploration stage, we studied effects of various design choices. 00:54:41.760 |
Here's that role of SFT seed hyperparameters, uh, role of system message. Of course, system message is useful. So, 00:54:50.880 |
um, basically to promote chain of thought, they tell the thing, uh, you're a system, you're a thinking 00:54:58.400 |
model, have your thinking and think tags, um, partially removing or replacing system messages was cool, but it 00:55:05.760 |
didn't work. Here's the exact thing. You're a large language model by Microsoft. Your role is an assistant 00:55:11.920 |
involved story, uh, exploring questions that, uh, please structure responses in two sections. 00:55:18.240 |
Thought solution using the performed method, think thought section and think then solution 00:55:23.600 |
second in thought detail, your reasoning steps. Each step should include analysis, summarize, 00:55:28.720 |
brainstorm, the solution section should be logical, accurate, concise. Um, now try it with the following 00:55:35.520 |
guidelines. So that's our system message. It helped optimization base model. So they, they considered 00:55:43.600 |
using the base model before SFT, like we talked about. So the 14 B before that SFT, um, both of them 00:55:51.200 |
worked pretty well, but the, you know, the one with SFT and instruction tuning did slightly better. 00:55:58.320 |
So they use it. Um, we attribute this to addition of safety focused post training. Wow. Safety is all 00:56:05.200 |
you need. Um, scaling final model is trained on 16 B trillion, 16 B tokens in this. Okay. Uh, real, 00:56:14.160 |
real quick, the plus. So they did 6,000 samples of RL. We applied outcome-based RL to enhance reasoning 00:56:20.240 |
capabilities. Um, um, um, they use, um, 72,000 math problems. Um, 72,000 math problems subset to 64 seeds of 00:56:32.480 |
that. We do small set of 6,400 problems. See, no coding reward function. They want it to, or is it basically, um, 00:56:42.400 |
incentivize correctness, penalize undesirable behavior, such as reputation and extensive length, 00:56:48.960 |
and encourage proper response formatting. Uh, we encourage the model to generate concise outputs 00:56:54.160 |
when it's correct, provoke thinking to think more when it's incorrect. Here's how they do that with 00:56:59.280 |
GERPO, uh, repetition penalty, training details. Uh, oh, this was the one that was, um, bat size of 64 over 00:57:08.960 |
32 H100. So, sorry. The, um, the mini that we talked about was 128 H100s, but this, uh, reasoning plus 00:57:19.520 |
was only 32 H100s, 64 batches for that much. It's like a couple hundred hours. They also do context 00:57:26.720 |
length extension here, but I think that's enough. I don't want to go too long over. They have evals. 00:57:31.680 |
Of course they have evals, but I want to, you know, even though we're over time, I want to give it two 00:57:36.240 |
minutes for questions. They have a whole main findings, takeaways, stuff like that. Um, I'm gonna go over 00:57:42.480 |
chat real quick, but if anyone wants to pop in share, please, um, you know, interrupt me now. 00:57:47.040 |
Yeah. So they call this SFT instead of, uh, deep seeks use of distill. So distill has like a term 00:58:00.480 |
of distillation loss, right? Where you kind of compare output logics and you kind of force what did 00:58:05.680 |
big model say? Let's, let's do actual distillation there. They, they do know how this is basically just 00:58:11.520 |
SFT on big model outputs. That's the beginning. What's the difference between five, four reasoning 00:58:16.480 |
and reasoning mini? Oh, five, four reasoning is on their big 14 B model. So five, four 14 B, uh, 00:58:24.560 |
gets trained to do reasoning. It's on par with O one mini O one and seven 70 B distills. The reasoning 00:58:32.240 |
mini model is done on a three B. So reasoning mini, they, they do, um, post training on a 3.8 B and they show 00:58:40.240 |
that they can match distill seven B's and they show that, you know, if you take what distillation was 00:58:47.760 |
done to big models and you directly apply it to a small model, it doesn't work. So they try the S1 data 00:58:53.280 |
set on a small model. It does worse, but they have a blueprint for how to do this for small, small models. 00:59:01.200 |
Is there anything interesting in the mini ablation section? Maybe which part they deem most important for 00:59:08.560 |
small models following their, their narrative is mid training or do they imply the combination is the key? 00:59:14.000 |
Um, let's see. Let's see. Let's see. Let's see. Uh, our distillation pipeline serves an effective 00:59:20.720 |
approach to measure reasoning. I don't think there's much that's reasoning here, much that's super, um, 00:59:29.520 |
important. Basically they just have like a, you need to do some mid training. They, they pose some open 00:59:37.040 |
questions. Um, yeah, cool. Okay. Thank you everyone. Sorry for last minute, um, paper change. 00:59:46.160 |
I low-key read this like an hour and a half ago, but hopefully it was useful. I'm sure some of this is 00:59:54.480 |
wrong, but yeah. Uh, next week we should have, uh, Professor Tom who does, uh, by hand illustrations, 01:00:05.440 |
talk about the differences in Lama 1, 2, 3, and 4 architecture. I'll share more in Discord. Um, he was 01:00:12.320 |
gonna just do Lama 4, but we pushed it a week so he has time to prep to do a better comparison between, 01:00:18.080 |
you know, what's actually changing between the series. So that should be next. We always need 01:00:22.480 |
volunteers. So if anyone wants to volunteer a paper, uh, let us know and we'll, we'll slot you in. 01:00:29.360 |
Someone, so Flo says, I can't wait till GPT 5.4 drops and I can't tell if people are saying 5, 4, 01:00:38.960 |
5, 4, like GPT 5.4 or 5, 4, uh, people will be saying GPT 5.4 because no one talks about the five 01:00:46.400 |
models. These are cool, but really, really not many people talk about them. Uh, not many people use 01:00:52.400 |
them, but it's, it's always nice to have, you know, it's, it's open research. They, they do a lot of work, 01:00:58.320 |
but yeah, nobody, nobody really talks about these. The only people that talk about them are people that 01:01:02.640 |
kind of shit on them for, um, training on benchmarks. If I'm not mistaken, they made 01:01:10.080 |
one of these multimodal and five, four or five, three with audio had the best transcription word 01:01:20.000 |
error rate. Someone fact checked me on this, but, um, I'll follow up somewhere. I'm pretty sure the 01:01:27.040 |
multilingual version of this had the best word error rate out of any model. Now that's not saying a lot 01:01:32.240 |
because speech and transcription models are small, like low latency efficient optimized. This is a fat 01:01:38.320 |
model, but yeah, uh, multimodal large models get good. You know, what can you say? Okay. I feel like, 01:01:48.000 |
um, that's enough. Thanks guys. Okay, I would end meeting, but Suix is host, so he will end meeting. 01:02:13.280 |
So, um, that's a good question. I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:19.360 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:23.520 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:25.440 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:27.360 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:29.520 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:33.520 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:35.520 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:40.000 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:42.480 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:43.600 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:47.800 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:51.800 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:53.920 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:55.840 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:57.840 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:02:59.840 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:01.840 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:05.840 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:07.840 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:10.560 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:12.320 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:14.240 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:16.160 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:18.080 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:20.000 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:21.840 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:24.080 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:26.080 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:28.320 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:32.480 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:34.480 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:36.880 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:38.640 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:40.640 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:42.640 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:44.640 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:46.640 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:03:59.200 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:03.200 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:05.200 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:07.440 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:11.600 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:15.600 |
What are you going to ask you to answer your question? 01:04:16.800 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:19.440 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:22.400 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:25.840 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question. 01:04:28.960 |
I'm going to ask you to answer your question. I'm going to ask you to answer your question.