back to index

The Phi-4 Reasoning Technical Report — Vibhu Sapra


Whisper Transcript | Transcript Only Page

00:00:00.000 | Let me also pull up chat real quick. But for those that don't know, there's the five series
00:00:10.940 | of models, right? So this started out way back when, when Microsoft was trying tiny stories,
00:00:17.860 | tiny stories was like a one mil multiple million parameters. So not even a billion, not even
00:00:23.280 | tens of millions parameters models. They just wanted to see, can we make a really small one
00:00:28.880 | million parameter model, like learn text and can it produce any coherent words? Turns out it can.
00:00:35.100 | Then they had like textbooks is all you need. Or, you know, if you train on textbooks, you can do
00:00:39.980 | some level of words. Then they had this five series. These were really, really small language models.
00:00:44.800 | So like up to one B, then we went from super small, like million parameter to billion to now we're at
00:00:51.680 | five, four, which is a 14 billion parameter model. So they've gotten chunky. I think, you know,
00:00:57.760 | the sweet spot that they were hitting was kind of that three B range where no one else was really
00:01:03.460 | doing this, right? Like we had llama, which was seven B, we had Falcon, we had Mistral. These are
00:01:09.140 | all seven B models. At the time, Quen wasn't really doing anything tiny, tiny. So these were like the
00:01:14.740 | on device, coherent, somewhat decent models. The history of the five models is that they would do
00:01:21.660 | good on benchmarks. And then people would say, damn, these models suck. And it turns out that,
00:01:26.600 | you know, they were like accused of training on the benchmarks and they were inflating their scores.
00:01:33.000 | So five, three was the, you know, last one that had a pretty big splash. Five, three was like, okay,
00:01:39.040 | here's a whole section on how we're doing like decontamination of training data. So
00:01:44.880 | they basically said, Hey, we're serious about this. We're not going to train on benchmarks. Here's how
00:01:50.940 | we like filter all our data. Here's how we train our big foundation models. They were kind of decent.
00:01:56.020 | Some people use them. Then recently we had five, four, this paper came out December, 2024.
00:02:03.580 | From there, we, as of this week, have got two more papers called five, four reasoning, reasoning,
00:02:10.700 | mini and reasoning plus. So let's do a quick little refresher on five, four, the one that came out,
00:02:18.220 | because this is kind of the foundation model that builds on top that the other two build on. So anyone
00:02:24.140 | has comments, questions, feel free, you know, pop in wherever. I feel like we covered this in one of the
00:02:30.780 | papers, but anyway, we'll go over it again real quick. So five, four now is 14 billion parameter
00:02:37.200 | model model, and they do basically a bunch of high quality filtering. They use LLMs to filter a bunch
00:02:44.240 | of stuff, and then they have a bunch of synthetic data. So they want to show how like distillation can
00:02:51.100 | do pretty good. So they train on high quality data. They have the same architecture as five, three,
00:02:58.240 | but you know, now they're focused on reasoning benchmark. So this came out like pre reasoning
00:03:03.360 | models. But this is the time when we started to realize, oh, shit, reasoning models are pretty,
00:03:09.200 | reasoning data is pretty high quality, right? So a bunch of the training data here is synthetic data.
00:03:15.600 | They have stuff like multi agent prompting, self revision workflows, instruction reversal. They have a
00:03:21.760 | section on rejection sampling where, you know, you take examples, you have chain of thought, you see what was
00:03:26.960 | wrong, they have rejection sampling in there, they do DPO, they start to introduce this concept of
00:03:33.200 | mid training here, I think they're the ones that start to like, you know, really push on it. And then of
00:03:36.720 | course, in their new thinking models, they have mid training is like their whole mid training is all you
00:03:43.120 | need, you know, but this one was basically synthetic data, SFT, DPO, we can get pretty good models.
00:03:49.920 | Mid training, mid training is all they need. Here's kind of the benchmarks of how it performed at the time.
00:03:55.440 | So, you know, as they always claim, it's kind of 4.0 mini, it's like, similar to the 70b models, it's similar to 4.0.
00:04:05.440 | But you know, this is a little old, it's kind of out of date now, but still good to know what the overall thing is on.
00:04:11.440 | So step zero is this data generation, they have this whole method of seeding data, expanding it out, post training,
00:04:19.360 | stuff like that. And then you know, performance is on par with even 4.05 B, but they've been known to
00:04:28.160 | not have the best, like track record of how they actually perform on benchmarks. But from what,
00:04:35.920 | you know, people say for 5.4, this is where things start to turn around, especially because they have
00:04:41.440 | this whole decontamination set. So like they really kick off the paper with addressing overfitting and data
00:04:48.000 | contamination where, you know, one of their original pitfalls is that they have learned to overfit.
00:04:55.040 | So they improved their data decontamination process. Basically, they have good ways to take out
00:05:04.400 | stuff from data sets, you know, it's what you would expect, they can use LLMs to filter this out,
00:05:08.800 | they can not train on variants of data set stuff. They have contamination proof benchmarks that they're
00:05:16.640 | doing. Okay, what else have we got here? So purpose of synthetic data, synthetic data,
00:05:23.360 | pretty good, right? It's a form of distillation with SFT, structured, gradual learning, what else?
00:05:29.360 | Chain of thought data is in there. So synthetic data for pre-training and mid-training. Here's kind
00:05:35.200 | of how they do it. So it begins with high quality seeds from multiple domains. Basically, they start seeding
00:05:41.440 | data from big domains, right? So they'll do web scrapes, they'll do this seeding filtration process,
00:05:46.640 | right? So essentially, they'll take big data sets, they'll use an LLM and they'll start filtering it. So
00:05:52.480 | they want to identify what's the highest quality of this data, then they'll generate synthetic examples
00:05:58.080 | and they'll use these examples as more training data. They talk quite a bit about multiple epochs for different
00:06:04.160 | sub data sets, but yeah, you know, they created 50 types of synthetic data sets. So web and code,
00:06:11.840 | code-based seed, right? For example, so they have a two-stage filtration process. First, identifying
00:06:16.800 | pages with strong educational content and second, segmenting selected passages into pages into
00:06:22.800 | passages, scoring each for its factual and reasoning content. Basically, they're trying to filter out what
00:06:27.600 | is good reasoning data here. In this, for question data sets, you know, discard questions where all
00:06:33.520 | answered were agreed, you know, where questions are easy or where the answers were kind of inconsistent. So
00:06:38.960 | a lot of data filtration, delete, deduction chains, logical reasoning steps, kind of more efficient,
00:06:46.800 | rewrite and augment data. So seeds are transformed into synthetic data through multi-step prompting
00:06:53.760 | workflows. That includes rewriting the most useful content in a passage into exercises,
00:06:58.640 | discussions, the structured reasoning. So once you have a lot of text, you subset the best part of it,
00:07:04.160 | you know, what are the actual hard questions here? From there, let's rewrite it into exercises,
00:07:09.040 | let's create discussions around it, let's have structured reasoning to how we got to that example.
00:07:13.200 | That's kind of how they do that. Self-revision, instruction reversal for code. So, you know,
00:07:19.520 | you have high quality code, let's reverse it. Let's get the path out of it. Let's get the,
00:07:24.240 | you know, generation steps out of it. Filtering web dumps was another one. They have small non-LLM
00:07:31.680 | classifiers trained on annotations on how to filter out web dumps. They have subsets of non-filtered and
00:07:39.600 | filtered stuff. You know, they over-index on STEM words. They want to just get that high-level stuff.
00:07:46.240 | They have multi-modality into this, multi-linguality, sorry. So 176 languages. More extraction, more
00:07:54.240 | cleaning. Okay. Post-training. Post-training has SFT and DPO. And then we kind of go into their training
00:08:01.840 | mix. I think like five minutes to cover what they're doing here. But you know, they start off with 4K context
00:08:08.160 | length. They extend it out to 16K during mid-training. They have a new vocab of a hundred thousand with
00:08:15.120 | unused tokens. Turns out they just left this note in the paper in December. These unused tokens are what
00:08:21.520 | the reasoning tokens end up becoming, right? So they add in these think tokens out of the unused tokens,
00:08:26.640 | and that's how they're able to do it. Previously, they had had a sliding window attention
00:08:32.320 | in 5-3. Now they do full attention over the 4K context. This model trained from scratch is on 10
00:08:39.520 | trillion tokens. They kind of give you, you know, here's the regular learning rate, weight decay,
00:08:44.480 | batch size, all that stuff. But pre-trained from scratch, 10 million tokens, a lot of synthetic tokens.
00:08:50.400 | Then after that, they have a mid-training stage where they do this context length extension from 4K to 16K.
00:08:58.000 | The interesting thing with all their models, even their reasoning models, like in the reasoning models,
00:09:03.440 | they have four stages of training. So one of the things that they noted was like, you know,
00:09:09.920 | as we do GRPO, we run into like vanishing gradient problems where, you know, at all lengths we have
00:09:16.880 | issues. So at each stage, at each stages of their fixing these problems, how our performance is doing?
00:09:23.440 | How many more points do we get in different benchmarks? Pretty interesting way how they lay
00:09:28.240 | these out. It's just like really a lot of learning. Okay, so 5-3 was two stages.
00:09:36.320 | Phase one was largely filtered web text. Phase two was basically, you know, small subset of reasoning
00:09:44.000 | tokens, basically like the high quality code and math. In this case, they show that web data sets had
00:09:51.520 | small benefits on reasoning heavy benchmarks. Models trained with only synthetic data underperformed on
00:09:57.280 | knowledge heavy benchmarks as well. So, you know, we need to fix these things.
00:10:01.840 | TLDR, here's their data pre-training mixture, where we search over different allocations of tokens coming
00:10:09.600 | from various sources, mainly synthetic, web rewrites, web filtration, divided into reasoning and knowledge
00:10:16.560 | heavy portions, targeted acquisitions and organic data. So math, books, stuff like that, forums, code data.
00:10:24.000 | And then there's a correlation between 7B and 14B models. Basically, they noticed that the scaling was
00:10:31.520 | pretty consistent. So they did a lot of their data pre-training mixtures on a 7B model, and then they
00:10:37.200 | transferred that understanding over to the 14B. Let's go a little bit faster. Here's their subset of the
00:10:45.840 | the final training of their 10B tokens in the pre-training stage. Or I don't know if it's 10B,
00:10:51.520 | because they're still post-training, you know. So like, let's see, probably 95% of the training is
00:10:56.880 | here in pre-training. The interesting thing here is, you know, they have a lot of epochs for different
00:11:02.320 | data. So like synthetic data, even though there's only 290 billion tokens, it's actually 40% of the total
00:11:09.600 | pre-training because they go over it with 14 epochs, basically. But yeah, so to keep everything
00:11:17.120 | general, 30% of the training tokens are for web and web rewrites, namely 1.3 trillion tokens of just
00:11:24.080 | web scraping. Web rewrites are 290 billion tokens, but you know, we do multiple epochs on that. Then
00:11:30.320 | the remaining tokens are largely from synthetic data, which is 40% of that. 20% of this is pure code,
00:11:37.280 | and then acquired sources are, you know, like the stuff that like the textbooks and things like that.
00:11:42.720 | That's most of the pre-training. Then they have their mid-training where they extend context length.
00:11:49.680 | Here is, you know, where they are high quality non-synthetic data sets separate. So they filter out
00:11:58.160 | in these pre-training sets, which one of these samples have long context, right? So in their pre-training,
00:12:05.920 | what is over 8,000 tokens, what is 16,000? Then they upweight subsets that are long, right? Then they do
00:12:13.440 | synthetic data sets to have more than 4k. The data set of this long context extension is 30% newly curated
00:12:22.800 | and 70% of previous stuff from the original pre-training. Then, you know, here's how it performs on different long
00:12:32.640 | context evals. So recall, rag, re-ranking, summarization. Here's how the models actually
00:12:39.120 | perform. And they show pretty good benchmarks on this, you know? And they kind of explain
00:12:46.000 | for people that don't understand what are long context benchmarks. Here's kind of the ones that
00:12:50.640 | exist, right? So recall, this is basically retrieving corresponding value from randomly generated long
00:12:56.240 | files. Rag, answering questions with QA. You know, this is a subset of these data sets.
00:13:02.800 | Re-ranking is, you know, you're given 10 documents and a query. You want to re-rank them. QA is question
00:13:09.840 | answer over long document. Oh, my thing has froze. Okay, never mind. I just decided to break my PDF reader a
00:13:19.680 | little bit. That's cool. Okay, after that, they have post-training. They do DPO. You know, this is
00:13:25.760 | basically, let's make it a chat-tuned model instead of a base model. Interesting thing to note in the paper
00:13:32.240 | this week in the thinking model, one of the approaches that they tried was instead of using the actual 5/4
00:13:38.800 | model, what if we take the base model and not the instruction-tuned model? And what if we do all of our
00:13:45.360 | thinking from there? And they're like, actually, it does pretty well, but not good enough. So, you know,
00:13:51.680 | spoiler, they end up just using their instruction-tuned model. But very open how they do this.
00:13:56.800 | So, you know, they use ChatML, you know, user assistant, they have DPO, they have SFT, they have
00:14:03.360 | multilingual data. This is pretty interesting. They only need 8 billion tokens out of their 10 trillion
00:14:08.880 | during their SFT and post-training, you know. So, DPO is, you know, alignment. SFT is for following
00:14:16.880 | that completion. I think that's basically enough for high-level what's going on in 5/4. Performances are
00:14:25.680 | as you'd expect, you know, it's a 14B. There aren't many 14Bs anymore. They have sections on red teaming,
00:14:32.240 | weakness. They have a lot of, like, just open research. But yeah, that's kind of where they
00:14:37.200 | sit, you know. They have 14B models. They do a lot of filtration. They do pre-training. They have this
00:14:43.440 | set of mid-training. They have SFT. They have DPO. It performs pretty good. It's on par with what you'd
00:14:50.480 | expect, you know. It came out after Quen 2.5B 14 Instruct, so it's slightly better. Better than 5.3,
00:14:57.360 | of course. When you look at stuff like 4.0 Mini or 4.0, you know, in some cases, it's better. On math
00:15:04.960 | and GPQA, they say it's better. Human eval, so coding stuff like this, slightly worse. Simple QA.
00:15:10.320 | One thing that they note, you know, like, in some of the problems with these models is that when you
00:15:17.600 | have factual, like, recollection, they do note that small models just struggle with this, right? Model,
00:15:23.520 | small model can't recognize a bunch of facts, so on QA, it's not the best. Big model will always do
00:15:28.640 | better. But yeah, that's kind of what 5.4 is at a high level. It's cool, you know. It's, like, not
00:15:36.880 | better than LAMA 3.370B, but in some areas it is where, you know, they tried to get this reasoning
00:15:44.320 | data through SFT and SlightDPO with just filtration and a lot of synthetic data. But okay, I will pause
00:15:51.360 | here since that's mostly 5.4. From there, we'll go on to the two reasoning data sets. But any questions,
00:15:59.120 | any thoughts, questions, comments, concerns? Has anyone used it? Probably not. They get some use now,
00:16:06.160 | they're, like, actually hosted by most people. They're on the, they're there through Azure and
00:16:12.080 | stuff, of course. But some people have started using them, you know. But anyway, any questions,
00:16:17.040 | any thoughts? Okay, chat, someone has said something. It's interesting to me, FI reasoning distilled from
00:16:21.840 | O3 and FI 4 mini distilled from DeepSeq. Why they use completely different SFT curated data sets,
00:16:27.600 | also kind of different SFT strategy. Yeah, it's an interesting note that they make. So the papers that
00:16:33.520 | they released this week, let me actually change my screen share real quick. So this came out in
00:16:39.760 | December. They have two models, two models that they released, kind of three, actually. So FI 4 mini
00:16:45.920 | reasoning, this builds on FI 4 mini. So it's actually a three point something B model. Let me check real
00:16:53.920 | quick. This is roughly a 3B model. And they show that, you know, there's a whole different set to getting,
00:17:03.360 | sorry, 3.8 B. There's a whole different formula to getting a small, small model to do reasoning
00:17:09.680 | than there is to getting a large model to do reasoning. So one of the interesting notes that
00:17:15.360 | someone in chat pointed out is that when they do FI 4 mini reasoning, they do their distillation from
00:17:22.640 | DeepSeq. But when they do FI 4 large, the one on the 14 B in the plus, they do it from O1.
00:17:30.880 | Sorry, from O3, O3 mini, which is interesting. Who knows why. But there's so many little gems in this
00:17:38.720 | FI 4 mini reasoning paper. Like, for example, they cite a bunch of recent sources of getting
00:17:46.320 | small models to do reasoning. Basically, they start with DeepSeq R1, right? So DeepSeq R1 had distals.
00:17:53.600 | They had Quen and Llama distals. And they show how they can do better. After that, there was a bunch of
00:18:00.880 | stuff like stuff from OpenHands, stuff from S1. And they make little notes here, right? So like,
00:18:08.240 | there's, there's other things like there's OpenThinker, Bespoke Labs. But one thing that they
00:18:13.520 | noticed was like, if they take the S1 and Limo datasets, S1 was basically, you know, a thousand
00:18:19.280 | samples of reasoning data trained into Quen 32B, I believe, had a really good reasoning model, right?
00:18:27.200 | So a thousand samples is all you need. They show that if they do this SFT on a thousand samples of
00:18:34.160 | their mini 3.8B model, it actually performs worse. So even though it performed well for S1, S1 is where
00:18:43.440 | they did it at 32B. They took a base instruct model. They did a thousand reasoning, high, high quality
00:18:48.640 | reasoning samples. They got reasoning. It worked very well. Benchmarks shot up like crazy. S1 paper
00:18:53.680 | was pretty cool, but pretty basic. They show that if we do the exact same thing on 5.4 mini, so we take a
00:18:59.040 | 3.8B model that's competent, good, instruction-tuned, we do a thousand samples of SFT on really good
00:19:06.080 | reasoning data set, our scores actually go down a lot. So base model had a score of 10, 78, 37.
00:19:13.520 | It shot down to 3, 47, 26. So TLDR is, you know, that actually doesn't work. And they, I don't remember,
00:19:22.000 | I was like trying to quote something in this, but they keep saying like, we need to explore a better
00:19:26.240 | recipe for how to do reasoning tiny models. This paper is basically a blueprint for that. But you know,
00:19:32.080 | it is based on DeepSeq. I would assume the reason that they do DeepSeq to answer the question exactly is
00:19:39.760 | because they're specifically comparing their model to Lama 3.8B, the distil, and the Quen distil. So
00:19:49.600 | they're training on both, they're comparing to both distils and, you know, why not use DeepSeq?
00:19:55.440 | But anyway, before we move on to that and then the large one, any questions on regular 5.4?
00:20:06.160 | Okay, we move on. Someone asked about training on synthetic data risk, the effect of hallucinations.
00:20:12.960 | So if you look deeper in there, like synthetic data processing, they have heavy, heavy filtration. So
00:20:18.560 | for some stuff that's questionable, for some stuff that has like good chain of thought, but not the
00:20:26.160 | right final output, they deal with that. But basically, you know, they have verifiable math output,
00:20:30.640 | they have verifiable code. They basically have really good pipelines to test whether outputs are
00:20:37.120 | correct or not. And then this is a lot more shown in the 5.3 paper as well. Is there any justification
00:20:43.680 | on why they did so many epochs in synthetic data or any rationale behind that? So in 5.4, yeah, basically
00:20:49.840 | they're just saying that all their synthetic data is just higher quality, right? So they start here,
00:20:55.440 | I'm sorry, it's a reasoning paper. Basically, they start with the web scrape is decent, we can filter
00:21:00.400 | it down and then expand it, and it's all better quality data. And then in 5.3, they show some of
00:21:05.120 | this too. But the number of epochs is kind of interesting. Specifically, like when they show in
00:21:12.000 | context length extension, something they talk about in a lot of these, and they also do actually mention
00:21:18.400 | this in later, like towards the end, I just don't think it's worth covering since these are more so
00:21:23.680 | reasoning benchmarks. Some of the interesting things that they note is they do like this method of
00:21:29.760 | packing and unpacking prompts. They also show how they'll repeat long context examples that were
00:21:39.920 | done in pre-training. So even their method of multiple epochs is like not necessarily reflecting
00:21:48.800 | the exact number of samples. Although I'm sure it's very minute, in this pre-training, like for example,
00:22:00.320 | the web data has 1.3 trillion tokens and is 1.2 epochs. In the mid-training where they do context length
00:22:08.560 | extension, they once again use the previous samples that had more than 8k context. So they do create
00:22:16.960 | synthetic, but they also reuse. So like, you know, this is a little skewed, but I don't know, they just
00:22:22.160 | show that synthetic data is good. The other nice thing about these papers is that they basically cite
00:22:27.040 | everything. A lot, a lot of citations. But what they show here is like a bunch of synthetic data, model go
00:22:34.400 | good. As you would expect, the reasoning models kind of follow that same trajectory, right? A lot of synthetic
00:22:41.600 | data, model can reason, model get good. But let's start with the mini reasoning model. So
00:22:46.160 | 5.4 mini reasoning. This is the 3.8 B reasoning model. Okay, so improving small language model reasoning
00:22:57.520 | remains challenging due to their limited model capability. Also, are we seeing my entire screen?
00:23:02.960 | Okay, we are making sure we're on the right paper. So they kind of bring up R1 distillation, right? So
00:23:10.400 | distillation from LLM generated synthetic data can improve reasoning and they show that
00:23:15.200 | even doing SFT works. So this paper, this work is trying to get a systematic training recipe for
00:23:23.040 | small language models. They have four steps. Step one is large scale mid training on diverse distilled
00:23:29.200 | chain of thought data. Step two is SFT on high quality long chain of thought. Step three is rollout DPO to
00:23:36.560 | leverage carefully curated preference data set. Step four is RL with verifiable reward. So four steps.
00:23:44.400 | here's how we do reasoning on tiny model. And at the end, they show that, you know, their compact 3.8 B
00:23:51.600 | reasoner can outperform distilled reasoning when 7B and distilled reasoning LLM8B by, you know, a little
00:23:59.120 | bit on math. They're starting to do even more and not overfitting here. So they have sections on like
00:24:06.320 | calendar planning and they're like, this is something we didn't explicitly train on at all. We filtered out
00:24:11.920 | data sets in our training, but our models start to generalize. Like they do really good on this task.
00:24:17.520 | And this is something that wasn't trained on at all. Um, basically they validate a carefully designed
00:24:23.440 | training recipe with large scale, high quality chain of thought data is effective to unlock strong
00:24:28.240 | reasoning capabilities, even in reasoning constraints, small models. I don't know what their like team is up
00:24:34.400 | to with chart colors, but here they go all pink and purple. So, um, here's, you know, AIME 24,
00:24:40.880 | math 500, GPQA diamond. Um, the original 5.4 is this shade of purple. Um, the R1 distills,
00:24:49.200 | you know, you can see how they're actually pretty good. So LLM8 and Quinn were pretty good, but hey,
00:24:54.240 | their tiny model at half the parameters can do even better. They're winning. Um, I thought this was
00:24:59.200 | interesting, uh, equal contribution for everyone except the first and last two authors. Screw the others.
00:25:05.680 | They, they, they're just listed first, but not equal contribution. Um, those charts are pink.
00:25:10.960 | These charts are colorful and different, but okay. Back to this. Um, I feel like me sitting here and
00:25:18.720 | reading through charts is useless. Um, but you know, oh no, my reader is broken. Okay. If interested,
00:25:26.000 | of how they perform small models, it's interesting how they, they put, uh, 4.0 mini kind of under the
00:25:33.680 | small model section instead of large models, uh, and 4.0 here, we don't really know how big 4.0 mini is,
00:25:42.160 | but, um, it's there nonetheless. Um, Oh, wrong, wrong chart. My bad. Let me go back to this. Okay.
00:25:52.080 | Okay. Sorry. Wrong chart. Okay. Um, basically they start off with their intro. They're like, uh,
00:25:57.520 | chain of thought is a way to do reasoning steps. It's cool. Uh, they say that small language models
00:26:03.760 | struggle with chain of thought. I thought that's kind of interesting. Um, enhancing reasoning capabilities
00:26:09.600 | is easier for language models for large language models due to extensive capability. It remains
00:26:15.120 | challenging for small reasoning models. Um, deep seek R1 shows that non-logit level distillation,
00:26:21.440 | basically just SFT with synthetic data makes a good reasoning performance. Then they cite all the other
00:26:28.640 | stuff. Um, then they show the other people. So, you know, um, bespoke studio, bespoke labs, 7b open thinker,
00:26:36.560 | 7b. Uh, they show that, um, they can do this with SFT. Um, some people suggest GRPO deep scaler does S1 and LIMO,
00:26:48.480 | um, show that, you know, small samples, even a thousand sets, a thousand samples can do good reasoning.
00:26:55.360 | Okay. Um, rather than focusing on isolated techniques, we explore training paradigms specifically tailored for
00:27:04.400 | small language models. So, um, they have two stages of distillation followed by rollout-based learning
00:27:12.800 | that reuses wrong LLM-generated samples and concludes with RL with verifiable reward. Initially, we employ
00:27:19.840 | a distillation as mid-training mechanism to embed foundation models with reasoning capabilities.
00:27:25.280 | Then they apply distillation again in fine tuning to further improve generalization. Uh, then there's LLM rollout
00:27:32.320 | out sampling for distillation. Incorrect outputs are typically discarded. However, they want to still
00:27:37.680 | use them. So they have this way of doing this, um, set. They, they take, uh, a sort of RL approach where
00:27:45.760 | they have it optimized, um, long. So if the answer is incorrect, it should still think a lot for incorrect
00:27:53.840 | examples. And then they want conciseness on correct examples and stuff. They talk about this later. Then they
00:27:59.840 | fine tune it with RL for final correctness, um, outcomes, five, four reasoning. Okay. More background, um,
00:28:07.760 | optimal distillation strategy for small models still unexplored. They keep repeating this,
00:28:13.120 | um, you know, data diversity. Uh, so they showed that, you know, um, data diversity and quality
00:28:22.160 | is very important. Applying isolated techniques degrades performance. So they tried S1 on 5-4 mini went down,
00:28:30.400 | which I thought is pretty crazy. Um, their goal is to once again, comprehensive efficient training recipe
00:28:37.680 | for small language models. So, um, non-reasoning models require a pre, uh, mid training stage to
00:28:44.400 | absorb a large volume of rate reasoning trajectories before additional techniques are applied. So once
00:28:49.840 | again, you know, if small, we need mid training. They love this concept of mid training. So quick questions
00:28:55.680 | to ask is how much mid training do they need? And then what do we do after mid training? Like careful
00:29:00.880 | distillation, preference learning RL, what should we do next? Uh, once again, systematically address these
00:29:06.880 | conditions and propose a recipe for building small reasoning models. Mid training is basically after
00:29:12.080 | pre training and after pre training and before post training. It's mid training. Um, it's like, you know, you
00:29:20.960 | have SFT for post training, but before you do that, when you have base model, is there something that we can
00:29:26.400 | do to do this sort of reasoning or before this specific RL DPO before that last step of most high
00:29:33.440 | quality stuff? Basically, uh, you know, if you take an instruction model, like five, four mini, and you
00:29:39.280 | just do direct, um, 1000 samples of high quality deep data, like you're cooked, you're not getting, um, reasoning.
00:29:48.400 | The models, the small models cannot pick that up in time. So they're saying mid training is where, you know,
00:29:53.280 | you need this stage of, let's do some chain of thought, extended synthetic reasoning data. Let's
00:29:59.360 | roll out reasoning, tracing steps. Let's have thinking steps. Let's do a lot of those. Then we
00:30:04.320 | go once again to our core examples. Um, that's kind of what, that's kind of what it is. And yeah, it's like, uh,
00:30:11.360 | basically train us train on chain of thought before your RL. Okay. Multi-stage, um, continual training for
00:30:19.200 | reasoning. So, um, multi-stage continual training is good. First we do that. We train curated chain
00:30:25.440 | of thought reasoning data set. Then we run RL with verifiable rewards. Okay. So can distillation be used
00:30:32.320 | as mid training? Yes. Uh, we want to use distillation. So we train base models with next token prediction
00:30:39.280 | on extensive corpus of synthetic chain of thought data. Uh, chain of thought are generated from deep
00:30:46.560 | seek R1. We apply rejection sampling, only get the correct answer. They talk about this later in
00:30:52.000 | section four. We pair questions with correct chain of thought, uh, with corresponding correct chain of
00:30:56.640 | thought answers, train a base model using standard causal language modeling objective. Uh, they have a
00:31:03.280 | packing mode. So basically what they're doing here is not just SFT on reasoning. They're packing in multiple
00:31:11.120 | examples into one training set, right? So, um, multiple examples are packed into the same input sequence
00:31:17.600 | to increase token efficiency. So we're not specifically trying to learn what is input output. We're not doing
00:31:24.800 | SFT to just do reasoning step answer. We're doing multiple examples. So like they're packing in a bunch of
00:31:30.560 | them. So like you could have seven examples of input chain of thought output, input chain of thought output.
00:31:38.000 | And we're just trying to learn and mid training, you know, here's how we do chain of thought. It's not
00:31:42.800 | just good answer generation. Um, this is like an effective way to use mid training, uh, you know, effective
00:31:50.960 | way to allow mid training to iteratively use as much chain of thought data as possible until the model
00:31:57.360 | starts to perform well on a validation set. Then we have, uh, distillation for SFT.
00:32:03.360 | Basically after it started to learn how to do this chain of thought, um, you know, we do
00:32:10.640 | fine tuning or just continual training in a non-packing way. And then that's where you teach the model where
00:32:16.880 | to stop generating text. Okay. After that, uh, rollout preference learning, uh, this is where, you know,
00:32:24.480 | the previous two stages is trained only on accepted generation. So they've done filtration.
00:32:30.080 | They've taken out all incorrect examples. They're only doing positive chain of thought. Here's question.
00:32:36.800 | Here's chain of thought. Here's answer packet, unpack it. You now know how to think, you now know where to
00:32:42.000 | stop, but now they want to use rejected rollouts to enhance performance. Basically, um, they, this is how they
00:32:50.000 | they like ensure that you have, uh, diversity, you have enough thinking, you have, you know, conciseness.
00:32:58.640 | So basically incorrect responses with minor nuances are compared to their correct answers and they provide
00:33:06.400 | positive, effective candidates for constructing informative preference pairs. Uh, so you can do preferences of, you know,
00:33:13.040 | uh, here's the right, uh, thinking approach, but the answer was wrong. And then here's the correct way to
00:33:18.400 | do it. Uh, this preference data set is constructed by using correct, um, correct answers as preferred
00:33:24.640 | rollouts and incorrect as dispreferred rollout. So now, uh, you know, we have DPO, we're going to do DPO and
00:33:31.280 | we have, here's chain of thought with correct answer. Here's chain of thought where you kind of messed up a
00:33:35.680 | little bit. Let's, let's do that. Then we have, um, RL with verifiable reward. So now we've done some
00:33:42.800 | alignment with DPO. We want it to, you know, we want to have preference towards these pair of examples,
00:33:49.440 | but now we want, um, RL on the distilled and preference train model. So, um, in the following,
00:33:56.880 | we describe RL algorithms that we've implemented. So PPO, PPO is a clip surrogate objective to limit
00:34:04.320 | the policy update. So it stays close to previous policy. Clipping is doing this, dah, dah, dah, dah.
00:34:10.000 | It has stabilization. Basically this is where they want it to, um, you know, okay, let's see. We have
00:34:18.400 | PPO and GRPO. We love our GRPO, um, comparing rewards in a batch with multiple responses. So
00:34:26.560 | for each question, we sample a set of responses under the old policy, compute their rewards,
00:34:32.080 | average out what we want. There's a verifiable reward for the right one. Standard RL stuff at these
00:34:36.960 | days. Um, what else base, uh, high variance in response lens. So, uh, our pilot study applying
00:34:44.080 | GRPO to train base model, we observed three issues that affected stability and effectiveness of, um,
00:34:49.520 | of model training, high variance in response lens. So although the base model after mid training is
00:34:56.080 | already able to generate chain of thought responses, we observed substantial variability and response
00:35:01.520 | lens within the same GRPO sampling group. So, uh, when you generate multiple outputs for the same example,
00:35:09.600 | uh, there's a lot of variability and how long that is. So for some stuff, positively reward responses
00:35:16.320 | range from 12,000 to 20,000 tokens, which is, you know, pretty large, uh, optimizing the model for standard
00:35:22.880 | GRPO. Um, it led to instability, right? Okay. Vanishing gradient. This is what you would expect. Um, you know,
00:35:33.680 | so as they did, um, a bunch of these, as they had so much diversity, so many different, uh, lengths,
00:35:41.520 | you kind of had identical rewards. So, uh, vanishing gradient problem, zero variance in the returns. Um,
00:35:50.160 | the model is sensitive to intra-group length discrepancies requiring extended GRPO batch size to 128. So
00:35:58.320 | more batches in GRPO while it worked. Um, we hypothesized that these issues become more
00:36:04.000 | prominent for small language models where RL stability is likely to be more fragile compared
00:36:08.480 | to large models. Okay. Uh, moving on quick, since we're short on time, what else? Synthetic data,
00:36:16.160 | synthetic chain of thought data generation. So we can stock large-scale reasoning data set composed of LLM
00:36:22.640 | generated synthetic, synthetic reasoning capabilities, uh, trajectories. Uh, they use a bunch of pre-made
00:36:30.080 | data sets. So basically all these are open data, open data sets and their sizes and whether they have
00:36:35.920 | reasoning or not. Um, for data sets that already had reasoning traces, we directly use their annotation.
00:36:43.360 | So basically, uh, bespoke labs, open R1 math from hugging face and open thoughts from open thoughts.
00:36:49.920 | That's roughly what is that like 350,000 samples. It has reasoning annotations. They use that, uh,
00:36:56.560 | for the other ones. So all these six or seven different data sets, um,
00:37:01.360 | we retain only, sorry, for data sets lacking such trajectories, we retain only math questions
00:37:09.520 | and generate new chain of thought answers using R1, the big R1, 671B. For each question, we sample
00:37:16.000 | approximately eight rollouts. And, um, in total, we collect 10 million rollouts across 1.6 million
00:37:24.880 | samples. So, uh, they use all these for the other ones. They use big R1. They only use math. They get
00:37:31.280 | 10, uh, eight rollouts per, now they have 10 million samples. All the previous, uh, training steps. So the mid-training
00:37:37.280 | and stuff that's used with this data set, this is not like a new training stage. This is just explaining
00:37:43.120 | the previous steps above. Okay. Um, this is kind of interesting as well. For math questions that are
00:37:50.320 | verifiable, we first apply math verification tools to address correctness. Uh, some auto-verification
00:37:56.800 | fails for complex solutions. And then it says, we additionally employ 40 mini to re-verify rollouts,
00:38:03.280 | incorrectly flagged, initially flagged as incorrect. I thought this was interesting. Like, um, we're in
00:38:09.760 | like mid 2025, we have O3, we have O1, we have 2.5 flash, we have Sonnet, we have all these big models,
00:38:17.360 | but let's verify our math with 40 mini. Like, let's think about that for a second. We have benchmarks,
00:38:25.360 | we have tables, we have everything. But for verifying if our math was flagged as incorrect properly,
00:38:31.920 | let's use 40 mini. Why do we use 40 mini? That's for Microsoft for them to, you know, not tell us.
00:38:40.640 | Uh, could they have used a bigger model? I think so. But anyway, um, they do that. Okay, experiments,
00:38:48.400 | evaluations, um, you know, we evaluate our model on three, mathematical, uh, possibly they're obsessed
00:38:56.960 | with token cost, 40 mini validation. I think 40 mini validation is just, you know, it's mini model. Let's
00:39:02.800 | compare it to mini. Um, 40 mini was compared to big, uh, five, four of the 14 B model. And based on
00:39:11.200 | estimates, you know, people say active parameters of 40 mini are small. So that's fair for token costs.
00:39:18.480 | I mean, I don't know. These are not cheap models to train. You know, this is a 14 B model trained on
00:39:26.640 | 10 trillion tokens. Um, that's a lot of compute. That's in the millions of dollars of compute.
00:39:32.400 | One thing I highlighted here that I might've skipped over was how cheap training five, four mini reasoning
00:39:38.320 | was. Um, like, I'm not going to say it's like deep seek are one cost $5 million, but, um, you know,
00:39:46.720 | there's obviously a lot of filtration, a lot of time, a lot of synthetic data generation, a lot of inference
00:39:52.400 | that goes into this, but they trained this on, I believe 32 H one hundreds for like two and a half days.
00:40:00.480 | So like maybe 64 H one hundreds, but, uh, very few nodes were used, you know? So like, this is something
00:40:08.240 | that someone could do themselves if they really wanted, if someone can actually pull up the five,
00:40:14.240 | four mini reasoning model on hugging face, it explains the, um, GPUs used to link it, just share it in chat.
00:40:21.520 | And we'll go over it real quick. But, um, that was something I wanted to know, you know, um, training
00:40:26.880 | this thing, the reasoning model outside of generating that data, which honestly, even in and of itself,
00:40:32.960 | they just used, um, a lot of deep seek or one, and they used open data sets. Like this thing didn't
00:40:40.320 | cost that much to train two nodes of H one hundreds for two and a half days. Like that's on the scale of
00:40:46.960 | tens to hundreds of thousands of dollars. So not that crazy, you know? Um, but yeah, uh, what else, uh, to,
00:40:56.240 | to show their results. Once again, you know, they're very cautious of overfitting on benchmarks.
00:41:02.080 | Uh, they, they do three runs and they, they report the averages and they're like,
00:41:06.800 | they have a whole section on how they need to redo on how we need to better evaluate reasoning models.
00:41:12.320 | Uh, they have a section on like reasoning models versus non-reasoning where, where these evals sit,
00:41:19.520 | uh, baselines and stuff. But yeah, they kind of show that, you know, the thing's pretty good.
00:41:25.440 | Training strategies. This is very straightforward, right? So what do they do? Distillation, uh,
00:41:30.880 | distill, uh, sorry, in the distillation stages, what's the batch size learning rate, how many epochs,
00:41:36.960 | warmup ratio, sequence length, packing, not packing. Um, this is all just, you know,
00:41:42.400 | if you really care about that, go ahead and read it. We're running short on time, so I will not read it
00:41:47.520 | for us. Um, on scores, here's kind of where they sit. So once again, like they show all of their stages,
00:41:56.240 | right? So basic five, four mini sucked at these math and reasoning benchmarks.
00:42:02.320 | O1 mini, good. Distills, pretty good. The other people, not so good. Base llama, not good.
00:42:08.800 | Adding distillation mid-training, ooh, a lot of bonuses, a lot of pretty good stuff. Adding
00:42:15.520 | distillation fine-tuning after, even better. Uh, adding rollout DPO, even better.
00:42:21.600 | RL GURPOL. Wow, we're good. It beats everything. Actually, it doesn't beat everything, but you know,
00:42:27.280 | um, pretty good. Beats, uh, R1 Distill Quinn 7B and R1 Distill Llama 8B, which, you know,
00:42:36.320 | when DeepSeek came out, a lot of people for local inference were using, um, Distill Quinn 7B. Uh,
00:42:43.840 | Tyler has linked the hugging face thing. Let's, let's check it out real quick. Um, context length, what model is this?
00:42:54.880 | 5-4 mini reasoning. Okay. Model quality, H100, GPU times 128 H100s for two days of training. So, you know,
00:43:04.800 | not that much. Assuming $2 per hour for H100, 128 times 48 hours is $12,000. Um, math checks out,
00:43:14.080 | but math doesn't really check out. Um, you can't assume $2 per H100 hour. Cause you know, you're doing
00:43:20.000 | this on actual nodes of training, but yeah, you know, per hour on H100, you can say 12K. Now that
00:43:26.800 | you're doing it on node level, maybe double it up, maybe triple it up. But point being, you know, tens of
00:43:32.400 | thousands of dollars for this, hundreds of billions of tokens, not trillions of tokens. And this is stuff
00:43:38.080 | that like low key could have been done earlier, but, um, yeah, they show it out stage by stage. Thank
00:43:45.120 | Thank you, Tyler for link and, um, quick math. Um, they show out stage by stage performance,
00:43:52.960 | ablations, um, fun charts, safety. We need safety. Ours stay consistent. Others go down.
00:44:02.560 | What else? Conclusion. Okay. They love their line, uh, small models when trained with deliberate data
00:44:09.440 | selection and training strategies can match or even exceed capabilities of larger models.
00:44:15.360 | They want to show this work as a blueprint for developing efficient, high performing models
00:44:20.320 | under resource constraints. Now, one of the very interesting things is, you know, um,
00:44:26.240 | some of the other work that's referenced here, like the distillation models, uh, distillation with SFT is
00:44:32.240 | cool. Deep seek showed it really worked, right? Let me see if this chart actually compares to base
00:44:37.600 | Lama. I believe they have it in there at the bottom. So basically for let's change color real
00:44:43.040 | quick for regular Lama 3.2. Oh, they did 3B. Goddamn. Nevermind. I was trying to compare Lama 8B to 8B.
00:44:52.560 | Uh, but you know, okay, let's just say for five, four, which is better than 8B on reasoning data sets.
00:44:59.440 | We did really, really bad. Now, what deep seek showed is with basic SFT on reasoning data, um, you can get pretty
00:45:08.960 | good performance. Uh, what they show is if you have a well thought out process to do this specifically for, um,
00:45:18.560 | small models, you can do even better than what people were very amazed by in the past. So, um,
00:45:26.480 | this thought out way to do this is really good. Um, what else, what else? Uh, so they basically have
00:45:33.040 | this as a blueprint for small model thinking RL. Um, what was interesting was just, yeah, stuff doesn't
00:45:40.800 | directly transfer over, you know? So, um, S1, if someone correct me if I'm wrong, I think was done on
00:45:48.160 | Quen 32B, right? Or one of the large ones where, you know, you take a thousand samples, you have small
00:45:55.600 | model and now you have really good reasoning. They tried that exact recipe with their small model.
00:46:01.760 | They took five, four mini, they did S1 training and not only did it not improve as much, it actually got
00:46:09.840 | worse. So, uh, you know, this stuff doesn't translate transfer over at face value for regular, um,
00:46:17.520 | foundation model stuff. It does like in the regular five, four 14B, um, they did a lot of their experiments
00:46:25.200 | on a 7B and then they, they did a lot of their pre-training data experiments on a 7B
00:46:32.080 | and then they transferred over to a 14B for actual train run. Where is this? It's somewhere in here.
00:46:39.360 | Um, yeah, right here, right here. So we observe a high rank correlation between the performance of
00:46:46.080 | 7B and 14B models on different data mixtures. Given large enough distance between the two data mixtures,
00:46:52.160 | uh, this allocated us to conduct experiments on 7B scale and transfer findings to 5-4, which is 14B.
00:46:58.080 | Uh, the key difference between 32B and this one is mid-training? No, so this one is a 14B,
00:47:05.120 | which also had mid-training. This is basically a mixture of a bunch of stuff, the small reasoning.
00:47:10.480 | Um, there's a whole bunch of sets of this, but here, yeah, basically they have a whole bunch of,
00:47:16.880 | um, let's learn chain of thought, packed, unpacked to know where to stop, then let's do RL.
00:47:22.560 | That's a better approach for mini reasoning model. Uh, we have questions and discussions, but this is
00:47:29.600 | not the only paper that dropped. I, in fact, was bamboozled an hour ago when I decided to
00:47:34.880 | read these turns out there's like low key two and a half papers. So we needed to know 5-4 base to see
00:47:40.480 | what they did. Then there's 5-4 mini reasoning, which is their 3.8B. Then there's also just 5-4 reasoning.
00:47:46.720 | 5-4 reasoning is not one model. It's actually two models. Um, 5-4 reasoning is where they take the 14B model.
00:47:55.040 | they do, um, they make a reasoning model using O3 mini traces. Uh, and they have another one. They have
00:48:03.120 | 5-4 reasoning plus where they basically do the same thing as before, but now they add RL specifically. And,
00:48:08.720 | uh, now they're comparing it to the 70B distills. And guess what? It does well. It outperforms the 70Bs.
00:48:16.800 | So same benchmarks, reasoning benchmarks. Um, we show that the benefit of carefully,
00:48:22.080 | of careful data curation and SFT extends to reasoning models. This can be further amplified
00:48:27.760 | with RL. So similar to how there's like, uh, O3 mini, low, high, O1 pro, all that stuff.
00:48:34.640 | They can also do this. So they have, um, uh, 5-4 reasoning and 5-4 reasoning plus. Could this paper
00:48:40.880 | have basically just launched 5-4 reasoning plus and called it 5-4 reasoning? Yes. They didn't have to
00:48:47.440 | do two. This is very similar to how in the, uh, reasoning mini paper they have, oh, this is the,
00:48:55.200 | this is the big one. One sec. In this reasoning mini paper,
00:49:01.280 | in the mini reasoning paper, they show every stage of training and how it performed. Right? Like
00:49:07.360 | they could have also had 5-4 mini reasoning and then 5-4 mini reasoning plus and do their whole recipe.
00:49:15.520 | Like it's starting to do better as it is. But, um, in this one, they, I guess the point they're trying
00:49:21.200 | to make is that an additional RL stage takes you from 5-4 reasoning to 5-4 reasoning plus. Um,
00:49:30.400 | yeah. TLDR is, um, they're getting O1 mini O3 level performance, beating R1 70Bs.
00:49:37.040 | Uh, yeah. Okay. I'm going to try to go through this in three minutes. Sorry on bad use of time. Um,
00:49:44.960 | how they do this. So 14B model, it's 5-4. Uh, they have a 14B model, supervised, fine-tuned. So they do SFT. Then they have reasoning plus, which has a further
00:49:59.520 | round of RL, uh, 1.4 million prompt of high quality answers containing long reasoning traces generated using
00:50:07.040 | O3 mini. Prompts are filtered to cover a range of difficulty. They want it to be stuff that the regular
00:50:13.200 | model can't answer. So if 5-4, is there any traction on these? Yeah. Uh, five models have decent traction.
00:50:19.680 | The reasoning one just came out. I don't know about traction on this. Um, so
00:50:24.480 | they basically want to filter out stuff that the regular, um, five model can solve. They only want
00:50:32.960 | to do stuff that it can't solve. Um, my highlighting got weird. So the data set used in SFT includes stem
00:50:39.120 | topics, coding, safety focused, uh, tasks. The reasoning plus model is trained using RL on a
00:50:45.840 | small set of 6,000 high quality math posted, math focused problems with verifiable solutions.
00:50:51.360 | Kind of interesting, right? They get so much performance gain from just 6,000 samples of RL.
00:50:56.960 | Um, they talk into how RL is, um, RL is like, you know, it has high variance, but it also has high
00:51:04.720 | impact if done correctly, but then it's just math. So, you know, live code bench went down a little
00:51:09.440 | bit. Uh, interesting thing to note. Okay. Basically it's a data, like the whole training pipeline,
00:51:17.680 | it matches what they've done in previous five models or come models. Uh, they once again,
00:51:22.880 | want to show that good data curation, synthetic data, small models to be good. A small model performs
00:51:29.600 | better than 01 mini 70 B models. They also outperform Claude's, uh, 3.72 thinking on all
00:51:36.880 | taxes, um, except GPQA and calendar planning performance on this. Okay. Performance is cool.
00:51:44.080 | Um, this was kind of an interesting one. Um, both of them present improvements over the base models,
00:51:51.600 | including math and specific, notably improvement of 50 percentage points here. Surprisingly,
00:51:57.520 | these models also improved by 30, 60 percentage points on, um, algorithmic and planning problems
00:52:05.680 | like calendar planning, which demonstrate increased generalizability of reasoning skills to domains
00:52:12.640 | that we did not target directly during SFT or RL. So, uh, on stuff they didn't target and didn't train,
00:52:19.360 | this shit still generalizes. Uh, very cool. Very cool. Improvement on general benchmarks.
00:52:24.800 | Of course it improves. Here's numbers of hell thinking effort versus accuracy trade-off. This was
00:52:30.480 | interesting. So the reasoning plus model that slapped some RL that does better, it takes approximately
00:52:36.480 | 1.5 times more, more tokens than the other one. Uh, this difference is less pronounced on other reasoning
00:52:42.560 | domains. So, um, uses this on average. So some domains like coding, planning, spatial tasks.
00:52:50.880 | Uh, these are all avenues to improve RL. Okay. Keep going through quick five, four demonstrates
00:52:57.840 | reasoning responses. They show some examples that the reasoning one did that the other one couldn't. So
00:53:03.600 | this is a word play riddle. This is a question, you know, I have coin tosses. What's the chance I see
00:53:11.440 | exactly 1.2 heads five, four would give you, you know, an actual math calculation. The reasoning model
00:53:17.840 | is like, you can't get exactly 0.2. So probability is zero. Um, pretty cool. Pretty cool. More stuff, planning,
00:53:25.840 | games. Crazy. It can do games, data stuff, uh, seed database. We specifically target seeds situated at
00:53:35.040 | the edge of five four's current ability. Additionally, to maximize the focus of reasoning skills and data
00:53:40.160 | set, we prioritize prompts that demand complex multi-step reasoning, as opposed to those, uh, testing
00:53:46.880 | factual knowledge. So they want reasoning stuff. They want teachable things, synthetic seeds. Sorry,
00:53:52.240 | I'm going quick. We're at like one minute left. So I'm going to go through this quick, quick, uh,
00:53:58.480 | five, four reasoning, basically SFT on five, four with specific tokens. They have reasoning tokens.
00:54:05.200 | They use two of those empty tokens that they have. They have think and, um, you know, think tags that they
00:54:10.720 | add in, uh, increased token length. So they, they extend the 32 K synthetically generate examples of long
00:54:18.640 | chain of thought over this. Our SFT data set is 1.4 million prompt response pairs, um, totaling 8.3
00:54:27.920 | billion unique token samples. So this many extended out over these domains. Here's training. Here's SFT
00:54:34.800 | steps. Here's improvement. Uh, during exploration stage, we studied effects of various design choices.
00:54:41.760 | Here's that role of SFT seed hyperparameters, uh, role of system message. Of course, system message is useful. So,
00:54:50.880 | um, basically to promote chain of thought, they tell the thing, uh, you're a system, you're a thinking
00:54:58.400 | model, have your thinking and think tags, um, partially removing or replacing system messages was cool, but it
00:55:05.760 | didn't work. Here's the exact thing. You're a large language model by Microsoft. Your role is an assistant
00:55:11.920 | involved story, uh, exploring questions that, uh, please structure responses in two sections.
00:55:18.240 | Thought solution using the performed method, think thought section and think then solution
00:55:23.600 | second in thought detail, your reasoning steps. Each step should include analysis, summarize,
00:55:28.720 | brainstorm, the solution section should be logical, accurate, concise. Um, now try it with the following
00:55:35.520 | guidelines. So that's our system message. It helped optimization base model. So they, they considered
00:55:43.600 | using the base model before SFT, like we talked about. So the 14 B before that SFT, um, both of them
00:55:51.200 | worked pretty well, but the, you know, the one with SFT and instruction tuning did slightly better.
00:55:58.320 | So they use it. Um, we attribute this to addition of safety focused post training. Wow. Safety is all
00:56:05.200 | you need. Um, scaling final model is trained on 16 B trillion, 16 B tokens in this. Okay. Uh, real,
00:56:14.160 | real quick, the plus. So they did 6,000 samples of RL. We applied outcome-based RL to enhance reasoning
00:56:20.240 | capabilities. Um, um, um, they use, um, 72,000 math problems. Um, 72,000 math problems subset to 64 seeds of
00:56:32.480 | that. We do small set of 6,400 problems. See, no coding reward function. They want it to, or is it basically, um,
00:56:42.400 | incentivize correctness, penalize undesirable behavior, such as reputation and extensive length,
00:56:48.960 | and encourage proper response formatting. Uh, we encourage the model to generate concise outputs
00:56:54.160 | when it's correct, provoke thinking to think more when it's incorrect. Here's how they do that with
00:56:59.280 | GERPO, uh, repetition penalty, training details. Uh, oh, this was the one that was, um, bat size of 64 over
00:57:08.960 | 32 H100. So, sorry. The, um, the mini that we talked about was 128 H100s, but this, uh, reasoning plus
00:57:19.520 | was only 32 H100s, 64 batches for that much. It's like a couple hundred hours. They also do context
00:57:26.720 | length extension here, but I think that's enough. I don't want to go too long over. They have evals.
00:57:31.680 | Of course they have evals, but I want to, you know, even though we're over time, I want to give it two
00:57:36.240 | minutes for questions. They have a whole main findings, takeaways, stuff like that. Um, I'm gonna go over
00:57:42.480 | chat real quick, but if anyone wants to pop in share, please, um, you know, interrupt me now.
00:57:47.040 | Yeah. So they call this SFT instead of, uh, deep seeks use of distill. So distill has like a term
00:58:00.480 | of distillation loss, right? Where you kind of compare output logics and you kind of force what did
00:58:05.680 | big model say? Let's, let's do actual distillation there. They, they do know how this is basically just
00:58:11.520 | SFT on big model outputs. That's the beginning. What's the difference between five, four reasoning
00:58:16.480 | and reasoning mini? Oh, five, four reasoning is on their big 14 B model. So five, four 14 B, uh,
00:58:24.560 | gets trained to do reasoning. It's on par with O one mini O one and seven 70 B distills. The reasoning
00:58:32.240 | mini model is done on a three B. So reasoning mini, they, they do, um, post training on a 3.8 B and they show
00:58:40.240 | that they can match distill seven B's and they show that, you know, if you take what distillation was
00:58:47.760 | done to big models and you directly apply it to a small model, it doesn't work. So they try the S1 data
00:58:53.280 | set on a small model. It does worse, but they have a blueprint for how to do this for small, small models.
00:59:01.200 | Is there anything interesting in the mini ablation section? Maybe which part they deem most important for
00:59:08.560 | small models following their, their narrative is mid training or do they imply the combination is the key?
00:59:14.000 | Um, let's see. Let's see. Let's see. Let's see. Uh, our distillation pipeline serves an effective
00:59:20.720 | approach to measure reasoning. I don't think there's much that's reasoning here, much that's super, um,
00:59:29.520 | important. Basically they just have like a, you need to do some mid training. They, they pose some open
00:59:37.040 | questions. Um, yeah, cool. Okay. Thank you everyone. Sorry for last minute, um, paper change.
00:59:46.160 | I low-key read this like an hour and a half ago, but hopefully it was useful. I'm sure some of this is
00:59:54.480 | wrong, but yeah. Uh, next week we should have, uh, Professor Tom who does, uh, by hand illustrations,
01:00:05.440 | talk about the differences in Lama 1, 2, 3, and 4 architecture. I'll share more in Discord. Um, he was
01:00:12.320 | gonna just do Lama 4, but we pushed it a week so he has time to prep to do a better comparison between,
01:00:18.080 | you know, what's actually changing between the series. So that should be next. We always need
01:00:22.480 | volunteers. So if anyone wants to volunteer a paper, uh, let us know and we'll, we'll slot you in.
01:00:29.360 | Someone, so Flo says, I can't wait till GPT 5.4 drops and I can't tell if people are saying 5, 4,
01:00:38.960 | 5, 4, like GPT 5.4 or 5, 4, uh, people will be saying GPT 5.4 because no one talks about the five
01:00:46.400 | models. These are cool, but really, really not many people talk about them. Uh, not many people use
01:00:52.400 | them, but it's, it's always nice to have, you know, it's, it's open research. They, they do a lot of work,
01:00:58.320 | but yeah, nobody, nobody really talks about these. The only people that talk about them are people that
01:01:02.640 | kind of shit on them for, um, training on benchmarks. If I'm not mistaken, they made
01:01:10.080 | one of these multimodal and five, four or five, three with audio had the best transcription word
01:01:20.000 | error rate. Someone fact checked me on this, but, um, I'll follow up somewhere. I'm pretty sure the
01:01:27.040 | multilingual version of this had the best word error rate out of any model. Now that's not saying a lot
01:01:32.240 | because speech and transcription models are small, like low latency efficient optimized. This is a fat
01:01:38.320 | model, but yeah, uh, multimodal large models get good. You know, what can you say? Okay. I feel like,
01:01:48.000 | um, that's enough. Thanks guys. Okay, I would end meeting, but Suix is host, so he will end meeting.
01:02:13.280 | So, um, that's a good question. I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:19.360 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:23.520 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:25.440 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:27.360 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:29.520 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:33.520 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:35.520 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:40.000 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:42.480 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:43.600 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:47.800 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:51.800 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:53.920 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:55.840 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:57.840 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:02:59.840 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:01.840 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:05.840 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:07.840 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:10.560 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:12.320 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:14.240 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:16.160 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:18.080 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:20.000 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:21.840 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:24.080 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:26.080 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:28.320 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:32.480 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:34.480 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:36.880 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:38.640 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:40.640 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:42.640 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:44.640 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:46.640 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:03:59.200 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:03.200 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:05.200 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:07.440 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:11.600 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:15.600 | What are you going to ask you to answer your question?
01:04:16.800 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:19.440 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:22.400 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:25.840 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.
01:04:28.960 | I'm going to ask you to answer your question. I'm going to ask you to answer your question.