DeepSeek DeepDive (R1, V3, Math, GRPO)

Cool. I will kick off by straight disagreeing with Sean there. V3 not required reading for R1. V3 is just model. It's like any other model. You know, all the models you just train on Next Token, it's a Next Token model. It's a good model, but up until 15 minutes ago I forgot about the slides for V3.

So, you know, that's how useless it is. But anyway, high level. I'm just going to go through more applicable parts of paper. Like, why do we care? If anyone has questions, comments, thoughts, interrupt me. It's more discussion than it is me yapping, you know. If you want yapping, go read the paper.

So, outline, high level, high level. So, let's talk about the two models, mostly the second model. What is inference time scaling? What's this test time compute? They talk about what previous approaches are, so we'll kind of discuss that a little bit. Also, if people are talking in chat, I'm not super seeing it.

If anything's, like, important. We're just talking about V3 and ignoring you. Okay, okay. Gross, V3. We had a slide, don't worry. So, then we'll talk about R1-0. So, DeepSeek R1 is not really just one model. There's two models. There's DeepSeek R1-0 and then there's DeepSeek R1. They actually put out both.

They're both pretty good. It's a different approach to how they did both, but yeah, kind of what they are, the training template, reward models, how they determine all this emergence, reflection, aha moments. And then we'll talk about what R1 really is. R1 is taking a reasoning model and then turning into a chat model again.

So, now we've got more than just a base model and chat model. Now we have base model, reasoning model that's not a good chat model, and then reasoning models that can chat again. Most of this paper and most of the slides are actually on R1-0. It's a very interesting one.

Apparently, an hour ago, the ARC benchmark guys, they put out news that R1-0 is better than R1. So, it's better on some stuff. You know, you take smart model, you make a chat model, it's going to become dumber. People are dumb. We like to chat. Then we'll talk about performance evals.

Then a really cool thing they did with distillation. So, they distilled LAMA-3 and QEN models into R1 models. And they just kind of, you know, we're going to drop this real quick. They broke the US economy. They're like, not only do you get R1, you also get LAMA-R1, you get QEN-R1, you get R1-0, which isn't a chat model, you get everything.

And then they have future work. Some people have done reproductions of the work. Someone yesterday from, I think, Future Labs put out a reasoning style data set. So, we'll just, you know, yap and discuss about that. But, yeah. So, high level, the point of all these reasoning models, the reason why everyone cares is because they're kind of changing the scaling curve from let's just train bigger and bigger models that can like, you know, we throw more compute at the problem and like the inference becomes a little bit better to let's start trading that upfront cost for inference time compute.

And yeah, that's kind of what they did. Before, OpenAI was the only one to do it with O1, O1-mini. And then, you know, give it a few months and deep seek out if anyone has just put out a really good model. It's like on par with OpenAI's O1. It's like better than all the other models.

Completely open sourced it. The paper's not that good. They don't really talk much about the training data. Like Sean mentioned earlier, it's a pretty basic approach to what they did. So, there's not much in the paper. They don't talk much about the data. They don't talk much about a lot.

But anyway, it's a paper. Weights are there. It's MIT licensed, which is pretty good. And you know, it's still a V1 of this. It's like just how OpenAI has O1, there'll be O3. So, you know, there'll be R2. There'll be other companies that do this. Mistral might do it if they still exist.

Lama will do it. So, there'll be others and then we'll only see improvements from here. One of the quotes from their paper, their goal, so like they say, "Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their evolution through a pure RL process." So, TLDR, what they really found out is just throw RL at the problem and you can get a really, really good model.

People kind of, you know, set aside RL for a while, but yeah, it turns out you can just throw RL at the problem and it's pretty good. But this is a kind of interesting note, right? So, they wanted to develop reasoning capabilities without any supervised data. Without any supervised data means, you know, we don't go out, we don't label a bunch of, "Hey, here's chain of thought.

Here's the ideal reasoning." A lot of people would think about, "Okay, if you have an agent, if you have a coding problem, there's 10, 20, 30 different approaches to get to the same answer. How do we optimize this stuff?" But no, they're not doing supervised data. Their goal was to do it without any supervised data, just self-evolution and RL.

And they did some really good RL. And they did put out good math. They explained this RL. And yeah, that kind of, you know, blew up. So, what they do is they post-train the base DeepSeq v3 model, which I don't think is that important. It's just, you know, big model with this GRPO.

GRPO is their type of RL. We'll go into it in a bit. And then they start to notice these emergent capabilities that come out. There's great reasoning that starts to come out. So, you know, you don't train on reasoning data, but dang, you get reasoning. You just train on hard questions, hard data.

Then reflection starts to become a thing. So, you know, the model starts to reflect, like think on its actions, thinks on its steps. It has these aha moments where it's like, "Oh, shoot, that's crazy. This is what the right step is." And then it continues. And you get like O1 level performance.

Then from this, you know, v3 like light or whatever zero model, R10, they train the actual DeepSeq R1. They have a four-stage approach for training it. And that becomes a really good reasoning and chat model. So, four stages, they have like this cold start to make sure things don't go crazy at the beginning.

And guess what? They do SFT. It's not just RL. Then they do RL. Then they do rejection sampling. Then they do RL again. So, not one RL. There are two RL stages at the problem, you know, so double the RL. But yeah, so high level, those are the two models.

R10, it's a great reasoning only model. It's trained on unraveled chain of thought, you know, with RL. It's not good as a general model. Then R1, it's created from outputs from R10 and that four-stage training method. It's a really good model. It's like O1. Then the other half of the paper, not half, but like, you know, they have a section, they have like a paragraph on, "Hey, by the way, we just distill our outputs into QN and LMA." It does very, very good.

So, that, they're not doing native RL training. They're doing proper distillation. So, they take their big model. They train it with a distillation loss. Well, they don't say what type of distillation, but you know, standard distillation is distillation loss. Then they compare it to the base models and it performs very well.

A little note, they do try, they make a note like, "Okay, what if we did RL?" So, they take QN32B, they do like 10K steps of RL. They compare that to distillation and they find distillation much better. They make a small claim like, "Yeah, you know, if you do RL, it's very compute expensive.

Like, it's hard to do. It doesn't work as well as just distilling. So, maybe in the future we still need these big base models." Being 2025, you know, no one talks about any data. They don't talk about where it came from. They just say, you know, get good quality data.

Performance is very good. Models are fully open source with MIT license. They don't give training data. They don't give training code either. They host the model themselves on their own API. Something interesting to note is as much as people are raving about how good this thing is, DeepSeek themselves are also serving it very cheaply and very fast.

So, 3x faster, 3 to 10x faster and also cheaper than other infra providers. But, you know, if you use the DeepSeek API, they clearly state that the, you know, data goes to China server. So, use as your own risk, but very, very cheap model, very, very good model. Their API is a lot faster and cheaper.

Part of that is because, you know, they know everything about how to optimize this thing. They built it and it just came out. The other providers that are hosting it, well, you know, they just have model and they're trying to run it. But, yeah, from there, let's go into DeepSeek v3 real quick.

This is my one slider. So, we say it's important. We'll stop after this and discuss it a little, but basically it's just a regular LLM. It's a pretty large model. It's chunky. It's 671 billion parameters, but 37 billion active parameters, which is pretty interesting. You know, it's a lot of experts in there, but effective parameters are pretty small.

It's basically a 30B model at inference time, fully open source. It's GPT 4.0 level. It's not the reasoning one. This is just a standard big MOE model. They made this little claim, you know, training this thing took $5 million, 5.5. They had like a few steps to this, so they trained it.

Then they did two stages of context length extension. They did, first, they trained the thing as a base model. Then they do some 32K and 128K context length extension, trained it on about 15 trillion tokens, do very standard, you know, train it, do SFT, do RL. The model is pretty good.

They have this concept of multi-head latent attention. It's pretty cool. If anything, that would be like the next slide if I had to have three slides, but, you know, they have fancy attention. They do multi-token prediction. We covered the paper from Meta a few months ago that talks about this, where, you know, it's more sample efficient.

You can do multi-token prediction. Meta put out a paper. They're like, "Oh shit, this works. It's pretty good. People should do it." And then not many people did it, and then they did it, and it helps. Came out a month ago. People are very hyped. The other day, it kind of, you know, broke America real quick.

NVIDIA dropped $600 million because they said they'd train this in $5 million. So, yeah, I'll take a little pause. That's high level of the, you know, what they've released, how it works, what's going on under the hood. This is DeepSeek v3. It's their big MOE. It's got 37 billion active parameters.

They say it was cheap to train. They trained it on 15 trillion tokens, but yeah, this is the, you know, step zero. This is the base model that the reasoning model is built on. This is very similar to models like Mixtral or GPT 4.0. It's just a big MOE model.

Oh, $600 billion, not $600 million. NVIDIA dropped heavy. Big, big drop. America blew up real quick. But yeah, so all the reasoning models are built on top of this as a base model. But yeah, if we want to pause here, anyone have thoughts, points, anything that they loved about this DeepSeek v3?

Which in and of itself is a good model. It's cheap. Things to note at a, you know, high level AI engineering, like view is using reasoning models is cool, but also they're kind of slow, right? Like if you need total completion, thinking is cool, but like sometimes I just want output, right?

Models are pretty good at compressing. Sometimes I want fast speed. This is like GPT 4.0 level and very fast. So it's only 37B active parameter. So a lot of the times people would probably want to use this. You don't need a reasoning model for everything, right? If you run a chatbot, that's cool.

You can probably just run this. Later in the conclusion, there's a slide that shows what future work they want to do on the reasoning model and they show how v3 is actually better at something. So not, not to undermine this, you know, it's still very good, very smart, fast, cheap.

It's a good model, but it's just an MOE. But yeah, anyone want to chime in, any questions, anything interesting in chat? A lot of questions. Sorry. There's a lot of questions. I don't know which one to focus on. Okay. I'm going to see the first one. So how is effective active parameters different from total parameters?

So total parameters, you know, you still have to load all this in memory. So 671 billion parameters, you need lots and lots of GPUs to load this thing. But at inference time, it's only using a fraction of these, right? So 5% of these, it's using 40 billion parameters. So realistically, like, it's more efficient to use a lot of tokens.

It's, it's going to be faster, it's going to be cheaper. But it's not something that you can just host yourself, right? Like your laptop might be able to host a 30 billion parameter model, you load all those weights of memory and you use it. This is like, kind of like that at inference time, but it needs all the weights loaded up.

I think the point being a lot of people miss is that, like, you do save, you do save memory at scale, like you might not save memory if you're having one chat on your laptop, because all, because every token may use a different subset of the parameters, so you need them all loaded.

But if you're doing batch inference, like, like, like DeepSeq themselves are doing, then they can route things to the, to each different GPU, how, how they, how they want and, and we saturate them a bit more. Yep. Yep. At batch, it's, it's just very efficient. And that also means it's, it's faster too.

Okay. What is FP8 training? So mixed precision training, you know, before we used to train in full precision, then half precision, then we started, oh shoot, we can do FP16. Now we cut precision again. It's just an interesting thing. I think they're the first ones that have done it.

Typically, you can do inference. This is like a quantization, right? You can run a model in four bit and half precision, and there's a slight degradation in quality. But on the training side, we typically need as much, like, precision as possible. In this case, they, they can do FP8 training.

They did it, guys. They also, yeah, another interesting key component was that they're training this without an auxiliary loss. So if you know about MOEs, that's a, that's a pretty interesting piece there. But, okay. Can we trust them on $5 million cost claim at face value? You can take it both ways.

People have gone into token economics of how much it would cost to train this many tokens at this scale, and it can be around here. But realistically, this is like, you know, maybe the final train run cost around this, but this doesn't include any of the R&D, any of the other experiments.

Like, it's more than it would be, but either way, you know, it's out, it's open source, it's good, it's small. It was cheap. I think Dario from Anthropic, their co-founder, mentioned something about this recently of like, they, you know, how Anthropic was, or Cloud 3.5 was also trained in the tens of millions.

It's nothing crazy. The other interesting thing was the whole GPU restriction, right? So people say that they have to say this because they're not allowed GPUs. And if they say they had GPUs, then, you know, they lose their little supplier, but they have GPUs. You know, who knows? Yeah, I wanted to add, because there's a lot of speculation of whether, whether is this real or not, right?

But like, one of the best thing about the open weight and open, is that it's in the code and we can validate these things. So it is definitely an MOE model. It has, it has definitely that amount of experts that was stated and, and the community has already essentially made calculators for how much does it cost to create an MOE model.

And if you work it out backwards, it's between five to 10 million. So maybe the exact number is off, but I think, I think a lot of people are missing the point that it's at that ballpark and for contrast, Lama 345B was 50 mil. And that is based on the amount of compute time.

So you, apples to apples, it's much cheaper. Yeah. Okay. I think these other questions have good discussion in chat, so I'm going to let them continue. Can I also just add one thing? I think one thing that DeepSeq v3 did very different from v2 is, crap, just escape me.

Yeah. Auxiliary free MOE training without an auxiliary loss. So I thought that was pretty interesting and it really simplified a lot of things. And that was a big step up from v2. So, I mean, if you have the paper open, I mean, just go through it. They make a big deal out of it.

I'm not sure how much of a difference it makes though. Yeah. Okay. I have the important, I have the R1 paper open, not the, not the other one. Also, I think I'm only sharing my Chrome screen, so we don't get my paper this time. My beautiful highlights. It's okay.

I took some screenshots of charts, but, okay, let's move on to the fun one. This was a chart that I was going to pull more charts from. This is from Jay Alomar's blog. It's a good blog. I recommend checking it out. Actually, there was a better posted chart in Discord, like, let me pull it up.

It was just posted about an hour ago. So this is like a better overview of the training pipeline, but this is also kind of what's happening here. So they've got the DeepSeek V3 base. They do SFT reasoning data examples. They have this SFT checkpoint. Then we do fine tuning with RL to get DeepSeek R1.

So we're going to kind of look in this middle step here, which is DeepSeek R10. So this is where they apply pure RL directly to V3, the V3 base model without any SFT data. They use a GRPO for RL, which was introduced a little while ago. Actually, this came out in the DeepSeek math paper.

So there's a few different ways that the model is rewarded during this RL process. Someone's got their hand up. You want to just ask the question? We haven't gone that deep yet. Sachin, you want to? Yeah. So you can hear me, right? Yeah. So one of the things, so I haven't been following what OpenAI and all the other guys are doing, but what prevents them from because they have the training data, their own like process.

And if they run this and verify because you have the code and all, and then they can compare like what their existing way of doing versus the new way of doing, right? So, do you know like how long that would take for these guys? But now they say, okay, this is the new way of doing things.

Everybody accepts. I don't know if, if you don't know what this data was trained and all, this will definitely shake it out. But we are not the guys who can basically have the money to train and actually verify this, right? Has anybody done that? Like, these are numbers that only the big guys can tell us, right?

In terms of training, there's not much to verify, right? So like four V3, four Lama models for the base models, there's verification that people can do, right? You know how many tokens are trained on. We know how it is like to train these models. For this model, like for R1, we don't have training code.

We don't have the data, but that doesn't mean that people can't do this. There's a section later about companies that are trying to reproduce this. They also show stuff that we can do, right? So they distill outputs from this into Lama models, that stuff that's very attainable, you know, that stuff is now in the hundreds to thousands of dollars.

Now that stuff regular people can do. There's a company, I don't remember who that already put out a fine tune on DeepSeq style R1 data. So we can discuss this in a bit, but yeah, there's, there's people that are starting to work on this, but anyway back, back to what they're doing here.

So R1-0 is kind of one of the two models, right? So they, a while ago, they put out this paper that was a DeepSeq math paper. They explained this new GRPORL algorithm. It's kind of where you have a reward that's based on accuracy and responses that are verifiably correct.

So verifiably correct means you train on data that can be verified. So math questions, you know, math that checks out leak code, you can have something that compiles it to check if something is correct. And then they have like a little format reward. So they, they want to do this RL that nudges the model to also follow format, right?

In this case, the format reward is making sure that there's think tags between reasoning and then there's an output at the end. So there's kind of like three things they're testing for here, right? One is the model is being rewarded to one, put thinking traces, right? So it needs to think.

So it's going to put thinking stuff between thinking tags. It needs an answer. So there's going to be an answer and the answer has to be correct. And then that correct answer has to verifiably check out. So then there's kind of this RL algorithm that's, that's applied around all this.

This is kind of what the prompt looks like that they train with. So this is the template prompt, right? So conversation between user and assistant, user asks a question, the assistant solves it. The assistant first thinks about the reasoning process in the mind, in its mind, then provides the user with an answer.

The reasoning process and answers are included, are enclosed within think tags and answers within answer tags respectively. So think goes here, answer goes here, then the assistant. So now, you know, when you answer, the model is prompted to now answer with, okay, here's my thinking. Here's my reasoning process.

Here's the end of my think tag. Here's an answer tag. Here's my answer. Here's the end of the answer. Then you do a bunch of training with just pure RL. Here's kind of the formula for all this. Here's, here's a cool little chart. GPRO, GRPO. So what is this?

So compared to traditional RL, there's no critic model here. It uses groups of sample generation to estimate rewards with, and this kind of helps with cutting the compute cost down, right? You don't need to train a separate critic model. In this case, there's group-based rewards. So this is basically where you score outputs when they're compared with a sampled group to reward relative performance.

So instead of generating one output, generate a group of scores and then, you know, reward the one that does the best out of the group. Then there's of course stability and stuff. A big thing with RL is you have like, you know, KL divergence. You don't want models to randomly drastically make big changes because they probably won't get back.

So there's a penalty, you know, if in the group something is really good, but you know, it diverges a lot from the sample, then yeah, we also penalize that. So it's just, this is good RL. We could spend a lot of time on this, but honestly, I think this is good for Discord discussion.

So I'm sure someone will create a thread of GRPO instead of 200 people sitting here thinking about what RL is. High level, there's no critique model. It's, you know, it's judging and rewarding outputs based on sampled group outputs, and then there's stability in here to make sure that we don't diverge if samples go crazy.

Now R1-0, how does it perform? So it performs really well and there was no labeled SFT training data. It's just a base model trained with this RL to output the correct responses and add some thinking. It does well with this majority voting, it does even better. So here's kind of benchmarks.

If we look at it compared to O1 and O1-mini, R1-0 does, you know, pretty good like on most benchmarks on math, on live code, code forces, it's pretty good up there. And then when you do majority voting, which is, you know, you generate a couple examples and you see if the answer is in there, it does significantly better.

The key thing here was they actually trained this thing on very, very hard questions. So just good training quality questions and yeah, it does pretty well. You generate a bunch of samples, you pick the one that's the best out of a group of them, you kind of nudge it towards doing better there.

Yeah, the next few things were very interesting charts that came out of here. So these are some charts that show how their inference time is correlated with eval performance. This is kind of what you start to see at scale. When they started to train this thing, it didn't work really well, right?

This is just like, okay, why does this work now? This is basic RL. But at scale, we start to see these emergent capabilities, right? As you train for more and more steps, we see that accuracy starts to go up with steps too, right? So for each question, we sample 16 responses, calculate the average, we start to see how it performs.

For more steps, the more kind of steps that you take, the better performance is. Another one here, this was a very interesting one. The average response length of the model also starts to increase. So the longer you train it, the more reasoning steps it starts to take, which means that basically the TLDR of this paper was just this RL thing just kind of works.

And you can see this in the charts, right? The more that we're training this, the model is starting to learn to reason more and more, because the more it reasons, the better the performance is. And throughout more steps, the average length of the response is starting to get longer and longer.

So here's a little quote here. "The average response length of R1-0 on training set during RL process. DeepSeq R1 naturally learns to solve reasoning tasks with more thinking time." So yeah, it's starting to do that. "Naturally squares ability to solve complex tasks by extending test time compute. This ranges from hundreds to thousands of reasoning tokens.

The emergence of interesting behaviors as test time compute increases." So this was another interesting one. So as you increase test time compute, they started to notice emergence of interesting behaviors. So some of these were reflections and aha moments. Reflections are where the model started to revisit and reevaluate previous steps and explore alternatives.

So as it's doing its thinking, it would start to reflect and be like, "Huh, a few steps ago, I went down this path. Maybe I should look at this again." Aha moments are where it starts to take more time and reevaluate an original approach. So in this example, and this also shows the quality, the example of questions that they're training on.

And you can see more of these as well. If you look at some of the stuff that they trained on, you can look at those data sets and look at the type of questions. But here it's being told to answer this question. It's like, "Okay, here's basic math. I can square both sides.

I can isolate this term." And then it's like, "Wait, wait, wait. That's an aha moment I can flag here." And they start to notice these little emergent capabilities where it's starting to find these aha moments and it's starting to re-reason. It's starting to have these reflections. And there was this kind of interesting quote that I found in the paper.

So this is from the DeepSeek team. They make this section. They say, "This moment is not only an aha moment for the model, but it's also for the researchers observing its behavior. It underscores the power and the beauty of reinforcement learning. Rather than explicitly teaching a model how to solve a problem, we simply provide it with the right incentives and it autonomously develops advanced problem-solving strategies.

The aha moment serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving way for more autonomous and adaptive models in the future." But basically, rather than explicitly teaching the model how to solve problems, they train it with RL, which incentivizes it based on those incentives, and it autonomously starts to understand these problem-solving strategies.

Through its thinking steps here, it starts to realize it has these aha moments, and it also starts to have reflections in its thinking. So that was kind of another interesting thing that came there. So what about DeepSeek R1? What are the kind of problems with R1-0? R1-0 had poor readability.

It also had a lot of language mixing. I'll make a note of this later in the section, and I think we should discuss it. They kept talking about language mixing. They keep talking about how problems with this model are that it mixes up languages. It goes between English to Chinese, and it's not good at fixing what language it should be.

And this is also a problem with the real R1. They weren't able to solve this too well. Now, some of this is due to RL, but yeah, it's just a little interesting note that more than four or five times in the paper, they had mentioned how this thing struggles with language mixing.

So that's kind of where R1-0 had its issues. It wasn't very readable. This is giving out weird thinking steps, aha moments. It's not being trained with the RL objective of there's no safety. There's no conciseness. There's no be a good assistant. There's no be a good chat model. There's no be fun to chat with.

There's nothing like that. So they take R1-0, and then they make R1, which is let's take this reasoning model that we can do. Let's actually make a proper LLM assistant that we can. So we'll make a reasoning chat model, which is what R1 becomes. But I'll take a little pause here.

I know a lot of people pre-read the paper too. Is there anything we want to dive deeper into in R1-0? We could talk about the RL policy itself, the model, how it performs, any of these charts. Yeah, I had a question. There was a step that I must have missed, which is the RL scoring function.

In other words, when the models return these 16 different answers in English, how is it scoring? How is it deciding which one was that? So there's a few things there. There's the verifiably correct, which is part of it. So these questions, if they're LeetCode style questions, you can run a compiler and you can verifiably see what's correct.

If it's a math question, you can verify that the answer matches what the answer should be. And that's the level of distinction they go at. There's a few different categories, but they can verify the answers to make sure that's correct. Then you have the other little parts of the policy, right?

Like you want it to output these think and answer tokens, so you can verify that it did that. If it didn't do that, then it's going to be penalized, right? So another part of this is following this prompt template. If it doesn't output think tokens and answer tokens, or if it outputs them but doesn't give any reasoning, that's not good.

Now, this is just a reasoning model. Some of the changes in the actual R1 is for sometimes you don't need to reason, right? For a question like hello, you can not reason. But basically, that's some of the stuff that they-- So just so I understand, so the basic concept here is that even doing these kinds of very simple forms of reasoning, or they're very complicated, like mathematics, the idea is that that learning then transfers onto other kinds of responses and reasoning that people want.

Because it was very interesting to me, like one of my test questions is, what is the population below Central Park? And R1 and all of them just fall on their ass. They can't answer this, which any third grader can reason through and come up with a reasonable answer, right?

And a reasonable answer is not a number that's larger in the population of Manhattan or less than 100,000. They just fall to pieces, because they can't seem to reason about this. And the reason I'm asking this is because is the assumption here that if they can solve these math questions and coding questions, that they can then reason about other things?

Is that one of the fundamental assumptions here? Yeah, so in my interpretation, the goal isn't to get it to reason about different things. It's to get it to just output thinking and reasoning, right? So you want it to be able to output its thought process to come to a verifiably correct answer.

And then as there's harder and harder questions, it does more or less output of what its thought process is. And you reward it for being right at the end or wrong at the end. And in that, you kind of distill this down to, yeah, for harder questions, it will do more thinking before it answers.

For simple questions, it won't. And that's kind of what I feel like they're going for here. If anyone else has other answers to this or other takes on this-- I just wonder whether-- yeah, I wonder whether, like, is it-- what you're saying is it's not learning new reasoning. It's just that in the fine-tuning step, we're teaching it to actually reason, even though the base model was capable of that before.

So I'm not-- I don't know whether that's true or not. It might be. Yeah, the base model is also very smart, right? But you're teaching it to give out its thought process. And it's also graded against 16 other versions of thought processes to an answer. And it needs to do pretty good on this.

It needs a good thought process. It needs to also learn to mimic this template and everything. But to be clear, it's not being judged on the thought process, only on the-- in other words, the RL reward is on the correctness of the answer, and then some very basic mechanical stuff, like, did you have the word "think" and so on?

And was there something that we're going to call a reasoning process? But we're not going into that and parsing it and trying to understand the reasoning process. All we care about is-- from a reward standpoint, the reward is given if the answer is correct. And these tokens and so on are present.

Actually, if you don't mind if I jump in here, because this is kind of related to my question I have on GRPO, which is Group Relative Policy Optimization. I couldn't find anything on GRPO on YouTube, which I was hoping to get, because I don't want to read this fucking 53-page math paper.

Forgive me. But I found something on direct policy optimization, and what that was telling me was they removed the reward function from the objective, or lost function. Sorry, I know, another sin. But removing that reward-- that explicit reward term seemed to be an important part of TPO, along with callback-livelier divergence in order to have that kind of-- I don't want to say memory, because there is also a stability term with a clip function with epsilons.

But I kind of had that memory as well. So I feel like this GRPO, along with the multi-latent head attention, which is a great poll-- thank you so much for that. I really appreciate that. I'm going to look into that soon from the meta paper. But it helps with this kind of batch learning.

When I hear people talk on Bloomberg about this model, I hear them say, oh, they trained in fewer batches, or the batches were more optimal. And when I hear that, I'm thinking in my head, this is a GRPF. But I also got to look into multi-latent head attention before I do that, and probably read this whole 53-page math paper.

I have a little bit different use case, I will tell. And people should help me out over here. So I look at quite a lot of medical cases, very, very deep things, which I-- right now, I use Perplexity, OpenAI, Cloud, and all. And some of these things just blow up in your face, right?

And the answers they give, I go back to literature and verify. And it's a very deep process where I need to-- and then I'm reasoning out with neurosurgeons and guys, and like, why, when I'm questioning them. And some of the data they know, some of they don't know. What ends up happening is OpenAI sometimes puts out just garbage.

I mean, I look at all sorts of reasoning, what it is showing. But the things I'm learning, and I know what the objective is. And my reward process is sometimes, like, so deep inside, is that I know that, hey, this biochemistry thing with this, this, this, whatever, is probably causing this neurological symptom.

Now, what you did over here with the RL part of it, when you said there's a number of guys where you try to pick them up, the problem in RL is that sometimes there might be one loner that the action that they take, the reward might be way down the line.

You cannot take majority of the guys right up front and say, this is the right way it is supposed to be done. And my take is, I haven't looked at this, but this is probably works good for smaller domains. But when you try to chain domains and domains together, it is probably going to have a lot of issues, because now we have a combinatorial problem over there.

It's like a go game, you know, like whatever it is. So if you have thoughts, let me know. So they have a way to solve this KL divergence where, you know, they account for if, if responses are significantly different than the rest of the group, we don't take big steps.

But I think at a high level, that's, that's enough on, on the RL. We could go on for that for the rest of the hour. But let's take that to offline discussion. Let's, let's get through what actual R1 is. So that was just R1-0. I'm going to spend the next quick five minutes and then we'll do 10 minutes of discussion, you know.

So, okay, what is DeepSeek R1? So there's, there's four stages to making this thing a good reasoning and chat model. So one of them is we have this cold start. So cold start is, you know, let's, let's start with some as strong as SFT. Let's not have this thing go crazy at first.

They mentioned they use some human annotators here. They just drop one line, you know, kind of interesting if you want, anyone wants to look into that. But so first we'll cold start the training, then we'll do RL, then we'll do a rejection sampling for generation, then we'll do RL again.

Okay, quick level. What are these four stages? So stage one, cold start. You have that DeepSeek V3, cold start the training with strong SFT. SFT on what you want, you know, this prevents the model from going unstable. Use a long chain of thought, few shot example prompt to, you know, generate some good detailed examples of what we want.

So generate some good reasoning examples, some reflection, verification, generate from R10. Post-process these have human annotators, look at them. This is on the order of thousands of samples, you know, so nothing crazy, but this just starts the model off, you know, so let's get the base model to do some basic SFT.

So normally we take base models, we do instruction fine tuning, we turn them into chat models, right? We do, we do SFT. So we're going to take the base model, we're going to generate some examples of chain of thought, few shot prompted examples, you know, so this looks like you, you take R10, you tell it, or you take whatever model you tell it to generate some examples where you give the formatting you want, you give good, give your thinking steps, give a lot of thinking steps, start it off strong, and then they generate a couple thousand examples, post-process them, have human annotators, I don't know, they just put like a line on this, then they do some regular SFT on base DeepSeq v3, that's stage one.

Stage two is they basically do the same exact RL, they add this language consistency reward, like we mentioned, you know, they're struggling with language mixing, so another part of this RL is now we want language to be consistent. Okay, so we did SFT on really good data, then we do a bunch and bunch of RL, they don't explain the data set, they don't explain where it came from, how it came, how many samples, but they do RL.

Stage three, rejection sampling. Rejection sampling is pretty common. Lama three did it, many others have done it, it's kind of new, they do this, this is the first time they talk about how much data, this is on the order, this is kind of like end of post-training lifecycle, you know, so they had big base model, which was v3, did SFT, did big stage of RL, now they do rejection sampling.

This helps it turn from like, you know, we had R1-0 style issues to let's start to fix this, let's generate completions, rank them with reward models, this is like LLM as a judge, so generate output, have an LLM judge it, have a reward model, judge these outputs and reject some samples, fine-tune this model with rejection sampling.

High level, that's what's happening. Stage four, let's do more RL, you know, throw RL at the problem. So make the model helpful and harmless while making reasoning good, that's kind of the objective here. They do R1 style questions, but they also mix in general chat human preference, so nuanced scenarios, you know, we want it to still give a good output, but now we want it to, you know, give a summary at the end, don't just give an answer, give a summary.

So this is kind of that last step. So this is what makes DeepSeek R1, instead of just reasoning model, you got to do some rejection sampling, you got to kickstart the thing so it doesn't go crazy with SFT, and you got to do this last stage of some RL for general use.

Now, yeah, the model is pretty good. It's a normal chat model, it gives thinking steps, it gives little summaries at the end, and it performs pretty well. It still struggles with some language swaps, but you know, on both benchmarks, it's better than O1, better than O1 Mini, either also better than O1, or in between the two.

So you know, O1 might beat it, but it's better than O1 Mini. O1 beats it better than Mini. O1 beats it better than Mini, or it's better than both. But it's very good, it's 37B active parameters. We don't really know the model sizes or active parameters of O1, but this thing's good, it's cheap, it's fast, DeepSeek has very good inference.

MIT license, it's fully out there. That's kind of our one, four stages. The new ones are kind of, hey, you do this cold start with SFT from a base model, and then these last two stages. You know, you have rejection sampling, you want to kind of fine-tune it, this is pretty common, you do this for about a million samples, 800,000 is what they did, and then you do some fine-tuning at the end.

And yeah, it does very, very well. Then the last part of this paper is kind of the distillation step. I'll talk very, very quickly on this. So distillation is where you take a big model, you train it, you generate outputs, and then you mimic those into a small model.

So you can either just take input/output and just do continual post-training and now you've distilled a model, or you can do this distillation loss where you try to get a small model to match the output logit, so not just match the output, match the output distribution, you know, really match the thinking process per se of what the big model is doing.

They do distillation on just a million samples, so they have 800,000 reasoning samples and they distill R1 into Lama and Klein models. So they take the two families, they do distillation. This is not RL. They just do basic SFT distillation on about 800,000 samples, and now the model does very well.

And not only does it do well, it outputs all of its thinking steps, you know, it becomes a bit of a reasoning model and the performance jumps a lot. So they not only compare it to the base models, which they're all better than, they also compare it to, you know, like GPT-40, they compare it to CloudSonic, to O1-mini.

And you can see, like, Quen32B is doing better than all these models. Like, their distillation work is very good. They open-sourced all these models. They dropped them all. These will run locally, you know. So this is like Lama8B, Lama7B, Quen32B. These are models that you can run locally on your laptop.

They're very, very strong models. They'll run pretty quick. And then just as a bit of, like, they had a lot of abolations, but the one that I found interesting is the question comes up of, "Hey, what if we just do RL on the other base models?" And they're like, "Okay, let's test it.

Let's take Quen32B. Let's do our same RL for 10k steps." Well, it does better, but it does nowhere near as good as our distillation. So basically, their takeaway is, "Hey, RL, like, takes a lot of compute. It's hard to do. And it doesn't get the same performance as this distillation.

So maybe in the future, we still need these big models." But yeah, distillation worked very, very well compared to their RL on a 32B. I'm sure other people will go deeper into this. Other people will try it. And they'll, you know, update us on how it does. Okay, future work.

R1 is still worse than V3 at some things. So as much as I shat on V3 not being important, it's still good. It's fast, faster than R1. R1 is worse at function calling, multi-turn, complex role play, and JSON output. Those are just a few things. R1 struggles with language mixing.

I don't know why they note this, but V3 doesn't. So maybe their RL has Chinese data. Maybe I misread something there. Maybe they both do. But R1 struggles with language mixing. R1 is sensitive to prompting. Few-shot prompts degrade the performance. So if you guys are using R1, don't few-shot prompt it.

Don't tell the model how to think. Tell it what you want it to do and let it reason. It will do better. And it's not better at a lot of engineering tasks than V3. They explained why and they explained, you know, what they think they can do to fix them.

But this is just some sort of future work. But high, high level, that's kind of the paper. We have seven minutes left. One option is if there's like one or two quick questions. Otherwise we can talk about future stuff, questions, and what people have. So people are trying to recreate R1.

DeepSeq didn't talk much about the data. Not many specifics about what went on there. Hug & Face has a thing to reproduce it. They have a whole chart of how they're going to do it. Bespoke Labs put out a data set. I think it's a great effort. It's not great in my opinion.

I looked at like their prompt templating. They heavily try to prompt it into responding in a certain way. But anyway, they have over, I think they have a hundred thousand-ish samples of R1 data. They fine tune to 7B. They show results. But yeah, what other hot takes have we seen?

Eugene Yan, you have your hand up. You want to join in? Hey, Vibhu. I just wanted to ask a question. Sorry, not a hot take, but a question. In V3, in stage three, right, they actually retrained it based on the DeepSeq V3 base model. So I guess the question I have is why did they not carry on from stage two?

I know Eugene Chia had a take on this on Discord, but I wonder if anyone here would have intuition on why they did this. I had a similar question. So I noticed in the chart as well, same thing going on. Exactly. And the paper actually calls it out specifically as a single sentence on its own.

So it's pretty unique. Yeah. Basically here, right, they restart from V3 stage instead of continuing down their whole code process. Why they did this, I wish we knew. Eugene Chia, do you want to share your take? So my speculation is after they got the first round data set, is they trained from the base model before annealing.

And this is just to get a better final outcome. I don't fully agree with this idea, but a lot of fine tuners swear on you want to fine tune on pre-annealed models instead of annealed models. Yeah. And we can probably go into a side track on that. But yeah, that's my guess.

I mean, I don't really have a great mathematical or machine learning background on this, but I teach right now and I feel like I do a lot of reinforcement learning when I'm just like trying to get students to do very little reasoning, COT steps correct. Like I want you to write down this exponent in the same police I did because I put in a different color.

And if you don't do it, I'm going to take off points and this and this and so forth. But what I want them to really understand and technically explain in a nuanced way, that fine tuning of having that discussion, having that expert review is so much more helpful than just a cookbook chain of thought.

So in my mind, fine tuning, but I mean, it's not really from an ML perspective, kind of just from, I talk to people a lot. Oh, thank you. I also forgot to mention what annealing is. So annealing is a process where you essentially flatten out and lower the learning rate.

And, and that's a one-time thing you do at the end of the model typically. Yeah. Sam. Hey, did we, did the paper cover how, when they create examples for SFT using rejection sampling on the RL checkpoint, what method they use to select the good examples? Sorry, I didn't hear that too much.

When they created SFT samples to what? Yeah. What method did they use to select the good examples when they're doing rejection sampling SFT on the RL checkpoint? Oh, on the rejection sampling? Yeah. Yeah. They, they share a little bit about the rejection sampling, but when they generate the samples, they, they had this whole section about, I mean, honestly, the whole paper is very, very vague in all these specifics.

They, they just mentioned like little, little notes, like in the cold start, you know, how do we generate it? Right. It says human annotators. It doesn't say what they did. Yeah. It doesn't say what they did. It says like, you know, slight, slight use of human annotators. We, we do post-processing.

That's cool. You do, you do post-processing to post-process what? But at some level they tell you for rejection sampling, it's kind of what you would expect, right? So what are they trying to do? What's their goal? Now, they probably want to take away examples that don't have like summaries at the end, right?

But they, they don't, they don't go into too much detail. Is language mixing a feature or a bug? Isn't it good if the model can find an efficient reasoning techniques? Feature or bug depends on how you see it, right? So language mixing in this case, I believe they meant is it responds.

Actually, no, they, they actually do specifically explain in some cases, like the question is asked in Chinese and it responds in English. So seems more like a bug, right? You don't, you don't want that. Cool. Any, any other quick questions? Any other thoughts? Any other concerns? And I'm sure there'll be a lot of discussion on discord continuing about this, you know.

I'm curious about people, how people are using these models now. Cause one of the things like my uses for thinking models is when I'm trying to brainstorm some creative topic. And yesterday I put side by side Gemini 2.0 thinking experimental, whatever 0121 and DeepSeek R1, and then the 70 billion llama distill from that's hosted on Grok.

And the answer I got from Gemini was much better than the one I got from the other two. And I was, I don't know, I haven't tried a whole bunch of examples, but I'm curious about whether, what people are using day to day when they want to code and when they want to think.

I gave up. I just stick with Claude. I made Claude now. I'm going to wait until there's a little more settling at the moment. I'm kind of wasting my time, which is fine, but I'd rather waste my time by not being as efficient than looking for which of the new models suits my uses best here.

But that's a personal opinion. The thing Rahim is talking is actually very critical in the sense, different models and all the big guys, they fail, what do you call it, brilliantly, or they just blow up at certain reasoning and then at other things they do very nice. Until you have looked at all the gamut of across the things, you cannot stick to one thing.

I mean, if you have some money, you need to like push across all of them and look at responses. Gemini gives one, like, that's why I want to see what perplexity, if they puts out a blog post on this, where they have probably the largest data of people, what people are trying to, how they try to use it.

So that would be my, what do you call next steps to go and see what do they think of deep sea if they're done their own internal thing. But that is, I think, where getting beyond like the evils, the evils are just like saying, okay, this is our baseline and from here, let's go and go to the races kind of a stuff.

I mean, specifically for your thing, if we're done, I'll leave it for Desco. So at a quick level, the other thing to note with how people are using these is they're very cheap, right? So if you look at the cost of O1 versus R1, it's like more than 10X cheaper.

So stuff that was expensive to do is now not as expensive, right? So they're just good, cheap reasoning models are fast. And the other part is they're open source. So if you want to self-deploy it, you can self-deploy it. If you want to use one of the reasoning models locally, like the Quen 32B1, 7B1, they're pretty good locally on your laptop.

So that's how some people use them. I don't know if also that availability, right? Like I've been playing with that and I've kind of had a little bit of mixed use and I've had a friend who's been trying to get him to do some gnarly reactor refactorings and he's been frustrated a little bit.

I don't know if that's just not learning how to be prompted properly for those kinds of very specific and more involved tasks, but also Cursor might end up having enough data set on how people are using it specifically for coding. And maybe I'm hoping they might publish something on that.

But yeah, it's a good model. Anyway, guys, thanks. Next week we will have Eric Ness facilitating Titan model. Discussion will continue in Discord. I'll throw slides in there. But yeah, see you guys next week. Thanks, everyone. Thanks, Vivo. Thanks. Thank you, Vivo. Thanks, everyone. Thank you, Kevin. Thank you, Kevin.

Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye. Bye-bye.

DeepSeek DeepDive (R1, V3, Math, GRPO)

Transcript