Back to Index

Latent Space Paper Club: AIEWF Special Edition (Test of Time, DeepSeek R1/V3) — VIbhu Sapra


Chapters

0:0 Paper Club Year in Review & Future Plans
8:0 DeepSeek Paper Discussion
9:10 DeepSeek R1 (May 28th Update)
12:40 DeepSeek Distillation
16:51 Original DeepSeek Model Overview (DeepSeek V3 and R1)
21:15 Development of reasoning capabilities through a pure RL process
24:46 DeepSeek R10
39:5 DeepSeek R1 four-stage training pipeline
44:15 Distillation Strategy
52:34 Community and Call to Action

Transcript

Okay, so PaperClub year in review. We've gone over a year, like a year and a half, no missed weeks, we've always done PaperClub. And it's pretty interesting, you know, I don't think any of us expected it to get this far, but every week for the past year and a half, we have always done a PaperClub Wednesday at noon.

We've had a bunch of authors come share their work, people from Nvidia, Meta, Allen AI, Amazon Together, Ryder, bunch of authors come, we get direct one-hour sessions with them, they share their work, we get nice feedback, and you know, some of these have started to do pretty well. On average, we get about 100 people in here every session, and just for, you know, for information, that's like on a Wednesday, workday at noon, we have 100 people joined to discuss a random Paper.

And some of the big ones, DeepSeek V3 had 300 live people sitting, listening to me yap about DeepSeek. Of course, all the other speakers do great, this is all built on volunteers, right? And along the way, we made many friends, and yeah, PaperClub went much further than we expected.

Now, the launch, you know, we have to ship something, we basically have to launch at World's Fair. So, we're launching our test of time PaperClub. This is going to be a V2 second PaperClub. So, the one that we do now are Wednesday at noon, this is still sticking around, it will still be the same thing.

Every week, we'll kind of have a paper, something that's trending, cover it, have an author, have our Q&A session, you know, 30 minutes of paper presentation, whether that's highlights or that's slides. And then we'll continue that with some discussion. But test of time PaperClub is going to be a little bit different, you know?

So, it's more curriculum based. Basically, we take this idea of, what's everything that you would need to know to be a good AI engineer? What are the themes? What are the core papers that don't change? So, stuff like, you know, what is attention? How does sequential text generation work?

So, what's going on in GPT-2? How do, like, optimizers work? What about, like, the key inference techniques, right? So, stuff like speculative decoding, flash attention, stable diffusion, whisper, the key papers that, you know, are the foundation of what's being built. We're going to kind of group those together, and session by session, we'll go over them.

So, we're going to kick off in July and run till December. That leaves us six months. Six months is about four weeks a month, you know, 24 weeks. Every week, we plan to go through two to four pre-presented papers. So, you kind of have a bit more structure to it.

There'll be presentations, but, you know, in six months, we can get through, like, 50 to 100 papers, and we can cover the core concepts. And every week will be kind of different, right? So, like, one week, we might be talking about whisper, and you're interested, right? In one week, you can get the fundamentals of speech, speech-to-text, text-to-speech, all that stuff.

Another day, it might be something like image generation. So, you know, we'll go over clip, stable diffusion, how does all this stuff work? Extend it out to video. Basically, we'll segment the core topics that we should need to know, and then we'll cover it every week. The exciting announcement here is we're also, for the first time, having SF section.

So, we're gonna do in-person SF Clubs and a remote section. I am gonna mute someone real quick. So, we've got 40 a few. Someone needs to be muted. We've got crazy echo. Oh, crazy. I'll just mute myself. Easy. I'll mute my own laptop. Okay, continuing on. If someone needs to interrupt, just start in Zoom check, because I don't have my speakers going.

But yeah, we'll have an in-person session in San Francisco every week, and of course, we'll keep the remote thing. So, we'll keep it going, and Original Paper Club will still be its own. Every week something comes out, we'll cover that, and we won't really deviate too much on the schedule for test of time.

It's gonna be foundational papers, plus some blogs. So, what are the topics? You know, this is still up for us to all decide. So, just listing out some of the few ones here. You know, we have foundations of deep learning. Something like attention, optimization, ReLU, gradient descent, what is basic RL, how do these things work?

So, you know, we'll have a section on foundations of deep learning. Another one, LLM foundations, right? So, pre-LLMs, we have like RNNs, LSTMs, bi-directional RNNs, all this stuff. Then we have like the very, very foundational models, right? So like BERT from Google, GPT-2, stuff like that. We'll have a day of everything kind of pre-LLM, and go over those.

Then after that, we'll have, you know, the actual like generative LLMs. I'm sure that's missing here, but you know, like LAMA 3, Deep Seek, that the core LLMs that you would expect to hear about, we'll go over those. We'll have a day of pre-mid, post-training, right? So what are the scaling law papers?

What is chinchilla? How does distillation work? What are the kind of key papers that you would want to know for training? And once again, this will be like a one to two sessions. So in one week, we'll go over three to four papers. We'll have someone present. So come prepared.

We'll go over, you know, what are the fundamentals in scaling laws, distillation? What is chinchilla scaling? What is overtraining? What is LAMA scaling laws? What are small model scaling laws? So like, the PHY team really has, you know, for small language models, how should we do proper RL? What is post-training for small models versus big models?

Well, we'll have someone cover all these, and that'll be a one to two-week section. Well, generative models, so you know, clips, or a segment, anything, diffusion, some of the key agent papers, fine-tuning, we'll go over LORA, QLORA, DPO, RL, GRPO, voice, we'll have whisper, optimization stage, so you know, speculative decoding, flash attention, we'll have eval tracks, Rexis, Eugene Yan, you know, he hosts our original paper club, he'll fill in good ones there, much, much more.

So yeah, you know, these categories are still up for debate. So I have a form here later, fill it out on stuff that you would want to add, any papers that you recommend, any topics, it's very straightforward. You know, also join our discord, we have the paper club channel in discord.

So join that, add in topics, anything, if you want to cover, you know, if you want to volunteer to be a speaker, over the next like week, I'll flesh out a rough schedule. And then from there, we'll kind of take it, you know, start covering it, make sure that every session we have speakers, we'll figure out logistics.

So SF will have a venue, we have a few in mind, remote, it'll be the same Zoom thing. But yeah, we want to we want to fragment this out, get people get people interested, we'll obviously still bring in speakers, you know, so some of these key authors, we know them.

So we'll invite them to speak on these papers, and it'll be good, we'll have discussion sessions, these these sessions might be a little bit more than an hour. Since we now have three to four papers, we still want to just go deep, right, we don't want to just like do a TLDR of the paper, we want to do what are the foundation, what are the fundamentals, and then go a little bit deeper.

So stay in touch on Discord, very, very basic Google form. But yeah, now we'll have curriculum based stuff. You know, like Flo here might talk about music generation, Eugene will talk about states based stuff. So key, same members, but just yeah, second paper club. Now, the other part of this is, you know, if you kind of have an area of interest, this will be like your go to session of, you know, here are the five to 10 papers, here's a presentation on them, here's discussion, and it will stay live on YouTube, you know, it's just you can kind of set in at any time and be caught up to date, you'll you'll kind of go fundamentals to what you need to know for every session.

And of course, we'll have the same lively discussion. Okay, test of time, paper club, very, very hype. But today is still paper club, we can't not do a paper. So you know, this is just mochi picture because everyone at the conference is having dogs in their slides. So I need them too.

So we can't just have an old paper club back here, we need our OG. So today's paper is going to be Deep Seek. So Deep Seek was obviously a popular paper. I know a lot of people haven't had a chance to actually go through the paper. And frankly, I didn't have much time to prep.

So you know, I get to reuse my slides, I'm smart like that. This is also being recorded for the broader AI engineer conference workshops and speakers. So it's another point, you know, this is why we do paper club, we get these discussions on papers out, and then you know, people can find them later, like our original Deep Seek paper reading, basically, like, you know, that was just last minute, let's highlight the paper, make some basic slides, have some discussion.

But guess what, 300 people joined live, there's over a 1000 views on a YouTube video of us just reading through a paper. So it's a pretty key paper. It's like one of the big open source papers models that kind of had a big transition. So let's just go over it again.

And of course, there is new stuff. So as of this week, or last week, we have Deep Seek R1 0528. So basically, the May 28th update. Now, you know, there have been rumors that okay, Deep Seek v2 is coming out, it's coming out, and then people start launching stuff.

Guess what, they didn't do it. They just did the same model, they called it a minor update. But it's actually not that small of an update. So let's let's dig into what it basically is. It's not v3, like revision two, it's the same naming scheme, but it's significantly better, actually.

So, yep, Simon Wilson in his keynote mentioned how we don't have good naming for models. This is basically quite a step up, but they've kept the name. So also plugging Simon's talk, he gave a keynote for the past six months, what were the 30 models? He launched Pelican Bench.

It's his own benchmark of how he judges how good foundation models are. He needs to check the new Deep Seek model. I don't think he did. But basically, here's what they did. They did better post training on Deep Seek v3. And now guess what, it got better. So some of this stuff, they put out very little information.

But when you dig in, you see that one of the key things is it's much better at reasoning. So the AIME 2024 score went from 70% to 87.5%. That basically means, you know, this is a good reasoning benchmark. From before, now we have v3 matching the performance of O3 and 2.5 level on math, coding and reasoning, which is quite a step up.

You know, it used to be like, okay, we have O1 level intelligence and open source. Yeah, all we needed was a little bit more training. And now we have O3 and 2.5 level intelligence. And this isn't even like a new model from them. This is just like, let's do a little bit better post training.

Let's do a little bit more, and we can get significant, significant performance increases. Pretty wild, right? Like 18% improvement on benchmarks. And no one's really talking about this. Deep Seek got a lot better. One of the quotes that they say is originally the original Deep Seek v3, it would take 12,000 tokens to reason through on average on the benchmark.

For an AIME, it would take 12,000 tokens of reasoning. They basically did more RL. Now it reasons more. On average, it reasons for 25,000 tokens. So they got the model to do double the reasoning. One thing that we talk a lot about is scaling laws, right? So before we would do optimal scaling for base models.

Now in Llama, we kind of did overfit our training for inference time, right? Let's really over train so we can do fast basic inference. And then now in this world of test time compute, where we do more inference time compute, we can also scale even more in that dimension.

So original Deep Seek, on average, with reason for 12,000 tokens, the new model can double that. So in this domain, we've doubled the amount of reasoning it could do. And we have a lot of benchmarks to increase. You know, 18% on AIME and it's a lot better at coding.

So in this, they intentionally wanted to do better JSON output, function calling, and more reasoning. And yeah, they just dropped it like that. Here's kind of our benchmark chart. On most paper clubs, we don't do benchmarks. But yeah, it's actually kind of up there with O3 and Gemini 2.5.

So, you know, the darkest color is our new revision of Deep Seek R1. And yeah, it's actually very good. On most benchmarks, it's like significantly better than the original Deep Seek R1. And you can see this, you know, humanity's last exam, it basically went from not being able to do anything to, okay, now this thing can do well.

What they did is now it can basically reason for twice as long. Okay, that's not the only drop. They also launched another distillation. This is kind of the interesting one. Now, not many people really talked about this at all on Twitter. But if we remember in the original Deep Seek paper, what they did was outside of their original Deep Seek model, they did three distillation models.

They distilled the Quen series and the Llama series. And they showed how distilling from the big model, distilling on these reasoning traces, we can get really, really good performance on small models. Well, they did it again. They took Quen 3-8B, and they did another distillation with their new reasoning model.

And they show that, you know, the new model, this new basic distillation basically like kills the old one. So basically, when you look at their old distill versus their new distillation, they get another 10% performance boost by just doing distillation from a better reasoning model. So in a few months time, they were able to get Deep Seek to do more chain of thought, more reasoning, use that to distill down to an 8B.

And now we have an even better 8B. So very, very interesting little note, right? Not just do we get 10% improvement from the last base model, their 8B distillation is actually matching performance of Quen 3's 235 billion 20B active thinking model. Horrible naming, I know, but let's take a second to think about that, right?

A Quen 3 8B dense model. So a small 8B model is matching the performance of the Quen 3 235B thinking model. And this is not a native thinking model, right? This is a base distillation model of an 8 billion parameter model. Just on distillation, so logic matching distillation from a big model, we're matching performance of their 235 billion MOE thinking model.

That's pretty wild. They didn't do this to the 32B, the 70B, but yeah, it's pretty crazy, right? Untalked about release that the new 8B does very, very well. How do we see this? We can see the chain of thought improvements distilled down really, really hard. This was one of the key findings in the original paper that a better recipe for training small models is to distill down from big models, and reasoning models make this even more efficient.

And then this is just kind of their follow-up, right? We don't have a paper, we don't have too much on this, but these are benchmarks that show it. And of course, model is open source, open weight, everything. So those are kind of the overviews. Let's see, let's see. Now let's go over the actual, the original DeepSeq paper.

So that's kind of ending where we had the new releases. So we have two models to recap. We have a new DeepSeq version. So for May 28th, we have a new DeepSeq. It's now on par with OpenAI's O3 and Gemini 2.5. It's significantly better, and it reasons for twice as long.

We also took that model, we distilled it down to Quen3 8B, and we have a much, much better small 8B reasoning model. This shows that, you know, reasoning models distill down very, very efficiently, and there's still a lot of juice to be squeezed out there. Now, from here, for those that haven't seen it, we're going to take a two second pause, see if there's any interesting questions.

There's a hugging face link. If these benchmarks are public, won't models be trained to score better on these benchmarks? Yeah, benchmarks obviously have their cons, their flaws, but you know, there's ways to see what models are overfit on them. How do they do in general performance? In general, the DeepSeq models are actually doing very, very well.

So from there, let's go into the original DeepSeq model. So okay, DeepSeq v3, hypest paper of the year, 300 people joined us live. Let's do a quick recap. This is basically me using my old slides because I can, but let's talk about what happened in DeepSeq. So we're going to kick off with a high level model overview.

So what are the models they release? When they release this, it's not just DeepSeq v3, right? They also have R1. What is inference time training? What is test time compute? What makes reasoning models different? So if you guys don't remember, this was the first test time scaling open, you know, open model, right?

This was the first one that got good. OpenAI released 01. Cloud released Cloud thinking much later. Gemini thinking came much later. We didn't really understand what was happening. We thought there was like MCTS. There was a lot of, you know, Monte Carlo tree search. What's going on? There was internal, let's generate chain of thought.

Let's train it to do chain of thought. But it turns out DeepSeq comes out with this paper. They do a great model and they're like, yo, RL, RL works. So two models were released. DeepSeq R1.0. Basically, they take a base model. They do a lot of GRPO RL. They have training templates, reward models.

They have this emergence capability, reflection, aha moments. Then they do R1, which is basically a four stage pipeline. They have cold start, reasoning RL stage, rejection sampling and SFT. And then, you know, of course, the little RL round two to get it to really, really reason. From there, we'll talk about performance and evals of how does original R1 do.

How is DeepSeq? Then the original distillation models, future work, reproductions and whatnot. We'll kind of skip over the base DeepSeq evals and performance because, you know, we already covered the new one. But okay, continuing on. So, high level overview. For those that understand, that don't really follow, you know, we keep hearing this term of test time scaling.

What is test time scaling? What are thinking models and what does this mean? So basically, we got to a point where we started to overtrain our models. We basically hit a scaling limit on how much we can train models. Originally, back in the day, we used to do sort of these chinchilla scaling laws, right?

We had a fixed compute budget, we had a fixed amount of data set, we would design a model around that. So how many parameters should it be based on how much data we have? Let's fit a model to our data, let's fit how many GPUs we have, and then let's train it to be kind of chinchilla optimal.

From there, we kind of realized that, okay, this isn't really what we want. And we started really, really scaling up our training. So we had stuff like, you know, Mistral, Lama 1, Lama 2, Lama 3, we started training these models from billions of tokens, to trillions of tokens. So you know, we had originally like a 1B, 1 trillion token, then Lama 3 was 15 trillion tokens.

Now they're up to like 45 trillion tokens. So what we shifted was, instead of training for, you know, model chinchilla optimal, let's start training for this sort of inference optimal training regime. So instead of, you know, thinking about what we have now, let's think about inference time. As we scale this, we want a model to be as densely packed as smart as possible.

So Lama is like, okay, basically, if we continue training, we don't really see degradation. But the problem in this is, it gets very, very expensive, right? As you train more and more, yeah, it's very, very heavily compute extensive. And you know, how many times can you scale this up?

Like how much data do we have? How much can we really fit in? If we're at 45 trillion parameters for like a 70b, can we scale that up 10x again? You know, are we going to do 450 trillion parameters? What if we want to 10x it again? We're basically hitting the compute scale for training, right?

We can't continually just keep scaling up our train runs because it's no longer like cost efficient, right? We're spending millions and millions and hundreds of millions of dollars on these train runs. You can only scale so much. So a lot of hypothesis was, you know, there's going to be a sort of plateau and open models will start to catch up because, you know, we're already all scaling so much.

But that's where, you know, we need to unlock another dimension, which in this case was reasoning or test time training. So this is where we basically started to do reasoning capabilities without any supervised data, right? There were approaches to try to do, you know, let's generate a bunch of chain of thought reasoning style data.

Let's do post training on it. And yeah, our models do a little better, but this didn't scale. What we needed was we wanted to do pure RL. Can we do pure RL to do reasoning data? So this is a quote from the paper. Basically, the DeepSeq team says, "Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self evolution through a pure RL process." So what they're going to do is they're going to post train the DeepSeq V3 base with GRPO, which is, you know, basically pure RL.

And then as they do this, they start to notice emergence of great reasoning, reflection, aha moment, and they start to match O1. Now, fast forward a few months, as we can see today, they can basically continue this. They do a little bit more RL. They do RL on longer traces.

They can now get the model to not match O1, but match O3 and double its amount of reasoning tokens. Okay, here's kind of the four step approach to how they train this R1 model. They start with this sort of cold start where, you know, they jump start with some SFT, then they do RL for reasoning, then they have this key step of rejection sampling for generation purposes.

You know, rejection sampling is something we'll talk about later. Then once again, they do the fourth stage of basic RL polishing. Okay, that's a very high level overview of what DeepSeq did to make RL. So to recap, you know, we needed to shift from next token predictors to scaling in another axis.

To do this, we needed to scale on instead of spending the same compute for every token generated, we want to dynamically generate, we want dynamically spend more compute on different queries. So we train the models now with pure RL to reason through their questions. So instead, you know, they're now trained to do RL with verifiable outputs on a lot of code and math data that can be verified to be, you know, whether it compiles, whether it's factually correct, whether the math is logically correct, and then now we can basically do native RL.

Doing this, we notice emergence, reflection, aha moments, and now we basically have another domain in which to scale models. Instead of scaling up a 10x order of magnitude of, you know, instead of 45 trillion tokens, let's train on 450 trillion tokens, and then, you know, scale that up and up.

We're kind of at the limit there. We start with a really good base model that's a good general next token predictor. We do RL, and now we can scale in the reasoning domain. So a few months ago, they showed that, you know, they can get O1 level performance, and then fast forward to now, we have O3 level performance with just more RL.

Tangentially, this was kind of the paper to kick it all off, right? Deep Seek showed that two things. One, you can do RL from base models, and we can get reasoning. Two, distillation really works, and it's a much better approach to small models. Following this, we've now had a lot more papers.

So shout out to like the 5-4 models, right? 5-4 showed how effective RL can be. They took 5-4 mini, and they made it a reasoning variant. Basically, they took about 6,000 samples, and in 6,000 samples of RL, they could take a small base model, a small base next token predictor, and do really, really good reasoning inference on it.

So this was the one to kick it off. And now we have, you know, the Quen models have done this. So there's Quen thinking models, or Deep Seek thinking models. And then there's formulas like the 5 models that show how to do this in small models. Okay, let's continue on from my high level overview.

So what did they release? They released two models early on. There's R1-0, which is a great reasoning model, only trained on unraveled chain of thought with RL. But it's not a great general model. R1-0 is when you only do RL on chain of thought, you know, it doesn't really emerge to general performance.

So R1 is trained as the second model. It uses outputs of R1-0 using this four stages of training. And then, you know, now we start to do RL back on human tasks, back on chat, and we have a good, good RL, we have a good, good reasoning model. Second thing they release is their distillations, right?

So they take these thinking traces, and they take models in the same family, and then they distill them down. So these are not natively trained with RL, they're distillations from their base models in Quen and Lama families, and then they show how this works really well. Okay, now, of course, you know, it's 2025, we don't get real papers anymore.

So they don't talk about data, how many tokens, where the data comes from. But you know, they still do share a lot about how this stuff works. Models, of course, fully open source, MIT license, no training data, no code, they have a Deep Seek API, which at the time, you know, it was much faster than anyone, it was much cheaper than anyone, turns out this was fake news.

This API is very unreliable, it's barely ever up. But you know, now, a few months later, after this has come out, we can see stuff like from open router, you know, Deep Seek actually takes a significant chunk of the pie, they take about 10% of API usage that goes through them.

So this model actually sparked a lot of adoption, it reopened up the race of, you know, okay, we're not done scaling models are still getting a lot better. And yeah, you know, once again, it's been like, once again, it's been like a few months. And as of like, last week, we have another update to this.

And then the Quinn team followed along. What was fake news? Fake news was the Deep Seek API. When it launched, it was 10x faster and 10x cheaper than other inference providers. But turns out it was super unreliable. No one can make API keys, it was almost always down. But it's okay, you know, it exists if you still want to try it.

It's a really good model. A lot of people use it. Okay, let's dig deep into this. Let's dig deep into these topics. Inference time scaling. So what is inference time scaling? We have 01 versus GPT40. Now there's 03, 04, 04 mini, all these things. But basically, what these do is, you increase the chain of thought reasoning process, you allow models to spend more time thinking before they respond.

Now, it's very common, we've used a lot of these models, right? They're starting to become even a little bit more agentic with RL. That's kind of what's progressed since we last covered this. So, Instead of spending hundreds of millions of dollars pre-training LLMs, that starts to get exponentially more expensive, right?

Instead of changing from hundreds of millions of dollars to pre-training LLM, we don't want to spend billions on trade runs, right? So we needed another access. So instead, we shifted to this paradigm of inference time scaling, where you take really hard questions, you do inference time, chain of thought, RL, and we have a new dimension to on which we can scale.

So previously, people tried to do process based reward modeling. So we would do RL basically on, we would try RL, we would do beam search, we would do MCTS, we would do all these inference time, you know, let's predict multiple tokens, go down these processes, they were all hacks, but nothing was really close to 01, right?

We would see Twitter demos, we would see like some fundraising that even came out of, you know, Okay, I have much better performance than LLM3 because I go down 10 trees of thought. And you know, I'm doing all this stuff on the back end using a bunch of tokens and and gluing it together.

But this is really not the right approach as we see. What really worked is just native, pure, beautiful, scaled up RL. So once again, now, this is what Deep Seek did to make V3. Here's the one slide, if you want to know what Deep Seek V3 is, it's open source GPT 4.0 Oh, sorry, this is Deep Seek V3.

This is the precursor to R1. So 4.0 quality, 37 billion active parameters. This is the, you know, regular MOE non reasoning model. This is what they build R1 off of. So it's MOE 671 billion parameters, 37 billion active. They launched this, it was a good model. It's just really chunky, right?

You can't run it on your laptop. No one can run a 700 B model. They made this whole, you know, we're better than everyone else where, you know, we could train this model in $5 million. And I think that this is actually true, right? Looking back at it, a lot of what we see is Deep Seek and Chinese labs were really able to catch up to the United States because of the constraints, right?

We put a lot of trade restrictions, we couldn't give them GPUs. So they had to get clever and smart with what they got. And basically, you know, they did very, very strong inference optimization, they made the most of what they had, right? We could have continued scaling, right? We could have thrown this 14.8 trillion tokens into 150 trillion tokens, but China didn't have GPUs for that, right?

The Deep Seek labs were like, we realized that we can't scale this in the same dimension. So they had to get creative and think about, okay, what if we do RL? And that's basically what they did. So V3 was, you know, it was an MOE 37 billion active, they introduced this concept of multi-headed latent attention, 15 trillion tokens, they did SFT, and then, you know, traditional RL, so RLHF to make it a chat model, multi-token prediction, this came out of meta, they needed it to be sample efficient with their 15 trillion tokens.

So multi-token prediction for a little bit more sample efficiency, trained in FBA, did some long context extension, basically first trained at 32K, then they extended this down to 128K. Came out a month ago from R1. After that, R1 came out, people got mad hyped. Now we have R1 V2 basically.

So these are kind of, you know, they're fancy diagrams. We have DeepSeq V3 base, we do SFT and RL, you have a SFT checkpoint, RL with RLHF to get, sorry, fine tune with RL to get DeepSeq R1. Okay, what is DeepSeq R1 0? R1 0 is where you don't do any SFT, you take a pure base model.

For those that don't remember, base models are models that come when you do your pre-training. So we train models to predict the next token, right? So these are not models like GPT 4.0 or 0.1. These are pure base models. All they do is they predict the next token. So they're kind of completionist models, right?

You can't normally chat with these. All they do is complete your sentence, complete your word. So they take the DeepSeq V3 base model, they apply pure RL, they don't do any SFT, they don't train it as a user assistant chat model, they don't do any of that. It uses GRPO for RL, which they, you know, they actually introduced quite a while in DeepSeq math.

The reward is based on both accuracy and format. Responses must be verifiably correct, right? So what are they doing here? What are they doing here? They take a base non-chat model, they use GRPO style RL on math and code, and they have a verifiable output, right? So the models need to be verifiably output.

They need their output to be verifiably correct, right? So for math, you have a correct output to your math question, right? Is the answer correct or not correct? If it's correct, that's good. We can RL on that. For code, does your code compile? If it compiles, that's good. We can do RL on that.

And, you know, this is lead code style questions. So lead code style questions, we know the answer, we know if the answer is correct. Then we format the... Basically, you know, in the little minute details, they format the rewards, they kind of output the thinking between think tags. So we take our base model, we do RL to do a bunch of thinking to get its chain of thought thinking process.

And from there, we kind of have DeepSeq R10. It's a reasoning thinking model that's good at outputting thinking and answers, but it hasn't really been trained to be a useful assistant. It's not all one yet. This is just a good thinking model that can unravel its thought process to generate answers.

We'll probably skip this guide. This is a GRPO. This is kind of the RL based algorithm that they use. It's, you know, comparing PPO, GRPO. This is how they do the RL. We'll kind of skip it. The key things to note, there's no critique model. You have group based rewards that are scored.

It has stability updates. There's a, you know, KL divergence. But yeah, they do GRPO. We're going to skip it in this talk for now. So, okay. DeepSeq R10. We took a base model. We did RL. We now have a thinking model. How does it perform? Pretty good. AIME, you know, it passes O1 Mini for the time.

Math. It passes Math 500. It passes O1 Mini. It's not, it's like on par with, but slightly worse than O1. And then, you know, the charts show that we're able to do this inference time scaling by doing RL on really hard questions. And it kind of works. It works pretty well.

So, yes. Chart. Number goes up. Number goes up even more. The more you train. This thing is starting to stably learn. What else do we learn? So, it naturally has the ability to solve these complex tasks by extending test time compute, right? Here we know that the original R10, it ranges from hundreds to thousands of reasoning tokens.

Turns out that this was a pretty key factor. In the update that they released that DeepSeq released last week, we scaled from training on from taking thousands of tokens to now taking to doubling it. So now we take from 12,000 tokens on average to 24,000 tokens. And once again, we shift from getting 01 level to 03 level performance in the new DeepSeq model.

So other stuff, you know, there's this emergence of interesting behaviors as test time compute increases. So as we're able to increase our test time training, as we can reason for more and more time, we start to see this emergence of interesting behaviors. Basically, as models learn to reason for longer and longer, as you get to thousands of steps of reasoning, models are able to start to have these reflection moments.

So, you know, the more reasoning you do, models start to learn, okay, I'm actually not forced to just output my next token. I'm not forced to do 100 tokens of thinking. I'm not forced to do 1000. I can continue down this thinking path. And we noticed that, you know, as they start to reason for longer, models start to do this sort of reflection phase, models start to revisit, reevaluate their previous steps, they start to go down alternative passes, and you know, this kind of arises this spontaneity of this spontaneity of this spontaneity emergence, right?

So spontaneously, you know, models will be like, okay, I tried this, this, this, it's not working, I don't have to answer right now. Let me try this new thing. And guess what, it works. And then we also have these aha moments. So very core takeaway of the paper, you know, this is kind of what got the DeepSeq authors to realize, okay, RL actually works.

The more time that we think for the further along the thinking traces, we start to get these models to do these aha moments. So here's a quote, again, from the paper, this moment is not only an aha moment for the model, but also for the researchers are observing its behavior.

It underscores the power and beauty of reinforcement learning. Rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem solving strategies. The aha moment serves as a powerful reminder of the potential of reinforcement that reinforcement learning unlocks new levels of intelligence and AI systems.

So basically, instead of telling models how to reason, what traces, instead of doing SFT on chain of thought traces, if we just give it this sparse idea of, you know, reason as much as you want. The more time it takes, we start to notice emergence of aha, I see what it should finally make.

I tried these six steps, and this seven step is working. And you know, this shows the power of RL. And this has been kind of passed down into other papers. Microsoft has put out really good scaling laws on doing RL on small models. Quen team has done really good thinking models.

They have an online talk here. So please, please watch out the Quen reasoning talk. A speaker speaks about how they do that. But okay, back to deep seek, back to deep seek. This is an example of an aha moment. So question, a basic math question, you know, not that basic question, you know, a basic question, you know, not that basic question.

And this seven step is working. And this seven step is working. And this seven step is working. And you know, this shows the power of RL. And this has been kind of passed down into other papers. Microsoft has put out really good scaling laws on doing RL on small models.

Quen team has done really good thinking models. They have an online talk here. So please, please watch out the Quen reasoning talk. speaks about how they do that. But okay, back to DeepSeq, back to DeepSeq. This is an example of an aha moment. So question, basic math question, you know, not that basic, actually, I can't answer this.

If a is greater than one, then the sum of this square root of a square root is equal to something. Well, as the model starts to think, you know, it realizes, oh, there's a x squared, right? If I square this, I can get x squared, then I can isolate this out, right?

It's doing reasoning. It's thinking through this math problem, step by step, then it says this in its own generation. Wait, wait, wait, that's an aha moment I can flag here. Let's reevaluate this. And like, you know, very interesting aha moments start to pop up. So yeah, that's kind of overview of what R1 was.

That's kind of an overview of what R1-0 was. R1-0 is basically where we take a base model, we train it on pure RL for math and code, we do RL on thinking steps. And now we have a reasoning model, it works. But it's not great. It doesn't actually have great readability.

It starts to reason in multiple languages, whoa, it reasoned in Chinese, who would have guessed? So we want to make it, you know, we can't just skip RL-HF, right? We want to make this thing a good chat model. So next section, instead of R1-0, how do we make DeepSeq R1?

How do we take DeepSeq R1 from being just a reasoning model to a reasoning useful assistant that's good at, you know, actually being a chatful assistant? So key solution, giving it to you straight, cold start, instead of taking base model to just RL, take a base model, do some regular SFT.

SFT is kind of, you know, here's some prompt answer, here's chat assistant, get it to, you know, do a cold start, get it to understand you're still a useful assistant. After that, do some RL. Do this core RL on very hard code math, get it to start to understand how to think, how to reason.

Scale it out until it starts to see these aha moments, until they can do proper reasoning. From there, let's do some rejection sampling. We want to take examples that don't work, right? Stuff where it is going wrong, where it has negative behavior. We do rejection sampling on it, and then once again, we do it, once again, our last session of stage four training, which is another round of RL.

So, going through these, stage one, cold start with strong SFT, prevents the model from getting unstable. So, basically, you take a base model, you have a long chain of thought style data set, so, you know, prompt answer pair, so user assistant, think through your process of how to solve this, Give your chain of thought, we take a base model, train it on this chain of thought, then from there, you know, we do our cold start with strong, strong SFT.

They don't just do synthetic data, they have human annotators, we want better readability, right? So, do your thinking and think tags, and then this is on the scale of a couple thousand examples. So, base model, thousand examples, a couple thousand examples of SFT, then our main RL stage, you know?

So, same RL process as R1-0, do a lot of RL on verifiable hard questions, so, math, lead code style coding, stuff that we can verify has the right answer, do a lot of RL, and then, you know, stage three rejection sampling. So, generate completions, rank them with a reward model, fine tune the original model.

So, this was standard, so, Lama 3 showed us this concept of rejection sampling, and, you know, we take it to Deep Seek. We do rejection sampling on samples that we don't like. We had 800,000 samples that were generated, 600,000 reasoning, 200,000 general chat. We rank them, and then we do our rejection sampling training.

Then, final RL stage for general use. This is very similar to, you know, similar to RLHF. You want to make the model helpful, harmless, and reason good. So, for reasoning, we use, you know, we want to keep it a good reasoner. Add in that reasoning for hard math questions, code questions.

For general chat, capture human preference and nuance situations. So, you know, this question should have a very verbose, detailed answer. This is a basic summary, keep it short, but keep your reasoning there. So, final stage is, you know, this final stage of RL, and now, guess what? Model is good at being a chat model.

Model is good at thinking. Model has emergence of aha. Behaviors, and the model is just a good chat model. Performance and evals, we're going to skip this. Basically, when we launched DeepSeq, it was 01 level, but forget that. We launched an update two weeks ago. So, new model, DeepSeq R1, launched May 28th.

And instead of being 01 level, the new DeepSeq R1 model is now reasoning for twice as long. It's as good as 03 and Gemini 2.5 Pro. Much better at math. Much better at coding. Much better at reasoning. It now has support for native function calling. JSON outputs. No longer hallucinates as much.

Second model drop. And of course, you know, performance charts. All the regular benchmarks you would expect. Model is performing as good as 03 and Gemini 2.5 now. All that was done. Do more RL. AIME jumped, you know, 17.5%, doubled the reasoning tokens. Basically, we doubled our reasoning access. We now reason for twice as long.

So, double the reasoning effort. And yeah, now we're 03 and Gemini 2.5 level. The other drop, we have a new distillation. We take our new model that reasons for twice as long. We distill this down to Quen 3.8B. And we do a distillation loss on this reasoning. And we get performance matching the Quen 235 billion reasoning model.

So, our dense 8B non-reasoning model that was distilled from our new deep-cheek model is as good as Quen 3.235B, which is an MOE reasoning model, which is pretty crazy, you know? The implications of this show that, you know, long, detailed, good reasoning really has a deep impact. Once again, check out the Microsoft work for good distillation scaling laws on this.

Okay, okay. Back to our paper. Instead of looking at our original deep-seek, that's kind of the performance of where we're at. Distillation. Let's talk about these distillation models. So, what we did was, distill R1 down into Lama and Quen models. This is not RL. This is basic SFT. We have these models that reason for 25,000-30,000 steps.

We take these traces, do SFT-style distillation. So, proper distillation. Match your logits. And guess what? This showed so much performance, but we know RL can do better. So, you know, all open source, all traces are open. Someone do RL-based distillation from the big models. No one has done this, as far as I know.

But, you know, we're able to get such good performance and there's still so much left to be done. But, let's go over what we distilled out. So, these are the family of models. This is, once again, pure SFT-style distillation. Oh shit, my slides are gone. They're back. Okay, so, we distilled Quen 1.5B, Quen 7B, 14B, 32B, Lama 8B, 70B.

Performance killed all the models they are themselves. So, you take the model itself, you look at our distillation, it worked. Number went up like crazy. Really, really good performance. All our distills are now basically on par with GPT-4O. And, for our new one, our new 8B distill is much, much better.

It's way better than 4O. And, this is just, once again, RL on long chain of thought. Take that, do SFT-style distillation. Question, what if we just did RL on the base model, right? We tried it. We tried RL on Quen 32B for 10K steps. It's actually worse than distillation.

So, you know, for small models, we don't want to just start with native RL. We saw this in our own model, right? We needed this cold start. We need to kick off something from base models. R1-0 to actual R1, we had to do this SFT cold start. So, it actually performed significantly worse than distillation.

And, of course, the Quen team at the time, they had their own reasoning model, right? They had QW-Q32B. And, you know, the R1 distill, so the base model distill, it did worse than what Quen did, but kind of on par, you know, it's probably what they did. Our distillation on the chat model, on the base chat model, actually performed so much better.

We were able to do so much better than the Quen reasoning model. And, of course, now we've taken this a step further with our new model. Okay, future work. R1 is worse than V3 at function calling, multi-tap, multi-turn, and JSON mode. Guess what? Two weeks ago, there's a new DeepSeek model.

We now do native function calling. We have JSON mode, we fixed it, it works. R1 struggles with language mixing. We don't know how we do on the new one. It's sensitive to prompting. I think this is something that the industry needs to figure out, right? Spoiler alert for some Latent Space podcast fans, you know, we've talked to some reasoning experts.

So, like, some of the stuff that we're seeing, you know, researchers at OpenAI, they're saying, you know, if you're still doing scaffolding with reasoning models, we're failing as labs. So, you know, we need to learn how to prompt these things better. And it's not much better at engineering tests than V3.

That's all fake. This is old news. This was our old DeepSeek. The new DeepSeek is a lot better. Open recreations, we want to promote open research, right? So, there were people trying to recreate this. Hugging Face has a version of this. Bespoke Labs was doing this. I don't know how they're doing.

There's quite a few people now that have done this. But, yeah, that's kind of an overview. I know a lot more people have joined. We now have, like, 100, 200 people in the audience. So, I'm going to do a quick recap in our last 10 minutes. So, first, we are launching a second paper club.

Every week, we do our normal paper club where we take the latest paper. We have 100 people that join every week, 300 for DeepSeek. We're turning this into a test of time paper club. If you're interested, sign up. We're going to run this in SF and we're going to do it remotely.

Over the next six months, we're going to cover 50 to 100 papers. We're going to break up what you would need to know as an AI engineer into different buckets. So, stuff like, you know, what are foundations of deep learning? Attention, RL, optimizers, Atom, gradient descent, foundation models that you should know about, GPT-2, BERT, RNNs, LSTMs, pre-training, post-training, mid-training.

So, scaling laws, chinchilla, distillation, we'll cover days of diffusion, optimization, voice, fine-tuning. We basically have a paper club where every week, we're going to split up these core concepts into a few papers. We'll have a presentation of three to four papers. Everyone is welcome to join. We're going to have a presentation on every core concept and then open discussion.

This is not a course, courses are good. You know, you have active workshop, you build stuff, you actually like do active learning. This is still a paper club, you know? This is, if you want to know the foundations of what's going on under the hood, these are the key papers to know.

We'll invite a lot of speakers, we'll have people present, we'll have good discussions. But yeah, test of time paper club coming in June, scan QR code. Let us know if you want to be involved, if you want to recommend a paper, share a paper. We'll share curriculum soon, join the Latent Space Discord.

We already have a list of top 2025 papers. It's a paper every week that you can go through, we're going to build off of that. And once again, final recap, what did we talk about today? Today we talked about the new DeepSeq model. So, two weeks ago, DeepSeq R1 May 28th came out.

Basically, DeepSeq took the last DeepSeq model that was as good as all one, we trained it to reason for twice as long, we got significantly better performance. DeepSeq R1 May 28th can now do standard structured JSON output, native function calling, hallucinates less, reasons for twice as long, and is a much, much better jump in performance.

From being 01 level, DeepSeq is now on par with OpenAI's 03 model and Gemini 2.5. Basically, across the board on all benchmarks, we are now as good as Gemini 2.5 and OpenAI's 03. The other model that was released is DeepSeq R1 QN3 8B base. So, we took QN3 8B, distilled it down into a reasoning model based on our longer traces, and we killed it.

So, it's a small 8B where we do post-training via distillation as SFT on reasoning traces, and model got really good. You take base QN3 8B, and you take R QN3 8B, it's very good. Looking at the benchmarks here, you know, our open source, on-device, runnable QN3 8B non-reasoning model is on par with Gemini 2.5 Flash Thinking, 03 Mini Medium, better than 5.4, significantly better than QN3 8B, our 8B reasoner is better than QN32B, it's on par with QN's 235B reasoning model.

So, you know, two major updates, new reasoning model from DeepSeq is good as 03 and Gemini 2.5, new mini 8B model that is as good as 2.5 Thinking and 03 Mini. Of course, all open source, run on your laptop, just as good, you know, and these are not even R2, this is not DeepSeq R2, this is our mini title refresh with a new date.

Yeah, so high level, that's what we talked about, you know, we see aha moments, instead of training on next token prediction, we scale this out to inference time scaling. So, we now train models to train and think for longer, we get aha moments and that's kind of the update to our new DeepSeq models.

So, thanks to everyone for coming, thanks for listening, join PaperClub. Oh yeah, so a lot of our regulars that help make PaperClub are here, Eugene, Ara, RJ, Flo is here. If anyone else is a regular in PaperClub, you know, come up, every week we have our weekly PaperClub, these are the homies that make it possible.

We're going to have our second PaperClub, test of time will be there soon. But yeah, you know, major shout out, this is not me, this is not SWIX, this is volunteers and more on Zoom. Every week on a weekday at noon, hundreds of you join in to discuss the Paper, so you know, big shout out to everyone here, not possible without us.

You know, all the authors as well that have come, all the authors that have been able to share, let's just give it up for everyone that makes PaperClub possible. Eugene Yan, who is over there running his track. Eugene Yan: Sam, I know, is speaking right now, SWIX, who is putting all this on.

Yeah, of course, once again, I'll leave our QR code here. If you're interested in test of time, volunteering for a paper, recommending a paper, fill it out, you know, help us out, this is our PaperClub selfie. But yeah, thanks for coming out, everyone. Enjoy the rest of the conference.

We'll see you next time. We'll see you next time. We'll see you next time.