back to indexDeepSeek DeepDive (R1, V3, Math, GRPO)

00:00:00.000 |
Cool. I will kick off by straight disagreeing with Sean there. V3 not required reading for 00:00:09.920 |
R1. V3 is just model. It's like any other model. You know, all the models you just train 00:00:14.780 |
on Next Token, it's a Next Token model. It's a good model, but up until 15 minutes ago 00:00:20.740 |
I forgot about the slides for V3. So, you know, that's how useless it is. But anyway, 00:00:27.980 |
high level. I'm just going to go through more applicable parts of paper. Like, why do we 00:00:32.680 |
care? If anyone has questions, comments, thoughts, interrupt me. It's more discussion than it 00:00:38.740 |
is me yapping, you know. If you want yapping, go read the paper. So, outline, high level, 00:00:45.180 |
high level. So, let's talk about the two models, mostly the second model. What is inference 00:00:50.620 |
time scaling? What's this test time compute? They talk about what previous approaches are, 00:00:55.260 |
so we'll kind of discuss that a little bit. Also, if people are talking in chat, I'm not 00:00:59.820 |
super seeing it. If anything's, like, important. We're just talking about V3 and ignoring you. 00:01:04.920 |
Okay, okay. Gross, V3. We had a slide, don't worry. So, then we'll talk about R1-0. So, 00:01:14.000 |
DeepSeek R1 is not really just one model. There's two models. There's DeepSeek R1-0 00:01:19.180 |
and then there's DeepSeek R1. They actually put out both. They're both pretty good. It's 00:01:24.100 |
a different approach to how they did both, but yeah, kind of what they are, the training 00:01:29.100 |
template, reward models, how they determine all this emergence, reflection, aha moments. 00:01:35.300 |
And then we'll talk about what R1 really is. R1 is taking a reasoning model and then turning 00:01:39.880 |
into a chat model again. So, now we've got more than just a base model and chat model. 00:01:45.620 |
Now we have base model, reasoning model that's not a good chat model, and then reasoning 00:01:50.500 |
models that can chat again. Most of this paper and most of the slides are actually on R1-0. 00:01:55.140 |
It's a very interesting one. Apparently, an hour ago, the ARC benchmark guys, they put 00:02:02.660 |
out news that R1-0 is better than R1. So, it's better on some stuff. You know, you take 00:02:09.580 |
smart model, you make a chat model, it's going to become dumber. People are dumb. We like 00:02:13.020 |
to chat. Then we'll talk about performance evals. Then a really cool thing they did with 00:02:18.180 |
distillation. So, they distilled LAMA-3 and QEN models into R1 models. And they just kind 00:02:26.260 |
of, you know, we're going to drop this real quick. They broke the US economy. They're 00:02:30.020 |
like, not only do you get R1, you also get LAMA-R1, you get QEN-R1, you get R1-0, which 00:02:36.460 |
isn't a chat model, you get everything. And then they have future work. Some people have 00:02:41.100 |
done reproductions of the work. Someone yesterday from, I think, Future Labs put out a reasoning 00:02:46.980 |
style data set. So, we'll just, you know, yap and discuss about that. But, yeah. So, 00:02:53.980 |
high level, the point of all these reasoning models, the reason why everyone cares is because 00:02:58.020 |
they're kind of changing the scaling curve from let's just train bigger and bigger models 00:03:02.300 |
that can like, you know, we throw more compute at the problem and like the inference becomes 00:03:07.020 |
a little bit better to let's start trading that upfront cost for inference time compute. 00:03:13.740 |
And yeah, that's kind of what they did. Before, OpenAI was the only one to do it with O1, 00:03:19.180 |
O1-mini. And then, you know, give it a few months and deep seek out if anyone has just 00:03:22.860 |
put out a really good model. It's like on par with OpenAI's O1. It's like better than 00:03:27.860 |
all the other models. Completely open sourced it. The paper's not that good. They don't 00:03:32.300 |
really talk much about the training data. Like Sean mentioned earlier, it's a pretty 00:03:38.020 |
basic approach to what they did. So, there's not much in the paper. They don't talk much 00:03:42.100 |
about the data. They don't talk much about a lot. But anyway, it's a paper. Weights are 00:03:46.660 |
there. It's MIT licensed, which is pretty good. And you know, it's still a V1 of this. 00:03:52.860 |
It's like just how OpenAI has O1, there'll be O3. So, you know, there'll be R2. There'll 00:03:58.380 |
be other companies that do this. Mistral might do it if they still exist. Lama will do it. 00:04:03.380 |
So, there'll be others and then we'll only see improvements from here. One of the quotes 00:04:08.140 |
from their paper, their goal, so like they say, "Our goal is to explore the potential 00:04:12.820 |
of LLMs to develop reasoning capabilities without any supervised data, focusing on their 00:04:17.940 |
evolution through a pure RL process." So, TLDR, what they really found out is just throw 00:04:24.100 |
RL at the problem and you can get a really, really good model. People kind of, you know, 00:04:29.060 |
set aside RL for a while, but yeah, it turns out you can just throw RL at the problem and 00:04:33.180 |
it's pretty good. But this is a kind of interesting note, right? So, they wanted to develop reasoning 00:04:38.820 |
capabilities without any supervised data. Without any supervised data means, you know, 00:04:43.580 |
we don't go out, we don't label a bunch of, "Hey, here's chain of thought. Here's the 00:04:48.280 |
ideal reasoning." A lot of people would think about, "Okay, if you have an agent, if you 00:04:52.300 |
have a coding problem, there's 10, 20, 30 different approaches to get to the same answer. 00:04:57.300 |
How do we optimize this stuff?" But no, they're not doing supervised data. Their goal was 00:05:02.020 |
to do it without any supervised data, just self-evolution and RL. And they did some really 00:05:07.580 |
good RL. And they did put out good math. They explained this RL. And yeah, that kind of, 00:05:12.840 |
you know, blew up. So, what they do is they post-train the base DeepSeq v3 model, which 00:05:19.620 |
I don't think is that important. It's just, you know, big model with this GRPO. GRPO is 00:05:24.620 |
their type of RL. We'll go into it in a bit. And then they start to notice these emergent 00:05:29.540 |
capabilities that come out. There's great reasoning that starts to come out. So, you 00:05:33.500 |
know, you don't train on reasoning data, but dang, you get reasoning. You just train on 00:05:38.380 |
hard questions, hard data. Then reflection starts to become a thing. So, you know, the 00:05:43.900 |
model starts to reflect, like think on its actions, thinks on its steps. It has these 00:05:49.020 |
aha moments where it's like, "Oh, shoot, that's crazy. This is what the right step is." And 00:05:53.100 |
then it continues. And you get like O1 level performance. Then from this, you know, v3 00:06:01.020 |
like light or whatever zero model, R10, they train the actual DeepSeq R1. They have a four-stage 00:06:07.940 |
approach for training it. And that becomes a really good reasoning and chat model. So, 00:06:12.420 |
four stages, they have like this cold start to make sure things don't go crazy at the 00:06:15.840 |
beginning. And guess what? They do SFT. It's not just RL. Then they do RL. Then they do 00:06:22.660 |
rejection sampling. Then they do RL again. So, not one RL. There are two RL stages at 00:06:27.660 |
the problem, you know, so double the RL. But yeah, so high level, those are the two models. 00:06:33.740 |
R10, it's a great reasoning only model. It's trained on unraveled chain of thought, you 00:06:39.380 |
know, with RL. It's not good as a general model. Then R1, it's created from outputs 00:06:45.180 |
from R10 and that four-stage training method. It's a really good model. It's like O1. Then 00:06:51.340 |
the other half of the paper, not half, but like, you know, they have a section, they 00:06:55.020 |
have like a paragraph on, "Hey, by the way, we just distill our outputs into QN and LMA." 00:07:01.220 |
It does very, very good. So, that, they're not doing native RL training. They're doing 00:07:05.920 |
proper distillation. So, they take their big model. They train it with a distillation loss. 00:07:10.420 |
Well, they don't say what type of distillation, but you know, standard distillation is distillation 00:07:15.020 |
loss. Then they compare it to the base models and it performs very well. A little note, 00:07:22.940 |
they do try, they make a note like, "Okay, what if we did RL?" So, they take QN32B, they 00:07:28.300 |
do like 10K steps of RL. They compare that to distillation and they find distillation 00:07:33.860 |
much better. They make a small claim like, "Yeah, you know, if you do RL, it's very compute 00:07:39.180 |
expensive. Like, it's hard to do. It doesn't work as well as just distilling. So, maybe 00:07:44.140 |
in the future we still need these big base models." 00:07:48.540 |
Being 2025, you know, no one talks about any data. They don't talk about where it came 00:07:53.380 |
from. They just say, you know, get good quality data. Performance is very good. Models are 00:07:58.300 |
fully open source with MIT license. They don't give training data. They don't give training 00:08:03.940 |
code either. They host the model themselves on their own API. Something interesting to 00:08:08.900 |
note is as much as people are raving about how good this thing is, DeepSeek themselves 00:08:14.620 |
are also serving it very cheaply and very fast. So, 3x faster, 3 to 10x faster and also 00:08:20.740 |
cheaper than other infra providers. But, you know, if you use the DeepSeek API, they clearly 00:08:26.820 |
state that the, you know, data goes to China server. So, use as your own risk, but very, 00:08:32.980 |
very cheap model, very, very good model. Their API is a lot faster and cheaper. Part of that 00:08:38.220 |
is because, you know, they know everything about how to optimize this thing. They built 00:08:42.340 |
it and it just came out. The other providers that are hosting it, well, you know, they 00:08:46.740 |
just have model and they're trying to run it. But, yeah, from there, let's go into DeepSeek 00:08:52.220 |
v3 real quick. This is my one slider. So, we say it's important. We'll stop after this 00:08:57.940 |
and discuss it a little, but basically it's just a regular LLM. It's a pretty large model. 00:09:03.940 |
It's chunky. It's 671 billion parameters, but 37 billion active parameters, which is 00:09:09.220 |
pretty interesting. You know, it's a lot of experts in there, but effective parameters 00:09:13.960 |
are pretty small. It's basically a 30B model at inference time, fully open source. It's 00:09:19.940 |
GPT 4.0 level. It's not the reasoning one. This is just a standard big MOE model. They 00:09:25.940 |
made this little claim, you know, training this thing took $5 million, 5.5. They had 00:09:31.020 |
like a few steps to this, so they trained it. Then they did two stages of context length 00:09:36.140 |
extension. They did, first, they trained the thing as a base model. Then they do some 32K 00:09:40.660 |
and 128K context length extension, trained it on about 15 trillion tokens, do very standard, 00:09:47.580 |
you know, train it, do SFT, do RL. The model is pretty good. They have this concept of 00:09:52.900 |
multi-head latent attention. It's pretty cool. If anything, that would be like the next slide 00:09:58.420 |
if I had to have three slides, but, you know, they have fancy attention. They do multi-token 00:10:04.140 |
prediction. We covered the paper from Meta a few months ago that talks about this, where, 00:10:09.540 |
you know, it's more sample efficient. You can do multi-token prediction. Meta put out 00:10:13.540 |
a paper. They're like, "Oh shit, this works. It's pretty good. People should do it." And 00:10:18.180 |
then not many people did it, and then they did it, and it helps. Came out a month ago. 00:10:24.000 |
People are very hyped. The other day, it kind of, you know, broke America real quick. NVIDIA 00:10:28.300 |
dropped $600 million because they said they'd train this in $5 million. So, yeah, I'll take 00:10:36.380 |
a little pause. That's high level of the, you know, what they've released, how it works, 00:10:41.260 |
what's going on under the hood. This is DeepSeek v3. It's their big MOE. It's got 37 billion 00:10:46.940 |
active parameters. They say it was cheap to train. They trained it on 15 trillion tokens, 00:10:53.780 |
but yeah, this is the, you know, step zero. This is the base model that the reasoning 00:10:57.820 |
model is built on. This is very similar to models like Mixtral or GPT 4.0. It's just 00:11:04.300 |
a big MOE model. Oh, $600 billion, not $600 million. NVIDIA dropped heavy. Big, big drop. 00:11:12.700 |
America blew up real quick. But yeah, so all the reasoning models are built on top of this 00:11:17.720 |
as a base model. But yeah, if we want to pause here, anyone have thoughts, points, anything 00:11:22.840 |
that they loved about this DeepSeek v3? Which in and of itself is a good model. It's cheap. 00:11:29.660 |
Things to note at a, you know, high level AI engineering, like view is using reasoning 00:11:35.100 |
models is cool, but also they're kind of slow, right? Like if you need total completion, 00:11:40.180 |
thinking is cool, but like sometimes I just want output, right? Models are pretty good 00:11:43.940 |
at compressing. Sometimes I want fast speed. This is like GPT 4.0 level and very fast. 00:11:50.920 |
So it's only 37B active parameter. So a lot of the times people would probably want to 00:11:56.580 |
use this. You don't need a reasoning model for everything, right? If you run a chatbot, 00:12:01.020 |
that's cool. You can probably just run this. Later in the conclusion, there's a slide that 00:12:06.660 |
shows what future work they want to do on the reasoning model and they show how v3 is 00:12:11.700 |
actually better at something. So not, not to undermine this, you know, it's still very 00:12:16.060 |
good, very smart, fast, cheap. It's a good model, but it's just an MOE. But yeah, anyone 00:12:23.540 |
want to chime in, any questions, anything interesting in chat? 00:12:27.060 |
A lot of questions. Sorry. There's a lot of questions. I don't know which one to focus 00:12:34.080 |
Okay. I'm going to see the first one. So how is effective active parameters different from 00:12:38.260 |
total parameters? So total parameters, you know, you still have to load all this in memory. 00:12:43.980 |
So 671 billion parameters, you need lots and lots of GPUs to load this thing. But at inference 00:12:49.780 |
time, it's only using a fraction of these, right? So 5% of these, it's using 40 billion 00:12:54.820 |
parameters. So realistically, like, it's more efficient to use a lot of tokens. It's, it's 00:13:01.300 |
going to be faster, it's going to be cheaper. But it's not something that you can just host 00:13:06.580 |
yourself, right? Like your laptop might be able to host a 30 billion parameter model, 00:13:11.940 |
you load all those weights of memory and you use it. This is like, kind of like that at 00:13:17.460 |
inference time, but it needs all the weights loaded up. 00:13:21.540 |
I think the point being a lot of people miss is that, like, you do save, you do save memory 00:13:26.020 |
at scale, like you might not save memory if you're having one chat on your laptop, because 00:13:29.340 |
all, because every token may use a different subset of the parameters, so you need them 00:13:33.740 |
all loaded. But if you're doing batch inference, like, like, like DeepSeq themselves are doing, 00:13:39.660 |
then they can route things to the, to each different GPU, how, how they, how they want 00:13:47.220 |
Yep. Yep. At batch, it's, it's just very efficient. And that also means it's, it's faster too. 00:13:52.980 |
Okay. What is FP8 training? So mixed precision training, you know, before we used to train 00:13:57.860 |
in full precision, then half precision, then we started, oh shoot, we can do FP16. Now 00:14:03.220 |
we cut precision again. It's just an interesting thing. I think they're the first ones that 00:14:08.060 |
have done it. Typically, you can do inference. This is like a quantization, right? You can 00:14:12.580 |
run a model in four bit and half precision, and there's a slight degradation in quality. 00:14:18.700 |
But on the training side, we typically need as much, like, precision as possible. In this 00:14:23.980 |
case, they, they can do FP8 training. They did it, guys. They also, yeah, another interesting 00:14:31.700 |
key component was that they're training this without an auxiliary loss. So if you know 00:14:36.900 |
about MOEs, that's a, that's a pretty interesting piece there. But, okay. Can we trust them 00:14:44.260 |
on $5 million cost claim at face value? You can take it both ways. People have gone into 00:14:51.060 |
token economics of how much it would cost to train this many tokens at this scale, and 00:14:56.580 |
it can be around here. But realistically, this is like, you know, maybe the final train 00:15:01.060 |
run cost around this, but this doesn't include any of the R&D, any of the other experiments. 00:15:06.140 |
Like, it's more than it would be, but either way, you know, it's out, it's open source, 00:15:11.780 |
it's good, it's small. It was cheap. I think Dario from Anthropic, their co-founder, mentioned 00:15:18.940 |
something about this recently of like, they, you know, how Anthropic was, or Cloud 3.5 00:15:26.420 |
was also trained in the tens of millions. It's nothing crazy. The other interesting 00:15:32.180 |
thing was the whole GPU restriction, right? So people say that they have to say this because 00:15:38.700 |
they're not allowed GPUs. And if they say they had GPUs, then, you know, they lose their 00:15:42.940 |
little supplier, but they have GPUs. You know, who knows? 00:15:49.300 |
Yeah, I wanted to add, because there's a lot of speculation of whether, whether is this 00:15:55.100 |
real or not, right? But like, one of the best thing about the open weight and open, is that 00:16:00.540 |
it's in the code and we can validate these things. So it is definitely an MOE model. 00:16:06.180 |
It has, it has definitely that amount of experts that was stated and, and the community has 00:16:11.820 |
already essentially made calculators for how much does it cost to create an MOE model. 00:16:17.500 |
And if you work it out backwards, it's between five to 10 million. So maybe the exact number 00:16:22.580 |
is off, but I think, I think a lot of people are missing the point that it's at that ballpark 00:16:28.340 |
and for contrast, Lama 345B was 50 mil. And that is based on the amount of compute time. 00:16:39.340 |
Yeah. Okay. I think these other questions have good discussion in chat, so I'm going 00:16:47.900 |
Can I also just add one thing? I think one thing that DeepSeq v3 did very different from 00:16:53.060 |
v2 is, crap, just escape me. Yeah. Auxiliary free MOE training without an auxiliary loss. 00:17:03.940 |
So I thought that was pretty interesting and it really simplified a lot of things. And 00:17:08.220 |
that was a big step up from v2. So, I mean, if you have the paper open, I mean, just go 00:17:13.100 |
through it. They make a big deal out of it. I'm not sure how much of a difference it makes 00:17:18.460 |
Okay. I have the important, I have the R1 paper open, not the, not the other one. Also, 00:17:25.580 |
I think I'm only sharing my Chrome screen, so we don't get my paper this time. My beautiful 00:17:32.060 |
highlights. It's okay. I took some screenshots of charts, but, okay, let's move on to the 00:17:38.580 |
fun one. This was a chart that I was going to pull more charts from. This is from Jay 00:17:43.580 |
Alomar's blog. It's a good blog. I recommend checking it out. Actually, there was a better 00:17:49.700 |
posted chart in Discord, like, let me pull it up. It was just posted about an hour ago. 00:17:58.860 |
So this is like a better overview of the training pipeline, but this is also kind of what's 00:18:06.340 |
happening here. So they've got the DeepSeek V3 base. They do SFT reasoning data examples. 00:18:15.460 |
They have this SFT checkpoint. Then we do fine tuning with RL to get DeepSeek R1. So 00:18:22.260 |
we're going to kind of look in this middle step here, which is DeepSeek R10. So this 00:18:27.940 |
is where they apply pure RL directly to V3, the V3 base model without any SFT data. They 00:18:33.740 |
use a GRPO for RL, which was introduced a little while ago. Actually, this came out 00:18:38.360 |
in the DeepSeek math paper. So there's a few different ways that the model is rewarded 00:18:43.940 |
during this RL process. Someone's got their hand up. You want to just ask the question? 00:18:49.980 |
We haven't gone that deep yet. Sachin, you want to? 00:18:53.060 |
Yeah. So you can hear me, right? Yeah. So one of the things, so I haven't been following 00:18:58.300 |
what OpenAI and all the other guys are doing, but what prevents them from because they have 00:19:04.660 |
the training data, their own like process. And if they run this and verify because you 00:19:10.860 |
have the code and all, and then they can compare like what their existing way of doing versus 00:19:16.780 |
the new way of doing, right? So, do you know like how long that would take for these guys? 00:19:24.720 |
But now they say, okay, this is the new way of doing things. Everybody accepts. I don't 00:19:28.840 |
know if, if you don't know what this data was trained and all, this will definitely 00:19:34.080 |
shake it out. But we are not the guys who can basically have the money to train and 00:19:38.680 |
actually verify this, right? Has anybody done that? Like, these are numbers that only the 00:19:45.640 |
In terms of training, there's not much to verify, right? So like four V3, four Lama 00:19:52.960 |
models for the base models, there's verification that people can do, right? You know how many 00:19:58.480 |
tokens are trained on. We know how it is like to train these models. For this model, like 00:20:04.200 |
for R1, we don't have training code. We don't have the data, but that doesn't mean that 00:20:09.440 |
people can't do this. There's a section later about companies that are trying to reproduce 00:20:13.960 |
this. They also show stuff that we can do, right? So they distill outputs from this into 00:20:18.640 |
Lama models, that stuff that's very attainable, you know, that stuff is now in the hundreds 00:20:23.120 |
to thousands of dollars. Now that stuff regular people can do. There's a company, I don't 00:20:28.960 |
remember who that already put out a fine tune on DeepSeq style R1 data. So we can discuss 00:20:37.120 |
this in a bit, but yeah, there's, there's people that are starting to work on this, 00:20:42.320 |
but anyway back, back to what they're doing here. So R1-0 is kind of one of the two models, 00:20:48.240 |
right? So they, a while ago, they put out this paper that was a DeepSeq math paper. 00:20:53.760 |
They explained this new GRPORL algorithm. It's kind of where you have a reward that's 00:20:59.860 |
based on accuracy and responses that are verifiably correct. So verifiably correct means you train 00:21:07.000 |
on data that can be verified. So math questions, you know, math that checks out leak code, 00:21:12.080 |
you can have something that compiles it to check if something is correct. And then they 00:21:15.800 |
have like a little format reward. So they, they want to do this RL that nudges the model 00:21:20.440 |
to also follow format, right? In this case, the format reward is making sure that there's 00:21:24.920 |
think tags between reasoning and then there's an output at the end. So there's kind of like 00:21:30.520 |
three things they're testing for here, right? One is the model is being rewarded to one, 00:21:36.240 |
put thinking traces, right? So it needs to think. So it's going to put thinking stuff 00:21:41.560 |
between thinking tags. It needs an answer. So there's going to be an answer and the answer 00:21:46.200 |
has to be correct. And then that correct answer has to verifiably check out. So then there's 00:21:52.080 |
kind of this RL algorithm that's, that's applied around all this. This is kind of what the 00:21:56.680 |
prompt looks like that they train with. So this is the template prompt, right? So conversation 00:22:01.760 |
between user and assistant, user asks a question, the assistant solves it. The assistant first 00:22:06.720 |
thinks about the reasoning process in the mind, in its mind, then provides the user 00:22:11.080 |
with an answer. The reasoning process and answers are included, are enclosed within 00:22:15.640 |
think tags and answers within answer tags respectively. So think goes here, answer goes 00:22:21.320 |
here, then the assistant. So now, you know, when you answer, the model is prompted to 00:22:25.960 |
now answer with, okay, here's my thinking. Here's my reasoning process. Here's the end 00:22:30.040 |
of my think tag. Here's an answer tag. Here's my answer. Here's the end of the answer. Then 00:22:34.600 |
you do a bunch of training with just pure RL. Here's kind of the formula for all this. 00:22:38.840 |
Here's, here's a cool little chart. GPRO, GRPO. So what is this? So compared to traditional 00:22:45.120 |
RL, there's no critic model here. It uses groups of sample generation to estimate rewards 00:22:50.680 |
with, and this kind of helps with cutting the compute cost down, right? You don't need 00:22:54.560 |
to train a separate critic model. In this case, there's group-based rewards. So this 00:22:59.680 |
is basically where you score outputs when they're compared with a sampled group to reward 00:23:05.120 |
relative performance. So instead of generating one output, generate a group of scores and 00:23:10.960 |
then, you know, reward the one that does the best out of the group. Then there's of course 00:23:15.320 |
stability and stuff. A big thing with RL is you have like, you know, KL divergence. You 00:23:20.200 |
don't want models to randomly drastically make big changes because they probably won't 00:23:24.880 |
get back. So there's a penalty, you know, if in the group something is really good, 00:23:30.840 |
but you know, it diverges a lot from the sample, then yeah, we also penalize that. So it's 00:23:36.000 |
just, this is good RL. We could spend a lot of time on this, but honestly, I think this 00:23:41.960 |
is good for Discord discussion. So I'm sure someone will create a thread of GRPO instead 00:23:47.600 |
of 200 people sitting here thinking about what RL is. High level, there's no critique 00:23:53.040 |
model. It's, you know, it's judging and rewarding outputs based on sampled group outputs, and 00:24:00.720 |
then there's stability in here to make sure that we don't diverge if samples go crazy. 00:24:05.440 |
Now R1-0, how does it perform? So it performs really well and there was no labeled SFT training 00:24:12.920 |
data. It's just a base model trained with this RL to output the correct responses and 00:24:17.600 |
add some thinking. It does well with this majority voting, it does even better. So here's 00:24:23.800 |
kind of benchmarks. If we look at it compared to O1 and O1-mini, R1-0 does, you know, pretty 00:24:29.800 |
good like on most benchmarks on math, on live code, code forces, it's pretty good up there. 00:24:37.000 |
And then when you do majority voting, which is, you know, you generate a couple examples 00:24:41.420 |
and you see if the answer is in there, it does significantly better. The key thing here 00:24:46.440 |
was they actually trained this thing on very, very hard questions. So just good training 00:24:51.440 |
quality questions and yeah, it does pretty well. You generate a bunch of samples, you 00:24:56.760 |
pick the one that's the best out of a group of them, you kind of nudge it towards doing 00:25:00.400 |
better there. Yeah, the next few things were very interesting 00:25:05.920 |
charts that came out of here. So these are some charts that show how their inference 00:25:10.800 |
time is correlated with eval performance. This is kind of what you start to see at scale. 00:25:15.840 |
When they started to train this thing, it didn't work really well, right? This is just 00:25:19.920 |
like, okay, why does this work now? This is basic RL. But at scale, we start to see these 00:25:24.780 |
emergent capabilities, right? As you train for more and more steps, we see that accuracy 00:25:29.520 |
starts to go up with steps too, right? So for each question, we sample 16 responses, 00:25:34.800 |
calculate the average, we start to see how it performs. For more steps, the more kind 00:25:38.720 |
of steps that you take, the better performance is. 00:25:42.520 |
Another one here, this was a very interesting one. The average response length of the model 00:25:48.280 |
also starts to increase. So the longer you train it, the more reasoning steps it starts 00:25:53.640 |
to take, which means that basically the TLDR of this paper was just this RL thing just 00:26:00.440 |
kind of works. And you can see this in the charts, right? The more that we're training 00:26:04.640 |
this, the model is starting to learn to reason more and more, because the more it reasons, 00:26:09.360 |
the better the performance is. And throughout more steps, the average length of the response 00:26:13.560 |
is starting to get longer and longer. So here's a little quote here. "The average response 00:26:20.520 |
length of R1-0 on training set during RL process. DeepSeq R1 naturally learns to solve reasoning 00:26:27.000 |
tasks with more thinking time." So yeah, it's starting to do that. "Naturally squares ability 00:26:34.480 |
to solve complex tasks by extending test time compute. This ranges from hundreds to thousands 00:26:40.980 |
of reasoning tokens. The emergence of interesting behaviors as test time compute increases." 00:26:47.000 |
So this was another interesting one. So as you increase test time compute, they started 00:26:51.480 |
to notice emergence of interesting behaviors. So some of these were reflections and aha 00:26:58.080 |
moments. Reflections are where the model started to revisit and reevaluate previous steps and 00:27:04.280 |
explore alternatives. So as it's doing its thinking, it would start to reflect and be 00:27:10.680 |
like, "Huh, a few steps ago, I went down this path. Maybe I should look at this again." 00:27:17.160 |
Aha moments are where it starts to take more time and reevaluate an original approach. 00:27:23.000 |
So in this example, and this also shows the quality, the example of questions that they're 00:27:29.080 |
training on. And you can see more of these as well. If you look at some of the stuff 00:27:32.560 |
that they trained on, you can look at those data sets and look at the type of questions. 00:27:36.780 |
But here it's being told to answer this question. It's like, "Okay, here's basic math. I can 00:27:41.480 |
square both sides. I can isolate this term." And then it's like, "Wait, wait, wait. That's 00:27:46.600 |
an aha moment I can flag here." And they start to notice these little emergent capabilities 00:27:50.960 |
where it's starting to find these aha moments and it's starting to re-reason. It's starting 00:27:55.760 |
to have these reflections. And there was this kind of interesting quote that I found in 00:28:00.080 |
the paper. So this is from the DeepSeek team. They make this section. They say, "This moment 00:28:06.640 |
is not only an aha moment for the model, but it's also for the researchers observing its 00:28:11.320 |
behavior. It underscores the power and the beauty of reinforcement learning. Rather than 00:28:16.240 |
explicitly teaching a model how to solve a problem, we simply provide it with the right 00:28:21.640 |
incentives and it autonomously develops advanced problem-solving strategies. The aha moment 00:28:28.480 |
serves as a powerful reminder of the potential of RL to unlock new levels of intelligence 00:28:34.280 |
in artificial systems, paving way for more autonomous and adaptive models in the future." 00:28:39.640 |
But basically, rather than explicitly teaching the model how to solve problems, they train 00:28:45.360 |
it with RL, which incentivizes it based on those incentives, and it autonomously starts 00:28:50.860 |
to understand these problem-solving strategies. Through its thinking steps here, it starts 00:28:58.520 |
to realize it has these aha moments, and it also starts to have reflections in its thinking. 00:29:05.020 |
So that was kind of another interesting thing that came there. 00:29:08.840 |
So what about DeepSeek R1? What are the kind of problems with R1-0? R1-0 had poor readability. 00:29:16.480 |
It also had a lot of language mixing. I'll make a note of this later in the section, 00:29:20.920 |
and I think we should discuss it. They kept talking about language mixing. They keep talking 00:29:25.720 |
about how problems with this model are that it mixes up languages. It goes between English 00:29:31.360 |
to Chinese, and it's not good at fixing what language it should be. And this is also a 00:29:37.200 |
problem with the real R1. They weren't able to solve this too well. Now, some of this 00:29:42.080 |
is due to RL, but yeah, it's just a little interesting note that more than four or five 00:29:47.560 |
times in the paper, they had mentioned how this thing struggles with language mixing. 00:29:53.560 |
So that's kind of where R1-0 had its issues. It wasn't very readable. This is giving out 00:30:02.000 |
weird thinking steps, aha moments. It's not being trained with the RL objective of there's 00:30:10.320 |
no safety. There's no conciseness. There's no be a good assistant. There's no be a good 00:30:16.280 |
chat model. There's no be fun to chat with. There's nothing like that. So they take R1-0, 00:30:22.480 |
and then they make R1, which is let's take this reasoning model that we can do. Let's 00:30:27.920 |
actually make a proper LLM assistant that we can. So we'll make a reasoning chat model, 00:30:33.960 |
which is what R1 becomes. But I'll take a little pause here. I know a lot of people 00:30:39.320 |
pre-read the paper too. Is there anything we want to dive deeper into in R1-0? We could 00:30:45.000 |
talk about the RL policy itself, the model, how it performs, any of these charts. 00:30:53.280 |
Yeah, I had a question. There was a step that I must have missed, which is the RL scoring 00:31:00.040 |
function. In other words, when the models return these 16 different answers in English, 00:31:08.520 |
how is it scoring? How is it deciding which one was that? 00:31:12.040 |
So there's a few things there. There's the verifiably correct, which is part of it. So 00:31:16.840 |
these questions, if they're LeetCode style questions, you can run a compiler and you 00:31:22.920 |
can verifiably see what's correct. If it's a math question, you can verify that the answer 00:31:27.840 |
matches what the answer should be. And that's the level of distinction they go at. There's 00:31:32.480 |
a few different categories, but they can verify the answers to make sure that's correct. Then 00:31:38.240 |
you have the other little parts of the policy, right? Like you want it to output these think 00:31:44.160 |
and answer tokens, so you can verify that it did that. If it didn't do that, then it's 00:31:49.600 |
going to be penalized, right? So another part of this is following this prompt template. 00:31:55.120 |
If it doesn't output think tokens and answer tokens, or if it outputs them but doesn't 00:32:00.600 |
give any reasoning, that's not good. Now, this is just a reasoning model. Some of the 00:32:05.640 |
changes in the actual R1 is for sometimes you don't need to reason, right? For a question 00:32:11.120 |
like hello, you can not reason. But basically, that's some of the stuff that they-- 00:32:16.680 |
So just so I understand, so the basic concept here is that even doing these kinds of very 00:32:23.760 |
simple forms of reasoning, or they're very complicated, like mathematics, the idea is 00:32:28.440 |
that that learning then transfers onto other kinds of responses and reasoning that people 00:32:36.280 |
want. Because it was very interesting to me, like one of my test questions is, what is 00:32:40.760 |
the population below Central Park? And R1 and all of them just fall on their ass. They 00:32:46.600 |
can't answer this, which any third grader can reason through and come up with a reasonable 00:32:50.760 |
answer, right? And a reasonable answer is not a number that's larger in the population 00:32:56.400 |
of Manhattan or less than 100,000. They just fall to pieces, because they can't seem to 00:33:02.680 |
reason about this. And the reason I'm asking this is because is the assumption here that 00:33:09.200 |
if they can solve these math questions and coding questions, that they can then reason 00:33:13.560 |
about other things? Is that one of the fundamental assumptions here? 00:33:18.520 |
Yeah, so in my interpretation, the goal isn't to get it to reason about different things. 00:33:28.120 |
It's to get it to just output thinking and reasoning, right? So you want it to be able 00:33:33.320 |
to output its thought process to come to a verifiably correct answer. And then as there's 00:33:40.280 |
harder and harder questions, it does more or less output of what its thought process 00:33:46.240 |
is. And you reward it for being right at the end or wrong at the end. And in that, you 00:33:53.280 |
kind of distill this down to, yeah, for harder questions, it will do more thinking before 00:33:57.840 |
it answers. For simple questions, it won't. And that's kind of what I feel like they're 00:34:02.360 |
going for here. If anyone else has other answers to this or other takes on this-- 00:34:05.560 |
I just wonder whether-- yeah, I wonder whether, like, is it-- what you're saying is it's not 00:34:10.120 |
learning new reasoning. It's just that in the fine-tuning step, we're teaching it to 00:34:14.960 |
actually reason, even though the base model was capable of that before. So I'm not-- I 00:34:20.600 |
don't know whether that's true or not. It might be. 00:34:23.240 |
Yeah, the base model is also very smart, right? But you're teaching it to give out its thought 00:34:28.320 |
process. And it's also graded against 16 other versions of thought processes to an answer. 00:34:33.720 |
And it needs to do pretty good on this. It needs a good thought process. It needs to 00:34:38.240 |
also learn to mimic this template and everything. 00:34:42.360 |
But to be clear, it's not being judged on the thought process, only on the-- in other 00:34:48.480 |
words, the RL reward is on the correctness of the answer, and then some very basic mechanical 00:34:54.520 |
stuff, like, did you have the word "think" and so on? And was there something that we're 00:34:57.800 |
going to call a reasoning process? But we're not going into that and parsing it and trying 00:35:01.640 |
to understand the reasoning process. All we care about is-- from a reward standpoint, 00:35:06.320 |
the reward is given if the answer is correct. And these tokens and so on are present. 00:35:11.640 |
Actually, if you don't mind if I jump in here, because this is kind of related to my question 00:35:14.840 |
I have on GRPO, which is Group Relative Policy Optimization. I couldn't find anything on 00:35:18.640 |
GRPO on YouTube, which I was hoping to get, because I don't want to read this fucking 00:35:21.600 |
53-page math paper. Forgive me. But I found something on direct policy optimization, and 00:35:26.840 |
what that was telling me was they removed the reward function from the objective, or 00:35:31.800 |
lost function. Sorry, I know, another sin. But removing that reward-- that explicit reward 00:35:37.680 |
term seemed to be an important part of TPO, along with callback-livelier divergence in 00:35:41.240 |
order to have that kind of-- I don't want to say memory, because there is also a stability 00:35:45.160 |
term with a clip function with epsilons. But I kind of had that memory as well. 00:35:50.120 |
So I feel like this GRPO, along with the multi-latent head attention, which is a great poll-- thank 00:35:54.080 |
you so much for that. I really appreciate that. I'm going to look into that soon from 00:35:57.680 |
the meta paper. But it helps with this kind of batch learning. When I hear people talk 00:36:02.000 |
on Bloomberg about this model, I hear them say, oh, they trained in fewer batches, or 00:36:06.280 |
the batches were more optimal. And when I hear that, I'm thinking in my head, this is 00:36:09.360 |
a GRPF. But I also got to look into multi-latent head attention before I do that, and probably 00:36:19.000 |
I have a little bit different use case, I will tell. And people should help me out over 00:36:23.800 |
here. So I look at quite a lot of medical cases, very, very deep things, which I-- right 00:36:29.400 |
now, I use Perplexity, OpenAI, Cloud, and all. And some of these things just blow up 00:36:34.800 |
in your face, right? And the answers they give, I go back to literature and verify. 00:36:39.520 |
And it's a very deep process where I need to-- and then I'm reasoning out with neurosurgeons 00:36:44.840 |
and guys, and like, why, when I'm questioning them. And some of the data they know, some 00:36:50.560 |
What ends up happening is OpenAI sometimes puts out just garbage. I mean, I look at all 00:36:54.880 |
sorts of reasoning, what it is showing. But the things I'm learning, and I know what the 00:36:59.560 |
objective is. And my reward process is sometimes, like, so deep inside, is that I know that, 00:37:06.880 |
hey, this biochemistry thing with this, this, this, whatever, is probably causing this neurological 00:37:13.840 |
Now, what you did over here with the RL part of it, when you said there's a number of guys 00:37:18.680 |
where you try to pick them up, the problem in RL is that sometimes there might be one 00:37:23.520 |
loner that the action that they take, the reward might be way down the line. You cannot 00:37:29.680 |
take majority of the guys right up front and say, this is the right way it is supposed 00:37:33.320 |
to be done. And my take is, I haven't looked at this, but this is probably works good for 00:37:39.000 |
smaller domains. But when you try to chain domains and domains together, it is probably 00:37:44.240 |
going to have a lot of issues, because now we have a combinatorial problem over there. 00:37:48.000 |
It's like a go game, you know, like whatever it is. So if you have thoughts, let me know. 00:37:52.560 |
So they have a way to solve this KL divergence where, you know, they account for if, if responses 00:38:01.160 |
are significantly different than the rest of the group, we don't take big steps. But 00:38:06.400 |
I think at a high level, that's, that's enough on, on the RL. We could go on for that for 00:38:12.200 |
the rest of the hour. But let's take that to offline discussion. Let's, let's get through 00:38:16.640 |
what actual R1 is. So that was just R1-0. I'm going to spend the next quick five minutes 00:38:21.920 |
and then we'll do 10 minutes of discussion, you know. So, okay, what is DeepSeek R1? So 00:38:27.040 |
there's, there's four stages to making this thing a good reasoning and chat model. So 00:38:31.600 |
one of them is we have this cold start. So cold start is, you know, let's, let's start 00:38:38.280 |
with some as strong as SFT. Let's not have this thing go crazy at first. They mentioned 00:38:42.520 |
they use some human annotators here. They just drop one line, you know, kind of interesting 00:38:46.120 |
if you want, anyone wants to look into that. But so first we'll cold start the training, 00:38:50.640 |
then we'll do RL, then we'll do a rejection sampling for generation, then we'll do RL 00:38:55.280 |
again. Okay, quick level. What are these four stages? So stage one, cold start. You have 00:38:59.440 |
that DeepSeek V3, cold start the training with strong SFT. SFT on what you want, you 00:39:05.640 |
know, this prevents the model from going unstable. Use a long chain of thought, few shot example 00:39:10.800 |
prompt to, you know, generate some good detailed examples of what we want. So generate some 00:39:16.620 |
good reasoning examples, some reflection, verification, generate from R10. Post-process 00:39:23.560 |
these have human annotators, look at them. This is on the order of thousands of samples, 00:39:27.800 |
you know, so nothing crazy, but this just starts the model off, you know, so let's get 00:39:33.160 |
the base model to do some basic SFT. So normally we take base models, we do instruction fine 00:39:40.200 |
tuning, we turn them into chat models, right? We do, we do SFT. So we're going to take the 00:39:44.120 |
base model, we're going to generate some examples of chain of thought, few shot prompted examples, 00:39:50.200 |
you know, so this looks like you, you take R10, you tell it, or you take whatever model 00:39:55.840 |
you tell it to generate some examples where you give the formatting you want, you give 00:40:00.760 |
good, give your thinking steps, give a lot of thinking steps, start it off strong, and 00:40:05.760 |
then they generate a couple thousand examples, post-process them, have human annotators, 00:40:09.880 |
I don't know, they just put like a line on this, then they do some regular SFT on base 00:40:15.220 |
DeepSeq v3, that's stage one. Stage two is they basically do the same exact RL, they 00:40:22.200 |
add this language consistency reward, like we mentioned, you know, they're struggling 00:40:26.240 |
with language mixing, so another part of this RL is now we want language to be consistent. 00:40:31.880 |
Okay, so we did SFT on really good data, then we do a bunch and bunch of RL, they don't 00:40:38.160 |
explain the data set, they don't explain where it came from, how it came, how many samples, 00:40:43.000 |
but they do RL. Stage three, rejection sampling. Rejection sampling is pretty common. Lama 00:40:48.640 |
three did it, many others have done it, it's kind of new, they do this, this is the first 00:40:52.640 |
time they talk about how much data, this is on the order, this is kind of like end of 00:40:56.560 |
post-training lifecycle, you know, so they had big base model, which was v3, did SFT, 00:41:02.560 |
did big stage of RL, now they do rejection sampling. This helps it turn from like, you 00:41:07.680 |
know, we had R1-0 style issues to let's start to fix this, let's generate completions, rank 00:41:15.400 |
them with reward models, this is like LLM as a judge, so generate output, have an LLM 00:41:20.920 |
judge it, have a reward model, judge these outputs and reject some samples, fine-tune 00:41:25.840 |
this model with rejection sampling. High level, that's what's happening. Stage four, let's 00:41:30.200 |
do more RL, you know, throw RL at the problem. So make the model helpful and harmless while 00:41:36.440 |
making reasoning good, that's kind of the objective here. They do R1 style questions, 00:41:42.120 |
but they also mix in general chat human preference, so nuanced scenarios, you know, we want it 00:41:48.400 |
to still give a good output, but now we want it to, you know, give a summary at the end, 00:41:53.480 |
don't just give an answer, give a summary. So this is kind of that last step. So this 00:41:57.960 |
is what makes DeepSeek R1, instead of just reasoning model, you got to do some rejection 00:42:03.320 |
sampling, you got to kickstart the thing so it doesn't go crazy with SFT, and you got 00:42:08.080 |
to do this last stage of some RL for general use. Now, yeah, the model is pretty good. 00:42:15.080 |
It's a normal chat model, it gives thinking steps, it gives little summaries at the end, 00:42:21.000 |
and it performs pretty well. It still struggles with some language swaps, but you know, on 00:42:26.160 |
both benchmarks, it's better than O1, better than O1 Mini, either also better than O1, 00:42:34.000 |
or in between the two. So you know, O1 might beat it, but it's better than O1 Mini. O1 00:42:39.720 |
beats it better than Mini. O1 beats it better than Mini, or it's better than both. But it's 00:42:43.920 |
very good, it's 37B active parameters. We don't really know the model sizes or active 00:42:49.680 |
parameters of O1, but this thing's good, it's cheap, it's fast, DeepSeek has very good inference. 00:42:55.760 |
MIT license, it's fully out there. That's kind of our one, four stages. The new ones 00:43:02.560 |
are kind of, hey, you do this cold start with SFT from a base model, and then these last 00:43:07.240 |
two stages. You know, you have rejection sampling, you want to kind of fine-tune it, this is 00:43:12.880 |
pretty common, you do this for about a million samples, 800,000 is what they did, and then 00:43:17.200 |
you do some fine-tuning at the end. And yeah, it does very, very well. Then the last part 00:43:23.760 |
of this paper is kind of the distillation step. I'll talk very, very quickly on this. 00:43:28.240 |
So distillation is where you take a big model, you train it, you generate outputs, and then 00:43:33.400 |
you mimic those into a small model. So you can either just take input/output and just 00:43:38.600 |
do continual post-training and now you've distilled a model, or you can do this distillation 00:43:42.920 |
loss where you try to get a small model to match the output logit, so not just match 00:43:47.920 |
the output, match the output distribution, you know, really match the thinking process 00:43:53.020 |
per se of what the big model is doing. They do distillation on just a million samples, 00:43:58.800 |
so they have 800,000 reasoning samples and they distill R1 into Lama and Klein models. 00:44:04.240 |
So they take the two families, they do distillation. This is not RL. They just do basic SFT distillation 00:44:10.680 |
on about 800,000 samples, and now the model does very well. And not only does it do well, 00:44:17.400 |
it outputs all of its thinking steps, you know, it becomes a bit of a reasoning model 00:44:21.680 |
and the performance jumps a lot. So they not only compare it to the base models, which 00:44:26.720 |
they're all better than, they also compare it to, you know, like GPT-40, they compare 00:44:30.920 |
it to CloudSonic, to O1-mini. And you can see, like, Quen32B is doing better than all 00:44:36.600 |
these models. Like, their distillation work is very good. They open-sourced all these 00:44:42.480 |
models. They dropped them all. These will run locally, you know. So this is like Lama8B, 00:44:48.240 |
Lama7B, Quen32B. These are models that you can run locally on your laptop. They're very, 00:44:54.520 |
very strong models. They'll run pretty quick. And then just as a bit of, like, they had 00:45:01.280 |
a lot of abolations, but the one that I found interesting is the question comes up of, "Hey, 00:45:05.760 |
what if we just do RL on the other base models?" And they're like, "Okay, let's test it. Let's 00:45:09.920 |
take Quen32B. Let's do our same RL for 10k steps." Well, it does better, but it does 00:45:16.920 |
nowhere near as good as our distillation. So basically, their takeaway is, "Hey, RL, 00:45:24.320 |
like, takes a lot of compute. It's hard to do. And it doesn't get the same performance 00:45:28.940 |
as this distillation. So maybe in the future, we still need these big models." But yeah, 00:45:34.360 |
distillation worked very, very well compared to their RL on a 32B. I'm sure other people 00:45:39.680 |
will go deeper into this. Other people will try it. And they'll, you know, update us on 00:45:43.600 |
how it does. Okay, future work. R1 is still worse than V3 at some things. So as much as 00:45:50.960 |
I shat on V3 not being important, it's still good. It's fast, faster than R1. R1 is worse 00:45:58.220 |
at function calling, multi-turn, complex role play, and JSON output. Those are just a few 00:46:04.320 |
things. R1 struggles with language mixing. I don't know why they note this, but V3 doesn't. 00:46:10.900 |
So maybe their RL has Chinese data. Maybe I misread something there. Maybe they both 00:46:15.880 |
do. But R1 struggles with language mixing. R1 is sensitive to prompting. Few-shot prompts 00:46:21.920 |
degrade the performance. So if you guys are using R1, don't few-shot prompt it. Don't 00:46:28.040 |
tell the model how to think. Tell it what you want it to do and let it reason. It will 00:46:32.560 |
do better. And it's not better at a lot of engineering tasks than V3. They explained 00:46:38.100 |
why and they explained, you know, what they think they can do to fix them. But this is 00:46:42.960 |
just some sort of future work. But high, high level, that's kind of the paper. We have seven 00:46:49.360 |
minutes left. One option is if there's like one or two quick questions. Otherwise we can 00:46:59.560 |
talk about future stuff, questions, and what people have. So people are trying to recreate 00:47:04.480 |
R1. DeepSeq didn't talk much about the data. Not many specifics about what went on there. 00:47:11.520 |
Hug & Face has a thing to reproduce it. They have a whole chart of how they're going to 00:47:14.720 |
do it. Bespoke Labs put out a data set. I think it's a great effort. It's not great 00:47:20.120 |
in my opinion. I looked at like their prompt templating. They heavily try to prompt it 00:47:24.800 |
into responding in a certain way. But anyway, they have over, I think they have a hundred 00:47:29.960 |
thousand-ish samples of R1 data. They fine tune to 7B. They show results. But yeah, what 00:47:35.920 |
other hot takes have we seen? Eugene Yan, you have your hand up. You want to join in? 00:47:39.880 |
Hey, Vibhu. I just wanted to ask a question. Sorry, not a hot take, but a question. In 00:47:45.040 |
V3, in stage three, right, they actually retrained it based on the DeepSeq V3 base model. So 00:47:53.200 |
I guess the question I have is why did they not carry on from stage two? I know Eugene 00:47:59.780 |
Chia had a take on this on Discord, but I wonder if anyone here would have intuition 00:48:06.760 |
on why they did this. I had a similar question. So I noticed in the chart as well, same thing 00:48:14.640 |
going on. Exactly. And the paper actually calls it out specifically as a single sentence 00:48:19.200 |
on its own. So it's pretty unique. Yeah. Basically here, right, they restart from V3 stage instead 00:48:31.040 |
of continuing down their whole code process. Why they did this, I wish we knew. Eugene 00:48:37.960 |
Chia, do you want to share your take? So my speculation is after they got the first 00:48:43.160 |
round data set, is they trained from the base model before annealing. And this is just to 00:48:49.140 |
get a better final outcome. I don't fully agree with this idea, but a lot of fine tuners 00:48:55.520 |
swear on you want to fine tune on pre-annealed models instead of annealed models. Yeah. And 00:49:03.680 |
we can probably go into a side track on that. But yeah, that's my guess. I mean, I don't 00:49:08.840 |
really have a great mathematical or machine learning background on this, but I teach right 00:49:13.600 |
now and I feel like I do a lot of reinforcement learning when I'm just like trying to get 00:49:17.560 |
students to do very little reasoning, COT steps correct. Like I want you to write down 00:49:22.960 |
this exponent in the same police I did because I put in a different color. And if you don't 00:49:27.800 |
do it, I'm going to take off points and this and this and so forth. But what I want them 00:49:31.480 |
to really understand and technically explain in a nuanced way, that fine tuning of having 00:49:36.080 |
that discussion, having that expert review is so much more helpful than just a cookbook 00:49:40.280 |
chain of thought. So in my mind, fine tuning, but I mean, it's not really from an ML perspective, 00:49:45.640 |
kind of just from, I talk to people a lot. Oh, thank you. I also forgot to mention what 00:49:51.960 |
annealing is. So annealing is a process where you essentially flatten out and lower the 00:49:55.800 |
learning rate. And, and that's a one-time thing you do at the end of the model typically. 00:50:01.520 |
Yeah. Sam. Hey, did we, did the paper cover how, when they create examples for SFT using 00:50:11.760 |
rejection sampling on the RL checkpoint, what method they use to select the good examples? 00:50:16.200 |
Sorry, I didn't hear that too much. When they created SFT samples to what? Yeah. What method 00:50:23.680 |
did they use to select the good examples when they're doing rejection sampling SFT on the 00:50:29.360 |
RL checkpoint? Oh, on the rejection sampling? Yeah. Yeah. They, they share a little bit 00:50:39.440 |
about the rejection sampling, but when they generate the samples, they, they had this 00:50:47.120 |
whole section about, I mean, honestly, the whole paper is very, very vague in all these 00:50:52.160 |
specifics. They, they just mentioned like little, little notes, like in the cold start, 00:50:58.860 |
you know, how do we generate it? Right. It says human annotators. It doesn't say what 00:51:02.720 |
they did. Yeah. It doesn't say what they did. It says like, you know, slight, slight use 00:51:06.840 |
of human annotators. We, we do post-processing. That's cool. You do, you do post-processing 00:51:13.160 |
to post-process what? But at some level they tell you for rejection sampling, it's kind 00:51:18.460 |
of what you would expect, right? So what are they trying to do? What's their goal? Now, 00:51:24.640 |
they probably want to take away examples that don't have like summaries at the end, right? 00:51:29.760 |
But they, they don't, they don't go into too much detail. Is language mixing a feature 00:51:35.560 |
or a bug? Isn't it good if the model can find an efficient reasoning techniques? Feature 00:51:41.780 |
or bug depends on how you see it, right? So language mixing in this case, I believe they 00:51:46.000 |
meant is it responds. Actually, no, they, they actually do specifically explain in some 00:51:52.920 |
cases, like the question is asked in Chinese and it responds in English. So seems more 00:51:59.720 |
like a bug, right? You don't, you don't want that. Cool. Any, any other quick questions? 00:52:07.840 |
Any other thoughts? Any other concerns? And I'm sure there'll be a lot of discussion on 00:52:14.600 |
discord continuing about this, you know. I'm curious about people, how people are using 00:52:19.760 |
these models now. Cause one of the things like my uses for thinking models is when I'm 00:52:24.600 |
trying to brainstorm some creative topic. And yesterday I put side by side Gemini 2.0 00:52:33.240 |
thinking experimental, whatever 0121 and DeepSeek R1, and then the 70 billion llama distill 00:52:41.560 |
from that's hosted on Grok. And the answer I got from Gemini was much better than the 00:52:46.020 |
one I got from the other two. And I was, I don't know, I haven't tried a whole bunch 00:52:49.360 |
of examples, but I'm curious about whether, what people are using day to day when they 00:53:02.220 |
I gave up. I just stick with Claude. I made Claude now. I'm going to wait until there's 00:53:07.600 |
a little more settling at the moment. I'm kind of wasting my time, which is fine, but 00:53:12.080 |
I'd rather waste my time by not being as efficient than looking for which of the new models suits 00:53:17.440 |
my uses best here. But that's a personal opinion. 00:53:20.160 |
The thing Rahim is talking is actually very critical in the sense, different models and 00:53:27.280 |
all the big guys, they fail, what do you call it, brilliantly, or they just blow up at certain 00:53:34.420 |
reasoning and then at other things they do very nice. Until you have looked at all the 00:53:38.960 |
gamut of across the things, you cannot stick to one thing. I mean, if you have some money, 00:53:43.920 |
you need to like push across all of them and look at responses. Gemini gives one, like, 00:53:50.360 |
that's why I want to see what perplexity, if they puts out a blog post on this, where 00:53:54.520 |
they have probably the largest data of people, what people are trying to, how they try to 00:53:58.160 |
use it. So that would be my, what do you call next steps to go and see what do they think 00:54:03.200 |
of deep sea if they're done their own internal thing. But that is, I think, where getting 00:54:09.560 |
beyond like the evils, the evils are just like saying, okay, this is our baseline and 00:54:14.080 |
from here, let's go and go to the races kind of a stuff. 00:54:18.600 |
I mean, specifically for your thing, if we're done, I'll leave it for Desco. 00:54:24.640 |
So at a quick level, the other thing to note with how people are using these is they're 00:54:28.800 |
very cheap, right? So if you look at the cost of O1 versus R1, it's like more than 10X cheaper. 00:54:38.040 |
So stuff that was expensive to do is now not as expensive, right? So they're just good, 00:54:45.560 |
cheap reasoning models are fast. And the other part is they're open source. So if you want 00:54:50.800 |
to self-deploy it, you can self-deploy it. If you want to use one of the reasoning models 00:54:55.360 |
locally, like the Quen 32B1, 7B1, they're pretty good locally on your laptop. So that's 00:55:05.160 |
I don't know if also that availability, right? Like I've been playing with that and I've 00:55:08.840 |
kind of had a little bit of mixed use and I've had a friend who's been trying to get 00:55:12.920 |
him to do some gnarly reactor refactorings and he's been frustrated a little bit. I don't 00:55:17.920 |
know if that's just not learning how to be prompted properly for those kinds of very 00:55:22.560 |
specific and more involved tasks, but also Cursor might end up having enough data set 00:55:28.560 |
on how people are using it specifically for coding. And maybe I'm hoping they might publish 00:55:34.560 |
But yeah, it's a good model. Anyway, guys, thanks. Next week we will have Eric Ness facilitating 00:55:43.000 |
Titan model. Discussion will continue in Discord. I'll throw slides in there. But yeah, see