back to index

DeepSeek DeepDive (R1, V3, Math, GRPO)


Whisper Transcript | Transcript Only Page

00:00:00.000 | Cool. I will kick off by straight disagreeing with Sean there. V3 not required reading for
00:00:09.920 | R1. V3 is just model. It's like any other model. You know, all the models you just train
00:00:14.780 | on Next Token, it's a Next Token model. It's a good model, but up until 15 minutes ago
00:00:20.740 | I forgot about the slides for V3. So, you know, that's how useless it is. But anyway,
00:00:27.980 | high level. I'm just going to go through more applicable parts of paper. Like, why do we
00:00:32.680 | care? If anyone has questions, comments, thoughts, interrupt me. It's more discussion than it
00:00:38.740 | is me yapping, you know. If you want yapping, go read the paper. So, outline, high level,
00:00:45.180 | high level. So, let's talk about the two models, mostly the second model. What is inference
00:00:50.620 | time scaling? What's this test time compute? They talk about what previous approaches are,
00:00:55.260 | so we'll kind of discuss that a little bit. Also, if people are talking in chat, I'm not
00:00:59.820 | super seeing it. If anything's, like, important. We're just talking about V3 and ignoring you.
00:01:04.920 | Okay, okay. Gross, V3. We had a slide, don't worry. So, then we'll talk about R1-0. So,
00:01:14.000 | DeepSeek R1 is not really just one model. There's two models. There's DeepSeek R1-0
00:01:19.180 | and then there's DeepSeek R1. They actually put out both. They're both pretty good. It's
00:01:24.100 | a different approach to how they did both, but yeah, kind of what they are, the training
00:01:29.100 | template, reward models, how they determine all this emergence, reflection, aha moments.
00:01:35.300 | And then we'll talk about what R1 really is. R1 is taking a reasoning model and then turning
00:01:39.880 | into a chat model again. So, now we've got more than just a base model and chat model.
00:01:45.620 | Now we have base model, reasoning model that's not a good chat model, and then reasoning
00:01:50.500 | models that can chat again. Most of this paper and most of the slides are actually on R1-0.
00:01:55.140 | It's a very interesting one. Apparently, an hour ago, the ARC benchmark guys, they put
00:02:02.660 | out news that R1-0 is better than R1. So, it's better on some stuff. You know, you take
00:02:09.580 | smart model, you make a chat model, it's going to become dumber. People are dumb. We like
00:02:13.020 | to chat. Then we'll talk about performance evals. Then a really cool thing they did with
00:02:18.180 | distillation. So, they distilled LAMA-3 and QEN models into R1 models. And they just kind
00:02:26.260 | of, you know, we're going to drop this real quick. They broke the US economy. They're
00:02:30.020 | like, not only do you get R1, you also get LAMA-R1, you get QEN-R1, you get R1-0, which
00:02:36.460 | isn't a chat model, you get everything. And then they have future work. Some people have
00:02:41.100 | done reproductions of the work. Someone yesterday from, I think, Future Labs put out a reasoning
00:02:46.980 | style data set. So, we'll just, you know, yap and discuss about that. But, yeah. So,
00:02:53.980 | high level, the point of all these reasoning models, the reason why everyone cares is because
00:02:58.020 | they're kind of changing the scaling curve from let's just train bigger and bigger models
00:03:02.300 | that can like, you know, we throw more compute at the problem and like the inference becomes
00:03:07.020 | a little bit better to let's start trading that upfront cost for inference time compute.
00:03:13.740 | And yeah, that's kind of what they did. Before, OpenAI was the only one to do it with O1,
00:03:19.180 | O1-mini. And then, you know, give it a few months and deep seek out if anyone has just
00:03:22.860 | put out a really good model. It's like on par with OpenAI's O1. It's like better than
00:03:27.860 | all the other models. Completely open sourced it. The paper's not that good. They don't
00:03:32.300 | really talk much about the training data. Like Sean mentioned earlier, it's a pretty
00:03:38.020 | basic approach to what they did. So, there's not much in the paper. They don't talk much
00:03:42.100 | about the data. They don't talk much about a lot. But anyway, it's a paper. Weights are
00:03:46.660 | there. It's MIT licensed, which is pretty good. And you know, it's still a V1 of this.
00:03:52.860 | It's like just how OpenAI has O1, there'll be O3. So, you know, there'll be R2. There'll
00:03:58.380 | be other companies that do this. Mistral might do it if they still exist. Lama will do it.
00:04:03.380 | So, there'll be others and then we'll only see improvements from here. One of the quotes
00:04:08.140 | from their paper, their goal, so like they say, "Our goal is to explore the potential
00:04:12.820 | of LLMs to develop reasoning capabilities without any supervised data, focusing on their
00:04:17.940 | evolution through a pure RL process." So, TLDR, what they really found out is just throw
00:04:24.100 | RL at the problem and you can get a really, really good model. People kind of, you know,
00:04:29.060 | set aside RL for a while, but yeah, it turns out you can just throw RL at the problem and
00:04:33.180 | it's pretty good. But this is a kind of interesting note, right? So, they wanted to develop reasoning
00:04:38.820 | capabilities without any supervised data. Without any supervised data means, you know,
00:04:43.580 | we don't go out, we don't label a bunch of, "Hey, here's chain of thought. Here's the
00:04:48.280 | ideal reasoning." A lot of people would think about, "Okay, if you have an agent, if you
00:04:52.300 | have a coding problem, there's 10, 20, 30 different approaches to get to the same answer.
00:04:57.300 | How do we optimize this stuff?" But no, they're not doing supervised data. Their goal was
00:05:02.020 | to do it without any supervised data, just self-evolution and RL. And they did some really
00:05:07.580 | good RL. And they did put out good math. They explained this RL. And yeah, that kind of,
00:05:12.840 | you know, blew up. So, what they do is they post-train the base DeepSeq v3 model, which
00:05:19.620 | I don't think is that important. It's just, you know, big model with this GRPO. GRPO is
00:05:24.620 | their type of RL. We'll go into it in a bit. And then they start to notice these emergent
00:05:29.540 | capabilities that come out. There's great reasoning that starts to come out. So, you
00:05:33.500 | know, you don't train on reasoning data, but dang, you get reasoning. You just train on
00:05:38.380 | hard questions, hard data. Then reflection starts to become a thing. So, you know, the
00:05:43.900 | model starts to reflect, like think on its actions, thinks on its steps. It has these
00:05:49.020 | aha moments where it's like, "Oh, shoot, that's crazy. This is what the right step is." And
00:05:53.100 | then it continues. And you get like O1 level performance. Then from this, you know, v3
00:06:01.020 | like light or whatever zero model, R10, they train the actual DeepSeq R1. They have a four-stage
00:06:07.940 | approach for training it. And that becomes a really good reasoning and chat model. So,
00:06:12.420 | four stages, they have like this cold start to make sure things don't go crazy at the
00:06:15.840 | beginning. And guess what? They do SFT. It's not just RL. Then they do RL. Then they do
00:06:22.660 | rejection sampling. Then they do RL again. So, not one RL. There are two RL stages at
00:06:27.660 | the problem, you know, so double the RL. But yeah, so high level, those are the two models.
00:06:33.740 | R10, it's a great reasoning only model. It's trained on unraveled chain of thought, you
00:06:39.380 | know, with RL. It's not good as a general model. Then R1, it's created from outputs
00:06:45.180 | from R10 and that four-stage training method. It's a really good model. It's like O1. Then
00:06:51.340 | the other half of the paper, not half, but like, you know, they have a section, they
00:06:55.020 | have like a paragraph on, "Hey, by the way, we just distill our outputs into QN and LMA."
00:07:01.220 | It does very, very good. So, that, they're not doing native RL training. They're doing
00:07:05.920 | proper distillation. So, they take their big model. They train it with a distillation loss.
00:07:10.420 | Well, they don't say what type of distillation, but you know, standard distillation is distillation
00:07:15.020 | loss. Then they compare it to the base models and it performs very well. A little note,
00:07:22.940 | they do try, they make a note like, "Okay, what if we did RL?" So, they take QN32B, they
00:07:28.300 | do like 10K steps of RL. They compare that to distillation and they find distillation
00:07:33.860 | much better. They make a small claim like, "Yeah, you know, if you do RL, it's very compute
00:07:39.180 | expensive. Like, it's hard to do. It doesn't work as well as just distilling. So, maybe
00:07:44.140 | in the future we still need these big base models."
00:07:48.540 | Being 2025, you know, no one talks about any data. They don't talk about where it came
00:07:53.380 | from. They just say, you know, get good quality data. Performance is very good. Models are
00:07:58.300 | fully open source with MIT license. They don't give training data. They don't give training
00:08:03.940 | code either. They host the model themselves on their own API. Something interesting to
00:08:08.900 | note is as much as people are raving about how good this thing is, DeepSeek themselves
00:08:14.620 | are also serving it very cheaply and very fast. So, 3x faster, 3 to 10x faster and also
00:08:20.740 | cheaper than other infra providers. But, you know, if you use the DeepSeek API, they clearly
00:08:26.820 | state that the, you know, data goes to China server. So, use as your own risk, but very,
00:08:32.980 | very cheap model, very, very good model. Their API is a lot faster and cheaper. Part of that
00:08:38.220 | is because, you know, they know everything about how to optimize this thing. They built
00:08:42.340 | it and it just came out. The other providers that are hosting it, well, you know, they
00:08:46.740 | just have model and they're trying to run it. But, yeah, from there, let's go into DeepSeek
00:08:52.220 | v3 real quick. This is my one slider. So, we say it's important. We'll stop after this
00:08:57.940 | and discuss it a little, but basically it's just a regular LLM. It's a pretty large model.
00:09:03.940 | It's chunky. It's 671 billion parameters, but 37 billion active parameters, which is
00:09:09.220 | pretty interesting. You know, it's a lot of experts in there, but effective parameters
00:09:13.960 | are pretty small. It's basically a 30B model at inference time, fully open source. It's
00:09:19.940 | GPT 4.0 level. It's not the reasoning one. This is just a standard big MOE model. They
00:09:25.940 | made this little claim, you know, training this thing took $5 million, 5.5. They had
00:09:31.020 | like a few steps to this, so they trained it. Then they did two stages of context length
00:09:36.140 | extension. They did, first, they trained the thing as a base model. Then they do some 32K
00:09:40.660 | and 128K context length extension, trained it on about 15 trillion tokens, do very standard,
00:09:47.580 | you know, train it, do SFT, do RL. The model is pretty good. They have this concept of
00:09:52.900 | multi-head latent attention. It's pretty cool. If anything, that would be like the next slide
00:09:58.420 | if I had to have three slides, but, you know, they have fancy attention. They do multi-token
00:10:04.140 | prediction. We covered the paper from Meta a few months ago that talks about this, where,
00:10:09.540 | you know, it's more sample efficient. You can do multi-token prediction. Meta put out
00:10:13.540 | a paper. They're like, "Oh shit, this works. It's pretty good. People should do it." And
00:10:18.180 | then not many people did it, and then they did it, and it helps. Came out a month ago.
00:10:24.000 | People are very hyped. The other day, it kind of, you know, broke America real quick. NVIDIA
00:10:28.300 | dropped $600 million because they said they'd train this in $5 million. So, yeah, I'll take
00:10:36.380 | a little pause. That's high level of the, you know, what they've released, how it works,
00:10:41.260 | what's going on under the hood. This is DeepSeek v3. It's their big MOE. It's got 37 billion
00:10:46.940 | active parameters. They say it was cheap to train. They trained it on 15 trillion tokens,
00:10:53.780 | but yeah, this is the, you know, step zero. This is the base model that the reasoning
00:10:57.820 | model is built on. This is very similar to models like Mixtral or GPT 4.0. It's just
00:11:04.300 | a big MOE model. Oh, $600 billion, not $600 million. NVIDIA dropped heavy. Big, big drop.
00:11:12.700 | America blew up real quick. But yeah, so all the reasoning models are built on top of this
00:11:17.720 | as a base model. But yeah, if we want to pause here, anyone have thoughts, points, anything
00:11:22.840 | that they loved about this DeepSeek v3? Which in and of itself is a good model. It's cheap.
00:11:29.660 | Things to note at a, you know, high level AI engineering, like view is using reasoning
00:11:35.100 | models is cool, but also they're kind of slow, right? Like if you need total completion,
00:11:40.180 | thinking is cool, but like sometimes I just want output, right? Models are pretty good
00:11:43.940 | at compressing. Sometimes I want fast speed. This is like GPT 4.0 level and very fast.
00:11:50.920 | So it's only 37B active parameter. So a lot of the times people would probably want to
00:11:56.580 | use this. You don't need a reasoning model for everything, right? If you run a chatbot,
00:12:01.020 | that's cool. You can probably just run this. Later in the conclusion, there's a slide that
00:12:06.660 | shows what future work they want to do on the reasoning model and they show how v3 is
00:12:11.700 | actually better at something. So not, not to undermine this, you know, it's still very
00:12:16.060 | good, very smart, fast, cheap. It's a good model, but it's just an MOE. But yeah, anyone
00:12:23.540 | want to chime in, any questions, anything interesting in chat?
00:12:27.060 | A lot of questions. Sorry. There's a lot of questions. I don't know which one to focus
00:12:34.080 | Okay. I'm going to see the first one. So how is effective active parameters different from
00:12:38.260 | total parameters? So total parameters, you know, you still have to load all this in memory.
00:12:43.980 | So 671 billion parameters, you need lots and lots of GPUs to load this thing. But at inference
00:12:49.780 | time, it's only using a fraction of these, right? So 5% of these, it's using 40 billion
00:12:54.820 | parameters. So realistically, like, it's more efficient to use a lot of tokens. It's, it's
00:13:01.300 | going to be faster, it's going to be cheaper. But it's not something that you can just host
00:13:06.580 | yourself, right? Like your laptop might be able to host a 30 billion parameter model,
00:13:11.940 | you load all those weights of memory and you use it. This is like, kind of like that at
00:13:17.460 | inference time, but it needs all the weights loaded up.
00:13:21.540 | I think the point being a lot of people miss is that, like, you do save, you do save memory
00:13:26.020 | at scale, like you might not save memory if you're having one chat on your laptop, because
00:13:29.340 | all, because every token may use a different subset of the parameters, so you need them
00:13:33.740 | all loaded. But if you're doing batch inference, like, like, like DeepSeq themselves are doing,
00:13:39.660 | then they can route things to the, to each different GPU, how, how they, how they want
00:13:44.580 | and, and we saturate them a bit more.
00:13:47.220 | Yep. Yep. At batch, it's, it's just very efficient. And that also means it's, it's faster too.
00:13:52.980 | Okay. What is FP8 training? So mixed precision training, you know, before we used to train
00:13:57.860 | in full precision, then half precision, then we started, oh shoot, we can do FP16. Now
00:14:03.220 | we cut precision again. It's just an interesting thing. I think they're the first ones that
00:14:08.060 | have done it. Typically, you can do inference. This is like a quantization, right? You can
00:14:12.580 | run a model in four bit and half precision, and there's a slight degradation in quality.
00:14:18.700 | But on the training side, we typically need as much, like, precision as possible. In this
00:14:23.980 | case, they, they can do FP8 training. They did it, guys. They also, yeah, another interesting
00:14:31.700 | key component was that they're training this without an auxiliary loss. So if you know
00:14:36.900 | about MOEs, that's a, that's a pretty interesting piece there. But, okay. Can we trust them
00:14:44.260 | on $5 million cost claim at face value? You can take it both ways. People have gone into
00:14:51.060 | token economics of how much it would cost to train this many tokens at this scale, and
00:14:56.580 | it can be around here. But realistically, this is like, you know, maybe the final train
00:15:01.060 | run cost around this, but this doesn't include any of the R&D, any of the other experiments.
00:15:06.140 | Like, it's more than it would be, but either way, you know, it's out, it's open source,
00:15:11.780 | it's good, it's small. It was cheap. I think Dario from Anthropic, their co-founder, mentioned
00:15:18.940 | something about this recently of like, they, you know, how Anthropic was, or Cloud 3.5
00:15:26.420 | was also trained in the tens of millions. It's nothing crazy. The other interesting
00:15:32.180 | thing was the whole GPU restriction, right? So people say that they have to say this because
00:15:38.700 | they're not allowed GPUs. And if they say they had GPUs, then, you know, they lose their
00:15:42.940 | little supplier, but they have GPUs. You know, who knows?
00:15:49.300 | Yeah, I wanted to add, because there's a lot of speculation of whether, whether is this
00:15:55.100 | real or not, right? But like, one of the best thing about the open weight and open, is that
00:16:00.540 | it's in the code and we can validate these things. So it is definitely an MOE model.
00:16:06.180 | It has, it has definitely that amount of experts that was stated and, and the community has
00:16:11.820 | already essentially made calculators for how much does it cost to create an MOE model.
00:16:17.500 | And if you work it out backwards, it's between five to 10 million. So maybe the exact number
00:16:22.580 | is off, but I think, I think a lot of people are missing the point that it's at that ballpark
00:16:28.340 | and for contrast, Lama 345B was 50 mil. And that is based on the amount of compute time.
00:16:34.380 | So you, apples to apples, it's much cheaper.
00:16:39.340 | Yeah. Okay. I think these other questions have good discussion in chat, so I'm going
00:16:46.900 | to let them continue.
00:16:47.900 | Can I also just add one thing? I think one thing that DeepSeq v3 did very different from
00:16:53.060 | v2 is, crap, just escape me. Yeah. Auxiliary free MOE training without an auxiliary loss.
00:17:03.940 | So I thought that was pretty interesting and it really simplified a lot of things. And
00:17:08.220 | that was a big step up from v2. So, I mean, if you have the paper open, I mean, just go
00:17:13.100 | through it. They make a big deal out of it. I'm not sure how much of a difference it makes
00:17:16.460 | though. Yeah.
00:17:18.460 | Okay. I have the important, I have the R1 paper open, not the, not the other one. Also,
00:17:25.580 | I think I'm only sharing my Chrome screen, so we don't get my paper this time. My beautiful
00:17:32.060 | highlights. It's okay. I took some screenshots of charts, but, okay, let's move on to the
00:17:38.580 | fun one. This was a chart that I was going to pull more charts from. This is from Jay
00:17:43.580 | Alomar's blog. It's a good blog. I recommend checking it out. Actually, there was a better
00:17:49.700 | posted chart in Discord, like, let me pull it up. It was just posted about an hour ago.
00:17:58.860 | So this is like a better overview of the training pipeline, but this is also kind of what's
00:18:06.340 | happening here. So they've got the DeepSeek V3 base. They do SFT reasoning data examples.
00:18:15.460 | They have this SFT checkpoint. Then we do fine tuning with RL to get DeepSeek R1. So
00:18:22.260 | we're going to kind of look in this middle step here, which is DeepSeek R10. So this
00:18:27.940 | is where they apply pure RL directly to V3, the V3 base model without any SFT data. They
00:18:33.740 | use a GRPO for RL, which was introduced a little while ago. Actually, this came out
00:18:38.360 | in the DeepSeek math paper. So there's a few different ways that the model is rewarded
00:18:43.940 | during this RL process. Someone's got their hand up. You want to just ask the question?
00:18:49.980 | We haven't gone that deep yet. Sachin, you want to?
00:18:53.060 | Yeah. So you can hear me, right? Yeah. So one of the things, so I haven't been following
00:18:58.300 | what OpenAI and all the other guys are doing, but what prevents them from because they have
00:19:04.660 | the training data, their own like process. And if they run this and verify because you
00:19:10.860 | have the code and all, and then they can compare like what their existing way of doing versus
00:19:16.780 | the new way of doing, right? So, do you know like how long that would take for these guys?
00:19:24.720 | But now they say, okay, this is the new way of doing things. Everybody accepts. I don't
00:19:28.840 | know if, if you don't know what this data was trained and all, this will definitely
00:19:34.080 | shake it out. But we are not the guys who can basically have the money to train and
00:19:38.680 | actually verify this, right? Has anybody done that? Like, these are numbers that only the
00:19:42.800 | big guys can tell us, right?
00:19:45.640 | In terms of training, there's not much to verify, right? So like four V3, four Lama
00:19:52.960 | models for the base models, there's verification that people can do, right? You know how many
00:19:58.480 | tokens are trained on. We know how it is like to train these models. For this model, like
00:20:04.200 | for R1, we don't have training code. We don't have the data, but that doesn't mean that
00:20:09.440 | people can't do this. There's a section later about companies that are trying to reproduce
00:20:13.960 | this. They also show stuff that we can do, right? So they distill outputs from this into
00:20:18.640 | Lama models, that stuff that's very attainable, you know, that stuff is now in the hundreds
00:20:23.120 | to thousands of dollars. Now that stuff regular people can do. There's a company, I don't
00:20:28.960 | remember who that already put out a fine tune on DeepSeq style R1 data. So we can discuss
00:20:37.120 | this in a bit, but yeah, there's, there's people that are starting to work on this,
00:20:42.320 | but anyway back, back to what they're doing here. So R1-0 is kind of one of the two models,
00:20:48.240 | right? So they, a while ago, they put out this paper that was a DeepSeq math paper.
00:20:53.760 | They explained this new GRPORL algorithm. It's kind of where you have a reward that's
00:20:59.860 | based on accuracy and responses that are verifiably correct. So verifiably correct means you train
00:21:07.000 | on data that can be verified. So math questions, you know, math that checks out leak code,
00:21:12.080 | you can have something that compiles it to check if something is correct. And then they
00:21:15.800 | have like a little format reward. So they, they want to do this RL that nudges the model
00:21:20.440 | to also follow format, right? In this case, the format reward is making sure that there's
00:21:24.920 | think tags between reasoning and then there's an output at the end. So there's kind of like
00:21:30.520 | three things they're testing for here, right? One is the model is being rewarded to one,
00:21:36.240 | put thinking traces, right? So it needs to think. So it's going to put thinking stuff
00:21:41.560 | between thinking tags. It needs an answer. So there's going to be an answer and the answer
00:21:46.200 | has to be correct. And then that correct answer has to verifiably check out. So then there's
00:21:52.080 | kind of this RL algorithm that's, that's applied around all this. This is kind of what the
00:21:56.680 | prompt looks like that they train with. So this is the template prompt, right? So conversation
00:22:01.760 | between user and assistant, user asks a question, the assistant solves it. The assistant first
00:22:06.720 | thinks about the reasoning process in the mind, in its mind, then provides the user
00:22:11.080 | with an answer. The reasoning process and answers are included, are enclosed within
00:22:15.640 | think tags and answers within answer tags respectively. So think goes here, answer goes
00:22:21.320 | here, then the assistant. So now, you know, when you answer, the model is prompted to
00:22:25.960 | now answer with, okay, here's my thinking. Here's my reasoning process. Here's the end
00:22:30.040 | of my think tag. Here's an answer tag. Here's my answer. Here's the end of the answer. Then
00:22:34.600 | you do a bunch of training with just pure RL. Here's kind of the formula for all this.
00:22:38.840 | Here's, here's a cool little chart. GPRO, GRPO. So what is this? So compared to traditional
00:22:45.120 | RL, there's no critic model here. It uses groups of sample generation to estimate rewards
00:22:50.680 | with, and this kind of helps with cutting the compute cost down, right? You don't need
00:22:54.560 | to train a separate critic model. In this case, there's group-based rewards. So this
00:22:59.680 | is basically where you score outputs when they're compared with a sampled group to reward
00:23:05.120 | relative performance. So instead of generating one output, generate a group of scores and
00:23:10.960 | then, you know, reward the one that does the best out of the group. Then there's of course
00:23:15.320 | stability and stuff. A big thing with RL is you have like, you know, KL divergence. You
00:23:20.200 | don't want models to randomly drastically make big changes because they probably won't
00:23:24.880 | get back. So there's a penalty, you know, if in the group something is really good,
00:23:30.840 | but you know, it diverges a lot from the sample, then yeah, we also penalize that. So it's
00:23:36.000 | just, this is good RL. We could spend a lot of time on this, but honestly, I think this
00:23:41.960 | is good for Discord discussion. So I'm sure someone will create a thread of GRPO instead
00:23:47.600 | of 200 people sitting here thinking about what RL is. High level, there's no critique
00:23:53.040 | model. It's, you know, it's judging and rewarding outputs based on sampled group outputs, and
00:24:00.720 | then there's stability in here to make sure that we don't diverge if samples go crazy.
00:24:05.440 | Now R1-0, how does it perform? So it performs really well and there was no labeled SFT training
00:24:12.920 | data. It's just a base model trained with this RL to output the correct responses and
00:24:17.600 | add some thinking. It does well with this majority voting, it does even better. So here's
00:24:23.800 | kind of benchmarks. If we look at it compared to O1 and O1-mini, R1-0 does, you know, pretty
00:24:29.800 | good like on most benchmarks on math, on live code, code forces, it's pretty good up there.
00:24:37.000 | And then when you do majority voting, which is, you know, you generate a couple examples
00:24:41.420 | and you see if the answer is in there, it does significantly better. The key thing here
00:24:46.440 | was they actually trained this thing on very, very hard questions. So just good training
00:24:51.440 | quality questions and yeah, it does pretty well. You generate a bunch of samples, you
00:24:56.760 | pick the one that's the best out of a group of them, you kind of nudge it towards doing
00:25:00.400 | better there. Yeah, the next few things were very interesting
00:25:05.920 | charts that came out of here. So these are some charts that show how their inference
00:25:10.800 | time is correlated with eval performance. This is kind of what you start to see at scale.
00:25:15.840 | When they started to train this thing, it didn't work really well, right? This is just
00:25:19.920 | like, okay, why does this work now? This is basic RL. But at scale, we start to see these
00:25:24.780 | emergent capabilities, right? As you train for more and more steps, we see that accuracy
00:25:29.520 | starts to go up with steps too, right? So for each question, we sample 16 responses,
00:25:34.800 | calculate the average, we start to see how it performs. For more steps, the more kind
00:25:38.720 | of steps that you take, the better performance is.
00:25:42.520 | Another one here, this was a very interesting one. The average response length of the model
00:25:48.280 | also starts to increase. So the longer you train it, the more reasoning steps it starts
00:25:53.640 | to take, which means that basically the TLDR of this paper was just this RL thing just
00:26:00.440 | kind of works. And you can see this in the charts, right? The more that we're training
00:26:04.640 | this, the model is starting to learn to reason more and more, because the more it reasons,
00:26:09.360 | the better the performance is. And throughout more steps, the average length of the response
00:26:13.560 | is starting to get longer and longer. So here's a little quote here. "The average response
00:26:20.520 | length of R1-0 on training set during RL process. DeepSeq R1 naturally learns to solve reasoning
00:26:27.000 | tasks with more thinking time." So yeah, it's starting to do that. "Naturally squares ability
00:26:34.480 | to solve complex tasks by extending test time compute. This ranges from hundreds to thousands
00:26:40.980 | of reasoning tokens. The emergence of interesting behaviors as test time compute increases."
00:26:47.000 | So this was another interesting one. So as you increase test time compute, they started
00:26:51.480 | to notice emergence of interesting behaviors. So some of these were reflections and aha
00:26:58.080 | moments. Reflections are where the model started to revisit and reevaluate previous steps and
00:27:04.280 | explore alternatives. So as it's doing its thinking, it would start to reflect and be
00:27:10.680 | like, "Huh, a few steps ago, I went down this path. Maybe I should look at this again."
00:27:17.160 | Aha moments are where it starts to take more time and reevaluate an original approach.
00:27:23.000 | So in this example, and this also shows the quality, the example of questions that they're
00:27:29.080 | training on. And you can see more of these as well. If you look at some of the stuff
00:27:32.560 | that they trained on, you can look at those data sets and look at the type of questions.
00:27:36.780 | But here it's being told to answer this question. It's like, "Okay, here's basic math. I can
00:27:41.480 | square both sides. I can isolate this term." And then it's like, "Wait, wait, wait. That's
00:27:46.600 | an aha moment I can flag here." And they start to notice these little emergent capabilities
00:27:50.960 | where it's starting to find these aha moments and it's starting to re-reason. It's starting
00:27:55.760 | to have these reflections. And there was this kind of interesting quote that I found in
00:28:00.080 | the paper. So this is from the DeepSeek team. They make this section. They say, "This moment
00:28:06.640 | is not only an aha moment for the model, but it's also for the researchers observing its
00:28:11.320 | behavior. It underscores the power and the beauty of reinforcement learning. Rather than
00:28:16.240 | explicitly teaching a model how to solve a problem, we simply provide it with the right
00:28:21.640 | incentives and it autonomously develops advanced problem-solving strategies. The aha moment
00:28:28.480 | serves as a powerful reminder of the potential of RL to unlock new levels of intelligence
00:28:34.280 | in artificial systems, paving way for more autonomous and adaptive models in the future."
00:28:39.640 | But basically, rather than explicitly teaching the model how to solve problems, they train
00:28:45.360 | it with RL, which incentivizes it based on those incentives, and it autonomously starts
00:28:50.860 | to understand these problem-solving strategies. Through its thinking steps here, it starts
00:28:58.520 | to realize it has these aha moments, and it also starts to have reflections in its thinking.
00:29:05.020 | So that was kind of another interesting thing that came there.
00:29:08.840 | So what about DeepSeek R1? What are the kind of problems with R1-0? R1-0 had poor readability.
00:29:16.480 | It also had a lot of language mixing. I'll make a note of this later in the section,
00:29:20.920 | and I think we should discuss it. They kept talking about language mixing. They keep talking
00:29:25.720 | about how problems with this model are that it mixes up languages. It goes between English
00:29:31.360 | to Chinese, and it's not good at fixing what language it should be. And this is also a
00:29:37.200 | problem with the real R1. They weren't able to solve this too well. Now, some of this
00:29:42.080 | is due to RL, but yeah, it's just a little interesting note that more than four or five
00:29:47.560 | times in the paper, they had mentioned how this thing struggles with language mixing.
00:29:53.560 | So that's kind of where R1-0 had its issues. It wasn't very readable. This is giving out
00:30:02.000 | weird thinking steps, aha moments. It's not being trained with the RL objective of there's
00:30:10.320 | no safety. There's no conciseness. There's no be a good assistant. There's no be a good
00:30:16.280 | chat model. There's no be fun to chat with. There's nothing like that. So they take R1-0,
00:30:22.480 | and then they make R1, which is let's take this reasoning model that we can do. Let's
00:30:27.920 | actually make a proper LLM assistant that we can. So we'll make a reasoning chat model,
00:30:33.960 | which is what R1 becomes. But I'll take a little pause here. I know a lot of people
00:30:39.320 | pre-read the paper too. Is there anything we want to dive deeper into in R1-0? We could
00:30:45.000 | talk about the RL policy itself, the model, how it performs, any of these charts.
00:30:53.280 | Yeah, I had a question. There was a step that I must have missed, which is the RL scoring
00:31:00.040 | function. In other words, when the models return these 16 different answers in English,
00:31:08.520 | how is it scoring? How is it deciding which one was that?
00:31:12.040 | So there's a few things there. There's the verifiably correct, which is part of it. So
00:31:16.840 | these questions, if they're LeetCode style questions, you can run a compiler and you
00:31:22.920 | can verifiably see what's correct. If it's a math question, you can verify that the answer
00:31:27.840 | matches what the answer should be. And that's the level of distinction they go at. There's
00:31:32.480 | a few different categories, but they can verify the answers to make sure that's correct. Then
00:31:38.240 | you have the other little parts of the policy, right? Like you want it to output these think
00:31:44.160 | and answer tokens, so you can verify that it did that. If it didn't do that, then it's
00:31:49.600 | going to be penalized, right? So another part of this is following this prompt template.
00:31:55.120 | If it doesn't output think tokens and answer tokens, or if it outputs them but doesn't
00:32:00.600 | give any reasoning, that's not good. Now, this is just a reasoning model. Some of the
00:32:05.640 | changes in the actual R1 is for sometimes you don't need to reason, right? For a question
00:32:11.120 | like hello, you can not reason. But basically, that's some of the stuff that they--
00:32:16.680 | So just so I understand, so the basic concept here is that even doing these kinds of very
00:32:23.760 | simple forms of reasoning, or they're very complicated, like mathematics, the idea is
00:32:28.440 | that that learning then transfers onto other kinds of responses and reasoning that people
00:32:36.280 | want. Because it was very interesting to me, like one of my test questions is, what is
00:32:40.760 | the population below Central Park? And R1 and all of them just fall on their ass. They
00:32:46.600 | can't answer this, which any third grader can reason through and come up with a reasonable
00:32:50.760 | answer, right? And a reasonable answer is not a number that's larger in the population
00:32:56.400 | of Manhattan or less than 100,000. They just fall to pieces, because they can't seem to
00:33:02.680 | reason about this. And the reason I'm asking this is because is the assumption here that
00:33:09.200 | if they can solve these math questions and coding questions, that they can then reason
00:33:13.560 | about other things? Is that one of the fundamental assumptions here?
00:33:18.520 | Yeah, so in my interpretation, the goal isn't to get it to reason about different things.
00:33:28.120 | It's to get it to just output thinking and reasoning, right? So you want it to be able
00:33:33.320 | to output its thought process to come to a verifiably correct answer. And then as there's
00:33:40.280 | harder and harder questions, it does more or less output of what its thought process
00:33:46.240 | is. And you reward it for being right at the end or wrong at the end. And in that, you
00:33:53.280 | kind of distill this down to, yeah, for harder questions, it will do more thinking before
00:33:57.840 | it answers. For simple questions, it won't. And that's kind of what I feel like they're
00:34:02.360 | going for here. If anyone else has other answers to this or other takes on this--
00:34:05.560 | I just wonder whether-- yeah, I wonder whether, like, is it-- what you're saying is it's not
00:34:10.120 | learning new reasoning. It's just that in the fine-tuning step, we're teaching it to
00:34:14.960 | actually reason, even though the base model was capable of that before. So I'm not-- I
00:34:20.600 | don't know whether that's true or not. It might be.
00:34:23.240 | Yeah, the base model is also very smart, right? But you're teaching it to give out its thought
00:34:28.320 | process. And it's also graded against 16 other versions of thought processes to an answer.
00:34:33.720 | And it needs to do pretty good on this. It needs a good thought process. It needs to
00:34:38.240 | also learn to mimic this template and everything.
00:34:42.360 | But to be clear, it's not being judged on the thought process, only on the-- in other
00:34:48.480 | words, the RL reward is on the correctness of the answer, and then some very basic mechanical
00:34:54.520 | stuff, like, did you have the word "think" and so on? And was there something that we're
00:34:57.800 | going to call a reasoning process? But we're not going into that and parsing it and trying
00:35:01.640 | to understand the reasoning process. All we care about is-- from a reward standpoint,
00:35:06.320 | the reward is given if the answer is correct. And these tokens and so on are present.
00:35:11.640 | Actually, if you don't mind if I jump in here, because this is kind of related to my question
00:35:14.840 | I have on GRPO, which is Group Relative Policy Optimization. I couldn't find anything on
00:35:18.640 | GRPO on YouTube, which I was hoping to get, because I don't want to read this fucking
00:35:21.600 | 53-page math paper. Forgive me. But I found something on direct policy optimization, and
00:35:26.840 | what that was telling me was they removed the reward function from the objective, or
00:35:31.800 | lost function. Sorry, I know, another sin. But removing that reward-- that explicit reward
00:35:37.680 | term seemed to be an important part of TPO, along with callback-livelier divergence in
00:35:41.240 | order to have that kind of-- I don't want to say memory, because there is also a stability
00:35:45.160 | term with a clip function with epsilons. But I kind of had that memory as well.
00:35:50.120 | So I feel like this GRPO, along with the multi-latent head attention, which is a great poll-- thank
00:35:54.080 | you so much for that. I really appreciate that. I'm going to look into that soon from
00:35:57.680 | the meta paper. But it helps with this kind of batch learning. When I hear people talk
00:36:02.000 | on Bloomberg about this model, I hear them say, oh, they trained in fewer batches, or
00:36:06.280 | the batches were more optimal. And when I hear that, I'm thinking in my head, this is
00:36:09.360 | a GRPF. But I also got to look into multi-latent head attention before I do that, and probably
00:36:13.200 | read this whole 53-page math paper.
00:36:19.000 | I have a little bit different use case, I will tell. And people should help me out over
00:36:23.800 | here. So I look at quite a lot of medical cases, very, very deep things, which I-- right
00:36:29.400 | now, I use Perplexity, OpenAI, Cloud, and all. And some of these things just blow up
00:36:34.800 | in your face, right? And the answers they give, I go back to literature and verify.
00:36:39.520 | And it's a very deep process where I need to-- and then I'm reasoning out with neurosurgeons
00:36:44.840 | and guys, and like, why, when I'm questioning them. And some of the data they know, some
00:36:48.800 | of they don't know.
00:36:50.560 | What ends up happening is OpenAI sometimes puts out just garbage. I mean, I look at all
00:36:54.880 | sorts of reasoning, what it is showing. But the things I'm learning, and I know what the
00:36:59.560 | objective is. And my reward process is sometimes, like, so deep inside, is that I know that,
00:37:06.880 | hey, this biochemistry thing with this, this, this, whatever, is probably causing this neurological
00:37:12.840 | symptom.
00:37:13.840 | Now, what you did over here with the RL part of it, when you said there's a number of guys
00:37:18.680 | where you try to pick them up, the problem in RL is that sometimes there might be one
00:37:23.520 | loner that the action that they take, the reward might be way down the line. You cannot
00:37:29.680 | take majority of the guys right up front and say, this is the right way it is supposed
00:37:33.320 | to be done. And my take is, I haven't looked at this, but this is probably works good for
00:37:39.000 | smaller domains. But when you try to chain domains and domains together, it is probably
00:37:44.240 | going to have a lot of issues, because now we have a combinatorial problem over there.
00:37:48.000 | It's like a go game, you know, like whatever it is. So if you have thoughts, let me know.
00:37:52.560 | So they have a way to solve this KL divergence where, you know, they account for if, if responses
00:38:01.160 | are significantly different than the rest of the group, we don't take big steps. But
00:38:06.400 | I think at a high level, that's, that's enough on, on the RL. We could go on for that for
00:38:12.200 | the rest of the hour. But let's take that to offline discussion. Let's, let's get through
00:38:16.640 | what actual R1 is. So that was just R1-0. I'm going to spend the next quick five minutes
00:38:21.920 | and then we'll do 10 minutes of discussion, you know. So, okay, what is DeepSeek R1? So
00:38:27.040 | there's, there's four stages to making this thing a good reasoning and chat model. So
00:38:31.600 | one of them is we have this cold start. So cold start is, you know, let's, let's start
00:38:38.280 | with some as strong as SFT. Let's not have this thing go crazy at first. They mentioned
00:38:42.520 | they use some human annotators here. They just drop one line, you know, kind of interesting
00:38:46.120 | if you want, anyone wants to look into that. But so first we'll cold start the training,
00:38:50.640 | then we'll do RL, then we'll do a rejection sampling for generation, then we'll do RL
00:38:55.280 | again. Okay, quick level. What are these four stages? So stage one, cold start. You have
00:38:59.440 | that DeepSeek V3, cold start the training with strong SFT. SFT on what you want, you
00:39:05.640 | know, this prevents the model from going unstable. Use a long chain of thought, few shot example
00:39:10.800 | prompt to, you know, generate some good detailed examples of what we want. So generate some
00:39:16.620 | good reasoning examples, some reflection, verification, generate from R10. Post-process
00:39:23.560 | these have human annotators, look at them. This is on the order of thousands of samples,
00:39:27.800 | you know, so nothing crazy, but this just starts the model off, you know, so let's get
00:39:33.160 | the base model to do some basic SFT. So normally we take base models, we do instruction fine
00:39:40.200 | tuning, we turn them into chat models, right? We do, we do SFT. So we're going to take the
00:39:44.120 | base model, we're going to generate some examples of chain of thought, few shot prompted examples,
00:39:50.200 | you know, so this looks like you, you take R10, you tell it, or you take whatever model
00:39:55.840 | you tell it to generate some examples where you give the formatting you want, you give
00:40:00.760 | good, give your thinking steps, give a lot of thinking steps, start it off strong, and
00:40:05.760 | then they generate a couple thousand examples, post-process them, have human annotators,
00:40:09.880 | I don't know, they just put like a line on this, then they do some regular SFT on base
00:40:15.220 | DeepSeq v3, that's stage one. Stage two is they basically do the same exact RL, they
00:40:22.200 | add this language consistency reward, like we mentioned, you know, they're struggling
00:40:26.240 | with language mixing, so another part of this RL is now we want language to be consistent.
00:40:31.880 | Okay, so we did SFT on really good data, then we do a bunch and bunch of RL, they don't
00:40:38.160 | explain the data set, they don't explain where it came from, how it came, how many samples,
00:40:43.000 | but they do RL. Stage three, rejection sampling. Rejection sampling is pretty common. Lama
00:40:48.640 | three did it, many others have done it, it's kind of new, they do this, this is the first
00:40:52.640 | time they talk about how much data, this is on the order, this is kind of like end of
00:40:56.560 | post-training lifecycle, you know, so they had big base model, which was v3, did SFT,
00:41:02.560 | did big stage of RL, now they do rejection sampling. This helps it turn from like, you
00:41:07.680 | know, we had R1-0 style issues to let's start to fix this, let's generate completions, rank
00:41:15.400 | them with reward models, this is like LLM as a judge, so generate output, have an LLM
00:41:20.920 | judge it, have a reward model, judge these outputs and reject some samples, fine-tune
00:41:25.840 | this model with rejection sampling. High level, that's what's happening. Stage four, let's
00:41:30.200 | do more RL, you know, throw RL at the problem. So make the model helpful and harmless while
00:41:36.440 | making reasoning good, that's kind of the objective here. They do R1 style questions,
00:41:42.120 | but they also mix in general chat human preference, so nuanced scenarios, you know, we want it
00:41:48.400 | to still give a good output, but now we want it to, you know, give a summary at the end,
00:41:53.480 | don't just give an answer, give a summary. So this is kind of that last step. So this
00:41:57.960 | is what makes DeepSeek R1, instead of just reasoning model, you got to do some rejection
00:42:03.320 | sampling, you got to kickstart the thing so it doesn't go crazy with SFT, and you got
00:42:08.080 | to do this last stage of some RL for general use. Now, yeah, the model is pretty good.
00:42:15.080 | It's a normal chat model, it gives thinking steps, it gives little summaries at the end,
00:42:21.000 | and it performs pretty well. It still struggles with some language swaps, but you know, on
00:42:26.160 | both benchmarks, it's better than O1, better than O1 Mini, either also better than O1,
00:42:34.000 | or in between the two. So you know, O1 might beat it, but it's better than O1 Mini. O1
00:42:39.720 | beats it better than Mini. O1 beats it better than Mini, or it's better than both. But it's
00:42:43.920 | very good, it's 37B active parameters. We don't really know the model sizes or active
00:42:49.680 | parameters of O1, but this thing's good, it's cheap, it's fast, DeepSeek has very good inference.
00:42:55.760 | MIT license, it's fully out there. That's kind of our one, four stages. The new ones
00:43:02.560 | are kind of, hey, you do this cold start with SFT from a base model, and then these last
00:43:07.240 | two stages. You know, you have rejection sampling, you want to kind of fine-tune it, this is
00:43:12.880 | pretty common, you do this for about a million samples, 800,000 is what they did, and then
00:43:17.200 | you do some fine-tuning at the end. And yeah, it does very, very well. Then the last part
00:43:23.760 | of this paper is kind of the distillation step. I'll talk very, very quickly on this.
00:43:28.240 | So distillation is where you take a big model, you train it, you generate outputs, and then
00:43:33.400 | you mimic those into a small model. So you can either just take input/output and just
00:43:38.600 | do continual post-training and now you've distilled a model, or you can do this distillation
00:43:42.920 | loss where you try to get a small model to match the output logit, so not just match
00:43:47.920 | the output, match the output distribution, you know, really match the thinking process
00:43:53.020 | per se of what the big model is doing. They do distillation on just a million samples,
00:43:58.800 | so they have 800,000 reasoning samples and they distill R1 into Lama and Klein models.
00:44:04.240 | So they take the two families, they do distillation. This is not RL. They just do basic SFT distillation
00:44:10.680 | on about 800,000 samples, and now the model does very well. And not only does it do well,
00:44:17.400 | it outputs all of its thinking steps, you know, it becomes a bit of a reasoning model
00:44:21.680 | and the performance jumps a lot. So they not only compare it to the base models, which
00:44:26.720 | they're all better than, they also compare it to, you know, like GPT-40, they compare
00:44:30.920 | it to CloudSonic, to O1-mini. And you can see, like, Quen32B is doing better than all
00:44:36.600 | these models. Like, their distillation work is very good. They open-sourced all these
00:44:42.480 | models. They dropped them all. These will run locally, you know. So this is like Lama8B,
00:44:48.240 | Lama7B, Quen32B. These are models that you can run locally on your laptop. They're very,
00:44:54.520 | very strong models. They'll run pretty quick. And then just as a bit of, like, they had
00:45:01.280 | a lot of abolations, but the one that I found interesting is the question comes up of, "Hey,
00:45:05.760 | what if we just do RL on the other base models?" And they're like, "Okay, let's test it. Let's
00:45:09.920 | take Quen32B. Let's do our same RL for 10k steps." Well, it does better, but it does
00:45:16.920 | nowhere near as good as our distillation. So basically, their takeaway is, "Hey, RL,
00:45:24.320 | like, takes a lot of compute. It's hard to do. And it doesn't get the same performance
00:45:28.940 | as this distillation. So maybe in the future, we still need these big models." But yeah,
00:45:34.360 | distillation worked very, very well compared to their RL on a 32B. I'm sure other people
00:45:39.680 | will go deeper into this. Other people will try it. And they'll, you know, update us on
00:45:43.600 | how it does. Okay, future work. R1 is still worse than V3 at some things. So as much as
00:45:50.960 | I shat on V3 not being important, it's still good. It's fast, faster than R1. R1 is worse
00:45:58.220 | at function calling, multi-turn, complex role play, and JSON output. Those are just a few
00:46:04.320 | things. R1 struggles with language mixing. I don't know why they note this, but V3 doesn't.
00:46:10.900 | So maybe their RL has Chinese data. Maybe I misread something there. Maybe they both
00:46:15.880 | do. But R1 struggles with language mixing. R1 is sensitive to prompting. Few-shot prompts
00:46:21.920 | degrade the performance. So if you guys are using R1, don't few-shot prompt it. Don't
00:46:28.040 | tell the model how to think. Tell it what you want it to do and let it reason. It will
00:46:32.560 | do better. And it's not better at a lot of engineering tasks than V3. They explained
00:46:38.100 | why and they explained, you know, what they think they can do to fix them. But this is
00:46:42.960 | just some sort of future work. But high, high level, that's kind of the paper. We have seven
00:46:49.360 | minutes left. One option is if there's like one or two quick questions. Otherwise we can
00:46:59.560 | talk about future stuff, questions, and what people have. So people are trying to recreate
00:47:04.480 | R1. DeepSeq didn't talk much about the data. Not many specifics about what went on there.
00:47:11.520 | Hug & Face has a thing to reproduce it. They have a whole chart of how they're going to
00:47:14.720 | do it. Bespoke Labs put out a data set. I think it's a great effort. It's not great
00:47:20.120 | in my opinion. I looked at like their prompt templating. They heavily try to prompt it
00:47:24.800 | into responding in a certain way. But anyway, they have over, I think they have a hundred
00:47:29.960 | thousand-ish samples of R1 data. They fine tune to 7B. They show results. But yeah, what
00:47:35.920 | other hot takes have we seen? Eugene Yan, you have your hand up. You want to join in?
00:47:39.880 | Hey, Vibhu. I just wanted to ask a question. Sorry, not a hot take, but a question. In
00:47:45.040 | V3, in stage three, right, they actually retrained it based on the DeepSeq V3 base model. So
00:47:53.200 | I guess the question I have is why did they not carry on from stage two? I know Eugene
00:47:59.780 | Chia had a take on this on Discord, but I wonder if anyone here would have intuition
00:48:06.760 | on why they did this. I had a similar question. So I noticed in the chart as well, same thing
00:48:14.640 | going on. Exactly. And the paper actually calls it out specifically as a single sentence
00:48:19.200 | on its own. So it's pretty unique. Yeah. Basically here, right, they restart from V3 stage instead
00:48:31.040 | of continuing down their whole code process. Why they did this, I wish we knew. Eugene
00:48:37.960 | Chia, do you want to share your take? So my speculation is after they got the first
00:48:43.160 | round data set, is they trained from the base model before annealing. And this is just to
00:48:49.140 | get a better final outcome. I don't fully agree with this idea, but a lot of fine tuners
00:48:55.520 | swear on you want to fine tune on pre-annealed models instead of annealed models. Yeah. And
00:49:03.680 | we can probably go into a side track on that. But yeah, that's my guess. I mean, I don't
00:49:08.840 | really have a great mathematical or machine learning background on this, but I teach right
00:49:13.600 | now and I feel like I do a lot of reinforcement learning when I'm just like trying to get
00:49:17.560 | students to do very little reasoning, COT steps correct. Like I want you to write down
00:49:22.960 | this exponent in the same police I did because I put in a different color. And if you don't
00:49:27.800 | do it, I'm going to take off points and this and this and so forth. But what I want them
00:49:31.480 | to really understand and technically explain in a nuanced way, that fine tuning of having
00:49:36.080 | that discussion, having that expert review is so much more helpful than just a cookbook
00:49:40.280 | chain of thought. So in my mind, fine tuning, but I mean, it's not really from an ML perspective,
00:49:45.640 | kind of just from, I talk to people a lot. Oh, thank you. I also forgot to mention what
00:49:51.960 | annealing is. So annealing is a process where you essentially flatten out and lower the
00:49:55.800 | learning rate. And, and that's a one-time thing you do at the end of the model typically.
00:50:01.520 | Yeah. Sam. Hey, did we, did the paper cover how, when they create examples for SFT using
00:50:11.760 | rejection sampling on the RL checkpoint, what method they use to select the good examples?
00:50:16.200 | Sorry, I didn't hear that too much. When they created SFT samples to what? Yeah. What method
00:50:23.680 | did they use to select the good examples when they're doing rejection sampling SFT on the
00:50:29.360 | RL checkpoint? Oh, on the rejection sampling? Yeah. Yeah. They, they share a little bit
00:50:39.440 | about the rejection sampling, but when they generate the samples, they, they had this
00:50:47.120 | whole section about, I mean, honestly, the whole paper is very, very vague in all these
00:50:52.160 | specifics. They, they just mentioned like little, little notes, like in the cold start,
00:50:58.860 | you know, how do we generate it? Right. It says human annotators. It doesn't say what
00:51:02.720 | they did. Yeah. It doesn't say what they did. It says like, you know, slight, slight use
00:51:06.840 | of human annotators. We, we do post-processing. That's cool. You do, you do post-processing
00:51:13.160 | to post-process what? But at some level they tell you for rejection sampling, it's kind
00:51:18.460 | of what you would expect, right? So what are they trying to do? What's their goal? Now,
00:51:24.640 | they probably want to take away examples that don't have like summaries at the end, right?
00:51:29.760 | But they, they don't, they don't go into too much detail. Is language mixing a feature
00:51:35.560 | or a bug? Isn't it good if the model can find an efficient reasoning techniques? Feature
00:51:41.780 | or bug depends on how you see it, right? So language mixing in this case, I believe they
00:51:46.000 | meant is it responds. Actually, no, they, they actually do specifically explain in some
00:51:52.920 | cases, like the question is asked in Chinese and it responds in English. So seems more
00:51:59.720 | like a bug, right? You don't, you don't want that. Cool. Any, any other quick questions?
00:52:07.840 | Any other thoughts? Any other concerns? And I'm sure there'll be a lot of discussion on
00:52:14.600 | discord continuing about this, you know. I'm curious about people, how people are using
00:52:19.760 | these models now. Cause one of the things like my uses for thinking models is when I'm
00:52:24.600 | trying to brainstorm some creative topic. And yesterday I put side by side Gemini 2.0
00:52:33.240 | thinking experimental, whatever 0121 and DeepSeek R1, and then the 70 billion llama distill
00:52:41.560 | from that's hosted on Grok. And the answer I got from Gemini was much better than the
00:52:46.020 | one I got from the other two. And I was, I don't know, I haven't tried a whole bunch
00:52:49.360 | of examples, but I'm curious about whether, what people are using day to day when they
00:52:52.660 | want to code and when they want to think.
00:53:02.220 | I gave up. I just stick with Claude. I made Claude now. I'm going to wait until there's
00:53:07.600 | a little more settling at the moment. I'm kind of wasting my time, which is fine, but
00:53:12.080 | I'd rather waste my time by not being as efficient than looking for which of the new models suits
00:53:17.440 | my uses best here. But that's a personal opinion.
00:53:20.160 | The thing Rahim is talking is actually very critical in the sense, different models and
00:53:27.280 | all the big guys, they fail, what do you call it, brilliantly, or they just blow up at certain
00:53:34.420 | reasoning and then at other things they do very nice. Until you have looked at all the
00:53:38.960 | gamut of across the things, you cannot stick to one thing. I mean, if you have some money,
00:53:43.920 | you need to like push across all of them and look at responses. Gemini gives one, like,
00:53:50.360 | that's why I want to see what perplexity, if they puts out a blog post on this, where
00:53:54.520 | they have probably the largest data of people, what people are trying to, how they try to
00:53:58.160 | use it. So that would be my, what do you call next steps to go and see what do they think
00:54:03.200 | of deep sea if they're done their own internal thing. But that is, I think, where getting
00:54:09.560 | beyond like the evils, the evils are just like saying, okay, this is our baseline and
00:54:14.080 | from here, let's go and go to the races kind of a stuff.
00:54:18.600 | I mean, specifically for your thing, if we're done, I'll leave it for Desco.
00:54:24.640 | So at a quick level, the other thing to note with how people are using these is they're
00:54:28.800 | very cheap, right? So if you look at the cost of O1 versus R1, it's like more than 10X cheaper.
00:54:38.040 | So stuff that was expensive to do is now not as expensive, right? So they're just good,
00:54:45.560 | cheap reasoning models are fast. And the other part is they're open source. So if you want
00:54:50.800 | to self-deploy it, you can self-deploy it. If you want to use one of the reasoning models
00:54:55.360 | locally, like the Quen 32B1, 7B1, they're pretty good locally on your laptop. So that's
00:55:04.160 | how some people use them.
00:55:05.160 | I don't know if also that availability, right? Like I've been playing with that and I've
00:55:08.840 | kind of had a little bit of mixed use and I've had a friend who's been trying to get
00:55:12.920 | him to do some gnarly reactor refactorings and he's been frustrated a little bit. I don't
00:55:17.920 | know if that's just not learning how to be prompted properly for those kinds of very
00:55:22.560 | specific and more involved tasks, but also Cursor might end up having enough data set
00:55:28.560 | on how people are using it specifically for coding. And maybe I'm hoping they might publish
00:55:32.960 | something on that.
00:55:34.560 | But yeah, it's a good model. Anyway, guys, thanks. Next week we will have Eric Ness facilitating
00:55:43.000 | Titan model. Discussion will continue in Discord. I'll throw slides in there. But yeah, see
00:55:48.240 | you guys next week.
00:55:49.240 | Thanks, everyone. Thanks, Vivo.
00:55:50.240 | Thanks.
00:55:51.240 | Thank you, Vivo.
00:55:52.240 | Thanks, everyone.
00:55:53.240 | Thank you, Kevin.
00:55:56.240 | Thank you, Kevin.
00:55:58.240 | Bye-bye.
00:55:59.240 | Bye-bye.
00:55:59.240 | Bye-bye.
00:56:00.240 | Bye-bye.
00:56:01.240 | Bye-bye.
00:56:02.240 | Bye-bye.
00:56:03.240 | Bye-bye.
00:56:03.240 | Bye-bye.
00:56:04.240 | Bye-bye.
00:56:05.240 | Bye-bye.
00:56:05.240 | [BLANK_AUDIO]