back to indexOpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs

Chapters
0:0 Introduction to the problem of open-source reasoning in AI models.
1:9 The effectiveness of Supervised Fine-Tuning (SFT) for reasoning.
3:38 Introduction to OpenThoughts 3 and its performance.
7:52 Key learnings from the data recipe development.
11:34 Guidance on adapting the dataset recipe to specific domains.
15:15 Call for open collaboration and where to find the project's resources
00:00:00.000 |
I'm Ryan. I'm a founding engineer at Bespoke Labs, and today I'm going to talk to you about 00:00:19.740 |
OpenThoughts, which is our project to create the best open source reasoning data sets. I'll be 00:00:26.880 |
switching tack a little bit from our earlier discussions on reasoning and RL and focus on 00:00:33.100 |
the reasoning part, and you'll see why. So just so we're on the same page, we've talked 00:00:38.700 |
a lot about reasoning, but what's actually going on here? So I like this graph from Jason, 00:00:43.880 |
which shows this incredible performance that's happened in the last several months where models 00:00:49.900 |
are getting much, much, much better on certain benchmarks. And if you look at that, this is 00:00:54.980 |
reasoning. This is test time scaling. I think everyone here is quite familiar with this, and 00:00:58.600 |
it seems that certain tasks like Amy, which are competitive math problems, really respond 00:01:05.080 |
to models when they're able to think step by step and do these long chain of thoughts. 00:01:12.100 |
So let's go back to DeepSeq R1. Now, DeepSeq R1 was really impressive for a lot of people 00:01:17.800 |
for a lot of reasons, and RL was a big part of that. But I was also particularly interested 00:01:23.480 |
because DeepSeq R1, at the end of the day, is an SFT model. So the final weights that they've 00:01:29.980 |
released are actually from DeepSeq v3 base, which is fine-tuned on 800k SFT examples, 600k of which are 00:01:39.340 |
reasoning. Of course, you can see here that RL was a big part of it, and RL was used heavily to create 00:01:46.220 |
that model which generated this data. But at the end, it was SFT and a little bit of RL for alignment. 00:01:52.860 |
So this was really interesting and surprising. And the other thing that was really interesting 00:01:56.940 |
and surprising to us was these small reasoning models that DeepSeq released, which were incredibly 00:02:02.700 |
strong. And this, for us, was a huge motivation to try to do this ourselves. And why is that interesting? 00:02:12.780 |
Because if we go back to here, no additional detail was really given on these data sets here. So if you 00:02:20.220 |
want to create strong reasoning models, we now sort of have a training recipe, but we don't have the 00:02:25.420 |
data recipe. That's the missing link. Okay. I want to also include a slide here on why is it interesting 00:02:32.940 |
to train your own reasoning models. So I'm partially taking this from Amir's talk yesterday on open source 00:02:39.660 |
and enterprise, which I really liked. But there's these main points: performance, privacy, speeding cost, 00:02:45.260 |
and then ownership and destiny. I think using reasoning is a great tool to solve a problem. 00:02:52.700 |
And you shouldn't limit yourself in your toolbox if you're trying to solve a specific domain task. 00:02:59.420 |
So as we talked about before, RL is a great tool in this toolbox to tackle reasoning tasks. But we're 00:03:05.900 |
going to see here that SFT is, as Nathan put this morning, extremely easy and extremely effective. 00:03:11.180 |
Okay, great. Now, the missing link. How do we actually solve for this reasoning data recipe? 00:03:18.380 |
There's all these questions that we had when we started. How much data do you really need? 00:03:23.260 |
What data curation steps are necessary? What are the optimal choices for each step in that data creation pipeline? 00:03:31.820 |
And then, how do you even go about figuring all this out? And this is the meat of the Open Thoughts project. 00:03:38.060 |
So today, we're excited to announce Open Thoughts 3, which is hot off the presses, just came out two hours 00:03:44.220 |
ago, which is our latest and greatest version of our reasoning data sets. And... 00:03:52.060 |
Thank you. And now, this is the state-of-the-art reasoning data set recipe. 00:03:59.740 |
So you can see here, these graphs are showing accuracy on three of these reasoning benchmarks. 00:04:06.540 |
AIME, which is competitive math. LiveCodebench is competitive code. And GPQA Diamond, which is our 00:04:12.140 |
science questions. On the y-axis, you see accuracy is going up. On the x-axis, you see the data 00:04:19.100 |
scale is going up. So we heard before that scaling is difficult, particularly difficult with RL. The 00:04:24.860 |
good news is for SFT, scaling is quite easier. You can see here, we compare to other Open Reasoning 00:04:31.500 |
data sets. So Nematron Nano, NVIDIA released this great model, Nematron Nano. It's an AP model, 00:04:36.940 |
and they also released the data set to train on it. So we compared directly by training on the same base 00:04:41.580 |
model between our data set, which is our data set recipe, and the Nematron Nano data, which is the NVIDIA 00:04:48.060 |
recipe. And you can see here, there's a significant gap. So we've shifted this scaling curve upwards. 00:04:53.420 |
Great. So yeah, this is the state-of-the-art 7b open data reasoning model. You can see we've had, 00:05:01.420 |
we have measured across the domains of interest of science, code, and math, and then a couple held up 00:05:09.420 |
So our original goal was to reproduce, to find the missing link for the DeepSeq 00:05:14.700 |
Distil models. And you can see here, we've crushed that goal. So we're significantly outperforming 00:05:20.860 |
the DeepSeq R1 Quen 7b model, which we started off trying to reproduce. And then compared to the Nematron 00:05:28.460 |
Nano model, which is trained on a different base model, we are also outperforming on some benchmarks, 00:05:34.700 |
and similarly competitive on some others. So okay, let's actually talk about how we achieve this. This 00:05:39.500 |
is the interesting part for you. So we go back to the scaling graph. You can see, once again, on the x 00:05:47.180 |
axis, we're scaling dataset size. So this is a huge method to increase accuracy. And the thing here is 00:05:57.580 |
it gets more and more expensive, exponentially more expensive as you keep going. 00:06:01.100 |
And then vertically, you can see that we've shifted the scaling curve up. So this is what I was talking 00:06:08.780 |
about before. This is the improving the dataset recipe. So given a fixed dataset recipe, you can always 00:06:13.900 |
scale it larger and you can always have higher performance. But if you want to push your 00:06:18.700 |
performance to the absolute maximum, the real question is, how do I create the best dataset? And 00:06:24.140 |
therefore, what is the best recipe for the dataset? Okay, so enough teasing here. Let's go into the meat of 00:06:31.820 |
it. So this is how we approach this problem. We broke down the dataset pipeline into sourcing questions, 00:06:39.900 |
mixing different sources of questions, filtering those questions, filtering out the highest quality 00:06:45.020 |
questions, generating answers with a teacher model. So that's distillation, and then filtering out bad 00:06:51.180 |
answers. And lastly, at the end of this entire experimentation, we looked at what are the best 00:06:57.260 |
teacher models? Which teacher model should we select? So through this entire pipeline, we've come down to 00:07:02.380 |
this final dataset recipe. Now, this was a ton of work. This is a screenshot of our Hugging Face page. So you 00:07:08.780 |
can see, created over 5,000 datasets and almost 3,000 models. For this project, it was only around 1,000 00:07:16.380 |
experiments. But just to give you an idea of how rigorously we looked at the different decisions in each of 00:07:22.140 |
these steps of the pipeline. And also, I think this is interesting because it peels back the curtain a little 00:07:26.780 |
bit on maybe what the frontier labs are doing. Finding signal at the smallest scale possible, 00:07:33.500 |
and trying out as many things as possible, and empirically choosing the best, and then scaling. 00:07:38.780 |
And often, sometimes when you scale, you see, okay, what was the best of the small scale? It doesn't 00:07:43.580 |
actually work. But if you're lucky, and you've done good science, then your YOLO run will be the best 00:07:50.540 |
possible, right? Okay. So these are the key learnings that we had from our dataset recipe. And this is 00:07:59.900 |
what you can take away. So the first thing is that, pretty surprising, sampling multiple answers, so 00:08:07.180 |
multiple reasoning traces per question in your dataset, works really, really well. The performance does not go 00:08:15.260 |
down at a fixed scale. If you take a fixed scale of questions, say 30k questions, or 30k examples. 00:08:23.260 |
And of those, if you take just 30k questions, and you only sample once per question, that performs 00:08:30.700 |
pretty similarly to if you took 1/16, so 30k over 16, and then for each, you sampled 16 times, which is 00:08:40.860 |
quite cool. So this allows you, this is really cool, because this allows you to scale by 16x, which is 00:08:45.500 |
more than an order of magnitude. And if you remember the graph from before, that corresponds to a pretty 00:08:50.300 |
large increase in accuracy. The other surprising thing that we found was that a better model in terms 00:08:57.900 |
of its own performance on evaluation benchmarks does not necessarily mean it's a better teacher model. 00:09:03.420 |
I think a good way to think about this is a brilliant researcher who's maybe a terrible lecturer, right? 00:09:08.860 |
We found specifically, QUEN32B was a stronger teacher model than DeepSeq R1. So we switched to 00:09:17.180 |
that in our recipe, even though previously, everyone has been using R1. 00:09:21.340 |
We also found that the sources of data that had synthetic questions were actually quite good. Some of the top 00:09:30.620 |
sources that we selected were entirely synthetic and better than sources, say, that scraped from 00:09:35.740 |
forums or had humans manually write things. And this is also really good news because synthetic 00:09:41.900 |
question generation is scalable. So once again, we go back to the x-axis and we can push even further, 00:09:49.340 |
So question filtering also works well. Here we filtered questions by asking a language model, 00:10:00.700 |
how difficult is this question, and then taking only the hardest questions. 00:10:04.060 |
We also had a language model try to answer that question and looked at the length of that answer. 00:10:11.100 |
So these are sort of proxies for the same thing. You can imagine that if a problem is a lot harder, 00:10:15.900 |
then a language model will think more and it will produce more text. So its answer will be longer. 00:10:21.100 |
And these things worked better than embeddings-based approaches or fast text classifiers, 00:10:27.260 |
which is interesting as so much that those approaches were typical for pre-training. So it 00:10:33.340 |
seems that the filtering for data and post-training is quite different than pre-training. 00:10:39.180 |
Okay, some things that didn't work that were also quite interesting. Through our experiments, 00:10:43.100 |
you saw that choosing a smaller number of high-quality sources was much better than trying to optimize 00:10:47.580 |
for diversity by going for a larger number of sources. That's very counterintuitive, right? You'd think, 00:10:52.540 |
okay, I'm always going to go for higher diversity, but this is actually not what we saw. 00:10:56.140 |
The last thing that was interesting is that people talk a lot about verification, which is obviously very 00:11:02.060 |
important for RL. And we actually see for SFT and distillation, it didn't seem that filtering based 00:11:08.540 |
off of the answer or verifying the answer really helped it all. This is quite surprising. And I think 00:11:14.700 |
there's some good research in the literature about maybe why this is, because if you have the hardest 00:11:21.660 |
problem, it might be still helpful, even if you have an incorrect answer to that hardest problem, 00:11:27.020 |
keeping it in and seeing how the teacher model attempts. It's not just the final output that matters. 00:11:32.460 |
Okay, great. Okay, so those are all the amazing learnings that we had for Open Thoughts 3, which 00:11:39.100 |
super excited to share. But now you're probably thinking, okay, they've done a thousand experiments. 00:11:44.060 |
I don't want to do a thousand experiments. I still want to create reasoning models. How do I adapt this 00:11:49.180 |
if I want to create specialized reasoning models? So I guess the first thing I would say is, be aware 00:11:55.980 |
that based off of your domain, these exact choices might be a little bit different. I would suggest, 00:12:00.860 |
okay, start with our recipe and then iterate on it. If you have capacity and compute, try a couple 00:12:06.460 |
different choices for each step in the pipeline. And I think a good example of this is we studied each step 00:12:11.820 |
in the pipeline differently by domain. So we studied it distinctly for code, science, and math. And we saw, 00:12:18.380 |
for example, in the question filtering, which I talked about before, 00:12:21.180 |
using difficulty labels worked well for code questions. But for math and science, it was a response 00:12:28.940 |
length. And if you think about that for a second, it makes sense because the response length for coding 00:12:35.020 |
questions are very different, right? For Amy math, it's literally just a number between zero and a thousand. 00:12:41.500 |
So the answer is not, it's not considering a large portion of the length. But you can imagine there's very simple 00:12:47.500 |
coding questions in which the answer is still a lot of lines of code. So yeah, this is one thing to be 00:12:52.940 |
aware of. The other thing which I talked about previously is synthetic question generation. Because 00:12:58.060 |
it works so well, and if your specialized domain, if you're, if you don't have a lot of data for your 00:13:04.220 |
particular problem, then go ahead, transform that existing data into questions, expand it, throw those 00:13:10.940 |
as in context examples, and just generate more data. So yeah, we built an open source library for 00:13:16.380 |
this. It's called curator, and you can you can try that out. And then lastly, I feel like everyone says 00:13:21.660 |
this, but it can't be said enough. The evaluation is paramount. If you don't know how well your models are 00:13:27.980 |
doing or improving, then you cannot make good principled decisions about your data set recipe. 00:13:33.580 |
We spent a lot of time on this. We also have this open source library on GitHub called Evalchemy, 00:13:38.700 |
which takes care of this and also takes care of the sharding and parallelism. And the key thing 00:13:46.060 |
here is for very small evaluation sets, if you if you only have a handful of questions, you should run 00:13:51.340 |
your model on those evaluation sets many times an average. So going back again to AME competitive math 00:13:57.740 |
questions, there's only 30 per year. So for our evaluations, we gave the model those 30 questions 00:14:05.900 |
10 times, and then we averaged to get the final signal to determine which data strategies were working 00:14:13.020 |
better than others, because otherwise, there's too much noise. Okay, this is also very, very interesting 00:14:18.140 |
and surprising and promising for you if you're specializing. It seems that you can actually surpass 00:14:24.620 |
the teacher in some domains with distillation. This is this is super cool. Usually you think 00:14:28.860 |
about only RL can push the frontier. Distillation is just about catching up to the teacher. But no, 00:14:33.980 |
that's not the case. So we have an example. It's in our paper where we looked at the legal reasoning 00:14:39.740 |
domain. So the problem of classifying Supreme Court decisions. What we did is we took 2k unique questions, 00:14:47.980 |
we sampled five answers per question, and then we did do verification here, which which did matter. 00:14:55.900 |
So we threw away any questions, any answers that were incorrect. And when you fine tune the 7b model, 00:15:02.780 |
it surpasses R1, which is a very strong reasoning model and also a very huge reasoning model. So this 00:15:08.620 |
is very exciting. There's a lot more research and also application to be done here. 00:15:13.340 |
Okay, cool. So everything's open. It's open thoughts and open thoughts means open. Go out and build. We 00:15:22.940 |
have all of our detailed paper. It's just out this morning. We've got the weights data set. We have a ton of 00:15:29.500 |
repos for code for data generation, for evaluation and synthetic data. So check those out. This is the 00:15:39.100 |
team. It was a huge group of people, a lot of work over many months. I think we're all very proud of what 00:15:44.940 |
we did. But there's lots of people to recognize here. If you scan that QR code, it goes to the tweet, 00:15:50.540 |
and everything about the open thoughts project is linked in from there. Yeah. Thank you. 00:15:55.420 |
All right. Thank you so much, Ryan. That was fascinating. It looks like we're already getting, 00:16:07.020 |
we have at least one question lined up. Again, we have time for maybe a couple of questions. 00:16:11.020 |
So if you have questions, please line up and we'll do it. Actually, before we get to those questions, 00:16:16.940 |
I will say as people are leaving, we are going to be back here at two o'clock. We've got an 00:16:23.500 |
excellent afternoon planned on this track. We've got Nathan Lambert. We've got the, we've got Christian 00:16:28.460 |
Segeti, who's the co-founder of X. And it's going to be a really great track at two o'clock back in this 00:16:33.100 |
room. Also, one more thing, if you do have questions for any of the speakers from this morning, hopefully 00:16:38.620 |
they're going to be able to stick around. Don't let them go to lunch. They're going to be there. They're 00:16:41.420 |
sitting up here at the front. So swarm them as soon as we're done. But for now, let's, let's get a couple 00:16:44.620 |
questions for, uh, go ahead. Um, yes, over there. Uh, thank you. Great talk. So, uh, two questions. 00:16:50.380 |
One is, um, if you're just using SFT on this data, what's the difference between this and regular SFT? 00:16:56.220 |
This is just regular SFT. Oh, yeah. Oh, okay. So then how is regular SFT able to make the models like 00:17:03.260 |
think longer? Because I thought for the reason models, they have like a thinking block and they think, 00:17:08.140 |
you know, hours and minutes. Exactly. So how do you, how do you, how does SFT make it think for 00:17:13.100 |
hours? So you're, you're doing supervised fine tuning on the questions and the answers also contain 00:17:19.100 |
the thinking. So the model learns to use its context window and produce these long thinking traces. So it 00:17:25.180 |
it can do this. People call SFT imitation. Um, but it, it can learn to learn this format in the same way. 00:17:31.980 |
Yeah. Thanks. All right. We'll take one from this side. Um, great presentation, Ryan. Uh, one question. Uh, 00:17:39.340 |
why do you think, um, a smaller model like when 32B was a better teacher than a deep seek R1? What was your 00:17:47.340 |
insight in figuring out that like a good professor makes a bad lecturer? Yeah, that's a great question. 00:17:53.420 |
Um, I think this is saying we need to investigate more, but you can see that, uh, when you look at 00:17:58.940 |
charts of the length of reasoning traces, you can see the distributions are different. So, uh, it might 00:18:04.940 |
be the case that you're using more of your context window, using more tokens, more steps. It also might 00:18:09.180 |
be the case that you just have a better formatted response, better output. Um, this is like in another 00:18:16.060 |
great open research, research question. Interesting. I'll also say on this point, 00:18:19.500 |
we also tried Claude as a teacher, which is like a very, as a good, strong model. And it was just a 00:18:24.060 |
terrible teacher. Um, so there's the, it, it, it, yeah, it's interesting what can, what actually 00:18:29.180 |
creates a good teacher. Yeah. All right. We'll take one more very brief question from this side. 00:18:33.900 |
And then those of you still waiting on questions, um, after, uh, after we have closed this up, it's warming. 00:18:39.740 |
So, uh, great talk around. Um, we're doing similar kind of thing, but I just had a question. Do you guys 00:18:44.780 |
have any like pattern map as to in the reasoning chain of thought when things don't work at what 00:18:51.260 |
level, you know, in the evil, do you find out that things are not working or it's not reasoning correctly? 00:18:57.580 |
Is there a pattern map or something that you have in your open source? 00:19:00.220 |
Sorry, I didn't catch that. Is there a, so if there are five steps of reasoning to reach a final conclusion, 00:19:08.700 |
Yeah, this is, this is a great question. We don't do this fine grained analysis, 00:19:12.620 |
but there is a ton in the literature about this, um, where, yeah, there's a sort of critical step where 00:19:18.140 |
it gets, gets things wrong. Um, there, we did like the simplest thing possible, right? You could also go in and 00:19:24.220 |
try to do more complicated things. Um, at evaluation time where you're doing interventions 00:19:30.540 |
to, uh, maybe detect steps that have gone awry and, and, and change, or you can do this in the, 00:19:37.340 |
when you're creating the data set. So you could potentially rewrite things, 00:19:40.300 |
but everything that we tried in terms of like messing with the reasoning trace, it wasn't helpful. 00:19:45.500 |
Um, so yeah, I think there's still more to explore there. There's like, 00:19:49.980 |
this is really just the start of everything in reasoning.