OpenThoughts: Data Recipes for Reasoning Models

I'm Ryan. I'm a founding engineer at Bespoke Labs, and today I'm going to talk to you about OpenThoughts, which is our project to create the best open source reasoning data sets. I'll be switching tack a little bit from our earlier discussions on reasoning and RL and focus on the reasoning part, and you'll see why.

So just so we're on the same page, we've talked a lot about reasoning, but what's actually going on here? So I like this graph from Jason, which shows this incredible performance that's happened in the last several months where models are getting much, much, much better on certain benchmarks. And if you look at that, this is reasoning.

This is test time scaling. I think everyone here is quite familiar with this, and it seems that certain tasks like Amy, which are competitive math problems, really respond to models when they're able to think step by step and do these long chain of thoughts. So let's go back to DeepSeq R1.

Now, DeepSeq R1 was really impressive for a lot of people for a lot of reasons, and RL was a big part of that. But I was also particularly interested because DeepSeq R1, at the end of the day, is an SFT model. So the final weights that they've released are actually from DeepSeq v3 base, which is fine-tuned on 800k SFT examples, 600k of which are reasoning.

Of course, you can see here that RL was a big part of it, and RL was used heavily to create that model which generated this data. But at the end, it was SFT and a little bit of RL for alignment. So this was really interesting and surprising. And the other thing that was really interesting and surprising to us was these small reasoning models that DeepSeq released, which were incredibly strong.

And this, for us, was a huge motivation to try to do this ourselves. And why is that interesting? Because if we go back to here, no additional detail was really given on these data sets here. So if you want to create strong reasoning models, we now sort of have a training recipe, but we don't have the data recipe.

That's the missing link. Okay. I want to also include a slide here on why is it interesting to train your own reasoning models. So I'm partially taking this from Amir's talk yesterday on open source and enterprise, which I really liked. But there's these main points: performance, privacy, speeding cost, and then ownership and destiny.

I think using reasoning is a great tool to solve a problem. And you shouldn't limit yourself in your toolbox if you're trying to solve a specific domain task. So as we talked about before, RL is a great tool in this toolbox to tackle reasoning tasks. But we're going to see here that SFT is, as Nathan put this morning, extremely easy and extremely effective.

Okay, great. Now, the missing link. How do we actually solve for this reasoning data recipe? There's all these questions that we had when we started. How much data do you really need? What data curation steps are necessary? What are the optimal choices for each step in that data creation pipeline?

And then, how do you even go about figuring all this out? And this is the meat of the Open Thoughts project. So today, we're excited to announce Open Thoughts 3, which is hot off the presses, just came out two hours ago, which is our latest and greatest version of our reasoning data sets.

And... Thank you. Thank you. And now, this is the state-of-the-art reasoning data set recipe. So you can see here, these graphs are showing accuracy on three of these reasoning benchmarks. AIME, which is competitive math. LiveCodebench is competitive code. And GPQA Diamond, which is our science questions. On the y-axis, you see accuracy is going up.

On the x-axis, you see the data scale is going up. So we heard before that scaling is difficult, particularly difficult with RL. The good news is for SFT, scaling is quite easier. You can see here, we compare to other Open Reasoning data sets. So Nematron Nano, NVIDIA released this great model, Nematron Nano.

It's an AP model, and they also released the data set to train on it. So we compared directly by training on the same base model between our data set, which is our data set recipe, and the Nematron Nano data, which is the NVIDIA recipe. And you can see here, there's a significant gap.

So we've shifted this scaling curve upwards. Great. So yeah, this is the state-of-the-art 7b open data reasoning model. You can see we've had, we have measured across the domains of interest of science, code, and math, and then a couple held up benchmarks. So our original goal was to reproduce, to find the missing link for the DeepSeq Distil models.

And you can see here, we've crushed that goal. So we're significantly outperforming the DeepSeq R1 Quen 7b model, which we started off trying to reproduce. And then compared to the Nematron Nano model, which is trained on a different base model, we are also outperforming on some benchmarks, and similarly competitive on some others.

So okay, let's actually talk about how we achieve this. This is the interesting part for you. So we go back to the scaling graph. You can see, once again, on the x axis, we're scaling dataset size. So this is a huge method to increase accuracy. And the thing here is it gets more and more expensive, exponentially more expensive as you keep going.

And then vertically, you can see that we've shifted the scaling curve up. So this is what I was talking about before. This is the improving the dataset recipe. So given a fixed dataset recipe, you can always scale it larger and you can always have higher performance. But if you want to push your performance to the absolute maximum, the real question is, how do I create the best dataset?

And therefore, what is the best recipe for the dataset? Okay, so enough teasing here. Let's go into the meat of it. So this is how we approach this problem. We broke down the dataset pipeline into sourcing questions, mixing different sources of questions, filtering those questions, filtering out the highest quality questions, generating answers with a teacher model.

So that's distillation, and then filtering out bad answers. And lastly, at the end of this entire experimentation, we looked at what are the best teacher models? Which teacher model should we select? So through this entire pipeline, we've come down to this final dataset recipe. Now, this was a ton of work.

This is a screenshot of our Hugging Face page. So you can see, created over 5,000 datasets and almost 3,000 models. For this project, it was only around 1,000 experiments. But just to give you an idea of how rigorously we looked at the different decisions in each of these steps of the pipeline.

And also, I think this is interesting because it peels back the curtain a little bit on maybe what the frontier labs are doing. Finding signal at the smallest scale possible, and trying out as many things as possible, and empirically choosing the best, and then scaling. And often, sometimes when you scale, you see, okay, what was the best of the small scale?

It doesn't actually work. But if you're lucky, and you've done good science, then your YOLO run will be the best possible, right? Okay. So these are the key learnings that we had from our dataset recipe. And this is what you can take away. So the first thing is that, pretty surprising, sampling multiple answers, so multiple reasoning traces per question in your dataset, works really, really well.

The performance does not go down at a fixed scale. If you take a fixed scale of questions, say 30k questions, or 30k examples. And of those, if you take just 30k questions, and you only sample once per question, that performs pretty similarly to if you took 1/16, so 30k over 16, and then for each, you sampled 16 times, which is quite cool.

So this allows you, this is really cool, because this allows you to scale by 16x, which is more than an order of magnitude. And if you remember the graph from before, that corresponds to a pretty large increase in accuracy. The other surprising thing that we found was that a better model in terms of its own performance on evaluation benchmarks does not necessarily mean it's a better teacher model.

I think a good way to think about this is a brilliant researcher who's maybe a terrible lecturer, right? We found specifically, QUEN32B was a stronger teacher model than DeepSeq R1. So we switched to that in our recipe, even though previously, everyone has been using R1. We also found that the sources of data that had synthetic questions were actually quite good.

Some of the top sources that we selected were entirely synthetic and better than sources, say, that scraped from forums or had humans manually write things. And this is also really good news because synthetic question generation is scalable. So once again, we go back to the x-axis and we can push even further, which is accuracy boost.

So question filtering also works well. Here we filtered questions by asking a language model, how difficult is this question, and then taking only the hardest questions. We also had a language model try to answer that question and looked at the length of that answer. So these are sort of proxies for the same thing.

You can imagine that if a problem is a lot harder, then a language model will think more and it will produce more text. So its answer will be longer. And these things worked better than embeddings-based approaches or fast text classifiers, which is interesting as so much that those approaches were typical for pre-training.

So it seems that the filtering for data and post-training is quite different than pre-training. Okay, some things that didn't work that were also quite interesting. Through our experiments, you saw that choosing a smaller number of high-quality sources was much better than trying to optimize for diversity by going for a larger number of sources.

That's very counterintuitive, right? You'd think, okay, I'm always going to go for higher diversity, but this is actually not what we saw. The last thing that was interesting is that people talk a lot about verification, which is obviously very important for RL. And we actually see for SFT and distillation, it didn't seem that filtering based off of the answer or verifying the answer really helped it all.

This is quite surprising. And I think there's some good research in the literature about maybe why this is, because if you have the hardest problem, it might be still helpful, even if you have an incorrect answer to that hardest problem, keeping it in and seeing how the teacher model attempts.

It's not just the final output that matters. Okay, great. Okay, so those are all the amazing learnings that we had for Open Thoughts 3, which super excited to share. But now you're probably thinking, okay, they've done a thousand experiments. I don't want to do a thousand experiments. I still want to create reasoning models.

How do I adapt this if I want to create specialized reasoning models? So I guess the first thing I would say is, be aware that based off of your domain, these exact choices might be a little bit different. I would suggest, okay, start with our recipe and then iterate on it.

If you have capacity and compute, try a couple different choices for each step in the pipeline. And I think a good example of this is we studied each step in the pipeline differently by domain. So we studied it distinctly for code, science, and math. And we saw, for example, in the question filtering, which I talked about before, using difficulty labels worked well for code questions.

But for math and science, it was a response length. And if you think about that for a second, it makes sense because the response length for coding questions are very different, right? For Amy math, it's literally just a number between zero and a thousand. So the answer is not, it's not considering a large portion of the length.

But you can imagine there's very simple coding questions in which the answer is still a lot of lines of code. So yeah, this is one thing to be aware of. The other thing which I talked about previously is synthetic question generation. Because it works so well, and if your specialized domain, if you're, if you don't have a lot of data for your particular problem, then go ahead, transform that existing data into questions, expand it, throw those as in context examples, and just generate more data.

So yeah, we built an open source library for this. It's called curator, and you can you can try that out. And then lastly, I feel like everyone says this, but it can't be said enough. The evaluation is paramount. If you don't know how well your models are doing or improving, then you cannot make good principled decisions about your data set recipe.

We spent a lot of time on this. We also have this open source library on GitHub called Evalchemy, which takes care of this and also takes care of the sharding and parallelism. And the key thing here is for very small evaluation sets, if you if you only have a handful of questions, you should run your model on those evaluation sets many times an average.

So going back again to AME competitive math questions, there's only 30 per year. So for our evaluations, we gave the model those 30 questions 10 times, and then we averaged to get the final signal to determine which data strategies were working better than others, because otherwise, there's too much noise.

Okay, this is also very, very interesting and surprising and promising for you if you're specializing. It seems that you can actually surpass the teacher in some domains with distillation. This is this is super cool. Usually you think about only RL can push the frontier. Distillation is just about catching up to the teacher.

But no, that's not the case. So we have an example. It's in our paper where we looked at the legal reasoning domain. So the problem of classifying Supreme Court decisions. What we did is we took 2k unique questions, we sampled five answers per question, and then we did do verification here, which which did matter.

So we threw away any questions, any answers that were incorrect. And when you fine tune the 7b model, it surpasses R1, which is a very strong reasoning model and also a very huge reasoning model. So this is very exciting. There's a lot more research and also application to be done here.

Okay, cool. So everything's open. It's open thoughts and open thoughts means open. Go out and build. We have all of our detailed paper. It's just out this morning. We've got the weights data set. We have a ton of repos for code for data generation, for evaluation and synthetic data.

So check those out. This is the team. It was a huge group of people, a lot of work over many months. I think we're all very proud of what we did. But there's lots of people to recognize here. If you scan that QR code, it goes to the tweet, and everything about the open thoughts project is linked in from there.

Yeah. Thank you. All right. Thank you so much, Ryan. That was fascinating. It looks like we're already getting, we have at least one question lined up. Again, we have time for maybe a couple of questions. So if you have questions, please line up and we'll do it. Actually, before we get to those questions, I will say as people are leaving, we are going to be back here at two o'clock.

We've got an excellent afternoon planned on this track. We've got Nathan Lambert. We've got the, we've got Christian Segeti, who's the co-founder of X. And it's going to be a really great track at two o'clock back in this room. Also, one more thing, if you do have questions for any of the speakers from this morning, hopefully they're going to be able to stick around.

Don't let them go to lunch. They're going to be there. They're sitting up here at the front. So swarm them as soon as we're done. But for now, let's, let's get a couple questions for, uh, go ahead. Um, yes, over there. Uh, thank you. Great talk. So, uh, two questions.

One is, um, if you're just using SFT on this data, what's the difference between this and regular SFT? This is just regular SFT. Oh, yeah. Oh, okay. So then how is regular SFT able to make the models like think longer? Because I thought for the reason models, they have like a thinking block and they think, you know, hours and minutes.

Exactly. So how do you, how do you, how does SFT make it think for hours? So you're, you're doing supervised fine tuning on the questions and the answers also contain the thinking. So the model learns to use its context window and produce these long thinking traces. So it it can do this.

People call SFT imitation. Um, but it, it can learn to learn this format in the same way. Yeah. Thanks. All right. We'll take one from this side. Um, great presentation, Ryan. Uh, one question. Uh, why do you think, um, a smaller model like when 32B was a better teacher than a deep seek R1?

What was your insight in figuring out that like a good professor makes a bad lecturer? Yeah, that's a great question. Um, I think this is saying we need to investigate more, but you can see that, uh, when you look at charts of the length of reasoning traces, you can see the distributions are different.

So, uh, it might be the case that you're using more of your context window, using more tokens, more steps. It also might be the case that you just have a better formatted response, better output. Um, this is like in another great open research, research question. Interesting. I'll also say on this point, we also tried Claude as a teacher, which is like a very, as a good, strong model.

And it was just a terrible teacher. Um, so there's the, it, it, it, yeah, it's interesting what can, what actually creates a good teacher. Yeah. All right. We'll take one more very brief question from this side. And then those of you still waiting on questions, um, after, uh, after we have closed this up, it's warming.

So, uh, great talk around. Um, we're doing similar kind of thing, but I just had a question. Do you guys have any like pattern map as to in the reasoning chain of thought when things don't work at what level, you know, in the evil, do you find out that things are not working or it's not reasoning correctly?

Is there a pattern map or something that you have in your open source? Sorry, I didn't catch that. Is there a, so if there are five steps of reasoning to reach a final conclusion, uh, at what step does the reasoning go awry? Yeah, this is, this is a great question.

We don't do this fine grained analysis, but there is a ton in the literature about this, um, where, yeah, there's a sort of critical step where it gets, gets things wrong. Um, there, we did like the simplest thing possible, right? You could also go in and try to do more complicated things.

Um, at evaluation time where you're doing interventions to, uh, maybe detect steps that have gone awry and, and, and change, or you can do this in the, when you're creating the data set. So you could potentially rewrite things, but everything that we tried in terms of like messing with the reasoning trace, it wasn't helpful.

Um, so yeah, I think there's still more to explore there. There's like, this is really just the start of everything in reasoning.

OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs

Chapters

Transcript