OpenThoughts: Data Recipes for Reasoning Models

00:00:00.000 | I'm Ryan. I'm a founding engineer at Bespoke Labs, and today I'm going to talk to you about

00:00:19.740 | OpenThoughts, which is our project to create the best open source reasoning data sets. I'll be

00:00:26.880 | switching tack a little bit from our earlier discussions on reasoning and RL and focus on

00:00:33.100 | the reasoning part, and you'll see why. So just so we're on the same page, we've talked

00:00:38.700 | a lot about reasoning, but what's actually going on here? So I like this graph from Jason,

00:00:43.880 | which shows this incredible performance that's happened in the last several months where models

00:00:49.900 | are getting much, much, much better on certain benchmarks. And if you look at that, this is

00:00:54.980 | reasoning. This is test time scaling. I think everyone here is quite familiar with this, and

00:00:58.600 | it seems that certain tasks like Amy, which are competitive math problems, really respond

00:01:05.080 | to models when they're able to think step by step and do these long chain of thoughts.

00:01:12.100 | So let's go back to DeepSeq R1. Now, DeepSeq R1 was really impressive for a lot of people

00:01:17.800 | for a lot of reasons, and RL was a big part of that. But I was also particularly interested

00:01:23.480 | because DeepSeq R1, at the end of the day, is an SFT model. So the final weights that they've

00:01:29.980 | released are actually from DeepSeq v3 base, which is fine-tuned on 800k SFT examples, 600k of which are

00:01:39.340 | reasoning. Of course, you can see here that RL was a big part of it, and RL was used heavily to create

00:01:46.220 | that model which generated this data. But at the end, it was SFT and a little bit of RL for alignment.

00:01:52.860 | So this was really interesting and surprising. And the other thing that was really interesting

00:01:56.940 | and surprising to us was these small reasoning models that DeepSeq released, which were incredibly

00:02:02.700 | strong. And this, for us, was a huge motivation to try to do this ourselves. And why is that interesting?

00:02:12.780 | Because if we go back to here, no additional detail was really given on these data sets here. So if you

00:02:20.220 | want to create strong reasoning models, we now sort of have a training recipe, but we don't have the

00:02:25.420 | data recipe. That's the missing link. Okay. I want to also include a slide here on why is it interesting

00:02:32.940 | to train your own reasoning models. So I'm partially taking this from Amir's talk yesterday on open source

00:02:39.660 | and enterprise, which I really liked. But there's these main points: performance, privacy, speeding cost,

00:02:45.260 | and then ownership and destiny. I think using reasoning is a great tool to solve a problem.

00:02:52.700 | And you shouldn't limit yourself in your toolbox if you're trying to solve a specific domain task.

00:02:59.420 | So as we talked about before, RL is a great tool in this toolbox to tackle reasoning tasks. But we're

00:03:05.900 | going to see here that SFT is, as Nathan put this morning, extremely easy and extremely effective.

00:03:11.180 | Okay, great. Now, the missing link. How do we actually solve for this reasoning data recipe?

00:03:18.380 | There's all these questions that we had when we started. How much data do you really need?

00:03:23.260 | What data curation steps are necessary? What are the optimal choices for each step in that data creation pipeline?

00:03:31.820 | And then, how do you even go about figuring all this out? And this is the meat of the Open Thoughts project.

00:03:38.060 | So today, we're excited to announce Open Thoughts 3, which is hot off the presses, just came out two hours

00:03:44.220 | ago, which is our latest and greatest version of our reasoning data sets. And...

00:03:49.260 | Thank you.

00:03:52.060 | Thank you. And now, this is the state-of-the-art reasoning data set recipe.

00:03:59.740 | So you can see here, these graphs are showing accuracy on three of these reasoning benchmarks.

00:04:06.540 | AIME, which is competitive math. LiveCodebench is competitive code. And GPQA Diamond, which is our

00:04:12.140 | science questions. On the y-axis, you see accuracy is going up. On the x-axis, you see the data

00:04:19.100 | scale is going up. So we heard before that scaling is difficult, particularly difficult with RL. The

00:04:24.860 | good news is for SFT, scaling is quite easier. You can see here, we compare to other Open Reasoning

00:04:31.500 | data sets. So Nematron Nano, NVIDIA released this great model, Nematron Nano. It's an AP model,

00:04:36.940 | and they also released the data set to train on it. So we compared directly by training on the same base

00:04:41.580 | model between our data set, which is our data set recipe, and the Nematron Nano data, which is the NVIDIA

00:04:48.060 | recipe. And you can see here, there's a significant gap. So we've shifted this scaling curve upwards.

00:04:53.420 | Great. So yeah, this is the state-of-the-art 7b open data reasoning model. You can see we've had,

00:05:01.420 | we have measured across the domains of interest of science, code, and math, and then a couple held up

00:05:06.140 | benchmarks.

00:05:09.420 | So our original goal was to reproduce, to find the missing link for the DeepSeq

00:05:14.700 | Distil models. And you can see here, we've crushed that goal. So we're significantly outperforming

00:05:20.860 | the DeepSeq R1 Quen 7b model, which we started off trying to reproduce. And then compared to the Nematron

00:05:28.460 | Nano model, which is trained on a different base model, we are also outperforming on some benchmarks,

00:05:34.700 | and similarly competitive on some others. So okay, let's actually talk about how we achieve this. This

00:05:39.500 | is the interesting part for you. So we go back to the scaling graph. You can see, once again, on the x

00:05:47.180 | axis, we're scaling dataset size. So this is a huge method to increase accuracy. And the thing here is

00:05:57.580 | it gets more and more expensive, exponentially more expensive as you keep going.

00:06:01.100 | And then vertically, you can see that we've shifted the scaling curve up. So this is what I was talking

00:06:08.780 | about before. This is the improving the dataset recipe. So given a fixed dataset recipe, you can always

00:06:13.900 | scale it larger and you can always have higher performance. But if you want to push your

00:06:18.700 | performance to the absolute maximum, the real question is, how do I create the best dataset? And

00:06:24.140 | therefore, what is the best recipe for the dataset? Okay, so enough teasing here. Let's go into the meat of

00:06:31.820 | it. So this is how we approach this problem. We broke down the dataset pipeline into sourcing questions,

00:06:39.900 | mixing different sources of questions, filtering those questions, filtering out the highest quality

00:06:45.020 | questions, generating answers with a teacher model. So that's distillation, and then filtering out bad

00:06:51.180 | answers. And lastly, at the end of this entire experimentation, we looked at what are the best

00:06:57.260 | teacher models? Which teacher model should we select? So through this entire pipeline, we've come down to

00:07:02.380 | this final dataset recipe. Now, this was a ton of work. This is a screenshot of our Hugging Face page. So you

00:07:08.780 | can see, created over 5,000 datasets and almost 3,000 models. For this project, it was only around 1,000

00:07:16.380 | experiments. But just to give you an idea of how rigorously we looked at the different decisions in each of

00:07:22.140 | these steps of the pipeline. And also, I think this is interesting because it peels back the curtain a little

00:07:26.780 | bit on maybe what the frontier labs are doing. Finding signal at the smallest scale possible,

00:07:33.500 | and trying out as many things as possible, and empirically choosing the best, and then scaling.

00:07:38.780 | And often, sometimes when you scale, you see, okay, what was the best of the small scale? It doesn't

00:07:43.580 | actually work. But if you're lucky, and you've done good science, then your YOLO run will be the best

00:07:50.540 | possible, right? Okay. So these are the key learnings that we had from our dataset recipe. And this is

00:07:59.900 | what you can take away. So the first thing is that, pretty surprising, sampling multiple answers, so

00:08:07.180 | multiple reasoning traces per question in your dataset, works really, really well. The performance does not go

00:08:15.260 | down at a fixed scale. If you take a fixed scale of questions, say 30k questions, or 30k examples.

00:08:23.260 | And of those, if you take just 30k questions, and you only sample once per question, that performs

00:08:30.700 | pretty similarly to if you took 1/16, so 30k over 16, and then for each, you sampled 16 times, which is

00:08:40.860 | quite cool. So this allows you, this is really cool, because this allows you to scale by 16x, which is

00:08:45.500 | more than an order of magnitude. And if you remember the graph from before, that corresponds to a pretty

00:08:50.300 | large increase in accuracy. The other surprising thing that we found was that a better model in terms

00:08:57.900 | of its own performance on evaluation benchmarks does not necessarily mean it's a better teacher model.

00:09:03.420 | I think a good way to think about this is a brilliant researcher who's maybe a terrible lecturer, right?

00:09:08.860 | We found specifically, QUEN32B was a stronger teacher model than DeepSeq R1. So we switched to

00:09:17.180 | that in our recipe, even though previously, everyone has been using R1.

00:09:21.340 | We also found that the sources of data that had synthetic questions were actually quite good. Some of the top

00:09:30.620 | sources that we selected were entirely synthetic and better than sources, say, that scraped from

00:09:35.740 | forums or had humans manually write things. And this is also really good news because synthetic

00:09:41.900 | question generation is scalable. So once again, we go back to the x-axis and we can push even further,

00:09:47.180 | which is accuracy boost.

00:09:49.340 | So question filtering also works well. Here we filtered questions by asking a language model,

00:10:00.700 | how difficult is this question, and then taking only the hardest questions.

00:10:04.060 | We also had a language model try to answer that question and looked at the length of that answer.

00:10:11.100 | So these are sort of proxies for the same thing. You can imagine that if a problem is a lot harder,

00:10:15.900 | then a language model will think more and it will produce more text. So its answer will be longer.

00:10:21.100 | And these things worked better than embeddings-based approaches or fast text classifiers,

00:10:27.260 | which is interesting as so much that those approaches were typical for pre-training. So it

00:10:33.340 | seems that the filtering for data and post-training is quite different than pre-training.

00:10:39.180 | Okay, some things that didn't work that were also quite interesting. Through our experiments,

00:10:43.100 | you saw that choosing a smaller number of high-quality sources was much better than trying to optimize

00:10:47.580 | for diversity by going for a larger number of sources. That's very counterintuitive, right? You'd think,

00:10:52.540 | okay, I'm always going to go for higher diversity, but this is actually not what we saw.

00:10:56.140 | The last thing that was interesting is that people talk a lot about verification, which is obviously very

00:11:02.060 | important for RL. And we actually see for SFT and distillation, it didn't seem that filtering based

00:11:08.540 | off of the answer or verifying the answer really helped it all. This is quite surprising. And I think

00:11:14.700 | there's some good research in the literature about maybe why this is, because if you have the hardest

00:11:21.660 | problem, it might be still helpful, even if you have an incorrect answer to that hardest problem,

00:11:27.020 | keeping it in and seeing how the teacher model attempts. It's not just the final output that matters.

00:11:32.460 | Okay, great. Okay, so those are all the amazing learnings that we had for Open Thoughts 3, which

00:11:39.100 | super excited to share. But now you're probably thinking, okay, they've done a thousand experiments.

00:11:44.060 | I don't want to do a thousand experiments. I still want to create reasoning models. How do I adapt this

00:11:49.180 | if I want to create specialized reasoning models? So I guess the first thing I would say is, be aware

00:11:55.980 | that based off of your domain, these exact choices might be a little bit different. I would suggest,

00:12:00.860 | okay, start with our recipe and then iterate on it. If you have capacity and compute, try a couple

00:12:06.460 | different choices for each step in the pipeline. And I think a good example of this is we studied each step

00:12:11.820 | in the pipeline differently by domain. So we studied it distinctly for code, science, and math. And we saw,

00:12:18.380 | for example, in the question filtering, which I talked about before,

00:12:21.180 | using difficulty labels worked well for code questions. But for math and science, it was a response

00:12:28.940 | length. And if you think about that for a second, it makes sense because the response length for coding

00:12:35.020 | questions are very different, right? For Amy math, it's literally just a number between zero and a thousand.

00:12:41.500 | So the answer is not, it's not considering a large portion of the length. But you can imagine there's very simple

00:12:47.500 | coding questions in which the answer is still a lot of lines of code. So yeah, this is one thing to be

00:12:52.940 | aware of. The other thing which I talked about previously is synthetic question generation. Because

00:12:58.060 | it works so well, and if your specialized domain, if you're, if you don't have a lot of data for your

00:13:04.220 | particular problem, then go ahead, transform that existing data into questions, expand it, throw those

00:13:10.940 | as in context examples, and just generate more data. So yeah, we built an open source library for

00:13:16.380 | this. It's called curator, and you can you can try that out. And then lastly, I feel like everyone says

00:13:21.660 | this, but it can't be said enough. The evaluation is paramount. If you don't know how well your models are

00:13:27.980 | doing or improving, then you cannot make good principled decisions about your data set recipe.

00:13:33.580 | We spent a lot of time on this. We also have this open source library on GitHub called Evalchemy,

00:13:38.700 | which takes care of this and also takes care of the sharding and parallelism. And the key thing

00:13:46.060 | here is for very small evaluation sets, if you if you only have a handful of questions, you should run

00:13:51.340 | your model on those evaluation sets many times an average. So going back again to AME competitive math

00:13:57.740 | questions, there's only 30 per year. So for our evaluations, we gave the model those 30 questions

00:14:05.900 | 10 times, and then we averaged to get the final signal to determine which data strategies were working

00:14:13.020 | better than others, because otherwise, there's too much noise. Okay, this is also very, very interesting

00:14:18.140 | and surprising and promising for you if you're specializing. It seems that you can actually surpass

00:14:24.620 | the teacher in some domains with distillation. This is this is super cool. Usually you think

00:14:28.860 | about only RL can push the frontier. Distillation is just about catching up to the teacher. But no,

00:14:33.980 | that's not the case. So we have an example. It's in our paper where we looked at the legal reasoning

00:14:39.740 | domain. So the problem of classifying Supreme Court decisions. What we did is we took 2k unique questions,

00:14:47.980 | we sampled five answers per question, and then we did do verification here, which which did matter.

00:14:55.900 | So we threw away any questions, any answers that were incorrect. And when you fine tune the 7b model,

00:15:02.780 | it surpasses R1, which is a very strong reasoning model and also a very huge reasoning model. So this

00:15:08.620 | is very exciting. There's a lot more research and also application to be done here.

00:15:13.340 | Okay, cool. So everything's open. It's open thoughts and open thoughts means open. Go out and build. We

00:15:22.940 | have all of our detailed paper. It's just out this morning. We've got the weights data set. We have a ton of

00:15:29.500 | repos for code for data generation, for evaluation and synthetic data. So check those out. This is the

00:15:39.100 | team. It was a huge group of people, a lot of work over many months. I think we're all very proud of what

00:15:44.940 | we did. But there's lots of people to recognize here. If you scan that QR code, it goes to the tweet,

00:15:50.540 | and everything about the open thoughts project is linked in from there. Yeah. Thank you.

00:15:55.420 | All right. Thank you so much, Ryan. That was fascinating. It looks like we're already getting,

00:16:07.020 | we have at least one question lined up. Again, we have time for maybe a couple of questions.

00:16:11.020 | So if you have questions, please line up and we'll do it. Actually, before we get to those questions,

00:16:16.940 | I will say as people are leaving, we are going to be back here at two o'clock. We've got an

00:16:23.500 | excellent afternoon planned on this track. We've got Nathan Lambert. We've got the, we've got Christian

00:16:28.460 | Segeti, who's the co-founder of X. And it's going to be a really great track at two o'clock back in this

00:16:33.100 | room. Also, one more thing, if you do have questions for any of the speakers from this morning, hopefully

00:16:38.620 | they're going to be able to stick around. Don't let them go to lunch. They're going to be there. They're

00:16:41.420 | sitting up here at the front. So swarm them as soon as we're done. But for now, let's, let's get a couple

00:16:44.620 | questions for, uh, go ahead. Um, yes, over there. Uh, thank you. Great talk. So, uh, two questions.

00:16:50.380 | One is, um, if you're just using SFT on this data, what's the difference between this and regular SFT?

00:16:56.220 | This is just regular SFT. Oh, yeah. Oh, okay. So then how is regular SFT able to make the models like

00:17:03.260 | think longer? Because I thought for the reason models, they have like a thinking block and they think,

00:17:08.140 | you know, hours and minutes. Exactly. So how do you, how do you, how does SFT make it think for

00:17:13.100 | hours? So you're, you're doing supervised fine tuning on the questions and the answers also contain

00:17:19.100 | the thinking. So the model learns to use its context window and produce these long thinking traces. So it

00:17:25.180 | it can do this. People call SFT imitation. Um, but it, it can learn to learn this format in the same way.

00:17:31.980 | Yeah. Thanks. All right. We'll take one from this side. Um, great presentation, Ryan. Uh, one question. Uh,

00:17:39.340 | why do you think, um, a smaller model like when 32B was a better teacher than a deep seek R1? What was your

00:17:47.340 | insight in figuring out that like a good professor makes a bad lecturer? Yeah, that's a great question.

00:17:53.420 | Um, I think this is saying we need to investigate more, but you can see that, uh, when you look at

00:17:58.940 | charts of the length of reasoning traces, you can see the distributions are different. So, uh, it might

00:18:04.940 | be the case that you're using more of your context window, using more tokens, more steps. It also might

00:18:09.180 | be the case that you just have a better formatted response, better output. Um, this is like in another

00:18:16.060 | great open research, research question. Interesting. I'll also say on this point,

00:18:19.500 | we also tried Claude as a teacher, which is like a very, as a good, strong model. And it was just a

00:18:24.060 | terrible teacher. Um, so there's the, it, it, it, yeah, it's interesting what can, what actually

00:18:29.180 | creates a good teacher. Yeah. All right. We'll take one more very brief question from this side.

00:18:33.900 | And then those of you still waiting on questions, um, after, uh, after we have closed this up, it's warming.

00:18:39.740 | So, uh, great talk around. Um, we're doing similar kind of thing, but I just had a question. Do you guys

00:18:44.780 | have any like pattern map as to in the reasoning chain of thought when things don't work at what

00:18:51.260 | level, you know, in the evil, do you find out that things are not working or it's not reasoning correctly?

00:18:57.580 | Is there a pattern map or something that you have in your open source?

00:19:00.220 | Sorry, I didn't catch that. Is there a, so if there are five steps of reasoning to reach a final conclusion,

00:19:06.220 | uh, at what step does the reasoning go awry?

00:19:08.700 | Yeah, this is, this is a great question. We don't do this fine grained analysis,

00:19:12.620 | but there is a ton in the literature about this, um, where, yeah, there's a sort of critical step where

00:19:18.140 | it gets, gets things wrong. Um, there, we did like the simplest thing possible, right? You could also go in and

00:19:24.220 | try to do more complicated things. Um, at evaluation time where you're doing interventions

00:19:30.540 | to, uh, maybe detect steps that have gone awry and, and, and change, or you can do this in the,

00:19:37.340 | when you're creating the data set. So you could potentially rewrite things,

00:19:40.300 | but everything that we tried in terms of like messing with the reasoning trace, it wasn't helpful.

00:19:45.500 | Um, so yeah, I think there's still more to explore there. There's like,

00:19:49.980 | this is really just the start of everything in reasoning.

OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs

Chapters