Scaling up Test-Time Compute: s1, Recurrent Depths

you will find it okay he's gone for context uh swicks is at the dmv he is in the process of getting his license so he's uh up for this one but let's let's give it a few minutes for people to join in um and then we'll we'll get started these are pretty good papers um nobody was so stressed out okay so we'll give it a few minutes for people to join um do we want to get a quick intro of what's going on he's left and hasn't muted people upon entering let me fix real quick okay so rafael you're kind of ready to present share yeah sure perfect perfect um let's give it another two minutes you want to do a quick intro yeah i'll start so um i'm rafael uh i'm working mostly on the lms uh i'm working to create uh lm for georgian language uh which is a country in europe which is a low resource language and uh yeah we already developed our first model uh we started like zero data zero compute and zero talent because uh only a few people work on lms in georgia so we managed to develop the the model which like outperforms uh uh larger models like gpt and cloth for georgian language and also it's like for enterprise use cases and it supports function calling uh json on structure mode and rag and so on so if you want to test it i can um give you the free uh api and yeah that's that's uh very sick very sick yeah share share info in discord i'm sure a lot of people are interested i know uh eugene cha he presents quite a bit he's very interested in low resource language models but um yeah so today we basically have the two s1 and scaling up test time compute papers i'll let you go through them however you want uh chat normally gets pretty active so if you're monitoring that's cool if something is very confusing i'll probably interrupt but you know take take it as you want yeah sure sure okay actually let's let's give it another minute some people are needing passcodes they're just not using the right link but okay once we hit 50 we should be good okay we've got a new link posted and then we're good to start yeah um can you see the screen yep uh okay so the today uh i want to cover a two paper but both of them are about the test time scaling and the latest uh hot topic the idea is that we want to build lm which uh so we want to increase the performance so we know from latest models like openai gemini and tropic already that reasoning works the uh the thinking works so uh idea is to we want to replicate that work and this paper from stanford is trying to replicate the reasoning step with uh like um only 200 or something like that so only a small budget the idea is that uh so uh they collected 1k samples uh in some way i'll take uh i'll take a bit and uh yeah idea is that they train qn 0.5 uh 32b and it outperforms or one preview actually on the mass and reasoning data sets that's the idea and um they uh they created some kind of technique which is pretty simple i think and yeah mostly it's for the people like me who don't have a lot of budget and resources to train the model and still have ability to train reasoning elements so uh yeah objective is to to create a simple simplest approach to achieve this time scaling actually so uh to deep dive into the methodology um let me scroll a bit um yeah the first thing is that they somehow created uh 1000 examples and uh what they did actually to took some mass data sets like numina mass uh aj ibel and so on and they generated uh tracing reasoning tracing using gemini 2.2 flash thinking and they did some uh pipeline after that so first is that uh they collected uh almost like 60k examples then they filtered using the quality which means to drop some examples based on the formatting some uh ascii errors as i remember and uh yeah after that they got 54k uh and yeah then they filter uh based on the difficulty and what they did is that took two model like i'm not sure if you can see the image perfectly but they took uh two image qn.57b and 32b and they asked the question on both models and if any of the model uh responds correctly they just filter those examples so they the idea is that they want to choose the hardest question that none of the model can answer and they evaluate the the correctness using the claw 3.5 sonnet and after that they got uh 25k uh samples uh and also they uh took the the length of the reasoning so uh reasoning trace so you can say that if the the length is the peak maybe it's a more like challenging problem or something like that after that uh the the key point is to take some so since the idea is do we want to uh run and train the model and uh small uh samples they have to take some uh sampling right so they took uh some llm and cluster those examples of 25k and after that took some uh just do some sampling to collect 1k examples and uh took them based on the reasoning trace length actually so yeah after those pipelines they collected 1k examples okay data sets which replace like the question the reasoning and the answers uh so if there are any questions about that I can stop a few seconds and I can continue because it's important to understand that it's simplest approach I mean but yeah it's pretty good to collect and uh yeah take some data okay uh after that they trained a qn 2.5 stemming to b and yeah as you can uh imagine they got some good results so yeah another thing that we uh that I want to explain before I show the result is that uh after creating data set they um create uh some technique uh to increase the test time scaling so the name is budget forcing or something like that as I remember and the idea is that if you want to generate more tokens and more thinking steps you add the word wait in the output and model understand to continue the generation and if you want to stop the thinking process and for the generation you add the final answer that those words in the generation and it stops the um the process yeah that the budget forcing here so they have the maximum number of tokens the minimum number of tokens and you have the ability to control the length of the traces based on those simple words that's the idea uh yes okay and about about the results that first thing is that they it outperforms or one preview and we also it's comparable to everyone this deal from deep seek qn reasoning and also the sky sky one t which is a few weeks ago um yeah also interesting thing that we can see um is that uh if you increase it uh the thinking time so the if you generate more tokens the performance increased for some uh data set like mess time and qa uh okay yeah that's the example if you want to see how it works so let's say how many are in the raspberry uh and as you can see it got the the wrong answer and they add the word weight which continues the the generation and after that it corrects uh it yeah return the correct answer okay um yeah another uh good thing is that uh as you can see after some time the results uh converge so if you generate a lot of tokens or if you add a lot of weight awards in the generation after some time it converts that's another uh result and also they compare those methods for other uh methods like majority voting or other else and found that the scaling law for the test times for those uh techniques are different so as you can see uh for um for this budget forcing uh results goes up but for majority voting yeah a little bit um yeah yeah that's also an interesting results so as you can see this is a their model the final one and they compare the results for openai gemini and qn uh and other open weights and open data models also as we mentioned that the data the model all of these are open source so you can check it uh yeah and test and then data as well okay so yeah as you can see it outperforms for uh or one preview on math data set and for amy and not for pqa and the same results for all mini uh for gemini we only have one results uh for other tool that i remember they can't test it because gemini blocks some uh blocks some uh api requests and for amy this is only 30 or 40 examples so they test it manually in a google ai studio and that's why we have only those results and it's nearly comparable to the gemini and the idea is that they kind of a distill uh gemini right and uh with only 1000 example they got comparable results to gemini actually another interesting thing is that uh this is the base model that they took uh and yeah the gain of the performance is a lot so like 30 uh for me and so on which is a huge yeah and uh some results from uh deepsec is comparable as well but the idea that they mentioned is that uh in date in uh in those case they only took 1000 examples so for deepsec they distill uh 800k examples to to train their models okay okay after that they do they did some ablations uh and uh found some good interesting things i think uh so uh first one is that um how you choose uh 1000 examples from 60k and yeah they choose like from for example from random from the diverse like based on the clusters to cut the 1k based on the lengths of the traces and just train on the full data set and as you can see it outperforms all 1k examples but it's comparable and yeah full data outperforms a little bit but the idea is that we only use 1k examples rather than the 60k and got that mostly the same results uh yeah i found that last section pretty interesting basically uh they these results show that if they didn't go through their like four step approach of filtering to a thousand samples their results are 40 30% worse right so their whole thing of like step one see if the questions are easy step two cluster and take like diverse samples that alone to like get the thousand high quality versus just a thousand random samples 30% net difference so pretty big importance in their little filtration pipeline yeah exactly so the pipeline is pretty simple i think but maybe one of the good thing is to filter based on the the quality and uh based on the model response i think rather than that like the formatting i think doesn't matter actually yeah the model response if i'm not mistaken from the two base plan models was basically just throwing out a lot of it so like when they use the quen 7b and quen 32b the base models basically what they did was they ran their 60 000 questions through these if the two basic models could answer it correctly then we just throw out the sample it's too easy so step one you know that filters a good bit just to get hard questions but yeah high level their whole pipeline was pretty interesting um yeah that seems to be like half of the value so yeah a lot of it was just yeah there's good at 1000 samples the other one was kind of the budget forcing right so you want to i think explain a bit more about how the budget forcing is working and what it's doing i thought it was pretty unique yeah just to summarize those are the important things so data sampling those data pipeline and budget forcing actually for the data uh side actually i test this idea for my use cases not for the uh the reasoning uh thing but uh i collect and i mix the data set based on this technique and yeah it really works because you can choose the high quality data based on that and yeah about the body budget forcing um as i mentioned so you can uh you have ability to control the size of the reasoning so let's say you have the minimum number of token that you want to generate and maximum number of token that you want to generate right and you can control this uh the length based on the wording so if you want to generate more you add wording like weight like as you can see here and model uh continue the generation and generate more tokens and you can do that several times like two or three times if you want to generate more and if you want to stop the generation you can add the final answer word and it's total generation so you can control the maximum number of tokens based on that yeah that's the idea i'm not sure if any other work is uh doing something like that um but yeah so after this came out some other people started doing it essentially what people are also saying is it's a way some people are saying this is the distinction between o3 mini low and o3 mini high and it allows you to control how much thinking you do right so as we generate tokens autoregressively eventually we generate an end of sentence token end of text token and then we stop generation so what they're doing is anytime the model the next likely token is the end of token end of sentence we replace it with a thinking token you know so hmm what should i do and like it continues thinking and now at inference time you can kind of just force more thinking or if it's rambling too much you can you know force it to end once it's given an answer if you look at the outputs or you can tell it you know like final answer and start boxing your answer but the the interesting thing here is like yeah it also played quite in effect so they show their scores of um with and without budget forcing but it's essentially like a inference time hack to control the amount of thinking right so if you think of like as an api consumer if i want thinking to be like 30 maybe i add in three extra think tokens instead of end of sentence if i want like 60 i add in six very high level just to understand conceptually what's going on but that's kind of what the budget forcing here was um another comment rj made is it only had a minor improvement over the base plan model it seems pretty high do you want to go to the performance benchmarks like you make the base yeah this is the base base is pretty bad so uh the base aime score of quen32b is 26.7 the s1 score is 50 and 56 with budget forcing so yeah this is result and this is always budget forcing oh right i was i was misreading i was looking at the qwq yeah okay got it yeah qwq is the quen reasoning model preview they haven't released it yet i think they have a second update that they put out this week someone at the quen team on twitter is cooking about this and keeps saying he's doing it but that's quen's internal reasoning model it's not it's not just a base instruct it's also just um not out yet yeah that's what surprised me that with only 1000 example you can just run fine tuning with sft and got a really big improvement rather than base model that's uh yeah yeah it's kind of it's not necessarily just fine tuning with sft right it's also distillation because they're distilling gemini thinking traces so it's not like a distillation loss but it's sft to match like um gemini's thinking output plus a little hack of think a little more if you're not on the right path but um the the interesting thing there is still the 800,000 samples does this very well with distillation uh so qwq is quen's internal native rl based reasoning model and if we look at it the performance is pretty similar right like the model is not out yet but um performance wise it's pretty much just as good the the difference here and with s1 is probably that you know they're similar to the 800,000 samples like i don't think they said but i think it's a fair assumption that they're very yeah they're in the hundreds of thousands of samples yeah exactly and they just spend uh i don't remember 26 minutes for 16h100 to run the training yeah thousand samples not much but it's it's very cool work to see and then this also leads to the here's a rumor of here's what openai does for more and less reasoning models thinking models and also here's a little hack for how you can do it yeah yeah and also they compare these budget forcing techniques for other uh conditional control techniques like token conditions step conditions and plus condition which is like uh you say that let's uh think 100 words or something like that or let's think with three steps and uh yeah they did some evaluation and they took it budget forcing is the best uh rather than those techniques yeah so also they did ablations about the word actually so uh they compare that you would like weight and like alternative without strain or something like that and yeah weight works the best it's also interesting yeah uh some of the other like little like local stuff that has come out after this paper is instead of just um appending the weight word or something people started slapping in aha moments and like people themselves have started testing quite a bit of this right so instead of end of token what they found best was to you know give it weight which is like wait i've figured this out and it's like that's pretty cool makes like it's an interesting idea people locally have tried like you know 50 other things and like uh one thing i was hoping someone on is let's train a classifier to see what's the best swap word to put in and like it does slightly better but now you have to run a classifier but it's fun little stuff yeah yeah so idea of this paper and motivation is that this is the simplest technique i i think which um gives you a big improvement on them in terms of results like rather than base model yeah i mostly the same gemini and so on and that's the main idea i think yeah uh yeah uh what else i think that's it pretty small paper but interesting and yeah all the all the results are on math related uh data sets yeah i wonder if there's a certain deal of performance degradation on more other data sets that require reasoning but maybe are not math based yeah that's good question but yeah for this paper they just only test on the math but no gpqa is not just math and they go into their data set curation uh the the majority like the 50 000 samples out of the 60 000 they started uh subsetting from that old data set they include quite a few other non-math-based benchmarks and then part of their clustering is to um like not just sample math so their sonnet um clustering does diversity past just math so they have standardized like lsat in there they have the hardest questions of the lsat they have chem biophysics they have other stuff in there and then the gpqa benchmark shows that it's not just math the other two are math but at a high level one thing that we've seen with reasoning models is like math and verified code kind of converge to quality on other uh just general tasks but one of their um benchmarks here is not math yeah people um did you see uh did anyone try do like replicating this with some of the same uh models that that are that the um r1 model was distilled too because i would be interested to see like okay if we use this technique versus distillation from like the r1 like god model what happens like how does that compare because it's a little bit apples to oranges to compare to like a distill right i mean this r1 distill but is that like the same size or like how do which one is that uh so r1 like deep seek themselves put out um when 32b r1 like when 32b reasoning so they did the exact same model so in the in the opposite sense s1 did this on the same model that deep seek did it on so we have the direct to i think and i could be completely wrong here i don't think deep seek told us how much data they use for distillation they just said we distilled the big ones here you go they're good here's how they perform but like you know we don't know did they do 500 samples did they do 50 000 we don't know what they did for distillation except for i i think they did a proper distillation loss not just sft on outputs and it's a little different too because like this is trained on um outputs from gemini that would be on outputs of r1 so like there's a little stuff different but directly to your question deep seek did do the same model you can compare the two and i don't know if someone wants to fact check or make a note i don't think the deep seek paper paper said how many samples are used for their distillation right yeah but but is that result in this paper do you did anyone figure out is that the same is that r1 distill is that the like equivalent quinn model the 32b because like regardless of how much i mean obviously it's like i mean to me this is an amazing result right like you use minimal compute and get something very close to the same performance that even even if it's not up to par i'd just be curious to see how it compares so um two two things one is i think what they're going after here is sample efficiency right so they want to show how you can do it in a couple thousand samples and two i think we can look at the um r1 distill so apparently r1 distill is better according to my quick let's ask chachi pt but it could have hallucinated so yeah i just i just do it it's like r1 distill is better i don't trust okay so it looks like um they did give numbers i remember this so they used um 600,000 sft also thanks um leah for pointing this out so for the r1 the original deep seek paper they used 800,000 samples 600,000 sft plus 200,000 reasoning traces and i think they actually yeah they do put it in the paper here so uh it performed they have benchmarks of r1 distill right there it's the third row but um sample efficiency you know that's the difference but also the technical distillation was different right so that is the largest yeah totally not like totally get uh like this is it's going after something different just one it's an interesting comparison but i guess i yeah so that distill presumably is that 32b quinn yes it is uh 800,000 samples yeah um well um do we want to move on to second paper or any other closing thoughts on this one i'm sure if there's a discussion in chat we can always pop back in but we have a second paper yeah so the second paper is scaling up test and compute with latent reasoning uh a recurrent test approach so the topic is the same so we want to build the reasoning model but the idea is that most of the model that we already know generates the reasoning and thinking tokens in the output right and in that paper they create the architecture which doesn't produce explicitly tokens in the generation so with things and reason in the architecture itself that's the main difference so now why do we need that kind of model because uh if you want to generate like as you can see the latest models like generate thousand tokens in their thinking and reasoning so it takes a lot of time also you need to have a big context window and which doesn't work for small models and also as i mentioned there are some reasoning abstractions that you can explicitly say by words so you need to have some high dimensional latent space where the model resonating itself in the architecture yeah that's the idea and about the methodology so they create the architecture so they add some recurrent block in the transformers and the idea is that when you train the model you run several ties in the loop so as i remember they took random numbers so sometimes you run three times maybe five times and so on and yeah that's the most of the idea so they took a 3.5 b model and they train on 800 billion tokens and as i mentioned it is equivalent like 50 billion parameter models so the result is pretty impressive as well yeah and as you can see the result in the first case as you run the loops in the test time so performance increase and for some benchmarks it converts quickly but for gsm8k it has a lot of improvements actually okay um yeah so uh why why they train the models as i mentioned that you don't you don't in this case you don't need to have uh the specialized data for the for the training uh also this technique uh takes less memory memory so idea is that you use a small model but more compute to increase the performance and yeah as i mentioned it takes more flops but it's more efficient since you have small models you don't need to have a communication cost or interconnect and it's much faster as i mentioned yeah that's the architecture uh yeah there are some uh sections uh the the topics why they use the embeddings and uh yeah some uh some improvements in the efficiency because they train this model on amd hardware uh with specialized gpus as i remember and as i mentioned they uh needed to create some kind of tricks and uh to make some uh fast training uh yeah so and as the previous paper also they published their pre-training codes and uh the model and the data okay so uh what they did actually that uh since they have they don't have a lot of money so they just took some already published data sets they just create some kind of mix and uh yeah mostly they wanted to test the model on mass and reasoning tasks so most of the data comes from the holding and mass data sets and as i mentioned the mixing strategy maybe is not the best but the idea is that just to test if this module architecture works so um yeah um and they did some uh they published the parameters hyperparameters uh they did some packing and so on one of the problem that they have in the training was to uh since it's a different architecture they mentioned that hyperparameters and the setup uh matters actually so in the first run this is the the orange they don't have a good performance but after some tweaking like learning rate and some hyperparameters they get some good results the blue one which is the latest the main model and yeah loss goes down um and yeah they uh i'm not sure about the the number of loops but yeah maybe we'll see in the results and yeah so when i see the results it doesn't look uh very good rather than the previous paper but we can see some uh some improvements in the in the for loop i think so yeah if you check the results it's uh it doesn't outperform most of the models actually so they compare the old models and pythea which is pretty old i think already uh and yeah the result is uh not so good but uh the main main thing is that performance increase if you include the loop as well uh but yeah as you can see 49 and 69 for arc easy um yeah and uh yeah they test on gsm rate k and yeah also one thing is that they check if uh if this architecture um emerged some ability like for example a chain of thought and some skills that we know for the other lens right so they test those ideas and see that this architecture works also for uh with the context so if you give some for example few short examples it improves the performance so just to test the idea that it doesn't goes wrong for for other ability uh yeah they publish some early checkpoints as well and uh yeah as you can see performance goes up uh there was yeah also they compare the baseline so baseline needs to take the same data set uh and run just uh with the base training technique and without those architecture and uh yeah as you can see performance uh is much better as well so for example 46 and 69 i think this is a good result uh because we know that with the same data set with the um same compute let's say you get much better performance and also they compare with r equals one and yeah actually it's the less performance than the baseline that's another uh interesting thing as well yeah they uh checked a few short examples and as you can see if you use the zero shot it converged quickly but if you use uh more shots yeah performance goes up which is good yeah so in this case they uh check that uh model uh uh needs more uh loops in the architecture uh when the problem is complex let's say so it's like uh for example high school mathematics needs uh five loops but uh logical or moral scenarios need much more let's say 20 or 25 uh yeah some of the inference time loops are kind of interesting so rj asked about the recurrent backprop so they have this little section here in chapter four on recur uh truncated backdrop backprop where i think they truncate everything except for like the last eight or something and then so what that allows is you can still have somewhat efficient training but at inference time you know the thing can still like roll out its internal thought for as long as it wants but they they have a little section somewhere i think it's in chapter section four on truncated backdrop but um efficiency training wise that helps right but but for like if i think it was saying like it did they did they truncated it like eight eight loops or whatever but doesn't that mean that like for those eight loops you can't parallelize right like you have to do one after another yeah i think which i mean i understand if you truncated it some number then you're limiting the impact of that but still uh that will that will i don't but i'd be curious to see like the impact overall of those layers that are recurrent um what's the impact on the training efficiency or speed yeah it was the eighth is there any question or my internet oh sorry when people join they're just unmuted so okay yeah um there is another question in chat though how does it choose the depth would it make sense to recur until the stop token slash signal is output do you want to kind of explain what this internal thinking is doing and how it works uh so actually in the training they took some random numbers i'm not sure why they did that because actually it can be let number parameter because you don't know the complexity of the data right so you cannot say the exact number of the loops in the pre-training but uh yeah actually it works and uh yeah i'm not sure yeah as i mentioned this uh the recurrent work is not the novel idea we know a lot of papers we did the same kind of things but yeah as i mentioned some ideas should be rediscovered sometimes so yeah but i i like that idea especially for small models i mean you don't have a lot of context size and uh yeah that's interesting at least so i think at a high level as well we have to remember what they're doing right they're training a 3b from scratch on 800 million billion samples billion samples uh it's very proof of concept right so there is uh ways where you can synchronize recurrent depth and parallelize this stuff but this is more show just a more so the proof of concept to show that this works and it this is more show just a more so the proof of concept to show that this works and if we should scale it up the other thing we've seen with a lot of um recurrent networks and state models is you can do conversions so even if they're not peak efficient you can convert quen to this style by you know replacing layers and adding stuff so all optimization stuff kind of comes a little bit later but um there there is stuff like this i know in cuda but they probably got some amd compute sponsorship to train this on mi 300xs which is cool now i don't know if they would write all that in the paper because it's not applicable for everyone but just that's a little bit of high level too yeah okay any other questions or comments actually they mentioned that it's equivalent to 50 billion but i can't see that in the results because in the result they compare only for uh for all normal models which is 7 billion so okay i think that's it yeah there was a lot of other sections but i think the idea was to create a different technique rather than explicitly generate the output token which is interesting at least in a different kind of architecture so i think it works to test maybe you'll see the other paper that do the same so that's why i i like that paper just to understand other kind of techniques rather than the generation and so on so awesome well thank you so much for volunteering to present we have any other questions on these two we still have a little bit of time um so scaling up test time compute was more so how we do internal reasoning if we don't want to externally spit out this reasoning can we have a recurrent state piece in our model that's kind of what recurrent networks used to do can we can we train this to internally think and then s1 was basically can we get really good sample efficiency for reasoning tokens and yeah hopefully they show pretty good promise yeah actually this is pretty different but yeah the idea is the same okay um how does it decide to stop iteration on a token um in the inference right i'm not sure actually maybe you control the loop not it doesn't decide itself no i think it does because if you look on page 31 there's a in the appendix there's a one of these heat maps um and it shows it for different tokens it's inferencing for different numbers of steps yeah for different tokens yeah so basically it's trained to keep doing this but we only back prop on eight steps but that doesn't mean that it can't unroll the amount of iterations it wants to do it's it's internal in the model or maybe i'm reading this wrong maybe this is 64 steps and it's just oh yeah this is the log is like the convergence time oh yeah okay so this is so maybe it does you just do configure it to reason for a certain number of steps and whatever that is yeah that makes that makes more sense okay got it okay um very very cool papers pretty pretty interesting to see other reasoning stuff um so we have we have a few more minutes um anything else anyone wants to bring up or should we talk about next week's paper okay well we'll end a little early and go straight into next week's stuff uh thank you so much for presenting by the way we always like volunteers if anyone wants to volunteer i think um next week i'm doing one then the week after we should have an opening so if there's anything interesting anyone wants to do feel free to you know volunteer and then we can we can get that worked out but yeah thank you so much for presenting last minute we kind of switched up uh paper so sorry for people that didn't get a chance to pre-read it's always nice to be able to pop in but you know you always get most value if you do some basic pre-reading come with your questions and then we can discuss them so that's kind of the whole point of it um for next week hugging face put out this um pre-training guide so it's kind of everything about scaling up llms and they go very in depth so it's it's a very long one i'm hoping hoping we can get it done in an hour but you know it starts with how do we train on one gpu multiple gpus um how do we do moes like what is training configs basically everything you would want to know about training and scaling up models so i will go through this and condense it down to what's a tldr of this probably like eight hour reading how can we condense this down into like you know high level overview 45 minutes of chat and then five minutes of q a but a very very good resource i will share it back in zoom um i think you know come prepared and just read over the sections you find interesting if you have any questions come with questions i'm sure this will be one where zoom chat will be very very active since it's basically a book but um that's the paper for next week yeah i started reading but it's a huge it's huge it's huge uh thomas wolf the one of the co-founders at hugging face apparently he also did a walk through on youtube so if you guys like side content want to listen to it there's a youtube version where he goes through this for quite a while but yeah next week um maybe just brush up on that i'll try to have something in an hour that we can go through and then it'll be good for q a you know well we'll have about like 100 people here that are all somewhat versed on the topic but yeah cool thanks everyone yeah run thank you take care

Scaling up Test-Time Compute: s1, Recurrent Depths

Transcript