back to indexScaling up Test-Time Compute: s1, Recurrent Depths

00:00:00.000 |
you will find it okay he's gone for context uh swicks is at the dmv he is in the process 00:00:10.160 |
of getting his license so he's uh up for this one but let's let's give it a few minutes 00:00:18.080 |
for people to join in um and then we'll we'll get started these are pretty good papers um 00:00:25.720 |
nobody was so stressed out okay so we'll give it a few minutes for people to join um do 00:00:44.960 |
we want to get a quick intro of what's going on 00:00:58.480 |
he's left and hasn't muted people upon entering let me fix real quick 00:01:20.840 |
okay so rafael you're kind of ready to present share yeah sure perfect perfect um let's give 00:01:49.580 |
it another two minutes you want to do a quick intro yeah i'll start so um i'm rafael uh 00:01:57.640 |
i'm working mostly on the lms uh i'm working to create uh lm for georgian language uh which 00:02:04.960 |
is a country in europe which is a low resource language and uh yeah we already developed 00:02:09.720 |
our first model uh we started like zero data zero compute and zero talent because uh only 00:02:17.360 |
a few people work on lms in georgia so we managed to develop the the model which like 00:02:23.880 |
outperforms uh uh larger models like gpt and cloth for georgian language and also it's 00:02:30.340 |
like for enterprise use cases and it supports function calling uh json on structure mode 00:02:35.960 |
and rag and so on so if you want to test it i can um give you the free uh api and yeah 00:02:43.600 |
that's that's uh very sick very sick yeah share share info in discord i'm sure a lot 00:02:49.240 |
of people are interested i know uh eugene cha he presents quite a bit he's very interested 00:02:54.720 |
in low resource language models but um yeah so today we basically have the two s1 and 00:03:02.520 |
scaling up test time compute papers i'll let you go through them however you want uh chat 00:03:07.080 |
normally gets pretty active so if you're monitoring that's cool if something is very confusing 00:03:13.120 |
i'll probably interrupt but you know take take it as you want yeah sure sure okay 00:03:27.440 |
actually let's let's give it another minute some people are needing passcodes they're 00:03:31.760 |
just not using the right link but okay once we hit 50 we should be good 00:03:40.960 |
okay we've got a new link posted and then we're good to start 00:03:47.600 |
uh okay so the today uh i want to cover a two paper but both of them are about the test 00:03:56.560 |
time scaling and the latest uh hot topic the idea is that we want to build lm which uh 00:04:04.560 |
so we want to increase the performance so we know from latest models like openai gemini 00:04:10.120 |
and tropic already that reasoning works the uh the thinking works so uh idea is to we 00:04:17.440 |
want to replicate that work and this paper from stanford is trying to replicate the reasoning 00:04:22.960 |
step with uh like um only 200 or something like that so only a small budget the idea 00:04:29.840 |
is that uh so uh they collected 1k samples uh in some way i'll take uh i'll take a bit 00:04:38.560 |
and uh yeah idea is that they train qn 0.5 uh 32b and it outperforms or one preview actually 00:04:48.000 |
on the mass and reasoning data sets that's the idea and um they uh they created some 00:04:56.000 |
kind of technique which is pretty simple i think and yeah mostly it's for the people 00:05:02.160 |
like me who don't have a lot of budget and resources to train the model and still have 00:05:06.840 |
ability to train reasoning elements so uh yeah objective is to to create a simple simplest 00:05:15.400 |
approach to achieve this time scaling actually so uh to deep dive into the methodology um 00:05:21.520 |
let me scroll a bit um yeah the first thing is that they somehow created uh 1000 examples 00:05:33.600 |
and uh what they did actually to took some mass data sets like numina mass uh aj ibel 00:05:40.480 |
and so on and they generated uh tracing reasoning tracing using gemini 2.2 flash thinking and 00:05:48.680 |
they did some uh pipeline after that so first is that uh they collected uh almost like 60k 00:05:56.440 |
examples then they filtered using the quality which means to drop some examples based on 00:06:01.960 |
the formatting some uh ascii errors as i remember and uh yeah after that they got 54k uh and 00:06:11.160 |
yeah then they filter uh based on the difficulty and what they did is that took two model like 00:06:17.320 |
i'm not sure if you can see the image perfectly but they took uh two image qn.57b and 32b 00:06:26.440 |
and they asked the question on both models and if any of the model uh responds correctly 00:06:32.400 |
they just filter those examples so they the idea is that they want to choose the hardest 00:06:37.960 |
question that none of the model can answer and they evaluate the the correctness using 00:06:43.600 |
the claw 3.5 sonnet and after that they got uh 25k uh samples uh and also they uh took 00:06:53.720 |
the the length of the reasoning so uh reasoning trace so you can say that if the the length 00:07:00.120 |
is the peak maybe it's a more like challenging problem or something like that after that 00:07:05.400 |
uh the the key point is to take some so since the idea is do we want to uh run and train 00:07:11.720 |
the model and uh small uh samples they have to take some uh sampling right so they took 00:07:20.200 |
uh some llm and cluster those examples of 25k and after that took some uh just do some 00:07:29.880 |
sampling to collect 1k examples and uh took them based on the reasoning trace length actually 00:07:37.160 |
so yeah after those pipelines they collected 1k examples okay data sets which replace like 00:07:43.720 |
the question the reasoning and the answers uh so if there are any questions about that 00:07:50.400 |
I can stop a few seconds and I can continue because it's important to understand that 00:07:55.600 |
it's simplest approach I mean but yeah it's pretty good to collect and uh yeah take some 00:08:02.080 |
data okay uh after that they trained a qn 2.5 stemming to b and yeah as you can uh imagine 00:08:13.080 |
they got some good results so yeah another thing that we uh that I want to explain before 00:08:23.040 |
I show the result is that uh after creating data set they um create uh some technique 00:08:30.440 |
uh to increase the test time scaling so the name is budget forcing or something like that 00:08:36.520 |
as I remember and the idea is that if you want to generate more tokens and more thinking 00:08:42.480 |
steps you add the word wait in the output and model understand to continue the generation 00:08:50.120 |
and if you want to stop the thinking process and for the generation you add the final answer 00:08:56.680 |
that those words in the generation and it stops the um the process yeah that the budget 00:09:03.560 |
forcing here so they have the maximum number of tokens the minimum number of tokens and 00:09:08.200 |
you have the ability to control the length of the traces based on those simple words 00:09:13.520 |
that's the idea uh yes okay and about about the results that first thing is that they 00:09:24.080 |
it outperforms or one preview and we also it's comparable to everyone this deal from 00:09:31.360 |
deep seek qn reasoning and also the sky sky one t which is a few weeks ago um yeah also 00:09:41.580 |
interesting thing that we can see um is that uh if you increase it uh the thinking time 00:09:49.680 |
so the if you generate more tokens the performance increased for some uh data set like mess time 00:09:57.080 |
and qa uh okay yeah that's the example if you want to see how it works so let's say 00:10:09.120 |
how many are in the raspberry uh and as you can see it got the the wrong answer and they 00:10:15.280 |
add the word weight which continues the the generation and after that it corrects uh it 00:10:22.400 |
yeah return the correct answer okay um yeah another uh good thing is that uh as you can 00:10:33.920 |
see after some time the results uh converge so if you generate a lot of tokens or if you 00:10:41.880 |
add a lot of weight awards in the generation after some time it converts that's another 00:10:48.920 |
uh result and also they compare those methods for other uh methods like majority voting 00:10:57.520 |
or other else and found that the scaling law for the test times for those uh techniques 00:11:05.560 |
are different so as you can see uh for um for this budget forcing uh results goes up 00:11:12.400 |
but for majority voting yeah a little bit um yeah yeah that's also an interesting results 00:11:23.960 |
so as you can see this is a their model the final one and they compare the results for 00:11:30.760 |
openai gemini and qn uh and other open weights and open data models also as we mentioned 00:11:38.600 |
that the data the model all of these are open source so you can check it uh yeah and test 00:11:46.560 |
and then data as well okay so yeah as you can see it outperforms for uh or one preview 00:11:54.040 |
on math data set and for amy and not for pqa and the same results for all mini uh for gemini 00:12:03.320 |
we only have one results uh for other tool that i remember they can't test it because 00:12:09.640 |
gemini blocks some uh blocks some uh api requests and for amy this is only 30 or 40 examples 00:12:18.000 |
so they test it manually in a google ai studio and that's why we have only those results 00:12:23.440 |
and it's nearly comparable to the gemini and the idea is that they kind of a distill 00:12:30.720 |
uh gemini right and uh with only 1000 example they got comparable results to gemini actually 00:12:40.320 |
another interesting thing is that uh this is the base model that they took uh and yeah 00:12:45.800 |
the gain of the performance is a lot so like 30 uh for me and so on which is a huge yeah 00:12:55.960 |
and uh some results from uh deepsec is comparable as well but the idea that they mentioned is 00:13:03.200 |
that uh in date in uh in those case they only took 1000 examples so for deepsec they distill 00:13:11.080 |
uh 800k examples to to train their models okay 00:13:19.400 |
okay after that they do they did some ablations uh and uh found some good interesting things 00:13:33.360 |
i think uh so uh first one is that um how you choose uh 1000 examples from 60k and yeah 00:13:41.920 |
they choose like from for example from random from the diverse like based on the clusters 00:13:47.200 |
to cut the 1k based on the lengths of the traces and just train on the full data set 00:13:53.960 |
and as you can see it outperforms all 1k examples but it's comparable and yeah full data outperforms 00:14:01.680 |
a little bit but the idea is that we only use 1k examples rather than the 60k and got 00:14:12.240 |
uh yeah i found that last section pretty interesting basically uh they these results show that 00:14:19.760 |
if they didn't go through their like four step approach of filtering to a thousand samples 00:14:24.960 |
their results are 40 30% worse right so their whole thing of like step one see if the questions 00:14:32.320 |
are easy step two cluster and take like diverse samples that alone to like get the thousand 00:14:39.120 |
high quality versus just a thousand random samples 30% net difference so pretty big importance 00:14:48.560 |
yeah exactly so the pipeline is pretty simple i think but maybe one of the good thing is 00:14:54.900 |
to filter based on the the quality and uh based on the model response i think rather 00:15:03.580 |
than that like the formatting i think doesn't matter actually 00:15:07.520 |
yeah the model response if i'm not mistaken from the two base plan models was basically 00:15:12.240 |
just throwing out a lot of it so like when they use the quen 7b and quen 32b the base 00:15:17.960 |
models basically what they did was they ran their 60 000 questions through these if the 00:15:23.980 |
two basic models could answer it correctly then we just throw out the sample it's too 00:15:27.760 |
easy so step one you know that filters a good bit just to get hard questions but yeah high 00:15:33.560 |
level their whole pipeline was pretty interesting um yeah that seems to be like half of the 00:15:40.480 |
value so yeah a lot of it was just yeah there's good at 1000 samples the other one was kind 00:15:47.840 |
of the budget forcing right so you want to i think explain a bit more about how the budget 00:15:53.640 |
forcing is working and what it's doing i thought it was pretty unique yeah just to summarize 00:15:59.160 |
those are the important things so data sampling those data pipeline and budget forcing actually 00:16:05.880 |
for the data uh side actually i test this idea for my use cases not for the uh the reasoning 00:16:12.720 |
uh thing but uh i collect and i mix the data set based on this technique and yeah it really 00:16:18.580 |
works because you can choose the high quality data based on that and yeah about the body 00:16:24.080 |
budget forcing um as i mentioned so you can uh you have ability to control the size of 00:16:32.200 |
the reasoning so let's say you have the minimum number of token that you want to generate 00:16:37.520 |
and maximum number of token that you want to generate right and you can control this 00:16:41.680 |
uh the length based on the wording so if you want to generate more you add wording like 00:16:47.200 |
weight like as you can see here and model uh continue the generation and generate more 00:16:53.840 |
tokens and you can do that several times like two or three times if you want to generate 00:16:58.800 |
more and if you want to stop the generation you can add the final answer word and it's 00:17:04.280 |
total generation so you can control the maximum number of tokens based on that yeah that's 00:17:10.880 |
the idea i'm not sure if any other work is uh doing something like that um but yeah so 00:17:19.320 |
after this came out some other people started doing it essentially what people are also 00:17:24.080 |
saying is it's a way some people are saying this is the distinction between o3 mini low 00:17:29.080 |
and o3 mini high and it allows you to control how much thinking you do right so as we generate 00:17:35.320 |
tokens autoregressively eventually we generate an end of sentence token end of text token 00:17:41.840 |
and then we stop generation so what they're doing is anytime the model the next likely 00:17:45.840 |
token is the end of token end of sentence we replace it with a thinking token you know 00:17:50.480 |
so hmm what should i do and like it continues thinking and now at inference time you can 00:17:56.120 |
kind of just force more thinking or if it's rambling too much you can you know force it 00:18:01.400 |
to end once it's given an answer if you look at the outputs or you can tell it you know 00:18:06.000 |
like final answer and start boxing your answer but the the interesting thing here is like 00:18:11.880 |
yeah it also played quite in effect so they show their scores of um with and without budget 00:18:18.320 |
forcing but it's essentially like a inference time hack to control the amount of thinking 00:18:23.280 |
right so if you think of like as an api consumer if i want thinking to be like 30 maybe i add 00:18:30.080 |
in three extra think tokens instead of end of sentence if i want like 60 i add in six 00:18:35.680 |
very high level just to understand conceptually what's going on but that's kind of what the 00:18:40.640 |
budget forcing here was um another comment rj made is it only had a minor improvement 00:18:47.960 |
over the base plan model it seems pretty high do you want to go to the performance benchmarks 00:18:53.680 |
like you make the base yeah this is the base base is pretty bad so uh the base aime score 00:18:59.320 |
of quen32b is 26.7 the s1 score is 50 and 56 with budget forcing so yeah this is result 00:19:09.400 |
and this is always budget forcing oh right i was i was misreading i was looking at the 00:19:13.800 |
qwq yeah okay got it yeah qwq is the quen reasoning model preview they haven't released 00:19:21.960 |
it yet i think they have a second update that they put out this week someone at the quen 00:19:26.840 |
team on twitter is cooking about this and keeps saying he's doing it but that's quen's 00:19:31.320 |
internal reasoning model it's not it's not just a base instruct it's also just um not 00:19:35.520 |
out yet yeah that's what surprised me that with only 1000 example you can just run fine 00:19:43.320 |
tuning with sft and got a really big improvement rather than base model that's uh yeah yeah 00:19:50.080 |
it's kind of it's not necessarily just fine tuning with sft right it's also distillation 00:19:56.240 |
because they're distilling gemini thinking traces so it's not like a distillation loss 00:20:01.420 |
but it's sft to match like um gemini's thinking output plus a little hack of think a little 00:20:09.000 |
more if you're not on the right path but um the the interesting thing there is still the 00:20:13.920 |
800,000 samples does this very well with distillation uh so qwq is quen's internal native rl based 00:20:22.400 |
reasoning model and if we look at it the performance is pretty similar right like the model is 00:20:27.320 |
not out yet but um performance wise it's pretty much just as good the the difference here 00:20:35.000 |
and with s1 is probably that you know they're similar to the 800,000 samples like i don't 00:20:41.000 |
think they said but i think it's a fair assumption that they're very yeah they're in the hundreds 00:20:45.740 |
of thousands of samples yeah exactly and they just spend uh i don't remember 26 minutes 00:20:54.800 |
for 16h100 to run the training yeah thousand samples not much but it's it's very cool work 00:21:02.040 |
to see and then this also leads to the here's a rumor of here's what openai does for more 00:21:07.480 |
and less reasoning models thinking models and also here's a little hack for how you 00:21:11.960 |
can do it yeah yeah and also they compare these budget forcing techniques for other 00:21:17.480 |
uh conditional control techniques like token conditions step conditions and plus condition 00:21:24.280 |
which is like uh you say that let's uh think 100 words or something like that or let's 00:21:30.000 |
think with three steps and uh yeah they did some evaluation and they took it budget forcing 00:21:36.440 |
is the best uh rather than those techniques yeah so also they did ablations about the 00:21:45.360 |
word actually so uh they compare that you would like weight and like alternative without 00:21:54.680 |
strain or something like that and yeah weight works the best it's also interesting yeah 00:22:02.200 |
uh some of the other like little like local stuff that has come out after this paper is 00:22:09.640 |
instead of just um appending the weight word or something people started slapping in aha 00:22:15.960 |
moments and like people themselves have started testing quite a bit of this right so instead 00:22:21.640 |
of end of token what they found best was to you know give it weight which is like wait 00:22:26.520 |
i've figured this out and it's like that's pretty cool makes like it's an interesting 00:22:30.600 |
idea people locally have tried like you know 50 other things and like uh one thing i was 00:22:36.520 |
hoping someone on is let's train a classifier to see what's the best swap word to put in 00:22:41.480 |
and like it does slightly better but now you have to run a classifier but it's fun little 00:22:46.560 |
stuff yeah yeah so idea of this paper and motivation is that this is the simplest technique 00:22:53.800 |
i i think which um gives you a big improvement on them in terms of results like rather than 00:23:01.240 |
base model yeah i mostly the same gemini and so on and that's the main idea i think yeah 00:23:09.000 |
uh yeah uh what else i think that's it pretty small paper but interesting and yeah 00:23:23.000 |
all the all the results are on math related uh data sets yeah i wonder if there's a 00:23:32.280 |
certain deal of performance degradation on more other 00:23:39.640 |
data sets that require reasoning but maybe are not math based yeah that's good question 00:23:48.680 |
but yeah for this paper they just only test on the math but no gpqa is not just math and they go 00:23:56.680 |
into their data set curation uh the the majority like the 50 000 samples out of the 60 000 they 00:24:03.320 |
started uh subsetting from that old data set they include quite a few other non-math-based benchmarks 00:24:11.560 |
and then part of their clustering is to um like not just sample math so their sonnet um clustering 00:24:21.080 |
does diversity past just math so they have standardized like lsat in there they have the 00:24:26.200 |
hardest questions of the lsat they have chem biophysics they have other stuff in there and 00:24:31.000 |
then the gpqa benchmark shows that it's not just math the other two are math but at a high level 00:24:38.200 |
one thing that we've seen with reasoning models is like math and verified code kind of converge to 00:24:44.200 |
quality on other uh just general tasks but one of their um benchmarks here is not math 00:24:51.160 |
yeah people um did you see uh did anyone try do like replicating this with some of the same 00:25:02.840 |
uh models that that are that the um r1 model was distilled too because i would be interested to see 00:25:10.040 |
like okay if we use this technique versus distillation from like the r1 like god model 00:25:17.000 |
what happens like how does that compare because it's a little bit apples to oranges to compare to 00:25:23.000 |
like a distill right i mean this r1 distill but is that like the same size or like how do 00:25:29.080 |
which one is that uh so r1 like deep seek themselves put out um when 32b r1 like when 00:25:37.560 |
32b reasoning so they did the exact same model so in the in the opposite sense s1 did this on 00:25:44.200 |
the same model that deep seek did it on so we have the direct to i think and i could be completely 00:25:50.040 |
wrong here i don't think deep seek told us how much data they use for distillation they just 00:25:54.520 |
said we distilled the big ones here you go they're good here's how they perform but like 00:25:58.840 |
you know we don't know did they do 500 samples did they do 50 000 we don't know what they did 00:26:04.200 |
for distillation except for i i think they did a proper distillation loss not just sft on outputs 00:26:10.040 |
and it's a little different too because like this is trained on um outputs from gemini that would 00:26:17.400 |
be on outputs of r1 so like there's a little stuff different but directly to your question 00:26:23.080 |
deep seek did do the same model you can compare the two and i don't know if someone wants to 00:26:28.440 |
fact check or make a note i don't think the deep seek paper paper said how many samples are used 00:26:34.440 |
for their distillation right yeah but but is that result in this paper do you did anyone figure out 00:26:42.040 |
is that the same is that r1 distill is that the like equivalent quinn model the 32b 00:26:50.040 |
because like regardless of how much i mean obviously it's like i mean to me this is an 00:26:57.080 |
amazing result right like you use minimal compute and get something very close to the same 00:27:02.920 |
performance that even even if it's not up to par i'd just be curious to see how it compares 00:27:08.840 |
so um two two things one is i think what they're going after here is sample efficiency right so 00:27:18.360 |
they want to show how you can do it in a couple thousand samples and two i think we can look at 00:27:25.400 |
the um r1 distill so apparently r1 distill is better according to my quick let's ask chachi pt 00:27:35.240 |
but it could have hallucinated so yeah i just i just do it it's like r1 distill is better i don't 00:27:45.080 |
trust okay so it looks like um they did give numbers i remember this so they used um 600,000 00:27:57.720 |
sft also thanks um leah for pointing this out so for the r1 the original deep seek paper they used 00:28:06.360 |
800,000 samples 600,000 sft plus 200,000 reasoning traces and i think they actually yeah they do put 00:28:13.000 |
it in the paper here so uh it performed they have benchmarks of r1 distill right there 00:28:18.920 |
it's the third row but um sample efficiency you know that's the difference but also the 00:28:24.840 |
technical distillation was different right so that is the largest yeah totally not like totally get 00:28:31.560 |
uh like this is it's going after something different just one it's an interesting 00:28:36.520 |
comparison but i guess i yeah so that distill presumably is that 32b quinn yes it is uh 800,000 00:28:49.560 |
um well um do we want to move on to second paper or any other closing thoughts on this one i'm 00:29:05.240 |
sure if there's a discussion in chat we can always pop back in but we have a second paper 00:29:15.800 |
yeah so the second paper is scaling up test and compute with latent reasoning 00:29:24.520 |
uh a recurrent test approach so the topic is the same so we want to build the reasoning model but 00:29:30.440 |
the idea is that most of the model that we already know generates the reasoning and thinking tokens 00:29:37.400 |
in the output right and in that paper they create the architecture which doesn't produce explicitly 00:29:45.240 |
tokens in the generation so with things and reason in the architecture itself that's the 00:29:49.800 |
main difference so now why do we need that kind of model because uh if you want to generate like 00:29:58.040 |
as you can see the latest models like generate thousand tokens in their thinking and reasoning 00:30:02.920 |
so it takes a lot of time also you need to have a big context window and which doesn't work for 00:30:10.360 |
small models and also as i mentioned there are some reasoning abstractions that you can explicitly 00:30:21.080 |
say by words so you need to have some high dimensional latent space where the model 00:30:28.200 |
resonating itself in the architecture yeah that's the idea and about the methodology so they create 00:30:36.840 |
the architecture so they add some recurrent block in the transformers and the idea is that when you 00:30:44.200 |
train the model you run several ties in the loop so as i remember they took random numbers so 00:30:53.320 |
sometimes you run three times maybe five times and so on and yeah that's the most of the idea 00:31:01.000 |
so they took a 3.5 b model and they train on 800 billion tokens and as i mentioned it is equivalent 00:31:10.680 |
like 50 billion parameter models so the result is pretty impressive as well 00:31:15.480 |
yeah and as you can see the result in the first case as you run the loops in the test time so 00:31:28.440 |
performance increase and for some benchmarks it converts quickly but for gsm8k it has a lot of 00:31:37.720 |
improvements actually okay um yeah so uh why why they train the models as i mentioned that 00:31:49.080 |
you don't you don't in this case you don't need to have uh the specialized data for the for the 00:31:55.400 |
training uh also this technique uh takes less memory memory so idea is that you use a small 00:32:02.920 |
model but more compute to increase the performance and yeah as i mentioned it takes more flops 00:32:10.360 |
but it's more efficient since you have small models you don't need to have a 00:32:16.920 |
communication cost or interconnect and it's much faster as i mentioned 00:32:25.560 |
yeah that's the architecture uh yeah there are some uh sections uh the the topics why they use 00:32:34.360 |
the embeddings and uh yeah some uh some improvements in the efficiency because 00:32:40.040 |
they train this model on amd hardware uh with specialized gpus as i remember and as i mentioned 00:32:48.440 |
they uh needed to create some kind of tricks and uh to make some uh fast training 00:32:54.360 |
uh yeah so and as the previous paper also they published their pre-training codes 00:33:03.320 |
and uh the model and the data okay so uh what they did actually that uh 00:33:12.040 |
since they have they don't have a lot of money so they just took some already 00:33:16.840 |
published data sets they just create some kind of mix and uh yeah mostly they wanted to test the 00:33:25.080 |
model on mass and reasoning tasks so most of the data comes from the holding and mass data sets 00:33:32.200 |
and as i mentioned the mixing strategy maybe is not the best but the idea is that 00:33:38.360 |
just to test if this module architecture works so um yeah um and they did some uh 00:33:47.160 |
they published the parameters hyperparameters uh they did some packing and so on one of the 00:33:54.440 |
problem that they have in the training was to uh since it's a different architecture 00:34:01.080 |
they mentioned that hyperparameters and the setup uh matters actually so in the first run this is 00:34:09.000 |
the the orange they don't have a good performance but after some tweaking like learning rate and 00:34:15.480 |
some hyperparameters they get some good results the blue one which is the latest the main model 00:34:23.160 |
and yeah loss goes down um and yeah they uh i'm not sure about the the number of loops 00:34:36.520 |
and yeah so when i see the results it doesn't look uh very good rather than the previous paper 00:34:47.240 |
but we can see some uh some improvements in the in the for loop i think so yeah if you check 00:34:53.400 |
the results it's uh it doesn't outperform most of the models actually so they compare 00:35:01.160 |
the old models and pythea which is pretty old i think already uh and yeah the result is uh 00:35:09.160 |
not so good but uh the main main thing is that performance increase if you include the loop as 00:35:16.280 |
well uh but yeah as you can see 49 and 69 for arc easy um yeah and uh 00:35:43.800 |
and yeah also one thing is that they check if uh if this architecture um emerged some ability 00:35:52.840 |
like for example a chain of thought and some skills that we know for the other lens right 00:35:57.960 |
so they test those ideas and see that this architecture works also for uh with the context 00:36:05.000 |
so if you give some for example few short examples it improves the performance so just to test the 00:36:11.640 |
idea that it doesn't goes wrong for for other ability uh yeah they publish some early checkpoints 00:36:20.440 |
as well and uh yeah as you can see performance goes up uh there was yeah also they compare the 00:36:34.200 |
baseline so baseline needs to take the same data set uh and run just uh with the base training 00:36:42.920 |
technique and without those architecture and uh yeah as you can see performance uh is much better 00:36:49.240 |
as well so for example 46 and 69 i think this is a good result uh because we know that with the 00:36:57.480 |
same data set with the um same compute let's say you get much better performance and also they 00:37:07.400 |
compare with r equals one and yeah actually it's the less performance than the baseline 00:37:14.360 |
that's another uh interesting thing as well yeah they uh checked a few short examples and as you 00:37:22.280 |
can see if you use the zero shot it converged quickly but if you use uh more shots yeah 00:37:31.080 |
yeah so in this case they uh check that uh model uh uh needs more uh loops in the architecture 00:37:47.560 |
uh when the problem is complex let's say so it's like uh for example high school mathematics 00:37:54.840 |
needs uh five loops but uh logical or moral scenarios need much more let's say 20 or 25 00:38:04.360 |
uh yeah some of the inference time loops are kind of interesting so rj asked about the recurrent 00:38:14.600 |
backprop so they have this little section here in chapter four on recur uh truncated backdrop 00:38:21.160 |
backprop where i think they truncate everything except for like the last eight or something and 00:38:28.680 |
then so what that allows is you can still have somewhat efficient training but at inference 00:38:34.600 |
time you know the thing can still like roll out its internal thought for as long as it wants 00:38:41.080 |
but they they have a little section somewhere i think it's in chapter section four on truncated 00:38:45.320 |
backdrop but um efficiency training wise that helps 00:38:51.560 |
right but but for like if i think it was saying like it did they did they truncated it like eight 00:39:01.240 |
eight loops or whatever but doesn't that mean that like for those eight loops you can't 00:39:07.720 |
parallelize right like you have to do one after another 00:39:11.160 |
yeah i think which i mean i understand if you truncated it some number then you're limiting 00:39:19.560 |
the impact of that but still uh that will that will i don't but i'd be curious to see like the 00:39:27.000 |
impact overall of those layers that are recurrent um what's the impact on the training efficiency or 00:39:41.000 |
is there any question or my internet oh sorry when people join they're just unmuted so 00:40:04.380 |
yeah um there is another question in chat though how does it choose the depth would it make sense 00:40:13.720 |
to recur until the stop token slash signal is output do you want to kind of explain what this 00:40:19.000 |
internal thinking is doing and how it works uh so actually in the training they took some random 00:40:28.760 |
numbers i'm not sure why they did that because actually it can be 00:40:33.320 |
let number parameter because you don't know the complexity of the 00:40:37.720 |
data right so you cannot say the exact number of the loops in the pre-training 00:40:43.800 |
but uh yeah actually it works and uh yeah i'm not sure 00:40:55.080 |
yeah as i mentioned this uh the recurrent work is not the novel idea we know a lot of papers 00:41:01.560 |
we did the same kind of things but yeah as i mentioned some ideas should be rediscovered 00:41:07.720 |
sometimes so yeah but i i like that idea especially for small models i mean you 00:41:15.240 |
don't have a lot of context size and uh yeah that's interesting at least so 00:41:20.680 |
i think at a high level as well we have to remember what they're doing right they're 00:41:24.600 |
training a 3b from scratch on 800 million billion samples billion samples uh it's very proof of 00:41:34.040 |
concept right so there is uh ways where you can synchronize recurrent depth and parallelize this 00:41:41.480 |
stuff but this is more show just a more so the proof of concept to show that this works and it 00:41:46.600 |
this is more show just a more so the proof of concept to show that this works and if we should 00:41:51.000 |
scale it up the other thing we've seen with a lot of um recurrent networks and state models is 00:41:57.560 |
you can do conversions so even if they're not peak efficient you can convert quen to this style 00:42:05.000 |
by you know replacing layers and adding stuff so all optimization stuff kind of comes a little bit 00:42:11.560 |
later but um there there is stuff like this i know in cuda but they probably got some amd 00:42:18.280 |
compute sponsorship to train this on mi 300xs which is cool now i don't know if they would 00:42:24.040 |
write all that in the paper because it's not applicable for everyone but just that's a little 00:42:28.280 |
bit of high level too yeah okay any other questions or comments 00:42:41.880 |
actually they mentioned that it's equivalent to 50 billion but i can't see that in the results 00:42:53.960 |
because in the result they compare only for uh 00:43:28.120 |
okay i think that's it yeah there was a lot of other sections but i think the idea 00:43:36.680 |
was to create a different technique rather than explicitly generate the output token 00:43:43.000 |
which is interesting at least in a different kind of architecture so i think it works to test 00:43:50.680 |
maybe you'll see the other paper that do the same so that's why i i like that paper just 00:43:59.800 |
to understand other kind of techniques rather than the generation and so on so awesome well 00:44:07.720 |
thank you so much for volunteering to present we have any other questions on these two we still 00:44:13.160 |
have a little bit of time um so scaling up test time compute was more so how we do internal reasoning 00:44:20.200 |
if we don't want to externally spit out this reasoning can we have a recurrent state piece 00:44:26.760 |
in our model that's kind of what recurrent networks used to do can we can we train this 00:44:32.120 |
to internally think and then s1 was basically can we get really good sample efficiency for 00:44:38.760 |
reasoning tokens and yeah hopefully they show pretty good promise 00:44:46.280 |
yeah actually this is pretty different but yeah the idea is the same 00:44:50.840 |
okay um how does it decide to stop iteration on a token um in the inference right 00:45:12.840 |
maybe you control the loop not it doesn't decide itself no i think it does because if you look on 00:45:25.080 |
page 31 there's a in the appendix there's a one of these heat maps um and it shows it for 00:45:34.520 |
different tokens it's inferencing for different numbers of steps yeah for different tokens 00:45:42.920 |
yeah so basically it's trained to keep doing this but we only back prop on eight steps but that 00:45:53.000 |
doesn't mean that it can't unroll the amount of iterations it wants to do it's it's internal in 00:45:58.600 |
the model or maybe i'm reading this wrong maybe this is 64 steps and it's just oh yeah this is 00:46:04.280 |
the log is like the convergence time oh yeah okay so this is so maybe it does you just do configure 00:46:11.720 |
it to reason for a certain number of steps and whatever that is yeah that makes that makes more 00:46:17.320 |
sense okay got it okay um very very cool papers pretty pretty interesting to see other reasoning 00:46:38.200 |
stuff um so we have we have a few more minutes um anything else anyone wants to bring up or 00:46:44.600 |
should we talk about next week's paper okay well we'll end a little early and go straight into 00:46:58.520 |
next week's stuff uh thank you so much for presenting by the way we always like 00:47:03.160 |
volunteers if anyone wants to volunteer i think um next week i'm doing one then the week after 00:47:10.920 |
we should have an opening so if there's anything interesting anyone wants to do feel free to you 00:47:16.200 |
know volunteer and then we can we can get that worked out but yeah thank you so much for presenting 00:47:22.360 |
last minute we kind of switched up uh paper so sorry for people that didn't get a chance to 00:47:29.480 |
pre-read it's always nice to be able to pop in but you know you always get most value if you 00:47:34.600 |
do some basic pre-reading come with your questions and then we can discuss them 00:47:38.920 |
so that's kind of the whole point of it um for next week hugging face put out this um 00:47:45.640 |
pre-training guide so it's kind of everything about scaling up llms and they go very in depth so 00:47:53.160 |
it's it's a very long one i'm hoping hoping we can get it done in an hour but you know it starts 00:47:59.080 |
with how do we train on one gpu multiple gpus um how do we do moes like what is training configs 00:48:07.880 |
basically everything you would want to know about training and scaling up models so i will go 00:48:14.520 |
through this and condense it down to what's a tldr of this probably like eight hour reading how can we 00:48:22.680 |
condense this down into like you know high level overview 45 minutes of chat and then five minutes 00:48:29.720 |
of q a but a very very good resource i will share it back in zoom um i think you know come prepared 00:48:38.920 |
and just read over the sections you find interesting if you have any questions come with 00:48:44.200 |
questions i'm sure this will be one where zoom chat will be very very active since it's basically 00:48:49.880 |
a book but um that's the paper for next week yeah i started reading but it's a huge it's huge it's 00:48:58.120 |
huge uh thomas wolf the one of the co-founders at hugging face apparently he also did a walk through 00:49:03.000 |
on youtube so if you guys like side content want to listen to it there's a youtube version where 00:49:08.760 |
he goes through this for quite a while but yeah next week um maybe just brush up on that i'll 00:49:15.160 |
try to have something in an hour that we can go through and then it'll be good for q a you know 00:49:19.080 |
well we'll have about like 100 people here that are all somewhat versed on the topic but yeah 00:49:26.840 |
cool thanks everyone yeah run thank you take care