back to indexRADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

00:00:00.000 |
i just wanted to make sure that the recording was uh going to go work properly okay 00:00:06.960 |
i did not prepare for this uh this this it and honestly like you know there's a lot of other 00:00:14.400 |
papers as well that's going on you don't have to take up the full time uh there's always people 00:00:18.560 |
that are interested in other things uh is there which which other papers are people interested in 00:00:25.280 |
uh i mean just like stuff that was dropped in the channel okay 00:00:31.120 |
it's useful to mute uh there's a bottom on the button on the top bottom right to mute all or um 00:00:42.960 |
yeah mute people on entry so that i can help because people don't realize they're unmuted 00:00:51.360 |
settings okay where's the mute all on entry can't seem to find that cd button either 00:01:00.800 |
uh bottom right there's like a little drop down 00:01:12.480 |
okay okay yeah fine mute participant uh entry got it rj were you saying something because i can't hear you at all 00:01:23.120 |
sorry i didn't i was not muted no i wasn't saying anything okay okay i'm at a cafe so there's a lot of 00:01:34.000 |
um got it got it got it got it okay okay i guess uh i'll go through red light and i'll go through 00:01:43.120 |
large language diffusion model uh since um i don't know whether you all saw the tweets from 00:01:49.600 |
uh uh recently as well because because uh apparently google now just made a a decent diffusion model 00:02:04.240 |
so ai is apparently going to get much faster today 00:02:25.520 |
okay anyone okay so uh it so today's going to just be like a collection of papers uh i will kick 00:02:33.920 |
off with my own which is red that and then um where we show how to uh essentially take an existing 00:02:40.960 |
72v quen model and convert it to another architecture altogether and then and then after that i will go 00:02:47.440 |
through the diffusion model um i guess uh paper and and and and and then we'll like pick up from any 00:02:55.840 |
other suggestions from there so the paper that we are covering i'm just going to put the link here 00:03:02.880 |
yep so uh so so i think a few of you may have already heard about me mentioning this in overall oops 00:03:14.640 |
but this was essentially the basis for our quirky 72b which which which is where we took 00:03:22.480 |
where which we took an uh existing quantity to be model let me find a picture which is better yeah 00:03:29.360 |
yeah where we do where we do an existing 72b model and essentially were able to train it uh using a new 00:03:38.320 |
attention mechanism and we were able to get uh most of the performance um retained um the the 00:03:44.160 |
subsequent paper that we we published is is more of like to document that that process from a high 00:03:49.200 |
level and then subsequently also what we found didn't work in particular so so uh so so the idea 00:03:56.640 |
here is that uh we want to actually encourage um and research into new attention mechanism and this 00:04:03.520 |
technique doesn't apply just to rwkb this this uh this approach actually unlocks new possibilities for 00:04:11.520 |
for not just uh rwkb but that includes state space for mofoado xlstm and etc um and essentially what we 00:04:20.400 |
realized is that most like majority of the feed forward network layer can be recycled if you um there is 00:04:28.240 |
actually a similar um paper that came up uh came out recently as well um mlps um similar across architectures i can't 00:04:38.880 |
remember the exact name um i didn't um but basically there was a recent there was a recent paper that came 00:04:47.760 |
out where where i where where where a team compared the the lama model and the mr model and i believe 00:04:55.920 |
warmer model and they found that actually that the that the mlp layers uh weight values are approximately similar to each other 00:05:05.920 |
despite being trained on different data sets uh uh when they are optimist so so so so there's kind of 00:05:12.480 |
like uh hypothesis right now floating around that uh that um if you're going to train on the same large 00:05:18.720 |
web data your mlp layers are going to converge to roughly the same and and and we we also like reaffirm 00:05:25.120 |
that hypothesis that hey you can freeze that layer and just swap out the attention layer so so what uh what we did 00:05:33.040 |
subsequently and so the thing is that uh we showed like this is not the first conversion method per se 00:05:39.600 |
from an existing uh existing uh existing uh from one architecture to another so so we we we did outline it 00:05:47.360 |
the comparison to like to compare to to to to existing methods uh acknowledging some of the previous work that 00:05:54.080 |
was done accordingly but i think what stood out for us is that uh we are we are probably one of the first 00:05:59.920 |
team that went after going through this process and only training on 500 million tokens that we actually 00:06:04.000 |
even have benchmarks improved and and even though some actually did be proved and like previously right 00:06:12.080 |
previous previous conversion attempts right uh previous conversion attempts right you will see a dramatic 00:06:17.360 |
drop in performance which it which is not really the biggest issue in my opinion because like when you 00:06:22.320 |
convert from a quadratic to a linear model even if let's say you drop by 20 performance in overall 00:06:29.360 |
the inference saving is actually substantial enough that that uh that that this this could potentially be a 00:06:35.280 |
viable path towards like making a more scalable model but but we've we've we've the updated techniques 00:06:41.360 |
that we did we've read that right it's that it's able to like more like rapidly i forgot to talk about 00:06:47.360 |
the name uh it's a rapidly a rapid attention distillation to linear attention decoder so we are able to 00:06:53.600 |
rapidly change one attention mechanism to another um the linear one is just because of the example 00:06:59.120 |
because we are linear architecture but like i said this can be done using other attention mechanisms 00:07:06.160 |
like right now we are actually helping two universities um and we are talking to another two 00:07:11.280 |
and their team to test and variation of the transformer attention model using the same technique so 00:07:17.760 |
it so what we what we did was that uh uh we take the existing model we freeze the feed forward network 00:07:26.320 |
layer and we we swap up the we swap out the attention layer and then uh and then and then subsequently right 00:07:35.840 |
we distilled the information uh in between i i know a lot of people get confused about 00:07:41.200 |
the distillation since i've conveyed the the idea um and and it was covered in this paper is that 00:07:47.040 |
is that uh when we say distillation right we we mean distillation from the same model uh from from the 00:07:55.040 |
from the original model and and it's not just logics distillation like like the first few steps of the 00:08:00.800 |
process right is that we distill it in between the layers so for example you can take a single block 00:08:07.120 |
you put that you as you as you as you do the four inference and as it goes through um goes through 00:08:13.600 |
respectively right uh we we take the the incoming embedding and and outgoing embedding and we use that 00:08:20.000 |
for distillation not the logics itself so so yeah distillation between the layers itself because in 00:08:26.560 |
this case we are replacing the attention mechanism um and then subsequently um you you fine tune that you 00:08:33.040 |
do for the fine tuning to like just stabilize and and and weave everything together so uh i think since we 00:08:40.160 |
released this paper right uh background uh i i covered the high level there um what what actually a lot of uh 00:08:51.520 |
people pointed out and really liked about it um so after after the logic distillation um experiments and 00:08:58.800 |
benchmark uh i wanted to highlight the this segment what did not work um so uh the because the thing is 00:09:08.000 |
that with this uh with this conversion method right um there's many ways you can do get it wrong and 00:09:13.360 |
and and we and and these are some that some things that we uh we uh we we we actually iterated through 00:09:23.920 |
and so we documented like what did not work um we also would like to clarify that when we say it did not 00:09:29.760 |
work and we mean very specifically did not work for rwkb um it is possible that as uh as we experiment with 00:09:37.360 |
other attention mechanism right that some of the things that we flagged out as did not work will 00:09:41.680 |
work with other attention mechanism so so even though we listed it out here right we are we are 00:09:45.600 |
already like breaking the rules for one of the experiments that we're already doing so um so like 00:09:51.760 |
for example one of the things that did not work for example which we thought would uh and and uh is 00:09:57.440 |
that uh so when we took so one of the shortcuts that we tried to do for example is that we took the 00:10:01.920 |
existing attention module weights and copy it over to rwkb um because at the end of the other kv is 00:10:10.400 |
is still is still uh still matrix multiplication the weights can be copied over uh and and we we just 00:10:18.080 |
like maybe it will work did not work um uh respectively um commented on their leadership skill his 00:10:27.760 |
leadership skills so we we made a pushback on their decision um today so we will hear most probably 00:10:35.760 |
tomorrow how the direction will go if that is not the case we are asking to consider him for a different 00:10:42.560 |
role in relax so sorry yeah uh that didn't sound like a question right yeah um yeah so so yeah um we then 00:10:53.680 |
subsequently um yeah uh like i said so we we we we abulated this and then we tested it like from from 00:11:01.360 |
starting the attention mechanism from scratch pretty much pretty much uh same score so so we end up 00:11:08.560 |
skipping that um even though i mentioned freezing model weights right you like like um i think one of the 00:11:16.160 |
things to point out is that you want to freeze only at stage one so so you go through several stages the 00:11:21.440 |
first stage is you replace the attention layer and then you do a quick chain um and then subsequently 00:11:27.440 |
you will want to unfreeze the mlp layers um the reason why we found that this was needed is that uh is 00:11:33.760 |
is that um especially over a longer context length right the performance will drop otherwise things like 00:11:39.520 |
that um batch sizing is extremely tricky uh um but i think this is the case for actually any large model but 00:11:47.200 |
it's just particularly so within this process and and to be clear also like and this is actually further 00:11:53.760 |
confirm with uh with one of the upcoming reasoning models that we did uh did right was that was that you 00:12:01.520 |
will need to have the data set that's reflective of the data that is in the model so for example we 00:12:06.480 |
recently did uh one the reasoning model we converted it uh spoiler is the is the latest quen tree model 00:12:13.920 |
and and we we didn't release it uh because we did when we did the conversion we didn't convert with the 00:12:20.400 |
reasoning data set so when when he when he tried to do the reasoning token and things like that it just went 00:12:26.080 |
crazy um we are now redoing the process with reasoning tokens and and and subsequently uh this this will be 00:12:34.880 |
we believe this will perform much better so like like your data set should be approximately the same 00:12:40.000 |
of what is your original model data set which is a challenge because we have absolutely no idea what 00:12:45.120 |
is quen data set so so so so so that's one thing to take note another thing that we tried for example is 00:12:52.160 |
like we tried to do laura base uh for for for for the conversion process it just made things a lot worse 00:13:01.360 |
yeah so i i as the reason why we do we decide to document all the what did not work is that uh like 00:13:07.520 |
once again um there is a lot of benefits for this mechanism here uh this approach right uh for new 00:13:14.480 |
attention mechanism not just rbkb and we just wanted to like outline like like some of the key things there 00:13:20.080 |
so yep and and so for those who don't know what rwkb is uh and i think most people in this group 00:13:26.880 |
already know is that it's a linear attention mechanism that my group has been building and 00:13:31.600 |
scaling up so with this right we have now built the the world's largest 72b um model that doesn't have 00:13:39.600 |
the transformer attention mechanism the benefit of it is scale linearly over a thousand x cheaper inference 00:13:45.280 |
cost and but what i find more exciting about about us covering this paper itself right this paper is 00:13:55.120 |
that it's not really so much about my architecture because i'm also on record saying that i don't think 00:14:01.280 |
our architecture is final and and that and that like we're and that there's so much more improvement to be 00:14:08.240 |
made so so so so so so if anything right i actually do think there needs to be more research into other 00:14:14.640 |
ways to build ai architectures because like one of the things that i said uh at icr where i presented the 00:14:21.040 |
earlier version of this is that um was that right now the largest models to date right are not even are not 00:14:30.160 |
reliable enough to do cashier duty or day-to-day work tasks and we are basic and and well scaling does help 00:14:39.120 |
it right if we scale seven times more because the scaling laws like you scale 10x to get 10 improvement 00:14:45.440 |
that will be the entire energy of the earth required to train the model so so at so at some point right we 00:14:53.280 |
do need a better uh attention mechanism to like um to like eat to lower the energy cost or to like just 00:15:00.560 |
make things more reliable or to make things as a scale better and previously most of the existing labs 00:15:06.560 |
avoid doing this because to validate any new architecture um you will need to and this is from our experience 00:15:13.920 |
you need to iterate around 20 to 80 times um to before you can find an architecture that has a 10 improvement 00:15:21.120 |
you're going to get more failures and success and when each try can cost up to five million dollars 00:15:27.360 |
in in that training process yeah it gets we are looking at tens of millions of dollars to try 00:15:33.600 |
experiment on new architecture and that 10 million could be just to train more tokens but this changes 00:15:40.080 |
things if it's now 2 000 to 20 000 to test a new new attention mechanism it unlocks the door to literally 00:15:48.240 |
anyone uh who is willing to spin up a run port instance for example and that that's how that and 00:15:55.680 |
and that's why we expect more experiments to do so and which links very well to the most very recent 00:16:03.360 |
thing about gemini diffusion model um i think i also said that diffusion models are probably interesting 00:16:09.200 |
because um or two things that i want to find out like one one uh is that uh diffusion models at least 00:16:18.160 |
for images have been highly resistant to to uh to catastrophe i know sorry to overfitting in the sense 00:16:25.840 |
that like you may have heard like uh of large language models overfitting after they see the data three four 00:16:32.720 |
four times for example or even two times and they start going like crazy or inferior diffusion models 00:16:39.280 |
especially in the image space are typically trained over a thousand times with the same data and they 00:16:44.480 |
did not go crazy and and we and and i've been saying that i want to see more success in text diffusion model 00:16:51.520 |
because i want to actually find out and test the hypothesis about why does diffusion models become 00:16:57.760 |
more resilient and and this is and and and well this is not open source uh google gemini has recently 00:17:05.840 |
released a diffusion model as as the demo that is blazingly fast so linking that right is that that uh is 00:17:12.480 |
that uh is that uh is that there's uh on the open source side right uh so that is proved that diffusion 00:17:16.800 |
model can actually work as potentially even replacements to to to to my architecture and transform architecture 00:17:23.200 |
um this um this is the paper covering large language diffusion models and and so this is the the team 00:17:30.320 |
that trained the lada a b which which which which which then um they uh which is then to be used uh 00:17:38.480 |
and then they benchmark it across the spectrum of tasks so i think what is uh what is interesting for 00:17:46.640 |
diffusion models is that uh for this in particular right what is uh i think the easiest way to to view a 00:17:55.440 |
diffusion a text diffusion model is that instead of pixels on the screen you uh you view you you replace 00:18:04.880 |
that with actually just the byte values and and that essentially means that it means that that same 00:18:12.320 |
process right of how you like you you generate an image and then you it slowly updates the values right 00:18:19.120 |
uh i think i think i'm just going to show this as a demo example as you can see it slowly updates like 00:18:24.320 |
what token by token character by character and and the well the downside for this is that your um you may you 00:18:34.240 |
won't get that streaming token effect i mean uh uh respectively uh and your time to first token air 00:18:41.280 |
code there might be much longer the benefit was that the benefit for diffusion model is that um if 00:18:48.800 |
let's say you're generating a thousand tokens you might be able to generate all thousand tokens in let's 00:18:53.120 |
say a hundred steps which means it's faster and overall yeah um i think what is interesting about 00:19:01.280 |
this as well right this paper in particular is that is that like um though as the way they they they said 00:19:09.840 |
it right uh is that um one is that the model itself right for diffusion models which to be clear this 00:19:16.560 |
is too early to say in my opinion 8b is too early to even draw conclusions right was able to outperform 00:19:22.480 |
lama right in in in in context learning and reversal reasoning reversal reasoning is is is is um is that 00:19:31.840 |
is the the the issue when let's say like uh a is related to b b is related to c and you have like a 00:19:39.360 |
statement like that uh is the model able to to to deduce logically that c is equal uh uh is uh related to a in 00:19:50.480 |
in a single hole um this is partially mitigated now with we with the chain of thought reasoning or deep 00:19:57.920 |
uh deep thinking or whatever you want to call it um or one-star reasoning but this was previously like a 00:20:04.960 |
major hurdle for actually a lot of the transformer model it's unable to like do the association backwards 00:20:10.160 |
from what it is trained on and it is shown that diffusion models are able to like overcome that i think it's also 00:20:17.760 |
in part is because they kind of work like in uh in a chain of thought pattern i uh because because like 00:20:23.200 |
every time it generates a part of the text it gets to see and regenerates it so so the and so they cover 00:20:31.200 |
how they how they train it respectively uh so um where where essentially they must uh uh uh they must 00:20:39.520 |
the token uh individually uh like there are different ways to approach it so like masking the token individually 00:20:45.360 |
that means what they did was that they they trained the laws of the various different 00:20:49.360 |
token sets and mass respectively uh compared to other other other tokens because since diffusion models 00:20:57.120 |
are no longer like running linearly they're able to see to the left and to the right of them all at the 00:21:02.000 |
the same time yeah um the um the other the other thing that they did was for towards more like prompt 00:21:08.160 |
completion training where basically the prompt is frozen at mass and then uh the response is trained 00:21:14.160 |
respectively this is very similar to like the same thing as how people train diffusion models where 00:21:20.080 |
where they where either they add noise and then they just uh train everything from there or they actually freeze 00:21:27.440 |
a part of the image and and then ask it to regenerate another part of the image so like the one this is 00:21:33.040 |
one plus example like for example you like you delete the center of the image and and you and you train 00:21:38.400 |
the model to like regenerate that image so so that's still a problem and response if you structure it similarly 00:21:45.440 |
yeah then um as they covered uh uh what else did they cover there wasn't really other than the what 00:21:55.920 |
they observed and uh and as i mentioned like like uh like uh how how how the inference can can trail 00:22:04.560 |
between uh lead to higher performance in overall i think the only thing else is like the math and then they 00:22:11.760 |
they basically showed the the the the the model's performance over over the training course and then 00:22:18.400 |
it shows that it's it's so autoregressive baseline basically transformers is a uh it shows that it's 00:22:24.960 |
able to like keep up in general with transformers uh leading uh leading to like like potential and 00:22:31.040 |
like more research down this path respectively so yeah uh and then they shared their benchmark respectively 00:22:37.760 |
which what as i mentioned previously what is of interest is that like for example oxy 00:22:46.400 |
is higher than llama and well mmlu is lower uh which is kind of funny because it's like this like 00:22:54.800 |
we saw the same issues with rwkb for example with rwkb having the higher oxy and gsm 8k well we're having 00:23:03.760 |
slightly lower uh math and mmlu uh which which kind of like still leads into the like one of the current 00:23:09.440 |
hypotheses that is a different attention mechanism may favor different architectures yeah 00:23:14.240 |
yeah and i think that's about it i already showed the sampling process through the animation here so yeah 00:23:23.680 |
yep ah yeah so that's uh yeah anyone has any questions 00:23:29.840 |
i had a quick question mostly from the chat so since we have language diffusion now is there any language 00:23:46.960 |
language flow matching models oh um not that i know all uh um because i i think this is just 00:24:01.680 |
yeah but i don't see why not uh down the line i i'm a little bit so i'm not sure 00:24:12.640 |
here i didn't read i haven't i still haven't read this paper but we actually covered this a little 00:24:17.680 |
bit previously too so i mean this is really similar to burt right and it feels to me like you're we're 00:24:24.640 |
basically just training we're training it to uh like an encoder only model but just with multiple steps 00:24:32.240 |
and it's unclear to me the flow matching models require the gaussian assumption and i don't know that that 00:24:39.840 |
gaussian assumption is here or not uh so yeah like i i haven't read the paper though so 00:24:48.240 |
the gaussian function let's just do a quick check for that 00:24:55.120 |
i from what i understand so like the large the language division model right what they said here right is 00:25:04.240 |
that is that is that they use the same sampling techniques for the individual tokens um as existing 00:25:12.160 |
approach for transformers so we're not so so i don't think they're using using like uh gaussian 00:25:19.360 |
sampling they are just literally like at the individual token slots these are the logits 00:25:27.600 |
and they sample from there and i assume what's happening as well is that there's probably like 00:25:34.480 |
uh either low temperature or top p setting did they write top p 00:25:40.080 |
no they didn't say top p yeah i think one of the things that they they did to make it more stable 00:25:46.320 |
is probably either a lower temperature or top p setting uh where where essentially like the model will 00:25:54.000 |
or the the individual tokens will will eventually represent their 90 plus percentile representation 00:26:01.200 |
and the idea is that once once you put put the output back into the input for the diffusion 00:26:05.680 |
uh if it's confident in that same input token it will it will with high probability output the same token 00:26:14.960 |
and everything will then eventually stabilize so it's not diffusion it's really um uh it's it's it's really 00:26:24.160 |
it's really just um yeah it's a sampling standard sampling that that's what they say 00:26:34.640 |
uh i see great uh i see the question one question i hope you answer at some point in the talk is what 00:26:42.080 |
is left to prove with regards to pursuing other uh others building you to use rwk as a base instead of 00:26:49.040 |
attention i think right now one of the biggest proof point that is not 100 absolutely proven is million 00:26:55.680 |
token context um um even though uh and above um and and i feel like this is a sliding goal post problem 00:27:04.160 |
um so like rwkb right now is uh with the latest v7 paper we are able to hit 32k context length perfect 00:27:10.880 |
memory uh with needle in the haystack and that's what 3b model the converted models are trained at 4k so 00:27:17.200 |
they do not survive a 32k needle in the haystack to do a 32k needle in the haystack we needed way more vram to 00:27:24.480 |
do the conversion and training process um and so so that would have required at least four nodes of mi300 00:27:33.360 |
to to do to do 64k and 32k training and above um and you can imagine the numbers just get more ridiculous 00:27:41.760 |
if you want to do 1 million uh token length um yeah so that is the what's left to prove but what we point 00:27:49.920 |
to to to as a hypothesis in the rwk v7 paper is that we shown that 1.5b is slightly below this one okay 00:27:58.000 |
needed a haystack and we showed that at 3b is more than 32k and so we show that we've increased param size 00:28:06.320 |
and increased state size right that the context length actually improved in terms of the model's ability to 00:28:12.800 |
memorize it and therefore right the hypothesis is that as we scale to 72b right when fully trained 00:28:20.720 |
it should be able to handle much larger context length like i say this is a hypothesis we have 00:28:26.720 |
not trained uh 512k context length model because we do not run a cluster uh a super cluster uh that being 00:28:35.200 |
said uh we are training a larger context length model uh for uh for uh for quirky too we already have a 00:28:45.840 |
iteration on on on the existing model which which then we are using like uh a few things that we if 00:28:51.760 |
you look at our previous paper like goldfinch compress kv caches and stuff like that uh we are able to 00:28:56.960 |
reduce the amount of memory requirement for longer context length um together with rwkv so it's a hybrid 00:29:02.880 |
model and we are going to we are probably going to try push this to train to 64k to uh and one to eight and 00:29:09.440 |
then 256k and above so so that would bring it into like into the category that is very well usable for 00:29:16.000 |
most use case so activity is what is distilled into the target model yes uh that is for stage one and 00:29:24.000 |
two uh eventually at the later stages it will be distillation from logits so that part is extremely 00:29:31.760 |
important a lot of previous papers they distilled from the logits and that didn't train the attention 00:29:37.840 |
mechanism well enough to be stabilized um the what you want to do is to train as fast as possible with 00:29:46.240 |
as little tokens as possible all the layers right to mimic the original attention layers uh capability 00:29:53.040 |
and so that's why we distilled between layers because because then then during the back propagation 00:29:57.520 |
process right it doesn't need to guess hey if i backprop all these all these uh inputs and outputs right 00:30:05.440 |
which layers do i need to update there is no guess within that it's just literally which which uh 00:30:12.000 |
which mentioned notification block you want to update so so so that part is actually actually important um 00:30:17.280 |
rwkv's nrn no uh it's not well it's inspired by lstm at this point uh it's i would say it's closer to 00:30:27.680 |
transformer than lstm uh we've been rewriting this since like five years we're not in version seven so 00:30:34.800 |
so a lot of things have changed have you done a compute equivalent performance such as duplicate 00:30:39.200 |
context versus the original model single version of the context ah so um this is this is the same thing as 00:30:44.800 |
the state space model team have shown that uh and we have tested this as well and we can confirm this is 00:30:50.080 |
that yes if you duplicate the context twice the model does perform better um this is this applies to all 00:30:57.600 |
linear models uh respectively not just rwkv state space and i believe xlstm as well though i may need to 00:31:04.720 |
verify that uh personally i have verified it for state space model uh yeah um there was a patron 00:31:12.640 |
is uh repeat twice linear models uh crap i can't remember the paper name yeah i'll i'll try to search 00:31:22.880 |
up that okay so we have the repeat twice paper and the subsequently what was the other one that i mentioned 00:31:32.320 |
uh the mlp layers are being the same uh across different architectures so yeah um 00:31:42.000 |
another thing to highlight as well right is that is that what is exciting also for example for the 00:31:48.160 |
quirky 72b right uh and i assume this will apply to state space as well if it goes through the same 00:31:53.600 |
process is that well we have similar performance to transformers at 72b scale when running on the same gpu 00:32:04.640 |
we are able to get more tokens per second in aggregate than than as it is in transform architecture so we 00:32:12.800 |
are talking so so we're talking about like the tokens per second comparison that the 72b model 00:32:17.360 |
um is like we can we're able to do like batch size 60 or 70. if you run a 72b on h100 you're only able to 00:32:25.840 |
do let's say batch size 8 or 16. and to run similar batch sizes right you will probably only need you probably 00:32:32.640 |
need to run a transformer a b instead and that that's speed up in overall token per second is actually 00:32:39.360 |
probably the bigger impact because at the end of the production really cares about how much compute and 00:32:44.880 |
how much tokens and how much intelligence you can extract uh rj do you ah yeah just read twice closing the 00:32:51.440 |
recall gap yeah that is the paper yeah thank you for finding that uh so i i will repost this inside here for 00:32:59.680 |
for those who once if someone can find that mlp paper that would be great as well uh yeah so this is 00:33:08.400 |
uh just read twice a uh paper um the trdr is uh we like if you just where was it the benchmarks 00:33:20.400 |
uh okay i just realized that this paper is actually quite hard to to read at an immediate glance 00:33:25.120 |
but the trdr was that uh linear models perform better when you just repeat the context twice 00:33:35.120 |
uh uh uh yeah pretty curious about after after after is there language for matching not too sure about that uh what do you think about 00:33:43.840 |
uh compared to mamba uh as covered previously like we we believe that like linear intentions uh uh 00:33:55.840 |
uh works better as because because like you're able to take the one of the problems that we observe in mamba 00:34:02.720 |
especially in the smaller models is that informations are merged from left to right 00:34:07.200 |
and if you have and part of the reason why repeating twice works is that if let's say you have instruction 00:34:13.840 |
and then you have uh information here information here will end up being merged into a single state 00:34:19.680 |
before it gets to see instruction so so mama mama much information like like like in a cascading pattern 00:34:26.880 |
like from left to right together to each other side by side before i generate the next token um sometimes 00:34:32.560 |
what happens sometimes is that the information gets lost before it reads the instruction and therefore 00:34:38.320 |
it's unable to perform the task that is the disadvantage but i'm also going to say that that disadvantage 00:34:44.800 |
could be could be academical and when i say this is the part where sometimes because i i jump into both academia and 00:34:50.880 |
practicality right um in production right theoretical academia limitations right are just 00:34:57.520 |
that um if i would argue that for state space right if the state size is big enough and it's able to compress 00:35:06.160 |
all this information together and then eventually when it merge together there's no information loss 00:35:11.440 |
then the disadvantage is not really a disadvantage we have for rwkp if you put the information it will 00:35:17.200 |
not lose that information and then it's able to decide the following information as you process through 00:35:22.880 |
uh yeah in both cases apparently repeating twice kind of mitigates this issue so 00:35:28.080 |
yeah and if the inference cost is worth it yeah then it becomes a question of more like 00:35:33.680 |
more like how do we dematurize the the technology as well yeah sorry okay 00:35:42.320 |
i just speak through all the questions on the 00:35:58.000 |
since nobody is i just want to dig in a little bit more on the repeat tracing so i'm wondering in 00:36:03.440 |
the gist of the question you answered but i wanted to know if you or anyone else done a study of like 00:36:10.720 |
if for the compute equivalent right so like because you guys built this model that has the rwkv attention 00:36:20.720 |
mechanism in it um so that you're you know sort of like and you're saying that like kind of sort of we 00:36:27.200 |
get the same quality out of the model so like let's just pretend it's exactly the same then can 00:36:33.120 |
like the quality like if you just repeat twice like where does where do you land with compute and then like 00:36:41.280 |
just repeat your context right like are you still more compute efficient and what what happens if you repeat 00:36:46.400 |
three times or whatever to get to the point of of um compute equivalent and look at quality there 00:36:53.760 |
actually that's a good question you're talking about inference time compute right exactly yeah 00:37:01.520 |
uh well that's the only way is to try and find out 00:37:12.000 |
yeah we have not tried to repeat it that many times uh i think also you have to remember part of the 00:37:21.920 |
bigger issue is that and this applies to both state space and rwkv even though we are more stable over 00:37:27.440 |
longer context right the prerequisite for this is that we need to train the model to support 00:37:33.120 |
those lengthy context links and right now our longest is 32 to we do have a 128k model so we could repeat 00:37:44.240 |
the experiment within let's say that 128k uh window itself uh i think um i think beyond beyond that beyond that i um 00:37:56.960 |
yeah uh we we definitely want to be able to like like research more into a longer context but i think 00:38:03.280 |
it's an interesting thought exercise like you said like what if you don't repeat twice i suspect you're 00:38:08.080 |
going to get diminishing point of return at some point like i i i think even three or four might be diminishing 00:38:14.560 |
but it's an interesting question i would say or maybe uh the equivalent way to do the experiment is 00:38:22.880 |
that if you look at okay you have your 72b model and then what is the equivalent model for the same 00:38:30.800 |
amount of compute with the original model right so like a b model or whatever it is for a certain 00:38:36.880 |
context length what's the quality difference like that so yeah yeah sorry yeah no no that i think that's 00:38:47.200 |
something yeah yeah the equivalent um com in terms of tokens per second output right the equivalent would 00:38:54.080 |
be a 32b or a uh 8b model uh somewhere in between that range uh the number changes according to context 00:39:01.280 |
right um but that's the approximate equivalent um if you control by tokens per second the the thing is that 00:39:09.920 |
if you have a 72b rwkb model you still need 72 gigs assuming um floating point eight right of vram to 00:39:19.280 |
load the model so your minimum hardware requirement doesn't actually go down it just means that 00:39:24.720 |
with the same gpu you can handle more tokens yeah okay yeah and likewise i would say that i don't see 00:39:33.200 |
why this is any different for mamba because we we both scale linearly in the same way in terms of compute cost 00:39:38.320 |
the same way in terms of compute cost flow you have any questions because now i thought i saw you say 00:39:54.000 |
something yeah no it's just talking to myself about the 72 uh gigabytes of vram still required on that last uh 00:40:02.880 |
like commentary that being said um quen uh the latest quen model is 32 gigs and it's it's looking very nice 00:40:14.080 |
yeah i'm excited with a lot of the last releases and like being able to run that stuff at home 00:40:20.320 |
uh i've been messing with a lot of voice stuff i haven't done as much as the language model stuff 00:40:26.720 |
how many people run run the quen 32b on at home or on their own 00:40:45.360 |
okay okay okay oh wait uh i run cold sometimes i used to get out sadly i'm gpu very gpu poor 00:40:54.960 |
okay uh yeah so just started to okay okay promising um yeah but i think it's i think it's nice that uh like 00:41:10.320 |
if like if like we now have a model that's kind of like better than last year half frontier model 00:41:18.880 |
that you can run on your laptop so if you just follow that trendline look at the best model now 00:41:25.040 |
in two years time you're going to run that on your laptop 00:41:31.120 |
yeah okay um then since since we went through all that uh well asking for like anyone wants any papers 00:41:42.640 |
to be covered for two for next week uh or next next week um there is the ai engineering summit in case you do 00:41:51.680 |
anyone do not know about it i'm sorry yeah engineering 00:41:58.320 |
so from from from looking at the channels it looks like uh next week someone was talking about doing 00:42:07.520 |
um because the ai engineer world's fair is not next week but the following week right 00:42:12.560 |
oh following oops yeah uh so so so it looks like next week someone was going to do um 00:42:19.920 |
diffusion large language diffusion models if i'm not mistaken is what i saw on the channel 00:42:23.680 |
um and he was talking about splitting the time i'd be interested in maybe possibly doing a paper that 00:42:29.520 |
talks about um image generation models and diffusion as like a lead-in to what he's going to talk about 00:42:35.760 |
maybe that will be interesting uh as a way to like split the time i could do like maybe the first 10 00:42:40.320 |
or 15 minutes and let him take the rest of the time but i am interested in google's uh large language 00:42:45.280 |
diffusion model i haven't looked into it at all but it seemed to be super interesting i don't know if 00:42:49.200 |
it's like where the benchmarks are and and stuff like that but uh was planning on tuning in next week 00:42:54.560 |
for his talk i forgot who it was but um just when i when i think of diffusion models off the top of my 00:43:00.400 |
head it's like image generation models i've also seen lately that uh there's quite a few of the 00:43:05.760 |
um music generation models that are also diffusion based so i just thought it would be like to do a 00:43:12.560 |
nice little roundup you know before he gets into um you know google stuff or language models maybe um 00:43:22.480 |
i don't know if i don't know if i don't know if if that makes any sense but i thought it'd be 00:43:26.400 |
interesting all right so um basically we're going to do an image diffusion as a lead into language 00:43:33.440 |
diffusion we see tomorrow will be a diffusion i mean next week will be a diffusion model uh uh pipeline 00:43:41.520 |
and and promoting the air engineering welfare if you haven't bought your tickets to to come to sf uh 00:43:47.920 |
and and to meet us um yep um please do so as soon as possible before before before it all runs out 00:43:54.720 |
uh we will so after that week we will probably uh organize once again on site an in-person paper 00:44:02.720 |
reading club uh for the world's fair yeah and yep uh i guess flo you'll be there i'll be there rj will be 00:44:12.640 |
there yeah so so yeah um yeah do suggest papers that you want to cover uh and subsequently um yeah and 00:44:22.560 |
and and i'm looking forward to the in-person uh paper club where and when uh the exact location i may 00:44:31.680 |
need to figure that out with swigs or for the actual for for the world's fair but but yeah we'll figure it out 00:44:38.880 |
okay it is a good good time too to just mention last year we like because there's all these after 00:44:46.480 |
events and you need to get to a lot of them we ended up getting split up a lot so i i just want to put 00:44:52.400 |
out and we can talk about this on discord but i want to just put out there that let's try to like get 00:44:58.560 |
some you know sort of uh some sort of consensus on what events that everybody wants to go to and try 00:45:05.200 |
to get tickets for them together so that we can go hang out together at these events okay sounds good 00:45:10.880 |
sounds good yeah uh if anyone wants to also like organize a specific written space uh event in the 00:45:18.320 |
area we'll probably also be able to help uh help with that as well yeah maybe before after all during the 00:45:23.600 |
event and yeah i'm looking forward to seeing you sam as well graph right okay thanks 00:45:32.480 |
sam are you if you're speaking at the graph right thing do you have any like uh pre-reading that we 00:45:44.160 |
can do i'm interested in understanding like graph rag has been a like a topic that keeps coming up and i 00:45:51.120 |
never really i never really gone on to i'd be curious to hear if there's like what new is happening and 00:45:58.080 |
we can pre-read for that well i'm essentially i don't know if you were at when wasim did this paper club 00:46:06.080 |
last year but i'm essentially like taking what he did and sort of updating it and adding to it and stuff 00:46:14.640 |
um so if you've watched that then you've basically seen it but yeah i was just yeah we we've published 00:46:21.120 |
a couple things on it'll be it'll be about like the the specialized alum for building graphs and the 00:46:27.920 |
retrieval aware compression and the fusion and decoder stuff that we're we're doing so yeah this this group 00:46:34.960 |
has seen a lot of it before it's fine we all need to go there so that will be unless the schedule 00:46:43.600 |
changes which in mind um it's 4th of june 11 40 a.m thank you thank you so make sure you all go there to 00:46:53.600 |
support sam uh a family member who else is talking to right who else nathan lambert nato 00:47:06.880 |
you know what which category uh i have no idea i just i think i saw in his blog that he was going to be 00:47:18.080 |
here and talking but maybe he just was saying he's going to be there all right all right okay um okay 00:47:27.520 |
okay uh and we got 10 more minutes somehow so yeah um 00:47:34.000 |
is there any how how would you all want the paper club to be run uh more with especially with papers like 00:47:47.200 |
and topics because because like uh like in particular like what is there more area of 00:47:55.440 |
interest in particular or is it just just do we just keep things as it is because like trying to 00:48:00.560 |
figure out how to like fill in the pipeline more and to avoid the situation of like hey uh i'm going 00:48:09.440 |
to cover my own paper unless people ask for it i mean i actually personally like hearing people 00:48:16.960 |
talk about their own papers i'd rather hear from someone who's not like the world's you know most 00:48:23.920 |
prominent uh you know ai expert talk about their own paper than just read somebody's paper because i i 00:48:30.960 |
feel like i get a lot out of hearing the author's take on it so i i actually kind of like that person 00:48:39.120 |
okay okay then uh yeah then um then what i would request is that uh if possible if you all know any 00:48:47.200 |
interesting papers or if you know uh any uh authors to these interesting papers um and you feel like 00:48:55.040 |
you uh it'll be worth inviting them just let me know uh let me review eugenian uh or even swix know and 00:49:03.440 |
and and sometimes and we'll be doing we'll be willing to do the the reach out uh like i think the other 00:49:09.360 |
time there the other time was oh which people was it that i did the previous one was the one scaling one 00:49:15.280 |
uh for a ai the paitia paper for example we got quentin on board to to help cover it though i was 00:49:22.720 |
quentin is probably more prominent but it doesn't need to be prominent in any shape or form yeah so so we'd love to 00:49:28.640 |
hear suggestions especially if like you know the person particularly a personality yeah 00:49:32.800 |
okay i want the authors to present here yes please and if you want want me to reach out to them to 00:49:52.000 |
i i also since no one's talking i i also feel like if having a like a bit i feel like that so 00:49:59.680 |
there's a discussion of having like the test of time version and then the current papers version 00:50:06.800 |
i actually think that you know in as much as there's demand there's definitely like several curriculums 00:50:13.840 |
that i would be interested in like there's like an architecture curriculum the agents curriculum like 00:50:18.640 |
all sorts of different you know probably diffusion or or like image model like there's a bunch of 00:50:26.320 |
different ones that i feel like we could put together curriculums for that i would be interested 00:50:32.160 |
in i would obviously wouldn't have time to do them all but like i i think in that also helpful for 00:50:37.280 |
pre-reading because then you already you know what's next to each paper and you don't and so that i one 00:50:42.800 |
thing that one criticism i have of how we do things now is it usually the the paper gets announced like 00:50:49.520 |
a day or two before the the discussion and it's hard to pre-read so if i know a week in advance it's 00:50:54.640 |
a lot easier for me to like over the weekend find a time or something like that 00:50:57.840 |
okay i do have one related questions how many people in this discord or in this paper club 00:51:09.360 |
come from a background without coding experience not with um the reason why i'm asking this is that 00:51:16.800 |
uh i've been exploring and and it is not commitment this is something that i'm still figuring out the 00:51:21.680 |
materials on right doing an entire series about like ai architecture with the assumption that you do not know 00:51:28.960 |
coding uh and then and my rationale behind that is that because well we do well i do like speak highly of 00:51:36.400 |
like andrew kapati series and fast ai series about ai architectures and stuff all of them kind of pretty 00:51:42.800 |
much assume you know coding and there are there are lots of people like trying to understand ai better 00:51:48.960 |
not necessarily to build ai architectures but just to understand it better and and 00:51:56.480 |
yeah i'm just like trying to see like how many people here just doesn't know code uh before jumping in 00:52:03.920 |
okay oh well i guess most people know code i i know there's a sampling bias in for this discord so 00:52:21.360 |
so it's like yeah just just laying out the idea 00:52:28.480 |
okay then i i think i shall just uh if there's no other paper suggestion then uh yeah perhaps we 00:52:39.600 |
should do a test of time week instead uh as in like i think what we can do is we can start 00:52:46.400 |
doing team weeks to make it easier so uh uh i would say let's try to do a test of time week after ai 00:52:55.840 |
engineering uh uh conference and and and then we use and then because seems like next week we can try 00:53:03.440 |
to turn it into diffusion week and then we'll just try to find papers around it that make it easier to like 00:53:08.960 |
pre-plan the paper in advance yeah okay then uh yeah um yeah if that's the case 00:53:16.480 |
yeah okay yeah yeah then we'll try to make sure that happens yeah then uh i'll just end off uh today's 00:53:29.920 |
one slightly early because then that uh if you have papers that you will want to suggest just feel free to 00:53:34.960 |
put into the discord okay okay okay take care everyone uh see you again next week