back to index

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale


Whisper Transcript | Transcript Only Page

00:00:00.000 | i just wanted to make sure that the recording was uh going to go work properly okay
00:00:06.960 | i did not prepare for this uh this this it and honestly like you know there's a lot of other
00:00:14.400 | papers as well that's going on you don't have to take up the full time uh there's always people
00:00:18.560 | that are interested in other things uh is there which which other papers are people interested in
00:00:25.280 | uh i mean just like stuff that was dropped in the channel okay
00:00:31.120 | it's useful to mute uh there's a bottom on the button on the top bottom right to mute all or um
00:00:42.960 | yeah mute people on entry so that i can help because people don't realize they're unmuted
00:00:51.360 | settings okay where's the mute all on entry can't seem to find that cd button either
00:01:00.800 | uh bottom right there's like a little drop down
00:01:12.480 | okay okay yeah fine mute participant uh entry got it rj were you saying something because i can't hear you at all
00:01:23.120 | sorry i didn't i was not muted no i wasn't saying anything okay okay i'm at a cafe so there's a lot of
00:01:34.000 | um got it got it got it got it okay okay i guess uh i'll go through red light and i'll go through
00:01:43.120 | large language diffusion model uh since um i don't know whether you all saw the tweets from
00:01:49.600 | uh uh recently as well because because uh apparently google now just made a a decent diffusion model
00:02:00.160 | uh uh text model which is quite red
00:02:04.240 | so ai is apparently going to get much faster today
00:02:25.520 | okay anyone okay so uh it so today's going to just be like a collection of papers uh i will kick
00:02:33.920 | off with my own which is red that and then um where we show how to uh essentially take an existing
00:02:40.960 | 72v quen model and convert it to another architecture altogether and then and then after that i will go
00:02:47.440 | through the diffusion model um i guess uh paper and and and and and then we'll like pick up from any
00:02:55.840 | other suggestions from there so the paper that we are covering i'm just going to put the link here
00:03:00.560 | and share my screen
00:03:02.880 | yep so uh so so i think a few of you may have already heard about me mentioning this in overall oops
00:03:14.640 | but this was essentially the basis for our quirky 72b which which which is where we took
00:03:22.480 | where which we took an uh existing quantity to be model let me find a picture which is better yeah
00:03:29.360 | yeah where we do where we do an existing 72b model and essentially were able to train it uh using a new
00:03:38.320 | attention mechanism and we were able to get uh most of the performance um retained um the the
00:03:44.160 | subsequent paper that we we published is is more of like to document that that process from a high
00:03:49.200 | level and then subsequently also what we found didn't work in particular so so uh so so the idea
00:03:56.640 | here is that uh we want to actually encourage um and research into new attention mechanism and this
00:04:03.520 | technique doesn't apply just to rwkb this this uh this approach actually unlocks new possibilities for
00:04:11.520 | for not just uh rwkb but that includes state space for mofoado xlstm and etc um and essentially what we
00:04:20.400 | realized is that most like majority of the feed forward network layer can be recycled if you um there is
00:04:28.240 | actually a similar um paper that came up uh came out recently as well um mlps um similar across architectures i can't
00:04:38.880 | remember the exact name um i didn't um but basically there was a recent there was a recent paper that came
00:04:47.760 | out where where i where where where a team compared the the lama model and the mr model and i believe
00:04:55.920 | warmer model and they found that actually that the that the mlp layers uh weight values are approximately similar to each other
00:05:05.920 | despite being trained on different data sets uh uh when they are optimist so so so so there's kind of
00:05:12.480 | like uh hypothesis right now floating around that uh that um if you're going to train on the same large
00:05:18.720 | web data your mlp layers are going to converge to roughly the same and and and we we also like reaffirm
00:05:25.120 | that hypothesis that hey you can freeze that layer and just swap out the attention layer so so what uh what we did
00:05:33.040 | subsequently and so the thing is that uh we showed like this is not the first conversion method per se
00:05:39.600 | from an existing uh existing uh existing uh from one architecture to another so so we we we did outline it
00:05:47.360 | the comparison to like to compare to to to to existing methods uh acknowledging some of the previous work that
00:05:54.080 | was done accordingly but i think what stood out for us is that uh we are we are probably one of the first
00:05:59.920 | team that went after going through this process and only training on 500 million tokens that we actually
00:06:04.000 | even have benchmarks improved and and even though some actually did be proved and like previously right
00:06:12.080 | previous previous conversion attempts right uh previous conversion attempts right you will see a dramatic
00:06:17.360 | drop in performance which it which is not really the biggest issue in my opinion because like when you
00:06:22.320 | convert from a quadratic to a linear model even if let's say you drop by 20 performance in overall
00:06:29.360 | the inference saving is actually substantial enough that that uh that that this this could potentially be a
00:06:35.280 | viable path towards like making a more scalable model but but we've we've we've the updated techniques
00:06:41.360 | that we did we've read that right it's that it's able to like more like rapidly i forgot to talk about
00:06:47.360 | the name uh it's a rapidly a rapid attention distillation to linear attention decoder so we are able to
00:06:53.600 | rapidly change one attention mechanism to another um the linear one is just because of the example
00:06:59.120 | because we are linear architecture but like i said this can be done using other attention mechanisms
00:07:06.160 | like right now we are actually helping two universities um and we are talking to another two
00:07:11.280 | and their team to test and variation of the transformer attention model using the same technique so
00:07:17.760 | it so what we what we did was that uh uh we take the existing model we freeze the feed forward network
00:07:26.320 | layer and we we swap up the we swap out the attention layer and then uh and then and then subsequently right
00:07:35.840 | we distilled the information uh in between i i know a lot of people get confused about
00:07:41.200 | the distillation since i've conveyed the the idea um and and it was covered in this paper is that
00:07:47.040 | is that uh when we say distillation right we we mean distillation from the same model uh from from the
00:07:55.040 | from the original model and and it's not just logics distillation like like the first few steps of the
00:08:00.800 | process right is that we distill it in between the layers so for example you can take a single block
00:08:07.120 | you put that you as you as you as you do the four inference and as it goes through um goes through
00:08:13.600 | respectively right uh we we take the the incoming embedding and and outgoing embedding and we use that
00:08:20.000 | for distillation not the logics itself so so yeah distillation between the layers itself because in
00:08:26.560 | this case we are replacing the attention mechanism um and then subsequently um you you fine tune that you
00:08:33.040 | do for the fine tuning to like just stabilize and and and weave everything together so uh i think since we
00:08:40.160 | released this paper right uh background uh i i covered the high level there um what what actually a lot of uh
00:08:51.520 | people pointed out and really liked about it um so after after the logic distillation um experiments and
00:08:58.800 | benchmark uh i wanted to highlight the this segment what did not work um so uh the because the thing is
00:09:08.000 | that with this uh with this conversion method right um there's many ways you can do get it wrong and
00:09:13.360 | and and we and and these are some that some things that we uh we uh we we we actually iterated through
00:09:23.920 | and so we documented like what did not work um we also would like to clarify that when we say it did not
00:09:29.760 | work and we mean very specifically did not work for rwkb um it is possible that as uh as we experiment with
00:09:37.360 | other attention mechanism right that some of the things that we flagged out as did not work will
00:09:41.680 | work with other attention mechanism so so even though we listed it out here right we are we are
00:09:45.600 | already like breaking the rules for one of the experiments that we're already doing so um so like
00:09:51.760 | for example one of the things that did not work for example which we thought would uh and and uh is
00:09:57.440 | that uh so when we took so one of the shortcuts that we tried to do for example is that we took the
00:10:01.920 | existing attention module weights and copy it over to rwkb um because at the end of the other kv is
00:10:10.400 | is still is still uh still matrix multiplication the weights can be copied over uh and and we we just
00:10:18.080 | like maybe it will work did not work um uh respectively um commented on their leadership skill his
00:10:27.760 | leadership skills so we we made a pushback on their decision um today so we will hear most probably
00:10:35.760 | tomorrow how the direction will go if that is not the case we are asking to consider him for a different
00:10:42.560 | role in relax so sorry yeah uh that didn't sound like a question right yeah um yeah so so yeah um we then
00:10:53.680 | subsequently um yeah uh like i said so we we we we abulated this and then we tested it like from from
00:11:01.360 | starting the attention mechanism from scratch pretty much pretty much uh same score so so we end up
00:11:08.560 | skipping that um even though i mentioned freezing model weights right you like like um i think one of the
00:11:16.160 | things to point out is that you want to freeze only at stage one so so you go through several stages the
00:11:21.440 | first stage is you replace the attention layer and then you do a quick chain um and then subsequently
00:11:27.440 | you will want to unfreeze the mlp layers um the reason why we found that this was needed is that uh is
00:11:33.760 | is that um especially over a longer context length right the performance will drop otherwise things like
00:11:39.520 | that um batch sizing is extremely tricky uh um but i think this is the case for actually any large model but
00:11:47.200 | it's just particularly so within this process and and to be clear also like and this is actually further
00:11:53.760 | confirm with uh with one of the upcoming reasoning models that we did uh did right was that was that you
00:12:01.520 | will need to have the data set that's reflective of the data that is in the model so for example we
00:12:06.480 | recently did uh one the reasoning model we converted it uh spoiler is the is the latest quen tree model
00:12:13.920 | and and we we didn't release it uh because we did when we did the conversion we didn't convert with the
00:12:20.400 | reasoning data set so when when he when he tried to do the reasoning token and things like that it just went
00:12:26.080 | crazy um we are now redoing the process with reasoning tokens and and and subsequently uh this this will be
00:12:34.880 | we believe this will perform much better so like like your data set should be approximately the same
00:12:40.000 | of what is your original model data set which is a challenge because we have absolutely no idea what
00:12:45.120 | is quen data set so so so so so that's one thing to take note another thing that we tried for example is
00:12:52.160 | like we tried to do laura base uh for for for for the conversion process it just made things a lot worse
00:13:01.360 | yeah so i i as the reason why we do we decide to document all the what did not work is that uh like
00:13:07.520 | once again um there is a lot of benefits for this mechanism here uh this approach right uh for new
00:13:14.480 | attention mechanism not just rbkb and we just wanted to like outline like like some of the key things there
00:13:20.080 | so yep and and so for those who don't know what rwkb is uh and i think most people in this group
00:13:26.880 | already know is that it's a linear attention mechanism that my group has been building and
00:13:31.600 | scaling up so with this right we have now built the the world's largest 72b um model that doesn't have
00:13:39.600 | the transformer attention mechanism the benefit of it is scale linearly over a thousand x cheaper inference
00:13:45.280 | cost and but what i find more exciting about about us covering this paper itself right this paper is
00:13:55.120 | that it's not really so much about my architecture because i'm also on record saying that i don't think
00:14:01.280 | our architecture is final and and that and that like we're and that there's so much more improvement to be
00:14:08.240 | made so so so so so so if anything right i actually do think there needs to be more research into other
00:14:14.640 | ways to build ai architectures because like one of the things that i said uh at icr where i presented the
00:14:21.040 | earlier version of this is that um was that right now the largest models to date right are not even are not
00:14:30.160 | reliable enough to do cashier duty or day-to-day work tasks and we are basic and and well scaling does help
00:14:39.120 | it right if we scale seven times more because the scaling laws like you scale 10x to get 10 improvement
00:14:45.440 | that will be the entire energy of the earth required to train the model so so at so at some point right we
00:14:53.280 | do need a better uh attention mechanism to like um to like eat to lower the energy cost or to like just
00:15:00.560 | make things more reliable or to make things as a scale better and previously most of the existing labs
00:15:06.560 | avoid doing this because to validate any new architecture um you will need to and this is from our experience
00:15:13.920 | you need to iterate around 20 to 80 times um to before you can find an architecture that has a 10 improvement
00:15:21.120 | you're going to get more failures and success and when each try can cost up to five million dollars
00:15:27.360 | in in that training process yeah it gets we are looking at tens of millions of dollars to try
00:15:33.600 | experiment on new architecture and that 10 million could be just to train more tokens but this changes
00:15:40.080 | things if it's now 2 000 to 20 000 to test a new new attention mechanism it unlocks the door to literally
00:15:48.240 | anyone uh who is willing to spin up a run port instance for example and that that's how that and
00:15:55.680 | and that's why we expect more experiments to do so and which links very well to the most very recent
00:16:03.360 | thing about gemini diffusion model um i think i also said that diffusion models are probably interesting
00:16:09.200 | because um or two things that i want to find out like one one uh is that uh diffusion models at least
00:16:18.160 | for images have been highly resistant to to uh to catastrophe i know sorry to overfitting in the sense
00:16:25.840 | that like you may have heard like uh of large language models overfitting after they see the data three four
00:16:32.720 | four times for example or even two times and they start going like crazy or inferior diffusion models
00:16:39.280 | especially in the image space are typically trained over a thousand times with the same data and they
00:16:44.480 | did not go crazy and and we and and i've been saying that i want to see more success in text diffusion model
00:16:51.520 | because i want to actually find out and test the hypothesis about why does diffusion models become
00:16:57.760 | more resilient and and this is and and and well this is not open source uh google gemini has recently
00:17:05.840 | released a diffusion model as as the demo that is blazingly fast so linking that right is that that uh is
00:17:12.480 | that uh is that uh is that there's uh on the open source side right uh so that is proved that diffusion
00:17:16.800 | model can actually work as potentially even replacements to to to to my architecture and transform architecture
00:17:23.200 | um this um this is the paper covering large language diffusion models and and so this is the the team
00:17:30.320 | that trained the lada a b which which which which which then um they uh which is then to be used uh
00:17:38.480 | and then they benchmark it across the spectrum of tasks so i think what is uh what is interesting for
00:17:46.640 | diffusion models is that uh for this in particular right what is uh i think the easiest way to to view a
00:17:55.440 | diffusion a text diffusion model is that instead of pixels on the screen you uh you view you you replace
00:18:04.880 | that with actually just the byte values and and that essentially means that it means that that same
00:18:12.320 | process right of how you like you you generate an image and then you it slowly updates the values right
00:18:19.120 | uh i think i think i'm just going to show this as a demo example as you can see it slowly updates like
00:18:24.320 | what token by token character by character and and the well the downside for this is that your um you may you
00:18:34.240 | won't get that streaming token effect i mean uh uh respectively uh and your time to first token air
00:18:41.280 | code there might be much longer the benefit was that the benefit for diffusion model is that um if
00:18:48.800 | let's say you're generating a thousand tokens you might be able to generate all thousand tokens in let's
00:18:53.120 | say a hundred steps which means it's faster and overall yeah um i think what is interesting about
00:19:01.280 | this as well right this paper in particular is that is that like um though as the way they they they said
00:19:09.840 | it right uh is that um one is that the model itself right for diffusion models which to be clear this
00:19:16.560 | is too early to say in my opinion 8b is too early to even draw conclusions right was able to outperform
00:19:22.480 | lama right in in in in context learning and reversal reasoning reversal reasoning is is is is um is that
00:19:31.840 | is the the the issue when let's say like uh a is related to b b is related to c and you have like a
00:19:39.360 | statement like that uh is the model able to to to deduce logically that c is equal uh uh is uh related to a in
00:19:50.480 | in a single hole um this is partially mitigated now with we with the chain of thought reasoning or deep
00:19:57.920 | uh deep thinking or whatever you want to call it um or one-star reasoning but this was previously like a
00:20:04.960 | major hurdle for actually a lot of the transformer model it's unable to like do the association backwards
00:20:10.160 | from what it is trained on and it is shown that diffusion models are able to like overcome that i think it's also
00:20:17.760 | in part is because they kind of work like in uh in a chain of thought pattern i uh because because like
00:20:23.200 | every time it generates a part of the text it gets to see and regenerates it so so the and so they cover
00:20:31.200 | how they how they train it respectively uh so um where where essentially they must uh uh uh they must
00:20:39.520 | the token uh individually uh like there are different ways to approach it so like masking the token individually
00:20:45.360 | that means what they did was that they they trained the laws of the various different
00:20:49.360 | token sets and mass respectively uh compared to other other other tokens because since diffusion models
00:20:57.120 | are no longer like running linearly they're able to see to the left and to the right of them all at the
00:21:02.000 | the same time yeah um the um the other the other thing that they did was for towards more like prompt
00:21:08.160 | completion training where basically the prompt is frozen at mass and then uh the response is trained
00:21:14.160 | respectively this is very similar to like the same thing as how people train diffusion models where
00:21:20.080 | where they where either they add noise and then they just uh train everything from there or they actually freeze
00:21:27.440 | a part of the image and and then ask it to regenerate another part of the image so like the one this is
00:21:33.040 | one plus example like for example you like you delete the center of the image and and you and you train
00:21:38.400 | the model to like regenerate that image so so that's still a problem and response if you structure it similarly
00:21:45.440 | yeah then um as they covered uh uh what else did they cover there wasn't really other than the what
00:21:55.920 | they observed and uh and as i mentioned like like uh like uh how how how the inference can can trail
00:22:04.560 | between uh lead to higher performance in overall i think the only thing else is like the math and then they
00:22:11.760 | they basically showed the the the the the model's performance over over the training course and then
00:22:18.400 | it shows that it's it's so autoregressive baseline basically transformers is a uh it shows that it's
00:22:24.960 | able to like keep up in general with transformers uh leading uh leading to like like potential and
00:22:31.040 | like more research down this path respectively so yeah uh and then they shared their benchmark respectively
00:22:37.760 | which what as i mentioned previously what is of interest is that like for example oxy
00:22:46.400 | is higher than llama and well mmlu is lower uh which is kind of funny because it's like this like
00:22:54.800 | we saw the same issues with rwkb for example with rwkb having the higher oxy and gsm 8k well we're having
00:23:03.760 | slightly lower uh math and mmlu uh which which kind of like still leads into the like one of the current
00:23:09.440 | hypotheses that is a different attention mechanism may favor different architectures yeah
00:23:14.240 | yeah and i think that's about it i already showed the sampling process through the animation here so yeah
00:23:23.680 | yep ah yeah so that's uh yeah anyone has any questions
00:23:29.840 | i had a quick question mostly from the chat so since we have language diffusion now is there any language
00:23:46.960 | language flow matching models oh um not that i know all uh um because i i think this is just
00:23:55.840 | fresh out of the oven
00:24:01.680 | yeah but i don't see why not uh down the line i i'm a little bit so i'm not sure
00:24:12.640 | here i didn't read i haven't i still haven't read this paper but we actually covered this a little
00:24:17.680 | bit previously too so i mean this is really similar to burt right and it feels to me like you're we're
00:24:24.640 | basically just training we're training it to uh like an encoder only model but just with multiple steps
00:24:32.240 | and it's unclear to me the flow matching models require the gaussian assumption and i don't know that that
00:24:39.840 | gaussian assumption is here or not uh so yeah like i i haven't read the paper though so
00:24:48.240 | the gaussian function let's just do a quick check for that
00:24:55.120 | i from what i understand so like the large the language division model right what they said here right is
00:25:04.240 | that is that is that they use the same sampling techniques for the individual tokens um as existing
00:25:12.160 | approach for transformers so we're not so so i don't think they're using using like uh gaussian
00:25:19.360 | sampling they are just literally like at the individual token slots these are the logits
00:25:27.600 | and they sample from there and i assume what's happening as well is that there's probably like
00:25:34.480 | uh either low temperature or top p setting did they write top p
00:25:40.080 | no they didn't say top p yeah i think one of the things that they they did to make it more stable
00:25:46.320 | is probably either a lower temperature or top p setting uh where where essentially like the model will
00:25:54.000 | or the the individual tokens will will eventually represent their 90 plus percentile representation
00:26:01.200 | and the idea is that once once you put put the output back into the input for the diffusion
00:26:05.680 | uh if it's confident in that same input token it will it will with high probability output the same token
00:26:14.960 | and everything will then eventually stabilize so it's not diffusion it's really um uh it's it's it's really
00:26:24.160 | it's really just um yeah it's a sampling standard sampling that that's what they say
00:26:34.640 | uh i see great uh i see the question one question i hope you answer at some point in the talk is what
00:26:42.080 | is left to prove with regards to pursuing other uh others building you to use rwk as a base instead of
00:26:49.040 | attention i think right now one of the biggest proof point that is not 100 absolutely proven is million
00:26:55.680 | token context um um even though uh and above um and and i feel like this is a sliding goal post problem
00:27:04.160 | um so like rwkb right now is uh with the latest v7 paper we are able to hit 32k context length perfect
00:27:10.880 | memory uh with needle in the haystack and that's what 3b model the converted models are trained at 4k so
00:27:17.200 | they do not survive a 32k needle in the haystack to do a 32k needle in the haystack we needed way more vram to
00:27:24.480 | do the conversion and training process um and so so that would have required at least four nodes of mi300
00:27:33.360 | to to do to do 64k and 32k training and above um and you can imagine the numbers just get more ridiculous
00:27:41.760 | if you want to do 1 million uh token length um yeah so that is the what's left to prove but what we point
00:27:49.920 | to to to as a hypothesis in the rwk v7 paper is that we shown that 1.5b is slightly below this one okay
00:27:58.000 | needed a haystack and we showed that at 3b is more than 32k and so we show that we've increased param size
00:28:06.320 | and increased state size right that the context length actually improved in terms of the model's ability to
00:28:12.800 | memorize it and therefore right the hypothesis is that as we scale to 72b right when fully trained
00:28:20.720 | it should be able to handle much larger context length like i say this is a hypothesis we have
00:28:26.720 | not trained uh 512k context length model because we do not run a cluster uh a super cluster uh that being
00:28:35.200 | said uh we are training a larger context length model uh for uh for uh for quirky too we already have a
00:28:45.840 | iteration on on on the existing model which which then we are using like uh a few things that we if
00:28:51.760 | you look at our previous paper like goldfinch compress kv caches and stuff like that uh we are able to
00:28:56.960 | reduce the amount of memory requirement for longer context length um together with rwkv so it's a hybrid
00:29:02.880 | model and we are going to we are probably going to try push this to train to 64k to uh and one to eight and
00:29:09.440 | then 256k and above so so that would bring it into like into the category that is very well usable for
00:29:16.000 | most use case so activity is what is distilled into the target model yes uh that is for stage one and
00:29:24.000 | two uh eventually at the later stages it will be distillation from logits so that part is extremely
00:29:31.760 | important a lot of previous papers they distilled from the logits and that didn't train the attention
00:29:37.840 | mechanism well enough to be stabilized um the what you want to do is to train as fast as possible with
00:29:46.240 | as little tokens as possible all the layers right to mimic the original attention layers uh capability
00:29:53.040 | and so that's why we distilled between layers because because then then during the back propagation
00:29:57.520 | process right it doesn't need to guess hey if i backprop all these all these uh inputs and outputs right
00:30:05.440 | which layers do i need to update there is no guess within that it's just literally which which uh
00:30:12.000 | which mentioned notification block you want to update so so so that part is actually actually important um
00:30:17.280 | rwkv's nrn no uh it's not well it's inspired by lstm at this point uh it's i would say it's closer to
00:30:27.680 | transformer than lstm uh we've been rewriting this since like five years we're not in version seven so
00:30:34.800 | so a lot of things have changed have you done a compute equivalent performance such as duplicate
00:30:39.200 | context versus the original model single version of the context ah so um this is this is the same thing as
00:30:44.800 | the state space model team have shown that uh and we have tested this as well and we can confirm this is
00:30:50.080 | that yes if you duplicate the context twice the model does perform better um this is this applies to all
00:30:57.600 | linear models uh respectively not just rwkv state space and i believe xlstm as well though i may need to
00:31:04.720 | verify that uh personally i have verified it for state space model uh yeah um there was a patron
00:31:12.640 | is uh repeat twice linear models uh crap i can't remember the paper name yeah i'll i'll try to search
00:31:22.880 | up that okay so we have the repeat twice paper and the subsequently what was the other one that i mentioned
00:31:32.320 | uh the mlp layers are being the same uh across different architectures so yeah um
00:31:42.000 | another thing to highlight as well right is that is that what is exciting also for example for the
00:31:48.160 | quirky 72b right uh and i assume this will apply to state space as well if it goes through the same
00:31:53.600 | process is that well we have similar performance to transformers at 72b scale when running on the same gpu
00:32:04.640 | we are able to get more tokens per second in aggregate than than as it is in transform architecture so we
00:32:12.800 | are talking so so we're talking about like the tokens per second comparison that the 72b model
00:32:17.360 | um is like we can we're able to do like batch size 60 or 70. if you run a 72b on h100 you're only able to
00:32:25.840 | do let's say batch size 8 or 16. and to run similar batch sizes right you will probably only need you probably
00:32:32.640 | need to run a transformer a b instead and that that's speed up in overall token per second is actually
00:32:39.360 | probably the bigger impact because at the end of the production really cares about how much compute and
00:32:44.880 | how much tokens and how much intelligence you can extract uh rj do you ah yeah just read twice closing the
00:32:51.440 | recall gap yeah that is the paper yeah thank you for finding that uh so i i will repost this inside here for
00:32:59.680 | for those who once if someone can find that mlp paper that would be great as well uh yeah so this is
00:33:08.400 | uh just read twice a uh paper um the trdr is uh we like if you just where was it the benchmarks
00:33:20.400 | uh okay i just realized that this paper is actually quite hard to to read at an immediate glance
00:33:25.120 | but the trdr was that uh linear models perform better when you just repeat the context twice
00:33:35.120 | uh uh uh yeah pretty curious about after after after is there language for matching not too sure about that uh what do you think about
00:33:43.840 | uh compared to mamba uh as covered previously like we we believe that like linear intentions uh uh
00:33:55.840 | uh works better as because because like you're able to take the one of the problems that we observe in mamba
00:34:02.720 | especially in the smaller models is that informations are merged from left to right
00:34:07.200 | and if you have and part of the reason why repeating twice works is that if let's say you have instruction
00:34:13.840 | and then you have uh information here information here will end up being merged into a single state
00:34:19.680 | before it gets to see instruction so so mama mama much information like like like in a cascading pattern
00:34:26.880 | like from left to right together to each other side by side before i generate the next token um sometimes
00:34:32.560 | what happens sometimes is that the information gets lost before it reads the instruction and therefore
00:34:38.320 | it's unable to perform the task that is the disadvantage but i'm also going to say that that disadvantage
00:34:44.800 | could be could be academical and when i say this is the part where sometimes because i i jump into both academia and
00:34:50.880 | practicality right um in production right theoretical academia limitations right are just
00:34:57.520 | that um if i would argue that for state space right if the state size is big enough and it's able to compress
00:35:06.160 | all this information together and then eventually when it merge together there's no information loss
00:35:11.440 | then the disadvantage is not really a disadvantage we have for rwkp if you put the information it will
00:35:17.200 | not lose that information and then it's able to decide the following information as you process through
00:35:22.880 | uh yeah in both cases apparently repeating twice kind of mitigates this issue so
00:35:28.080 | yeah and if the inference cost is worth it yeah then it becomes a question of more like
00:35:33.680 | more like how do we dematurize the the technology as well yeah sorry okay
00:35:42.320 | i just speak through all the questions on the
00:35:47.200 | chat yeah anyone has any other questions
00:35:58.000 | since nobody is i just want to dig in a little bit more on the repeat tracing so i'm wondering in
00:36:03.440 | the gist of the question you answered but i wanted to know if you or anyone else done a study of like
00:36:10.720 | if for the compute equivalent right so like because you guys built this model that has the rwkv attention
00:36:20.720 | mechanism in it um so that you're you know sort of like and you're saying that like kind of sort of we
00:36:27.200 | get the same quality out of the model so like let's just pretend it's exactly the same then can
00:36:33.120 | like the quality like if you just repeat twice like where does where do you land with compute and then like
00:36:41.280 | just repeat your context right like are you still more compute efficient and what what happens if you repeat
00:36:46.400 | three times or whatever to get to the point of of um compute equivalent and look at quality there
00:36:53.760 | actually that's a good question you're talking about inference time compute right exactly yeah
00:37:01.520 | uh well that's the only way is to try and find out
00:37:12.000 | yeah we have not tried to repeat it that many times uh i think also you have to remember part of the
00:37:21.920 | bigger issue is that and this applies to both state space and rwkv even though we are more stable over
00:37:27.440 | longer context right the prerequisite for this is that we need to train the model to support
00:37:33.120 | those lengthy context links and right now our longest is 32 to we do have a 128k model so we could repeat
00:37:44.240 | the experiment within let's say that 128k uh window itself uh i think um i think beyond beyond that beyond that i um
00:37:56.960 | yeah uh we we definitely want to be able to like like research more into a longer context but i think
00:38:03.280 | it's an interesting thought exercise like you said like what if you don't repeat twice i suspect you're
00:38:08.080 | going to get diminishing point of return at some point like i i i think even three or four might be diminishing
00:38:14.560 | but it's an interesting question i would say or maybe uh the equivalent way to do the experiment is
00:38:22.880 | that if you look at okay you have your 72b model and then what is the equivalent model for the same
00:38:30.800 | amount of compute with the original model right so like a b model or whatever it is for a certain
00:38:36.880 | context length what's the quality difference like that so yeah yeah sorry yeah no no that i think that's
00:38:47.200 | something yeah yeah the equivalent um com in terms of tokens per second output right the equivalent would
00:38:54.080 | be a 32b or a uh 8b model uh somewhere in between that range uh the number changes according to context
00:39:01.280 | right um but that's the approximate equivalent um if you control by tokens per second the the thing is that
00:39:09.920 | if you have a 72b rwkb model you still need 72 gigs assuming um floating point eight right of vram to
00:39:19.280 | load the model so your minimum hardware requirement doesn't actually go down it just means that
00:39:24.720 | with the same gpu you can handle more tokens yeah okay yeah and likewise i would say that i don't see
00:39:33.200 | why this is any different for mamba because we we both scale linearly in the same way in terms of compute cost
00:39:38.320 | the same way in terms of compute cost flow you have any questions because now i thought i saw you say
00:39:54.000 | something yeah no it's just talking to myself about the 72 uh gigabytes of vram still required on that last uh
00:40:02.880 | like commentary that being said um quen uh the latest quen model is 32 gigs and it's it's looking very nice
00:40:14.080 | yeah i'm excited with a lot of the last releases and like being able to run that stuff at home
00:40:20.320 | uh i've been messing with a lot of voice stuff i haven't done as much as the language model stuff
00:40:23.920 | locally yet or recently but yeah i'm excited
00:40:26.720 | how many people run run the quen 32b on at home or on their own
00:40:45.360 | okay okay okay oh wait uh i run cold sometimes i used to get out sadly i'm gpu very gpu poor
00:40:54.960 | okay uh yeah so just started to okay okay promising um yeah but i think it's i think it's nice that uh like
00:41:10.320 | if like if like we now have a model that's kind of like better than last year half frontier model
00:41:18.880 | that you can run on your laptop so if you just follow that trendline look at the best model now
00:41:25.040 | in two years time you're going to run that on your laptop
00:41:31.120 | yeah okay um then since since we went through all that uh well asking for like anyone wants any papers
00:41:42.640 | to be covered for two for next week uh or next next week um there is the ai engineering summit in case you do
00:41:51.680 | anyone do not know about it i'm sorry yeah engineering
00:41:55.440 | world's fair
00:41:58.320 | so from from from looking at the channels it looks like uh next week someone was talking about doing
00:42:07.520 | um because the ai engineer world's fair is not next week but the following week right
00:42:12.560 | oh following oops yeah uh so so so it looks like next week someone was going to do um
00:42:19.920 | diffusion large language diffusion models if i'm not mistaken is what i saw on the channel
00:42:23.680 | um and he was talking about splitting the time i'd be interested in maybe possibly doing a paper that
00:42:29.520 | talks about um image generation models and diffusion as like a lead-in to what he's going to talk about
00:42:35.760 | maybe that will be interesting uh as a way to like split the time i could do like maybe the first 10
00:42:40.320 | or 15 minutes and let him take the rest of the time but i am interested in google's uh large language
00:42:45.280 | diffusion model i haven't looked into it at all but it seemed to be super interesting i don't know if
00:42:49.200 | it's like where the benchmarks are and and stuff like that but uh was planning on tuning in next week
00:42:54.560 | for his talk i forgot who it was but um just when i when i think of diffusion models off the top of my
00:43:00.400 | head it's like image generation models i've also seen lately that uh there's quite a few of the
00:43:05.760 | um music generation models that are also diffusion based so i just thought it would be like to do a
00:43:12.560 | nice little roundup you know before he gets into um you know google stuff or language models maybe um
00:43:22.480 | i don't know if i don't know if i don't know if if that makes any sense but i thought it'd be
00:43:26.400 | interesting all right so um basically we're going to do an image diffusion as a lead into language
00:43:33.440 | diffusion we see tomorrow will be a diffusion i mean next week will be a diffusion model uh uh pipeline
00:43:41.520 | and and promoting the air engineering welfare if you haven't bought your tickets to to come to sf uh
00:43:47.920 | and and to meet us um yep um please do so as soon as possible before before before it all runs out
00:43:54.720 | uh we will so after that week we will probably uh organize once again on site an in-person paper
00:44:02.720 | reading club uh for the world's fair yeah and yep uh i guess flo you'll be there i'll be there rj will be
00:44:12.640 | there yeah so so yeah um yeah do suggest papers that you want to cover uh and subsequently um yeah and
00:44:22.560 | and and i'm looking forward to the in-person uh paper club where and when uh the exact location i may
00:44:31.680 | need to figure that out with swigs or for the actual for for the world's fair but but yeah we'll figure it out
00:44:38.880 | okay it is a good good time too to just mention last year we like because there's all these after
00:44:46.480 | events and you need to get to a lot of them we ended up getting split up a lot so i i just want to put
00:44:52.400 | out and we can talk about this on discord but i want to just put out there that let's try to like get
00:44:58.560 | some you know sort of uh some sort of consensus on what events that everybody wants to go to and try
00:45:05.200 | to get tickets for them together so that we can go hang out together at these events okay sounds good
00:45:10.880 | sounds good yeah uh if anyone wants to also like organize a specific written space uh event in the
00:45:18.320 | area we'll probably also be able to help uh help with that as well yeah maybe before after all during the
00:45:23.600 | event and yeah i'm looking forward to seeing you sam as well graph right okay thanks
00:45:32.480 | sam are you if you're speaking at the graph right thing do you have any like uh pre-reading that we
00:45:44.160 | can do i'm interested in understanding like graph rag has been a like a topic that keeps coming up and i
00:45:51.120 | never really i never really gone on to i'd be curious to hear if there's like what new is happening and
00:45:58.080 | we can pre-read for that well i'm essentially i don't know if you were at when wasim did this paper club
00:46:06.080 | last year but i'm essentially like taking what he did and sort of updating it and adding to it and stuff
00:46:14.640 | um so if you've watched that then you've basically seen it but yeah i was just yeah we we've published
00:46:21.120 | a couple things on it'll be it'll be about like the the specialized alum for building graphs and the
00:46:27.920 | retrieval aware compression and the fusion and decoder stuff that we're we're doing so yeah this this group
00:46:34.960 | has seen a lot of it before it's fine we all need to go there so that will be unless the schedule
00:46:43.600 | changes which in mind um it's 4th of june 11 40 a.m thank you thank you so make sure you all go there to
00:46:53.600 | support sam uh a family member who else is talking to right who else nathan lambert nato
00:47:06.880 | you know what which category uh i have no idea i just i think i saw in his blog that he was going to be
00:47:18.080 | here and talking but maybe he just was saying he's going to be there all right all right okay um okay
00:47:27.520 | okay uh and we got 10 more minutes somehow so yeah um
00:47:34.000 | is there any how how would you all want the paper club to be run uh more with especially with papers like
00:47:47.200 | and topics because because like uh like in particular like what is there more area of
00:47:55.440 | interest in particular or is it just just do we just keep things as it is because like trying to
00:48:00.560 | figure out how to like fill in the pipeline more and to avoid the situation of like hey uh i'm going
00:48:09.440 | to cover my own paper unless people ask for it i mean i actually personally like hearing people
00:48:16.960 | talk about their own papers i'd rather hear from someone who's not like the world's you know most
00:48:23.920 | prominent uh you know ai expert talk about their own paper than just read somebody's paper because i i
00:48:30.960 | feel like i get a lot out of hearing the author's take on it so i i actually kind of like that person
00:48:39.120 | okay okay then uh yeah then um then what i would request is that uh if possible if you all know any
00:48:47.200 | interesting papers or if you know uh any uh authors to these interesting papers um and you feel like
00:48:55.040 | you uh it'll be worth inviting them just let me know uh let me review eugenian uh or even swix know and
00:49:03.440 | and and sometimes and we'll be doing we'll be willing to do the the reach out uh like i think the other
00:49:09.360 | time there the other time was oh which people was it that i did the previous one was the one scaling one
00:49:15.280 | uh for a ai the paitia paper for example we got quentin on board to to help cover it though i was
00:49:22.720 | quentin is probably more prominent but it doesn't need to be prominent in any shape or form yeah so so we'd love to
00:49:28.640 | hear suggestions especially if like you know the person particularly a personality yeah
00:49:32.800 | okay i want the authors to present here yes please and if you want want me to reach out to them to
00:49:44.240 | apply more pressure just let me know
00:49:52.000 | i i also since no one's talking i i also feel like if having a like a bit i feel like that so
00:49:59.680 | there's a discussion of having like the test of time version and then the current papers version
00:50:06.800 | i actually think that you know in as much as there's demand there's definitely like several curriculums
00:50:13.840 | that i would be interested in like there's like an architecture curriculum the agents curriculum like
00:50:18.640 | all sorts of different you know probably diffusion or or like image model like there's a bunch of
00:50:26.320 | different ones that i feel like we could put together curriculums for that i would be interested
00:50:32.160 | in i would obviously wouldn't have time to do them all but like i i think in that also helpful for
00:50:37.280 | pre-reading because then you already you know what's next to each paper and you don't and so that i one
00:50:42.800 | thing that one criticism i have of how we do things now is it usually the the paper gets announced like
00:50:49.520 | a day or two before the the discussion and it's hard to pre-read so if i know a week in advance it's
00:50:54.640 | a lot easier for me to like over the weekend find a time or something like that
00:50:57.840 | okay i do have one related questions how many people in this discord or in this paper club
00:51:09.360 | come from a background without coding experience not with um the reason why i'm asking this is that
00:51:16.800 | uh i've been exploring and and it is not commitment this is something that i'm still figuring out the
00:51:21.680 | materials on right doing an entire series about like ai architecture with the assumption that you do not know
00:51:28.960 | coding uh and then and my rationale behind that is that because well we do well i do like speak highly of
00:51:36.400 | like andrew kapati series and fast ai series about ai architectures and stuff all of them kind of pretty
00:51:42.800 | much assume you know coding and there are there are lots of people like trying to understand ai better
00:51:48.960 | not necessarily to build ai architectures but just to understand it better and and
00:51:56.480 | yeah i'm just like trying to see like how many people here just doesn't know code uh before jumping in
00:52:03.920 | okay oh well i guess most people know code i i know there's a sampling bias in for this discord so
00:52:21.360 | so it's like yeah just just laying out the idea
00:52:23.600 | the silent majority is very silent
00:52:28.480 | okay then i i think i shall just uh if there's no other paper suggestion then uh yeah perhaps we
00:52:39.600 | should do a test of time week instead uh as in like i think what we can do is we can start
00:52:46.400 | doing team weeks to make it easier so uh uh i would say let's try to do a test of time week after ai
00:52:55.840 | engineering uh uh conference and and and then we use and then because seems like next week we can try
00:53:03.440 | to turn it into diffusion week and then we'll just try to find papers around it that make it easier to like
00:53:08.960 | pre-plan the paper in advance yeah okay then uh yeah um yeah if that's the case
00:53:16.480 | yeah okay yeah yeah then we'll try to make sure that happens yeah then uh i'll just end off uh today's
00:53:29.920 | one slightly early because then that uh if you have papers that you will want to suggest just feel free to
00:53:34.960 | put into the discord okay okay okay take care everyone uh see you again next week
00:53:41.040 | Thank you.
00:54:11.020 | Thank you.
00:54:14.020 | Thank you.
00:54:16.020 | Thank you.