RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

i just wanted to make sure that the recording was uh going to go work properly okay i did not prepare for this uh this this it and honestly like you know there's a lot of other papers as well that's going on you don't have to take up the full time uh there's always people that are interested in other things uh is there which which other papers are people interested in uh i mean just like stuff that was dropped in the channel okay it's useful to mute uh there's a bottom on the button on the top bottom right to mute all or um yeah mute people on entry so that i can help because people don't realize they're unmuted settings okay where's the mute all on entry can't seem to find that cd button either uh bottom right there's like a little drop down okay okay yeah fine mute participant uh entry got it rj were you saying something because i can't hear you at all sorry i didn't i was not muted no i wasn't saying anything okay okay i'm at a cafe so there's a lot of um got it got it got it got it okay okay i guess uh i'll go through red light and i'll go through large language diffusion model uh since um i don't know whether you all saw the tweets from uh uh recently as well because because uh apparently google now just made a a decent diffusion model uh uh text model which is quite red so ai is apparently going to get much faster today okay okay anyone okay so uh it so today's going to just be like a collection of papers uh i will kick off with my own which is red that and then um where we show how to uh essentially take an existing 72v quen model and convert it to another architecture altogether and then and then after that i will go through the diffusion model um i guess uh paper and and and and and then we'll like pick up from any other suggestions from there so the paper that we are covering i'm just going to put the link here and share my screen yep so uh so so i think a few of you may have already heard about me mentioning this in overall oops but this was essentially the basis for our quirky 72b which which which is where we took where which we took an uh existing quantity to be model let me find a picture which is better yeah yeah where we do where we do an existing 72b model and essentially were able to train it uh using a new attention mechanism and we were able to get uh most of the performance um retained um the the subsequent paper that we we published is is more of like to document that that process from a high level and then subsequently also what we found didn't work in particular so so uh so so the idea here is that uh we want to actually encourage um and research into new attention mechanism and this technique doesn't apply just to rwkb this this uh this approach actually unlocks new possibilities for for not just uh rwkb but that includes state space for mofoado xlstm and etc um and essentially what we realized is that most like majority of the feed forward network layer can be recycled if you um there is actually a similar um paper that came up uh came out recently as well um mlps um similar across architectures i can't remember the exact name um i didn't um but basically there was a recent there was a recent paper that came out where where i where where where a team compared the the lama model and the mr model and i believe warmer model and they found that actually that the that the mlp layers uh weight values are approximately similar to each other despite being trained on different data sets uh uh when they are optimist so so so so there's kind of like uh hypothesis right now floating around that uh that um if you're going to train on the same large web data your mlp layers are going to converge to roughly the same and and and we we also like reaffirm that hypothesis that hey you can freeze that layer and just swap out the attention layer so so what uh what we did subsequently and so the thing is that uh we showed like this is not the first conversion method per se from an existing uh existing uh existing uh from one architecture to another so so we we we did outline it the comparison to like to compare to to to to existing methods uh acknowledging some of the previous work that was done accordingly but i think what stood out for us is that uh we are we are probably one of the first team that went after going through this process and only training on 500 million tokens that we actually even have benchmarks improved and and even though some actually did be proved and like previously right previous previous conversion attempts right uh previous conversion attempts right you will see a dramatic drop in performance which it which is not really the biggest issue in my opinion because like when you convert from a quadratic to a linear model even if let's say you drop by 20 performance in overall the inference saving is actually substantial enough that that uh that that this this could potentially be a viable path towards like making a more scalable model but but we've we've we've the updated techniques that we did we've read that right it's that it's able to like more like rapidly i forgot to talk about the name uh it's a rapidly a rapid attention distillation to linear attention decoder so we are able to rapidly change one attention mechanism to another um the linear one is just because of the example because we are linear architecture but like i said this can be done using other attention mechanisms like right now we are actually helping two universities um and we are talking to another two and their team to test and variation of the transformer attention model using the same technique so it so what we what we did was that uh uh we take the existing model we freeze the feed forward network layer and we we swap up the we swap out the attention layer and then uh and then and then subsequently right we distilled the information uh in between i i know a lot of people get confused about the distillation since i've conveyed the the idea um and and it was covered in this paper is that is that uh when we say distillation right we we mean distillation from the same model uh from from the from the original model and and it's not just logics distillation like like the first few steps of the process right is that we distill it in between the layers so for example you can take a single block you put that you as you as you as you do the four inference and as it goes through um goes through respectively right uh we we take the the incoming embedding and and outgoing embedding and we use that for distillation not the logics itself so so yeah distillation between the layers itself because in this case we are replacing the attention mechanism um and then subsequently um you you fine tune that you do for the fine tuning to like just stabilize and and and weave everything together so uh i think since we released this paper right uh background uh i i covered the high level there um what what actually a lot of uh people pointed out and really liked about it um so after after the logic distillation um experiments and benchmark uh i wanted to highlight the this segment what did not work um so uh the because the thing is that with this uh with this conversion method right um there's many ways you can do get it wrong and and and we and and these are some that some things that we uh we uh we we we actually iterated through and so we documented like what did not work um we also would like to clarify that when we say it did not work and we mean very specifically did not work for rwkb um it is possible that as uh as we experiment with other attention mechanism right that some of the things that we flagged out as did not work will work with other attention mechanism so so even though we listed it out here right we are we are already like breaking the rules for one of the experiments that we're already doing so um so like for example one of the things that did not work for example which we thought would uh and and uh is that uh so when we took so one of the shortcuts that we tried to do for example is that we took the existing attention module weights and copy it over to rwkb um because at the end of the other kv is is still is still uh still matrix multiplication the weights can be copied over uh and and we we just like maybe it will work did not work um uh respectively um commented on their leadership skill his leadership skills so we we made a pushback on their decision um today so we will hear most probably tomorrow how the direction will go if that is not the case we are asking to consider him for a different role in relax so sorry yeah uh that didn't sound like a question right yeah um yeah so so yeah um we then subsequently um yeah uh like i said so we we we we abulated this and then we tested it like from from starting the attention mechanism from scratch pretty much pretty much uh same score so so we end up skipping that um even though i mentioned freezing model weights right you like like um i think one of the things to point out is that you want to freeze only at stage one so so you go through several stages the first stage is you replace the attention layer and then you do a quick chain um and then subsequently you will want to unfreeze the mlp layers um the reason why we found that this was needed is that uh is is that um especially over a longer context length right the performance will drop otherwise things like that um batch sizing is extremely tricky uh um but i think this is the case for actually any large model but it's just particularly so within this process and and to be clear also like and this is actually further confirm with uh with one of the upcoming reasoning models that we did uh did right was that was that you will need to have the data set that's reflective of the data that is in the model so for example we recently did uh one the reasoning model we converted it uh spoiler is the is the latest quen tree model and and we we didn't release it uh because we did when we did the conversion we didn't convert with the reasoning data set so when when he when he tried to do the reasoning token and things like that it just went crazy um we are now redoing the process with reasoning tokens and and and subsequently uh this this will be we believe this will perform much better so like like your data set should be approximately the same of what is your original model data set which is a challenge because we have absolutely no idea what is quen data set so so so so so that's one thing to take note another thing that we tried for example is like we tried to do laura base uh for for for for the conversion process it just made things a lot worse yeah so i i as the reason why we do we decide to document all the what did not work is that uh like once again um there is a lot of benefits for this mechanism here uh this approach right uh for new attention mechanism not just rbkb and we just wanted to like outline like like some of the key things there so yep and and so for those who don't know what rwkb is uh and i think most people in this group already know is that it's a linear attention mechanism that my group has been building and scaling up so with this right we have now built the the world's largest 72b um model that doesn't have the transformer attention mechanism the benefit of it is scale linearly over a thousand x cheaper inference cost and but what i find more exciting about about us covering this paper itself right this paper is that it's not really so much about my architecture because i'm also on record saying that i don't think our architecture is final and and that and that like we're and that there's so much more improvement to be made so so so so so so if anything right i actually do think there needs to be more research into other ways to build ai architectures because like one of the things that i said uh at icr where i presented the earlier version of this is that um was that right now the largest models to date right are not even are not reliable enough to do cashier duty or day-to-day work tasks and we are basic and and well scaling does help it right if we scale seven times more because the scaling laws like you scale 10x to get 10 improvement that will be the entire energy of the earth required to train the model so so at so at some point right we do need a better uh attention mechanism to like um to like eat to lower the energy cost or to like just make things more reliable or to make things as a scale better and previously most of the existing labs avoid doing this because to validate any new architecture um you will need to and this is from our experience you need to iterate around 20 to 80 times um to before you can find an architecture that has a 10 improvement you're going to get more failures and success and when each try can cost up to five million dollars in in that training process yeah it gets we are looking at tens of millions of dollars to try experiment on new architecture and that 10 million could be just to train more tokens but this changes things if it's now 2 000 to 20 000 to test a new new attention mechanism it unlocks the door to literally anyone uh who is willing to spin up a run port instance for example and that that's how that and and that's why we expect more experiments to do so and which links very well to the most very recent thing about gemini diffusion model um i think i also said that diffusion models are probably interesting because um or two things that i want to find out like one one uh is that uh diffusion models at least for images have been highly resistant to to uh to catastrophe i know sorry to overfitting in the sense that like you may have heard like uh of large language models overfitting after they see the data three four four times for example or even two times and they start going like crazy or inferior diffusion models especially in the image space are typically trained over a thousand times with the same data and they did not go crazy and and we and and i've been saying that i want to see more success in text diffusion model because i want to actually find out and test the hypothesis about why does diffusion models become more resilient and and this is and and and well this is not open source uh google gemini has recently released a diffusion model as as the demo that is blazingly fast so linking that right is that that uh is that uh is that uh is that there's uh on the open source side right uh so that is proved that diffusion model can actually work as potentially even replacements to to to to my architecture and transform architecture um this um this is the paper covering large language diffusion models and and so this is the the team that trained the lada a b which which which which which then um they uh which is then to be used uh and then they benchmark it across the spectrum of tasks so i think what is uh what is interesting for diffusion models is that uh for this in particular right what is uh i think the easiest way to to view a diffusion a text diffusion model is that instead of pixels on the screen you uh you view you you replace that with actually just the byte values and and that essentially means that it means that that same process right of how you like you you generate an image and then you it slowly updates the values right uh i think i think i'm just going to show this as a demo example as you can see it slowly updates like what token by token character by character and and the well the downside for this is that your um you may you won't get that streaming token effect i mean uh uh respectively uh and your time to first token air code there might be much longer the benefit was that the benefit for diffusion model is that um if let's say you're generating a thousand tokens you might be able to generate all thousand tokens in let's say a hundred steps which means it's faster and overall yeah um i think what is interesting about this as well right this paper in particular is that is that like um though as the way they they they said it right uh is that um one is that the model itself right for diffusion models which to be clear this is too early to say in my opinion 8b is too early to even draw conclusions right was able to outperform lama right in in in in context learning and reversal reasoning reversal reasoning is is is is um is that is the the the issue when let's say like uh a is related to b b is related to c and you have like a statement like that uh is the model able to to to deduce logically that c is equal uh uh is uh related to a in in a single hole um this is partially mitigated now with we with the chain of thought reasoning or deep uh deep thinking or whatever you want to call it um or one-star reasoning but this was previously like a major hurdle for actually a lot of the transformer model it's unable to like do the association backwards from what it is trained on and it is shown that diffusion models are able to like overcome that i think it's also in part is because they kind of work like in uh in a chain of thought pattern i uh because because like every time it generates a part of the text it gets to see and regenerates it so so the and so they cover how they how they train it respectively uh so um where where essentially they must uh uh uh they must the token uh individually uh like there are different ways to approach it so like masking the token individually that means what they did was that they they trained the laws of the various different token sets and mass respectively uh compared to other other other tokens because since diffusion models are no longer like running linearly they're able to see to the left and to the right of them all at the the same time yeah um the um the other the other thing that they did was for towards more like prompt completion training where basically the prompt is frozen at mass and then uh the response is trained respectively this is very similar to like the same thing as how people train diffusion models where where they where either they add noise and then they just uh train everything from there or they actually freeze a part of the image and and then ask it to regenerate another part of the image so like the one this is one plus example like for example you like you delete the center of the image and and you and you train the model to like regenerate that image so so that's still a problem and response if you structure it similarly yeah then um as they covered uh uh what else did they cover there wasn't really other than the what they observed and uh and as i mentioned like like uh like uh how how how the inference can can trail between uh lead to higher performance in overall i think the only thing else is like the math and then they they basically showed the the the the the model's performance over over the training course and then it shows that it's it's so autoregressive baseline basically transformers is a uh it shows that it's able to like keep up in general with transformers uh leading uh leading to like like potential and like more research down this path respectively so yeah uh and then they shared their benchmark respectively which what as i mentioned previously what is of interest is that like for example oxy is higher than llama and well mmlu is lower uh which is kind of funny because it's like this like we saw the same issues with rwkb for example with rwkb having the higher oxy and gsm 8k well we're having slightly lower uh math and mmlu uh which which kind of like still leads into the like one of the current hypotheses that is a different attention mechanism may favor different architectures yeah yeah and i think that's about it i already showed the sampling process through the animation here so yeah yep ah yeah so that's uh yeah anyone has any questions i had a quick question mostly from the chat so since we have language diffusion now is there any language language flow matching models oh um not that i know all uh um because i i think this is just fresh out of the oven so yeah but i don't see why not uh down the line i i'm a little bit so i'm not sure here i didn't read i haven't i still haven't read this paper but we actually covered this a little bit previously too so i mean this is really similar to burt right and it feels to me like you're we're basically just training we're training it to uh like an encoder only model but just with multiple steps and it's unclear to me the flow matching models require the gaussian assumption and i don't know that that gaussian assumption is here or not uh so yeah like i i haven't read the paper though so the gaussian function let's just do a quick check for that i from what i understand so like the large the language division model right what they said here right is that is that is that they use the same sampling techniques for the individual tokens um as existing approach for transformers so we're not so so i don't think they're using using like uh gaussian sampling they are just literally like at the individual token slots these are the logits and they sample from there and i assume what's happening as well is that there's probably like uh either low temperature or top p setting did they write top p no they didn't say top p yeah i think one of the things that they they did to make it more stable is probably either a lower temperature or top p setting uh where where essentially like the model will or the the individual tokens will will eventually represent their 90 plus percentile representation and the idea is that once once you put put the output back into the input for the diffusion uh if it's confident in that same input token it will it will with high probability output the same token and everything will then eventually stabilize so it's not diffusion it's really um uh it's it's it's really it's really just um yeah it's a sampling standard sampling that that's what they say uh i see great uh i see the question one question i hope you answer at some point in the talk is what is left to prove with regards to pursuing other uh others building you to use rwk as a base instead of attention i think right now one of the biggest proof point that is not 100 absolutely proven is million token context um um even though uh and above um and and i feel like this is a sliding goal post problem um so like rwkb right now is uh with the latest v7 paper we are able to hit 32k context length perfect memory uh with needle in the haystack and that's what 3b model the converted models are trained at 4k so they do not survive a 32k needle in the haystack to do a 32k needle in the haystack we needed way more vram to do the conversion and training process um and so so that would have required at least four nodes of mi300 to to do to do 64k and 32k training and above um and you can imagine the numbers just get more ridiculous if you want to do 1 million uh token length um yeah so that is the what's left to prove but what we point to to to as a hypothesis in the rwk v7 paper is that we shown that 1.5b is slightly below this one okay needed a haystack and we showed that at 3b is more than 32k and so we show that we've increased param size and increased state size right that the context length actually improved in terms of the model's ability to memorize it and therefore right the hypothesis is that as we scale to 72b right when fully trained it should be able to handle much larger context length like i say this is a hypothesis we have not trained uh 512k context length model because we do not run a cluster uh a super cluster uh that being said uh we are training a larger context length model uh for uh for uh for quirky too we already have a iteration on on on the existing model which which then we are using like uh a few things that we if you look at our previous paper like goldfinch compress kv caches and stuff like that uh we are able to reduce the amount of memory requirement for longer context length um together with rwkv so it's a hybrid model and we are going to we are probably going to try push this to train to 64k to uh and one to eight and then 256k and above so so that would bring it into like into the category that is very well usable for most use case so activity is what is distilled into the target model yes uh that is for stage one and two uh eventually at the later stages it will be distillation from logits so that part is extremely important a lot of previous papers they distilled from the logits and that didn't train the attention mechanism well enough to be stabilized um the what you want to do is to train as fast as possible with as little tokens as possible all the layers right to mimic the original attention layers uh capability and so that's why we distilled between layers because because then then during the back propagation process right it doesn't need to guess hey if i backprop all these all these uh inputs and outputs right which layers do i need to update there is no guess within that it's just literally which which uh which mentioned notification block you want to update so so so that part is actually actually important um rwkv's nrn no uh it's not well it's inspired by lstm at this point uh it's i would say it's closer to transformer than lstm uh we've been rewriting this since like five years we're not in version seven so so a lot of things have changed have you done a compute equivalent performance such as duplicate context versus the original model single version of the context ah so um this is this is the same thing as the state space model team have shown that uh and we have tested this as well and we can confirm this is that yes if you duplicate the context twice the model does perform better um this is this applies to all linear models uh respectively not just rwkv state space and i believe xlstm as well though i may need to verify that uh personally i have verified it for state space model uh yeah um there was a patron is uh repeat twice linear models uh crap i can't remember the paper name yeah i'll i'll try to search up that okay so we have the repeat twice paper and the subsequently what was the other one that i mentioned uh the mlp layers are being the same uh across different architectures so yeah um another thing to highlight as well right is that is that what is exciting also for example for the quirky 72b right uh and i assume this will apply to state space as well if it goes through the same process is that well we have similar performance to transformers at 72b scale when running on the same gpu we are able to get more tokens per second in aggregate than than as it is in transform architecture so we are talking so so we're talking about like the tokens per second comparison that the 72b model um is like we can we're able to do like batch size 60 or 70.

if you run a 72b on h100 you're only able to do let's say batch size 8 or 16. and to run similar batch sizes right you will probably only need you probably need to run a transformer a b instead and that that's speed up in overall token per second is actually probably the bigger impact because at the end of the production really cares about how much compute and how much tokens and how much intelligence you can extract uh rj do you ah yeah just read twice closing the recall gap yeah that is the paper yeah thank you for finding that uh so i i will repost this inside here for for those who once if someone can find that mlp paper that would be great as well uh yeah so this is uh just read twice a uh paper um the trdr is uh we like if you just where was it the benchmarks uh okay i just realized that this paper is actually quite hard to to read at an immediate glance but the trdr was that uh linear models perform better when you just repeat the context twice uh uh uh yeah pretty curious about after after after is there language for matching not too sure about that uh what do you think about uh compared to mamba uh as covered previously like we we believe that like linear intentions uh uh uh works better as because because like you're able to take the one of the problems that we observe in mamba especially in the smaller models is that informations are merged from left to right and if you have and part of the reason why repeating twice works is that if let's say you have instruction and then you have uh information here information here will end up being merged into a single state before it gets to see instruction so so mama mama much information like like like in a cascading pattern like from left to right together to each other side by side before i generate the next token um sometimes what happens sometimes is that the information gets lost before it reads the instruction and therefore it's unable to perform the task that is the disadvantage but i'm also going to say that that disadvantage could be could be academical and when i say this is the part where sometimes because i i jump into both academia and practicality right um in production right theoretical academia limitations right are just that um if i would argue that for state space right if the state size is big enough and it's able to compress all this information together and then eventually when it merge together there's no information loss then the disadvantage is not really a disadvantage we have for rwkp if you put the information it will not lose that information and then it's able to decide the following information as you process through uh yeah in both cases apparently repeating twice kind of mitigates this issue so yeah and if the inference cost is worth it yeah then it becomes a question of more like more like how do we dematurize the the technology as well yeah sorry okay i just speak through all the questions on the chat yeah anyone has any other questions since nobody is i just want to dig in a little bit more on the repeat tracing so i'm wondering in the gist of the question you answered but i wanted to know if you or anyone else done a study of like if for the compute equivalent right so like because you guys built this model that has the rwkv attention mechanism in it um so that you're you know sort of like and you're saying that like kind of sort of we get the same quality out of the model so like let's just pretend it's exactly the same then can like the quality like if you just repeat twice like where does where do you land with compute and then like just repeat your context right like are you still more compute efficient and what what happens if you repeat three times or whatever to get to the point of of um compute equivalent and look at quality there actually that's a good question you're talking about inference time compute right exactly yeah uh well that's the only way is to try and find out yeah we have not tried to repeat it that many times uh i think also you have to remember part of the bigger issue is that and this applies to both state space and rwkv even though we are more stable over longer context right the prerequisite for this is that we need to train the model to support those lengthy context links and right now our longest is 32 to we do have a 128k model so we could repeat the experiment within let's say that 128k uh window itself uh i think um i think beyond beyond that beyond that i um yeah uh we we definitely want to be able to like like research more into a longer context but i think it's an interesting thought exercise like you said like what if you don't repeat twice i suspect you're going to get diminishing point of return at some point like i i i think even three or four might be diminishing but it's an interesting question i would say or maybe uh the equivalent way to do the experiment is that if you look at okay you have your 72b model and then what is the equivalent model for the same amount of compute with the original model right so like a b model or whatever it is for a certain context length what's the quality difference like that so yeah yeah sorry yeah no no that i think that's something yeah yeah the equivalent um com in terms of tokens per second output right the equivalent would be a 32b or a uh 8b model uh somewhere in between that range uh the number changes according to context right um but that's the approximate equivalent um if you control by tokens per second the the thing is that if you have a 72b rwkb model you still need 72 gigs assuming um floating point eight right of vram to load the model so your minimum hardware requirement doesn't actually go down it just means that with the same gpu you can handle more tokens yeah okay yeah and likewise i would say that i don't see why this is any different for mamba because we we both scale linearly in the same way in terms of compute cost the same way in terms of compute cost flow you have any questions because now i thought i saw you say something yeah no it's just talking to myself about the 72 uh gigabytes of vram still required on that last uh like commentary that being said um quen uh the latest quen model is 32 gigs and it's it's looking very nice yeah i'm excited with a lot of the last releases and like being able to run that stuff at home uh i've been messing with a lot of voice stuff i haven't done as much as the language model stuff locally yet or recently but yeah i'm excited how many people run run the quen 32b on at home or on their own okay okay okay oh wait uh i run cold sometimes i used to get out sadly i'm gpu very gpu poor okay uh yeah so just started to okay okay promising um yeah but i think it's i think it's nice that uh like if like if like we now have a model that's kind of like better than last year half frontier model that you can run on your laptop so if you just follow that trendline look at the best model now in two years time you're going to run that on your laptop yeah okay um then since since we went through all that uh well asking for like anyone wants any papers to be covered for two for next week uh or next next week um there is the ai engineering summit in case you do anyone do not know about it i'm sorry yeah engineering world's fair so from from from looking at the channels it looks like uh next week someone was talking about doing um because the ai engineer world's fair is not next week but the following week right oh following oops yeah uh so so so it looks like next week someone was going to do um diffusion large language diffusion models if i'm not mistaken is what i saw on the channel um and he was talking about splitting the time i'd be interested in maybe possibly doing a paper that talks about um image generation models and diffusion as like a lead-in to what he's going to talk about maybe that will be interesting uh as a way to like split the time i could do like maybe the first 10 or 15 minutes and let him take the rest of the time but i am interested in google's uh large language diffusion model i haven't looked into it at all but it seemed to be super interesting i don't know if it's like where the benchmarks are and and stuff like that but uh was planning on tuning in next week for his talk i forgot who it was but um just when i when i think of diffusion models off the top of my head it's like image generation models i've also seen lately that uh there's quite a few of the um music generation models that are also diffusion based so i just thought it would be like to do a nice little roundup you know before he gets into um you know google stuff or language models maybe um i don't know if i don't know if i don't know if if that makes any sense but i thought it'd be interesting all right so um basically we're going to do an image diffusion as a lead into language diffusion we see tomorrow will be a diffusion i mean next week will be a diffusion model uh uh pipeline and and promoting the air engineering welfare if you haven't bought your tickets to to come to sf uh and and to meet us um yep um please do so as soon as possible before before before it all runs out uh we will so after that week we will probably uh organize once again on site an in-person paper reading club uh for the world's fair yeah and yep uh i guess flo you'll be there i'll be there rj will be there yeah so so yeah um yeah do suggest papers that you want to cover uh and subsequently um yeah and and and i'm looking forward to the in-person uh paper club where and when uh the exact location i may need to figure that out with swigs or for the actual for for the world's fair but but yeah we'll figure it out okay it is a good good time too to just mention last year we like because there's all these after events and you need to get to a lot of them we ended up getting split up a lot so i i just want to put out and we can talk about this on discord but i want to just put out there that let's try to like get some you know sort of uh some sort of consensus on what events that everybody wants to go to and try to get tickets for them together so that we can go hang out together at these events okay sounds good sounds good yeah uh if anyone wants to also like organize a specific written space uh event in the area we'll probably also be able to help uh help with that as well yeah maybe before after all during the event and yeah i'm looking forward to seeing you sam as well graph right okay thanks sam are you if you're speaking at the graph right thing do you have any like uh pre-reading that we can do i'm interested in understanding like graph rag has been a like a topic that keeps coming up and i never really i never really gone on to i'd be curious to hear if there's like what new is happening and we can pre-read for that well i'm essentially i don't know if you were at when wasim did this paper club last year but i'm essentially like taking what he did and sort of updating it and adding to it and stuff um so if you've watched that then you've basically seen it but yeah i was just yeah we we've published a couple things on it'll be it'll be about like the the specialized alum for building graphs and the retrieval aware compression and the fusion and decoder stuff that we're we're doing so yeah this this group has seen a lot of it before it's fine we all need to go there so that will be unless the schedule changes which in mind um it's 4th of june 11 40 a.m thank you thank you so make sure you all go there to support sam uh a family member who else is talking to right who else nathan lambert nato you know what which category uh i have no idea i just i think i saw in his blog that he was going to be here and talking but maybe he just was saying he's going to be there all right all right okay um okay okay uh and we got 10 more minutes somehow so yeah um is there any how how would you all want the paper club to be run uh more with especially with papers like and topics because because like uh like in particular like what is there more area of interest in particular or is it just just do we just keep things as it is because like trying to figure out how to like fill in the pipeline more and to avoid the situation of like hey uh i'm going to cover my own paper unless people ask for it i mean i actually personally like hearing people talk about their own papers i'd rather hear from someone who's not like the world's you know most prominent uh you know ai expert talk about their own paper than just read somebody's paper because i i feel like i get a lot out of hearing the author's take on it so i i actually kind of like that person okay okay then uh yeah then um then what i would request is that uh if possible if you all know any interesting papers or if you know uh any uh authors to these interesting papers um and you feel like you uh it'll be worth inviting them just let me know uh let me review eugenian uh or even swix know and and and sometimes and we'll be doing we'll be willing to do the the reach out uh like i think the other time there the other time was oh which people was it that i did the previous one was the one scaling one uh for a ai the paitia paper for example we got quentin on board to to help cover it though i was quentin is probably more prominent but it doesn't need to be prominent in any shape or form yeah so so we'd love to hear suggestions especially if like you know the person particularly a personality yeah okay i want the authors to present here yes please and if you want want me to reach out to them to apply more pressure just let me know i i also since no one's talking i i also feel like if having a like a bit i feel like that so there's a discussion of having like the test of time version and then the current papers version i actually think that you know in as much as there's demand there's definitely like several curriculums that i would be interested in like there's like an architecture curriculum the agents curriculum like all sorts of different you know probably diffusion or or like image model like there's a bunch of different ones that i feel like we could put together curriculums for that i would be interested in i would obviously wouldn't have time to do them all but like i i think in that also helpful for pre-reading because then you already you know what's next to each paper and you don't and so that i one thing that one criticism i have of how we do things now is it usually the the paper gets announced like a day or two before the the discussion and it's hard to pre-read so if i know a week in advance it's a lot easier for me to like over the weekend find a time or something like that okay i do have one related questions how many people in this discord or in this paper club come from a background without coding experience not with um the reason why i'm asking this is that uh i've been exploring and and it is not commitment this is something that i'm still figuring out the materials on right doing an entire series about like ai architecture with the assumption that you do not know coding uh and then and my rationale behind that is that because well we do well i do like speak highly of like andrew kapati series and fast ai series about ai architectures and stuff all of them kind of pretty much assume you know coding and there are there are lots of people like trying to understand ai better not necessarily to build ai architectures but just to understand it better and and yeah i'm just like trying to see like how many people here just doesn't know code uh before jumping in okay oh well i guess most people know code i i know there's a sampling bias in for this discord so so it's like yeah just just laying out the idea the silent majority is very silent okay then i i think i shall just uh if there's no other paper suggestion then uh yeah perhaps we should do a test of time week instead uh as in like i think what we can do is we can start doing team weeks to make it easier so uh uh i would say let's try to do a test of time week after ai engineering uh uh conference and and and then we use and then because seems like next week we can try to turn it into diffusion week and then we'll just try to find papers around it that make it easier to like pre-plan the paper in advance yeah okay then uh yeah um yeah if that's the case yeah okay yeah yeah then we'll try to make sure that happens yeah then uh i'll just end off uh today's one slightly early because then that uh if you have papers that you will want to suggest just feel free to put into the discord okay okay okay take care everyone uh see you again next week Thank you.

Thank you. Thank you. Thank you.

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Transcript