[Paper Club] GMT20251112 200413 Recording gallery 3440x1440

well yes i would but i was waiting for a screen share to start oh sorry it's okay slight editing um okay screen is sharing thing is recording uh so we have a few things that came out basically um i think what a week ago kimmy dropped kimmy uh linear it's it's just small different hybrid attention stuff ted will go over that in a bit and then this week came out kimmy k2 thinking it's basically an extension of kimmy k2 but it now has reasoning and thinking and it's pretty cool because it basically popped the benchmarks in in every stat so big big jump um it's open weight like before the the notable ones were on like humanity's last exam i'll give a quick high level overview before we kind of dive deep honestly this um this twitter post by artificial analysis is pretty good so open source uh they use int4 this time instead of fp8 so it's better quantized doubled the context length it's a lot more sparse um really good on humanities last exam similar to gpt oss it's now trained to do tool calls and reasoning and yeah it's kind of like very state-of-the-art it's still trillion parameters it's still huge um it's very very verbose though so there's two endpoints of fast and a regular but you know one thing that they note here that you wouldn't see is i think it's like the most expensive um most verbose tokens that communicate to thinking use the highest number of tokens ever used in their eval harness so their evil harness they basically test the model across all benchmarks uh it's nice that they actually shipped in for and they tested in for but you know it's very very verbose it thinks a lot but that being said it's still very good there's a heavy version of it that uses even more tokens and um yes even better uh on benchmarks you know it's it's very very state-of-the-art so like humanity's last exam beats gpt5 high beats growth for on other stuff it's it's pretty up there and on par um i of course use gpt5 to give a bit of a summary and i think it honestly does pretty well since there's no paper uh what's more interesting is actually there's some commentary like nathan from interconnects did a pretty good what are the five things that matter about it um someone posted something about their inference on reddit there's a little q a session about quantization so let's just dig in pretty fast so k2 thinking i don't know if anyone has thoughts they actually use it or whatnot uh the other aspect of this is it's their servers are super on fire so um they talk about the speeds that they get here um performance is good here's how it performs but basically um they reported it gets uh the the base endpoint is very slow getting about eight output tokens per second while turbo is slightly faster uh that's that's not the worst case right it's still open weights other providers will support it um so you know you it's slow but you know base 10 is doing it pretty fast and still pretty cheap um but you know will will many people use it who knows it's it's still mit license there's two clauses in there i think one is if you make more than 20 mil a month or you have like a bunch of users you want to prominently say you're using kimi k2 so yeah now it reasons it can plan tool call and stuff between its thinking it's trained with quantization aware training in int 4 and plus training we'll talk about that a bit later that really helps speed and rl they doubled the context length which is pretty exciting yeah it's it's quite state-of-the-art stuff uh some stuff is uh slightly behind but you know in general it's quite good it's like a little midpoint where open source was able to like take quite a win again um with it being int 4 compared to k2 we'll talk about quantization aware training and stuff uh someone's asking in the chat with it being um int 4 it it's you know the last k2 was fpa i believe so it's quantized basically to take half the memory so there's a deployment guide with vllm somewhere here deployment um basically now you can run it on a node it's optimized for older hardware it's it's not like fp4 um so you know cool stuff like that you it's basically a lot faster and better um here's the clause 100 million monthly active users or 20 million monthly revenue what else there's the official blog post which you can go through it's mostly just evals and hype there's not much here agentic reasoning oh my mouse is frozen okay we're good we're good um what else api is there it's slow on kimmy but you know other people have it uh it builds a lot on the old k2 it's more sparse deployment guide let's see what else what else yeah uh it can reason and tool call a lot between 200 to 300 sequential tool calls int 4 versus quantize uh through quantize aware training we'll talk about that through the reddit post in a bit longer context um with this similar to how openai had harmony they open source for gpt oss uh you want to make sure that you follow their um their format parsing so you can actually properly parse these tool calls there's like a whole section on this i don't know if anyone here is actually renting a node to deploy it most of us will just consume and use apis and you know that's that's pretty much the norm right now uh architecture wise still a trillion parameters it's 32p active uh 384 experts top eight one shared expert uh context length has doubled so you know 32 be active this thing can get fast it's just that i'm sure kimmy servers are like quite cooked and on fire right now a sparse in this context so there's a sparsity ratio so the number of active versus total experts right um this is something that kimmy has really been pushing is the sparsity ratio going down or is it going up i thought it's more sparse than before am i mistaken in this uh we'll we'll double check that in a sec but i thought it's i thought it's more sparse than before but we'll we'll double see maybe someone else can check uh benchmarks i don't know if there's much of a point in looking in benchmarks the interesting thing to note is that there is a heavy equivalent um benchmarks on hugging face are laid out quite well honestly like depending on what you care about one i don't know if many people are actually going to end up using it it is quite a bit cheaper but um you know if you care about humanity's last exams it's pretty state-of-the-art artificial analysis has their aggregated benchmark which still shows that like you know right there after open ai the only things to be aware of that don't show up in benchmarks is that it is very very verbose so uses a lot a lot of tokens to think and that's that's not optimal how to use it we don't need to know any of this uh design and training notes so moe with mla very sparse moe keep stop active 32 it still uses me on clip they didn't they didn't share that much more in for uh quantized aware training they talk about this a bit in the system report and read it we'll go over there they have caching they have standard they have a turbo okay i think that's enough of a high level of what came out uh nathan from interconnects put out you know a pretty good post about how their releases come out from um chinese labs benchmarks first user behavior second china on the rise so basically release stuff uh after they see people in the us doing it uh this is new it's like you know the interleave thinking with tool calls it's really good for agents this is kind of like not many open models did this except for gpt oss so it's the first one after that i think uh pressure on american labs of course i don't think there's too much more to go over through artificial analysis um okay so what is this whole quantization is a comp is not a compromise it's the next paradigm so an infra engineer shares insights on why this choice matters uh key idea quantization is no longer a trade-off it's it's actually more beneficial so there's there's in traditional inference there's two reasons why there's two goals of optimization right high throughput so as you quantize you can get better utilization of your gpu right this mattered a lot for early like dense models on local hardware whether you have a mac or like a 30 90 40 90 50 90 you could do uh you know four bit quant on a 32b fit it in ram and use it and then of course lower latency um if you know the more quantized it is the lower latency you can get so for their thing um with such a sparse moe their decoding is actually memory bound the smaller the model weights uh the faster the compute is so in previous their fp8 where they had about it took about a terabyte of memory right they were at the limit of what single high speed interconnect per gpu node can can handle now they have um weights at four bit activations at 16 bit it's a interesting type of quantization it slightly uh latency drops quite a bit and it maintains quality and it's in a sense lossless so why uh quantize aware training over post training quantization uh post training worked fine for shorter generations but it didn't work in long chain so another big thing another big thing that they did here is one they doubled the context two they do um tool call reasoning right so this thing is very verbose it's more verbose in thinking than any other model it does tool calls interleaved in its thinking and yeah uh post training quantization didn't work errors basically accumulated and then it degraded precision over long contexts but um quantize aware training didn't do it so how does it work uh dependence on calibration expertise this was these are issues with post trading quantization k2 k2 thinking adopted quantize aware training for minimal loss and more stable long context reasoning how it works it uses a weight only quantize aware training and a fake quantization plus straight through estimator um little details here it four is the hidden advantage in rl a few people mentioned this this is kind of what all the blog posts and tweets about it show uh it for doesn't just speed up inference it accelerates rl training itself right if you think about rl and all this verification you have a lot of rollouts being done and you know if you can speed those up you're essentially accelerating all your rl in general too right rl rollouts suffer from long-term inefficiency but this low latency profile makes it much faster and you probably have uh less overhead where you're off policy per se as rollouts are sinking and new weights are being updated uh this is pretty interesting right so uh once again the title of this is quantization is not a compromise it's the next paradigm um they they basically say in practice each rl iteration runs 10 to 20 percent faster end to end with this in for um with this in for quantize aware training uh moreover quantized rl brings stability smaller representational space reduces accumulation error improving learning robustness so why in for versus uh mx fp4 any of these other fancy fancier quantization stuff uh basically all this new fancy stuff like fp4 it it's new to blackwell but it won't work with older hardware so uh more generalization is what it for gives you it for can work across older hardware and you know that's a trade off they're they're willing to make so at uh quant scale of 1 to 32 it for matches fp4 formats it's like you know it's more hardware adaptable even though it's not the latest fancy stuff because it works not non uh blackwell okay that's like quick 10 to 15 minute overview of k2 thinking i'll pause here for questions questions comments i see chat is quite active i honestly haven't had a chance to look through it if anyone wants to you know interrupt or go over any parts in specific detail feel free you know i'm gonna go through chat but feel free to unmute or whatever or we can always pass to ted and then do do q a towards the end and keep it in in zoom import restrictions force for creativity yeah a lot of people say that um the chinese labs do really good stuff like uh muon clip a lot of deep seeks um innovations were basically because they don't have the best gpus so they're forced to get creative and yeah um how many 50 90s do you need to run this you need a lot of 50 90s you need you need the pewdiepie rig you need chinese 48 gig 50 90s or you know there's always the a6000 pro the a6000 rtx it's it's where you pay twice as much for more vram you get a 50 90 level performance with 96 gigs of arab you put four or six of those in a box and you can run this thing locally and it'll be slow or you could always just you know use the cloud where this thing is like two dollars on the millions of tokens but the uh the kimmy lin here that's what 48 48 billion parameter if i'm not mistaken so there's a version of that you can run yeah but it's it's more of an experimental it's not really a stated yard model yet yeah quirks is if you ask it for code it will generate code and thinking see if it's correct and produce the same main output i think ankit is uh making an interesting point here i think you can't look at thinking tokens as uh quirks thinking tokens are all basically like random random latent space noise like models even deep sea kimmy um all of them will basically do random stuff in their thinking those switch languages they'll write code for poetry they'll do poetry for code and then they'll output a really reasonable answer so um you know they make it a pretty clear point that there's a reason why you don't output your um chain of thought reasoning to the end user like people were salty at opening eyes first 01 model for not giving chain of thought tokens but in reality what we see now is you know there's so much noise in there like summarizing it is probably the better ux right you don't want to ask it for code and get a bunch of um you don't want to get like poetry in your code thinking it doesn't really help um let's see any other questions otherwise i will probably pass over to ted and i'll keep i'll keep answering in the zoom chat but yeah it's better than minimax um trade-offs are there they don't really talk about anything locally it will be 10 tokens per second no it'll probably be worse than 10 tokens per second right now the kimmy servers are serving it at eight tokens per second so 10 tokens per second would be great on your 5090 but no these things are things are pretty raw okay i think um a quick question so you mentioned a 10 to 20 percent uh increase end-to-end on ril training right uh that doesn't seem to be a very large number um and i was wondering like is this speed up any other thing else on this article is about the uh bandwidth limit right so can we kind of correlate those two things like um you know that you're going when you went to int for and uh how much more bandwidth are you allowed now uh for gpu transfer and then how does that correlate with the speed yeah i think there's a high level map that you can work through right so quantization will be faster uh you're also taking half the ram so you know if you have x amount of gpus there's actually a a full breakdown of this somewhere i can share a link at some point in the zoom chat um but yeah you break down the factors that you have right so you have faster you have half the vram required so they they did a scaling compared to chinchilla somewhere um but in my opinion 10 to 20 percent is uh it's quite a bit right if you compound it out for your experiments plus your final train run like the final train run to train a million token a trillion parameter model is in the millions of dollars right so if you're 20 faster uh you can use 20 less gpu hours per se right so that's pretty substantial uh i don't i don't know if that's exactly a fair way to put it but 10 to 20 speedups are pretty uncommon right um just yeah i think that alone is pretty impactful but then when you think about it you know um it's it's very rollout heavy and if you look at what the actual bottlenecks were from um bandwidth across uh their parallel is uh their pipeline parallelism and gpu sharding um having more on less gpus it really helps break that down i i feel like i saw a good post about this i'll try to find it and share it oh yeah you could spend more i think uh it's not just about spending more though right this this uh this optimization is also at inference time so you'll you should see speed ups during brain and plane inference and forget about other providers like the whole one of the big benefits about open models and open weights is you can serve them yourself right you want to use k2 you need to find a terabyte of your app you want to use k2 thinking you can do it on half the half the memory so that alone is a lot better than a 10 10 to 20 speed up right it's it's also after gpu is required and then the other thing that they mention is that you went to int 4 making uh rl training more robust um and just that end sentence on that section i don't quite understand why that would be says oh smaller representative space reduce accumulation error or why is that uh it seems like you're more quantized so it doesn't that you know accumulate even worse so what is this reasoning there i think someone talked about it in the comments here magic just having a robust quantization now that's during compute there's there's an answer that someone gave about it i'll share it in the chat a bit but um i think we should as a separate thing i i remember reading somewhere an analysis that said that rl updates tend to be like rank one so int 4 shouldn't be a problem at all for rl if it works for for regular training it'll definitely work for rl yeah so i don't know if you answered your question frankie maybe it's just it's faster and it's just as good okay it brings stability smaller representational spit uh space reduces accumulation error improving learning robustness i'm sure there's more we can dig into here but um it's a fair question i feel like we should yeah the quantization aware maybe is what they're saying makes it better yeah versus um you know post that strange motivation okay i hope i didn't take up too much of your time ted there wasn't that much in this one no that was great so uh let me talk about the kimmy linear paper uh share my screen uh i'm always impressed with how much information viibu can cover um so fast so uh i don't know that i can i can i can match that but so we have this kimmy linear paper let me just do the quick highlights so it's a hybrid linear attention architecture so it's going to be one quarter of the layers are regular attention and and three quarters of the layers are going to be their new kimmy linear thing okay and there can be a linear thing they call kimmy delta attention kda so we'll get into it but they say it's an expressive linear attention module that extends gated delta net with a finer finer grain gating mechanism and then they have um a bespoke chunk wise algorithm that achieves high hardware efficiency so so um talk about that too remind me if i don't but that's an important element for any linear model um so they pre-trained this 3 billion active 48 billion total so this is this is not quite like state-of-the-art size and i think some of the experiments in the paper were done even smaller so um just know that this is this is like a proof of concept maybe more than it is the model you're going to start using daily but the key thing is because the linear has so much less memory requirements if only a quarter of the layers are regular attention then you're reducing your kv cache size by 75 percent and that's like you know if vbo's saying 10 20 is a pretty good increase 75 is is a very notable decrease in your memory utilization they have some decoding throughput numbers i i didn't understand they have some graphs that differ but it's like 2.9x 3x 6x um but basically um because decoding is memory bound um bandwidth bound then when you reduce the kv cache by by three quarters that of course speeds up your decoding and that's where that speed up comes from whatever the exact number is and the cool thing is they open source the kda kernel and the vlm implementation okay so um let me jump ahead to what this uh what this model looks like okay so they have uh layers where they do kda instead of regular attention and then they have layers where they do regular attention and you know tldr in the end they're going to say we did relations and we found that three of these for every one normal one gave us the best bang for our buck um the the attention layers are using latent attention like deep seek uh but that's not particularly noteworthy um it's a mixture of experts model and the moe is just completely standard um so i feel like the really interesting thing to talk about can be linear is their new um kda um uh layer that that's replacing the attention and it's doing sort of a uh uh local attention it's not it's not there to be able to do um you know in context uh retrieval of a token that's you know 10 000 100 000 tokens ago that's what the one quarter of the layers that are full attention are there for but if you just need something like the previous token two tokens ago um then this guy is is going to take care of it for you and it's actually really strong and it can do more than that um all right so i'm trying to follow the chat but you guys are typing faster than every time i look at it so shout out if there's anything uh um that i'm going too fast you want me to to answer or whatever all right so i thought that if what we're going to focus on is just this kda thing then it actually makes sense to start with their related work section um down in section seven and then we'll get into the details and i think we'll have just enough time to talk a little bit about the the math okay so it starts with linear attention and um the idea is that attention i think you guys know quadratic so it reformulates the quadratic attention map into kernelized feature interactions replacing the soft max with a positive feature map so that tension can be computed through two associative matrix products uh this is a little bit like the kernel trick that you see in svms which can take something non-linear and turn it into something linearly separable so ultimately you don't necessarily really need to know that but um so you get rid of this explicit um um quadratic uh attention similarity matrix because you you you hope for it learning some pattern that that this kernel makes linearly separable and so then we just have uh uh linear uh calculations uh and then let's see here subsequent work strengthens the vanilla linear attention significantly through uh more refined memory control shifting from data independent decay it's more adaptive data dependent mechanisms so for example mamba just uh had this um decay thing and then mamba 2 they introduced where the decay was a function of what the current token is so that's what they mean when they say data dependent decay mechanisms and i'll talk more about how this decay stuff works um and refining the decay granularity from course headwise to precise channel wise decay so that's one of the big things here is that they're actually going to say my model dimension is 2048 instead of just decaying all 2040 of those floats by the same factor 0.9 i'm actually going to choose channel wise which ones i'm going to keep which ones i'm going to decay um so gated linear attention generalizes these approaches with diagonal channel wise gates so that's the decay thing table seven summaries collectively these men methods cast attention as compact recurrent memory updated with parallel prefix scan operators infused matrix multiplies aligning with modern accelerators that's a whole mouthful basically um what linear attention and kda and mamba and rwkb and gated delta net all of these things what they have in common is that at decode time they act like an rnn so you push one thing in it updates the state you get one thing out that's not enough in order to have a good linear attention you need three things you need the the easy one which is that decoding goes um uh really easily and really fast the second thing you need is that training and or pre-fill is done really fast and that's where you see this business of parallel prefix scans fused matrix multiply what they have in here this chunked operator that's the thing that does the prefix the pre-fill operation really fast or the training operation really fast otherwise if you just had a vanilla rnn and you give it a big amount of code and you have a hundred thousand tokens it would need to process those hundred thousand sequentially and it'd be really slow the decoding tokens per second would be really fast but the pre-fill would suck so you need those two things you need fast pre-fill you need fast decode and then the third thing you need is you need the attention to actually um uh be really smart be able to to um learn very complex patterns so what i'm mostly going to talk about is this complementary view of the same linear tension connects um linear tension to this idea of fast weight memory um and the state is a low capacity associated table and so that's that's the view that i find most intuitive for understanding these things um so then they also talk about linear tension with the gating mechanism and it's really the same idea here and they say a primary distinction yeah sorry someone had a question that i think is uh in the right time they asked if you can go into the pre-fill a little bit more but why make mixes what makes pre-fill so fast and in general what pre-filling is so um so when we train models uh typically what we do is let's talk about early pre-training um so you start with a context length of say four thousand you present all four thousand you have teacher forcing which means that um no matter what the model predicts you're always just putting in the four thousand tokens from your original ground truth corpus and then you calculate cross entropy loss uh in parallel across all four thousand token positions you're really doing the same thing in pre-fill so if somebody has a very short question like what is the capital of france this is no big deal it's like eight tokens whether you did them in parallel you give them sequentially no biggie but when i uh am working on code and i and i give you um uh the files and let's say you need all of these files this 100 000 tokens then in order to start decoding and whether you're thinking whether you do tools whatever you're doing if you're going to start decoding you first need to process all 100 000 of these um and we've come to name this portion of the process pre-fill because its characteristics are very different from the one token at a time decoding that happens after that so under pre-fill we have all the data we have all the hundred thousand we don't care about any of the predictions so we can we can do this in a highly parallelist fashion and for regular attention um you can you can do this super parallel all at once and your gpus will be compute bound because they're doing so many matrix multiplies calculating your your keys queries multiplying them together all that stuff okay um as well as the mlps uh during decoding you only get one new token one new query um and you're multiplying that against all your node keys and then calculating attention and so it ends up that once your context gets long the the keys and values that you're loading into the gpu are more memory bandwidth limited than the amount of compute you actually need to do the matrix multiplies actually relatively fast and tiny and so um linear if you imagine just a regular old-fashioned rnn um it's really great because you just have this small state you put in one token and you do a little bit of operations and you get out the next token so they're really great for decoding but there is no in an old-fashioned rnn there is no way to do pre-fill in parallel you would have to just do them one at a time and if i give you a code base it's 100 000 tokens you're kind of screwed so everybody has to if you want to have a competitive rnn style linear mechanism which will save on memory you also need to have a second mode where you can in parallel process these hundred thousand tokens not one at a time there has to be some trick and so this chunked method um in the kda paper and then similar in other papers uh mamba whatever they they have some some way of transforming the math to say there's an equivalence if i do it this way i get the same answer in parallel versus having to process every token one at a time did that for whoever had that it already scrolled off my screen that answers the question all right cool thank you um all right so let me just read this real quick uh the primary distinction amongst various gated linear attention mechanism lies in the parameterization of the forget gate so we haven't talked about this yet so maybe reading this first it's a little bit out of order uh for instance retinet uses a data independent scalar decay alpha and mama2 employs data dependence scalar um specifically gated linear tension uses a diagonalized fine grade matrix um offering an effective trade-off between uh efficiency and performance and that's what we're going to get into here note also that aside from linear attention another way you can speed things up is you can do sparse attention so if we talk about the new um deep seek i don't even know what it's called but the new deep seek attention thing they're kind of doing a sparse kind of a thing where um uh they're not looking at all of the the tokens at once to calculate uh the attention but this paper is a linear form so i'm just gonna kind of focus on that all right any questions if not then i was thinking i would jump into um uh the the actual linear attention stuff all right so section 2.2 is the key section for understanding what they did in kda and um they didn't actually say this um because they have these different sections here what the heck are they doing what they're actually doing is they're they're doing a history of linear attention and they're building up from the simplest to the most complex and their latest thing which is kidney delta tension okay so section 2.2 none of this stuff literally is used in kda this is all the the history that builds up to we built our our kda by making a a slight tweak on on the last thing here so um to start with we're going to look at this first equation but this is just a basic linear attention and what i was thinking i would do is i would just like scribble a little bit so you guys can sort of understand the idea of the the associative memory so let me see if um do we have whiteboards turned on here is there something i can enable uh it might not be enabled during the meeting so no i don't see it okay okay this is this is new this is fancy this is everyone drawing all right people don't draw on this because i need to draw on this if you want if you want me to show you something you need to stop messing around okay um all right how do i erase it all right thanks guys um okay okay awesome so so let's start with just a regular kv cache okay and and it looks something like this all right so oh i just lost the whiteboard i think that was me i i thought i xed out of it for myself but i i did it for everybody apparently sorry that's okay i can never uh i can do it i think that you should be able to start a new whiteboard yeah yeah yeah yeah sorry all right nobody nobody mess with it all right all right do you guys see or do i need to share my screen we see it we see it okay all right so so let's start with a um a regular kv cache and if i can do this really quickly okay it looks something like this and basically um this here sorry i'm trying to do this fast this here is um is our sequence length okay and over here we're going to have our um our key dimension i'll call it d sub k and we're going to have our value dimension d sub v all right so so hopefully this looks familiar to you guys you have keys and values and it's as long as a single place and so basically um what you do in here is for every token that you've ever seen you store a copy of the the key and value after you've taken that token and run it through the embedding matrix w sub k w sub v all right and then you don't have to store these you could just recompute them and that's what the the sort of original attention thinking is but it's actually better if you think of it in terms of the kv cash all right so now what what what happens when we train the model is we learn different patterns that we care about and we have the ability to recall them on demand so let's say the llm wants to know if the current token is a noun okay then what it could do is it could have one of the head keys put a one in a certain place if the current token is a noun then if the query also has a one in that place then when you dot the query in the key your dot product is a is a large value you get a one and you're basically going to recall that so imagine that uh for simplicity now this is well okay let me just start this way so imagine that you just basically divided up your keys okay and you basically said hey i'm gonna put um a one here anytime it's a noun and i'm gonna put uh a one here anytime it's a verb okay so that's a way in which you could build this this key value memory and then anytime the head wants to match on prior words that are nouns you just pass in a query that looks like one zero zero zero zero whatever okay and then every word that's not a noun is going to have zeros here and so any word that's not a noun is going to have a very low value is going to have a zero value and so you're only going to get matches on nouns and then if everything is a zero except for verbs then you put in zero one zero zero zero as your query and you're going to get all the verbs now obviously you can design whatever you can learn whatever pattern you want okay but this is the idea so what this means is now we've built an associative memory okay so a regular array in python you you access it by index you say i want index 23 but here you can say i want nouns and you don't need to know what position they're in and you don't need to know how many there are you just simply say if my query is one zero zero zero zero i will get all nouns and if there's only one noun then if you say give me one zero zero you will get that exact uh right entry okay so the problem we have where it gets expensive is that in this formulation every time you decode and you get a new token you need to lengthen this sucker and so then you're basically adding another another column here and then you do decoding and you're adding another column here and this thing this thing just keeps growing uh in size and i and i it's not fancy my my columns lines aren't lining up but i but you get the gist so then by the time you get to hundred thousand tokens this thing is very large it's it's hundred thousand so the the promise of linear attention is let's not have this thing grow like crazy let's assign a fixed amount of memory to it and so if you said i'm going to assign um a fixed size of 1024 to this thing for the first 1024 tokens you're great right um or the first 1024 concepts that you want to store in here um you can basically just have a one you know um in the first row for the first thing and a one in the second row and you can just one hot and code it okay but what happens when you get to the next one the 1025th if your update rule um it is simply sort of something along the lines of uh of cycle through these one hot encodings and then you get to 1024 you're going to be back to a one in the first position and you're going to end up uh um uh writing something in here that's going to collide with this thing that you already had in there and so so this is where in the literature you'll see some stuff about collisions okay now the next thing we're going to do is we're going to so so we have to find a way to solve that um and so what you could do is you could say i'm going to erase this entry um with the with the one in the first row and then that'll create room for my 1025th thing so ultimately what you have to do is you have to have a rule that says what am i going to erase when i want to put something in and so that's where as you get to these more complicated linear models people are are having these decay rules these are the the the erase rules for for when do i get rid of stuff in my now very limited fixed sized kb cache okay so then the next thing we're going to do is we're going to do a little bit of linear algebra okay and so um if you if you recall in linear algebra um yes i could put a one in the first row one in the second row and this will give me this nice orthogonal basis in this case they'll also be length one so it'll be orthonormal and that's the the principal basis the one that you know is easy on the eyes but if this thing is d dimensional um i can have any d um orthogonal vectors and they will work perfectly well they do not have to be all one and all zeros um this first vector could be one one one one one and then the next one as long as it it's orthogonal to that first one um it could it could be anything and so that's where if you look at the keys in an lm you're not going to see one zero zero zero zero one zero zero but in fact they very well may be orthogonal to each other okay and we benefit from this concept that in high dimensionality the cursive dimensionality all random vectors are very close to orthogonal to each other so the dot product of two random vectors is likely to have a value close to zero all right and so um so in fact we don't use one zero zero zero we just use learn uh vectors okay so um so then the next first thing to understand is if we go back to i i think i can toggle back and forth but uh hold on a second where's my zoom menu here uh sharing will close the whiteboard uh okay i didn't want to do that hold on let me see if i can do this okay so this is the equation that's at the top of section 2.2 okay it says that here's my hidden state so so this is going to be my my new rectangle here okay and we initialize it with all zeros and um what we do is we when we process the first token we add in this thing k times v transpose so this is an outer product so if you're used to seeing dot products k transpose times v would be a scalar quantity okay but k times v transpose is going to give us a rank one and if k and v are our same dimensionality a rank one square matrix all right um and then it says the way that you you get your output so equivalent to your your soft max tension is um you you take the transpose of this matrix and you multiply it by your query so i want to just show you super quickly how this um ends up being a really clean associative memory so i'm going to drop the the t subs subscripts and and just do um uh uh uh one time step okay so we start with s is all zeros and then we say the first s equals uh k times uh v v transpose all right all right um then um how do we read the output from that so the output is we say it's s transpose times the query so what i'm going to do is i'm going to rewrite s using the equation up above and i'm going to transpose it so when i transpose it i'm going to get v times one second um uh k transpose if you know you know how your transpose matrix multiply works um times the query but but we can change the parentheses uh because matrix multiplies associative and so i can actually say this is v times k transpose times times the query okay and so in if if you are familiar with the softmax attention uh formula you you have this this k transpose times query thing i think it's written the other way around q query transpose times key whatever but you're multiplying your queries and your keys and then ultimately you're multiplying that by your values and then it's got a scaling thing square root of d blah blah blah whatever but so in this formulation we've basically got the same functionality as um we've got this query times key business just like in regular softmax attention and so uh i'm gonna try and go a little bit fast here just for the sake of time but hopefully this gives you at least a little bit of intuition that um that we can store these things and and we can recall them and it's performing something very similar to regular attention okay so the difference is that in our in our uh in our linear attention we don't have these two boxes okay we now just have one box and this is this is initialized to zeros and then when we get the first key in value we do this outer product it gives us a rank one square matrix and we sum it in here and so it's just going to be zero plus that it's going to be our original one the key thing from linear algebra is that if the second key is orthogonal to the first key and you and you do this thing again regardless of what the value is and you take that rank one matrix and you add it in here you will then get a rank two matrix and when you multiply by either um key in this in this this output formulation you will get if the two are perfectly orthogonal zero dot product you will get an exact recall of the values with no loss whatsoever now if they're if they're like mostly orthogonal you'll get 99 98 recall whatever that's that's good enough to the extent that you put in a query that's kind of a mix between different keys then you're going to get a weighted blended average of the different keys which is exactly what we see in softmax tension um you you're getting a blend of the different things of the different values that are getting a weighted sum of values based on your tension scores okay so the shape of this is a little different in that this one is two times d assuming that that dk equals dv the height of this is 2d and the height of this is only 1d so this is actually even a more efficient representation because we're taking advantage of this linear algebra stuff we don't actually have to store the the the the keys and the values in separate sections we by multiplying them together and taking advantage of orthogonality we just slam them all in here all right any questions before i move on to the rest of section 2.2 all right cool so let me share my screen it says the whiteboard's going to go away i don't know if it can be saved brought back whatever but one sec one sec let me uh let me save it or add an option to save it as a pdf or something yeah there is a way to save them yeah i got it i got it whiteboard i can save it as a template and i can also just export it as a p yeah just export it to pdf if you would be so kind yep i got one png i'll also do a pdf and then we can figure out how to get it all right it's not that beautiful so people all right it's still there still there okay go ahead and get rid of it i'll bring it back if you need it all right um yeah and i think i can bring it back but anyway all right so hopefully you see my screen so we just talked about this first equation okay this is the super simple uh way of storing things in an array and so s is simply a sum of rank one outer products of approximately orthogonal keys all right this thing will die as soon as you have one more thing than whatever the dimensionality is so if it's if it's you know 1024 it'll work great for the first 1024 things and then it's not going to work um once you try to put something else in so uh one of the things you can do is um you can have this this delta rule which basically says that i'm going to subtract a little bit out in order to make room for the new thing that i'm adding um that's what this business that gets you to this equation that's what this business is about i'm not going to go into like super super detail but ultimately um what what then people evolve to is this idea which for people here will probably be more familiar is this idea of weight decay so if you imagine that that square matrix we have that's our memory store if you just apply a little bit of weight decay every time you get a new token you multiply that whole matrix by 0.95 okay then what's going to happen is something that's very old will have been multiplied by 0.9 over and over and over and over and over again and it will eventually get close to zeroed out without you having to explicitly do anything okay and so that's where you basically have this scalar alpha that's like your weight decay in in our in our weight matrices that we use when we're doing you know gradient descent when we're doing like atom w right um so it every time you get a new token you just multiply it by this um and and the old ones uh will decay and so the idea here is that uh i is the identity matrix so our update after we get a new token is the identity times the old matrix so that's just the old matrix minus um this particular thing which is the which is the making room forgetting component here okay plus this is our brand new key and our brand new value so this is the thing that we're adding into the memory all right and then we're going to take this whole old thing and we're going to decay it by some you know whatever 0.95 some some you know value like this and this thing actually worked pretty well it was pretty stable so that if you got past a thousand tokens and you went to fifteen hundred two thousand three thousand it kind of mostly remembered the last one thousand tokens so it works a lot like sliding window attention okay so i i saw the comment early in the chat why not just do sliding window okay it works like that however it is a little bit smarter than that where you can selectively say if a concept is super important i will keep it even though it's more than a thousand tokens ago all right um so hypothetically if if you train a model on stories and the first sentence says our main character is named bob it could decide that that based on context it's gonna like never forget that thing and it's gonna really really make sure that bob stays in the in the memory because it's super important if you use a sliding window it it's it's a hard-coded rule it's only the last thousand tokens and so you you can't have that selectivity whether or not your model is smart enough to learn that rule to know that it should remember bob is a whole separate thing but that's why they they train these models to try to get them to work so finally um if you have this idea of weight decay we're decaying all of the entries in there simultaneously and equally um now you can say i'm going to decay them uh uh uh based on what the current token is and then the final thing that they do in kimmy in this in this kimmy linear kda they say this old entry is um a float of length 2048 i'm only going to decay of the 2048 floats only decay the ones that i really want to um and i can selectively choose and so instead of having up here it's a scalar a 0.95 that multiplies by all the weights i'm now actually going to say this skip this is now not a scalar it's a vector um and they make it a diagonal matrix but it's effectively just saying each component i will choose so maybe the first 10 floats 0.95 and the rest of the floats is one i'm not going to decay them at all so at the end of the day that's the that's the trick and so then they say we have this efficient algorithm for doing parallel when we have pre-fill or training when we have all the tokens at once it's an rnn so it's always really fast when you're doing one token at a time and then what we can do is we can compare it to other linear uh things that came before so for example you may remember uh mamba uh you may remember a gated delta net which is a little more recent and so this thing actually if you look at these curves mamba 2 is the orange line and on accuracy um it smokes it and then gated delta nets newer and it's it's similar but like a little bit better um in performance so then they get into uh i know we're out of time so then they get into um some simple experiments to show that their linear attention alone in a toy model with two layers outperforms older things outperforms linformer mamba mamba 2 rwkv uh uh gated delta net we we outperform all those on a few things that linear models tend to have difficulty on like reversing a strain um and doing like multiple recall um so now that we're confident that we have a better linear component let's actually stick it in a model and then we're going to do some ablations with um how much full attention do we need they come up with one quarter full attention is good if you do more than a quarter full attention you get almost identical performance at more cost if you do less than a quarter if you do like one eighth full attention they found that the performance drops so that's why they they don't have to worry about what they're doing but they don't have to worry about what they're going to do and they don't have to worry about it they don't have to worry about it they don't have to worry that they don't have to worry about it they don't have to worry about it they don't have to worry about it they don't have to worry about it and they don't have to worry about it and they don't have to worry about it they don't have to worry about it and they don't have to worry about it and they don't have to worry about it and i'm not sure if they did their ablations completely right i'm not sure how they handled rope for the the baselines for the the baselines but basically what they found was that if they took away positional embeddings for the quarter of the layers that were full attention it actually performed better than if they gave it positional information and what they argued is that attention is so regular attention is so good at positional information that it it trumps the kda and says i will take care of short-term dependencies when they took attention when they took a rope away from the full attention layers it was like well i can't do any positional information because i don't have anything and so it fully delegated all the responsibility for previous token two tokens ago to the kda layers that allowed the regular attention layers to 100 percent focus on long context semantic tasks give me the nouns give me the last time i saw the name bob give me all of those things and not focus any of its energy on short-term things there may be some other dynamics going on here but i'm particularly uh um on event that regular rope is kind of stupid we should be using at least truncated rope mla they separate the rope so that it's not actually adding noise uh to the normally rope you know you you multiply it and so you're you're noising your your token signal uh for mla you concatenate it so you're not noising your signal so i think there's some things that that will my personal opinion is we're going to learn some things about how rope is awesome it beats all the other positional embeddings but there is a cost to it and that we can be a little bit smarter about it and that's why they found that by by taking it away from the the regular attention layers so that for me was like the most um interesting thing uh um the rest of it is like hey we ran some experiments and did really well they make these strong claims that they outperformed a model with the same training recipe same number of tokens but just all full attention layers instead of a quarter full and three quarters kda i would believe it but i i think it's possible that that they may have messed up their relations but if they do in fact beat it i think my personal take is it's not because the kda is so awesome it's because rope has a cost and they actually managed to find a way to measure that cost and by removing rope from the from the full attention layers they actually let the attention do what it does really well without any noise but that's just ted's hot take all right i can i can stay longer if if people have questions or want to talk about it but with respect for the people who may need to go i think that's the most important thing to know about it so there are other models out there that are doing the hybrid i saw some stuff in the chat so so yeah like three quarters something faster one quarter full attention means you get three quarters smaller uh kv cache much much faster decoding because decoding is memory bound and as long as it doesn't totally kill you on on in context learning then then it's like sort of an overall win thank you um i understand a lot more about kda um after your step-by-step walkthrough just wanted to say that thanks any questions all right so i don't know if that's a good sign that it was clear or i just completely lost everybody so if you go back to this diagram they didn't label it but on the left this is the queries in the keys and so like usual you take your your your um your your token you run it through a weight matrix w sub q w sub k uh if you remember mamba they have some uh 1d convolutions that they run it through um and those are helpful and like you know 1d convolution can very easily just find simple patterns um this is a normal there's normalization layers everywhere this is a normalization layer uh qk norm like quen has this is the values totally uh again normal except they run it through convolution one trick they do is on their alpha um they use something akin to uh laura where they take two rectangular low rank matrices and multiply them together to make a square matrix uh just because they don't really need that whatever the heck it is d squared numbers of parameters um and they do a similar trick uh on the on the final output gate and then this is um there uh uh beta the the forget factor so everything on here is channel uh um aware so there's nothing here that applies equally to all 2048 floats anything they do whether they're adding subtracting multiplying it's always going to be at least dimensionality d um and this is the uh this is the gate that you see in gated attention uh i don't know i don't remember which other models have gated attention now uh uh uh the latest quen have it um but anyway so normally what you do is you calculate all your attention you just add that to the residual stream okay so this sucker has a sigmoid on it that says hey if you're layer three and for some reason you're sort of less important i'm going to actually multiply your output by 0.5 and so you're going to have a quieter addition to the residual stream and if you're a super important layer then you're going to get the full level 1.0 um and if you guys know about like um attention seeks attention sinks and massive activations um that's one of the things that researchers seem to be trying to fight and so they think that if they add some of these normalizations in they don't get as large spiking activations which i think helps them when they're trying to go to smaller and smaller quantization because now they don't have such extreme values that they need to support and so i think the holy grail is you figure out some kind of normalizations and gatings such that all your activations stay in a nice little gaussian distribution and then you can totally nail it with something like you know fp4 in four or whatever but right now if you have these activations and then occasionally you have an activation that's a hundred times larger it's just it's just very difficult in low precision to support an activation so that's why in these things you still see right like the the weights are small but the activations are still 16-bit it's because you have these hundred thousand times larger activations if you if you if you really quash those you go down to eight bit activations and maybe eventually four bit activations so that's like a little more detail if you actually go through sort of like this is what the the layers look like and you see there's just a few extra normalizations in here and i had a question about you had mentioned at the beginning you drew a parallel between linear and colonel methods and kernel methods and i i have i think i understand kernel kernel methods reasonably well i just wanted to understand your parallel you're drawing there um i i don't know if i can do a great job explaining it so i gave the the associative memory analogy for how to understand this okay um the other way of thinking about it is that um if you if you have this n by n uh uh attention pattern well it may be lower triangular because of of of of of of of of of of of causal attention or whatever but still if you just think about it the pattern can look like anything okay and the high values and the low values are not necessarily linearly separable um and because so if the high values and the low values look like some sort of xor pattern where it's like high low high then if instead of using a full n squared thing and a soft max if you try to squash this to just a one-dimensional representation a linear representation you can only do linearly separable in a way that the highs have to all be on one side and lows have to be on the other side and that is the the somewhat mathematical uh uh uh interpretation for why linear attention you know whatever lin form or per form or mamba one cannot be as expressive why they hit a glass ceiling that soft max full attention doesn't hit when you start training these really large models um so if you imagine that you then take those those attention patterns and you run them through some other function so the classic one uh uh if you want i can pull it up real quick is but so you you have sort of like high low values uh centered on zero and then if you do x squared and you turn it into a parabola okay then the high values are all up here and the low values are all down here in in the in the bottom of the parabola and now they're linearly separable right in the higher dimension okay so that's that's one of the interpretations for these linear uh attentions is that they do some sort of transform to get them into a different dimension in which they are linearly separable uh and if you could do that perfectly theoretically you you would get the same performance as as full n squared softmax attention um i i i don't know the math well enough to know if the problem is that you can't do that or i think the problem might just be that we're not learning super complex transforms they're just using these like very fast to compute transforms and therefore that fast to compute thing might not cover every possible scenario and in some scenarios uh even though it gets to learn the queries and the keys to you know fit its purposes it's still not finding something that's linearly separable uh probably just because there's so many millions of different concepts and you have so many heads and at the end of the day it wants to do something and it can't it can't make everything it wants to do linearly separable right it's a glass ceiling and the non-linearity that it's using is basically a polynomial one through the recurrence relation is that the idea uh the math is getting beyond me pretty quickly the original like rnns uh uh uh uh uh uh transformers are rnns paper that's like four years old i think whatever yeah if you look at that i think that's the best reference for you'll see they have this fee function that's there so like like support vector machines what you what they do is they say we're going to have this kernel that transforms it but we want something that's super fast to compute and so the kernel trick allows you to compute these dot products super fast for svms without actually explicitly doing the transform so they they make a similar argument there where it's like yes there may be all sorts of different transforms you could use we're just going to pick this one because it's fast and i think ultimately that's where you get bottlenecked because you're gonna lose some expressivity and also you're not training that function bespoke for whatever the the the the lm's trying to do so if somehow you had a way of like picking the function per head per layer then maybe you could actually get these these linear models to perform even better but the good news is that since that paper since mamba one mamba two rwkb it looks like the kimmy people are saying that their linear attention which has slightly different update rules and slightly different forget rules outperforms all prior linear attention uh it's still not good enough you notice they didn't build it 100 kda they still say one quarter regular attention but it's still better than if you did three quarters mamba layers better than if you did three quarters gated delta net layers yeah okay got it i'm gonna i'm gonna check out that paper you said it was transformers are rns yeah yeah okay yep something like that let me see if i can find the exact thing yeah i think i found it it was uh let's see icml20 okay yeah and there's multiple so i'm getting them confused i just found one that's the fast autoregressive transformers with linear attention oh is that's what i found too yeah yeah yeah so maybe that is the one that i was thinking of um let me just take a quick peek at the paper yep that's the one you see these these fee functions that they talk about for these transforms yep so you can see they they wrap a a function fee around around all of their query and their key type things yeah yeah all right awesome thank you the silence is scaring me i think i think i lost people but hopefully hopefully this gives you a little better insight into kda thanks ted yeah this is amazing thank you really appreciate it really thanks do you like i have a i have a question actually um this is my last this is like the question um i was wondering um if this means that inference for linear intention might be well better um like it does this mean that we're going to move to distillation for linear models and then training for full attention during full attention i was just reading some papers on that and it seemed to make sense given what you're saying like i i don't know if that if if what i'm saying makes sense but yeah yeah so the kimmy people are saying this can be a full drop-in replacement so it will train faster it will pre-fill faster uh i don't know if it's pre-fill faster but it'll decode faster um than regular full attention so you can just completely according to the paper's claims scrap your regular attention models and and replace it with one quarter retention models with kda i think that claim is a bit strong and and you know proof time may tell but but yes so you don't necessarily need to do anything fancy in terms of distillation or there's that you can just train this instead of a regular full attention models what they're claiming all right i just realized that my chat was not auto scrolling and so i thought there were no messages there's like 80 bazillion messages i don't read all right on the other side um if anyone wants to volunteer for next weeks i think we have a few that were posted in discord otherwise you know we'll continue continue discussion and everything down there in discord next week aie new york if any of you guys are in new york come to come to conference and you know we'll do something latent space in person where's the discord uh you can find it on the luma or yeah it's a search latent space discord and there's a paper club channel you'll find it in there cool cool awesome yeah luma slash ls and then you can you can find it latent spaces all right thanks ted um and same uh ai engineer new york stuff there's a channel in and disport people talking about it cool take care guys

[Paper Club] GMT20251112 200413 Recording gallery 3440x1440

Transcript