2024 in Post-Transformer Architectures: State Space Models, RWKV [Latent Space LIVE! @ NeurIPS 2024]

(upbeat music) - Yeah, so thanks so much for having us. So this is gonna be a little bit of a two-part presentation. My name is Dan, I'm at Together AI, and I'll be joining UCSD as faculty in about a year. And Eugene, you wanna introduce yourself? - I'm Eugene, I lead the art activity team, and I'm CEO and co-founder of Featherless, and we both work on this new post-transformer architecture space.

- Yeah, so today, we're really excited to talk to you a little bit about that. So first, I'm gonna give a broad overview of kind of the last few years of progress in non-post-transformer architectures, and then afterwards, Eugene will tell us a little bit about the latest and the greatest and the latest frontier models in this space.

So the story starts with scaling. So this is probably a figure or something like this that you've seen very recently. Over the last five to six years, we've seen models really scale up in parameter size, and that's brought with it a bunch of new capabilities, like the ability to talk to you and tell you sometimes how to use your Colab and your AWS screens.

But another place where we've seen scaling, especially recently, is scaling in context length. So this can mean just having more text inputs for your models, but it can also mean things like taking a lot of visual token inputs, image inputs to your models, or generating lots of outputs. And one thing that's been really exciting over the last few months or so is that we're seeing scaling, not only during training time, but also during test time.

So this is one of the, this is the iconic image from the OpenAI '01 release. Not only are we starting to scale train time compute, but we're also starting to scale test time compute. Now, if you're familiar with our attention in our transformer architectures today, this graph on the right might look a little bit scary.

And one of the reasons is that the implications are a little bit interesting. So what does it mean if we want to continue having smarter and smarter models? Do we just need to start building bigger, bigger data centers, spending more flops? Is this, this little Dolly 3, we need more flops guy, is this gonna be the future of all of AI?

Or is there a better way, another path forward? Maybe we can get the same capabilities that we've gotten used to, but for a lot less compute, a lot less flops. And one of the things that we're gonna talk about today is specifically looking at that core attention operator in some of these models.

And the reason is that, so this is just some basic scaling curves, but attention has compute that scales quadratically in the context length. So that means that if you're doing something like test time compute, and you want to spend a bunch of tokens thinking about what comes next, the longer that that goes, the more tokens you spend on that, that compute grows quadratically in that.

One of the questions that we're interested in is, can we take that basic sequence model, the basic sequence primitive at the bottom and get it to scale better? Can we scale and let's say N to the three halves or N log N? And so in the first part of the talk, so we just went over the introduction, what I'm gonna do over the next few slides is just talk about some of the key advances and ideas that have shown over the past few years since maybe early 2020 to now, that shown promise that this might actually be possible, that you can actually get potentially the same quality that we want while scaling better.

So to do that, and basically the story that we're gonna look is we're gonna start to see how, so this is a basic graph of just the past couple of years of progress of perplexity where that blue line, that dotted blue line is attention, it's your basic transformer, full dense attention.

And then the dots coming down are some of the methods that you'll see in this presentation today. We're gonna turn the clock back all the way to 2020. So this question of, can we make attention sub-quadratic? Basically, as soon as we said, attention is all you need, people started asking this question.

So we have this quadratic attention operator, can we do better? I'll briefly talk about why attention is quadratic. And the basic thing that happens if you're not familiar is that you have these inputs, these keys and queries, and what you do in this attention matrix, this S matrix over here is that you're using, you're comparing every token in your input to every other token.

So when I try to do something like upload a whole book to Gemini, what happens beyond the, or maybe not Gemini, 'cause we don't necessarily know what architecture is, but let's say we upload it to Llama, what happens behind the scenes is that it's gonna take every single word in that book and compare it to every other word.

And this has been a really, it's led to some pretty impressive things, but it's kind of a brute forcing of the way that you would try to interpret something. And what attention does in particular is the, and then what attention, sorry, don't wanna, okay. No, no laser pointer. What attention does afterwards is that instead of always operating in this quadratic thing, it takes a row-wise softmax over this matrix and then multiplies it by this values matrix.

So one of the key points to notice is that the output size is always gonna be the same as the inputs, at least in standard self-attention. So one of the first things that folks tried to do around 2020 is this thing called linear attention, which is just noticing that if we take out this softmax from here, if we take out this non-linearity in the middle of the attention operation, and then if you compute the keys and the values operation first, you actually never hit this quadratic bottleneck.

So that's potentially a way to get a lot more computationally efficient. And there are various ways to do this by basically using feature maps or try to approximate this overall attention computation. But some of this work sort of started to hit a wall in 2020 and the basic challenges were two.

So one was quality. Back then it was kind of hard to get good quality with these linear attention operators. The other one was actually hardware efficiency. So this feature map that was just shown by Simplify here actually ends up being quite computationally expensive if you just implement it naively.

So you started having these operators that not only you're not really sure if they have the same quality, but also they're actually just wall clock slower. So you kind of end up getting the worst of both worlds. So this was the SAGE. So that kind of sets the SAGE for four years ago.

Keep this in mind because linear attention is actually gonna come back in a few years once we have a better understanding. But one of the works that started kicking off this mini revolution in post-transformer architectures was this idea called state-space model. So here the seminal work is one about our work in 2022.

And this piece of work really brought together a few ideas from some long running research lines of work. The first one was, and this is really one of the keys to closing the gap in quality, was just using things that if you talk to an electrical engineer off the street, they might know like the back of their hand, but taking some of those properties with how we model dynamical systems in signal processing and then using those ideas to model the inputs, the text tokens in, for example, a transformer-like next token prediction architecture.

So some of those early state-space model papers were looking at this relatively simple recurrent update model that comes from maybe chapter one of a signal processing class, but then using some principle theory about how you should do that recurrent update in order to really get the most that you can out of your hidden state, out of your sequence.

So that was one key idea for quality. And when this was eventually realized, you started to see a bunch of benchmarks that were pretty sticky for a few years, things like long range arena, some long sequence evaluation benchmarks, there were stuff in time series, time series analysis. You started to see the quality tick up in meaningful ways.

But the other key thing that was so influential about these state-space models is that they also had a key idea about how you can compute these things efficiently. So if you go back to your machine learning 101 class, where you learned about RNNs, one thing that you may have learned is that they don't paralyze as well as detention, because if you just run them naively, you have to do this kind of sequential update to process new tokens.

Whereas in attention, you can process all the tokens in parallel at one time. One of the key insights behind the S4 paper was that these recurrent models, you could take them and you could also formulate them as a convolution. And in particular, with a convolution, you could, instead of using a PyTorch conv1d operation, you can compute that with the FFT.

And that would give you N log N compute in the sequence length N with a operator that was relatively well optimized for modern hardware. So those are really, I'd say the two key ideas in 2022 that started allowing these breakthroughs to happen in these non-transformer architectures. So these ideas about how to principally model, sorry, how to model the recurrent updates of a sequence in a principled way, and also these key ideas and how you can compute it efficiently by turning it into a convolution and then scaling it up with the FFT.

Along those same lines, so afterwards, we started putting out some work on specialized kernels. So just like we have flash attention for transformers, we also have works like flash FFT conv, and if you look at these lines of work, oftentimes whenever you see a new architecture, you see a new primitive, one of the table stakes now is, do you have an efficient kernel so that you can actually get wall clock speed up?

So by 2022, 2023, we were starting to have these models that had promising quality primitives and also promising wall clocks. So you could actually see regimes where they were better than transformers in meaningful ways. That being said, there were still sometimes a quality gap, particularly for language modeling. And because language is so core to what we do in sequence modeling these days, the next key idea that I'm gonna talk about is this idea of selection mechanisms.

And this is basically an idea of, so you have this recurrent state that you're keeping around that just summarizes everything that came before, and to get a good sequence model, one of the things that you really need to be able to do is have the model learn what's the best way to pick out pieces from that recurrent state.

So one of the major ideas here in a line of work called H3, Hungry, Hungry Hippos, and also these hyena models were, one way you can do this is by just adding some simple element-wise gates. So versions of these ideas have been around for decades. If you squint at the LSTM paper, you can probably find this gating mechanism.

But turns out you can take those old ideas, add them into these new states-based models, and then you can see quality start to pick up. If you've heard of the Mamba model, this also takes the selection to the next level by actually making some changes in that fundamental recurrent state space.

So it's not only just this gating that happens around the SSM layer, but also you can actually make the ABCD matrices of your state-based model, you can make them data-dependent, which will allow you to even better select out different pieces from your hidden state, depending on what you're seeing.

I'll also point out, if you look at the bottom right of this figure, there's this little triangle with a GPU SRAM, GPU HBM, and this is just continuing that trend of when you have a new architecture, you also release it with a kernel to show that it is hardware efficient, that it can be hardware efficient on modern hardware.

One of the next cool things that happened is once we had this understanding of these are the basic pieces, these are the basic principles behind some of the sequence models, linear attention actually started to come back. So in earlier this year, there's a model called BASED from Simran Arora and some other folks that combined a more principled version of linear attention that basically the two-second summaries that it used a Taylor approximation of the softmax attention, combined that with a simple sliding window attention and was starting to be able to expand the Pareto frontier of how much data can you recall from your sequence versus how small is your recurrent state size.

So those orange dots are at the top there are just showing smaller sequences that can recall more memory. And the last major idea, I think that has been influential on this line of work and is very relatively late breaking just a few months ago, is just the basic idea that when you have these models that are fundamentally more efficient in the sequence length, you maybe don't want to prompt them or use them in exactly the same way.

So this was a really cool paper called Just Read Twice also from Simran that basically said, hey, all these efficient models can process tokens so much more efficiently than transformers, that they can sometimes have unfair advantages compared to a simple transformer token. So, sorry, a simple transformer model. So take, for example, the standard use case of you have some long document, you're gonna pass it in as input and then you're gonna ask some question about it.

One problem you might imagine for a recurrent model where you have a fixed state size is, let's say that your article is very long and you're trying to ask about some really niche thing. You can imagine it might be hard for the model to know ahead of time what information to put into the hidden state.

But these models are so much more efficient that you can do something really stupid, like you can just put the document, write down the document, write down the question, write down the document again and then write down the question again. And then this time, the second time that you go over that document, you know exactly what to look for.

And the cool thing about this is, so this results in better quality, especially on these recall intensive tasks. But the other interesting thing is, it really takes advantage of the more efficient architectures that we're having here. So one of the other, I think, influential ideas in this line of work is, if you change the fundamental compute capabilities of your model and the way that it scales, you can actually start to query it at test time differently.

And this actually, of course, goes back to those slides on test time compute. So while everybody's looking at, say, test time compute for big transformer models, I think potentially a really interesting research question is how can you take those and how does it change with this new next generation of models?

So I'll just briefly summarize what some of those key ideas were and then talk and then show you briefly kind of what the state of the art is today. So the four key ideas are, instead of just doing a simple linear attention approximation, instead, take ideas that we know from other fields, like signal processing, do a more principled approach to your modeling of the sequence.

Another key idea throughout all these lines of work is you really want hardware and kernel support from day one. So even if your model is theoretically more efficient, if somebody goes and runs it and it's two times slower, one of the things that we've learned is that if you're in that situation, it's just gonna be dead on arrival.

So you want to be designing your architectures with the hardware in mind. One of the key machine learning ideas that has been important for the quality is just making sure that you encode different ways that you can select from your hidden state and really focus on that as a key decider of quality.

And finally, I think one of the emerging new things for this line of work and something that's quite interesting is what are the right test time paradigms for these models? How do they change relative to what you might do for a standard transformer? I'll briefly end this section. So I've labeled this slide where we are yesterday because Eugene is gonna talk about some new models that he released literally this morning.

But as of yesterday, some of the really cool results out of these efficient alternative models were, so AI2 trained this hybrid MOE called Jamba that is currently the state-of-the-art for these non-transformer architectures. There's this NVIDIA and MIT put out this new diffusion model called SANA recently that one of their key observations is that you can take a standard diffusion, transformer diffusion model, replace the layers with linear attention, and then that lets you scale to much larger images, much larger sequences more efficiently.

And one thing that I don't think anybody would have called when a few years ago is that one of those gated SSM, gated states-based models ended up on the cover of science because a great group of folks went and trained some DNA models. So that's Michael Polley, Eric Yuen from Stanford and the Ark Institute.

So we're really at an exciting time in 2024 where these non-transformer, post-transformer architectures are showing promise across a wide range, across a wide range of modalities, of applications, and of tasks. And with that, I'll pass it on to Eugene who can tell you a little bit about the latest and greatest with RWKV.

- Yeah, so that's useful. Yeah. - You're talking to here. - Oh, I'm talking to here, okay. So yeah, two streams. Yeah, so I think one common questions that we tend to get asked, right, is what's the difference between RWKV and states-based? So I think one of the key things to really understand, right, the difference between the two groups, right, is that we are actually more like an open-source rental internet meets academia kind of situation.

Like most of us never wrote any paper, but we basically look at RNNs and linear intention when intention is all you need came out. And then we decided to like, "Hey, there is a quadratic scaling problem. "Why don't we try fixing that instead?" So we end up developing our own branch, but we end up sharing ideas back and forth.

And we do all this actively in Discord, GitHub, et cetera. This was so bad for a few years, right, that basically the average group's H-index was so close to zero, right, ILLUTR-AI actually came in and helped us write our first paper. Great, now our H-index is now three, apparently.

So, but the thing is like, a lot of these experiments led to results. And essentially, we took the same ideas from linear intention and we built on it. So to take a step back into like, how does RWKB handle its own attention mechanic and achieve the same goals of like O(n) compute, respectively, and in focus of our overall goal to make AI accessible to everyone, regardless of language, nation, or compute.

That's our open-source goal. We actually train our models primarily on over a hundred language, which is another topic altogether. And our goal is to train to even 200 languages to cover all languages in the world. But at the same time, we work on this architecture to lower the compute cost so that people can run in Raspberry Pis and on anything.

So how did RWKB break the dependency of LSTM token flow? Because I think to understand architecture, right, it's probably easier to understand it from the RNN lens, because that's where we built on. We all state space kind of like try to start anew and took lessons from that and say, so there's a little bit of divergence there.

And AKA, this is our version of linear intention. So to take a step back, all foundation models, be it transformers or non-transformers, at a very high level, right, comes in a token, I mean, takes things into embeddings and goes through a lot of layers, generate a lot of internal states, whether QKB cache or RNN states or RWKB states, and outputs an embedding layer norm in something, and we just take more layers and more embeddings, and somehow that magically works.

So if you remember your ancient RNN lessons, which we call blessing these days, the general idea is that you have the embedding information from all the way up, and you take that information and you flow it back down, and then you process it as part of your LSTM layers.

So this is how it generally works. Kapati is quoted saying that RNNs are actually unreasonably effective. The problem is this is not scalable. To start doing work on the second token, you need to wait for the first token, and then you need to, and likewise for the third token and fourth token, yada, yada.

That is CPU land, not GPU land. So you can have a H100, and you can't even use 1% of it. So that's kind of why RNNs didn't really take off in the direction that when you wanted like billions of parameters when it comes to training. So what did RWKB version zero do?

We just did the dumbest, lamest thing. Sorry, this is the bottleneck for RNN. We did the dumb thing of removing that line, and it kind of worked. It trained, it sucked, but it kind of worked. Then we were like, hey, then no one cared because the loss was crap, but how do we improve that?

And that's essentially where we move forward because if you see this kind of flow, you can actually get your GPU saturated quickly where it essentially cascades respectively. So I'm just waiting for this to loop again. So it's like once you get your first layer, your token to be computed finish, you start to cascade your compute all the way until you're, hey, I'm using 100% of GPU.

So we worked on it and we started going along the principle of that as long as we keep this general architecture where we can cascade and be highly efficient with our architecture, nothing is sacred in our architecture. And we have done some crazy ideas. In fact, if you ask me to explain some things in the paper, right, officially in the paper, I'll say we had this idea and we wrote it this way.

The reality is someone came with a code, we tested it, it worked, and then we rationalized it. So the general idea behind RWKVR is that we generally have two major blocks that we do. We call it TimeMix and ChannelMix. And TimeMix generally handles long-term memory states where essentially where we apply the matrix multiplication and SILU activation functions into processing an input embedding and an output embedding.

I'm oversimplifying it because this calculation changed every version and we have version seven right now. ChannelMix is similar to Bayes in the sense that where it does shorter-term attention, where it just look at the sister token or the token before it, 'cause there's a shift in the token shift matrix.

I don't really want to go too much into the papers itself because we do have three papers on this. Basically, RWKV, RNN for the transformer era, Igor and Finch RWKV matrix value state. This is the updated version five, version six. And GoFinch is our hybrid model, respectively. We are writing the paper already for V7, and which is for RWKV7, codename Goose, all our architectures are codenamed by a bird.

And I'm going to cover as well Q-RWKV and MAMA-RWK and RWKVMU. So where did that lead to? Wait, because we are all GPU poor, and to be clear, most of this research is done only on a handful of H100s, which I had one Google researcher told me that was his experiment budget for a single researcher.

So our entire organization has less compute than a single researcher in Google. One of the things that we explored into was how do we convert transformer models instead? Because someone already paid that million dollars onto training, so why don't we take advantage of those weights? And I believe, together, AI worked on the locus for the MAMA side of things, and we took some ideas from there as well, and we essentially did that for RWKV.

And that led to Q-RWKV6, which we just dropped today, a 32-bit instruct preview model, where we took the current 32-bit instruct model, freeze the feedforward layer, remove the QKV attention layer, and replace it with RWKV linear layers. So to be clear, this means we do not have the RWKV channel mixed layer, we only have the time mixed layer.

But once we do that, we train the RWKV layer. Important is that the feedforward layer needs to be frozen, so the new attention can be learned. And then we unfreeze the feedforward layer and train all the layers together with a custom learning rate schedule so that they can learn how to work together.

The end result, surprisingly, and to be honest, to the frustration of the RWKV MOE team, which ended up releasing the model on the same day, was that with just a few hours of training on two nodes, we managed to get it to be on par kind of with the original QUANT32B model.

So in fact, when the first run, that completely confused us, and I was telling Daniel Goldstein-Smithkey, who kind of leads most of our research coordination, when you pitched me this idea, you told me at best you would get the same level of performance. But you didn't tell me the challenge and the score would shoot up.

I don't know what's happening there. But it did. MMLU score dropping, that was expected, because if you think about it, when we were training all the layers, we were essentially like Frankensteining this thing, and we did brain damage to the feedforward network layer with the new RWKV layers. But 76%, hey, some of it is retained, and we can probably further train this.

We didn't even spend three days training this, so there's a lot more that can be done, hence the preview. But this brings up a big question, because we are already now in the process of converting the SMPB. This is actually extremely compute efficient to test our attention mechanic. It's like, it becomes a shortcut.

We are already planning to do our version seven and our hybrid architecture for it, because we don't train from scratch, and we get a really good model out of it. And the other thing that is uncomfortable to say is that, because we are doing right now the SMPB, is that if this scales correctly to 128k context length, I'm not even talking about a million, 128k, majority of enterprise workload today is just on SMPB at under 32k context length.

That means if this works and the benchmark matches it, it means we can replace the vast majority of current AI workload, unless you want super long context. And then, sorry, can someone give us more GPUs, because we don't need the VRAM for super long context, sadly. So yeah, that's what we are working on.

And essentially, we are excited about this to just push it further. And this conversion process, to be clear, I don't think it's going to be exclusive to RWKV, but it probably will work for Mamba as well. I don't see why not. And we will probably see more ideas, or more experiments, or more hybrids.

Like, yeah, one of the weirdest thing that I wanted to say outright, and I confirm this with the Black Mamba team and the Jamba team, because we did the Goldfinch hybrid model, is that none of us understand why a hybrid with a state-based model, be it RWKV state space, and transformer, performs better than the baseline of both.

It's like, when you train one, you expect, and then you replace, you expect the same results. That's our pitch. That's our claim. But somehow, when we jam both together, it outperforms both. And that's one area of evolution that, like, we only have four experiments, plus four teams, that a lot more needs to be done.

But these are things that excite me, essentially, because that is what, potentially, we can move ahead for, which brings us to what comes next. - So this part is kind of just some, where we'll talk a little bit about stuff that we're excited about, maybe have some wild speculation on what's coming next.

And, of course, this is also the part that will be more open to questions. So a couple of things that I'm excited about is continued hardware model co-design for these models. So one of the things that we've put out recently is this library called Thunder Kittens. It's a CUDA library.

And one of the things that we found frustrating is every time that we built one of these new architectures, and I'm sure you had the exact same experience, we'd have to go and spend two months in CUDA land, like, writing these new, efficient things. And if we decided to change one thing in PyTorch, like, one line of PyTorch code is like a week of CUDA code, at least.

So one of our goals with a library like Thunder Kittens, so we just broke down what are the key principles, what are the key hardware things, what are the key compute pieces that you get from the hardware. So, for example, on H100, everything really revolves around a warp group matrix multiply operation.

So you really want your operation to be able to split into a relatively small matrix-matrix multiply operation. So, like, multiplying two 64 by 64 matrices, for example. And so if you know that ahead of time when you're designing your model, that probably gives you some information about how you set the state sizes, how you set the update, how you set the update function.

So with Thunder Kittens, we basically built a whole library just around this basic idea that all your basic compute primitives should not be a float, but it should be a matrix, and everything should just be matrix compute. And we've been using that to try to both re-implement some existing architectures and also start to design some new ones that are really designed with this core, with a tensor core primitive in mind.

Another thing that we're, at least I'm excited about, is we, over the last four or five years, we've really been looking at language models as the next thing. But if you've been paying attention to Twitter, there's been a bunch of new next generation models that are coming out. So there are video generation models that can run real time, that are supported by your mouse and your keyboard, that I'm told if you play with them, that they only have a few seconds of memory.

Can we take that model? Can we give it a very long context length so that you could actually maybe generate an entire game state at a time? What does that look like for the model? You're certainly not gonna do a giant quadratic attention computation to try to run that.

Maybe use some of these new models or some of these new video generation models that came out. So Sora came out, I don't know, two days ago now, but with super long queue times and super long generation times. So that's probably a quadratic attention operation at the bottom of it.

What if we could remove that and get the same quality, but a lot faster generation time? Or some of the demos that we saw from Paige earlier today. If I have a super long conversation with my Gemini bot, what if I wanted to remember everything that it's seen in the last week?

I mean, maybe you don't for personal reasons, but what if I did? What does that mean for the architecture? And I think that's certainly something I'm pretty excited about. I'm sure you're excited about it too. I think we were supposed to have some hot takes, but I honestly don't remember what our hot takes were.

- Yeah. - Hot take, yes. These are our hot takes. - I think the big one on Twitter that we saw, that we shared was, the question is like, is RAG relevant in the case of like the future of like state-based models? - Let's see. I haven't played too much with RAG, but when I have, I'll say I found it was a little bit challenging to do research on it because we had this experience over and over again where you could have an embedding model of any quality.

So you could have a really, really bad embedding model or you could have a really, really good one by any measure of good. And for the final RAG application, it kind of didn't matter. That's what I'll say about RAG. Well, being recorded. I know it doesn't actually answer the question, but.

- Yeah. So I think a lot of folks are like extremely excited of the idea of RWKB or state-based potentially having infinite context. But I think the reality is that when we say infinite context, we just mean a different kind of infinite context or as it's previously covered, you need to test the model differently.

So think of it more along the lines of the human. Like, I don't remember what I eat for breakfast yesterday. Yeah, that's the statement that I'll say. And we humans are not quadratic transformers. If we did, if let's say we increase our brain size for every second we live, we would have exploded by the time we are five years old or something like that.

And I think basically fundamentally for us, right, be it whether we, regardless of whether RWKB, state-space, XLSTM, et cetera, our general idea is that instead of that expanding state, that increase in computational cost, what if we have a fixed state size? And information theory detects that that fixed state size will have a limit.

Just how big of a limit is a question. Like, RWKB is running at 40 megabytes for a state. Its future version might run into 400 megabytes. That is like millions of tokens in, if you're talking about mathematically, the maximum possibility. It's just that I guess we are all more inefficient about it.

So maybe you would hit 100,000 and that's kind of like the work we are doing trying to like push it and maximize it. And that's where the models will start deferring because it will choose to forget things, it will choose to remember things. And that's why I think that there might be some element of right, but it may not be the same right.

It may be the model learn things. And it's like, hmm, I can't remember that article. Let me do a database search to search. Just like us humans, when we can't remember the article in a company, we do a search on Notion. - Yeah, I think something that would be really interesting is if you could have facts that are, so right now the one intuition about language models is that all those parameters are around just to store random facts about the world.

And this intuition comes from the observation that if you take a really small language model, it can do things like talk to you or it kind of has like the style of conversation it can learn that. But where it will usually fall over compared to a much larger one is it'll just be a lot less factual about things that it knows or that it can do.

But that points to all those weights that we're spending, all that SGD that we're spending to train these models are just being used to store facts. And we have things like databases that are pretty good at storing facts. So I think one thing that would be really interesting is if we could actually have some sort of outside data store that a language model can look at that maybe has some sort of gradient descent in it, but would be quite interesting.

And then maybe you could edit it, delete facts, change who's president so that it doesn't get lost. - Can we open up Q&A and hot takes to the audience? I have hot take Q&A. Do these scale? When 405 being state space model, RAG exists, no one does long context, who's throwing in 2 million token questions, what takes?

- The who's throwing in 2 million token question I think is a really good question. So I actually, I was gonna offer that as a hot take. I mean, my hot take was gonna be that long context doesn't matter. I know I just gave a whole talk about it.

You know, what's the point of doing research if you can't play both sides? But I think one of the, so I think for both of us, the reason that we first got into this was just from the first principle of questions of there's this quadratic thing. Clearly intelligence doesn't need to be quadratic.

What is going on? Can we understand it better? You know, since then it's kind of turned into a race, which has been exciting to watch like how much context you can take in. But I think it's right. Nobody is actually putting in a 2 million context prompt into these models.

And, you know, if they are, maybe we can go, you know, design a better model to do that particular thing. - Yeah, what do you think about that? So you've also been working on this. Do you think long context matters? - So I'm gonna burn a bit. How many of you remember the news of Google Gemini is supporting 3 million context, right?

Raise your hand. Yeah. - 2 million. - Oh, it's 2 million. - Yeah. How many of you actually tried that? See? - I use it a lot. - You, you're off of Mind's TV. (laughs) - I use it a lot. All right. So for some people that is used, and I think that's the, that's might be like, this is where my opinion starts to differ because I think the big labs may have a bigger role in this because like, even for RWKB, even when we train long context, the reason why I say VRAM is a problem is that because when we did the, we need to back prop against the states, we actually need to maintain the state in between the tokens by the token length.

So that means we need to actually roll out the whole 1 million context if we are actually training 1 million, which is the same for transformers actually, but it just means we don't magically reuse the VRAM consumption in the training time space. So that is the one, the VRAM bottlenecks, and I'm neither OpenAI nor Google, so donate GPUs if you have too much of them.

But then putting it back to another paradigm, right, is that I think O1 style reasoning might be actually pushing that direction downwards. In my opinion, this is my partial hot take, is that if, let's say you have a super big 400B model, and let's say you have a 70B model that may take double the tokens, but gets the same result.

Strictly speaking, a 70B, and this is even for transformer or non-transformer, right, will take less resources than that 400B model, even if it did double the amount of thinking. And if that's the case, and we're still all trying to figure this out, maybe the direction for us is really getting the sub-200B to be as fast, as efficient as possible, with a very efficient architecture that some folks happen to be working on, to just reason it out over larger and larger context length.

Yeah. - One thing I'm super interested in is models that can watch forever. Obviously you cannot train something on infinite context length. How are y'all thinking about that, where you run on a much longer context length than is possible to train on? - Yeah, it's a great question. So I think when, I think you guys probably had tweets along these lines too.

When we first started doing these things, because these are all recurrent models, in theory, you could just run it forever. You could just run it forever. And at the very least it won't like air out on your crash. There's another question of whether it can actually use what it's seen in that infinite context.

And I think there, so one place where probably the research and architectures ran faster than other research is actually the benchmarks for long context. So you turn it on forever, you wanna do everything or watch everything. What is it that you actually wanted to do? Can we actually build some benchmarks for that, then measure what's happening, and then ask the question, can the models do it?

Is there something else that they need? Yeah, I think that if I were to turn back the clock to 2022, that's probably one of the things I would have done differently, which would have been actually get some long context benchmarks out at the same time as we started pushing context length on all these models.

- I will also say the use case. So like, I think we both agree that there's no infinite memory and the model needs to be able to learn inside. I think what we have observed for, I think this also fits the state space model, is that one of the key advantage of this alternate attention mechanic that is not based on token position is that the model don't suddenly become crazy when you go past the 8K training context length or a million context length.

It's actually still stable. It's still able to run, it's still be able to rationalize. It just starts forgetting things. But some of these things are still there in latent memory. Some of these things are still somewhat there. That's the whole point of why reading twice works, things like that.

And one of the biggest push in this direction is that I think both state space and RWKB have separate papers by other researchers where they use this architecture for time series data, weather modeling. So you're not asking what was the weather five days ago. You're asking what's the weather tomorrow based on the infinite length that we, as on this earth and the computer will keep running.

So, and they found that it is better than existing, like we transform our existing architecture in modeling this weather data. Control for the param size and stuff. I'm quite sure there are people with larger models. So there are things that in this case, right, there is future applications if your question is just what's next and not what's 10 years ago.

- Thanks so much for having us.

2024 in Post-Transformer Architectures: State Space Models, RWKV [Latent Space LIVE! @ NeurIPS 2024]

Transcript