back to index

2024 in Post-Transformer Architectures: State Space Models, RWKV [Latent Space LIVE! @ NeurIPS 2024]


Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Yeah, so thanks so much for having us.
00:00:08.620 | So this is gonna be a little bit of a two-part presentation.
00:00:11.600 | My name is Dan, I'm at Together AI,
00:00:14.240 | and I'll be joining UCSD as faculty in about a year.
00:00:17.840 | And Eugene, you wanna introduce yourself?
00:00:19.960 | - I'm Eugene, I lead the art activity team,
00:00:22.000 | and I'm CEO and co-founder of Featherless,
00:00:25.120 | and we both work on this new
00:00:27.480 | post-transformer architecture space.
00:00:29.740 | - Yeah, so today, we're really excited
00:00:33.120 | to talk to you a little bit about that.
00:00:35.600 | So first, I'm gonna give a broad overview
00:00:37.800 | of kind of the last few years of progress
00:00:40.380 | in non-post-transformer architectures,
00:00:42.900 | and then afterwards, Eugene will tell us a little bit
00:00:45.960 | about the latest and the greatest
00:00:47.440 | and the latest frontier models in this space.
00:00:50.620 | So the story starts with scaling.
00:00:54.280 | So this is probably a figure or something like this
00:00:56.880 | that you've seen very recently.
00:00:59.120 | Over the last five to six years,
00:01:00.920 | we've seen models really scale up in parameter size,
00:01:03.640 | and that's brought with it a bunch of new capabilities,
00:01:05.640 | like the ability to talk to you and tell you sometimes
00:01:09.700 | how to use your Colab and your AWS screens.
00:01:12.800 | But another place where we've seen scaling,
00:01:15.480 | especially recently, is scaling in context length.
00:01:18.680 | So this can mean just having more text inputs
00:01:21.920 | for your models, but it can also mean things
00:01:23.880 | like taking a lot of visual token inputs,
00:01:28.000 | image inputs to your models, or generating lots of outputs.
00:01:31.520 | And one thing that's been really exciting
00:01:34.080 | over the last few months or so is that we're seeing scaling,
00:01:37.680 | not only during training time, but also during test time.
00:01:39.920 | So this is one of the, this is the iconic image
00:01:43.240 | from the OpenAI '01 release.
00:01:45.280 | Not only are we starting to scale train time compute,
00:01:47.840 | but we're also starting to scale test time compute.
00:01:51.160 | Now, if you're familiar with our attention
00:01:53.880 | in our transformer architectures today,
00:01:55.820 | this graph on the right might look a little bit scary.
00:01:58.640 | And one of the reasons is that the implications
00:02:02.280 | are a little bit interesting.
00:02:04.920 | So what does it mean if we want to continue
00:02:06.960 | having smarter and smarter models?
00:02:08.680 | Do we just need to start building
00:02:09.960 | bigger, bigger data centers, spending more flops?
00:02:13.320 | Is this, this little Dolly 3, we need more flops guy,
00:02:16.800 | is this gonna be the future of all of AI?
00:02:20.720 | Or is there a better way, another path forward?
00:02:23.960 | Maybe we can get the same capabilities
00:02:26.300 | that we've gotten used to,
00:02:27.900 | but for a lot less compute, a lot less flops.
00:02:30.160 | And one of the things that we're gonna talk about today
00:02:33.300 | is specifically looking at that core attention operator
00:02:37.140 | in some of these models.
00:02:38.820 | And the reason is that,
00:02:40.340 | so this is just some basic scaling curves,
00:02:44.100 | but attention has compute that scales quadratically
00:02:47.140 | in the context length.
00:02:48.480 | So that means that if you're doing something
00:02:50.020 | like test time compute, and you want to spend a bunch
00:02:52.580 | of tokens thinking about what comes next,
00:02:54.820 | the longer that that goes,
00:02:56.900 | the more tokens you spend on that,
00:03:00.160 | that compute grows quadratically in that.
00:03:02.380 | One of the questions that we're interested in is,
00:03:04.860 | can we take that basic sequence model,
00:03:06.900 | the basic sequence primitive at the bottom
00:03:08.900 | and get it to scale better?
00:03:10.260 | Can we scale and let's say N to the three halves
00:03:12.580 | or N log N?
00:03:13.660 | And so in the first part of the talk,
00:03:17.220 | so we just went over the introduction,
00:03:18.940 | what I'm gonna do over the next few slides
00:03:20.560 | is just talk about some of the key advances
00:03:22.860 | and ideas that have shown over the past few years
00:03:25.840 | since maybe early 2020 to now,
00:03:29.140 | that shown promise that this might actually be possible,
00:03:32.080 | that you can actually get potentially the same quality
00:03:34.460 | that we want while scaling better.
00:03:37.480 | So to do that, and basically the story
00:03:42.480 | that we're gonna look is we're gonna start to see how,
00:03:45.020 | so this is a basic graph of just the past couple of years
00:03:48.220 | of progress of perplexity where that blue line,
00:03:51.100 | that dotted blue line is attention,
00:03:52.580 | it's your basic transformer, full dense attention.
00:03:55.100 | And then the dots coming down are some of the methods
00:03:58.480 | that you'll see in this presentation today.
00:04:00.640 | We're gonna turn the clock back all the way to 2020.
00:04:04.460 | So this question of, can we make attention sub-quadratic?
00:04:09.180 | Basically, as soon as we said, attention is all you need,
00:04:11.940 | people started asking this question.
00:04:13.420 | So we have this quadratic attention operator,
00:04:16.260 | can we do better?
00:04:17.460 | I'll briefly talk about why attention is quadratic.
00:04:19.860 | And the basic thing that happens if you're not familiar
00:04:23.020 | is that you have these inputs, these keys and queries,
00:04:25.860 | and what you do in this attention matrix,
00:04:27.740 | this S matrix over here is that you're using,
00:04:30.320 | you're comparing every token in your input
00:04:32.500 | to every other token.
00:04:33.860 | So when I try to do something like upload a whole book
00:04:36.340 | to Gemini, what happens beyond the, or maybe not Gemini,
00:04:39.620 | 'cause we don't necessarily know what architecture is,
00:04:41.680 | but let's say we upload it to Llama,
00:04:43.580 | what happens behind the scenes is that it's gonna take
00:04:46.980 | every single word in that book
00:04:48.300 | and compare it to every other word.
00:04:50.060 | And this has been a really,
00:04:51.500 | it's led to some pretty impressive things,
00:04:53.940 | but it's kind of a brute forcing of the way
00:04:56.060 | that you would try to interpret something.
00:04:59.620 | And what attention does in particular is the,
00:05:03.020 | and then what attention, sorry, don't wanna, okay.
00:05:06.780 | No, no laser pointer.
00:05:08.020 | What attention does afterwards is that
00:05:10.180 | instead of always operating in this quadratic thing,
00:05:13.060 | it takes a row-wise softmax over this matrix
00:05:15.700 | and then multiplies it by this values matrix.
00:05:17.660 | So one of the key points to notice is that the output size
00:05:21.100 | is always gonna be the same as the inputs,
00:05:23.360 | at least in standard self-attention.
00:05:26.340 | So one of the first things that folks tried to do
00:05:28.340 | around 2020 is this thing called linear attention,
00:05:30.500 | which is just noticing that if we take out this softmax
00:05:34.100 | from here, if we take out this non-linearity
00:05:36.100 | in the middle of the attention operation,
00:05:37.900 | and then if you compute the keys
00:05:39.400 | and the values operation first,
00:05:41.160 | you actually never hit this quadratic bottleneck.
00:05:44.060 | So that's potentially a way
00:05:46.300 | to get a lot more computationally efficient.
00:05:50.540 | And there are various ways to do this
00:05:52.420 | by basically using feature maps
00:05:54.060 | or try to approximate this overall attention computation.
00:05:57.620 | But some of this work sort of started to hit a wall in 2020
00:06:02.060 | and the basic challenges were two.
00:06:04.220 | So one was quality.
00:06:05.600 | Back then it was kind of hard to get good quality
00:06:09.580 | with these linear attention operators.
00:06:11.620 | The other one was actually hardware efficiency.
00:06:13.460 | So this feature map that was just shown by Simplify here
00:06:18.260 | actually ends up being quite computationally expensive
00:06:20.980 | if you just implement it naively.
00:06:22.980 | So you started having these operators
00:06:24.440 | that not only you're not really sure
00:06:27.300 | if they have the same quality,
00:06:28.820 | but also they're actually just wall clock slower.
00:06:30.740 | So you kind of end up getting the worst of both worlds.
00:06:34.340 | So this was the SAGE.
00:06:36.620 | So that kind of sets the SAGE for four years ago.
00:06:38.900 | Keep this in mind because linear attention
00:06:40.460 | is actually gonna come back in a few years
00:06:43.020 | once we have a better understanding.
00:06:46.260 | But one of the works that started kicking off
00:06:48.340 | this mini revolution in post-transformer architectures
00:06:52.740 | was this idea called state-space model.
00:06:54.680 | So here the seminal work is one about our work in 2022.
00:06:59.500 | And this piece of work really brought together a few ideas
00:07:03.420 | from some long running research lines of work.
00:07:10.460 | The first one was, and this is really one of the keys
00:07:13.660 | to closing the gap in quality,
00:07:16.420 | was just using things that if you talk
00:07:19.820 | to an electrical engineer off the street,
00:07:23.100 | they might know like the back of their hand,
00:07:26.980 | but taking some of those properties
00:07:28.360 | with how we model dynamical systems in signal processing
00:07:32.660 | and then using those ideas to model the inputs,
00:07:36.020 | the text tokens in, for example,
00:07:39.060 | a transformer-like next token prediction architecture.
00:07:42.300 | So some of those early state-space model papers
00:07:44.780 | were looking at this relatively simple recurrent update
00:07:49.140 | model that comes from maybe chapter one
00:07:50.960 | of a signal processing class,
00:07:53.100 | but then using some principle theory
00:07:55.680 | about how you should do that recurrent update
00:07:58.340 | in order to really get the most that you can
00:08:01.560 | out of your hidden state, out of your sequence.
00:08:05.860 | So that was one key idea for quality.
00:08:07.900 | And when this was eventually realized,
00:08:11.300 | you started to see a bunch of benchmarks
00:08:13.060 | that were pretty sticky for a few years,
00:08:15.220 | things like long range arena,
00:08:16.620 | some long sequence evaluation benchmarks,
00:08:19.980 | there were stuff in time series, time series analysis.
00:08:24.420 | You started to see the quality tick up in meaningful ways.
00:08:29.300 | But the other key thing that was so influential
00:08:33.820 | about these state-space models
00:08:34.940 | is that they also had a key idea
00:08:36.780 | about how you can compute these things efficiently.
00:08:41.020 | So if you go back to your machine learning 101 class,
00:08:43.460 | where you learned about RNNs,
00:08:44.780 | one thing that you may have learned
00:08:46.020 | is that they don't paralyze as well as detention,
00:08:48.980 | because if you just run them naively,
00:08:51.060 | you have to do this kind of sequential update
00:08:54.140 | to process new tokens.
00:08:55.620 | Whereas in attention,
00:08:56.840 | you can process all the tokens in parallel at one time.
00:09:00.120 | One of the key insights behind the S4 paper
00:09:02.460 | was that these recurrent models,
00:09:04.220 | you could take them and you could also formulate them
00:09:07.060 | as a convolution.
00:09:08.420 | And in particular, with a convolution,
00:09:09.780 | you could, instead of using a PyTorch conv1d operation,
00:09:12.540 | you can compute that with the FFT.
00:09:15.060 | And that would give you N log N compute
00:09:17.380 | in the sequence length N with a operator
00:09:20.820 | that was relatively well optimized for modern hardware.
00:09:24.460 | So those are really, I'd say the two key ideas in 2022
00:09:28.380 | that started allowing these breakthroughs to happen
00:09:31.740 | in these non-transformer architectures.
00:09:33.700 | So these ideas about how to principally model,
00:09:36.780 | sorry, how to model the recurrent updates
00:09:39.020 | of a sequence in a principled way,
00:09:42.500 | and also these key ideas
00:09:43.740 | and how you can compute it efficiently
00:09:45.780 | by turning it into a convolution
00:09:47.860 | and then scaling it up with the FFT.
00:09:50.040 | Along those same lines,
00:09:53.580 | so afterwards, we started putting out some work
00:09:57.500 | on specialized kernels.
00:09:58.700 | So just like we have flash attention for transformers,
00:10:01.320 | we also have works like flash FFT conv,
00:10:03.500 | and if you look at these lines of work,
00:10:05.620 | oftentimes whenever you see a new architecture,
00:10:07.940 | you see a new primitive,
00:10:09.540 | one of the table stakes now is,
00:10:11.620 | do you have an efficient kernel
00:10:12.740 | so that you can actually get wall clock speed up?
00:10:14.780 | So by 2022, 2023, we were starting to have these models
00:10:18.460 | that had promising quality primitives
00:10:21.380 | and also promising wall clocks.
00:10:23.180 | So you could actually see regimes
00:10:24.980 | where they were better than transformers in meaningful ways.
00:10:27.980 | That being said, there were still sometimes a quality gap,
00:10:31.860 | particularly for language modeling.
00:10:33.580 | And because language is so core to what we do
00:10:36.540 | in sequence modeling these days,
00:10:38.220 | the next key idea that I'm gonna talk about
00:10:41.500 | is this idea of selection mechanisms.
00:10:43.860 | And this is basically an idea of,
00:10:45.940 | so you have this recurrent state that you're keeping around
00:10:48.600 | that just summarizes everything that came before,
00:10:52.140 | and to get a good sequence model,
00:10:53.620 | one of the things that you really need to be able to do
00:10:56.060 | is have the model learn
00:10:58.020 | what's the best way to pick out pieces
00:11:00.300 | from that recurrent state.
00:11:02.100 | So one of the major ideas here
00:11:04.800 | in a line of work called H3, Hungry, Hungry Hippos,
00:11:07.860 | and also these hyena models were,
00:11:11.060 | one way you can do this is by just adding
00:11:12.860 | some simple element-wise gates.
00:11:15.580 | So versions of these ideas have been around for decades.
00:11:18.780 | If you squint at the LSTM paper,
00:11:21.580 | you can probably find this gating mechanism.
00:11:24.740 | But turns out you can take those old ideas,
00:11:26.540 | add them into these new states-based models,
00:11:29.420 | and then you can see quality start to pick up.
00:11:32.660 | If you've heard of the Mamba model,
00:11:35.940 | this also takes the selection to the next level
00:11:39.200 | by actually making some changes
00:11:40.660 | in that fundamental recurrent state space.
00:11:43.700 | So it's not only just this gating
00:11:45.540 | that happens around the SSM layer,
00:11:47.620 | but also you can actually make the ABCD matrices
00:11:51.840 | of your state-based model,
00:11:53.100 | you can make them data-dependent,
00:11:54.860 | which will allow you to even better select out
00:11:57.420 | different pieces from your hidden state,
00:11:59.120 | depending on what you're seeing.
00:12:00.720 | I'll also point out,
00:12:02.420 | if you look at the bottom right of this figure,
00:12:03.980 | there's this little triangle with a GPU SRAM, GPU HBM,
00:12:07.380 | and this is just continuing that trend
00:12:09.460 | of when you have a new architecture,
00:12:12.140 | you also release it with a kernel
00:12:14.580 | to show that it is hardware efficient,
00:12:16.940 | that it can be hardware efficient on modern hardware.
00:12:20.480 | One of the next cool things that happened
00:12:26.500 | is once we had this understanding
00:12:28.340 | of these are the basic pieces,
00:12:30.380 | these are the basic principles
00:12:31.800 | behind some of the sequence models,
00:12:34.320 | linear attention actually started to come back.
00:12:36.160 | So in earlier this year,
00:12:38.120 | there's a model called BASED from Simran Arora
00:12:41.800 | and some other folks that combined
00:12:44.600 | a more principled version of linear attention
00:12:46.920 | that basically the two-second summaries
00:12:50.680 | that it used a Taylor approximation
00:12:52.860 | of the softmax attention,
00:12:54.600 | combined that with a simple sliding window attention
00:12:57.200 | and was starting to be able to expand the Pareto frontier
00:13:01.540 | of how much data can you recall from your sequence
00:13:04.820 | versus how small is your recurrent state size.
00:13:07.220 | So those orange dots are at the top there
00:13:09.500 | are just showing smaller sequences
00:13:12.020 | that can recall more memory.
00:13:14.160 | And the last major idea,
00:13:18.460 | I think that has been influential
00:13:19.980 | on this line of work
00:13:20.820 | and is very relatively late breaking
00:13:22.660 | just a few months ago,
00:13:24.360 | is just the basic idea
00:13:25.660 | that when you have these models
00:13:27.380 | that are fundamentally more efficient
00:13:29.900 | in the sequence length,
00:13:31.420 | you maybe don't want to prompt them
00:13:32.980 | or use them in exactly the same way.
00:13:35.020 | So this was a really cool paper called Just Read Twice
00:13:37.740 | also from Simran that basically said,
00:13:40.620 | hey, all these efficient models
00:13:42.540 | can process tokens so much more efficiently
00:13:44.500 | than transformers,
00:13:45.700 | that they can sometimes have unfair advantages
00:13:48.480 | compared to a simple transformer token.
00:13:51.540 | So, sorry, a simple transformer model.
00:13:53.500 | So take, for example, the standard use case
00:13:57.100 | of you have some long document,
00:13:58.740 | you're gonna pass it in as input
00:14:00.060 | and then you're gonna ask some question about it.
00:14:03.060 | One problem you might imagine for a recurrent model
00:14:06.740 | where you have a fixed state size is,
00:14:08.660 | let's say that your article is very long
00:14:11.580 | and you're trying to ask about some really niche thing.
00:14:14.900 | You can imagine it might be hard for the model
00:14:16.660 | to know ahead of time
00:14:17.540 | what information to put into the hidden state.
00:14:20.520 | But these models are so much more efficient
00:14:23.020 | that you can do something really stupid,
00:14:24.520 | like you can just put the document,
00:14:26.940 | write down the document, write down the question,
00:14:29.060 | write down the document again
00:14:30.300 | and then write down the question again.
00:14:31.900 | And then this time, the second time
00:14:33.300 | that you go over that document,
00:14:34.500 | you know exactly what to look for.
00:14:36.920 | And the cool thing about this is,
00:14:38.300 | so this results in better quality,
00:14:41.020 | especially on these recall intensive tasks.
00:14:43.700 | But the other interesting thing is,
00:14:45.660 | it really takes advantage
00:14:47.140 | of the more efficient architectures that we're having here.
00:14:50.680 | So one of the other, I think, influential ideas
00:14:53.100 | in this line of work is,
00:14:54.580 | if you change the fundamental compute capabilities
00:14:58.260 | of your model and the way that it scales,
00:15:00.260 | you can actually start to query it at test time differently.
00:15:03.100 | And this actually, of course,
00:15:04.260 | goes back to those slides on test time compute.
00:15:07.160 | So while everybody's looking at, say,
00:15:09.020 | test time compute for big transformer models,
00:15:12.340 | I think potentially a really interesting research question
00:15:14.660 | is how can you take those
00:15:16.040 | and how does it change
00:15:17.300 | with this new next generation of models?
00:15:20.560 | So I'll just briefly summarize
00:15:23.620 | what some of those key ideas were
00:15:25.780 | and then talk and then show you briefly
00:15:27.720 | kind of what the state of the art is today.
00:15:30.440 | So the four key ideas are,
00:15:32.120 | instead of just doing
00:15:33.160 | a simple linear attention approximation,
00:15:35.800 | instead, take ideas that we know from other fields,
00:15:39.120 | like signal processing,
00:15:40.480 | do a more principled approach
00:15:42.600 | to your modeling of the sequence.
00:15:44.760 | Another key idea throughout all these lines of work
00:15:47.240 | is you really want hardware and kernel support from day one.
00:15:51.160 | So even if your model is theoretically more efficient,
00:15:54.960 | if somebody goes and runs it and it's two times slower,
00:15:58.120 | one of the things that we've learned
00:15:59.420 | is that if you're in that situation,
00:16:01.120 | it's just gonna be dead on arrival.
00:16:03.520 | So you want to be designing your architectures
00:16:06.200 | with the hardware in mind.
00:16:07.840 | One of the key machine learning ideas
00:16:11.980 | that has been important for the quality
00:16:13.840 | is just making sure that you encode different ways
00:16:16.440 | that you can select from your hidden state
00:16:18.720 | and really focus on that as a key decider of quality.
00:16:22.200 | And finally, I think one of the emerging new things
00:16:26.600 | for this line of work
00:16:27.960 | and something that's quite interesting
00:16:29.560 | is what are the right test time paradigms for these models?
00:16:32.960 | How do they change relative to what you might do
00:16:37.960 | for a standard transformer?
00:16:39.360 | I'll briefly end this section.
00:16:41.880 | So I've labeled this slide where we are yesterday
00:16:45.440 | because Eugene is gonna talk about some new models
00:16:47.440 | that he released literally this morning.
00:16:49.840 | But as of yesterday, some of the really cool results
00:16:52.080 | out of these efficient alternative models were,
00:16:56.480 | so AI2 trained this hybrid MOE called Jamba
00:16:59.600 | that is currently the state-of-the-art
00:17:03.120 | for these non-transformer architectures.
00:17:06.320 | There's this NVIDIA and MIT
00:17:08.720 | put out this new diffusion model called SANA recently
00:17:12.640 | that one of their key observations
00:17:15.760 | is that you can take a standard diffusion,
00:17:18.240 | transformer diffusion model,
00:17:19.760 | replace the layers with linear attention,
00:17:21.800 | and then that lets you scale to much larger images,
00:17:25.960 | much larger sequences more efficiently.
00:17:30.720 | And one thing that I don't think anybody would have called
00:17:34.360 | when a few years ago
00:17:36.320 | is that one of those gated SSM, gated states-based models
00:17:41.840 | ended up on the cover of science
00:17:43.480 | because a great group of folks went
00:17:45.880 | and trained some DNA models.
00:17:47.200 | So that's Michael Polley, Eric Yuen
00:17:49.640 | from Stanford and the Ark Institute.
00:17:53.200 | So we're really at an exciting time in 2024
00:17:56.920 | where these non-transformer, post-transformer architectures
00:18:00.240 | are showing promise across a wide range,
00:18:03.280 | across a wide range of modalities,
00:18:07.360 | of applications, and of tasks.
00:18:10.760 | And with that, I'll pass it on to Eugene
00:18:12.280 | who can tell you a little bit
00:18:13.760 | about the latest and greatest with RWKV.
00:18:16.720 | - Yeah, so that's useful.
00:18:19.120 | Yeah. - You're talking to here.
00:18:19.960 | - Oh, I'm talking to here, okay.
00:18:21.280 | So yeah, two streams.
00:18:23.240 | Yeah, so I think one common questions
00:18:25.040 | that we tend to get asked, right,
00:18:26.920 | is what's the difference between RWKV and states-based?
00:18:30.200 | So I think one of the key things to really understand,
00:18:33.560 | right, the difference between the two groups, right,
00:18:36.440 | is that we are actually more like
00:18:38.680 | an open-source rental internet meets academia
00:18:41.080 | kind of situation.
00:18:42.200 | Like most of us never wrote any paper,
00:18:45.040 | but we basically look at RNNs and linear intention
00:18:48.960 | when intention is all you need came out.
00:18:50.680 | And then we decided to like,
00:18:51.600 | "Hey, there is a quadratic scaling problem.
00:18:54.480 | "Why don't we try fixing that instead?"
00:18:57.160 | So we end up developing our own branch,
00:19:00.120 | but we end up sharing ideas back and forth.
00:19:02.600 | And we do all this actively in Discord, GitHub, et cetera.
00:19:07.760 | This was so bad for a few years, right,
00:19:10.080 | that basically the average group's H-index
00:19:12.520 | was so close to zero, right,
00:19:13.760 | ILLUTR-AI actually came in
00:19:15.480 | and helped us write our first paper.
00:19:17.360 | Great, now our H-index is now three, apparently.
00:19:19.600 | So, but the thing is like,
00:19:22.400 | a lot of these experiments led to results.
00:19:25.320 | And essentially, we took the same ideas
00:19:30.280 | from linear intention and we built on it.
00:19:33.320 | So to take a step back into like,
00:19:35.000 | how does RWKB handle its own attention mechanic
00:19:38.520 | and achieve the same goals of like O(n) compute,
00:19:41.600 | respectively, and in focus of our overall goal
00:19:45.720 | to make AI accessible to everyone,
00:19:47.120 | regardless of language, nation, or compute.
00:19:48.880 | That's our open-source goal.
00:19:50.560 | We actually train our models primarily
00:19:52.640 | on over a hundred language,
00:19:54.160 | which is another topic altogether.
00:19:56.120 | And our goal is to train to even 200 languages
00:19:58.240 | to cover all languages in the world.
00:20:00.040 | But at the same time, we work on this architecture
00:20:03.360 | to lower the compute cost so that people
00:20:05.440 | can run in Raspberry Pis and on anything.
00:20:08.600 | So how did RWKB break the dependency of LSTM token flow?
00:20:13.600 | Because I think to understand architecture, right,
00:20:16.120 | it's probably easier to understand it from the RNN lens,
00:20:19.760 | because that's where we built on.
00:20:21.680 | We all state space kind of like try to start anew
00:20:25.040 | and took lessons from that and say,
00:20:26.200 | so there's a little bit of divergence there.
00:20:28.200 | And AKA, this is our version of linear intention.
00:20:31.320 | So to take a step back, all foundation models,
00:20:35.000 | be it transformers or non-transformers,
00:20:37.440 | at a very high level, right, comes in a token,
00:20:40.400 | I mean, takes things into embeddings
00:20:42.480 | and goes through a lot of layers,
00:20:44.240 | generate a lot of internal states,
00:20:45.800 | whether QKB cache or RNN states or RWKB states,
00:20:50.360 | and outputs an embedding layer norm in something,
00:20:52.680 | and we just take more layers and more embeddings,
00:20:54.360 | and somehow that magically works.
00:20:57.400 | So if you remember your ancient RNN lessons,
00:21:02.040 | which we call blessing these days,
00:21:07.000 | the general idea is that you have the embedding information
00:21:09.360 | from all the way up, and you take that information
00:21:13.080 | and you flow it back down,
00:21:13.920 | and then you process it as part of your LSTM layers.
00:21:16.480 | So this is how it generally works.
00:21:19.040 | Kapati is quoted saying that RNNs
00:21:20.760 | are actually unreasonably effective.
00:21:22.640 | The problem is this is not scalable.
00:21:25.160 | To start doing work on the second token,
00:21:27.160 | you need to wait for the first token,
00:21:28.640 | and then you need to,
00:21:29.480 | and likewise for the third token
00:21:30.320 | and fourth token, yada, yada.
00:21:31.960 | That is CPU land, not GPU land.
00:21:34.360 | So you can have a H100, and you can't even use 1% of it.
00:21:38.280 | So that's kind of why RNNs didn't really take off
00:21:41.400 | in the direction that when you wanted
00:21:42.640 | like billions of parameters when it comes to training.
00:21:44.880 | So what did RWKB version zero do?
00:21:47.560 | We just did the dumbest, lamest thing.
00:21:49.920 | Sorry, this is the bottleneck for RNN.
00:21:52.120 | We did the dumb thing of removing that line,
00:21:54.680 | and it kind of worked.
00:21:56.360 | It trained, it sucked, but it kind of worked.
00:21:59.920 | Then we were like, hey, then no one cared
00:22:02.800 | because the loss was crap, but how do we improve that?
00:22:07.000 | And that's essentially where we move forward
00:22:09.640 | because if you see this kind of flow,
00:22:12.080 | you can actually get your GPU saturated quickly
00:22:15.880 | where it essentially cascades respectively.
00:22:17.920 | So I'm just waiting for this to loop again.
00:22:20.160 | So it's like once you get your first layer,
00:22:21.840 | your token to be computed finish,
00:22:24.200 | you start to cascade your compute all the way
00:22:26.360 | until you're, hey, I'm using 100% of GPU.
00:22:28.760 | So we worked on it and we started going along
00:22:33.000 | the principle of that as long as we keep
00:22:34.960 | this general architecture where we can cascade
00:22:38.040 | and be highly efficient with our architecture,
00:22:40.960 | nothing is sacred in our architecture.
00:22:43.120 | And we have done some crazy ideas.
00:22:45.680 | In fact, if you ask me to explain some things
00:22:48.920 | in the paper, right, officially in the paper,
00:22:51.160 | I'll say we had this idea and we wrote it this way.
00:22:53.640 | The reality is someone came with a code,
00:22:55.760 | we tested it, it worked, and then we rationalized it.
00:22:58.440 | So the general idea behind RWKVR is that
00:23:03.200 | we generally have two major blocks that we do.
00:23:06.520 | We call it TimeMix and ChannelMix.
00:23:08.080 | And TimeMix generally handles long-term memory states
00:23:12.520 | where essentially where we apply the matrix multiplication
00:23:17.520 | and SILU activation functions into processing
00:23:19.520 | an input embedding and an output embedding.
00:23:22.200 | I'm oversimplifying it because this calculation
00:23:25.120 | changed every version and we have version seven right now.
00:23:29.080 | ChannelMix is similar to Bayes in the sense
00:23:31.680 | that where it does shorter-term attention,
00:23:33.840 | where it just look at the sister token
00:23:36.680 | or the token before it, 'cause there's a shift
00:23:38.560 | in the token shift matrix.
00:23:41.480 | I don't really want to go too much into the papers itself
00:23:43.800 | because we do have three papers on this.
00:23:46.240 | Basically, RWKV, RNN for the transformer era,
00:23:49.840 | Igor and Finch RWKV matrix value state.
00:23:52.040 | This is the updated version five, version six.
00:23:54.680 | And GoFinch is our hybrid model, respectively.
00:23:59.680 | We are writing the paper already for V7,
00:24:03.680 | and which is for RWKV7, codename Goose,
00:24:08.480 | all our architectures are codenamed by a bird.
00:24:11.000 | And I'm going to cover as well Q-RWKV
00:24:13.680 | and MAMA-RWK and RWKVMU.
00:24:16.920 | So where did that lead to?
00:24:18.560 | Wait, because we are all GPU poor,
00:24:21.760 | and to be clear, most of this research is done
00:24:24.200 | only on a handful of H100s,
00:24:25.880 | which I had one Google researcher told me
00:24:28.000 | that was his experiment budget for a single researcher.
00:24:31.520 | So our entire organization has less compute
00:24:34.360 | than a single researcher in Google.
00:24:36.680 | One of the things that we explored into
00:24:40.120 | was how do we convert transformer models instead?
00:24:43.440 | Because someone already paid that million dollars
00:24:46.200 | onto training, so why don't we take advantage
00:24:47.840 | of those weights?
00:24:49.560 | And I believe, together, AI worked on the locus
00:24:52.840 | for the MAMA side of things,
00:24:55.480 | and we took some ideas from there as well,
00:24:57.480 | and we essentially did that for RWKV.
00:24:59.920 | And that led to Q-RWKV6, which we just dropped today,
00:25:05.360 | a 32-bit instruct preview model,
00:25:07.400 | where we took the current 32-bit instruct model,
00:25:10.600 | freeze the feedforward layer,
00:25:12.080 | remove the QKV attention layer,
00:25:15.200 | and replace it with RWKV linear layers.
00:25:17.840 | So to be clear, this means we do not have
00:25:21.000 | the RWKV channel mixed layer,
00:25:22.720 | we only have the time mixed layer.
00:25:24.440 | But once we do that, we train the RWKV layer.
00:25:28.600 | Important is that the feedforward layer needs to be frozen,
00:25:30.920 | so the new attention can be learned.
00:25:33.240 | And then we unfreeze the feedforward layer
00:25:35.880 | and train all the layers together
00:25:37.000 | with a custom learning rate schedule
00:25:38.280 | so that they can learn how to work together.
00:25:41.040 | The end result, surprisingly, and to be honest,
00:25:44.320 | to the frustration of the RWKV MOE team,
00:25:46.760 | which ended up releasing the model on the same day,
00:25:49.240 | was that with just a few hours of training on two nodes,
00:25:54.240 | we managed to get it to be on par
00:25:56.360 | kind of with the original QUANT32B model.
00:25:59.080 | So in fact, when the first run,
00:26:01.200 | that completely confused us,
00:26:02.800 | and I was telling Daniel Goldstein-Smithkey,
00:26:06.640 | who kind of leads most of our research coordination,
00:26:09.480 | when you pitched me this idea,
00:26:10.640 | you told me at best you would get
00:26:12.120 | the same level of performance.
00:26:13.040 | But you didn't tell me the challenge
00:26:15.040 | and the score would shoot up.
00:26:19.160 | I don't know what's happening there.
00:26:21.240 | But it did.
00:26:22.080 | MMLU score dropping, that was expected,
00:26:25.160 | because if you think about it,
00:26:26.560 | when we were training all the layers,
00:26:28.680 | we were essentially like Frankensteining this thing,
00:26:31.440 | and we did brain damage to the feedforward network layer
00:26:34.320 | with the new RWKV layers.
00:26:36.040 | But 76%, hey, some of it is retained,
00:26:38.520 | and we can probably further train this.
00:26:40.760 | We didn't even spend three days training this,
00:26:42.600 | so there's a lot more that can be done,
00:26:44.720 | hence the preview.
00:26:46.240 | But this brings up a big question,
00:26:49.800 | because we are already now in the process
00:26:51.960 | of converting the SMPB.
00:26:54.160 | This is actually extremely compute efficient
00:26:56.480 | to test our attention mechanic.
00:26:59.080 | It's like, it becomes a shortcut.
00:27:01.000 | We are already planning to do our version seven
00:27:02.920 | and our hybrid architecture for it,
00:27:04.920 | because we don't train from scratch,
00:27:06.400 | and we get a really good model out of it.
00:27:08.720 | And the other thing that is uncomfortable to say
00:27:12.080 | is that, because we are doing right now the SMPB,
00:27:14.920 | is that if this scales correctly to 128k context length,
00:27:19.480 | I'm not even talking about a million, 128k,
00:27:22.840 | majority of enterprise workload today
00:27:26.160 | is just on SMPB at under 32k context length.
00:27:30.360 | That means if this works and the benchmark matches it,
00:27:34.040 | it means we can replace the vast majority
00:27:36.240 | of current AI workload,
00:27:37.720 | unless you want super long context.
00:27:39.240 | And then, sorry, can someone give us more GPUs,
00:27:41.560 | because we don't need the VRAM for super long context, sadly.
00:27:44.720 | So yeah, that's what we are working on.
00:27:47.960 | And essentially, we are excited about this
00:27:50.280 | to just push it further.
00:27:51.480 | And this conversion process, to be clear,
00:27:54.320 | I don't think it's going to be exclusive to RWKV,
00:27:56.680 | but it probably will work for Mamba as well.
00:27:59.760 | I don't see why not.
00:28:00.840 | And we will probably see more ideas,
00:28:03.000 | or more experiments, or more hybrids.
00:28:05.080 | Like, yeah, one of the weirdest thing
00:28:07.400 | that I wanted to say outright,
00:28:09.080 | and I confirm this with the Black Mamba team
00:28:10.840 | and the Jamba team,
00:28:12.520 | because we did the Goldfinch hybrid model,
00:28:14.600 | is that none of us understand why a hybrid
00:28:18.520 | with a state-based model, be it RWKV state space,
00:28:20.960 | and transformer, performs better
00:28:24.040 | than the baseline of both.
00:28:26.600 | It's like, when you train one,
00:28:29.040 | you expect, and then you replace,
00:28:30.120 | you expect the same results.
00:28:31.040 | That's our pitch.
00:28:31.880 | That's our claim.
00:28:32.760 | But somehow, when we jam both together,
00:28:34.960 | it outperforms both.
00:28:36.240 | And that's one area of evolution
00:28:38.200 | that, like, we only have four experiments,
00:28:40.160 | plus four teams, that a lot more needs to be done.
00:28:42.760 | But these are things that excite me, essentially,
00:28:45.320 | because that is what, potentially,
00:28:47.360 | we can move ahead for,
00:28:48.960 | which brings us to what comes next.
00:28:51.200 | - So this part is kind of just some,
00:28:55.920 | where we'll talk a little bit about stuff
00:28:57.480 | that we're excited about,
00:28:59.800 | maybe have some wild speculation
00:29:02.200 | on what's coming next.
00:29:05.760 | And, of course, this is also the part
00:29:07.920 | that will be more open to questions.
00:29:10.560 | So a couple of things that I'm excited about
00:29:12.800 | is continued hardware model co-design for these models.
00:29:17.800 | So one of the things that we've put out recently
00:29:22.040 | is this library called Thunder Kittens.
00:29:23.600 | It's a CUDA library.
00:29:25.320 | And one of the things that we found frustrating
00:29:27.760 | is every time that we built one of these new architectures,
00:29:30.280 | and I'm sure you had the exact same experience,
00:29:32.680 | we'd have to go and spend two months in CUDA land,
00:29:35.160 | like, writing these new, efficient things.
00:29:37.640 | And if we decided to change one thing in PyTorch,
00:29:40.920 | like, one line of PyTorch code
00:29:42.320 | is like a week of CUDA code, at least.
00:29:45.000 | So one of our goals with a library like Thunder Kittens,
00:29:48.440 | so we just broke down what are the key principles,
00:29:52.000 | what are the key hardware things,
00:29:54.640 | what are the key compute pieces
00:29:56.600 | that you get from the hardware.
00:29:57.520 | So, for example, on H100,
00:29:59.640 | everything really revolves around
00:30:02.320 | a warp group matrix multiply operation.
00:30:05.840 | So you really want your operation to be able to split
00:30:08.760 | into a relatively small matrix-matrix multiply operation.
00:30:13.560 | So, like, multiplying two 64 by 64 matrices, for example.
00:30:18.240 | And so if you know that ahead of time
00:30:19.920 | when you're designing your model,
00:30:21.240 | that probably gives you some information
00:30:24.600 | about how you set the state sizes,
00:30:25.880 | how you set the update, how you set the update function.
00:30:28.960 | So with Thunder Kittens,
00:30:30.320 | we basically built a whole library
00:30:31.880 | just around this basic idea
00:30:33.840 | that all your basic compute primitives
00:30:36.280 | should not be a float, but it should be a matrix,
00:30:38.800 | and everything should just be matrix compute.
00:30:41.280 | And we've been using that to try to both re-implement
00:30:44.160 | some existing architectures and also start to design
00:30:47.200 | some new ones that are really designed
00:30:48.880 | with this core, with a tensor core primitive in mind.
00:30:52.720 | Another thing that we're, at least I'm excited about,
00:30:57.720 | is we, over the last four or five years,
00:31:00.800 | we've really been looking at language models
00:31:02.600 | as the next thing.
00:31:03.640 | But if you've been paying attention to Twitter,
00:31:06.000 | there's been a bunch of new next generation models
00:31:08.600 | that are coming out.
00:31:09.640 | So there are video generation models
00:31:13.880 | that can run real time,
00:31:16.080 | that are supported by your mouse and your keyboard,
00:31:19.280 | that I'm told if you play with them,
00:31:21.600 | that they only have a few seconds of memory.
00:31:24.680 | Can we take that model?
00:31:25.600 | Can we give it a very long context length
00:31:27.400 | so that you could actually maybe generate
00:31:29.360 | an entire game state at a time?
00:31:31.400 | What does that look like for the model?
00:31:33.040 | You're certainly not gonna do
00:31:34.440 | a giant quadratic attention computation
00:31:37.240 | to try to run that.
00:31:38.920 | Maybe use some of these new models
00:31:41.320 | or some of these new video generation models that came out.
00:31:43.680 | So Sora came out, I don't know, two days ago now,
00:31:47.800 | but with super long queue times
00:31:49.080 | and super long generation times.
00:31:51.040 | So that's probably a quadratic attention operation
00:31:53.440 | at the bottom of it.
00:31:55.120 | What if we could remove that and get the same quality,
00:31:57.160 | but a lot faster generation time?
00:32:00.320 | Or some of the demos that we saw from Paige earlier today.
00:32:04.040 | If I have a super long conversation with my Gemini bot,
00:32:09.040 | what if I wanted to remember everything
00:32:12.480 | that it's seen in the last week?
00:32:14.120 | I mean, maybe you don't for personal reasons,
00:32:17.160 | but what if I did?
00:32:18.440 | What does that mean for the architecture?
00:32:21.000 | And I think that's certainly something
00:32:22.680 | I'm pretty excited about.
00:32:24.200 | I'm sure you're excited about it too.
00:32:26.040 | I think we were supposed to have some hot takes,
00:32:28.480 | but I honestly don't remember what our hot takes were.
00:32:30.960 | - Yeah.
00:32:31.800 | - Hot take, yes.
00:32:34.360 | These are our hot takes.
00:32:35.480 | - I think the big one on Twitter that we saw,
00:32:41.080 | that we shared was, the question is like,
00:32:42.960 | is RAG relevant in the case of like
00:32:46.000 | the future of like state-based models?
00:32:48.200 | - Let's see.
00:32:50.280 | I haven't played too much with RAG,
00:32:54.640 | but when I have,
00:32:56.960 | I'll say I found it was a little bit challenging
00:33:01.200 | to do research on it
00:33:02.480 | because we had this experience over and over again
00:33:06.240 | where you could have an embedding model of any quality.
00:33:10.760 | So you could have a really, really bad embedding model
00:33:12.680 | or you could have a really, really good one
00:33:14.560 | by any measure of good.
00:33:16.800 | And for the final RAG application,
00:33:18.960 | it kind of didn't matter.
00:33:20.440 | That's what I'll say about RAG.
00:33:23.800 | Well, being recorded.
00:33:25.360 | I know it doesn't actually answer the question, but.
00:33:28.760 | - Yeah.
00:33:29.600 | So I think a lot of folks are like extremely excited
00:33:33.240 | of the idea of RWKB or state-based
00:33:35.760 | potentially having infinite context.
00:33:37.680 | But I think the reality is that
00:33:40.680 | when we say infinite context,
00:33:41.760 | we just mean a different kind of infinite context
00:33:45.240 | or as it's previously covered,
00:33:46.520 | you need to test the model differently.
00:33:48.480 | So think of it more along the lines of the human.
00:33:51.160 | Like, I don't remember what I eat for breakfast
00:33:53.680 | yesterday.
00:33:54.840 | Yeah, that's the statement that I'll say.
00:33:57.440 | And we humans are not quadratic transformers.
00:34:01.600 | If we did, if let's say we increase our brain size
00:34:04.840 | for every second we live,
00:34:06.360 | we would have exploded by the time we are five years old
00:34:08.320 | or something like that.
00:34:09.440 | And I think basically fundamentally for us, right,
00:34:13.160 | be it whether we, regardless of whether RWKB,
00:34:15.720 | state-space, XLSTM, et cetera,
00:34:18.560 | our general idea is that instead of that expanding state,
00:34:21.600 | that increase in computational cost,
00:34:23.520 | what if we have a fixed state size?
00:34:26.240 | And information theory detects that
00:34:29.000 | that fixed state size will have a limit.
00:34:31.320 | Just how big of a limit is a question.
00:34:34.120 | Like, RWKB is running at 40 megabytes for a state.
00:34:39.120 | Its future version might run into 400 megabytes.
00:34:41.760 | That is like millions of tokens in,
00:34:45.680 | if you're talking about mathematically,
00:34:47.040 | the maximum possibility.
00:34:49.280 | It's just that I guess we are all more inefficient about it.
00:34:51.760 | So maybe you would hit 100,000
00:34:53.560 | and that's kind of like the work we are doing
00:34:55.080 | trying to like push it and maximize it.
00:34:57.760 | And that's where the models will start deferring
00:35:00.680 | because it will choose to forget things,
00:35:02.680 | it will choose to remember things.
00:35:04.240 | And that's why I think that there might be
00:35:06.280 | some element of right, but it may not be the same right.
00:35:08.480 | It may be the model learn things.
00:35:09.920 | And it's like, hmm, I can't remember that article.
00:35:12.760 | Let me do a database search to search.
00:35:14.920 | Just like us humans,
00:35:16.360 | when we can't remember the article in a company,
00:35:18.360 | we do a search on Notion.
00:35:19.800 | - Yeah, I think something that would be really interesting
00:35:22.480 | is if you could have facts that are,
00:35:25.680 | so right now the one intuition about language models
00:35:29.520 | is that all those parameters are around
00:35:31.360 | just to store random facts about the world.
00:35:33.640 | And this intuition comes from the observation
00:35:35.840 | that if you take a really small language model,
00:35:38.280 | it can do things like talk to you
00:35:39.800 | or it kind of has like the style of conversation
00:35:44.000 | it can learn that.
00:35:44.960 | But where it will usually fall over
00:35:46.600 | compared to a much larger one
00:35:47.760 | is it'll just be a lot less factual
00:35:49.640 | about things that it knows or that it can do.
00:35:52.960 | But that points to all those weights that we're spending,
00:35:57.360 | all that SGD that we're spending to train these models
00:35:59.800 | are just being used to store facts.
00:36:01.760 | And we have things like databases
00:36:03.080 | that are pretty good at storing facts.
00:36:04.720 | So I think one thing that would be really interesting
00:36:06.560 | is if we could actually have some sort of outside data store
00:36:10.520 | that a language model can look at
00:36:13.600 | that maybe has some sort of gradient descent in it,
00:36:19.040 | but would be quite interesting.
00:36:21.600 | And then maybe you could edit it, delete facts,
00:36:23.680 | change who's president so that it doesn't get lost.
00:36:28.440 | - Can we open up Q&A and hot takes to the audience?
00:36:31.640 | I have hot take Q&A.
00:36:35.440 | Do these scale?
00:36:36.640 | When 405 being state space model,
00:36:40.680 | RAG exists, no one does long context,
00:36:43.320 | who's throwing in 2 million token questions, what takes?
00:36:48.120 | - The who's throwing in 2 million token question
00:36:50.440 | I think is a really good question.
00:36:52.400 | So I actually, I was gonna offer that as a hot take.
00:36:55.680 | I mean, my hot take was gonna be
00:36:56.800 | that long context doesn't matter.
00:36:58.560 | I know I just gave a whole talk about it.
00:37:00.600 | You know, what's the point of doing research
00:37:04.480 | if you can't play both sides?
00:37:06.680 | But I think one of the, so I think for both of us,
00:37:11.320 | the reason that we first got into this
00:37:12.960 | was just from the first principle of questions
00:37:15.680 | of there's this quadratic thing.
00:37:18.920 | Clearly intelligence doesn't need to be quadratic.
00:37:21.240 | What is going on?
00:37:22.080 | Can we understand it better?
00:37:23.440 | You know, since then it's kind of turned into a race,
00:37:28.280 | which has been exciting to watch
00:37:29.440 | like how much context you can take in.
00:37:31.720 | But I think it's right.
00:37:32.560 | Nobody is actually putting in a 2 million context prompt
00:37:35.320 | into these models.
00:37:37.120 | And, you know, if they are, maybe we can go, you know,
00:37:41.400 | design a better model to do that particular thing.
00:37:45.280 | - Yeah, what do you think about that?
00:37:46.440 | So you've also been working on this.
00:37:48.000 | Do you think long context matters?
00:37:49.880 | - So I'm gonna burn a bit.
00:37:51.840 | How many of you remember the news of Google Gemini
00:37:54.760 | is supporting 3 million context, right?
00:37:56.720 | Raise your hand.
00:37:57.560 | Yeah. - 2 million.
00:37:59.760 | - Oh, it's 2 million.
00:38:00.800 | - Yeah.
00:38:06.640 | How many of you actually tried that?
00:38:09.360 | See? - I use it a lot.
00:38:11.240 | - You, you're off of Mind's TV.
00:38:13.200 | (laughs)
00:38:14.040 | - I use it a lot.
00:38:15.560 | All right.
00:38:16.400 | So for some people that is used,
00:38:18.600 | and I think that's the,
00:38:20.800 | that's might be like,
00:38:23.040 | this is where my opinion starts to differ
00:38:24.560 | because I think the big labs may have a bigger role in this
00:38:27.360 | because like, even for RWKB,
00:38:29.560 | even when we train long context,
00:38:30.640 | the reason why I say VRAM is a problem
00:38:32.400 | is that because when we did the,
00:38:33.840 | we need to back prop against the states,
00:38:35.960 | we actually need to maintain the state
00:38:37.840 | in between the tokens by the token length.
00:38:40.520 | So that means we need to actually roll out
00:38:42.800 | the whole 1 million context
00:38:44.600 | if we are actually training 1 million,
00:38:46.360 | which is the same for transformers actually,
00:38:48.520 | but it just means we don't magically
00:38:50.880 | reuse the VRAM consumption in the training time space.
00:38:53.920 | So that is the one, the VRAM bottlenecks,
00:38:56.040 | and I'm neither OpenAI nor Google,
00:38:58.440 | so donate GPUs if you have too much of them.
00:39:01.000 | But then putting it back to another paradigm, right,
00:39:05.640 | is that I think O1 style reasoning
00:39:08.760 | might be actually pushing that direction downwards.
00:39:12.120 | In my opinion, this is my partial hot take,
00:39:14.520 | is that if, let's say you have a super big 400B model,
00:39:18.960 | and let's say you have a 70B model
00:39:20.960 | that may take double the tokens,
00:39:23.680 | but gets the same result.
00:39:25.440 | Strictly speaking, a 70B,
00:39:28.200 | and this is even for transformer or non-transformer, right,
00:39:31.080 | will take less resources than that 400B model,
00:39:35.920 | even if it did double the amount of thinking.
00:39:38.480 | And if that's the case,
00:39:39.320 | and we're still all trying to figure this out,
00:39:41.600 | maybe the direction for us
00:39:42.640 | is really getting the sub-200B
00:39:44.560 | to be as fast, as efficient as possible,
00:39:46.400 | with a very efficient architecture
00:39:48.240 | that some folks happen to be working on,
00:39:50.520 | to just reason it out over larger and larger context length.
00:39:55.360 | Yeah.
00:39:56.200 | - One thing I'm super interested in
00:39:57.200 | is models that can watch forever.
00:40:00.560 | Obviously you cannot train something
00:40:03.880 | on infinite context length.
00:40:06.080 | How are y'all thinking about that,
00:40:08.560 | where you run on a much longer context length
00:40:11.160 | than is possible to train on?
00:40:14.080 | - Yeah, it's a great question.
00:40:17.080 | So I think when,
00:40:20.120 | I think you guys probably had tweets along these lines too.
00:40:23.040 | When we first started doing these things,
00:40:25.720 | because these are all recurrent models,
00:40:28.200 | in theory, you could just run it forever.
00:40:29.880 | You could just run it forever.
00:40:31.560 | And at the very least it won't like air out on your crash.
00:40:35.200 | There's another question of whether it can actually use
00:40:38.440 | what it's seen in that infinite context.
00:40:40.840 | And I think there,
00:40:42.200 | so one place where probably the research
00:40:44.600 | and architectures ran faster than other research
00:40:47.880 | is actually the benchmarks for long context.
00:40:49.840 | So you turn it on forever,
00:40:51.960 | you wanna do everything or watch everything.
00:40:54.240 | What is it that you actually wanted to do?
00:40:56.080 | Can we actually build some benchmarks for that,
00:40:58.280 | then measure what's happening,
00:40:59.720 | and then ask the question, can the models do it?
00:41:02.320 | Is there something else that they need?
00:41:05.000 | Yeah, I think that if I were to turn back the clock to 2022,
00:41:09.000 | that's probably one of the things
00:41:10.320 | I would have done differently,
00:41:11.200 | which would have been actually get some long context
00:41:14.080 | benchmarks out at the same time
00:41:16.920 | as we started pushing context length on all these models.
00:41:20.040 | - I will also say the use case.
00:41:21.520 | So like, I think we both agree
00:41:22.920 | that there's no infinite memory
00:41:25.640 | and the model needs to be able to learn inside.
00:41:27.600 | I think what we have observed for,
00:41:28.880 | I think this also fits the state space model,
00:41:30.640 | is that one of the key advantage
00:41:32.320 | of this alternate attention mechanic
00:41:34.000 | that is not based on token position
00:41:36.240 | is that the model don't suddenly become crazy
00:41:38.280 | when you go past the 8K training context length
00:41:40.960 | or a million context length.
00:41:44.280 | It's actually still stable.
00:41:45.520 | It's still able to run, it's still be able to rationalize.
00:41:47.800 | It just starts forgetting things.
00:41:50.000 | But some of these things are still there in latent memory.
00:41:53.120 | Some of these things are still somewhat there.
00:41:54.520 | That's the whole point of why reading twice works,
00:41:57.720 | things like that.
00:41:58.680 | And one of the biggest push in this direction
00:42:00.960 | is that I think both state space and RWKB
00:42:03.280 | have separate papers by other researchers
00:42:05.920 | where they use this architecture for time series data,
00:42:08.480 | weather modeling.
00:42:09.640 | So you're not asking what was the weather five days ago.
00:42:13.520 | You're asking what's the weather tomorrow
00:42:15.160 | based on the infinite length that we,
00:42:18.600 | as on this earth and the computer will keep running.
00:42:21.320 | So, and they found that it is better than existing,
00:42:26.320 | like we transform our existing architecture
00:42:29.120 | in modeling this weather data.
00:42:30.880 | Control for the param size and stuff.
00:42:32.320 | I'm quite sure there are people with larger models.
00:42:33.920 | So there are things that in this case, right,
00:42:37.920 | there is future applications
00:42:39.360 | if your question is just what's next
00:42:41.120 | and not what's 10 years ago.
00:42:42.880 | - Thanks so much for having us.