2024 in Post-Transformer Architectures: State Space Models, RWKV [Latent Space LIVE! @ NeurIPS 2024]

00:00:00.000 | (upbeat music)

00:00:02.580 | - Yeah, so thanks so much for having us.

00:00:08.620 | So this is gonna be a little bit of a two-part presentation.

00:00:11.600 | My name is Dan, I'm at Together AI,

00:00:14.240 | and I'll be joining UCSD as faculty in about a year.

00:00:17.840 | And Eugene, you wanna introduce yourself?

00:00:19.960 | - I'm Eugene, I lead the art activity team,

00:00:22.000 | and I'm CEO and co-founder of Featherless,

00:00:25.120 | and we both work on this new

00:00:27.480 | post-transformer architecture space.

00:00:29.740 | - Yeah, so today, we're really excited

00:00:33.120 | to talk to you a little bit about that.

00:00:35.600 | So first, I'm gonna give a broad overview

00:00:37.800 | of kind of the last few years of progress

00:00:40.380 | in non-post-transformer architectures,

00:00:42.900 | and then afterwards, Eugene will tell us a little bit

00:00:45.960 | about the latest and the greatest

00:00:47.440 | and the latest frontier models in this space.

00:00:50.620 | So the story starts with scaling.

00:00:54.280 | So this is probably a figure or something like this

00:00:56.880 | that you've seen very recently.

00:00:59.120 | Over the last five to six years,

00:01:00.920 | we've seen models really scale up in parameter size,

00:01:03.640 | and that's brought with it a bunch of new capabilities,

00:01:05.640 | like the ability to talk to you and tell you sometimes

00:01:09.700 | how to use your Colab and your AWS screens.

00:01:12.800 | But another place where we've seen scaling,

00:01:15.480 | especially recently, is scaling in context length.

00:01:18.680 | So this can mean just having more text inputs

00:01:21.920 | for your models, but it can also mean things

00:01:23.880 | like taking a lot of visual token inputs,

00:01:28.000 | image inputs to your models, or generating lots of outputs.

00:01:31.520 | And one thing that's been really exciting

00:01:34.080 | over the last few months or so is that we're seeing scaling,

00:01:37.680 | not only during training time, but also during test time.

00:01:39.920 | So this is one of the, this is the iconic image

00:01:43.240 | from the OpenAI '01 release.

00:01:45.280 | Not only are we starting to scale train time compute,

00:01:47.840 | but we're also starting to scale test time compute.

00:01:51.160 | Now, if you're familiar with our attention

00:01:53.880 | in our transformer architectures today,

00:01:55.820 | this graph on the right might look a little bit scary.

00:01:58.640 | And one of the reasons is that the implications

00:02:02.280 | are a little bit interesting.

00:02:04.920 | So what does it mean if we want to continue

00:02:06.960 | having smarter and smarter models?

00:02:08.680 | Do we just need to start building

00:02:09.960 | bigger, bigger data centers, spending more flops?

00:02:13.320 | Is this, this little Dolly 3, we need more flops guy,

00:02:16.800 | is this gonna be the future of all of AI?

00:02:20.720 | Or is there a better way, another path forward?

00:02:23.960 | Maybe we can get the same capabilities

00:02:26.300 | that we've gotten used to,

00:02:27.900 | but for a lot less compute, a lot less flops.

00:02:30.160 | And one of the things that we're gonna talk about today

00:02:33.300 | is specifically looking at that core attention operator

00:02:37.140 | in some of these models.

00:02:38.820 | And the reason is that,

00:02:40.340 | so this is just some basic scaling curves,

00:02:44.100 | but attention has compute that scales quadratically

00:02:47.140 | in the context length.

00:02:48.480 | So that means that if you're doing something

00:02:50.020 | like test time compute, and you want to spend a bunch

00:02:52.580 | of tokens thinking about what comes next,

00:02:54.820 | the longer that that goes,

00:02:56.900 | the more tokens you spend on that,

00:03:00.160 | that compute grows quadratically in that.

00:03:02.380 | One of the questions that we're interested in is,

00:03:04.860 | can we take that basic sequence model,

00:03:06.900 | the basic sequence primitive at the bottom

00:03:08.900 | and get it to scale better?

00:03:10.260 | Can we scale and let's say N to the three halves

00:03:12.580 | or N log N?

00:03:13.660 | And so in the first part of the talk,

00:03:17.220 | so we just went over the introduction,

00:03:18.940 | what I'm gonna do over the next few slides

00:03:20.560 | is just talk about some of the key advances

00:03:22.860 | and ideas that have shown over the past few years

00:03:25.840 | since maybe early 2020 to now,

00:03:29.140 | that shown promise that this might actually be possible,

00:03:32.080 | that you can actually get potentially the same quality

00:03:34.460 | that we want while scaling better.

00:03:37.480 | So to do that, and basically the story

00:03:42.480 | that we're gonna look is we're gonna start to see how,

00:03:45.020 | so this is a basic graph of just the past couple of years

00:03:48.220 | of progress of perplexity where that blue line,

00:03:51.100 | that dotted blue line is attention,

00:03:52.580 | it's your basic transformer, full dense attention.

00:03:55.100 | And then the dots coming down are some of the methods

00:03:58.480 | that you'll see in this presentation today.

00:04:00.640 | We're gonna turn the clock back all the way to 2020.

00:04:04.460 | So this question of, can we make attention sub-quadratic?

00:04:09.180 | Basically, as soon as we said, attention is all you need,

00:04:11.940 | people started asking this question.

00:04:13.420 | So we have this quadratic attention operator,

00:04:16.260 | can we do better?

00:04:17.460 | I'll briefly talk about why attention is quadratic.

00:04:19.860 | And the basic thing that happens if you're not familiar

00:04:23.020 | is that you have these inputs, these keys and queries,

00:04:25.860 | and what you do in this attention matrix,

00:04:27.740 | this S matrix over here is that you're using,

00:04:30.320 | you're comparing every token in your input

00:04:32.500 | to every other token.

00:04:33.860 | So when I try to do something like upload a whole book

00:04:36.340 | to Gemini, what happens beyond the, or maybe not Gemini,

00:04:39.620 | 'cause we don't necessarily know what architecture is,

00:04:41.680 | but let's say we upload it to Llama,

00:04:43.580 | what happens behind the scenes is that it's gonna take

00:04:46.980 | every single word in that book

00:04:48.300 | and compare it to every other word.

00:04:50.060 | And this has been a really,

00:04:51.500 | it's led to some pretty impressive things,

00:04:53.940 | but it's kind of a brute forcing of the way

00:04:56.060 | that you would try to interpret something.

00:04:59.620 | And what attention does in particular is the,

00:05:03.020 | and then what attention, sorry, don't wanna, okay.

00:05:06.780 | No, no laser pointer.

00:05:08.020 | What attention does afterwards is that

00:05:10.180 | instead of always operating in this quadratic thing,

00:05:13.060 | it takes a row-wise softmax over this matrix

00:05:15.700 | and then multiplies it by this values matrix.

00:05:17.660 | So one of the key points to notice is that the output size

00:05:21.100 | is always gonna be the same as the inputs,

00:05:23.360 | at least in standard self-attention.

00:05:26.340 | So one of the first things that folks tried to do

00:05:28.340 | around 2020 is this thing called linear attention,

00:05:30.500 | which is just noticing that if we take out this softmax

00:05:34.100 | from here, if we take out this non-linearity

00:05:36.100 | in the middle of the attention operation,

00:05:37.900 | and then if you compute the keys

00:05:39.400 | and the values operation first,

00:05:41.160 | you actually never hit this quadratic bottleneck.

00:05:44.060 | So that's potentially a way

00:05:46.300 | to get a lot more computationally efficient.

00:05:50.540 | And there are various ways to do this

00:05:52.420 | by basically using feature maps

00:05:54.060 | or try to approximate this overall attention computation.

00:05:57.620 | But some of this work sort of started to hit a wall in 2020

00:06:02.060 | and the basic challenges were two.

00:06:04.220 | So one was quality.

00:06:05.600 | Back then it was kind of hard to get good quality

00:06:09.580 | with these linear attention operators.

00:06:11.620 | The other one was actually hardware efficiency.

00:06:13.460 | So this feature map that was just shown by Simplify here

00:06:18.260 | actually ends up being quite computationally expensive

00:06:20.980 | if you just implement it naively.

00:06:22.980 | So you started having these operators

00:06:24.440 | that not only you're not really sure

00:06:27.300 | if they have the same quality,

00:06:28.820 | but also they're actually just wall clock slower.

00:06:30.740 | So you kind of end up getting the worst of both worlds.

00:06:34.340 | So this was the SAGE.

00:06:36.620 | So that kind of sets the SAGE for four years ago.

00:06:38.900 | Keep this in mind because linear attention

00:06:40.460 | is actually gonna come back in a few years

00:06:43.020 | once we have a better understanding.

00:06:46.260 | But one of the works that started kicking off

00:06:48.340 | this mini revolution in post-transformer architectures

00:06:52.740 | was this idea called state-space model.

00:06:54.680 | So here the seminal work is one about our work in 2022.

00:06:59.500 | And this piece of work really brought together a few ideas

00:07:03.420 | from some long running research lines of work.

00:07:10.460 | The first one was, and this is really one of the keys

00:07:13.660 | to closing the gap in quality,

00:07:16.420 | was just using things that if you talk

00:07:19.820 | to an electrical engineer off the street,

00:07:23.100 | they might know like the back of their hand,

00:07:26.980 | but taking some of those properties

00:07:28.360 | with how we model dynamical systems in signal processing

00:07:32.660 | and then using those ideas to model the inputs,

00:07:36.020 | the text tokens in, for example,

00:07:39.060 | a transformer-like next token prediction architecture.

00:07:42.300 | So some of those early state-space model papers

00:07:44.780 | were looking at this relatively simple recurrent update

00:07:49.140 | model that comes from maybe chapter one

00:07:50.960 | of a signal processing class,

00:07:53.100 | but then using some principle theory

00:07:55.680 | about how you should do that recurrent update

00:07:58.340 | in order to really get the most that you can

00:08:01.560 | out of your hidden state, out of your sequence.

00:08:05.860 | So that was one key idea for quality.

00:08:07.900 | And when this was eventually realized,

00:08:11.300 | you started to see a bunch of benchmarks

00:08:13.060 | that were pretty sticky for a few years,

00:08:15.220 | things like long range arena,

00:08:16.620 | some long sequence evaluation benchmarks,

00:08:19.980 | there were stuff in time series, time series analysis.

00:08:24.420 | You started to see the quality tick up in meaningful ways.

00:08:29.300 | But the other key thing that was so influential

00:08:33.820 | about these state-space models

00:08:34.940 | is that they also had a key idea

00:08:36.780 | about how you can compute these things efficiently.

00:08:41.020 | So if you go back to your machine learning 101 class,

00:08:43.460 | where you learned about RNNs,

00:08:44.780 | one thing that you may have learned

00:08:46.020 | is that they don't paralyze as well as detention,

00:08:48.980 | because if you just run them naively,

00:08:51.060 | you have to do this kind of sequential update

00:08:54.140 | to process new tokens.

00:08:55.620 | Whereas in attention,

00:08:56.840 | you can process all the tokens in parallel at one time.

00:09:00.120 | One of the key insights behind the S4 paper

00:09:02.460 | was that these recurrent models,

00:09:04.220 | you could take them and you could also formulate them

00:09:07.060 | as a convolution.

00:09:08.420 | And in particular, with a convolution,

00:09:09.780 | you could, instead of using a PyTorch conv1d operation,

00:09:12.540 | you can compute that with the FFT.

00:09:15.060 | And that would give you N log N compute

00:09:17.380 | in the sequence length N with a operator

00:09:20.820 | that was relatively well optimized for modern hardware.

00:09:24.460 | So those are really, I'd say the two key ideas in 2022

00:09:28.380 | that started allowing these breakthroughs to happen

00:09:31.740 | in these non-transformer architectures.

00:09:33.700 | So these ideas about how to principally model,

00:09:36.780 | sorry, how to model the recurrent updates

00:09:39.020 | of a sequence in a principled way,

00:09:42.500 | and also these key ideas

00:09:43.740 | and how you can compute it efficiently

00:09:45.780 | by turning it into a convolution

00:09:47.860 | and then scaling it up with the FFT.

00:09:50.040 | Along those same lines,

00:09:53.580 | so afterwards, we started putting out some work

00:09:57.500 | on specialized kernels.

00:09:58.700 | So just like we have flash attention for transformers,

00:10:01.320 | we also have works like flash FFT conv,

00:10:03.500 | and if you look at these lines of work,

00:10:05.620 | oftentimes whenever you see a new architecture,

00:10:07.940 | you see a new primitive,

00:10:09.540 | one of the table stakes now is,

00:10:11.620 | do you have an efficient kernel

00:10:12.740 | so that you can actually get wall clock speed up?

00:10:14.780 | So by 2022, 2023, we were starting to have these models

00:10:18.460 | that had promising quality primitives

00:10:21.380 | and also promising wall clocks.

00:10:23.180 | So you could actually see regimes

00:10:24.980 | where they were better than transformers in meaningful ways.

00:10:27.980 | That being said, there were still sometimes a quality gap,

00:10:31.860 | particularly for language modeling.

00:10:33.580 | And because language is so core to what we do

00:10:36.540 | in sequence modeling these days,

00:10:38.220 | the next key idea that I'm gonna talk about

00:10:41.500 | is this idea of selection mechanisms.

00:10:43.860 | And this is basically an idea of,

00:10:45.940 | so you have this recurrent state that you're keeping around

00:10:48.600 | that just summarizes everything that came before,

00:10:52.140 | and to get a good sequence model,

00:10:53.620 | one of the things that you really need to be able to do

00:10:56.060 | is have the model learn

00:10:58.020 | what's the best way to pick out pieces

00:11:00.300 | from that recurrent state.

00:11:02.100 | So one of the major ideas here

00:11:04.800 | in a line of work called H3, Hungry, Hungry Hippos,

00:11:07.860 | and also these hyena models were,

00:11:11.060 | one way you can do this is by just adding

00:11:12.860 | some simple element-wise gates.

00:11:15.580 | So versions of these ideas have been around for decades.

00:11:18.780 | If you squint at the LSTM paper,

00:11:21.580 | you can probably find this gating mechanism.

00:11:24.740 | But turns out you can take those old ideas,

00:11:26.540 | add them into these new states-based models,

00:11:29.420 | and then you can see quality start to pick up.

00:11:32.660 | If you've heard of the Mamba model,

00:11:35.940 | this also takes the selection to the next level

00:11:39.200 | by actually making some changes

00:11:40.660 | in that fundamental recurrent state space.

00:11:43.700 | So it's not only just this gating

00:11:45.540 | that happens around the SSM layer,

00:11:47.620 | but also you can actually make the ABCD matrices

00:11:51.840 | of your state-based model,

00:11:53.100 | you can make them data-dependent,

00:11:54.860 | which will allow you to even better select out

00:11:57.420 | different pieces from your hidden state,

00:11:59.120 | depending on what you're seeing.

00:12:00.720 | I'll also point out,

00:12:02.420 | if you look at the bottom right of this figure,

00:12:03.980 | there's this little triangle with a GPU SRAM, GPU HBM,

00:12:07.380 | and this is just continuing that trend

00:12:09.460 | of when you have a new architecture,

00:12:12.140 | you also release it with a kernel

00:12:14.580 | to show that it is hardware efficient,

00:12:16.940 | that it can be hardware efficient on modern hardware.

00:12:20.480 | One of the next cool things that happened

00:12:26.500 | is once we had this understanding

00:12:28.340 | of these are the basic pieces,

00:12:30.380 | these are the basic principles

00:12:31.800 | behind some of the sequence models,

00:12:34.320 | linear attention actually started to come back.

00:12:36.160 | So in earlier this year,

00:12:38.120 | there's a model called BASED from Simran Arora

00:12:41.800 | and some other folks that combined

00:12:44.600 | a more principled version of linear attention

00:12:46.920 | that basically the two-second summaries

00:12:50.680 | that it used a Taylor approximation

00:12:52.860 | of the softmax attention,

00:12:54.600 | combined that with a simple sliding window attention

00:12:57.200 | and was starting to be able to expand the Pareto frontier

00:13:01.540 | of how much data can you recall from your sequence

00:13:04.820 | versus how small is your recurrent state size.

00:13:07.220 | So those orange dots are at the top there

00:13:09.500 | are just showing smaller sequences

00:13:12.020 | that can recall more memory.

00:13:14.160 | And the last major idea,

00:13:18.460 | I think that has been influential

00:13:19.980 | on this line of work

00:13:20.820 | and is very relatively late breaking

00:13:22.660 | just a few months ago,

00:13:24.360 | is just the basic idea

00:13:25.660 | that when you have these models

00:13:27.380 | that are fundamentally more efficient

00:13:29.900 | in the sequence length,

00:13:31.420 | you maybe don't want to prompt them

00:13:32.980 | or use them in exactly the same way.

00:13:35.020 | So this was a really cool paper called Just Read Twice

00:13:37.740 | also from Simran that basically said,

00:13:40.620 | hey, all these efficient models

00:13:42.540 | can process tokens so much more efficiently

00:13:44.500 | than transformers,

00:13:45.700 | that they can sometimes have unfair advantages

00:13:48.480 | compared to a simple transformer token.

00:13:51.540 | So, sorry, a simple transformer model.

00:13:53.500 | So take, for example, the standard use case

00:13:57.100 | of you have some long document,

00:13:58.740 | you're gonna pass it in as input

00:14:00.060 | and then you're gonna ask some question about it.

00:14:03.060 | One problem you might imagine for a recurrent model

00:14:06.740 | where you have a fixed state size is,

00:14:08.660 | let's say that your article is very long

00:14:11.580 | and you're trying to ask about some really niche thing.

00:14:14.900 | You can imagine it might be hard for the model

00:14:16.660 | to know ahead of time

00:14:17.540 | what information to put into the hidden state.

00:14:20.520 | But these models are so much more efficient

00:14:23.020 | that you can do something really stupid,

00:14:24.520 | like you can just put the document,

00:14:26.940 | write down the document, write down the question,

00:14:29.060 | write down the document again

00:14:30.300 | and then write down the question again.

00:14:31.900 | And then this time, the second time

00:14:33.300 | that you go over that document,

00:14:34.500 | you know exactly what to look for.

00:14:36.920 | And the cool thing about this is,

00:14:38.300 | so this results in better quality,

00:14:41.020 | especially on these recall intensive tasks.

00:14:43.700 | But the other interesting thing is,

00:14:45.660 | it really takes advantage

00:14:47.140 | of the more efficient architectures that we're having here.

00:14:50.680 | So one of the other, I think, influential ideas

00:14:53.100 | in this line of work is,

00:14:54.580 | if you change the fundamental compute capabilities

00:14:58.260 | of your model and the way that it scales,

00:15:00.260 | you can actually start to query it at test time differently.

00:15:03.100 | And this actually, of course,

00:15:04.260 | goes back to those slides on test time compute.

00:15:07.160 | So while everybody's looking at, say,

00:15:09.020 | test time compute for big transformer models,

00:15:12.340 | I think potentially a really interesting research question

00:15:14.660 | is how can you take those

00:15:16.040 | and how does it change

00:15:17.300 | with this new next generation of models?

00:15:20.560 | So I'll just briefly summarize

00:15:23.620 | what some of those key ideas were

00:15:25.780 | and then talk and then show you briefly

00:15:27.720 | kind of what the state of the art is today.

00:15:30.440 | So the four key ideas are,

00:15:32.120 | instead of just doing

00:15:33.160 | a simple linear attention approximation,

00:15:35.800 | instead, take ideas that we know from other fields,

00:15:39.120 | like signal processing,

00:15:40.480 | do a more principled approach

00:15:42.600 | to your modeling of the sequence.

00:15:44.760 | Another key idea throughout all these lines of work

00:15:47.240 | is you really want hardware and kernel support from day one.

00:15:51.160 | So even if your model is theoretically more efficient,

00:15:54.960 | if somebody goes and runs it and it's two times slower,

00:15:58.120 | one of the things that we've learned

00:15:59.420 | is that if you're in that situation,

00:16:01.120 | it's just gonna be dead on arrival.

00:16:03.520 | So you want to be designing your architectures

00:16:06.200 | with the hardware in mind.

00:16:07.840 | One of the key machine learning ideas

00:16:11.980 | that has been important for the quality

00:16:13.840 | is just making sure that you encode different ways

00:16:16.440 | that you can select from your hidden state

00:16:18.720 | and really focus on that as a key decider of quality.

00:16:22.200 | And finally, I think one of the emerging new things

00:16:26.600 | for this line of work

00:16:27.960 | and something that's quite interesting

00:16:29.560 | is what are the right test time paradigms for these models?

00:16:32.960 | How do they change relative to what you might do

00:16:37.960 | for a standard transformer?

00:16:39.360 | I'll briefly end this section.

00:16:41.880 | So I've labeled this slide where we are yesterday

00:16:45.440 | because Eugene is gonna talk about some new models

00:16:47.440 | that he released literally this morning.

00:16:49.840 | But as of yesterday, some of the really cool results

00:16:52.080 | out of these efficient alternative models were,

00:16:56.480 | so AI2 trained this hybrid MOE called Jamba

00:16:59.600 | that is currently the state-of-the-art

00:17:03.120 | for these non-transformer architectures.

00:17:06.320 | There's this NVIDIA and MIT

00:17:08.720 | put out this new diffusion model called SANA recently

00:17:12.640 | that one of their key observations

00:17:15.760 | is that you can take a standard diffusion,

00:17:18.240 | transformer diffusion model,

00:17:19.760 | replace the layers with linear attention,

00:17:21.800 | and then that lets you scale to much larger images,

00:17:25.960 | much larger sequences more efficiently.

00:17:30.720 | And one thing that I don't think anybody would have called

00:17:34.360 | when a few years ago

00:17:36.320 | is that one of those gated SSM, gated states-based models

00:17:41.840 | ended up on the cover of science

00:17:43.480 | because a great group of folks went

00:17:45.880 | and trained some DNA models.

00:17:47.200 | So that's Michael Polley, Eric Yuen

00:17:49.640 | from Stanford and the Ark Institute.

00:17:53.200 | So we're really at an exciting time in 2024

00:17:56.920 | where these non-transformer, post-transformer architectures

00:18:00.240 | are showing promise across a wide range,

00:18:03.280 | across a wide range of modalities,

00:18:07.360 | of applications, and of tasks.

00:18:10.760 | And with that, I'll pass it on to Eugene

00:18:12.280 | who can tell you a little bit

00:18:13.760 | about the latest and greatest with RWKV.

00:18:16.720 | - Yeah, so that's useful.

00:18:19.120 | Yeah. - You're talking to here.

00:18:19.960 | - Oh, I'm talking to here, okay.

00:18:21.280 | So yeah, two streams.

00:18:23.240 | Yeah, so I think one common questions

00:18:25.040 | that we tend to get asked, right,

00:18:26.920 | is what's the difference between RWKV and states-based?

00:18:30.200 | So I think one of the key things to really understand,

00:18:33.560 | right, the difference between the two groups, right,

00:18:36.440 | is that we are actually more like

00:18:38.680 | an open-source rental internet meets academia

00:18:41.080 | kind of situation.

00:18:42.200 | Like most of us never wrote any paper,

00:18:45.040 | but we basically look at RNNs and linear intention

00:18:48.960 | when intention is all you need came out.

00:18:50.680 | And then we decided to like,

00:18:51.600 | "Hey, there is a quadratic scaling problem.

00:18:54.480 | "Why don't we try fixing that instead?"

00:18:57.160 | So we end up developing our own branch,

00:19:00.120 | but we end up sharing ideas back and forth.

00:19:02.600 | And we do all this actively in Discord, GitHub, et cetera.

00:19:07.760 | This was so bad for a few years, right,

00:19:10.080 | that basically the average group's H-index

00:19:12.520 | was so close to zero, right,

00:19:13.760 | ILLUTR-AI actually came in

00:19:15.480 | and helped us write our first paper.

00:19:17.360 | Great, now our H-index is now three, apparently.

00:19:19.600 | So, but the thing is like,

00:19:22.400 | a lot of these experiments led to results.

00:19:25.320 | And essentially, we took the same ideas

00:19:30.280 | from linear intention and we built on it.

00:19:33.320 | So to take a step back into like,

00:19:35.000 | how does RWKB handle its own attention mechanic

00:19:38.520 | and achieve the same goals of like O(n) compute,

00:19:41.600 | respectively, and in focus of our overall goal

00:19:45.720 | to make AI accessible to everyone,

00:19:47.120 | regardless of language, nation, or compute.

00:19:48.880 | That's our open-source goal.

00:19:50.560 | We actually train our models primarily

00:19:52.640 | on over a hundred language,

00:19:54.160 | which is another topic altogether.

00:19:56.120 | And our goal is to train to even 200 languages

00:19:58.240 | to cover all languages in the world.

00:20:00.040 | But at the same time, we work on this architecture

00:20:03.360 | to lower the compute cost so that people

00:20:05.440 | can run in Raspberry Pis and on anything.

00:20:08.600 | So how did RWKB break the dependency of LSTM token flow?

00:20:13.600 | Because I think to understand architecture, right,

00:20:16.120 | it's probably easier to understand it from the RNN lens,

00:20:19.760 | because that's where we built on.

00:20:21.680 | We all state space kind of like try to start anew

00:20:25.040 | and took lessons from that and say,

00:20:26.200 | so there's a little bit of divergence there.

00:20:28.200 | And AKA, this is our version of linear intention.

00:20:31.320 | So to take a step back, all foundation models,

00:20:35.000 | be it transformers or non-transformers,

00:20:37.440 | at a very high level, right, comes in a token,

00:20:40.400 | I mean, takes things into embeddings

00:20:42.480 | and goes through a lot of layers,

00:20:44.240 | generate a lot of internal states,

00:20:45.800 | whether QKB cache or RNN states or RWKB states,

00:20:50.360 | and outputs an embedding layer norm in something,

00:20:52.680 | and we just take more layers and more embeddings,

00:20:54.360 | and somehow that magically works.

00:20:57.400 | So if you remember your ancient RNN lessons,

00:21:02.040 | which we call blessing these days,

00:21:07.000 | the general idea is that you have the embedding information

00:21:09.360 | from all the way up, and you take that information

00:21:13.080 | and you flow it back down,

00:21:13.920 | and then you process it as part of your LSTM layers.

00:21:16.480 | So this is how it generally works.

00:21:19.040 | Kapati is quoted saying that RNNs

00:21:20.760 | are actually unreasonably effective.

00:21:22.640 | The problem is this is not scalable.

00:21:25.160 | To start doing work on the second token,

00:21:27.160 | you need to wait for the first token,

00:21:28.640 | and then you need to,

00:21:29.480 | and likewise for the third token

00:21:30.320 | and fourth token, yada, yada.

00:21:31.960 | That is CPU land, not GPU land.

00:21:34.360 | So you can have a H100, and you can't even use 1% of it.

00:21:38.280 | So that's kind of why RNNs didn't really take off

00:21:41.400 | in the direction that when you wanted

00:21:42.640 | like billions of parameters when it comes to training.

00:21:44.880 | So what did RWKB version zero do?

00:21:47.560 | We just did the dumbest, lamest thing.

00:21:49.920 | Sorry, this is the bottleneck for RNN.

00:21:52.120 | We did the dumb thing of removing that line,

00:21:54.680 | and it kind of worked.

00:21:56.360 | It trained, it sucked, but it kind of worked.

00:21:59.920 | Then we were like, hey, then no one cared

00:22:02.800 | because the loss was crap, but how do we improve that?

00:22:07.000 | And that's essentially where we move forward

00:22:09.640 | because if you see this kind of flow,

00:22:12.080 | you can actually get your GPU saturated quickly

00:22:15.880 | where it essentially cascades respectively.

00:22:17.920 | So I'm just waiting for this to loop again.

00:22:20.160 | So it's like once you get your first layer,

00:22:21.840 | your token to be computed finish,

00:22:24.200 | you start to cascade your compute all the way

00:22:26.360 | until you're, hey, I'm using 100% of GPU.

00:22:28.760 | So we worked on it and we started going along

00:22:33.000 | the principle of that as long as we keep

00:22:34.960 | this general architecture where we can cascade

00:22:38.040 | and be highly efficient with our architecture,

00:22:40.960 | nothing is sacred in our architecture.

00:22:43.120 | And we have done some crazy ideas.

00:22:45.680 | In fact, if you ask me to explain some things

00:22:48.920 | in the paper, right, officially in the paper,

00:22:51.160 | I'll say we had this idea and we wrote it this way.

00:22:53.640 | The reality is someone came with a code,

00:22:55.760 | we tested it, it worked, and then we rationalized it.

00:22:58.440 | So the general idea behind RWKVR is that

00:23:03.200 | we generally have two major blocks that we do.

00:23:06.520 | We call it TimeMix and ChannelMix.

00:23:08.080 | And TimeMix generally handles long-term memory states

00:23:12.520 | where essentially where we apply the matrix multiplication

00:23:17.520 | and SILU activation functions into processing

00:23:19.520 | an input embedding and an output embedding.

00:23:22.200 | I'm oversimplifying it because this calculation

00:23:25.120 | changed every version and we have version seven right now.

00:23:29.080 | ChannelMix is similar to Bayes in the sense

00:23:31.680 | that where it does shorter-term attention,

00:23:33.840 | where it just look at the sister token

00:23:36.680 | or the token before it, 'cause there's a shift

00:23:38.560 | in the token shift matrix.

00:23:41.480 | I don't really want to go too much into the papers itself

00:23:43.800 | because we do have three papers on this.

00:23:46.240 | Basically, RWKV, RNN for the transformer era,

00:23:49.840 | Igor and Finch RWKV matrix value state.

00:23:52.040 | This is the updated version five, version six.

00:23:54.680 | And GoFinch is our hybrid model, respectively.

00:23:59.680 | We are writing the paper already for V7,

00:24:03.680 | and which is for RWKV7, codename Goose,

00:24:08.480 | all our architectures are codenamed by a bird.

00:24:11.000 | And I'm going to cover as well Q-RWKV

00:24:13.680 | and MAMA-RWK and RWKVMU.

00:24:16.920 | So where did that lead to?

00:24:18.560 | Wait, because we are all GPU poor,

00:24:21.760 | and to be clear, most of this research is done

00:24:24.200 | only on a handful of H100s,

00:24:25.880 | which I had one Google researcher told me

00:24:28.000 | that was his experiment budget for a single researcher.

00:24:31.520 | So our entire organization has less compute

00:24:34.360 | than a single researcher in Google.

00:24:36.680 | One of the things that we explored into

00:24:40.120 | was how do we convert transformer models instead?

00:24:43.440 | Because someone already paid that million dollars

00:24:46.200 | onto training, so why don't we take advantage

00:24:47.840 | of those weights?

00:24:49.560 | And I believe, together, AI worked on the locus

00:24:52.840 | for the MAMA side of things,

00:24:55.480 | and we took some ideas from there as well,

00:24:57.480 | and we essentially did that for RWKV.

00:24:59.920 | And that led to Q-RWKV6, which we just dropped today,

00:25:05.360 | a 32-bit instruct preview model,

00:25:07.400 | where we took the current 32-bit instruct model,

00:25:10.600 | freeze the feedforward layer,

00:25:12.080 | remove the QKV attention layer,

00:25:15.200 | and replace it with RWKV linear layers.

00:25:17.840 | So to be clear, this means we do not have

00:25:21.000 | the RWKV channel mixed layer,

00:25:22.720 | we only have the time mixed layer.

00:25:24.440 | But once we do that, we train the RWKV layer.

00:25:28.600 | Important is that the feedforward layer needs to be frozen,

00:25:30.920 | so the new attention can be learned.

00:25:33.240 | And then we unfreeze the feedforward layer

00:25:35.880 | and train all the layers together

00:25:37.000 | with a custom learning rate schedule

00:25:38.280 | so that they can learn how to work together.

00:25:41.040 | The end result, surprisingly, and to be honest,

00:25:44.320 | to the frustration of the RWKV MOE team,

00:25:46.760 | which ended up releasing the model on the same day,

00:25:49.240 | was that with just a few hours of training on two nodes,

00:25:54.240 | we managed to get it to be on par

00:25:56.360 | kind of with the original QUANT32B model.

00:25:59.080 | So in fact, when the first run,

00:26:01.200 | that completely confused us,

00:26:02.800 | and I was telling Daniel Goldstein-Smithkey,

00:26:06.640 | who kind of leads most of our research coordination,

00:26:09.480 | when you pitched me this idea,

00:26:10.640 | you told me at best you would get

00:26:12.120 | the same level of performance.

00:26:13.040 | But you didn't tell me the challenge

00:26:15.040 | and the score would shoot up.

00:26:19.160 | I don't know what's happening there.

00:26:21.240 | But it did.

00:26:22.080 | MMLU score dropping, that was expected,

00:26:25.160 | because if you think about it,

00:26:26.560 | when we were training all the layers,

00:26:28.680 | we were essentially like Frankensteining this thing,

00:26:31.440 | and we did brain damage to the feedforward network layer

00:26:34.320 | with the new RWKV layers.

00:26:36.040 | But 76%, hey, some of it is retained,

00:26:38.520 | and we can probably further train this.

00:26:40.760 | We didn't even spend three days training this,

00:26:42.600 | so there's a lot more that can be done,

00:26:44.720 | hence the preview.

00:26:46.240 | But this brings up a big question,

00:26:49.800 | because we are already now in the process

00:26:51.960 | of converting the SMPB.

00:26:54.160 | This is actually extremely compute efficient

00:26:56.480 | to test our attention mechanic.

00:26:59.080 | It's like, it becomes a shortcut.

00:27:01.000 | We are already planning to do our version seven

00:27:02.920 | and our hybrid architecture for it,

00:27:04.920 | because we don't train from scratch,

00:27:06.400 | and we get a really good model out of it.

00:27:08.720 | And the other thing that is uncomfortable to say

00:27:12.080 | is that, because we are doing right now the SMPB,

00:27:14.920 | is that if this scales correctly to 128k context length,

00:27:19.480 | I'm not even talking about a million, 128k,

00:27:22.840 | majority of enterprise workload today

00:27:26.160 | is just on SMPB at under 32k context length.

00:27:30.360 | That means if this works and the benchmark matches it,

00:27:34.040 | it means we can replace the vast majority

00:27:36.240 | of current AI workload,

00:27:37.720 | unless you want super long context.

00:27:39.240 | And then, sorry, can someone give us more GPUs,

00:27:41.560 | because we don't need the VRAM for super long context, sadly.

00:27:44.720 | So yeah, that's what we are working on.

00:27:47.960 | And essentially, we are excited about this

00:27:50.280 | to just push it further.

00:27:51.480 | And this conversion process, to be clear,

00:27:54.320 | I don't think it's going to be exclusive to RWKV,

00:27:56.680 | but it probably will work for Mamba as well.

00:27:59.760 | I don't see why not.

00:28:00.840 | And we will probably see more ideas,

00:28:03.000 | or more experiments, or more hybrids.

00:28:05.080 | Like, yeah, one of the weirdest thing

00:28:07.400 | that I wanted to say outright,

00:28:09.080 | and I confirm this with the Black Mamba team

00:28:10.840 | and the Jamba team,

00:28:12.520 | because we did the Goldfinch hybrid model,

00:28:14.600 | is that none of us understand why a hybrid

00:28:18.520 | with a state-based model, be it RWKV state space,

00:28:20.960 | and transformer, performs better

00:28:24.040 | than the baseline of both.

00:28:26.600 | It's like, when you train one,

00:28:29.040 | you expect, and then you replace,

00:28:30.120 | you expect the same results.

00:28:31.040 | That's our pitch.

00:28:31.880 | That's our claim.

00:28:32.760 | But somehow, when we jam both together,

00:28:34.960 | it outperforms both.

00:28:36.240 | And that's one area of evolution

00:28:38.200 | that, like, we only have four experiments,

00:28:40.160 | plus four teams, that a lot more needs to be done.

00:28:42.760 | But these are things that excite me, essentially,

00:28:45.320 | because that is what, potentially,

00:28:47.360 | we can move ahead for,

00:28:48.960 | which brings us to what comes next.

00:28:51.200 | - So this part is kind of just some,

00:28:55.920 | where we'll talk a little bit about stuff

00:28:57.480 | that we're excited about,

00:28:59.800 | maybe have some wild speculation

00:29:02.200 | on what's coming next.

00:29:05.760 | And, of course, this is also the part

00:29:07.920 | that will be more open to questions.

00:29:10.560 | So a couple of things that I'm excited about

00:29:12.800 | is continued hardware model co-design for these models.

00:29:17.800 | So one of the things that we've put out recently

00:29:22.040 | is this library called Thunder Kittens.

00:29:23.600 | It's a CUDA library.

00:29:25.320 | And one of the things that we found frustrating

00:29:27.760 | is every time that we built one of these new architectures,

00:29:30.280 | and I'm sure you had the exact same experience,

00:29:32.680 | we'd have to go and spend two months in CUDA land,

00:29:35.160 | like, writing these new, efficient things.

00:29:37.640 | And if we decided to change one thing in PyTorch,

00:29:40.920 | like, one line of PyTorch code

00:29:42.320 | is like a week of CUDA code, at least.

00:29:45.000 | So one of our goals with a library like Thunder Kittens,

00:29:48.440 | so we just broke down what are the key principles,

00:29:52.000 | what are the key hardware things,

00:29:54.640 | what are the key compute pieces

00:29:56.600 | that you get from the hardware.

00:29:57.520 | So, for example, on H100,

00:29:59.640 | everything really revolves around

00:30:02.320 | a warp group matrix multiply operation.

00:30:05.840 | So you really want your operation to be able to split

00:30:08.760 | into a relatively small matrix-matrix multiply operation.

00:30:13.560 | So, like, multiplying two 64 by 64 matrices, for example.

00:30:18.240 | And so if you know that ahead of time

00:30:19.920 | when you're designing your model,

00:30:21.240 | that probably gives you some information

00:30:24.600 | about how you set the state sizes,

00:30:25.880 | how you set the update, how you set the update function.

00:30:28.960 | So with Thunder Kittens,

00:30:30.320 | we basically built a whole library

00:30:31.880 | just around this basic idea

00:30:33.840 | that all your basic compute primitives

00:30:36.280 | should not be a float, but it should be a matrix,

00:30:38.800 | and everything should just be matrix compute.

00:30:41.280 | And we've been using that to try to both re-implement

00:30:44.160 | some existing architectures and also start to design

00:30:47.200 | some new ones that are really designed

00:30:48.880 | with this core, with a tensor core primitive in mind.

00:30:52.720 | Another thing that we're, at least I'm excited about,

00:30:57.720 | is we, over the last four or five years,

00:31:00.800 | we've really been looking at language models

00:31:02.600 | as the next thing.

00:31:03.640 | But if you've been paying attention to Twitter,

00:31:06.000 | there's been a bunch of new next generation models

00:31:08.600 | that are coming out.

00:31:09.640 | So there are video generation models

00:31:13.880 | that can run real time,

00:31:16.080 | that are supported by your mouse and your keyboard,

00:31:19.280 | that I'm told if you play with them,

00:31:21.600 | that they only have a few seconds of memory.

00:31:24.680 | Can we take that model?

00:31:25.600 | Can we give it a very long context length

00:31:27.400 | so that you could actually maybe generate

00:31:29.360 | an entire game state at a time?

00:31:31.400 | What does that look like for the model?

00:31:33.040 | You're certainly not gonna do

00:31:34.440 | a giant quadratic attention computation

00:31:37.240 | to try to run that.

00:31:38.920 | Maybe use some of these new models

00:31:41.320 | or some of these new video generation models that came out.

00:31:43.680 | So Sora came out, I don't know, two days ago now,

00:31:47.800 | but with super long queue times

00:31:49.080 | and super long generation times.

00:31:51.040 | So that's probably a quadratic attention operation

00:31:53.440 | at the bottom of it.

00:31:55.120 | What if we could remove that and get the same quality,

00:31:57.160 | but a lot faster generation time?

00:32:00.320 | Or some of the demos that we saw from Paige earlier today.

00:32:04.040 | If I have a super long conversation with my Gemini bot,

00:32:09.040 | what if I wanted to remember everything

00:32:12.480 | that it's seen in the last week?

00:32:14.120 | I mean, maybe you don't for personal reasons,

00:32:17.160 | but what if I did?

00:32:18.440 | What does that mean for the architecture?

00:32:21.000 | And I think that's certainly something

00:32:22.680 | I'm pretty excited about.

00:32:24.200 | I'm sure you're excited about it too.

00:32:26.040 | I think we were supposed to have some hot takes,

00:32:28.480 | but I honestly don't remember what our hot takes were.

00:32:30.960 | - Yeah.

00:32:31.800 | - Hot take, yes.

00:32:34.360 | These are our hot takes.

00:32:35.480 | - I think the big one on Twitter that we saw,

00:32:41.080 | that we shared was, the question is like,

00:32:42.960 | is RAG relevant in the case of like

00:32:46.000 | the future of like state-based models?

00:32:48.200 | - Let's see.

00:32:50.280 | I haven't played too much with RAG,

00:32:54.640 | but when I have,

00:32:56.960 | I'll say I found it was a little bit challenging

00:33:01.200 | to do research on it

00:33:02.480 | because we had this experience over and over again

00:33:06.240 | where you could have an embedding model of any quality.

00:33:10.760 | So you could have a really, really bad embedding model

00:33:12.680 | or you could have a really, really good one

00:33:14.560 | by any measure of good.

00:33:16.800 | And for the final RAG application,

00:33:18.960 | it kind of didn't matter.

00:33:20.440 | That's what I'll say about RAG.

00:33:23.800 | Well, being recorded.

00:33:25.360 | I know it doesn't actually answer the question, but.

00:33:28.760 | - Yeah.

00:33:29.600 | So I think a lot of folks are like extremely excited

00:33:33.240 | of the idea of RWKB or state-based

00:33:35.760 | potentially having infinite context.

00:33:37.680 | But I think the reality is that

00:33:40.680 | when we say infinite context,

00:33:41.760 | we just mean a different kind of infinite context

00:33:45.240 | or as it's previously covered,

00:33:46.520 | you need to test the model differently.

00:33:48.480 | So think of it more along the lines of the human.

00:33:51.160 | Like, I don't remember what I eat for breakfast

00:33:53.680 | yesterday.

00:33:54.840 | Yeah, that's the statement that I'll say.

00:33:57.440 | And we humans are not quadratic transformers.

00:34:01.600 | If we did, if let's say we increase our brain size

00:34:04.840 | for every second we live,

00:34:06.360 | we would have exploded by the time we are five years old

00:34:08.320 | or something like that.

00:34:09.440 | And I think basically fundamentally for us, right,

00:34:13.160 | be it whether we, regardless of whether RWKB,

00:34:15.720 | state-space, XLSTM, et cetera,

00:34:18.560 | our general idea is that instead of that expanding state,

00:34:21.600 | that increase in computational cost,

00:34:23.520 | what if we have a fixed state size?

00:34:26.240 | And information theory detects that

00:34:29.000 | that fixed state size will have a limit.

00:34:31.320 | Just how big of a limit is a question.

00:34:34.120 | Like, RWKB is running at 40 megabytes for a state.

00:34:39.120 | Its future version might run into 400 megabytes.

00:34:41.760 | That is like millions of tokens in,

00:34:45.680 | if you're talking about mathematically,

00:34:47.040 | the maximum possibility.

00:34:49.280 | It's just that I guess we are all more inefficient about it.

00:34:51.760 | So maybe you would hit 100,000

00:34:53.560 | and that's kind of like the work we are doing

00:34:55.080 | trying to like push it and maximize it.

00:34:57.760 | And that's where the models will start deferring

00:35:00.680 | because it will choose to forget things,

00:35:02.680 | it will choose to remember things.

00:35:04.240 | And that's why I think that there might be

00:35:06.280 | some element of right, but it may not be the same right.

00:35:08.480 | It may be the model learn things.

00:35:09.920 | And it's like, hmm, I can't remember that article.

00:35:12.760 | Let me do a database search to search.

00:35:14.920 | Just like us humans,

00:35:16.360 | when we can't remember the article in a company,

00:35:18.360 | we do a search on Notion.

00:35:19.800 | - Yeah, I think something that would be really interesting

00:35:22.480 | is if you could have facts that are,

00:35:25.680 | so right now the one intuition about language models

00:35:29.520 | is that all those parameters are around

00:35:31.360 | just to store random facts about the world.

00:35:33.640 | And this intuition comes from the observation

00:35:35.840 | that if you take a really small language model,

00:35:38.280 | it can do things like talk to you

00:35:39.800 | or it kind of has like the style of conversation

00:35:44.000 | it can learn that.

00:35:44.960 | But where it will usually fall over

00:35:46.600 | compared to a much larger one

00:35:47.760 | is it'll just be a lot less factual

00:35:49.640 | about things that it knows or that it can do.

00:35:52.960 | But that points to all those weights that we're spending,

00:35:57.360 | all that SGD that we're spending to train these models

00:35:59.800 | are just being used to store facts.

00:36:01.760 | And we have things like databases

00:36:03.080 | that are pretty good at storing facts.

00:36:04.720 | So I think one thing that would be really interesting

00:36:06.560 | is if we could actually have some sort of outside data store

00:36:10.520 | that a language model can look at

00:36:13.600 | that maybe has some sort of gradient descent in it,

00:36:19.040 | but would be quite interesting.

00:36:21.600 | And then maybe you could edit it, delete facts,

00:36:23.680 | change who's president so that it doesn't get lost.

00:36:28.440 | - Can we open up Q&A and hot takes to the audience?

00:36:31.640 | I have hot take Q&A.

00:36:35.440 | Do these scale?

00:36:36.640 | When 405 being state space model,

00:36:40.680 | RAG exists, no one does long context,

00:36:43.320 | who's throwing in 2 million token questions, what takes?

00:36:48.120 | - The who's throwing in 2 million token question

00:36:50.440 | I think is a really good question.

00:36:52.400 | So I actually, I was gonna offer that as a hot take.

00:36:55.680 | I mean, my hot take was gonna be

00:36:56.800 | that long context doesn't matter.

00:36:58.560 | I know I just gave a whole talk about it.

00:37:00.600 | You know, what's the point of doing research

00:37:04.480 | if you can't play both sides?

00:37:06.680 | But I think one of the, so I think for both of us,

00:37:11.320 | the reason that we first got into this

00:37:12.960 | was just from the first principle of questions

00:37:15.680 | of there's this quadratic thing.

00:37:18.920 | Clearly intelligence doesn't need to be quadratic.

00:37:21.240 | What is going on?

00:37:22.080 | Can we understand it better?

00:37:23.440 | You know, since then it's kind of turned into a race,

00:37:28.280 | which has been exciting to watch

00:37:29.440 | like how much context you can take in.

00:37:31.720 | But I think it's right.

00:37:32.560 | Nobody is actually putting in a 2 million context prompt

00:37:35.320 | into these models.

00:37:37.120 | And, you know, if they are, maybe we can go, you know,

00:37:41.400 | design a better model to do that particular thing.

00:37:45.280 | - Yeah, what do you think about that?

00:37:46.440 | So you've also been working on this.

00:37:48.000 | Do you think long context matters?

00:37:49.880 | - So I'm gonna burn a bit.

00:37:51.840 | How many of you remember the news of Google Gemini

00:37:54.760 | is supporting 3 million context, right?

00:37:56.720 | Raise your hand.

00:37:57.560 | Yeah. - 2 million.

00:37:59.760 | - Oh, it's 2 million.

00:38:00.800 | - Yeah.

00:38:06.640 | How many of you actually tried that?

00:38:09.360 | See? - I use it a lot.

00:38:11.240 | - You, you're off of Mind's TV.

00:38:13.200 | (laughs)

00:38:14.040 | - I use it a lot.

00:38:15.560 | All right.

00:38:16.400 | So for some people that is used,

00:38:18.600 | and I think that's the,

00:38:20.800 | that's might be like,

00:38:23.040 | this is where my opinion starts to differ

00:38:24.560 | because I think the big labs may have a bigger role in this

00:38:27.360 | because like, even for RWKB,

00:38:29.560 | even when we train long context,

00:38:30.640 | the reason why I say VRAM is a problem

00:38:32.400 | is that because when we did the,

00:38:33.840 | we need to back prop against the states,

00:38:35.960 | we actually need to maintain the state

00:38:37.840 | in between the tokens by the token length.

00:38:40.520 | So that means we need to actually roll out

00:38:42.800 | the whole 1 million context

00:38:44.600 | if we are actually training 1 million,

00:38:46.360 | which is the same for transformers actually,

00:38:48.520 | but it just means we don't magically

00:38:50.880 | reuse the VRAM consumption in the training time space.

00:38:53.920 | So that is the one, the VRAM bottlenecks,

00:38:56.040 | and I'm neither OpenAI nor Google,

00:38:58.440 | so donate GPUs if you have too much of them.

00:39:01.000 | But then putting it back to another paradigm, right,

00:39:05.640 | is that I think O1 style reasoning

00:39:08.760 | might be actually pushing that direction downwards.

00:39:12.120 | In my opinion, this is my partial hot take,

00:39:14.520 | is that if, let's say you have a super big 400B model,

00:39:18.960 | and let's say you have a 70B model

00:39:20.960 | that may take double the tokens,

00:39:23.680 | but gets the same result.

00:39:25.440 | Strictly speaking, a 70B,

00:39:28.200 | and this is even for transformer or non-transformer, right,

00:39:31.080 | will take less resources than that 400B model,

00:39:35.920 | even if it did double the amount of thinking.

00:39:38.480 | And if that's the case,

00:39:39.320 | and we're still all trying to figure this out,

00:39:41.600 | maybe the direction for us

00:39:42.640 | is really getting the sub-200B

00:39:44.560 | to be as fast, as efficient as possible,

00:39:46.400 | with a very efficient architecture

00:39:48.240 | that some folks happen to be working on,

00:39:50.520 | to just reason it out over larger and larger context length.

00:39:55.360 | Yeah.

00:39:56.200 | - One thing I'm super interested in

00:39:57.200 | is models that can watch forever.

00:40:00.560 | Obviously you cannot train something

00:40:03.880 | on infinite context length.

00:40:06.080 | How are y'all thinking about that,

00:40:08.560 | where you run on a much longer context length

00:40:11.160 | than is possible to train on?

00:40:14.080 | - Yeah, it's a great question.

00:40:17.080 | So I think when,

00:40:20.120 | I think you guys probably had tweets along these lines too.

00:40:23.040 | When we first started doing these things,

00:40:25.720 | because these are all recurrent models,

00:40:28.200 | in theory, you could just run it forever.

00:40:29.880 | You could just run it forever.

00:40:31.560 | And at the very least it won't like air out on your crash.

00:40:35.200 | There's another question of whether it can actually use

00:40:38.440 | what it's seen in that infinite context.

00:40:40.840 | And I think there,

00:40:42.200 | so one place where probably the research

00:40:44.600 | and architectures ran faster than other research

00:40:47.880 | is actually the benchmarks for long context.

00:40:49.840 | So you turn it on forever,

00:40:51.960 | you wanna do everything or watch everything.

00:40:54.240 | What is it that you actually wanted to do?

00:40:56.080 | Can we actually build some benchmarks for that,

00:40:58.280 | then measure what's happening,

00:40:59.720 | and then ask the question, can the models do it?

00:41:02.320 | Is there something else that they need?

00:41:05.000 | Yeah, I think that if I were to turn back the clock to 2022,

00:41:09.000 | that's probably one of the things

00:41:10.320 | I would have done differently,

00:41:11.200 | which would have been actually get some long context

00:41:14.080 | benchmarks out at the same time

00:41:16.920 | as we started pushing context length on all these models.

00:41:20.040 | - I will also say the use case.

00:41:21.520 | So like, I think we both agree

00:41:22.920 | that there's no infinite memory

00:41:25.640 | and the model needs to be able to learn inside.

00:41:27.600 | I think what we have observed for,

00:41:28.880 | I think this also fits the state space model,

00:41:30.640 | is that one of the key advantage

00:41:32.320 | of this alternate attention mechanic

00:41:34.000 | that is not based on token position

00:41:36.240 | is that the model don't suddenly become crazy

00:41:38.280 | when you go past the 8K training context length

00:41:40.960 | or a million context length.

00:41:44.280 | It's actually still stable.

00:41:45.520 | It's still able to run, it's still be able to rationalize.

00:41:47.800 | It just starts forgetting things.

00:41:50.000 | But some of these things are still there in latent memory.

00:41:53.120 | Some of these things are still somewhat there.

00:41:54.520 | That's the whole point of why reading twice works,

00:41:57.720 | things like that.

00:41:58.680 | And one of the biggest push in this direction

00:42:00.960 | is that I think both state space and RWKB

00:42:03.280 | have separate papers by other researchers

00:42:05.920 | where they use this architecture for time series data,

00:42:08.480 | weather modeling.

00:42:09.640 | So you're not asking what was the weather five days ago.

00:42:13.520 | You're asking what's the weather tomorrow

00:42:15.160 | based on the infinite length that we,

00:42:18.600 | as on this earth and the computer will keep running.

00:42:21.320 | So, and they found that it is better than existing,

00:42:26.320 | like we transform our existing architecture

00:42:29.120 | in modeling this weather data.

00:42:30.880 | Control for the param size and stuff.

00:42:32.320 | I'm quite sure there are people with larger models.

00:42:33.920 | So there are things that in this case, right,

00:42:37.920 | there is future applications

00:42:39.360 | if your question is just what's next

00:42:41.120 | and not what's 10 years ago.

00:42:42.880 | - Thanks so much for having us.