Google Titans: Learning to Memorize at Test Time

Welcome, everyone. Let me share my window here. So pretty, by the way. Yeah. I was playing around with this new presentation tool, gamma.ai. And I can't go back to PowerPoint ever again. That's huge praise. Yeah. A friend of mine actually quit to work for this company. And I was like, another AI Slides company?

You know, these never work. Shows what I know. It's really impressive. So I'm not paid by the company in any way, but I do recommend people try. So this week, we're going to talk about a new paper out of Google Research, Titans. The main thing that this new architecture presents is giving the model some memory, especially at inference time, so that it can leverage that to make better inferences and, you know, improve benchmark scores.

And I'm generally not going to be looking at the chat while I present. There's a lot to go through, but I'm going to leave plenty of time at the end for discussion where we can go back over questions. So yeah. So do feel free to drop posts in the chat, and we will get to them.

So let's see. Okay. I'm trying to figure out how to -- wait a minute. There we go. Okay. So what are some of the problems that this Titans paper is trying to address? The first of which is the quadratic complexity nature of attention. And this has been one of the things that typically was, up until, say, the last year or so, really restricting the context window size of models.

So if I remember correctly, when GPT-4 came out, it first had a size of 8K or 32K, something like that context window, but nothing compared to the 1 million context window of Gemini. And the reason for that was that for -- in typical attention, every token is compared to every other token, so that if you, say, have a context window size of 1,000, then the complexity is 1,000 squared.

And if you have a, you know, window size of 10,000, then it's 10,000 squared. And so in order to increase the size of your context window, it becomes, like, very computationally expensive very quickly. And there are ways around this that are mentioned in the paper, but that's one of the problems with, let's say, classical transformers, which leads right into the limited context window, again, based on the quadratic aspect.

And then the other thing is that the long context window creates the bottleneck for the very long sequences of data. So transformers are great, and, you know, there's been tons of advances in them, since the attention is all you need paper, but, you know, there are limitations in the architecture.

And so what does Titans do to try to address that at a very high level? The first thing it does is to emulate human memory, meaning that human memory has different -- you can think of them as modules or types of memory. There's short-term memory and long-term memory. However, in the typical transformer model, as we'll see, those have, like, short-term memory because they have access to the tokens that they're currently processing, but anything that happened before that, they don't have memory of that.

And as you do AI engineering, you probably are well aware that every time you make an API call to OpenAI or Cloud, it's, like, starting all over again. And then the second part is the enhanced processing. So the memory allows very long, like, extremely long sequences to be handled, so that -- and we'll see that as the tokens are processed, they -- the model keeps a representation of just the most salient or important, let's say, facts.

It's a representation of facts, but let's say facts from the tokens it's seen so far so that it can use those to improve predictions on future tokens. So there was some work by other groups, and this group as well. So this isn't, like, a sudden thing. This paper, test-time training with supervision, goes back to 2020.

So there's a few things that this paper represented. The first is using that unlabeled test dataset into a self-supervide learning problem so that it can update its own parameters before making a prediction. And they noticed performance improvements with this. So that's one paper that previously came out. And I didn't -- I'm going to just mention two papers.

I didn't read either of them in depth. I just kind of read the abstract and got the gist of them. But if you're very interested in this, both of these are good sources. And then the other one is this RNNs or learning to learn at test-time about RNNs with expressive hidden states.

So this is -- and we'll see this in the Titans paper as well, that it does bring an RNN or recurring neural network technique back into transformers where there is a state inside the -- well, it's the hidden state. And as the tokens are processed, that hidden state is updated and then allowed to affect future outputs.

So this is the paper that brought that into the research community. You can also see the -- one of the main points from this paper is the linear complexity. So using this to get around the quadratic complexity of the typical transformer. So as I mentioned, Titans is inspired by human memory.

There's core or short-term memory. So in the Titans models, this is the equivalent of just a normal attention mechanism where you have the query, keys, and values sets, you know, all interacting in the transformer component. So this is the short-term memory. So they don't really introduce anything about short-term memory.

They just reuse what is already out there. Long-term memory is where they learn to memorize historical context, encode that context into abstractions so that it can improve future tokens. This is the main thing that they talk about in the paper is this long-term memory module and how exactly it operates.

And then the persistent memory is not a, like, inference-dependent thing. So this is -- you can think of this as, like, a set of knowledge or a set of rules that always exists in the model. It's hard-coded in there. And it's always brought into the -- into the sequence.

We'll see how that happens. They don't talk a whole lot about persistent memory. So it's -- at least for me, it was a little unclear where -- how exactly this is created. But it's also not the main point of it. It's just kind of a set of parameters that are always fed into the sequence that's going to be predicted against.

Okay. So let's take some time digging into this long-term memory module. So this is the -- kind of the main breakthrough that this paper is presenting. There's a few different things you can see here as far as what this module does. The first is recurrent processing. So like I mentioned, it maintains a state as it's doing inference.

And then that part of the output is then fed back into that state so that it can continue to capture new information as it goes along. This helps it to kind of learn the task that it's doing. And also to do those needle in a haystack things where maybe some information early on in the sequence is relevant to something much later.

So it can keep that in memory. It's also a memory component. So this is the thing that is responsible for generating representations for, you know, the information that is coming through it at inference time. And then let's see. The last bullet point about sequential data handling. Yeah. So as I've been mentioning, like, we're assuming this is all happening in a sequence of tokens in the case of a large language model where it's processing one token after another.

It's not all at once. That's pretty -- pretty taken for granted for any large language model. So let's keep going. So we'll dig into a few different aspects now of the long-term memory module. So here you can see some points. Let me pull out a couple of them. One is this weight adjustments, which is, I think, one of the most interesting things about the Titan's architecture.

So for, like, almost all LLMs, or at least the ones that I'm aware of, like a GPT-4, a Sonnet, Claude Sonnet, whatever, they do a lot of pre-training. They do, you know, instruction tuning, all of that. But once they're done with that, then the weights are just the weights.

And even if you think of, like, the deep seek release last week, like, it was just a bunch of weights. So these don't -- these don't adjust in memory, no matter what you put through the model. Whereas this Titan's model, at least in the context of a single inference, can adjust weights on the long-term memory module, which I think makes it, like, a very interesting new approach.

And for the continuous adaptation, so that kind of plays off the weight adjustments. So that's something that, you know, as it's going through a long sequence of perhaps even millions of tokens, it can continue to actually learn about, like, the data that it's processing. And that's another thing that they really emphasize in the paper, is creating a model of -- when I say model, I don't mean, like, a machine learning model.

I mean, like, an abstraction for learning. And what does that mean for a model to be able to learn? So we've talked about how the model updates its weights based on the data that comes through. And so how does the model know, like, what is interesting, like, what is worth keeping, and what isn't?

Because it doesn't have enough parameters for long sequences to capture everything that comes into it. So it's going to have to compress whatever data is coming through. So how does it decide how, like, what to remember and what not to? So this surprise mechanism is the main way that it does.

So information that's considered surprising is for humans more memorable. And so they took the leap that for a model, it's also going to be more memorable. Or more important to remember. I'm just looking through the slides. So, okay. So the point is, well, how do we know if information is important or not?

The main way they decide that is the gradient of the neural network with respect to the input data. So you can see that -- hopefully you can see my mouse. But you can see that down here. The gradient of this incoming input data with respect to the memory from the previous time step gives us the amount of surprise.

And it -- you can see it gives preferential treatment to this surprising information. The more surprising it is, the more likely it is to be stored in memory. So you can imagine for, like, a long sequence of maybe documents, maybe there's, I don't know, 100 documents that are all legal briefings.

And then it comes to something that is maybe a bill of lading or, you know, a product description or something like that. So then it's going to be like, oh, this is completely different. I should remember something about this. So it's going to prioritize keeping what it considers surprising or information that it hasn't seen before.

And then it stores this in key value pairs in that long-term memory module. So in addition to surprise, there's also forgetting. So I should have mentioned in the last slide that this data parameter here controls the amount of, like, kind of the impact of surprise. So it's a tunable parameter that can be increased or decreased to either remember more surprising information or forget it.

The momentum and dynamic forgetting means that it can forget things so that it does discard irrelevant information as it goes along. So there's this, you can see this one over alpha where it is slowly degrading the older memory. So as things get older and older, they fade out. I heard someone come off mute.

Is there a question? >> Yeah. That MT parameter, is that the output of the memory module or the model itself? >> That is essentially the state of the memory module. >> Okay. And so the surprise has the same dimensions, the same form as the memory itself? >> Why don't we look in the paper after the presentation and we can, or if you want to look at it.

I can't answer that right off the bat. >> I would assume so because it's just a simple addition to it. And so kind of curious, I guess the origin of this is that, you know, they show different structures for how the memory was included and it's, like, added in different parts of the architecture stack, if you will.

I'm kind of curious if that would stay the same across their form of the architecture or if it changes depending on where the memory is included at. >> Yeah. That's an interesting question. I'm going to go through the different architectures they introduce. So hopefully, I don't know if that will answer the question, but at least it will give us enough context for discussion.

I'm going to leave plenty of time for discussion, so this would be a good thing to talk about. And then it also has this momentum that, where it combines past surprise with a decay factor. So that's this eta t and I guess the previous surprise. So somehow this provides a momentum mechanism to the model.

Okay. So those are kind of the components that go into a Titan's model. Now, let's look at the different ways they combine them because there was this open question of, like, okay, we have this module that we think would be useful for the model. Like, let's see what's the best way to then incorporate that.

And they come up with three different options. Memory as context, memory as layer, and memory as gated branch. And each one of these has their strengths and weaknesses. So let's first look at the memory as context. And I'll just walk through this diagram a little bit. So we can start looking over here at the -- this is the input sequence.

So this is essentially your prompt or what you are sending into the model. This is then used as a key or a query -- I guess a query into the memory module that then returns, like, the most relevant information from the memory as a sequence of tokens. And then at the same time step, you can see this is the persistent memory.

So this is kind of the hard-coded information that just always has the same token that adds to the sequence. So you can see that before we go to attention, we're adding these extra sequences. So in my previous example of, like, legal documents and maybe a product description, the retrieval could say, like, okay, this sequence looks like -- or this next part of the sequence looks like a legal document.

So let's grab any information we have about the legal documents, put it here. And then this persistent memory might just be about how to generally handle incoming requests, something like that. It goes through attention. So this is the quadratic thing where every token pays attention to every other token.

It then, based on the output, sends an update to the memory, so updates the actual weights here with whatever relevant information came out. Then I'm not sure what this is. Some kind of -- maybe an XOR combines these and then gives you the final output. And so this is what the attention masks look like for memory as context.

You can see that each one of these is treated as, like, a small context window. So it's, like, kind of little batches of tokens being paid attention to together. Then we add the long-term memory onto that. And then finally the persistent memory. And so this kind of gives you an idea of, like, inside the attention mechanism, what exactly it's seeing.

And next, memory as gate. So this is a different approach. It does not, like, query the memory before sending it into the attention mechanism. So it sends a query to the memory at the same time as going through the attention mechanism and then combines them afterwards. And so the memory -- like, whatever is stored in the memory does not go through the attention mechanism at all.

Yeah. So this is -- this uses the memory as a gating mechanism. I guess right here. Which I guess would allow -- I'm assuming allow certain tokens through and some tokens not. We can discuss this in the discussion about exactly how this works. And then here's the attention masks for the memory as gate.

You can see it's quite different. Here this is a sliding window that, you know, each token pays attention to a certain number of tokens on each side of it. You can see the long-term memory, how next to the input sequence the memory is strong. And then it gradually fades out as the tokens are, you know, further in the future, if you will.

And then finally, the persistent memory just, like, adds to the very front of the sequence. This is what the attention masks look like in the memory as gate. And then finally, memory as layer. So this is in some ways similar to what some transformer models do, where they have the attention mechanism and then there's a feedforward mechanism as well.

So this is, like, the closest thing to that, where it just goes right into the memory, whatever comes out of the memory goes into the attention. Yeah, and it says the, can I take advantage of the complementary data processing of intention in the neural memory module so there's no combining after either before or after.

It's just, like, one more step in the architecture. And then finally, the, they also, in their experimentation, look at memory without attention. So this is essentially just a long-term memory module by itself. There's no attention mechanism. And they just look at, like, how does this perform? Just purely a long-term memory module without attention or anything.

Okay. So now we're getting into the last part of the paper, the experimental setup and results. So they test all four of these variants and they test them at different sizes. And I thought one thing that was interesting is that these are pretty small sizes, at least compared to, you know, your current, your modern LLM.

So, like, a LLAMA7B is considered a smaller model. And if you look at, like, a, you know, like a GPT4O that has probably hundreds of billions of parameters. So this size is pretty tiny compared to, like, the models we use day to day. So they gave a big table of their results for language modeling.

And they also threw some common sense reasoning benchmarks into here. So you can see all the benchmark names across the top. And then the best performing ones are highlighted. The tan highlights are for hybrids. And then the blue highlights are for pure or just, like, normal. Or yeah. So you can see that basically Titans wins at everything here versus Mamba, Deltanet.

This is test at the time of -- or no. Anyway, this is one of the previous works of memory testing. So you can see that it wins at everything. This is for the 340 million parameter model. And you can also see the number of tokens they train on is pretty small.

Modern LLMs train on, like, low trillions of tokens. So this is not much data at all. Then -- so you can see, like, language modeling, it does quite well. And then here's their LLNA Haystack. So for this test, they had some information early on in a sequence. And then a bunch of, like, filler tokens.

And then some sequence that needed those very early tokens. Like, to understand what was happening. And so you can -- the Titans is the red stars here. So you can see that they are maintaining their performance quite well. Even out to -- here, if we go to the fine-tuning setup, 10 to the 7th.

So even out to 10 million tokens. I mean, they did take a performance hit. But still doing much better than every other model. And so I think this is, for me, one of the most interesting charts that just shows that as this becomes more productionalized, there's going to be the opportunity to have longer and longer context windows where maybe you can feed in, like, a bunch of YouTube videos, like, hundreds of pages of PDFs, and all this stuff, and still have it be -- give you, like, relevant output.

So with that, that is the end of my presentation. So let me pop open the chat here. See what's going on. And if anyone wants to speak up, make comments, make any corrections to what I said, like, I'm not an expert in this, and I'm also not an AI researcher.

So if you have insights, would love to hear those. And also, if anyone has answers to questions, feel free to chime in. Again, like, I'm not the expert on this paper. Like, I read it and understand it. But I know there's a bunch of very smart people on this call.

So with that, I'll open up the floor. >> I can also help with questions, but I'm struggling with the chat window. >> Yeah, I mean, I want to validate Cosmin's frustration with this paper. Yeah, I mean, look, like, I think they did try to illustrate the memory mechanisms somewhat, but not super -- it's not super clear.

And I always wish that these things came with code. I really like the name of papers with code, because this one needed code. And, you know, maybe they released it, but I didn't see it in the inside of the paper. >> They said at the very bottom of the paper that they're planning to release code soon.

>> How Chinese of them? >> Whatever that means. >> Yeah, I didn't understand if the diagrams refer to one step or multiple steps. Like, I think -- and also, the diagrams are 3D, which makes it a bit more confusing. Like, they say when you update the memory, you do memory of query, and then you get, like, vector as output.

What does that mean? Is that k-nearest neighbor lookup? Is it some attention? Maybe we can go to the first slide, where they update the memory equation. I don't know, Eugene, if you got time to -- got any time to look, but I would be interested in just one of those operations, how does it actually happen?

Like, I understand at the high level, we have a memory module, we update, it's nice to forget, they somehow figured it out, but, like, what layers or what do they actually do? Like, even lower, you have some -- if you go a bit lower. Yeah, you see -- >> Retrieving memory.

>> Yeah, what does that -- maybe I didn't read or, like, even this figure one, I didn't understand at all what's going on. >> Yeah, I also skipped this, because I didn't understand what was going on. >> I can explain the in-context learning one with -- and I can -- with the parallels of what happens on RWKB7 and Titan, based on what I understood from the paper.

So, think of it as this way. >> Which figure? >> This whole segment, the long-term memory training and the surprise formula. It's actually a lot easier if you explain it using simplified -- >> Do you mind saying which? So, like, we read while you explain. No, this is great, like, having an expert explain it, but which figure is it or which formula?

>> So, you scroll up. I'm trying to, like, figure out the page numbers as well. So, we are talking about, very specifically, the segment tree 3.1 and the surprise metric there, downwards, that whole memory architecture. So, one way to view it, right, is that -- let's just say a standard problem.

Let's say the quick brown fox, correct? And this is -- this is a piece of text that exists in so many training corpus that all LLMs will probably memorize this phrase itself, the quick brown fox. There is no surprise there. So, because the surprise score is zero, there is no -- there's no back propagation required, per se, to update the memories.

If you view the memories as, let's just say, a 4,000 -- I don't know what the dimensions are here, because they never disclose, but let's just say a 4,000 by 4,000, 4096 floating point value, in our case, it's BF16. If, let's say, we said the quick brown deer, then the model is like, hey, I wasn't expecting deer, I'm expecting fox.

There's a difference there. That difference there -- I'm oversimplifying the math, because this is not accurate -- can be converted to a loss score that you can back propagate on. So, it's when you see that when the models see differences, do you update this memory, this memory that's being shared between tokens?

Now, where it differs for the Google paper, Python and RWKB, which we are testing as well, is that the -- >> Eugene, so you have a separate key value store that you attend at the same time as you attend the current token. So, that's your budget, right? And you attend the whole thing, okay?

And does that help or doesn't it help? And when it's surprising, you need to update. So I wonder how they send the updates to the key value store. Like, what actually happens? Like, what's the loss of the key value store and the current token? Go ahead. >> So if you want to view it as simplified code, which is not how they implement it, because -- is that you can view your model weights during the forward pass, right, as frozen, just view as frozen.

And the -- so, the quick brown, let's say the quick brown, those tokens, right, generate a state. You take that state, and then let's say instead of -- let's say we see fox, and then we say deer instead, right? There's a difference in expectation. The model expected fox, it got deer instead.

So because the model weights are frozen, if you do the forward pass and then you want to correct the model's thinking and you do the backwards pass, the only way to update it is this state value. So you do the backwards pass, you update the state values, then you take that state and you go -- you process the next token.

So you continue your sentence completion. And that essentially is what the surprise mechanic is about. It's about, hey, it didn't give the output we required, and then we take that difference and then we convert it into a score. Where this differs from RWKB is that we don't use a surprise mechanism.

We are currently using more closer to the standard gradient descent. So the difference here is that in a surprise mechanism, so if let's say you expected fox and you get fox, for example, there's no -- the loss is essentially zero. There's no backprop. But in RWKB's case, right, if let's say that you -- it expected fox and it actually got the fox, and since the way logics work, there's always a zero point -- let's just say a zero point something percent difference, there's still a loss score being calculated there.

So even though it was not a surprise, we still do the backpropagation process. In practice, is this better or worse? I have absolutely no idea. This is something we need to test and evaluate on. But that's the key difference on how we handle the memories segment. It's all about, like, every token you forward, you backprop and then you update the weights.

This is -- >> You just -- instead of you propagate the signal through the frozen weights and just update the keys and values. One question that I got was how do they manage a fixed size key value store? Basically, how do they decide what to drop and stuff? And they have this -- I didn't understand their gating in here.

But basically, yeah, over time, you'll see lots and lots and lots of things. So you kind of need to figure out, like, if you're efficiently using your memory, then you solve the problem, basically. >> So this goes to the AI black box. We decide the key value store to a specific size.

That's part of the architecture design. This is the same thing as the model dimensions as to how the model decides what to keep and drop, right? That is specifically decided by the model. So I think one -- there's another segment where it highlights the decay, right? Let me find that segment.

>> Could you scroll a bit down? >> Sorry if I'm not looking at the screen. I'm looking at the paper to just find the decay. I think it was mentioned -- >> Yeah, it's a forgetting mechanism. Sorry. It's equation, yeah, 13. >> Yeah, so the idea behind the decay mechanism, right, and this part is consistent with other theories, is that by default, every token you move forward, you are slowly forgetting.

And the forget rate is something that's trained in the model. So let's just say the value is by default, everything you will forget in 32K tokens. So if you forward 32K, you should forget it. This makes it sound similar to sliding window attention in that sense, but the decay mechanism is supposed to work together with the -- with, let's say, the surprise mechanism or basically the model's -- this is a bit more gray, but basically as the -- by default, you decay.

You let the model compute against the state itself. And the model may just decide, hey, this is important to memorize, so I'm going to reinforce that number. So as every step it takes, right, in a way, internally the model just, like, keeps backpropagating against the state and think, hey, do I need to reinforce this floating point value?

If I stop reinforcing, it will eventually decay, but if I want to reinforce it, I can just, like, keep increasing the value and then keeping it within bounds. And that's how it keeps -- kind of, like, by default, it will slowly forget, but if it thinks it's important, it will keep trying to remember it over larger context time.

To be clear, this is -- this part is really theorycrafting, because even for RWKB, the highest we ever push a model to is 32K, and we are now experimenting at 64K. This theorycrafting is supposed to, like, extend to, like, 1 million token per se. So it's something that we definitely need to test.

But the idea behind decay is so that by default, things will expire, and so the model is able to clear up space to memorize new things, and it's able to also decide for itself to keep things in memory. Not much different from how they describe it in Python. Yeah.

>> Thanks a lot. This helped me quite a bit. >> Eugene, I have a question. So when you're referring to by default it decays, so in terms of the surprise here, can we assume that the surprise by default is low for most of the tokens? >> I think view it as -- let's just say -- let's just view it as a wreck scenario.

A wreck scenario. Let's just say your company is the most generic company on Earth, and then I just put your company document there. There is no surprise. Then it just moves on. But let's just say your company has some very proprietary, never-heard-before stuff. That surprise will then be what is stored.

So it's about the difference in information in the fixed model weights compared to this floating-point state. Does that make sense? >> Is there another -- I'm trying to read the questions from the chat. >> Yeah. There's another one from Cosmin earlier about our needle in the haystack's interesting tasks.

I mean, I think they could be in real use cases. There's probably a lot where they're not. But depending on your use case, I could see how if you just want to throw a bunch of tokens into the model and not worry about the order or anything, that it would be useful to have.

>> I think it's interesting also academically to just understand the maximum limit the model can memorize in worst-case scenario. That's the way I view needle in the haystack. In practical scenarios, that was one of the challenges about RWKB benchmarking this as well, and the same thing for Titan, is that if, let's say, we train on Harry Potter book and you just put the whole Harry Potter book as the right context, there's no surprise there.

And essentially, it can pass the right test with, what, 300k context length. But that's not a correct test, per se. So needle in the haystack is meant to represent the worst-case scenario. That's how I feel. And a lot of companies are a lot more generic than they think they are.

Also, I think in NeurIPS, the joint presentation that we did with Dan Fu, both him and I agree that with other techniques, such as repeating twice, that works well for both Mamba and RWKB, is that we may want to re-evaluate how we benchmark these things, per se, when it comes to practical, right situations.

Because if, let's say, the inference cost for Mamba and RWKB is 100,000x cheaper, and it's right performance triples or quadruples just by repeating the context twice, so then we just repeat the context twice. That was one of the arguments. And I kind of agree, but it's also a very different test.

I see there's another question, Eugene, that you'd be good at answering is, what is the gating mechanism in modern RNNs? What do you mean by modern RNNs? I don't know. Aditya, do you want to clarify? Yeah, this sentence is on the paper right in front of us. It says Dow and Gu, 2024, or Vietor et al., 2023.

It's a comparison of the weight decay mechanism. So I just have no confidence that RNNs normally do. That's relatable. Oh, okay. Later in the section, we show that the weight decay mechanism is closely related to the gating mechanism in RNNs Dow and Gu, citing, I presume, the Mamba star?

Yeah, okay, the state space model, yeah. It's, I think, the state space model. This would be similar to the weight decay that I explained earlier, which is basically, by default, it's for gating. My understanding of Mamba is that they run these RNNs without gates, so that they can run something like Fast Fourier Transform or some parallel scan on GPUs and run it fast.

And then they have some multiplicative process on each token. And when you do multiplications, basically, you can use them as gates to how much of the signal you propagate. So they add the nonlinearity and gating at token level, while old school LSTM, they had memory gate for get gate, input gate at each step, and they were running it one by one.

So it might be, that might be one thing, like this token level multiplicative things that people have in state space models. Yeah, if it's about that, then it's really more about how we restructure. So both Mamba and RWBKB and Titan, is that with the way we restructure the formulation, we do not need to wait for one token compared to another, unlike the old LSTM.

So yeah, I think that makes sense. Yeah, actually I think your explanation makes more sense. Basically, if you contrast it with the old LSTM RNNs, all the newer gates, even though we are designed differently, is designed in a way that doesn't have this bottleneck, where you need to wait for one token after another.

Whether is it through math hacks, which is what state space model did, which is really impressive, honestly, that kind of math. But that's what they are good at. And in our case, it's really more of like how we optimize things in the QDAR forward. It achieves the same result.

It's able to train in parallel, unlike old RNNs. And I'm quite sure Google has their own optimization when it comes to training as well. I have another question, Eugene, for you. I don't know how to compare baselines with this model. Are you impressed with one point perplexity win on wiki or the other wins?

Or it's kind of any new model usually shows that kind of win. So to me, the win seem large. So I think, is this something super strong or it's OK, not great? To be honest, I classify this in promising, but we need to test further, because even in our experience for RWKB, anything below 1.5B may not necessarily hold until 7B situation.

So we had reverted experimental changes where we made on 0.5B models, which is kind of what's being tested here. Here is 340 to 760 million parents, where the perplexity loss was great. It dropped much lower. And then when we scale it to 1.5B, it sucked. So it's promising, but I want to test to find out more, because I think the true test is testing on 1.5B and then 7B, which, to be honest, I'm quite sure the Google folks have done it.

They are not compute bound. It takes more effort for them to write this paper than to train that 1.5B and 7B model. Yeah, I'm a bit skeptical, because there's some gossip that Googlers aren't allowed to publish super impactful stuff. So it's interesting. Yeah. I also would like to know what was your results for the larger models?

Yeah, I mean, I guess that's one interesting thing, is that back in 2018, the attention is all you need days, Google researchers could publish anything they wanted, because there was really no competitive advantage they were giving away. But these days, you wonder, if they have a really big breakthrough, are they going to publish that, or are they just going to keep it to themselves?

I have another question. Sorry, if someone could explain. The memory module is described as a meta-in-context model. In quasi-layman's terms, what would a meta-in-context model mean? There's like a sort of a parallel small model running on that, as if it was trained very quickly? So that's the part where I explained every time the tokens look different from what you expected, it does the back propagation to the memory modules.

So you can think of it as a-- I'm oversimplifying the math calling back propagation. You can think of it as training the memory modules. It's inefficient to do it that way, because we use matrix multiplication math dedicated for this. But in theory, you could implement it as your standard backprop gradient descent, at least for RWQ case.

I mainly double-check for Google's case. But the important thing is it runs at inference time. So it's like it always backpropagates something, which is very different from other models that are trained in a big pre-training run, and then nothing changes. You just run inference. So it's kind of meta-learning that at inference time, it still does some additional update.

Yeah, the long-term hope and goal, if we can get this process to be stable, and the memory module is, let's just say, a gigabyte in size in memory, this is what will represent short- to mid-term memories for a AGI kind of model. The issue for any super long-context training we're talking about in AGI scale is that we don't really have the means to really figure out how to train this memory module in a structured, guided way.

And right now, the hope is that if we train it, let's just say, at 4, 8k or 32k, it generalizes to 1 to 8. And if, let's say, we train at 1 mu, it generalizes to 10 mu. So if we train it at 10 mu, it generalizes to a larger context length and a much longer context length.

The problem with this approach is, even for us right now, and this is something that maybe Titan may have more tests on, is that when we train on 512, it generalizes up to 4k, and then it dies out there. Then, if we train up to 4k, it generalizes up to 16k.

So the generalization doesn't go on to infinitum, unlike humans, arguably. But then again, maybe that's why we go senile at the age of 100. Maybe that's the reason. That's our context length. Yes. So I see we're at time. I don't know, Swix, I'll turn the floor back over to you.

Any comments or thoughts about next week or announcements? Ishan is doing DQ2 on his spreadsheet. Is that Ishan? I don't know. He likes spreadsheets. That's all I know. So that will be next week. Okay, well, yeah, I mean. Okay, I think that was a yes. Cool. Yeah, we have run out of our context length for this session.

Thank you, Eric, for a great presentation. Yeah, very topical paper. We are, I've just been chatting with Vibhu, and we're basically kind of thinking about, you know, doing the second paper club and sometimes somewhat splitting between timeless papers or timeless survey papers and then hot individual papers is kind of like the split that we're thinking about.

And then also maybe doing it at a different time. So it's not like during the day, during the workday for most people in the US. So yeah, people are interested in timeless papers and then also hot papers. So I think those are the two spiky things that maybe we can have a different vibe for that as well.

Yeah, let's discuss in Discord, but otherwise have a wonderful day. Bye.

Google Titans: Learning to Memorize at Test Time

Transcript