Back to Index

Stanford CS25: V2 I Neuroscience-Inspired Artificial Intelligence


Transcript

It's fun to be here. So the work I'm presenting today, title of it is Attention Approximates Sports Distributed Memory. And this was done in collaboration with Genghis Pahlavon and my PhD advisor is Gabriel Kreiman. So why should you care about this work? We show that the heuristic attention operation can be implemented with simple properties of high dimensional vectors in a biologically plausible fashion.

So the transformer and attention, as you know, are incredibly powerful, but they were heuristically developed. And the softmax operation and attention is particularly important, but also heuristic. And so we show that the intersection of hyperspheres that is used in sports distributed memory closely approximates the softmax and attention more broadly, both in theory and with some experiments on trained transformers.

So you can see SDM, sports distributed memory, as preempting attention by approximately 30 years. It was developed back in 1988. And what's exciting about this is that it meets a high bar for biological plausibility. Hopefully I have time to actually get into the wiring of the cerebellum and how you can map each operation to part of the circuit there.

So first I'm going to give an overview of sports distributed memory. Then I have a transformer attention summary, but I assume that you guys already know all of that. We can get there and then decide how deep we want to go into it. I'll then talk about how actually attention approximates SDM, interpret the transformer more broadly, and then hopefully there's time to go into SDM's biological plausibility.

Also I'm going to keep everything high level visual intuition and then go into the math, but stop me and please ask questions, literally whenever. So sports distributed memory is motivated by the question of how the brain can read and write memories in order to later retrieve the correct one.

And some considerations that it takes into account are high memory capacity, robustness to query noise, biological plausibility, and some notion of fault tolerance. SDM is unique from other associative memory models that you may be familiar with, like Hopfield networks, in so much as it's sparse. So it operates in a very high dimensional vector space, and the neurons that exist in this space only occupy a very small portion of possible locations.

It's also distributed, so all read and write operations apply to all nearby neurons. It is actually, as a side note, Hopfield networks, if you're familiar with them, are a special case of sports distributed memory. I'm not going to go deep into that now, but I have a blog post on it.

Okay. So first we're going to look at the write operation for sports distributed memory. We're in this high dimensional binary vector space. We're using hamming distance as our metric for now, and we'll move to continuance later. And we have this green pattern, which is represented by the solid dot, and the hollow circles are these hypothetical neurons.

So think of everything quite abstractly, and then we'll map to biology later. So this pattern has a right radius, which is some hamming distance. It activates all of the neurons within that hamming distance, and then here I just note that each of those neurons are now storing that green pattern, and the green pattern has disappeared.

So I'm keeping track of this location with this kind of fuzzy hollow circle. That'll be relevant later. So we're writing in another pattern, this orange one. And note here that neurons can store multiple patterns inside of them, and formally this is actually a superposition, or just a summation, of these high dimensional vectors.

Because they're high dimensional, you don't have that much cross talk, so you can get away with it. But for now, you can just think of it as a neuron can store multiple patterns. Finally, we have a third pattern, this blue one. We're writing it in another location, and yeah.

So again, we're keeping track of the original pattern locations, but they can be triangulated from the nearby neurons that are storing them. And so we've written in three patterns, now we want to read from the system. So I have this pink star, Xi, it appears it has a given-- it's represented by a given vector, which has a given location in space, it activates nearby neurons again.

But now the neurons output the patterns that they stored previously. And so you can see that based upon its location, it's getting four blue patterns, two orange and one green. And it then does a majority rule operation, where it updates towards whatever pattern it's seeing from us. So in this case, because blue is actually a majority, it's just going to update completely towards blue.

Again, I'll formalize this more in a bit, but this is really to give you intuition for the core operations of SCM. So the key thing to relate this back to attention is actually to abstract away the neurons that are operating under the hood, and just consider the circle intersection.

And so what each of these intersections between the pink read circles and each of the right circles means is that intersection is the neurons that both store that pattern that was written in, and are now being read from by the query. And the size of that intersection corresponds to how many patterns the query is then going to read.

And so formally, we define the number of neurons in this circle intersection as the cardinality between number of neurons in pattern, number of neurons in query, and their intersection. Okay, are there any questions, like at a high level before I get more into the math? I don't know if I can check, is it easy for me to check Zoom?

Nah, sorry, Zoom people, I'm not going to check. Okay. So the neurons that's randomly distributed in this space? Yes, yeah, yeah. And there's later, there's more recent work that they can learn and update their location to cut tile manifold. But in this, you can assume that they're randomly initialized binary high dimensional vectors.

Okay, so this is the full SDM update rule. I'm going to break it down. So the first thing that you do, so this is this is for reading, to be clear, so you've already written patterns into your neurons. So the first thing you do is you weight each pattern by the size of its circle intersection.

So the circle intersection there for each pattern. Then you sum over all of the patterns that have been written into this space. So you're just doing a weighted summation of them. And then there's this normalization by the total number of intersections that you have. And finally, because at least for now, we're working in this binary space, you map back to binary, just seeing if any of the values are greater than a half.

Okay, how familiar are people with attention? I looked at like the previous talks you've had, they seem quite high level. Like, can you guys write the attention equation for me? Is that like, can I get thumbs up if you can do that? Yeah, okay, I'm not like, I'll go through this, but I'll probably go through it faster than otherwise.

So when I first made this presentation, like, this was the state of the art for transformers, which was like alpha pole. And so it's kind of funny, like how far things have come now, I don't need to tell you that transformers are important. So yeah, I'm going to work with this example.

Well, okay, I'm going to work with this example here, the cat sat on the blank. And so we're in this setting, we're predicting the next token, which hypothetically is the word map. And so there are kind of four things that the attention operation is doing. The first one up here is it's generating what are called keys, values and queries.

And again, I'll get into the math in a second, I'm just trying to keep it high level first. And then we're going to compare our query with each of the keys. So the word the, which is closest to the word we're next predicting is our query, and we're seeing how similar it is each of the key vectors.

We then based upon that similarity, do this softmax normalization, so that all of the attention weights sum to one, and then we sum together their value vectors to use to propagate to like the next layer or uses our prediction. And so at a high level, you can think of this as like the query word the is looking for nouns and their associated verbs.

And so hypothetically, it has a high similarity with words like cat and sat, or their keys. So this then gives large weight to the cat and sat value vectors, which get moved to the next part of the network. And the cat value vector hypothetically, contains a superposition of other animals like mice, and maybe words that rhyme with mat.

And the sat vector also contains things that are sat on including mat. And so what you actually get from the value vectors of paying attention to cat and sat are like three times mat plus one times mouse plus one time sofa. This is again, like a totally hypothetical example, but I'm trying to make the point that you can extract from your value vectors, things useful for predicting the next token by paying attention to specific keys.

And I guess, yeah, another thing here is like what you pay attention to, so cat and sat might be different from what you're actually extracting. You're paying attention to your keys, but you're getting your value vectors out. Okay, so here is the full attention equation. The top line, I'm separating out the projection matrices, W subscript E, K, and Q, and in the second one, I've just collapsed them into like the and yeah, so breaking this apart.

The first step here is we compare, we do a dot product between our query vector and our keys. This should actually be a small, you know, and so yeah, we're doing this dot product between them to see, get a notion of similarity. We then apply the softmax operation, which is an exponential over a sum of exponentials.

The way to think of the softmax is it just makes large values larger, and this will be important for the relations here, so I'll spend a minute on it. At the top here, I have like some hypothetical items indexed from zero to nine, and then the values for each of those items.

In the second row, I just do like a normal normalization of them, and so the top item goes to a 30% value, but if I instead do a softmax, and it becomes one of the beta-coefficient softmax, but the value becomes 0.6, so it just, it makes your distributions peakier is kind of one way of thinking of it, and this is useful for attention because you only want to pay attention to the most important things, or the things that are nearby and kind of ignore stuff further away.

And so once we've applied our softmax, we then just do a weighted summation of our value vectors, which actually get extracted and propagate to the next layer. Okay, so here's the full equation, I went through that a little bit quickly, I'm happy to answer questions on it, but I think half of you know it, half of you don't.

Okay, so how does transformer attention approximate sparse distributed memory, this 30-year-old thing that I've said is biologically plausible. So are we supposed to accept that SPM is biologically plausible? So I'm going to get to that at the end, yeah. Attention is also formed, like in the sense of like all attention back in this one, it's not like it was invented five years ago.

I think the attention equation I'm showing here was developed, I mean attention is all you need was the highlight, but Benji has a paper from 2015 where it was actually first written in this way, correct me if I'm wrong, but I'm pretty sure. Yeah, I mean I guess like this particular one, that's why I was asking a question, because like… No, it's a good question.

It's like you show that like two different methods that could be classified as like attention proposals, right, or like the same, then like you still have to show that like one of them is like indifferent. Yes, exactly, so I'll show that SDM has really nice mappings to a circuit in a Cerebellum at the neuronal level, and then right now it's this link to attention, and I guess you make a good point that there are other attention mechanisms.

This is the one that has been dominant, but I don't think that's just a coincidence, like there's been a bunch of… Computing your Softmax is expensive, and there's been a bunch of work like the Linformer, etc., etc., that tries to get rid of the Softmax operation, and it's just done really badly.

Like there's a bunch of jokes on Twitter now that it's like a black hole for people that like try and get rid of Softmax and you can't, and so it seems like this, and like other versions of it, transformers just don't scale as well in the same way, and so there's something important about this particular attention equation.

But like that goes the other way, right, which is like if this is really important, then like SDM is like actually like this. So the thing that I think is important is that you have this exponential weighting, where you're really paying attention to the things that matter, and you're ignoring everything else, and that is what SDM approximates.

There might be better equations, but the point I was just trying to make there is like the Softmax does seem to be important, and this equation does seem to be very successful, and we haven't come up with better formulations for it. Yeah, no, that's a great question. Yeah, so it turns out that sparse distributed memory, as you move your query and your pattern away from each other, so you pull these circles apart, the read and write circles, the number of neurons that are in this intersection in a sufficiently high dimensional space decays approximately exponentially, and so on this right plot here, I'm pulling apart, the x-axis is me pulling apart the blue and the pink circles, and the y-axis is on a log scale the number of neurons that are in the intersection, and so to the extent that this is a linear plot on a log scale, it's exponential, and this is for a particular setting where I have my 64 dimensional vectors, which is like used in GPT-2, it holds across a lot of different settings, particularly higher dimensions, which are now used for bigger transformers.

Okay, so I have this shorthand for the circle intersection equation, and what I'll show is how the circle intersection is approximately exponential, so we can write it with two constants c, subgroup 1 and subgroup 2, with the one outside, because you're normalizing Softmax's exponential over some exponentials, that would cancel, the thing that matters is c2, and you can approximate that nicely with the beta coefficient that's used in the Softmax.

And so, yeah, I guess as well, I'll focus first on the binary original version of SDM, but then we also develop a continuous version, okay, so yeah, the two things that you need for the circle intersection and the exponential decay to work, are you need - to map it to attention is you need some notion of continuous space, and so you can use this equation here to map Hamming distances to discretize cosine similarity values, where the hats over the vectors are L2 normalizations, and you can then write the circle intersection equation on the left as this exponential with these two constants that you need to learn, and then rewrite this by converting c2 and c, you can write this as a beta coefficient.

Let me get to some plots, yeah, so you need the correct beta coefficient, but you can fit this with a log-linear regression in a closed form. I want to show a plot here, yeah, okay, so in the blue is our circle intersection for two different Hamming distances both using 64-dimensional vectors, and the orange is our actual Softmax attention operation, where we fit the beta coefficient, so that it will - the Hamming distance used by attention is equivalent to the Hamming distance used by SDM, and you can see so that the main plot is the normalized weights, so just summed up in undivided systems, one, and then I have log plots here, and you can see that in not log space, the curves agree quite nicely.

You can see that for the higher dimensional - sorry, the larger Hamming distance, the log plot, you see this drop off here, where the circle intersection stops being exponential, but it turns out this actually isn't a problem, because the point at which the drop - the exponential breaks down, you're at approximately 0.20 here, and you're basically paying negligible attention to any of those points, and so in the regime where the exponential really matters, this approximation holds true, yeah, yeah, yeah, yeah, no I just wanted to actually like show a figure to get some intuition before, yeah, so all we're doing here is we're just - we're in a binary space with original axiom, and we're just using this mapping to co-sign and then what you need to do is just have the beta coefficient fit, and you can view your beta coefficient and attention as determining how peaky things are, and this relates directly to the Hamming distance of your circles that you're using for read and write on virtue.

And so yeah, to like mathematically show this now, on this slide I'm not using any tricks, I'm just rewriting attention using the SDM notation of patterns and queries, so this little box down here is doing that mapping. And this is the money slide where we're updating our query, and on the left we have our attention equation written in SDM notation, we expand our sub-max, and then the main statement is - this is closely approximated by if we swap out our exponential with the SDM equation.

So and again, the two things that you need for this to work are, one, your attention vectors, your q's and queries need to be L2 normalized, so I have hats on them, and then you want - if you decide a given Hamming distance for SDM, and I'll get into what Hamming distances are good for different things, then you need to have a beta coefficient that relates to it.

But again, that's just how many things are you trying to pay attention to. So yeah, just as a quick side note, you can write SDM using continuous vectors and then not need this mapping to understand similarity. And so here I have the plots again, but with this, and I've added the - the orange and the green have split, but I've added the continuous approximation here too.

And what's nice about the continuous version is you can actually then write sparse distributed memory as a multilayered perceptron with slightly different assumptions, and I'm not going to talk about that now, but this is featured in Sparse Distributed Memory as a Continual Learner, which was added to the additional readings, and it'll be in - sorry, this shouldn't say ICML, this should say ICLR.

It's just been accepted to ICLR for this year. Okay, so do trained transformers use these beta coefficients that I've said are similar to those for SDM? And so it shouldn't be surprising that depending on the heavy distance you set, SDM is better for certain things. For example, you just want to store as many memories as possible, and you're assuming that your queries aren't noisy, or you're assuming your queries are really noisy, so you can't store as much, but you can retrieve from a long distance.

And if attention of the transformer is implementing sparse distributed memory, we should expect to see that the beta coefficients that the transformer uses correspond to these good instances of SDM. And so we have some weak evidence that that's the case. So this is the key query normalized variant of attention, where you actually learn your beta coefficient.

Normally in transformers, you don't, but you don't L2 norm your vectors, and so you can kind of have this like effective beta coefficient. So in this case, it's a cleaner instance where we're actually learning beta. And this one's trained on a number of different translation tasks. We take the learned beta coefficients across layers and across tasks, and plot them as they hit the ground.

And the red dotted lines correspond to three different notions of sparse distributed memory that are optimal for different things. And again, this is weak evidence insomuch as to derive the optimal SDM beta coefficients or corresponding handling distances. We need to assume random patterns in this high-dimensional space, and obviously real-world beta isn't random.

However, it is nice to see, one, all the beta coefficients fall within the bounds, and two, they skew towards the max query noise, which makes more sense if you're dealing with complicated real-world beta, where the next data points you'd see might be out of distribution based on what you've seen in the past.

The max memory capacity variant assumes no query noise at all. And so it's like, how many things can I pack in, assuming that the questions I'm asking the system are perfectly formed? OK. Just talking a little bit about transformer components more broadly. So I've mentioned that you can write the feed-forward layer as a version of SDM that has a notion of longer-term memory.

There's also layer norm, which is crucial in transformers. And it's not quite the same, but it can be related to the L2 normalization that's required by SDM. There's also the key query normalization variant that explicitly does this L2 normalization. And it does get slightly better performance, at least on the small tests that they did.

I don't know if this would scale to larger models. And so I guess this work is interesting in so much as the biological plausibility, which I'm about to get to, and then the links to transformers. It hasn't to date improved transformer architectures. But that doesn't mean that this lens couldn't be used or be useful in some way.

So yeah, I list a few other things that SDM is related to that could be used to funnel in. So in the new work where SDM is a continual learner, we expand the cerebellar circuit, look at components of it, particularly inhibitory interneurons, implement those in a deep learning model, and it then becomes much better at continual learning.

So that was a fun way of actually using this link to get better bottom-line performance. So a summary of this section is basically just the intersection between two hyperspheres approximates an exponential, and this allows SDM's read and write operations to approximate attention both in theory and our limited tests.

And so kind of like big picture research questions that could come out of this is, first, is the transformer so successful because it's performing some key cognitive operation? The cerebellum is a very old brain region used by most organisms, including fruit flies, maybe even cephalopods, through divergent but now convergent evolution.

And then given that the transformer has been so successful empirically, is SDM actually the correct theory for cerebellar function? And that's still an open question. As we learn more and more about the cerebellum, there's nothing that yet disproves SDM as working there. And I think it's-- I'll go out on a limb and say it's one of the more compelling theories for how the cerebellum is actually working.

And so I think this work kind of motivates looking at more of these questions-- both of these questions more seriously. OK. Do we have time? Cool. So here's the circuit that implements SDM. At the bottom, we have patterns coming in for either reading or writing. And they're going to-- actually, I break down each of these slides.

Yeah. Yeah. So first, we have patterns that come in. And every neuron here, these are the dendrites of each neuron. And they're deciding whether or not they're going to fire for the input that comes in. Then if the neuron does fire, and you're writing in that pattern, then you simultaneously-- and I'm going to explain-- let's say you hear that this is crazy, the brain doesn't do this, and then I'm going to hopefully take it down.

You not only need to have the thing, the pattern that activates neurons, but you need to have a separate line that tells the neuron what to sort. And just like you have this difference between keys and values, where they can be different vectors representing different things, here you can have a key that comes in and tells the neuron what to activate, and the value for what it should actually sort, and then put later on.

This is called a hetero-associative mapping. And then once you're reading from the system, you also have your query come in here, activate neurons, and those neurons then output whatever they sort. And the neuron's vector is this particular column that it's stored. And as a reminder, it's sorting patterns in the superconductor.

And then it will dump whatever it's stored across these output lines. And then you have this G majority bit operation that converts to a 0 or 1, deciding if the neuron's going to fire or not. And so here is the same circuit, but where I overlay cell types and the cerebellum.

And so I'll come back to this slide, because most people probably aren't familiar with cerebellar circuitry. Let me just get some water. OK. So the way that the cerebellum is pretty homogenous in that it follows the pattern throughout. Also, fun fact, 70% of all neurons on the brain are in the cerebellum.

They're small, so you wouldn't know it. But the cerebellum is very underappreciated, and there's a bunch of evidence that it has closed-loop systems with most higher-order processing now. If your cerebellum's damaged, you are more likely to have autism, et cetera, et cetera. So it does a lot more than just fine-order coordination, which a lot of people have assumed in the past.

OK. So inputs come in through the muscle fibers here. They interface with granule cells. This is a major up-projection, where you have tons and tons of granule cells. Each granule cell has what are called parallel fibers, which are these incredibly long and thin axons that branch out in this T structure.

Then they're hit by the Purkinje cells, which will receive up to 100,000 parallel fiber inputs. It's the highest connectivity of any neuron on the brain. And then the Purkinje cell will decide whether or not to fire and send its output downwards here. So that's the whole system where patterns come in and the neurons decide whether they fire or not, and the way that they then output their output.

You then have a separate right line, which is the climbing fibers here. So the climbing fibers come up, and they're pretty amazing in that these connections here you've kind of ignored. They're not as important. But one that really matters is that-- they're not very strong in there. But one that really matters is it goes up and it wraps around individual Purkinje cells.

And the mapping is close to one-to-one between climbing fibers and Purkinje cells, at least a very strong axon potential. And so-- And they're connected to what, here? In these-- Yeah, as in the stuff off this line. Yeah, right. It's two lines. They're connected to one. Oh, so they're separate neurons coming from-- it's separate.

There it is. Purkinje cells here go into the cerebellar nuclei, kind of in the core of the cerebellum. And that then feeds into thalamus, like back to higher-order brain regions, or like down the muscle movement, et cetera. A lot of people think of the cerebellum as kind of like a fine-tuning look-up table, where, like, you've already decided the muscle movement you want to do, but the cerebellum will then, like, do a bunch of different things.

So it's, like, much more accurate. But it seems like this also applies to, like, next-word prediction. Like, we have fMRI data for this. A neuroscientist once said to me that, like, a dirty little secret of fMRI is that the cerebellum lights up for everything. So OK. Going back to this circuit here, then, yeah?

Timescales, are these operating at? I mean, how long is the information stored and retrieved? Do we have any idea about this? Like, is this, like, a couple of milliseconds, or, like, is this information more persistent? So the main theory is that you have updating through time-dependent plasticity, where your climbing fiber will either—which is doing what you want to write in—will fire either just before or just after your granule cells fire, and so that then updates the propingy cell synapses for long-term depression or contentiation.

So whatever timescale that's happening on. The climbing fiber makes very large action potentials, or at least a very large action potential when the propingy cell. And so I do think you could get pretty fast synaptic updates. And they're also persistent for a long time? I think so, yeah. The synapses have been staying for, like, the rest of your life.

So what's really unique about this circuit is the fact that you have these two orthogonal nodes where you have the moss fibers bringing information in to decide if the neuron's going to fire or not, but then the totally separate climbing fiber lines that can update specific neurons and what they're storing will later output.

And then the propingy cell is so important, it's kind of doing this pooling across every single neuron. And each neuron, remember, it's storing the vector this way. And so the propingy cell is doing element-wise summation and then deciding whether it fires or not. And this allows for you to store your vectors in superposition and then later denoise them.

The theory of SDM maps quite well to the Marr and Hollis theories of cellular function, which are still quite dominant, if anyone's familiar or wants to talk to you about this. Yeah, so the analogy of the neuron in the SDM, you introduced the pool, and then I kind of just basically-- each neuron of the propingy cell instead of-- Each neuron is a granule cell.

OK. And then, yeah. So the location of the neuron, those hollow circles, corresponds to the granular cell dendrites here, where the patterns that pop in correspond to the activations of the modifiers. And then the efferent post-synaptic connections are with the propingy cell. So that's actually-- what it's storing is in the synaptic connections with the end propingy cells at that interface.

And then the propingy cell does the majority of the operation in the cyanoblastifier. Yeah, I think we're basically into question time. So yeah, thanks a lot. I have a question. I don't know anything about SDM, but it seems, as understood, it's very good for long-term memory. And I am curious, what's your hypothesis of what we should be doing for short-term memory?

Because it seems that-- so if you have this link of transformers having long-term memory, what's good for short-term memory? Because for me, it seems like we are doing this in the prompt context right now. But how could we incorporate these to the record? Yeah, yeah. So this work actually focuses more on the short-term memory, where it relates to the attention operation.

But you can rewrite SDM. It's almost more natural to interpret it as a multilayer perceptron that does a softmax activation across its-- or a top-k activation across its neurons. It's a little bit more complicated than that. So yeah, the most interesting thing here is the fact that I just have a bunch of neurons.

And in activating nearby neurons in this high-dimensional space, you get this exponential weighting, which is the softmax. And then because it's an associative memory, where you have keys and values, it is attention. And yeah, I guess the thing I most want to drive home from this is it's actually surprisingly easy for the brain to implement the attention operation, the attention equation, just using high-dimensional vectors and activating nearby neurons.

So it's good for short-term memory? Yes. So if you were to actually use SDM for attention-- yeah, so let me go all the way back real quick. This is important. There are kind of two ways of viewing SDM. And I don't think you were here for the talk. I think I saw you come in a bit later, which is totally fine.

But-- I was listening, but maybe they were going to remember. Oh, cool, cool, cool. Yeah, yeah, yeah. OK, so there are two ways of looking at SDM. There's the neuron perspective, which is this one here. And this is actually what's going on in the brain, of course. And so the only thing that is actually constant is the neuron, the patterns are ethereal.

And then there's the pattern-based perspective, which is actually what attention is doing. And so here, you're abstracting away the neurons, or assuming they're operating under the hood. But what you're actually computing is the distance between the true location of your pattern and the query. And there are pros and cons to both of these.

The pro to this is you get much higher fidelity distances, if you know exactly how far the query is from the original patterns. And that's really important when you're deciding what to update towards. You really want to know what is closest and what is further away, and be able to apply the exponential weighting correctly.

The problem is you need to store all of your pattern information in memory. And so this is why transformers have limited context windows. The other perspective is this long-term memory one, where you forget about the patterns. And you just look at where-- you just have your neurons that store a bunch of patterns in them in this noisy superposition.

And so you can't really-- you can kind of and it's all much noisier. But you can store tons of patterns, and you're not constrained by a context window. Or you can think of any penalty layer as storing the entire data set in a noisy superposition of states. Yeah, hopefully that kind of answers your question.

I think there was one here first, and then-- yeah? So I guess my question is-- so I guess you kind of have shown that the modern self-attention mechanism maps onto this SDM mechanism that seems plausible and might seem like the modern contemporary theories of how the brain could implement SDM.

And I guess my question is, to what degree has that been experimentally verified versus-- you were mentioning earlier that it might actually be easier to have done this using an MLP layer in some sense than mapping that onto these mechanisms. And so how do experimentalists actually distinguish between competing hypotheses?

For instance, one thing that I wasn't entirely clear about is even if the brain could do attention or SDM, that doesn't actually mean it would, because maybe it can't do backprop. So how does this get actually tested? Totally. Yeah, yeah, yeah. So on the backprop point, you wouldn't have to do it here because you have the climbing fibers that can directly give training signals to what the neurons can store.

So in this case, it's like a supervised learning task where the climbing fiber knows what it wants to write in or how it should be updated in the Purkinje cell synapses. But for your broader point, you basically need to test this. You need to be able to do real-time learning.

The Drosophila mushroom body is basically identical to the cerebellum. And that's why any brain data set has done most of the individual neuron connectivity. So what you would really want to do is in vitro, real-time, super, super high frames per second calcium imaging, and be able to see how synapses change over time.

And so for an associative learning task, like hear a sound move left, hear another sound move right, or smells, or whatever, present one of those, trace, like figure out the small subset of neurons that fire, which we know is a small subset, so that already fits with the handling of sensitization.

See how the synapses here update and how the outputs of it correspond to changes in motor action. And then extinguish that memory, so write in a new one, and then watch it go away again. And our cameras are getting fast enough, and our calcium and voltage indicators are getting to be really good.

So hopefully in the next three to five years, we can do some of those tests. But I think that would be very definitive. - Do we have any other questions? - I think there was one more, and then I should go over to Will. - In terms of how you map the neuron based at the end of the spinal epithelial biological implementation, what is the range of your circle that you're mapping around?

Is that like the multi-headedness, or can you do that kind of thing? I'm just trying to understand how that must be. - Yeah, so I wouldn't get confused with multi-headedness, because that's different attention heads all doing their own attention operation. It's funny, though, the cerebellum has microzones, which you can think of as like separate attention heads in a way.

I don't want to take that analogy too far, but it is somewhat interesting. So the way you relate this is, in attention, you have your beta coefficient. That is an effective beta coefficient, because the vector norms of your keys and queries aren't constrained. That corresponds to a hemming distance, and here that corresponds to the number of neurons that are on for any given input.

And the hemming distance you want, I had that slide before, the hemming distance you want depends upon what you're actually trying to do. And if you're not trying to store that many memories, for example, you're going to have a higher hemming distance, because you can get a higher fidelity calculation for the number of neurons in that noisy intersection.

Cool. Yeah, thanks a lot. - Excellent. Let's give our speaker another round of applause. So as a disclaimer, before I introduce our next speaker, the person who was scheduled, unfortunately, had to cancel last minute due to faculty interviews. So our next speaker has very graciously agreed to present at the very last minute, but we are very grateful to him.

So I'd like to introduce everybody to Will. So Will is a computational neuroscience machine learning PhD student at the University College of London at their Gatsby unit. So I don't know if anybody has heard about the Gatsby unit. I'm a bit of a history buff or history nerd, depending on how you phrase it.

The Gatsby unit was actually this incredible powerhouse in the 1990s and 2000s. So Hinton used to be there. Zubin Garimani used to be there. He's now in charge of Google Research. I think they've done a tremendous amount of good work. Anyways, and now I'd like to invite Will to talk about how to build a cognitive map.

Did you want to share your screen? - Yeah. - Okay. Can you stand in front of... Here, let me stop sharing. - Okay. So I'm going to be presenting this work. It's all about how a model that people in the group that I work with to study the hippocampal entorhinal system, completely independently, turned out to look a bit like a transformer.

So this paper that I'm going to talk about is describing that link. So the paper that built this link is by these three people. James is a postdoc, half at Stanford, Tim's a professor at Oxford and in London, and Joe's a PhD student in London. So this is the problem that this model of the hippocampal entorhinal system, which we'll talk more about, is supposed to solve.

It's basically the observation there's a lot of structure in the world, and generally we should use it in order to generalize quickly between tasks. So the kind of thing I mean by that is you know how 2D space works because of your long experience living in the world. And so if you start at this greenhouse and step north to this orange one, then to this red one, then this pink one, because of the structure of 2D space, you can think to yourself, "Oh, what will happen if I step left?" And you know that you'll end up back at the green one because loops of this type close in 2D space, okay?

And this is, you know, perhaps this is a new city you've just arrived in. This is like a zero-shot generalization because you somehow realize that the structure applies more broadly and use it in a new context. Yeah, and there's generally a lot of these kinds of situations where there's structures that like reappear in the world.

So there can be lots of instances where the same structure will be useful to doing these zero-shot generalizations to predict what you're going to see next. Okay, and so you may be able to see how we're already going to start mapping this onto some kind of sequence prediction task that feels a bit Transformer-esque, which is you receive this sequence of observations and, in this case, actions, movements in space, and your job is, given a new action, step left here, you have to try and predict what you're going to see.

So that's a kind of sequence prediction version of it. And the way we're going to try and solve this is based on factorization. It's like, you can't go into one environment and just learn from the experiences in that one environment. You have to separate out the structure and the experiences you're having so that you can reuse the structural part, which appears very often in the world.

Okay, and so, yeah, separating memories from structure. And so, you know, here's our separation of the two. We have our dude wandering around this, like, 2D grid world. And you want to separate out the fact that there's 2D space, and it's 2D space that has these rules underlying it.

And in a particular instance, in the environment you're in, you need to be able to recall which objects are at which locations in the environment. Okay, so in this case, it's like, oh, this position has an orange house, this position doesn't. That's green, sorry, orange, red, and pink. And so you have to bind those two, you have to be like, whenever you realize that you're back in this position, recall that that is the observation you're going to see there.

Okay, so this model that we're going to build is some model that tries to achieve this. Yeah, new stars, and so when you enter it, imagine you enter a new environment with the same structure, you wander around and realize it's the same structure, all you have to do is bind the new things that you see to the locations, and then you're done, passed up, you know how the world works.

So this is what neuroscientists mean by a cognitive map, is this idea of, like, separating out and understanding the structure that you can reuse in new situations. And yeah, this model that was built in the lab is a model of this process happening, of the separation between the two of them and how you use them to do new inferences.

And this is the bit that's supposed to look like a transport. So that's the general introduction, and then we'll dive into it a little more now. Make sense of the broad picture? Good. Silence, I'll assume, is good. So we'll start off with some brain stuff. So there's a long stream of evidence from spatial navigation that the brain is doing something like this.

I mean, I think you can probably imagine how you yourself are doing this already when you go to a new city, or you're trying to understand a new task that has some structure you recognize from previously. You can see how this is something you're probably doing. But spatial navigation is an area in neuroscience which had a huge stream of discoveries over the last 50 years, and a lot of evidence of the neural basis of this computation.

So we're going to talk through some of those examples. The earliest of these are psychologists like Tolman, who were showing that rats, in this case, can do this kind of path integration structure. So the way this worked is they got put at a start position here, down at the bottom, S, and they got trained that this route up here got you a reward.

So this is the maze we had to run around. Then they were asked, they were put in this new... the same thing, but they blocked off this path that takes this long, winding route, and given instead a selection of all these arms to go down. And they look at which path the rat goes down.

And the finding is that the rat goes down the one that corresponds to heading off in this direction. So the rat has somehow not just learned... like, you know, one option of this is it's like blind memorization of actions that I need to take in order to route around.

Instead, no, it's learning actually that the... embedding the reward in its understanding of 2D space and taking a direct route there, even though it's never taken it before. There's evidence that rats are doing this as well as us. And then a series of, like, neural discoveries about the basis of this.

So John O'Keefe stuck an electrode in the hippocampus, which is a brain area we'll talk more about, and found these things called place cells. So what I'm plotting here is each of these columns is a single neuron. And the mouse or rat, I can't remember, is running around a square environment.

The black lines are the path the rodent traces out through time. And you put a red dot down every time you see this individual neuron spike. And then the bottom plot of this is just a smooth version of that spike rate, so that firing rate, which you can think of as, like, the activity of a neuron in a neural network.

That's the analogy that people usually draw. And so these ones are called place cells because they're neurons that respond in a particular position in space. And in the '70s, this was, like, huge excitement, you know, and people have been studying mainly, like, sensory systems and motor output. And suddenly, a deep cognitive variable plays something you never-- you don't have a GPS signaler, but somehow there's this, like, signal for what looks like position in the brain in very, like, understandable ways.

The next step in-- the biggest step, I guess, in this chain of discovery is the Moser Lab, which is a group in Norway. They stuck an electrode in a different area of the brain. The medial entorhinal cortex. And so this is the hippocampal entorhinal system we're going to be talking about.

And they found this neuron called a grid cell. So again, the same plot structure that I'm showing here, but instead, these neurons respond not in one position in a room, but in, like, a hexagonal lattice of positions in a room. Okay, so this-- these two, I guess, I'm showing to you because they, like, really motivate the underlying neural basis of this kind of, like, spatial cognition, embodying the structure of this space in some way.

Okay, and it's very surprising finding why neurons choosing to represent things with this hexagonal lattice. It's, like, yeah, provoked a lot of research. And broadly, there's been, like, many more discoveries in this area. So there's place cells I've talked to you about, grid cells, cells that respond based on the location of not yourself, but another animal, cells that respond when your head is facing a particular direction, cells that respond to when you're a particular distance away from an object.

So, like, I'm one step south of an object, that kind of cell. Cells that respond to reward positions, cells that respond to vectors to boundaries, cells that respond to-- so, like, all sorts, all kinds of structure that this pair of brain structures, the hippocampus here, this red area, and the entorhinal cortex, this blue area here, which is, yeah, conserved across a lot of species, are represented.

There's also finally one finding in this that's fun, is they did an fMRI experiment on London taxicab drivers. And I don't know if you know this, but the London taxicab drivers, they do a thing called the Knowledge, which is a two-year-long test where they have to learn every street in London.

And the idea is the test goes something like, "Oh, there's a traffic jam here and a roadwork here, and I need to get from, like, Camden Town down to Wandsworth in the quickest way possible. What route would you go?" And they have to tell you which route they're going to be able to take through all the roads and, like, how they would replan if they found a stop there, those kind of things.

So, it's, like, intense-- you see them, like, driving around sometimes, learning all of these, like, routes with little maps. They're being made a little bit obsolete by Google Maps, but, you know, luckily, they got them before that-- this experiment was done before that was true. And so, what they've got here is a measure of the size of your hippocampus using fMRI versus how long you've been a taxicab driver in months.

And the claim is basically the longer you're a taxicab driver, the bigger your hippocampus, because the more you're having to do this kind of spatial reading. So, that's a big set of evidence that these brain areas are doing something to do with space. But there's a lot of evidence that there's something more than that, something non-spatial going on in these areas, okay?

And we're going to build these together to make the broader claim about this, like, underlying structural inference. And so, I'm going to talk through a couple of those. The first one of these is a guy called Patient HM. This is the most studied patient in, like, medical history. He had epilepsy, and to cure intractable epilepsy, you have to cut out the brain region that's causing these, like, seizure-like events in your brain.

And in this case, the epilepsy was coming from the guy's hippocampus, so they bilaterally lesioned his hippocampus. They, like, cut out both of his hippocampi. And it turned out that this guy then had terrible amnesia. He never formed another memory again, and he could only recall memories from a long time before the surgery happened, okay?

But he, you know, so experiments showed a lot of this stuff about how we understand the neural basis of memory, things like he could learn to do motor tasks. So, somehow, the motor tasks are being done. For example, they gave him some very difficult motor coordination tasks that people can't generally do, but can with a lot of practice.

And he got very good at this eventually, and was as good as other people at learning to do that. He had no recollection of ever doing the task. So, he'd go in to do this new task and be like, "I've never seen this before. I have no idea what you're asking me to do." And he'd do it amazingly.

He'd be like, "Yeah, sorry." There's some evidence there that the hippocampus is involved in at least some parts of memory there, which seems a bit separate to this stuff about space that I've been talking to you about. The second of these is imagining things. So, this is actually a paper by Demis Hassabis, who, before he was DeepMindHead, was a neuroscientist.

And here, maybe you can't read that. I'll read some of these out. You're asked to imagine you're lying on a white sandy beach in a beautiful tropical bay. And so, the control, this bottom one, says things like, "It's very hot and the sun is beating down on me. The sand underneath me is almost unbearably hot.

I can hear the sounds of small wavelets lapping on the beach. The sea is gorgeous aquamarine color." You know, like, so a nice, lucid description of this beauty scene. Whereas the person with a hippocampal damage says, "As for seeing, I can't really, apart from just the sky. I can hear the sound of seagulls under the sea.

I can feel the grain of sand beneath my fingers." And then, like, yeah, struggles, basically. Really struggles to do this imagination scenario. Some of the things written in these are, like, very surprising. So, the last of these is this transitive inference task. So, transitive inference, A is greater than B, B is greater than C, therefore, A is greater than C.

And the way they convert this into a rodent experiment is you get given two pots of food that have different smells. And your job is to go to the pot of food. You learn which pot of food has, sorry, which pot with the smell has the food. And so, these are colored by the two pots by their smell, A and B.

And the rodent has to learn to go to a particular pot, in this case, the one that smells like A. And they do two of these. They do A has the food when it's presented in a pair with B, and B has the food when it's presented in a pair with C.

And then they test what do the mouths do when presented with A and C, a completely new situation. If they have a hippocampus, they'll go for A over C. They'll do transitive inference. If they don't have one, they can't. And so, there's a much more broad set. This is like, oh, I've shown you how hippocampus is used for this spatial stuff that people have been excited about.

But there's also all of this kind of relational stuff, imagining new situations, some slightly more complex story here. The last thing I'm going to do is how the entorhinal cortex as well. So, that's where, if you remember, hippocampus was these guys, entorhinal cortex was these grid cells, was how entorhinal cortex was appearing to do some broader stuff as well.

This is all motivation for the model, just trying to build all of these things together. So, in this one, this is called the Stretchy Birds Task, okay? So, you put people in an fMRI machine and you make them navigate, but navigate in bird space. And what bird space means is it's a two-dimensional space of images.

And each image is one of these birds. And as you vary along the X dimension, the bird's legs get longer and shorter. And as you vary along the Y direction, the bird's neck gets longer and shorter, okay? And the patients sit there, or subjects sit there, and just watch the bird images change so that it traces out some part in 2D space.

But they never see the 2D space. They just see the images, okay? And the claim is basically, and then they're asked to do some navigational tasks. They're like, "Oh, whenever you're in this place "in 2D space, you show Santa Claus next to the bird." And so, the participants have to pin that particular bird image, that particular place in 2D space to the Santa Claus.

And you're asked to go and find the Santa Claus again using some non-directional controller. And they navigate their way back. And the claim is that these people use grid cells. So, the entorhinal cortex is active in how these people are navigating this abstract cognitive bird space. And the way you test that claim is you look at the fMRI signal in the entorhinal cortex as the participants head at some particular angle in bird space.

And because of the six-fold symmetry of the hexagonal lattice, you get this six-fold symmetric waving up and down of the entorhinal cortex activity as you head in particular directions in 2D space. So, it's like evidence that this system is being used not just for navigation in 2D space, but any cognitive task with some underlying structure that you can extract, you use it to do these tasks.

- Is there significance to bird space also being 2D here? - Yes, yes. - Like, have people tried this with multiple dimensions of variability? - People haven't done that experiment, but people have done things like look at how grid cells... Ah, have they even done that? They've done things like 3D space, but not like cognitive 3D space.

They've done, like, literally like make... They've done it in bats. They stick electrodes in bats and make the bats fly around the room and look at how their grid cells respond. Yeah, but definitely, I think they've done it... Ah, they've done it in sequence space. So, in this case, you hear a sequence of sounds with hierarchical structure.

So, it's like how there's months, weeks, days, and meals, something like that. So, like, weeks have a periodic structure, months have a periodic structure, days have a periodic structure, and meals have a periodic structure. And so, you hear a sequence of sounds with exactly the same kind of structure as that hierarchy of sequences, and you look at the representation in the entorhinal cortex through fMRI, and you see exactly the same thing, that the structure is all represented in hyper.

Even more than that, you actually see in the entorhinal cortex an array of length scales. So, at one end of the entorhinal cortex, you've got very large length-scale grid cells that are, like, responding to large variations in space. The other end, you've got very small ones, and you see the same thing recapitulated there.

The, like, meals cycle that cycles a lot more quicker is represented in one end of the entorhinal cortex in fMRI, and the months cycle is at the other end, with, like, a scale in between. So, there's some, yeah, evidence to that end. All right. So, I've been talking about MEC, the medial entorhinal cortex.

Another brain area that people don't look at as much is the LEC, the lateral entorhinal cortex, but it wouldn't be important for this model. And basically, the only bit that you shouldn't be aware of before we get to the model is that it seems to represent very high-level, the similarity structure in the lateral entorhinal cortex seems to be, like, a very high-level semantic one.

For example, you present some images, and you look at how, you know, in the visual cortex, things are more similarly represented if they look similar, but by the time you get to the lateral entorhinal cortex, things look more similar based on their usage. For example, like, an ironing board and an iron will be represented similarly, even though they look very different because they're somehow, like, semantically linked.

Okay, so that's the role that the LEC is gonna play in this model. So yeah, basically, the claim is this is for more than just 2D space. So the neural implementation of this cognitive map, which is for not only 2D space, which this cartoon is supposed to represent, but also things, any other structure.

So for some structures, like transitive inference, this one is faster than that, and it's faster than that, or family trees, like this person is my mother's brother and is there for my uncle, those kind of things. These, like, broader structural inferences that you'll want to be able to use in many situations with basically the same problem.

Great, that was a load of neuroscience. Now we're gonna get onto the model that tries to summarize all of these things, and that's gonna be the model that will end up looking like a transform. So yeah, we basically want this separation. These diagrams here are supposed to represent a particular environment that you're wandering around.

It has an underlying grid structure, and you see a set of stimuli at each point on the grid, which are these little cartoon bits. And you wanna try and create a thing that separates out this, like, 2D structural grid from the actual experiences you're seeing. And the mapping to the things I've been showing you is that this grid-like code is actually the grid cells in the medial entorhinal cortex are somehow abstracting the structure.

The lateral entorhinal cortex, encoding these semantically meaningful similarities, will be the objects that you're seeing. So it's just like, this is what I'm seeing in the world. And the combination of the two of them will be the hippocampus. So yeah, in more diagrams, we've got G, the structural code, the grid code, and MEC, and LEC.

Oh, is someone asking a question? - Since morning, so now it's lunchtime. Yeah. - Sorry, I can't hear you if you're asking a question. How do I mute someone if they're... Maybe type it in the chat if there is one. (audience laughing) Nice. So yeah, we got the hippocampus in the middle, which is gonna be our binding of the two of them together.

Okay. So I'm gonna step through each of these three parts on their own and how they do the job that I've assigned to them, and then come back together and show the full model. So lateral entorhinal cortex encodes what you're seeing. So this is like these images or the houses we were looking at before, and that would just be some vector XT that's different.

So a random vector, different for every symbol. The medial entorhinal cortex is the one that tells you where you are in space. And it has the job of path integrating. Okay? So this means receiving a sequence of actions that you've taken in space. For example, I went north, east, and south, and tell you where in 2D space that you are.

So it's somehow the bit that embeds the structure of the world and the way that we'll do that is this G of T, this vector of activities in this brain area, will be updated by a matrix that depends on the actions you've taken. Okay? So if you step north, you update the representation with the step north matrix.

Okay? And those matrices are gonna have to obey some rules. For example, if you step north, then step south, you haven't moved. And so the step north matrix and the step south matrix have to be inverses of one another so that the activity stays the same and represents the structure of the world somehow.

Okay. So that's the world structure part. Finally, the memory, because we have to memorize which things we found at which positions, it's gonna happen in the hippocampus, and that's gonna be through a version of these things called the Hopfield networks that you heard mentioned in the last talk. So this is like a content addressable memory and it's biologically plausible claim.

The way it works is you have a set of activities, P, which are the activities of all these neurons. And when it receives, so it just like recurrently updates itself. So there's some weight matrix in here, W, and some non-linearity, and you run it forward in time, and it just like settled into some dynamical system, it's settled into some attractive state.

And the way you make it do memory is through the weight matrix. Okay? So you make it like a sum of outer products of these chi mu, each chi is some memory, some pattern you wanna record. Okay? And then it's, yeah, this is just writing it in there. The update pattern is like that.

And the claim is basically that if P, the memory, the activity of the neurons, the hippocampal neurons is close to some memory, say chi mu, then this dot product will be much larger than all of the other dot products with all the other memories. So this sum over all of them will basically be dominated by this one term, chi mu.

And so your attractor network will basically settle into that one chi mu. And maybe the preempt some of the stuff that's gonna come later, you can see how this like similarity between points is, yeah, pairwise similarity, and then adding some, adding them up, weighted by this pairwise similarity is the bit that's gonna turn out looking a bit like attention.

And so some cool things you can do with these systems is like, here's a set of images that someone's encoded in a Hopfield network, and then someone's presented this image to the network and asked it to just run to its like dynamical attractor, minima, and it recreates all of the memory that it's got stored in it.

So like completes the rest of the image. So that's our system. Yeah. - I'm sorry if I'm trying to call you out, I'm just trying to assess like where is the, like where we're getting to like, I've heard that like this interpretation, like the modern interpretation, Hopfield network. - This one is actually, which interpretation, sorry.

- I bet that like it's effectively, is that. - Ah, yeah. Yeah, yeah, yeah. It's only the link to transformers will basically only be through the fact, there's classic Hopfield networks and then there's modern ones that were made in like 2016. And the link between attention and modern is precise, the link with like classic is not as precise.

I mean, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah. - The modern Hopfield network is to continue with virtual, but with the original Hopfield network it's flatter. - With a change in the non-linearity, right? 'Cause then you have to do the exponentiation thing. We'll maybe get to it later, you can tell me.

- I'm sorry, I'm not gonna take too long. - No, no, no, no, all right. More questions is good. We'll get, yeah, maybe. - We have a separate energy function and I think the exponential is in that. - Mm. - If that, yeah. - Okay, thanks. - No worries.

So that's basically how our system's gonna work. But this Tormund Eichenbaum machine, that's what the name of this thing is. And so the patterns you wanna store in the hippocampus, so these memories that we wanna embed, are a combination of the position and the input. And like half of Homer's face here, if you then have decided you wanna end up at a particular position, you can recall the stimulus that you saw there and predict that as your next observation.

Or vice versa. If you see a new thing, you can infer, oh, I path integrated wrong, I must actually be here. Assuming there's usually more than one thing in the world that might be in a different position, so. Yeah, that's the whole system. Does the whole Tormund Eichenbaum machine make sense, roughly, what it's doing?

Okay, cool. And basically, this last bit is saying it's really good. So what I'm showing here, this is on the 2D navigation task. So it's a big grid. I think they use like, I don't know, 11 by 11 or something, and it's like wandering around and has to predict what it's gonna see in some new environment.

And on here, this is the number of nodes in that graph that you've visited. And on the y-axis is how much you correctly predict. And each of these lines is based on how many of those type of environments I've seen before, how quickly do I learn? And the basic phenomena it's showing is, over time, as you see more and more of these environments, you learn to learn.

So you learn the structure of the world and eventually able to quickly generalize to the new situation and predict what you're gonna see. And this scales not with the number of edges that you've visited, which would be the learn everything option, predict, 'cause if you're trying to predict which state I'm gonna see, given my current state and action, and in a dumb way, you just need to see all states in action, so all edges.

But this thing is able to do it much more cleverly 'cause it only needs to visit all nodes and just memorize what is at each position. And you can see that its learning curve follows the number of nodes visited learning curve. So it's doing well. For neuroscience, this is also exciting, is that the neural patterns of response in these model regions match the ones observed in the brain.

So in the hippocampal section, you get place cell-like activities. This hexagon is the grid of the environment that it's exploring, and plotted is the firing rate of that neuron, whereas the ones in the medial interrhinal cortex show this grid-like firing pattern. Yeah. - This, like, exactly compared, operates on this free bed space in the center.

Do you have any thoughts about how that transfers, like, do you think it was a real-world thing? Do you think that it was just, like, math representation of, like, a very nicely distributed space? Or do you think there's something more complicated going on? - Yeah, I imagine there's something more complicated going on.

I guess this... - So this is, like, a super high-tech-- - No, no, no, no, yeah. Maybe you can make-- (laughs) Maybe you can make arguments that they're, as I was saying, there's these different modules that operate at different scales. You can see this already here. Like, grid cells at one scale, grid cells at another scale.

And so you could imagine how that could be useful for, like, one of them operates in the highest level and mix those. One of them operates at the lowest level. You know, and, like, yeah, adaptable. They seem to scale up or down depending on your environment. And so, like, an adaptable set of length scales that you can use to be pretty good.

So that's quite speculative, so. (laughs) Okay, sorry, yeah. - So make sure I understand if you go up. - Okay, one more. - Yeah, so you have your... What's the key and what's the value? - Yeah, so the... - And Hondo networks are always auto instead of retro-associated. So how are you...

- The memories that we're gonna put in, so the patterns, let's say kinu, is gonna be some, like, outer product of the position at a given time and the flattened. So we, yeah, take the outer product of those. So every element in X gets to see every element of G, flatten those out, and that's your vector that you're gonna embed.

Does that make sense? - Yeah. - Sorry, I should put that on. Yeah, and then you do the same operation except you flatten with an identity in the... Let's say you're at a position you wanna predict what you're gonna see. You set X to be identity, you do this operation that creates a very big vector from G, you put that in and you let it run its dynamics, and it recalls the pattern, and you, like, learn a network that, like, traces out the X from that.

- And the figures you show, if you go down a bit? - Yeah. - Yeah, it's hard to see, but what's on the X axis and what, like, what, are you training a popular network with this flattened outer product of the, okay. - Yeah, the actual, the training that's going on is more in the structure of the world, 'cause it has to learn those matrices.

All it gets told is which action type it's taken, and it has to learn the fact that stepping east is the opposite of stepping west. So all of the learning of stuff is in those matrices, learning to get the right structure. - Okay. - There's also, I mean, 'cause the Hotfield network learning, the Hotfield network's, like, re-initialized every environment, and you're just, like, shoving memories in.

So it's less, like, that's less the bit that's true. It's causing this, certainly, but it's not causing this, like, shift up, which is, as training progresses in many different environments, you get better at the task, because it's learning the structure of the task. - Okay. And the link to, I mean, this is all just modern Hotfield networks.

- The initial paper was actually classic Hotfield networks, but, yeah, now the new versions of it are modern Hotfield networks, yeah. - Right. And then, in so much as modern Hotfield networks equal attention. - This is a track book. - Right. But then you're, okay, and then you have some results, there are some results looking at activations.

Well, these are recordings in the brain. - These are, no, these are actually in 10. - So this is, the left ones are neurons in the G section, in the medial entorhinal cortex part of 10. As you vary position, yeah. And we're gonna get, yeah, my last section is about how 10 is like a transport.

But we'll get to, hopefully it'll be clear, the link between the two after that. Okay, we're happy with that now, hopefully. Cool, 10 is approximately equal to transport, yeah. So you seem to, you know all of this. But I guess my notation, at least we can clarify that. You've got your data, which is maybe like tokens coming in, and you've got your positional embedding, and the positional embedding will play a very big role here.

That's the E, and together they make this vector H. Okay, and these arrive over time. Yeah, and you've got your attention updates that you see some similarity between the key and the query, and then you add up weighted values with those similarities. We're all happy with that. And here's the stepped version.

So the basic intuition about how these parts map onto each other is that the G is the positioning coding, as you may have been able to predict. The X are the input tokens. This guy, when you put in the memory, and you try and recall which memory is most similar to, that's the attention part.

And maybe some, yeah, you attend, you compare the current GT to all of the previous GTs, and you recall the ones with high similarity structure, and return the corresponding X. We've still got 10 minutes, right? Might as well, yeah, okay. Maybe some differences, so I think I'm gonna go through this, between how you would maybe, like the normal transformer, and how to make it map onto this, are the following.

So the first of these is that the keys and the queries are the same at all time points. So there's no difference in the matrix that maps from tokens to keys and tokens to queries, same matrix. And it only depends on the positional encoding. Okay, so you only recall memories based on how similar their positions are.

So yeah, this is key at time tau equals, query at time tau equals some matrix applied only to the positional embedding at time tau. Then the values depend only on this X part, so it's some like factorization of the two, which is the value at time tau is like some value matrix, only applied to that X part.

So that's the only bit you want to like recall, I guess. Is that right? I think that's right. And then it's a causal transformer in that you only do attention at things that have arrived at time points in the path. Make sense? And finally, the perhaps like weird and interesting difference is that there's this path integration going on in the positional encodings.

So these E are the equivalent of the grid cells, the G from the previous bit, and they're going to be updated through this matrix that depend on the actions you're taking in the world. Yeah, so that's basically the correspondence. I'm going to go through a little bit about how the Hopfield network is approximately like doing attention over previous tokens.

So yeah, I was describing to you before the classic Hopfield network, which if you remove the non-linearity, looks like this. And the mapping, I guess, is like the hippocampal activity, the current neural activity is the query. The set of memories themselves are the key. You're doing this dot product to get the current similarity between the query and the key, and then you're summing them up, weighted by that dot product, all of the memories that are values.

So that's the simple version. But actually, these Hopfield networks are quite bad. They like, in some senses, they tend to fail, they have a like low memory capacity. For N neurons, they have something, they can only embed like 0.14 N memories, just like a big result from statistical physics in the '80s.

But it's okay, people have improved this. The reason that they're bad, it seems to be basically that the overlap between your query and the memories is too big for too many memories. You basically like look too similar to too many things. So how do you do that? You like sharpen your similarity function.

Okay, and the way we're gonna sharpen it is through this function, and this function is gonna be soft. So it's gonna be like, oh, how similar am I to this particular pattern, weighted, exponentiated, and then over how similar am I to all the other ones. That's our new measure of similarity, and that's the minor setting of the modern Hopfield one.

Yeah, yeah. (laughs) And then you can see how this thing, yeah, it's basically doing the attention mechanism. And it's also biologically plausible. We'll quickly run through that, is that you have some set of activity, PT, this like neural activity, and you're gonna compare that to each chi mu, and that's through these memory neurons.

So there's a set of memory neurons, one for each pattern that you've memorized, mu, and the weights to this memory neuron will be this chi mu, and then the activity of this neuron will be this dot product, and then you're gonna do divisive normalization to run this operation between these neurons, so like to make them compete with one another and only recall the memories that are most similar through the most activated according to this like softmax operation, and then they'll project back to the PT and produce the output by summing up the memories weighted by this thing times the chi mu, which is the weights.

So then weights out to the memory neurons and back to the hippocampus are both chi mu. Okay, and so that's how you can like biologically plausibly run this modern Hottfield network. And so, sorry, yeah? - Do you have any thoughts into what the memories that are inputted are? Like if you attend over every memory you've ever made, probably not.

- Probably not, yeah. I guess somehow you have to have knowledge, and you know, in this case it works nicely 'cause we like wipe this poor agent's memory every time and only memorize things from the environment, and so you need something that like gates it so that it only looks for things in the current environment somehow.

How that happens, I'm not sure. There are claims that there's this like just shift over time. The claim is basically that like somehow as time passes, the representations just slowly like rotate or something. And then they're also embedding something like a time similarity as well, 'cause the closer in time you are, the more you're like in the same rotated thing.

So maybe that's a mechanism to like, oh, you know, past a certain time, you don't recall things. But the evidence and debate a lot around that, so yeah, expect to see. Other mechanisms like it, I'm sure. Maybe context is another one. Actually, we'll briefly talk about that. You know, if you know you're in the same context, then you can send a signal like somehow in the prefabricated context like work out what kind of setting am I in?

You can send that signal back and be like, oh, make sure you attend to these ones that are in the same context. So yeah, there we go. Tim transformer, that's the job. It path integrates its positional encodings, which is kind of fun. It computes similarity using these positional encodings, and it only compares to past memories, but otherwise it looks a bit like a transformer setup.

And here's the setup, MEC, LEC, hippocampus, and placements. Some, yeah, so here's a brief, the last thing I think I'm gonna say is that like, this extends TEM nicely 'cause it allows it, previously you had to do this outer product and flatten, that's a very, dimensionality is like terrible scaling with like, for example, if you wanna do position, what I saw and the context signal, suddenly I'm telling like outer product three vectors and flatten that, that's a much, much bigger, you're scaling like N cube, right?

Rather than what you'd like to do is just like 3N. And so this version of TEM with this new modern hotfield network does scale nicely to adding a context input as just another input in what was previously this like modern hotfield network. So yeah, there's some. So yeah, our conclusions is there's like, proved somewhat interesting as a two-way relationship from the AI to the neuroscience, we use this new memory model, this modern hotfield network that has all of, you know, all of this bit is supposed to be in the hippocampus, whereas previously we just had these like memory bits in the classic hotfield network in the hippocampus.

So it makes kind of interesting predictions about different place cell structures in the hippocampus and it just sped up the code a lot, right? From the neuro to AI, maybe there's a few things that are slightly different, this like learnable recurrent positioning coding. So people do some of this, I think they get like positioning codings and learn RNN updates them, but maybe this is like some motivation to try, for example, they don't do it with weight matrices and these weight matrices are very biased towards, because they're invertible generally and things like that, they're very biased towards representing these very clean structures like 2D space.

So I mean, you know, interesting there. The other thing is this is like one attention layer only. And so like somehow by using nice extra foundations, making the task very easy in terms of like processing X and using the right positional encoding, you've got it to solve the task with just one of these.

Also kind of nice, and maybe it's like a nice interpretation is that you can go in and really probe what these neurons are doing in this network and really understand it, you know? We know that the position encoding looks like grid cells. We have a very deep understanding of why grid cells are a useful thing to have if you're doing this path integration.

So it's like, hopefully helps like interpret all these things. Oh yeah, and if there was time, I was going to tell you all about grid cells, which are my hobby horse, but I don't think there's time. So I'll stop there. - Excellent. Questions? Go ahead. - A very big question.

So in the very beginning, those grids are linked into one neuron or a population of neurons? - These ones. Those ones. Yeah, that's one neuron's response. - Just one neuron? - Yeah. Wild, it's wild. Let me tell you more about the grid cell system. - And you can't know that, how can you know it's the one neuron?

How do you measure it? - Because you got electrodes stuck in here, right? And they generally have like, the classic measuring technique is a tetrahedron, which is four wires, okay? And they receive these spikes, which are like electrical fluctuations as a result of a neuron firing. And they can like triangulate that that particular spike that they measured because of the pattern of activity on the four wires has to have only come from one position.

So they can work out which neurons sent that particular spike. Yeah. But there's, so there's a set of neurons that have grid cell patterns. Lots of neurons have patterns that are just translated versions of one another. So the same grid, like shifted in space. That's called a module. And then there are sets of modules, which are the same types of neurons, but with a lattice that's much bigger or much smaller.

And in rats, there's roughly seven. So this is a very surprising crystalline structure of these seven modules, but in each module, each neuron is just translated by another one. Which, yeah, there's a lot of theory work about why that's a very sensible thing to do if you want to do path integration and work out where you are in the environment, based on your like velocity signals.

Nice. - Cool. So just this thing that you said, this was like really fascinating about the friendly thing. Is this a product of evolution or a product of learning? - Evolution. It's like, it emerges like 10 days after, in a baby rat's life, after being born. So certainly, oh, certainly that structure seems to be like very biased to being created.

Unclear, you know, we were talking about how it was being co-opted to encode other things. And so it's debatable how flexible it is or how hardwired it is. But it seemed, you know, the fMRI evidence suggests that there's some like more flexibility in the system. Unclear quite how it's coding it, but it'd be cool to get neural recordings of it.

I wanna see. - Cool. Actually, let's give our speaker another round of applause. (laughing)