back to indexStanford CS25: V2 I Neuroscience-Inspired Artificial Intelligence
00:00:11.620 |
So the work I'm presenting today, title of it is Attention Approximates Sports Distributed 00:00:17.660 |
And this was done in collaboration with Genghis Pahlavon and my PhD advisor is Gabriel Kreiman. 00:00:27.300 |
We show that the heuristic attention operation can be implemented with simple properties 00:00:31.260 |
of high dimensional vectors in a biologically plausible fashion. 00:00:34.860 |
So the transformer and attention, as you know, are incredibly powerful, but they were heuristically 00:00:41.740 |
And the softmax operation and attention is particularly important, but also heuristic. 00:00:47.500 |
And so we show that the intersection of hyperspheres that is used in sports distributed memory closely 00:00:53.540 |
approximates the softmax and attention more broadly, both in theory and with some experiments 00:01:01.540 |
So you can see SDM, sports distributed memory, as preempting attention by approximately 30 00:01:10.220 |
And what's exciting about this is that it meets a high bar for biological plausibility. 00:01:14.220 |
Hopefully I have time to actually get into the wiring of the cerebellum and how you can 00:01:17.040 |
map each operation to part of the circuit there. 00:01:21.940 |
So first I'm going to give an overview of sports distributed memory. 00:01:25.940 |
Then I have a transformer attention summary, but I assume that you guys already know all 00:01:31.740 |
We can get there and then decide how deep we want to go into it. 00:01:35.300 |
I'll then talk about how actually attention approximates SDM, interpret the transformer 00:01:40.920 |
more broadly, and then hopefully there's time to go into SDM's biological plausibility. 00:01:47.020 |
Also I'm going to keep everything high level visual intuition and then go into the math, 00:01:51.860 |
but stop me and please ask questions, literally whenever. 00:01:56.940 |
So sports distributed memory is motivated by the question of how the brain can read 00:02:01.140 |
and write memories in order to later retrieve the correct one. 00:02:05.320 |
And some considerations that it takes into account are high memory capacity, robustness 00:02:08.940 |
to query noise, biological plausibility, and some notion of fault tolerance. 00:02:14.180 |
SDM is unique from other associative memory models that you may be familiar with, like 00:02:18.740 |
Hopfield networks, in so much as it's sparse. 00:02:23.160 |
So it operates in a very high dimensional vector space, and the neurons that exist in 00:02:27.740 |
this space only occupy a very small portion of possible locations. 00:02:32.700 |
It's also distributed, so all read and write operations apply to all nearby neurons. 00:02:40.220 |
It is actually, as a side note, Hopfield networks, if you're familiar with them, are a special 00:02:46.180 |
I'm not going to go deep into that now, but I have a blog post on it. 00:02:52.740 |
So first we're going to look at the write operation for sports distributed memory. 00:02:57.380 |
We're in this high dimensional binary vector space. 00:03:01.000 |
We're using hamming distance as our metric for now, and we'll move to continuance later. 00:03:06.140 |
And we have this green pattern, which is represented by the solid dot, and the hollow circles are 00:03:15.260 |
So think of everything quite abstractly, and then we'll map to biology later. 00:03:20.300 |
So this pattern has a right radius, which is some hamming distance. 00:03:24.260 |
It activates all of the neurons within that hamming distance, and then here I just note 00:03:30.500 |
that each of those neurons are now storing that green pattern, and the green pattern 00:03:36.380 |
So I'm keeping track of this location with this kind of fuzzy hollow circle. 00:03:41.060 |
So we're writing in another pattern, this orange one. 00:03:44.940 |
And note here that neurons can store multiple patterns inside of them, and formally this 00:03:49.660 |
is actually a superposition, or just a summation, of these high dimensional vectors. 00:03:53.460 |
Because they're high dimensional, you don't have that much cross talk, so you can get 00:03:58.100 |
But for now, you can just think of it as a neuron can store multiple patterns. 00:04:01.020 |
Finally, we have a third pattern, this blue one. 00:04:04.740 |
We're writing it in another location, and yeah. 00:04:08.340 |
So again, we're keeping track of the original pattern locations, but they can be triangulated 00:04:13.540 |
from the nearby neurons that are storing them. 00:04:17.580 |
And so we've written in three patterns, now we want to read from the system. 00:04:20.580 |
So I have this pink star, Xi, it appears it has a given-- it's represented by a given 00:04:27.020 |
vector, which has a given location in space, it activates nearby neurons again. 00:04:31.980 |
But now the neurons output the patterns that they stored previously. 00:04:37.700 |
And so you can see that based upon its location, it's getting four blue patterns, two orange 00:04:44.180 |
And it then does a majority rule operation, where it updates towards whatever pattern 00:04:51.420 |
So in this case, because blue is actually a majority, it's just going to update completely 00:04:57.420 |
Again, I'll formalize this more in a bit, but this is really to give you intuition for 00:05:05.620 |
So the key thing to relate this back to attention is actually to abstract away the neurons that 00:05:12.780 |
are operating under the hood, and just consider the circle intersection. 00:05:16.660 |
And so what each of these intersections between the pink read circles and each of the right 00:05:22.300 |
circles means is that intersection is the neurons that both store that pattern that 00:05:27.860 |
was written in, and are now being read from by the query. 00:05:32.060 |
And the size of that intersection corresponds to how many patterns the query is then going 00:05:38.940 |
And so formally, we define the number of neurons in this circle intersection as the 00:05:48.420 |
cardinality between number of neurons in pattern, number of neurons in query, and their intersection. 00:05:55.940 |
Okay, are there any questions, like at a high level before I get more into the math? 00:06:03.380 |
I don't know if I can check, is it easy for me to check Zoom? 00:06:07.980 |
Nah, sorry, Zoom people, I'm not going to check. 00:06:11.700 |
So the neurons that's randomly distributed in this space? 00:06:15.620 |
And there's later, there's more recent work that they can learn and update their location 00:06:22.060 |
But in this, you can assume that they're randomly initialized binary high dimensional vectors. 00:06:36.220 |
So the first thing that you do, so this is this is for reading, to be clear, so you've 00:06:42.860 |
So the first thing you do is you weight each pattern by the size of its circle intersection. 00:06:48.280 |
So the circle intersection there for each pattern. 00:06:53.740 |
Then you sum over all of the patterns that have been written into this space. 00:06:58.260 |
So you're just doing a weighted summation of them. 00:07:01.900 |
And then there's this normalization by the total number of intersections that you have. 00:07:09.620 |
And finally, because at least for now, we're working in this binary space, you map back 00:07:13.700 |
to binary, just seeing if any of the values are greater than a half. 00:07:21.100 |
Okay, how familiar are people with attention? 00:07:24.380 |
I looked at like the previous talks you've had, they seem quite high level. 00:07:28.380 |
Like, can you guys write the attention equation for me? 00:07:31.740 |
Is that like, can I get thumbs up if you can do that? 00:07:35.860 |
Yeah, okay, I'm not like, I'll go through this, but I'll probably go through it faster 00:07:44.180 |
So when I first made this presentation, like, this was the state of the art for transformers, 00:07:52.140 |
And so it's kind of funny, like how far things have come now, I don't need to tell you that 00:07:58.440 |
So yeah, I'm going to work with this example. 00:08:01.780 |
Well, okay, I'm going to work with this example here, the cat sat on the blank. 00:08:07.260 |
And so we're in this setting, we're predicting the next token, which hypothetically is the 00:08:14.220 |
And so there are kind of four things that the attention operation is doing. 00:08:19.100 |
The first one up here is it's generating what are called keys, values and queries. 00:08:22.580 |
And again, I'll get into the math in a second, I'm just trying to keep it high level first. 00:08:26.980 |
And then we're going to compare our query with each of the keys. 00:08:32.140 |
So the word the, which is closest to the word we're next predicting is our query, and we're 00:08:37.140 |
seeing how similar it is each of the key vectors. 00:08:43.380 |
We then based upon that similarity, do this softmax normalization, so that all of the 00:08:50.060 |
attention weights sum to one, and then we sum together their value vectors to use to 00:08:57.220 |
propagate to like the next layer or uses our prediction. 00:09:02.140 |
And so at a high level, you can think of this as like the query word the is looking for 00:09:10.400 |
And so hypothetically, it has a high similarity with words like cat and sat, or their keys. 00:09:17.300 |
So this then gives large weight to the cat and sat value vectors, which get moved to 00:09:24.260 |
And the cat value vector hypothetically, contains a superposition of other animals like mice, 00:09:32.180 |
And the sat vector also contains things that are sat on including mat. 00:09:37.180 |
And so what you actually get from the value vectors of paying attention to cat and sat 00:09:43.180 |
are like three times mat plus one times mouse plus one time sofa. 00:09:49.220 |
This is again, like a totally hypothetical example, but I'm trying to make the point 00:09:54.260 |
that you can extract from your value vectors, things useful for predicting the next token 00:10:07.340 |
And I guess, yeah, another thing here is like what you pay attention to, so cat and sat 00:10:11.580 |
might be different from what you're actually extracting. 00:10:13.820 |
You're paying attention to your keys, but you're getting your value vectors out. 00:10:16.900 |
Okay, so here is the full attention equation. 00:10:20.820 |
The top line, I'm separating out the projection matrices, W subscript E, K, and Q, and in 00:10:28.180 |
the second one, I've just collapsed them into like the [inaudible] and yeah, so breaking 00:10:34.820 |
The first step here is we compare, we do a dot product between our query vector and our 00:10:41.380 |
This should actually be a small, you know, [inaudible] and so yeah, we're doing this 00:10:47.740 |
dot product between them to see, get a notion of similarity. 00:10:52.860 |
We then apply the softmax operation, which is an exponential over a sum of exponentials. 00:10:58.620 |
The way to think of the softmax is it just makes large values larger, and this will be 00:11:04.300 |
important for the relations here, so I'll spend a minute on it. 00:11:09.020 |
At the top here, I have like some hypothetical items indexed from zero to nine, and then 00:11:17.820 |
In the second row, I just do like a normal normalization of them, and so the top item 00:11:22.660 |
goes to a 30% value, but if I instead do a softmax, and it becomes one of the beta-coefficient 00:11:28.580 |
softmax, but the value becomes 0.6, so it just, it makes your distributions peakier 00:11:34.900 |
is kind of one way of thinking of it, and this is useful for attention because you only 00:11:38.700 |
want to pay attention to the most important things, or the things that are nearby and 00:11:47.140 |
And so once we've applied our softmax, we then just do a weighted summation of our value 00:11:55.060 |
vectors, which actually get extracted and propagate to the next layer. 00:12:04.140 |
Okay, so here's the full equation, I went through that a little bit quickly, I'm happy 00:12:12.100 |
to answer questions on it, but I think half of you know it, half of you don't. 00:12:19.180 |
Okay, so how does transformer attention approximate sparse distributed memory, this 30-year-old 00:12:25.300 |
thing that I've said is biologically plausible. 00:12:30.140 |
So are we supposed to accept that SPM is biologically plausible? 00:12:36.540 |
So I'm going to get to that at the end, yeah. 00:12:37.540 |
Attention is also formed, like in the sense of like all attention back in this one, it's 00:12:39.540 |
I think the attention equation I'm showing here was developed, I mean attention is all 00:12:50.700 |
you need was the highlight, but Benji has a paper from 2015 where it was actually first 00:12:57.020 |
written in this way, correct me if I'm wrong, but I'm pretty sure. 00:13:00.660 |
Yeah, I mean I guess like this particular one, that's why I was asking a question, because 00:13:07.700 |
It's like you show that like two different methods that could be classified as like attention 00:13:13.060 |
proposals, right, or like the same, then like you still have to show that like one of them 00:13:21.740 |
Yes, exactly, so I'll show that SDM has really nice mappings to a circuit in a Cerebellum 00:13:25.980 |
at the neuronal level, and then right now it's this link to attention, and I guess you 00:13:32.780 |
make a good point that there are other attention mechanisms. 00:13:35.500 |
This is the one that has been dominant, but I don't think that's just a coincidence, like 00:13:40.940 |
Computing your Softmax is expensive, and there's been a bunch of work like the Linformer, etc., 00:13:44.900 |
etc., that tries to get rid of the Softmax operation, and it's just done really badly. 00:13:49.700 |
Like there's a bunch of jokes on Twitter now that it's like a black hole for people that 00:13:53.060 |
like try and get rid of Softmax and you can't, and so it seems like this, and like other 00:13:58.140 |
versions of it, transformers just don't scale as well in the same way, and so there's something 00:14:03.020 |
important about this particular attention equation. 00:14:05.780 |
But like that goes the other way, right, which is like if this is really important, then 00:14:17.980 |
So the thing that I think is important is that you have this exponential weighting, 00:14:22.060 |
where you're really paying attention to the things that matter, and you're ignoring everything 00:14:29.940 |
There might be better equations, but the point I was just trying to make there is like the 00:14:34.660 |
Softmax does seem to be important, and this equation does seem to be very successful, 00:14:39.440 |
and we haven't come up with better formulations for it. 00:14:46.380 |
Yeah, so it turns out that sparse distributed memory, as you move your query and your pattern 00:14:54.020 |
away from each other, so you pull these circles apart, the read and write circles, the number 00:14:59.300 |
of neurons that are in this intersection in a sufficiently high dimensional space decays 00:15:04.260 |
approximately exponentially, and so on this right plot here, I'm pulling apart, the x-axis 00:15:10.420 |
is me pulling apart the blue and the pink circles, and the y-axis is on a log scale 00:15:17.420 |
the number of neurons that are in the intersection, and so to the extent that this is a linear 00:15:23.180 |
plot on a log scale, it's exponential, and this is for a particular setting where I have 00:15:31.020 |
my 64 dimensional vectors, which is like used in GPT-2, it holds across a lot of different 00:15:37.660 |
settings, particularly higher dimensions, which are now used for bigger transformers. 00:15:42.780 |
Okay, so I have this shorthand for the circle intersection equation, and what I'll show 00:15:51.900 |
is how the circle intersection is approximately exponential, so we can write it with two constants 00:15:58.220 |
c, subgroup 1 and subgroup 2, with the one outside, because you're normalizing Softmax's 00:16:05.780 |
exponential over some exponentials, that would cancel, the thing that matters is c2, and 00:16:10.460 |
you can approximate that nicely with the beta coefficient that's used in the Softmax. 00:16:18.020 |
And so, yeah, I guess as well, I'll focus first on the binary original version of SDM, 00:16:24.020 |
but then we also develop a continuous version, okay, so yeah, the two things that you need 00:16:30.460 |
for the circle intersection and the exponential decay to work, are you need - to map it to 00:16:35.820 |
attention is you need some notion of continuous space, and so you can use this equation here 00:16:40.420 |
to map Hamming distances to discretize cosine similarity values, where the hats over the 00:16:48.020 |
vectors are L2 normalizations, and you can then write the circle intersection equation 00:16:55.880 |
on the left as this exponential with these two constants that you need to learn, and 00:17:05.160 |
then rewrite this by converting c2 and c, you can write this as a beta coefficient. 00:17:12.540 |
Let me get to some plots, yeah, so you need the correct beta coefficient, but you can 00:17:16.620 |
fit this with a log-linear regression in a closed form. 00:17:22.220 |
I want to show a plot here, yeah, okay, so in the blue is our circle intersection for 00:17:28.940 |
two different Hamming distances both using 64-dimensional vectors, and the orange is 00:17:34.660 |
our actual Softmax attention operation, where we fit the beta coefficient, so that it will 00:17:41.060 |
- the Hamming distance used by attention is equivalent to the Hamming distance used by 00:17:45.460 |
SDM, and you can see so that the main plot is the normalized weights, so just summed 00:17:53.980 |
up in undivided systems, one, and then I have log plots here, and you can see that in not 00:18:04.860 |
You can see that for the higher dimensional - sorry, the larger Hamming distance, the 00:18:09.540 |
log plot, you see this drop off here, where the circle intersection stops being exponential, 00:18:14.940 |
but it turns out this actually isn't a problem, because the point at which the drop - the 00:18:20.100 |
exponential breaks down, you're at approximately 0.20 here, and you're basically paying negligible 00:18:27.580 |
attention to any of those points, and so in the regime where the exponential really matters, 00:18:34.340 |
this approximation holds true, yeah, yeah, yeah, yeah, no I just wanted to actually like 00:18:46.780 |
show a figure to get some intuition before, yeah, so all we're doing here is we're just 00:18:53.780 |
- we're in a binary space with original axiom, and we're just using this mapping to co-sign 00:19:00.900 |
and then what you need to do is just have the beta coefficient fit, and you can view 00:19:06.460 |
your beta coefficient and attention as determining how peaky things are, and this relates directly 00:19:10.940 |
to the Hamming distance of your circles that you're using for read and write on virtue. 00:19:19.020 |
And so yeah, to like mathematically show this now, on this slide I'm not using any tricks, 00:19:23.860 |
I'm just rewriting attention using the SDM notation of patterns and queries, so this 00:19:37.620 |
And this is the money slide where we're updating our query, and on the left we have our attention 00:19:45.180 |
equation written in SDM notation, we expand our sub-max, and then the main statement is 00:19:51.980 |
- this is closely approximated by if we swap out our exponential with the SDM equation. 00:20:10.460 |
So and again, the two things that you need for this to work are, one, your attention 00:20:15.980 |
vectors, your q's and queries need to be L2 normalized, so I have hats on them, and then 00:20:22.860 |
you want - if you decide a given Hamming distance for SDM, and I'll get into what Hamming distances 00:20:28.900 |
are good for different things, then you need to have a beta coefficient that relates to 00:20:35.500 |
But again, that's just how many things are you trying to pay attention to. 00:20:44.580 |
So yeah, just as a quick side note, you can write SDM using continuous vectors and then 00:20:49.980 |
not need this mapping to understand similarity. 00:20:54.140 |
And so here I have the plots again, but with this, and I've added the - the orange and 00:21:02.260 |
the green have split, but I've added the continuous approximation here too. 00:21:10.140 |
And what's nice about the continuous version is you can actually then write sparse distributed 00:21:13.900 |
memory as a multilayered perceptron with slightly different assumptions, and I'm not going to 00:21:19.020 |
talk about that now, but this is featured in Sparse Distributed Memory as a Continual 00:21:24.100 |
Learner, which was added to the additional readings, and it'll be in - sorry, this shouldn't 00:21:31.940 |
It's just been accepted to ICLR for this year. 00:21:35.060 |
Okay, so do trained transformers use these beta coefficients that I've said are similar 00:21:45.140 |
And so it shouldn't be surprising that depending on the heavy distance you set, SDM is better 00:21:52.340 |
For example, you just want to store as many memories as possible, and you're assuming 00:21:56.220 |
that your queries aren't noisy, or you're assuming your queries are really noisy, so 00:22:01.100 |
you can't store as much, but you can retrieve from a long distance. 00:22:05.540 |
And if attention of the transformer is implementing sparse distributed memory, we should expect 00:22:10.300 |
to see that the beta coefficients that the transformer uses correspond to these good 00:22:17.380 |
And so we have some weak evidence that that's the case. 00:22:21.860 |
So this is the key query normalized variant of attention, where you actually learn your 00:22:27.500 |
Normally in transformers, you don't, but you don't L2 norm your vectors, and so you can 00:22:32.740 |
kind of have this like effective beta coefficient. 00:22:35.140 |
So in this case, it's a cleaner instance where we're actually learning beta. 00:22:39.260 |
And this one's trained on a number of different translation tasks. 00:22:43.220 |
We take the learned beta coefficients across layers and across tasks, and plot them as 00:22:48.980 |
And the red dotted lines correspond to three different notions of sparse distributed memory 00:22:56.960 |
And again, this is weak evidence insomuch as to derive the optimal SDM beta coefficients 00:23:07.120 |
We need to assume random patterns in this high-dimensional space, and obviously real-world 00:23:13.320 |
However, it is nice to see, one, all the beta coefficients fall within the bounds, and two, 00:23:20.160 |
they skew towards the max query noise, which makes more sense if you're dealing with complicated 00:23:26.200 |
real-world beta, where the next data points you'd see might be out of distribution based 00:23:32.400 |
The max memory capacity variant assumes no query noise at all. 00:23:36.080 |
And so it's like, how many things can I pack in, assuming that the questions I'm asking 00:23:48.680 |
Just talking a little bit about transformer components more broadly. 00:23:53.040 |
So I've mentioned that you can write the feed-forward layer as a version of SDM that has a notion 00:24:03.660 |
There's also layer norm, which is crucial in transformers. 00:24:06.800 |
And it's not quite the same, but it can be related to the L2 normalization that's required 00:24:12.920 |
There's also the key query normalization variant that explicitly does this L2 normalization. 00:24:17.320 |
And it does get slightly better performance, at least on the small tests that they did. 00:24:22.000 |
I don't know if this would scale to larger models. 00:24:27.160 |
And so I guess this work is interesting in so much as the biological plausibility, which 00:24:31.720 |
I'm about to get to, and then the links to transformers. 00:24:35.560 |
It hasn't to date improved transformer architectures. 00:24:38.880 |
But that doesn't mean that this lens couldn't be used or be useful in some way. 00:24:43.600 |
So yeah, I list a few other things that SDM is related to that could be used to funnel 00:24:49.560 |
So in the new work where SDM is a continual learner, we expand the cerebellar circuit, 00:24:54.640 |
look at components of it, particularly inhibitory interneurons, implement those in a deep learning 00:24:59.080 |
model, and it then becomes much better at continual learning. 00:25:02.400 |
So that was a fun way of actually using this link to get better bottom-line performance. 00:25:10.220 |
So a summary of this section is basically just the intersection between two hyperspheres 00:25:16.040 |
approximates an exponential, and this allows SDM's read and write operations to approximate 00:25:22.840 |
attention both in theory and our limited tests. 00:25:27.400 |
And so kind of like big picture research questions that could come out of this is, first, is 00:25:32.080 |
the transformer so successful because it's performing some key cognitive operation? 00:25:37.480 |
The cerebellum is a very old brain region used by most organisms, including fruit flies, 00:25:44.120 |
maybe even cephalopods, through divergent but now convergent evolution. 00:25:50.160 |
And then given that the transformer has been so successful empirically, is SDM actually 00:26:03.080 |
As we learn more and more about the cerebellum, there's nothing that yet disproves SDM as 00:26:08.080 |
And I think it's-- I'll go out on a limb and say it's one of the more compelling theories 00:26:14.640 |
And so I think this work kind of motivates looking at more of these questions-- both 00:26:31.080 |
At the bottom, we have patterns coming in for either reading or writing. 00:26:36.880 |
And they're going to-- actually, I break down each of these slides. 00:26:43.000 |
And every neuron here, these are the dendrites of each neuron. 00:26:47.040 |
And they're deciding whether or not they're going to fire for the input that comes in. 00:26:54.520 |
Then if the neuron does fire, and you're writing in that pattern, then you simultaneously-- 00:27:00.180 |
and I'm going to explain-- let's say you hear that this is crazy, the brain doesn't do this, 00:27:05.180 |
and then I'm going to hopefully take it down. 00:27:06.180 |
You not only need to have the thing, the pattern that activates neurons, but you need to have 00:27:11.580 |
a separate line that tells the neuron what to sort. 00:27:15.740 |
And just like you have this difference between keys and values, where they can be different 00:27:19.020 |
vectors representing different things, here you can have a key that comes in and tells 00:27:24.300 |
the neuron what to activate, and the value for what it should actually sort, and then 00:27:34.220 |
And then once you're reading from the system, you also have your query come in here, activate 00:27:40.940 |
neurons, and those neurons then output whatever they sort. 00:27:45.340 |
And the neuron's vector is this particular column that it's stored. 00:27:50.860 |
And as a reminder, it's sorting patterns in the superconductor. 00:27:55.380 |
And then it will dump whatever it's stored across these output lines. 00:28:00.940 |
And then you have this G majority bit operation that converts to a 0 or 1, deciding if the 00:28:10.180 |
And so here is the same circuit, but where I overlay cell types and the cerebellum. 00:28:20.100 |
And so I'll come back to this slide, because most people probably aren't familiar with 00:28:32.940 |
So the way that the cerebellum is pretty homogenous in that it follows the pattern throughout. 00:28:39.100 |
Also, fun fact, 70% of all neurons on the brain are in the cerebellum. 00:28:45.060 |
But the cerebellum is very underappreciated, and there's a bunch of evidence that it has 00:28:48.380 |
closed-loop systems with most higher-order processing now. 00:28:51.980 |
If your cerebellum's damaged, you are more likely to have autism, et cetera, et cetera. 00:28:55.660 |
So it does a lot more than just fine-order coordination, which a lot of people have assumed 00:29:02.100 |
So inputs come in through the muscle fibers here. 00:29:06.580 |
This is a major up-projection, where you have tons and tons of granule cells. 00:29:11.300 |
Each granule cell has what are called parallel fibers, which are these incredibly long and 00:29:16.060 |
thin axons that branch out in this T structure. 00:29:20.140 |
Then they're hit by the Purkinje cells, which will receive up to 100,000 parallel fiber 00:29:29.260 |
It's the highest connectivity of any neuron on the brain. 00:29:33.420 |
And then the Purkinje cell will decide whether or not to fire and send its output downwards 00:29:39.980 |
So that's the whole system where patterns come in and the neurons decide whether they 00:29:43.780 |
fire or not, and the way that they then output their output. 00:29:47.580 |
You then have a separate right line, which is the climbing fibers here. 00:29:51.060 |
So the climbing fibers come up, and they're pretty amazing in that these connections here 00:29:56.780 |
But one that really matters is that-- they're not very strong in there. 00:29:57.780 |
But one that really matters is it goes up and it wraps around individual Purkinje cells. 00:30:03.940 |
And the mapping is close to one-to-one between climbing fibers and Purkinje cells, at least 00:30:17.820 |
Oh, so they're separate neurons coming from-- it's separate. 00:30:19.820 |
Purkinje cells here go into the cerebellar nuclei, kind of in the core of the cerebellum. 00:30:20.820 |
And that then feeds into thalamus, like back to higher-order brain regions, or like down 00:30:22.820 |
A lot of people think of the cerebellum as kind of like a fine-tuning look-up table, 00:30:23.820 |
where, like, you've already decided the muscle movement you want to do, but the cerebellum 00:30:24.820 |
will then, like, do a bunch of different things. 00:30:49.760 |
But it seems like this also applies to, like, next-word prediction. 00:30:54.620 |
A neuroscientist once said to me that, like, a dirty little secret of fMRI is that the 00:31:11.620 |
I mean, how long is the information stored and retrieved? 00:31:17.820 |
Like, is this, like, a couple of milliseconds, or, like, is this information more persistent? 00:31:23.720 |
So the main theory is that you have updating through time-dependent plasticity, where your 00:31:33.700 |
climbing fiber will either—which is doing what you want to write in—will fire either 00:31:38.260 |
just before or just after your granule cells fire, and so that then updates the propingy 00:31:45.740 |
cell synapses for long-term depression or contentiation. 00:31:51.420 |
The climbing fiber makes very large action potentials, or at least a very large action 00:31:57.780 |
And so I do think you could get pretty fast synaptic updates. 00:32:05.380 |
The synapses have been staying for, like, the rest of your life. 00:32:10.500 |
So what's really unique about this circuit is the fact that you have these two orthogonal 00:32:16.740 |
nodes where you have the moss fibers bringing information in to decide if the neuron's going 00:32:21.540 |
to fire or not, but then the totally separate climbing fiber lines that can update specific 00:32:26.260 |
neurons and what they're storing will later output. 00:32:30.580 |
And then the propingy cell is so important, it's kind of doing this pooling across every 00:32:36.460 |
And each neuron, remember, it's storing the vector this way. 00:32:39.740 |
And so the propingy cell is doing element-wise summation and then deciding whether it fires 00:32:46.700 |
And this allows for you to store your vectors in superposition and then later denoise them. 00:32:56.780 |
The theory of SDM maps quite well to the Marr and Hollis theories of cellular function, 00:33:01.460 |
which are still quite dominant, if anyone's familiar or wants to talk to you about this. 00:33:06.020 |
Yeah, so the analogy of the neuron in the SDM, you introduced the pool, and then I kind of 00:33:10.300 |
just basically-- each neuron of the propingy cell instead of-- 00:33:17.460 |
So the location of the neuron, those hollow circles, corresponds to the granular cell 00:33:22.220 |
dendrites here, where the patterns that pop in correspond to the activations of the modifiers. 00:33:27.900 |
And then the efferent post-synaptic connections are with the propingy cell. 00:33:34.900 |
So that's actually-- what it's storing is in the synaptic connections with the end propingy 00:33:44.260 |
And then the propingy cell does the majority of the operation in the cyanoblastifier. 00:33:59.460 |
Yeah, I think we're basically into question time. 00:34:03.860 |
I don't know anything about SDM, but it seems, as understood, it's very good for long-term 00:34:20.100 |
And I am curious, what's your hypothesis of what we should be doing for short-term memory? 00:34:30.020 |
Because it seems that-- so if you have this link of transformers having long-term memory, 00:34:38.620 |
Because for me, it seems like we are doing this in the prompt context right now. 00:34:43.740 |
But how could we incorporate these to the record? 00:34:48.340 |
So this work actually focuses more on the short-term memory, where it relates to the 00:34:57.340 |
It's almost more natural to interpret it as a multilayer perceptron that does a softmax 00:35:01.660 |
activation across its-- or a top-k activation across its neurons. 00:35:05.380 |
It's a little bit more complicated than that. 00:35:10.260 |
So yeah, the most interesting thing here is the fact that I just have a bunch of neurons. 00:35:16.900 |
And in activating nearby neurons in this high-dimensional space, you get this exponential weighting, 00:35:23.260 |
And then because it's an associative memory, where you have keys and values, it is attention. 00:35:30.420 |
And yeah, I guess the thing I most want to drive home from this is it's actually surprisingly 00:35:35.580 |
easy for the brain to implement the attention operation, the attention equation, just using 00:35:42.180 |
high-dimensional vectors and activating nearby neurons. 00:35:49.860 |
So if you were to actually use SDM for attention-- yeah, so let me go all the way back real quick. 00:35:59.900 |
And I don't think you were here for the talk. 00:36:01.500 |
I think I saw you come in a bit later, which is totally fine. 00:36:04.540 |
I was listening, but maybe they were going to remember. 00:36:11.620 |
There's the neuron perspective, which is this one here. 00:36:14.860 |
And this is actually what's going on in the brain, of course. 00:36:19.060 |
And so the only thing that is actually constant is the neuron, the patterns are ethereal. 00:36:24.620 |
And then there's the pattern-based perspective, which is actually what attention is doing. 00:36:29.740 |
And so here, you're abstracting away the neurons, or assuming they're operating under the hood. 00:36:33.820 |
But what you're actually computing is the distance between the true location of your 00:36:41.140 |
And there are pros and cons to both of these. 00:36:44.420 |
The pro to this is you get much higher fidelity distances, if you know exactly how far the 00:36:53.380 |
And that's really important when you're deciding what to update towards. 00:36:56.140 |
You really want to know what is closest and what is further away, and be able to apply 00:37:03.460 |
The problem is you need to store all of your pattern information in memory. 00:37:07.220 |
And so this is why transformers have limited context windows. 00:37:11.380 |
The other perspective is this long-term memory one, where you forget about the patterns. 00:37:15.780 |
And you just look at where-- you just have your neurons that store a bunch of patterns 00:37:22.660 |
And so you can't really-- you can kind of [INAUDIBLE] and it's all much noisier. 00:37:30.140 |
But you can store tons of patterns, and you're not constrained by a context window. 00:37:33.300 |
Or you can think of any penalty layer as storing the entire data set in a noisy superposition 00:37:41.660 |
Yeah, hopefully that kind of answers your question. 00:37:45.140 |
I think there was one here first, and then-- yeah? 00:37:49.020 |
So I guess my question is-- so I guess you kind of have shown that the modern self-attention 00:37:58.700 |
mechanism maps onto this SDM mechanism that seems plausible and might seem like the modern 00:38:07.820 |
contemporary theories of how the brain could implement SDM. 00:38:14.020 |
And I guess my question is, to what degree has that been experimentally verified versus-- 00:38:21.260 |
you were mentioning earlier that it might actually be easier to have done this using 00:38:25.780 |
an MLP layer in some sense than mapping that onto these mechanisms. 00:38:31.620 |
And so how do experimentalists actually distinguish between competing hypotheses? 00:38:37.820 |
For instance, one thing that I wasn't entirely clear about is even if the brain could do 00:38:46.580 |
attention or SDM, that doesn't actually mean it would, because maybe it can't do backprop. 00:38:58.780 |
So on the backprop point, you wouldn't have to do it here because you 00:39:04.820 |
have the climbing fibers that can directly give training signals to what the neurons can store. 00:39:12.780 |
So in this case, it's like a supervised learning task where the climbing fiber knows what it wants 00:39:17.860 |
to write in or how it should be updated in the Purkinje cell synapses. 00:39:22.580 |
But for your broader point, you basically need to test this. 00:39:28.020 |
You need to be able to do real-time learning. 00:39:31.300 |
The Drosophila mushroom body is basically identical to the cerebellum. 00:39:35.460 |
And that's why any brain data set has done most of the individual neuron connectivity. 00:39:39.980 |
So what you would really want to do is in vitro, real-time, super, super high frames 00:39:48.020 |
per second calcium imaging, and be able to see how synapses change over time. 00:39:56.380 |
And so for an associative learning task, like hear a sound move left, hear another sound move right, 00:40:03.180 |
or smells, or whatever, present one of those, trace, like figure out the small subset of neurons 00:40:10.220 |
that fire, which we know is a small subset, so that already fits with the handling of sensitization. 00:40:15.620 |
See how the synapses here update and how the outputs of it correspond to changes in motor action. 00:40:22.780 |
And then extinguish that memory, so write in a new one, and then watch it go away again. 00:40:29.860 |
And our cameras are getting fast enough, and our calcium and voltage indicators are getting to be really good. 00:40:36.860 |
So hopefully in the next three to five years, we can do some of those tests. 00:40:49.460 |
- I think there was one more, and then I should go over to Will. 00:40:52.780 |
- In terms of how you map the neuron based at the end of the spinal epithelial biological implementation, 00:41:03.820 |
what is the range of your circle that you're mapping around? 00:41:08.220 |
Is that like the multi-headedness, or can you do that kind of thing? 00:41:14.340 |
I'm just trying to understand how that must be. 00:41:17.220 |
- Yeah, so I wouldn't get confused with multi-headedness, 00:41:21.940 |
because that's different attention heads all doing their own attention operation. 00:41:26.300 |
It's funny, though, the cerebellum has microzones, which you can think of as like separate attention heads in a way. 00:41:32.180 |
I don't want to take that analogy too far, but it is somewhat interesting. 00:41:37.940 |
So the way you relate this is, in attention, you have your beta coefficient. 00:41:44.580 |
That is an effective beta coefficient, because the vector norms of your keys and queries aren't constrained. 00:41:53.420 |
That corresponds to a hemming distance, and here that corresponds to the number of neurons that are on for any given input. 00:42:04.340 |
And the hemming distance you want, I had that slide before, 00:42:09.220 |
the hemming distance you want depends upon what you're actually trying to do. 00:42:12.900 |
And if you're not trying to store that many memories, for example, you're going to have a higher hemming distance, 00:42:16.980 |
because you can get a higher fidelity calculation for the number of neurons in that noisy intersection. 00:42:28.060 |
- Excellent. Let's give our speaker another round of applause. 00:42:33.100 |
So as a disclaimer, before I introduce our next speaker, 00:42:39.780 |
the person who was scheduled, unfortunately, had to cancel last minute due to faculty interviews. 00:42:43.180 |
So our next speaker has very graciously agreed to present at the very last minute, but we are very grateful to him. 00:42:50.380 |
So Will is a computational neuroscience machine learning PhD student at the University College of London at their Gatsby unit. 00:42:56.740 |
So I don't know if anybody has heard about the Gatsby unit. 00:42:58.740 |
I'm a bit of a history buff or history nerd, depending on how you phrase it. 00:43:02.020 |
The Gatsby unit was actually this incredible powerhouse in the 1990s and 2000s. 00:43:05.900 |
So Hinton used to be there. Zubin Garimani used to be there. 00:43:10.340 |
I think they've done a tremendous amount of good work. 00:43:13.060 |
Anyways, and now I'd like to invite Will to talk about how to build a cognitive map. 00:43:19.180 |
- Okay. Can you stand in front of... Here, let me stop sharing. 00:43:23.100 |
- Okay. So I'm going to be presenting this work. 00:43:26.780 |
It's all about how a model that people in the group that I work with to study the hippocampal entorhinal system, 00:43:35.460 |
completely independently, turned out to look a bit like a transformer. 00:43:39.260 |
So this paper that I'm going to talk about is describing that link. 00:43:42.740 |
So the paper that built this link is by these three people. 00:43:46.260 |
James is a postdoc, half at Stanford, Tim's a professor at Oxford and in London, and Joe's a PhD student in London. 00:43:55.180 |
So this is the problem that this model of the hippocampal entorhinal system, 00:44:01.700 |
which we'll talk more about, is supposed to solve. 00:44:03.820 |
It's basically the observation there's a lot of structure in the world, 00:44:06.420 |
and generally we should use it in order to generalize quickly between tasks. 00:44:10.060 |
So the kind of thing I mean by that is you know how 2D space works 00:44:13.780 |
because of your long experience living in the world. 00:44:16.020 |
And so if you start at this greenhouse and step north to this orange one, then to this red one, then this pink one, 00:44:21.140 |
because of the structure of 2D space, you can think to yourself, "Oh, what will happen if I step left?" 00:44:26.500 |
And you know that you'll end up back at the green one because loops of this type close in 2D space, okay? 00:44:31.820 |
And this is, you know, perhaps this is a new city you've just arrived in. 00:44:35.580 |
This is like a zero-shot generalization because you somehow realize that the structure applies more broadly and use it in a new context. 00:44:43.620 |
Yeah, and there's generally a lot of these kinds of situations where there's structures that like reappear in the world. 00:44:48.460 |
So there can be lots of instances where the same structure will be useful to doing these zero-shot generalizations 00:44:55.580 |
Okay, and so you may be able to see how we're already going to start mapping this onto some kind of sequence prediction task 00:45:04.140 |
which is you receive this sequence of observations and, in this case, actions, movements in space, 00:45:10.860 |
and your job is, given a new action, step left here, you have to try and predict what you're going to see. 00:45:15.780 |
So that's a kind of sequence prediction version of it. 00:45:18.940 |
And the way we're going to try and solve this is based on factorization. 00:45:22.700 |
It's like, you can't go into one environment and just learn from the experiences in that one environment. 00:45:27.220 |
You have to separate out the structure and the experiences you're having 00:45:29.860 |
so that you can reuse the structural part, which appears very often in the world. 00:45:33.780 |
Okay, and so, yeah, separating memories from structure. 00:45:37.260 |
And so, you know, here's our separation of the two. 00:45:40.700 |
We have our dude wandering around this, like, 2D grid world. 00:45:44.700 |
And you want to separate out the fact that there's 2D space, 00:45:48.140 |
and it's 2D space that has these rules underlying it. 00:45:51.140 |
And in a particular instance, in the environment you're in, 00:45:53.740 |
you need to be able to recall which objects are at which locations in the environment. 00:45:58.060 |
Okay, so in this case, it's like, oh, this position has an orange house, this position doesn't. 00:46:04.220 |
And so you have to bind those two, you have to be like, whenever you realize that you're back in this position, 00:46:07.940 |
recall that that is the observation you're going to see there. 00:46:11.380 |
Okay, so this model that we're going to build is some model that tries to achieve this. 00:46:17.660 |
Yeah, new stars, and so when you enter it, imagine you enter a new environment with the same structure, 00:46:21.620 |
you wander around and realize it's the same structure, 00:46:23.700 |
all you have to do is bind the new things that you see to the locations, 00:46:26.700 |
and then you're done, passed up, you know how the world works. 00:46:30.500 |
So this is what neuroscientists mean by a cognitive map, 00:46:33.820 |
is this idea of, like, separating out and understanding the structure that you can reuse in new situations. 00:46:38.340 |
And yeah, this model that was built in the lab is a model of this process happening, 00:46:44.020 |
of the separation between the two of them and how you use them to do new inferences. 00:46:47.340 |
And this is the bit that's supposed to look like a transport. 00:46:50.380 |
So that's the general introduction, and then we'll dive into it a little more now. 00:47:03.180 |
So there's a long stream of evidence from spatial navigation 00:47:08.780 |
I mean, I think you can probably imagine how you yourself are doing this 00:47:14.460 |
or you're trying to understand a new task that has some structure you recognize from previously. 00:47:17.900 |
You can see how this is something you're probably doing. 00:47:20.580 |
But spatial navigation is an area in neuroscience 00:47:22.580 |
which had a huge stream of discoveries over the last 50 years, 00:47:26.060 |
and a lot of evidence of the neural basis of this computation. 00:47:29.860 |
So we're going to talk through some of those examples. 00:47:31.860 |
The earliest of these are psychologists like Tolman, 00:47:37.620 |
can do this kind of path integration structure. 00:47:39.980 |
So the way this worked is they got put at a start position here, 00:47:43.660 |
and they got trained that this route up here got you a reward. 00:47:49.020 |
Then they were asked, they were put in this new... 00:47:51.420 |
the same thing, but they blocked off this path that takes this long, winding route, 00:47:55.500 |
and given instead a selection of all these arms to go down. 00:47:58.220 |
And they look at which path the rat goes down. 00:48:00.620 |
And the finding is that the rat goes down the one 00:48:02.900 |
that corresponds to heading off in this direction. 00:48:07.100 |
like, you know, one option of this is it's like blind memorization of actions 00:48:10.300 |
that I need to take in order to route around. 00:48:12.300 |
Instead, no, it's learning actually that the... 00:48:14.180 |
embedding the reward in its understanding of 2D space 00:48:16.820 |
and taking a direct route there, even though it's never taken it before. 00:48:19.620 |
There's evidence that rats are doing this as well as us. 00:48:22.220 |
And then a series of, like, neural discoveries about the basis of this. 00:48:26.420 |
So John O'Keefe stuck an electrode in the hippocampus, 00:48:33.700 |
So what I'm plotting here is each of these columns is a single neuron. 00:48:38.660 |
And the mouse or rat, I can't remember, is running around a square environment. 00:48:42.900 |
The black lines are the path the rodent traces out through time. 00:48:47.420 |
And you put a red dot down every time you see this individual neuron spike. 00:48:51.100 |
And then the bottom plot of this is just a smooth version of that spike rate, 00:48:54.420 |
so that firing rate, which you can think of as, like, 00:48:56.380 |
the activity of a neuron in a neural network. 00:49:00.140 |
And so these ones are called place cells because they're neurons 00:49:02.060 |
that respond in a particular position in space. 00:49:04.380 |
And in the '70s, this was, like, huge excitement, you know, 00:49:06.340 |
and people have been studying mainly, like, sensory systems and motor output. 00:49:09.300 |
And suddenly, a deep cognitive variable plays something you never-- 00:49:11.980 |
you don't have a GPS signaler, but somehow there's this, like, 00:49:14.540 |
signal for what looks like position in the brain in very, like, understandable ways. 00:49:19.940 |
The next step in-- the biggest step, I guess, in this chain of discovery 00:49:23.740 |
is the Moser Lab, which is a group in Norway. 00:49:27.180 |
They stuck an electrode in a different area of the brain. 00:49:30.540 |
And so this is the hippocampal entorhinal system we're going to be talking about. 00:49:33.300 |
And they found this neuron called a grid cell. 00:49:35.140 |
So again, the same plot structure that I'm showing here, 00:49:37.420 |
but instead, these neurons respond not in one position in a room, 00:49:39.860 |
but in, like, a hexagonal lattice of positions in a room. 00:49:43.980 |
Okay, so this-- these two, I guess, I'm showing to you because they, like, 00:49:48.260 |
really motivate the underlying neural basis of this kind of, like, 00:49:52.060 |
spatial cognition, embodying the structure of this space in some way. 00:49:56.260 |
Okay, and it's very surprising finding why neurons choosing to represent things 00:50:00.660 |
It's, like, yeah, provoked a lot of research. 00:50:03.700 |
And broadly, there's been, like, many more discoveries in this area. 00:50:06.700 |
So there's place cells I've talked to you about, grid cells, 00:50:09.860 |
cells that respond based on the location of not yourself, but another animal, 00:50:13.860 |
cells that respond when your head is facing a particular direction, 00:50:16.740 |
cells that respond to when you're a particular distance away from an object. 00:50:20.660 |
So, like, I'm one step south of an object, that kind of cell. 00:50:28.900 |
cells that respond to-- so, like, all sorts, all kinds of structure 00:50:32.100 |
that this pair of brain structures, the hippocampus here, this red area, 00:50:37.420 |
and the entorhinal cortex, this blue area here, 00:50:40.260 |
which is, yeah, conserved across a lot of species, are represented. 00:50:45.260 |
There's also finally one finding in this that's fun, 00:50:47.380 |
is they did an fMRI experiment on London taxicab drivers. 00:50:51.500 |
And I don't know if you know this, but the London taxicab drivers, 00:50:54.660 |
they do a thing called the Knowledge, which is a two-year-long test 00:50:58.300 |
where they have to learn every street in London. 00:51:00.500 |
And the idea is the test goes something like, 00:51:02.740 |
"Oh, there's a traffic jam here and a roadwork here, 00:51:05.060 |
and I need to get from, like, Camden Town down to Wandsworth 00:51:08.780 |
in the quickest way possible. What route would you go?" 00:51:10.700 |
And they have to tell you which route they're going to be able to take 00:51:12.260 |
through all the roads and, like, how they would replan 00:51:14.300 |
if they found a stop there, those kind of things. 00:51:16.220 |
So, it's, like, intense-- you see them, like, driving around sometimes, 00:51:18.540 |
learning all of these, like, routes with little maps. 00:51:21.220 |
They're being made a little bit obsolete by Google Maps, 00:51:24.100 |
but, you know, luckily, they got them before that-- 00:51:26.460 |
this experiment was done before that was true. 00:51:28.260 |
And so, what they've got here is a measure of the size of your hippocampus 00:51:31.740 |
using fMRI versus how long you've been a taxicab driver in months. 00:51:35.140 |
And the claim is basically the longer you're a taxicab driver, 00:51:37.020 |
the bigger your hippocampus, because the more you're having to do 00:51:40.780 |
So, that's a big set of evidence that these brain areas 00:51:46.260 |
But there's a lot of evidence that there's something more than that, 00:51:49.300 |
something non-spatial going on in these areas, okay? 00:51:52.180 |
And we're going to build these together to make the broader claim 00:51:55.100 |
about this, like, underlying structural inference. 00:51:57.580 |
And so, I'm going to talk through a couple of those. 00:52:00.820 |
The first one of these is a guy called Patient HM. 00:52:03.860 |
This is the most studied patient in, like, medical history. 00:52:07.460 |
He had epilepsy, and to cure intractable epilepsy, 00:52:11.980 |
you have to cut out the brain region that's causing these, like, 00:52:17.060 |
And in this case, the epilepsy was coming from the guy's hippocampus, 00:52:20.700 |
so they bilaterally lesioned his hippocampus. 00:52:25.100 |
And it turned out that this guy then had terrible amnesia. 00:52:28.220 |
He never formed another memory again, and he could only recall memories 00:52:30.940 |
from a long time before the surgery happened, okay? 00:52:34.580 |
But he, you know, so experiments showed a lot of this stuff 00:52:38.300 |
about how we understand the neural basis of memory, 00:52:41.140 |
things like he could learn to do motor tasks. 00:52:45.500 |
For example, they gave him some very difficult motor coordination tasks 00:52:48.020 |
that people can't generally do, but can with a lot of practice. 00:52:52.580 |
and was as good as other people at learning to do that. 00:52:54.580 |
He had no recollection of ever doing the task. 00:52:56.340 |
So, he'd go in to do this new task and be like, 00:52:58.220 |
"I've never seen this before. I have no idea what you're asking me to do." 00:53:03.820 |
There's some evidence there that the hippocampus is involved 00:53:07.020 |
which seems a bit separate to this stuff about space 00:53:13.220 |
So, this is actually a paper by Demis Hassabis, 00:53:15.220 |
who, before he was DeepMindHead, was a neuroscientist. 00:53:19.500 |
And here, maybe you can't read that. I'll read some of these out. 00:53:23.220 |
You're asked to imagine you're lying on a white sandy beach 00:53:27.220 |
And so, the control, this bottom one, says things like, 00:53:28.940 |
"It's very hot and the sun is beating down on me. 00:53:30.500 |
The sand underneath me is almost unbearably hot. 00:53:33.020 |
I can hear the sounds of small wavelets lapping on the beach. 00:53:36.940 |
You know, like, so a nice, lucid description of this beauty scene. 00:53:41.220 |
Whereas the person with a hippocampal damage says, 00:53:44.820 |
"As for seeing, I can't really, apart from just the sky. 00:53:47.020 |
I can hear the sound of seagulls under the sea. 00:53:50.300 |
I can feel the grain of sand beneath my fingers." 00:53:56.020 |
Really struggles to do this imagination scenario. 00:53:58.220 |
Some of the things written in these are, like, very surprising. 00:54:00.540 |
So, the last of these is this transitive inference task. 00:54:06.260 |
So, transitive inference, A is greater than B, 00:54:08.380 |
B is greater than C, therefore, A is greater than C. 00:54:11.660 |
And the way they convert this into a rodent experiment 00:54:14.180 |
is you get given two pots of food that have different smells. 00:54:23.980 |
And so, these are colored by the two pots by their smell, A and B. 00:54:28.380 |
And the rodent has to learn to go to a particular pot, 00:54:33.420 |
They do A has the food when it's presented in a pair with B, 00:54:36.460 |
and B has the food when it's presented in a pair with C. 00:54:40.700 |
when presented with A and C, a completely new situation. 00:54:43.260 |
If they have a hippocampus, they'll go for A over C. 00:54:51.740 |
This is like, oh, I've shown you how hippocampus is used 00:54:53.540 |
for this spatial stuff that people have been excited about. 00:54:55.900 |
But there's also all of this kind of relational stuff, 00:55:15.660 |
just trying to build all of these things together. 00:55:19.460 |
this is called the Stretchy Birds Task, okay? 00:55:24.980 |
and you make them navigate, but navigate in bird space. 00:55:40.100 |
the bird's neck gets longer and shorter, okay? 00:55:42.900 |
And the patients sit there, or subjects sit there, 00:55:55.260 |
and then they're asked to do some navigational tasks. 00:55:57.260 |
They're like, "Oh, whenever you're in this place 00:55:59.860 |
"in 2D space, you show Santa Claus next to the bird." 00:56:06.660 |
that particular place in 2D space to the Santa Claus. 00:56:08.940 |
And you're asked to go and find the Santa Claus again 00:56:15.380 |
And the claim is that these people use grid cells. 00:56:25.460 |
is you look at the fMRI signal in the entorhinal cortex 00:56:37.100 |
you get this six-fold symmetric waving up and down 00:56:41.860 |
as you head in particular directions in 2D space. 00:56:44.700 |
So, it's like evidence that this system is being used 00:56:48.500 |
but any cognitive task with some underlying structure 00:56:51.180 |
that you can extract, you use it to do these tasks. 00:56:54.620 |
- Is there significance to bird space also being 2D here? 00:57:04.380 |
but people have done things like look at how grid cells... 00:57:22.060 |
Yeah, but definitely, I think they've done it... 00:57:29.820 |
So, in this case, you hear a sequence of sounds 00:57:33.260 |
So, it's like how there's months, weeks, days, and meals, 00:57:50.820 |
that the structure is all represented in hyper. 00:57:59.460 |
you've got very large length-scale grid cells 00:58:01.220 |
that are, like, responding to large variations in space. 00:58:05.260 |
and you see the same thing recapitulated there. 00:58:07.100 |
The, like, meals cycle that cycles a lot more quicker 00:58:09.660 |
is represented in one end of the entorhinal cortex 00:58:11.540 |
in fMRI, and the months cycle is at the other end, 00:58:15.260 |
So, there's some, yeah, evidence to that end. 00:58:22.620 |
Another brain area that people don't look at as much 00:58:28.580 |
And basically, the only bit that you shouldn't be aware of 00:58:31.980 |
is that it seems to represent very high-level, 00:58:34.060 |
the similarity structure in the lateral entorhinal cortex 00:58:36.660 |
seems to be, like, a very high-level semantic one. 00:58:40.860 |
and you look at how, you know, in the visual cortex, 00:58:42.900 |
things are more similarly represented if they look similar, 00:58:45.540 |
but by the time you get to the lateral entorhinal cortex, 00:58:47.340 |
things look more similar based on their usage. 00:58:49.060 |
For example, like, an ironing board and an iron 00:58:54.020 |
because they're somehow, like, semantically linked. 00:59:04.940 |
So the neural implementation of this cognitive map, 00:59:12.580 |
So for some structures, like transitive inference, 00:59:15.300 |
this one is faster than that, and it's faster than that, 00:59:17.300 |
or family trees, like this person is my mother's brother 00:59:20.820 |
and is there for my uncle, those kind of things. 00:59:24.740 |
that you'll want to be able to use in many situations 00:59:42.020 |
These diagrams here are supposed to represent 00:59:44.060 |
a particular environment that you're wandering around. 00:59:47.460 |
and you see a set of stimuli at each point on the grid, 00:59:53.180 |
that separates out this, like, 2D structural grid 00:59:57.100 |
And the mapping to the things I've been showing you 00:59:58.860 |
is that this grid-like code is actually the grid cells 01:00:07.020 |
encoding these semantically meaningful similarities, 01:00:11.100 |
So it's just like, this is what I'm seeing in the world. 01:00:16.900 |
So yeah, in more diagrams, we've got G, the structural code, 01:00:30.460 |
- Sorry, I can't hear you if you're asking a question. 01:00:51.300 |
So yeah, we got the hippocampus in the middle, 01:01:00.300 |
So I'm gonna step through each of these three parts 01:01:05.300 |
and then come back together and show the full model. 01:01:08.500 |
So lateral entorhinal cortex encodes what you're seeing. 01:01:15.620 |
and that would just be some vector XT that's different. 01:01:17.620 |
So a random vector, different for every symbol. 01:01:28.660 |
So this means receiving a sequence of actions 01:01:35.300 |
So it's somehow the bit that embeds the structure of the world 01:01:38.140 |
and the way that we'll do that is this G of T, 01:01:40.620 |
this vector of activities in this brain area, 01:01:48.340 |
So if you step north, you update the representation 01:01:52.700 |
And those matrices are gonna have to obey some rules. 01:01:54.540 |
For example, if you step north, then step south, 01:01:57.380 |
And so the step north matrix and the step south matrix 01:02:03.060 |
and represents the structure of the world somehow. 01:02:16.900 |
and that's gonna be through a version of these things 01:02:27.540 |
The way it works is you have a set of activities, P, 01:02:30.340 |
which are the activities of all these neurons. 01:02:38.860 |
and some non-linearity, and you run it forward in time, 01:02:40.860 |
and it just like settled into some dynamical system, 01:02:57.020 |
And then it's, yeah, this is just writing it in there. 01:03:06.460 |
the hippocampal neurons is close to some memory, 01:03:08.660 |
say chi mu, then this dot product will be much larger 01:03:15.260 |
So this sum over all of them will basically be dominated 01:03:27.500 |
you can see how this like similarity between points 01:03:38.460 |
And so some cool things you can do with these systems 01:03:44.820 |
that someone's encoded in a Hopfield network, 01:03:46.780 |
and then someone's presented this image to the network 01:03:49.340 |
and asked it to just run to its like dynamical attractor, 01:04:12.660 |
like the modern interpretation, Hopfield network. 01:04:16.060 |
- This one is actually, which interpretation, sorry. 01:04:32.140 |
And the link between attention and modern is precise, 01:04:35.740 |
the link with like classic is not as precise. 01:04:38.700 |
I mean, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah. 01:04:42.300 |
- The modern Hopfield network is to continue with virtual, 01:04:45.340 |
but with the original Hopfield network it's flatter. 01:04:51.500 |
'Cause then you have to do the exponentiation thing. 01:04:56.060 |
We'll maybe get to it later, you can tell me. 01:05:11.820 |
So that's basically how our system's gonna work. 01:05:18.780 |
And so the patterns you wanna store in the hippocampus, 01:05:24.980 |
are a combination of the position and the input. 01:05:32.340 |
you can recall the stimulus that you saw there 01:05:40.300 |
oh, I path integrated wrong, I must actually be here. 01:05:42.900 |
Assuming there's usually more than one thing in the world 01:05:51.380 |
Does the whole Tormund Eichenbaum machine make sense, 01:05:58.780 |
And basically, this last bit is saying it's really good. 01:06:18.180 |
And on the y-axis is how much you correctly predict. 01:06:22.660 |
how many of those type of environments I've seen before, 01:06:28.100 |
over time, as you see more and more of these environments, 01:06:35.300 |
to the new situation and predict what you're gonna see. 01:06:52.260 |
But this thing is able to do it much more cleverly 01:06:58.460 |
follows the number of nodes visited learning curve. 01:07:20.860 |
and plotted is the firing rate of that neuron, 01:07:23.340 |
whereas the ones in the medial interrhinal cortex 01:07:32.380 |
operates on this free bed space in the center. 01:07:58.180 |
as I was saying, there's these different modules 01:08:04.820 |
And so you could imagine how that could be useful 01:08:16.140 |
And so, like, an adaptable set of length scales 01:08:56.660 |
So we, yeah, take the outer product of those. 01:09:02.540 |
So every element in X gets to see every element of G, 01:09:13.780 |
except you flatten with an identity in the... 01:09:18.180 |
You set X to be identity, you do this operation 01:09:22.020 |
you put that in and you let it run its dynamics, 01:09:25.900 |
learn a network that, like, traces out the X from that. 01:09:30.020 |
- And the figures you show, if you go down a bit? 01:09:34.260 |
- Yeah, it's hard to see, but what's on the X axis 01:09:37.500 |
and what, like, what, are you training a popular network 01:09:42.500 |
with this flattened outer product of the, okay. 01:09:46.980 |
- Yeah, the actual, the training that's going on 01:09:53.180 |
All it gets told is which action type it's taken, 01:09:55.580 |
and it has to learn the fact that stepping east 01:09:58.860 |
So all of the learning of stuff is in those matrices, 01:10:04.180 |
- There's also, I mean, 'cause the Hotfield network learning, 01:10:10.300 |
So it's less, like, that's less the bit that's true. 01:10:20.820 |
because it's learning the structure of the task. 01:10:29.740 |
- The initial paper was actually classic Hotfield networks, 01:10:37.780 |
And then, in so much as modern Hotfield networks 01:10:45.460 |
But then you're, okay, and then you have some results, 01:10:48.940 |
there are some results looking at activations. 01:10:57.700 |
- So this is, the left ones are neurons in the G section, 01:11:15.740 |
Cool, 10 is approximately equal to transport, yeah. 01:11:21.980 |
But I guess my notation, at least we can clarify that. 01:11:29.260 |
and the positional embedding will play a very big role here. 01:11:32.180 |
That's the E, and together they make this vector H. 01:11:38.820 |
that you see some similarity between the key and the query, 01:11:42.220 |
and then you add up weighted values with those similarities. 01:11:53.820 |
map onto each other is that the G is the positioning coding, 01:12:04.260 |
and you try and recall which memory is most similar to, 01:12:13.380 |
you compare the current GT to all of the previous GTs, 01:12:17.220 |
and you recall the ones with high similarity structure, 01:12:32.660 |
and how to make it map onto this, are the following. 01:12:35.620 |
So the first of these is that the keys and the queries 01:12:41.740 |
that maps from tokens to keys and tokens to queries, 01:12:45.540 |
And it only depends on the positional encoding. 01:12:59.580 |
applied only to the positional embedding at time tau. 01:13:09.020 |
which is the value at time tau is like some value matrix, 01:13:12.860 |
So that's the only bit you want to like recall, I guess. 01:13:22.420 |
that have arrived at time points in the path. 01:13:27.460 |
And finally, the perhaps like weird and interesting 01:13:29.540 |
difference is that there's this path integration going on 01:13:34.100 |
So these E are the equivalent of the grid cells, 01:13:37.300 |
and they're going to be updated through this matrix 01:13:39.020 |
that depend on the actions you're taking in the world. 01:13:41.300 |
Yeah, so that's basically the correspondence. 01:13:47.500 |
about how the Hopfield network is approximately 01:13:57.300 |
which if you remove the non-linearity, looks like this. 01:14:00.300 |
And the mapping, I guess, is like the hippocampal activity, 01:14:10.300 |
You're doing this dot product to get the current similarity 01:14:21.980 |
But actually, these Hopfield networks are quite bad. 01:14:24.580 |
They like, in some senses, they tend to fail, 01:14:35.020 |
just like a big result from statistical physics in the '80s. 01:14:48.940 |
You basically like look too similar to too many things. 01:15:00.020 |
oh, how similar am I to this particular pattern, 01:15:04.700 |
and then over how similar am I to all the other ones. 01:15:08.660 |
and that's the minor setting of the modern Hopfield one. 01:15:16.300 |
yeah, it's basically doing the attention mechanism. 01:15:29.700 |
and you're gonna compare that to each chi mu, 01:15:34.700 |
one for each pattern that you've memorized, mu, 01:15:37.060 |
and the weights to this memory neuron will be this chi mu, 01:15:44.100 |
and then you're gonna do divisive normalization 01:15:49.940 |
so like to make them compete with one another 01:15:51.820 |
and only recall the memories that are most similar 01:16:10.660 |
biologically plausibly run this modern Hottfield network. 01:16:16.740 |
- Do you have any thoughts into what the memories 01:16:20.700 |
Like if you attend over every memory you've ever made, 01:16:29.100 |
'cause we like wipe this poor agent's memory every time 01:16:31.740 |
and only memorize things from the environment, 01:16:46.700 |
The claim is basically that like somehow as time passes, 01:16:50.420 |
the representations just slowly like rotate or something. 01:16:57.300 |
the more you're like in the same rotated thing. 01:17:04.940 |
But the evidence and debate a lot around that, 01:17:16.420 |
You know, if you know you're in the same context, 01:17:34.260 |
It computes similarity using these positional encodings, 01:17:39.220 |
but otherwise it looks a bit like a transformer setup. 01:17:42.540 |
And here's the setup, MEC, LEC, hippocampus, and placements. 01:17:49.580 |
the last thing I think I'm gonna say is that like, 01:17:54.700 |
previously you had to do this outer product and flatten, 01:17:56.940 |
that's a very, dimensionality is like terrible scaling 01:17:59.180 |
with like, for example, if you wanna do position, 01:18:03.180 |
suddenly I'm telling like outer product three vectors 01:18:04.980 |
and flatten that, that's a much, much bigger, 01:18:07.860 |
Rather than what you'd like to do is just like 3N. 01:18:13.580 |
with this new modern hotfield network does scale nicely 01:18:15.900 |
to adding a context input as just another input 01:18:18.220 |
in what was previously this like modern hotfield network. 01:18:25.180 |
proved somewhat interesting as a two-way relationship 01:18:31.620 |
this modern hotfield network that has all of, 01:18:36.860 |
whereas previously we just had these like memory bits 01:18:38.900 |
in the classic hotfield network in the hippocampus. 01:18:41.820 |
about different place cell structures in the hippocampus 01:18:48.900 |
maybe there's a few things that are slightly different, 01:18:51.100 |
this like learnable recurrent positioning coding. 01:18:57.700 |
but maybe this is like some motivation to try, 01:19:00.780 |
for example, they don't do it with weight matrices 01:19:02.420 |
and these weight matrices are very biased towards, 01:19:12.620 |
The other thing is this is like one attention layer only. 01:19:14.900 |
And so like somehow by using nice extra foundations, 01:19:19.100 |
making the task very easy in terms of like processing X 01:19:23.260 |
you've got it to solve the task with just one of these. 01:19:33.100 |
We know that the position encoding looks like grid cells. 01:19:43.660 |
I was going to tell you all about grid cells, 01:19:44.500 |
which are my hobby horse, but I don't think there's time. 01:20:14.820 |
Let me tell you more about the grid cell system. 01:20:21.060 |
- Because you got electrodes stuck in here, right? 01:20:25.740 |
the classic measuring technique is a tetrahedron, 01:20:34.740 |
that that particular spike that they measured 01:20:36.180 |
because of the pattern of activity on the four wires 01:20:49.540 |
that are just translated versions of one another. 01:20:57.740 |
but with a lattice that's much bigger or much smaller. 01:21:01.340 |
So this is a very surprising crystalline structure 01:21:05.740 |
each neuron is just translated by another one. 01:21:13.380 |
and work out where you are in the environment, 01:21:22.780 |
this was like really fascinating about the friendly thing. 01:21:25.060 |
Is this a product of evolution or a product of learning? 01:21:40.180 |
seems to be like very biased to being created. 01:21:46.660 |
how it was being co-opted to encode other things. 01:21:55.820 |
suggests that there's some like more flexibility 01:21:59.300 |
but it'd be cool to get neural recordings of it. 01:22:04.580 |
Actually, let's give our speaker another round of applause.