back to index

Stanford CS25: V2 I Neuroscience-Inspired Artificial Intelligence


Whisper Transcript | Transcript Only Page

00:00:00.000 | It's fun to be here.
00:00:11.620 | So the work I'm presenting today, title of it is Attention Approximates Sports Distributed
00:00:16.660 | Memory.
00:00:17.660 | And this was done in collaboration with Genghis Pahlavon and my PhD advisor is Gabriel Kreiman.
00:00:24.860 | So why should you care about this work?
00:00:27.300 | We show that the heuristic attention operation can be implemented with simple properties
00:00:31.260 | of high dimensional vectors in a biologically plausible fashion.
00:00:34.860 | So the transformer and attention, as you know, are incredibly powerful, but they were heuristically
00:00:40.740 | developed.
00:00:41.740 | And the softmax operation and attention is particularly important, but also heuristic.
00:00:47.500 | And so we show that the intersection of hyperspheres that is used in sports distributed memory closely
00:00:53.540 | approximates the softmax and attention more broadly, both in theory and with some experiments
00:00:59.500 | on trained transformers.
00:01:01.540 | So you can see SDM, sports distributed memory, as preempting attention by approximately 30
00:01:07.100 | years.
00:01:08.100 | It was developed back in 1988.
00:01:10.220 | And what's exciting about this is that it meets a high bar for biological plausibility.
00:01:14.220 | Hopefully I have time to actually get into the wiring of the cerebellum and how you can
00:01:17.040 | map each operation to part of the circuit there.
00:01:21.940 | So first I'm going to give an overview of sports distributed memory.
00:01:25.940 | Then I have a transformer attention summary, but I assume that you guys already know all
00:01:30.740 | of that.
00:01:31.740 | We can get there and then decide how deep we want to go into it.
00:01:35.300 | I'll then talk about how actually attention approximates SDM, interpret the transformer
00:01:40.920 | more broadly, and then hopefully there's time to go into SDM's biological plausibility.
00:01:47.020 | Also I'm going to keep everything high level visual intuition and then go into the math,
00:01:51.860 | but stop me and please ask questions, literally whenever.
00:01:56.940 | So sports distributed memory is motivated by the question of how the brain can read
00:02:01.140 | and write memories in order to later retrieve the correct one.
00:02:05.320 | And some considerations that it takes into account are high memory capacity, robustness
00:02:08.940 | to query noise, biological plausibility, and some notion of fault tolerance.
00:02:14.180 | SDM is unique from other associative memory models that you may be familiar with, like
00:02:18.740 | Hopfield networks, in so much as it's sparse.
00:02:23.160 | So it operates in a very high dimensional vector space, and the neurons that exist in
00:02:27.740 | this space only occupy a very small portion of possible locations.
00:02:32.700 | It's also distributed, so all read and write operations apply to all nearby neurons.
00:02:40.220 | It is actually, as a side note, Hopfield networks, if you're familiar with them, are a special
00:02:44.260 | case of sports distributed memory.
00:02:46.180 | I'm not going to go deep into that now, but I have a blog post on it.
00:02:51.660 | Okay.
00:02:52.740 | So first we're going to look at the write operation for sports distributed memory.
00:02:57.380 | We're in this high dimensional binary vector space.
00:03:01.000 | We're using hamming distance as our metric for now, and we'll move to continuance later.
00:03:06.140 | And we have this green pattern, which is represented by the solid dot, and the hollow circles are
00:03:12.780 | these hypothetical neurons.
00:03:15.260 | So think of everything quite abstractly, and then we'll map to biology later.
00:03:20.300 | So this pattern has a right radius, which is some hamming distance.
00:03:24.260 | It activates all of the neurons within that hamming distance, and then here I just note
00:03:30.500 | that each of those neurons are now storing that green pattern, and the green pattern
00:03:35.380 | has disappeared.
00:03:36.380 | So I'm keeping track of this location with this kind of fuzzy hollow circle.
00:03:38.940 | That'll be relevant later.
00:03:41.060 | So we're writing in another pattern, this orange one.
00:03:44.940 | And note here that neurons can store multiple patterns inside of them, and formally this
00:03:49.660 | is actually a superposition, or just a summation, of these high dimensional vectors.
00:03:53.460 | Because they're high dimensional, you don't have that much cross talk, so you can get
00:03:57.100 | away with it.
00:03:58.100 | But for now, you can just think of it as a neuron can store multiple patterns.
00:04:01.020 | Finally, we have a third pattern, this blue one.
00:04:04.740 | We're writing it in another location, and yeah.
00:04:08.340 | So again, we're keeping track of the original pattern locations, but they can be triangulated
00:04:13.540 | from the nearby neurons that are storing them.
00:04:17.580 | And so we've written in three patterns, now we want to read from the system.
00:04:20.580 | So I have this pink star, Xi, it appears it has a given-- it's represented by a given
00:04:27.020 | vector, which has a given location in space, it activates nearby neurons again.
00:04:31.980 | But now the neurons output the patterns that they stored previously.
00:04:37.700 | And so you can see that based upon its location, it's getting four blue patterns, two orange
00:04:43.180 | and one green.
00:04:44.180 | And it then does a majority rule operation, where it updates towards whatever pattern
00:04:50.160 | it's seeing from us.
00:04:51.420 | So in this case, because blue is actually a majority, it's just going to update completely
00:04:56.420 | towards blue.
00:04:57.420 | Again, I'll formalize this more in a bit, but this is really to give you intuition for
00:05:01.820 | the core operations of SCM.
00:05:05.620 | So the key thing to relate this back to attention is actually to abstract away the neurons that
00:05:12.780 | are operating under the hood, and just consider the circle intersection.
00:05:16.660 | And so what each of these intersections between the pink read circles and each of the right
00:05:22.300 | circles means is that intersection is the neurons that both store that pattern that
00:05:27.860 | was written in, and are now being read from by the query.
00:05:32.060 | And the size of that intersection corresponds to how many patterns the query is then going
00:05:37.940 | to read.
00:05:38.940 | And so formally, we define the number of neurons in this circle intersection as the
00:05:48.420 | cardinality between number of neurons in pattern, number of neurons in query, and their intersection.
00:05:55.940 | Okay, are there any questions, like at a high level before I get more into the math?
00:06:03.380 | I don't know if I can check, is it easy for me to check Zoom?
00:06:07.980 | Nah, sorry, Zoom people, I'm not going to check.
00:06:10.700 | Okay.
00:06:11.700 | So the neurons that's randomly distributed in this space?
00:06:14.620 | Yes, yeah, yeah.
00:06:15.620 | And there's later, there's more recent work that they can learn and update their location
00:06:20.460 | to cut tile manifold.
00:06:22.060 | But in this, you can assume that they're randomly initialized binary high dimensional vectors.
00:06:27.580 | Okay, so this is the full SDM update rule.
00:06:34.260 | I'm going to break it down.
00:06:36.220 | So the first thing that you do, so this is this is for reading, to be clear, so you've
00:06:40.500 | already written patterns into your neurons.
00:06:42.860 | So the first thing you do is you weight each pattern by the size of its circle intersection.
00:06:48.280 | So the circle intersection there for each pattern.
00:06:53.740 | Then you sum over all of the patterns that have been written into this space.
00:06:58.260 | So you're just doing a weighted summation of them.
00:07:01.900 | And then there's this normalization by the total number of intersections that you have.
00:07:09.620 | And finally, because at least for now, we're working in this binary space, you map back
00:07:13.700 | to binary, just seeing if any of the values are greater than a half.
00:07:21.100 | Okay, how familiar are people with attention?
00:07:24.380 | I looked at like the previous talks you've had, they seem quite high level.
00:07:28.380 | Like, can you guys write the attention equation for me?
00:07:31.740 | Is that like, can I get thumbs up if you can do that?
00:07:35.860 | Yeah, okay, I'm not like, I'll go through this, but I'll probably go through it faster
00:07:42.620 | than otherwise.
00:07:44.180 | So when I first made this presentation, like, this was the state of the art for transformers,
00:07:48.180 | which was like alpha pole.
00:07:52.140 | And so it's kind of funny, like how far things have come now, I don't need to tell you that
00:07:55.220 | transformers are important.
00:07:58.440 | So yeah, I'm going to work with this example.
00:08:01.780 | Well, okay, I'm going to work with this example here, the cat sat on the blank.
00:08:07.260 | And so we're in this setting, we're predicting the next token, which hypothetically is the
00:08:13.220 | word map.
00:08:14.220 | And so there are kind of four things that the attention operation is doing.
00:08:19.100 | The first one up here is it's generating what are called keys, values and queries.
00:08:22.580 | And again, I'll get into the math in a second, I'm just trying to keep it high level first.
00:08:26.980 | And then we're going to compare our query with each of the keys.
00:08:32.140 | So the word the, which is closest to the word we're next predicting is our query, and we're
00:08:37.140 | seeing how similar it is each of the key vectors.
00:08:43.380 | We then based upon that similarity, do this softmax normalization, so that all of the
00:08:50.060 | attention weights sum to one, and then we sum together their value vectors to use to
00:08:57.220 | propagate to like the next layer or uses our prediction.
00:09:02.140 | And so at a high level, you can think of this as like the query word the is looking for
00:09:08.100 | nouns and their associated verbs.
00:09:10.400 | And so hypothetically, it has a high similarity with words like cat and sat, or their keys.
00:09:17.300 | So this then gives large weight to the cat and sat value vectors, which get moved to
00:09:22.380 | the next part of the network.
00:09:24.260 | And the cat value vector hypothetically, contains a superposition of other animals like mice,
00:09:30.140 | and maybe words that rhyme with mat.
00:09:32.180 | And the sat vector also contains things that are sat on including mat.
00:09:37.180 | And so what you actually get from the value vectors of paying attention to cat and sat
00:09:43.180 | are like three times mat plus one times mouse plus one time sofa.
00:09:49.220 | This is again, like a totally hypothetical example, but I'm trying to make the point
00:09:54.260 | that you can extract from your value vectors, things useful for predicting the next token
00:10:00.940 | by paying attention to specific keys.
00:10:07.340 | And I guess, yeah, another thing here is like what you pay attention to, so cat and sat
00:10:11.580 | might be different from what you're actually extracting.
00:10:13.820 | You're paying attention to your keys, but you're getting your value vectors out.
00:10:16.900 | Okay, so here is the full attention equation.
00:10:20.820 | The top line, I'm separating out the projection matrices, W subscript E, K, and Q, and in
00:10:28.180 | the second one, I've just collapsed them into like the [inaudible] and yeah, so breaking
00:10:33.820 | this apart.
00:10:34.820 | The first step here is we compare, we do a dot product between our query vector and our
00:10:40.380 | keys.
00:10:41.380 | This should actually be a small, you know, [inaudible] and so yeah, we're doing this
00:10:47.740 | dot product between them to see, get a notion of similarity.
00:10:52.860 | We then apply the softmax operation, which is an exponential over a sum of exponentials.
00:10:58.620 | The way to think of the softmax is it just makes large values larger, and this will be
00:11:04.300 | important for the relations here, so I'll spend a minute on it.
00:11:09.020 | At the top here, I have like some hypothetical items indexed from zero to nine, and then
00:11:15.260 | the values for each of those items.
00:11:17.820 | In the second row, I just do like a normal normalization of them, and so the top item
00:11:22.660 | goes to a 30% value, but if I instead do a softmax, and it becomes one of the beta-coefficient
00:11:28.580 | softmax, but the value becomes 0.6, so it just, it makes your distributions peakier
00:11:34.900 | is kind of one way of thinking of it, and this is useful for attention because you only
00:11:38.700 | want to pay attention to the most important things, or the things that are nearby and
00:11:42.820 | kind of ignore stuff further away.
00:11:47.140 | And so once we've applied our softmax, we then just do a weighted summation of our value
00:11:55.060 | vectors, which actually get extracted and propagate to the next layer.
00:12:04.140 | Okay, so here's the full equation, I went through that a little bit quickly, I'm happy
00:12:12.100 | to answer questions on it, but I think half of you know it, half of you don't.
00:12:19.180 | Okay, so how does transformer attention approximate sparse distributed memory, this 30-year-old
00:12:25.300 | thing that I've said is biologically plausible.
00:12:30.140 | So are we supposed to accept that SPM is biologically plausible?
00:12:36.540 | So I'm going to get to that at the end, yeah.
00:12:37.540 | Attention is also formed, like in the sense of like all attention back in this one, it's
00:12:38.540 | not like it was invented five years ago.
00:12:39.540 | I think the attention equation I'm showing here was developed, I mean attention is all
00:12:50.700 | you need was the highlight, but Benji has a paper from 2015 where it was actually first
00:12:57.020 | written in this way, correct me if I'm wrong, but I'm pretty sure.
00:13:00.660 | Yeah, I mean I guess like this particular one, that's why I was asking a question, because
00:13:05.700 | like…
00:13:06.700 | No, it's a good question.
00:13:07.700 | It's like you show that like two different methods that could be classified as like attention
00:13:13.060 | proposals, right, or like the same, then like you still have to show that like one of them
00:13:20.740 | is like indifferent.
00:13:21.740 | Yes, exactly, so I'll show that SDM has really nice mappings to a circuit in a Cerebellum
00:13:25.980 | at the neuronal level, and then right now it's this link to attention, and I guess you
00:13:32.780 | make a good point that there are other attention mechanisms.
00:13:35.500 | This is the one that has been dominant, but I don't think that's just a coincidence, like
00:13:39.420 | there's been a bunch of…
00:13:40.940 | Computing your Softmax is expensive, and there's been a bunch of work like the Linformer, etc.,
00:13:44.900 | etc., that tries to get rid of the Softmax operation, and it's just done really badly.
00:13:49.700 | Like there's a bunch of jokes on Twitter now that it's like a black hole for people that
00:13:53.060 | like try and get rid of Softmax and you can't, and so it seems like this, and like other
00:13:58.140 | versions of it, transformers just don't scale as well in the same way, and so there's something
00:14:03.020 | important about this particular attention equation.
00:14:05.780 | But like that goes the other way, right, which is like if this is really important, then
00:14:11.980 | like SDM is like actually like this.
00:14:17.980 | So the thing that I think is important is that you have this exponential weighting,
00:14:22.060 | where you're really paying attention to the things that matter, and you're ignoring everything
00:14:25.260 | else, and that is what SDM approximates.
00:14:29.940 | There might be better equations, but the point I was just trying to make there is like the
00:14:34.660 | Softmax does seem to be important, and this equation does seem to be very successful,
00:14:39.440 | and we haven't come up with better formulations for it.
00:14:42.300 | Yeah, no, that's a great question.
00:14:46.380 | Yeah, so it turns out that sparse distributed memory, as you move your query and your pattern
00:14:54.020 | away from each other, so you pull these circles apart, the read and write circles, the number
00:14:59.300 | of neurons that are in this intersection in a sufficiently high dimensional space decays
00:15:04.260 | approximately exponentially, and so on this right plot here, I'm pulling apart, the x-axis
00:15:10.420 | is me pulling apart the blue and the pink circles, and the y-axis is on a log scale
00:15:17.420 | the number of neurons that are in the intersection, and so to the extent that this is a linear
00:15:23.180 | plot on a log scale, it's exponential, and this is for a particular setting where I have
00:15:31.020 | my 64 dimensional vectors, which is like used in GPT-2, it holds across a lot of different
00:15:37.660 | settings, particularly higher dimensions, which are now used for bigger transformers.
00:15:42.780 | Okay, so I have this shorthand for the circle intersection equation, and what I'll show
00:15:51.900 | is how the circle intersection is approximately exponential, so we can write it with two constants
00:15:58.220 | c, subgroup 1 and subgroup 2, with the one outside, because you're normalizing Softmax's
00:16:05.780 | exponential over some exponentials, that would cancel, the thing that matters is c2, and
00:16:10.460 | you can approximate that nicely with the beta coefficient that's used in the Softmax.
00:16:18.020 | And so, yeah, I guess as well, I'll focus first on the binary original version of SDM,
00:16:24.020 | but then we also develop a continuous version, okay, so yeah, the two things that you need
00:16:30.460 | for the circle intersection and the exponential decay to work, are you need - to map it to
00:16:35.820 | attention is you need some notion of continuous space, and so you can use this equation here
00:16:40.420 | to map Hamming distances to discretize cosine similarity values, where the hats over the
00:16:48.020 | vectors are L2 normalizations, and you can then write the circle intersection equation
00:16:55.880 | on the left as this exponential with these two constants that you need to learn, and
00:17:05.160 | then rewrite this by converting c2 and c, you can write this as a beta coefficient.
00:17:12.540 | Let me get to some plots, yeah, so you need the correct beta coefficient, but you can
00:17:16.620 | fit this with a log-linear regression in a closed form.
00:17:22.220 | I want to show a plot here, yeah, okay, so in the blue is our circle intersection for
00:17:28.940 | two different Hamming distances both using 64-dimensional vectors, and the orange is
00:17:34.660 | our actual Softmax attention operation, where we fit the beta coefficient, so that it will
00:17:41.060 | - the Hamming distance used by attention is equivalent to the Hamming distance used by
00:17:45.460 | SDM, and you can see so that the main plot is the normalized weights, so just summed
00:17:53.980 | up in undivided systems, one, and then I have log plots here, and you can see that in not
00:18:00.420 | log space, the curves agree quite nicely.
00:18:04.860 | You can see that for the higher dimensional - sorry, the larger Hamming distance, the
00:18:09.540 | log plot, you see this drop off here, where the circle intersection stops being exponential,
00:18:14.940 | but it turns out this actually isn't a problem, because the point at which the drop - the
00:18:20.100 | exponential breaks down, you're at approximately 0.20 here, and you're basically paying negligible
00:18:27.580 | attention to any of those points, and so in the regime where the exponential really matters,
00:18:34.340 | this approximation holds true, yeah, yeah, yeah, yeah, no I just wanted to actually like
00:18:46.780 | show a figure to get some intuition before, yeah, so all we're doing here is we're just
00:18:53.780 | - we're in a binary space with original axiom, and we're just using this mapping to co-sign
00:19:00.900 | and then what you need to do is just have the beta coefficient fit, and you can view
00:19:06.460 | your beta coefficient and attention as determining how peaky things are, and this relates directly
00:19:10.940 | to the Hamming distance of your circles that you're using for read and write on virtue.
00:19:19.020 | And so yeah, to like mathematically show this now, on this slide I'm not using any tricks,
00:19:23.860 | I'm just rewriting attention using the SDM notation of patterns and queries, so this
00:19:29.860 | little box down here is doing that mapping.
00:19:37.620 | And this is the money slide where we're updating our query, and on the left we have our attention
00:19:45.180 | equation written in SDM notation, we expand our sub-max, and then the main statement is
00:19:51.980 | - this is closely approximated by if we swap out our exponential with the SDM equation.
00:20:10.460 | So and again, the two things that you need for this to work are, one, your attention
00:20:15.980 | vectors, your q's and queries need to be L2 normalized, so I have hats on them, and then
00:20:22.860 | you want - if you decide a given Hamming distance for SDM, and I'll get into what Hamming distances
00:20:28.900 | are good for different things, then you need to have a beta coefficient that relates to
00:20:35.500 | But again, that's just how many things are you trying to pay attention to.
00:20:44.580 | So yeah, just as a quick side note, you can write SDM using continuous vectors and then
00:20:49.980 | not need this mapping to understand similarity.
00:20:54.140 | And so here I have the plots again, but with this, and I've added the - the orange and
00:21:02.260 | the green have split, but I've added the continuous approximation here too.
00:21:10.140 | And what's nice about the continuous version is you can actually then write sparse distributed
00:21:13.900 | memory as a multilayered perceptron with slightly different assumptions, and I'm not going to
00:21:19.020 | talk about that now, but this is featured in Sparse Distributed Memory as a Continual
00:21:24.100 | Learner, which was added to the additional readings, and it'll be in - sorry, this shouldn't
00:21:28.460 | say ICML, this should say ICLR.
00:21:31.940 | It's just been accepted to ICLR for this year.
00:21:35.060 | Okay, so do trained transformers use these beta coefficients that I've said are similar
00:21:42.460 | to those for SDM?
00:21:45.140 | And so it shouldn't be surprising that depending on the heavy distance you set, SDM is better
00:21:50.340 | for certain things.
00:21:52.340 | For example, you just want to store as many memories as possible, and you're assuming
00:21:56.220 | that your queries aren't noisy, or you're assuming your queries are really noisy, so
00:22:01.100 | you can't store as much, but you can retrieve from a long distance.
00:22:05.540 | And if attention of the transformer is implementing sparse distributed memory, we should expect
00:22:10.300 | to see that the beta coefficients that the transformer uses correspond to these good
00:22:14.900 | instances of SDM.
00:22:17.380 | And so we have some weak evidence that that's the case.
00:22:21.860 | So this is the key query normalized variant of attention, where you actually learn your
00:22:26.500 | beta coefficient.
00:22:27.500 | Normally in transformers, you don't, but you don't L2 norm your vectors, and so you can
00:22:32.740 | kind of have this like effective beta coefficient.
00:22:35.140 | So in this case, it's a cleaner instance where we're actually learning beta.
00:22:39.260 | And this one's trained on a number of different translation tasks.
00:22:43.220 | We take the learned beta coefficients across layers and across tasks, and plot them as
00:22:47.980 | they hit the ground.
00:22:48.980 | And the red dotted lines correspond to three different notions of sparse distributed memory
00:22:54.380 | that are optimal for different things.
00:22:56.960 | And again, this is weak evidence insomuch as to derive the optimal SDM beta coefficients
00:23:05.240 | or corresponding handling distances.
00:23:07.120 | We need to assume random patterns in this high-dimensional space, and obviously real-world
00:23:12.320 | beta isn't random.
00:23:13.320 | However, it is nice to see, one, all the beta coefficients fall within the bounds, and two,
00:23:20.160 | they skew towards the max query noise, which makes more sense if you're dealing with complicated
00:23:26.200 | real-world beta, where the next data points you'd see might be out of distribution based
00:23:30.240 | on what you've seen in the past.
00:23:32.400 | The max memory capacity variant assumes no query noise at all.
00:23:36.080 | And so it's like, how many things can I pack in, assuming that the questions I'm asking
00:23:40.520 | the system are perfectly formed?
00:23:48.680 | Just talking a little bit about transformer components more broadly.
00:23:53.040 | So I've mentioned that you can write the feed-forward layer as a version of SDM that has a notion
00:23:59.320 | of longer-term memory.
00:24:03.660 | There's also layer norm, which is crucial in transformers.
00:24:06.800 | And it's not quite the same, but it can be related to the L2 normalization that's required
00:24:11.100 | by SDM.
00:24:12.920 | There's also the key query normalization variant that explicitly does this L2 normalization.
00:24:17.320 | And it does get slightly better performance, at least on the small tests that they did.
00:24:22.000 | I don't know if this would scale to larger models.
00:24:27.160 | And so I guess this work is interesting in so much as the biological plausibility, which
00:24:31.720 | I'm about to get to, and then the links to transformers.
00:24:35.560 | It hasn't to date improved transformer architectures.
00:24:38.880 | But that doesn't mean that this lens couldn't be used or be useful in some way.
00:24:43.600 | So yeah, I list a few other things that SDM is related to that could be used to funnel
00:24:49.560 | So in the new work where SDM is a continual learner, we expand the cerebellar circuit,
00:24:54.640 | look at components of it, particularly inhibitory interneurons, implement those in a deep learning
00:24:59.080 | model, and it then becomes much better at continual learning.
00:25:02.400 | So that was a fun way of actually using this link to get better bottom-line performance.
00:25:10.220 | So a summary of this section is basically just the intersection between two hyperspheres
00:25:16.040 | approximates an exponential, and this allows SDM's read and write operations to approximate
00:25:22.840 | attention both in theory and our limited tests.
00:25:27.400 | And so kind of like big picture research questions that could come out of this is, first, is
00:25:32.080 | the transformer so successful because it's performing some key cognitive operation?
00:25:37.480 | The cerebellum is a very old brain region used by most organisms, including fruit flies,
00:25:44.120 | maybe even cephalopods, through divergent but now convergent evolution.
00:25:50.160 | And then given that the transformer has been so successful empirically, is SDM actually
00:25:55.720 | the correct theory for cerebellar function?
00:25:59.280 | And that's still an open question.
00:26:03.080 | As we learn more and more about the cerebellum, there's nothing that yet disproves SDM as
00:26:07.080 | working there.
00:26:08.080 | And I think it's-- I'll go out on a limb and say it's one of the more compelling theories
00:26:10.980 | for how the cerebellum is actually working.
00:26:14.640 | And so I think this work kind of motivates looking at more of these questions-- both
00:26:18.440 | of these questions more seriously.
00:26:22.400 | Do we have time?
00:26:24.720 | Cool.
00:26:26.320 | So here's the circuit that implements SDM.
00:26:31.080 | At the bottom, we have patterns coming in for either reading or writing.
00:26:36.880 | And they're going to-- actually, I break down each of these slides.
00:26:40.000 | Yeah.
00:26:41.000 | Yeah.
00:26:42.000 | So first, we have patterns that come in.
00:26:43.000 | And every neuron here, these are the dendrites of each neuron.
00:26:47.040 | And they're deciding whether or not they're going to fire for the input that comes in.
00:26:54.520 | Then if the neuron does fire, and you're writing in that pattern, then you simultaneously--
00:27:00.180 | and I'm going to explain-- let's say you hear that this is crazy, the brain doesn't do this,
00:27:05.180 | and then I'm going to hopefully take it down.
00:27:06.180 | You not only need to have the thing, the pattern that activates neurons, but you need to have
00:27:11.580 | a separate line that tells the neuron what to sort.
00:27:15.740 | And just like you have this difference between keys and values, where they can be different
00:27:19.020 | vectors representing different things, here you can have a key that comes in and tells
00:27:24.300 | the neuron what to activate, and the value for what it should actually sort, and then
00:27:29.740 | put later on.
00:27:30.740 | This is called a hetero-associative mapping.
00:27:34.220 | And then once you're reading from the system, you also have your query come in here, activate
00:27:40.940 | neurons, and those neurons then output whatever they sort.
00:27:45.340 | And the neuron's vector is this particular column that it's stored.
00:27:50.860 | And as a reminder, it's sorting patterns in the superconductor.
00:27:55.380 | And then it will dump whatever it's stored across these output lines.
00:28:00.940 | And then you have this G majority bit operation that converts to a 0 or 1, deciding if the
00:28:07.860 | neuron's going to fire or not.
00:28:10.180 | And so here is the same circuit, but where I overlay cell types and the cerebellum.
00:28:20.100 | And so I'll come back to this slide, because most people probably aren't familiar with
00:28:24.860 | cerebellar circuitry.
00:28:25.860 | Let me just get some water.
00:28:32.940 | So the way that the cerebellum is pretty homogenous in that it follows the pattern throughout.
00:28:39.100 | Also, fun fact, 70% of all neurons on the brain are in the cerebellum.
00:28:42.780 | They're small, so you wouldn't know it.
00:28:45.060 | But the cerebellum is very underappreciated, and there's a bunch of evidence that it has
00:28:48.380 | closed-loop systems with most higher-order processing now.
00:28:51.980 | If your cerebellum's damaged, you are more likely to have autism, et cetera, et cetera.
00:28:55.660 | So it does a lot more than just fine-order coordination, which a lot of people have assumed
00:29:00.100 | in the past.
00:29:02.100 | So inputs come in through the muscle fibers here.
00:29:04.420 | They interface with granule cells.
00:29:06.580 | This is a major up-projection, where you have tons and tons of granule cells.
00:29:11.300 | Each granule cell has what are called parallel fibers, which are these incredibly long and
00:29:16.060 | thin axons that branch out in this T structure.
00:29:20.140 | Then they're hit by the Purkinje cells, which will receive up to 100,000 parallel fiber
00:29:28.260 | inputs.
00:29:29.260 | It's the highest connectivity of any neuron on the brain.
00:29:33.420 | And then the Purkinje cell will decide whether or not to fire and send its output downwards
00:29:38.980 | here.
00:29:39.980 | So that's the whole system where patterns come in and the neurons decide whether they
00:29:43.780 | fire or not, and the way that they then output their output.
00:29:47.580 | You then have a separate right line, which is the climbing fibers here.
00:29:51.060 | So the climbing fibers come up, and they're pretty amazing in that these connections here
00:29:54.780 | you've kind of ignored.
00:29:55.780 | They're not as important.
00:29:56.780 | But one that really matters is that-- they're not very strong in there.
00:29:57.780 | But one that really matters is it goes up and it wraps around individual Purkinje cells.
00:30:03.940 | And the mapping is close to one-to-one between climbing fibers and Purkinje cells, at least
00:30:08.700 | a very strong axon potential.
00:30:10.820 | And so--
00:30:11.820 | And they're connected to what, here?
00:30:12.820 | In these--
00:30:13.820 | Yeah, as in the stuff off this line.
00:30:14.820 | Yeah, right.
00:30:15.820 | It's two lines.
00:30:16.820 | They're connected to one.
00:30:17.820 | Oh, so they're separate neurons coming from-- it's separate.
00:30:18.820 | There it is.
00:30:19.820 | Purkinje cells here go into the cerebellar nuclei, kind of in the core of the cerebellum.
00:30:20.820 | And that then feeds into thalamus, like back to higher-order brain regions, or like down
00:30:21.820 | the muscle movement, et cetera.
00:30:22.820 | A lot of people think of the cerebellum as kind of like a fine-tuning look-up table,
00:30:23.820 | where, like, you've already decided the muscle movement you want to do, but the cerebellum
00:30:24.820 | will then, like, do a bunch of different things.
00:30:25.820 | So it's, like, much more accurate.
00:30:49.760 | But it seems like this also applies to, like, next-word prediction.
00:30:52.420 | Like, we have fMRI data for this.
00:30:54.620 | A neuroscientist once said to me that, like, a dirty little secret of fMRI is that the
00:30:59.500 | cerebellum lights up for everything.
00:31:02.140 | So OK.
00:31:05.620 | Going back to this circuit here, then, yeah?
00:31:09.480 | Timescales, are these operating at?
00:31:11.620 | I mean, how long is the information stored and retrieved?
00:31:14.980 | Do we have any idea about this?
00:31:17.820 | Like, is this, like, a couple of milliseconds, or, like, is this information more persistent?
00:31:23.720 | So the main theory is that you have updating through time-dependent plasticity, where your
00:31:33.700 | climbing fiber will either—which is doing what you want to write in—will fire either
00:31:38.260 | just before or just after your granule cells fire, and so that then updates the propingy
00:31:45.740 | cell synapses for long-term depression or contentiation.
00:31:49.420 | So whatever timescale that's happening on.
00:31:51.420 | The climbing fiber makes very large action potentials, or at least a very large action
00:31:56.300 | potential when the propingy cell.
00:31:57.780 | And so I do think you could get pretty fast synaptic updates.
00:32:01.260 | And they're also persistent for a long time?
00:32:04.380 | I think so, yeah.
00:32:05.380 | The synapses have been staying for, like, the rest of your life.
00:32:10.500 | So what's really unique about this circuit is the fact that you have these two orthogonal
00:32:16.740 | nodes where you have the moss fibers bringing information in to decide if the neuron's going
00:32:21.540 | to fire or not, but then the totally separate climbing fiber lines that can update specific
00:32:26.260 | neurons and what they're storing will later output.
00:32:30.580 | And then the propingy cell is so important, it's kind of doing this pooling across every
00:32:35.460 | single neuron.
00:32:36.460 | And each neuron, remember, it's storing the vector this way.
00:32:39.740 | And so the propingy cell is doing element-wise summation and then deciding whether it fires
00:32:45.700 | or not.
00:32:46.700 | And this allows for you to store your vectors in superposition and then later denoise them.
00:32:56.780 | The theory of SDM maps quite well to the Marr and Hollis theories of cellular function,
00:33:01.460 | which are still quite dominant, if anyone's familiar or wants to talk to you about this.
00:33:06.020 | Yeah, so the analogy of the neuron in the SDM, you introduced the pool, and then I kind of
00:33:10.300 | just basically-- each neuron of the propingy cell instead of--
00:33:12.340 | Each neuron is a granule cell.
00:33:16.460 | And then, yeah.
00:33:17.460 | So the location of the neuron, those hollow circles, corresponds to the granular cell
00:33:22.220 | dendrites here, where the patterns that pop in correspond to the activations of the modifiers.
00:33:27.900 | And then the efferent post-synaptic connections are with the propingy cell.
00:33:34.900 | So that's actually-- what it's storing is in the synaptic connections with the end propingy
00:33:40.980 | cells at that interface.
00:33:44.260 | And then the propingy cell does the majority of the operation in the cyanoblastifier.
00:33:59.460 | Yeah, I think we're basically into question time.
00:34:01.860 | So yeah, thanks a lot.
00:34:02.860 | I have a question.
00:34:03.860 | I don't know anything about SDM, but it seems, as understood, it's very good for long-term
00:34:19.100 | memory.
00:34:20.100 | And I am curious, what's your hypothesis of what we should be doing for short-term memory?
00:34:30.020 | Because it seems that-- so if you have this link of transformers having long-term memory,
00:34:37.060 | what's good for short-term memory?
00:34:38.620 | Because for me, it seems like we are doing this in the prompt context right now.
00:34:43.740 | But how could we incorporate these to the record?
00:34:47.260 | Yeah, yeah.
00:34:48.340 | So this work actually focuses more on the short-term memory, where it relates to the
00:34:52.940 | attention operation.
00:34:54.100 | But you can rewrite SDM.
00:34:57.340 | It's almost more natural to interpret it as a multilayer perceptron that does a softmax
00:35:01.660 | activation across its-- or a top-k activation across its neurons.
00:35:05.380 | It's a little bit more complicated than that.
00:35:10.260 | So yeah, the most interesting thing here is the fact that I just have a bunch of neurons.
00:35:16.900 | And in activating nearby neurons in this high-dimensional space, you get this exponential weighting,
00:35:22.260 | which is the softmax.
00:35:23.260 | And then because it's an associative memory, where you have keys and values, it is attention.
00:35:30.420 | And yeah, I guess the thing I most want to drive home from this is it's actually surprisingly
00:35:35.580 | easy for the brain to implement the attention operation, the attention equation, just using
00:35:42.180 | high-dimensional vectors and activating nearby neurons.
00:35:46.300 | So it's good for short-term memory?
00:35:49.860 | So if you were to actually use SDM for attention-- yeah, so let me go all the way back real quick.
00:35:57.260 | This is important.
00:35:58.260 | There are kind of two ways of viewing SDM.
00:35:59.900 | And I don't think you were here for the talk.
00:36:01.500 | I think I saw you come in a bit later, which is totally fine.
00:36:03.540 | But--
00:36:04.540 | I was listening, but maybe they were going to remember.
00:36:06.500 | Oh, cool, cool, cool.
00:36:07.500 | Yeah, yeah, yeah.
00:36:08.500 | OK, so there are two ways of looking at SDM.
00:36:11.620 | There's the neuron perspective, which is this one here.
00:36:14.860 | And this is actually what's going on in the brain, of course.
00:36:19.060 | And so the only thing that is actually constant is the neuron, the patterns are ethereal.
00:36:24.620 | And then there's the pattern-based perspective, which is actually what attention is doing.
00:36:29.740 | And so here, you're abstracting away the neurons, or assuming they're operating under the hood.
00:36:33.820 | But what you're actually computing is the distance between the true location of your
00:36:39.100 | pattern and the query.
00:36:41.140 | And there are pros and cons to both of these.
00:36:44.420 | The pro to this is you get much higher fidelity distances, if you know exactly how far the
00:36:50.700 | query is from the original patterns.
00:36:53.380 | And that's really important when you're deciding what to update towards.
00:36:56.140 | You really want to know what is closest and what is further away, and be able to apply
00:37:00.780 | the exponential weighting correctly.
00:37:03.460 | The problem is you need to store all of your pattern information in memory.
00:37:07.220 | And so this is why transformers have limited context windows.
00:37:11.380 | The other perspective is this long-term memory one, where you forget about the patterns.
00:37:15.780 | And you just look at where-- you just have your neurons that store a bunch of patterns
00:37:20.500 | in them in this noisy superposition.
00:37:22.660 | And so you can't really-- you can kind of [INAUDIBLE] and it's all much noisier.
00:37:30.140 | But you can store tons of patterns, and you're not constrained by a context window.
00:37:33.300 | Or you can think of any penalty layer as storing the entire data set in a noisy superposition
00:37:39.220 | of states.
00:37:41.660 | Yeah, hopefully that kind of answers your question.
00:37:45.140 | I think there was one here first, and then-- yeah?
00:37:49.020 | So I guess my question is-- so I guess you kind of have shown that the modern self-attention
00:37:58.700 | mechanism maps onto this SDM mechanism that seems plausible and might seem like the modern
00:38:07.820 | contemporary theories of how the brain could implement SDM.
00:38:14.020 | And I guess my question is, to what degree has that been experimentally verified versus--
00:38:21.260 | you were mentioning earlier that it might actually be easier to have done this using
00:38:25.780 | an MLP layer in some sense than mapping that onto these mechanisms.
00:38:31.620 | And so how do experimentalists actually distinguish between competing hypotheses?
00:38:37.820 | For instance, one thing that I wasn't entirely clear about is even if the brain could do
00:38:46.580 | attention or SDM, that doesn't actually mean it would, because maybe it can't do backprop.
00:38:54.060 | So how does this get actually tested?
00:38:57.580 | Totally.
00:38:58.060 | Yeah, yeah, yeah.
00:38:58.780 | So on the backprop point, you wouldn't have to do it here because you
00:39:04.820 | have the climbing fibers that can directly give training signals to what the neurons can store.
00:39:12.780 | So in this case, it's like a supervised learning task where the climbing fiber knows what it wants
00:39:17.860 | to write in or how it should be updated in the Purkinje cell synapses.
00:39:22.580 | But for your broader point, you basically need to test this.
00:39:28.020 | You need to be able to do real-time learning.
00:39:31.300 | The Drosophila mushroom body is basically identical to the cerebellum.
00:39:35.460 | And that's why any brain data set has done most of the individual neuron connectivity.
00:39:39.980 | So what you would really want to do is in vitro, real-time, super, super high frames
00:39:48.020 | per second calcium imaging, and be able to see how synapses change over time.
00:39:56.380 | And so for an associative learning task, like hear a sound move left, hear another sound move right,
00:40:03.180 | or smells, or whatever, present one of those, trace, like figure out the small subset of neurons
00:40:10.220 | that fire, which we know is a small subset, so that already fits with the handling of sensitization.
00:40:15.620 | See how the synapses here update and how the outputs of it correspond to changes in motor action.
00:40:22.780 | And then extinguish that memory, so write in a new one, and then watch it go away again.
00:40:29.860 | And our cameras are getting fast enough, and our calcium and voltage indicators are getting to be really good.
00:40:36.860 | So hopefully in the next three to five years, we can do some of those tests.
00:40:40.980 | But I think that would be very definitive.
00:40:47.980 | - Do we have any other questions?
00:40:49.460 | - I think there was one more, and then I should go over to Will.
00:40:52.780 | - In terms of how you map the neuron based at the end of the spinal epithelial biological implementation,
00:41:03.820 | what is the range of your circle that you're mapping around?
00:41:08.220 | Is that like the multi-headedness, or can you do that kind of thing?
00:41:14.340 | I'm just trying to understand how that must be.
00:41:17.220 | - Yeah, so I wouldn't get confused with multi-headedness,
00:41:21.940 | because that's different attention heads all doing their own attention operation.
00:41:26.300 | It's funny, though, the cerebellum has microzones, which you can think of as like separate attention heads in a way.
00:41:32.180 | I don't want to take that analogy too far, but it is somewhat interesting.
00:41:37.940 | So the way you relate this is, in attention, you have your beta coefficient.
00:41:44.580 | That is an effective beta coefficient, because the vector norms of your keys and queries aren't constrained.
00:41:53.420 | That corresponds to a hemming distance, and here that corresponds to the number of neurons that are on for any given input.
00:42:04.340 | And the hemming distance you want, I had that slide before,
00:42:09.220 | the hemming distance you want depends upon what you're actually trying to do.
00:42:12.900 | And if you're not trying to store that many memories, for example, you're going to have a higher hemming distance,
00:42:16.980 | because you can get a higher fidelity calculation for the number of neurons in that noisy intersection.
00:42:26.500 | Cool. Yeah, thanks a lot.
00:42:28.060 | - Excellent. Let's give our speaker another round of applause.
00:42:33.100 | So as a disclaimer, before I introduce our next speaker,
00:42:39.780 | the person who was scheduled, unfortunately, had to cancel last minute due to faculty interviews.
00:42:43.180 | So our next speaker has very graciously agreed to present at the very last minute, but we are very grateful to him.
00:42:48.540 | So I'd like to introduce everybody to Will.
00:42:50.380 | So Will is a computational neuroscience machine learning PhD student at the University College of London at their Gatsby unit.
00:42:56.740 | So I don't know if anybody has heard about the Gatsby unit.
00:42:58.740 | I'm a bit of a history buff or history nerd, depending on how you phrase it.
00:43:02.020 | The Gatsby unit was actually this incredible powerhouse in the 1990s and 2000s.
00:43:05.900 | So Hinton used to be there. Zubin Garimani used to be there.
00:43:08.300 | He's now in charge of Google Research.
00:43:10.340 | I think they've done a tremendous amount of good work.
00:43:13.060 | Anyways, and now I'd like to invite Will to talk about how to build a cognitive map.
00:43:17.700 | Did you want to share your screen? - Yeah.
00:43:19.180 | - Okay. Can you stand in front of... Here, let me stop sharing.
00:43:23.100 | - Okay. So I'm going to be presenting this work.
00:43:26.780 | It's all about how a model that people in the group that I work with to study the hippocampal entorhinal system,
00:43:35.460 | completely independently, turned out to look a bit like a transformer.
00:43:39.260 | So this paper that I'm going to talk about is describing that link.
00:43:42.740 | So the paper that built this link is by these three people.
00:43:46.260 | James is a postdoc, half at Stanford, Tim's a professor at Oxford and in London, and Joe's a PhD student in London.
00:43:55.180 | So this is the problem that this model of the hippocampal entorhinal system,
00:44:01.700 | which we'll talk more about, is supposed to solve.
00:44:03.820 | It's basically the observation there's a lot of structure in the world,
00:44:06.420 | and generally we should use it in order to generalize quickly between tasks.
00:44:10.060 | So the kind of thing I mean by that is you know how 2D space works
00:44:13.780 | because of your long experience living in the world.
00:44:16.020 | And so if you start at this greenhouse and step north to this orange one, then to this red one, then this pink one,
00:44:21.140 | because of the structure of 2D space, you can think to yourself, "Oh, what will happen if I step left?"
00:44:26.500 | And you know that you'll end up back at the green one because loops of this type close in 2D space, okay?
00:44:31.820 | And this is, you know, perhaps this is a new city you've just arrived in.
00:44:35.580 | This is like a zero-shot generalization because you somehow realize that the structure applies more broadly and use it in a new context.
00:44:43.620 | Yeah, and there's generally a lot of these kinds of situations where there's structures that like reappear in the world.
00:44:48.460 | So there can be lots of instances where the same structure will be useful to doing these zero-shot generalizations
00:44:53.580 | to predict what you're going to see next.
00:44:55.580 | Okay, and so you may be able to see how we're already going to start mapping this onto some kind of sequence prediction task
00:45:02.140 | that feels a bit Transformer-esque,
00:45:04.140 | which is you receive this sequence of observations and, in this case, actions, movements in space,
00:45:10.860 | and your job is, given a new action, step left here, you have to try and predict what you're going to see.
00:45:15.780 | So that's a kind of sequence prediction version of it.
00:45:18.940 | And the way we're going to try and solve this is based on factorization.
00:45:22.700 | It's like, you can't go into one environment and just learn from the experiences in that one environment.
00:45:27.220 | You have to separate out the structure and the experiences you're having
00:45:29.860 | so that you can reuse the structural part, which appears very often in the world.
00:45:33.780 | Okay, and so, yeah, separating memories from structure.
00:45:37.260 | And so, you know, here's our separation of the two.
00:45:40.700 | We have our dude wandering around this, like, 2D grid world.
00:45:44.700 | And you want to separate out the fact that there's 2D space,
00:45:48.140 | and it's 2D space that has these rules underlying it.
00:45:51.140 | And in a particular instance, in the environment you're in,
00:45:53.740 | you need to be able to recall which objects are at which locations in the environment.
00:45:58.060 | Okay, so in this case, it's like, oh, this position has an orange house, this position doesn't.
00:46:01.620 | That's green, sorry, orange, red, and pink.
00:46:04.220 | And so you have to bind those two, you have to be like, whenever you realize that you're back in this position,
00:46:07.940 | recall that that is the observation you're going to see there.
00:46:11.380 | Okay, so this model that we're going to build is some model that tries to achieve this.
00:46:17.660 | Yeah, new stars, and so when you enter it, imagine you enter a new environment with the same structure,
00:46:21.620 | you wander around and realize it's the same structure,
00:46:23.700 | all you have to do is bind the new things that you see to the locations,
00:46:26.700 | and then you're done, passed up, you know how the world works.
00:46:30.500 | So this is what neuroscientists mean by a cognitive map,
00:46:33.820 | is this idea of, like, separating out and understanding the structure that you can reuse in new situations.
00:46:38.340 | And yeah, this model that was built in the lab is a model of this process happening,
00:46:44.020 | of the separation between the two of them and how you use them to do new inferences.
00:46:47.340 | And this is the bit that's supposed to look like a transport.
00:46:50.380 | So that's the general introduction, and then we'll dive into it a little more now.
00:46:54.180 | Make sense of the broad picture?
00:46:56.220 | Good.
00:46:57.460 | Silence, I'll assume, is good.
00:46:59.460 | So we'll start off with some brain stuff.
00:47:03.180 | So there's a long stream of evidence from spatial navigation
00:47:06.780 | that the brain is doing something like this.
00:47:08.780 | I mean, I think you can probably imagine how you yourself are doing this
00:47:12.460 | already when you go to a new city,
00:47:14.460 | or you're trying to understand a new task that has some structure you recognize from previously.
00:47:17.900 | You can see how this is something you're probably doing.
00:47:20.580 | But spatial navigation is an area in neuroscience
00:47:22.580 | which had a huge stream of discoveries over the last 50 years,
00:47:26.060 | and a lot of evidence of the neural basis of this computation.
00:47:29.860 | So we're going to talk through some of those examples.
00:47:31.860 | The earliest of these are psychologists like Tolman,
00:47:34.820 | who were showing that rats, in this case,
00:47:37.620 | can do this kind of path integration structure.
00:47:39.980 | So the way this worked is they got put at a start position here,
00:47:42.460 | down at the bottom, S,
00:47:43.660 | and they got trained that this route up here got you a reward.
00:47:47.020 | So this is the maze we had to run around.
00:47:49.020 | Then they were asked, they were put in this new...
00:47:51.420 | the same thing, but they blocked off this path that takes this long, winding route,
00:47:55.500 | and given instead a selection of all these arms to go down.
00:47:58.220 | And they look at which path the rat goes down.
00:48:00.620 | And the finding is that the rat goes down the one
00:48:02.900 | that corresponds to heading off in this direction.
00:48:05.100 | So the rat has somehow not just learned...
00:48:07.100 | like, you know, one option of this is it's like blind memorization of actions
00:48:10.300 | that I need to take in order to route around.
00:48:12.300 | Instead, no, it's learning actually that the...
00:48:14.180 | embedding the reward in its understanding of 2D space
00:48:16.820 | and taking a direct route there, even though it's never taken it before.
00:48:19.620 | There's evidence that rats are doing this as well as us.
00:48:22.220 | And then a series of, like, neural discoveries about the basis of this.
00:48:26.420 | So John O'Keefe stuck an electrode in the hippocampus,
00:48:30.300 | which is a brain area we'll talk more about,
00:48:32.020 | and found these things called place cells.
00:48:33.700 | So what I'm plotting here is each of these columns is a single neuron.
00:48:38.660 | And the mouse or rat, I can't remember, is running around a square environment.
00:48:42.900 | The black lines are the path the rodent traces out through time.
00:48:47.420 | And you put a red dot down every time you see this individual neuron spike.
00:48:51.100 | And then the bottom plot of this is just a smooth version of that spike rate,
00:48:54.420 | so that firing rate, which you can think of as, like,
00:48:56.380 | the activity of a neuron in a neural network.
00:48:58.020 | That's the analogy that people usually draw.
00:49:00.140 | And so these ones are called place cells because they're neurons
00:49:02.060 | that respond in a particular position in space.
00:49:04.380 | And in the '70s, this was, like, huge excitement, you know,
00:49:06.340 | and people have been studying mainly, like, sensory systems and motor output.
00:49:09.300 | And suddenly, a deep cognitive variable plays something you never--
00:49:11.980 | you don't have a GPS signaler, but somehow there's this, like,
00:49:14.540 | signal for what looks like position in the brain in very, like, understandable ways.
00:49:19.940 | The next step in-- the biggest step, I guess, in this chain of discovery
00:49:23.740 | is the Moser Lab, which is a group in Norway.
00:49:27.180 | They stuck an electrode in a different area of the brain.
00:49:29.460 | The medial entorhinal cortex.
00:49:30.540 | And so this is the hippocampal entorhinal system we're going to be talking about.
00:49:33.300 | And they found this neuron called a grid cell.
00:49:35.140 | So again, the same plot structure that I'm showing here,
00:49:37.420 | but instead, these neurons respond not in one position in a room,
00:49:39.860 | but in, like, a hexagonal lattice of positions in a room.
00:49:43.980 | Okay, so this-- these two, I guess, I'm showing to you because they, like,
00:49:48.260 | really motivate the underlying neural basis of this kind of, like,
00:49:52.060 | spatial cognition, embodying the structure of this space in some way.
00:49:56.260 | Okay, and it's very surprising finding why neurons choosing to represent things
00:49:59.620 | with this hexagonal lattice.
00:50:00.660 | It's, like, yeah, provoked a lot of research.
00:50:03.700 | And broadly, there's been, like, many more discoveries in this area.
00:50:06.700 | So there's place cells I've talked to you about, grid cells,
00:50:09.860 | cells that respond based on the location of not yourself, but another animal,
00:50:13.860 | cells that respond when your head is facing a particular direction,
00:50:16.740 | cells that respond to when you're a particular distance away from an object.
00:50:20.660 | So, like, I'm one step south of an object, that kind of cell.
00:50:24.780 | Cells that respond to reward positions,
00:50:26.540 | cells that respond to vectors to boundaries,
00:50:28.900 | cells that respond to-- so, like, all sorts, all kinds of structure
00:50:32.100 | that this pair of brain structures, the hippocampus here, this red area,
00:50:37.420 | and the entorhinal cortex, this blue area here,
00:50:40.260 | which is, yeah, conserved across a lot of species, are represented.
00:50:45.260 | There's also finally one finding in this that's fun,
00:50:47.380 | is they did an fMRI experiment on London taxicab drivers.
00:50:51.500 | And I don't know if you know this, but the London taxicab drivers,
00:50:54.660 | they do a thing called the Knowledge, which is a two-year-long test
00:50:58.300 | where they have to learn every street in London.
00:51:00.500 | And the idea is the test goes something like,
00:51:02.740 | "Oh, there's a traffic jam here and a roadwork here,
00:51:05.060 | and I need to get from, like, Camden Town down to Wandsworth
00:51:08.780 | in the quickest way possible. What route would you go?"
00:51:10.700 | And they have to tell you which route they're going to be able to take
00:51:12.260 | through all the roads and, like, how they would replan
00:51:14.300 | if they found a stop there, those kind of things.
00:51:16.220 | So, it's, like, intense-- you see them, like, driving around sometimes,
00:51:18.540 | learning all of these, like, routes with little maps.
00:51:21.220 | They're being made a little bit obsolete by Google Maps,
00:51:24.100 | but, you know, luckily, they got them before that--
00:51:26.460 | this experiment was done before that was true.
00:51:28.260 | And so, what they've got here is a measure of the size of your hippocampus
00:51:31.740 | using fMRI versus how long you've been a taxicab driver in months.
00:51:35.140 | And the claim is basically the longer you're a taxicab driver,
00:51:37.020 | the bigger your hippocampus, because the more you're having to do
00:51:38.780 | this kind of spatial reading.
00:51:40.780 | So, that's a big set of evidence that these brain areas
00:51:44.140 | are doing something to do with space.
00:51:46.260 | But there's a lot of evidence that there's something more than that,
00:51:49.300 | something non-spatial going on in these areas, okay?
00:51:52.180 | And we're going to build these together to make the broader claim
00:51:55.100 | about this, like, underlying structural inference.
00:51:57.580 | And so, I'm going to talk through a couple of those.
00:52:00.820 | The first one of these is a guy called Patient HM.
00:52:03.860 | This is the most studied patient in, like, medical history.
00:52:07.460 | He had epilepsy, and to cure intractable epilepsy,
00:52:11.980 | you have to cut out the brain region that's causing these, like,
00:52:14.860 | seizure-like events in your brain.
00:52:17.060 | And in this case, the epilepsy was coming from the guy's hippocampus,
00:52:20.700 | so they bilaterally lesioned his hippocampus.
00:52:22.660 | They, like, cut out both of his hippocampi.
00:52:25.100 | And it turned out that this guy then had terrible amnesia.
00:52:28.220 | He never formed another memory again, and he could only recall memories
00:52:30.940 | from a long time before the surgery happened, okay?
00:52:34.580 | But he, you know, so experiments showed a lot of this stuff
00:52:38.300 | about how we understand the neural basis of memory,
00:52:41.140 | things like he could learn to do motor tasks.
00:52:43.740 | So, somehow, the motor tasks are being done.
00:52:45.500 | For example, they gave him some very difficult motor coordination tasks
00:52:48.020 | that people can't generally do, but can with a lot of practice.
00:52:50.500 | And he got very good at this eventually,
00:52:52.580 | and was as good as other people at learning to do that.
00:52:54.580 | He had no recollection of ever doing the task.
00:52:56.340 | So, he'd go in to do this new task and be like,
00:52:58.220 | "I've never seen this before. I have no idea what you're asking me to do."
00:53:00.300 | And he'd do it amazingly.
00:53:01.460 | He'd be like, "Yeah, sorry."
00:53:03.820 | There's some evidence there that the hippocampus is involved
00:53:05.660 | in at least some parts of memory there,
00:53:07.020 | which seems a bit separate to this stuff about space
00:53:08.900 | that I've been talking to you about.
00:53:10.860 | The second of these is imagining things.
00:53:13.220 | So, this is actually a paper by Demis Hassabis,
00:53:15.220 | who, before he was DeepMindHead, was a neuroscientist.
00:53:19.500 | And here, maybe you can't read that. I'll read some of these out.
00:53:23.220 | You're asked to imagine you're lying on a white sandy beach
00:53:25.860 | in a beautiful tropical bay.
00:53:27.220 | And so, the control, this bottom one, says things like,
00:53:28.940 | "It's very hot and the sun is beating down on me.
00:53:30.500 | The sand underneath me is almost unbearably hot.
00:53:33.020 | I can hear the sounds of small wavelets lapping on the beach.
00:53:35.220 | The sea is gorgeous aquamarine color."
00:53:36.940 | You know, like, so a nice, lucid description of this beauty scene.
00:53:41.220 | Whereas the person with a hippocampal damage says,
00:53:44.820 | "As for seeing, I can't really, apart from just the sky.
00:53:47.020 | I can hear the sound of seagulls under the sea.
00:53:50.300 | I can feel the grain of sand beneath my fingers."
00:53:53.780 | And then, like, yeah, struggles, basically.
00:53:56.020 | Really struggles to do this imagination scenario.
00:53:58.220 | Some of the things written in these are, like, very surprising.
00:54:00.540 | So, the last of these is this transitive inference task.
00:54:06.260 | So, transitive inference, A is greater than B,
00:54:08.380 | B is greater than C, therefore, A is greater than C.
00:54:11.660 | And the way they convert this into a rodent experiment
00:54:14.180 | is you get given two pots of food that have different smells.
00:54:17.220 | And your job is to go to the pot of food.
00:54:19.180 | You learn which pot of food has, sorry,
00:54:21.100 | which pot with the smell has the food.
00:54:23.980 | And so, these are colored by the two pots by their smell, A and B.
00:54:28.380 | And the rodent has to learn to go to a particular pot,
00:54:30.700 | in this case, the one that smells like A.
00:54:32.500 | And they do two of these.
00:54:33.420 | They do A has the food when it's presented in a pair with B,
00:54:36.460 | and B has the food when it's presented in a pair with C.
00:54:39.060 | And then they test what do the mouths do
00:54:40.700 | when presented with A and C, a completely new situation.
00:54:43.260 | If they have a hippocampus, they'll go for A over C.
00:54:45.980 | They'll do transitive inference.
00:54:47.500 | If they don't have one, they can't.
00:54:49.620 | And so, there's a much more broad set.
00:54:51.740 | This is like, oh, I've shown you how hippocampus is used
00:54:53.540 | for this spatial stuff that people have been excited about.
00:54:55.900 | But there's also all of this kind of relational stuff,
00:54:58.540 | imagining new situations,
00:55:00.020 | some slightly more complex story here.
00:55:02.780 | The last thing I'm going to do
00:55:04.100 | is how the entorhinal cortex as well.
00:55:05.820 | So, that's where, if you remember,
00:55:07.100 | hippocampus was these guys,
00:55:08.580 | entorhinal cortex was these grid cells,
00:55:10.460 | was how entorhinal cortex was appearing
00:55:12.020 | to do some broader stuff as well.
00:55:14.300 | This is all motivation for the model,
00:55:15.660 | just trying to build all of these things together.
00:55:18.180 | So, in this one,
00:55:19.460 | this is called the Stretchy Birds Task, okay?
00:55:22.500 | So, you put people in an fMRI machine
00:55:24.980 | and you make them navigate, but navigate in bird space.
00:55:28.020 | And what bird space means
00:55:29.540 | is it's a two-dimensional space of images.
00:55:32.860 | And each image is one of these birds.
00:55:34.860 | And as you vary along the X dimension,
00:55:36.820 | the bird's legs get longer and shorter.
00:55:38.740 | And as you vary along the Y direction,
00:55:40.100 | the bird's neck gets longer and shorter, okay?
00:55:42.900 | And the patients sit there, or subjects sit there,
00:55:46.900 | and just watch the bird images change
00:55:48.700 | so that it traces out some part in 2D space.
00:55:50.780 | But they never see the 2D space.
00:55:52.060 | They just see the images, okay?
00:55:53.980 | And the claim is basically,
00:55:55.260 | and then they're asked to do some navigational tasks.
00:55:57.260 | They're like, "Oh, whenever you're in this place
00:55:59.860 | "in 2D space, you show Santa Claus next to the bird."
00:56:03.300 | And so, the participants have to pin
00:56:05.620 | that particular bird image,
00:56:06.660 | that particular place in 2D space to the Santa Claus.
00:56:08.940 | And you're asked to go and find the Santa Claus again
00:56:12.020 | using some non-directional controller.
00:56:13.500 | And they navigate their way back.
00:56:15.380 | And the claim is that these people use grid cells.
00:56:18.260 | So, the entorhinal cortex is active
00:56:20.060 | in how these people are navigating
00:56:21.780 | this abstract cognitive bird space.
00:56:24.100 | And the way you test that claim
00:56:25.460 | is you look at the fMRI signal in the entorhinal cortex
00:56:30.180 | as the participants head
00:56:31.460 | at some particular angle in bird space.
00:56:34.260 | And because of the six-fold symmetry
00:56:35.780 | of the hexagonal lattice,
00:56:37.100 | you get this six-fold symmetric waving up and down
00:56:39.620 | of the entorhinal cortex activity
00:56:41.860 | as you head in particular directions in 2D space.
00:56:44.700 | So, it's like evidence that this system is being used
00:56:46.540 | not just for navigation in 2D space,
00:56:48.500 | but any cognitive task with some underlying structure
00:56:51.180 | that you can extract, you use it to do these tasks.
00:56:54.620 | - Is there significance to bird space also being 2D here?
00:56:58.060 | - Yes, yes.
00:56:59.100 | - Like, have people tried this
00:57:00.060 | with multiple dimensions of variability?
00:57:02.780 | - People haven't done that experiment,
00:57:04.380 | but people have done things like look at how grid cells...
00:57:09.220 | Ah, have they even done that?
00:57:11.740 | They've done things like 3D space,
00:57:13.740 | but not like cognitive 3D space.
00:57:15.380 | They've done, like, literally like make...
00:57:17.340 | They've done it in bats.
00:57:18.180 | They stick electrodes in bats
00:57:19.020 | and make the bats fly around the room
00:57:20.260 | and look at how their grid cells respond.
00:57:22.060 | Yeah, but definitely, I think they've done it...
00:57:25.940 | Ah, they've done it in sequence space.
00:57:29.820 | So, in this case, you hear a sequence of sounds
00:57:31.660 | with hierarchical structure.
00:57:33.260 | So, it's like how there's months, weeks, days, and meals,
00:57:36.380 | something like that.
00:57:37.220 | So, like, weeks have a periodic structure,
00:57:39.060 | months have a periodic structure,
00:57:40.140 | days have a periodic structure,
00:57:41.220 | and meals have a periodic structure.
00:57:43.020 | And so, you hear a sequence of sounds
00:57:44.420 | with exactly the same kind of structure
00:57:45.580 | as that hierarchy of sequences,
00:57:47.180 | and you look at the representation
00:57:48.020 | in the entorhinal cortex through fMRI,
00:57:49.980 | and you see exactly the same thing,
00:57:50.820 | that the structure is all represented in hyper.
00:57:53.140 | Even more than that,
00:57:54.060 | you actually see in the entorhinal cortex
00:57:55.980 | an array of length scales.
00:57:58.140 | So, at one end of the entorhinal cortex,
00:57:59.460 | you've got very large length-scale grid cells
00:58:01.220 | that are, like, responding to large variations in space.
00:58:03.780 | The other end, you've got very small ones,
00:58:05.260 | and you see the same thing recapitulated there.
00:58:07.100 | The, like, meals cycle that cycles a lot more quicker
00:58:09.660 | is represented in one end of the entorhinal cortex
00:58:11.540 | in fMRI, and the months cycle is at the other end,
00:58:13.980 | with, like, a scale in between.
00:58:15.260 | So, there's some, yeah, evidence to that end.
00:58:17.500 | All right.
00:58:20.100 | So, I've been talking about MEC,
00:58:21.380 | the medial entorhinal cortex.
00:58:22.620 | Another brain area that people don't look at as much
00:58:24.820 | is the LEC, the lateral entorhinal cortex,
00:58:26.940 | but it wouldn't be important for this model.
00:58:28.580 | And basically, the only bit that you shouldn't be aware of
00:58:31.060 | before we get to the model
00:58:31.980 | is that it seems to represent very high-level,
00:58:34.060 | the similarity structure in the lateral entorhinal cortex
00:58:36.660 | seems to be, like, a very high-level semantic one.
00:58:38.780 | For example, you present some images,
00:58:40.860 | and you look at how, you know, in the visual cortex,
00:58:42.900 | things are more similarly represented if they look similar,
00:58:45.540 | but by the time you get to the lateral entorhinal cortex,
00:58:47.340 | things look more similar based on their usage.
00:58:49.060 | For example, like, an ironing board and an iron
00:58:51.340 | will be represented similarly,
00:58:52.980 | even though they look very different
00:58:54.020 | because they're somehow, like, semantically linked.
00:58:56.340 | Okay, so that's the role that the LEC
00:58:57.980 | is gonna play in this model.
00:59:00.420 | So yeah, basically, the claim is
00:59:02.980 | this is for more than just 2D space.
00:59:04.940 | So the neural implementation of this cognitive map,
00:59:07.220 | which is for not only 2D space,
00:59:08.500 | which this cartoon is supposed to represent,
00:59:10.340 | but also things, any other structure.
00:59:12.580 | So for some structures, like transitive inference,
00:59:15.300 | this one is faster than that, and it's faster than that,
00:59:17.300 | or family trees, like this person is my mother's brother
00:59:20.820 | and is there for my uncle, those kind of things.
00:59:23.140 | These, like, broader structural inferences
00:59:24.740 | that you'll want to be able to use in many situations
00:59:26.420 | with basically the same problem.
00:59:28.900 | Great, that was a load of neuroscience.
00:59:31.180 | Now we're gonna get onto the model
00:59:32.420 | that tries to summarize all of these things,
00:59:34.700 | and that's gonna be the model
00:59:35.540 | that will end up looking like a transform.
00:59:37.940 | So yeah, we basically want this separation.
00:59:42.020 | These diagrams here are supposed to represent
00:59:44.060 | a particular environment that you're wandering around.
00:59:45.980 | It has an underlying grid structure,
00:59:47.460 | and you see a set of stimuli at each point on the grid,
00:59:49.780 | which are these little cartoon bits.
00:59:51.420 | And you wanna try and create a thing
00:59:53.180 | that separates out this, like, 2D structural grid
00:59:55.300 | from the actual experiences you're seeing.
00:59:57.100 | And the mapping to the things I've been showing you
00:59:58.860 | is that this grid-like code is actually the grid cells
01:00:02.140 | in the medial entorhinal cortex
01:00:03.300 | are somehow abstracting the structure.
01:00:05.300 | The lateral entorhinal cortex,
01:00:07.020 | encoding these semantically meaningful similarities,
01:00:09.740 | will be the objects that you're seeing.
01:00:11.100 | So it's just like, this is what I'm seeing in the world.
01:00:13.460 | And the combination of the two of them
01:00:15.700 | will be the hippocampus.
01:00:16.900 | So yeah, in more diagrams, we've got G, the structural code,
01:00:21.900 | the grid code, and MEC, and LEC.
01:00:24.740 | Oh, is someone asking a question?
01:00:26.580 | - Since morning, so now it's lunchtime.
01:00:29.620 | Yeah.
01:00:30.460 | - Sorry, I can't hear you if you're asking a question.
01:00:35.540 | How do I mute someone if they're...
01:00:38.500 | Maybe type it in the chat if there is one.
01:00:45.620 | (audience laughing)
01:00:50.460 | Nice.
01:00:51.300 | So yeah, we got the hippocampus in the middle,
01:00:54.700 | which is gonna be our binding
01:00:56.060 | of the two of them together.
01:00:57.460 | Okay.
01:01:00.300 | So I'm gonna step through each of these three parts
01:01:02.220 | on their own and how they do the job
01:01:03.580 | that I've assigned to them,
01:01:05.300 | and then come back together and show the full model.
01:01:08.500 | So lateral entorhinal cortex encodes what you're seeing.
01:01:11.820 | So this is like these images
01:01:13.940 | or the houses we were looking at before,
01:01:15.620 | and that would just be some vector XT that's different.
01:01:17.620 | So a random vector, different for every symbol.
01:01:19.940 | The medial entorhinal cortex is the one
01:01:23.660 | that tells you where you are in space.
01:01:25.740 | And it has the job of path integrating.
01:01:27.820 | Okay?
01:01:28.660 | So this means receiving a sequence of actions
01:01:30.380 | that you've taken in space.
01:01:31.340 | For example, I went north, east, and south,
01:01:33.340 | and tell you where in 2D space that you are.
01:01:35.300 | So it's somehow the bit that embeds the structure of the world
01:01:38.140 | and the way that we'll do that is this G of T,
01:01:40.620 | this vector of activities in this brain area,
01:01:43.300 | will be updated by a matrix
01:01:45.660 | that depends on the actions you've taken.
01:01:47.500 | Okay?
01:01:48.340 | So if you step north, you update the representation
01:01:50.140 | with the step north matrix.
01:01:51.820 | Okay?
01:01:52.700 | And those matrices are gonna have to obey some rules.
01:01:54.540 | For example, if you step north, then step south,
01:01:56.540 | you haven't moved.
01:01:57.380 | And so the step north matrix and the step south matrix
01:01:59.420 | have to be inverses of one another
01:02:00.860 | so that the activity stays the same
01:02:03.060 | and represents the structure of the world somehow.
01:02:05.460 | Okay.
01:02:07.500 | So that's the world structure part.
01:02:09.820 | Finally, the memory,
01:02:11.980 | because we have to memorize which things
01:02:14.180 | we found at which positions,
01:02:15.740 | it's gonna happen in the hippocampus,
01:02:16.900 | and that's gonna be through a version of these things
01:02:19.060 | called the Hopfield networks
01:02:20.340 | that you heard mentioned in the last talk.
01:02:22.940 | So this is like a content addressable memory
01:02:24.780 | and it's biologically plausible claim.
01:02:27.540 | The way it works is you have a set of activities, P,
01:02:30.340 | which are the activities of all these neurons.
01:02:32.420 | And when it receives,
01:02:34.100 | so it just like recurrently updates itself.
01:02:36.900 | So there's some weight matrix in here, W,
01:02:38.860 | and some non-linearity, and you run it forward in time,
01:02:40.860 | and it just like settled into some dynamical system,
01:02:43.060 | it's settled into some attractive state.
01:02:45.460 | And the way you make it do memory
01:02:47.020 | is through the weight matrix.
01:02:48.540 | Okay?
01:02:49.380 | So you make it like a sum of outer products
01:02:51.060 | of these chi mu, each chi is some memory,
01:02:54.020 | some pattern you wanna record.
01:02:56.060 | Okay?
01:02:57.020 | And then it's, yeah, this is just writing it in there.
01:02:59.140 | The update pattern is like that.
01:03:00.980 | And the claim is basically that if P,
01:03:03.580 | the memory, the activity of the neurons,
01:03:06.460 | the hippocampal neurons is close to some memory,
01:03:08.660 | say chi mu, then this dot product will be much larger
01:03:12.340 | than all of the other dot products
01:03:13.980 | with all the other memories.
01:03:15.260 | So this sum over all of them will basically be dominated
01:03:17.820 | by this one term, chi mu.
01:03:20.620 | And so your attractor network
01:03:22.540 | will basically settle into that one chi mu.
01:03:25.220 | And maybe the preempt some of the stuff
01:03:26.060 | that's gonna come later,
01:03:27.500 | you can see how this like similarity between points
01:03:30.300 | is, yeah, pairwise similarity,
01:03:32.420 | and then adding some, adding them up,
01:03:33.980 | weighted by this pairwise similarity
01:03:35.980 | is the bit that's gonna turn out
01:03:37.060 | looking a bit like attention.
01:03:38.460 | And so some cool things you can do with these systems
01:03:42.740 | is like, here's a set of images
01:03:44.820 | that someone's encoded in a Hopfield network,
01:03:46.780 | and then someone's presented this image to the network
01:03:49.340 | and asked it to just run to its like dynamical attractor,
01:03:52.660 | minima, and it recreates all of the memory
01:03:54.980 | that it's got stored in it.
01:03:55.900 | So like completes the rest of the image.
01:03:57.900 | So that's our system.
01:04:01.140 | Yeah.
01:04:03.740 | - I'm sorry if I'm trying to call you out,
01:04:05.300 | I'm just trying to assess like where is the,
01:04:08.100 | like where we're getting to like,
01:04:10.020 | I've heard that like this interpretation,
01:04:12.660 | like the modern interpretation, Hopfield network.
01:04:16.060 | - This one is actually, which interpretation, sorry.
01:04:18.700 | - I bet that like it's effectively, is that.
01:04:21.980 | - Ah, yeah.
01:04:22.820 | Yeah, yeah, yeah.
01:04:24.900 | It's only the link to transformers
01:04:26.460 | will basically only be through the fact,
01:04:28.300 | there's classic Hopfield networks
01:04:29.380 | and then there's modern ones
01:04:30.220 | that were made in like 2016.
01:04:32.140 | And the link between attention and modern is precise,
01:04:35.740 | the link with like classic is not as precise.
01:04:38.700 | I mean, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah.
01:04:42.300 | - The modern Hopfield network is to continue with virtual,
01:04:45.340 | but with the original Hopfield network it's flatter.
01:04:48.420 | - With a change in the non-linearity, right?
01:04:51.500 | 'Cause then you have to do the exponentiation thing.
01:04:56.060 | We'll maybe get to it later, you can tell me.
01:04:57.380 | - I'm sorry, I'm not gonna take too long.
01:04:58.820 | - No, no, no, no, all right.
01:04:59.900 | More questions is good.
01:05:01.100 | We'll get, yeah, maybe.
01:05:03.340 | - We have a separate energy function
01:05:04.940 | and I think the exponential is in that.
01:05:07.660 | - Mm.
01:05:08.500 | - If that, yeah.
01:05:10.140 | - Okay, thanks.
01:05:10.980 | - No worries.
01:05:11.820 | So that's basically how our system's gonna work.
01:05:15.420 | But this Tormund Eichenbaum machine,
01:05:17.100 | that's what the name of this thing is.
01:05:18.780 | And so the patterns you wanna store in the hippocampus,
01:05:23.220 | so these memories that we wanna embed,
01:05:24.980 | are a combination of the position and the input.
01:05:27.460 | And like half of Homer's face here,
01:05:29.700 | if you then have decided you wanna end up
01:05:31.340 | at a particular position,
01:05:32.340 | you can recall the stimulus that you saw there
01:05:34.460 | and predict that as your next observation.
01:05:37.580 | Or vice versa.
01:05:38.420 | If you see a new thing, you can infer,
01:05:40.300 | oh, I path integrated wrong, I must actually be here.
01:05:42.900 | Assuming there's usually more than one thing in the world
01:05:46.420 | that might be in a different position, so.
01:05:48.180 | Yeah, that's the whole system.
01:05:51.380 | Does the whole Tormund Eichenbaum machine make sense,
01:05:54.380 | roughly, what it's doing?
01:05:55.620 | Okay, cool.
01:05:58.780 | And basically, this last bit is saying it's really good.
01:06:02.500 | So what I'm showing here,
01:06:04.380 | this is on the 2D navigation task.
01:06:06.420 | So it's a big grid.
01:06:07.460 | I think they use like, I don't know,
01:06:09.180 | 11 by 11 or something,
01:06:10.020 | and it's like wandering around
01:06:10.860 | and has to predict what it's gonna see
01:06:11.900 | in some new environment.
01:06:13.340 | And on here, this is the number of nodes
01:06:16.420 | in that graph that you've visited.
01:06:18.180 | And on the y-axis is how much you correctly predict.
01:06:20.780 | And each of these lines is based on
01:06:22.660 | how many of those type of environments I've seen before,
01:06:25.340 | how quickly do I learn?
01:06:26.660 | And the basic phenomena it's showing is,
01:06:28.100 | over time, as you see more and more of these environments,
01:06:31.140 | you learn to learn.
01:06:31.980 | So you learn the structure of the world
01:06:33.900 | and eventually able to quickly generalize
01:06:35.300 | to the new situation and predict what you're gonna see.
01:06:38.100 | And this scales not with the number of edges
01:06:40.420 | that you've visited,
01:06:41.260 | which would be the learn everything option,
01:06:43.860 | predict, 'cause if you're trying to predict
01:06:45.740 | which state I'm gonna see,
01:06:46.660 | given my current state and action,
01:06:48.380 | and in a dumb way,
01:06:49.220 | you just need to see all states in action,
01:06:50.660 | so all edges.
01:06:52.260 | But this thing is able to do it much more cleverly
01:06:53.980 | 'cause it only needs to visit all nodes
01:06:55.420 | and just memorize what is at each position.
01:06:57.220 | And you can see that its learning curve
01:06:58.460 | follows the number of nodes visited learning curve.
01:07:01.940 | So it's doing well.
01:07:03.300 | For neuroscience, this is also exciting,
01:07:05.500 | is that the neural patterns of response
01:07:08.220 | in these model regions
01:07:10.500 | match the ones observed in the brain.
01:07:12.700 | So in the hippocampal section,
01:07:15.180 | you get place cell-like activities.
01:07:17.780 | This hexagon is the grid of the environment
01:07:19.860 | that it's exploring,
01:07:20.860 | and plotted is the firing rate of that neuron,
01:07:23.340 | whereas the ones in the medial interrhinal cortex
01:07:26.020 | show this grid-like firing pattern.
01:07:27.820 | Yeah.
01:07:30.460 | - This, like, exactly compared,
01:07:32.380 | operates on this free bed space in the center.
01:07:35.140 | Do you have any thoughts about
01:07:36.380 | how that transfers, like,
01:07:37.620 | do you think it was a real-world thing?
01:07:39.060 | Do you think that it was just, like,
01:07:40.340 | math representation of, like,
01:07:41.860 | a very nicely distributed space?
01:07:44.300 | Or do you think there's something
01:07:45.220 | more complicated going on?
01:07:47.220 | - Yeah, I imagine there's something
01:07:48.260 | more complicated going on.
01:07:49.380 | I guess this...
01:07:50.220 | - So this is, like, a super high-tech--
01:07:54.340 | - No, no, no, no, yeah.
01:07:55.180 | Maybe you can make--
01:07:56.020 | (laughs)
01:07:56.860 | Maybe you can make arguments that they're,
01:07:58.180 | as I was saying, there's these different modules
01:07:59.860 | that operate at different scales.
01:08:01.100 | You can see this already here.
01:08:01.940 | Like, grid cells at one scale,
01:08:03.460 | grid cells at another scale.
01:08:04.820 | And so you could imagine how that could be useful
01:08:07.260 | for, like, one of them operates
01:08:09.220 | in the highest level and mix those.
01:08:10.620 | One of them operates at the lowest level.
01:08:11.900 | You know, and, like, yeah, adaptable.
01:08:13.500 | They seem to scale up or down
01:08:14.940 | depending on your environment.
01:08:16.140 | And so, like, an adaptable set of length scales
01:08:18.060 | that you can use to be pretty good.
01:08:19.260 | So that's quite speculative, so.
01:08:21.020 | (laughs)
01:08:23.420 | Okay, sorry, yeah.
01:08:24.700 | - So make sure I understand if you go up.
01:08:26.740 | - Okay, one more.
01:08:30.340 | - Yeah, so you have your...
01:08:33.260 | What's the key and what's the value?
01:08:35.220 | - Yeah, so the...
01:08:37.860 | - And Hondo networks are always auto
01:08:41.420 | instead of retro-associated.
01:08:42.900 | So how are you...
01:08:44.380 | - The memories that we're gonna put in,
01:08:46.020 | so the patterns, let's say kinu,
01:08:48.500 | is gonna be some, like, outer product
01:08:50.660 | of the position at a given time
01:08:52.980 | and the flattened.
01:08:56.660 | So we, yeah, take the outer product of those.
01:09:02.540 | So every element in X gets to see every element of G,
01:09:05.060 | flatten those out, and that's your vector
01:09:06.260 | that you're gonna embed.
01:09:07.380 | Does that make sense?
01:09:08.220 | - Yeah. - Sorry, I should put that on.
01:09:09.820 | Yeah, and then you do the same operation
01:09:13.780 | except you flatten with an identity in the...
01:09:16.100 | Let's say you're at a position
01:09:17.140 | you wanna predict what you're gonna see.
01:09:18.180 | You set X to be identity, you do this operation
01:09:20.460 | that creates a very big vector from G,
01:09:22.020 | you put that in and you let it run its dynamics,
01:09:23.820 | and it recalls the pattern, and you, like,
01:09:25.900 | learn a network that, like, traces out the X from that.
01:09:30.020 | - And the figures you show, if you go down a bit?
01:09:33.420 | - Yeah.
01:09:34.260 | - Yeah, it's hard to see, but what's on the X axis
01:09:37.500 | and what, like, what, are you training a popular network
01:09:42.500 | with this flattened outer product of the, okay.
01:09:46.980 | - Yeah, the actual, the training that's going on
01:09:49.340 | is more in the structure of the world,
01:09:51.580 | 'cause it has to learn those matrices.
01:09:53.180 | All it gets told is which action type it's taken,
01:09:55.580 | and it has to learn the fact that stepping east
01:09:57.340 | is the opposite of stepping west.
01:09:58.860 | So all of the learning of stuff is in those matrices,
01:10:01.020 | learning to get the right structure.
01:10:03.340 | - Okay.
01:10:04.180 | - There's also, I mean, 'cause the Hotfield network learning,
01:10:06.020 | the Hotfield network's, like,
01:10:06.860 | re-initialized every environment,
01:10:08.380 | and you're just, like, shoving memories in.
01:10:10.300 | So it's less, like, that's less the bit that's true.
01:10:12.860 | It's causing this, certainly,
01:10:14.420 | but it's not causing this, like, shift up,
01:10:16.500 | which is, as training progresses
01:10:18.100 | in many different environments,
01:10:19.700 | you get better at the task,
01:10:20.820 | because it's learning the structure of the task.
01:10:23.420 | - Okay.
01:10:24.260 | And the link to, I mean, this is all
01:10:28.020 | just modern Hotfield networks.
01:10:29.740 | - The initial paper was actually classic Hotfield networks,
01:10:33.660 | but, yeah, now the new versions of it
01:10:35.620 | are modern Hotfield networks, yeah.
01:10:36.780 | - Right.
01:10:37.780 | And then, in so much as modern Hotfield networks
01:10:40.420 | equal attention.
01:10:41.740 | - This is a track book.
01:10:44.300 | - Right.
01:10:45.460 | But then you're, okay, and then you have some results,
01:10:48.940 | there are some results looking at activations.
01:10:52.900 | Well, these are recordings in the brain.
01:10:55.460 | - These are, no, these are actually in 10.
01:10:57.700 | - So this is, the left ones are neurons in the G section,
01:11:00.780 | in the medial entorhinal cortex part of 10.
01:11:03.700 | As you vary position, yeah.
01:11:06.540 | And we're gonna get, yeah, my last section
01:11:07.780 | is about how 10 is like a transport.
01:11:10.260 | But we'll get to, hopefully it'll be clear,
01:11:11.860 | the link between the two after that.
01:11:13.740 | Okay, we're happy with that now, hopefully.
01:11:15.740 | Cool, 10 is approximately equal to transport, yeah.
01:11:17.900 | So you seem to, you know all of this.
01:11:21.980 | But I guess my notation, at least we can clarify that.
01:11:25.660 | You've got your data,
01:11:26.500 | which is maybe like tokens coming in,
01:11:28.020 | and you've got your positional embedding,
01:11:29.260 | and the positional embedding will play a very big role here.
01:11:32.180 | That's the E, and together they make this vector H.
01:11:34.980 | Okay, and these arrive over time.
01:11:37.700 | Yeah, and you've got your attention updates
01:11:38.820 | that you see some similarity between the key and the query,
01:11:42.220 | and then you add up weighted values with those similarities.
01:11:45.860 | We're all happy with that.
01:11:47.780 | And here's the stepped version.
01:11:49.300 | So the basic intuition about how these parts
01:11:53.820 | map onto each other is that the G is the positioning coding,
01:11:57.460 | as you may have been able to predict.
01:11:58.780 | The X are the input tokens.
01:12:00.660 | This guy, when you put in the memory,
01:12:04.260 | and you try and recall which memory is most similar to,
01:12:06.940 | that's the attention part.
01:12:08.220 | And maybe some, yeah, you attend,
01:12:13.380 | you compare the current GT to all of the previous GTs,
01:12:17.220 | and you recall the ones with high similarity structure,
01:12:20.100 | and return the corresponding X.
01:12:22.180 | We've still got 10 minutes, right?
01:12:23.300 | Might as well, yeah, okay.
01:12:24.580 | Maybe some differences,
01:12:27.620 | so I think I'm gonna go through this,
01:12:28.780 | between how you would maybe,
01:12:31.100 | like the normal transformer,
01:12:32.660 | and how to make it map onto this, are the following.
01:12:35.620 | So the first of these is that the keys and the queries
01:12:37.900 | are the same at all time points.
01:12:39.980 | So there's no difference in the matrix
01:12:41.740 | that maps from tokens to keys and tokens to queries,
01:12:44.460 | same matrix.
01:12:45.540 | And it only depends on the positional encoding.
01:12:47.900 | Okay, so you only recall memories
01:12:50.780 | based on how similar their positions are.
01:12:52.820 | So yeah, this is key at time tau equals,
01:12:57.820 | query at time tau equals some matrix
01:12:59.580 | applied only to the positional embedding at time tau.
01:13:02.180 | Then the values depend only on this X part,
01:13:06.660 | so it's some like factorization of the two,
01:13:09.020 | which is the value at time tau is like some value matrix,
01:13:11.380 | only applied to that X part.
01:13:12.860 | So that's the only bit you want to like recall, I guess.
01:13:16.140 | Is that right?
01:13:16.980 | I think that's right.
01:13:18.220 | And then it's a causal transformer
01:13:19.900 | in that you only do attention at things
01:13:22.420 | that have arrived at time points in the path.
01:13:24.580 | Make sense?
01:13:27.460 | And finally, the perhaps like weird and interesting
01:13:29.540 | difference is that there's this path integration going on
01:13:32.340 | in the positional encodings.
01:13:34.100 | So these E are the equivalent of the grid cells,
01:13:35.940 | the G from the previous bit,
01:13:37.300 | and they're going to be updated through this matrix
01:13:39.020 | that depend on the actions you're taking in the world.
01:13:41.300 | Yeah, so that's basically the correspondence.
01:13:46.660 | I'm going to go through a little bit
01:13:47.500 | about how the Hopfield network is approximately
01:13:50.060 | like doing attention over previous tokens.
01:13:53.380 | So yeah, I was describing to you
01:13:55.100 | before the classic Hopfield network,
01:13:57.300 | which if you remove the non-linearity, looks like this.
01:14:00.300 | And the mapping, I guess, is like the hippocampal activity,
01:14:03.420 | the current neural activity is the query.
01:14:06.140 | The set of memories themselves are the key.
01:14:10.300 | You're doing this dot product to get the current similarity
01:14:12.380 | between the query and the key,
01:14:13.780 | and then you're summing them up,
01:14:14.780 | weighted by that dot product,
01:14:16.980 | all of the memories that are values.
01:14:18.780 | So that's the simple version.
01:14:21.980 | But actually, these Hopfield networks are quite bad.
01:14:24.580 | They like, in some senses, they tend to fail,
01:14:27.340 | they have a like low memory capacity.
01:14:29.740 | For N neurons, they have something,
01:14:31.620 | they can only embed like 0.14 N memories,
01:14:35.020 | just like a big result from statistical physics in the '80s.
01:14:38.740 | But it's okay, people have improved this.
01:14:41.260 | The reason that they're bad,
01:14:42.180 | it seems to be basically that the overlap
01:14:44.100 | between your query and the memories
01:14:46.420 | is too big for too many memories.
01:14:48.940 | You basically like look too similar to too many things.
01:14:51.420 | So how do you do that?
01:14:52.300 | You like sharpen your similarity function.
01:14:54.900 | Okay, and the way we're gonna sharpen it
01:14:56.180 | is through this function,
01:14:57.380 | and this function is gonna be soft.
01:14:59.180 | So it's gonna be like,
01:15:00.020 | oh, how similar am I to this particular pattern,
01:15:02.340 | weighted, exponentiated,
01:15:04.700 | and then over how similar am I to all the other ones.
01:15:07.100 | That's our new measure of similarity,
01:15:08.660 | and that's the minor setting of the modern Hopfield one.
01:15:11.820 | Yeah, yeah. (laughs)
01:15:14.420 | And then you can see how this thing,
01:15:16.300 | yeah, it's basically doing the attention mechanism.
01:15:18.580 | And it's also biologically plausible.
01:15:24.180 | We'll quickly run through that,
01:15:25.140 | is that you have some set of activity, PT,
01:15:27.220 | this like neural activity,
01:15:29.700 | and you're gonna compare that to each chi mu,
01:15:32.140 | and that's through these memory neurons.
01:15:33.540 | So there's a set of memory neurons,
01:15:34.700 | one for each pattern that you've memorized, mu,
01:15:37.060 | and the weights to this memory neuron will be this chi mu,
01:15:41.100 | and then the activity of this neuron
01:15:42.500 | will be this dot product,
01:15:44.100 | and then you're gonna do divisive normalization
01:15:46.620 | to run this operation between these neurons,
01:15:49.940 | so like to make them compete with one another
01:15:51.820 | and only recall the memories that are most similar
01:15:53.740 | through the most activated
01:15:55.100 | according to this like softmax operation,
01:15:56.900 | and then they'll project back to the PT
01:15:58.780 | and produce the output by summing up
01:16:00.500 | the memories weighted by this thing
01:16:01.580 | times the chi mu, which is the weights.
01:16:03.460 | So then weights out to the memory neurons
01:16:05.180 | and back to the hippocampus are both chi mu.
01:16:08.700 | Okay, and so that's how you can like
01:16:10.660 | biologically plausibly run this modern Hottfield network.
01:16:13.460 | And so, sorry, yeah?
01:16:16.740 | - Do you have any thoughts into what the memories
01:16:18.980 | that are inputted are?
01:16:20.700 | Like if you attend over every memory you've ever made,
01:16:22.980 | probably not.
01:16:23.900 | - Probably not, yeah.
01:16:25.900 | I guess somehow you have to have knowledge,
01:16:27.620 | and you know, in this case it works nicely
01:16:29.100 | 'cause we like wipe this poor agent's memory every time
01:16:31.740 | and only memorize things from the environment,
01:16:34.220 | and so you need something that like gates it
01:16:35.940 | so that it only looks for things
01:16:37.380 | in the current environment somehow.
01:16:40.100 | How that happens, I'm not sure.
01:16:42.180 | There are claims that there's this like
01:16:44.340 | just shift over time.
01:16:46.700 | The claim is basically that like somehow as time passes,
01:16:50.420 | the representations just slowly like rotate or something.
01:16:53.620 | And then they're also embedding something
01:16:54.860 | like a time similarity as well,
01:16:56.300 | 'cause the closer in time you are,
01:16:57.300 | the more you're like in the same rotated thing.
01:17:00.220 | So maybe that's a mechanism to like,
01:17:02.020 | oh, you know, past a certain time,
01:17:03.420 | you don't recall things.
01:17:04.940 | But the evidence and debate a lot around that,
01:17:07.340 | so yeah, expect to see.
01:17:09.060 | Other mechanisms like it, I'm sure.
01:17:10.820 | Maybe context is another one.
01:17:14.580 | Actually, we'll briefly talk about that.
01:17:16.420 | You know, if you know you're in the same context,
01:17:18.300 | then you can send a signal like somehow
01:17:19.660 | in the prefabricated context like work out
01:17:21.540 | what kind of setting am I in?
01:17:22.700 | You can send that signal back and be like,
01:17:24.100 | oh, make sure you attend to these ones
01:17:25.580 | that are in the same context.
01:17:27.020 | So yeah, there we go.
01:17:29.060 | Tim transformer, that's the job.
01:17:31.140 | It path integrates its positional encodings,
01:17:32.940 | which is kind of fun.
01:17:34.260 | It computes similarity using these positional encodings,
01:17:37.380 | and it only compares to past memories,
01:17:39.220 | but otherwise it looks a bit like a transformer setup.
01:17:42.540 | And here's the setup, MEC, LEC, hippocampus, and placements.
01:17:45.540 | Some, yeah, so here's a brief,
01:17:49.580 | the last thing I think I'm gonna say is that like,
01:17:52.580 | this extends TEM nicely 'cause it allows it,
01:17:54.700 | previously you had to do this outer product and flatten,
01:17:56.940 | that's a very, dimensionality is like terrible scaling
01:17:59.180 | with like, for example, if you wanna do position,
01:18:01.380 | what I saw and the context signal,
01:18:03.180 | suddenly I'm telling like outer product three vectors
01:18:04.980 | and flatten that, that's a much, much bigger,
01:18:06.300 | you're scaling like N cube, right?
01:18:07.860 | Rather than what you'd like to do is just like 3N.
01:18:11.380 | And so this version of TEM
01:18:13.580 | with this new modern hotfield network does scale nicely
01:18:15.900 | to adding a context input as just another input
01:18:18.220 | in what was previously this like modern hotfield network.
01:18:20.500 | So yeah, there's some.
01:18:22.940 | So yeah, our conclusions is there's like,
01:18:25.180 | proved somewhat interesting as a two-way relationship
01:18:27.900 | from the AI to the neuroscience,
01:18:30.260 | we use this new memory model,
01:18:31.620 | this modern hotfield network that has all of,
01:18:34.380 | you know, all of this bit
01:18:35.460 | is supposed to be in the hippocampus,
01:18:36.860 | whereas previously we just had these like memory bits
01:18:38.900 | in the classic hotfield network in the hippocampus.
01:18:40.580 | So it makes kind of interesting predictions
01:18:41.820 | about different place cell structures in the hippocampus
01:18:44.460 | and it just sped up the code a lot, right?
01:18:46.540 | From the neuro to AI,
01:18:48.900 | maybe there's a few things that are slightly different,
01:18:51.100 | this like learnable recurrent positioning coding.
01:18:53.420 | So people do some of this,
01:18:54.380 | I think they get like positioning codings
01:18:55.740 | and learn RNN updates them,
01:18:57.700 | but maybe this is like some motivation to try,
01:19:00.780 | for example, they don't do it with weight matrices
01:19:02.420 | and these weight matrices are very biased towards,
01:19:04.820 | because they're invertible generally
01:19:06.340 | and things like that,
01:19:07.180 | they're very biased towards representing
01:19:08.140 | these very clean structures like 2D space.
01:19:10.100 | So I mean, you know, interesting there.
01:19:12.620 | The other thing is this is like one attention layer only.
01:19:14.900 | And so like somehow by using nice extra foundations,
01:19:19.100 | making the task very easy in terms of like processing X
01:19:21.700 | and using the right positional encoding,
01:19:23.260 | you've got it to solve the task with just one of these.
01:19:25.980 | Also kind of nice,
01:19:27.220 | and maybe it's like a nice interpretation
01:19:28.860 | is that you can go in and really probe
01:19:30.380 | what these neurons are doing in this network
01:19:31.740 | and really understand it, you know?
01:19:33.100 | We know that the position encoding looks like grid cells.
01:19:35.460 | We have a very deep understanding
01:19:36.460 | of why grid cells are a useful thing to have
01:19:38.140 | if you're doing this path integration.
01:19:39.660 | So it's like, hopefully helps like interpret
01:19:41.980 | all these things.
01:19:42.820 | Oh yeah, and if there was time,
01:19:43.660 | I was going to tell you all about grid cells,
01:19:44.500 | which are my hobby horse, but I don't think there's time.
01:19:46.340 | So I'll stop there.
01:19:47.700 | - Excellent.
01:19:48.540 | Questions?
01:19:54.340 | Go ahead.
01:20:00.220 | - A very big question.
01:20:01.420 | So in the very beginning,
01:20:02.820 | those grids are linked into one neuron
01:20:05.980 | or a population of neurons?
01:20:08.060 | - These ones.
01:20:08.900 | Those ones.
01:20:11.460 | Yeah, that's one neuron's response.
01:20:12.300 | - Just one neuron?
01:20:13.140 | - Yeah.
01:20:13.980 | Wild, it's wild.
01:20:14.820 | Let me tell you more about the grid cell system.
01:20:16.100 | - And you can't know that,
01:20:17.260 | how can you know it's the one neuron?
01:20:19.500 | How do you measure it?
01:20:21.060 | - Because you got electrodes stuck in here, right?
01:20:24.420 | And they generally have like,
01:20:25.740 | the classic measuring technique is a tetrahedron,
01:20:27.460 | which is four wires, okay?
01:20:29.220 | And they receive these spikes,
01:20:30.380 | which are like electrical fluctuations
01:20:32.180 | as a result of a neuron firing.
01:20:33.660 | And they can like triangulate
01:20:34.740 | that that particular spike that they measured
01:20:36.180 | because of the pattern of activity on the four wires
01:20:38.020 | has to have only come from one position.
01:20:39.620 | So they can work out which neurons
01:20:41.100 | sent that particular spike.
01:20:43.060 | Yeah.
01:20:43.900 | But there's, so there's a set of neurons
01:20:46.020 | that have grid cell patterns.
01:20:48.100 | Lots of neurons have patterns
01:20:49.540 | that are just translated versions of one another.
01:20:51.220 | So the same grid, like shifted in space.
01:20:53.180 | That's called a module.
01:20:54.460 | And then there are sets of modules,
01:20:55.860 | which are the same types of neurons,
01:20:57.740 | but with a lattice that's much bigger or much smaller.
01:21:00.020 | And in rats, there's roughly seven.
01:21:01.340 | So this is a very surprising crystalline structure
01:21:03.860 | of these seven modules,
01:21:04.900 | but in each module,
01:21:05.740 | each neuron is just translated by another one.
01:21:07.780 | Which, yeah, there's a lot of theory work
01:21:09.860 | about why that's a very sensible thing to do
01:21:12.020 | if you want to do path integration
01:21:13.380 | and work out where you are in the environment,
01:21:15.140 | based on your like velocity signals.
01:21:16.940 | Nice.
01:21:19.100 | - Cool.
01:21:20.940 | So just this thing that you said,
01:21:22.780 | this was like really fascinating about the friendly thing.
01:21:25.060 | Is this a product of evolution or a product of learning?
01:21:30.980 | - Evolution.
01:21:31.820 | It's like, it emerges like 10 days after,
01:21:35.220 | in a baby rat's life, after being born.
01:21:37.860 | So certainly, oh, certainly that structure
01:21:40.180 | seems to be like very biased to being created.
01:21:43.220 | Unclear, you know, we were talking about
01:21:46.660 | how it was being co-opted to encode other things.
01:21:49.340 | And so it's debatable how flexible it is
01:21:52.580 | or how hardwired it is.
01:21:53.460 | But it seemed, you know, the fMRI evidence
01:21:55.820 | suggests that there's some like more flexibility
01:21:57.420 | in the system.
01:21:58.260 | Unclear quite how it's coding it,
01:21:59.300 | but it'd be cool to get neural recordings of it.
01:22:00.900 | I wanna see.
01:22:01.740 | - Cool.
01:22:04.580 | Actually, let's give our speaker another round of applause.
01:22:06.420 | (laughing)
01:22:08.660 | [BLANK_AUDIO]