Back to Index

Stanford CS25: V2 I Represent part-whole hierarchies in a neural network, Geoff Hinton


Chapters

0:0
1:55 Why it is hard to make real neural networks learn part-whole hierarchies • Each image has a different parsetree, • Real neural networks cannot dynamically allocate neurons to represent nodes in a parse tree. - What a neuron does is determined by the weights on its connections and the weights
20:45 A brief introduction to contrastive learning of visual representations • Contrastive self-supervised learning uses the similarity between activity vectors produced from different patches of the same image as the objective
43:39 How to implement multimodal predictions in the joint space of identity and pose • Each neuron in the embedding vector for the object is a basis function that represents a vague distribution in the log probability space

Transcript

(silence) Before we start, I gave the same talk at Stanford quite recently. I suggested to the people inviting me, I could just give one talk and both audiences come, but they will prefer it as two separate talks. So if you went to this talk recently, I suggest you leave now.

You won't learn anything new. Okay. What I'm gonna do is combine some recent ideas in neural networks to try to explain how a neural network could represent part-whole hierarchies without violating any of the basic principles of how neurons work. And I'm gonna explain these ideas in terms of an imaginary system.

I started writing a design document for a system and in the end, I decided the design document by itself was quite interesting. So this is just vaporware, it's stuff that doesn't exist. Little bits of it now exist, but somehow I find it easy to explain the ideas in the context of an imaginary system.

So most people now studying neural networks are doing engineering and they don't really care if it's exactly how the brain works. They're not trying to understand how the brain works, they're trying to make cool technology. And so a hundred layers is fine in a ResNet, weight sharing is fine in a convolutional neural net.

Some researchers, particularly computational neuroscientists investigate neural networks, artificial neural networks in an attempt to understand how the brain might actually work. I think we still got a lot to learn from the brain and I think it's worth remembering that for about half a century, the only thing that kept research on neural networks going was the belief that it must be possible to make these things learn complicated things 'cause the brain does.

So every image has a different parse tree, that is the structure of the holes in the parts in the image. And in a real neural network, you can't dynamically allocate, you can't just grab a bunch of neurons and say, "Okay, you now represent this," because you don't have random access memory.

You can't just set the weights of the neurons to be whatever you like. What a neuron does is determined by its connections and they only change slowly, at least probably, mostly they change slowly. So the question is, if you can't change what neurons do quickly, how can you represent a dynamic parse tree?

In symbolic AI, it's not a problem. You just grab a piece of memory, that's what it normally amounts to, and say, "This is gonna represent a node in the parse tree and I'm gonna give it pointers to other nodes, other bits of memory that represent other nodes." So there's no problem.

For about five years, I played with a theory called capsules, where you say, "Because you can't allocate neurons on the fly, you're gonna allocate them in advance." So we're gonna take groups of neurons and we're gonna allocate them to different possible nodes in a parse tree. And most of these groups of neurons for most images are gonna be silent, a few are gonna be active.

And then the ones that are active, we have to dynamically hook them up into a parse tree. So we have to have a way of rooting between these groups of neurons. So that was the capsules theory. And I had some very competent people working with me who actually made it work, but it was tough going.

My view is that some ideas want to work and some ideas don't want to work. And capsules was sort of in between. Things like backpropagation just wanna work, you try them and they work. There's other ideas I've had that just don't wanna work. Capsules was sort of in between and we got it working.

But I now have a new theory that could be seen as a funny kind of capsules model in which each capsule is universal. That is instead of a capsule being dedicated to a particular kind of thing, each capsule can represent any kind of thing. But hardware still comes in capsules, which are also called embedding sometimes.

So the imaginary system I'll talk about is called Glom. And in Glom, hardware gets allocated to columns and each column contains multiple levels of representation of what's happening in a small patch of the image. So within a column, you might have a lower level representation that says it's a nostril and the next level up might say it's a nose and the next level up might say a face, the next level up a person and the top level might say it's a party.

That's what the whole scene is. And the idea for representing part-whole hierarchies is to use islands of agreement between the embeddings at these different levels. So at the scene level, at the top level, you'd like the same embedding for every patch of the image 'cause that patch is a patch of the same scene everywhere.

At the object level, you'd like the embeddings of all the different patches that belong to the object to be the same. So as you go up this hierarchy, you're trying to make things more and more the same and that's how you're squeezing redundancy out. The embedding vectors are the things that act like pointers and the embedding vectors are dynamic.

They're neural activations rather than neural weights. So it's fine to have different embedding vectors for every image. So here's a little picture. If you had a one-dimensional row of patches, these are the columns for the patches and you'd have something like a convolutional neural net as a front end.

And then after the front end, you produce your lowest level embeddings that say what's going on in each particular patch. And so that bottom layer of black arrows, they're all different. Of course, these embeddings are thousands of dimensions, maybe hundreds of thousands in your brain. And so two-dimensional vector isn't right, but at least I can represent where the two vectors are the same by using the orientation.

So the lowest level, all the patches will have different representations. But the next level up, the first two patches, they might be part of a nostril, for example. And so, yeah, they'll have the same embedding. But the next level up, the first three patches might be part of a nose.

And so they'll all have the same embedding. Notice that even though what's in the image is quite different, at the part level, those three red vectors are all meant to be the same. So what we're doing is we're getting the same representation for things that are superficially very different.

We're finding spatial coherence in an image by giving the same representation to different things. And at the object level, you might have a nose and then a mouth, and they're the same face. They're part of the same face. And so all those vectors are the same. And this network hasn't yet settled down to produce something unseen level.

So the islands of agreement are what capture the parse tree. Now, they're a bit more powerful than a parse tree. They can capture things like shut the heck up. You can have shut and up can be different vectors at one level, but at a higher level, shut and up can have exactly the same vector, namely the vector for shut up, and they can be disconnected.

So you can do things a bit more powerful than a context-free grammar here, but basically it's a parse tree. If you're a physicist, you can think of each of these levels as an icing model with real valued vectors rather than binary spins. And you can think of them being coordinate transforms between levels, which makes it much more complicated.

And then this is a kind of multi-level icing model, but with complicated interactions between the levels, because for example, between the red arrows and the black arrows above them, you need the coordinate transform between a nose and a face, but we'll come to that later. If you're not a physicist, ignore all that 'cause it won't help.

(keyboard clicking) So I want to start, and this is, I guess, is particularly relevant for a natural language course where some of you are not vision people, by trying to prove to you that coordinate systems are not just something invented by Descartes. Coordinate systems were invented by the brain a long time ago, and we use coordinate systems in understanding what's going on in an image.

I also want to demonstrate the psychological reality of parse trees for an image. So I'm gonna do this with a task that I invented a long time ago in the 1970s, when I was a grad student, in fact. And you have to do this task to get the full benefit from it.

So I want you to imagine on the tabletop in front of you, there's a wireframe cube, and it's in the standard orientation for a cube, is resting on the tabletop. And from your point of view, there's a front bottom right-hand corner, and a top back left-hand corner. Here we go.

Okay. The front bottom right-hand corner is resting on the tabletop, along with the four other corners. And the top back left-hand corner is at the other end of a diagonal that goes through the center of the cube. Okay, so far so good. Now what we're gonna do is rotate the cube so that this finger stays on the tabletop, and the other finger is vertically above it, like that.

This finger shouldn't have moved. Okay. So now we've got the cube in an orientation where that thing that was a body diagonal is now vertical. And all you've gotta do is take the bottom finger, 'cause that's still on the tabletop, and point with the bottom finger to where the other corners of the cube are.

So I want you to actually do it. Off you go. Take your bottom finger, hold your top finger at the other end of that diagonal that's now been made vertical, and just point to where the other corners are. And luckily, it's Zoom, so most of you, other people, won't be able to see what you did.

And I can see that some of you aren't pointing, and that's very bad. So most people point out four other corners, and the most common response is to say they're here, here, here, and here. They point out four corners in a square halfway up that axis. That's wrong, as you might imagine.

And it's easy to see that it's wrong, 'cause if you imagine the cube in the normal orientation and count the corners, there's eight of them. And these were two corners. So where did the other two corners go? So one theory is that when you rotated the cube, the centripetal forces made them fly off into your unconscious.

That's not a very good theory. So what's happening here is you have no idea where the other corners are, unless you're something like a crystallographer. You can sort of imagine bits of the cube, but you just can't imagine this structure of the other corners, what structure they form. And this common response that people give, of four corners in a square, is doing something very weird.

It's trying to, it's saying, well, okay, I don't know where the bits of a cube are, but I know something about cubes. I know the corners come in fours. I know a cube has this four-fold rotational symmetry, or two planes of bilateral symmetry, but right angles to one another.

And so what people do is they preserve the symmetries of the cube in their response. They give four corners in a square. Now, what they've actually pointed out if they do that is two pyramids, each of which has a square base. One's upside down, and they're stuck base to base.

So you can visualize that quite easily. It's a square base pyramid with another one stuck underneath it. And so now you get your two fingers as the vertices of those two pyramids. And what's interesting about that is you've preserved the symmetries of the cube at the cost of doing something pretty radical, which is changing faces to vertices and vertices to faces.

The thing you pointed out if you did that was an octahedron. It has eight faces and six vertices. A cube has six faces and eight vertices. So in order to preserve the symmetries you know about of the cube, if you did that, you've done something really radical, which is changed faces for vertices and vertices for faces.

I should show you what the answer looks like. So I'm gonna step back and try and get enough light, and maybe you can see this cube. So this is a cube, and you can see that the other edges form a kind of zigzag ring around the middle. So I've got a picture of it.

So the colored rods here are the other edges of the cube, the ones that don't touch your fingertips. And your top finger's connected to the three vertices of those flaps, and your bottom finger's connected to the lowest three vertices there. And that's what a cube looks like. It's something you had no idea about.

This is just a completely different model of a cube. It's so different, I'll give it a different name. I'll call it a hexahedron. And the thing to notice is a hexahedron and a cube are just conceptually utterly different. You wouldn't even know one was the same as the other if you think about one as a hexahedron and one as a cube.

It's like the ambiguity between a tilted square and an upright diamond, but more powerful 'cause you're not familiar with it. And that's my demonstration that people really do use coordinate systems. And if you use a different coordinate system to describe things, and here I forced you to use a different coordinate system by making the diagonal be vertical and asking you to describe it relative to that vertical axis, then familiar things become completely unfamiliar.

And when you do see them relative to this new frame, they're just a completely different thing. Notice that things like convolutional neural nets don't have that. They can't look at something and have two utterly different internal representations of the very same thing. I'm also showing you that you do parsing.

So here I've colored it so you parse it into what I call the crown, which is three triangular flaps that slope upwards and outwards. Here's a different parsing. The same green flap sloping upwards and outwards. Now we have a red flap sloping downwards and outwards, and we have a central rectangle, and you just have the two ends of the rectangle.

And if you perceive this and now close your eyes and ask you, were there any parallel edges there? You're very well aware that those two blue edges were parallel, and you're typically not aware of any other parallel edges, even though you know by symmetry there must be other pairs.

Similarly with the crown, if you see the crown, and then I ask you to close your eyes and ask you, were there parallel edges? You don't see any parallel edges. And that's because the coordinate systems you're using for those flaps don't line up with the edges. And you only notice parallel edges if they line up with the coordinate system you're using.

So here for the rectangle, the parallel edges align with the coordinate system. For the flaps, they don't. So you're aware that those two blue edges are parallel, but you're not aware that one of the green edges and one of the red edges is parallel. So this isn't like the NECA cube ambiguity, where when it flips, you think that what's out there in reality is different, things are at a different depth.

This is like, next weekend, we shall be visiting relatives. So if you take the sentence, "Next weekend, we shall be visiting relatives," it can mean, next weekend, what we will be doing is visiting relatives, or it can mean, next weekend, what we will be is visiting relatives. Now, those are completely different senses.

They happen to have the same truth conditions. They mean the same thing in the sense of truth conditions, 'cause if you're visiting relatives, what you are is visiting relatives. And it's that kind of ambiguity. No disagreement about what's going on in the world, but two completely different ways of seeing the sentence.

So this was drawn in the 1970s. This is what AI was like in the 1970s. This is a sort of structural description of the crown interpretation. So you have nodes for all the various parts in the hierarchy. I've also put something on the arcs. That RWX is the relationship between the crown and the flap, and that can be represented by a matrix.

It's really the relationship between the intrinsic frame of reference of the crown and the intrinsic frame of reference of the flap. And notice that if I change my viewpoint, that doesn't change at all. So that kind of relationship will be a good thing to put in the weights of a neural network, 'cause you'd like a neural network to be able to recognize shapes independently of viewpoint.

And that RWX is knowledge about the shape that's independent of viewpoint. Here's the zigzag interpretation. And here's something else where I've added the things in the heavy blue boxes. They're the relationship between a node and the viewer. That is to be more explicit. The coordinate transformation between the intrinsic frame of reference of the crown and the intrinsic frame of reference of the viewer, your eyeball, is that RWV.

And that's a different kind of thing altogether, 'cause as you change viewpoint, that changes. In fact, as you change viewpoint, all those things in blue boxes all change together in a consistent way. And there's a simple relationship, which is that if you take RWV, then you multiply it by RWX, you get RXV.

So you can easily propagate viewpoint information over a structural description. And that's what I think a mental image is. Rather than a bunch of pixels, it's a structural description with associated viewpoint information. That makes sense of a lot of properties of mental images. Like if you want to do any reasoning with things like RWX, you form a mental image.

That is you fill in, you choose a viewpoint. And I want to do one more demo to convince you you always choose a viewpoint when you're solving mental imagery problems. So I'm gonna give you another very simple mental imagery problem at the risk of running over time. Imagine that you're at a particular point and you travel a mile East, and then you travel a mile North, and then you travel a mile East again.

What's your direction back to your starting point? This isn't a very hard problem. It's sort of a bit South and quite a lot West, right? It's not exactly Southwest, but it's sort of Southwest. Now, when you did that task, what you imagined from your point of view is you went a mile East, and then you went a mile North, and then you went a mile East again.

I'll tell you what you didn't imagine. You didn't imagine that you went a mile East, and then you went a mile North, and then you went a mile East again. You could have solved the problem perfectly well with North not being up, but you had North up. You also didn't imagine this.

You go a mile East, and then a mile North, and then a mile East again. And you didn't imagine this. You go a mile East, and then a mile North, and so on. You imagined it at a particular scale, in a particular orientation, and in a particular position. And you can answer questions about roughly how big it was and so on.

So that's evidence that to solve these tasks that involve using relationships between things, you form a mental image. Okay, enough on mental imagery. So I'm now gonna give you a very brief introduction to contrastive learning. So this is a complete disconnect in the talk, but it'll come back together soon.

So in contrastive self-supervised learning, what we try and do is make two different crops of an image have the same representation. There's a paper a long time ago by Becker and Hinton where we were doing this to discover low-level coherence in an image, like the continuity of surfaces or the depth of surfaces.

It's been improved a lot since then, and it's been used for doing things like classification. That is, you take an image that has one prominent object in it, and you say, "If I take a crop of the image "that contains sort of any part of that object, "it should have the same representation "as some other crop of the image "containing a part of that object." And this has been developed a lot in the last few years.

I'm gonna talk about a model developed a couple of years ago by my group in Toronto called SimClear, but there's lots of other models. And since then, things have improved. So in SimClear, you take an image X, you take two different crops, and you also do colour distortion of the crops, different colour distortions of each crop.

And that's to prevent it from using colour histograms to say they're the same. So you mess with the colour, so it can't use colour in a simple way. And that gives you Xi tilde and Xj tilde. You then put those through the same neural network, F, and you get a representation, H.

And then you take the representation, H, and you put it through another neural network, which compresses it a bit. It goes to low dimensionality. That's an extra complexity I'm not gonna explain, but it makes it work a bit better. You can do it without doing that. And you get two embeddings, Xi and Zj.

And your aim is to maximise the agreement between those vectors. And so you start off doing that and you say, okay, let's start off with random neural networks, random weights in the neural networks, and let's take two patches and let's put them through these transformations. Let's try and make Zi be the same as Zj.

So let's back propagate the squared difference between components of I and components of J. And hey, presto, what you discover is everything collapses. For every image, it will always produce the same Zi and Zj. And then you realise, well, that's not what I meant by agreement. I mean, they should be the same when you get two crops of the same image and different when you get two crops of different images.

Otherwise, it's not really agreement, right? So you have to have negative examples. You have to show it crops from different images and say those should be different. If they're already different, you don't try and make them a lot more different. It's very easy to make things very different, but that's not what you want.

You just wanna be sure they're different enough. So crops from different images aren't taken to be from the same image. So if they happen to be very similar, you push them apart. And that stops your representations collapsing. That's called contrastive learning. And it works very well. So what you can do is do unsupervised learning by trying to maximise agreement between the representations you get from two image patches from the same image.

And after you've done that, you just take your representation of the image patch and you feed it to a linear classifier, a bunch of weights. So you multiply the representation by a weight matrix, put it through a softmax and get class labels. And then you train that by gradient descent.

And what you discover is that that's just about as good as training on labelled data. So now the only thing you've trained on labelled data is that last linear classifier. The previous layers were trained on unlabelled data and you've managed to train your representations without needing labels. Now, there's a problem with this.

It works very nicely, but it's really confounding objects and whole scenes. So it makes sense to say two different patches from the same scene should get the same vector label at the scene level 'cause they're from the same scene. But what if one of the patches contains bits of objects A and B and another patch contain bits of objects A and C?

You don't really want those two patches to have the same representation at the object level. So we have to distinguish these different levels of representation. And for contrastive learning, if you don't use any kind of gating or attention, then what's happening is you're really doing learning at the scene level.

What we'd like is that the representations you get at the object level should be the same if both patches are patches from object A, but should be different if one patch is from object A and one patch is from object B. And to do that, we're gonna need some form of attention to decide whether they really come from the same thing.

And so GLOM is designed to do that. It's designed to take contrastive learning and to introduce attention of the kind you get in transformers in order not to try and say things are the same when they're not. I should mention at this point that most of you will be familiar with BERT, and you could think of the word fragments that are fed into BERT as like the image patches I'm using here.

And in BERT, you have that whole column of representations of the same word fragment. In BERT, what's happening presumably as you go up is you're getting semantically richer representations. But in BERT, there's no attempt to get representations of larger things like whole phrases. This, what I'm gonna talk about will be a way to modify BERT.

So as you go up, you get bigger and bigger islands of agreement. So for example, after a couple of levels, then things like New and York will have the different fragments of York, suppose it's got two different fragments, will have exactly the same representation if it was done in the GLOM-like way.

And then as you go up another level, the fragments of New, well, New's probably a thing in its own right, but the fragments of York would all have exactly the same representation that have this island of agreement. And that will be a representation of a compound thing. And as you go up, you're gonna get these islands of agreement that represent bigger and bigger things.

And that's gonna be a much more useful kind of BERT 'cause instead of taking vectors that represent word fragments and then sort of munging them together by taking the max of each, for example, the max of each component, for example, which is just a crazy thing to do, you'd explicitly, as you're learning, form representations of larger parts in the part-whole hierarchy.

Okay. So what we're going after in GLOM is a particular kind of spatial coherence that's more complicated than the spatial coherence caused by the fact that surfaces tend to be at the same depth and same orientation in nearby patches of an image. We're going after the spatial coherence that says that if you find a mouth in an image and you find a nose in an image and then the right spatial relationship to make a face, then that's a particular kind of coherence.

And we want to go after that unsupervised and we want to discover that kind of coherence in images. So before I go into more details about GLOM, I want a disclaimer. For years, computer vision treated vision as you've got a static image, a uniform resolution, and you want to say what's in it.

That's not how vision works in the real world. In the real world, this is actually a loop where you decide where to look if you're a person or a robot. You better do that intelligently. And that gives you a sample of the optic array. It turns the optic array, the incoming light, into a retinal image.

And on your retina, you have high resolution in the middle and low resolution around the edges. And so you're focusing on particular details and you never ever process the whole image at uniform resolution. You're always focusing on something and processing where you're fixating at high resolution and everything else at much lower resolution, particularly around the edges.

So I'm going to ignore all the complexity of how you decide where to look and all the complexity of how you put together the information you get from different fixations by saying, let's just talk about the very first fixation or a novel image. So you look somewhere and now what happens on that first fixation?

We know that the same hardware in the brain is going to be reused for the next fixation, but let's just think about the first fixation. So finally, here's a picture of the architecture and this is the architecture for a single location. So like for a single word fragment in BERT and it shows you what's happening for multiple frames.

So Glom is really designed for video, but I only talk about applying it to static images. Then you should think of a static image as a very boring video in which the frames are all the same as each other. So I'm showing you three adjacent levels in the hierarchy and I'm showing you what happens over time.

So if you look at the middle level, maybe that's the sort of major part level and look at that box that says level L and that's at frame four. So the right-hand level L box and let's ask how the state of that box, the state of that embedding is determined.

So inside the box, we're gonna get an embedding and the embedding is gonna be the representation of what's going on at the major part level for that little patch of the image. And level L, in this diagram, all of these embeddings will always be devoted to the same patch of the retinal image.

Okay. The level L embedding on the right-hand side, you can see there's three things determining it there. This is a green arrow and for static images, the green arrow is rather boring. It's just saying you should sort of be similar to the previous state of level L. So it's just doing temporal integration.

The blue arrow is actually a neural net with a couple of hidden layers in it. I'm just showing you the embeddings here, not all the layers of the neural net. We need a couple of hidden layers to do the coordinate transforms that are required. And the blue arrow is basically taking information at the level below of the previous time step.

So level L minus one on frame three might be representing that I think I might be a nostril. Well, if you think you might be a nostril, what you predict at the next level up is a nose. What's more, if you have a coordinate frame for the nostril, you can predict the coordinate frame for the nose.

Maybe not perfectly, but you have a pretty good idea of the orientation position scale of the nose. So that bottom up neural net is a net that can take any kind of part of level L minus one. You can take a nostril, but it could also take a steering wheel and predict the car from the steering wheel and predict what you've got at the next level up.

The red arrow is a top-down neural net. So the red arrow is predicting the nose from the whole face. And again, it has a couple of hidden layers to do coordinate transforms. 'Cause if you know the coordinate frame of the face and you know the relationship between a face and a nose, and that's gonna be in the weights of that top-down neural net, then you can predict that it's a nose and what the pose of the nose is.

And that's all gonna be in activities in that embedding vector. Okay, now all of that is what's going on in one column of hardware. That's all about a specific patch of the image. So that's very, very like what's going on for one word fragment in BERT. You have all these levels of representation.

It's a bit confusing exactly what the relation of this is to BERT. And I'll give you the reference to a long archive paper at the end that has a whole section on how this relates to BERT. But it's confusing 'cause this has time steps. And that makes it all more complicated.

Okay. So those are three things that determine the level and embedding. But there's one fourth thing, which is in black at the bottom there. And that's the only way in which different locations interact. And that's a very simplified form of a transformer. If you take a transformer, as in BERT, and you say, let's make the embeddings and the keys and the queries and the values all be the same as each other.

We just have this one vector. So now all you're trying to do is make the level L embedding in one column be the same as the level L embedding in nearby columns. But it's gonna be gated. You're only gonna try and make it be the same if it's already quite similar.

So here's how the attention works. You take the level L embedding in location X, that's LX. And you take the level L embedding in the nearby location Y, that's LY. You take the scalar product. You exponentiate and you normalize. In other words, you do a softmax. And that gives you the weight to use in your desire to make LX be the same as LY.

So the input produced by this from neighbors is an attention-weighted average of the level L embeddings of nearby columns. And that's an extra input that you get. It's trying to make you agree with nearby things. And that's what's gonna cause you to get these islands of agreement. (mouse clicking) So back to this picture.

I think, yeah. This is what we'd like to see. And the reason we get those, that big island of agreement at the object level is 'cause we're trying to get agreement there. We're trying to learn the coordinate transform from the red arrows to the level above and from the green arrows to the level above such that we get agreement.

Okay. Now, one thing we need to worry about is that the difficult thing in perception, it's not so bad in language, it's probably worse in visual perception, is that there's a lot of ambiguity. If I'm looking at a line drawing, for example, I see a circle. Well, a circle could be the right eye of a face or it could be the left eye of a face or it could be the front wheel of a car or the back wheel of a car.

There's all sorts of things that circle could be. And we'd like to disambiguate the circle. And there's a long line of work using things like Markov random fields. Here we need a variational Markov random field which I call a transformational random field because the interaction between, for example, something that might be an eye and something that might be a mouth needs to be gated by coordinate transforms.

For the, let's take a nose and a mouth 'cause that's my standard thing. If you take something that might be a nose and you want to ask, does anybody out there support the idea I'm a nose? Well, what you'd like to do is send to everything nearby a message saying, do you have the right kind of pose and the right kind of identity to support the idea that I'm a nose?

And so you'd like, for example, to send out a message from the nose. You'd send out a message to all nearby locations saying, does anybody have a mouth with the pose that I predict by taking the pose of the nose, multiplying by the coordinate transform between a nose and a mouth?

And now I can predict the pose of the mouth. Is there anybody out there with that pose who thinks it might be a mouth? And I think you can see, you're gonna have to send out a lot of different messages. For each kind of other thing that might support you, you're gonna need to send a different message.

So you're gonna need a multi-headed transformer and it's gonna be doing these coordinate transforms and you have to do a coordinate transform, the inverse transform on the way back. 'Cause if the mouse supports you, what it needs to support is a nose, not with the pose of the mouth, but with the appropriate pose.

So that's gonna get very complicated. You're gonna have N-squared interactions all with coordinate transforms. There's another way of doing it that's much simpler. That's called a Hough transform. At least it's much simpler if you have a way of representing ambiguity. So instead of these direct interactions between parts like a nose and a mouth, what you're gonna do is you're gonna make each of the parts predict the whole.

So the nose can predict the face and it can predict the pose of the face. And the mouth can also predict the face. Now these will be in different columns of Glom, but in one column of Glom, you'll have a nose predicting face. In a nearby column, you'll have a mouth predicting face.

And those two faces should be the same if this really is a face. So when you do this attention weighted averaging with nearby things, what you're doing is you're getting confirmation that the support for the hypothesis you've got, I mean, suppose in one column you make the hypothesis it's a face with this pose.

That gets supported by nearby columns that derive the very same embedding from quite different data. One derived it from the nose and one derived it from the mouth. And this doesn't require any dynamic routing because the embeddings are always referring to what's going on in the same small patch of the image.

Within a column, there's no routing. And between columns, there's something a bit like routing, but it's just the standard transformer kind of attention. You're just trying to agree with things that are similar. And, okay, so that's how GLOM is meant to work. And the big problem is that if I see a circle, it might be a left eye, it might be a right eye, it might be a front wheel of a car, it might be the back wheel of a car.

Because my embedding for a particular patch at a particular level has to be able to represent anything, when I get an ambiguous thing, I have to deal with all these possibilities of what hole it might be part of. So instead of trying to resolve ambiguity at the part level, what I can do is jump to the next level up and resolve the ambiguity there, just by seeing if things are the same, which is an easier way to resolve ambiguity.

But the cost of that is I have to be able to represent all the ambiguity I get at the next level up. Now, it turns out you can do that. We've done a little toy example where you can actually preserve this ambiguity, but it's difficult. It's the kind of thing neural nets are good at.

So if you think about the embedding at the next level up, you've got a whole bunch of neurons whose activities are that embedding, and you wanna represent a highly multimodal distribution, like it might be a car with this pose, or a car with that pose, or a face with this pose, or a face with that pose.

All of these are possible predictions for finding a circle. And so you have to represent all that. And the question is, can neural nets do that? And I think the way they must be doing it is each neuron in the embedding stands for an unnormalized log probability distribution over this huge space of possible identities and possible poses, the sort of cross product of identities and poses.

And so the neuron is this rather vague log probability distribution over that space. And when you activate the neuron, what it's saying is, add in that log probability distribution to what you've already got. And so now if you have a whole bunch of log probability distributions, and you add them all together, you can get a much more peaky log probability distribution.

And when you exponentiate to get a probability distribution, it gets very peaky. And so very vague basis functions in this joint space of pose and identity, and basis functions in the log probability in that space can be combined to produce sharp conclusions. So I think that's how neurons are representing things.

Most people think about neurons as, they think about the thing that they're representing. But obviously in perception, you have to deal with uncertainty. And so neurons have to be good at representing multimodal distributions. And this is the only way I can think of that's good at doing it. That's a rather weak argument.

I mean, it's the argument that led Chomsky to believe that language wasn't learned because he couldn't think of how it was learned. My view is neurons must be using this representation 'cause I can't think of any other way of doing it. Okay. I just said all that 'cause I got ahead of myself 'cause I got excited.

Okay. Now, the reason you can get away with this, the reason you have these very vague distributions in the unnormalized log probability space is because these neurons are all dedicated to a small patch of image, and they're all trying to represent the thing that's happening in that patch of image.

So you're only trying to represent one thing. You're not trying to represent some set of possible objects. If you're trying to represent some set of possible objects, you'd have a horrible binding problem, and you couldn't use these very vague distributions. But so long as you know that all of these neurons, all of the active neurons refer to the same thing, then you can do the intersection.

You can add the log probability distribution together and intersect the sets of things they represent. Okay. I'm getting near the end. How would you train a system like this? Well, obviously you could train it the way you train BERT. You could do deep end-to-end training. And for GLOM, what that would consist of, and the way we trained a toy example, is you take an image, you leave out some patches of the image, you then let GLOM settle down for about 10 iterations, and it's trying to fill in the lowest level representation of what's in the image, the lowest level embedding, and it fills them in wrong.

And so you now back propagate that error, and you're back propagating it through time in this network. So it'll also back propagate up and down through the levels. So you're basically just doing back propagation through time of the error due to filling in things incorrectly. That's basically how BERT is trained.

And you could train GLOM the same way. But I also want to include an extra bit in the training to encourage islands. We want to encourage big islands of identical vectors at high levels. And you can do that by using contrastive learning. So if you think how the next, at the next time step, if you think how an embedding is determined, it's determined by combining a whole bunch of different factors, what was going on in the previous time step at this level of representation in this location, what was going on at the previous time step in this location, but at the next level down, at the next level up, and also what was going on at the previous time step at nearby locations at the same level.

And the weighted average of all those things I'll call the consensus embedding. And that's what you use for the next embedding. And I think you can see that if we try and make the bottom up neural net and the top down neural net, if we try and make the predictions agree with the consensus, the consensus has folded in information from nearby locations that already roughly agree because of the attention weighting.

And so by trying to make the top down and bottom up neural networks agree with the consensus, you're trying to make them agree with what's going on in nearby locations that are similar. And so you'll be training it to form islands. This is more interesting to neuroscientists than to people who do natural language, so I'm gonna ignore that.

You might think it's wasteful to be replicating all these embeddings at the object level. So the idea is at the object level, there'll be a large number of patches that all have exactly the same vector representation. And that seems like a waste, but actually biology is full of things like that.

All your cells have exactly the same DNA and all the parts of an organ have pretty much the same vector protein expressions. So there's lots of replication goes on to keep things local. And it's the same here. And actually that replication is very useful when you're settling on an interpretation, because before you settle down, you don't know which things should be the same as which other things.

So having separate vectors in each location to represent what's going on there at the object level gives you the flexibility to gradually segment things as you settle down in a sensible way. It allows you to hedge your bets. And what you're doing is not quite like clustering. You're creating clusters of identical vectors rather than discovering clusters in fixed data.

So clustering, you're given the data and it's fixed and you find the clusters. Here, the embeddings at every level, they vary over time. They're determined by the top-down and bottom-up inputs and by inputs coming from nearby locations. So what you're doing is forming clusters rather than discovering them in fixed data.

And that's got a somewhat different flavor and can settle down faster. And one other advantage to this replication is what you don't want is to have much more work in your transformer as you go to higher levels. But you do need longer range interactions at higher levels. Presumably for the lowest levels, you want fairly short range interactions in your transformer, and they could be dense.

As you go to higher levels, you want much longer range interactions. So you could make them sparse, and people have done things like that for BERT-like systems. Here, it's easy to make them sparse 'cause you're expecting big islands. So all you need to do is see one patch of a big island to know what the vector representation of that island is.

And so sparse representations will work much better if you have these big islands of agreement as you go up. So the idea is you have longer range and sparser connections as you go up. So the amount of computation is the same in every level. And just to summarize, I showed how to combine three important advances of neural networks in Glom.

I didn't actually talk about neural fields, and that's important for the top-down network. Maybe since I've got two minutes to spare, I'm gonna go back and mention neural fields very briefly. Yeah, when I train that top-down neural network, I have a problem. And the problem is, if you look at those red arrows and those green arrows, they're quite different.

But if you look at the level above, the object level, all those vectors are the same. And of course, in an engineered system, I want to replicate the neural nets in every location. So I use exactly the same top-down and bottom-up neural nets everywhere. And so the question is, how can the same neural net be given a black arrow and sometimes produce a red arrow and sometimes produce a green arrow, which have quite different orientations?

How can it produce a nose where there's nose and a mouth where there's mouth, even though the face vector is the same everywhere? And the answer is, the top-down neural network doesn't just get the face vector, it also gets the location of the patch for which it's producing the path vector.

So the three patches that should get the red vector are different locations from the three patches that should get the green vector. So if I use a neural network that gets the location as input as well, here's what it can do. It can take the pose that's encoded in that black vector, the pose of the face.

It can take the location in the image for which it's predicting the vector of the level below. And the pose is relative to the image too. So knowing the location in the image and knowing the pose of the whole face, it can figure out which bit of the face it needs to predict at that location.

And so in one location, it can predict, okay, there should be nose there and it gives you the red vector. In another location, it can predict from where that image patch is, there should be mouth there, so it can give you the green arrow. So you can get the same vector at the level above to predict different vectors in different places at the level below by giving it the place that it's predicting for.

And that's what's going on in neural fields. Okay. Now, this was quite a complicated talk. There's a long paper about it on archive that goes into much more detail. And you could view this talk as just an encouragement to read that paper. And I'm done, exactly on time. - Okay.

- Thank you. - Thanks a lot. - Yeah. (upbeat music)