back to index

Stanford CS25: V2 I Represent part-whole hierarchies in a neural network, Geoff Hinton


Chapters

0:0
1:55 Why it is hard to make real neural networks learn part-whole hierarchies • Each image has a different parsetree, • Real neural networks cannot dynamically allocate neurons to represent nodes in a parse tree. - What a neuron does is determined by the weights on its connections and the weights
20:45 A brief introduction to contrastive learning of visual representations • Contrastive self-supervised learning uses the similarity between activity vectors produced from different patches of the same image as the objective
43:39 How to implement multimodal predictions in the joint space of identity and pose • Each neuron in the embedding vector for the object is a basis function that represents a vague distribution in the log probability space

Whisper Transcript | Transcript Only Page

00:00:00.000 | (silence)
00:00:02.160 | Before we start, I gave the same talk at Stanford
00:00:09.880 | quite recently.
00:00:10.900 | I suggested to the people inviting me,
00:00:13.440 | I could just give one talk and both audiences come,
00:00:15.680 | but they will prefer it as two separate talks.
00:00:18.320 | So if you went to this talk recently,
00:00:20.620 | I suggest you leave now.
00:00:22.400 | You won't learn anything new.
00:00:24.280 | Okay.
00:00:28.440 | What I'm gonna do is combine some recent ideas
00:00:30.920 | in neural networks to try to explain
00:00:33.960 | how a neural network could represent part-whole hierarchies
00:00:36.880 | without violating any of the basic principles
00:00:41.040 | of how neurons work.
00:00:42.220 | And I'm gonna explain these ideas
00:00:47.640 | in terms of an imaginary system.
00:00:50.340 | I started writing a design document for a system
00:00:52.440 | and in the end, I decided the design document
00:00:54.400 | by itself was quite interesting.
00:00:56.280 | So this is just vaporware, it's stuff that doesn't exist.
00:00:58.840 | Little bits of it now exist,
00:00:59.960 | but somehow I find it easy to explain the ideas
00:01:04.060 | in the context of an imaginary system.
00:01:05.960 | So most people now studying neural networks
00:01:14.400 | are doing engineering and they don't really care
00:01:16.800 | if it's exactly how the brain works.
00:01:18.800 | They're not trying to understand how the brain works,
00:01:20.280 | they're trying to make cool technology.
00:01:22.760 | And so a hundred layers is fine in a ResNet,
00:01:25.380 | weight sharing is fine in a convolutional neural net.
00:01:27.980 | Some researchers, particularly computational neuroscientists
00:01:32.760 | investigate neural networks, artificial neural networks
00:01:35.680 | in an attempt to understand
00:01:36.680 | how the brain might actually work.
00:01:38.800 | I think we still got a lot to learn from the brain
00:01:41.640 | and I think it's worth remembering
00:01:43.720 | that for about half a century,
00:01:45.200 | the only thing that kept research on neural networks going
00:01:48.200 | was the belief that it must be possible
00:01:49.880 | to make these things learn complicated things
00:01:51.640 | 'cause the brain does.
00:01:53.500 | So every image has a different parse tree,
00:01:58.500 | that is the structure of the holes
00:02:02.680 | in the parts in the image.
00:02:03.980 | And in a real neural network,
00:02:07.760 | you can't dynamically allocate,
00:02:09.560 | you can't just grab a bunch of neurons and say,
00:02:11.320 | "Okay, you now represent this,"
00:02:15.000 | because you don't have random access memory.
00:02:16.800 | You can't just set the weights of the neurons
00:02:19.280 | to be whatever you like.
00:02:20.520 | What a neuron does is determined by its connections
00:02:22.900 | and they only change slowly, at least probably,
00:02:26.320 | mostly they change slowly.
00:02:27.620 | So the question is,
00:02:30.500 | if you can't change what neurons do quickly,
00:02:33.260 | how can you represent a dynamic parse tree?
00:02:35.660 | In symbolic AI, it's not a problem.
00:02:42.080 | You just grab a piece of memory,
00:02:44.060 | that's what it normally amounts to,
00:02:45.700 | and say, "This is gonna represent a node in the parse tree
00:02:48.740 | and I'm gonna give it pointers to other nodes,
00:02:50.860 | other bits of memory that represent other nodes."
00:02:53.020 | So there's no problem.
00:02:54.660 | For about five years,
00:02:55.900 | I played with a theory called capsules,
00:02:58.620 | where you say,
00:03:00.620 | "Because you can't allocate neurons on the fly,
00:03:03.340 | you're gonna allocate them in advance."
00:03:05.240 | So we're gonna take groups of neurons
00:03:06.700 | and we're gonna allocate them
00:03:07.740 | to different possible nodes in a parse tree.
00:03:10.600 | And most of these groups of neurons for most images
00:03:13.700 | are gonna be silent, a few are gonna be active.
00:03:17.300 | And then the ones that are active,
00:03:18.500 | we have to dynamically hook them up into a parse tree.
00:03:21.340 | So we have to have a way of rooting
00:03:22.940 | between these groups of neurons.
00:03:25.240 | So that was the capsules theory.
00:03:28.260 | And I had some very competent people working with me
00:03:31.780 | who actually made it work,
00:03:33.340 | but it was tough going.
00:03:36.960 | My view is that some ideas want to work
00:03:38.640 | and some ideas don't want to work.
00:03:40.140 | And capsules was sort of in between.
00:03:42.000 | Things like backpropagation just wanna work,
00:03:43.740 | you try them and they work.
00:03:46.020 | There's other ideas I've had that just don't wanna work.
00:03:49.020 | Capsules was sort of in between and we got it working.
00:03:52.060 | But I now have a new theory
00:03:53.980 | that could be seen as a funny kind of capsules model
00:03:57.740 | in which each capsule is universal.
00:03:59.860 | That is instead of a capsule being dedicated
00:04:02.540 | to a particular kind of thing,
00:04:04.640 | each capsule can represent any kind of thing.
00:04:07.100 | But hardware still comes in capsules,
00:04:11.500 | which are also called embedding sometimes.
00:04:14.640 | So the imaginary system I'll talk about is called Glom.
00:04:19.640 | And in Glom, hardware gets allocated to columns
00:04:25.240 | and each column contains multiple levels of representation
00:04:29.920 | of what's happening in a small patch of the image.
00:04:32.400 | So within a column,
00:04:35.240 | you might have a lower level representation
00:04:38.120 | that says it's a nostril
00:04:39.480 | and the next level up might say it's a nose
00:04:42.280 | and the next level up might say a face,
00:04:43.680 | the next level up a person
00:04:45.120 | and the top level might say it's a party.
00:04:47.120 | That's what the whole scene is.
00:04:48.680 | And the idea for representing part-whole hierarchies
00:04:52.760 | is to use islands of agreement
00:04:54.860 | between the embeddings at these different levels.
00:04:57.320 | So at the scene level, at the top level,
00:05:00.760 | you'd like the same embedding for every patch of the image
00:05:03.760 | 'cause that patch is a patch of the same scene everywhere.
00:05:07.200 | At the object level,
00:05:08.400 | you'd like the embeddings of all the different patches
00:05:11.400 | that belong to the object to be the same.
00:05:14.600 | So as you go up this hierarchy,
00:05:15.960 | you're trying to make things more and more the same
00:05:18.360 | and that's how you're squeezing redundancy out.
00:05:21.140 | The embedding vectors are the things that act like pointers
00:05:25.400 | and the embedding vectors are dynamic.
00:05:28.600 | They're neural activations rather than neural weights.
00:05:31.280 | So it's fine to have different embedding vectors
00:05:33.880 | for every image.
00:05:34.820 | So here's a little picture.
00:05:40.800 | If you had a one-dimensional row of patches,
00:05:44.520 | these are the columns for the patches
00:05:47.160 | and you'd have something like a convolutional neural net
00:05:51.720 | as a front end.
00:05:52.660 | And then after the front end,
00:05:56.000 | you produce your lowest level embeddings
00:05:58.240 | that say what's going on in each particular patch.
00:06:01.080 | And so that bottom layer of black arrows,
00:06:03.280 | they're all different.
00:06:04.600 | Of course, these embeddings are thousands of dimensions,
00:06:07.840 | maybe hundreds of thousands in your brain.
00:06:10.180 | And so two-dimensional vector isn't right,
00:06:14.160 | but at least I can represent
00:06:15.480 | where the two vectors are the same by using the orientation.
00:06:19.160 | So the lowest level,
00:06:20.680 | all the patches will have different representations.
00:06:23.960 | But the next level up, the first two patches,
00:06:27.360 | they might be part of a nostril, for example.
00:06:31.160 | And so, yeah, they'll have the same embedding.
00:06:37.680 | But the next level up,
00:06:39.460 | the first three patches might be part of a nose.
00:06:42.900 | And so they'll all have the same embedding.
00:06:45.100 | Notice that even though what's in the image
00:06:47.260 | is quite different, at the part level,
00:06:51.660 | those three red vectors are all meant to be the same.
00:06:55.180 | So what we're doing is we're getting the same representation
00:06:58.100 | for things that are superficially very different.
00:07:00.860 | We're finding spatial coherence in an image
00:07:03.740 | by giving the same representation to different things.
00:07:06.860 | And at the object level,
00:07:09.200 | you might have a nose and then a mouth,
00:07:12.100 | and they're the same face.
00:07:13.540 | They're part of the same face.
00:07:15.220 | And so all those vectors are the same.
00:07:17.060 | And this network hasn't yet settled down
00:07:18.900 | to produce something unseen level.
00:07:20.600 | So the islands of agreement are what capture the parse tree.
00:07:27.240 | Now, they're a bit more powerful than a parse tree.
00:07:29.740 | They can capture things like shut the heck up.
00:07:33.980 | You can have shut and up can be different vectors
00:07:36.860 | at one level, but at a higher level,
00:07:38.660 | shut and up can have exactly the same vector,
00:07:41.820 | namely the vector for shut up, and they can be disconnected.
00:07:44.900 | So you can do things a bit more powerful
00:07:46.300 | than a context-free grammar here,
00:07:48.180 | but basically it's a parse tree.
00:07:51.520 | If you're a physicist,
00:07:53.140 | you can think of each of these levels as an icing model
00:07:58.620 | with real valued vectors rather than binary spins.
00:08:02.980 | And you can think of them being coordinate transforms
00:08:05.100 | between levels, which makes it much more complicated.
00:08:08.380 | And then this is a kind of multi-level icing model,
00:08:12.460 | but with complicated interactions between the levels,
00:08:15.700 | because for example, between the red arrows
00:08:17.660 | and the black arrows above them,
00:08:19.460 | you need the coordinate transform
00:08:21.020 | between a nose and a face, but we'll come to that later.
00:08:24.580 | If you're not a physicist,
00:08:27.740 | ignore all that 'cause it won't help.
00:08:29.580 | (keyboard clicking)
00:08:33.340 | So I want to start, and this is, I guess,
00:08:35.780 | is particularly relevant for a natural language course
00:08:38.260 | where some of you are not vision people,
00:08:40.740 | by trying to prove to you that coordinate systems
00:08:45.020 | are not just something invented by Descartes.
00:08:47.220 | Coordinate systems were invented by the brain
00:08:52.060 | a long time ago, and we use coordinate systems
00:08:54.940 | in understanding what's going on in an image.
00:08:58.080 | I also want to demonstrate the psychological reality
00:09:00.500 | of parse trees for an image.
00:09:02.820 | So I'm gonna do this with a task
00:09:05.680 | that I invented a long time ago in the 1970s,
00:09:09.500 | when I was a grad student, in fact.
00:09:11.260 | And you have to do this task
00:09:14.460 | to get the full benefit from it.
00:09:16.180 | So I want you to imagine on the tabletop in front of you,
00:09:23.020 | there's a wireframe cube,
00:09:25.060 | and it's in the standard orientation for a cube,
00:09:27.680 | is resting on the tabletop.
00:09:29.660 | And from your point of view,
00:09:31.920 | there's a front bottom right-hand corner,
00:09:34.940 | and a top back left-hand corner.
00:09:37.660 | Here we go.
00:09:38.880 | Okay.
00:09:39.720 | The front bottom right-hand corner
00:09:42.500 | is resting on the tabletop,
00:09:44.100 | along with the four other corners.
00:09:46.340 | And the top back left-hand corner
00:09:48.660 | is at the other end of a diagonal
00:09:50.380 | that goes through the center of the cube.
00:09:53.060 | Okay, so far so good.
00:09:55.040 | Now what we're gonna do is rotate the cube
00:09:56.900 | so that this finger stays on the tabletop,
00:09:59.660 | and the other finger is vertically above it, like that.
00:10:02.420 | This finger shouldn't have moved.
00:10:05.620 | Okay.
00:10:06.740 | So now we've got the cube in an orientation
00:10:08.700 | where that thing that was a body diagonal is now vertical.
00:10:12.140 | And all you've gotta do is take the bottom finger,
00:10:15.340 | 'cause that's still on the tabletop,
00:10:17.180 | and point with the bottom finger
00:10:18.500 | to where the other corners of the cube are.
00:10:21.020 | So I want you to actually do it.
00:10:22.020 | Off you go.
00:10:22.860 | Take your bottom finger,
00:10:24.300 | hold your top finger at the other end of that diagonal
00:10:27.540 | that's now been made vertical,
00:10:28.780 | and just point to where the other corners are.
00:10:31.040 | And luckily, it's Zoom,
00:10:35.460 | so most of you, other people,
00:10:38.140 | won't be able to see what you did.
00:10:39.460 | And I can see that some of you aren't pointing,
00:10:41.100 | and that's very bad.
00:10:42.100 | So most people point out four other corners,
00:10:47.420 | and the most common response
00:10:48.820 | is to say they're here, here, here, and here.
00:10:51.320 | They point out four corners in a square
00:10:52.980 | halfway up that axis.
00:10:54.460 | That's wrong, as you might imagine.
00:11:00.060 | And it's easy to see that it's wrong,
00:11:01.820 | 'cause if you imagine the cube in the normal orientation
00:11:04.820 | and count the corners, there's eight of them.
00:11:07.220 | And these were two corners.
00:11:10.140 | So where did the other two corners go?
00:11:12.040 | So one theory is that when you rotated the cube,
00:11:15.700 | the centripetal forces
00:11:16.900 | made them fly off into your unconscious.
00:11:19.500 | That's not a very good theory.
00:11:21.220 | So what's happening here
00:11:24.060 | is you have no idea where the other corners are,
00:11:26.660 | unless you're something like a crystallographer.
00:11:29.460 | You can sort of imagine bits of the cube,
00:11:31.180 | but you just can't imagine this structure
00:11:32.900 | of the other corners, what structure they form.
00:11:35.200 | And this common response that people give,
00:11:38.700 | of four corners in a square,
00:11:41.560 | is doing something very weird.
00:11:43.860 | It's trying to, it's saying,
00:11:45.340 | well, okay, I don't know where the bits of a cube are,
00:11:48.900 | but I know something about cubes.
00:11:50.300 | I know the corners come in fours.
00:11:52.580 | I know a cube has this four-fold rotational symmetry,
00:11:55.940 | or two planes of bilateral symmetry,
00:11:58.180 | but right angles to one another.
00:12:00.140 | And so what people do is they preserve the symmetries
00:12:02.980 | of the cube in their response.
00:12:05.060 | They give four corners in a square.
00:12:07.040 | Now, what they've actually pointed out if they do that
00:12:11.620 | is two pyramids, each of which has a square base.
00:12:16.180 | One's upside down, and they're stuck base to base.
00:12:19.680 | So you can visualize that quite easily.
00:12:21.780 | It's a square base pyramid
00:12:22.780 | with another one stuck underneath it.
00:12:25.300 | And so now you get your two fingers
00:12:26.700 | as the vertices of those two pyramids.
00:12:29.460 | And what's interesting about that is
00:12:33.460 | you've preserved the symmetries of the cube
00:12:36.340 | at the cost of doing something pretty radical,
00:12:38.980 | which is changing faces to vertices and vertices to faces.
00:12:43.980 | The thing you pointed out if you did that was an octahedron.
00:12:48.840 | It has eight faces and six vertices.
00:12:51.500 | A cube has six faces and eight vertices.
00:12:54.340 | So in order to preserve the symmetries
00:12:56.260 | you know about of the cube, if you did that,
00:13:00.840 | you've done something really radical,
00:13:03.020 | which is changed faces for vertices and vertices for faces.
00:13:05.980 | I should show you what the answer looks like.
00:13:10.120 | So I'm gonna step back and try and get enough light,
00:13:13.500 | and maybe you can see this cube.
00:13:18.740 | So this is a cube,
00:13:21.460 | and you can see that the other edges
00:13:26.460 | form a kind of zigzag ring around the middle.
00:13:28.900 | So I've got a picture of it.
00:13:31.580 | So the colored rods here are the other edges of the cube,
00:13:37.900 | the ones that don't touch your fingertips.
00:13:40.460 | And your top finger's connected
00:13:41.780 | to the three vertices of those flaps,
00:13:44.380 | and your bottom finger's connected
00:13:46.540 | to the lowest three vertices there.
00:13:48.640 | And that's what a cube looks like.
00:13:51.100 | It's something you had no idea about.
00:13:53.420 | This is just a completely different model of a cube.
00:13:55.940 | It's so different, I'll give it a different name.
00:13:57.540 | I'll call it a hexahedron.
00:13:58.860 | And the thing to notice is a hexahedron and a cube
00:14:04.940 | are just conceptually utterly different.
00:14:07.540 | You wouldn't even know one was the same as the other
00:14:09.860 | if you think about one as a hexahedron and one as a cube.
00:14:12.620 | It's like the ambiguity between a tilted square
00:14:15.540 | and an upright diamond, but more powerful
00:14:17.780 | 'cause you're not familiar with it.
00:14:19.480 | And that's my demonstration
00:14:22.340 | that people really do use coordinate systems.
00:14:24.580 | And if you use a different coordinate system
00:14:26.260 | to describe things, and here I forced you
00:14:28.500 | to use a different coordinate system
00:14:29.900 | by making the diagonal be vertical
00:14:32.300 | and asking you to describe it relative to that vertical axis,
00:14:35.340 | then familiar things become completely unfamiliar.
00:14:38.260 | And when you do see them relative to this new frame,
00:14:41.560 | they're just a completely different thing.
00:14:44.300 | Notice that things like convolutional neural nets
00:14:46.460 | don't have that.
00:14:47.540 | They can't look at something
00:14:48.580 | and have two utterly different internal representations
00:14:50.900 | of the very same thing.
00:14:52.060 | I'm also showing you that you do parsing.
00:14:55.980 | So here I've colored it so you parse it
00:14:57.780 | into what I call the crown,
00:14:59.660 | which is three triangular flaps
00:15:01.580 | that slope upwards and outwards.
00:15:03.180 | Here's a different parsing.
00:15:06.260 | The same green flap sloping upwards and outwards.
00:15:08.980 | Now we have a red flap sloping downwards and outwards,
00:15:12.580 | and we have a central rectangle,
00:15:14.460 | and you just have the two ends of the rectangle.
00:15:17.180 | And if you perceive this and now close your eyes
00:15:21.220 | and ask you, were there any parallel edges there?
00:15:24.540 | You're very well aware that those two blue edges
00:15:27.260 | were parallel, and you're typically not aware
00:15:29.620 | of any other parallel edges,
00:15:31.300 | even though you know by symmetry there must be other pairs.
00:15:34.380 | Similarly with the crown, if you see the crown,
00:15:37.340 | and then I ask you to close your eyes
00:15:38.940 | and ask you, were there parallel edges?
00:15:41.020 | You don't see any parallel edges.
00:15:43.420 | And that's because the coordinate systems
00:15:45.060 | you're using for those flaps don't line up with the edges.
00:15:48.860 | And you only notice parallel edges
00:15:50.620 | if they line up with the coordinate system you're using.
00:15:53.460 | So here for the rectangle,
00:15:55.020 | the parallel edges align with the coordinate system.
00:15:57.380 | For the flaps, they don't.
00:15:59.020 | So you're aware that those two blue edges are parallel,
00:16:01.300 | but you're not aware that one of the green edges
00:16:03.500 | and one of the red edges is parallel.
00:16:10.460 | So this isn't like the NECA cube ambiguity,
00:16:12.580 | where when it flips, you think that what's out there
00:16:15.220 | in reality is different, things are at a different depth.
00:16:18.340 | This is like, next weekend, we shall be visiting relatives.
00:16:22.780 | So if you take the sentence,
00:16:23.620 | "Next weekend, we shall be visiting relatives,"
00:16:26.020 | it can mean, next weekend,
00:16:28.780 | what we will be doing is visiting relatives,
00:16:31.620 | or it can mean, next weekend,
00:16:33.580 | what we will be is visiting relatives.
00:16:36.180 | Now, those are completely different senses.
00:16:39.620 | They happen to have the same truth conditions.
00:16:41.660 | They mean the same thing in the sense of truth conditions,
00:16:45.100 | 'cause if you're visiting relatives,
00:16:46.980 | what you are is visiting relatives.
00:16:49.100 | And it's that kind of ambiguity.
00:16:50.740 | No disagreement about what's going on in the world,
00:16:52.500 | but two completely different ways of seeing the sentence.
00:16:55.260 | So this was drawn in the 1970s.
00:17:03.900 | This is what AI was like in the 1970s.
00:17:08.460 | This is a sort of structural description
00:17:10.500 | of the crown interpretation.
00:17:12.380 | So you have nodes for all the various parts in the hierarchy.
00:17:17.820 | I've also put something on the arcs.
00:17:20.020 | That RWX is the relationship
00:17:24.140 | between the crown and the flap,
00:17:26.420 | and that can be represented by a matrix.
00:17:28.580 | It's really the relationship
00:17:29.740 | between the intrinsic frame of reference of the crown
00:17:32.860 | and the intrinsic frame of reference of the flap.
00:17:35.700 | And notice that if I change my viewpoint,
00:17:39.020 | that doesn't change at all.
00:17:40.420 | So that kind of relationship will be a good thing
00:17:43.540 | to put in the weights of a neural network,
00:17:45.900 | 'cause you'd like a neural network
00:17:47.100 | to be able to recognize shapes independently of viewpoint.
00:17:50.460 | And that RWX is knowledge about the shape
00:17:53.580 | that's independent of viewpoint.
00:17:55.180 | Here's the zigzag interpretation.
00:17:59.620 | And here's something else
00:18:01.820 | where I've added the things in the heavy blue boxes.
00:18:06.460 | They're the relationship between a node and the viewer.
00:18:11.460 | That is to be more explicit.
00:18:13.900 | The coordinate transformation
00:18:15.220 | between the intrinsic frame of reference of the crown
00:18:18.260 | and the intrinsic frame of reference of the viewer,
00:18:20.940 | your eyeball, is that RWV.
00:18:23.580 | And that's a different kind of thing altogether,
00:18:26.980 | 'cause as you change viewpoint, that changes.
00:18:29.580 | In fact, as you change viewpoint,
00:18:31.060 | all those things in blue boxes all change together
00:18:34.260 | in a consistent way.
00:18:35.380 | And there's a simple relationship,
00:18:37.780 | which is that if you take RWV,
00:18:40.220 | then you multiply it by RWX, you get RXV.
00:18:44.420 | So you can easily propagate viewpoint information
00:18:47.620 | over a structural description.
00:18:49.940 | And that's what I think a mental image is.
00:18:52.500 | Rather than a bunch of pixels,
00:18:55.020 | it's a structural description
00:18:56.820 | with associated viewpoint information.
00:18:59.980 | That makes sense of a lot of properties of mental images.
00:19:04.980 | Like if you want to do any reasoning with things like RWX,
00:19:09.780 | you form a mental image.
00:19:12.460 | That is you fill in, you choose a viewpoint.
00:19:14.900 | And I want to do one more demo to convince you
00:19:17.460 | you always choose a viewpoint
00:19:18.980 | when you're solving mental imagery problems.
00:19:21.380 | So I'm gonna give you another very simple
00:19:22.900 | mental imagery problem at the risk of running over time.
00:19:27.660 | Imagine that you're at a particular point
00:19:31.660 | and you travel a mile East,
00:19:33.420 | and then you travel a mile North,
00:19:35.220 | and then you travel a mile East again.
00:19:37.540 | What's your direction back to your starting point?
00:19:40.820 | This isn't a very hard problem.
00:19:42.940 | It's sort of a bit South and quite a lot West, right?
00:19:46.340 | It's not exactly Southwest, but it's sort of Southwest.
00:19:49.100 | Now, when you did that task,
00:19:52.980 | what you imagined from your point of view
00:19:56.220 | is you went a mile East, and then you went a mile North,
00:19:58.740 | and then you went a mile East again.
00:20:00.540 | I'll tell you what you didn't imagine.
00:20:03.300 | You didn't imagine that you went a mile East,
00:20:05.180 | and then you went a mile North,
00:20:06.180 | and then you went a mile East again.
00:20:07.940 | You could have solved the problem perfectly well
00:20:09.620 | with North not being up, but you had North up.
00:20:13.100 | You also didn't imagine this.
00:20:14.740 | You go a mile East, and then a mile North,
00:20:16.300 | and then a mile East again.
00:20:17.980 | And you didn't imagine this.
00:20:18.820 | You go a mile East, and then a mile North, and so on.
00:20:21.340 | You imagined it at a particular scale,
00:20:23.340 | in a particular orientation, and in a particular position.
00:20:26.180 | And you can answer questions
00:20:29.940 | about roughly how big it was and so on.
00:20:32.020 | So that's evidence that to solve these tasks
00:20:35.540 | that involve using relationships between things,
00:20:39.140 | you form a mental image.
00:20:40.820 | Okay, enough on mental imagery.
00:20:43.780 | So I'm now gonna give you a very brief introduction
00:20:48.300 | to contrastive learning.
00:20:50.060 | So this is a complete disconnect in the talk,
00:20:54.140 | but it'll come back together soon.
00:20:55.820 | So in contrastive self-supervised learning,
00:21:02.140 | what we try and do is make two different crops of an image
00:21:06.660 | have the same representation.
00:21:08.300 | There's a paper a long time ago by Becker and Hinton
00:21:14.460 | where we were doing this to discover
00:21:16.140 | low-level coherence in an image,
00:21:18.620 | like the continuity of surfaces or the depth of surfaces.
00:21:23.620 | It's been improved a lot since then,
00:21:27.300 | and it's been used for doing things like classification.
00:21:30.180 | That is, you take an image
00:21:34.100 | that has one prominent object in it,
00:21:36.540 | and you say, "If I take a crop of the image
00:21:40.100 | "that contains sort of any part of that object,
00:21:42.460 | "it should have the same representation
00:21:45.380 | "as some other crop of the image
00:21:46.540 | "containing a part of that object."
00:21:49.260 | And this has been developed a lot in the last few years.
00:21:54.140 | I'm gonna talk about a model developed a couple of years ago
00:21:57.140 | by my group in Toronto called SimClear,
00:21:59.100 | but there's lots of other models.
00:22:00.780 | And since then, things have improved.
00:22:02.580 | So in SimClear, you take an image X,
00:22:08.860 | you take two different crops,
00:22:11.900 | and you also do colour distortion of the crops,
00:22:14.660 | different colour distortions of each crop.
00:22:16.980 | And that's to prevent it from using colour histograms
00:22:19.340 | to say they're the same.
00:22:20.540 | So you mess with the colour,
00:22:22.740 | so it can't use colour in a simple way.
00:22:25.340 | And that gives you Xi tilde and Xj tilde.
00:22:32.140 | You then put those through the same neural network, F,
00:22:36.500 | and you get a representation, H.
00:22:38.380 | And then you take the representation, H,
00:22:41.140 | and you put it through another neural network,
00:22:43.020 | which compresses it a bit.
00:22:44.820 | It goes to low dimensionality.
00:22:47.260 | That's an extra complexity I'm not gonna explain,
00:22:49.420 | but it makes it work a bit better.
00:22:51.700 | You can do it without doing that.
00:22:53.540 | And you get two embeddings, Xi and Zj.
00:22:56.100 | And your aim is to maximise the agreement
00:22:59.540 | between those vectors.
00:23:00.820 | And so you start off doing that and you say,
00:23:03.820 | okay, let's start off with random neural networks,
00:23:07.220 | random weights in the neural networks,
00:23:09.180 | and let's take two patches
00:23:10.460 | and let's put them through these transformations.
00:23:13.060 | Let's try and make Zi be the same as Zj.
00:23:15.660 | So let's back propagate the squared difference
00:23:17.980 | between components of I and components of J.
00:23:21.140 | And hey, presto, what you discover is everything collapses.
00:23:25.540 | For every image, it will always produce the same Zi and Zj.
00:23:32.260 | And then you realise, well,
00:23:33.260 | that's not what I meant by agreement.
00:23:35.020 | I mean, they should be the same
00:23:37.300 | when you get two crops of the same image
00:23:39.220 | and different when you get two crops of different images.
00:23:42.620 | Otherwise, it's not really agreement, right?
00:23:45.420 | So you have to have negative examples.
00:23:50.420 | You have to show it crops from different images
00:23:53.380 | and say those should be different.
00:23:55.620 | If they're already different,
00:23:57.500 | you don't try and make them a lot more different.
00:23:59.980 | It's very easy to make things very different,
00:24:02.300 | but that's not what you want.
00:24:03.300 | You just wanna be sure they're different enough.
00:24:04.860 | So crops from different images
00:24:07.220 | aren't taken to be from the same image.
00:24:09.420 | So if they happen to be very similar, you push them apart.
00:24:12.300 | And that stops your representations collapsing.
00:24:14.220 | That's called contrastive learning.
00:24:16.140 | And it works very well.
00:24:17.340 | So what you can do is do unsupervised learning
00:24:23.540 | by trying to maximise agreement
00:24:25.100 | between the representations you get
00:24:28.180 | from two image patches from the same image.
00:24:30.380 | And after you've done that,
00:24:32.620 | you just take your representation of the image patch
00:24:36.300 | and you feed it to a linear classifier,
00:24:38.620 | a bunch of weights.
00:24:39.460 | So you multiply the representation by a weight matrix,
00:24:42.140 | put it through a softmax and get class labels.
00:24:45.700 | And then you train that by gradient descent.
00:24:48.780 | And what you discover is that that's just about as good
00:24:53.700 | as training on labelled data.
00:24:56.020 | So now the only thing you've trained on labelled data
00:24:58.060 | is that last linear classifier.
00:25:00.380 | The previous layers were trained on unlabelled data
00:25:03.540 | and you've managed to train your representations
00:25:07.100 | without needing labels.
00:25:12.020 | Now, there's a problem with this.
00:25:13.860 | It works very nicely,
00:25:17.020 | but it's really confounding objects and whole scenes.
00:25:20.500 | So it makes sense to say two different patches
00:25:23.700 | from the same scene should get the same vector label
00:25:28.700 | at the scene level 'cause they're from the same scene.
00:25:31.780 | But what if one of the patches
00:25:34.260 | contains bits of objects A and B
00:25:35.660 | and another patch contain bits of objects A and C?
00:25:38.180 | You don't really want those two patches
00:25:39.540 | to have the same representation at the object level.
00:25:42.580 | So we have to distinguish
00:25:43.820 | these different levels of representation.
00:25:46.700 | And for contrastive learning,
00:25:49.300 | if you don't use any kind of gating or attention,
00:25:53.180 | then what's happening is you're really doing learning
00:25:55.540 | at the scene level.
00:25:56.660 | What we'd like is that the representations
00:26:02.060 | you get at the object level should be the same
00:26:05.700 | if both patches are patches from object A,
00:26:09.100 | but should be different if one patch is from object A
00:26:11.340 | and one patch is from object B.
00:26:13.020 | And to do that, we're gonna need some form of attention
00:26:14.940 | to decide whether they really come from the same thing.
00:26:17.860 | And so GLOM is designed to do that.
00:26:19.500 | It's designed to take contrastive learning
00:26:21.980 | and to introduce attention
00:26:23.820 | of the kind you get in transformers
00:26:25.940 | in order not to try and say things are the same
00:26:28.380 | when they're not.
00:26:29.220 | I should mention at this point
00:26:32.620 | that most of you will be familiar with BERT,
00:26:35.980 | and you could think of the word fragments
00:26:37.740 | that are fed into BERT
00:26:39.340 | as like the image patches I'm using here.
00:26:42.420 | And in BERT, you have that whole column of representations
00:26:45.140 | of the same word fragment.
00:26:46.700 | In BERT, what's happening presumably as you go up
00:26:50.740 | is you're getting semantically richer representations.
00:26:55.740 | But in BERT, there's no attempt to get representations
00:27:00.420 | of larger things like whole phrases.
00:27:02.500 | This, what I'm gonna talk about
00:27:06.580 | will be a way to modify BERT.
00:27:08.020 | So as you go up,
00:27:09.540 | you get bigger and bigger islands of agreement.
00:27:12.100 | So for example, after a couple of levels,
00:27:15.580 | then things like New and York
00:27:18.780 | will have the different fragments of York,
00:27:21.660 | suppose it's got two different fragments,
00:27:23.580 | will have exactly the same representation
00:27:25.380 | if it was done in the GLOM-like way.
00:27:27.580 | And then as you go up another level,
00:27:29.860 | the fragments of New,
00:27:31.220 | well, New's probably a thing in its own right,
00:27:32.700 | but the fragments of York
00:27:35.020 | would all have exactly the same representation
00:27:37.380 | that have this island of agreement.
00:27:40.740 | And that will be a representation of a compound thing.
00:27:44.660 | And as you go up,
00:27:45.500 | you're gonna get these islands of agreement
00:27:46.980 | that represent bigger and bigger things.
00:27:49.020 | And that's gonna be a much more useful kind of BERT
00:27:51.700 | 'cause instead of taking vectors
00:27:54.940 | that represent word fragments
00:27:56.700 | and then sort of munging them together
00:27:58.780 | by taking the max of each, for example,
00:28:00.500 | the max of each component, for example,
00:28:02.500 | which is just a crazy thing to do,
00:28:05.020 | you'd explicitly, as you're learning,
00:28:07.060 | form representations of larger parts
00:28:09.660 | in the part-whole hierarchy.
00:28:11.060 | Okay.
00:28:13.060 | So what we're going after in GLOM
00:28:18.300 | is a particular kind of spatial coherence
00:28:20.900 | that's more complicated than the spatial coherence
00:28:23.300 | caused by the fact that surfaces
00:28:25.260 | tend to be at the same depth and same orientation
00:28:27.780 | in nearby patches of an image.
00:28:30.180 | We're going after the spatial coherence
00:28:32.540 | that says that if you find a mouth in an image
00:28:37.220 | and you find a nose in an image
00:28:38.620 | and then the right spatial relationship to make a face,
00:28:41.460 | then that's a particular kind of coherence.
00:28:44.260 | And we want to go after that unsupervised
00:28:46.940 | and we want to discover that kind of coherence in images.
00:28:50.740 | So before I go into more details about GLOM,
00:28:56.740 | I want a disclaimer.
00:28:58.700 | For years, computer vision treated vision
00:29:02.900 | as you've got a static image, a uniform resolution,
00:29:05.940 | and you want to say what's in it.
00:29:08.140 | That's not how vision works in the real world.
00:29:10.260 | In the real world, this is actually a loop
00:29:12.140 | where you decide where to look
00:29:13.660 | if you're a person or a robot.
00:29:16.140 | You better do that intelligently.
00:29:20.260 | And that gives you a sample of the optic array.
00:29:25.060 | It turns the optic array, the incoming light,
00:29:27.420 | into a retinal image.
00:29:30.580 | And on your retina, you have high resolution in the middle
00:29:32.780 | and low resolution around the edges.
00:29:35.660 | And so you're focusing on particular details
00:29:39.900 | and you never ever process the whole image
00:29:43.100 | at uniform resolution.
00:29:44.580 | You're always focusing on something
00:29:46.020 | and processing where you're fixating at high resolution
00:29:49.140 | and everything else at much lower resolution,
00:29:51.100 | particularly around the edges.
00:29:53.380 | So I'm going to ignore all the complexity
00:29:56.140 | of how you decide where to look
00:29:57.940 | and all the complexity of how you put together
00:30:00.740 | the information you get from different fixations
00:30:03.340 | by saying, let's just talk about the very first fixation
00:30:06.580 | or a novel image.
00:30:07.940 | So you look somewhere
00:30:09.260 | and now what happens on that first fixation?
00:30:11.860 | We know that the same hardware in the brain
00:30:13.580 | is going to be reused for the next fixation,
00:30:16.500 | but let's just think about the first fixation.
00:30:20.740 | So finally, here's a picture of the architecture
00:30:23.220 | and this is the architecture for a single location.
00:30:29.780 | So like for a single word fragment in BERT
00:30:32.500 | and it shows you what's happening for multiple frames.
00:30:38.700 | So Glom is really designed for video,
00:30:40.820 | but I only talk about applying it to static images.
00:30:43.660 | Then you should think of a static image
00:30:45.580 | as a very boring video
00:30:47.500 | in which the frames are all the same as each other.
00:30:50.900 | So I'm showing you three adjacent levels in the hierarchy
00:30:54.740 | and I'm showing you what happens over time.
00:30:59.300 | So if you look at the middle level,
00:31:02.580 | maybe that's the sort of major part level
00:31:04.740 | and look at that box that says level L
00:31:08.180 | and that's at frame four.
00:31:11.740 | So the right-hand level L box
00:31:14.780 | and let's ask how the state of that box,
00:31:17.660 | the state of that embedding is determined.
00:31:20.500 | So inside the box, we're gonna get an embedding
00:31:22.860 | and the embedding is gonna be the representation
00:31:27.700 | of what's going on at the major part level
00:31:31.140 | for that little patch of the image.
00:31:32.940 | And level L, in this diagram,
00:31:37.900 | all of these embeddings will always be devoted
00:31:41.780 | to the same patch of the retinal image.
00:31:44.580 | Okay.
00:31:47.580 | The level L embedding on the right-hand side,
00:31:51.860 | you can see there's three things determining it there.
00:31:55.500 | This is a green arrow and for static images,
00:31:58.700 | the green arrow is rather boring.
00:31:59.940 | It's just saying you should sort of be similar
00:32:02.020 | to the previous state of level L.
00:32:03.740 | So it's just doing temporal integration.
00:32:05.740 | The blue arrow is actually a neural net
00:32:12.020 | with a couple of hidden layers in it.
00:32:14.540 | I'm just showing you the embeddings here,
00:32:15.900 | not all the layers of the neural net.
00:32:18.260 | We need a couple of hidden layers
00:32:19.420 | to do the coordinate transforms that are required.
00:32:22.380 | And the blue arrow is basically taking information
00:32:26.340 | at the level below of the previous time step.
00:32:30.060 | So level L minus one on frame three
00:32:32.660 | might be representing that I think I might be a nostril.
00:32:36.260 | Well, if you think you might be a nostril,
00:32:38.260 | what you predict at the next level up is a nose.
00:32:42.060 | What's more, if you have a coordinate frame
00:32:44.300 | for the nostril, you can predict the coordinate frame
00:32:46.700 | for the nose.
00:32:47.780 | Maybe not perfectly, but you have a pretty good idea
00:32:50.100 | of the orientation position scale of the nose.
00:32:53.100 | So that bottom up neural net is a net
00:32:58.100 | that can take any kind of part of level L minus one.
00:33:01.420 | You can take a nostril,
00:33:02.660 | but it could also take a steering wheel
00:33:04.660 | and predict the car from the steering wheel
00:33:06.820 | and predict what you've got at the next level up.
00:33:12.980 | The red arrow is a top-down neural net.
00:33:16.860 | So the red arrow is predicting the nose from the whole face.
00:33:22.860 | And again, it has a couple of hidden layers
00:33:27.300 | to do coordinate transforms.
00:33:29.220 | 'Cause if you know the coordinate frame of the face
00:33:32.260 | and you know the relationship between a face and a nose,
00:33:35.340 | and that's gonna be in the weights
00:33:36.500 | of that top-down neural net,
00:33:38.500 | then you can predict that it's a nose
00:33:41.340 | and what the pose of the nose is.
00:33:43.500 | And that's all gonna be in activities
00:33:45.580 | in that embedding vector.
00:33:46.820 | Okay, now all of that is what's going on
00:33:51.900 | in one column of hardware.
00:33:54.020 | That's all about a specific patch of the image.
00:33:57.100 | So that's very, very like what's going on
00:33:59.980 | for one word fragment in BERT.
00:34:02.140 | You have all these levels of representation.
00:34:04.340 | It's a bit confusing exactly what the relation of this
00:34:11.020 | is to BERT.
00:34:11.860 | And I'll give you the reference
00:34:13.020 | to a long archive paper at the end
00:34:14.420 | that has a whole section on how this relates to BERT.
00:34:17.500 | But it's confusing 'cause this has time steps.
00:34:19.940 | And that makes it all more complicated.
00:34:24.540 | Okay.
00:34:25.380 | So those are three things
00:34:28.060 | that determine the level and embedding.
00:34:30.260 | But there's one fourth thing,
00:34:31.780 | which is in black at the bottom there.
00:34:34.500 | And that's the only way
00:34:36.860 | in which different locations interact.
00:34:39.540 | And that's a very simplified form of a transformer.
00:34:43.500 | If you take a transformer, as in BERT,
00:34:46.220 | and you say, let's make the embeddings
00:34:48.660 | and the keys and the queries and the values
00:34:50.940 | all be the same as each other.
00:34:53.020 | We just have this one vector.
00:34:54.460 | So now all you're trying to do
00:34:58.740 | is make the level L embedding in one column
00:35:02.100 | be the same as the level L embedding in nearby columns.
00:35:05.460 | But it's gonna be gated.
00:35:08.300 | You're only gonna try and make it be the same
00:35:10.700 | if it's already quite similar.
00:35:13.140 | So here's how the attention works.
00:35:17.300 | You take the level L embedding in location X, that's LX.
00:35:22.940 | And you take the level L embedding
00:35:24.220 | in the nearby location Y, that's LY.
00:35:27.340 | You take the scalar product.
00:35:28.740 | You exponentiate and you normalize.
00:35:32.900 | In other words, you do a softmax.
00:35:35.340 | And that gives you the weight to use
00:35:38.260 | in your desire to make LX be the same as LY.
00:35:43.260 | So the input produced by this from neighbors
00:35:50.580 | is an attention-weighted average
00:35:54.580 | of the level L embeddings of nearby columns.
00:35:57.580 | And that's an extra input that you get.
00:35:59.540 | It's trying to make you agree with nearby things.
00:36:02.620 | And that's what's gonna cause you
00:36:03.860 | to get these islands of agreement.
00:36:06.060 | (mouse clicking)
00:36:08.820 | So back to this picture.
00:36:12.020 | I think, yeah.
00:36:17.820 | This is what we'd like to see.
00:36:21.260 | And the reason we get those,
00:36:25.340 | that big island of agreement at the object level
00:36:28.460 | is 'cause we're trying to get agreement there.
00:36:30.820 | We're trying to learn the coordinate transform
00:36:34.180 | from the red arrows to the level above
00:36:36.460 | and from the green arrows to the level above
00:36:38.540 | such that we get agreement.
00:36:40.140 | Okay.
00:36:42.380 | Now, one thing we need to worry about
00:36:47.980 | is that the difficult thing in perception,
00:36:50.420 | it's not so bad in language,
00:36:54.300 | it's probably worse in visual perception,
00:36:56.340 | is that there's a lot of ambiguity.
00:36:58.700 | If I'm looking at a line drawing, for example,
00:37:00.420 | I see a circle.
00:37:02.300 | Well, a circle could be the right eye of a face
00:37:04.860 | or it could be the left eye of a face
00:37:06.260 | or it could be the front wheel of a car
00:37:07.540 | or the back wheel of a car.
00:37:08.820 | There's all sorts of things that circle could be.
00:37:11.620 | And we'd like to disambiguate the circle.
00:37:13.900 | And there's a long line of work
00:37:16.420 | using things like Markov random fields.
00:37:19.940 | Here we need a variational Markov random field
00:37:22.020 | which I call a transformational random field
00:37:24.540 | because the interaction between, for example,
00:37:27.340 | something that might be an eye
00:37:28.780 | and something that might be a mouth
00:37:31.220 | needs to be gated by coordinate transforms.
00:37:33.660 | For the, let's take a nose and a mouth
00:37:36.260 | 'cause that's my standard thing.
00:37:37.940 | If you take something that might be a nose
00:37:39.780 | and you want to ask,
00:37:40.980 | does anybody out there support the idea I'm a nose?
00:37:43.580 | Well, what you'd like to do is send to everything nearby
00:37:49.100 | a message saying,
00:37:50.340 | do you have the right kind of pose
00:37:55.460 | and the right kind of identity
00:37:56.660 | to support the idea that I'm a nose?
00:37:58.460 | And so you'd like, for example,
00:38:00.860 | to send out a message from the nose.
00:38:04.260 | You'd send out a message to all nearby locations saying,
00:38:07.300 | does anybody have a mouth with the pose that I predict
00:38:10.180 | by taking the pose of the nose,
00:38:12.620 | multiplying by the coordinate transform
00:38:14.420 | between a nose and a mouth?
00:38:16.100 | And now I can predict the pose of the mouth.
00:38:18.020 | Is there anybody out there with that pose
00:38:20.060 | who thinks it might be a mouth?
00:38:22.140 | And I think you can see,
00:38:22.980 | you're gonna have to send out a lot of different messages.
00:38:25.620 | For each kind of other thing that might support you,
00:38:28.540 | you're gonna need to send a different message.
00:38:29.860 | So you're gonna need a multi-headed transformer
00:38:34.860 | and it's gonna be doing these coordinate transforms
00:38:38.060 | and you have to do a coordinate transform,
00:38:40.100 | the inverse transform on the way back.
00:38:42.180 | 'Cause if the mouse supports you,
00:38:43.900 | what it needs to support is a nose,
00:38:45.700 | not with the pose of the mouth,
00:38:47.860 | but with the appropriate pose.
00:38:49.980 | So that's gonna get very complicated.
00:38:51.380 | You're gonna have N-squared interactions
00:38:52.700 | all with coordinate transforms.
00:38:54.620 | There's another way of doing it that's much simpler.
00:38:56.820 | That's called a Hough transform.
00:38:58.980 | At least it's much simpler
00:39:00.580 | if you have a way of representing ambiguity.
00:39:02.940 | So instead of these direct interactions
00:39:08.580 | between parts like a nose and a mouth,
00:39:10.460 | what you're gonna do
00:39:12.500 | is you're gonna make each of the parts predict the whole.
00:39:16.420 | So the nose can predict the face
00:39:19.980 | and it can predict the pose of the face.
00:39:22.020 | And the mouth can also predict the face.
00:39:24.580 | Now these will be in different columns of Glom,
00:39:26.940 | but in one column of Glom,
00:39:28.020 | you'll have a nose predicting face.
00:39:30.060 | In a nearby column,
00:39:31.060 | you'll have a mouth predicting face.
00:39:33.300 | And those two faces should be the same
00:39:35.900 | if this really is a face.
00:39:37.300 | So when you do this attention weighted averaging
00:39:41.100 | with nearby things,
00:39:42.460 | what you're doing is you're getting confirmation
00:39:45.260 | that the support for the hypothesis you've got,
00:39:49.900 | I mean, suppose in one column you make the hypothesis
00:39:51.780 | it's a face with this pose.
00:39:53.860 | That gets supported by nearby columns
00:39:56.300 | that derive the very same embedding
00:39:59.100 | from quite different data.
00:40:00.820 | One derived it from the nose
00:40:02.020 | and one derived it from the mouth.
00:40:03.700 | And this doesn't require any dynamic routing
00:40:07.460 | because the embeddings are always referring
00:40:11.140 | to what's going on in the same small patch of the image.
00:40:14.380 | Within a column, there's no routing.
00:40:16.580 | And between columns,
00:40:18.900 | there's something a bit like routing,
00:40:20.100 | but it's just the standard transformer kind of attention.
00:40:22.940 | You're just trying to agree with things that are similar.
00:40:26.460 | And, okay, so that's how GLOM is meant to work.
00:40:31.180 | And the big problem is that if I see a circle,
00:40:36.820 | it might be a left eye, it might be a right eye,
00:40:38.500 | it might be a front wheel of a car,
00:40:40.700 | it might be the back wheel of a car.
00:40:42.420 | Because my embedding for a particular patch
00:40:44.500 | at a particular level has to be able to represent anything,
00:40:47.660 | when I get an ambiguous thing,
00:40:49.540 | I have to deal with all these possibilities
00:40:51.420 | of what hole it might be part of.
00:40:54.220 | So instead of trying to resolve ambiguity at the part level,
00:40:57.700 | what I can do is jump to the next level up
00:40:59.460 | and resolve the ambiguity there,
00:41:01.100 | just by seeing if things are the same,
00:41:03.180 | which is an easier way to resolve ambiguity.
00:41:05.660 | But the cost of that is I have to be able to represent
00:41:09.380 | all the ambiguity I get at the next level up.
00:41:11.660 | Now, it turns out you can do that.
00:41:14.660 | We've done a little toy example
00:41:17.180 | where you can actually preserve this ambiguity,
00:41:20.500 | but it's difficult.
00:41:23.340 | It's the kind of thing neural nets are good at.
00:41:25.700 | So if you think about the embedding at the next level up,
00:41:30.260 | you've got a whole bunch of neurons
00:41:33.180 | whose activities are that embedding,
00:41:35.460 | and you wanna represent a highly multimodal distribution,
00:41:40.140 | like it might be a car with this pose,
00:41:41.700 | or a car with that pose, or a face with this pose,
00:41:43.620 | or a face with that pose.
00:41:45.420 | All of these are possible predictions for finding a circle.
00:41:48.380 | And so you have to represent all that.
00:41:52.620 | And the question is, can neural nets do that?
00:41:55.060 | And I think the way they must be doing it is
00:41:57.940 | each neuron in the embedding
00:42:00.060 | stands for an unnormalized log probability distribution
00:42:05.700 | over this huge space of possible identities
00:42:07.980 | and possible poses,
00:42:09.540 | the sort of cross product of identities and poses.
00:42:11.980 | And so the neuron is this rather vague
00:42:16.940 | log probability distribution over that space.
00:42:19.180 | And when you activate the neuron,
00:42:22.020 | what it's saying is,
00:42:23.500 | add in that log probability distribution
00:42:25.780 | to what you've already got.
00:42:27.740 | And so now if you have a whole bunch
00:42:28.940 | of log probability distributions,
00:42:31.300 | and you add them all together,
00:42:32.820 | you can get a much more peaky log probability distribution.
00:42:37.780 | And when you exponentiate to get a probability distribution,
00:42:41.260 | it gets very peaky.
00:42:42.980 | And so very vague basis functions
00:42:46.380 | in this joint space of pose and identity,
00:42:50.260 | and basis functions in the log probability in that space
00:42:53.660 | can be combined to produce sharp conclusions.
00:42:56.180 | So I think that's how neurons are representing things.
00:43:02.340 | Most people think about neurons as,
00:43:04.940 | they think about the thing that they're representing.
00:43:07.620 | But obviously in perception,
00:43:10.540 | you have to deal with uncertainty.
00:43:12.980 | And so neurons have to be good at representing
00:43:15.300 | multimodal distributions.
00:43:18.780 | And this is the only way I can think of
00:43:20.420 | that's good at doing it.
00:43:21.660 | That's a rather weak argument.
00:43:25.220 | I mean, it's the argument that led Chomsky
00:43:26.740 | to believe that language wasn't learned
00:43:28.820 | because he couldn't think of how it was learned.
00:43:30.820 | My view is neurons must be using this representation
00:43:34.500 | 'cause I can't think of any other way of doing it.
00:43:36.540 | Okay.
00:43:41.540 | I just said all that 'cause I got ahead of myself
00:43:43.340 | 'cause I got excited.
00:43:44.500 | Okay.
00:43:45.340 | Now, the reason you can get away with this,
00:43:50.540 | the reason you have these very vague distributions
00:43:52.860 | in the unnormalized log probability space
00:43:55.380 | is because these neurons are all dedicated
00:43:58.980 | to a small patch of image,
00:44:00.860 | and they're all trying to represent the thing
00:44:03.180 | that's happening in that patch of image.
00:44:05.460 | So you're only trying to represent one thing.
00:44:07.700 | You're not trying to represent
00:44:09.100 | some set of possible objects.
00:44:11.380 | If you're trying to represent
00:44:12.220 | some set of possible objects,
00:44:13.260 | you'd have a horrible binding problem,
00:44:15.060 | and you couldn't use these very vague distributions.
00:44:17.340 | But so long as you know that all of these neurons,
00:44:20.940 | all of the active neurons refer to the same thing,
00:44:24.700 | then you can do the intersection.
00:44:26.780 | You can add the log probability distribution together
00:44:28.980 | and intersect the sets of things they represent.
00:44:31.500 | Okay.
00:44:35.260 | I'm getting near the end.
00:44:36.700 | How would you train a system like this?
00:44:39.100 | Well, obviously you could train it the way you train BERT.
00:44:41.620 | You could do deep end-to-end training.
00:44:43.900 | And for GLOM, what that would consist of,
00:44:48.660 | and the way we trained a toy example,
00:44:50.740 | is you take an image,
00:44:55.780 | you leave out some patches of the image,
00:44:58.500 | you then let GLOM settle down for about 10 iterations,
00:45:04.020 | and it's trying to fill in
00:45:07.900 | the lowest level representation of what's in the image,
00:45:11.540 | the lowest level embedding,
00:45:14.540 | and it fills them in wrong.
00:45:17.260 | And so you now back propagate that error,
00:45:19.580 | and you're back propagating it through time in this network.
00:45:22.700 | So it'll also back propagate up and down through the levels.
00:45:25.700 | So you're basically just doing back propagation through time
00:45:30.180 | of the error due to filling in things incorrectly.
00:45:34.220 | That's basically how BERT is trained.
00:45:36.980 | And you could train GLOM the same way.
00:45:39.580 | But I also want to include an extra bit in the training
00:45:43.180 | to encourage islands.
00:45:44.540 | We want to encourage big islands
00:45:51.740 | of identical vectors at high levels.
00:45:54.580 | And you can do that by using contrastive learning.
00:45:57.100 | So if you think how the next,
00:46:03.220 | at the next time step,
00:46:04.700 | if you think how an embedding is determined,
00:46:07.660 | it's determined by combining
00:46:10.620 | a whole bunch of different factors,
00:46:12.860 | what was going on in the previous time step
00:46:15.380 | at this level of representation in this location,
00:46:18.460 | what was going on at the previous time step
00:46:21.220 | in this location,
00:46:22.580 | but at the next level down,
00:46:24.380 | at the next level up,
00:46:26.100 | and also what was going on at the previous time step
00:46:29.060 | at nearby locations at the same level.
00:46:33.380 | And the weighted average of all those things
00:46:36.220 | I'll call the consensus embedding.
00:46:37.620 | And that's what you use for the next embedding.
00:46:40.580 | And I think you can see that
00:46:42.540 | if we try and make the bottom up neural net
00:46:45.100 | and the top down neural net,
00:46:46.900 | if we try and make the predictions agree with the consensus,
00:46:50.900 | the consensus has folded in information
00:46:54.180 | from nearby locations that already roughly agree
00:46:58.820 | because of the attention weighting.
00:47:01.260 | And so by trying to make the top down
00:47:02.700 | and bottom up neural networks agree with the consensus,
00:47:06.060 | you're trying to make them agree
00:47:07.380 | with what's going on in nearby locations that are similar.
00:47:10.460 | And so you'll be training it to form islands.
00:47:12.860 | This is more interesting to neuroscientists
00:47:19.820 | than to people who do natural language,
00:47:21.700 | so I'm gonna ignore that.
00:47:22.940 | You might think it's wasteful to be replicating
00:47:29.740 | all these embeddings at the object level.
00:47:31.780 | So the idea is at the object level,
00:47:33.580 | there'll be a large number of patches
00:47:35.020 | that all have exactly the same vector representation.
00:47:38.260 | And that seems like a waste,
00:47:39.860 | but actually biology is full of things like that.
00:47:41.940 | All your cells have exactly the same DNA
00:47:44.620 | and all the parts of an organ
00:47:47.060 | have pretty much the same vector protein expressions.
00:47:49.980 | So there's lots of replication goes on
00:47:51.980 | to keep things local.
00:47:55.380 | And it's the same here.
00:47:56.220 | And actually that replication is very useful
00:47:59.700 | when you're settling on an interpretation,
00:48:02.020 | because before you settle down,
00:48:04.140 | you don't know which things should be the same
00:48:05.700 | as which other things.
00:48:06.860 | So having separate vectors in each location
00:48:08.700 | to represent what's going on there at the object level
00:48:11.340 | gives you the flexibility to gradually segment things
00:48:13.860 | as you settle down in a sensible way.
00:48:16.060 | It allows you to hedge your bets.
00:48:19.740 | And what you're doing is not quite like clustering.
00:48:22.340 | You're creating clusters of identical vectors
00:48:25.580 | rather than discovering clusters in fixed data.
00:48:28.260 | So clustering, you're given the data and it's fixed
00:48:30.780 | and you find the clusters.
00:48:31.820 | Here, the embeddings at every level, they vary over time.
00:48:36.820 | They're determined by the top-down and bottom-up inputs
00:48:40.460 | and by inputs coming from nearby locations.
00:48:42.580 | So what you're doing is forming clusters
00:48:45.020 | rather than discovering them in fixed data.
00:48:47.500 | And that's got a somewhat different flavor
00:48:49.620 | and can settle down faster.
00:48:55.740 | And one other advantage to this replication
00:48:58.380 | is what you don't want is to have much more work
00:49:02.580 | in your transformer as you go to higher levels.
00:49:06.220 | But you do need longer range interactions at higher levels.
00:49:09.220 | Presumably for the lowest levels,
00:49:10.380 | you want fairly short range interactions
00:49:12.140 | in your transformer, and they could be dense.
00:49:14.860 | As you go to higher levels,
00:49:15.820 | you want much longer range interactions.
00:49:17.900 | So you could make them sparse,
00:49:20.380 | and people have done things like that for BERT-like systems.
00:49:24.940 | Here, it's easy to make them sparse
00:49:28.340 | 'cause you're expecting big islands.
00:49:30.460 | So all you need to do is see one patch of a big island
00:49:34.980 | to know what the vector representation of that island is.
00:49:37.980 | And so sparse representations will work much better
00:49:40.660 | if you have these big islands of agreement as you go up.
00:49:43.060 | So the idea is you have longer range
00:49:44.580 | and sparser connections as you go up.
00:49:46.260 | So the amount of computation is the same in every level.
00:49:50.620 | And just to summarize,
00:49:51.780 | I showed how to combine three important advances
00:49:56.820 | of neural networks in Glom.
00:49:58.380 | I didn't actually talk about neural fields,
00:50:01.260 | and that's important for the top-down network.
00:50:04.260 | Maybe since I've got two minutes to spare,
00:50:05.820 | I'm gonna go back and mention neural fields very briefly.
00:50:08.660 | Yeah, when I train that top-down neural network,
00:50:17.060 | I have a problem.
00:50:18.740 | And the problem is,
00:50:20.180 | if you look at those red arrows and those green arrows,
00:50:24.940 | they're quite different.
00:50:28.140 | But if you look at the level above, the object level,
00:50:31.180 | all those vectors are the same.
00:50:32.740 | And of course, in an engineered system,
00:50:36.980 | I want to replicate the neural nets in every location.
00:50:39.580 | So I use exactly the same
00:50:40.660 | top-down and bottom-up neural nets everywhere.
00:50:43.820 | And so the question is,
00:50:44.660 | how can the same neural net be given a black arrow
00:50:48.500 | and sometimes produce a red arrow
00:50:51.020 | and sometimes produce a green arrow,
00:50:52.300 | which have quite different orientations?
00:50:54.580 | How can it produce a nose where there's nose
00:50:57.620 | and a mouth where there's mouth,
00:50:59.220 | even though the face vector is the same everywhere?
00:51:02.340 | And the answer is,
00:51:03.940 | the top-down neural network doesn't just get
00:51:07.420 | the face vector,
00:51:09.340 | it also gets the location of the patch
00:51:12.380 | for which it's producing the path vector.
00:51:15.020 | So the three patches that should get the red vector
00:51:18.380 | are different locations
00:51:20.980 | from the three patches that should get the green vector.
00:51:23.740 | So if I use a neural network
00:51:25.180 | that gets the location as input as well,
00:51:27.420 | here's what it can do.
00:51:28.940 | It can take the pose that's encoded in that black vector,
00:51:32.020 | the pose of the face.
00:51:33.180 | It can take the location in the image
00:51:38.060 | for which it's predicting the vector of the level below.
00:51:41.860 | And the pose is relative to the image too.
00:51:44.220 | So knowing the location in the image
00:51:45.620 | and knowing the pose of the whole face,
00:51:47.260 | it can figure out which bit of the face
00:51:49.020 | it needs to predict at that location.
00:51:52.980 | And so in one location, it can predict,
00:51:54.660 | okay, there should be nose there
00:51:56.460 | and it gives you the red vector.
00:51:57.900 | In another location, it can predict
00:51:59.900 | from where that image patch is,
00:52:01.220 | there should be mouth there,
00:52:02.620 | so it can give you the green arrow.
00:52:04.700 | So you can get the same vector at the level above
00:52:06.940 | to predict different vectors
00:52:08.260 | in different places at the level below
00:52:10.100 | by giving it the place that it's predicting for.
00:52:12.340 | And that's what's going on in neural fields.
00:52:14.540 | Okay.
00:52:19.940 | Now, this was quite a complicated talk.
00:52:23.420 | There's a long paper about it on archive
00:52:25.700 | that goes into much more detail.
00:52:27.300 | And you could view this talk as just an encouragement
00:52:30.860 | to read that paper.
00:52:32.620 | And I'm done, exactly on time.
00:52:39.140 | - Okay.
00:52:39.980 | - Thank you.
00:52:40.820 | - Thanks a lot.
00:52:41.660 | - Yeah.
00:52:42.500 | (upbeat music)
00:52:45.100 | [BLANK_AUDIO]