Stanford CS25: V2 I Represent part-whole hierarchies in a neural network, Geoff Hinton

00:00:00.000 | (silence)

00:00:02.160 | Before we start, I gave the same talk at Stanford

00:00:09.880 | quite recently.

00:00:10.900 | I suggested to the people inviting me,

00:00:13.440 | I could just give one talk and both audiences come,

00:00:15.680 | but they will prefer it as two separate talks.

00:00:18.320 | So if you went to this talk recently,

00:00:20.620 | I suggest you leave now.

00:00:22.400 | You won't learn anything new.

00:00:24.280 | Okay.

00:00:28.440 | What I'm gonna do is combine some recent ideas

00:00:30.920 | in neural networks to try to explain

00:00:33.960 | how a neural network could represent part-whole hierarchies

00:00:36.880 | without violating any of the basic principles

00:00:41.040 | of how neurons work.

00:00:42.220 | And I'm gonna explain these ideas

00:00:47.640 | in terms of an imaginary system.

00:00:50.340 | I started writing a design document for a system

00:00:52.440 | and in the end, I decided the design document

00:00:54.400 | by itself was quite interesting.

00:00:56.280 | So this is just vaporware, it's stuff that doesn't exist.

00:00:58.840 | Little bits of it now exist,

00:00:59.960 | but somehow I find it easy to explain the ideas

00:01:04.060 | in the context of an imaginary system.

00:01:05.960 | So most people now studying neural networks

00:01:14.400 | are doing engineering and they don't really care

00:01:16.800 | if it's exactly how the brain works.

00:01:18.800 | They're not trying to understand how the brain works,

00:01:20.280 | they're trying to make cool technology.

00:01:22.760 | And so a hundred layers is fine in a ResNet,

00:01:25.380 | weight sharing is fine in a convolutional neural net.

00:01:27.980 | Some researchers, particularly computational neuroscientists

00:01:32.760 | investigate neural networks, artificial neural networks

00:01:35.680 | in an attempt to understand

00:01:36.680 | how the brain might actually work.

00:01:38.800 | I think we still got a lot to learn from the brain

00:01:41.640 | and I think it's worth remembering

00:01:43.720 | that for about half a century,

00:01:45.200 | the only thing that kept research on neural networks going

00:01:48.200 | was the belief that it must be possible

00:01:49.880 | to make these things learn complicated things

00:01:51.640 | 'cause the brain does.

00:01:53.500 | So every image has a different parse tree,

00:01:58.500 | that is the structure of the holes

00:02:02.680 | in the parts in the image.

00:02:03.980 | And in a real neural network,

00:02:07.760 | you can't dynamically allocate,

00:02:09.560 | you can't just grab a bunch of neurons and say,

00:02:11.320 | "Okay, you now represent this,"

00:02:15.000 | because you don't have random access memory.

00:02:16.800 | You can't just set the weights of the neurons

00:02:19.280 | to be whatever you like.

00:02:20.520 | What a neuron does is determined by its connections

00:02:22.900 | and they only change slowly, at least probably,

00:02:26.320 | mostly they change slowly.

00:02:27.620 | So the question is,

00:02:30.500 | if you can't change what neurons do quickly,

00:02:33.260 | how can you represent a dynamic parse tree?

00:02:35.660 | In symbolic AI, it's not a problem.

00:02:42.080 | You just grab a piece of memory,

00:02:44.060 | that's what it normally amounts to,

00:02:45.700 | and say, "This is gonna represent a node in the parse tree

00:02:48.740 | and I'm gonna give it pointers to other nodes,

00:02:50.860 | other bits of memory that represent other nodes."

00:02:53.020 | So there's no problem.

00:02:54.660 | For about five years,

00:02:55.900 | I played with a theory called capsules,

00:02:58.620 | where you say,

00:03:00.620 | "Because you can't allocate neurons on the fly,

00:03:03.340 | you're gonna allocate them in advance."

00:03:05.240 | So we're gonna take groups of neurons

00:03:06.700 | and we're gonna allocate them

00:03:07.740 | to different possible nodes in a parse tree.

00:03:10.600 | And most of these groups of neurons for most images

00:03:13.700 | are gonna be silent, a few are gonna be active.

00:03:17.300 | And then the ones that are active,

00:03:18.500 | we have to dynamically hook them up into a parse tree.

00:03:21.340 | So we have to have a way of rooting

00:03:22.940 | between these groups of neurons.

00:03:25.240 | So that was the capsules theory.

00:03:28.260 | And I had some very competent people working with me

00:03:31.780 | who actually made it work,

00:03:33.340 | but it was tough going.

00:03:36.960 | My view is that some ideas want to work

00:03:38.640 | and some ideas don't want to work.

00:03:40.140 | And capsules was sort of in between.

00:03:42.000 | Things like backpropagation just wanna work,

00:03:43.740 | you try them and they work.

00:03:46.020 | There's other ideas I've had that just don't wanna work.

00:03:49.020 | Capsules was sort of in between and we got it working.

00:03:52.060 | But I now have a new theory

00:03:53.980 | that could be seen as a funny kind of capsules model

00:03:57.740 | in which each capsule is universal.

00:03:59.860 | That is instead of a capsule being dedicated

00:04:02.540 | to a particular kind of thing,

00:04:04.640 | each capsule can represent any kind of thing.

00:04:07.100 | But hardware still comes in capsules,

00:04:11.500 | which are also called embedding sometimes.

00:04:14.640 | So the imaginary system I'll talk about is called Glom.

00:04:19.640 | And in Glom, hardware gets allocated to columns

00:04:25.240 | and each column contains multiple levels of representation

00:04:29.920 | of what's happening in a small patch of the image.

00:04:32.400 | So within a column,

00:04:35.240 | you might have a lower level representation

00:04:38.120 | that says it's a nostril

00:04:39.480 | and the next level up might say it's a nose

00:04:42.280 | and the next level up might say a face,

00:04:43.680 | the next level up a person

00:04:45.120 | and the top level might say it's a party.

00:04:47.120 | That's what the whole scene is.

00:04:48.680 | And the idea for representing part-whole hierarchies

00:04:52.760 | is to use islands of agreement

00:04:54.860 | between the embeddings at these different levels.

00:04:57.320 | So at the scene level, at the top level,

00:05:00.760 | you'd like the same embedding for every patch of the image

00:05:03.760 | 'cause that patch is a patch of the same scene everywhere.

00:05:07.200 | At the object level,

00:05:08.400 | you'd like the embeddings of all the different patches

00:05:11.400 | that belong to the object to be the same.

00:05:14.600 | So as you go up this hierarchy,

00:05:15.960 | you're trying to make things more and more the same

00:05:18.360 | and that's how you're squeezing redundancy out.

00:05:21.140 | The embedding vectors are the things that act like pointers

00:05:25.400 | and the embedding vectors are dynamic.

00:05:28.600 | They're neural activations rather than neural weights.

00:05:31.280 | So it's fine to have different embedding vectors

00:05:33.880 | for every image.

00:05:34.820 | So here's a little picture.

00:05:40.800 | If you had a one-dimensional row of patches,

00:05:44.520 | these are the columns for the patches

00:05:47.160 | and you'd have something like a convolutional neural net

00:05:51.720 | as a front end.

00:05:52.660 | And then after the front end,

00:05:56.000 | you produce your lowest level embeddings

00:05:58.240 | that say what's going on in each particular patch.

00:06:01.080 | And so that bottom layer of black arrows,

00:06:03.280 | they're all different.

00:06:04.600 | Of course, these embeddings are thousands of dimensions,

00:06:07.840 | maybe hundreds of thousands in your brain.

00:06:10.180 | And so two-dimensional vector isn't right,

00:06:14.160 | but at least I can represent

00:06:15.480 | where the two vectors are the same by using the orientation.

00:06:19.160 | So the lowest level,

00:06:20.680 | all the patches will have different representations.

00:06:23.960 | But the next level up, the first two patches,

00:06:27.360 | they might be part of a nostril, for example.

00:06:31.160 | And so, yeah, they'll have the same embedding.

00:06:37.680 | But the next level up,

00:06:39.460 | the first three patches might be part of a nose.

00:06:42.900 | And so they'll all have the same embedding.

00:06:45.100 | Notice that even though what's in the image

00:06:47.260 | is quite different, at the part level,

00:06:51.660 | those three red vectors are all meant to be the same.

00:06:55.180 | So what we're doing is we're getting the same representation

00:06:58.100 | for things that are superficially very different.

00:07:00.860 | We're finding spatial coherence in an image

00:07:03.740 | by giving the same representation to different things.

00:07:06.860 | And at the object level,

00:07:09.200 | you might have a nose and then a mouth,

00:07:12.100 | and they're the same face.

00:07:13.540 | They're part of the same face.

00:07:15.220 | And so all those vectors are the same.

00:07:17.060 | And this network hasn't yet settled down

00:07:18.900 | to produce something unseen level.

00:07:20.600 | So the islands of agreement are what capture the parse tree.

00:07:27.240 | Now, they're a bit more powerful than a parse tree.

00:07:29.740 | They can capture things like shut the heck up.

00:07:33.980 | You can have shut and up can be different vectors

00:07:36.860 | at one level, but at a higher level,

00:07:38.660 | shut and up can have exactly the same vector,

00:07:41.820 | namely the vector for shut up, and they can be disconnected.

00:07:44.900 | So you can do things a bit more powerful

00:07:46.300 | than a context-free grammar here,

00:07:48.180 | but basically it's a parse tree.

00:07:51.520 | If you're a physicist,

00:07:53.140 | you can think of each of these levels as an icing model

00:07:58.620 | with real valued vectors rather than binary spins.

00:08:02.980 | And you can think of them being coordinate transforms

00:08:05.100 | between levels, which makes it much more complicated.

00:08:08.380 | And then this is a kind of multi-level icing model,

00:08:12.460 | but with complicated interactions between the levels,

00:08:15.700 | because for example, between the red arrows

00:08:17.660 | and the black arrows above them,

00:08:19.460 | you need the coordinate transform

00:08:21.020 | between a nose and a face, but we'll come to that later.

00:08:24.580 | If you're not a physicist,

00:08:27.740 | ignore all that 'cause it won't help.

00:08:29.580 | (keyboard clicking)

00:08:33.340 | So I want to start, and this is, I guess,

00:08:35.780 | is particularly relevant for a natural language course

00:08:38.260 | where some of you are not vision people,

00:08:40.740 | by trying to prove to you that coordinate systems

00:08:45.020 | are not just something invented by Descartes.

00:08:47.220 | Coordinate systems were invented by the brain

00:08:52.060 | a long time ago, and we use coordinate systems

00:08:54.940 | in understanding what's going on in an image.

00:08:58.080 | I also want to demonstrate the psychological reality

00:09:00.500 | of parse trees for an image.

00:09:02.820 | So I'm gonna do this with a task

00:09:05.680 | that I invented a long time ago in the 1970s,

00:09:09.500 | when I was a grad student, in fact.

00:09:11.260 | And you have to do this task

00:09:14.460 | to get the full benefit from it.

00:09:16.180 | So I want you to imagine on the tabletop in front of you,

00:09:23.020 | there's a wireframe cube,

00:09:25.060 | and it's in the standard orientation for a cube,

00:09:27.680 | is resting on the tabletop.

00:09:29.660 | And from your point of view,

00:09:31.920 | there's a front bottom right-hand corner,

00:09:34.940 | and a top back left-hand corner.

00:09:37.660 | Here we go.

00:09:38.880 | Okay.

00:09:39.720 | The front bottom right-hand corner

00:09:42.500 | is resting on the tabletop,

00:09:44.100 | along with the four other corners.

00:09:46.340 | And the top back left-hand corner

00:09:48.660 | is at the other end of a diagonal

00:09:50.380 | that goes through the center of the cube.

00:09:53.060 | Okay, so far so good.

00:09:55.040 | Now what we're gonna do is rotate the cube

00:09:56.900 | so that this finger stays on the tabletop,

00:09:59.660 | and the other finger is vertically above it, like that.

00:10:02.420 | This finger shouldn't have moved.

00:10:05.620 | Okay.

00:10:06.740 | So now we've got the cube in an orientation

00:10:08.700 | where that thing that was a body diagonal is now vertical.

00:10:12.140 | And all you've gotta do is take the bottom finger,

00:10:15.340 | 'cause that's still on the tabletop,

00:10:17.180 | and point with the bottom finger

00:10:18.500 | to where the other corners of the cube are.

00:10:21.020 | So I want you to actually do it.

00:10:22.020 | Off you go.

00:10:22.860 | Take your bottom finger,

00:10:24.300 | hold your top finger at the other end of that diagonal

00:10:27.540 | that's now been made vertical,

00:10:28.780 | and just point to where the other corners are.

00:10:31.040 | And luckily, it's Zoom,

00:10:35.460 | so most of you, other people,

00:10:38.140 | won't be able to see what you did.

00:10:39.460 | And I can see that some of you aren't pointing,

00:10:41.100 | and that's very bad.

00:10:42.100 | So most people point out four other corners,

00:10:47.420 | and the most common response

00:10:48.820 | is to say they're here, here, here, and here.

00:10:51.320 | They point out four corners in a square

00:10:52.980 | halfway up that axis.

00:10:54.460 | That's wrong, as you might imagine.

00:11:00.060 | And it's easy to see that it's wrong,

00:11:01.820 | 'cause if you imagine the cube in the normal orientation

00:11:04.820 | and count the corners, there's eight of them.

00:11:07.220 | And these were two corners.

00:11:10.140 | So where did the other two corners go?

00:11:12.040 | So one theory is that when you rotated the cube,

00:11:15.700 | the centripetal forces

00:11:16.900 | made them fly off into your unconscious.

00:11:19.500 | That's not a very good theory.

00:11:21.220 | So what's happening here

00:11:24.060 | is you have no idea where the other corners are,

00:11:26.660 | unless you're something like a crystallographer.

00:11:29.460 | You can sort of imagine bits of the cube,

00:11:31.180 | but you just can't imagine this structure

00:11:32.900 | of the other corners, what structure they form.

00:11:35.200 | And this common response that people give,

00:11:38.700 | of four corners in a square,

00:11:41.560 | is doing something very weird.

00:11:43.860 | It's trying to, it's saying,

00:11:45.340 | well, okay, I don't know where the bits of a cube are,

00:11:48.900 | but I know something about cubes.

00:11:50.300 | I know the corners come in fours.

00:11:52.580 | I know a cube has this four-fold rotational symmetry,

00:11:55.940 | or two planes of bilateral symmetry,

00:11:58.180 | but right angles to one another.

00:12:00.140 | And so what people do is they preserve the symmetries

00:12:02.980 | of the cube in their response.

00:12:05.060 | They give four corners in a square.

00:12:07.040 | Now, what they've actually pointed out if they do that

00:12:11.620 | is two pyramids, each of which has a square base.

00:12:16.180 | One's upside down, and they're stuck base to base.

00:12:19.680 | So you can visualize that quite easily.

00:12:21.780 | It's a square base pyramid

00:12:22.780 | with another one stuck underneath it.

00:12:25.300 | And so now you get your two fingers

00:12:26.700 | as the vertices of those two pyramids.

00:12:29.460 | And what's interesting about that is

00:12:33.460 | you've preserved the symmetries of the cube

00:12:36.340 | at the cost of doing something pretty radical,

00:12:38.980 | which is changing faces to vertices and vertices to faces.

00:12:43.980 | The thing you pointed out if you did that was an octahedron.

00:12:48.840 | It has eight faces and six vertices.

00:12:51.500 | A cube has six faces and eight vertices.

00:12:54.340 | So in order to preserve the symmetries

00:12:56.260 | you know about of the cube, if you did that,

00:13:00.840 | you've done something really radical,

00:13:03.020 | which is changed faces for vertices and vertices for faces.

00:13:05.980 | I should show you what the answer looks like.

00:13:10.120 | So I'm gonna step back and try and get enough light,

00:13:13.500 | and maybe you can see this cube.

00:13:18.740 | So this is a cube,

00:13:21.460 | and you can see that the other edges

00:13:26.460 | form a kind of zigzag ring around the middle.

00:13:28.900 | So I've got a picture of it.

00:13:31.580 | So the colored rods here are the other edges of the cube,

00:13:37.900 | the ones that don't touch your fingertips.

00:13:40.460 | And your top finger's connected

00:13:41.780 | to the three vertices of those flaps,

00:13:44.380 | and your bottom finger's connected

00:13:46.540 | to the lowest three vertices there.

00:13:48.640 | And that's what a cube looks like.

00:13:51.100 | It's something you had no idea about.

00:13:53.420 | This is just a completely different model of a cube.

00:13:55.940 | It's so different, I'll give it a different name.

00:13:57.540 | I'll call it a hexahedron.

00:13:58.860 | And the thing to notice is a hexahedron and a cube

00:14:04.940 | are just conceptually utterly different.

00:14:07.540 | You wouldn't even know one was the same as the other

00:14:09.860 | if you think about one as a hexahedron and one as a cube.

00:14:12.620 | It's like the ambiguity between a tilted square

00:14:15.540 | and an upright diamond, but more powerful

00:14:17.780 | 'cause you're not familiar with it.

00:14:19.480 | And that's my demonstration

00:14:22.340 | that people really do use coordinate systems.

00:14:24.580 | And if you use a different coordinate system

00:14:26.260 | to describe things, and here I forced you

00:14:28.500 | to use a different coordinate system

00:14:29.900 | by making the diagonal be vertical

00:14:32.300 | and asking you to describe it relative to that vertical axis,

00:14:35.340 | then familiar things become completely unfamiliar.

00:14:38.260 | And when you do see them relative to this new frame,

00:14:41.560 | they're just a completely different thing.

00:14:44.300 | Notice that things like convolutional neural nets

00:14:46.460 | don't have that.

00:14:47.540 | They can't look at something

00:14:48.580 | and have two utterly different internal representations

00:14:50.900 | of the very same thing.

00:14:52.060 | I'm also showing you that you do parsing.

00:14:55.980 | So here I've colored it so you parse it

00:14:57.780 | into what I call the crown,

00:14:59.660 | which is three triangular flaps

00:15:01.580 | that slope upwards and outwards.

00:15:03.180 | Here's a different parsing.

00:15:06.260 | The same green flap sloping upwards and outwards.

00:15:08.980 | Now we have a red flap sloping downwards and outwards,

00:15:12.580 | and we have a central rectangle,

00:15:14.460 | and you just have the two ends of the rectangle.

00:15:17.180 | And if you perceive this and now close your eyes

00:15:21.220 | and ask you, were there any parallel edges there?

00:15:24.540 | You're very well aware that those two blue edges

00:15:27.260 | were parallel, and you're typically not aware

00:15:29.620 | of any other parallel edges,

00:15:31.300 | even though you know by symmetry there must be other pairs.

00:15:34.380 | Similarly with the crown, if you see the crown,

00:15:37.340 | and then I ask you to close your eyes

00:15:38.940 | and ask you, were there parallel edges?

00:15:41.020 | You don't see any parallel edges.

00:15:43.420 | And that's because the coordinate systems

00:15:45.060 | you're using for those flaps don't line up with the edges.

00:15:48.860 | And you only notice parallel edges

00:15:50.620 | if they line up with the coordinate system you're using.

00:15:53.460 | So here for the rectangle,

00:15:55.020 | the parallel edges align with the coordinate system.

00:15:57.380 | For the flaps, they don't.

00:15:59.020 | So you're aware that those two blue edges are parallel,

00:16:01.300 | but you're not aware that one of the green edges

00:16:03.500 | and one of the red edges is parallel.

00:16:10.460 | So this isn't like the NECA cube ambiguity,

00:16:12.580 | where when it flips, you think that what's out there

00:16:15.220 | in reality is different, things are at a different depth.

00:16:18.340 | This is like, next weekend, we shall be visiting relatives.

00:16:22.780 | So if you take the sentence,

00:16:23.620 | "Next weekend, we shall be visiting relatives,"

00:16:26.020 | it can mean, next weekend,

00:16:28.780 | what we will be doing is visiting relatives,

00:16:31.620 | or it can mean, next weekend,

00:16:33.580 | what we will be is visiting relatives.

00:16:36.180 | Now, those are completely different senses.

00:16:39.620 | They happen to have the same truth conditions.

00:16:41.660 | They mean the same thing in the sense of truth conditions,

00:16:45.100 | 'cause if you're visiting relatives,

00:16:46.980 | what you are is visiting relatives.

00:16:49.100 | And it's that kind of ambiguity.

00:16:50.740 | No disagreement about what's going on in the world,

00:16:52.500 | but two completely different ways of seeing the sentence.

00:16:55.260 | So this was drawn in the 1970s.

00:17:03.900 | This is what AI was like in the 1970s.

00:17:08.460 | This is a sort of structural description

00:17:10.500 | of the crown interpretation.

00:17:12.380 | So you have nodes for all the various parts in the hierarchy.

00:17:17.820 | I've also put something on the arcs.

00:17:20.020 | That RWX is the relationship

00:17:24.140 | between the crown and the flap,

00:17:26.420 | and that can be represented by a matrix.

00:17:28.580 | It's really the relationship

00:17:29.740 | between the intrinsic frame of reference of the crown

00:17:32.860 | and the intrinsic frame of reference of the flap.

00:17:35.700 | And notice that if I change my viewpoint,

00:17:39.020 | that doesn't change at all.

00:17:40.420 | So that kind of relationship will be a good thing

00:17:43.540 | to put in the weights of a neural network,

00:17:45.900 | 'cause you'd like a neural network

00:17:47.100 | to be able to recognize shapes independently of viewpoint.

00:17:50.460 | And that RWX is knowledge about the shape

00:17:53.580 | that's independent of viewpoint.

00:17:55.180 | Here's the zigzag interpretation.

00:17:59.620 | And here's something else

00:18:01.820 | where I've added the things in the heavy blue boxes.

00:18:06.460 | They're the relationship between a node and the viewer.

00:18:11.460 | That is to be more explicit.

00:18:13.900 | The coordinate transformation

00:18:15.220 | between the intrinsic frame of reference of the crown

00:18:18.260 | and the intrinsic frame of reference of the viewer,

00:18:20.940 | your eyeball, is that RWV.

00:18:23.580 | And that's a different kind of thing altogether,

00:18:26.980 | 'cause as you change viewpoint, that changes.

00:18:29.580 | In fact, as you change viewpoint,

00:18:31.060 | all those things in blue boxes all change together

00:18:34.260 | in a consistent way.

00:18:35.380 | And there's a simple relationship,

00:18:37.780 | which is that if you take RWV,

00:18:40.220 | then you multiply it by RWX, you get RXV.

00:18:44.420 | So you can easily propagate viewpoint information

00:18:47.620 | over a structural description.

00:18:49.940 | And that's what I think a mental image is.

00:18:52.500 | Rather than a bunch of pixels,

00:18:55.020 | it's a structural description

00:18:56.820 | with associated viewpoint information.

00:18:59.980 | That makes sense of a lot of properties of mental images.

00:19:04.980 | Like if you want to do any reasoning with things like RWX,

00:19:09.780 | you form a mental image.

00:19:12.460 | That is you fill in, you choose a viewpoint.

00:19:14.900 | And I want to do one more demo to convince you

00:19:17.460 | you always choose a viewpoint

00:19:18.980 | when you're solving mental imagery problems.

00:19:21.380 | So I'm gonna give you another very simple

00:19:22.900 | mental imagery problem at the risk of running over time.

00:19:27.660 | Imagine that you're at a particular point

00:19:31.660 | and you travel a mile East,

00:19:33.420 | and then you travel a mile North,

00:19:35.220 | and then you travel a mile East again.

00:19:37.540 | What's your direction back to your starting point?

00:19:40.820 | This isn't a very hard problem.

00:19:42.940 | It's sort of a bit South and quite a lot West, right?

00:19:46.340 | It's not exactly Southwest, but it's sort of Southwest.

00:19:49.100 | Now, when you did that task,

00:19:52.980 | what you imagined from your point of view

00:19:56.220 | is you went a mile East, and then you went a mile North,

00:19:58.740 | and then you went a mile East again.

00:20:00.540 | I'll tell you what you didn't imagine.

00:20:03.300 | You didn't imagine that you went a mile East,

00:20:05.180 | and then you went a mile North,

00:20:06.180 | and then you went a mile East again.

00:20:07.940 | You could have solved the problem perfectly well

00:20:09.620 | with North not being up, but you had North up.

00:20:13.100 | You also didn't imagine this.

00:20:14.740 | You go a mile East, and then a mile North,

00:20:16.300 | and then a mile East again.

00:20:17.980 | And you didn't imagine this.

00:20:18.820 | You go a mile East, and then a mile North, and so on.

00:20:21.340 | You imagined it at a particular scale,

00:20:23.340 | in a particular orientation, and in a particular position.

00:20:26.180 | And you can answer questions

00:20:29.940 | about roughly how big it was and so on.

00:20:32.020 | So that's evidence that to solve these tasks

00:20:35.540 | that involve using relationships between things,

00:20:39.140 | you form a mental image.

00:20:40.820 | Okay, enough on mental imagery.

00:20:43.780 | So I'm now gonna give you a very brief introduction

00:20:48.300 | to contrastive learning.

00:20:50.060 | So this is a complete disconnect in the talk,

00:20:54.140 | but it'll come back together soon.

00:20:55.820 | So in contrastive self-supervised learning,

00:21:02.140 | what we try and do is make two different crops of an image

00:21:06.660 | have the same representation.

00:21:08.300 | There's a paper a long time ago by Becker and Hinton

00:21:14.460 | where we were doing this to discover

00:21:16.140 | low-level coherence in an image,

00:21:18.620 | like the continuity of surfaces or the depth of surfaces.

00:21:23.620 | It's been improved a lot since then,

00:21:27.300 | and it's been used for doing things like classification.

00:21:30.180 | That is, you take an image

00:21:34.100 | that has one prominent object in it,

00:21:36.540 | and you say, "If I take a crop of the image

00:21:40.100 | "that contains sort of any part of that object,

00:21:42.460 | "it should have the same representation

00:21:45.380 | "as some other crop of the image

00:21:46.540 | "containing a part of that object."

00:21:49.260 | And this has been developed a lot in the last few years.

00:21:54.140 | I'm gonna talk about a model developed a couple of years ago

00:21:57.140 | by my group in Toronto called SimClear,

00:21:59.100 | but there's lots of other models.

00:22:00.780 | And since then, things have improved.

00:22:02.580 | So in SimClear, you take an image X,

00:22:08.860 | you take two different crops,

00:22:11.900 | and you also do colour distortion of the crops,

00:22:14.660 | different colour distortions of each crop.

00:22:16.980 | And that's to prevent it from using colour histograms

00:22:19.340 | to say they're the same.

00:22:20.540 | So you mess with the colour,

00:22:22.740 | so it can't use colour in a simple way.

00:22:25.340 | And that gives you Xi tilde and Xj tilde.

00:22:32.140 | You then put those through the same neural network, F,

00:22:36.500 | and you get a representation, H.

00:22:38.380 | And then you take the representation, H,

00:22:41.140 | and you put it through another neural network,

00:22:43.020 | which compresses it a bit.

00:22:44.820 | It goes to low dimensionality.

00:22:47.260 | That's an extra complexity I'm not gonna explain,

00:22:49.420 | but it makes it work a bit better.

00:22:51.700 | You can do it without doing that.

00:22:53.540 | And you get two embeddings, Xi and Zj.

00:22:56.100 | And your aim is to maximise the agreement

00:22:59.540 | between those vectors.

00:23:00.820 | And so you start off doing that and you say,

00:23:03.820 | okay, let's start off with random neural networks,

00:23:07.220 | random weights in the neural networks,

00:23:09.180 | and let's take two patches

00:23:10.460 | and let's put them through these transformations.

00:23:13.060 | Let's try and make Zi be the same as Zj.

00:23:15.660 | So let's back propagate the squared difference

00:23:17.980 | between components of I and components of J.

00:23:21.140 | And hey, presto, what you discover is everything collapses.

00:23:25.540 | For every image, it will always produce the same Zi and Zj.

00:23:32.260 | And then you realise, well,

00:23:33.260 | that's not what I meant by agreement.

00:23:35.020 | I mean, they should be the same

00:23:37.300 | when you get two crops of the same image

00:23:39.220 | and different when you get two crops of different images.

00:23:42.620 | Otherwise, it's not really agreement, right?

00:23:45.420 | So you have to have negative examples.

00:23:50.420 | You have to show it crops from different images

00:23:53.380 | and say those should be different.

00:23:55.620 | If they're already different,

00:23:57.500 | you don't try and make them a lot more different.

00:23:59.980 | It's very easy to make things very different,

00:24:02.300 | but that's not what you want.

00:24:03.300 | You just wanna be sure they're different enough.

00:24:04.860 | So crops from different images

00:24:07.220 | aren't taken to be from the same image.

00:24:09.420 | So if they happen to be very similar, you push them apart.

00:24:12.300 | And that stops your representations collapsing.

00:24:14.220 | That's called contrastive learning.

00:24:16.140 | And it works very well.

00:24:17.340 | So what you can do is do unsupervised learning

00:24:23.540 | by trying to maximise agreement

00:24:25.100 | between the representations you get

00:24:28.180 | from two image patches from the same image.

00:24:30.380 | And after you've done that,

00:24:32.620 | you just take your representation of the image patch

00:24:36.300 | and you feed it to a linear classifier,

00:24:38.620 | a bunch of weights.

00:24:39.460 | So you multiply the representation by a weight matrix,

00:24:42.140 | put it through a softmax and get class labels.

00:24:45.700 | And then you train that by gradient descent.

00:24:48.780 | And what you discover is that that's just about as good

00:24:53.700 | as training on labelled data.

00:24:56.020 | So now the only thing you've trained on labelled data

00:24:58.060 | is that last linear classifier.

00:25:00.380 | The previous layers were trained on unlabelled data

00:25:03.540 | and you've managed to train your representations

00:25:07.100 | without needing labels.

00:25:12.020 | Now, there's a problem with this.

00:25:13.860 | It works very nicely,

00:25:17.020 | but it's really confounding objects and whole scenes.

00:25:20.500 | So it makes sense to say two different patches

00:25:23.700 | from the same scene should get the same vector label

00:25:28.700 | at the scene level 'cause they're from the same scene.

00:25:31.780 | But what if one of the patches

00:25:34.260 | contains bits of objects A and B

00:25:35.660 | and another patch contain bits of objects A and C?

00:25:38.180 | You don't really want those two patches

00:25:39.540 | to have the same representation at the object level.

00:25:42.580 | So we have to distinguish

00:25:43.820 | these different levels of representation.

00:25:46.700 | And for contrastive learning,

00:25:49.300 | if you don't use any kind of gating or attention,

00:25:53.180 | then what's happening is you're really doing learning

00:25:55.540 | at the scene level.

00:25:56.660 | What we'd like is that the representations

00:26:02.060 | you get at the object level should be the same

00:26:05.700 | if both patches are patches from object A,

00:26:09.100 | but should be different if one patch is from object A

00:26:11.340 | and one patch is from object B.

00:26:13.020 | And to do that, we're gonna need some form of attention

00:26:14.940 | to decide whether they really come from the same thing.

00:26:17.860 | And so GLOM is designed to do that.

00:26:19.500 | It's designed to take contrastive learning

00:26:21.980 | and to introduce attention

00:26:23.820 | of the kind you get in transformers

00:26:25.940 | in order not to try and say things are the same

00:26:28.380 | when they're not.

00:26:29.220 | I should mention at this point

00:26:32.620 | that most of you will be familiar with BERT,

00:26:35.980 | and you could think of the word fragments

00:26:37.740 | that are fed into BERT

00:26:39.340 | as like the image patches I'm using here.

00:26:42.420 | And in BERT, you have that whole column of representations

00:26:45.140 | of the same word fragment.

00:26:46.700 | In BERT, what's happening presumably as you go up

00:26:50.740 | is you're getting semantically richer representations.

00:26:55.740 | But in BERT, there's no attempt to get representations

00:27:00.420 | of larger things like whole phrases.

00:27:02.500 | This, what I'm gonna talk about

00:27:06.580 | will be a way to modify BERT.

00:27:08.020 | So as you go up,

00:27:09.540 | you get bigger and bigger islands of agreement.

00:27:12.100 | So for example, after a couple of levels,

00:27:15.580 | then things like New and York

00:27:18.780 | will have the different fragments of York,

00:27:21.660 | suppose it's got two different fragments,

00:27:23.580 | will have exactly the same representation

00:27:25.380 | if it was done in the GLOM-like way.

00:27:27.580 | And then as you go up another level,

00:27:29.860 | the fragments of New,

00:27:31.220 | well, New's probably a thing in its own right,

00:27:32.700 | but the fragments of York

00:27:35.020 | would all have exactly the same representation

00:27:37.380 | that have this island of agreement.

00:27:40.740 | And that will be a representation of a compound thing.

00:27:44.660 | And as you go up,

00:27:45.500 | you're gonna get these islands of agreement

00:27:46.980 | that represent bigger and bigger things.

00:27:49.020 | And that's gonna be a much more useful kind of BERT

00:27:51.700 | 'cause instead of taking vectors

00:27:54.940 | that represent word fragments

00:27:56.700 | and then sort of munging them together

00:27:58.780 | by taking the max of each, for example,

00:28:00.500 | the max of each component, for example,

00:28:02.500 | which is just a crazy thing to do,

00:28:05.020 | you'd explicitly, as you're learning,

00:28:07.060 | form representations of larger parts

00:28:09.660 | in the part-whole hierarchy.

00:28:11.060 | Okay.

00:28:13.060 | So what we're going after in GLOM

00:28:18.300 | is a particular kind of spatial coherence

00:28:20.900 | that's more complicated than the spatial coherence

00:28:23.300 | caused by the fact that surfaces

00:28:25.260 | tend to be at the same depth and same orientation

00:28:27.780 | in nearby patches of an image.

00:28:30.180 | We're going after the spatial coherence

00:28:32.540 | that says that if you find a mouth in an image

00:28:37.220 | and you find a nose in an image

00:28:38.620 | and then the right spatial relationship to make a face,

00:28:41.460 | then that's a particular kind of coherence.

00:28:44.260 | And we want to go after that unsupervised

00:28:46.940 | and we want to discover that kind of coherence in images.

00:28:50.740 | So before I go into more details about GLOM,

00:28:56.740 | I want a disclaimer.

00:28:58.700 | For years, computer vision treated vision

00:29:02.900 | as you've got a static image, a uniform resolution,

00:29:05.940 | and you want to say what's in it.

00:29:08.140 | That's not how vision works in the real world.

00:29:10.260 | In the real world, this is actually a loop

00:29:12.140 | where you decide where to look

00:29:13.660 | if you're a person or a robot.

00:29:16.140 | You better do that intelligently.

00:29:20.260 | And that gives you a sample of the optic array.

00:29:25.060 | It turns the optic array, the incoming light,

00:29:27.420 | into a retinal image.

00:29:30.580 | And on your retina, you have high resolution in the middle

00:29:32.780 | and low resolution around the edges.

00:29:35.660 | And so you're focusing on particular details

00:29:39.900 | and you never ever process the whole image

00:29:43.100 | at uniform resolution.

00:29:44.580 | You're always focusing on something

00:29:46.020 | and processing where you're fixating at high resolution

00:29:49.140 | and everything else at much lower resolution,

00:29:51.100 | particularly around the edges.

00:29:53.380 | So I'm going to ignore all the complexity

00:29:56.140 | of how you decide where to look

00:29:57.940 | and all the complexity of how you put together

00:30:00.740 | the information you get from different fixations

00:30:03.340 | by saying, let's just talk about the very first fixation

00:30:06.580 | or a novel image.

00:30:07.940 | So you look somewhere

00:30:09.260 | and now what happens on that first fixation?

00:30:11.860 | We know that the same hardware in the brain

00:30:13.580 | is going to be reused for the next fixation,

00:30:16.500 | but let's just think about the first fixation.

00:30:20.740 | So finally, here's a picture of the architecture

00:30:23.220 | and this is the architecture for a single location.

00:30:29.780 | So like for a single word fragment in BERT

00:30:32.500 | and it shows you what's happening for multiple frames.

00:30:38.700 | So Glom is really designed for video,

00:30:40.820 | but I only talk about applying it to static images.

00:30:43.660 | Then you should think of a static image

00:30:45.580 | as a very boring video

00:30:47.500 | in which the frames are all the same as each other.

00:30:50.900 | So I'm showing you three adjacent levels in the hierarchy

00:30:54.740 | and I'm showing you what happens over time.

00:30:59.300 | So if you look at the middle level,

00:31:02.580 | maybe that's the sort of major part level

00:31:04.740 | and look at that box that says level L

00:31:08.180 | and that's at frame four.

00:31:11.740 | So the right-hand level L box

00:31:14.780 | and let's ask how the state of that box,

00:31:17.660 | the state of that embedding is determined.

00:31:20.500 | So inside the box, we're gonna get an embedding

00:31:22.860 | and the embedding is gonna be the representation

00:31:27.700 | of what's going on at the major part level

00:31:31.140 | for that little patch of the image.

00:31:32.940 | And level L, in this diagram,

00:31:37.900 | all of these embeddings will always be devoted

00:31:41.780 | to the same patch of the retinal image.

00:31:44.580 | Okay.

00:31:47.580 | The level L embedding on the right-hand side,

00:31:51.860 | you can see there's three things determining it there.

00:31:55.500 | This is a green arrow and for static images,

00:31:58.700 | the green arrow is rather boring.

00:31:59.940 | It's just saying you should sort of be similar

00:32:02.020 | to the previous state of level L.

00:32:03.740 | So it's just doing temporal integration.

00:32:05.740 | The blue arrow is actually a neural net

00:32:12.020 | with a couple of hidden layers in it.

00:32:14.540 | I'm just showing you the embeddings here,

00:32:15.900 | not all the layers of the neural net.

00:32:18.260 | We need a couple of hidden layers

00:32:19.420 | to do the coordinate transforms that are required.

00:32:22.380 | And the blue arrow is basically taking information

00:32:26.340 | at the level below of the previous time step.

00:32:30.060 | So level L minus one on frame three

00:32:32.660 | might be representing that I think I might be a nostril.

00:32:36.260 | Well, if you think you might be a nostril,

00:32:38.260 | what you predict at the next level up is a nose.

00:32:42.060 | What's more, if you have a coordinate frame

00:32:44.300 | for the nostril, you can predict the coordinate frame

00:32:46.700 | for the nose.

00:32:47.780 | Maybe not perfectly, but you have a pretty good idea

00:32:50.100 | of the orientation position scale of the nose.

00:32:53.100 | So that bottom up neural net is a net

00:32:58.100 | that can take any kind of part of level L minus one.

00:33:01.420 | You can take a nostril,

00:33:02.660 | but it could also take a steering wheel

00:33:04.660 | and predict the car from the steering wheel

00:33:06.820 | and predict what you've got at the next level up.

00:33:12.980 | The red arrow is a top-down neural net.

00:33:16.860 | So the red arrow is predicting the nose from the whole face.

00:33:22.860 | And again, it has a couple of hidden layers

00:33:27.300 | to do coordinate transforms.

00:33:29.220 | 'Cause if you know the coordinate frame of the face

00:33:32.260 | and you know the relationship between a face and a nose,

00:33:35.340 | and that's gonna be in the weights

00:33:36.500 | of that top-down neural net,

00:33:38.500 | then you can predict that it's a nose

00:33:41.340 | and what the pose of the nose is.

00:33:43.500 | And that's all gonna be in activities

00:33:45.580 | in that embedding vector.

00:33:46.820 | Okay, now all of that is what's going on

00:33:51.900 | in one column of hardware.

00:33:54.020 | That's all about a specific patch of the image.

00:33:57.100 | So that's very, very like what's going on

00:33:59.980 | for one word fragment in BERT.

00:34:02.140 | You have all these levels of representation.

00:34:04.340 | It's a bit confusing exactly what the relation of this

00:34:11.020 | is to BERT.

00:34:11.860 | And I'll give you the reference

00:34:13.020 | to a long archive paper at the end

00:34:14.420 | that has a whole section on how this relates to BERT.

00:34:17.500 | But it's confusing 'cause this has time steps.

00:34:19.940 | And that makes it all more complicated.

00:34:24.540 | Okay.

00:34:25.380 | So those are three things

00:34:28.060 | that determine the level and embedding.

00:34:30.260 | But there's one fourth thing,

00:34:31.780 | which is in black at the bottom there.

00:34:34.500 | And that's the only way

00:34:36.860 | in which different locations interact.

00:34:39.540 | And that's a very simplified form of a transformer.

00:34:43.500 | If you take a transformer, as in BERT,

00:34:46.220 | and you say, let's make the embeddings

00:34:48.660 | and the keys and the queries and the values

00:34:50.940 | all be the same as each other.

00:34:53.020 | We just have this one vector.

00:34:54.460 | So now all you're trying to do

00:34:58.740 | is make the level L embedding in one column

00:35:02.100 | be the same as the level L embedding in nearby columns.

00:35:05.460 | But it's gonna be gated.

00:35:08.300 | You're only gonna try and make it be the same

00:35:10.700 | if it's already quite similar.

00:35:13.140 | So here's how the attention works.

00:35:17.300 | You take the level L embedding in location X, that's LX.

00:35:22.940 | And you take the level L embedding

00:35:24.220 | in the nearby location Y, that's LY.

00:35:27.340 | You take the scalar product.

00:35:28.740 | You exponentiate and you normalize.

00:35:32.900 | In other words, you do a softmax.

00:35:35.340 | And that gives you the weight to use

00:35:38.260 | in your desire to make LX be the same as LY.

00:35:43.260 | So the input produced by this from neighbors

00:35:50.580 | is an attention-weighted average

00:35:54.580 | of the level L embeddings of nearby columns.

00:35:57.580 | And that's an extra input that you get.

00:35:59.540 | It's trying to make you agree with nearby things.

00:36:02.620 | And that's what's gonna cause you

00:36:03.860 | to get these islands of agreement.

00:36:06.060 | (mouse clicking)

00:36:08.820 | So back to this picture.

00:36:12.020 | I think, yeah.

00:36:17.820 | This is what we'd like to see.

00:36:21.260 | And the reason we get those,

00:36:25.340 | that big island of agreement at the object level

00:36:28.460 | is 'cause we're trying to get agreement there.

00:36:30.820 | We're trying to learn the coordinate transform

00:36:34.180 | from the red arrows to the level above

00:36:36.460 | and from the green arrows to the level above

00:36:38.540 | such that we get agreement.

00:36:40.140 | Okay.

00:36:42.380 | Now, one thing we need to worry about

00:36:47.980 | is that the difficult thing in perception,

00:36:50.420 | it's not so bad in language,

00:36:54.300 | it's probably worse in visual perception,

00:36:56.340 | is that there's a lot of ambiguity.

00:36:58.700 | If I'm looking at a line drawing, for example,

00:37:00.420 | I see a circle.

00:37:02.300 | Well, a circle could be the right eye of a face

00:37:04.860 | or it could be the left eye of a face

00:37:06.260 | or it could be the front wheel of a car

00:37:07.540 | or the back wheel of a car.

00:37:08.820 | There's all sorts of things that circle could be.

00:37:11.620 | And we'd like to disambiguate the circle.

00:37:13.900 | And there's a long line of work

00:37:16.420 | using things like Markov random fields.

00:37:19.940 | Here we need a variational Markov random field

00:37:22.020 | which I call a transformational random field

00:37:24.540 | because the interaction between, for example,

00:37:27.340 | something that might be an eye

00:37:28.780 | and something that might be a mouth

00:37:31.220 | needs to be gated by coordinate transforms.

00:37:33.660 | For the, let's take a nose and a mouth

00:37:36.260 | 'cause that's my standard thing.

00:37:37.940 | If you take something that might be a nose

00:37:39.780 | and you want to ask,

00:37:40.980 | does anybody out there support the idea I'm a nose?

00:37:43.580 | Well, what you'd like to do is send to everything nearby

00:37:49.100 | a message saying,

00:37:50.340 | do you have the right kind of pose

00:37:55.460 | and the right kind of identity

00:37:56.660 | to support the idea that I'm a nose?

00:37:58.460 | And so you'd like, for example,

00:38:00.860 | to send out a message from the nose.

00:38:04.260 | You'd send out a message to all nearby locations saying,

00:38:07.300 | does anybody have a mouth with the pose that I predict

00:38:10.180 | by taking the pose of the nose,

00:38:12.620 | multiplying by the coordinate transform

00:38:14.420 | between a nose and a mouth?

00:38:16.100 | And now I can predict the pose of the mouth.

00:38:18.020 | Is there anybody out there with that pose

00:38:20.060 | who thinks it might be a mouth?

00:38:22.140 | And I think you can see,

00:38:22.980 | you're gonna have to send out a lot of different messages.

00:38:25.620 | For each kind of other thing that might support you,

00:38:28.540 | you're gonna need to send a different message.

00:38:29.860 | So you're gonna need a multi-headed transformer

00:38:34.860 | and it's gonna be doing these coordinate transforms

00:38:38.060 | and you have to do a coordinate transform,

00:38:40.100 | the inverse transform on the way back.

00:38:42.180 | 'Cause if the mouse supports you,

00:38:43.900 | what it needs to support is a nose,

00:38:45.700 | not with the pose of the mouth,

00:38:47.860 | but with the appropriate pose.

00:38:49.980 | So that's gonna get very complicated.

00:38:51.380 | You're gonna have N-squared interactions

00:38:52.700 | all with coordinate transforms.

00:38:54.620 | There's another way of doing it that's much simpler.

00:38:56.820 | That's called a Hough transform.

00:38:58.980 | At least it's much simpler

00:39:00.580 | if you have a way of representing ambiguity.

00:39:02.940 | So instead of these direct interactions

00:39:08.580 | between parts like a nose and a mouth,

00:39:10.460 | what you're gonna do

00:39:12.500 | is you're gonna make each of the parts predict the whole.

00:39:16.420 | So the nose can predict the face

00:39:19.980 | and it can predict the pose of the face.

00:39:22.020 | And the mouth can also predict the face.

00:39:24.580 | Now these will be in different columns of Glom,

00:39:26.940 | but in one column of Glom,

00:39:28.020 | you'll have a nose predicting face.

00:39:30.060 | In a nearby column,

00:39:31.060 | you'll have a mouth predicting face.

00:39:33.300 | And those two faces should be the same

00:39:35.900 | if this really is a face.

00:39:37.300 | So when you do this attention weighted averaging

00:39:41.100 | with nearby things,

00:39:42.460 | what you're doing is you're getting confirmation

00:39:45.260 | that the support for the hypothesis you've got,

00:39:49.900 | I mean, suppose in one column you make the hypothesis

00:39:51.780 | it's a face with this pose.

00:39:53.860 | That gets supported by nearby columns

00:39:56.300 | that derive the very same embedding

00:39:59.100 | from quite different data.

00:40:00.820 | One derived it from the nose

00:40:02.020 | and one derived it from the mouth.

00:40:03.700 | And this doesn't require any dynamic routing

00:40:07.460 | because the embeddings are always referring

00:40:11.140 | to what's going on in the same small patch of the image.

00:40:14.380 | Within a column, there's no routing.

00:40:16.580 | And between columns,

00:40:18.900 | there's something a bit like routing,

00:40:20.100 | but it's just the standard transformer kind of attention.

00:40:22.940 | You're just trying to agree with things that are similar.

00:40:26.460 | And, okay, so that's how GLOM is meant to work.

00:40:31.180 | And the big problem is that if I see a circle,

00:40:36.820 | it might be a left eye, it might be a right eye,

00:40:38.500 | it might be a front wheel of a car,

00:40:40.700 | it might be the back wheel of a car.

00:40:42.420 | Because my embedding for a particular patch

00:40:44.500 | at a particular level has to be able to represent anything,

00:40:47.660 | when I get an ambiguous thing,

00:40:49.540 | I have to deal with all these possibilities

00:40:51.420 | of what hole it might be part of.

00:40:54.220 | So instead of trying to resolve ambiguity at the part level,

00:40:57.700 | what I can do is jump to the next level up

00:40:59.460 | and resolve the ambiguity there,

00:41:01.100 | just by seeing if things are the same,

00:41:03.180 | which is an easier way to resolve ambiguity.

00:41:05.660 | But the cost of that is I have to be able to represent

00:41:09.380 | all the ambiguity I get at the next level up.

00:41:11.660 | Now, it turns out you can do that.

00:41:14.660 | We've done a little toy example

00:41:17.180 | where you can actually preserve this ambiguity,

00:41:20.500 | but it's difficult.

00:41:23.340 | It's the kind of thing neural nets are good at.

00:41:25.700 | So if you think about the embedding at the next level up,

00:41:30.260 | you've got a whole bunch of neurons

00:41:33.180 | whose activities are that embedding,

00:41:35.460 | and you wanna represent a highly multimodal distribution,

00:41:40.140 | like it might be a car with this pose,

00:41:41.700 | or a car with that pose, or a face with this pose,

00:41:43.620 | or a face with that pose.

00:41:45.420 | All of these are possible predictions for finding a circle.

00:41:48.380 | And so you have to represent all that.

00:41:52.620 | And the question is, can neural nets do that?

00:41:55.060 | And I think the way they must be doing it is

00:41:57.940 | each neuron in the embedding

00:42:00.060 | stands for an unnormalized log probability distribution

00:42:05.700 | over this huge space of possible identities

00:42:07.980 | and possible poses,

00:42:09.540 | the sort of cross product of identities and poses.

00:42:11.980 | And so the neuron is this rather vague

00:42:16.940 | log probability distribution over that space.

00:42:19.180 | And when you activate the neuron,

00:42:22.020 | what it's saying is,

00:42:23.500 | add in that log probability distribution

00:42:25.780 | to what you've already got.

00:42:27.740 | And so now if you have a whole bunch

00:42:28.940 | of log probability distributions,

00:42:31.300 | and you add them all together,

00:42:32.820 | you can get a much more peaky log probability distribution.

00:42:37.780 | And when you exponentiate to get a probability distribution,

00:42:41.260 | it gets very peaky.

00:42:42.980 | And so very vague basis functions

00:42:46.380 | in this joint space of pose and identity,

00:42:50.260 | and basis functions in the log probability in that space

00:42:53.660 | can be combined to produce sharp conclusions.

00:42:56.180 | So I think that's how neurons are representing things.

00:43:02.340 | Most people think about neurons as,

00:43:04.940 | they think about the thing that they're representing.

00:43:07.620 | But obviously in perception,

00:43:10.540 | you have to deal with uncertainty.

00:43:12.980 | And so neurons have to be good at representing

00:43:15.300 | multimodal distributions.

00:43:18.780 | And this is the only way I can think of

00:43:20.420 | that's good at doing it.

00:43:21.660 | That's a rather weak argument.

00:43:25.220 | I mean, it's the argument that led Chomsky

00:43:26.740 | to believe that language wasn't learned

00:43:28.820 | because he couldn't think of how it was learned.

00:43:30.820 | My view is neurons must be using this representation

00:43:34.500 | 'cause I can't think of any other way of doing it.

00:43:36.540 | Okay.

00:43:41.540 | I just said all that 'cause I got ahead of myself

00:43:43.340 | 'cause I got excited.

00:43:44.500 | Okay.

00:43:45.340 | Now, the reason you can get away with this,

00:43:50.540 | the reason you have these very vague distributions

00:43:52.860 | in the unnormalized log probability space

00:43:55.380 | is because these neurons are all dedicated

00:43:58.980 | to a small patch of image,

00:44:00.860 | and they're all trying to represent the thing

00:44:03.180 | that's happening in that patch of image.

00:44:05.460 | So you're only trying to represent one thing.

00:44:07.700 | You're not trying to represent

00:44:09.100 | some set of possible objects.

00:44:11.380 | If you're trying to represent

00:44:12.220 | some set of possible objects,

00:44:13.260 | you'd have a horrible binding problem,

00:44:15.060 | and you couldn't use these very vague distributions.

00:44:17.340 | But so long as you know that all of these neurons,

00:44:20.940 | all of the active neurons refer to the same thing,

00:44:24.700 | then you can do the intersection.

00:44:26.780 | You can add the log probability distribution together

00:44:28.980 | and intersect the sets of things they represent.

00:44:31.500 | Okay.

00:44:35.260 | I'm getting near the end.

00:44:36.700 | How would you train a system like this?

00:44:39.100 | Well, obviously you could train it the way you train BERT.

00:44:41.620 | You could do deep end-to-end training.

00:44:43.900 | And for GLOM, what that would consist of,

00:44:48.660 | and the way we trained a toy example,

00:44:50.740 | is you take an image,

00:44:55.780 | you leave out some patches of the image,

00:44:58.500 | you then let GLOM settle down for about 10 iterations,

00:45:04.020 | and it's trying to fill in

00:45:07.900 | the lowest level representation of what's in the image,

00:45:11.540 | the lowest level embedding,

00:45:14.540 | and it fills them in wrong.

00:45:17.260 | And so you now back propagate that error,

00:45:19.580 | and you're back propagating it through time in this network.

00:45:22.700 | So it'll also back propagate up and down through the levels.

00:45:25.700 | So you're basically just doing back propagation through time

00:45:30.180 | of the error due to filling in things incorrectly.

00:45:34.220 | That's basically how BERT is trained.

00:45:36.980 | And you could train GLOM the same way.

00:45:39.580 | But I also want to include an extra bit in the training

00:45:43.180 | to encourage islands.

00:45:44.540 | We want to encourage big islands

00:45:51.740 | of identical vectors at high levels.

00:45:54.580 | And you can do that by using contrastive learning.

00:45:57.100 | So if you think how the next,

00:46:03.220 | at the next time step,

00:46:04.700 | if you think how an embedding is determined,

00:46:07.660 | it's determined by combining

00:46:10.620 | a whole bunch of different factors,

00:46:12.860 | what was going on in the previous time step

00:46:15.380 | at this level of representation in this location,

00:46:18.460 | what was going on at the previous time step

00:46:21.220 | in this location,

00:46:22.580 | but at the next level down,

00:46:24.380 | at the next level up,

00:46:26.100 | and also what was going on at the previous time step

00:46:29.060 | at nearby locations at the same level.

00:46:33.380 | And the weighted average of all those things

00:46:36.220 | I'll call the consensus embedding.

00:46:37.620 | And that's what you use for the next embedding.

00:46:40.580 | And I think you can see that

00:46:42.540 | if we try and make the bottom up neural net

00:46:45.100 | and the top down neural net,

00:46:46.900 | if we try and make the predictions agree with the consensus,

00:46:50.900 | the consensus has folded in information

00:46:54.180 | from nearby locations that already roughly agree

00:46:58.820 | because of the attention weighting.

00:47:01.260 | And so by trying to make the top down

00:47:02.700 | and bottom up neural networks agree with the consensus,

00:47:06.060 | you're trying to make them agree

00:47:07.380 | with what's going on in nearby locations that are similar.

00:47:10.460 | And so you'll be training it to form islands.

00:47:12.860 | This is more interesting to neuroscientists

00:47:19.820 | than to people who do natural language,

00:47:21.700 | so I'm gonna ignore that.

00:47:22.940 | You might think it's wasteful to be replicating

00:47:29.740 | all these embeddings at the object level.

00:47:31.780 | So the idea is at the object level,

00:47:33.580 | there'll be a large number of patches

00:47:35.020 | that all have exactly the same vector representation.

00:47:38.260 | And that seems like a waste,

00:47:39.860 | but actually biology is full of things like that.

00:47:41.940 | All your cells have exactly the same DNA

00:47:44.620 | and all the parts of an organ

00:47:47.060 | have pretty much the same vector protein expressions.

00:47:49.980 | So there's lots of replication goes on

00:47:51.980 | to keep things local.

00:47:55.380 | And it's the same here.

00:47:56.220 | And actually that replication is very useful

00:47:59.700 | when you're settling on an interpretation,

00:48:02.020 | because before you settle down,

00:48:04.140 | you don't know which things should be the same

00:48:05.700 | as which other things.

00:48:06.860 | So having separate vectors in each location

00:48:08.700 | to represent what's going on there at the object level

00:48:11.340 | gives you the flexibility to gradually segment things

00:48:13.860 | as you settle down in a sensible way.

00:48:16.060 | It allows you to hedge your bets.

00:48:19.740 | And what you're doing is not quite like clustering.

00:48:22.340 | You're creating clusters of identical vectors

00:48:25.580 | rather than discovering clusters in fixed data.

00:48:28.260 | So clustering, you're given the data and it's fixed

00:48:30.780 | and you find the clusters.

00:48:31.820 | Here, the embeddings at every level, they vary over time.

00:48:36.820 | They're determined by the top-down and bottom-up inputs

00:48:40.460 | and by inputs coming from nearby locations.

00:48:42.580 | So what you're doing is forming clusters

00:48:45.020 | rather than discovering them in fixed data.

00:48:47.500 | And that's got a somewhat different flavor

00:48:49.620 | and can settle down faster.

00:48:55.740 | And one other advantage to this replication

00:48:58.380 | is what you don't want is to have much more work

00:49:02.580 | in your transformer as you go to higher levels.

00:49:06.220 | But you do need longer range interactions at higher levels.

00:49:09.220 | Presumably for the lowest levels,

00:49:10.380 | you want fairly short range interactions

00:49:12.140 | in your transformer, and they could be dense.

00:49:14.860 | As you go to higher levels,

00:49:15.820 | you want much longer range interactions.

00:49:17.900 | So you could make them sparse,

00:49:20.380 | and people have done things like that for BERT-like systems.

00:49:24.940 | Here, it's easy to make them sparse

00:49:28.340 | 'cause you're expecting big islands.

00:49:30.460 | So all you need to do is see one patch of a big island

00:49:34.980 | to know what the vector representation of that island is.

00:49:37.980 | And so sparse representations will work much better

00:49:40.660 | if you have these big islands of agreement as you go up.

00:49:43.060 | So the idea is you have longer range

00:49:44.580 | and sparser connections as you go up.

00:49:46.260 | So the amount of computation is the same in every level.

00:49:50.620 | And just to summarize,

00:49:51.780 | I showed how to combine three important advances

00:49:56.820 | of neural networks in Glom.

00:49:58.380 | I didn't actually talk about neural fields,

00:50:01.260 | and that's important for the top-down network.

00:50:04.260 | Maybe since I've got two minutes to spare,

00:50:05.820 | I'm gonna go back and mention neural fields very briefly.

00:50:08.660 | Yeah, when I train that top-down neural network,

00:50:17.060 | I have a problem.

00:50:18.740 | And the problem is,

00:50:20.180 | if you look at those red arrows and those green arrows,

00:50:24.940 | they're quite different.

00:50:28.140 | But if you look at the level above, the object level,

00:50:31.180 | all those vectors are the same.

00:50:32.740 | And of course, in an engineered system,

00:50:36.980 | I want to replicate the neural nets in every location.

00:50:39.580 | So I use exactly the same

00:50:40.660 | top-down and bottom-up neural nets everywhere.

00:50:43.820 | And so the question is,

00:50:44.660 | how can the same neural net be given a black arrow

00:50:48.500 | and sometimes produce a red arrow

00:50:51.020 | and sometimes produce a green arrow,

00:50:52.300 | which have quite different orientations?

00:50:54.580 | How can it produce a nose where there's nose

00:50:57.620 | and a mouth where there's mouth,

00:50:59.220 | even though the face vector is the same everywhere?

00:51:02.340 | And the answer is,

00:51:03.940 | the top-down neural network doesn't just get

00:51:07.420 | the face vector,

00:51:09.340 | it also gets the location of the patch

00:51:12.380 | for which it's producing the path vector.

00:51:15.020 | So the three patches that should get the red vector

00:51:18.380 | are different locations

00:51:20.980 | from the three patches that should get the green vector.

00:51:23.740 | So if I use a neural network

00:51:25.180 | that gets the location as input as well,

00:51:27.420 | here's what it can do.

00:51:28.940 | It can take the pose that's encoded in that black vector,

00:51:32.020 | the pose of the face.

00:51:33.180 | It can take the location in the image

00:51:38.060 | for which it's predicting the vector of the level below.

00:51:41.860 | And the pose is relative to the image too.

00:51:44.220 | So knowing the location in the image

00:51:45.620 | and knowing the pose of the whole face,

00:51:47.260 | it can figure out which bit of the face

00:51:49.020 | it needs to predict at that location.

00:51:52.980 | And so in one location, it can predict,

00:51:54.660 | okay, there should be nose there

00:51:56.460 | and it gives you the red vector.

00:51:57.900 | In another location, it can predict

00:51:59.900 | from where that image patch is,

00:52:01.220 | there should be mouth there,

00:52:02.620 | so it can give you the green arrow.

00:52:04.700 | So you can get the same vector at the level above

00:52:06.940 | to predict different vectors

00:52:08.260 | in different places at the level below

00:52:10.100 | by giving it the place that it's predicting for.

00:52:12.340 | And that's what's going on in neural fields.

00:52:14.540 | Okay.

00:52:19.940 | Now, this was quite a complicated talk.

00:52:23.420 | There's a long paper about it on archive

00:52:25.700 | that goes into much more detail.

00:52:27.300 | And you could view this talk as just an encouragement

00:52:30.860 | to read that paper.

00:52:32.620 | And I'm done, exactly on time.

00:52:39.140 | - Okay.

00:52:39.980 | - Thank you.

00:52:40.820 | - Thanks a lot.

00:52:41.660 | - Yeah.

00:52:42.500 | (upbeat music)

00:52:45.100 | [BLANK_AUDIO]

Stanford CS25: V2 I Represent part-whole hierarchies in a neural network, Geoff Hinton

Chapters