back to indexStanford CS25: V2 I Represent part-whole hierarchies in a neural network, Geoff Hinton
Chapters
0:0
1:55 Why it is hard to make real neural networks learn part-whole hierarchies • Each image has a different parsetree, • Real neural networks cannot dynamically allocate neurons to represent nodes in a parse tree. - What a neuron does is determined by the weights on its connections and the weights
20:45 A brief introduction to contrastive learning of visual representations • Contrastive self-supervised learning uses the similarity between activity vectors produced from different patches of the same image as the objective
43:39 How to implement multimodal predictions in the joint space of identity and pose • Each neuron in the embedding vector for the object is a basis function that represents a vague distribution in the log probability space
00:00:02.160 |
Before we start, I gave the same talk at Stanford 00:00:13.440 |
I could just give one talk and both audiences come, 00:00:15.680 |
but they will prefer it as two separate talks. 00:00:28.440 |
What I'm gonna do is combine some recent ideas 00:00:33.960 |
how a neural network could represent part-whole hierarchies 00:00:36.880 |
without violating any of the basic principles 00:00:50.340 |
I started writing a design document for a system 00:00:52.440 |
and in the end, I decided the design document 00:00:56.280 |
So this is just vaporware, it's stuff that doesn't exist. 00:00:59.960 |
but somehow I find it easy to explain the ideas 00:01:14.400 |
are doing engineering and they don't really care 00:01:18.800 |
They're not trying to understand how the brain works, 00:01:25.380 |
weight sharing is fine in a convolutional neural net. 00:01:27.980 |
Some researchers, particularly computational neuroscientists 00:01:32.760 |
investigate neural networks, artificial neural networks 00:01:38.800 |
I think we still got a lot to learn from the brain 00:01:45.200 |
the only thing that kept research on neural networks going 00:01:49.880 |
to make these things learn complicated things 00:02:09.560 |
you can't just grab a bunch of neurons and say, 00:02:16.800 |
You can't just set the weights of the neurons 00:02:20.520 |
What a neuron does is determined by its connections 00:02:22.900 |
and they only change slowly, at least probably, 00:02:45.700 |
and say, "This is gonna represent a node in the parse tree 00:02:48.740 |
and I'm gonna give it pointers to other nodes, 00:02:50.860 |
other bits of memory that represent other nodes." 00:03:00.620 |
"Because you can't allocate neurons on the fly, 00:03:10.600 |
And most of these groups of neurons for most images 00:03:13.700 |
are gonna be silent, a few are gonna be active. 00:03:18.500 |
we have to dynamically hook them up into a parse tree. 00:03:28.260 |
And I had some very competent people working with me 00:03:46.020 |
There's other ideas I've had that just don't wanna work. 00:03:49.020 |
Capsules was sort of in between and we got it working. 00:03:53.980 |
that could be seen as a funny kind of capsules model 00:04:04.640 |
each capsule can represent any kind of thing. 00:04:14.640 |
So the imaginary system I'll talk about is called Glom. 00:04:19.640 |
And in Glom, hardware gets allocated to columns 00:04:25.240 |
and each column contains multiple levels of representation 00:04:29.920 |
of what's happening in a small patch of the image. 00:04:48.680 |
And the idea for representing part-whole hierarchies 00:04:54.860 |
between the embeddings at these different levels. 00:05:00.760 |
you'd like the same embedding for every patch of the image 00:05:03.760 |
'cause that patch is a patch of the same scene everywhere. 00:05:08.400 |
you'd like the embeddings of all the different patches 00:05:15.960 |
you're trying to make things more and more the same 00:05:18.360 |
and that's how you're squeezing redundancy out. 00:05:21.140 |
The embedding vectors are the things that act like pointers 00:05:28.600 |
They're neural activations rather than neural weights. 00:05:31.280 |
So it's fine to have different embedding vectors 00:05:47.160 |
and you'd have something like a convolutional neural net 00:05:58.240 |
that say what's going on in each particular patch. 00:06:04.600 |
Of course, these embeddings are thousands of dimensions, 00:06:15.480 |
where the two vectors are the same by using the orientation. 00:06:20.680 |
all the patches will have different representations. 00:06:23.960 |
But the next level up, the first two patches, 00:06:27.360 |
they might be part of a nostril, for example. 00:06:31.160 |
And so, yeah, they'll have the same embedding. 00:06:39.460 |
the first three patches might be part of a nose. 00:06:51.660 |
those three red vectors are all meant to be the same. 00:06:55.180 |
So what we're doing is we're getting the same representation 00:06:58.100 |
for things that are superficially very different. 00:07:03.740 |
by giving the same representation to different things. 00:07:20.600 |
So the islands of agreement are what capture the parse tree. 00:07:27.240 |
Now, they're a bit more powerful than a parse tree. 00:07:29.740 |
They can capture things like shut the heck up. 00:07:33.980 |
You can have shut and up can be different vectors 00:07:38.660 |
shut and up can have exactly the same vector, 00:07:41.820 |
namely the vector for shut up, and they can be disconnected. 00:07:53.140 |
you can think of each of these levels as an icing model 00:07:58.620 |
with real valued vectors rather than binary spins. 00:08:02.980 |
And you can think of them being coordinate transforms 00:08:05.100 |
between levels, which makes it much more complicated. 00:08:08.380 |
And then this is a kind of multi-level icing model, 00:08:12.460 |
but with complicated interactions between the levels, 00:08:21.020 |
between a nose and a face, but we'll come to that later. 00:08:35.780 |
is particularly relevant for a natural language course 00:08:40.740 |
by trying to prove to you that coordinate systems 00:08:45.020 |
are not just something invented by Descartes. 00:08:47.220 |
Coordinate systems were invented by the brain 00:08:52.060 |
a long time ago, and we use coordinate systems 00:08:54.940 |
in understanding what's going on in an image. 00:08:58.080 |
I also want to demonstrate the psychological reality 00:09:05.680 |
that I invented a long time ago in the 1970s, 00:09:16.180 |
So I want you to imagine on the tabletop in front of you, 00:09:25.060 |
and it's in the standard orientation for a cube, 00:09:59.660 |
and the other finger is vertically above it, like that. 00:10:08.700 |
where that thing that was a body diagonal is now vertical. 00:10:12.140 |
And all you've gotta do is take the bottom finger, 00:10:24.300 |
hold your top finger at the other end of that diagonal 00:10:28.780 |
and just point to where the other corners are. 00:10:39.460 |
And I can see that some of you aren't pointing, 00:10:48.820 |
is to say they're here, here, here, and here. 00:11:01.820 |
'cause if you imagine the cube in the normal orientation 00:11:04.820 |
and count the corners, there's eight of them. 00:11:12.040 |
So one theory is that when you rotated the cube, 00:11:24.060 |
is you have no idea where the other corners are, 00:11:26.660 |
unless you're something like a crystallographer. 00:11:32.900 |
of the other corners, what structure they form. 00:11:45.340 |
well, okay, I don't know where the bits of a cube are, 00:11:52.580 |
I know a cube has this four-fold rotational symmetry, 00:12:00.140 |
And so what people do is they preserve the symmetries 00:12:07.040 |
Now, what they've actually pointed out if they do that 00:12:11.620 |
is two pyramids, each of which has a square base. 00:12:16.180 |
One's upside down, and they're stuck base to base. 00:12:36.340 |
at the cost of doing something pretty radical, 00:12:38.980 |
which is changing faces to vertices and vertices to faces. 00:12:43.980 |
The thing you pointed out if you did that was an octahedron. 00:13:03.020 |
which is changed faces for vertices and vertices for faces. 00:13:05.980 |
I should show you what the answer looks like. 00:13:10.120 |
So I'm gonna step back and try and get enough light, 00:13:26.460 |
form a kind of zigzag ring around the middle. 00:13:31.580 |
So the colored rods here are the other edges of the cube, 00:13:53.420 |
This is just a completely different model of a cube. 00:13:55.940 |
It's so different, I'll give it a different name. 00:13:58.860 |
And the thing to notice is a hexahedron and a cube 00:14:07.540 |
You wouldn't even know one was the same as the other 00:14:09.860 |
if you think about one as a hexahedron and one as a cube. 00:14:12.620 |
It's like the ambiguity between a tilted square 00:14:22.340 |
that people really do use coordinate systems. 00:14:32.300 |
and asking you to describe it relative to that vertical axis, 00:14:35.340 |
then familiar things become completely unfamiliar. 00:14:38.260 |
And when you do see them relative to this new frame, 00:14:44.300 |
Notice that things like convolutional neural nets 00:14:48.580 |
and have two utterly different internal representations 00:15:06.260 |
The same green flap sloping upwards and outwards. 00:15:08.980 |
Now we have a red flap sloping downwards and outwards, 00:15:14.460 |
and you just have the two ends of the rectangle. 00:15:17.180 |
And if you perceive this and now close your eyes 00:15:21.220 |
and ask you, were there any parallel edges there? 00:15:24.540 |
You're very well aware that those two blue edges 00:15:27.260 |
were parallel, and you're typically not aware 00:15:31.300 |
even though you know by symmetry there must be other pairs. 00:15:34.380 |
Similarly with the crown, if you see the crown, 00:15:45.060 |
you're using for those flaps don't line up with the edges. 00:15:50.620 |
if they line up with the coordinate system you're using. 00:15:55.020 |
the parallel edges align with the coordinate system. 00:15:59.020 |
So you're aware that those two blue edges are parallel, 00:16:01.300 |
but you're not aware that one of the green edges 00:16:12.580 |
where when it flips, you think that what's out there 00:16:15.220 |
in reality is different, things are at a different depth. 00:16:18.340 |
This is like, next weekend, we shall be visiting relatives. 00:16:23.620 |
"Next weekend, we shall be visiting relatives," 00:16:39.620 |
They happen to have the same truth conditions. 00:16:41.660 |
They mean the same thing in the sense of truth conditions, 00:16:50.740 |
No disagreement about what's going on in the world, 00:16:52.500 |
but two completely different ways of seeing the sentence. 00:17:12.380 |
So you have nodes for all the various parts in the hierarchy. 00:17:29.740 |
between the intrinsic frame of reference of the crown 00:17:32.860 |
and the intrinsic frame of reference of the flap. 00:17:40.420 |
So that kind of relationship will be a good thing 00:17:47.100 |
to be able to recognize shapes independently of viewpoint. 00:18:01.820 |
where I've added the things in the heavy blue boxes. 00:18:06.460 |
They're the relationship between a node and the viewer. 00:18:15.220 |
between the intrinsic frame of reference of the crown 00:18:18.260 |
and the intrinsic frame of reference of the viewer, 00:18:23.580 |
And that's a different kind of thing altogether, 00:18:26.980 |
'cause as you change viewpoint, that changes. 00:18:31.060 |
all those things in blue boxes all change together 00:18:44.420 |
So you can easily propagate viewpoint information 00:18:59.980 |
That makes sense of a lot of properties of mental images. 00:19:04.980 |
Like if you want to do any reasoning with things like RWX, 00:19:14.900 |
And I want to do one more demo to convince you 00:19:22.900 |
mental imagery problem at the risk of running over time. 00:19:37.540 |
What's your direction back to your starting point? 00:19:42.940 |
It's sort of a bit South and quite a lot West, right? 00:19:46.340 |
It's not exactly Southwest, but it's sort of Southwest. 00:19:56.220 |
is you went a mile East, and then you went a mile North, 00:20:03.300 |
You didn't imagine that you went a mile East, 00:20:07.940 |
You could have solved the problem perfectly well 00:20:09.620 |
with North not being up, but you had North up. 00:20:18.820 |
You go a mile East, and then a mile North, and so on. 00:20:23.340 |
in a particular orientation, and in a particular position. 00:20:35.540 |
that involve using relationships between things, 00:20:43.780 |
So I'm now gonna give you a very brief introduction 00:20:50.060 |
So this is a complete disconnect in the talk, 00:21:02.140 |
what we try and do is make two different crops of an image 00:21:08.300 |
There's a paper a long time ago by Becker and Hinton 00:21:18.620 |
like the continuity of surfaces or the depth of surfaces. 00:21:27.300 |
and it's been used for doing things like classification. 00:21:40.100 |
"that contains sort of any part of that object, 00:21:49.260 |
And this has been developed a lot in the last few years. 00:21:54.140 |
I'm gonna talk about a model developed a couple of years ago 00:22:11.900 |
and you also do colour distortion of the crops, 00:22:16.980 |
And that's to prevent it from using colour histograms 00:22:32.140 |
You then put those through the same neural network, F, 00:22:41.140 |
and you put it through another neural network, 00:22:47.260 |
That's an extra complexity I'm not gonna explain, 00:23:03.820 |
okay, let's start off with random neural networks, 00:23:10.460 |
and let's put them through these transformations. 00:23:15.660 |
So let's back propagate the squared difference 00:23:21.140 |
And hey, presto, what you discover is everything collapses. 00:23:25.540 |
For every image, it will always produce the same Zi and Zj. 00:23:39.220 |
and different when you get two crops of different images. 00:23:50.420 |
You have to show it crops from different images 00:23:57.500 |
you don't try and make them a lot more different. 00:23:59.980 |
It's very easy to make things very different, 00:24:03.300 |
You just wanna be sure they're different enough. 00:24:09.420 |
So if they happen to be very similar, you push them apart. 00:24:12.300 |
And that stops your representations collapsing. 00:24:17.340 |
So what you can do is do unsupervised learning 00:24:32.620 |
you just take your representation of the image patch 00:24:39.460 |
So you multiply the representation by a weight matrix, 00:24:42.140 |
put it through a softmax and get class labels. 00:24:48.780 |
And what you discover is that that's just about as good 00:24:56.020 |
So now the only thing you've trained on labelled data 00:25:00.380 |
The previous layers were trained on unlabelled data 00:25:03.540 |
and you've managed to train your representations 00:25:17.020 |
but it's really confounding objects and whole scenes. 00:25:20.500 |
So it makes sense to say two different patches 00:25:23.700 |
from the same scene should get the same vector label 00:25:28.700 |
at the scene level 'cause they're from the same scene. 00:25:35.660 |
and another patch contain bits of objects A and C? 00:25:39.540 |
to have the same representation at the object level. 00:25:49.300 |
if you don't use any kind of gating or attention, 00:25:53.180 |
then what's happening is you're really doing learning 00:26:02.060 |
you get at the object level should be the same 00:26:09.100 |
but should be different if one patch is from object A 00:26:13.020 |
And to do that, we're gonna need some form of attention 00:26:14.940 |
to decide whether they really come from the same thing. 00:26:25.940 |
in order not to try and say things are the same 00:26:42.420 |
And in BERT, you have that whole column of representations 00:26:46.700 |
In BERT, what's happening presumably as you go up 00:26:50.740 |
is you're getting semantically richer representations. 00:26:55.740 |
But in BERT, there's no attempt to get representations 00:27:09.540 |
you get bigger and bigger islands of agreement. 00:27:31.220 |
well, New's probably a thing in its own right, 00:27:35.020 |
would all have exactly the same representation 00:27:40.740 |
And that will be a representation of a compound thing. 00:27:49.020 |
And that's gonna be a much more useful kind of BERT 00:28:20.900 |
that's more complicated than the spatial coherence 00:28:25.260 |
tend to be at the same depth and same orientation 00:28:32.540 |
that says that if you find a mouth in an image 00:28:38.620 |
and then the right spatial relationship to make a face, 00:28:46.940 |
and we want to discover that kind of coherence in images. 00:29:02.900 |
as you've got a static image, a uniform resolution, 00:29:08.140 |
That's not how vision works in the real world. 00:29:20.260 |
And that gives you a sample of the optic array. 00:29:25.060 |
It turns the optic array, the incoming light, 00:29:30.580 |
And on your retina, you have high resolution in the middle 00:29:46.020 |
and processing where you're fixating at high resolution 00:29:49.140 |
and everything else at much lower resolution, 00:29:57.940 |
and all the complexity of how you put together 00:30:00.740 |
the information you get from different fixations 00:30:03.340 |
by saying, let's just talk about the very first fixation 00:30:16.500 |
but let's just think about the first fixation. 00:30:20.740 |
So finally, here's a picture of the architecture 00:30:23.220 |
and this is the architecture for a single location. 00:30:32.500 |
and it shows you what's happening for multiple frames. 00:30:40.820 |
but I only talk about applying it to static images. 00:30:47.500 |
in which the frames are all the same as each other. 00:30:50.900 |
So I'm showing you three adjacent levels in the hierarchy 00:31:20.500 |
So inside the box, we're gonna get an embedding 00:31:22.860 |
and the embedding is gonna be the representation 00:31:37.900 |
all of these embeddings will always be devoted 00:31:47.580 |
The level L embedding on the right-hand side, 00:31:51.860 |
you can see there's three things determining it there. 00:31:59.940 |
It's just saying you should sort of be similar 00:32:19.420 |
to do the coordinate transforms that are required. 00:32:22.380 |
And the blue arrow is basically taking information 00:32:26.340 |
at the level below of the previous time step. 00:32:32.660 |
might be representing that I think I might be a nostril. 00:32:38.260 |
what you predict at the next level up is a nose. 00:32:44.300 |
for the nostril, you can predict the coordinate frame 00:32:47.780 |
Maybe not perfectly, but you have a pretty good idea 00:32:50.100 |
of the orientation position scale of the nose. 00:32:58.100 |
that can take any kind of part of level L minus one. 00:33:06.820 |
and predict what you've got at the next level up. 00:33:16.860 |
So the red arrow is predicting the nose from the whole face. 00:33:29.220 |
'Cause if you know the coordinate frame of the face 00:33:32.260 |
and you know the relationship between a face and a nose, 00:33:54.020 |
That's all about a specific patch of the image. 00:34:04.340 |
It's a bit confusing exactly what the relation of this 00:34:14.420 |
that has a whole section on how this relates to BERT. 00:34:17.500 |
But it's confusing 'cause this has time steps. 00:34:39.540 |
And that's a very simplified form of a transformer. 00:35:02.100 |
be the same as the level L embedding in nearby columns. 00:35:08.300 |
You're only gonna try and make it be the same 00:35:17.300 |
You take the level L embedding in location X, that's LX. 00:35:59.540 |
It's trying to make you agree with nearby things. 00:36:25.340 |
that big island of agreement at the object level 00:36:28.460 |
is 'cause we're trying to get agreement there. 00:36:30.820 |
We're trying to learn the coordinate transform 00:36:58.700 |
If I'm looking at a line drawing, for example, 00:37:02.300 |
Well, a circle could be the right eye of a face 00:37:08.820 |
There's all sorts of things that circle could be. 00:37:19.940 |
Here we need a variational Markov random field 00:37:24.540 |
because the interaction between, for example, 00:37:40.980 |
does anybody out there support the idea I'm a nose? 00:37:43.580 |
Well, what you'd like to do is send to everything nearby 00:38:04.260 |
You'd send out a message to all nearby locations saying, 00:38:07.300 |
does anybody have a mouth with the pose that I predict 00:38:22.980 |
you're gonna have to send out a lot of different messages. 00:38:25.620 |
For each kind of other thing that might support you, 00:38:28.540 |
you're gonna need to send a different message. 00:38:29.860 |
So you're gonna need a multi-headed transformer 00:38:34.860 |
and it's gonna be doing these coordinate transforms 00:38:54.620 |
There's another way of doing it that's much simpler. 00:39:12.500 |
is you're gonna make each of the parts predict the whole. 00:39:24.580 |
Now these will be in different columns of Glom, 00:39:37.300 |
So when you do this attention weighted averaging 00:39:42.460 |
what you're doing is you're getting confirmation 00:39:45.260 |
that the support for the hypothesis you've got, 00:39:49.900 |
I mean, suppose in one column you make the hypothesis 00:40:11.140 |
to what's going on in the same small patch of the image. 00:40:20.100 |
but it's just the standard transformer kind of attention. 00:40:22.940 |
You're just trying to agree with things that are similar. 00:40:26.460 |
And, okay, so that's how GLOM is meant to work. 00:40:31.180 |
And the big problem is that if I see a circle, 00:40:36.820 |
it might be a left eye, it might be a right eye, 00:40:44.500 |
at a particular level has to be able to represent anything, 00:40:54.220 |
So instead of trying to resolve ambiguity at the part level, 00:41:05.660 |
But the cost of that is I have to be able to represent 00:41:09.380 |
all the ambiguity I get at the next level up. 00:41:17.180 |
where you can actually preserve this ambiguity, 00:41:23.340 |
It's the kind of thing neural nets are good at. 00:41:25.700 |
So if you think about the embedding at the next level up, 00:41:35.460 |
and you wanna represent a highly multimodal distribution, 00:41:41.700 |
or a car with that pose, or a face with this pose, 00:41:45.420 |
All of these are possible predictions for finding a circle. 00:41:52.620 |
And the question is, can neural nets do that? 00:42:00.060 |
stands for an unnormalized log probability distribution 00:42:09.540 |
the sort of cross product of identities and poses. 00:42:16.940 |
log probability distribution over that space. 00:42:32.820 |
you can get a much more peaky log probability distribution. 00:42:37.780 |
And when you exponentiate to get a probability distribution, 00:42:50.260 |
and basis functions in the log probability in that space 00:42:53.660 |
can be combined to produce sharp conclusions. 00:42:56.180 |
So I think that's how neurons are representing things. 00:43:04.940 |
they think about the thing that they're representing. 00:43:12.980 |
And so neurons have to be good at representing 00:43:28.820 |
because he couldn't think of how it was learned. 00:43:30.820 |
My view is neurons must be using this representation 00:43:34.500 |
'cause I can't think of any other way of doing it. 00:43:41.540 |
I just said all that 'cause I got ahead of myself 00:43:50.540 |
the reason you have these very vague distributions 00:44:00.860 |
and they're all trying to represent the thing 00:44:05.460 |
So you're only trying to represent one thing. 00:44:15.060 |
and you couldn't use these very vague distributions. 00:44:17.340 |
But so long as you know that all of these neurons, 00:44:20.940 |
all of the active neurons refer to the same thing, 00:44:26.780 |
You can add the log probability distribution together 00:44:28.980 |
and intersect the sets of things they represent. 00:44:39.100 |
Well, obviously you could train it the way you train BERT. 00:44:58.500 |
you then let GLOM settle down for about 10 iterations, 00:45:07.900 |
the lowest level representation of what's in the image, 00:45:19.580 |
and you're back propagating it through time in this network. 00:45:22.700 |
So it'll also back propagate up and down through the levels. 00:45:25.700 |
So you're basically just doing back propagation through time 00:45:30.180 |
of the error due to filling in things incorrectly. 00:45:39.580 |
But I also want to include an extra bit in the training 00:45:54.580 |
And you can do that by using contrastive learning. 00:46:15.380 |
at this level of representation in this location, 00:46:26.100 |
and also what was going on at the previous time step 00:46:37.620 |
And that's what you use for the next embedding. 00:46:46.900 |
if we try and make the predictions agree with the consensus, 00:46:54.180 |
from nearby locations that already roughly agree 00:47:02.700 |
and bottom up neural networks agree with the consensus, 00:47:07.380 |
with what's going on in nearby locations that are similar. 00:47:10.460 |
And so you'll be training it to form islands. 00:47:22.940 |
You might think it's wasteful to be replicating 00:47:35.020 |
that all have exactly the same vector representation. 00:47:39.860 |
but actually biology is full of things like that. 00:47:47.060 |
have pretty much the same vector protein expressions. 00:48:04.140 |
you don't know which things should be the same 00:48:08.700 |
to represent what's going on there at the object level 00:48:11.340 |
gives you the flexibility to gradually segment things 00:48:19.740 |
And what you're doing is not quite like clustering. 00:48:22.340 |
You're creating clusters of identical vectors 00:48:25.580 |
rather than discovering clusters in fixed data. 00:48:28.260 |
So clustering, you're given the data and it's fixed 00:48:31.820 |
Here, the embeddings at every level, they vary over time. 00:48:36.820 |
They're determined by the top-down and bottom-up inputs 00:48:58.380 |
is what you don't want is to have much more work 00:49:02.580 |
in your transformer as you go to higher levels. 00:49:06.220 |
But you do need longer range interactions at higher levels. 00:49:12.140 |
in your transformer, and they could be dense. 00:49:20.380 |
and people have done things like that for BERT-like systems. 00:49:30.460 |
So all you need to do is see one patch of a big island 00:49:34.980 |
to know what the vector representation of that island is. 00:49:37.980 |
And so sparse representations will work much better 00:49:40.660 |
if you have these big islands of agreement as you go up. 00:49:46.260 |
So the amount of computation is the same in every level. 00:49:51.780 |
I showed how to combine three important advances 00:50:01.260 |
and that's important for the top-down network. 00:50:05.820 |
I'm gonna go back and mention neural fields very briefly. 00:50:08.660 |
Yeah, when I train that top-down neural network, 00:50:20.180 |
if you look at those red arrows and those green arrows, 00:50:28.140 |
But if you look at the level above, the object level, 00:50:36.980 |
I want to replicate the neural nets in every location. 00:50:40.660 |
top-down and bottom-up neural nets everywhere. 00:50:44.660 |
how can the same neural net be given a black arrow 00:50:59.220 |
even though the face vector is the same everywhere? 00:51:15.020 |
So the three patches that should get the red vector 00:51:20.980 |
from the three patches that should get the green vector. 00:51:28.940 |
It can take the pose that's encoded in that black vector, 00:51:38.060 |
for which it's predicting the vector of the level below. 00:52:04.700 |
So you can get the same vector at the level above 00:52:10.100 |
by giving it the place that it's predicting for. 00:52:27.300 |
And you could view this talk as just an encouragement