back to indexDeep Learning for Computer Vision (Andrej Karpathy, OpenAI)
Chapters
0:0 Deep Learning for Computer Vision
4:23 Computer Vision 2011
9:59 Transfer Learning
11:16 The power is easily accessible.
11:58 ConvNets are everywhere...
16:58 Convolution Layer
19:25 For example, if we had 6 5x5 filters, we'll get 6 separate activation maps
24:56 MAX POOLING
32:13 Case Study: AlexNet NELA
52:34 Addressing other tasks...
53:24 Image Classification thing = a vector of probabilities for different classes
54:9 Localization
54:48 Reinforcement Learning
55:38 Segmentation
57:26 Variational Autoencoders
57:41 Detection
59:26 Dense Image Captioning
00:00:03.160 |
So today I'll speak about deep learning, especially in the context of computer vision. 00:00:07.880 |
So what you saw in the previous talk is neural networks. 00:00:10.760 |
So you saw that neural networks are organized into these layers, fully connected layers, 00:00:14.720 |
where neurons in one layer are not connected, but they're connected fully to all the neurons 00:00:19.920 |
And we saw that basically we have this layer-wise structure from input until output. 00:00:25.280 |
And there are neurons and nonlinearities, et cetera. 00:00:27.720 |
Now so far we have not made too many assumptions about the inputs. 00:00:31.140 |
So in particular, here we just assume that an input is some kind of a vector of numbers 00:00:37.340 |
So that's both a bug and a feature to some extent, because in most real world applications 00:00:44.700 |
we actually can make some assumptions about the input that makes learning much more efficient. 00:00:52.140 |
So in particular, usually we don't just want to plug into neural networks vectors of numbers, 00:00:58.460 |
but they actually have some kind of a structure. 00:01:00.180 |
So we don't have vectors of numbers, but these numbers are arranged in some kind of a layout, 00:01:07.040 |
So for example, spectrograms are two-dimensional arrays of numbers. 00:01:09.780 |
Images are three-dimensional arrays of numbers. 00:01:12.060 |
Videos would be four-dimensional arrays of numbers. 00:01:14.260 |
Text you could treat as one-dimensional array of numbers. 00:01:17.060 |
And so whenever you have this kind of local connectivity structure in your data, then 00:01:21.500 |
you'd like to take advantage of it, and convolutional neural networks allow you to do that. 00:01:26.140 |
So before I dive into convolutional neural networks and all the details of the architectures, 00:01:30.020 |
I'd like to briefly talk about a bit of the history of how this field evolved over time. 00:01:34.800 |
So I like to start off usually with talking about Hubel and Wiesel and the experiments 00:01:40.660 |
So what they were doing is trying to study the computations that happened in the early 00:01:48.220 |
And so they had cats, and they plugged in electrodes that could record from the different 00:01:53.700 |
And then they showed the cat different patterns of light. 00:01:56.180 |
And they were trying to debug neurons effectively and try to show them different patterns and 00:02:01.780 |
And a lot of these experiments inspired some of the modeling that came in afterwards. 00:02:07.100 |
So in particular, one of the early models that tried to take advantage of some of the 00:02:10.020 |
results of these experiments was the model called Neurocognitron from Fukushima in the 00:02:18.260 |
And so what you saw here was this architecture that, again, is layer-wise, similar to what 00:02:22.020 |
you see in the cortex, where you have these simple and complex cells, where the simple 00:02:26.020 |
cells detect small things in the visual field. 00:02:29.720 |
And then you have this local connectivity pattern, and the simple and complex cells 00:02:32.860 |
alternate in this layered architecture throughout. 00:02:36.300 |
And so this looks a bit like a ConvNet, because you have some of its features, like, say, 00:02:41.860 |
But at the time, this was not trained with backpropagation. 00:02:44.140 |
These were specific, heuristically chosen updates. 00:02:49.800 |
And this was unsupervised learning back then. 00:02:52.340 |
So the first time that we've actually used backpropagation to train some of these networks 00:02:59.000 |
And so this is an example of one of the networks that was developed back then, in the 1990s, 00:03:06.220 |
And this is what you would recognize today as a convolutional neural network. 00:03:09.280 |
So it has a lot of the very convolutional layers. 00:03:13.820 |
And it's a similar kind of design to what you would see in the Fukushima's neurocognitron. 00:03:18.180 |
But this was actually trained with backpropagation end-to-end using supervised learning. 00:03:26.260 |
And we're here in 2016, basically about 20 years later. 00:03:31.240 |
Now computer vision has, for a long time, kind of worked on larger images. 00:03:38.740 |
And a lot of these models back then were applied to very small kind of settings, like, say, 00:03:43.420 |
minimizing digits in zip codes and things like that. 00:03:46.780 |
And they were very successful in those domains. 00:03:48.700 |
But back at least when I entered computer vision, roughly 2011, it was thought that 00:03:54.540 |
But it was thought that they would not scale up naively into large, complex images, that 00:03:59.900 |
they would be constrained to these toy tasks for a long time. 00:04:02.620 |
Or I shouldn't say toy, because these were very important tasks, but certainly like smaller 00:04:08.180 |
And so in computer vision in roughly 2011, it was much more common to use a kind of these 00:04:17.060 |
So when I entered my PhD in 2011 working on computer vision, you would run a state-of-the-art 00:04:21.020 |
object detector on this image, and you might get something like this, where cars were detected 00:04:28.020 |
And you would kind of just shrug your shoulders and say, well, that just happens sometimes. 00:04:31.240 |
You kind of just accept it as something that would just happen. 00:04:37.220 |
Things actually worked relatively decent, I should say. 00:04:39.580 |
But definitely there were many mistakes that you would not see today about four years in 00:04:47.100 |
And so a lot of computer vision kind of looked much more like this. 00:04:49.660 |
When you look into a paper that tried to do image classification, you would find this 00:04:53.900 |
section in the paper on the features that they used. 00:04:59.420 |
And so they would use a gist, hog, et cetera, and then a second page of features and all 00:05:07.940 |
And you would extract this kitchen sink of features and a third page here. 00:05:12.500 |
And so you end up with this very large, complex code base, because some of these feature types 00:05:16.660 |
are implemented in MATLAB, some of them in Python, some of them in C++. 00:05:20.180 |
And you end up with this large code base of extracting all these features, caching them, 00:05:23.520 |
and then eventually plugging them into linear classifiers to do some kind of visual recognition 00:05:32.700 |
But there were definitely room for improvement. 00:05:34.920 |
And so a lot of this changed in computer vision in 2012 with this paper from Alex Kurchevsky, 00:05:41.900 |
So this is the first time that someone took a convolutional neural network that is very 00:05:46.180 |
similar to the one that you saw from 1998 from Yanma Kun. 00:05:50.040 |
And I'll go into details of how they differ exactly. 00:05:56.340 |
And they trained it on a much bigger data set on GPUs. 00:05:58.860 |
And things basically ended up working extremely well. 00:06:01.060 |
And this is the first time that computer vision community has really noticed these models 00:06:08.540 |
So we saw that the performance of these models has improved drastically. 00:06:13.540 |
Here we are looking at the ImageNet ILS VRC visual recognition challenge over the years. 00:06:21.860 |
And you can see that from 2010 in the beginning, these were feature-based methods. 00:06:26.700 |
And then in 2012, we had this huge jump in performance. 00:06:29.740 |
And that was due to the first kind of convolutional neural network in 2012. 00:06:33.980 |
And then we've managed to push that over time. 00:06:38.500 |
I think the results for ImageNet Challenge 2016 are actually due to come out today. 00:06:44.560 |
But I don't think that actually they've come out yet. 00:06:59.100 |
Well, we'll get to find out very soon what happens right here. 00:07:02.900 |
Just to put this in context, by the way, because you're just looking at numbers, like 3.57, 00:07:08.900 |
So something that I did about two years ago now is that I tried to measure the human accuracy 00:07:15.500 |
And so what I did for that is I developed this web interface where I would show myself 00:07:22.820 |
And then I had this interface here where I would have all the different classes of ImageNet. 00:07:30.260 |
And then basically, you go down this list and you scroll for a long time and you find 00:07:36.060 |
And then I competed against the ComNet at the time. 00:07:50.260 |
Well, some of the things, like HotDog seems very easy. 00:07:54.580 |
Well, it turns out that some of the images in a test set of ImageNet are actually mislabeled. 00:07:58.940 |
But also, some of the images are just very difficult to guess. 00:08:02.340 |
So in particular, if you have this terrier, there's 50 different types of terriers. 00:08:05.700 |
And it turns out to be a very difficult task to find exactly which type of terrier that 00:08:12.140 |
It turns out that convolutional neural networks are actually extremely good at this. 00:08:16.100 |
And so this is where I would lose points compared to ComNet. 00:08:20.080 |
So I estimate that human accuracy based on this is roughly 2% to 5% range, depending 00:08:23.940 |
on how much time you have and how much expertise you have and how many people you involve and 00:08:28.320 |
how much they really want to do this, which is not too much. 00:08:36.260 |
And I think the error rate, if I remember correctly, was about 1.5%. 00:08:40.700 |
So if we get below 1.5%, I would be extremely suspicious on ImageNet. 00:08:46.680 |
So to summarize, basically, what we've done is, before 2012, computer vision looked somewhat 00:08:53.060 |
like this, where we had these feature extractors. 00:08:54.940 |
And then we trained a small portion at the end of the feature extraction step. 00:08:59.800 |
And so we only trained this last piece on top of these features that were fixed. 00:09:03.640 |
And we've basically replaced the feature extraction step with a single convolutional neural network. 00:09:07.820 |
And now we train everything completely end to end. 00:09:12.080 |
So I'm going to go into details of how this works in a bit. 00:09:15.220 |
Also in terms of code complexity, we kind of went from a setup that looks-- whoops. 00:09:23.000 |
We went from a setup that looks something like that in papers to something like, instead 00:09:27.200 |
of extracting all these things, we just say, apply 20 layers with 3 by 3 conv or something 00:09:35.100 |
But I think it's a correct first order statement to make, is that we've definitely seen that 00:09:39.560 |
we've reduced code complexity quite a lot, because these architectures are so homogeneous 00:09:46.180 |
So it's also remarkable that-- so we had this reduction in complexity. 00:09:51.900 |
One other thing that was quite amazing about the results in 2012 that is also a separate 00:09:56.040 |
thing that did not have to be the case is that the features that you learn by training 00:10:02.380 |
And you can apply them in different settings. 00:10:04.380 |
So in other words, this transfer learning works extremely well. 00:10:08.280 |
And of course, I didn't go into details of convolutional networks yet. 00:10:12.040 |
And we have a sequence of layers, just like in a normal neural network. 00:10:16.540 |
And when you pre-train this network on ImageNet, then it turns out that the features that you 00:10:20.540 |
learn in the middle are actually transferable. 00:10:23.180 |
And you can use them on different data sets, and that this works extremely well. 00:10:28.240 |
You might imagine that you could have a convolutional network that works extremely well on ImageNet. 00:10:32.220 |
But when you try to run it on something else, like BIRDS data set or something, that it 00:10:38.020 |
And that's a very interesting finding, in my opinion. 00:10:40.940 |
So people noticed this back in roughly 2013, after the first convolutional networks. 00:10:45.880 |
They noticed that you can actually take many computer vision data sets. 00:10:49.060 |
And it used to be that you would compete on all of these separately and design features 00:10:53.980 |
And you can just shortcut all those steps that we had designed. 00:10:58.060 |
And you can just take these pre-trained features that you get from ImageNet. 00:11:01.740 |
And you can just train a linear classifier on every single data set on top of those features. 00:11:05.260 |
And you obtain many state-of-the-art results across many different data sets. 00:11:08.820 |
And so this was quite a remarkable finding back then, I believe. 00:11:16.300 |
And the code complexity, of course, got much more manageable. 00:11:20.100 |
So now all this power is actually available to you with very few lines of code. 00:11:23.780 |
If you want to just use a convolutional network on images, it turns out to be only a few lines 00:11:29.180 |
If you use, for example, Keras, it's one of the deep learning libraries that I'm going 00:11:32.020 |
to go into and I'll mention again later in the talk. 00:11:35.260 |
But basically, you just load a state-of-the-art convolutional neural network. 00:11:41.740 |
And it tells you that this is an African elephant inside that image. 00:11:45.380 |
And this took a couple hundred or a couple ten milliseconds if you have a GPU. 00:11:49.700 |
And so everything got much faster, much simpler, works really well, transfers really well. 00:11:53.500 |
So this was really a huge advance in computer vision. 00:11:55.980 |
And so as a result of all these nice properties, ComNets today are everywhere. 00:11:59.940 |
So here is a collection of some of the things that I try to find across different applications. 00:12:07.020 |
So for example, you can search Google Photos for different types of categories, like in 00:12:16.420 |
You can-- of course, this is very relevant in self-driving cars. 00:12:21.100 |
Convolutional networks are very relevant there. 00:12:22.760 |
Medical image diagnosis, recognizing Chinese characters, doing all kinds of medical segmentation 00:12:29.280 |
Quite random tasks, like whale recognition and more generally many Kaggle challenges. 00:12:34.500 |
Satellite image analysis, recognizing different types of galaxies. 00:12:37.620 |
You may have seen recently that a WaveNet from DeepMind, also a very interesting paper 00:12:42.600 |
that they generate music and they generate speech. 00:12:47.580 |
And that's also just a ComNet is doing most of the heavy lifting here. 00:12:50.700 |
So it's a convolutional network on top of sound. 00:12:56.420 |
In the context of reinforcement learning and agent environment interactions, we've also 00:13:00.680 |
seen a lot of advances of using ComNets as the core computational building block. 00:13:04.620 |
So when you want to play Atari games, or you want to play AlphaGo, or Doom, or StarCraft, 00:13:08.920 |
or if you want to get robots to perform interesting manipulation tasks, all of this uses ComNets 00:13:13.940 |
as a core computational block to do very impressive things. 00:13:19.740 |
Not only are we using it for a lot of different applications, we're also finding uses in art. 00:13:28.100 |
So you can basically simulate what it looks like, what it feels like maybe to be on some 00:13:33.660 |
So you can take images and you can just hallucinate features using ComNets. 00:13:37.020 |
Or you might be familiar with neural style, which allows you to take arbitrary images 00:13:40.340 |
and transfer arbitrary styles of different paintings like Van Gogh on top of them. 00:13:44.480 |
And this is all using convolutional networks. 00:13:46.620 |
The last thing I'd like to note that I find also interesting is that in the process of 00:13:50.620 |
trying to develop better computer vision architectures and trying to basically optimize for performance 00:13:56.000 |
on the ImageNet challenge, we've actually ended up converging to something that potentially 00:14:00.100 |
might function something like your visual cortex in some ways. 00:14:03.520 |
And so these are some of the experiments that I find interesting where they've studied macaque 00:14:07.200 |
monkeys and they record from a subpopulation of the IT cortex. 00:14:13.660 |
This is the part that does a lot of object recognition. 00:14:16.940 |
So basically, they take a monkey and they take a ComNet and they show them images. 00:14:20.540 |
And then you look at what those images are represented at the end of this network. 00:14:24.280 |
So inside the monkey's brain or on top of your convolutional network. 00:14:27.460 |
And so you look at representations of different images. 00:14:29.460 |
And then it turns out that there's a mapping between those two spaces that actually seems 00:14:33.460 |
to indicate to some extent that some of the things we're doing somehow ended up converging 00:14:37.680 |
to something that the brain could be doing as well in the visual cortex. 00:14:43.380 |
I'm now going to dive into convolutional networks and try to explain briefly how these networks 00:14:50.500 |
Of course, there's an entire class on this that I taught, which is a convolutional networks 00:14:54.460 |
And so I'm going to distill some of those 13 lectures into one lecture. 00:15:02.900 |
So convolutional neural network is really just a single function. 00:15:06.380 |
It's a function from the raw pixels of some kind of an image. 00:15:15.180 |
You take the raw pixels, you put it through this function, and you get 1,000 numbers at 00:15:19.540 |
In the case of image classification, if you're trying to categorize images into 1,000 different 00:15:24.520 |
And really, functionally, all that's happening in a convolutional network is just dot products 00:15:31.400 |
They're wired up together in interesting ways so that you are basically doing visual recognition. 00:15:36.460 |
And in particular, this function f has a lot of knobs in it. 00:15:40.620 |
So these W's here that participate in these dot products and in these convolutions and 00:15:44.140 |
fully connected layers and so on, these W's are all parameters of this network. 00:15:48.220 |
So normally, you might have about on the order of 10 million parameters. 00:15:51.600 |
And those are basically knobs that change this function. 00:15:55.580 |
And so we'd like to change those knobs, of course, so that when you put images through 00:15:59.900 |
that function, you get probabilities that are consistent with your training data. 00:16:06.180 |
And it turns out that we can do that tuning automatically with back propagation through 00:16:11.300 |
Now, more concretely, a convolutional neural network is made up of a sequence of layers, 00:16:15.620 |
just as in the case of normal neural networks. 00:16:17.580 |
But we have different types of layers that we play with. 00:16:21.860 |
Here I'm using rectified linear unit, ReLU, for short, as a non-linearity. 00:16:26.460 |
So I'm making that an explicit its own layer, pooling layers, and fully connected layers. 00:16:31.940 |
The core computational building block of a convolutional network, though, is this convolutional 00:16:38.680 |
We are probably getting rid of things like pooling layers. 00:16:40.980 |
So you might see them slightly going away over time. 00:16:43.340 |
And fully connected layers can actually be represented-- they're basically equivalent 00:16:48.020 |
And so really, it's just a sequence of conv layers in the simplest case. 00:16:52.300 |
So let me explain convolutional layer, because that's the core computational building block 00:16:58.520 |
So the entire com net is this collection of layers. 00:17:03.080 |
And these layers don't function over vectors. 00:17:05.420 |
So they don't transform vectors as a normal neural network. 00:17:09.300 |
So a layer will take a volume, a three-dimensional volume of numbers, an array. 00:17:13.420 |
In this case, for example, we have a 32 by 32 by 3 image. 00:17:17.180 |
So those three dimensions are the width, height, and I'll refer to the third dimension as depth. 00:17:22.900 |
That's not to be confused with the depth of a network, which is the number of layers in 00:17:28.700 |
So this convolutional layer accepts a three-dimensional volume. 00:17:31.180 |
And it produces a three-dimensional volume using some weights. 00:17:34.580 |
So the way it actually produces this output volume is as follows. 00:17:37.700 |
We're going to have these filters in a convolutional layer. 00:17:40.260 |
So these filters are always small spatially, like, say, for example, 5 by 5 filter. 00:17:45.540 |
But their depth extends always through the input depth of the input volume. 00:17:51.220 |
So since the input volume has three channels, the depth is three, then our filters will 00:17:57.900 |
So we have depth of three in our filters as well. 00:18:01.000 |
And then we can take those filters, and we can basically convolve them with the input 00:18:04.940 |
So what that amounts to is we take this filter. 00:18:09.060 |
So that's just the point that the channels here must match. 00:18:12.260 |
We take that filter, and we slide it through all spatial positions of the input volume. 00:18:16.620 |
And along the way, as we're sliding this filter, we're computing dot products. 00:18:20.020 |
So W transpose X plus B, where W are the filters, and X is a small piece of the input volume, 00:18:27.020 |
And so this is basically the convolutional operation. 00:18:28.780 |
You're taking this filter, and you're sliding it through at all spatial positions, and you're 00:18:33.720 |
So when you do this, you end up with this activation map. 00:18:37.040 |
So in this case, we get a 28 by 28 activation map. 00:18:41.300 |
28 comes from the fact that there are 28 unique positions to place this 5 by 5 filter into 00:18:49.420 |
So there are 28 by 28 unique positions you can place that filter in. 00:18:52.260 |
In every one of those, you're going to get a single number of how well that filter likes 00:19:03.420 |
And now in a convolutional layer, we don't just have a single filter, but we're going 00:19:09.740 |
We're going to slide it through the input volume. 00:19:13.500 |
So there are 75 numbers here that basically make up a filter. 00:19:19.080 |
We convolve them through, get a new activation map, and we continue doing this for all the 00:19:25.140 |
So for example, if we had six filters in this convolutional layer, then we might end up 00:19:32.440 |
And we stack them along the depth dimension to arrive at the output volume of 28 by 28 00:19:37.700 |
And so really what we've done is we've re-represented the original image, which is 32 by 32 by 00:19:42.380 |
3, into a kind of a new image that is 28 by 28 by 6, where this image basically has these 00:19:48.840 |
six channels that tell you how well every filter matches or likes every part of the 00:19:57.020 |
So let's compare this operation to, say, using a fully connected layer as you would in a 00:20:03.060 |
So in particular, we saw that we processed a 32 by 32 by 3 volume into 28 by 28 by 6 00:20:10.000 |
But one question you might want to ask is, how many parameters would this require if 00:20:13.560 |
we wanted a fully connected layer of the same number of output neurons here? 00:20:17.160 |
So we wanted 28 by 28 by 6 or times-- 28 times 28 times 6 number of neurons fully connected. 00:20:26.560 |
Turns out that that would be quite a few parameters, right? 00:20:28.760 |
Because every single neuron in the output volume would be fully connected to all of 00:20:34.960 |
So basically, every one of those 28 by 28 by 6 neurons is connected to 32 by 32 by 3. 00:20:41.360 |
Turns out to be about 15 million parameters, and also on that order of number of multiplies. 00:20:45.920 |
So you're doing a lot of compute, and you're introducing a huge amount of parameters into 00:20:49.960 |
Now, since we're doing convolution instead, you'll notice that-- think about the number 00:20:55.480 |
of parameters that we've introduced with this example convolutional layer. 00:20:59.120 |
So we've used-- we had six filters, and every one of them was a 5 by 5 by 3 filter. 00:21:06.280 |
So basically, we just have 5 by 5 by 3 filters. 00:21:09.540 |
If you just multiply that out, we have 450 parameters. 00:21:15.640 |
So compared to 15 million, we've only introduced very few parameters. 00:21:21.840 |
So computationally, how many flops are we doing? 00:21:24.960 |
Well, we have 28 by 28 by 6 outputs to produce. 00:21:27.800 |
And every one of these numbers is a function of a 5 by 5 by 3 region in the original image. 00:21:35.840 |
And then every one of them is computed by doing 5 times 5 times 3 multiplies. 00:21:39.800 |
So you end up with only on the order of 350,000 multiplies. 00:21:43.880 |
So we've reduced from 15 million to quite a few. 00:21:46.600 |
So we're doing less flops, and we're using fewer parameters. 00:21:50.200 |
And really, what we've done here is we've made assumptions. 00:21:52.920 |
So we've made the assumption that because the fully connected layer, if this was a fully 00:21:58.080 |
connected layer, could compute the exact same thing. 00:22:02.400 |
So a specific setting of those 15 million parameters would actually produce the exact 00:22:13.600 |
We've assumed, for example, that since we have these fixed filters that we're sliding 00:22:16.640 |
across space, we've assumed that if there's some interesting feature that you'd like to 00:22:20.320 |
detect in one part of the image, like, say, top left, then that feature will also be useful 00:22:24.200 |
somewhere else, like on the bottom right, because we fixed these filters and applied 00:22:30.240 |
You might notice that this is not always something that you might want. 00:22:33.080 |
For example, if you're getting inputs that are centered face images, and you're doing 00:22:36.640 |
some kind of a face recognition or something like that, then you might expect that you 00:22:39.680 |
might want different filters at different spatial positions. 00:22:42.600 |
Like say, for eye regions, you might want to have some eye-like filters. 00:22:45.800 |
And for mouth region, you might want to have mouth-specific features and so on. 00:22:49.280 |
And so in that case, you might not want to use convolutional layer, because those features 00:22:52.240 |
have to be shared across all spatial positions. 00:22:55.240 |
And the second assumption that we made is that these filters are small locally. 00:23:03.040 |
But that's OK, because we end up stacking up these convolutional layers in sequence. 00:23:06.840 |
And so the neurons at the end of the ConvNet will grow their receptive field as you stack 00:23:12.160 |
these convolutional layers on top of each other. 00:23:14.160 |
So at the end of the ConvNet, those neurons end up being a function of the entire image 00:23:18.000 |
So just to give you an idea about what these activation maps look like concretely, here's 00:23:26.640 |
And we have these different filters at-- we have 32 different small filters here. 00:23:30.720 |
And so if we were to convolve these filters with this image, we end up with these activation 00:23:35.240 |
So this filter, if you convolve it, you get this activation map and so on. 00:23:38.920 |
So this one, for example, has some orange stuff in it. 00:23:41.080 |
So when we convolve with this image, you see that this white here is denoting the fact 00:23:44.900 |
that that filter matches that part of the image quite well. 00:23:50.520 |
And then that goes into the next convolutional layer. 00:23:53.720 |
So the way this looks like then is that we've processed this with some kind of a convolutional 00:24:01.680 |
We apply a rectified linear unit, some kind of a non-linearity as normal. 00:24:05.040 |
And then we would just repeat that operation. 00:24:06.880 |
So we keep plugging these conv volumes into the next convolutional layer. 00:24:11.480 |
And so they plug into each other in sequence. 00:24:14.080 |
And so we end up processing the image over time. 00:24:19.160 |
You'll notice that there are a few more layers. 00:24:20.640 |
So in particular, the pooling layer I'll explain very briefly. 00:24:27.060 |
If you've used Photoshop or something like that, you've taken a large image and you've 00:24:30.520 |
resized it, you've down sampled the image, well, pooling layers do basically something 00:24:35.920 |
But they're doing it on every single channel independently. 00:24:38.640 |
So for every one of these channels independently in a input volume, we'll pluck out that activation 00:24:46.400 |
And that becomes a channel in the output volume. 00:24:48.720 |
So it's really just a down sampling operation on these volumes. 00:24:52.720 |
So for example, one of the common ways of doing this in the context of neural networks 00:24:57.860 |
So in this case, it would be common to say, for example, use 2 by 2 filters stride 2 and 00:25:05.840 |
So if this is an input channel in a volume, then we're basically-- what that amounts to 00:25:10.080 |
is we're truncating it into these 2 by 2 regions. 00:25:13.400 |
And we're taking a max over 4 numbers to produce one piece of the output. 00:25:18.880 |
So this is a very cheap operation that down samples your volumes. 00:25:21.860 |
It's really a way to control the capacity of the network. 00:25:25.160 |
You don't want things to be too computationally expensive. 00:25:27.220 |
It turns out that a pooling layer allows you to down sample your volumes. 00:25:30.760 |
You're going to end up doing less computation. 00:25:32.760 |
And it turns out to not hurt the performance too much. 00:25:35.080 |
So we use them basically as a way of controlling the capacity of these networks. 00:25:39.960 |
And the last layer that I want to briefly mention, of course, is the fully connected 00:25:43.120 |
layer, which is exactly what you're familiar with. 00:25:46.100 |
So we have these volumes throughout as we've processed the image. 00:25:55.560 |
And then we apply a fully connected layer, which really amounts to just a matrix multiplication. 00:26:00.000 |
And then that gives us probabilities after applying a softmax or something like that. 00:26:06.180 |
So let me now show you briefly a demo of what a convolutional network looks like. 00:26:12.920 |
This is a deep learning library for training convolutional neural networks that is implemented 00:26:18.340 |
I wrote this maybe two years ago at this point. 00:26:21.380 |
So here what we're doing is we're training a convolutional network on the CIFAR-10 dataset. 00:26:32.800 |
So here we are training this network in the browser. 00:26:35.280 |
And you can see that the loss is decreasing, which means that we're better classifying 00:26:40.840 |
And so here's the network specification, which you can play with because this is all done 00:26:45.700 |
So you can just change this and play with this. 00:26:49.380 |
And this convolutional network I'm showing here, all the intermediate activations and 00:26:53.080 |
all the intermediate, basically, activation maps that we're producing. 00:26:59.680 |
We're convolving them with the image and getting all these activation maps. 00:27:02.880 |
I'm also showing the gradients, but I don't want to dwell on that too much. 00:27:07.640 |
So ReLU thresholding anything below 0 gets clamped at 0. 00:27:14.980 |
And then another convolution, ReLU pool, conv, ReLU pool, et cetera, until at the end we 00:27:21.240 |
And then we have our softmax so that we get probabilities out. 00:27:24.600 |
And then we apply a loss to those probabilities and backpropagate. 00:27:28.200 |
And so here we see that I've been training in this tab for the last maybe 30 seconds 00:27:33.300 |
And we're already getting about 30% accuracy on CIFAR-10. 00:27:38.400 |
And these are the outputs of this convolutional network. 00:27:40.680 |
And you can see that it learned that this is already a car or something like that. 00:27:47.360 |
And you can change the architecture and so on. 00:27:50.280 |
Another thing I'd like to show you is this video, because it gives you, again, this very 00:27:53.880 |
intuitive visceral feeling of exactly what this is computing. 00:27:57.080 |
Is there is a very good video by Jason Yosinski from-- 00:28:07.600 |
It's this interactive convolutional network demo. 00:28:10.880 |
- --neural networks have enabled computers to better see and understand the world. 00:28:20.160 |
So what we're seeing here is these are activation maps in some particular-- shown in real time 00:28:27.760 |
So these are for the conv1 layer of an AlexNet, which we're going to go into in much more 00:28:32.400 |
But these are the different activation maps that are being produced at this point. 00:28:36.360 |
- --neural network called AlexNet running in CAFE. 00:28:39.800 |
By interacting with the network, we can see what some of the neurons are doing. 00:28:44.600 |
For example, on this first layer, a unit in the center responds strongly to light to dark 00:28:51.620 |
This neighbor, one neuron over, responds to edges in the opposite direction, dark to light. 00:28:58.520 |
Using optimization, we can synthetically produce images that light up each neuron on this layer 00:29:05.320 |
We can scroll through every layer in the network to see what it does, including convolution, 00:29:12.900 |
We can switch back and forth between showing the actual activations and showing images 00:29:22.500 |
By the time we get to the fifth convolutional layer, the features being computed represent 00:29:29.500 |
For example, this neuron seems to respond to faces. 00:29:32.700 |
We can further investigate this neuron by showing a few different types of information. 00:29:36.940 |
First we can artificially create optimized images using new regularization techniques 00:29:42.620 |
These synthetic images show that this neuron fires in response to a face and shoulders. 00:29:47.060 |
We can also plot the images from the training set that activate this neuron the most, as 00:29:50.740 |
well as pixels from those images most responsible for the high activations, computed via the 00:29:57.120 |
This feature responds to multiple faces in different locations. 00:30:00.740 |
And by looking at the deconv, we can see that it would respond more strongly if we had even 00:30:08.300 |
We can also confirm that it cares about the head and shoulders, but ignores the arms and 00:30:14.060 |
We can even see that it fires to some extent for cat faces. 00:30:18.540 |
Using backprop or deconv, we can see that this unit depends most strongly on a couple 00:30:22.860 |
units in the previous layer, conv4, and on about a dozen or so in conv3. 00:30:28.580 |
Now let's look at another neuron on this layer. 00:30:33.020 |
From the top nine images, we might conclude that it fires for different types of clothing. 00:30:37.620 |
But examining the synthetic images shows that it may be detecting not clothing per se, but 00:30:43.300 |
In the live plot, we can see that it's activated by my shirt. 00:30:46.820 |
And smoothing out half of my shirt causes that half of the activations to decrease. 00:30:56.120 |
This one has learned to look for printed text in a variety of sizes, colors, and fonts. 00:31:02.080 |
This is pretty cool, because we never ask the network to look for wrinkles or text or 00:31:06.700 |
But the only labels we provided were at the very last layer. 00:31:09.460 |
So the only reason the network learned features like text and faces in the middle was to support 00:31:16.260 |
For example, the text detector may provide good evidence that a rectangle is in fact 00:31:22.740 |
And detecting many books next to each other might be a good way of detecting a bookcase, 00:31:26.820 |
which was one of the categories we trained the net to recognize. 00:31:31.420 |
In this video, we've shown some of the features of the DeepViz toolbox. 00:31:37.700 |
So I hope that gives you an idea about exactly what's going on. 00:31:43.020 |
There's usually some fully connected layers at the end. 00:31:45.380 |
But mostly it's just these convolutional operations stacked on top of each other. 00:31:49.220 |
So what I'd like to do now is I'll dive into some details of how these architectures are 00:31:54.300 |
The way I'll do this is I'll go over all the winners of the ImageNet challenges, and I'll 00:31:57.940 |
tell you about the architectures, how they came about, how they differ. 00:32:00.780 |
And so you'll get a concrete idea about what these architectures look like in practice. 00:32:08.360 |
So the AlexNet, just to give you an idea about the sizes of these networks and the images 00:32:13.300 |
that they process, it took 227 by 227 by 3 images. 00:32:17.860 |
And the first layer of an AlexNet, for example, was a convolutional layer that had 11 by 11 00:32:27.440 |
Stride of 4 I didn't fully explain because I wanted to save some time. 00:32:30.660 |
But intuitively, it just means that as you're sliding this filter across the input, you 00:32:34.460 |
don't have to slide it one pixel at a time, but you can actually jump a few pixels at 00:32:38.620 |
So we have 11 by 11 filters with a stride, a skip of 4. 00:32:43.600 |
You can try to compute, for example, what is the output volume if you apply this sort 00:32:49.540 |
of convolutional layer on top of this volume. 00:32:51.420 |
And I didn't go into details of how you compute that. 00:32:53.560 |
But basically, there are formulas for this, and you can look into details in the class. 00:32:58.260 |
But you arrive at 55 by 55 by 96 volume as output. 00:33:03.260 |
The total number of parameters in this layer, we have 96 filters. 00:33:07.500 |
Every one of them is 11 by 11 by 3 because that's the input depth of these images. 00:33:14.460 |
So basically, it just amounts to 11 times 11 times 3. 00:33:17.280 |
And then you have 96 filters, so about 35,000 parameters in this very first layer. 00:33:22.820 |
Then the second layer of an AlexNet is a pooling layer. 00:33:25.500 |
So we apply 3 by 3 filters at stride of 2, and they do max pooling. 00:33:30.040 |
So you can, again, compute the output volume size of that after applying this to that volume. 00:33:35.220 |
And you arrive, if you do some very simple arithmetic there, you arrive at 27 by 27 by 00:33:42.580 |
You can think about what is the number of parameters in this pooling layer. 00:33:48.760 |
So pooling layers compute a fixed function, a fixed downsampling operation. 00:33:52.460 |
There are no parameters involved in a pooling layer. 00:33:54.580 |
All the parameters are in convolutional layers and the fully connected layers, which are, 00:33:57.780 |
to some extent, equivalent to convolutional layers. 00:34:01.220 |
So we can go ahead and just basically, based on the description in the paper-- although 00:34:05.220 |
it's non-trivial, I think, based on the description of this particular paper-- but you can go 00:34:08.560 |
ahead and decipher what the volumes are throughout. 00:34:11.940 |
You can look at the kind of patterns that emerge in terms of how you actually increase 00:34:16.460 |
the number of filters in higher convolutional layers. 00:34:24.060 |
And eventually, 4,096 units of fully connected layers. 00:34:27.540 |
You'll see also normalization layers here, which have since become slightly deprecated. 00:34:31.660 |
It's not very common to use the normalization layers that were used at the time for the 00:34:37.620 |
What's interesting to note is how this differs from the 1998 YAMLACoon network. 00:34:41.780 |
So in particular, I usually like to think about four things that hold back progress, 00:34:52.900 |
And then I like to differentiate between algorithms and infrastructure, algorithms being something 00:34:57.060 |
that feels like research and infrastructure being something that feels like a lot of engineering 00:35:01.500 |
And so in particular, we've had progress in all those four fronts. 00:35:04.540 |
So we see that in 1998, the data you could get a hold of maybe would be on the order 00:35:08.860 |
of a few thousand, whereas now we have a few million. 00:35:11.280 |
So we have three orders of magnitude of increase in number of data. 00:35:14.660 |
Compute, GPUs have become available, and we use them to train these networks. 00:35:18.900 |
They are about, say, roughly 20 times faster than CPUs. 00:35:23.380 |
And then, of course, CPUs we have today are much, much faster than CPUs that they had 00:35:28.060 |
So I don't know exactly to what that works out to, but I wouldn't be surprised if it's, 00:35:30.580 |
again, on the order of three orders of magnitude of improvement again. 00:35:34.540 |
I'd like to actually skip over the algorithm and talk about infrastructure. 00:35:37.020 |
So in this case, we're talking about NVIDIA releasing the CUDA library that allows you 00:35:41.820 |
to efficiently create all these matrix vector operations and apply them on arrays of numbers. 00:35:46.900 |
So that's a piece of software that we rely on and that we take advantage of that wasn't 00:35:52.940 |
And finally, algorithms is kind of an interesting one, because in those 20 years, there's been 00:35:57.260 |
much less improvement in algorithms than all these other three pieces. 00:36:02.380 |
So in particular, what we've done with the 1998 network is we've made it bigger. 00:36:09.340 |
And the two really new things algorithmically are dropout and rectified linear units. 00:36:16.660 |
So dropout is a regularization technique developed by Geoff Hinton and colleagues. 00:36:21.700 |
And rectified linear units are these nonlinearities that train much faster than sigmoids and tanhs. 00:36:27.420 |
And this paper actually had a plot that showed that the rectified linear units trained a 00:36:33.780 |
And that's intuitively because of the vanishing gradient problems. 00:36:36.460 |
And when you have very deep networks with sigmoids, those gradients vanish, as Hugo was 00:36:43.420 |
So what's interesting also to note, by the way, is that both dropout and ReLU are basically 00:36:47.380 |
like one line or two lines of code to change. 00:36:50.500 |
So it's about a two line diff total in those 20 years. 00:36:53.880 |
And both of them consist of setting things to zero. 00:36:56.460 |
So with the ReLU, you set things to zero when they're lower than zero. 00:36:59.820 |
And with dropout, you set things to zero at random. 00:37:06.060 |
So if you try to find a new cool algorithm, look for one line diffs that set something 00:37:16.020 |
Now some of the newest things that happened, some of the comparing it again and giving 00:37:20.700 |
you an idea about the hyperparameters that were in this architecture. 00:37:25.100 |
It was the first use of rectified linear units. 00:37:29.380 |
This network used the normalization layers, which are not used anymore, at least in the 00:37:32.820 |
specific way that they use them in this paper. 00:37:37.700 |
So you don't only pipe these images into the networks exactly as they come from the data 00:37:42.940 |
set, but you jitter them spatially around a bit. 00:37:45.420 |
And you warp them, and you change the colors a bit, and you just do this randomly because 00:37:49.060 |
you're trying to build in some invariances to these small perturbations. 00:37:52.340 |
And you're basically hallucinating additional data. 00:37:59.860 |
And roughly, you see standard hyperparameters, like say batch sizes of roughly 128, using 00:38:05.060 |
stochastic gradient descent with momentum, usually 0.9. 00:38:09.460 |
The momentum learning rates of 1e negative 2, you reduce them in normal ways. 00:38:13.660 |
So you reduce roughly by a factor of 10 whenever validation stops improving. 00:38:17.860 |
And weight decay of just a bit, 5e negative 4. 00:38:23.860 |
So you train seven independent convolutional networks separately, and then you just average 00:38:37.700 |
This was developed by Matthew Zeiler and Rob Fergus in 2013. 00:38:43.240 |
And this was an improvement on top of AlexNet architecture. 00:38:45.820 |
In particular, one of the bigger differences here were that the first convolutional layer, 00:38:50.740 |
they went from 11 by 11 stride 4 to 7 by 7 stride 2. 00:38:53.900 |
So you have slightly smaller filters, and you apply them more densely. 00:38:57.420 |
And then also, they noticed that these convolutional layers in the middle, if you make them larger, 00:39:02.220 |
if you scale them up, then you actually gain performance. 00:39:06.380 |
Matthew Zeiler then went-- he became the founder of Clarify. 00:39:11.620 |
And he worked on this a bit more inside Clarify, and he managed to push the performance to 00:39:15.060 |
11%, which was the winning entry at the time. 00:39:17.880 |
But we don't actually know what gets you from 14% to 11%, because Matthew never disclosed 00:39:24.300 |
But he did say that it was more tweaking of these hyperparameters and optimizing that 00:39:31.020 |
In 2014, we saw a slightly bigger diff to this. 00:39:34.580 |
So one of the networks that was introduced then was a VGG net from Karen Simonian and 00:39:39.220 |
What's beautiful about VGG net-- and they explored a few architectures here, and the 00:39:42.020 |
one that ended up working best was this D column, which is why I'm highlighting it. 00:39:45.460 |
What's beautiful about the VGG net is that it's so simple. 00:39:48.500 |
So you might have noticed in these previous networks, you have these different filter 00:39:52.980 |
sizes, different layers, and you do different amount of strides, and everything kind of 00:39:56.620 |
looks a bit hairy, and you're not sure where these hyperparameters are coming from. 00:40:02.020 |
All you do is 3 by 3 convolutions with stride 1, pad 1, and you do 2 by 2 max poolings with 00:40:08.260 |
And you do this throughout completely homogeneous architecture, and you just alternate a few 00:40:12.620 |
conv and a few pool layers, and you get top performance. 00:40:16.660 |
So they managed to reduce the error down to 7.3% in the VGG net, just with a very simple 00:40:24.060 |
So I've also here written out this D architecture. 00:40:27.980 |
So you can see-- I'm not sure how instructive this is, because it's kind of dense. 00:40:32.100 |
But you can definitely see, and you can look at this offline perhaps, but you can see how 00:40:35.580 |
these volumes develop, and you can see the kinds of sizes of these filters. 00:40:41.100 |
So they're always 3 by 3, but the number of filters, again, grows. 00:40:43.860 |
So we started off with 64, and then we go to 128, 256, 512. 00:40:51.140 |
I also have a few numbers here, just to give you an idea of the scale at which these networks 00:40:56.260 |
So we have on the order of 140 million parameters. 00:40:59.860 |
I'll show you in a bit that this can be about 5 or 10 million parameters, and it works just 00:41:05.020 |
And it's about 100 megabytes for image, in terms of memory, in the forward pass. 00:41:09.820 |
And then the backward pass also needs roughly on that order. 00:41:12.280 |
So that's roughly the numbers that we're working with here. 00:41:16.380 |
Also you can note that most of the-- and this is true mostly in convolutional networks-- 00:41:20.140 |
is that most of the memory is in the early convolutional layers. 00:41:23.180 |
Most of the parameters, at least in the case where you use these giant fully connected 00:41:29.340 |
So the winner, actually, in 2014 was not the VGGnet. 00:41:31.780 |
I only present it because it's such a simple architecture. 00:41:34.500 |
But the winner was actually GoogleNet, with a slightly hairier architecture, we should 00:41:41.440 |
But in this case, they've put inception modules in sequence. 00:41:46.380 |
I don't have too much time to go into the details, but you can see that it consists 00:41:49.700 |
basically of convolutions and different kinds of strides and so on. 00:41:54.500 |
So the GoogleNet looks slightly hairier, but it turns out to be more efficient in several 00:42:02.260 |
So for example, it works a bit better than VGGnet, at least at the time. 00:42:06.780 |
It only has 5 million parameters, compared to VGGnet's 140 million parameters, so a huge 00:42:12.220 |
And you do that, by the way, by just throwing away fully connected layers. 00:42:15.180 |
So you'll notice in this breakdown I did, these fully connected layers here have 100 00:42:18.980 |
million parameters and 16 million parameters. 00:42:22.460 |
So if you take them away, that actually doesn't hurt performance too much. 00:42:26.220 |
So you can get a huge reduction of parameters. 00:42:30.000 |
And it was slightly -- we can also compare to the original AlexNet. 00:42:35.180 |
So compared to the original AlexNet, we have fewer parameters, a bit more compute, and 00:42:40.380 |
So GoogleNet was really optimized to have a low footprint, both memory-wise, both computation-wise, 00:42:47.980 |
And VGGnet is a very beautiful, homogeneous architecture, but there are some inefficiencies 00:42:55.220 |
Now, in 2015, we had a slightly bigger delta on top of the architectures. 00:43:00.460 |
So right now, these architectures, if Jan Lekhoene looked at them maybe in 1998, he 00:43:08.660 |
So one of the first kind of bigger departures, I would argue, was in 2015, with the introduction 00:43:14.260 |
And so this is work from Kangming He and colleagues in Microsoft Research Asia. 00:43:18.840 |
And so they did not only win the ImageNet Challenge in 2015, but they won a whole bunch 00:43:24.360 |
And this was all just by applying these residual networks that were trained on ImageNet and 00:43:28.640 |
then fine-tuned on all these different tasks. 00:43:30.600 |
And you basically can crush lots of different tasks whenever you get a new awesome ConvNet. 00:43:36.860 |
So at this time, the performance was basically 3.57% from these residual networks. 00:43:44.100 |
So this paper tried to argue that if you look at the number of layers, it goes up. 00:43:48.340 |
And then they made the point that with residual networks, as we'll see in a bit, you can introduce 00:43:53.100 |
many more layers and that that correlates strongly with performance. 00:43:57.580 |
We've since found that, in fact, you can make these residual networks quite a lot shallower, 00:44:01.980 |
like say on the order of 20 or 30 layers, and they work just as fine, just as well. 00:44:05.580 |
So it's not necessarily the depth here, but I'll go into that in a bit. 00:44:10.900 |
What's interesting about this paper is this plot here, where they compare these residual 00:44:15.900 |
networks-- and I'll go into details of how they work in a bit-- and these what they call 00:44:19.060 |
plane networks, which is everything I've explained until now. 00:44:22.420 |
And the problem with plane networks is that when you try to scale them up and introduce 00:44:25.820 |
additional layers, they don't get monotonically better. 00:44:29.080 |
So if you take a 20-layer model-- and this is on CIFAR-10 experience-- if you take a 00:44:34.740 |
20-layer model and you run it, and then you take a 56-layer model, you'll see that the 00:44:41.500 |
And this is not just on the test data, so it's not just an overfitting issue. 00:44:46.020 |
The 56-layer model performs worse on the training data than the 20-layer model, even though 00:44:50.580 |
the 56-layer model can imitate 20-layer model by setting 36 layers to compute identities. 00:44:56.420 |
So basically, it's an optimization problem that you can't find the solution once your 00:45:01.340 |
problem size grows that much bigger in this plane net architecture. 00:45:05.980 |
So in the residual networks that they proposed, they found that when you wire them up in a 00:45:09.300 |
slightly different way, you monotonically get a better performance as you add more layers. 00:45:14.360 |
So more layers, always strictly better, and you don't run into these optimization issues. 00:45:19.460 |
So comparing residual networks to plane networks, in plane networks, as I've explained already, 00:45:24.020 |
you have this sequence of convolutional layers, where every convolutional layer operates over 00:45:30.840 |
In residual networks, we have this first convolutional layer on top of the raw image. 00:45:36.880 |
So at this point, we've reduced to 56 by 56 by 64, the original image. 00:45:41.660 |
And then from here on, they have these residual blocks with these funny skip connections. 00:45:52.180 |
So the original Kyming paper had this architecture here shown under original. 00:45:57.040 |
So on the left, you see original residual networks design. 00:46:00.140 |
Since then, they had an additional paper that played with the architecture and found that 00:46:03.660 |
there's a better arrangement of layers inside this block that works better empirically. 00:46:08.900 |
And so the way this works-- so concentrate on the proposed one in the middle, since that 00:46:12.180 |
works so well-- is you have this pathway where you have this representation of the image 00:46:18.540 |
And then instead of transforming that representation x to get a new x to plug in later, we end 00:46:25.100 |
We go off, and we do some compute on the side. 00:46:27.840 |
So that's that residual block doing some computation. 00:46:33.500 |
So you have this addition operation here going to the next residual block. 00:46:37.280 |
So you have this x, and you always compute deltas to it. 00:46:40.860 |
And I think it's not intuitive that this should work much better or why that works much better. 00:46:44.860 |
I think it becomes a bit more intuitively clear if you actually understand the backpropagation 00:46:50.780 |
And this is why I always urge people also to implement backprop themselves to get an 00:46:54.260 |
intuition for how it works, what it's computing, and so on. 00:46:57.380 |
Because if you understand backprop, you'll see that addition operation is a gradient 00:47:01.940 |
So you get a gradient from the top, and this gradient will flow equally to all the children 00:47:08.320 |
So you have gradient flowing here from the supervision. 00:47:10.560 |
So you have supervision at the very bottom here in this diagram. 00:47:14.780 |
And it flows through these residual blocks and then gets added to the stream. 00:47:19.080 |
But this addition distributes that gradient always identically through. 00:47:23.760 |
So what you end up with is this kind of a gradient superhighway, as I like to call it, 00:47:27.260 |
where these gradients from your supervision go directly to the original convolutional 00:47:31.420 |
And on top of that, you get these deltas from all the residual blocks. 00:47:34.100 |
So these blocks can come on online and can help out that original stream of information. 00:47:40.380 |
This is also related to, I think, why LSTMs, long short-term memory networks, work better 00:47:45.780 |
than recurrent neural networks, because they also have these kind of addition operations 00:47:51.660 |
And it just makes the gradients flow significantly better. 00:47:55.260 |
Then there were some results on top of residual networks that I thought were quite amusing. 00:47:58.600 |
So recently, for example, we had this result on deep networks with stochastic depth. 00:48:03.380 |
The idea here was that the authors of this paper noticed that you have these residual 00:48:07.820 |
blocks that compute deltas on top of your stream. 00:48:11.020 |
And you can basically randomly throw out layers. 00:48:14.020 |
So you have these, say, 100 blocks, 100 residual blocks. 00:48:18.460 |
And at test time, similar to dropout, you introduce all of them. 00:48:23.300 |
But you have to scale things a bit, just like with dropout. 00:48:26.500 |
But basically, it's kind of an unintuitive result, because you can throw out layers at 00:48:30.780 |
And I think it breaks the original notion of what we had of ConvNets as these feature 00:48:35.820 |
transformers that compute more and more complex features over time or something like that. 00:48:40.780 |
And I think it seems much more intuitive to think about these residual networks, at least 00:48:44.740 |
to me, as some kinds of dynamical systems, where you have this original representation 00:48:50.940 |
And then every single residual block is kind of like a vector field, because it computes 00:48:57.420 |
And so these vector fields nudge your original representation x towards a space where you 00:49:02.020 |
can decode the answer y of the class of that x. 00:49:06.180 |
And so if you drop off some of these residual blocks at random, then if you haven't applied 00:49:09.980 |
one of these vector fields, then the other vector fields that come later can kind of 00:49:14.020 |
And they basically nudge the-- they pick up the slack. 00:49:19.740 |
And so that's possibly why the image I currently have in mind of how these things work. 00:49:27.660 |
In fact, another experiment that people are playing with that I also find interesting 00:49:33.700 |
So it starts to look more like a recurrent neural network. 00:49:36.340 |
So these residual blocks would have shared connectivity. 00:49:39.020 |
And then you have this dynamical system, really, where you're just running a single RNN, a 00:49:42.940 |
single vector field that you keep iterating over and over. 00:49:45.380 |
And then your fixed point gives you the answer. 00:49:47.380 |
So it's kind of interesting what's happening. 00:49:54.780 |
So people are playing a lot with these residual networks and improving on them in various 00:50:00.260 |
So as I mentioned already, it turns out that you can make these residual networks much 00:50:07.020 |
And that can work just as well, if not better. 00:50:08.960 |
So it's not necessarily the depth that is giving you a lot of the performance. 00:50:15.100 |
And if you increase the width, that can actually work better. 00:50:18.180 |
And they're also more efficient if you do it that way. 00:50:21.140 |
There's more funny regularization techniques. 00:50:23.700 |
Here swap out is a funny regularization technique that actually interpolates between plain nets, 00:50:33.620 |
We actually have many more different types of nets. 00:50:35.660 |
And so people have really experimented with this a lot. 00:50:37.460 |
I'm really eager to see what the winning architecture will be in 2016 as a result of a lot of this. 00:50:42.260 |
One of the things that has really enabled this rapid experimentation in the community 00:50:45.700 |
is that somehow we've developed, luckily, this culture of sharing a lot of code among 00:50:51.220 |
So for example, Facebook has released-- just as an example-- Facebook has released residual 00:50:55.740 |
networks code in Torch that is really good that a lot of these papers, I believe, have 00:50:59.140 |
adopted and worked on top of and that allowed them to actually really scale up their experiments 00:51:09.580 |
Unfortunately, a lot of these papers are coming on archive. 00:51:12.340 |
And it's kind of a chaos as these are being uploaded. 00:51:14.260 |
So at this point, I think this is a natural point to plug very briefly my archivesanity.com. 00:51:26.500 |
And it analyzes all the papers, the full text of the papers, and creates TF-IDF bag of words 00:51:32.580 |
And then you can do things like you can search a particular paper, like residual networks 00:51:36.580 |
And you can look for similar papers on archive. 00:51:38.580 |
And so this is a sorted list of basically all the residual networks papers that are 00:51:45.260 |
And you can create a library of papers that you like. 00:51:47.340 |
And then Archive Sanity will train a support vector machine for you. 00:51:50.400 |
And basically, you can look at what are archive papers over the last month that I would enjoy 00:51:57.460 |
And so it's like a curated feed specifically for you. 00:52:11.100 |
I've given you an idea of what they look like in practice. 00:52:13.500 |
And we went through case studies of the winning architectures over time. 00:52:16.900 |
But so far, we've only looked at image classification specifically. 00:52:19.540 |
So we're categorizing images into some number of bins. 00:52:22.380 |
So I'd like to briefly talk about addressing other tasks in computer vision and how you 00:52:28.160 |
So the way to think about doing other tasks in computer vision is that really what we 00:52:32.300 |
have is you can think of this convolutional neural network as this block of compute that 00:52:39.540 |
And it can do basically arbitrary functions that are very nice over images. 00:52:43.740 |
And so it takes an image, gives you some kind of features. 00:52:47.260 |
And now different tasks will basically look as follows. 00:52:50.540 |
You want to predict some kind of a thing in different tasks that will be different things. 00:52:56.300 |
And then you want to make the predicted thing much more closer to the desired thing. 00:53:01.100 |
So this is the only part usually that changes from task to task. 00:53:03.940 |
You'll see that these comm nets don't change too much. 00:53:06.020 |
What changes is your loss function at the very end. 00:53:08.120 |
And that's what actually helps you really transfer a lot of these winning architectures. 00:53:13.900 |
And you don't worry too much about the details of that architecture. 00:53:16.460 |
Because you're only worried about adding a small piece at the top or changing the loss 00:53:19.660 |
function or substituting a new data set and so on. 00:53:22.520 |
So just to make this slightly more concrete, in image classification, we apply this compute 00:53:28.780 |
And then if I want to do classification, I would basically predict 1,000 numbers that 00:53:32.260 |
give me the log probabilities of different classes. 00:53:34.700 |
And then I have a predicted thing, a desired thing, particular class. 00:53:39.740 |
If I'm doing image captioning, it also looks very similar. 00:53:42.940 |
Instead of predicting just a vector of 1,000 numbers, I now have, for example, 10,000 words 00:53:50.580 |
And I'd be predicting 10,000 numbers and a sequence of them. 00:53:53.480 |
And so I can use a recurrent neural network, which you will hear much more about, I think, 00:54:00.340 |
And so I produce a sequence of 10,000 dimensional vectors. 00:54:03.580 |
And they indicate the probabilities of different words to be emitted at different time steps. 00:54:08.020 |
Or for example, if you want to do localization, again, most of the block stays unchanged. 00:54:12.460 |
But now we also want some kind of an extent in the image. 00:54:16.660 |
So suppose we want to classify-- we don't only just want to classify this as an airplane, 00:54:20.300 |
but we want to localize it with x, y, width, height, bounding box coordinates. 00:54:24.280 |
And if we make the specific assumption as well that there's always a single one thing 00:54:28.000 |
in the image, like a single airplane in every image, then you can just afford to just predict 00:54:33.020 |
So we predict these softmax scores, just like before, and apply the cross-entropy loss. 00:54:37.380 |
And then we can predict x, y, width, height on top of that. 00:54:39.780 |
And we use an L2 loss or a Hoover loss or something like that. 00:54:43.500 |
So you just have a predicted thing, a desired thing, and you just backprop. 00:54:47.940 |
If you want to do reinforcement learning because you want to play different games, then again, 00:54:51.500 |
the setup is you just predict some different thing. 00:54:55.420 |
So in this case, we would be, for example, predicting eight numbers that give us the 00:55:01.060 |
For example, there are eight discrete actions in Atari. 00:55:03.420 |
And we just predict eight numbers, and then we train this with a slightly different manner. 00:55:07.820 |
Because in the case of reinforcement learning, you don't actually know what the correct action 00:55:14.300 |
But you can still get a desired thing eventually, because you just run these rollouts over time, 00:55:22.000 |
And then that helps inform exactly what the correct answer should have been or what the 00:55:26.820 |
desired thing should have been in any one of those rollouts in any point in time. 00:55:30.580 |
I don't want to dwell on this too much in this lecture, though. 00:55:33.700 |
You'll hear much more about reinforcement learning in a later lecture. 00:55:38.460 |
If you wanted to do segmentation, for example, then you don't want to predict a single vector 00:55:45.980 |
But every single pixel has its own category that you'd like to predict. 00:55:48.980 |
So a data set will actually be colored like this, and you have different classes, different 00:55:53.200 |
And then instead of predicting a single vector of classes, you predict an entire array of 00:55:58.500 |
224 by 224, since that's the extent of the original image, for example, times 20 if you 00:56:04.340 |
And then you basically have 224 by 224 independent soft maxes here. 00:56:11.720 |
This here would be slightly more difficult, because you see here I have deconv layers 00:56:20.740 |
They do a very similar operation, but kind of backwards in some way. 00:56:24.860 |
So a convolutional layer kind of does these downsampling operations as it computes. 00:56:28.300 |
A deconv layer does these kind of upsampling operations as it computes these convolutions. 00:56:32.660 |
But in fact, you can implement a deconv layer using a conv layer. 00:56:35.780 |
So what you do is you deconv forward pass is the conv layer backward pass. 00:56:40.060 |
And the deconv backward pass is the conv layer forward pass, basically. 00:56:43.580 |
So they're basically an identical operation, but just are you upsampling or downsampling 00:56:48.880 |
So you can use deconv layers, or you can use hypercolumns. 00:56:51.820 |
And there are different things that people do in segmentation literature. 00:56:55.140 |
But that's just a rough idea, as you're just changing the loss function at the end. 00:56:58.500 |
If you wanted to do autoencoders, so you want to do some unsupervised learning or something 00:57:01.660 |
like that, well, you're just trying to predict the original image. 00:57:04.620 |
So you're trying to get the convolutional network to implement the identity transformation. 00:57:09.140 |
And the trick, of course, that makes it non-trivial is that you're forcing the representation 00:57:12.700 |
to go through this representational bottleneck of 7 by 7 by 512. 00:57:16.580 |
So the network must find an efficient representation of the original image so that it can decode 00:57:22.740 |
You again have an L2 loss at the end, and you backprop. 00:57:25.620 |
Or if you want to do variational autoencoders, you have to introduce a reparameterization 00:57:28.900 |
layer, and you have to append an additional small loss that makes your posterior be your 00:57:35.020 |
And then you have an entire generative model. 00:57:39.740 |
If you wanted to do detection, things get a little more hairy, perhaps, compared to 00:57:45.700 |
So one of my favorite detectors, perhaps, to explain is the YOLO detector, because it's 00:57:50.500 |
It doesn't work the best, but it's the simplest one to explain and has the core idea of how 00:57:57.340 |
And so the way this works is we reduced the original image to a 7 by 7 by 512 feature. 00:58:03.780 |
So really, there are these 49 discrete locations that we have. 00:58:08.500 |
And at every single one of these 49 locations, we're going to predict-- in YOLO, we're going 00:58:15.940 |
So every single one of these 49 will be some kind of a softmax. 00:58:20.140 |
And then additionally, at every single position, we're going to predict some number of bounding 00:58:25.060 |
And so there's going to be a b number of bounding boxes. 00:58:31.780 |
And the 5 comes from the fact that every bounding box will have five numbers associated with 00:58:36.300 |
So you have to describe the x, y, the width, and the height. 00:58:38.620 |
And you have to also indicate some kind of a confidence of that bounding box. 00:58:43.620 |
So that's the fifth number, some kind of a confidence measure. 00:58:46.180 |
So you basically end up predicting these bounding boxes. 00:58:52.280 |
And then you have some true bounding boxes in the image. 00:58:54.660 |
So you know that there are certain true boxes. 00:58:59.060 |
And what you do then is you match up the desired thing with the predicted thing. 00:59:03.780 |
And whatever-- so say, for example, you had one bounding box of a cat. 00:59:08.420 |
Then you would find the closest predicted bounding box. 00:59:12.760 |
And you would try to make that associated grid cell predict cat. 00:59:16.100 |
And you would nudge the prediction to be slightly more towards the cat box. 00:59:20.900 |
And so all of this can be done with simple losses. 00:59:25.540 |
Or if you want to get much more fancy, you could do dense image captioning. 00:59:28.940 |
So in this case, this is a combination of detection and image captioning. 00:59:32.380 |
This is a paper with my equal co-author Justin Johnson and Fei-Fei Li from last year. 00:59:41.580 |
But the first order approximation is that instead-- it's basically a detection. 00:59:45.400 |
But instead of predicting fixed classes, we instead predict a sequence of words. 00:59:53.260 |
And you can predict-- you can both detect and describe everything in a complex visual 00:59:58.960 |
So that's just some overview of different tasks that people care about. 01:00:01.700 |
Most of them consist of just changing this top part. 01:00:04.560 |
You put a different loss function, a different data set. 01:00:07.240 |
But you'll see that this computational block stays relatively unchanged from time to time. 01:00:11.340 |
And that's why, as I mentioned, when you do transfer learning, you just want to take these 01:00:15.980 |
And you mostly want to use whatever works well on ImageNet, because a lot of that does 01:00:22.420 |
So in the last part of the talk, I'd like to-- let me just make sure we're good on time. 01:00:27.280 |
So in the last part of the talk, I just wanted to give some hints or some practical considerations 01:00:31.740 |
when you want to apply convolutional networks in practice. 01:00:34.820 |
So first consideration you might have if you want to run these networks is, what hardware 01:00:40.460 |
So some of the options that I think are available to you-- well, first of all, you can just 01:00:45.640 |
So for example, NVIDIA has these Digits dev boxes that you can buy. 01:00:50.260 |
They have Titan X GPUs, which are strong GPUs. 01:00:53.460 |
You can also, if you're much more ambitious, you can buy a DGX1, which has the newest Pascal 01:01:02.400 |
So this is kind of an expensive supercomputer. 01:01:05.600 |
But the Digits dev box, I think, is more accessible. 01:01:10.140 |
Alternatively, you can look at the specs of a dev box. 01:01:15.980 |
And then you can buy all the components yourself and assemble it like LEGO. 01:01:19.660 |
Unfortunately, that's prone to mistakes, of course. 01:01:22.500 |
But you can definitely reduce the price maybe by a factor of like two compared to the NVIDIA 01:01:28.820 |
But of course, NVIDIA machine would just come with all the software installed, all the hardware 01:01:35.620 |
But unfortunately, it's actually not at a good place right now. 01:01:38.900 |
It's actually quite difficult to get GPUs in the cloud-- good GPUs, at least. 01:01:51.640 |
Microsoft Azure is coming up with its own offering soon. 01:01:57.080 |
And it's in some kind of a beta stage, if I remember correctly. 01:02:00.140 |
And so those are powerful GPUs, K80s, that would be available to you. 01:02:09.680 |
But they allow you to rent a box in the cloud. 01:02:11.900 |
So what that amounts to is that we have these boxes somewhere in the cloud. 01:02:24.920 |
So these options are available to you hardware-wise. 01:02:27.680 |
In terms of software, there are many different frameworks, of course, that you could use 01:02:32.280 |
So these are some of the more common ones that you might see in practice. 01:02:36.880 |
So different people have different recommendations on this. 01:02:40.000 |
My personal recommendation right now to most people, if you just want to apply this in 01:02:43.620 |
practical settings, 90% of the use cases are probably addressable with things like Keras. 01:02:49.300 |
So Keras would be my go-to number one thing to look at. 01:02:58.520 |
And basically, it's just a higher-level API over either of those. 01:03:01.080 |
So for example, I usually use Keras on top of TensorFlow. 01:03:03.840 |
And it's a much more higher-level language than raw TensorFlow. 01:03:08.300 |
So you can also work in raw TensorFlow, but you'll have to do a lot of low-level stuff. 01:03:11.740 |
If you need all that freedom, then that's great, because that allows you to have much 01:03:14.980 |
more freedom in terms of how you design everything. 01:03:19.820 |
For example, you have to assign every single weight. 01:03:24.260 |
And so it's just much more wordy, but you can work at that level. 01:03:27.180 |
Or for most applications, I think Keras would be sufficient. 01:03:35.640 |
So those are the options that I would currently consider, at least. 01:03:41.600 |
Another practical consideration-- you might be wondering, what architecture do I use in 01:03:46.660 |
So my answer here-- and I've already hinted at this-- is don't be a hero. 01:03:52.500 |
Don't design your own neural networks and convolutional layers. 01:04:00.560 |
Look at whatever is currently the latest released thing that works really well in ILS VRC. 01:04:07.920 |
And then you potentially add or delete some layers on top, because you want to do some 01:04:12.280 |
So that usually requires some tinkering at the top or something like that. 01:04:15.400 |
And then you fine tune it on your application. 01:04:19.880 |
The first degree, I think, to most applications would be don't tinker with it too much. 01:04:25.340 |
But of course, you can also take 231n, and then you might become much better at tinkering 01:04:35.680 |
And my answer here, again, would be don't be a hero. 01:04:41.520 |
For the most part, you'll see that all papers use the same hyperparameters. 01:04:45.500 |
So Adam-- when you use Adam for optimization, it's always learning rate 1e negative 3 or 01:04:54.840 |
It's always the similar kinds of learning rates. 01:04:58.780 |
One of the things you probably want to play with the most is the regularization. 01:05:02.680 |
And in particular, not the L2 regularization, but the dropout rates is something I would 01:05:10.000 |
Because you might have a smaller or a much larger data set, if you have a much smaller 01:05:14.520 |
So you want to make sure that you regularize properly with dropout. 01:05:17.660 |
And then you might want to, as a second degree consideration, maybe learning rate, you want 01:05:22.880 |
But that usually doesn't have as much of an effect. 01:05:29.000 |
And this is 90% of the use cases, I would say. 01:05:33.840 |
So compared to when-- computer vision in 2011, where you might have hundreds of hyperparameters. 01:05:42.000 |
And in terms of distributed training, so if you want to work at scale, because if you 01:05:46.920 |
want to train ImageNet or some large scale data sets, you might want to train across 01:05:51.120 |
So just to give you an idea, most of these state-of-the-art networks are trained on the 01:05:54.100 |
order of a few weeks across multiple GPUs, usually four or eight GPUs. 01:05:58.920 |
And these GPUs are roughly on the order of $1,000 each. 01:06:04.960 |
But you almost always want to train on multiple GPUs if possible. 01:06:08.460 |
Usually you don't end up training across machines. 01:06:10.380 |
That's much more rare, I think, to train across machines. 01:06:12.820 |
What's much more common is you have a single machine. 01:06:14.460 |
And it has eight Titan Xs or something like that. 01:06:16.840 |
And you do distributed training on those eight Titan Xs. 01:06:19.740 |
There are different ways to do distributed training. 01:06:21.520 |
So if you're feeling fancy, you can try to do some model parallelism, where you split 01:06:29.440 |
I would instead advise some kind of a data parallelism architecture. 01:06:32.060 |
So usually what you see in practice is you have eight GPUs. 01:06:35.420 |
So I take my batch of 256 images or something like that. 01:06:43.420 |
And then I basically just add up all the gradients. 01:06:49.420 |
And mathematically, you're doing the exact same thing as if you had a giant GPU. 01:06:53.940 |
But you're just splitting up that batch across different GPUs. 01:06:57.180 |
But you're still doing synchronous training with SGD as normal. 01:06:59.940 |
So that's what you'll see most in practice, which I think is the best thing to do right 01:07:07.140 |
And other kind of considerations that sometimes enter that you could maybe worry about is 01:07:11.700 |
that there are these bottlenecks to be aware of. 01:07:18.940 |
You want that disk to probably be an SSD because you want this loading to be quick. 01:07:23.220 |
Because these GPUs process data very quickly. 01:07:28.020 |
So in many applications, you might want to pre-process your data, make sure that it's 01:07:31.420 |
read out contiguously in very raw form from something like an HDFI file or some kind of 01:07:38.180 |
And another bottleneck to be aware of is the CPU-GPU bottleneck. 01:07:41.880 |
So the GPU is doing a lot of heavy lifting of the neural network. 01:07:46.180 |
And you might want to use things like prefetching threads, where the CPU, while the networks 01:07:50.180 |
are doing forward-backward on the GPU, your CPU is busy loading the data from the disk 01:07:54.380 |
and maybe doing some pre-processing and making sure that it can ship it off to the GPU at 01:08:00.900 |
So those are some of the practical considerations I could come up with for this lecture. 01:08:04.740 |
If you wanted to learn much more about convolutional neural networks and a lot of what I've been 01:08:07.520 |
talking about, then I encourage you to check out CS231n. 01:08:58.860 |
I'm using a lot of convolutional nets for genomics. 01:09:01.100 |
One of the problems that we see is that our genomic sequence tends to be arbitrary length. 01:09:06.860 |
So right now we're patterned for a lot of zeros, but we're curious as to what your thoughts 01:09:10.340 |
are on using CNNs for things of arbitrary size. 01:09:18.380 |
So is this like a genomic sequence of like ATCG? 01:09:23.660 |
So some of the options would be -- so recurrent neural networks might be a good fit because 01:09:28.980 |
Another option I would say is if you look at the WaveNet paper from DeepMind, they have 01:09:32.900 |
audio and they're using convolutional networks for processing it. 01:09:35.740 |
And I would basically adopt that kind of an architecture. 01:09:37.700 |
They have this clever way of doing what's called atros or dilated convolutions. 01:09:42.060 |
And so that allows you to capture a lot of context with few layers. 01:09:49.160 |
And there's an efficient implementation of it that you should be aware of on GitHub. 01:09:51.460 |
And so you might be able to just drag and drop the fast WaveNet code into that application. 01:09:56.100 |
And so you have much larger context, but it's, of course, not infinite context as you might 01:10:05.780 |
Our main problem is that the genes can be very short or very long, but the whole sequence 01:10:11.420 |
So I think that's one of the challenges that we're looking at with this type of problem. 01:10:16.780 |
Yeah, so those would be the two options that I would play with, basically. 01:10:29.540 |
So my question is that, is there a clear mathematical or conceptual understanding when people decide 01:10:34.540 |
how many hidden layers have to be part of their architecture? 01:10:38.540 |
So the answer with a lot of this is there a mathematical understanding will likely be 01:10:43.300 |
no, because we are in very early phases of just doing a lot of empirical guess and check 01:10:49.380 |
And so theory is in some ways lagging behind a bit. 01:10:53.300 |
I would say that with residual networks, you want to have more layers usually works better. 01:10:58.660 |
And so you can take these layers out or you can put them in, and it's just mostly computational 01:11:04.760 |
So our consideration is usually is you have a GPU. 01:11:07.380 |
It has maybe 16 gigs of RAM or 12 gigs of RAM or something. 01:11:10.700 |
I want certain batch size, and I have these considerations, and that upper bounds the 01:11:17.340 |
And so I use the biggest thing that fits in my GPU, and that's mostly the way you choose 01:11:24.460 |
So if you have a very small data set, then you might end up with a pretty big network 01:11:28.500 |
So you might want to make sure that you are tuning those dropout rates properly, and so 01:11:33.980 |
My understanding is that the recent convolution nets doesn't use pooling layers, right? 01:11:44.140 |
So the question is, why don't they use pooling layers? 01:11:53.940 |
So certainly, so if you saw, for example, the residual network at the end, there was 01:11:57.540 |
a single pooling layer at the very beginning, but mostly they went away. 01:12:02.100 |
So it took-- I wonder if I can find the slide. 01:12:04.020 |
I wonder if this is a good idea to try to find the slide. 01:12:13.820 |
So this was the residual network architecture. 01:12:15.700 |
So you see that they do a first conv, and then there's a single pool right there. 01:12:19.820 |
But certainly, the trend has been to throw them away over time. 01:12:24.020 |
It's called Striving for Simplicity, the All-Convolutional Neural Network. 01:12:27.620 |
And the point in that paper is, look, you can actually do strided convolutions. 01:12:31.020 |
You can throw away pooling layers altogether, or it's just as well. 01:12:34.300 |
So pooling layers are kind of, I would say, this kind of a bit of a historical vestige 01:12:37.820 |
of they needed things to be efficient, and they need to control the capacity and down 01:12:43.060 |
And so we're kind of throwing them away over time. 01:12:44.780 |
And yeah, they're not doing anything super useful. 01:12:52.420 |
So maybe you don't actually want to get rid of that information. 01:12:55.820 |
So it's always more appealing to-- it's probably more appealing, I would say, to throw them 01:13:00.220 |
But you mentioned there is a sort of cognitive or brain analogy that the brain is doing pooling. 01:13:06.620 |
Yeah, so I think that analogy is stretched by a lot. 01:13:09.220 |
So the brain-- I'm not sure if the brain is doing pooling. 01:13:15.220 |
Not for just classification, but the usage of neural networks for image compression. 01:13:18.220 |
Instead of classification for images, can we use the neural networks for image compression? 01:13:32.660 |
Yeah, I think there's actually really exciting work in this area. 01:13:35.600 |
So one that I'm aware of, for example, is recent work from Google, where they're using 01:13:39.840 |
convolutional networks and recurrent networks to come up with variably sized codes for images. 01:13:44.760 |
So certainly, a lot of these generative models, I mean, they are very related to compression. 01:13:49.300 |
So definitely a lot of work in the area that I'm excited about. 01:13:52.740 |
Also, for example, super resolution networks. 01:13:54.700 |
So you saw the recent acquisition of Magic Pony by Twitter. 01:13:58.940 |
So they were also doing something that basically allows you to compress. 01:14:02.260 |
You can send low resolution streams, because you can upsample it on the client. 01:14:11.860 |
Can you please comment on scalability regarding number of classes? 01:14:18.380 |
So what does it take if we go up to 10,000 or 100,000 classes? 01:14:22.420 |
Yeah, so if you have a lot of classes, then of course, you can grow your softmax, but 01:14:26.940 |
that becomes inefficient at some point, because you're doing a giant matrix multiply. 01:14:31.100 |
So some of the ways that people are addressing this in practice, I believe, is use of hierarchical 01:14:37.460 |
So you decompose your classes into groups, and then you kind of predict one group at 01:14:47.140 |
So I see these papers, but I'm not an expert on exactly how this works. 01:14:51.700 |
But I do know that hierarchical softmax is something that people use in this setting. 01:14:54.740 |
Especially, for example, in language models, this is often used, because you have a huge 01:14:57.900 |
amount of words, and you still need to predict them somehow. 01:15:00.380 |
And so I believe Tomasz Michalow, for example, he has some papers on using hierarchical softmax 01:15:06.980 |
Could you talk a little bit about the convolutional functions? 01:15:11.220 |
Like what considerations you should make in selecting the functions that are used in the 01:15:18.180 |
Selecting the functions that are used in the convolutional filters? 01:15:25.100 |
They're just numbers that we train with backpropagation. 01:15:29.100 |
Are you talking about the nonlinearities, perhaps? 01:15:30.860 |
Yeah, I'm just wondering about when you're selecting the features, or when you're getting 01:15:36.340 |
the-- when you're trying to train to understand different features within an image, what are 01:15:44.580 |
You're talking about understanding exactly what those filters are looking for in the 01:15:48.020 |
So a lot of interesting work, especially, for example, so Jason Yosinski, he has this 01:15:53.100 |
And I've shown you that you can kind of debug it that way a bit. 01:15:55.740 |
There's an entire lecture that I encourage you to watch in CS231N on visualizing and 01:16:02.260 |
So people use things like a deconv or guided backpropagation. 01:16:06.420 |
Or you backpropagate to image, and you try to find a stimulus that maximally activates 01:16:11.740 |
So different ways of probing it, and different ways have been developed. 01:16:20.140 |
I had a question regarding the size of fine-tuning data set. 01:16:24.940 |
For example, is there a ballpark number if you are trying to do classification? 01:16:30.940 |
How many do you need for fine-tuning it to your sample set? 01:16:36.020 |
So how many data points do you need to get good performance? 01:16:43.100 |
So this is like the most boring answer, I think. 01:16:47.860 |
And it's really hard to say, actually, how many you need. 01:16:52.540 |
So usually one way to look at it is-- one heuristic that people sometimes follow is 01:16:57.100 |
you look at the number of parameters, and you want the number of examples to be on the 01:17:01.860 |
That's one way people sometimes break it down. 01:17:07.300 |
So I was hoping that most of the things would be taken care of there, and then you're just 01:17:14.900 |
So when you're saying fine-tuning, are you fine-tuning the whole network, or you're freezing 01:17:21.020 |
So another way to look at it is you have some number of parameters, and you can estimate 01:17:23.660 |
the number of bits that you think every parameter has. 01:17:27.280 |
And then you count the number of bits in your data. 01:17:29.280 |
So that's the kind of comparisons you would do. 01:17:35.460 |
And you have to try, and you have to regularize, and you have to cross-validate that, and you 01:17:38.020 |
have to see what performance you get over time. 01:17:40.940 |
Because it's too task-dependent for me to say something stronger. 01:17:44.580 |
I would like to know how do you think the Covenant will work in the 3D case? 01:17:49.940 |
Like is it just a simple extension of the 2D case, or do we need some extra tweak about 01:17:56.420 |
So in the 3D case, so you're talking specifically about, say, videos or some 3D-- 01:17:59.680 |
Actually, I'm talking about the image that has the depth information. 01:18:05.200 |
So say you have like RGBD input and things like that. 01:18:10.200 |
But I do know, for example, that people try to have-- for example, one thing you can do 01:18:17.520 |
Or maybe you want a separate ConvNet on top of the depth channel and do some fusion later. 01:18:21.440 |
So I don't know exactly what the state of the art in treating that depth channel is 01:18:27.480 |
So I don't know exactly how they do it right now. 01:18:32.000 |
Just how do you think the 3D object recognition-- 01:18:43.120 |
But we are not treating the 2D image, but the 3D representation of the object. 01:18:54.040 |
But the problem with these meshes and so on is that there's this rotational degree of 01:18:58.440 |
freedom that I'm not sure what people do about, honestly. 01:19:06.440 |
There are some obvious things you might want to try. 01:19:07.720 |
You might want to plug in all the possible ways you could orient this and then a test 01:19:13.120 |
So that would be some of the obvious things to play with. 01:19:14.780 |
But I'm not actually sure what the state of the art is. 01:19:21.600 |
So coming back to distributed training, is it possible to do even the classification 01:19:27.520 |
Or my question is, in the future, can I imagine our cell phones do these things together for 01:19:38.640 |
You're trying to get cell phones distributed training. 01:19:41.440 |
A train and also classify for one cell phone. 01:19:47.080 |
So related thoughts I had recently was, so I had come to JS in the browser. 01:19:50.520 |
And I was thinking of basically, this trains networks. 01:19:55.400 |
Because you could imagine shipping this off as an ad equivalent. 01:19:58.600 |
Like people just include this in the JavaScript. 01:20:00.400 |
And then everyone's browsers are kind of like training a small network. 01:20:06.160 |
But do you think there's too much communication overhead? 01:20:08.600 |
Or it could be actually really distributed in an efficient way? 01:20:12.160 |
Yes, so the problem with distributing it a lot is actually the stale gradients problem. 01:20:16.960 |
So when you look at some of the papers that Google has put out about distributed training, 01:20:21.320 |
as you look at the number of workers when you do asynchronous SGD, number of workers 01:20:25.580 |
and the performance improvement you get, it kind of plateaus quite quickly after eight 01:20:31.560 |
So I'm not sure if there are ways of dealing with thousands of workers. 01:20:35.120 |
The issue is that you have a distributed-- every worker has this specific snapshot of 01:20:39.840 |
the weights that are currently-- you pull from the master. 01:20:45.820 |
And now you have a set of weights that you're using. 01:20:50.240 |
But by the time you send an update and you've done your forward, backward, the parameter 01:20:53.320 |
server has now done lots of updates from thousands of other things. 01:20:59.480 |
You've evaluated it at the wrong and old location. 01:21:07.480 |
And I'm not sure what people are doing about this. 01:21:11.440 |
I was wondering about applications of convolutional nets to two inputs at a time. 01:21:17.200 |
So let's say you have two pictures of jigs of puzzles, jigs of pieces. 01:21:21.240 |
And you're trying to figure out if they fit together or whether one object compares to 01:21:27.480 |
Have you heard of any implementation of this kind? 01:21:32.320 |
So the common ways of dealing with that is you put a commnet on each. 01:21:35.120 |
And then you do some kind of a fusion eventually to merge the information. 01:21:40.480 |
And what about for recurring neural networks if you had variable input? 01:21:45.200 |
So for example, in the context of videos where you have frames coming in, then yes, some 01:21:48.600 |
of the approaches are you have a convolutional network on a frame. 01:21:51.240 |
And then at the top, you tie it in with a recurring neural network. 01:21:54.760 |
So you have these-- you reduce the image to some kind of a lower dimensional representation. 01:21:58.920 |
And then that's an input to a recurring neural network at the top. 01:22:04.200 |
For example, you can actually make the recurrent-- you can make every single neuron in the commnet 01:22:11.420 |
So right now, when a neuron computes its output, it's only a function of a local neighborhood 01:22:17.880 |
But you can also make it, in addition, a function of that same local neighborhood or its own 01:22:22.740 |
activation perhaps at the previous time step, if that makes sense. 01:22:27.800 |
So this neuron is not just computing a dot product with the current patch, but it's also 01:22:31.920 |
incorporating a dot product of its own and maybe its neighborhoods activations at the 01:22:38.640 |
So that's kind of like a small RNN update hidden inside every single neuron. 01:22:41.640 |
So those are the things that I think people play with when I'm not familiar with what 01:22:50.920 |
I have a question regarding the latency for the models that are trained using multiple 01:22:56.280 |
So especially at the prediction time, as we add more layers for the forward pass, it will 01:23:02.200 |
It will increase in the latency for the prediction. 01:23:05.060 |
So what are the numbers that we have seen presently that if you can share the prediction 01:23:17.520 |
So you're worried, for example, you want to run a prediction very quickly. 01:23:21.280 |
Would it be on an embedded device, or is this in the cloud? 01:23:26.360 |
You're identifying the objects, or you're doing some image analysis or something. 01:23:35.640 |
So one way you would approach this, actually, is you have this network that you've trained 01:23:38.920 |
using floating point arithmetic, 32 bits, say. 01:23:42.320 |
And so there's a lot of work on taking that network and discretizing all the weights into 01:23:47.720 |
like ints and making it much smaller and pruning connections. 01:23:51.200 |
So one of the works related to this, for example, is Song Han here at Stanford has a few papers 01:23:56.200 |
on getting rid of spurious connections and reducing the network as much as possible, 01:24:00.200 |
and then making everything very efficient with integer arithmetic. 01:24:03.480 |
So basically, you achieve this by discretizing all the weights and all the activations and 01:24:12.480 |
So there are some tricks like that that people play. 01:24:15.480 |
That's mostly what you would do on an embedded device. 01:24:18.280 |
And then the challenge, of course, is you've changed the network, and now you just kind 01:24:21.480 |
of are crossing your fingers that it works well. 01:24:23.320 |
And so I think what's interesting from a research standpoint is you'd like your test time to 01:24:32.960 |
And so the question is, how do we train with low precision arithmetic? 01:24:35.960 |
And there's a lot of work on this as well, so say from Yoshua Bengio's lab as well. 01:24:41.040 |
So that's exciting directions of how you train in a low precision regime. 01:24:45.080 |
Do you have any numbers that you can share for the state of the art, how much time does 01:24:50.800 |
Yes, I see the papers, but I'm not sure if I remember the exact reductions. 01:24:54.600 |
It's on the order of-- OK, I don't want to say, because basically I don't know.