Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)

00:00:00.000 | So thank you very much for the introduction.

00:00:03.160 | So today I'll speak about deep learning, especially in the context of computer vision.

00:00:07.880 | So what you saw in the previous talk is neural networks.

00:00:10.760 | So you saw that neural networks are organized into these layers, fully connected layers,

00:00:14.720 | where neurons in one layer are not connected, but they're connected fully to all the neurons

00:00:18.680 | in the previous layer.

00:00:19.920 | And we saw that basically we have this layer-wise structure from input until output.

00:00:25.280 | And there are neurons and nonlinearities, et cetera.

00:00:27.720 | Now so far we have not made too many assumptions about the inputs.

00:00:31.140 | So in particular, here we just assume that an input is some kind of a vector of numbers

00:00:34.700 | that we plug into this neural network.

00:00:37.340 | So that's both a bug and a feature to some extent, because in most real world applications

00:00:44.700 | we actually can make some assumptions about the input that makes learning much more efficient.

00:00:52.140 | So in particular, usually we don't just want to plug into neural networks vectors of numbers,

00:00:58.460 | but they actually have some kind of a structure.

00:01:00.180 | So we don't have vectors of numbers, but these numbers are arranged in some kind of a layout,

00:01:04.820 | like an n-dimensional array of numbers.

00:01:07.040 | So for example, spectrograms are two-dimensional arrays of numbers.

00:01:09.780 | Images are three-dimensional arrays of numbers.

00:01:12.060 | Videos would be four-dimensional arrays of numbers.

00:01:14.260 | Text you could treat as one-dimensional array of numbers.

00:01:17.060 | And so whenever you have this kind of local connectivity structure in your data, then

00:01:21.500 | you'd like to take advantage of it, and convolutional neural networks allow you to do that.

00:01:26.140 | So before I dive into convolutional neural networks and all the details of the architectures,

00:01:30.020 | I'd like to briefly talk about a bit of the history of how this field evolved over time.

00:01:34.800 | So I like to start off usually with talking about Hubel and Wiesel and the experiments

00:01:38.860 | that they performed in the 1960s.

00:01:40.660 | So what they were doing is trying to study the computations that happened in the early

00:01:45.240 | visual cortex areas of a cat.

00:01:48.220 | And so they had cats, and they plugged in electrodes that could record from the different

00:01:52.540 | neurons.

00:01:53.700 | And then they showed the cat different patterns of light.

00:01:56.180 | And they were trying to debug neurons effectively and try to show them different patterns and

00:01:59.820 | see what they responded to.

00:02:01.780 | And a lot of these experiments inspired some of the modeling that came in afterwards.

00:02:07.100 | So in particular, one of the early models that tried to take advantage of some of the

00:02:10.020 | results of these experiments was the model called Neurocognitron from Fukushima in the

00:02:17.260 | '60s.

00:02:18.260 | And so what you saw here was this architecture that, again, is layer-wise, similar to what

00:02:22.020 | you see in the cortex, where you have these simple and complex cells, where the simple

00:02:26.020 | cells detect small things in the visual field.

00:02:29.720 | And then you have this local connectivity pattern, and the simple and complex cells

00:02:32.860 | alternate in this layered architecture throughout.

00:02:36.300 | And so this looks a bit like a ConvNet, because you have some of its features, like, say,

00:02:40.540 | the local connectivity.

00:02:41.860 | But at the time, this was not trained with backpropagation.

00:02:44.140 | These were specific, heuristically chosen updates.

00:02:49.800 | And this was unsupervised learning back then.

00:02:52.340 | So the first time that we've actually used backpropagation to train some of these networks

00:02:55.300 | was an experiment of Jan Lekun in the 1990s.

00:02:59.000 | And so this is an example of one of the networks that was developed back then, in the 1990s,

00:03:04.380 | by Jan Lekun, is LeanNet5.

00:03:06.220 | And this is what you would recognize today as a convolutional neural network.

00:03:09.280 | So it has a lot of the very convolutional layers.

00:03:12.380 | And it's alternating.

00:03:13.820 | And it's a similar kind of design to what you would see in the Fukushima's neurocognitron.

00:03:18.180 | But this was actually trained with backpropagation end-to-end using supervised learning.

00:03:24.320 | So this happened in roughly the 1990s.

00:03:26.260 | And we're here in 2016, basically about 20 years later.

00:03:31.240 | Now computer vision has, for a long time, kind of worked on larger images.

00:03:38.740 | And a lot of these models back then were applied to very small kind of settings, like, say,

00:03:43.420 | minimizing digits in zip codes and things like that.

00:03:46.780 | And they were very successful in those domains.

00:03:48.700 | But back at least when I entered computer vision, roughly 2011, it was thought that

00:03:52.820 | a lot of people were aware of these models.

00:03:54.540 | But it was thought that they would not scale up naively into large, complex images, that

00:03:59.900 | they would be constrained to these toy tasks for a long time.

00:04:02.620 | Or I shouldn't say toy, because these were very important tasks, but certainly like smaller

00:04:06.180 | visual recognition problems.

00:04:08.180 | And so in computer vision in roughly 2011, it was much more common to use a kind of these

00:04:12.980 | feature-based approaches at the time.

00:04:15.220 | And they didn't work actually that well.

00:04:17.060 | So when I entered my PhD in 2011 working on computer vision, you would run a state-of-the-art

00:04:21.020 | object detector on this image, and you might get something like this, where cars were detected

00:04:27.020 | in trees.

00:04:28.020 | And you would kind of just shrug your shoulders and say, well, that just happens sometimes.

00:04:31.240 | You kind of just accept it as something that would just happen.

00:04:36.080 | And of course, this is a caricature.

00:04:37.220 | Things actually worked relatively decent, I should say.

00:04:39.580 | But definitely there were many mistakes that you would not see today about four years in

00:04:44.380 | 2016, five years later.

00:04:47.100 | And so a lot of computer vision kind of looked much more like this.

00:04:49.660 | When you look into a paper that tried to do image classification, you would find this

00:04:53.900 | section in the paper on the features that they used.

00:04:56.600 | So this is one page of features.

00:04:59.420 | And so they would use a gist, hog, et cetera, and then a second page of features and all

00:05:05.660 | their hyperparameters.

00:05:06.940 | So all kinds of different histograms.

00:05:07.940 | And you would extract this kitchen sink of features and a third page here.

00:05:12.500 | And so you end up with this very large, complex code base, because some of these feature types

00:05:16.660 | are implemented in MATLAB, some of them in Python, some of them in C++.

00:05:20.180 | And you end up with this large code base of extracting all these features, caching them,

00:05:23.520 | and then eventually plugging them into linear classifiers to do some kind of visual recognition

00:05:26.940 | task.

00:05:27.940 | So it was quite unwieldy.

00:05:31.100 | But it worked to some extent.

00:05:32.700 | But there were definitely room for improvement.

00:05:34.920 | And so a lot of this changed in computer vision in 2012 with this paper from Alex Kurchevsky,

00:05:39.860 | Ilya Satskever, and Jeff Hinton.

00:05:41.900 | So this is the first time that someone took a convolutional neural network that is very

00:05:46.180 | similar to the one that you saw from 1998 from Yanma Kun.

00:05:50.040 | And I'll go into details of how they differ exactly.

00:05:52.780 | But they took that kind of network.

00:05:54.340 | They scaled it up.

00:05:55.340 | They made it much bigger.

00:05:56.340 | And they trained it on a much bigger data set on GPUs.

00:05:58.860 | And things basically ended up working extremely well.

00:06:01.060 | And this is the first time that computer vision community has really noticed these models

00:06:04.260 | and adopted them to work on larger images.

00:06:08.540 | So we saw that the performance of these models has improved drastically.

00:06:13.540 | Here we are looking at the ImageNet ILS VRC visual recognition challenge over the years.

00:06:19.440 | And we're looking at the top five errors.

00:06:20.860 | So low is good.

00:06:21.860 | And you can see that from 2010 in the beginning, these were feature-based methods.

00:06:26.700 | And then in 2012, we had this huge jump in performance.

00:06:29.740 | And that was due to the first kind of convolutional neural network in 2012.

00:06:33.980 | And then we've managed to push that over time.

00:06:35.620 | And now we're down to about 3.57%.

00:06:38.500 | I think the results for ImageNet Challenge 2016 are actually due to come out today.

00:06:44.560 | But I don't think that actually they've come out yet.

00:06:46.740 | I have this second tab here opened.

00:06:51.220 | I was waiting for the result.

00:06:52.380 | But I don't think this is up yet.

00:06:54.100 | Yeah.

00:06:55.100 | OK.

00:06:56.100 | No.

00:06:57.100 | Nothing.

00:06:58.100 | All right.

00:06:59.100 | Well, we'll get to find out very soon what happens right here.

00:07:00.540 | So I'm very excited to see that.

00:07:02.900 | Just to put this in context, by the way, because you're just looking at numbers, like 3.57,

00:07:06.260 | how good is that?

00:07:07.380 | That's actually really, really good.

00:07:08.900 | So something that I did about two years ago now is that I tried to measure the human accuracy

00:07:14.500 | on this data set.

00:07:15.500 | And so what I did for that is I developed this web interface where I would show myself

00:07:20.700 | ImageNet images from the test set.

00:07:22.820 | And then I had this interface here where I would have all the different classes of ImageNet.

00:07:27.340 | There's 1,000 of them.

00:07:28.620 | And some example images.

00:07:30.260 | And then basically, you go down this list and you scroll for a long time and you find

00:07:33.780 | what class you think that image might be.

00:07:36.060 | And then I competed against the ComNet at the time.

00:07:39.700 | And this was GoogleNet in 2014.

00:07:44.820 | And so HotDog is a very simple class.

00:07:46.620 | You can do that quite easily.

00:07:48.440 | But why is the accuracy not 0%?

00:07:50.260 | Well, some of the things, like HotDog seems very easy.

00:07:53.060 | Why isn't it trivial for humans to see?

00:07:54.580 | Well, it turns out that some of the images in a test set of ImageNet are actually mislabeled.

00:07:58.940 | But also, some of the images are just very difficult to guess.

00:08:02.340 | So in particular, if you have this terrier, there's 50 different types of terriers.

00:08:05.700 | And it turns out to be a very difficult task to find exactly which type of terrier that

00:08:09.660 | is.

00:08:10.660 | You can spend minutes trying to find it.

00:08:12.140 | It turns out that convolutional neural networks are actually extremely good at this.

00:08:16.100 | And so this is where I would lose points compared to ComNet.

00:08:20.080 | So I estimate that human accuracy based on this is roughly 2% to 5% range, depending

00:08:23.940 | on how much time you have and how much expertise you have and how many people you involve and

00:08:28.320 | how much they really want to do this, which is not too much.

00:08:31.940 | And so really, we're doing extremely well.

00:08:34.460 | And so we're down to 3%.

00:08:36.260 | And I think the error rate, if I remember correctly, was about 1.5%.

00:08:40.700 | So if we get below 1.5%, I would be extremely suspicious on ImageNet.

00:08:45.500 | That seems wrong.

00:08:46.680 | So to summarize, basically, what we've done is, before 2012, computer vision looked somewhat

00:08:53.060 | like this, where we had these feature extractors.

00:08:54.940 | And then we trained a small portion at the end of the feature extraction step.

00:08:59.800 | And so we only trained this last piece on top of these features that were fixed.

00:09:03.640 | And we've basically replaced the feature extraction step with a single convolutional neural network.

00:09:07.820 | And now we train everything completely end to end.

00:09:09.880 | And this turns out to work quite nicely.

00:09:12.080 | So I'm going to go into details of how this works in a bit.

00:09:15.220 | Also in terms of code complexity, we kind of went from a setup that looks-- whoops.

00:09:20.320 | I'm way ahead.

00:09:23.000 | We went from a setup that looks something like that in papers to something like, instead

00:09:27.200 | of extracting all these things, we just say, apply 20 layers with 3 by 3 conv or something

00:09:31.160 | like that.

00:09:32.160 | And things work quite well.

00:09:33.640 | This is, of course, an over-exaggeration.

00:09:35.100 | But I think it's a correct first order statement to make, is that we've definitely seen that

00:09:39.560 | we've reduced code complexity quite a lot, because these architectures are so homogeneous

00:09:43.860 | compared to what we've done before.

00:09:46.180 | So it's also remarkable that-- so we had this reduction in complexity.

00:09:49.660 | We had this amazing performance on ImageNet.

00:09:51.900 | One other thing that was quite amazing about the results in 2012 that is also a separate

00:09:56.040 | thing that did not have to be the case is that the features that you learn by training

00:10:00.540 | on ImageNet turn out to be quite generic.

00:10:02.380 | And you can apply them in different settings.

00:10:04.380 | So in other words, this transfer learning works extremely well.

00:10:08.280 | And of course, I didn't go into details of convolutional networks yet.

00:10:10.520 | But we start with an image.

00:10:12.040 | And we have a sequence of layers, just like in a normal neural network.

00:10:14.700 | And at the end, we have a classifier.

00:10:16.540 | And when you pre-train this network on ImageNet, then it turns out that the features that you

00:10:20.540 | learn in the middle are actually transferable.

00:10:23.180 | And you can use them on different data sets, and that this works extremely well.

00:10:26.500 | And so that didn't have to be the case.

00:10:28.240 | You might imagine that you could have a convolutional network that works extremely well on ImageNet.

00:10:32.220 | But when you try to run it on something else, like BIRDS data set or something, that it

00:10:35.540 | might just not work well.

00:10:36.940 | But that is not the case.

00:10:38.020 | And that's a very interesting finding, in my opinion.

00:10:40.940 | So people noticed this back in roughly 2013, after the first convolutional networks.

00:10:45.880 | They noticed that you can actually take many computer vision data sets.

00:10:49.060 | And it used to be that you would compete on all of these separately and design features

00:10:52.000 | maybe for some of these separately.

00:10:53.980 | And you can just shortcut all those steps that we had designed.

00:10:58.060 | And you can just take these pre-trained features that you get from ImageNet.

00:11:01.740 | And you can just train a linear classifier on every single data set on top of those features.

00:11:05.260 | And you obtain many state-of-the-art results across many different data sets.

00:11:08.820 | And so this was quite a remarkable finding back then, I believe.

00:11:13.020 | So things worked very well on ImageNet.

00:11:14.580 | Things transferred very well.

00:11:16.300 | And the code complexity, of course, got much more manageable.

00:11:20.100 | So now all this power is actually available to you with very few lines of code.

00:11:23.780 | If you want to just use a convolutional network on images, it turns out to be only a few lines

00:11:28.180 | of code.

00:11:29.180 | If you use, for example, Keras, it's one of the deep learning libraries that I'm going

00:11:32.020 | to go into and I'll mention again later in the talk.

00:11:35.260 | But basically, you just load a state-of-the-art convolutional neural network.

00:11:38.400 | You take an image.

00:11:39.400 | You load it.

00:11:40.400 | And you compute your predictions.

00:11:41.740 | And it tells you that this is an African elephant inside that image.

00:11:45.380 | And this took a couple hundred or a couple ten milliseconds if you have a GPU.

00:11:49.700 | And so everything got much faster, much simpler, works really well, transfers really well.

00:11:53.500 | So this was really a huge advance in computer vision.

00:11:55.980 | And so as a result of all these nice properties, ComNets today are everywhere.

00:11:59.940 | So here is a collection of some of the things that I try to find across different applications.

00:12:07.020 | So for example, you can search Google Photos for different types of categories, like in

00:12:11.500 | this case Rubik's Cube.

00:12:13.620 | You can find house numbers very efficiently.

00:12:16.420 | You can-- of course, this is very relevant in self-driving cars.

00:12:18.940 | And we're doing perception in the cars.

00:12:21.100 | Convolutional networks are very relevant there.

00:12:22.760 | Medical image diagnosis, recognizing Chinese characters, doing all kinds of medical segmentation

00:12:27.660 | tasks.

00:12:29.280 | Quite random tasks, like whale recognition and more generally many Kaggle challenges.

00:12:34.500 | Satellite image analysis, recognizing different types of galaxies.

00:12:37.620 | You may have seen recently that a WaveNet from DeepMind, also a very interesting paper

00:12:42.600 | that they generate music and they generate speech.

00:12:46.300 | And so this is a generative model.

00:12:47.580 | And that's also just a ComNet is doing most of the heavy lifting here.

00:12:50.700 | So it's a convolutional network on top of sound.

00:12:53.780 | And other tasks, like image captioning.

00:12:56.420 | In the context of reinforcement learning and agent environment interactions, we've also

00:13:00.680 | seen a lot of advances of using ComNets as the core computational building block.

00:13:04.620 | So when you want to play Atari games, or you want to play AlphaGo, or Doom, or StarCraft,

00:13:08.920 | or if you want to get robots to perform interesting manipulation tasks, all of this uses ComNets

00:13:13.940 | as a core computational block to do very impressive things.

00:13:19.740 | Not only are we using it for a lot of different applications, we're also finding uses in art.

00:13:26.300 | So here are some examples from DeepDream.

00:13:28.100 | So you can basically simulate what it looks like, what it feels like maybe to be on some

00:13:32.660 | drugs.

00:13:33.660 | So you can take images and you can just hallucinate features using ComNets.

00:13:37.020 | Or you might be familiar with neural style, which allows you to take arbitrary images

00:13:40.340 | and transfer arbitrary styles of different paintings like Van Gogh on top of them.

00:13:44.480 | And this is all using convolutional networks.

00:13:46.620 | The last thing I'd like to note that I find also interesting is that in the process of

00:13:50.620 | trying to develop better computer vision architectures and trying to basically optimize for performance

00:13:56.000 | on the ImageNet challenge, we've actually ended up converging to something that potentially

00:14:00.100 | might function something like your visual cortex in some ways.

00:14:03.520 | And so these are some of the experiments that I find interesting where they've studied macaque

00:14:07.200 | monkeys and they record from a subpopulation of the IT cortex.

00:14:13.660 | This is the part that does a lot of object recognition.

00:14:15.940 | And so they record.

00:14:16.940 | So basically, they take a monkey and they take a ComNet and they show them images.

00:14:20.540 | And then you look at what those images are represented at the end of this network.

00:14:24.280 | So inside the monkey's brain or on top of your convolutional network.

00:14:27.460 | And so you look at representations of different images.

00:14:29.460 | And then it turns out that there's a mapping between those two spaces that actually seems

00:14:33.460 | to indicate to some extent that some of the things we're doing somehow ended up converging

00:14:37.680 | to something that the brain could be doing as well in the visual cortex.

00:14:42.060 | So that's just some intro.

00:14:43.380 | I'm now going to dive into convolutional networks and try to explain briefly how these networks

00:14:49.500 | work.

00:14:50.500 | Of course, there's an entire class on this that I taught, which is a convolutional networks

00:14:53.460 | class.

00:14:54.460 | And so I'm going to distill some of those 13 lectures into one lecture.

00:14:57.940 | So we'll see how that goes.

00:14:59.540 | I won't cover everything, of course.

00:15:02.900 | So convolutional neural network is really just a single function.

00:15:06.380 | It's a function from the raw pixels of some kind of an image.

00:15:09.940 | So we take 224 by 224 by 3 image.

00:15:12.700 | So 3 here is for the call channels, RGB.

00:15:15.180 | You take the raw pixels, you put it through this function, and you get 1,000 numbers at

00:15:18.540 | the end.

00:15:19.540 | In the case of image classification, if you're trying to categorize images into 1,000 different

00:15:23.060 | classes.

00:15:24.520 | And really, functionally, all that's happening in a convolutional network is just dot products

00:15:28.860 | and max operations.

00:15:30.400 | That's everything.

00:15:31.400 | They're wired up together in interesting ways so that you are basically doing visual recognition.

00:15:36.460 | And in particular, this function f has a lot of knobs in it.

00:15:40.620 | So these W's here that participate in these dot products and in these convolutions and

00:15:44.140 | fully connected layers and so on, these W's are all parameters of this network.

00:15:48.220 | So normally, you might have about on the order of 10 million parameters.

00:15:51.600 | And those are basically knobs that change this function.

00:15:55.580 | And so we'd like to change those knobs, of course, so that when you put images through

00:15:59.900 | that function, you get probabilities that are consistent with your training data.

00:16:04.060 | And so that gives us a lot to tune.

00:16:06.180 | And it turns out that we can do that tuning automatically with back propagation through

00:16:09.980 | that search process.

00:16:11.300 | Now, more concretely, a convolutional neural network is made up of a sequence of layers,

00:16:15.620 | just as in the case of normal neural networks.

00:16:17.580 | But we have different types of layers that we play with.

00:16:20.280 | So we have convolutional layers.

00:16:21.860 | Here I'm using rectified linear unit, ReLU, for short, as a non-linearity.

00:16:26.460 | So I'm making that an explicit its own layer, pooling layers, and fully connected layers.

00:16:31.940 | The core computational building block of a convolutional network, though, is this convolutional

00:16:35.580 | layer.

00:16:36.580 | And we have non-linearities interspersed.

00:16:38.680 | We are probably getting rid of things like pooling layers.

00:16:40.980 | So you might see them slightly going away over time.

00:16:43.340 | And fully connected layers can actually be represented-- they're basically equivalent

00:16:46.300 | to convolutional layers as well.

00:16:48.020 | And so really, it's just a sequence of conv layers in the simplest case.

00:16:52.300 | So let me explain convolutional layer, because that's the core computational building block

00:16:55.380 | here that does all the heavy lifting.

00:16:58.520 | So the entire com net is this collection of layers.

00:17:03.080 | And these layers don't function over vectors.

00:17:05.420 | So they don't transform vectors as a normal neural network.

00:17:07.540 | But they function over volumes.

00:17:09.300 | So a layer will take a volume, a three-dimensional volume of numbers, an array.

00:17:13.420 | In this case, for example, we have a 32 by 32 by 3 image.

00:17:17.180 | So those three dimensions are the width, height, and I'll refer to the third dimension as depth.

00:17:21.140 | We have three channels.

00:17:22.900 | That's not to be confused with the depth of a network, which is the number of layers in

00:17:25.820 | that network.

00:17:26.820 | So this is just the depth of a volume.

00:17:28.700 | So this convolutional layer accepts a three-dimensional volume.

00:17:31.180 | And it produces a three-dimensional volume using some weights.

00:17:34.580 | So the way it actually produces this output volume is as follows.

00:17:37.700 | We're going to have these filters in a convolutional layer.

00:17:40.260 | So these filters are always small spatially, like, say, for example, 5 by 5 filter.

00:17:45.540 | But their depth extends always through the input depth of the input volume.

00:17:51.220 | So since the input volume has three channels, the depth is three, then our filters will

00:17:55.700 | always match that number.

00:17:57.900 | So we have depth of three in our filters as well.

00:18:01.000 | And then we can take those filters, and we can basically convolve them with the input

00:18:03.940 | volume.

00:18:04.940 | So what that amounts to is we take this filter.

00:18:08.060 | Oh, yeah.

00:18:09.060 | So that's just the point that the channels here must match.

00:18:12.260 | We take that filter, and we slide it through all spatial positions of the input volume.

00:18:16.620 | And along the way, as we're sliding this filter, we're computing dot products.

00:18:20.020 | So W transpose X plus B, where W are the filters, and X is a small piece of the input volume,

00:18:25.500 | and B is the offset.

00:18:27.020 | And so this is basically the convolutional operation.

00:18:28.780 | You're taking this filter, and you're sliding it through at all spatial positions, and you're

00:18:32.140 | computing dot products.

00:18:33.720 | So when you do this, you end up with this activation map.

00:18:37.040 | So in this case, we get a 28 by 28 activation map.

00:18:41.300 | 28 comes from the fact that there are 28 unique positions to place this 5 by 5 filter into

00:18:46.180 | this 32 by 32 space.

00:18:49.420 | So there are 28 by 28 unique positions you can place that filter in.

00:18:52.260 | In every one of those, you're going to get a single number of how well that filter likes

00:18:57.700 | that part of the input.

00:19:00.580 | So that carves out a single activation map.

00:19:03.420 | And now in a convolutional layer, we don't just have a single filter, but we're going

00:19:05.860 | to have an entire set of filters.

00:19:07.740 | So here's another filter, a green filter.

00:19:09.740 | We're going to slide it through the input volume.

00:19:12.040 | It has its own parameters.

00:19:13.500 | So there are 75 numbers here that basically make up a filter.

00:19:17.500 | There are different 75 numbers.

00:19:19.080 | We convolve them through, get a new activation map, and we continue doing this for all the

00:19:22.780 | filters in that convolutional layer.

00:19:25.140 | So for example, if we had six filters in this convolutional layer, then we might end up

00:19:29.140 | with 28 by 28 activation maps six times.

00:19:32.440 | And we stack them along the depth dimension to arrive at the output volume of 28 by 28

00:19:36.700 | by 6.

00:19:37.700 | And so really what we've done is we've re-represented the original image, which is 32 by 32 by

00:19:42.380 | 3, into a kind of a new image that is 28 by 28 by 6, where this image basically has these

00:19:48.840 | six channels that tell you how well every filter matches or likes every part of the

00:19:55.080 | input image.

00:19:57.020 | So let's compare this operation to, say, using a fully connected layer as you would in a

00:20:00.920 | normal neural network.

00:20:03.060 | So in particular, we saw that we processed a 32 by 32 by 3 volume into 28 by 28 by 6

00:20:09.000 | volume.

00:20:10.000 | But one question you might want to ask is, how many parameters would this require if

00:20:13.560 | we wanted a fully connected layer of the same number of output neurons here?

00:20:17.160 | So we wanted 28 by 28 by 6 or times-- 28 times 28 times 6 number of neurons fully connected.

00:20:24.840 | How many parameters would that be?

00:20:26.560 | Turns out that that would be quite a few parameters, right?

00:20:28.760 | Because every single neuron in the output volume would be fully connected to all of

00:20:32.080 | the 32 by 32 by 3 numbers here.

00:20:34.960 | So basically, every one of those 28 by 28 by 6 neurons is connected to 32 by 32 by 3.

00:20:41.360 | Turns out to be about 15 million parameters, and also on that order of number of multiplies.

00:20:45.920 | So you're doing a lot of compute, and you're introducing a huge amount of parameters into

00:20:48.960 | your network.

00:20:49.960 | Now, since we're doing convolution instead, you'll notice that-- think about the number

00:20:55.480 | of parameters that we've introduced with this example convolutional layer.

00:20:59.120 | So we've used-- we had six filters, and every one of them was a 5 by 5 by 3 filter.

00:21:06.280 | So basically, we just have 5 by 5 by 3 filters.

00:21:08.480 | We have six of them.

00:21:09.540 | If you just multiply that out, we have 450 parameters.

00:21:12.400 | And in this, I'm not counting the biases.

00:21:13.800 | I'm just counting the raw weights.

00:21:15.640 | So compared to 15 million, we've only introduced very few parameters.

00:21:19.240 | Also, how many multiplies have we done?

00:21:21.840 | So computationally, how many flops are we doing?

00:21:24.960 | Well, we have 28 by 28 by 6 outputs to produce.

00:21:27.800 | And every one of these numbers is a function of a 5 by 5 by 3 region in the original image.

00:21:33.120 | So basically, we have 28 by 28 by 6.

00:21:35.840 | And then every one of them is computed by doing 5 times 5 times 3 multiplies.

00:21:39.800 | So you end up with only on the order of 350,000 multiplies.

00:21:43.880 | So we've reduced from 15 million to quite a few.

00:21:46.600 | So we're doing less flops, and we're using fewer parameters.

00:21:50.200 | And really, what we've done here is we've made assumptions.

00:21:52.920 | So we've made the assumption that because the fully connected layer, if this was a fully

00:21:58.080 | connected layer, could compute the exact same thing.

00:22:02.400 | So a specific setting of those 15 million parameters would actually produce the exact

00:22:05.640 | output of this convolutional layer.

00:22:07.240 | But we've done it much more efficiently.

00:22:08.480 | We've done that by introducing these biases.

00:22:11.800 | So in particular, we've made assumptions.

00:22:13.600 | We've assumed, for example, that since we have these fixed filters that we're sliding

00:22:16.640 | across space, we've assumed that if there's some interesting feature that you'd like to

00:22:20.320 | detect in one part of the image, like, say, top left, then that feature will also be useful

00:22:24.200 | somewhere else, like on the bottom right, because we fixed these filters and applied

00:22:27.760 | them at all the spatial positions equally.

00:22:30.240 | You might notice that this is not always something that you might want.

00:22:33.080 | For example, if you're getting inputs that are centered face images, and you're doing

00:22:36.640 | some kind of a face recognition or something like that, then you might expect that you

00:22:39.680 | might want different filters at different spatial positions.

00:22:42.600 | Like say, for eye regions, you might want to have some eye-like filters.

00:22:45.800 | And for mouth region, you might want to have mouth-specific features and so on.

00:22:49.280 | And so in that case, you might not want to use convolutional layer, because those features

00:22:52.240 | have to be shared across all spatial positions.

00:22:55.240 | And the second assumption that we made is that these filters are small locally.

00:22:59.800 | And so we don't have global connectivity.

00:23:01.580 | We have this local connectivity.

00:23:03.040 | But that's OK, because we end up stacking up these convolutional layers in sequence.

00:23:06.840 | And so the neurons at the end of the ConvNet will grow their receptive field as you stack

00:23:12.160 | these convolutional layers on top of each other.

00:23:14.160 | So at the end of the ConvNet, those neurons end up being a function of the entire image

00:23:17.000 | eventually.

00:23:18.000 | So just to give you an idea about what these activation maps look like concretely, here's

00:23:22.600 | an example of an image on the top left.

00:23:24.880 | This is a part of a car, I believe.

00:23:26.640 | And we have these different filters at-- we have 32 different small filters here.

00:23:30.720 | And so if we were to convolve these filters with this image, we end up with these activation

00:23:34.240 | maps.

00:23:35.240 | So this filter, if you convolve it, you get this activation map and so on.

00:23:38.920 | So this one, for example, has some orange stuff in it.

00:23:41.080 | So when we convolve with this image, you see that this white here is denoting the fact

00:23:44.900 | that that filter matches that part of the image quite well.

00:23:47.840 | And so we get these activation maps.

00:23:49.480 | You stack them up.

00:23:50.520 | And then that goes into the next convolutional layer.

00:23:53.720 | So the way this looks like then is that we've processed this with some kind of a convolutional

00:23:59.680 | layer.

00:24:00.680 | We get some output.

00:24:01.680 | We apply a rectified linear unit, some kind of a non-linearity as normal.

00:24:05.040 | And then we would just repeat that operation.

00:24:06.880 | So we keep plugging these conv volumes into the next convolutional layer.

00:24:11.480 | And so they plug into each other in sequence.

00:24:14.080 | And so we end up processing the image over time.

00:24:17.040 | So that's the convolutional layer.

00:24:19.160 | You'll notice that there are a few more layers.

00:24:20.640 | So in particular, the pooling layer I'll explain very briefly.

00:24:24.800 | Pooling layer is quite simple.

00:24:27.060 | If you've used Photoshop or something like that, you've taken a large image and you've

00:24:30.520 | resized it, you've down sampled the image, well, pooling layers do basically something

00:24:34.920 | exactly like that.

00:24:35.920 | But they're doing it on every single channel independently.

00:24:38.640 | So for every one of these channels independently in a input volume, we'll pluck out that activation

00:24:44.400 | map.

00:24:45.400 | We'll down sample it.

00:24:46.400 | And that becomes a channel in the output volume.

00:24:48.720 | So it's really just a down sampling operation on these volumes.

00:24:52.720 | So for example, one of the common ways of doing this in the context of neural networks

00:24:55.400 | especially is to use max pooling operation.

00:24:57.860 | So in this case, it would be common to say, for example, use 2 by 2 filters stride 2 and

00:25:04.000 | do max operation.

00:25:05.840 | So if this is an input channel in a volume, then we're basically-- what that amounts to

00:25:10.080 | is we're truncating it into these 2 by 2 regions.

00:25:13.400 | And we're taking a max over 4 numbers to produce one piece of the output.

00:25:18.880 | So this is a very cheap operation that down samples your volumes.

00:25:21.860 | It's really a way to control the capacity of the network.

00:25:24.040 | So you don't want too many numbers.

00:25:25.160 | You don't want things to be too computationally expensive.

00:25:27.220 | It turns out that a pooling layer allows you to down sample your volumes.

00:25:30.760 | You're going to end up doing less computation.

00:25:32.760 | And it turns out to not hurt the performance too much.

00:25:35.080 | So we use them basically as a way of controlling the capacity of these networks.

00:25:39.960 | And the last layer that I want to briefly mention, of course, is the fully connected

00:25:43.120 | layer, which is exactly what you're familiar with.

00:25:46.100 | So we have these volumes throughout as we've processed the image.

00:25:48.720 | At the end, you're left with this volume.

00:25:50.220 | And now you'd like to predict some classes.

00:25:51.920 | So what we do is we just take that volume.

00:25:53.520 | We stretch it out into a single column.

00:25:55.560 | And then we apply a fully connected layer, which really amounts to just a matrix multiplication.

00:26:00.000 | And then that gives us probabilities after applying a softmax or something like that.

00:26:06.180 | So let me now show you briefly a demo of what a convolutional network looks like.

00:26:11.000 | So this is ConvNetJS.

00:26:12.920 | This is a deep learning library for training convolutional neural networks that is implemented

00:26:17.340 | in JavaScript.

00:26:18.340 | I wrote this maybe two years ago at this point.

00:26:21.380 | So here what we're doing is we're training a convolutional network on the CIFAR-10 dataset.

00:26:25.040 | CIFAR-10 is a dataset of 50,000 images.

00:26:27.880 | Each image is 32 by 32 by 3.

00:26:30.200 | And there are 10 different classes.

00:26:32.800 | So here we are training this network in the browser.

00:26:35.280 | And you can see that the loss is decreasing, which means that we're better classifying

00:26:39.320 | these inputs.

00:26:40.840 | And so here's the network specification, which you can play with because this is all done

00:26:44.700 | in the browser.

00:26:45.700 | So you can just change this and play with this.

00:26:48.080 | So this is an input image.

00:26:49.380 | And this convolutional network I'm showing here, all the intermediate activations and

00:26:53.080 | all the intermediate, basically, activation maps that we're producing.

00:26:57.460 | So here we have a set of filters.

00:26:59.680 | We're convolving them with the image and getting all these activation maps.

00:27:02.880 | I'm also showing the gradients, but I don't want to dwell on that too much.

00:27:06.480 | Then you threshold.

00:27:07.640 | So ReLU thresholding anything below 0 gets clamped at 0.

00:27:11.600 | And then you pool.

00:27:12.800 | So this is just a downsampling operation.

00:27:14.980 | And then another convolution, ReLU pool, conv, ReLU pool, et cetera, until at the end we

00:27:20.160 | have a fully connected layer.

00:27:21.240 | And then we have our softmax so that we get probabilities out.

00:27:24.600 | And then we apply a loss to those probabilities and backpropagate.

00:27:28.200 | And so here we see that I've been training in this tab for the last maybe 30 seconds

00:27:32.300 | or one minute.

00:27:33.300 | And we're already getting about 30% accuracy on CIFAR-10.

00:27:36.420 | So these are test images from CIFAR-10.

00:27:38.400 | And these are the outputs of this convolutional network.

00:27:40.680 | And you can see that it learned that this is already a car or something like that.

00:27:43.440 | So this trains pretty quickly in JavaScript.

00:27:46.360 | So you can play with this.

00:27:47.360 | And you can change the architecture and so on.

00:27:50.280 | Another thing I'd like to show you is this video, because it gives you, again, this very

00:27:53.880 | intuitive visceral feeling of exactly what this is computing.

00:27:57.080 | Is there is a very good video by Jason Yosinski from--

00:27:59.920 | Recent advance--

00:28:01.240 | I'm going to play this in a bit.

00:28:02.600 | This is from the deep visualization toolbox.

00:28:05.420 | So you can download this code.

00:28:06.480 | And you can play with this.

00:28:07.600 | It's this interactive convolutional network demo.

00:28:09.880 | [VIDEO PLAYBACK]

00:28:10.880 | - --neural networks have enabled computers to better see and understand the world.

00:28:14.600 | They can recognize school buses and--

00:28:16.160 | [END PLAYBACK]

00:28:17.160 | --top left corner, we show the--

00:28:18.160 | [END PLAYBACK]

00:28:19.160 | I'm going to skip a bit.

00:28:20.160 | So what we're seeing here is these are activation maps in some particular-- shown in real time

00:28:25.660 | as this demo is running.

00:28:27.760 | So these are for the conv1 layer of an AlexNet, which we're going to go into in much more

00:28:31.400 | detail.

00:28:32.400 | But these are the different activation maps that are being produced at this point.

00:28:35.360 | [VIDEO PLAYBACK]

00:28:36.360 | - --neural network called AlexNet running in CAFE.

00:28:39.800 | By interacting with the network, we can see what some of the neurons are doing.

00:28:44.600 | For example, on this first layer, a unit in the center responds strongly to light to dark

00:28:48.800 | edges.

00:28:51.620 | This neighbor, one neuron over, responds to edges in the opposite direction, dark to light.

00:28:58.520 | Using optimization, we can synthetically produce images that light up each neuron on this layer

00:29:02.720 | to see what each neuron is looking for.

00:29:05.320 | We can scroll through every layer in the network to see what it does, including convolution,

00:29:09.680 | pooling, and normalization layers.

00:29:12.900 | We can switch back and forth between showing the actual activations and showing images

00:29:16.860 | synthesized to produce high activation.

00:29:22.500 | By the time we get to the fifth convolutional layer, the features being computed represent

00:29:26.220 | abstract concepts.

00:29:29.500 | For example, this neuron seems to respond to faces.

00:29:32.700 | We can further investigate this neuron by showing a few different types of information.

00:29:36.940 | First we can artificially create optimized images using new regularization techniques

00:29:40.740 | that are described in our paper.

00:29:42.620 | These synthetic images show that this neuron fires in response to a face and shoulders.

00:29:47.060 | We can also plot the images from the training set that activate this neuron the most, as

00:29:50.740 | well as pixels from those images most responsible for the high activations, computed via the

00:29:55.060 | deconvolution technique.

00:29:57.120 | This feature responds to multiple faces in different locations.

00:30:00.740 | And by looking at the deconv, we can see that it would respond more strongly if we had even

00:30:05.920 | darker eyes and rosier lips.

00:30:08.300 | We can also confirm that it cares about the head and shoulders, but ignores the arms and

00:30:12.180 | torso.

00:30:14.060 | We can even see that it fires to some extent for cat faces.

00:30:18.540 | Using backprop or deconv, we can see that this unit depends most strongly on a couple

00:30:22.860 | units in the previous layer, conv4, and on about a dozen or so in conv3.

00:30:28.580 | Now let's look at another neuron on this layer.

00:30:31.180 | So what's this unit doing?

00:30:33.020 | From the top nine images, we might conclude that it fires for different types of clothing.

00:30:37.620 | But examining the synthetic images shows that it may be detecting not clothing per se, but

00:30:42.020 | wrinkles.

00:30:43.300 | In the live plot, we can see that it's activated by my shirt.

00:30:46.820 | And smoothing out half of my shirt causes that half of the activations to decrease.

00:30:52.060 | Finally, here's another interesting neuron.

00:30:56.120 | This one has learned to look for printed text in a variety of sizes, colors, and fonts.

00:31:02.080 | This is pretty cool, because we never ask the network to look for wrinkles or text or

00:31:05.700 | faces.

00:31:06.700 | But the only labels we provided were at the very last layer.

00:31:09.460 | So the only reason the network learned features like text and faces in the middle was to support

00:31:13.580 | final decisions at that last layer.

00:31:16.260 | For example, the text detector may provide good evidence that a rectangle is in fact

00:31:21.140 | a book seen on edge.

00:31:22.740 | And detecting many books next to each other might be a good way of detecting a bookcase,

00:31:26.820 | which was one of the categories we trained the net to recognize.

00:31:31.420 | In this video, we've shown some of the features of the DeepViz toolbox.

00:31:35.100 | So I encourage you to play with that.

00:31:36.500 | It's really fun.

00:31:37.700 | So I hope that gives you an idea about exactly what's going on.

00:31:39.660 | There are these convolutional layers.

00:31:40.780 | We down sample them from time to time.

00:31:43.020 | There's usually some fully connected layers at the end.

00:31:45.380 | But mostly it's just these convolutional operations stacked on top of each other.

00:31:49.220 | So what I'd like to do now is I'll dive into some details of how these architectures are

00:31:52.700 | actually put together.

00:31:54.300 | The way I'll do this is I'll go over all the winners of the ImageNet challenges, and I'll

00:31:57.940 | tell you about the architectures, how they came about, how they differ.

00:32:00.780 | And so you'll get a concrete idea about what these architectures look like in practice.

00:32:04.300 | So we'll start off with the AlexNet in 2012.

00:32:08.360 | So the AlexNet, just to give you an idea about the sizes of these networks and the images

00:32:13.300 | that they process, it took 227 by 227 by 3 images.

00:32:17.860 | And the first layer of an AlexNet, for example, was a convolutional layer that had 11 by 11

00:32:22.500 | filters applied with a stride of 4.

00:32:25.860 | And there are 96 of them.

00:32:27.440 | Stride of 4 I didn't fully explain because I wanted to save some time.

00:32:30.660 | But intuitively, it just means that as you're sliding this filter across the input, you

00:32:34.460 | don't have to slide it one pixel at a time, but you can actually jump a few pixels at

00:32:37.620 | a time.

00:32:38.620 | So we have 11 by 11 filters with a stride, a skip of 4.

00:32:42.320 | And we have 96 of them.

00:32:43.600 | You can try to compute, for example, what is the output volume if you apply this sort

00:32:49.540 | of convolutional layer on top of this volume.

00:32:51.420 | And I didn't go into details of how you compute that.

00:32:53.560 | But basically, there are formulas for this, and you can look into details in the class.

00:32:58.260 | But you arrive at 55 by 55 by 96 volume as output.

00:33:03.260 | The total number of parameters in this layer, we have 96 filters.

00:33:07.500 | Every one of them is 11 by 11 by 3 because that's the input depth of these images.

00:33:14.460 | So basically, it just amounts to 11 times 11 times 3.

00:33:17.280 | And then you have 96 filters, so about 35,000 parameters in this very first layer.

00:33:22.820 | Then the second layer of an AlexNet is a pooling layer.

00:33:25.500 | So we apply 3 by 3 filters at stride of 2, and they do max pooling.

00:33:30.040 | So you can, again, compute the output volume size of that after applying this to that volume.

00:33:35.220 | And you arrive, if you do some very simple arithmetic there, you arrive at 27 by 27 by

00:33:39.860 | 96.

00:33:40.860 | So this is the downsampling operation.

00:33:42.580 | You can think about what is the number of parameters in this pooling layer.

00:33:47.000 | And of course, it's 0.

00:33:48.760 | So pooling layers compute a fixed function, a fixed downsampling operation.

00:33:52.460 | There are no parameters involved in a pooling layer.

00:33:54.580 | All the parameters are in convolutional layers and the fully connected layers, which are,

00:33:57.780 | to some extent, equivalent to convolutional layers.

00:34:01.220 | So we can go ahead and just basically, based on the description in the paper-- although

00:34:05.220 | it's non-trivial, I think, based on the description of this particular paper-- but you can go

00:34:08.560 | ahead and decipher what the volumes are throughout.

00:34:11.940 | You can look at the kind of patterns that emerge in terms of how you actually increase

00:34:16.460 | the number of filters in higher convolutional layers.

00:34:19.060 | So we started off with 96.

00:34:20.060 | Then we go to 256 filters.

00:34:22.300 | Then to 384.

00:34:24.060 | And eventually, 4,096 units of fully connected layers.

00:34:27.540 | You'll see also normalization layers here, which have since become slightly deprecated.

00:34:31.660 | It's not very common to use the normalization layers that were used at the time for the

00:34:36.020 | AlexNet architecture.

00:34:37.620 | What's interesting to note is how this differs from the 1998 YAMLACoon network.

00:34:41.780 | So in particular, I usually like to think about four things that hold back progress,

00:34:46.340 | so at least in deep learning.

00:34:48.700 | So the data is a constraint, compute.

00:34:52.900 | And then I like to differentiate between algorithms and infrastructure, algorithms being something

00:34:57.060 | that feels like research and infrastructure being something that feels like a lot of engineering

00:35:00.500 | has to happen.

00:35:01.500 | And so in particular, we've had progress in all those four fronts.

00:35:04.540 | So we see that in 1998, the data you could get a hold of maybe would be on the order

00:35:08.860 | of a few thousand, whereas now we have a few million.

00:35:11.280 | So we have three orders of magnitude of increase in number of data.

00:35:14.660 | Compute, GPUs have become available, and we use them to train these networks.

00:35:18.900 | They are about, say, roughly 20 times faster than CPUs.

00:35:23.380 | And then, of course, CPUs we have today are much, much faster than CPUs that they had

00:35:26.700 | back in 1998.

00:35:28.060 | So I don't know exactly to what that works out to, but I wouldn't be surprised if it's,

00:35:30.580 | again, on the order of three orders of magnitude of improvement again.

00:35:34.540 | I'd like to actually skip over the algorithm and talk about infrastructure.

00:35:37.020 | So in this case, we're talking about NVIDIA releasing the CUDA library that allows you

00:35:41.820 | to efficiently create all these matrix vector operations and apply them on arrays of numbers.

00:35:46.900 | So that's a piece of software that we rely on and that we take advantage of that wasn't

00:35:51.620 | available before.

00:35:52.940 | And finally, algorithms is kind of an interesting one, because in those 20 years, there's been

00:35:57.260 | much less improvement in algorithms than all these other three pieces.

00:36:02.380 | So in particular, what we've done with the 1998 network is we've made it bigger.

00:36:05.880 | So you have more channels.

00:36:07.180 | You have more layers by a bit.

00:36:09.340 | And the two really new things algorithmically are dropout and rectified linear units.

00:36:16.660 | So dropout is a regularization technique developed by Geoff Hinton and colleagues.

00:36:21.700 | And rectified linear units are these nonlinearities that train much faster than sigmoids and tanhs.

00:36:27.420 | And this paper actually had a plot that showed that the rectified linear units trained a

00:36:32.300 | bit faster than sigmoids.

00:36:33.780 | And that's intuitively because of the vanishing gradient problems.

00:36:36.460 | And when you have very deep networks with sigmoids, those gradients vanish, as Hugo was

00:36:40.380 | talking about in the last lecture.

00:36:43.420 | So what's interesting also to note, by the way, is that both dropout and ReLU are basically

00:36:47.380 | like one line or two lines of code to change.

00:36:50.500 | So it's about a two line diff total in those 20 years.

00:36:53.880 | And both of them consist of setting things to zero.

00:36:56.460 | So with the ReLU, you set things to zero when they're lower than zero.

00:36:59.820 | And with dropout, you set things to zero at random.

00:37:02.100 | So it's a good idea to set things to zero.

00:37:04.820 | Apparently, that's what we've learned.

00:37:06.060 | So if you try to find a new cool algorithm, look for one line diffs that set something

00:37:10.120 | to zero.

00:37:11.120 | It probably will work better.

00:37:12.660 | And we could add you here to this list.

00:37:16.020 | Now some of the newest things that happened, some of the comparing it again and giving

00:37:20.700 | you an idea about the hyperparameters that were in this architecture.

00:37:25.100 | It was the first use of rectified linear units.

00:37:26.780 | We haven't seen that as much before.

00:37:29.380 | This network used the normalization layers, which are not used anymore, at least in the

00:37:32.820 | specific way that they use them in this paper.

00:37:35.980 | They used heavy data augmentation.

00:37:37.700 | So you don't only pipe these images into the networks exactly as they come from the data

00:37:42.940 | set, but you jitter them spatially around a bit.

00:37:45.420 | And you warp them, and you change the colors a bit, and you just do this randomly because

00:37:49.060 | you're trying to build in some invariances to these small perturbations.

00:37:52.340 | And you're basically hallucinating additional data.

00:37:55.220 | It was the first real use of dropout.

00:37:59.860 | And roughly, you see standard hyperparameters, like say batch sizes of roughly 128, using

00:38:05.060 | stochastic gradient descent with momentum, usually 0.9.

00:38:09.460 | The momentum learning rates of 1e negative 2, you reduce them in normal ways.

00:38:13.660 | So you reduce roughly by a factor of 10 whenever validation stops improving.

00:38:17.860 | And weight decay of just a bit, 5e negative 4.

00:38:21.480 | And ensembling always helps.

00:38:23.860 | So you train seven independent convolutional networks separately, and then you just average

00:38:28.340 | their predictions.

00:38:29.540 | Always gives you additional 2% improvement.

00:38:32.400 | So this is AlexNet, the winner of 2012.

00:38:34.580 | In 2013, the winner was the ZFNet.

00:38:37.700 | This was developed by Matthew Zeiler and Rob Fergus in 2013.

00:38:43.240 | And this was an improvement on top of AlexNet architecture.

00:38:45.820 | In particular, one of the bigger differences here were that the first convolutional layer,

00:38:50.740 | they went from 11 by 11 stride 4 to 7 by 7 stride 2.

00:38:53.900 | So you have slightly smaller filters, and you apply them more densely.

00:38:57.420 | And then also, they noticed that these convolutional layers in the middle, if you make them larger,

00:39:02.220 | if you scale them up, then you actually gain performance.

00:39:04.420 | So they managed to improve a tiny bit.

00:39:06.380 | Matthew Zeiler then went-- he became the founder of Clarify.

00:39:11.620 | And he worked on this a bit more inside Clarify, and he managed to push the performance to

00:39:15.060 | 11%, which was the winning entry at the time.

00:39:17.880 | But we don't actually know what gets you from 14% to 11%, because Matthew never disclosed

00:39:22.900 | the full details of what happened there.

00:39:24.300 | But he did say that it was more tweaking of these hyperparameters and optimizing that

00:39:28.220 | a bit.

00:39:29.620 | So that was 2013 winner.

00:39:31.020 | In 2014, we saw a slightly bigger diff to this.

00:39:34.580 | So one of the networks that was introduced then was a VGG net from Karen Simonian and

00:39:37.740 | Andrew Zisterman.

00:39:39.220 | What's beautiful about VGG net-- and they explored a few architectures here, and the

00:39:42.020 | one that ended up working best was this D column, which is why I'm highlighting it.

00:39:45.460 | What's beautiful about the VGG net is that it's so simple.

00:39:48.500 | So you might have noticed in these previous networks, you have these different filter

00:39:52.980 | sizes, different layers, and you do different amount of strides, and everything kind of

00:39:56.620 | looks a bit hairy, and you're not sure where these hyperparameters are coming from.

00:39:59.700 | VGG net is extremely uniform.

00:40:02.020 | All you do is 3 by 3 convolutions with stride 1, pad 1, and you do 2 by 2 max poolings with

00:40:06.980 | stride 2.

00:40:08.260 | And you do this throughout completely homogeneous architecture, and you just alternate a few

00:40:12.620 | conv and a few pool layers, and you get top performance.

00:40:16.660 | So they managed to reduce the error down to 7.3% in the VGG net, just with a very simple

00:40:22.780 | and homogeneous architecture.

00:40:24.060 | So I've also here written out this D architecture.

00:40:27.980 | So you can see-- I'm not sure how instructive this is, because it's kind of dense.

00:40:32.100 | But you can definitely see, and you can look at this offline perhaps, but you can see how

00:40:35.580 | these volumes develop, and you can see the kinds of sizes of these filters.

00:40:41.100 | So they're always 3 by 3, but the number of filters, again, grows.

00:40:43.860 | So we started off with 64, and then we go to 128, 256, 512.

00:40:47.620 | So we're just doubling it over time.

00:40:51.140 | I also have a few numbers here, just to give you an idea of the scale at which these networks

00:40:54.960 | normally operate.

00:40:56.260 | So we have on the order of 140 million parameters.

00:40:58.700 | This is actually quite a lot.

00:40:59.860 | I'll show you in a bit that this can be about 5 or 10 million parameters, and it works just

00:41:03.020 | as well.

00:41:05.020 | And it's about 100 megabytes for image, in terms of memory, in the forward pass.

00:41:09.820 | And then the backward pass also needs roughly on that order.

00:41:12.280 | So that's roughly the numbers that we're working with here.

00:41:16.380 | Also you can note that most of the-- and this is true mostly in convolutional networks--

00:41:20.140 | is that most of the memory is in the early convolutional layers.

00:41:23.180 | Most of the parameters, at least in the case where you use these giant fully connected

00:41:26.380 | layers at the top, would be here.

00:41:29.340 | So the winner, actually, in 2014 was not the VGGnet.

00:41:31.780 | I only present it because it's such a simple architecture.

00:41:34.500 | But the winner was actually GoogleNet, with a slightly hairier architecture, we should

00:41:38.500 | say.

00:41:39.500 | So it's still a sequence of things.

00:41:41.440 | But in this case, they've put inception modules in sequence.

00:41:44.700 | And this is an example inception module.

00:41:46.380 | I don't have too much time to go into the details, but you can see that it consists

00:41:49.700 | basically of convolutions and different kinds of strides and so on.

00:41:54.500 | So the GoogleNet looks slightly hairier, but it turns out to be more efficient in several

00:42:01.260 | respects.

00:42:02.260 | So for example, it works a bit better than VGGnet, at least at the time.

00:42:06.780 | It only has 5 million parameters, compared to VGGnet's 140 million parameters, so a huge

00:42:11.220 | reduction.

00:42:12.220 | And you do that, by the way, by just throwing away fully connected layers.

00:42:15.180 | So you'll notice in this breakdown I did, these fully connected layers here have 100

00:42:18.980 | million parameters and 16 million parameters.

00:42:21.020 | Turns out you don't actually need that.

00:42:22.460 | So if you take them away, that actually doesn't hurt performance too much.

00:42:26.220 | So you can get a huge reduction of parameters.

00:42:30.000 | And it was slightly -- we can also compare to the original AlexNet.

00:42:35.180 | So compared to the original AlexNet, we have fewer parameters, a bit more compute, and

00:42:38.820 | a much better performance.

00:42:40.380 | So GoogleNet was really optimized to have a low footprint, both memory-wise, both computation-wise,

00:42:45.220 | and both parameter-wise.

00:42:46.980 | But it looks a bit uglier.

00:42:47.980 | And VGGnet is a very beautiful, homogeneous architecture, but there are some inefficiencies

00:42:52.220 | in it.

00:42:53.220 | Okay.

00:42:54.220 | So that's 2014.

00:42:55.220 | Now, in 2015, we had a slightly bigger delta on top of the architectures.

00:43:00.460 | So right now, these architectures, if Jan Lekhoene looked at them maybe in 1998, he

00:43:03.660 | would still recognize everything.

00:43:04.860 | So everything looks very simple.

00:43:06.740 | You just played with hyperparameters.

00:43:08.660 | So one of the first kind of bigger departures, I would argue, was in 2015, with the introduction

00:43:12.180 | of residual networks.

00:43:14.260 | And so this is work from Kangming He and colleagues in Microsoft Research Asia.

00:43:18.840 | And so they did not only win the ImageNet Challenge in 2015, but they won a whole bunch

00:43:23.360 | of challenges.

00:43:24.360 | And this was all just by applying these residual networks that were trained on ImageNet and

00:43:28.640 | then fine-tuned on all these different tasks.

00:43:30.600 | And you basically can crush lots of different tasks whenever you get a new awesome ConvNet.

00:43:36.860 | So at this time, the performance was basically 3.57% from these residual networks.

00:43:42.260 | So this is 2015.

00:43:44.100 | So this paper tried to argue that if you look at the number of layers, it goes up.

00:43:48.340 | And then they made the point that with residual networks, as we'll see in a bit, you can introduce

00:43:53.100 | many more layers and that that correlates strongly with performance.

00:43:57.580 | We've since found that, in fact, you can make these residual networks quite a lot shallower,

00:44:01.980 | like say on the order of 20 or 30 layers, and they work just as fine, just as well.

00:44:05.580 | So it's not necessarily the depth here, but I'll go into that in a bit.

00:44:09.200 | But you get a much better performance.

00:44:10.900 | What's interesting about this paper is this plot here, where they compare these residual

00:44:15.900 | networks-- and I'll go into details of how they work in a bit-- and these what they call

00:44:19.060 | plane networks, which is everything I've explained until now.

00:44:22.420 | And the problem with plane networks is that when you try to scale them up and introduce

00:44:25.820 | additional layers, they don't get monotonically better.

00:44:29.080 | So if you take a 20-layer model-- and this is on CIFAR-10 experience-- if you take a

00:44:34.740 | 20-layer model and you run it, and then you take a 56-layer model, you'll see that the

00:44:39.020 | 56-layer model performs worse.

00:44:41.500 | And this is not just on the test data, so it's not just an overfitting issue.

00:44:44.940 | This is on the training data.

00:44:46.020 | The 56-layer model performs worse on the training data than the 20-layer model, even though

00:44:50.580 | the 56-layer model can imitate 20-layer model by setting 36 layers to compute identities.

00:44:56.420 | So basically, it's an optimization problem that you can't find the solution once your

00:45:01.340 | problem size grows that much bigger in this plane net architecture.

00:45:05.980 | So in the residual networks that they proposed, they found that when you wire them up in a

00:45:09.300 | slightly different way, you monotonically get a better performance as you add more layers.

00:45:14.360 | So more layers, always strictly better, and you don't run into these optimization issues.

00:45:19.460 | So comparing residual networks to plane networks, in plane networks, as I've explained already,

00:45:24.020 | you have this sequence of convolutional layers, where every convolutional layer operates over

00:45:28.420 | volume before and produces volume.

00:45:30.840 | In residual networks, we have this first convolutional layer on top of the raw image.

00:45:34.840 | And there's a pooling layer.

00:45:36.880 | So at this point, we've reduced to 56 by 56 by 64, the original image.

00:45:41.660 | And then from here on, they have these residual blocks with these funny skip connections.

00:45:45.740 | And this turns out to be quite important.

00:45:49.200 | So let me show you what these look like.

00:45:52.180 | So the original Kyming paper had this architecture here shown under original.

00:45:57.040 | So on the left, you see original residual networks design.

00:46:00.140 | Since then, they had an additional paper that played with the architecture and found that

00:46:03.660 | there's a better arrangement of layers inside this block that works better empirically.

00:46:08.900 | And so the way this works-- so concentrate on the proposed one in the middle, since that

00:46:12.180 | works so well-- is you have this pathway where you have this representation of the image

00:46:17.540 | x.

00:46:18.540 | And then instead of transforming that representation x to get a new x to plug in later, we end

00:46:23.260 | up having this x.

00:46:25.100 | We go off, and we do some compute on the side.

00:46:27.840 | So that's that residual block doing some computation.

00:46:30.240 | And then you add your result on top of x.

00:46:33.500 | So you have this addition operation here going to the next residual block.

00:46:37.280 | So you have this x, and you always compute deltas to it.

00:46:40.860 | And I think it's not intuitive that this should work much better or why that works much better.

00:46:44.860 | I think it becomes a bit more intuitively clear if you actually understand the backpropagation

00:46:48.540 | dynamics and how backprop works.

00:46:50.780 | And this is why I always urge people also to implement backprop themselves to get an

00:46:54.260 | intuition for how it works, what it's computing, and so on.

00:46:57.380 | Because if you understand backprop, you'll see that addition operation is a gradient

00:47:00.820 | distributor.

00:47:01.940 | So you get a gradient from the top, and this gradient will flow equally to all the children

00:47:06.380 | that participated in that addition.

00:47:08.320 | So you have gradient flowing here from the supervision.

00:47:10.560 | So you have supervision at the very bottom here in this diagram.

00:47:13.180 | And it kind of flows upwards.

00:47:14.780 | And it flows through these residual blocks and then gets added to the stream.

00:47:19.080 | But this addition distributes that gradient always identically through.

00:47:23.760 | So what you end up with is this kind of a gradient superhighway, as I like to call it,

00:47:27.260 | where these gradients from your supervision go directly to the original convolutional

00:47:30.420 | layer.

00:47:31.420 | And on top of that, you get these deltas from all the residual blocks.

00:47:34.100 | So these blocks can come on online and can help out that original stream of information.

00:47:40.380 | This is also related to, I think, why LSTMs, long short-term memory networks, work better

00:47:45.780 | than recurrent neural networks, because they also have these kind of addition operations

00:47:50.460 | in the LSTM.

00:47:51.660 | And it just makes the gradients flow significantly better.

00:47:55.260 | Then there were some results on top of residual networks that I thought were quite amusing.

00:47:58.600 | So recently, for example, we had this result on deep networks with stochastic depth.

00:48:03.380 | The idea here was that the authors of this paper noticed that you have these residual

00:48:07.820 | blocks that compute deltas on top of your stream.

00:48:11.020 | And you can basically randomly throw out layers.

00:48:14.020 | So you have these, say, 100 blocks, 100 residual blocks.

00:48:16.220 | And you can randomly drop them out.

00:48:18.460 | And at test time, similar to dropout, you introduce all of them.

00:48:21.980 | And they all work at the same time.

00:48:23.300 | But you have to scale things a bit, just like with dropout.

00:48:26.500 | But basically, it's kind of an unintuitive result, because you can throw out layers at

00:48:29.780 | random.

00:48:30.780 | And I think it breaks the original notion of what we had of ConvNets as these feature

00:48:35.820 | transformers that compute more and more complex features over time or something like that.

00:48:40.780 | And I think it seems much more intuitive to think about these residual networks, at least

00:48:44.740 | to me, as some kinds of dynamical systems, where you have this original representation

00:48:49.940 | of the image x.

00:48:50.940 | And then every single residual block is kind of like a vector field, because it computes

00:48:54.860 | in a delta on top of your signal.

00:48:57.420 | And so these vector fields nudge your original representation x towards a space where you

00:49:02.020 | can decode the answer y of the class of that x.

00:49:06.180 | And so if you drop off some of these residual blocks at random, then if you haven't applied

00:49:09.980 | one of these vector fields, then the other vector fields that come later can kind of

00:49:13.020 | make up for it.

00:49:14.020 | And they basically nudge the-- they pick up the slack.

00:49:17.980 | And they nudge it along anyways.

00:49:19.740 | And so that's possibly why the image I currently have in mind of how these things work.

00:49:24.820 | So much more like dynamical systems.

00:49:27.660 | In fact, another experiment that people are playing with that I also find interesting

00:49:30.780 | is you can share these residual blocks.

00:49:33.700 | So it starts to look more like a recurrent neural network.

00:49:36.340 | So these residual blocks would have shared connectivity.

00:49:39.020 | And then you have this dynamical system, really, where you're just running a single RNN, a

00:49:42.940 | single vector field that you keep iterating over and over.

00:49:45.380 | And then your fixed point gives you the answer.

00:49:47.380 | So it's kind of interesting what's happening.

00:49:49.980 | It looks very funny.

00:49:52.860 | We've had many more interesting results.

00:49:54.780 | So people are playing a lot with these residual networks and improving on them in various

00:49:59.260 | ways.

00:50:00.260 | So as I mentioned already, it turns out that you can make these residual networks much

00:50:03.180 | shallower and make them wider.

00:50:05.600 | So you introduce more channels.

00:50:07.020 | And that can work just as well, if not better.

00:50:08.960 | So it's not necessarily the depth that is giving you a lot of the performance.

00:50:13.680 | You can scale down the depth.

00:50:15.100 | And if you increase the width, that can actually work better.

00:50:18.180 | And they're also more efficient if you do it that way.

00:50:21.140 | There's more funny regularization techniques.

00:50:23.700 | Here swap out is a funny regularization technique that actually interpolates between plain nets,

00:50:28.660 | res nets, and dropout.

00:50:30.260 | So that's also a funny paper.

00:50:32.180 | We have fractal nets.

00:50:33.620 | We actually have many more different types of nets.

00:50:35.660 | And so people have really experimented with this a lot.

00:50:37.460 | I'm really eager to see what the winning architecture will be in 2016 as a result of a lot of this.

00:50:42.260 | One of the things that has really enabled this rapid experimentation in the community

00:50:45.700 | is that somehow we've developed, luckily, this culture of sharing a lot of code among

00:50:50.220 | ourselves.

00:50:51.220 | So for example, Facebook has released-- just as an example-- Facebook has released residual

00:50:55.740 | networks code in Torch that is really good that a lot of these papers, I believe, have

00:50:59.140 | adopted and worked on top of and that allowed them to actually really scale up their experiments

00:51:03.740 | and explore different architectures.

00:51:07.920 | So it's great that this has happened.

00:51:09.580 | Unfortunately, a lot of these papers are coming on archive.

00:51:12.340 | And it's kind of a chaos as these are being uploaded.

00:51:14.260 | So at this point, I think this is a natural point to plug very briefly my archivesanity.com.

00:51:19.820 | So this is the best website ever.

00:51:21.640 | And what it does is it crawls archive.

00:51:24.500 | And it takes all the papers.

00:51:26.500 | And it analyzes all the papers, the full text of the papers, and creates TF-IDF bag of words

00:51:30.540 | features for all the papers.

00:51:32.580 | And then you can do things like you can search a particular paper, like residual networks

00:51:35.580 | paper here.

00:51:36.580 | And you can look for similar papers on archive.

00:51:38.580 | And so this is a sorted list of basically all the residual networks papers that are

00:51:41.540 | most related to that paper.

00:51:43.700 | Or you can also create user accounts.

00:51:45.260 | And you can create a library of papers that you like.

00:51:47.340 | And then Archive Sanity will train a support vector machine for you.

00:51:50.400 | And basically, you can look at what are archive papers over the last month that I would enjoy

00:51:54.660 | the most.

00:51:55.660 | And that's just computed by Archive Sanity.

00:51:57.460 | And so it's like a curated feed specifically for you.

00:52:00.080 | So I use this quite a bit.

00:52:01.080 | And I find it useful.

00:52:02.360 | So I hope that other people do as well.

00:52:05.060 | OK.

00:52:06.100 | So we saw convolutional neural networks.

00:52:08.300 | I explained how they work.

00:52:09.500 | I explained some of the background context.

00:52:11.100 | I've given you an idea of what they look like in practice.

00:52:13.500 | And we went through case studies of the winning architectures over time.

00:52:16.900 | But so far, we've only looked at image classification specifically.

00:52:19.540 | So we're categorizing images into some number of bins.

00:52:22.380 | So I'd like to briefly talk about addressing other tasks in computer vision and how you

00:52:26.220 | might go about doing that.

00:52:28.160 | So the way to think about doing other tasks in computer vision is that really what we

00:52:32.300 | have is you can think of this convolutional neural network as this block of compute that

00:52:37.500 | has a few million parameters in it.

00:52:39.540 | And it can do basically arbitrary functions that are very nice over images.

00:52:43.740 | And so it takes an image, gives you some kind of features.

00:52:47.260 | And now different tasks will basically look as follows.

00:52:50.540 | You want to predict some kind of a thing in different tasks that will be different things.

00:52:54.620 | And you always have a desired thing.

00:52:56.300 | And then you want to make the predicted thing much more closer to the desired thing.

00:52:59.660 | And you back propagate.

00:53:01.100 | So this is the only part usually that changes from task to task.

00:53:03.940 | You'll see that these comm nets don't change too much.

00:53:06.020 | What changes is your loss function at the very end.

00:53:08.120 | And that's what actually helps you really transfer a lot of these winning architectures.

00:53:12.260 | You usually use these pre-trained networks.

00:53:13.900 | And you don't worry too much about the details of that architecture.

00:53:16.460 | Because you're only worried about adding a small piece at the top or changing the loss

00:53:19.660 | function or substituting a new data set and so on.

00:53:22.520 | So just to make this slightly more concrete, in image classification, we apply this compute

00:53:26.780 | block.

00:53:27.780 | We get these features.

00:53:28.780 | And then if I want to do classification, I would basically predict 1,000 numbers that

00:53:32.260 | give me the log probabilities of different classes.

00:53:34.700 | And then I have a predicted thing, a desired thing, particular class.

00:53:38.260 | And I can back prop.

00:53:39.740 | If I'm doing image captioning, it also looks very similar.

00:53:42.940 | Instead of predicting just a vector of 1,000 numbers, I now have, for example, 10,000 words

00:53:48.900 | in some kind of vocabulary.

00:53:50.580 | And I'd be predicting 10,000 numbers and a sequence of them.

00:53:53.480 | And so I can use a recurrent neural network, which you will hear much more about, I think,

00:53:57.460 | in Richard's lecture just after this.

00:54:00.340 | And so I produce a sequence of 10,000 dimensional vectors.

00:54:02.540 | And that's just a description.

00:54:03.580 | And they indicate the probabilities of different words to be emitted at different time steps.

00:54:08.020 | Or for example, if you want to do localization, again, most of the block stays unchanged.

00:54:12.460 | But now we also want some kind of an extent in the image.

00:54:16.660 | So suppose we want to classify-- we don't only just want to classify this as an airplane,

00:54:20.300 | but we want to localize it with x, y, width, height, bounding box coordinates.

00:54:24.280 | And if we make the specific assumption as well that there's always a single one thing

00:54:28.000 | in the image, like a single airplane in every image, then you can just afford to just predict

00:54:32.020 | that.

00:54:33.020 | So we predict these softmax scores, just like before, and apply the cross-entropy loss.

00:54:37.380 | And then we can predict x, y, width, height on top of that.

00:54:39.780 | And we use an L2 loss or a Hoover loss or something like that.

00:54:43.500 | So you just have a predicted thing, a desired thing, and you just backprop.

00:54:47.940 | If you want to do reinforcement learning because you want to play different games, then again,

00:54:51.500 | the setup is you just predict some different thing.

00:54:53.740 | And it has some different semantics.

00:54:55.420 | So in this case, we would be, for example, predicting eight numbers that give us the

00:54:58.400 | probabilities of taking different actions.

00:55:01.060 | For example, there are eight discrete actions in Atari.

00:55:03.420 | And we just predict eight numbers, and then we train this with a slightly different manner.

00:55:07.820 | Because in the case of reinforcement learning, you don't actually know what the correct action

00:55:12.500 | is to take at any point in time.

00:55:14.300 | But you can still get a desired thing eventually, because you just run these rollouts over time,

00:55:18.940 | and you just see what happens.

00:55:22.000 | And then that helps inform exactly what the correct answer should have been or what the

00:55:26.820 | desired thing should have been in any one of those rollouts in any point in time.

00:55:30.580 | I don't want to dwell on this too much in this lecture, though.

00:55:32.340 | It's outside of the scope.

00:55:33.700 | You'll hear much more about reinforcement learning in a later lecture.

00:55:38.460 | If you wanted to do segmentation, for example, then you don't want to predict a single vector

00:55:43.320 | of numbers for a single image.

00:55:45.980 | But every single pixel has its own category that you'd like to predict.

00:55:48.980 | So a data set will actually be colored like this, and you have different classes, different

00:55:52.020 | areas.

00:55:53.200 | And then instead of predicting a single vector of classes, you predict an entire array of

00:55:58.500 | 224 by 224, since that's the extent of the original image, for example, times 20 if you

00:56:02.740 | have 20 different classes.

00:56:04.340 | And then you basically have 224 by 224 independent soft maxes here.

00:56:08.940 | That's one way you could pose this.

00:56:10.180 | And then you back propagate.

00:56:11.720 | This here would be slightly more difficult, because you see here I have deconv layers

00:56:16.240 | mentioned here.

00:56:17.240 | And I didn't explain deconvolutional layers.

00:56:19.380 | They're related to convolutional layers.

00:56:20.740 | They do a very similar operation, but kind of backwards in some way.

00:56:24.860 | So a convolutional layer kind of does these downsampling operations as it computes.

00:56:28.300 | A deconv layer does these kind of upsampling operations as it computes these convolutions.

00:56:32.660 | But in fact, you can implement a deconv layer using a conv layer.

00:56:35.780 | So what you do is you deconv forward pass is the conv layer backward pass.

00:56:40.060 | And the deconv backward pass is the conv layer forward pass, basically.

00:56:43.580 | So they're basically an identical operation, but just are you upsampling or downsampling

00:56:47.380 | kind of.

00:56:48.880 | So you can use deconv layers, or you can use hypercolumns.

00:56:51.820 | And there are different things that people do in segmentation literature.

00:56:55.140 | But that's just a rough idea, as you're just changing the loss function at the end.

00:56:58.500 | If you wanted to do autoencoders, so you want to do some unsupervised learning or something

00:57:01.660 | like that, well, you're just trying to predict the original image.

00:57:04.620 | So you're trying to get the convolutional network to implement the identity transformation.

00:57:09.140 | And the trick, of course, that makes it non-trivial is that you're forcing the representation

00:57:12.700 | to go through this representational bottleneck of 7 by 7 by 512.

00:57:16.580 | So the network must find an efficient representation of the original image so that it can decode

00:57:20.100 | it later.

00:57:21.100 | So that would be an autoencoder.

00:57:22.740 | You again have an L2 loss at the end, and you backprop.

00:57:25.620 | Or if you want to do variational autoencoders, you have to introduce a reparameterization

00:57:28.900 | layer, and you have to append an additional small loss that makes your posterior be your

00:57:32.940 | prior.

00:57:33.940 | But it's just like an additional layer.

00:57:35.020 | And then you have an entire generative model.

00:57:36.820 | And you can actually sample images as well.

00:57:39.740 | If you wanted to do detection, things get a little more hairy, perhaps, compared to

00:57:43.940 | localization or something like that.

00:57:45.700 | So one of my favorite detectors, perhaps, to explain is the YOLO detector, because it's

00:57:48.980 | perhaps the simplest one.

00:57:50.500 | It doesn't work the best, but it's the simplest one to explain and has the core idea of how

00:57:53.940 | people do detection in computer vision.

00:57:57.340 | And so the way this works is we reduced the original image to a 7 by 7 by 512 feature.

00:58:03.780 | So really, there are these 49 discrete locations that we have.

00:58:08.500 | And at every single one of these 49 locations, we're going to predict-- in YOLO, we're going

00:58:13.400 | to predict a class.

00:58:14.740 | So that's shown here on the top right.

00:58:15.940 | So every single one of these 49 will be some kind of a softmax.

00:58:20.140 | And then additionally, at every single position, we're going to predict some number of bounding

00:58:23.940 | boxes.

00:58:25.060 | And so there's going to be a b number of bounding boxes.

00:58:27.700 | Say b is 10.

00:58:29.080 | So we're going to be predicting 50 numbers.

00:58:31.780 | And the 5 comes from the fact that every bounding box will have five numbers associated with

00:58:35.300 | it.

00:58:36.300 | So you have to describe the x, y, the width, and the height.

00:58:38.620 | And you have to also indicate some kind of a confidence of that bounding box.

00:58:43.620 | So that's the fifth number, some kind of a confidence measure.

00:58:46.180 | So you basically end up predicting these bounding boxes.

00:58:48.580 | They have positions.

00:58:49.620 | They have class.

00:58:50.660 | They have confidence.

00:58:52.280 | And then you have some true bounding boxes in the image.

00:58:54.660 | So you know that there are certain true boxes.

00:58:57.200 | And they have certain class.

00:58:59.060 | And what you do then is you match up the desired thing with the predicted thing.

00:59:03.780 | And whatever-- so say, for example, you had one bounding box of a cat.

00:59:08.420 | Then you would find the closest predicted bounding box.

00:59:10.960 | And you would mark it as a positive.

00:59:12.760 | And you would try to make that associated grid cell predict cat.

00:59:16.100 | And you would nudge the prediction to be slightly more towards the cat box.

00:59:20.900 | And so all of this can be done with simple losses.

00:59:22.580 | And you just back propagate that.

00:59:23.740 | And then you have a detector.

00:59:25.540 | Or if you want to get much more fancy, you could do dense image captioning.

00:59:28.940 | So in this case, this is a combination of detection and image captioning.

00:59:32.380 | This is a paper with my equal co-author Justin Johnson and Fei-Fei Li from last year.

00:59:36.460 | And so what we did here is image comes in.

00:59:38.240 | And it becomes much more complex.

00:59:39.420 | I don't maybe want to go into it as much.

00:59:41.580 | But the first order approximation is that instead-- it's basically a detection.

00:59:45.400 | But instead of predicting fixed classes, we instead predict a sequence of words.

00:59:49.560 | So we use a recurrent neural network there.

00:59:51.960 | But basically, you can take an image then.

00:59:53.260 | And you can predict-- you can both detect and describe everything in a complex visual

00:59:57.420 | scene.

00:59:58.960 | So that's just some overview of different tasks that people care about.

01:00:01.700 | Most of them consist of just changing this top part.

01:00:04.560 | You put a different loss function, a different data set.

01:00:07.240 | But you'll see that this computational block stays relatively unchanged from time to time.

01:00:11.340 | And that's why, as I mentioned, when you do transfer learning, you just want to take these

01:00:14.980 | pre-trained networks.

01:00:15.980 | And you mostly want to use whatever works well on ImageNet, because a lot of that does

01:00:19.500 | not change too much.

01:00:22.420 | So in the last part of the talk, I'd like to-- let me just make sure we're good on time.

01:00:25.940 | OK, we're good.

01:00:27.280 | So in the last part of the talk, I just wanted to give some hints or some practical considerations

01:00:31.740 | when you want to apply convolutional networks in practice.

01:00:34.820 | So first consideration you might have if you want to run these networks is, what hardware

01:00:38.480 | do I use?

01:00:40.460 | So some of the options that I think are available to you-- well, first of all, you can just

01:00:44.540 | buy a machine.

01:00:45.640 | So for example, NVIDIA has these Digits dev boxes that you can buy.

01:00:50.260 | They have Titan X GPUs, which are strong GPUs.

01:00:53.460 | You can also, if you're much more ambitious, you can buy a DGX1, which has the newest Pascal

01:00:57.540 | P100 GPUs.

01:00:58.540 | Unfortunately, the DGX1 is about $130,000.

01:01:02.400 | So this is kind of an expensive supercomputer.

01:01:05.600 | But the Digits dev box, I think, is more accessible.

01:01:07.940 | And so that's one option you can go with.

01:01:10.140 | Alternatively, you can look at the specs of a dev box.

01:01:13.780 | And those specs are-- they're good specs.

01:01:15.980 | And then you can buy all the components yourself and assemble it like LEGO.

01:01:19.660 | Unfortunately, that's prone to mistakes, of course.

01:01:22.500 | But you can definitely reduce the price maybe by a factor of like two compared to the NVIDIA

01:01:27.820 | machine.

01:01:28.820 | But of course, NVIDIA machine would just come with all the software installed, all the hardware

01:01:31.780 | is ready, and you can just do work.

01:01:33.940 | There are a few GPU offerings in the cloud.

01:01:35.620 | But unfortunately, it's actually not at a good place right now.

01:01:38.900 | It's actually quite difficult to get GPUs in the cloud-- good GPUs, at least.

01:01:42.480 | So Amazon AWS has these grid K5 520s.

01:01:45.920 | They're not very good GPUs.

01:01:47.640 | They're not fast.

01:01:48.640 | They don't have too much memory.

01:01:49.640 | It's actually kind of a problem.

01:01:51.640 | Microsoft Azure is coming up with its own offering soon.

01:01:55.400 | So I think they've announced it.

01:01:57.080 | And it's in some kind of a beta stage, if I remember correctly.

01:02:00.140 | And so those are powerful GPUs, K80s, that would be available to you.

01:02:03.520 | At OpenAI, for example, you use Cerescale.

01:02:05.920 | So Cerescale is a slightly different model.

01:02:07.760 | You can't spin up GPUs on demand.

01:02:09.680 | But they allow you to rent a box in the cloud.

01:02:11.900 | So what that amounts to is that we have these boxes somewhere in the cloud.

01:02:14.880 | I have just the DNN.

01:02:17.240 | I just have the URL.

01:02:18.240 | I SSH to it.

01:02:19.960 | It's a Titan X boxes in the machine.

01:02:21.920 | And so you can just do work that way.

01:02:24.920 | So these options are available to you hardware-wise.

01:02:27.680 | In terms of software, there are many different frameworks, of course, that you could use

01:02:30.720 | for deep learning.

01:02:32.280 | So these are some of the more common ones that you might see in practice.

01:02:36.880 | So different people have different recommendations on this.

01:02:40.000 | My personal recommendation right now to most people, if you just want to apply this in

01:02:43.620 | practical settings, 90% of the use cases are probably addressable with things like Keras.

01:02:49.300 | So Keras would be my go-to number one thing to look at.

01:02:53.060 | Keras is a layer over TensorFlow or Theano.

01:02:58.520 | And basically, it's just a higher-level API over either of those.

01:03:01.080 | So for example, I usually use Keras on top of TensorFlow.

01:03:03.840 | And it's a much more higher-level language than raw TensorFlow.

01:03:08.300 | So you can also work in raw TensorFlow, but you'll have to do a lot of low-level stuff.

01:03:11.740 | If you need all that freedom, then that's great, because that allows you to have much

01:03:14.980 | more freedom in terms of how you design everything.

01:03:17.260 | But it can be slightly more wordy.

01:03:19.820 | For example, you have to assign every single weight.

01:03:21.580 | You have to assign a name, stuff like that.

01:03:24.260 | And so it's just much more wordy, but you can work at that level.

01:03:27.180 | Or for most applications, I think Keras would be sufficient.

01:03:29.980 | And I've used Torch for a long time.

01:03:31.140 | I still really like Torch.

01:03:32.460 | It's very lightweight, interpretable.

01:03:34.020 | It works just fine.

01:03:35.640 | So those are the options that I would currently consider, at least.

01:03:41.600 | Another practical consideration-- you might be wondering, what architecture do I use in

01:03:45.660 | my problem?

01:03:46.660 | So my answer here-- and I've already hinted at this-- is don't be a hero.

01:03:51.500 | Don't go crazy.

01:03:52.500 | Don't design your own neural networks and convolutional layers.

01:03:55.160 | And you don't want to do that, probably.

01:03:58.320 | So the algorithm is actually very simple.

01:04:00.560 | Look at whatever is currently the latest released thing that works really well in ILS VRC.

01:04:05.980 | You download that pre-trained model.

01:04:07.920 | And then you potentially add or delete some layers on top, because you want to do some

01:04:11.280 | other task.

01:04:12.280 | So that usually requires some tinkering at the top or something like that.

01:04:15.400 | And then you fine tune it on your application.

01:04:17.300 | So actually, a very straightforward process.

01:04:19.880 | The first degree, I think, to most applications would be don't tinker with it too much.

01:04:23.420 | You're going to break it.

01:04:25.340 | But of course, you can also take 231n, and then you might become much better at tinkering

01:04:29.480 | with these architectures.

01:04:32.120 | Second is how do I choose the parameters?

01:04:35.680 | And my answer here, again, would be don't be a hero.

01:04:39.520 | Look into papers.

01:04:40.520 | Look at what parameters they use.

01:04:41.520 | For the most part, you'll see that all papers use the same hyperparameters.

01:04:44.360 | They look very similar.

01:04:45.500 | So Adam-- when you use Adam for optimization, it's always learning rate 1e negative 3 or

01:04:49.480 | 1e negative 4.

01:04:52.720 | So you can also use SGD momentum.

01:04:54.840 | It's always the similar kinds of learning rates.

01:04:56.720 | So don't go too crazy designing this.

01:04:58.780 | One of the things you probably want to play with the most is the regularization.

01:05:02.680 | And in particular, not the L2 regularization, but the dropout rates is something I would

01:05:06.000 | advise instead.

01:05:10.000 | Because you might have a smaller or a much larger data set, if you have a much smaller

01:05:12.620 | data set, then overfitting is a concern.

01:05:14.520 | So you want to make sure that you regularize properly with dropout.

01:05:17.660 | And then you might want to, as a second degree consideration, maybe learning rate, you want

01:05:21.720 | to tune that a tiny bit.

01:05:22.880 | But that usually doesn't have as much of an effect.

01:05:26.380 | So really, there's like two hyperparameters.

01:05:28.000 | And you take a pre-trained network.

01:05:29.000 | And this is 90% of the use cases, I would say.

01:05:33.840 | So compared to when-- computer vision in 2011, where you might have hundreds of hyperparameters.

01:05:38.360 | So yeah.

01:05:42.000 | And in terms of distributed training, so if you want to work at scale, because if you

01:05:46.920 | want to train ImageNet or some large scale data sets, you might want to train across

01:05:49.660 | multiple GPUs.

01:05:51.120 | So just to give you an idea, most of these state-of-the-art networks are trained on the

01:05:54.100 | order of a few weeks across multiple GPUs, usually four or eight GPUs.

01:05:58.920 | And these GPUs are roughly on the order of $1,000 each.

01:06:01.360 | But then you also have to house them.

01:06:02.860 | So of course, that adds additional price.

01:06:04.960 | But you almost always want to train on multiple GPUs if possible.

01:06:08.460 | Usually you don't end up training across machines.

01:06:10.380 | That's much more rare, I think, to train across machines.

01:06:12.820 | What's much more common is you have a single machine.

01:06:14.460 | And it has eight Titan Xs or something like that.

01:06:16.840 | And you do distributed training on those eight Titan Xs.

01:06:19.740 | There are different ways to do distributed training.

01:06:21.520 | So if you're feeling fancy, you can try to do some model parallelism, where you split

01:06:26.460 | your network across multiple GPUs.

01:06:29.440 | I would instead advise some kind of a data parallelism architecture.

01:06:32.060 | So usually what you see in practice is you have eight GPUs.

01:06:35.420 | So I take my batch of 256 images or something like that.

01:06:38.660 | I split it.

01:06:39.660 | And I split it equally across the GPUs.

01:06:41.500 | I do forward pass on those GPUs.

01:06:43.420 | And then I basically just add up all the gradients.

01:06:46.420 | And I propagate that through.

01:06:47.880 | So you're just distributing this batch.

01:06:49.420 | And mathematically, you're doing the exact same thing as if you had a giant GPU.

01:06:53.940 | But you're just splitting up that batch across different GPUs.

01:06:57.180 | But you're still doing synchronous training with SGD as normal.

01:06:59.940 | So that's what you'll see most in practice, which I think is the best thing to do right

01:07:03.340 | now for most normal applications.

01:07:07.140 | And other kind of considerations that sometimes enter that you could maybe worry about is

01:07:11.700 | that there are these bottlenecks to be aware of.

01:07:13.540 | So in particular, CPU to disk bottleneck.

01:07:16.100 | This means that you have a giant data set.

01:07:17.500 | It's somewhere on some disk.

01:07:18.940 | You want that disk to probably be an SSD because you want this loading to be quick.

01:07:23.220 | Because these GPUs process data very quickly.

01:07:24.900 | And that might actually be a bottleneck.

01:07:26.260 | Like loading the data could be a bottleneck.

01:07:28.020 | So in many applications, you might want to pre-process your data, make sure that it's

01:07:31.420 | read out contiguously in very raw form from something like an HDFI file or some kind of

01:07:36.300 | other binary format.

01:07:38.180 | And another bottleneck to be aware of is the CPU-GPU bottleneck.

01:07:41.880 | So the GPU is doing a lot of heavy lifting of the neural network.

01:07:44.320 | And the CPU is loading the data.

01:07:46.180 | And you might want to use things like prefetching threads, where the CPU, while the networks

01:07:50.180 | are doing forward-backward on the GPU, your CPU is busy loading the data from the disk

01:07:54.380 | and maybe doing some pre-processing and making sure that it can ship it off to the GPU at

01:07:59.100 | the next time step.

01:08:00.900 | So those are some of the practical considerations I could come up with for this lecture.

01:08:04.740 | If you wanted to learn much more about convolutional neural networks and a lot of what I've been

01:08:07.520 | talking about, then I encourage you to check out CS231n.

01:08:11.100 | We have lecture videos available.

01:08:12.940 | We have notes, slides, and assignments.

01:08:15.020 | Everything is up and available.

01:08:16.980 | So you're welcome to check it out.

01:08:19.540 | And that's it.

01:08:20.540 | Thank you.

01:08:21.540 | [ Applause ]

01:08:30.820 | So I guess I can take some questions.

01:08:31.820 | Yeah.

01:08:32.820 | [ Inaudible ]

01:08:53.620 | Hello?

01:08:54.620 | Hello.

01:08:55.620 | Hi.

01:08:56.620 | I'm Kyle Farr from Lumna.

01:08:58.860 | I'm using a lot of convolutional nets for genomics.

01:09:01.100 | One of the problems that we see is that our genomic sequence tends to be arbitrary length.

01:09:06.860 | So right now we're patterned for a lot of zeros, but we're curious as to what your thoughts

01:09:10.340 | are on using CNNs for things of arbitrary size.

01:09:13.980 | Or we can't just downsample to 277 by 277.

01:09:18.380 | So is this like a genomic sequence of like ATCG?

01:09:20.660 | Like that kind of sequence?

01:09:21.660 | Yeah, exactly.

01:09:22.660 | Yeah.

01:09:23.660 | So some of the options would be -- so recurrent neural networks might be a good fit because

01:09:26.060 | they allow arbitrarily sized contexts.

01:09:28.980 | Another option I would say is if you look at the WaveNet paper from DeepMind, they have

01:09:32.900 | audio and they're using convolutional networks for processing it.

01:09:35.740 | And I would basically adopt that kind of an architecture.

01:09:37.700 | They have this clever way of doing what's called atros or dilated convolutions.

01:09:42.060 | And so that allows you to capture a lot of context with few layers.

01:09:45.600 | And so that's called dilated convolutions.

01:09:47.540 | And the WaveNet paper has some details.

01:09:49.160 | And there's an efficient implementation of it that you should be aware of on GitHub.

01:09:51.460 | And so you might be able to just drag and drop the fast WaveNet code into that application.

01:09:56.100 | And so you have much larger context, but it's, of course, not infinite context as you might

01:09:59.260 | have with a recurrent network.

01:10:00.540 | Yeah, we're definitely checking those out.

01:10:02.380 | We also tried RNNs.

01:10:03.580 | They're quite slow for these things.

01:10:05.780 | Our main problem is that the genes can be very short or very long, but the whole sequence

01:10:09.940 | matters.

01:10:11.420 | So I think that's one of the challenges that we're looking at with this type of problem.

01:10:15.780 | Interesting.

01:10:16.780 | Yeah, so those would be the two options that I would play with, basically.

01:10:19.740 | I think those are the two that I'm aware of.

01:10:21.940 | Thank you.

01:10:22.940 | Thanks for a great lecture.

01:10:29.540 | So my question is that, is there a clear mathematical or conceptual understanding when people decide

01:10:34.540 | how many hidden layers have to be part of their architecture?

01:10:38.540 | So the answer with a lot of this is there a mathematical understanding will likely be

01:10:43.300 | no, because we are in very early phases of just doing a lot of empirical guess and check

01:10:48.160 | kind of work.

01:10:49.380 | And so theory is in some ways lagging behind a bit.

01:10:53.300 | I would say that with residual networks, you want to have more layers usually works better.

01:10:58.660 | And so you can take these layers out or you can put them in, and it's just mostly computational

01:11:02.420 | consideration of how much can you fit in.

01:11:04.760 | So our consideration is usually is you have a GPU.

01:11:07.380 | It has maybe 16 gigs of RAM or 12 gigs of RAM or something.

01:11:10.700 | I want certain batch size, and I have these considerations, and that upper bounds the

01:11:14.380 | amount of layers or how big they could be.

01:11:17.340 | And so I use the biggest thing that fits in my GPU, and that's mostly the way you choose

01:11:21.460 | this.

01:11:23.120 | And then you regularize it very strongly.

01:11:24.460 | So if you have a very small data set, then you might end up with a pretty big network

01:11:27.500 | for your data set.

01:11:28.500 | So you might want to make sure that you are tuning those dropout rates properly, and so

01:11:31.980 | you're not overfitting.

01:11:32.980 | I have a question.

01:11:33.980 | My understanding is that the recent convolution nets doesn't use pooling layers, right?

01:11:44.140 | So the question is, why don't they use pooling layers?

01:11:48.700 | So is there still a place for pooling?

01:11:52.940 | Yeah.

01:11:53.940 | So certainly, so if you saw, for example, the residual network at the end, there was

01:11:57.540 | a single pooling layer at the very beginning, but mostly they went away.

01:12:01.100 | You're right.

01:12:02.100 | So it took-- I wonder if I can find the slide.

01:12:04.020 | I wonder if this is a good idea to try to find the slide.

01:12:06.900 | That's probably-- OK, let me just find this.

01:12:12.340 | Oh, OK.

01:12:13.820 | So this was the residual network architecture.

01:12:15.700 | So you see that they do a first conv, and then there's a single pool right there.

01:12:19.820 | But certainly, the trend has been to throw them away over time.

01:12:22.500 | And there's a paper also.

01:12:24.020 | It's called Striving for Simplicity, the All-Convolutional Neural Network.

01:12:27.620 | And the point in that paper is, look, you can actually do strided convolutions.

01:12:31.020 | You can throw away pooling layers altogether, or it's just as well.

01:12:34.300 | So pooling layers are kind of, I would say, this kind of a bit of a historical vestige

01:12:37.820 | of they needed things to be efficient, and they need to control the capacity and down

01:12:40.860 | sample things quite a lot.

01:12:43.060 | And so we're kind of throwing them away over time.

01:12:44.780 | And yeah, they're not doing anything super useful.

01:12:48.140 | They're doing this fixed operation.

01:12:50.340 | And you want to learn as much as possible.

01:12:52.420 | So maybe you don't actually want to get rid of that information.

01:12:55.820 | So it's always more appealing to-- it's probably more appealing, I would say, to throw them

01:12:59.220 | away.

01:13:00.220 | But you mentioned there is a sort of cognitive or brain analogy that the brain is doing pooling.

01:13:06.620 | Yeah, so I think that analogy is stretched by a lot.

01:13:09.220 | So the brain-- I'm not sure if the brain is doing pooling.

01:13:12.220 | [LAUGHTER]

01:13:13.220 | Yeah.

01:13:14.220 | How about image compression?

01:13:15.220 | Not for just classification, but the usage of neural networks for image compression.

01:13:16.220 | Do we have any examples?

01:13:17.220 | Sorry, I couldn't hear the question.

01:13:18.220 | Instead of classification for images, can we use the neural networks for image compression?

01:13:31.660 | Image compression.

01:13:32.660 | Yeah, I think there's actually really exciting work in this area.

01:13:35.600 | So one that I'm aware of, for example, is recent work from Google, where they're using

01:13:39.840 | convolutional networks and recurrent networks to come up with variably sized codes for images.

01:13:44.760 | So certainly, a lot of these generative models, I mean, they are very related to compression.

01:13:49.300 | So definitely a lot of work in the area that I'm excited about.

01:13:52.740 | Also, for example, super resolution networks.

01:13:54.700 | So you saw the recent acquisition of Magic Pony by Twitter.

01:13:58.940 | So they were also doing something that basically allows you to compress.

01:14:02.260 | You can send low resolution streams, because you can upsample it on the client.

01:14:06.380 | And so a lot of work in that area.

01:14:08.860 | Yeah.

01:14:09.860 | I had one question.

01:14:10.860 | One more, but maybe after you.

01:14:11.860 | Can you please comment on scalability regarding number of classes?

01:14:18.380 | So what does it take if we go up to 10,000 or 100,000 classes?

01:14:22.420 | Yeah, so if you have a lot of classes, then of course, you can grow your softmax, but

01:14:26.940 | that becomes inefficient at some point, because you're doing a giant matrix multiply.

01:14:31.100 | So some of the ways that people are addressing this in practice, I believe, is use of hierarchical

01:14:35.060 | softmax and things like that.

01:14:37.460 | So you decompose your classes into groups, and then you kind of predict one group at

01:14:42.940 | a time, and you kind of converge that way.

01:14:47.140 | So I see these papers, but I'm not an expert on exactly how this works.

01:14:51.700 | But I do know that hierarchical softmax is something that people use in this setting.

01:14:54.740 | Especially, for example, in language models, this is often used, because you have a huge

01:14:57.900 | amount of words, and you still need to predict them somehow.

01:15:00.380 | And so I believe Tomasz Michalow, for example, he has some papers on using hierarchical softmax

01:15:04.060 | in this context.

01:15:06.980 | Could you talk a little bit about the convolutional functions?

01:15:11.220 | Like what considerations you should make in selecting the functions that are used in the

01:15:16.100 | convolutional filters?

01:15:18.180 | Selecting the functions that are used in the convolutional filters?

01:15:21.640 | So these filters are just parameters, right?

01:15:23.320 | So we train those filters.

01:15:25.100 | They're just numbers that we train with backpropagation.

01:15:29.100 | Are you talking about the nonlinearities, perhaps?

01:15:30.860 | Yeah, I'm just wondering about when you're selecting the features, or when you're getting

01:15:36.340 | the-- when you're trying to train to understand different features within an image, what are

01:15:41.820 | those filters actually doing?

01:15:43.580 | Oh, I see.

01:15:44.580 | You're talking about understanding exactly what those filters are looking for in the

01:15:47.020 | image and so on.

01:15:48.020 | So a lot of interesting work, especially, for example, so Jason Yosinski, he has this

01:15:52.020 | DeepVist toolbox.

01:15:53.100 | And I've shown you that you can kind of debug it that way a bit.

01:15:55.740 | There's an entire lecture that I encourage you to watch in CS231N on visualizing and

01:16:00.020 | understanding convolutional networks.

01:16:02.260 | So people use things like a deconv or guided backpropagation.

01:16:06.420 | Or you backpropagate to image, and you try to find a stimulus that maximally activates

01:16:10.340 | any arbitrary neuron.

01:16:11.740 | So different ways of probing it, and different ways have been developed.

01:16:15.880 | And there's a lecture about it.

01:16:17.080 | So I would check that out.

01:16:19.140 | Great, thanks.

01:16:20.140 | I had a question regarding the size of fine-tuning data set.

01:16:24.940 | For example, is there a ballpark number if you are trying to do classification?

01:16:30.940 | How many do you need for fine-tuning it to your sample set?

01:16:36.020 | So how many data points do you need to get good performance?

01:16:40.100 | That's the question.

01:16:43.100 | So this is like the most boring answer, I think.

01:16:46.140 | Because the more, the better always.

01:16:47.860 | And it's really hard to say, actually, how many you need.

01:16:52.540 | So usually one way to look at it is-- one heuristic that people sometimes follow is

01:16:57.100 | you look at the number of parameters, and you want the number of examples to be on the

01:17:00.220 | order of number of parameters.

01:17:01.860 | That's one way people sometimes break it down.

01:17:03.580 | Even for fine-tuning?

01:17:05.300 | Because we'll have an ImageNet model.

01:17:07.300 | So I was hoping that most of the things would be taken care of there, and then you're just

01:17:11.260 | fine-tuning.

01:17:12.260 | So you might need a lower order.

01:17:13.900 | I see.

01:17:14.900 | So when you're saying fine-tuning, are you fine-tuning the whole network, or you're freezing

01:17:17.100 | some of it, or just the top classifier?

01:17:19.020 | Just the top classifier.

01:17:20.020 | Yeah.

01:17:21.020 | So another way to look at it is you have some number of parameters, and you can estimate

01:17:23.660 | the number of bits that you think every parameter has.

01:17:27.280 | And then you count the number of bits in your data.

01:17:29.280 | So that's the kind of comparisons you would do.

01:17:31.660 | But really, I have no good answer.

01:17:34.460 | So the more, the better.

01:17:35.460 | And you have to try, and you have to regularize, and you have to cross-validate that, and you

01:17:38.020 | have to see what performance you get over time.

01:17:40.940 | Because it's too task-dependent for me to say something stronger.

01:17:43.580 | Hi.

01:17:44.580 | I would like to know how do you think the Covenant will work in the 3D case?

01:17:49.940 | Like is it just a simple extension of the 2D case, or do we need some extra tweak about

01:17:55.420 | it?

01:17:56.420 | So in the 3D case, so you're talking specifically about, say, videos or some 3D--

01:17:59.680 | Actually, I'm talking about the image that has the depth information.

01:18:03.880 | Oh, I see.

01:18:05.200 | So say you have like RGBD input and things like that.

01:18:07.720 | Yeah.

01:18:08.720 | So I'm not too familiar with what people do.

01:18:10.200 | But I do know, for example, that people try to have-- for example, one thing you can do

01:18:15.240 | is just treat it as a fourth channel.

01:18:17.520 | Or maybe you want a separate ConvNet on top of the depth channel and do some fusion later.

01:18:21.440 | So I don't know exactly what the state of the art in treating that depth channel is

01:18:24.260 | right now.

01:18:27.480 | So I don't know exactly how they do it right now.

01:18:30.080 | So maybe just one more question.

01:18:32.000 | Just how do you think the 3D object recognition--

01:18:34.520 | 3D object?

01:18:35.520 | Yeah.

01:18:36.520 | Recognition?

01:18:37.520 | So what is the output that you'd like?

01:18:39.840 | The output is still the class probability.

01:18:43.120 | But we are not treating the 2D image, but the 3D representation of the object.

01:18:47.600 | I see.

01:18:48.600 | So do you have a mesh or a point cloud?

01:18:49.600 | Yeah, a mesh.

01:18:50.600 | I see.

01:18:51.600 | Yeah.

01:18:52.940 | So that's exactly my area, unfortunately.

01:18:54.040 | But the problem with these meshes and so on is that there's this rotational degree of

01:18:58.440 | freedom that I'm not sure what people do about, honestly.

01:19:01.680 | So I'm actually not an expert on this.

01:19:05.440 | So I don't want to comment.

01:19:06.440 | There are some obvious things you might want to try.

01:19:07.720 | You might want to plug in all the possible ways you could orient this and then a test

01:19:11.920 | time average over them.

01:19:13.120 | So that would be some of the obvious things to play with.

01:19:14.780 | But I'm not actually sure what the state of the art is.

01:19:17.600 | Thank you.

01:19:18.600 | OK, one more question.

01:19:19.600 | Go ahead.

01:19:20.600 | OK.

01:19:21.600 | So coming back to distributed training, is it possible to do even the classification

01:19:26.240 | in a distributed way?

01:19:27.520 | Or my question is, in the future, can I imagine our cell phones do these things together for

01:19:33.560 | one inquiry?

01:19:36.640 | Our cell phones?

01:19:37.640 | Oh, I see.

01:19:38.640 | You're trying to get cell phones distributed training.

01:19:40.440 | Yes, yes.

01:19:41.440 | A train and also classify for one cell phone.

01:19:42.440 | That's a radical idea.

01:19:43.440 | Is there any hope in that?

01:19:44.440 | Very radical idea.

01:19:47.080 | So related thoughts I had recently was, so I had come to JS in the browser.

01:19:50.520 | And I was thinking of basically, this trains networks.

01:19:53.920 | And I was thinking about similar questions.

01:19:55.400 | Because you could imagine shipping this off as an ad equivalent.

01:19:58.600 | Like people just include this in the JavaScript.

01:20:00.400 | And then everyone's browsers are kind of like training a small network.

01:20:05.160 | So I think that's a related question.

01:20:06.160 | But do you think there's too much communication overhead?

01:20:08.600 | Or it could be actually really distributed in an efficient way?

01:20:12.160 | Yes, so the problem with distributing it a lot is actually the stale gradients problem.

01:20:16.960 | So when you look at some of the papers that Google has put out about distributed training,

01:20:21.320 | as you look at the number of workers when you do asynchronous SGD, number of workers

01:20:25.580 | and the performance improvement you get, it kind of plateaus quite quickly after eight

01:20:29.400 | workers or something quite small.

01:20:31.560 | So I'm not sure if there are ways of dealing with thousands of workers.

01:20:35.120 | The issue is that you have a distributed-- every worker has this specific snapshot of

01:20:39.840 | the weights that are currently-- you pull from the master.

01:20:45.820 | And now you have a set of weights that you're using.

01:20:47.640 | And you do forward, backward.

01:20:48.640 | And then you send an update.

01:20:50.240 | But by the time you send an update and you've done your forward, backward, the parameter

01:20:53.320 | server has now done lots of updates from thousands of other things.

01:20:57.860 | And so your gradient is stale.

01:20:59.480 | You've evaluated it at the wrong and old location.

01:21:02.440 | And so it's an incorrect direction now.

01:21:05.100 | And everything breaks.

01:21:06.440 | So that's the challenge.

01:21:07.480 | And I'm not sure what people are doing about this.

01:21:11.440 | I was wondering about applications of convolutional nets to two inputs at a time.

01:21:17.200 | So let's say you have two pictures of jigs of puzzles, jigs of pieces.

01:21:21.240 | And you're trying to figure out if they fit together or whether one object compares to

01:21:26.240 | the other in a specific way.

01:21:27.480 | Have you heard of any implementation of this kind?

01:21:29.720 | Yes.

01:21:30.720 | So you have two inputs instead of one.

01:21:32.320 | So the common ways of dealing with that is you put a commnet on each.

01:21:35.120 | And then you do some kind of a fusion eventually to merge the information.

01:21:38.480 | Right?

01:21:39.480 | I see.

01:21:40.480 | And what about for recurring neural networks if you had variable input?

01:21:45.200 | So for example, in the context of videos where you have frames coming in, then yes, some

01:21:48.600 | of the approaches are you have a convolutional network on a frame.

01:21:51.240 | And then at the top, you tie it in with a recurring neural network.

01:21:54.760 | So you have these-- you reduce the image to some kind of a lower dimensional representation.

01:21:58.920 | And then that's an input to a recurring neural network at the top.

01:22:02.820 | There are other ways to play with this.

01:22:04.200 | For example, you can actually make the recurrent-- you can make every single neuron in the commnet

01:22:07.480 | recurrent.

01:22:08.480 | That's also one funny way of doing this.

01:22:11.420 | So right now, when a neuron computes its output, it's only a function of a local neighborhood

01:22:16.040 | and below it.

01:22:17.880 | But you can also make it, in addition, a function of that same local neighborhood or its own

01:22:22.740 | activation perhaps at the previous time step, if that makes sense.

01:22:27.800 | So this neuron is not just computing a dot product with the current patch, but it's also

01:22:31.920 | incorporating a dot product of its own and maybe its neighborhoods activations at the

01:22:37.280 | previous time step of the frame.

01:22:38.640 | So that's kind of like a small RNN update hidden inside every single neuron.

01:22:41.640 | So those are the things that I think people play with when I'm not familiar with what

01:22:44.280 | currently is working best in this area.

01:22:45.920 | Pretty awesome.

01:22:46.920 | Thank you.

01:22:47.920 | Yeah.

01:22:48.920 | Yeah, hi.

01:22:49.920 | Thanks for the great talk.

01:22:50.920 | I have a question regarding the latency for the models that are trained using multiple

01:22:55.280 | layers.

01:22:56.280 | So especially at the prediction time, as we add more layers for the forward pass, it will

01:23:01.200 | take some time.

01:23:02.200 | It will increase in the latency for the prediction.

01:23:05.060 | So what are the numbers that we have seen presently that if you can share the prediction

01:23:13.440 | time or the latency at the forward pass?

01:23:17.520 | So you're worried, for example, you want to run a prediction very quickly.

01:23:21.280 | Would it be on an embedded device, or is this in the cloud?

01:23:23.920 | Yeah, suppose it's a cell phone.

01:23:26.360 | You're identifying the objects, or you're doing some image analysis or something.

01:23:33.480 | Yeah.

01:23:34.480 | So there's definitely a lot of work on this.

01:23:35.640 | So one way you would approach this, actually, is you have this network that you've trained

01:23:38.920 | using floating point arithmetic, 32 bits, say.

01:23:42.320 | And so there's a lot of work on taking that network and discretizing all the weights into

01:23:47.720 | like ints and making it much smaller and pruning connections.

01:23:51.200 | So one of the works related to this, for example, is Song Han here at Stanford has a few papers

01:23:56.200 | on getting rid of spurious connections and reducing the network as much as possible,

01:24:00.200 | and then making everything very efficient with integer arithmetic.

01:24:03.480 | So basically, you achieve this by discretizing all the weights and all the activations and

01:24:10.600 | throwing away and pruning the network.

01:24:12.480 | So there are some tricks like that that people play.

01:24:15.480 | That's mostly what you would do on an embedded device.

01:24:18.280 | And then the challenge, of course, is you've changed the network, and now you just kind

01:24:21.480 | of are crossing your fingers that it works well.

01:24:23.320 | And so I think what's interesting from a research standpoint is you'd like your test time to

01:24:29.080 | exactly match your training time.

01:24:31.160 | So then you get the best performance.

01:24:32.960 | And so the question is, how do we train with low precision arithmetic?

01:24:35.960 | And there's a lot of work on this as well, so say from Yoshua Bengio's lab as well.

01:24:41.040 | So that's exciting directions of how you train in a low precision regime.

01:24:45.080 | Do you have any numbers that you can share for the state of the art, how much time does

01:24:49.800 | it take?

01:24:50.800 | Yes, I see the papers, but I'm not sure if I remember the exact reductions.

01:24:54.600 | It's on the order of-- OK, I don't want to say, because basically I don't know.

01:24:58.480 | I don't want to try to guess this.

01:25:00.520 | Thank you.

01:25:01.520 | All right.

01:25:02.520 | So with that, we'll take our time.

01:25:03.520 | Let's thank Andre.

01:25:04.520 | Lunch is outside, and we'll restart at 1245.

Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)

Chapters