Back to Index

Deep Learning for Computer Vision (Andrej Karpathy, OpenAI)


Chapters

0:0 Deep Learning for Computer Vision
4:23 Computer Vision 2011
9:59 Transfer Learning
11:16 The power is easily accessible.
11:58 ConvNets are everywhere...
16:58 Convolution Layer
19:25 For example, if we had 6 5x5 filters, we'll get 6 separate activation maps
24:56 MAX POOLING
32:13 Case Study: AlexNet NELA
52:34 Addressing other tasks...
53:24 Image Classification thing = a vector of probabilities for different classes
54:9 Localization
54:48 Reinforcement Learning
55:38 Segmentation
57:26 Variational Autoencoders
57:41 Detection
59:26 Dense Image Captioning

Transcript

So thank you very much for the introduction. So today I'll speak about deep learning, especially in the context of computer vision. So what you saw in the previous talk is neural networks. So you saw that neural networks are organized into these layers, fully connected layers, where neurons in one layer are not connected, but they're connected fully to all the neurons in the previous layer.

And we saw that basically we have this layer-wise structure from input until output. And there are neurons and nonlinearities, et cetera. Now so far we have not made too many assumptions about the inputs. So in particular, here we just assume that an input is some kind of a vector of numbers that we plug into this neural network.

So that's both a bug and a feature to some extent, because in most real world applications we actually can make some assumptions about the input that makes learning much more efficient. So in particular, usually we don't just want to plug into neural networks vectors of numbers, but they actually have some kind of a structure.

So we don't have vectors of numbers, but these numbers are arranged in some kind of a layout, like an n-dimensional array of numbers. So for example, spectrograms are two-dimensional arrays of numbers. Images are three-dimensional arrays of numbers. Videos would be four-dimensional arrays of numbers. Text you could treat as one-dimensional array of numbers.

And so whenever you have this kind of local connectivity structure in your data, then you'd like to take advantage of it, and convolutional neural networks allow you to do that. So before I dive into convolutional neural networks and all the details of the architectures, I'd like to briefly talk about a bit of the history of how this field evolved over time.

So I like to start off usually with talking about Hubel and Wiesel and the experiments that they performed in the 1960s. So what they were doing is trying to study the computations that happened in the early visual cortex areas of a cat. And so they had cats, and they plugged in electrodes that could record from the different neurons.

And then they showed the cat different patterns of light. And they were trying to debug neurons effectively and try to show them different patterns and see what they responded to. And a lot of these experiments inspired some of the modeling that came in afterwards. So in particular, one of the early models that tried to take advantage of some of the results of these experiments was the model called Neurocognitron from Fukushima in the '60s.

And so what you saw here was this architecture that, again, is layer-wise, similar to what you see in the cortex, where you have these simple and complex cells, where the simple cells detect small things in the visual field. And then you have this local connectivity pattern, and the simple and complex cells alternate in this layered architecture throughout.

And so this looks a bit like a ConvNet, because you have some of its features, like, say, the local connectivity. But at the time, this was not trained with backpropagation. These were specific, heuristically chosen updates. And this was unsupervised learning back then. So the first time that we've actually used backpropagation to train some of these networks was an experiment of Jan Lekun in the 1990s.

And so this is an example of one of the networks that was developed back then, in the 1990s, by Jan Lekun, is LeanNet5. And this is what you would recognize today as a convolutional neural network. So it has a lot of the very convolutional layers. And it's alternating. And it's a similar kind of design to what you would see in the Fukushima's neurocognitron.

But this was actually trained with backpropagation end-to-end using supervised learning. So this happened in roughly the 1990s. And we're here in 2016, basically about 20 years later. Now computer vision has, for a long time, kind of worked on larger images. And a lot of these models back then were applied to very small kind of settings, like, say, minimizing digits in zip codes and things like that.

And they were very successful in those domains. But back at least when I entered computer vision, roughly 2011, it was thought that a lot of people were aware of these models. But it was thought that they would not scale up naively into large, complex images, that they would be constrained to these toy tasks for a long time.

Or I shouldn't say toy, because these were very important tasks, but certainly like smaller visual recognition problems. And so in computer vision in roughly 2011, it was much more common to use a kind of these feature-based approaches at the time. And they didn't work actually that well. So when I entered my PhD in 2011 working on computer vision, you would run a state-of-the-art object detector on this image, and you might get something like this, where cars were detected in trees.

And you would kind of just shrug your shoulders and say, well, that just happens sometimes. You kind of just accept it as something that would just happen. And of course, this is a caricature. Things actually worked relatively decent, I should say. But definitely there were many mistakes that you would not see today about four years in 2016, five years later.

And so a lot of computer vision kind of looked much more like this. When you look into a paper that tried to do image classification, you would find this section in the paper on the features that they used. So this is one page of features. And so they would use a gist, hog, et cetera, and then a second page of features and all their hyperparameters.

So all kinds of different histograms. And you would extract this kitchen sink of features and a third page here. And so you end up with this very large, complex code base, because some of these feature types are implemented in MATLAB, some of them in Python, some of them in C++.

And you end up with this large code base of extracting all these features, caching them, and then eventually plugging them into linear classifiers to do some kind of visual recognition task. So it was quite unwieldy. But it worked to some extent. But there were definitely room for improvement. And so a lot of this changed in computer vision in 2012 with this paper from Alex Kurchevsky, Ilya Satskever, and Jeff Hinton.

So this is the first time that someone took a convolutional neural network that is very similar to the one that you saw from 1998 from Yanma Kun. And I'll go into details of how they differ exactly. But they took that kind of network. They scaled it up. They made it much bigger.

And they trained it on a much bigger data set on GPUs. And things basically ended up working extremely well. And this is the first time that computer vision community has really noticed these models and adopted them to work on larger images. So we saw that the performance of these models has improved drastically.

Here we are looking at the ImageNet ILS VRC visual recognition challenge over the years. And we're looking at the top five errors. So low is good. And you can see that from 2010 in the beginning, these were feature-based methods. And then in 2012, we had this huge jump in performance.

And that was due to the first kind of convolutional neural network in 2012. And then we've managed to push that over time. And now we're down to about 3.57%. I think the results for ImageNet Challenge 2016 are actually due to come out today. But I don't think that actually they've come out yet.

I have this second tab here opened. I was waiting for the result. But I don't think this is up yet. Yeah. OK. No. Nothing. All right. Well, we'll get to find out very soon what happens right here. So I'm very excited to see that. Just to put this in context, by the way, because you're just looking at numbers, like 3.57, how good is that?

That's actually really, really good. So something that I did about two years ago now is that I tried to measure the human accuracy on this data set. And so what I did for that is I developed this web interface where I would show myself ImageNet images from the test set.

And then I had this interface here where I would have all the different classes of ImageNet. There's 1,000 of them. And some example images. And then basically, you go down this list and you scroll for a long time and you find what class you think that image might be.

And then I competed against the ComNet at the time. And this was GoogleNet in 2014. And so HotDog is a very simple class. You can do that quite easily. But why is the accuracy not 0%? Well, some of the things, like HotDog seems very easy. Why isn't it trivial for humans to see?

Well, it turns out that some of the images in a test set of ImageNet are actually mislabeled. But also, some of the images are just very difficult to guess. So in particular, if you have this terrier, there's 50 different types of terriers. And it turns out to be a very difficult task to find exactly which type of terrier that is.

You can spend minutes trying to find it. It turns out that convolutional neural networks are actually extremely good at this. And so this is where I would lose points compared to ComNet. So I estimate that human accuracy based on this is roughly 2% to 5% range, depending on how much time you have and how much expertise you have and how many people you involve and how much they really want to do this, which is not too much.

And so really, we're doing extremely well. And so we're down to 3%. And I think the error rate, if I remember correctly, was about 1.5%. So if we get below 1.5%, I would be extremely suspicious on ImageNet. That seems wrong. So to summarize, basically, what we've done is, before 2012, computer vision looked somewhat like this, where we had these feature extractors.

And then we trained a small portion at the end of the feature extraction step. And so we only trained this last piece on top of these features that were fixed. And we've basically replaced the feature extraction step with a single convolutional neural network. And now we train everything completely end to end.

And this turns out to work quite nicely. So I'm going to go into details of how this works in a bit. Also in terms of code complexity, we kind of went from a setup that looks-- whoops. I'm way ahead. We went from a setup that looks something like that in papers to something like, instead of extracting all these things, we just say, apply 20 layers with 3 by 3 conv or something like that.

And things work quite well. This is, of course, an over-exaggeration. But I think it's a correct first order statement to make, is that we've definitely seen that we've reduced code complexity quite a lot, because these architectures are so homogeneous compared to what we've done before. So it's also remarkable that-- so we had this reduction in complexity.

We had this amazing performance on ImageNet. One other thing that was quite amazing about the results in 2012 that is also a separate thing that did not have to be the case is that the features that you learn by training on ImageNet turn out to be quite generic. And you can apply them in different settings.

So in other words, this transfer learning works extremely well. And of course, I didn't go into details of convolutional networks yet. But we start with an image. And we have a sequence of layers, just like in a normal neural network. And at the end, we have a classifier. And when you pre-train this network on ImageNet, then it turns out that the features that you learn in the middle are actually transferable.

And you can use them on different data sets, and that this works extremely well. And so that didn't have to be the case. You might imagine that you could have a convolutional network that works extremely well on ImageNet. But when you try to run it on something else, like BIRDS data set or something, that it might just not work well.

But that is not the case. And that's a very interesting finding, in my opinion. So people noticed this back in roughly 2013, after the first convolutional networks. They noticed that you can actually take many computer vision data sets. And it used to be that you would compete on all of these separately and design features maybe for some of these separately.

And you can just shortcut all those steps that we had designed. And you can just take these pre-trained features that you get from ImageNet. And you can just train a linear classifier on every single data set on top of those features. And you obtain many state-of-the-art results across many different data sets.

And so this was quite a remarkable finding back then, I believe. So things worked very well on ImageNet. Things transferred very well. And the code complexity, of course, got much more manageable. So now all this power is actually available to you with very few lines of code. If you want to just use a convolutional network on images, it turns out to be only a few lines of code.

If you use, for example, Keras, it's one of the deep learning libraries that I'm going to go into and I'll mention again later in the talk. But basically, you just load a state-of-the-art convolutional neural network. You take an image. You load it. And you compute your predictions. And it tells you that this is an African elephant inside that image.

And this took a couple hundred or a couple ten milliseconds if you have a GPU. And so everything got much faster, much simpler, works really well, transfers really well. So this was really a huge advance in computer vision. And so as a result of all these nice properties, ComNets today are everywhere.

So here is a collection of some of the things that I try to find across different applications. So for example, you can search Google Photos for different types of categories, like in this case Rubik's Cube. You can find house numbers very efficiently. You can-- of course, this is very relevant in self-driving cars.

And we're doing perception in the cars. Convolutional networks are very relevant there. Medical image diagnosis, recognizing Chinese characters, doing all kinds of medical segmentation tasks. Quite random tasks, like whale recognition and more generally many Kaggle challenges. Satellite image analysis, recognizing different types of galaxies. You may have seen recently that a WaveNet from DeepMind, also a very interesting paper that they generate music and they generate speech.

And so this is a generative model. And that's also just a ComNet is doing most of the heavy lifting here. So it's a convolutional network on top of sound. And other tasks, like image captioning. In the context of reinforcement learning and agent environment interactions, we've also seen a lot of advances of using ComNets as the core computational building block.

So when you want to play Atari games, or you want to play AlphaGo, or Doom, or StarCraft, or if you want to get robots to perform interesting manipulation tasks, all of this uses ComNets as a core computational block to do very impressive things. Not only are we using it for a lot of different applications, we're also finding uses in art.

So here are some examples from DeepDream. So you can basically simulate what it looks like, what it feels like maybe to be on some drugs. So you can take images and you can just hallucinate features using ComNets. Or you might be familiar with neural style, which allows you to take arbitrary images and transfer arbitrary styles of different paintings like Van Gogh on top of them.

And this is all using convolutional networks. The last thing I'd like to note that I find also interesting is that in the process of trying to develop better computer vision architectures and trying to basically optimize for performance on the ImageNet challenge, we've actually ended up converging to something that potentially might function something like your visual cortex in some ways.

And so these are some of the experiments that I find interesting where they've studied macaque monkeys and they record from a subpopulation of the IT cortex. This is the part that does a lot of object recognition. And so they record. So basically, they take a monkey and they take a ComNet and they show them images.

And then you look at what those images are represented at the end of this network. So inside the monkey's brain or on top of your convolutional network. And so you look at representations of different images. And then it turns out that there's a mapping between those two spaces that actually seems to indicate to some extent that some of the things we're doing somehow ended up converging to something that the brain could be doing as well in the visual cortex.

So that's just some intro. I'm now going to dive into convolutional networks and try to explain briefly how these networks work. Of course, there's an entire class on this that I taught, which is a convolutional networks class. And so I'm going to distill some of those 13 lectures into one lecture.

So we'll see how that goes. I won't cover everything, of course. So convolutional neural network is really just a single function. It's a function from the raw pixels of some kind of an image. So we take 224 by 224 by 3 image. So 3 here is for the call channels, RGB.

You take the raw pixels, you put it through this function, and you get 1,000 numbers at the end. In the case of image classification, if you're trying to categorize images into 1,000 different classes. And really, functionally, all that's happening in a convolutional network is just dot products and max operations.

That's everything. They're wired up together in interesting ways so that you are basically doing visual recognition. And in particular, this function f has a lot of knobs in it. So these W's here that participate in these dot products and in these convolutions and fully connected layers and so on, these W's are all parameters of this network.

So normally, you might have about on the order of 10 million parameters. And those are basically knobs that change this function. And so we'd like to change those knobs, of course, so that when you put images through that function, you get probabilities that are consistent with your training data.

And so that gives us a lot to tune. And it turns out that we can do that tuning automatically with back propagation through that search process. Now, more concretely, a convolutional neural network is made up of a sequence of layers, just as in the case of normal neural networks.

But we have different types of layers that we play with. So we have convolutional layers. Here I'm using rectified linear unit, ReLU, for short, as a non-linearity. So I'm making that an explicit its own layer, pooling layers, and fully connected layers. The core computational building block of a convolutional network, though, is this convolutional layer.

And we have non-linearities interspersed. We are probably getting rid of things like pooling layers. So you might see them slightly going away over time. And fully connected layers can actually be represented-- they're basically equivalent to convolutional layers as well. And so really, it's just a sequence of conv layers in the simplest case.

So let me explain convolutional layer, because that's the core computational building block here that does all the heavy lifting. So the entire com net is this collection of layers. And these layers don't function over vectors. So they don't transform vectors as a normal neural network. But they function over volumes.

So a layer will take a volume, a three-dimensional volume of numbers, an array. In this case, for example, we have a 32 by 32 by 3 image. So those three dimensions are the width, height, and I'll refer to the third dimension as depth. We have three channels. That's not to be confused with the depth of a network, which is the number of layers in that network.

So this is just the depth of a volume. So this convolutional layer accepts a three-dimensional volume. And it produces a three-dimensional volume using some weights. So the way it actually produces this output volume is as follows. We're going to have these filters in a convolutional layer. So these filters are always small spatially, like, say, for example, 5 by 5 filter.

But their depth extends always through the input depth of the input volume. So since the input volume has three channels, the depth is three, then our filters will always match that number. So we have depth of three in our filters as well. And then we can take those filters, and we can basically convolve them with the input volume.

So what that amounts to is we take this filter. Oh, yeah. So that's just the point that the channels here must match. We take that filter, and we slide it through all spatial positions of the input volume. And along the way, as we're sliding this filter, we're computing dot products.

So W transpose X plus B, where W are the filters, and X is a small piece of the input volume, and B is the offset. And so this is basically the convolutional operation. You're taking this filter, and you're sliding it through at all spatial positions, and you're computing dot products.

So when you do this, you end up with this activation map. So in this case, we get a 28 by 28 activation map. 28 comes from the fact that there are 28 unique positions to place this 5 by 5 filter into this 32 by 32 space. So there are 28 by 28 unique positions you can place that filter in.

In every one of those, you're going to get a single number of how well that filter likes that part of the input. So that carves out a single activation map. And now in a convolutional layer, we don't just have a single filter, but we're going to have an entire set of filters.

So here's another filter, a green filter. We're going to slide it through the input volume. It has its own parameters. So there are 75 numbers here that basically make up a filter. There are different 75 numbers. We convolve them through, get a new activation map, and we continue doing this for all the filters in that convolutional layer.

So for example, if we had six filters in this convolutional layer, then we might end up with 28 by 28 activation maps six times. And we stack them along the depth dimension to arrive at the output volume of 28 by 28 by 6. And so really what we've done is we've re-represented the original image, which is 32 by 32 by 3, into a kind of a new image that is 28 by 28 by 6, where this image basically has these six channels that tell you how well every filter matches or likes every part of the input image.

So let's compare this operation to, say, using a fully connected layer as you would in a normal neural network. So in particular, we saw that we processed a 32 by 32 by 3 volume into 28 by 28 by 6 volume. But one question you might want to ask is, how many parameters would this require if we wanted a fully connected layer of the same number of output neurons here?

So we wanted 28 by 28 by 6 or times-- 28 times 28 times 6 number of neurons fully connected. How many parameters would that be? Turns out that that would be quite a few parameters, right? Because every single neuron in the output volume would be fully connected to all of the 32 by 32 by 3 numbers here.

So basically, every one of those 28 by 28 by 6 neurons is connected to 32 by 32 by 3. Turns out to be about 15 million parameters, and also on that order of number of multiplies. So you're doing a lot of compute, and you're introducing a huge amount of parameters into your network.

Now, since we're doing convolution instead, you'll notice that-- think about the number of parameters that we've introduced with this example convolutional layer. So we've used-- we had six filters, and every one of them was a 5 by 5 by 3 filter. So basically, we just have 5 by 5 by 3 filters.

We have six of them. If you just multiply that out, we have 450 parameters. And in this, I'm not counting the biases. I'm just counting the raw weights. So compared to 15 million, we've only introduced very few parameters. Also, how many multiplies have we done? So computationally, how many flops are we doing?

Well, we have 28 by 28 by 6 outputs to produce. And every one of these numbers is a function of a 5 by 5 by 3 region in the original image. So basically, we have 28 by 28 by 6. And then every one of them is computed by doing 5 times 5 times 3 multiplies.

So you end up with only on the order of 350,000 multiplies. So we've reduced from 15 million to quite a few. So we're doing less flops, and we're using fewer parameters. And really, what we've done here is we've made assumptions. So we've made the assumption that because the fully connected layer, if this was a fully connected layer, could compute the exact same thing.

So a specific setting of those 15 million parameters would actually produce the exact output of this convolutional layer. But we've done it much more efficiently. We've done that by introducing these biases. So in particular, we've made assumptions. We've assumed, for example, that since we have these fixed filters that we're sliding across space, we've assumed that if there's some interesting feature that you'd like to detect in one part of the image, like, say, top left, then that feature will also be useful somewhere else, like on the bottom right, because we fixed these filters and applied them at all the spatial positions equally.

You might notice that this is not always something that you might want. For example, if you're getting inputs that are centered face images, and you're doing some kind of a face recognition or something like that, then you might expect that you might want different filters at different spatial positions.

Like say, for eye regions, you might want to have some eye-like filters. And for mouth region, you might want to have mouth-specific features and so on. And so in that case, you might not want to use convolutional layer, because those features have to be shared across all spatial positions.

And the second assumption that we made is that these filters are small locally. And so we don't have global connectivity. We have this local connectivity. But that's OK, because we end up stacking up these convolutional layers in sequence. And so the neurons at the end of the ConvNet will grow their receptive field as you stack these convolutional layers on top of each other.

So at the end of the ConvNet, those neurons end up being a function of the entire image eventually. So just to give you an idea about what these activation maps look like concretely, here's an example of an image on the top left. This is a part of a car, I believe.

And we have these different filters at-- we have 32 different small filters here. And so if we were to convolve these filters with this image, we end up with these activation maps. So this filter, if you convolve it, you get this activation map and so on. So this one, for example, has some orange stuff in it.

So when we convolve with this image, you see that this white here is denoting the fact that that filter matches that part of the image quite well. And so we get these activation maps. You stack them up. And then that goes into the next convolutional layer. So the way this looks like then is that we've processed this with some kind of a convolutional layer.

We get some output. We apply a rectified linear unit, some kind of a non-linearity as normal. And then we would just repeat that operation. So we keep plugging these conv volumes into the next convolutional layer. And so they plug into each other in sequence. And so we end up processing the image over time.

So that's the convolutional layer. You'll notice that there are a few more layers. So in particular, the pooling layer I'll explain very briefly. Pooling layer is quite simple. If you've used Photoshop or something like that, you've taken a large image and you've resized it, you've down sampled the image, well, pooling layers do basically something exactly like that.

But they're doing it on every single channel independently. So for every one of these channels independently in a input volume, we'll pluck out that activation map. We'll down sample it. And that becomes a channel in the output volume. So it's really just a down sampling operation on these volumes.

So for example, one of the common ways of doing this in the context of neural networks especially is to use max pooling operation. So in this case, it would be common to say, for example, use 2 by 2 filters stride 2 and do max operation. So if this is an input channel in a volume, then we're basically-- what that amounts to is we're truncating it into these 2 by 2 regions.

And we're taking a max over 4 numbers to produce one piece of the output. So this is a very cheap operation that down samples your volumes. It's really a way to control the capacity of the network. So you don't want too many numbers. You don't want things to be too computationally expensive.

It turns out that a pooling layer allows you to down sample your volumes. You're going to end up doing less computation. And it turns out to not hurt the performance too much. So we use them basically as a way of controlling the capacity of these networks. And the last layer that I want to briefly mention, of course, is the fully connected layer, which is exactly what you're familiar with.

So we have these volumes throughout as we've processed the image. At the end, you're left with this volume. And now you'd like to predict some classes. So what we do is we just take that volume. We stretch it out into a single column. And then we apply a fully connected layer, which really amounts to just a matrix multiplication.

And then that gives us probabilities after applying a softmax or something like that. So let me now show you briefly a demo of what a convolutional network looks like. So this is ConvNetJS. This is a deep learning library for training convolutional neural networks that is implemented in JavaScript. I wrote this maybe two years ago at this point.

So here what we're doing is we're training a convolutional network on the CIFAR-10 dataset. CIFAR-10 is a dataset of 50,000 images. Each image is 32 by 32 by 3. And there are 10 different classes. So here we are training this network in the browser. And you can see that the loss is decreasing, which means that we're better classifying these inputs.

And so here's the network specification, which you can play with because this is all done in the browser. So you can just change this and play with this. So this is an input image. And this convolutional network I'm showing here, all the intermediate activations and all the intermediate, basically, activation maps that we're producing.

So here we have a set of filters. We're convolving them with the image and getting all these activation maps. I'm also showing the gradients, but I don't want to dwell on that too much. Then you threshold. So ReLU thresholding anything below 0 gets clamped at 0. And then you pool.

So this is just a downsampling operation. And then another convolution, ReLU pool, conv, ReLU pool, et cetera, until at the end we have a fully connected layer. And then we have our softmax so that we get probabilities out. And then we apply a loss to those probabilities and backpropagate.

And so here we see that I've been training in this tab for the last maybe 30 seconds or one minute. And we're already getting about 30% accuracy on CIFAR-10. So these are test images from CIFAR-10. And these are the outputs of this convolutional network. And you can see that it learned that this is already a car or something like that.

So this trains pretty quickly in JavaScript. So you can play with this. And you can change the architecture and so on. Another thing I'd like to show you is this video, because it gives you, again, this very intuitive visceral feeling of exactly what this is computing. Is there is a very good video by Jason Yosinski from-- Recent advance-- I'm going to play this in a bit.

This is from the deep visualization toolbox. So you can download this code. And you can play with this. It's this interactive convolutional network demo. - --neural networks have enabled computers to better see and understand the world. They can recognize school buses and-- --top left corner, we show the-- I'm going to skip a bit.

So what we're seeing here is these are activation maps in some particular-- shown in real time as this demo is running. So these are for the conv1 layer of an AlexNet, which we're going to go into in much more detail. But these are the different activation maps that are being produced at this point.

- --neural network called AlexNet running in CAFE. By interacting with the network, we can see what some of the neurons are doing. For example, on this first layer, a unit in the center responds strongly to light to dark edges. This neighbor, one neuron over, responds to edges in the opposite direction, dark to light.

Using optimization, we can synthetically produce images that light up each neuron on this layer to see what each neuron is looking for. We can scroll through every layer in the network to see what it does, including convolution, pooling, and normalization layers. We can switch back and forth between showing the actual activations and showing images synthesized to produce high activation.

By the time we get to the fifth convolutional layer, the features being computed represent abstract concepts. For example, this neuron seems to respond to faces. We can further investigate this neuron by showing a few different types of information. First we can artificially create optimized images using new regularization techniques that are described in our paper.

These synthetic images show that this neuron fires in response to a face and shoulders. We can also plot the images from the training set that activate this neuron the most, as well as pixels from those images most responsible for the high activations, computed via the deconvolution technique. This feature responds to multiple faces in different locations.

And by looking at the deconv, we can see that it would respond more strongly if we had even darker eyes and rosier lips. We can also confirm that it cares about the head and shoulders, but ignores the arms and torso. We can even see that it fires to some extent for cat faces.

Using backprop or deconv, we can see that this unit depends most strongly on a couple units in the previous layer, conv4, and on about a dozen or so in conv3. Now let's look at another neuron on this layer. So what's this unit doing? From the top nine images, we might conclude that it fires for different types of clothing.

But examining the synthetic images shows that it may be detecting not clothing per se, but wrinkles. In the live plot, we can see that it's activated by my shirt. And smoothing out half of my shirt causes that half of the activations to decrease. Finally, here's another interesting neuron. This one has learned to look for printed text in a variety of sizes, colors, and fonts.

This is pretty cool, because we never ask the network to look for wrinkles or text or faces. But the only labels we provided were at the very last layer. So the only reason the network learned features like text and faces in the middle was to support final decisions at that last layer.

For example, the text detector may provide good evidence that a rectangle is in fact a book seen on edge. And detecting many books next to each other might be a good way of detecting a bookcase, which was one of the categories we trained the net to recognize. In this video, we've shown some of the features of the DeepViz toolbox.

So I encourage you to play with that. It's really fun. So I hope that gives you an idea about exactly what's going on. There are these convolutional layers. We down sample them from time to time. There's usually some fully connected layers at the end. But mostly it's just these convolutional operations stacked on top of each other.

So what I'd like to do now is I'll dive into some details of how these architectures are actually put together. The way I'll do this is I'll go over all the winners of the ImageNet challenges, and I'll tell you about the architectures, how they came about, how they differ.

And so you'll get a concrete idea about what these architectures look like in practice. So we'll start off with the AlexNet in 2012. So the AlexNet, just to give you an idea about the sizes of these networks and the images that they process, it took 227 by 227 by 3 images.

And the first layer of an AlexNet, for example, was a convolutional layer that had 11 by 11 filters applied with a stride of 4. And there are 96 of them. Stride of 4 I didn't fully explain because I wanted to save some time. But intuitively, it just means that as you're sliding this filter across the input, you don't have to slide it one pixel at a time, but you can actually jump a few pixels at a time.

So we have 11 by 11 filters with a stride, a skip of 4. And we have 96 of them. You can try to compute, for example, what is the output volume if you apply this sort of convolutional layer on top of this volume. And I didn't go into details of how you compute that.

But basically, there are formulas for this, and you can look into details in the class. But you arrive at 55 by 55 by 96 volume as output. The total number of parameters in this layer, we have 96 filters. Every one of them is 11 by 11 by 3 because that's the input depth of these images.

So basically, it just amounts to 11 times 11 times 3. And then you have 96 filters, so about 35,000 parameters in this very first layer. Then the second layer of an AlexNet is a pooling layer. So we apply 3 by 3 filters at stride of 2, and they do max pooling.

So you can, again, compute the output volume size of that after applying this to that volume. And you arrive, if you do some very simple arithmetic there, you arrive at 27 by 27 by 96. So this is the downsampling operation. You can think about what is the number of parameters in this pooling layer.

And of course, it's 0. So pooling layers compute a fixed function, a fixed downsampling operation. There are no parameters involved in a pooling layer. All the parameters are in convolutional layers and the fully connected layers, which are, to some extent, equivalent to convolutional layers. So we can go ahead and just basically, based on the description in the paper-- although it's non-trivial, I think, based on the description of this particular paper-- but you can go ahead and decipher what the volumes are throughout.

You can look at the kind of patterns that emerge in terms of how you actually increase the number of filters in higher convolutional layers. So we started off with 96. Then we go to 256 filters. Then to 384. And eventually, 4,096 units of fully connected layers. You'll see also normalization layers here, which have since become slightly deprecated.

It's not very common to use the normalization layers that were used at the time for the AlexNet architecture. What's interesting to note is how this differs from the 1998 YAMLACoon network. So in particular, I usually like to think about four things that hold back progress, so at least in deep learning.

So the data is a constraint, compute. And then I like to differentiate between algorithms and infrastructure, algorithms being something that feels like research and infrastructure being something that feels like a lot of engineering has to happen. And so in particular, we've had progress in all those four fronts. So we see that in 1998, the data you could get a hold of maybe would be on the order of a few thousand, whereas now we have a few million.

So we have three orders of magnitude of increase in number of data. Compute, GPUs have become available, and we use them to train these networks. They are about, say, roughly 20 times faster than CPUs. And then, of course, CPUs we have today are much, much faster than CPUs that they had back in 1998.

So I don't know exactly to what that works out to, but I wouldn't be surprised if it's, again, on the order of three orders of magnitude of improvement again. I'd like to actually skip over the algorithm and talk about infrastructure. So in this case, we're talking about NVIDIA releasing the CUDA library that allows you to efficiently create all these matrix vector operations and apply them on arrays of numbers.

So that's a piece of software that we rely on and that we take advantage of that wasn't available before. And finally, algorithms is kind of an interesting one, because in those 20 years, there's been much less improvement in algorithms than all these other three pieces. So in particular, what we've done with the 1998 network is we've made it bigger.

So you have more channels. You have more layers by a bit. And the two really new things algorithmically are dropout and rectified linear units. So dropout is a regularization technique developed by Geoff Hinton and colleagues. And rectified linear units are these nonlinearities that train much faster than sigmoids and tanhs.

And this paper actually had a plot that showed that the rectified linear units trained a bit faster than sigmoids. And that's intuitively because of the vanishing gradient problems. And when you have very deep networks with sigmoids, those gradients vanish, as Hugo was talking about in the last lecture. So what's interesting also to note, by the way, is that both dropout and ReLU are basically like one line or two lines of code to change.

So it's about a two line diff total in those 20 years. And both of them consist of setting things to zero. So with the ReLU, you set things to zero when they're lower than zero. And with dropout, you set things to zero at random. So it's a good idea to set things to zero.

Apparently, that's what we've learned. So if you try to find a new cool algorithm, look for one line diffs that set something to zero. It probably will work better. And we could add you here to this list. Now some of the newest things that happened, some of the comparing it again and giving you an idea about the hyperparameters that were in this architecture.

It was the first use of rectified linear units. We haven't seen that as much before. This network used the normalization layers, which are not used anymore, at least in the specific way that they use them in this paper. They used heavy data augmentation. So you don't only pipe these images into the networks exactly as they come from the data set, but you jitter them spatially around a bit.

And you warp them, and you change the colors a bit, and you just do this randomly because you're trying to build in some invariances to these small perturbations. And you're basically hallucinating additional data. It was the first real use of dropout. And roughly, you see standard hyperparameters, like say batch sizes of roughly 128, using stochastic gradient descent with momentum, usually 0.9.

The momentum learning rates of 1e negative 2, you reduce them in normal ways. So you reduce roughly by a factor of 10 whenever validation stops improving. And weight decay of just a bit, 5e negative 4. And ensembling always helps. So you train seven independent convolutional networks separately, and then you just average their predictions.

Always gives you additional 2% improvement. So this is AlexNet, the winner of 2012. In 2013, the winner was the ZFNet. This was developed by Matthew Zeiler and Rob Fergus in 2013. And this was an improvement on top of AlexNet architecture. In particular, one of the bigger differences here were that the first convolutional layer, they went from 11 by 11 stride 4 to 7 by 7 stride 2.

So you have slightly smaller filters, and you apply them more densely. And then also, they noticed that these convolutional layers in the middle, if you make them larger, if you scale them up, then you actually gain performance. So they managed to improve a tiny bit. Matthew Zeiler then went-- he became the founder of Clarify.

And he worked on this a bit more inside Clarify, and he managed to push the performance to 11%, which was the winning entry at the time. But we don't actually know what gets you from 14% to 11%, because Matthew never disclosed the full details of what happened there. But he did say that it was more tweaking of these hyperparameters and optimizing that a bit.

So that was 2013 winner. In 2014, we saw a slightly bigger diff to this. So one of the networks that was introduced then was a VGG net from Karen Simonian and Andrew Zisterman. What's beautiful about VGG net-- and they explored a few architectures here, and the one that ended up working best was this D column, which is why I'm highlighting it.

What's beautiful about the VGG net is that it's so simple. So you might have noticed in these previous networks, you have these different filter sizes, different layers, and you do different amount of strides, and everything kind of looks a bit hairy, and you're not sure where these hyperparameters are coming from.

VGG net is extremely uniform. All you do is 3 by 3 convolutions with stride 1, pad 1, and you do 2 by 2 max poolings with stride 2. And you do this throughout completely homogeneous architecture, and you just alternate a few conv and a few pool layers, and you get top performance.

So they managed to reduce the error down to 7.3% in the VGG net, just with a very simple and homogeneous architecture. So I've also here written out this D architecture. So you can see-- I'm not sure how instructive this is, because it's kind of dense. But you can definitely see, and you can look at this offline perhaps, but you can see how these volumes develop, and you can see the kinds of sizes of these filters.

So they're always 3 by 3, but the number of filters, again, grows. So we started off with 64, and then we go to 128, 256, 512. So we're just doubling it over time. I also have a few numbers here, just to give you an idea of the scale at which these networks normally operate.

So we have on the order of 140 million parameters. This is actually quite a lot. I'll show you in a bit that this can be about 5 or 10 million parameters, and it works just as well. And it's about 100 megabytes for image, in terms of memory, in the forward pass.

And then the backward pass also needs roughly on that order. So that's roughly the numbers that we're working with here. Also you can note that most of the-- and this is true mostly in convolutional networks-- is that most of the memory is in the early convolutional layers. Most of the parameters, at least in the case where you use these giant fully connected layers at the top, would be here.

So the winner, actually, in 2014 was not the VGGnet. I only present it because it's such a simple architecture. But the winner was actually GoogleNet, with a slightly hairier architecture, we should say. So it's still a sequence of things. But in this case, they've put inception modules in sequence.

And this is an example inception module. I don't have too much time to go into the details, but you can see that it consists basically of convolutions and different kinds of strides and so on. So the GoogleNet looks slightly hairier, but it turns out to be more efficient in several respects.

So for example, it works a bit better than VGGnet, at least at the time. It only has 5 million parameters, compared to VGGnet's 140 million parameters, so a huge reduction. And you do that, by the way, by just throwing away fully connected layers. So you'll notice in this breakdown I did, these fully connected layers here have 100 million parameters and 16 million parameters.

Turns out you don't actually need that. So if you take them away, that actually doesn't hurt performance too much. So you can get a huge reduction of parameters. And it was slightly -- we can also compare to the original AlexNet. So compared to the original AlexNet, we have fewer parameters, a bit more compute, and a much better performance.

So GoogleNet was really optimized to have a low footprint, both memory-wise, both computation-wise, and both parameter-wise. But it looks a bit uglier. And VGGnet is a very beautiful, homogeneous architecture, but there are some inefficiencies in it. Okay. So that's 2014. Now, in 2015, we had a slightly bigger delta on top of the architectures.

So right now, these architectures, if Jan Lekhoene looked at them maybe in 1998, he would still recognize everything. So everything looks very simple. You just played with hyperparameters. So one of the first kind of bigger departures, I would argue, was in 2015, with the introduction of residual networks. And so this is work from Kangming He and colleagues in Microsoft Research Asia.

And so they did not only win the ImageNet Challenge in 2015, but they won a whole bunch of challenges. And this was all just by applying these residual networks that were trained on ImageNet and then fine-tuned on all these different tasks. And you basically can crush lots of different tasks whenever you get a new awesome ConvNet.

So at this time, the performance was basically 3.57% from these residual networks. So this is 2015. So this paper tried to argue that if you look at the number of layers, it goes up. And then they made the point that with residual networks, as we'll see in a bit, you can introduce many more layers and that that correlates strongly with performance.

We've since found that, in fact, you can make these residual networks quite a lot shallower, like say on the order of 20 or 30 layers, and they work just as fine, just as well. So it's not necessarily the depth here, but I'll go into that in a bit. But you get a much better performance.

What's interesting about this paper is this plot here, where they compare these residual networks-- and I'll go into details of how they work in a bit-- and these what they call plane networks, which is everything I've explained until now. And the problem with plane networks is that when you try to scale them up and introduce additional layers, they don't get monotonically better.

So if you take a 20-layer model-- and this is on CIFAR-10 experience-- if you take a 20-layer model and you run it, and then you take a 56-layer model, you'll see that the 56-layer model performs worse. And this is not just on the test data, so it's not just an overfitting issue.

This is on the training data. The 56-layer model performs worse on the training data than the 20-layer model, even though the 56-layer model can imitate 20-layer model by setting 36 layers to compute identities. So basically, it's an optimization problem that you can't find the solution once your problem size grows that much bigger in this plane net architecture.

So in the residual networks that they proposed, they found that when you wire them up in a slightly different way, you monotonically get a better performance as you add more layers. So more layers, always strictly better, and you don't run into these optimization issues. So comparing residual networks to plane networks, in plane networks, as I've explained already, you have this sequence of convolutional layers, where every convolutional layer operates over volume before and produces volume.

In residual networks, we have this first convolutional layer on top of the raw image. And there's a pooling layer. So at this point, we've reduced to 56 by 56 by 64, the original image. And then from here on, they have these residual blocks with these funny skip connections. And this turns out to be quite important.

So let me show you what these look like. So the original Kyming paper had this architecture here shown under original. So on the left, you see original residual networks design. Since then, they had an additional paper that played with the architecture and found that there's a better arrangement of layers inside this block that works better empirically.

And so the way this works-- so concentrate on the proposed one in the middle, since that works so well-- is you have this pathway where you have this representation of the image x. And then instead of transforming that representation x to get a new x to plug in later, we end up having this x.

We go off, and we do some compute on the side. So that's that residual block doing some computation. And then you add your result on top of x. So you have this addition operation here going to the next residual block. So you have this x, and you always compute deltas to it.

And I think it's not intuitive that this should work much better or why that works much better. I think it becomes a bit more intuitively clear if you actually understand the backpropagation dynamics and how backprop works. And this is why I always urge people also to implement backprop themselves to get an intuition for how it works, what it's computing, and so on.

Because if you understand backprop, you'll see that addition operation is a gradient distributor. So you get a gradient from the top, and this gradient will flow equally to all the children that participated in that addition. So you have gradient flowing here from the supervision. So you have supervision at the very bottom here in this diagram.

And it kind of flows upwards. And it flows through these residual blocks and then gets added to the stream. But this addition distributes that gradient always identically through. So what you end up with is this kind of a gradient superhighway, as I like to call it, where these gradients from your supervision go directly to the original convolutional layer.

And on top of that, you get these deltas from all the residual blocks. So these blocks can come on online and can help out that original stream of information. This is also related to, I think, why LSTMs, long short-term memory networks, work better than recurrent neural networks, because they also have these kind of addition operations in the LSTM.

And it just makes the gradients flow significantly better. Then there were some results on top of residual networks that I thought were quite amusing. So recently, for example, we had this result on deep networks with stochastic depth. The idea here was that the authors of this paper noticed that you have these residual blocks that compute deltas on top of your stream.

And you can basically randomly throw out layers. So you have these, say, 100 blocks, 100 residual blocks. And you can randomly drop them out. And at test time, similar to dropout, you introduce all of them. And they all work at the same time. But you have to scale things a bit, just like with dropout.

But basically, it's kind of an unintuitive result, because you can throw out layers at random. And I think it breaks the original notion of what we had of ConvNets as these feature transformers that compute more and more complex features over time or something like that. And I think it seems much more intuitive to think about these residual networks, at least to me, as some kinds of dynamical systems, where you have this original representation of the image x.

And then every single residual block is kind of like a vector field, because it computes in a delta on top of your signal. And so these vector fields nudge your original representation x towards a space where you can decode the answer y of the class of that x. And so if you drop off some of these residual blocks at random, then if you haven't applied one of these vector fields, then the other vector fields that come later can kind of make up for it.

And they basically nudge the-- they pick up the slack. And they nudge it along anyways. And so that's possibly why the image I currently have in mind of how these things work. So much more like dynamical systems. In fact, another experiment that people are playing with that I also find interesting is you can share these residual blocks.

So it starts to look more like a recurrent neural network. So these residual blocks would have shared connectivity. And then you have this dynamical system, really, where you're just running a single RNN, a single vector field that you keep iterating over and over. And then your fixed point gives you the answer.

So it's kind of interesting what's happening. It looks very funny. We've had many more interesting results. So people are playing a lot with these residual networks and improving on them in various ways. So as I mentioned already, it turns out that you can make these residual networks much shallower and make them wider.

So you introduce more channels. And that can work just as well, if not better. So it's not necessarily the depth that is giving you a lot of the performance. You can scale down the depth. And if you increase the width, that can actually work better. And they're also more efficient if you do it that way.

There's more funny regularization techniques. Here swap out is a funny regularization technique that actually interpolates between plain nets, res nets, and dropout. So that's also a funny paper. We have fractal nets. We actually have many more different types of nets. And so people have really experimented with this a lot.

I'm really eager to see what the winning architecture will be in 2016 as a result of a lot of this. One of the things that has really enabled this rapid experimentation in the community is that somehow we've developed, luckily, this culture of sharing a lot of code among ourselves.

So for example, Facebook has released-- just as an example-- Facebook has released residual networks code in Torch that is really good that a lot of these papers, I believe, have adopted and worked on top of and that allowed them to actually really scale up their experiments and explore different architectures.

So it's great that this has happened. Unfortunately, a lot of these papers are coming on archive. And it's kind of a chaos as these are being uploaded. So at this point, I think this is a natural point to plug very briefly my archivesanity.com. So this is the best website ever.

And what it does is it crawls archive. And it takes all the papers. And it analyzes all the papers, the full text of the papers, and creates TF-IDF bag of words features for all the papers. And then you can do things like you can search a particular paper, like residual networks paper here.

And you can look for similar papers on archive. And so this is a sorted list of basically all the residual networks papers that are most related to that paper. Or you can also create user accounts. And you can create a library of papers that you like. And then Archive Sanity will train a support vector machine for you.

And basically, you can look at what are archive papers over the last month that I would enjoy the most. And that's just computed by Archive Sanity. And so it's like a curated feed specifically for you. So I use this quite a bit. And I find it useful. So I hope that other people do as well.

OK. So we saw convolutional neural networks. I explained how they work. I explained some of the background context. I've given you an idea of what they look like in practice. And we went through case studies of the winning architectures over time. But so far, we've only looked at image classification specifically.

So we're categorizing images into some number of bins. So I'd like to briefly talk about addressing other tasks in computer vision and how you might go about doing that. So the way to think about doing other tasks in computer vision is that really what we have is you can think of this convolutional neural network as this block of compute that has a few million parameters in it.

And it can do basically arbitrary functions that are very nice over images. And so it takes an image, gives you some kind of features. And now different tasks will basically look as follows. You want to predict some kind of a thing in different tasks that will be different things.

And you always have a desired thing. And then you want to make the predicted thing much more closer to the desired thing. And you back propagate. So this is the only part usually that changes from task to task. You'll see that these comm nets don't change too much. What changes is your loss function at the very end.

And that's what actually helps you really transfer a lot of these winning architectures. You usually use these pre-trained networks. And you don't worry too much about the details of that architecture. Because you're only worried about adding a small piece at the top or changing the loss function or substituting a new data set and so on.

So just to make this slightly more concrete, in image classification, we apply this compute block. We get these features. And then if I want to do classification, I would basically predict 1,000 numbers that give me the log probabilities of different classes. And then I have a predicted thing, a desired thing, particular class.

And I can back prop. If I'm doing image captioning, it also looks very similar. Instead of predicting just a vector of 1,000 numbers, I now have, for example, 10,000 words in some kind of vocabulary. And I'd be predicting 10,000 numbers and a sequence of them. And so I can use a recurrent neural network, which you will hear much more about, I think, in Richard's lecture just after this.

And so I produce a sequence of 10,000 dimensional vectors. And that's just a description. And they indicate the probabilities of different words to be emitted at different time steps. Or for example, if you want to do localization, again, most of the block stays unchanged. But now we also want some kind of an extent in the image.

So suppose we want to classify-- we don't only just want to classify this as an airplane, but we want to localize it with x, y, width, height, bounding box coordinates. And if we make the specific assumption as well that there's always a single one thing in the image, like a single airplane in every image, then you can just afford to just predict that.

So we predict these softmax scores, just like before, and apply the cross-entropy loss. And then we can predict x, y, width, height on top of that. And we use an L2 loss or a Hoover loss or something like that. So you just have a predicted thing, a desired thing, and you just backprop.

If you want to do reinforcement learning because you want to play different games, then again, the setup is you just predict some different thing. And it has some different semantics. So in this case, we would be, for example, predicting eight numbers that give us the probabilities of taking different actions.

For example, there are eight discrete actions in Atari. And we just predict eight numbers, and then we train this with a slightly different manner. Because in the case of reinforcement learning, you don't actually know what the correct action is to take at any point in time. But you can still get a desired thing eventually, because you just run these rollouts over time, and you just see what happens.

And then that helps inform exactly what the correct answer should have been or what the desired thing should have been in any one of those rollouts in any point in time. I don't want to dwell on this too much in this lecture, though. It's outside of the scope. You'll hear much more about reinforcement learning in a later lecture.

If you wanted to do segmentation, for example, then you don't want to predict a single vector of numbers for a single image. But every single pixel has its own category that you'd like to predict. So a data set will actually be colored like this, and you have different classes, different areas.

And then instead of predicting a single vector of classes, you predict an entire array of 224 by 224, since that's the extent of the original image, for example, times 20 if you have 20 different classes. And then you basically have 224 by 224 independent soft maxes here. That's one way you could pose this.

And then you back propagate. This here would be slightly more difficult, because you see here I have deconv layers mentioned here. And I didn't explain deconvolutional layers. They're related to convolutional layers. They do a very similar operation, but kind of backwards in some way. So a convolutional layer kind of does these downsampling operations as it computes.

A deconv layer does these kind of upsampling operations as it computes these convolutions. But in fact, you can implement a deconv layer using a conv layer. So what you do is you deconv forward pass is the conv layer backward pass. And the deconv backward pass is the conv layer forward pass, basically.

So they're basically an identical operation, but just are you upsampling or downsampling kind of. So you can use deconv layers, or you can use hypercolumns. And there are different things that people do in segmentation literature. But that's just a rough idea, as you're just changing the loss function at the end.

If you wanted to do autoencoders, so you want to do some unsupervised learning or something like that, well, you're just trying to predict the original image. So you're trying to get the convolutional network to implement the identity transformation. And the trick, of course, that makes it non-trivial is that you're forcing the representation to go through this representational bottleneck of 7 by 7 by 512.

So the network must find an efficient representation of the original image so that it can decode it later. So that would be an autoencoder. You again have an L2 loss at the end, and you backprop. Or if you want to do variational autoencoders, you have to introduce a reparameterization layer, and you have to append an additional small loss that makes your posterior be your prior.

But it's just like an additional layer. And then you have an entire generative model. And you can actually sample images as well. If you wanted to do detection, things get a little more hairy, perhaps, compared to localization or something like that. So one of my favorite detectors, perhaps, to explain is the YOLO detector, because it's perhaps the simplest one.

It doesn't work the best, but it's the simplest one to explain and has the core idea of how people do detection in computer vision. And so the way this works is we reduced the original image to a 7 by 7 by 512 feature. So really, there are these 49 discrete locations that we have.

And at every single one of these 49 locations, we're going to predict-- in YOLO, we're going to predict a class. So that's shown here on the top right. So every single one of these 49 will be some kind of a softmax. And then additionally, at every single position, we're going to predict some number of bounding boxes.

And so there's going to be a b number of bounding boxes. Say b is 10. So we're going to be predicting 50 numbers. And the 5 comes from the fact that every bounding box will have five numbers associated with it. So you have to describe the x, y, the width, and the height.

And you have to also indicate some kind of a confidence of that bounding box. So that's the fifth number, some kind of a confidence measure. So you basically end up predicting these bounding boxes. They have positions. They have class. They have confidence. And then you have some true bounding boxes in the image.

So you know that there are certain true boxes. And they have certain class. And what you do then is you match up the desired thing with the predicted thing. And whatever-- so say, for example, you had one bounding box of a cat. Then you would find the closest predicted bounding box.

And you would mark it as a positive. And you would try to make that associated grid cell predict cat. And you would nudge the prediction to be slightly more towards the cat box. And so all of this can be done with simple losses. And you just back propagate that.

And then you have a detector. Or if you want to get much more fancy, you could do dense image captioning. So in this case, this is a combination of detection and image captioning. This is a paper with my equal co-author Justin Johnson and Fei-Fei Li from last year. And so what we did here is image comes in.

And it becomes much more complex. I don't maybe want to go into it as much. But the first order approximation is that instead-- it's basically a detection. But instead of predicting fixed classes, we instead predict a sequence of words. So we use a recurrent neural network there. But basically, you can take an image then.

And you can predict-- you can both detect and describe everything in a complex visual scene. So that's just some overview of different tasks that people care about. Most of them consist of just changing this top part. You put a different loss function, a different data set. But you'll see that this computational block stays relatively unchanged from time to time.

And that's why, as I mentioned, when you do transfer learning, you just want to take these pre-trained networks. And you mostly want to use whatever works well on ImageNet, because a lot of that does not change too much. So in the last part of the talk, I'd like to-- let me just make sure we're good on time.

OK, we're good. So in the last part of the talk, I just wanted to give some hints or some practical considerations when you want to apply convolutional networks in practice. So first consideration you might have if you want to run these networks is, what hardware do I use? So some of the options that I think are available to you-- well, first of all, you can just buy a machine.

So for example, NVIDIA has these Digits dev boxes that you can buy. They have Titan X GPUs, which are strong GPUs. You can also, if you're much more ambitious, you can buy a DGX1, which has the newest Pascal P100 GPUs. Unfortunately, the DGX1 is about $130,000. So this is kind of an expensive supercomputer.

But the Digits dev box, I think, is more accessible. And so that's one option you can go with. Alternatively, you can look at the specs of a dev box. And those specs are-- they're good specs. And then you can buy all the components yourself and assemble it like LEGO.

Unfortunately, that's prone to mistakes, of course. But you can definitely reduce the price maybe by a factor of like two compared to the NVIDIA machine. But of course, NVIDIA machine would just come with all the software installed, all the hardware is ready, and you can just do work. There are a few GPU offerings in the cloud.

But unfortunately, it's actually not at a good place right now. It's actually quite difficult to get GPUs in the cloud-- good GPUs, at least. So Amazon AWS has these grid K5 520s. They're not very good GPUs. They're not fast. They don't have too much memory. It's actually kind of a problem.

Microsoft Azure is coming up with its own offering soon. So I think they've announced it. And it's in some kind of a beta stage, if I remember correctly. And so those are powerful GPUs, K80s, that would be available to you. At OpenAI, for example, you use Cerescale. So Cerescale is a slightly different model.

You can't spin up GPUs on demand. But they allow you to rent a box in the cloud. So what that amounts to is that we have these boxes somewhere in the cloud. I have just the DNN. I just have the URL. I SSH to it. It's a Titan X boxes in the machine.

And so you can just do work that way. So these options are available to you hardware-wise. In terms of software, there are many different frameworks, of course, that you could use for deep learning. So these are some of the more common ones that you might see in practice. So different people have different recommendations on this.

My personal recommendation right now to most people, if you just want to apply this in practical settings, 90% of the use cases are probably addressable with things like Keras. So Keras would be my go-to number one thing to look at. Keras is a layer over TensorFlow or Theano. And basically, it's just a higher-level API over either of those.

So for example, I usually use Keras on top of TensorFlow. And it's a much more higher-level language than raw TensorFlow. So you can also work in raw TensorFlow, but you'll have to do a lot of low-level stuff. If you need all that freedom, then that's great, because that allows you to have much more freedom in terms of how you design everything.

But it can be slightly more wordy. For example, you have to assign every single weight. You have to assign a name, stuff like that. And so it's just much more wordy, but you can work at that level. Or for most applications, I think Keras would be sufficient. And I've used Torch for a long time.

I still really like Torch. It's very lightweight, interpretable. It works just fine. So those are the options that I would currently consider, at least. Another practical consideration-- you might be wondering, what architecture do I use in my problem? So my answer here-- and I've already hinted at this-- is don't be a hero.

Don't go crazy. Don't design your own neural networks and convolutional layers. And you don't want to do that, probably. So the algorithm is actually very simple. Look at whatever is currently the latest released thing that works really well in ILS VRC. You download that pre-trained model. And then you potentially add or delete some layers on top, because you want to do some other task.

So that usually requires some tinkering at the top or something like that. And then you fine tune it on your application. So actually, a very straightforward process. The first degree, I think, to most applications would be don't tinker with it too much. You're going to break it. But of course, you can also take 231n, and then you might become much better at tinkering with these architectures.

Second is how do I choose the parameters? And my answer here, again, would be don't be a hero. Look into papers. Look at what parameters they use. For the most part, you'll see that all papers use the same hyperparameters. They look very similar. So Adam-- when you use Adam for optimization, it's always learning rate 1e negative 3 or 1e negative 4.

So you can also use SGD momentum. It's always the similar kinds of learning rates. So don't go too crazy designing this. One of the things you probably want to play with the most is the regularization. And in particular, not the L2 regularization, but the dropout rates is something I would advise instead.

Because you might have a smaller or a much larger data set, if you have a much smaller data set, then overfitting is a concern. So you want to make sure that you regularize properly with dropout. And then you might want to, as a second degree consideration, maybe learning rate, you want to tune that a tiny bit.

But that usually doesn't have as much of an effect. So really, there's like two hyperparameters. And you take a pre-trained network. And this is 90% of the use cases, I would say. So compared to when-- computer vision in 2011, where you might have hundreds of hyperparameters. So yeah. And in terms of distributed training, so if you want to work at scale, because if you want to train ImageNet or some large scale data sets, you might want to train across multiple GPUs.

So just to give you an idea, most of these state-of-the-art networks are trained on the order of a few weeks across multiple GPUs, usually four or eight GPUs. And these GPUs are roughly on the order of $1,000 each. But then you also have to house them. So of course, that adds additional price.

But you almost always want to train on multiple GPUs if possible. Usually you don't end up training across machines. That's much more rare, I think, to train across machines. What's much more common is you have a single machine. And it has eight Titan Xs or something like that. And you do distributed training on those eight Titan Xs.

There are different ways to do distributed training. So if you're feeling fancy, you can try to do some model parallelism, where you split your network across multiple GPUs. I would instead advise some kind of a data parallelism architecture. So usually what you see in practice is you have eight GPUs.

So I take my batch of 256 images or something like that. I split it. And I split it equally across the GPUs. I do forward pass on those GPUs. And then I basically just add up all the gradients. And I propagate that through. So you're just distributing this batch.

And mathematically, you're doing the exact same thing as if you had a giant GPU. But you're just splitting up that batch across different GPUs. But you're still doing synchronous training with SGD as normal. So that's what you'll see most in practice, which I think is the best thing to do right now for most normal applications.

And other kind of considerations that sometimes enter that you could maybe worry about is that there are these bottlenecks to be aware of. So in particular, CPU to disk bottleneck. This means that you have a giant data set. It's somewhere on some disk. You want that disk to probably be an SSD because you want this loading to be quick.

Because these GPUs process data very quickly. And that might actually be a bottleneck. Like loading the data could be a bottleneck. So in many applications, you might want to pre-process your data, make sure that it's read out contiguously in very raw form from something like an HDFI file or some kind of other binary format.

And another bottleneck to be aware of is the CPU-GPU bottleneck. So the GPU is doing a lot of heavy lifting of the neural network. And the CPU is loading the data. And you might want to use things like prefetching threads, where the CPU, while the networks are doing forward-backward on the GPU, your CPU is busy loading the data from the disk and maybe doing some pre-processing and making sure that it can ship it off to the GPU at the next time step.

So those are some of the practical considerations I could come up with for this lecture. If you wanted to learn much more about convolutional neural networks and a lot of what I've been talking about, then I encourage you to check out CS231n. We have lecture videos available. We have notes, slides, and assignments.

Everything is up and available. So you're welcome to check it out. And that's it. Thank you. So I guess I can take some questions. Yeah. Hello? Hello. Hi. I'm Kyle Farr from Lumna. I'm using a lot of convolutional nets for genomics. One of the problems that we see is that our genomic sequence tends to be arbitrary length.

So right now we're patterned for a lot of zeros, but we're curious as to what your thoughts are on using CNNs for things of arbitrary size. Or we can't just downsample to 277 by 277. So is this like a genomic sequence of like ATCG? Like that kind of sequence?

Yeah, exactly. Yeah. So some of the options would be -- so recurrent neural networks might be a good fit because they allow arbitrarily sized contexts. Another option I would say is if you look at the WaveNet paper from DeepMind, they have audio and they're using convolutional networks for processing it.

And I would basically adopt that kind of an architecture. They have this clever way of doing what's called atros or dilated convolutions. And so that allows you to capture a lot of context with few layers. And so that's called dilated convolutions. And the WaveNet paper has some details. And there's an efficient implementation of it that you should be aware of on GitHub.

And so you might be able to just drag and drop the fast WaveNet code into that application. And so you have much larger context, but it's, of course, not infinite context as you might have with a recurrent network. Yeah, we're definitely checking those out. We also tried RNNs. They're quite slow for these things.

Our main problem is that the genes can be very short or very long, but the whole sequence matters. So I think that's one of the challenges that we're looking at with this type of problem. Interesting. Yeah, so those would be the two options that I would play with, basically.

I think those are the two that I'm aware of. Thank you. Thanks for a great lecture. So my question is that, is there a clear mathematical or conceptual understanding when people decide how many hidden layers have to be part of their architecture? So the answer with a lot of this is there a mathematical understanding will likely be no, because we are in very early phases of just doing a lot of empirical guess and check kind of work.

And so theory is in some ways lagging behind a bit. I would say that with residual networks, you want to have more layers usually works better. And so you can take these layers out or you can put them in, and it's just mostly computational consideration of how much can you fit in.

So our consideration is usually is you have a GPU. It has maybe 16 gigs of RAM or 12 gigs of RAM or something. I want certain batch size, and I have these considerations, and that upper bounds the amount of layers or how big they could be. And so I use the biggest thing that fits in my GPU, and that's mostly the way you choose this.

And then you regularize it very strongly. So if you have a very small data set, then you might end up with a pretty big network for your data set. So you might want to make sure that you are tuning those dropout rates properly, and so you're not overfitting. I have a question.

My understanding is that the recent convolution nets doesn't use pooling layers, right? So the question is, why don't they use pooling layers? So is there still a place for pooling? Yeah. So certainly, so if you saw, for example, the residual network at the end, there was a single pooling layer at the very beginning, but mostly they went away.

You're right. So it took-- I wonder if I can find the slide. I wonder if this is a good idea to try to find the slide. That's probably-- OK, let me just find this. Oh, OK. So this was the residual network architecture. So you see that they do a first conv, and then there's a single pool right there.

But certainly, the trend has been to throw them away over time. And there's a paper also. It's called Striving for Simplicity, the All-Convolutional Neural Network. And the point in that paper is, look, you can actually do strided convolutions. You can throw away pooling layers altogether, or it's just as well.

So pooling layers are kind of, I would say, this kind of a bit of a historical vestige of they needed things to be efficient, and they need to control the capacity and down sample things quite a lot. And so we're kind of throwing them away over time. And yeah, they're not doing anything super useful.

They're doing this fixed operation. And you want to learn as much as possible. So maybe you don't actually want to get rid of that information. So it's always more appealing to-- it's probably more appealing, I would say, to throw them away. But you mentioned there is a sort of cognitive or brain analogy that the brain is doing pooling.

Yeah, so I think that analogy is stretched by a lot. So the brain-- I'm not sure if the brain is doing pooling. Yeah. How about image compression? Not for just classification, but the usage of neural networks for image compression. Do we have any examples? Sorry, I couldn't hear the question.

Instead of classification for images, can we use the neural networks for image compression? Image compression. Yeah, I think there's actually really exciting work in this area. So one that I'm aware of, for example, is recent work from Google, where they're using convolutional networks and recurrent networks to come up with variably sized codes for images.

So certainly, a lot of these generative models, I mean, they are very related to compression. So definitely a lot of work in the area that I'm excited about. Also, for example, super resolution networks. So you saw the recent acquisition of Magic Pony by Twitter. So they were also doing something that basically allows you to compress.

You can send low resolution streams, because you can upsample it on the client. And so a lot of work in that area. Yeah. I had one question. One more, but maybe after you. Can you please comment on scalability regarding number of classes? So what does it take if we go up to 10,000 or 100,000 classes?

Yeah, so if you have a lot of classes, then of course, you can grow your softmax, but that becomes inefficient at some point, because you're doing a giant matrix multiply. So some of the ways that people are addressing this in practice, I believe, is use of hierarchical softmax and things like that.

So you decompose your classes into groups, and then you kind of predict one group at a time, and you kind of converge that way. So I see these papers, but I'm not an expert on exactly how this works. But I do know that hierarchical softmax is something that people use in this setting.

Especially, for example, in language models, this is often used, because you have a huge amount of words, and you still need to predict them somehow. And so I believe Tomasz Michalow, for example, he has some papers on using hierarchical softmax in this context. Could you talk a little bit about the convolutional functions?

Like what considerations you should make in selecting the functions that are used in the convolutional filters? Selecting the functions that are used in the convolutional filters? So these filters are just parameters, right? So we train those filters. They're just numbers that we train with backpropagation. Are you talking about the nonlinearities, perhaps?

Yeah, I'm just wondering about when you're selecting the features, or when you're getting the-- when you're trying to train to understand different features within an image, what are those filters actually doing? Oh, I see. You're talking about understanding exactly what those filters are looking for in the image and so on.

So a lot of interesting work, especially, for example, so Jason Yosinski, he has this DeepVist toolbox. And I've shown you that you can kind of debug it that way a bit. There's an entire lecture that I encourage you to watch in CS231N on visualizing and understanding convolutional networks. So people use things like a deconv or guided backpropagation.

Or you backpropagate to image, and you try to find a stimulus that maximally activates any arbitrary neuron. So different ways of probing it, and different ways have been developed. And there's a lecture about it. So I would check that out. Great, thanks. I had a question regarding the size of fine-tuning data set.

For example, is there a ballpark number if you are trying to do classification? How many do you need for fine-tuning it to your sample set? So how many data points do you need to get good performance? That's the question. So this is like the most boring answer, I think.

Because the more, the better always. And it's really hard to say, actually, how many you need. So usually one way to look at it is-- one heuristic that people sometimes follow is you look at the number of parameters, and you want the number of examples to be on the order of number of parameters.

That's one way people sometimes break it down. Even for fine-tuning? Because we'll have an ImageNet model. So I was hoping that most of the things would be taken care of there, and then you're just fine-tuning. So you might need a lower order. I see. So when you're saying fine-tuning, are you fine-tuning the whole network, or you're freezing some of it, or just the top classifier?

Just the top classifier. Yeah. So another way to look at it is you have some number of parameters, and you can estimate the number of bits that you think every parameter has. And then you count the number of bits in your data. So that's the kind of comparisons you would do.

But really, I have no good answer. So the more, the better. And you have to try, and you have to regularize, and you have to cross-validate that, and you have to see what performance you get over time. Because it's too task-dependent for me to say something stronger. Hi. I would like to know how do you think the Covenant will work in the 3D case?

Like is it just a simple extension of the 2D case, or do we need some extra tweak about it? So in the 3D case, so you're talking specifically about, say, videos or some 3D-- Actually, I'm talking about the image that has the depth information. Oh, I see. So say you have like RGBD input and things like that.

Yeah. So I'm not too familiar with what people do. But I do know, for example, that people try to have-- for example, one thing you can do is just treat it as a fourth channel. Or maybe you want a separate ConvNet on top of the depth channel and do some fusion later.

So I don't know exactly what the state of the art in treating that depth channel is right now. So I don't know exactly how they do it right now. So maybe just one more question. Just how do you think the 3D object recognition-- 3D object? Yeah. Recognition? So what is the output that you'd like?

The output is still the class probability. But we are not treating the 2D image, but the 3D representation of the object. I see. So do you have a mesh or a point cloud? Yeah, a mesh. I see. Yeah. So that's exactly my area, unfortunately. But the problem with these meshes and so on is that there's this rotational degree of freedom that I'm not sure what people do about, honestly.

So I'm actually not an expert on this. So I don't want to comment. There are some obvious things you might want to try. You might want to plug in all the possible ways you could orient this and then a test time average over them. So that would be some of the obvious things to play with.

But I'm not actually sure what the state of the art is. Thank you. OK, one more question. Go ahead. OK. So coming back to distributed training, is it possible to do even the classification in a distributed way? Or my question is, in the future, can I imagine our cell phones do these things together for one inquiry?

Our cell phones? Oh, I see. You're trying to get cell phones distributed training. Yes, yes. A train and also classify for one cell phone. That's a radical idea. Is there any hope in that? Very radical idea. So related thoughts I had recently was, so I had come to JS in the browser.

And I was thinking of basically, this trains networks. And I was thinking about similar questions. Because you could imagine shipping this off as an ad equivalent. Like people just include this in the JavaScript. And then everyone's browsers are kind of like training a small network. So I think that's a related question.

But do you think there's too much communication overhead? Or it could be actually really distributed in an efficient way? Yes, so the problem with distributing it a lot is actually the stale gradients problem. So when you look at some of the papers that Google has put out about distributed training, as you look at the number of workers when you do asynchronous SGD, number of workers and the performance improvement you get, it kind of plateaus quite quickly after eight workers or something quite small.

So I'm not sure if there are ways of dealing with thousands of workers. The issue is that you have a distributed-- every worker has this specific snapshot of the weights that are currently-- you pull from the master. And now you have a set of weights that you're using. And you do forward, backward.

And then you send an update. But by the time you send an update and you've done your forward, backward, the parameter server has now done lots of updates from thousands of other things. And so your gradient is stale. You've evaluated it at the wrong and old location. And so it's an incorrect direction now.

And everything breaks. So that's the challenge. And I'm not sure what people are doing about this. I was wondering about applications of convolutional nets to two inputs at a time. So let's say you have two pictures of jigs of puzzles, jigs of pieces. And you're trying to figure out if they fit together or whether one object compares to the other in a specific way.

Have you heard of any implementation of this kind? Yes. So you have two inputs instead of one. So the common ways of dealing with that is you put a commnet on each. And then you do some kind of a fusion eventually to merge the information. Right? I see. And what about for recurring neural networks if you had variable input?

So for example, in the context of videos where you have frames coming in, then yes, some of the approaches are you have a convolutional network on a frame. And then at the top, you tie it in with a recurring neural network. So you have these-- you reduce the image to some kind of a lower dimensional representation.

And then that's an input to a recurring neural network at the top. There are other ways to play with this. For example, you can actually make the recurrent-- you can make every single neuron in the commnet recurrent. That's also one funny way of doing this. So right now, when a neuron computes its output, it's only a function of a local neighborhood and below it.

But you can also make it, in addition, a function of that same local neighborhood or its own activation perhaps at the previous time step, if that makes sense. So this neuron is not just computing a dot product with the current patch, but it's also incorporating a dot product of its own and maybe its neighborhoods activations at the previous time step of the frame.

So that's kind of like a small RNN update hidden inside every single neuron. So those are the things that I think people play with when I'm not familiar with what currently is working best in this area. Pretty awesome. Thank you. Yeah. Yeah, hi. Thanks for the great talk. I have a question regarding the latency for the models that are trained using multiple layers.

So especially at the prediction time, as we add more layers for the forward pass, it will take some time. It will increase in the latency for the prediction. So what are the numbers that we have seen presently that if you can share the prediction time or the latency at the forward pass?

So you're worried, for example, you want to run a prediction very quickly. Would it be on an embedded device, or is this in the cloud? Yeah, suppose it's a cell phone. You're identifying the objects, or you're doing some image analysis or something. Yeah. So there's definitely a lot of work on this.

So one way you would approach this, actually, is you have this network that you've trained using floating point arithmetic, 32 bits, say. And so there's a lot of work on taking that network and discretizing all the weights into like ints and making it much smaller and pruning connections. So one of the works related to this, for example, is Song Han here at Stanford has a few papers on getting rid of spurious connections and reducing the network as much as possible, and then making everything very efficient with integer arithmetic.

So basically, you achieve this by discretizing all the weights and all the activations and throwing away and pruning the network. So there are some tricks like that that people play. That's mostly what you would do on an embedded device. And then the challenge, of course, is you've changed the network, and now you just kind of are crossing your fingers that it works well.

And so I think what's interesting from a research standpoint is you'd like your test time to exactly match your training time. So then you get the best performance. And so the question is, how do we train with low precision arithmetic? And there's a lot of work on this as well, so say from Yoshua Bengio's lab as well.

So that's exciting directions of how you train in a low precision regime. Do you have any numbers that you can share for the state of the art, how much time does it take? Yes, I see the papers, but I'm not sure if I remember the exact reductions. It's on the order of-- OK, I don't want to say, because basically I don't know.

I don't want to try to guess this. Thank you. All right. So with that, we'll take our time. Let's thank Andre. Lunch is outside, and we'll restart at 1245.