Convolutional neural networks, or CNNs, have been the undisputed champions of computer vision for almost a decade. Their widespread adoption kick-started the adoption of deep learning. And without them, the field of artificial intelligence would look very different today. Before deep learning with CNNs, computer vision relied on very brittle edge detection algorithms, color profiling, and a plethora of manually scripted processes.
And all of these could very rarely be applied across different use cases or different domains. The result is that every dataset and every use case required a specific implementation, which required manual intervention and domain-specific knowledge, which meant that applying these networks across a broad domain of use cases or domains was just not practical.
Deep-layered CNNs changed this. Rather than manual feature extraction, CNNs would learn how to extract features from images. And they could do this for a vast number of datasets and a vast number of use cases. All they needed was training data. Big training datasets and deep-layered convolutional neural networks have remained the de facto standard within the field of computer vision.
Now, there have been some moves to other architectures, like for example, the vision transformer, which we covered in a earlier video. And things like multi-modality may also prove to be another thing that helps us to move on to better things than CNNs. But for now, CNNs are still the standard when it comes to computer vision.
In this very visual and hands-on video, we're going to learn why that is and what exactly makes a convolutional neural network work and how we can actually use them ourselves. So let's start with what makes a convolutional neural network. By and large, CNNs are neural networks that are known for their performance on image datasets and image tasks.
They're characterized by something called a convolutional layer. These convolutional layers are able to detect abstract features and almost ideas within an image. And we can shift these images, squash them, rotate them. But if a human can still recognize the image, the chances are a well-trained CNN will still be able to identify that image as well.
Because of their affinity to image-based applications, we tend to find CNNs being used in image classification, object detection, and many other tasks within the realm of computer vision. Now, we're focusing on deep-layered CNNs, which are any neural network that satisfies two conditions. The first is that it contains many layers, i.e.
is a deep neural network. And two, that it contains convolutional layers. Beyond that, convolutional neural networks can consist of many different architectures, and they will contain many different features, common ones include pooling, normalization layers, and also linear layers. And we'll see a few examples of the different types of convolutional neural networks later on in the video.
Now, let's just briefly focus on what exactly a convolutional layer is actually doing. So we can think of an image as a big array of activation values. These arrays are followed by many more arrays of initially randomly initialized values, the weights that we call a filter or a kernel.
A convolutional layer is nothing more than a element-wise multiplication between these pixel values and the filter weights, which are then summed together. This element-wise operation followed by the sum of the resulting values is often called the scalar product because it results in a single scalar value. And we can see how that works here.
So we have our input, which is a five by five pixel image, and we have our filter, which is a three by three pixel array. In this very first iteration of the convolution, we can see that we are multiplying the three by three window within that input by the filter weights.
Multiply both those together in element-wise multiplication to get the array that you can see on the right. In there, we can see that the sum of all the values within that array is equal to three, and that is our scalar product value. Now, in reality, we would not just return a single value because we would actually be sliding this window, this filter, over the full image.
And the output of each one of those window operations is a single value. But of course, we now have nine values from this larger image. Now, the output of this convolution is what we would call a feature map or an activation map. Both mean the same thing. And we call it like that because it is a mapping from the activations of detected features from the input layer.
Now, something worth noting here is obviously we are going from a larger dimensional space into a smaller dimensional space. We're compressing that information. So we always need to be mindful of excessive compression and therefore data loss via dimensionality reduction. And because of this, we may want to increase or decrease the amount of compression that our filters create.
To do that, we modify the filter size and also how quickly it moves across the image using a variable called the stride. The stride defines the number of pixels a filter moves after every calculation. By increasing the stride, the filter will move across the entire image in fewer steps, outputting fewer values and producing a more compressed feature map.
Now, there are also some other surprising effects of image compression that we need to be aware of. One of those is the filter's interaction with the border areas of a input image or input array. If we're not careful with this, the border effects on small images can result in a rapid loss of information.
And that's naturally more of a problem for those smaller images because they start with less information in the first place. So to avoid this, we must either reduce the amount of compression that we're doing using the previous techniques that we mentioned. So being mindful of the filter size and the stride, or we can add padding.
Now we can see the effect of padding here. So we are essentially taking the original image and we're adding padding around the edge of that image. Now, typically this padding is going to be a set of zero value pixels. And we add that around the image to limit or prevent compression between layers.
For larger images, this is going to be less of a problem. But for those smaller images, padding is a very effective remedy to avoid too much information loss. Now, another key feature of these deep convolutional neural networks is obviously depth of those networks. Going way back to 2012, the first very successful convolutional neural network or deep convolutional neural network was called AlexNet.
And the authors of AlexNet found that the depth of their network was a very key component that contributed to its high performance. And the reason for this is every successive layer within a convolutional neural network abstracts the initial features more and more. So we get more and more abstract features as the number of layers within the network increases.
And we can think of this as the network is able to abstract image features more and get them closer to the very abstract concepts that we have in our minds as human beings. So it gets closer to a human-like understanding of an image. So for example, a shallow CNN might recognize that an image contains an animal.
But as we add another layer, it may be able to identify that animal as a dog. It has become more abstract. And adding another layer may allow that network to identify specific breeds like a satchel bull terrier or a husky. These ideas of different dog breeds is far more abstract than a dog or just an animal.
And it requires far more abstraction in order to actually understand that and for CNN to be able to identify that. So by adding more layers, we're generally going to be able to identify more abstract and more specific concepts. Now, moving on to what are some of the very common features of a convolutional neural network, although not necessarily restricted to convolutional neural networks alone, we have activation functions.
So activation functions, we will find that in pretty much every neural network. And they are used to add non-linearity to these networks. And essentially what that allows us to do, particularly over many layers, is represent more complex patterns. Now, you may recognize a few of these activation functions. We have the rectified linear unit function, the tanh function and sigmoid function.
These three are some of the most common activation functions that we find in neural networks. In the past, CNNs often use activation functions within the hidden layers of the network, so that the middle layers of that network, basically not the input or the output, everything in between. And they would typically use tanh or sigmoid activations.
However, in 2012, the rectified linear unit activation function became very popular because it was used by the AlexNet model, which kind of kick-started this era of deep learning, as it was the best performing CNN of its time by a long shot. Nowadays, the rectified linear unit or ReLU activation function is still a very popular choice.
It's a lot simpler than tanh and sigmoid, and also doesn't require regularization in order to avoid saturation, which basically means the congregation of activation outputs towards the minimum and maximum values of that activation function. Another very important feature is the use of pooling layers. Now, we use pooling layers because the output of feature maps are very sensitive to the location of input features.
So a small change can make a big difference. To some degree, this can be useful as it can tell us the difference between, for example, a cat's face and a dog's face, based on where the eyes are, where the ears are, and so on. However, we don't want that to be too dramatic because if an eye is shifted two pixels to the left or two pixels to the right, that should still allow the model to identify this image as being of a face.
It should not make it go crazy and detect something completely different. And we need pooling layers in order to allow the model to have this form of smoothing or stability. Pooling layers are a downsampling method that compress information from one layer into a smaller space in the next layer.
And the two most common pooling methods are max pooling and average pooling. Max pooling takes a maximum value of all the values within a window, and average pooling takes the average of all those values within a window. And of course, as that pooling window moves across our input array, we would end up outputting another array of smaller dimensionality.
And with that, we can move on to the final main feature of a convolutional neural network, which is the use of fully connected layers. A fully connected linear layer is simply a neural network in the very traditional stripped down sense. It is the dot product between some inputs, X, and the layer weights, W, with a bias term added onto there, and usually an activation function.
We will typically see a fully connected layer at the end of a convolutional neural network. And it handles the transformation of our convolutional neural network embeddings from 3D tensors to more understandable outputs like class predictions. It's often within these final layers that we will find the most information rich vector embeddings that represent the information that's come through all of those layers and have the most abstract machine readable numeric representation of whatever image is being passed through all of those layers.
And it's this that we would typically use in things like image retrieval. But focusing on the classification, what a classifier will usually do is apply a softmax activation function. And that will create a probability distribution across all of the output nodes, where each one of these nodes represents a candidate class or a potential label.
For example, here we have cat, dog, and car. Those would be our output classes. And after these fully connected layers and the softmax activation function, we have our predictions. Now, all of those features that we've just worked through are the very common components that make up a convolutional neural network.
But with time, many different convolutional networks have been designed. So there is no specific architecture, but instead what we can use is some of the most high-performing networks as almost a set of guideposts in how we can design our own networks or simply use those existing networks. So let's have a very quick high-level look at a few of the most popular ones.
So we'll start way back when with the very first successful convolutional neural network, which was LeNet. Now, LeNet is, I think, the earliest good example of a deep convolutional neural network being applied in the real world. It was developed in 1998. And in reality, a lot of us have probably actually interacted with LeNet without ever even realizing.
It was developed at Bell Labs and they licensed it to many different big banks around the globe for reading handwritten digits on checks. And despite its use worldwide, it was surprisingly the only example of a commercially successful CNN, at least on that scale, for another 14 years. And that is where we got AlexNet.
So AlexNet is the deep CNN that kick-started the era of deep learning. And that was back in October, 2012. The catalyst of this was AlexNet winning the ImageNet competition. And in fact, AlexNet can actually be seen as a continuation of LeNet. It uses a very similar architecture, but simply added more layers, training data, and some safeguards against overfitting.
And it was after AlexNet that the broader community of computer vision researchers began focusing their attention on building ever deeper models with really big datasets. And the following years after this basically saw many variations of AlexNet winning the ImageNet competition until we get to VGGNet. Now VGGNet came in 2014 and dethroned AlexNet as the winner of the ImageNet competition.
And there were a few different variations of this network developed, but we can already see that it's a much deeper network. But as core, it's still using the same process of convolution layers, pooling layers, and so on. Moving on to 2015, the next year, we saw the introduction of ResNet.
Now ResNet introduced even deeper networks than ever before. The first of those contained 34 layers. And since then, 50 plus layer ResNet models have been developed and still hold some of the state-of-the-art results on many computer vision benchmarks. Now, beyond adding more layers, ResNet was actually very much inspired by VGGNet, but added smaller filters and a generally less complex network architecture.
Another thing, which is why it's called the residual network, e.g. ResNet, is they added these shortcut connections between layers. And this was to avoid the loss of information over many layers with the greater depth of ResNet. So adding these shortcuts or these residual connections just allowed information to be maintained over a longer distance, which was very much required with this deeper network size.
Now, I think that's enough for understanding convolutional neural networks. What I want to do now is actually look at how to implement convolutional neural networks and use them in classification. So we're gonna go through this notebook example here. You can find this notebook in the video description, and you'll be able to open that in Colab, or if you prefer, you can actually download the file as well.
We're first going to just load in the relevant libraries. So we have PyTorch here and TorchVision. I'm gonna be using these transforms a lot, so we'll just also import them as transforms, make it a little bit easier. Now, we're gonna be working with a fair bit of data. So what is usually pretty helpful to do is switch from CPU to GPU if you have it available.
We don't always have it available, but it can be useful if you do have it. And you can just check what you have, like so. So for me, I'm on MacBook right now, so I only have CPU. But if you're working on Colab, this should show up as CUDA.
Now, as usual, our first task is going to be data-free processing. So we first need to download our dataset. We're gonna be using the CIFAR-10 dataset, which is a very popular image classification dataset. Download that, and we will see that it contains 50,000 items. And within that, we have images, which are just Python PIL image objects, and their labels, of which there are 10 unique labels within the dataset, hence why it's called CIFAR-10.
And we can confirm that here. So we see we have 10 of these. And then we can also view the images, but they are very small. We can also see that they are Python PIL objects here. So this is, I think, a plane, but yeah, it's very small. So we're gonna be training the model.
And while we're training, we also want to pull in another dataset that is independent to the training dataset that we can use as a validation or test set later on. We're gonna be using this test dataset, actually, as a validation dataset. So we're going to be checking our model performance on this data during the training process.
You can see that we have a smaller number of items, and here it's 10,000 from 50,000. Now, most convolutional neural networks are designed to only accept a certain size of images. In the case that we are going to use, we're gonna use a 32 image input. We can modify that based on the model architecture, but the model architecture that we're gonna be using later accepts this 32 by 32 image size.
So what we need to do is we can set the image size here, and then we can use transforms resize to resize the image into whatever we put in here, so the 32 pixels. And then this transforms to tensor is just to convert our image, our pill image object into a tensor, in which we can then feed into our model later on.
So run this, and basically, we can just run preprocess on our images, and that will run this transformation pipeline across them. Now, there are a few things to consider when we're doing this. The first is, okay, we're gonna be iterating through everything. We need to extract the image and its respective label.
One thing we need to consider with this image data set in particular, but other image data sets as well, is we only want one image format. So I want to have RGB images, so images with red, green, and blue color channels. A few images in this data set are actually just grayscale, so they have a single color channel.
Now, we're not going to colorize it or anything like that. We're actually just going to copy those single color channels into three color channels, and it will still appear as a black and white grayscale image, but we at least have those three color channels, which means we can pass that directly into our model, which expects image arrays with three color channels.
So we do that using image convert here, and then we preprocess everything, and then we append all that to our inputs. Okay, now we'll just take a moment. It's pretty quick. Now let's have a look at one of those images. So we run this, and we can see that we have 50,000 of these modified images in our training set, and each one of those is a three by 32 by 32 tensor.
Now, 32 by 32 is the actual pixels in the image, and the three is the number of color channels, red, green, and blue. And we can see the result from this transformation here. So this is just a tensor. One thing to note here is that all these values have been normalized to between zero and one.
That was actually done by the preprocessing pipeline here with the transforms up to tensor. Okay, so moving on. One thing that we should do is calculate the mean and standard deviation of images so that we can modify the normalization to better fit our set of images. So to do that, we would do this, creating a big list of all of our images.
And this is just a small subset of those that we pulled from here. So this is like we're sampling a smaller portion of those. Otherwise, this just takes a bit longer. We're merging all these into a single three channel vector. So you can see here, we're just kind of merging all those images into a massive one single big image.
And then we just want to calculate the mean and the standard deviation for each one of those color channels. Okay, so we get these values and these values. I don't need those tensors anymore. They take a bit of space. I'm going to remove them. And then what we do is we just modify that normalize here.
Okay, so now we're just normalizing all of the tensors with this additional step here. Okay, so we've preprocessed that and we apply that to the existing tensors. Run that. Okay, and that's done. But obviously when we're doing all of this in one go, we would probably want to do this.
So we put all of them into a single preprocessing pipeline. So we have to resize the two tensor and then normalize following that. And we'll actually use that on the validation set. So here is our validation set. And we'll just rerun the same thing as before, but obviously this time we have that normalization step in there as well.
Now, when we're training a model, we are probably going to want to do it in batches. So we want to pass through a batch of input data at any one time into our model. So passing rather than passing a single image at a time through everything, we're passing everything through in batches of, in this example, 64.
And this is because neural networks can be paralyzed. And that allows us to take advantage of paralyzation, which means we're performing many calculations across all of these different images in a single batch in parallel rather than one after the other, which means everything will be much faster. Now to prepare everything, we need to load everything into our data loader.
So this is going to handle the loading of data into our model. We have the batch size here. We're also shuffling everything so that we don't have like the same set of images if they're not shuffled already within the data set. So we don't have the same set of images all covering a single batch.
We want every single batch to be as representative of the full data set as possible. Now we initialize data load for both the validation and the training set. And then what we're going to do is actually build our convolutional neural network. Now, it's pretty big, but that's just thanks to the number, the depth of the network.
In reality, if we compare it to a lot of the networks we looked at before, it's not actually that deep. We only have five convolutional layers here and then a few other things. But you should note that there are a few things that you might recognize from before. We have the convolutional layer followed by radio activation layer, followed by max pooling layer, and we do that several times.
And then at the end here, we have our fully connected layers. We have a few of those in order to get our predictions. Now, things to note here is that we have a number of input channels here. This must align to the number of input color channels that we're expecting.
Our inputs will be identified based on the first set of input data that we throw into our model. So that'll be 32 by 32. And this first convolution layer is going to go over that 32 by 32 images with a kernel or a window of four by four images and go through that.
We also add some padding to reduce the amount of compression. And we do that throughout the network. Now, other thing actually to note is the number of output channels is the depth of the array. So the depth initially is through color channels, the depth from the output of that is actually 64.
And that gets deeper and deeper as we go through before we start decreasing that as we're going towards the end of the model. And then the end here, we're just performing transformations from our 3D convolution layers into the fully connected layers at the end there. And the final output, so this is our final layer.
The input of that layer is 256 activations or nodes. And the output is the number of classes, which is 10. Then what we do, so that's defining the structure of our convolutional neural network. The next bit here is the forward step throughout. So it's identifying or it's defining the process that we move through each one of these layers, the order that each one of them is used.
Because we can actually define all of these in any order we want. But it's actually here that the order is defined. So we go through that, we get to here, which is our final output. And that is how convolutional neural network, that is how we define it. And then we come down and we move on to setting everything up for training.
So we want to move it to our device if we're using a CUDA-enabled GPU, otherwise it'll just stay on the CPU. We want to set the loss function. We're gonna be using cross-entropy loss. This is used when we have a classification task, like what we do here. Set learning rate, so 0.008.
And we're going to use stochastic gradient descent here as our optimizer. Moving on, we would go down to here. We train it for about 50 epochs here. You can less or more depending on what you're seeing on your end during the training. It doesn't take too long to run anyway.
And then we run through the training. So we go through these 50 epochs. And within each epoch, we run through the entire dataset. And we load that from the training data loader. We move those to the GPU if it is available. We do the forward propagation. From here, we get our output logits.
So that is the final output predictions. And then from there, we calculate loss function between the predicted values and the true values. From there, we optimize the model. So we do backward propagation step. And that is the training step. That's all for the training. And then this bit here is another bit.
So this is for our validation so that we can actually calculate the validation and just see that our model is not overfitting or anything over time. And we will get something that looks like this. So important here is if you see the loss decreasing and the validation loss increasing, that means that you're probably training for too many steps or your learning rate is too high and the model is overfitting to the training data.
So in that case, just be wary and kind of stop doing that. And either train for less epochs or train with a lower learning rate or with less layers even. Okay, and then the final validation accuracy that we'll get near the end here is around 80%. And we can then go ahead and save the model to a file like this.
Okay, so this is just CNN and the PyTorch weights file there. Now, that is the training for the model. Let's have a look at how we can use it for inference. So inference, by inference, I mean making predictions. We would load the model. You know, if we didn't already have it in the notebook, we'd make sure we switch it to evaluation mode and also move to device if we have CUDA-enabled GPU again.
Pretending that we're not in the same notebook, we reinitialize the test set. So it's actually our validation set, but we're just using it for both. In this example, of course, in real scenarios, you should use a different data set for your validation and test set. Come down here, we do the preprocessing that we set up before, the preprocessing pipeline.
We can check the number of tensors we have there. So we're just taking the first 10 as an example. All of those are the three by 32 by 32 that we saw before. Stack all those into a single tensor so that we have basically 10 tensors or 10 image arrays, you can think of them like that, with the three color channels and each one of those 32 by 32 pixels.
Process all of them through our model. And then from there, we can use the torch argmax function in order to find the value within the output logics that has the highest activation. And that is our prediction. Okay, so in the first example, position number five had the highest activation.
That means the model is predicting that this model is whatever class number five is. And we'll see what that is in a moment. We can just see the number of predictions we have here. It's 10. So we can see the class names. So here, if we look at this, the number five would be zero, one, two, three, four, five.
So the first one is predicting dog. And if we have a look here, the second one is number eight. So number eight here, we'll go a little further. So six, seven, eight, ship. We come down here. And what we can do is find the predictions for each one of these.
So this one is apparently a dog. It's pretty hard to tell, to be honest. This one is a ship, a automobile, which maybe isn't accurate, an airplane, a frog, another frog, automobile, and so on and so on. So generally speaking, most of those actually do look correct. So we have successfully managed to train our convolutional neural network on this classification task for images.
And that is despite these images being very low resolution. Like even myself as a person, I struggle to figure out what exactly is in each one of those images. But that's it for this introduction to the long reigning champions of computer vision, i.e. the convolutional neural networks. We've gone through the intuition behind these models and had a look at a few of the most popular versions of these as well, which should always inform us as to if we wanted to build a convolutional neural network, always have a look at those past implementations of what they did, or just use one of those out of the box.
And then after that, we obviously went through all this code, went through actually training a convolutional neural network and actually using it for inference as well. So as I said, convolutional neural networks are still very popular, but I think in the next, particularly in the next couple of years, they're probably mostly going to be replaced by other architectures like vision transformers and possibly other things as well.
But even so, they're super relevant even now and definitely a good thing to understand if you're working within machine learning and particularly computer vision. So I hope all of that has been helpful and insightful. Thank you very much for watching and I will see you again in the next one.
Bye. (soft music) (soft music) (soft music) (soft music) you