back to indexConvolutional Neural Nets Explained and Implemented in Python (PyTorch)
Chapters
0:0 Intro
1:59 What Makes a Convolutional Neural Network
3:24 Image preprocessing for CNNs
9:15 Common components of a CNN
11:1 Components: pooling layers
12:31 Building the CNN with PyTorch
14:14 Notable CNNs
17:52 Implementation of CNNs
18:52 Image Preprocessing for CNNs
22:46 How to normalize images for CNN input
23:53 Image preprocessing pipeline with pytorch
24:59 Pytorch data loading pipeline for CNNs
25:32 Building the CNN with PyTorch
28:8 CNN training parameters
28:49 CNN training loop
30:27 Using PyTorch CNN for inference
00:00:02.400 |
have been the undisputed champions of computer vision 00:00:13.120 |
And without them, the field of artificial intelligence 00:00:28.440 |
and a plethora of manually scripted processes. 00:00:32.720 |
And all of these could very rarely be applied 00:00:35.840 |
across different use cases or different domains. 00:00:38.880 |
The result is that every dataset and every use case 00:00:54.040 |
across a broad domain of use cases or domains 00:01:04.120 |
CNNs would learn how to extract features from images. 00:01:08.540 |
And they could do this for a vast number of datasets 00:01:19.640 |
the de facto standard within the field of computer vision. 00:01:23.180 |
Now, there have been some moves to other architectures, 00:01:32.660 |
And things like multi-modality may also prove 00:01:54.340 |
and what exactly makes a convolutional neural network work 00:01:59.580 |
So let's start with what makes a convolutional neural network. 00:02:15.420 |
These convolutional layers are able to detect 00:02:18.620 |
abstract features and almost ideas within an image. 00:02:23.420 |
And we can shift these images, squash them, rotate them. 00:02:27.660 |
But if a human can still recognize the image, 00:02:32.940 |
will still be able to identify that image as well. 00:02:36.060 |
Because of their affinity to image-based applications, 00:02:40.620 |
we tend to find CNNs being used in image classification, 00:02:52.980 |
which are any neural network that satisfies two conditions. 00:03:02.500 |
And two, that it contains convolutional layers. 00:03:12.020 |
and they will contain many different features, 00:03:14.820 |
common ones include pooling, normalization layers, 00:03:19.540 |
And we'll see a few examples of the different types 00:03:22.500 |
of convolutional neural networks later on in the video. 00:03:25.020 |
Now, let's just briefly focus on what exactly 00:03:37.740 |
These arrays are followed by many more arrays 00:03:45.060 |
the weights that we call a filter or a kernel. 00:03:54.860 |
between these pixel values and the filter weights, 00:04:01.020 |
This element-wise operation followed by the sum 00:04:04.940 |
of the resulting values is often called the scalar product 00:04:14.900 |
So we have our input, which is a five by five pixel image, 00:04:18.620 |
and we have our filter, which is a three by three pixel array. 00:04:22.820 |
In this very first iteration of the convolution, 00:04:27.140 |
we can see that we are multiplying the three by three window 00:04:33.700 |
Multiply both those together in element-wise multiplication 00:04:37.260 |
to get the array that you can see on the right. 00:04:41.100 |
In there, we can see that the sum of all the values 00:04:50.380 |
Now, in reality, we would not just return a single value 00:04:54.140 |
because we would actually be sliding this window, 00:04:59.860 |
And the output of each one of those window operations 00:05:10.860 |
is what we would call a feature map or an activation map. 00:05:15.100 |
And we call it like that because it is a mapping 00:05:23.620 |
Now, something worth noting here is obviously 00:05:32.620 |
So we always need to be mindful of excessive compression 00:05:37.300 |
and therefore data loss via dimensionality reduction. 00:05:41.260 |
And because of this, we may want to increase or decrease 00:05:44.940 |
the amount of compression that our filters create. 00:05:50.380 |
and also how quickly it moves across the image 00:05:56.940 |
The stride defines the number of pixels a filter moves 00:06:04.100 |
the filter will move across the entire image in fewer steps, 00:06:12.740 |
Now, there are also some other surprising effects 00:06:16.220 |
of image compression that we need to be aware of. 00:06:21.620 |
with the border areas of a input image or input array. 00:06:37.700 |
because they start with less information in the first place. 00:06:42.620 |
we must either reduce the amount of compression 00:06:45.860 |
that we're doing using the previous techniques 00:06:49.940 |
So being mindful of the filter size and the stride, 00:06:56.940 |
So we are essentially taking the original image 00:06:59.860 |
and we're adding padding around the edge of that image. 00:07:10.140 |
to limit or prevent compression between layers. 00:07:36.020 |
the first very successful convolutional neural network 00:07:39.820 |
or deep convolutional neural network was called AlexNet. 00:07:54.020 |
And the reason for this is every successive layer 00:08:00.100 |
abstracts the initial features more and more. 00:08:07.460 |
as the number of layers within the network increases. 00:08:12.540 |
as the network is able to abstract image features more 00:08:17.540 |
and get them closer to the very abstract concepts 00:08:25.740 |
So it gets closer to a human-like understanding of an image. 00:08:32.460 |
might recognize that an image contains an animal. 00:08:37.740 |
it may be able to identify that animal as a dog. 00:08:44.180 |
may allow that network to identify specific breeds 00:08:53.340 |
is far more abstract than a dog or just an animal. 00:09:16.420 |
Now, moving on to what are some of the very common features 00:09:29.940 |
we will find that in pretty much every neural network. 00:09:32.980 |
And they are used to add non-linearity to these networks. 00:09:46.220 |
Now, you may recognize a few of these activation functions. 00:09:58.260 |
activation functions that we find in neural networks. 00:10:01.220 |
In the past, CNNs often use activation functions 00:10:14.620 |
And they would typically use tanh or sigmoid activations. 00:10:20.100 |
the rectified linear unit activation function 00:10:27.700 |
which kind of kick-started this era of deep learning, 00:10:31.820 |
as it was the best performing CNN of its time 00:10:38.940 |
or ReLU activation function is still a very popular choice. 00:10:52.100 |
which basically means the congregation of activation outputs 00:11:02.460 |
Another very important feature is the use of pooling layers. 00:11:07.780 |
because the output of feature maps are very sensitive 00:11:19.340 |
as it can tell us the difference between, for example, 00:11:23.900 |
based on where the eyes are, where the ears are, and so on. 00:11:28.060 |
However, we don't want that to be too dramatic 00:11:30.060 |
because if an eye is shifted two pixels to the left 00:11:45.020 |
And we need pooling layers in order to allow the model 00:12:25.460 |
And with that, we can move on to the final main feature 00:12:34.820 |
A fully connected linear layer is simply a neural network 00:12:42.900 |
It is the dot product between some inputs, X, 00:12:52.980 |
We will typically see a fully connected layer 00:12:55.820 |
at the end of a convolutional neural network. 00:13:00.860 |
of our convolutional neural network embeddings 00:13:03.340 |
from 3D tensors to more understandable outputs 00:13:15.620 |
vector embeddings that represent the information 00:13:41.100 |
And that will create a probability distribution 00:13:45.980 |
where each one of these nodes represents a candidate class 00:14:07.140 |
Now, all of those features that we've just worked through 00:14:15.700 |
But with time, many different convolutional networks 00:14:42.420 |
convolutional neural network, which was LeNet. 00:14:44.820 |
Now, LeNet is, I think, the earliest good example 00:15:02.260 |
and they licensed it to many different big banks 00:15:05.940 |
around the globe for reading handwritten digits on checks. 00:15:22.140 |
at least on that scale, for another 14 years. 00:15:59.180 |
And it was after AlexNet that the broader community 00:16:08.260 |
on building ever deeper models with really big datasets. 00:16:25.820 |
Now VGGNet came in 2014 and dethroned AlexNet 00:16:38.180 |
but we can already see that it's a much deeper network. 00:16:41.980 |
But as core, it's still using the same process 00:16:44.820 |
of convolution layers, pooling layers, and so on. 00:16:52.940 |
Now ResNet introduced even deeper networks than ever before. 00:17:14.140 |
ResNet was actually very much inspired by VGGNet, 00:17:19.660 |
and a generally less complex network architecture. 00:17:29.060 |
is they added these shortcut connections between layers. 00:17:34.060 |
And this was to avoid the loss of information 00:17:37.740 |
over many layers with the greater depth of ResNet. 00:17:42.220 |
So adding these shortcuts or these residual connections 00:17:47.940 |
over a longer distance, which was very much required 00:17:54.780 |
for understanding convolutional neural networks. 00:17:57.300 |
What I want to do now is actually look at how 00:18:04.060 |
So we're gonna go through this notebook example here. 00:18:06.860 |
You can find this notebook in the video description, 00:18:12.460 |
or if you prefer, you can actually download the file as well. 00:18:15.100 |
We're first going to just load in the relevant libraries. 00:18:23.420 |
so we'll just also import them as transforms, 00:18:27.420 |
Now, we're gonna be working with a fair bit of data. 00:18:34.060 |
is switch from CPU to GPU if you have it available. 00:18:42.660 |
And you can just check what you have, like so. 00:18:45.660 |
So for me, I'm on MacBook right now, so I only have CPU. 00:19:03.700 |
which is a very popular image classification dataset. 00:19:17.940 |
and their labels, of which there are 10 unique labels 00:19:21.100 |
within the dataset, hence why it's called CIFAR-10. 00:19:32.740 |
We can also see that they are Python PIL objects here. 00:19:35.460 |
So this is, I think, a plane, but yeah, it's very small. 00:19:47.620 |
that we can use as a validation or test set later on. 00:19:55.180 |
So we're going to be checking our model performance 00:20:00.500 |
You can see that we have a smaller number of items, 00:20:08.220 |
are designed to only accept a certain size of images. 00:20:18.260 |
We can modify that based on the model architecture, 00:20:20.980 |
but the model architecture that we're gonna be using later 00:20:25.940 |
So what we need to do is we can set the image size here, 00:20:32.460 |
to resize the image into whatever we put in here, 00:20:45.420 |
in which we can then feed into our model later on. 00:20:52.740 |
and that will run this transformation pipeline across them. 00:21:04.020 |
We need to extract the image and its respective label. 00:21:21.500 |
so images with red, green, and blue color channels. 00:21:24.740 |
A few images in this data set are actually just grayscale, 00:21:28.980 |
Now, we're not going to colorize it or anything like that. 00:21:34.140 |
those single color channels into three color channels, 00:21:41.300 |
but we at least have those three color channels, 00:21:43.940 |
which means we can pass that directly into our model, 00:21:47.140 |
which expects image arrays with three color channels. 00:22:00.820 |
Now let's have a look at one of those images. 00:22:08.620 |
of these modified images in our training set, 00:22:12.700 |
and each one of those is a three by 32 by 32 tensor. 00:22:17.700 |
Now, 32 by 32 is the actual pixels in the image, 00:22:22.180 |
and the three is the number of color channels, 00:22:25.580 |
And we can see the result from this transformation here. 00:22:32.100 |
One thing to note here is that all these values 00:22:34.780 |
have been normalized to between zero and one. 00:22:37.660 |
That was actually done by the preprocessing pipeline 00:22:50.420 |
is calculate the mean and standard deviation of images 00:23:09.300 |
So this is like we're sampling a smaller portion of those. 00:23:14.540 |
We're merging all these into a single three channel vector. 00:23:29.620 |
Okay, so we get these values and these values. 00:23:36.260 |
And then what we do is we just modify that normalize here. 00:23:41.260 |
Okay, so now we're just normalizing all of the tensors 00:23:56.180 |
But obviously when we're doing all of this in one go, 00:24:00.860 |
So we put all of them into a single preprocessing pipeline. 00:24:09.940 |
And we'll actually use that on the validation set. 00:24:15.180 |
And we'll just rerun the same thing as before, 00:24:17.020 |
but obviously this time we have that normalization step 00:24:23.900 |
we are probably going to want to do it in batches. 00:24:26.780 |
So we want to pass through a batch of input data 00:24:34.300 |
So passing rather than passing a single image at a time 00:24:38.700 |
we're passing everything through in batches of, 00:24:43.220 |
And this is because neural networks can be paralyzed. 00:24:46.860 |
And that allows us to take advantage of paralyzation, 00:24:50.580 |
which means we're performing many calculations 00:24:53.540 |
across all of these different images in a single batch 00:25:03.500 |
we need to load everything into our data loader. 00:25:06.460 |
So this is going to handle the loading of data 00:25:14.500 |
so that we don't have like the same set of images 00:25:18.100 |
if they're not shuffled already within the data set. 00:25:24.580 |
We want every single batch to be as representative 00:25:31.940 |
for both the validation and the training set. 00:25:35.660 |
is actually build our convolutional neural network. 00:25:46.620 |
In reality, if we compare it to a lot of the networks 00:25:48.620 |
we looked at before, it's not actually that deep. 00:25:56.180 |
But you should note that there are a few things 00:26:11.540 |
We have a few of those in order to get our predictions. 00:26:18.060 |
is that we have a number of input channels here. 00:26:20.300 |
This must align to the number of input color channels 00:26:40.740 |
with a kernel or a window of four by four images 00:26:46.060 |
We also add some padding to reduce the amount of compression. 00:26:58.020 |
So the depth initially is through color channels, 00:27:00.820 |
the depth from the output of that is actually 64. 00:27:03.380 |
And that gets deeper and deeper as we go through 00:27:17.780 |
into the fully connected layers at the end there. 00:27:21.700 |
And the final output, so this is our final layer. 00:27:25.220 |
The input of that layer is 256 activations or nodes. 00:27:29.620 |
And the output is the number of classes, which is 10. 00:27:37.820 |
The next bit here is the forward step throughout. 00:27:42.500 |
So it's identifying or it's defining the process 00:27:47.540 |
that we move through each one of these layers, 00:27:56.500 |
But it's actually here that the order is defined. 00:28:03.940 |
And that is how convolutional neural network, 00:28:10.740 |
and we move on to setting everything up for training. 00:28:24.780 |
This is used when we have a classification task, 00:28:33.700 |
And we're going to use stochastic gradient descent here 00:28:44.300 |
You can less or more depending on what you're seeing 00:28:57.100 |
And within each epoch, we run through the entire dataset. 00:29:01.020 |
And we load that from the training data loader. 00:29:16.820 |
And then from there, we calculate loss function 00:29:18.780 |
between the predicted values and the true values. 00:29:38.620 |
so that we can actually calculate the validation 00:29:40.460 |
and just see that our model is not overfitting 00:29:44.620 |
And we will get something that looks like this. 00:29:47.780 |
So important here is if you see the loss decreasing 00:29:54.020 |
that means that you're probably training for too many steps 00:29:58.980 |
and the model is overfitting to the training data. 00:30:02.500 |
So in that case, just be wary and kind of stop doing that. 00:30:18.540 |
that we'll get near the end here is around 80%. 00:30:32.220 |
Let's have a look at how we can use it for inference. 00:30:34.340 |
So inference, by inference, I mean making predictions. 00:30:40.300 |
You know, if we didn't already have it in the notebook, 00:30:43.420 |
we'd make sure we switch it to evaluation mode 00:30:45.820 |
and also move to device if we have CUDA-enabled GPU again. 00:30:49.940 |
Pretending that we're not in the same notebook, 00:30:58.740 |
In this example, of course, in real scenarios, 00:31:08.180 |
that we set up before, the preprocessing pipeline. 00:31:11.100 |
We can check the number of tensors we have there. 00:31:13.380 |
So we're just taking the first 10 as an example. 00:31:16.300 |
All of those are the three by 32 by 32 that we saw before. 00:31:27.700 |
or 10 image arrays, you can think of them like that, 00:31:39.580 |
And then from there, we can use the torch argmax function 00:31:43.140 |
in order to find the value within the output logics 00:31:55.580 |
position number five had the highest activation. 00:31:59.260 |
that this model is whatever class number five is. 00:32:05.900 |
We can just see the number of predictions we have here. 00:32:12.900 |
the number five would be zero, one, two, three, four, five. 00:32:20.180 |
And if we have a look here, the second one is number eight. 00:32:23.940 |
So number eight here, we'll go a little further. 00:32:45.540 |
a frog, another frog, automobile, and so on and so on. 00:33:04.700 |
And that is despite these images being very low resolution. 00:33:17.740 |
to the long reigning champions of computer vision, 00:33:24.460 |
We've gone through the intuition behind these models 00:33:29.180 |
and had a look at a few of the most popular versions 00:33:32.140 |
of these as well, which should always inform us 00:33:36.460 |
as to if we wanted to build a convolutional neural network, 00:33:39.700 |
always have a look at those past implementations 00:33:42.180 |
of what they did, or just use one of those out of the box. 00:33:49.220 |
went through actually training a convolutional neural network 00:33:58.020 |
are still very popular, but I think in the next, 00:34:06.540 |
by other architectures like vision transformers