Convolutional Neural Nets Explained and Implemented in Python (PyTorch)

00:00:00.000 | Convolutional neural networks, or CNNs,

00:00:02.400 | have been the undisputed champions of computer vision

00:00:06.800 | for almost a decade.

00:00:08.180 | Their widespread adoption kick-started

00:00:11.000 | the adoption of deep learning.

00:00:13.120 | And without them, the field of artificial intelligence

00:00:16.680 | would look very different today.

00:00:19.400 | Before deep learning with CNNs,

00:00:21.680 | computer vision relied on very brittle

00:00:25.680 | edge detection algorithms, color profiling,

00:00:28.440 | and a plethora of manually scripted processes.

00:00:32.720 | And all of these could very rarely be applied

00:00:35.840 | across different use cases or different domains.

00:00:38.880 | The result is that every dataset and every use case

00:00:42.760 | required a specific implementation,

00:00:45.980 | which required manual intervention

00:00:48.640 | and domain-specific knowledge,

00:00:50.720 | which meant that applying these networks

00:00:54.040 | across a broad domain of use cases or domains

00:00:57.560 | was just not practical.

00:00:59.480 | Deep-layered CNNs changed this.

00:01:01.920 | Rather than manual feature extraction,

00:01:04.120 | CNNs would learn how to extract features from images.

00:01:08.540 | And they could do this for a vast number of datasets

00:01:10.720 | and a vast number of use cases.

00:01:12.400 | All they needed was training data.

00:01:14.480 | Big training datasets and deep-layered

00:01:17.320 | convolutional neural networks have remained

00:01:19.640 | the de facto standard within the field of computer vision.

00:01:23.180 | Now, there have been some moves to other architectures,

00:01:27.740 | like for example, the vision transformer,

00:01:30.360 | which we covered in a earlier video.

00:01:32.660 | And things like multi-modality may also prove

00:01:36.260 | to be another thing that helps us

00:01:39.220 | to move on to better things than CNNs.

00:01:42.980 | But for now, CNNs are still the standard

00:01:46.500 | when it comes to computer vision.

00:01:48.500 | In this very visual and hands-on video,

00:01:52.140 | we're going to learn why that is

00:01:54.340 | and what exactly makes a convolutional neural network work

00:01:57.660 | and how we can actually use them ourselves.

00:01:59.580 | So let's start with what makes a convolutional neural network.

00:02:03.180 | By and large, CNNs are neural networks

00:02:05.860 | that are known for their performance

00:02:08.940 | on image datasets and image tasks.

00:02:12.300 | They're characterized by something called

00:02:13.940 | a convolutional layer.

00:02:15.420 | These convolutional layers are able to detect

00:02:18.620 | abstract features and almost ideas within an image.

00:02:23.420 | And we can shift these images, squash them, rotate them.

00:02:27.660 | But if a human can still recognize the image,

00:02:30.700 | the chances are a well-trained CNN

00:02:32.940 | will still be able to identify that image as well.

00:02:36.060 | Because of their affinity to image-based applications,

00:02:40.620 | we tend to find CNNs being used in image classification,

00:02:45.580 | object detection, and many other tasks

00:02:48.060 | within the realm of computer vision.

00:02:50.780 | Now, we're focusing on deep-layered CNNs,

00:02:52.980 | which are any neural network that satisfies two conditions.

00:02:57.140 | The first is that it contains many layers,

00:03:00.420 | i.e. is a deep neural network.

00:03:02.500 | And two, that it contains convolutional layers.

00:03:06.300 | Beyond that, convolutional neural networks

00:03:08.420 | can consist of many different architectures,

00:03:12.020 | and they will contain many different features,

00:03:14.820 | common ones include pooling, normalization layers,

00:03:18.180 | and also linear layers.

00:03:19.540 | And we'll see a few examples of the different types

00:03:22.500 | of convolutional neural networks later on in the video.

00:03:25.020 | Now, let's just briefly focus on what exactly

00:03:28.300 | a convolutional layer is actually doing.

00:03:30.660 | So we can think of an image

00:03:33.940 | as a big array of activation values.

00:03:37.740 | These arrays are followed by many more arrays

00:03:41.780 | of initially randomly initialized values,

00:03:45.060 | the weights that we call a filter or a kernel.

00:03:49.100 | A convolutional layer is nothing more

00:03:52.060 | than a element-wise multiplication

00:03:54.860 | between these pixel values and the filter weights,

00:03:58.900 | which are then summed together.

00:04:01.020 | This element-wise operation followed by the sum

00:04:04.940 | of the resulting values is often called the scalar product

00:04:08.980 | because it results in a single scalar value.

00:04:12.580 | And we can see how that works here.

00:04:14.900 | So we have our input, which is a five by five pixel image,

00:04:18.620 | and we have our filter, which is a three by three pixel array.

00:04:22.820 | In this very first iteration of the convolution,

00:04:27.140 | we can see that we are multiplying the three by three window

00:04:30.820 | within that input by the filter weights.

00:04:33.700 | Multiply both those together in element-wise multiplication

00:04:37.260 | to get the array that you can see on the right.

00:04:41.100 | In there, we can see that the sum of all the values

00:04:44.620 | within that array is equal to three,

00:04:46.940 | and that is our scalar product value.

00:04:50.380 | Now, in reality, we would not just return a single value

00:04:54.140 | because we would actually be sliding this window,

00:04:56.620 | this filter, over the full image.

00:04:59.860 | And the output of each one of those window operations

00:05:03.260 | is a single value.

00:05:04.580 | But of course, we now have nine values

00:05:07.460 | from this larger image.

00:05:08.980 | Now, the output of this convolution

00:05:10.860 | is what we would call a feature map or an activation map.

00:05:13.820 | Both mean the same thing.

00:05:15.100 | And we call it like that because it is a mapping

00:05:18.060 | from the activations of detected features

00:05:21.580 | from the input layer.

00:05:23.620 | Now, something worth noting here is obviously

00:05:25.900 | we are going from a larger dimensional space

00:05:28.540 | into a smaller dimensional space.

00:05:30.580 | We're compressing that information.

00:05:32.620 | So we always need to be mindful of excessive compression

00:05:37.300 | and therefore data loss via dimensionality reduction.

00:05:41.260 | And because of this, we may want to increase or decrease

00:05:44.940 | the amount of compression that our filters create.

00:05:47.740 | To do that, we modify the filter size

00:05:50.380 | and also how quickly it moves across the image

00:05:54.340 | using a variable called the stride.

00:05:56.940 | The stride defines the number of pixels a filter moves

00:06:00.980 | after every calculation.

00:06:02.700 | By increasing the stride,

00:06:04.100 | the filter will move across the entire image in fewer steps,

00:06:08.260 | outputting fewer values

00:06:09.900 | and producing a more compressed feature map.

00:06:12.740 | Now, there are also some other surprising effects

00:06:16.220 | of image compression that we need to be aware of.

00:06:18.940 | One of those is the filter's interaction

00:06:21.620 | with the border areas of a input image or input array.

00:06:26.620 | If we're not careful with this,

00:06:28.420 | the border effects on small images

00:06:31.260 | can result in a rapid loss of information.

00:06:34.700 | And that's naturally more of a problem

00:06:36.300 | for those smaller images

00:06:37.700 | because they start with less information in the first place.

00:06:41.420 | So to avoid this,

00:06:42.620 | we must either reduce the amount of compression

00:06:45.860 | that we're doing using the previous techniques

00:06:48.900 | that we mentioned.

00:06:49.940 | So being mindful of the filter size and the stride,

00:06:52.900 | or we can add padding.

00:06:55.020 | Now we can see the effect of padding here.

00:06:56.940 | So we are essentially taking the original image

00:06:59.860 | and we're adding padding around the edge of that image.

00:07:03.180 | Now, typically this padding

00:07:04.460 | is going to be a set of zero value pixels.

00:07:08.380 | And we add that around the image

00:07:10.140 | to limit or prevent compression between layers.

00:07:14.620 | For larger images,

00:07:15.500 | this is going to be less of a problem.

00:07:17.420 | But for those smaller images,

00:07:20.540 | padding is a very effective remedy

00:07:23.260 | to avoid too much information loss.

00:07:25.620 | Now, another key feature

00:07:27.700 | of these deep convolutional neural networks

00:07:30.540 | is obviously depth of those networks.

00:07:33.300 | Going way back to 2012,

00:07:36.020 | the first very successful convolutional neural network

00:07:39.820 | or deep convolutional neural network was called AlexNet.

00:07:42.900 | And the authors of AlexNet

00:07:45.220 | found that the depth of their network

00:07:48.940 | was a very key component

00:07:50.780 | that contributed to its high performance.

00:07:54.020 | And the reason for this is every successive layer

00:07:58.100 | within a convolutional neural network

00:08:00.100 | abstracts the initial features more and more.

00:08:04.580 | So we get more and more abstract features

00:08:07.460 | as the number of layers within the network increases.

00:08:10.940 | And we can think of this

00:08:12.540 | as the network is able to abstract image features more

00:08:17.540 | and get them closer to the very abstract concepts

00:08:22.500 | that we have in our minds as human beings.

00:08:25.740 | So it gets closer to a human-like understanding of an image.

00:08:29.340 | So for example, a shallow CNN

00:08:32.460 | might recognize that an image contains an animal.

00:08:35.860 | But as we add another layer,

00:08:37.740 | it may be able to identify that animal as a dog.

00:08:40.580 | It has become more abstract.

00:08:42.340 | And adding another layer

00:08:44.180 | may allow that network to identify specific breeds

00:08:47.740 | like a satchel bull terrier or a husky.

00:08:50.580 | These ideas of different dog breeds

00:08:53.340 | is far more abstract than a dog or just an animal.

00:08:58.340 | And it requires far more abstraction

00:09:01.900 | in order to actually understand that

00:09:03.780 | and for CNN to be able to identify that.

00:09:08.260 | So by adding more layers,

00:09:09.900 | we're generally going to be able to identify

00:09:12.500 | more abstract and more specific concepts.

00:09:16.420 | Now, moving on to what are some of the very common features

00:09:20.780 | of a convolutional neural network,

00:09:22.220 | although not necessarily restricted

00:09:24.020 | to convolutional neural networks alone,

00:09:26.660 | we have activation functions.

00:09:28.180 | So activation functions,

00:09:29.940 | we will find that in pretty much every neural network.

00:09:32.980 | And they are used to add non-linearity to these networks.

00:09:37.940 | And essentially what that allows us to do,

00:09:40.340 | particularly over many layers,

00:09:42.180 | is represent more complex patterns.

00:09:46.220 | Now, you may recognize a few of these activation functions.

00:09:49.540 | We have the rectified linear unit function,

00:09:52.060 | the tanh function and sigmoid function.

00:09:55.380 | These three are some of the most common

00:09:58.260 | activation functions that we find in neural networks.

00:10:01.220 | In the past, CNNs often use activation functions

00:10:05.580 | within the hidden layers of the network,

00:10:08.340 | so that the middle layers of that network,

00:10:11.420 | basically not the input or the output,

00:10:13.340 | everything in between.

00:10:14.620 | And they would typically use tanh or sigmoid activations.

00:10:17.940 | However, in 2012,

00:10:20.100 | the rectified linear unit activation function

00:10:23.300 | became very popular because it was used

00:10:26.100 | by the AlexNet model,

00:10:27.700 | which kind of kick-started this era of deep learning,

00:10:31.820 | as it was the best performing CNN of its time

00:10:34.940 | by a long shot.

00:10:36.780 | Nowadays, the rectified linear unit

00:10:38.940 | or ReLU activation function is still a very popular choice.

00:10:43.100 | It's a lot simpler than tanh and sigmoid,

00:10:46.020 | and also doesn't require regularization

00:10:49.980 | in order to avoid saturation,

00:10:52.100 | which basically means the congregation of activation outputs

00:10:57.100 | towards the minimum and maximum values

00:10:59.860 | of that activation function.

00:11:02.460 | Another very important feature is the use of pooling layers.

00:11:06.660 | Now, we use pooling layers

00:11:07.780 | because the output of feature maps are very sensitive

00:11:11.620 | to the location of input features.

00:11:14.180 | So a small change can make a big difference.

00:11:16.420 | To some degree, this can be useful

00:11:19.340 | as it can tell us the difference between, for example,

00:11:22.060 | a cat's face and a dog's face,

00:11:23.900 | based on where the eyes are, where the ears are, and so on.

00:11:28.060 | However, we don't want that to be too dramatic

00:11:30.060 | because if an eye is shifted two pixels to the left

00:11:33.500 | or two pixels to the right,

00:11:35.180 | that should still allow the model

00:11:37.380 | to identify this image as being of a face.

00:11:40.900 | It should not make it go crazy

00:11:42.660 | and detect something completely different.

00:11:45.020 | And we need pooling layers in order to allow the model

00:11:49.140 | to have this form of smoothing or stability.

00:11:53.220 | Pooling layers are a downsampling method

00:11:55.940 | that compress information from one layer

00:11:59.220 | into a smaller space in the next layer.

00:12:02.140 | And the two most common pooling methods

00:12:05.500 | are max pooling and average pooling.

00:12:08.060 | Max pooling takes a maximum value

00:12:09.980 | of all the values within a window,

00:12:12.420 | and average pooling takes the average

00:12:14.700 | of all those values within a window.

00:12:16.660 | And of course, as that pooling window

00:12:18.420 | moves across our input array,

00:12:21.100 | we would end up outputting another array

00:12:23.460 | of smaller dimensionality.

00:12:25.460 | And with that, we can move on to the final main feature

00:12:30.020 | of a convolutional neural network,

00:12:32.060 | which is the use of fully connected layers.

00:12:34.820 | A fully connected linear layer is simply a neural network

00:12:38.980 | in the very traditional stripped down sense.

00:12:42.900 | It is the dot product between some inputs, X,

00:12:46.220 | and the layer weights, W,

00:12:47.700 | with a bias term added onto there,

00:12:50.540 | and usually an activation function.

00:12:52.980 | We will typically see a fully connected layer

00:12:55.820 | at the end of a convolutional neural network.

00:12:58.820 | And it handles the transformation

00:13:00.860 | of our convolutional neural network embeddings

00:13:03.340 | from 3D tensors to more understandable outputs

00:13:07.220 | like class predictions.

00:13:08.860 | It's often within these final layers

00:13:12.940 | that we will find the most information rich

00:13:15.620 | vector embeddings that represent the information

00:13:18.660 | that's come through all of those layers

00:13:20.380 | and have the most abstract machine readable

00:13:24.660 | numeric representation of whatever image

00:13:27.300 | is being passed through all of those layers.

00:13:29.620 | And it's this that we would typically use

00:13:31.660 | in things like image retrieval.

00:13:33.580 | But focusing on the classification,

00:13:36.060 | what a classifier will usually do

00:13:37.540 | is apply a softmax activation function.

00:13:41.100 | And that will create a probability distribution

00:13:43.620 | across all of the output nodes,

00:13:45.980 | where each one of these nodes represents a candidate class

00:13:50.700 | or a potential label.

00:13:52.340 | For example, here we have cat, dog, and car.

00:13:56.860 | Those would be our output classes.

00:14:00.420 | And after these fully connected layers

00:14:02.860 | and the softmax activation function,

00:14:05.220 | we have our predictions.

00:14:07.140 | Now, all of those features that we've just worked through

00:14:10.540 | are the very common components

00:14:13.060 | that make up a convolutional neural network.

00:14:15.700 | But with time, many different convolutional networks

00:14:18.060 | have been designed.

00:14:19.100 | So there is no specific architecture,

00:14:21.140 | but instead what we can use

00:14:22.980 | is some of the most high-performing networks

00:14:25.580 | as almost a set of guideposts

00:14:28.540 | in how we can design our own networks

00:14:30.780 | or simply use those existing networks.

00:14:33.140 | So let's have a very quick high-level look

00:14:35.060 | at a few of the most popular ones.

00:14:36.980 | So we'll start way back

00:14:38.900 | when with the very first successful

00:14:42.420 | convolutional neural network, which was LeNet.

00:14:44.820 | Now, LeNet is, I think, the earliest good example

00:14:47.900 | of a deep convolutional neural network

00:14:50.140 | being applied in the real world.

00:14:51.780 | It was developed in 1998.

00:14:53.820 | And in reality, a lot of us

00:14:55.620 | have probably actually interacted with LeNet

00:14:58.140 | without ever even realizing.

00:15:00.340 | It was developed at Bell Labs

00:15:02.260 | and they licensed it to many different big banks

00:15:05.940 | around the globe for reading handwritten digits on checks.

00:15:10.940 | And despite its use worldwide,

00:15:14.460 | it was surprisingly the only example

00:15:19.020 | of a commercially successful CNN,

00:15:22.140 | at least on that scale, for another 14 years.

00:15:26.020 | And that is where we got AlexNet.

00:15:28.700 | So AlexNet is the deep CNN

00:15:32.460 | that kick-started the era of deep learning.

00:15:37.020 | And that was back in October, 2012.

00:15:39.860 | The catalyst of this was AlexNet

00:15:42.740 | winning the ImageNet competition.

00:15:45.580 | And in fact, AlexNet can actually be seen

00:15:47.940 | as a continuation of LeNet.

00:15:51.260 | It uses a very similar architecture,

00:15:53.460 | but simply added more layers, training data,

00:15:56.380 | and some safeguards against overfitting.

00:15:59.180 | And it was after AlexNet that the broader community

00:16:03.380 | of computer vision researchers

00:16:05.540 | began focusing their attention

00:16:08.260 | on building ever deeper models with really big datasets.

00:16:13.260 | And the following years after this

00:16:17.100 | basically saw many variations of AlexNet

00:16:20.620 | winning the ImageNet competition

00:16:22.860 | until we get to VGGNet.

00:16:25.820 | Now VGGNet came in 2014 and dethroned AlexNet

00:16:30.700 | as the winner of the ImageNet competition.

00:16:33.940 | And there were a few different variations

00:16:36.220 | of this network developed,

00:16:38.180 | but we can already see that it's a much deeper network.

00:16:41.980 | But as core, it's still using the same process

00:16:44.820 | of convolution layers, pooling layers, and so on.

00:16:47.980 | Moving on to 2015, the next year,

00:16:50.540 | we saw the introduction of ResNet.

00:16:52.940 | Now ResNet introduced even deeper networks than ever before.

00:16:57.940 | The first of those contained 34 layers.

00:17:01.780 | And since then, 50 plus layer ResNet models

00:17:04.900 | have been developed and still hold

00:17:06.860 | some of the state-of-the-art results

00:17:09.500 | on many computer vision benchmarks.

00:17:11.980 | Now, beyond adding more layers,

00:17:14.140 | ResNet was actually very much inspired by VGGNet,

00:17:17.580 | but added smaller filters

00:17:19.660 | and a generally less complex network architecture.

00:17:24.140 | Another thing, which is why it's called

00:17:25.820 | the residual network, e.g. ResNet,

00:17:29.060 | is they added these shortcut connections between layers.

00:17:34.060 | And this was to avoid the loss of information

00:17:37.740 | over many layers with the greater depth of ResNet.

00:17:42.220 | So adding these shortcuts or these residual connections

00:17:45.500 | just allowed information to be maintained

00:17:47.940 | over a longer distance, which was very much required

00:17:50.900 | with this deeper network size.

00:17:53.860 | Now, I think that's enough

00:17:54.780 | for understanding convolutional neural networks.

00:17:57.300 | What I want to do now is actually look at how

00:17:59.780 | to implement convolutional neural networks

00:18:02.220 | and use them in classification.

00:18:04.060 | So we're gonna go through this notebook example here.

00:18:06.860 | You can find this notebook in the video description,

00:18:10.580 | and you'll be able to open that in Colab,

00:18:12.460 | or if you prefer, you can actually download the file as well.

00:18:15.100 | We're first going to just load in the relevant libraries.

00:18:18.140 | So we have PyTorch here and TorchVision.

00:18:21.820 | I'm gonna be using these transforms a lot,

00:18:23.420 | so we'll just also import them as transforms,

00:18:26.180 | make it a little bit easier.

00:18:27.420 | Now, we're gonna be working with a fair bit of data.

00:18:30.100 | So what is usually pretty helpful to do

00:18:34.060 | is switch from CPU to GPU if you have it available.

00:18:37.780 | We don't always have it available,

00:18:39.660 | but it can be useful if you do have it.

00:18:42.660 | And you can just check what you have, like so.

00:18:45.660 | So for me, I'm on MacBook right now, so I only have CPU.

00:18:49.780 | But if you're working on Colab,

00:18:51.140 | this should show up as CUDA.

00:18:53.780 | Now, as usual, our first task

00:18:55.980 | is going to be data-free processing.

00:18:58.220 | So we first need to download our dataset.

00:19:01.820 | We're gonna be using the CIFAR-10 dataset,

00:19:03.700 | which is a very popular image classification dataset.

00:19:07.460 | Download that, and we will see

00:19:10.540 | that it contains 50,000 items.

00:19:13.300 | And within that, we have images,

00:19:15.580 | which are just Python PIL image objects,

00:19:17.940 | and their labels, of which there are 10 unique labels

00:19:21.100 | within the dataset, hence why it's called CIFAR-10.

00:19:24.060 | And we can confirm that here.

00:19:25.780 | So we see we have 10 of these.

00:19:28.100 | And then we can also view the images,

00:19:30.540 | but they are very small.

00:19:32.740 | We can also see that they are Python PIL objects here.

00:19:35.460 | So this is, I think, a plane, but yeah, it's very small.

00:19:39.020 | So we're gonna be training the model.

00:19:40.700 | And while we're training,

00:19:41.780 | we also want to pull in another dataset

00:19:45.180 | that is independent to the training dataset

00:19:47.620 | that we can use as a validation or test set later on.

00:19:51.100 | We're gonna be using this test dataset,

00:19:53.340 | actually, as a validation dataset.

00:19:55.180 | So we're going to be checking our model performance

00:19:57.660 | on this data during the training process.

00:20:00.500 | You can see that we have a smaller number of items,

00:20:02.580 | and here it's 10,000 from 50,000.

00:20:05.500 | Now, most convolutional neural networks

00:20:08.220 | are designed to only accept a certain size of images.

00:20:12.940 | In the case that we are going to use,

00:20:15.140 | we're gonna use a 32 image input.

00:20:18.260 | We can modify that based on the model architecture,

00:20:20.980 | but the model architecture that we're gonna be using later

00:20:23.300 | accepts this 32 by 32 image size.

00:20:25.940 | So what we need to do is we can set the image size here,

00:20:29.460 | and then we can use transforms resize

00:20:32.460 | to resize the image into whatever we put in here,

00:20:35.980 | so the 32 pixels.

00:20:37.460 | And then this transforms to tensor

00:20:39.340 | is just to convert our image,

00:20:41.540 | our pill image object into a tensor,

00:20:45.420 | in which we can then feed into our model later on.

00:20:47.660 | So run this, and basically,

00:20:50.380 | we can just run preprocess on our images,

00:20:52.740 | and that will run this transformation pipeline across them.

00:20:57.260 | Now, there are a few things to consider

00:20:59.900 | when we're doing this.

00:21:00.860 | The first is, okay, we're gonna be iterating

00:21:03.180 | through everything.

00:21:04.020 | We need to extract the image and its respective label.

00:21:07.900 | One thing we need to consider

00:21:08.980 | with this image data set in particular,

00:21:11.540 | but other image data sets as well,

00:21:13.980 | is we only want one image format.

00:21:18.700 | So I want to have RGB images,

00:21:21.500 | so images with red, green, and blue color channels.

00:21:24.740 | A few images in this data set are actually just grayscale,

00:21:27.220 | so they have a single color channel.

00:21:28.980 | Now, we're not going to colorize it or anything like that.

00:21:32.300 | We're actually just going to copy

00:21:34.140 | those single color channels into three color channels,

00:21:37.420 | and it will still appear

00:21:38.540 | as a black and white grayscale image,

00:21:41.300 | but we at least have those three color channels,

00:21:43.940 | which means we can pass that directly into our model,

00:21:47.140 | which expects image arrays with three color channels.

00:21:50.980 | So we do that using image convert here,

00:21:53.580 | and then we preprocess everything,

00:21:55.660 | and then we append all that to our inputs.

00:21:58.180 | Okay, now we'll just take a moment.

00:21:59.500 | It's pretty quick.

00:22:00.820 | Now let's have a look at one of those images.

00:22:03.380 | So we run this,

00:22:05.740 | and we can see that we have 50,000

00:22:08.620 | of these modified images in our training set,

00:22:12.700 | and each one of those is a three by 32 by 32 tensor.

00:22:17.700 | Now, 32 by 32 is the actual pixels in the image,

00:22:22.180 | and the three is the number of color channels,

00:22:24.340 | red, green, and blue.

00:22:25.580 | And we can see the result from this transformation here.

00:22:29.300 | So this is just a tensor.

00:22:32.100 | One thing to note here is that all these values

00:22:34.780 | have been normalized to between zero and one.

00:22:37.660 | That was actually done by the preprocessing pipeline

00:22:42.020 | here with the transforms up to tensor.

00:22:44.900 | Okay, so moving on.

00:22:48.020 | One thing that we should do

00:22:50.420 | is calculate the mean and standard deviation of images

00:22:53.820 | so that we can modify the normalization

00:22:57.700 | to better fit our set of images.

00:23:00.180 | So to do that, we would do this,

00:23:02.300 | creating a big list of all of our images.

00:23:05.380 | And this is just a small subset of those

00:23:08.100 | that we pulled from here.

00:23:09.300 | So this is like we're sampling a smaller portion of those.

00:23:12.540 | Otherwise, this just takes a bit longer.

00:23:14.540 | We're merging all these into a single three channel vector.

00:23:18.220 | So you can see here,

00:23:19.540 | we're just kind of merging all those images

00:23:21.140 | into a massive one single big image.

00:23:23.580 | And then we just want to calculate the mean

00:23:26.140 | and the standard deviation for each one

00:23:28.060 | of those color channels.

00:23:29.620 | Okay, so we get these values and these values.

00:23:32.540 | I don't need those tensors anymore.

00:23:33.780 | They take a bit of space.

00:23:34.900 | I'm going to remove them.

00:23:36.260 | And then what we do is we just modify that normalize here.

00:23:41.260 | Okay, so now we're just normalizing all of the tensors

00:23:47.300 | with this additional step here.

00:23:49.260 | Okay, so we've preprocessed that

00:23:51.100 | and we apply that to the existing tensors.

00:23:53.460 | Run that.

00:23:54.300 | Okay, and that's done.

00:23:56.180 | But obviously when we're doing all of this in one go,

00:23:59.260 | we would probably want to do this.

00:24:00.860 | So we put all of them into a single preprocessing pipeline.

00:24:05.860 | So we have to resize the two tensor

00:24:07.900 | and then normalize following that.

00:24:09.940 | And we'll actually use that on the validation set.

00:24:12.660 | So here is our validation set.

00:24:15.180 | And we'll just rerun the same thing as before,

00:24:17.020 | but obviously this time we have that normalization step

00:24:19.780 | in there as well.

00:24:21.220 | Now, when we're training a model,

00:24:23.900 | we are probably going to want to do it in batches.

00:24:26.780 | So we want to pass through a batch of input data

00:24:31.780 | at any one time into our model.

00:24:34.300 | So passing rather than passing a single image at a time

00:24:37.820 | through everything,

00:24:38.700 | we're passing everything through in batches of,

00:24:40.860 | in this example, 64.

00:24:43.220 | And this is because neural networks can be paralyzed.

00:24:46.860 | And that allows us to take advantage of paralyzation,

00:24:50.580 | which means we're performing many calculations

00:24:53.540 | across all of these different images in a single batch

00:24:56.180 | in parallel rather than one after the other,

00:24:59.340 | which means everything will be much faster.

00:25:01.220 | Now to prepare everything,

00:25:03.500 | we need to load everything into our data loader.

00:25:06.460 | So this is going to handle the loading of data

00:25:09.460 | into our model.

00:25:11.500 | We have the batch size here.

00:25:13.220 | We're also shuffling everything

00:25:14.500 | so that we don't have like the same set of images

00:25:18.100 | if they're not shuffled already within the data set.

00:25:21.260 | So we don't have the same set of images

00:25:22.700 | all covering a single batch.

00:25:24.580 | We want every single batch to be as representative

00:25:27.500 | of the full data set as possible.

00:25:30.500 | Now we initialize data load

00:25:31.940 | for both the validation and the training set.

00:25:34.340 | And then what we're going to do

00:25:35.660 | is actually build our convolutional neural network.

00:25:39.140 | Now, it's pretty big,

00:25:42.140 | but that's just thanks to the number,

00:25:44.620 | the depth of the network.

00:25:46.620 | In reality, if we compare it to a lot of the networks

00:25:48.620 | we looked at before, it's not actually that deep.

00:25:51.500 | We only have five convolutional layers here

00:25:54.540 | and then a few other things.

00:25:56.180 | But you should note that there are a few things

00:25:58.900 | that you might recognize from before.

00:26:00.620 | We have the convolutional layer

00:26:01.900 | followed by radio activation layer,

00:26:04.860 | followed by max pooling layer,

00:26:06.500 | and we do that several times.

00:26:08.260 | And then at the end here,

00:26:09.220 | we have our fully connected layers.

00:26:11.540 | We have a few of those in order to get our predictions.

00:26:14.780 | Now, things to note here

00:26:18.060 | is that we have a number of input channels here.

00:26:20.300 | This must align to the number of input color channels

00:26:23.340 | that we're expecting.

00:26:24.540 | Our inputs will be identified

00:26:27.180 | based on the first set of input data

00:26:30.020 | that we throw into our model.

00:26:31.740 | So that'll be 32 by 32.

00:26:33.660 | And this first convolution layer

00:26:35.860 | is going to go over that 32 by 32 images

00:26:40.740 | with a kernel or a window of four by four images

00:26:45.140 | and go through that.

00:26:46.060 | We also add some padding to reduce the amount of compression.

00:26:49.420 | And we do that throughout the network.

00:26:51.660 | Now, other thing actually to note

00:26:53.420 | is the number of output channels

00:26:54.820 | is the depth of the array.

00:26:58.020 | So the depth initially is through color channels,

00:27:00.820 | the depth from the output of that is actually 64.

00:27:03.380 | And that gets deeper and deeper as we go through

00:27:06.140 | before we start decreasing that

00:27:08.540 | as we're going towards the end of the model.

00:27:11.500 | And then the end here,

00:27:12.340 | we're just performing transformations

00:27:14.460 | from our 3D convolution layers

00:27:17.780 | into the fully connected layers at the end there.

00:27:21.700 | And the final output, so this is our final layer.

00:27:25.220 | The input of that layer is 256 activations or nodes.

00:27:29.620 | And the output is the number of classes, which is 10.

00:27:32.460 | Then what we do,

00:27:33.420 | so that's defining the structure

00:27:35.140 | of our convolutional neural network.

00:27:37.820 | The next bit here is the forward step throughout.

00:27:42.500 | So it's identifying or it's defining the process

00:27:47.540 | that we move through each one of these layers,

00:27:50.100 | the order that each one of them is used.

00:27:52.740 | Because we can actually define all of these

00:27:54.500 | in any order we want.

00:27:56.500 | But it's actually here that the order is defined.

00:27:59.740 | So we go through that, we get to here,

00:28:02.140 | which is our final output.

00:28:03.940 | And that is how convolutional neural network,

00:28:08.300 | that is how we define it.

00:28:09.820 | And then we come down

00:28:10.740 | and we move on to setting everything up for training.

00:28:13.860 | So we want to move it to our device

00:28:16.900 | if we're using a CUDA-enabled GPU,

00:28:18.940 | otherwise it'll just stay on the CPU.

00:28:21.300 | We want to set the loss function.

00:28:22.620 | We're gonna be using cross-entropy loss.

00:28:24.780 | This is used when we have a classification task,

00:28:28.940 | like what we do here.

00:28:30.780 | Set learning rate, so 0.008.

00:28:33.700 | And we're going to use stochastic gradient descent here

00:28:37.380 | as our optimizer.

00:28:38.980 | Moving on, we would go down to here.

00:28:41.540 | We train it for about 50 epochs here.

00:28:44.300 | You can less or more depending on what you're seeing

00:28:47.500 | on your end during the training.

00:28:49.020 | It doesn't take too long to run anyway.

00:28:51.540 | And then we run through the training.

00:28:54.300 | So we go through these 50 epochs.

00:28:57.100 | And within each epoch, we run through the entire dataset.

00:29:01.020 | And we load that from the training data loader.

00:29:04.220 | We move those to the GPU if it is available.

00:29:08.660 | We do the forward propagation.

00:29:10.940 | From here, we get our output logits.

00:29:13.500 | So that is the final output predictions.

00:29:16.820 | And then from there, we calculate loss function

00:29:18.780 | between the predicted values and the true values.

00:29:23.620 | From there, we optimize the model.

00:29:25.860 | So we do backward propagation step.

00:29:28.820 | And that is the training step.

00:29:32.380 | That's all for the training.

00:29:33.980 | And then this bit here is another bit.

00:29:36.540 | So this is for our validation

00:29:38.620 | so that we can actually calculate the validation

00:29:40.460 | and just see that our model is not overfitting

00:29:42.580 | or anything over time.

00:29:44.620 | And we will get something that looks like this.

00:29:47.780 | So important here is if you see the loss decreasing

00:29:51.900 | and the validation loss increasing,

00:29:54.020 | that means that you're probably training for too many steps

00:29:56.780 | or your learning rate is too high

00:29:58.980 | and the model is overfitting to the training data.

00:30:02.500 | So in that case, just be wary and kind of stop doing that.

00:30:06.700 | And either train for less epochs

00:30:10.700 | or train with a lower learning rate

00:30:14.100 | or with less layers even.

00:30:16.260 | Okay, and then the final validation accuracy

00:30:18.540 | that we'll get near the end here is around 80%.

00:30:21.300 | And we can then go ahead and save the model

00:30:23.540 | to a file like this.

00:30:25.340 | Okay, so this is just CNN

00:30:26.860 | and the PyTorch weights file there.

00:30:29.500 | Now, that is the training for the model.

00:30:32.220 | Let's have a look at how we can use it for inference.

00:30:34.340 | So inference, by inference, I mean making predictions.

00:30:38.380 | We would load the model.

00:30:40.300 | You know, if we didn't already have it in the notebook,

00:30:43.420 | we'd make sure we switch it to evaluation mode

00:30:45.820 | and also move to device if we have CUDA-enabled GPU again.

00:30:49.940 | Pretending that we're not in the same notebook,

00:30:51.540 | we reinitialize the test set.

00:30:55.020 | So it's actually our validation set,

00:30:56.540 | but we're just using it for both.

00:30:58.740 | In this example, of course, in real scenarios,

00:31:01.380 | you should use a different data set

00:31:02.900 | for your validation and test set.

00:31:04.900 | Come down here, we do the preprocessing

00:31:08.180 | that we set up before, the preprocessing pipeline.

00:31:11.100 | We can check the number of tensors we have there.

00:31:13.380 | So we're just taking the first 10 as an example.

00:31:16.300 | All of those are the three by 32 by 32 that we saw before.

00:31:21.100 | Stack all those into a single tensor

00:31:23.380 | so that we have basically 10 tensors

00:31:27.700 | or 10 image arrays, you can think of them like that,

00:31:31.340 | with the three color channels

00:31:33.740 | and each one of those 32 by 32 pixels.

00:31:37.020 | Process all of them through our model.

00:31:39.580 | And then from there, we can use the torch argmax function

00:31:43.140 | in order to find the value within the output logics

00:31:48.140 | that has the highest activation.

00:31:50.180 | And that is our prediction.

00:31:53.140 | Okay, so in the first example,

00:31:55.580 | position number five had the highest activation.

00:31:57.700 | That means the model is predicting

00:31:59.260 | that this model is whatever class number five is.

00:32:03.260 | And we'll see what that is in a moment.

00:32:05.900 | We can just see the number of predictions we have here.

00:32:08.420 | It's 10.

00:32:09.500 | So we can see the class names.

00:32:11.060 | So here, if we look at this,

00:32:12.900 | the number five would be zero, one, two, three, four, five.

00:32:17.900 | So the first one is predicting dog.

00:32:20.180 | And if we have a look here, the second one is number eight.

00:32:23.940 | So number eight here, we'll go a little further.

00:32:26.740 | So six, seven, eight, ship.

00:32:29.740 | We come down here.

00:32:31.220 | And what we can do is find the predictions

00:32:33.020 | for each one of these.

00:32:33.900 | So this one is apparently a dog.

00:32:36.420 | It's pretty hard to tell, to be honest.

00:32:38.380 | This one is a ship, a automobile,

00:32:42.380 | which maybe isn't accurate, an airplane,

00:32:45.540 | a frog, another frog, automobile, and so on and so on.

00:32:50.540 | So generally speaking,

00:32:52.700 | most of those actually do look correct.

00:32:55.220 | So we have successfully managed to train

00:32:59.940 | our convolutional neural network

00:33:01.780 | on this classification task for images.

00:33:04.700 | And that is despite these images being very low resolution.

00:33:08.740 | Like even myself as a person,

00:33:11.100 | I struggle to figure out what exactly

00:33:13.780 | is in each one of those images.

00:33:15.900 | But that's it for this introduction

00:33:17.740 | to the long reigning champions of computer vision,

00:33:21.740 | i.e. the convolutional neural networks.

00:33:24.460 | We've gone through the intuition behind these models

00:33:29.180 | and had a look at a few of the most popular versions

00:33:32.140 | of these as well, which should always inform us

00:33:36.460 | as to if we wanted to build a convolutional neural network,

00:33:39.700 | always have a look at those past implementations

00:33:42.180 | of what they did, or just use one of those out of the box.

00:33:46.260 | And then after that, we obviously went

00:33:48.060 | through all this code,

00:33:49.220 | went through actually training a convolutional neural network

00:33:52.860 | and actually using it for inference as well.

00:33:55.780 | So as I said, convolutional neural networks

00:33:58.020 | are still very popular, but I think in the next,

00:34:02.700 | particularly in the next couple of years,

00:34:04.340 | they're probably mostly going to be replaced

00:34:06.540 | by other architectures like vision transformers

00:34:10.300 | and possibly other things as well.

00:34:12.580 | But even so, they're super relevant even now

00:34:17.100 | and definitely a good thing to understand

00:34:20.260 | if you're working within machine learning

00:34:22.660 | and particularly computer vision.

00:34:24.540 | So I hope all of that has been helpful

00:34:27.340 | and insightful.

00:34:28.780 | Thank you very much for watching

00:34:31.260 | and I will see you again in the next one.

00:34:33.580 | Bye.

00:34:34.420 | (soft music)

00:34:36.820 | (soft music)

00:34:39.220 | (soft music)

00:34:41.620 | (soft music)

00:34:44.020 | you

00:34:46.080 | [BLANK_AUDIO]

Convolutional Neural Nets Explained and Implemented in Python (PyTorch)

Chapters