back to index

Convolutional Neural Nets Explained and Implemented in Python (PyTorch)


Chapters

0:0 Intro
1:59 What Makes a Convolutional Neural Network
3:24 Image preprocessing for CNNs
9:15 Common components of a CNN
11:1 Components: pooling layers
12:31 Building the CNN with PyTorch
14:14 Notable CNNs
17:52 Implementation of CNNs
18:52 Image Preprocessing for CNNs
22:46 How to normalize images for CNN input
23:53 Image preprocessing pipeline with pytorch
24:59 Pytorch data loading pipeline for CNNs
25:32 Building the CNN with PyTorch
28:8 CNN training parameters
28:49 CNN training loop
30:27 Using PyTorch CNN for inference

Whisper Transcript | Transcript Only Page

00:00:00.000 | Convolutional neural networks, or CNNs,
00:00:02.400 | have been the undisputed champions of computer vision
00:00:06.800 | for almost a decade.
00:00:08.180 | Their widespread adoption kick-started
00:00:11.000 | the adoption of deep learning.
00:00:13.120 | And without them, the field of artificial intelligence
00:00:16.680 | would look very different today.
00:00:19.400 | Before deep learning with CNNs,
00:00:21.680 | computer vision relied on very brittle
00:00:25.680 | edge detection algorithms, color profiling,
00:00:28.440 | and a plethora of manually scripted processes.
00:00:32.720 | And all of these could very rarely be applied
00:00:35.840 | across different use cases or different domains.
00:00:38.880 | The result is that every dataset and every use case
00:00:42.760 | required a specific implementation,
00:00:45.980 | which required manual intervention
00:00:48.640 | and domain-specific knowledge,
00:00:50.720 | which meant that applying these networks
00:00:54.040 | across a broad domain of use cases or domains
00:00:57.560 | was just not practical.
00:00:59.480 | Deep-layered CNNs changed this.
00:01:01.920 | Rather than manual feature extraction,
00:01:04.120 | CNNs would learn how to extract features from images.
00:01:08.540 | And they could do this for a vast number of datasets
00:01:10.720 | and a vast number of use cases.
00:01:12.400 | All they needed was training data.
00:01:14.480 | Big training datasets and deep-layered
00:01:17.320 | convolutional neural networks have remained
00:01:19.640 | the de facto standard within the field of computer vision.
00:01:23.180 | Now, there have been some moves to other architectures,
00:01:27.740 | like for example, the vision transformer,
00:01:30.360 | which we covered in a earlier video.
00:01:32.660 | And things like multi-modality may also prove
00:01:36.260 | to be another thing that helps us
00:01:39.220 | to move on to better things than CNNs.
00:01:42.980 | But for now, CNNs are still the standard
00:01:46.500 | when it comes to computer vision.
00:01:48.500 | In this very visual and hands-on video,
00:01:52.140 | we're going to learn why that is
00:01:54.340 | and what exactly makes a convolutional neural network work
00:01:57.660 | and how we can actually use them ourselves.
00:01:59.580 | So let's start with what makes a convolutional neural network.
00:02:03.180 | By and large, CNNs are neural networks
00:02:05.860 | that are known for their performance
00:02:08.940 | on image datasets and image tasks.
00:02:12.300 | They're characterized by something called
00:02:13.940 | a convolutional layer.
00:02:15.420 | These convolutional layers are able to detect
00:02:18.620 | abstract features and almost ideas within an image.
00:02:23.420 | And we can shift these images, squash them, rotate them.
00:02:27.660 | But if a human can still recognize the image,
00:02:30.700 | the chances are a well-trained CNN
00:02:32.940 | will still be able to identify that image as well.
00:02:36.060 | Because of their affinity to image-based applications,
00:02:40.620 | we tend to find CNNs being used in image classification,
00:02:45.580 | object detection, and many other tasks
00:02:48.060 | within the realm of computer vision.
00:02:50.780 | Now, we're focusing on deep-layered CNNs,
00:02:52.980 | which are any neural network that satisfies two conditions.
00:02:57.140 | The first is that it contains many layers,
00:03:00.420 | i.e. is a deep neural network.
00:03:02.500 | And two, that it contains convolutional layers.
00:03:06.300 | Beyond that, convolutional neural networks
00:03:08.420 | can consist of many different architectures,
00:03:12.020 | and they will contain many different features,
00:03:14.820 | common ones include pooling, normalization layers,
00:03:18.180 | and also linear layers.
00:03:19.540 | And we'll see a few examples of the different types
00:03:22.500 | of convolutional neural networks later on in the video.
00:03:25.020 | Now, let's just briefly focus on what exactly
00:03:28.300 | a convolutional layer is actually doing.
00:03:30.660 | So we can think of an image
00:03:33.940 | as a big array of activation values.
00:03:37.740 | These arrays are followed by many more arrays
00:03:41.780 | of initially randomly initialized values,
00:03:45.060 | the weights that we call a filter or a kernel.
00:03:49.100 | A convolutional layer is nothing more
00:03:52.060 | than a element-wise multiplication
00:03:54.860 | between these pixel values and the filter weights,
00:03:58.900 | which are then summed together.
00:04:01.020 | This element-wise operation followed by the sum
00:04:04.940 | of the resulting values is often called the scalar product
00:04:08.980 | because it results in a single scalar value.
00:04:12.580 | And we can see how that works here.
00:04:14.900 | So we have our input, which is a five by five pixel image,
00:04:18.620 | and we have our filter, which is a three by three pixel array.
00:04:22.820 | In this very first iteration of the convolution,
00:04:27.140 | we can see that we are multiplying the three by three window
00:04:30.820 | within that input by the filter weights.
00:04:33.700 | Multiply both those together in element-wise multiplication
00:04:37.260 | to get the array that you can see on the right.
00:04:41.100 | In there, we can see that the sum of all the values
00:04:44.620 | within that array is equal to three,
00:04:46.940 | and that is our scalar product value.
00:04:50.380 | Now, in reality, we would not just return a single value
00:04:54.140 | because we would actually be sliding this window,
00:04:56.620 | this filter, over the full image.
00:04:59.860 | And the output of each one of those window operations
00:05:03.260 | is a single value.
00:05:04.580 | But of course, we now have nine values
00:05:07.460 | from this larger image.
00:05:08.980 | Now, the output of this convolution
00:05:10.860 | is what we would call a feature map or an activation map.
00:05:13.820 | Both mean the same thing.
00:05:15.100 | And we call it like that because it is a mapping
00:05:18.060 | from the activations of detected features
00:05:21.580 | from the input layer.
00:05:23.620 | Now, something worth noting here is obviously
00:05:25.900 | we are going from a larger dimensional space
00:05:28.540 | into a smaller dimensional space.
00:05:30.580 | We're compressing that information.
00:05:32.620 | So we always need to be mindful of excessive compression
00:05:37.300 | and therefore data loss via dimensionality reduction.
00:05:41.260 | And because of this, we may want to increase or decrease
00:05:44.940 | the amount of compression that our filters create.
00:05:47.740 | To do that, we modify the filter size
00:05:50.380 | and also how quickly it moves across the image
00:05:54.340 | using a variable called the stride.
00:05:56.940 | The stride defines the number of pixels a filter moves
00:06:00.980 | after every calculation.
00:06:02.700 | By increasing the stride,
00:06:04.100 | the filter will move across the entire image in fewer steps,
00:06:08.260 | outputting fewer values
00:06:09.900 | and producing a more compressed feature map.
00:06:12.740 | Now, there are also some other surprising effects
00:06:16.220 | of image compression that we need to be aware of.
00:06:18.940 | One of those is the filter's interaction
00:06:21.620 | with the border areas of a input image or input array.
00:06:26.620 | If we're not careful with this,
00:06:28.420 | the border effects on small images
00:06:31.260 | can result in a rapid loss of information.
00:06:34.700 | And that's naturally more of a problem
00:06:36.300 | for those smaller images
00:06:37.700 | because they start with less information in the first place.
00:06:41.420 | So to avoid this,
00:06:42.620 | we must either reduce the amount of compression
00:06:45.860 | that we're doing using the previous techniques
00:06:48.900 | that we mentioned.
00:06:49.940 | So being mindful of the filter size and the stride,
00:06:52.900 | or we can add padding.
00:06:55.020 | Now we can see the effect of padding here.
00:06:56.940 | So we are essentially taking the original image
00:06:59.860 | and we're adding padding around the edge of that image.
00:07:03.180 | Now, typically this padding
00:07:04.460 | is going to be a set of zero value pixels.
00:07:08.380 | And we add that around the image
00:07:10.140 | to limit or prevent compression between layers.
00:07:14.620 | For larger images,
00:07:15.500 | this is going to be less of a problem.
00:07:17.420 | But for those smaller images,
00:07:20.540 | padding is a very effective remedy
00:07:23.260 | to avoid too much information loss.
00:07:25.620 | Now, another key feature
00:07:27.700 | of these deep convolutional neural networks
00:07:30.540 | is obviously depth of those networks.
00:07:33.300 | Going way back to 2012,
00:07:36.020 | the first very successful convolutional neural network
00:07:39.820 | or deep convolutional neural network was called AlexNet.
00:07:42.900 | And the authors of AlexNet
00:07:45.220 | found that the depth of their network
00:07:48.940 | was a very key component
00:07:50.780 | that contributed to its high performance.
00:07:54.020 | And the reason for this is every successive layer
00:07:58.100 | within a convolutional neural network
00:08:00.100 | abstracts the initial features more and more.
00:08:04.580 | So we get more and more abstract features
00:08:07.460 | as the number of layers within the network increases.
00:08:10.940 | And we can think of this
00:08:12.540 | as the network is able to abstract image features more
00:08:17.540 | and get them closer to the very abstract concepts
00:08:22.500 | that we have in our minds as human beings.
00:08:25.740 | So it gets closer to a human-like understanding of an image.
00:08:29.340 | So for example, a shallow CNN
00:08:32.460 | might recognize that an image contains an animal.
00:08:35.860 | But as we add another layer,
00:08:37.740 | it may be able to identify that animal as a dog.
00:08:40.580 | It has become more abstract.
00:08:42.340 | And adding another layer
00:08:44.180 | may allow that network to identify specific breeds
00:08:47.740 | like a satchel bull terrier or a husky.
00:08:50.580 | These ideas of different dog breeds
00:08:53.340 | is far more abstract than a dog or just an animal.
00:08:58.340 | And it requires far more abstraction
00:09:01.900 | in order to actually understand that
00:09:03.780 | and for CNN to be able to identify that.
00:09:08.260 | So by adding more layers,
00:09:09.900 | we're generally going to be able to identify
00:09:12.500 | more abstract and more specific concepts.
00:09:16.420 | Now, moving on to what are some of the very common features
00:09:20.780 | of a convolutional neural network,
00:09:22.220 | although not necessarily restricted
00:09:24.020 | to convolutional neural networks alone,
00:09:26.660 | we have activation functions.
00:09:28.180 | So activation functions,
00:09:29.940 | we will find that in pretty much every neural network.
00:09:32.980 | And they are used to add non-linearity to these networks.
00:09:37.940 | And essentially what that allows us to do,
00:09:40.340 | particularly over many layers,
00:09:42.180 | is represent more complex patterns.
00:09:46.220 | Now, you may recognize a few of these activation functions.
00:09:49.540 | We have the rectified linear unit function,
00:09:52.060 | the tanh function and sigmoid function.
00:09:55.380 | These three are some of the most common
00:09:58.260 | activation functions that we find in neural networks.
00:10:01.220 | In the past, CNNs often use activation functions
00:10:05.580 | within the hidden layers of the network,
00:10:08.340 | so that the middle layers of that network,
00:10:11.420 | basically not the input or the output,
00:10:13.340 | everything in between.
00:10:14.620 | And they would typically use tanh or sigmoid activations.
00:10:17.940 | However, in 2012,
00:10:20.100 | the rectified linear unit activation function
00:10:23.300 | became very popular because it was used
00:10:26.100 | by the AlexNet model,
00:10:27.700 | which kind of kick-started this era of deep learning,
00:10:31.820 | as it was the best performing CNN of its time
00:10:34.940 | by a long shot.
00:10:36.780 | Nowadays, the rectified linear unit
00:10:38.940 | or ReLU activation function is still a very popular choice.
00:10:43.100 | It's a lot simpler than tanh and sigmoid,
00:10:46.020 | and also doesn't require regularization
00:10:49.980 | in order to avoid saturation,
00:10:52.100 | which basically means the congregation of activation outputs
00:10:57.100 | towards the minimum and maximum values
00:10:59.860 | of that activation function.
00:11:02.460 | Another very important feature is the use of pooling layers.
00:11:06.660 | Now, we use pooling layers
00:11:07.780 | because the output of feature maps are very sensitive
00:11:11.620 | to the location of input features.
00:11:14.180 | So a small change can make a big difference.
00:11:16.420 | To some degree, this can be useful
00:11:19.340 | as it can tell us the difference between, for example,
00:11:22.060 | a cat's face and a dog's face,
00:11:23.900 | based on where the eyes are, where the ears are, and so on.
00:11:28.060 | However, we don't want that to be too dramatic
00:11:30.060 | because if an eye is shifted two pixels to the left
00:11:33.500 | or two pixels to the right,
00:11:35.180 | that should still allow the model
00:11:37.380 | to identify this image as being of a face.
00:11:40.900 | It should not make it go crazy
00:11:42.660 | and detect something completely different.
00:11:45.020 | And we need pooling layers in order to allow the model
00:11:49.140 | to have this form of smoothing or stability.
00:11:53.220 | Pooling layers are a downsampling method
00:11:55.940 | that compress information from one layer
00:11:59.220 | into a smaller space in the next layer.
00:12:02.140 | And the two most common pooling methods
00:12:05.500 | are max pooling and average pooling.
00:12:08.060 | Max pooling takes a maximum value
00:12:09.980 | of all the values within a window,
00:12:12.420 | and average pooling takes the average
00:12:14.700 | of all those values within a window.
00:12:16.660 | And of course, as that pooling window
00:12:18.420 | moves across our input array,
00:12:21.100 | we would end up outputting another array
00:12:23.460 | of smaller dimensionality.
00:12:25.460 | And with that, we can move on to the final main feature
00:12:30.020 | of a convolutional neural network,
00:12:32.060 | which is the use of fully connected layers.
00:12:34.820 | A fully connected linear layer is simply a neural network
00:12:38.980 | in the very traditional stripped down sense.
00:12:42.900 | It is the dot product between some inputs, X,
00:12:46.220 | and the layer weights, W,
00:12:47.700 | with a bias term added onto there,
00:12:50.540 | and usually an activation function.
00:12:52.980 | We will typically see a fully connected layer
00:12:55.820 | at the end of a convolutional neural network.
00:12:58.820 | And it handles the transformation
00:13:00.860 | of our convolutional neural network embeddings
00:13:03.340 | from 3D tensors to more understandable outputs
00:13:07.220 | like class predictions.
00:13:08.860 | It's often within these final layers
00:13:12.940 | that we will find the most information rich
00:13:15.620 | vector embeddings that represent the information
00:13:18.660 | that's come through all of those layers
00:13:20.380 | and have the most abstract machine readable
00:13:24.660 | numeric representation of whatever image
00:13:27.300 | is being passed through all of those layers.
00:13:29.620 | And it's this that we would typically use
00:13:31.660 | in things like image retrieval.
00:13:33.580 | But focusing on the classification,
00:13:36.060 | what a classifier will usually do
00:13:37.540 | is apply a softmax activation function.
00:13:41.100 | And that will create a probability distribution
00:13:43.620 | across all of the output nodes,
00:13:45.980 | where each one of these nodes represents a candidate class
00:13:50.700 | or a potential label.
00:13:52.340 | For example, here we have cat, dog, and car.
00:13:56.860 | Those would be our output classes.
00:14:00.420 | And after these fully connected layers
00:14:02.860 | and the softmax activation function,
00:14:05.220 | we have our predictions.
00:14:07.140 | Now, all of those features that we've just worked through
00:14:10.540 | are the very common components
00:14:13.060 | that make up a convolutional neural network.
00:14:15.700 | But with time, many different convolutional networks
00:14:18.060 | have been designed.
00:14:19.100 | So there is no specific architecture,
00:14:21.140 | but instead what we can use
00:14:22.980 | is some of the most high-performing networks
00:14:25.580 | as almost a set of guideposts
00:14:28.540 | in how we can design our own networks
00:14:30.780 | or simply use those existing networks.
00:14:33.140 | So let's have a very quick high-level look
00:14:35.060 | at a few of the most popular ones.
00:14:36.980 | So we'll start way back
00:14:38.900 | when with the very first successful
00:14:42.420 | convolutional neural network, which was LeNet.
00:14:44.820 | Now, LeNet is, I think, the earliest good example
00:14:47.900 | of a deep convolutional neural network
00:14:50.140 | being applied in the real world.
00:14:51.780 | It was developed in 1998.
00:14:53.820 | And in reality, a lot of us
00:14:55.620 | have probably actually interacted with LeNet
00:14:58.140 | without ever even realizing.
00:15:00.340 | It was developed at Bell Labs
00:15:02.260 | and they licensed it to many different big banks
00:15:05.940 | around the globe for reading handwritten digits on checks.
00:15:10.940 | And despite its use worldwide,
00:15:14.460 | it was surprisingly the only example
00:15:19.020 | of a commercially successful CNN,
00:15:22.140 | at least on that scale, for another 14 years.
00:15:26.020 | And that is where we got AlexNet.
00:15:28.700 | So AlexNet is the deep CNN
00:15:32.460 | that kick-started the era of deep learning.
00:15:37.020 | And that was back in October, 2012.
00:15:39.860 | The catalyst of this was AlexNet
00:15:42.740 | winning the ImageNet competition.
00:15:45.580 | And in fact, AlexNet can actually be seen
00:15:47.940 | as a continuation of LeNet.
00:15:51.260 | It uses a very similar architecture,
00:15:53.460 | but simply added more layers, training data,
00:15:56.380 | and some safeguards against overfitting.
00:15:59.180 | And it was after AlexNet that the broader community
00:16:03.380 | of computer vision researchers
00:16:05.540 | began focusing their attention
00:16:08.260 | on building ever deeper models with really big datasets.
00:16:13.260 | And the following years after this
00:16:17.100 | basically saw many variations of AlexNet
00:16:20.620 | winning the ImageNet competition
00:16:22.860 | until we get to VGGNet.
00:16:25.820 | Now VGGNet came in 2014 and dethroned AlexNet
00:16:30.700 | as the winner of the ImageNet competition.
00:16:33.940 | And there were a few different variations
00:16:36.220 | of this network developed,
00:16:38.180 | but we can already see that it's a much deeper network.
00:16:41.980 | But as core, it's still using the same process
00:16:44.820 | of convolution layers, pooling layers, and so on.
00:16:47.980 | Moving on to 2015, the next year,
00:16:50.540 | we saw the introduction of ResNet.
00:16:52.940 | Now ResNet introduced even deeper networks than ever before.
00:16:57.940 | The first of those contained 34 layers.
00:17:01.780 | And since then, 50 plus layer ResNet models
00:17:04.900 | have been developed and still hold
00:17:06.860 | some of the state-of-the-art results
00:17:09.500 | on many computer vision benchmarks.
00:17:11.980 | Now, beyond adding more layers,
00:17:14.140 | ResNet was actually very much inspired by VGGNet,
00:17:17.580 | but added smaller filters
00:17:19.660 | and a generally less complex network architecture.
00:17:24.140 | Another thing, which is why it's called
00:17:25.820 | the residual network, e.g. ResNet,
00:17:29.060 | is they added these shortcut connections between layers.
00:17:34.060 | And this was to avoid the loss of information
00:17:37.740 | over many layers with the greater depth of ResNet.
00:17:42.220 | So adding these shortcuts or these residual connections
00:17:45.500 | just allowed information to be maintained
00:17:47.940 | over a longer distance, which was very much required
00:17:50.900 | with this deeper network size.
00:17:53.860 | Now, I think that's enough
00:17:54.780 | for understanding convolutional neural networks.
00:17:57.300 | What I want to do now is actually look at how
00:17:59.780 | to implement convolutional neural networks
00:18:02.220 | and use them in classification.
00:18:04.060 | So we're gonna go through this notebook example here.
00:18:06.860 | You can find this notebook in the video description,
00:18:10.580 | and you'll be able to open that in Colab,
00:18:12.460 | or if you prefer, you can actually download the file as well.
00:18:15.100 | We're first going to just load in the relevant libraries.
00:18:18.140 | So we have PyTorch here and TorchVision.
00:18:21.820 | I'm gonna be using these transforms a lot,
00:18:23.420 | so we'll just also import them as transforms,
00:18:26.180 | make it a little bit easier.
00:18:27.420 | Now, we're gonna be working with a fair bit of data.
00:18:30.100 | So what is usually pretty helpful to do
00:18:34.060 | is switch from CPU to GPU if you have it available.
00:18:37.780 | We don't always have it available,
00:18:39.660 | but it can be useful if you do have it.
00:18:42.660 | And you can just check what you have, like so.
00:18:45.660 | So for me, I'm on MacBook right now, so I only have CPU.
00:18:49.780 | But if you're working on Colab,
00:18:51.140 | this should show up as CUDA.
00:18:53.780 | Now, as usual, our first task
00:18:55.980 | is going to be data-free processing.
00:18:58.220 | So we first need to download our dataset.
00:19:01.820 | We're gonna be using the CIFAR-10 dataset,
00:19:03.700 | which is a very popular image classification dataset.
00:19:07.460 | Download that, and we will see
00:19:10.540 | that it contains 50,000 items.
00:19:13.300 | And within that, we have images,
00:19:15.580 | which are just Python PIL image objects,
00:19:17.940 | and their labels, of which there are 10 unique labels
00:19:21.100 | within the dataset, hence why it's called CIFAR-10.
00:19:24.060 | And we can confirm that here.
00:19:25.780 | So we see we have 10 of these.
00:19:28.100 | And then we can also view the images,
00:19:30.540 | but they are very small.
00:19:32.740 | We can also see that they are Python PIL objects here.
00:19:35.460 | So this is, I think, a plane, but yeah, it's very small.
00:19:39.020 | So we're gonna be training the model.
00:19:40.700 | And while we're training,
00:19:41.780 | we also want to pull in another dataset
00:19:45.180 | that is independent to the training dataset
00:19:47.620 | that we can use as a validation or test set later on.
00:19:51.100 | We're gonna be using this test dataset,
00:19:53.340 | actually, as a validation dataset.
00:19:55.180 | So we're going to be checking our model performance
00:19:57.660 | on this data during the training process.
00:20:00.500 | You can see that we have a smaller number of items,
00:20:02.580 | and here it's 10,000 from 50,000.
00:20:05.500 | Now, most convolutional neural networks
00:20:08.220 | are designed to only accept a certain size of images.
00:20:12.940 | In the case that we are going to use,
00:20:15.140 | we're gonna use a 32 image input.
00:20:18.260 | We can modify that based on the model architecture,
00:20:20.980 | but the model architecture that we're gonna be using later
00:20:23.300 | accepts this 32 by 32 image size.
00:20:25.940 | So what we need to do is we can set the image size here,
00:20:29.460 | and then we can use transforms resize
00:20:32.460 | to resize the image into whatever we put in here,
00:20:35.980 | so the 32 pixels.
00:20:37.460 | And then this transforms to tensor
00:20:39.340 | is just to convert our image,
00:20:41.540 | our pill image object into a tensor,
00:20:45.420 | in which we can then feed into our model later on.
00:20:47.660 | So run this, and basically,
00:20:50.380 | we can just run preprocess on our images,
00:20:52.740 | and that will run this transformation pipeline across them.
00:20:57.260 | Now, there are a few things to consider
00:20:59.900 | when we're doing this.
00:21:00.860 | The first is, okay, we're gonna be iterating
00:21:03.180 | through everything.
00:21:04.020 | We need to extract the image and its respective label.
00:21:07.900 | One thing we need to consider
00:21:08.980 | with this image data set in particular,
00:21:11.540 | but other image data sets as well,
00:21:13.980 | is we only want one image format.
00:21:18.700 | So I want to have RGB images,
00:21:21.500 | so images with red, green, and blue color channels.
00:21:24.740 | A few images in this data set are actually just grayscale,
00:21:27.220 | so they have a single color channel.
00:21:28.980 | Now, we're not going to colorize it or anything like that.
00:21:32.300 | We're actually just going to copy
00:21:34.140 | those single color channels into three color channels,
00:21:37.420 | and it will still appear
00:21:38.540 | as a black and white grayscale image,
00:21:41.300 | but we at least have those three color channels,
00:21:43.940 | which means we can pass that directly into our model,
00:21:47.140 | which expects image arrays with three color channels.
00:21:50.980 | So we do that using image convert here,
00:21:53.580 | and then we preprocess everything,
00:21:55.660 | and then we append all that to our inputs.
00:21:58.180 | Okay, now we'll just take a moment.
00:21:59.500 | It's pretty quick.
00:22:00.820 | Now let's have a look at one of those images.
00:22:03.380 | So we run this,
00:22:05.740 | and we can see that we have 50,000
00:22:08.620 | of these modified images in our training set,
00:22:12.700 | and each one of those is a three by 32 by 32 tensor.
00:22:17.700 | Now, 32 by 32 is the actual pixels in the image,
00:22:22.180 | and the three is the number of color channels,
00:22:24.340 | red, green, and blue.
00:22:25.580 | And we can see the result from this transformation here.
00:22:29.300 | So this is just a tensor.
00:22:32.100 | One thing to note here is that all these values
00:22:34.780 | have been normalized to between zero and one.
00:22:37.660 | That was actually done by the preprocessing pipeline
00:22:42.020 | here with the transforms up to tensor.
00:22:44.900 | Okay, so moving on.
00:22:48.020 | One thing that we should do
00:22:50.420 | is calculate the mean and standard deviation of images
00:22:53.820 | so that we can modify the normalization
00:22:57.700 | to better fit our set of images.
00:23:00.180 | So to do that, we would do this,
00:23:02.300 | creating a big list of all of our images.
00:23:05.380 | And this is just a small subset of those
00:23:08.100 | that we pulled from here.
00:23:09.300 | So this is like we're sampling a smaller portion of those.
00:23:12.540 | Otherwise, this just takes a bit longer.
00:23:14.540 | We're merging all these into a single three channel vector.
00:23:18.220 | So you can see here,
00:23:19.540 | we're just kind of merging all those images
00:23:21.140 | into a massive one single big image.
00:23:23.580 | And then we just want to calculate the mean
00:23:26.140 | and the standard deviation for each one
00:23:28.060 | of those color channels.
00:23:29.620 | Okay, so we get these values and these values.
00:23:32.540 | I don't need those tensors anymore.
00:23:33.780 | They take a bit of space.
00:23:34.900 | I'm going to remove them.
00:23:36.260 | And then what we do is we just modify that normalize here.
00:23:41.260 | Okay, so now we're just normalizing all of the tensors
00:23:47.300 | with this additional step here.
00:23:49.260 | Okay, so we've preprocessed that
00:23:51.100 | and we apply that to the existing tensors.
00:23:53.460 | Run that.
00:23:54.300 | Okay, and that's done.
00:23:56.180 | But obviously when we're doing all of this in one go,
00:23:59.260 | we would probably want to do this.
00:24:00.860 | So we put all of them into a single preprocessing pipeline.
00:24:05.860 | So we have to resize the two tensor
00:24:07.900 | and then normalize following that.
00:24:09.940 | And we'll actually use that on the validation set.
00:24:12.660 | So here is our validation set.
00:24:15.180 | And we'll just rerun the same thing as before,
00:24:17.020 | but obviously this time we have that normalization step
00:24:19.780 | in there as well.
00:24:21.220 | Now, when we're training a model,
00:24:23.900 | we are probably going to want to do it in batches.
00:24:26.780 | So we want to pass through a batch of input data
00:24:31.780 | at any one time into our model.
00:24:34.300 | So passing rather than passing a single image at a time
00:24:37.820 | through everything,
00:24:38.700 | we're passing everything through in batches of,
00:24:40.860 | in this example, 64.
00:24:43.220 | And this is because neural networks can be paralyzed.
00:24:46.860 | And that allows us to take advantage of paralyzation,
00:24:50.580 | which means we're performing many calculations
00:24:53.540 | across all of these different images in a single batch
00:24:56.180 | in parallel rather than one after the other,
00:24:59.340 | which means everything will be much faster.
00:25:01.220 | Now to prepare everything,
00:25:03.500 | we need to load everything into our data loader.
00:25:06.460 | So this is going to handle the loading of data
00:25:09.460 | into our model.
00:25:11.500 | We have the batch size here.
00:25:13.220 | We're also shuffling everything
00:25:14.500 | so that we don't have like the same set of images
00:25:18.100 | if they're not shuffled already within the data set.
00:25:21.260 | So we don't have the same set of images
00:25:22.700 | all covering a single batch.
00:25:24.580 | We want every single batch to be as representative
00:25:27.500 | of the full data set as possible.
00:25:30.500 | Now we initialize data load
00:25:31.940 | for both the validation and the training set.
00:25:34.340 | And then what we're going to do
00:25:35.660 | is actually build our convolutional neural network.
00:25:39.140 | Now, it's pretty big,
00:25:42.140 | but that's just thanks to the number,
00:25:44.620 | the depth of the network.
00:25:46.620 | In reality, if we compare it to a lot of the networks
00:25:48.620 | we looked at before, it's not actually that deep.
00:25:51.500 | We only have five convolutional layers here
00:25:54.540 | and then a few other things.
00:25:56.180 | But you should note that there are a few things
00:25:58.900 | that you might recognize from before.
00:26:00.620 | We have the convolutional layer
00:26:01.900 | followed by radio activation layer,
00:26:04.860 | followed by max pooling layer,
00:26:06.500 | and we do that several times.
00:26:08.260 | And then at the end here,
00:26:09.220 | we have our fully connected layers.
00:26:11.540 | We have a few of those in order to get our predictions.
00:26:14.780 | Now, things to note here
00:26:18.060 | is that we have a number of input channels here.
00:26:20.300 | This must align to the number of input color channels
00:26:23.340 | that we're expecting.
00:26:24.540 | Our inputs will be identified
00:26:27.180 | based on the first set of input data
00:26:30.020 | that we throw into our model.
00:26:31.740 | So that'll be 32 by 32.
00:26:33.660 | And this first convolution layer
00:26:35.860 | is going to go over that 32 by 32 images
00:26:40.740 | with a kernel or a window of four by four images
00:26:45.140 | and go through that.
00:26:46.060 | We also add some padding to reduce the amount of compression.
00:26:49.420 | And we do that throughout the network.
00:26:51.660 | Now, other thing actually to note
00:26:53.420 | is the number of output channels
00:26:54.820 | is the depth of the array.
00:26:58.020 | So the depth initially is through color channels,
00:27:00.820 | the depth from the output of that is actually 64.
00:27:03.380 | And that gets deeper and deeper as we go through
00:27:06.140 | before we start decreasing that
00:27:08.540 | as we're going towards the end of the model.
00:27:11.500 | And then the end here,
00:27:12.340 | we're just performing transformations
00:27:14.460 | from our 3D convolution layers
00:27:17.780 | into the fully connected layers at the end there.
00:27:21.700 | And the final output, so this is our final layer.
00:27:25.220 | The input of that layer is 256 activations or nodes.
00:27:29.620 | And the output is the number of classes, which is 10.
00:27:32.460 | Then what we do,
00:27:33.420 | so that's defining the structure
00:27:35.140 | of our convolutional neural network.
00:27:37.820 | The next bit here is the forward step throughout.
00:27:42.500 | So it's identifying or it's defining the process
00:27:47.540 | that we move through each one of these layers,
00:27:50.100 | the order that each one of them is used.
00:27:52.740 | Because we can actually define all of these
00:27:54.500 | in any order we want.
00:27:56.500 | But it's actually here that the order is defined.
00:27:59.740 | So we go through that, we get to here,
00:28:02.140 | which is our final output.
00:28:03.940 | And that is how convolutional neural network,
00:28:08.300 | that is how we define it.
00:28:09.820 | And then we come down
00:28:10.740 | and we move on to setting everything up for training.
00:28:13.860 | So we want to move it to our device
00:28:16.900 | if we're using a CUDA-enabled GPU,
00:28:18.940 | otherwise it'll just stay on the CPU.
00:28:21.300 | We want to set the loss function.
00:28:22.620 | We're gonna be using cross-entropy loss.
00:28:24.780 | This is used when we have a classification task,
00:28:28.940 | like what we do here.
00:28:30.780 | Set learning rate, so 0.008.
00:28:33.700 | And we're going to use stochastic gradient descent here
00:28:37.380 | as our optimizer.
00:28:38.980 | Moving on, we would go down to here.
00:28:41.540 | We train it for about 50 epochs here.
00:28:44.300 | You can less or more depending on what you're seeing
00:28:47.500 | on your end during the training.
00:28:49.020 | It doesn't take too long to run anyway.
00:28:51.540 | And then we run through the training.
00:28:54.300 | So we go through these 50 epochs.
00:28:57.100 | And within each epoch, we run through the entire dataset.
00:29:01.020 | And we load that from the training data loader.
00:29:04.220 | We move those to the GPU if it is available.
00:29:08.660 | We do the forward propagation.
00:29:10.940 | From here, we get our output logits.
00:29:13.500 | So that is the final output predictions.
00:29:16.820 | And then from there, we calculate loss function
00:29:18.780 | between the predicted values and the true values.
00:29:23.620 | From there, we optimize the model.
00:29:25.860 | So we do backward propagation step.
00:29:28.820 | And that is the training step.
00:29:32.380 | That's all for the training.
00:29:33.980 | And then this bit here is another bit.
00:29:36.540 | So this is for our validation
00:29:38.620 | so that we can actually calculate the validation
00:29:40.460 | and just see that our model is not overfitting
00:29:42.580 | or anything over time.
00:29:44.620 | And we will get something that looks like this.
00:29:47.780 | So important here is if you see the loss decreasing
00:29:51.900 | and the validation loss increasing,
00:29:54.020 | that means that you're probably training for too many steps
00:29:56.780 | or your learning rate is too high
00:29:58.980 | and the model is overfitting to the training data.
00:30:02.500 | So in that case, just be wary and kind of stop doing that.
00:30:06.700 | And either train for less epochs
00:30:10.700 | or train with a lower learning rate
00:30:14.100 | or with less layers even.
00:30:16.260 | Okay, and then the final validation accuracy
00:30:18.540 | that we'll get near the end here is around 80%.
00:30:21.300 | And we can then go ahead and save the model
00:30:23.540 | to a file like this.
00:30:25.340 | Okay, so this is just CNN
00:30:26.860 | and the PyTorch weights file there.
00:30:29.500 | Now, that is the training for the model.
00:30:32.220 | Let's have a look at how we can use it for inference.
00:30:34.340 | So inference, by inference, I mean making predictions.
00:30:38.380 | We would load the model.
00:30:40.300 | You know, if we didn't already have it in the notebook,
00:30:43.420 | we'd make sure we switch it to evaluation mode
00:30:45.820 | and also move to device if we have CUDA-enabled GPU again.
00:30:49.940 | Pretending that we're not in the same notebook,
00:30:51.540 | we reinitialize the test set.
00:30:55.020 | So it's actually our validation set,
00:30:56.540 | but we're just using it for both.
00:30:58.740 | In this example, of course, in real scenarios,
00:31:01.380 | you should use a different data set
00:31:02.900 | for your validation and test set.
00:31:04.900 | Come down here, we do the preprocessing
00:31:08.180 | that we set up before, the preprocessing pipeline.
00:31:11.100 | We can check the number of tensors we have there.
00:31:13.380 | So we're just taking the first 10 as an example.
00:31:16.300 | All of those are the three by 32 by 32 that we saw before.
00:31:21.100 | Stack all those into a single tensor
00:31:23.380 | so that we have basically 10 tensors
00:31:27.700 | or 10 image arrays, you can think of them like that,
00:31:31.340 | with the three color channels
00:31:33.740 | and each one of those 32 by 32 pixels.
00:31:37.020 | Process all of them through our model.
00:31:39.580 | And then from there, we can use the torch argmax function
00:31:43.140 | in order to find the value within the output logics
00:31:48.140 | that has the highest activation.
00:31:50.180 | And that is our prediction.
00:31:53.140 | Okay, so in the first example,
00:31:55.580 | position number five had the highest activation.
00:31:57.700 | That means the model is predicting
00:31:59.260 | that this model is whatever class number five is.
00:32:03.260 | And we'll see what that is in a moment.
00:32:05.900 | We can just see the number of predictions we have here.
00:32:08.420 | It's 10.
00:32:09.500 | So we can see the class names.
00:32:11.060 | So here, if we look at this,
00:32:12.900 | the number five would be zero, one, two, three, four, five.
00:32:17.900 | So the first one is predicting dog.
00:32:20.180 | And if we have a look here, the second one is number eight.
00:32:23.940 | So number eight here, we'll go a little further.
00:32:26.740 | So six, seven, eight, ship.
00:32:29.740 | We come down here.
00:32:31.220 | And what we can do is find the predictions
00:32:33.020 | for each one of these.
00:32:33.900 | So this one is apparently a dog.
00:32:36.420 | It's pretty hard to tell, to be honest.
00:32:38.380 | This one is a ship, a automobile,
00:32:42.380 | which maybe isn't accurate, an airplane,
00:32:45.540 | a frog, another frog, automobile, and so on and so on.
00:32:50.540 | So generally speaking,
00:32:52.700 | most of those actually do look correct.
00:32:55.220 | So we have successfully managed to train
00:32:59.940 | our convolutional neural network
00:33:01.780 | on this classification task for images.
00:33:04.700 | And that is despite these images being very low resolution.
00:33:08.740 | Like even myself as a person,
00:33:11.100 | I struggle to figure out what exactly
00:33:13.780 | is in each one of those images.
00:33:15.900 | But that's it for this introduction
00:33:17.740 | to the long reigning champions of computer vision,
00:33:21.740 | i.e. the convolutional neural networks.
00:33:24.460 | We've gone through the intuition behind these models
00:33:29.180 | and had a look at a few of the most popular versions
00:33:32.140 | of these as well, which should always inform us
00:33:36.460 | as to if we wanted to build a convolutional neural network,
00:33:39.700 | always have a look at those past implementations
00:33:42.180 | of what they did, or just use one of those out of the box.
00:33:46.260 | And then after that, we obviously went
00:33:48.060 | through all this code,
00:33:49.220 | went through actually training a convolutional neural network
00:33:52.860 | and actually using it for inference as well.
00:33:55.780 | So as I said, convolutional neural networks
00:33:58.020 | are still very popular, but I think in the next,
00:34:02.700 | particularly in the next couple of years,
00:34:04.340 | they're probably mostly going to be replaced
00:34:06.540 | by other architectures like vision transformers
00:34:10.300 | and possibly other things as well.
00:34:12.580 | But even so, they're super relevant even now
00:34:17.100 | and definitely a good thing to understand
00:34:20.260 | if you're working within machine learning
00:34:22.660 | and particularly computer vision.
00:34:24.540 | So I hope all of that has been helpful
00:34:27.340 | and insightful.
00:34:28.780 | Thank you very much for watching
00:34:31.260 | and I will see you again in the next one.
00:34:34.420 | (soft music)
00:34:36.820 | (soft music)
00:34:39.220 | (soft music)
00:34:41.620 | (soft music)
00:34:46.080 | [BLANK_AUDIO]