back to indexLesson 15: Deep Learning Foundations to Stable Diffusion
Chapters
0:0 Introduction
0:51 What are convolutions?
6:52 Visualizing convolutions
8:51 Creating a convolution with MNIST
17:58 Speeding up the matrix multiplication when calculating convolutions
22:27 Pythorch’s F.unfold and F.conv2d
27:21 Padding and Stride
31:3 Creating the ConvNet
38:32 Convolution Arithmetic. NCHW and NHWC
39:47 Parameters in MLP vs CNN
42:27 CNNs and image size
43:12 Receptive fields
46:9 Convolutions in Excel: conv-example.xlsx
56:4 Autoencoders
60:0 Speeding up fitting and improving accuracy
65:56 Reminding what an auto-encoder is
75:52 Creating a Learner
82:48 Metric class
88:40 Decorator with callbacks
92:45 Python recap
00:00:00.000 |
Hi all and welcome to Lesson 15 and what we're going to endeavor to do today is to create 00:00:13.200 |
And in the process we will see why doing that well is a tricky thing to do and time permitting 00:00:22.060 |
we will begin to work on a framework, a deep learning framework to make life a lot easier. 00:00:29.920 |
Not sure how far we'll get on that today, time wise, so let's see how we go and get 00:00:37.000 |
So okay, so today let's start by talking before we can create a convolutional autoencoder. 00:00:45.140 |
We need to talk about convolutions and what are they and what are they for? 00:00:52.480 |
Generally speaking, convolutions are something that allows us to tell our neural network 00:01:00.560 |
a little bit about the structure of the problem that's going to make it a lot easier for it 00:01:06.640 |
And in particular the structure of our problem is we're doing things with images. 00:01:11.280 |
Images are laid out on a grid, a 2D grid for black and white or a 3D for color or a 4D 00:01:24.720 |
And so we would say, you know, there's a relationship between the pixels going across and the pixels 00:01:31.280 |
They tend to be similar to each other, differences in those pixels across those dimensions tend 00:01:35.980 |
to have meaning, patterns of pixels that appear in different places often represent the same 00:01:44.240 |
So for example, a cat in the top left is still a cat even if it's in the bottom right. 00:01:49.520 |
These kinds of this kind of prior information is something that is naturally captured by 00:01:56.760 |
a convolutional neural network, something that uses convolutions. 00:02:01.760 |
Generally speaking, this is a good thing because it means that we will be able to use less 00:02:06.040 |
parameters and less computation because more of that information about the problem we're 00:02:11.640 |
solving is kind of encoded directly into our architecture. 00:02:17.320 |
There are other architectures that don't encode that prior information as strongly such as 00:02:24.880 |
a multilayer perceptron, which we've been looking at so far or a transformers network, 00:02:32.040 |
Those kinds of architectures could potentially give us what they do give us more flexibility 00:02:38.520 |
and given enough time, compute and data, they could potentially find things that maybe CNNs 00:02:48.640 |
So we're not always going to use convolutional neural networks, but they're a pretty good 00:02:53.000 |
starting point and certainly something important to understand. 00:02:59.360 |
We can also take advantage of one-dimensional convolutions for language-based tasks, for 00:03:11.580 |
So in this notebook, one thing you'll notice that might be of interest is we are importing 00:03:20.520 |
Now MiniAI is this little library that we're starting to create and we're creating it using 00:03:27.280 |
So we've now got a MiniAI.training and a MiniAI.datasets. 00:03:31.480 |
And so if we look, for example, at the datasets notebook, it starts with something that says 00:03:36.440 |
that the default export module is called datasets and some of the cells have a export directive 00:03:47.280 |
And at the very bottom, we had something that called nbdev export. 00:03:52.720 |
Now what that's going to do is it's going to create a file called datasets.py. 00:04:08.660 |
And it contains those cells that we exported. 00:04:21.880 |
That's because everything for nbdev is stored in settings.ini and there's something here 00:04:26.200 |
saying create a library libname called MiniAI. 00:04:32.460 |
You can't use this library until you install it. 00:04:35.420 |
Now we haven't uploaded it to PyPy, like we've made it a pip installable package from the 00:04:43.620 |
But you can actually install a local directory as if it's a Python module that you've kind 00:04:53.880 |
And to do that, you say pip install in the usual way, but you say -e, this down to editable. 00:05:00.100 |
And that means set up the current directory as a Python module. 00:05:03.720 |
Well, current directory, actually any directory you like, I just put dot to mean the current 00:05:08.600 |
And so you'll see that's going to go ahead and actually install my library. 00:05:15.480 |
And so after I've done that, I can now import things from that library, as you see. 00:05:31.200 |
We're going to grab our MNIST dataset and we're going to create a convolutional neural 00:05:36.260 |
So before we do that, we're going to talk about what are convolutions. 00:05:40.600 |
And one of my favorite descriptions of convolutions comes from the student in our, I think it 00:05:44.920 |
was our very first course, Matt Kleinsmith, who wrote this really nice Medium article, 00:05:53.040 |
CNNs from different viewpoints, which I'm going to steal from. 00:05:57.320 |
Say that this is our image, it's a three by three image with nine pixels labeled from 00:06:08.080 |
Now a convolution uses something called a kernel and a kernel is just another tensor. 00:06:15.920 |
In this case, it's a two by two matrix again. 00:06:18.760 |
So this one's we're going to have alpha, beta, gamma, delta as our four values in this convolution. 00:06:26.960 |
Now in this kernel, oh, now one thing I'll mention, I can't remember if I've said this 00:06:31.760 |
before, is the Greek letters are things that you want to be able to, I think I have mentioned 00:06:39.280 |
So if you don't know how to read these and say what these names are, make sure you head 00:06:44.240 |
over to Wikipedia or whatever and learn the names of all the Greek letters so that you 00:06:51.600 |
So, what happens when we apply a convolution with this two by two kernel to this three 00:06:58.400 |
by three image, I mean, it doesn't have to be an image, it's in this case, it's just 00:07:04.720 |
a rank two tensor, but it might represent an image. 00:07:09.160 |
What happens is we take the kernel and we overlay it over the first little two by two 00:07:15.780 |
sub grid, like so, and specifically what we do is we match color to color. 00:07:22.760 |
So the output of this first two by two overlay would be alpha times A plus beta times B plus 00:07:31.280 |
gamma times D plus delta times E and that would yield some value P and that's going 00:07:37.720 |
to end up in the top left of a two by two output. 00:07:42.420 |
So the top right of the two by two output, we're going to slide, it's like a sliding 00:07:46.760 |
window, we're going to slide our kernel over to here and apply each of our coefficients 00:07:52.720 |
to these respectively colored squares and then ditto for the bottom left and then ditto 00:08:03.520 |
So we end up with this equation, P as we discussed is alpha A plus beta B plus gamma D plus delta 00:08:19.280 |
As you can see, it's just alpha in this test times B and so we're just multiplying them 00:08:24.600 |
together and adding them up, multiply together, add them up, multiply together and add them 00:08:29.080 |
So we're basically, you can imagine that we're basically flattening these out into rank one 00:08:34.360 |
tensors into vectors and then doing a dot product would be one way of thinking about 00:08:38.240 |
what's happening as we slide this kernel over these windows. 00:08:52.080 |
So for example, let's grab our training images and take a look at one. 00:09:09.320 |
So remember, a kernel is just, we've already, a kernel appears a lot of times in computer 00:09:15.840 |
We've already seen the term kernel to mean a piece of code that we run on a GPU across 00:09:24.240 |
lots of parallel kind of virtual devices or potentially in a grid. 00:09:31.880 |
We've got a computation, which is in this case, kind of this dot product or something 00:09:35.520 |
like a dot product, sliding over, occurring lots of times over a grid. 00:09:45.680 |
That's kind of another use of the word kernel. 00:09:47.480 |
So in this case, a kernel is a, in this case, it's going to be a rank two tensor. 00:09:52.520 |
And so let's create a kernel with these values in the three by three matrix, rank two tensor. 00:10:06.760 |
So what would happen if we slide this over just these nine pixels over this 28 by 28? 00:10:17.240 |
Well, what's going to happen is if we've got some, the top left, for example, three by 00:10:23.360 |
three section has these names, then we're going to end up with negative A1 because the top 00:10:30.920 |
Negative A1, minus A2, minus A3, the next to just zero. 00:10:50.880 |
What I've done here is I've grabbed just the first 13 rows and first 23 columns of our 00:11:00.440 |
And I'm actually showing the numbers and also using gray kind of conditional formatting, 00:11:08.260 |
if you like, or the equivalent in pandas to show this top bit. 00:11:18.400 |
So what happens if we take rows three, four, and five? 00:11:26.620 |
So it's rows three, four, and five, columns 14, 15, 16, 14, 15, 16. 00:11:35.560 |
What's that going to give us if we multiply it by this kernel? 00:11:42.040 |
It gives us a fairly large positive value because the three that we have negatives on 00:11:52.480 |
And the three that we have positives on, they're all close to one. 00:12:00.720 |
What about the same columns, but for rows 789, 789 here, the top is all positive and 00:12:13.900 |
So that means that we're going to get a lot of negative terms. 00:12:18.200 |
And not surprisingly, that's exactly what we see. 00:12:20.760 |
If we do this kind of a dot product equivalent, which all you need a NumPy to do that is just 00:12:29.000 |
an element-wise multiplication followed by a sum, right? 00:12:32.720 |
So that's going to be quite a large negative number. 00:12:35.620 |
And so perhaps you're seeing what this is doing, and maybe you got a hint from the name 00:12:41.480 |
It's something that is going to find the top edge, right? 00:12:45.780 |
So this one is a top edge, so it's a positive, and this one is a bottom edge, so it's a negative. 00:12:52.880 |
So we would like to apply that, this kernel, to every single 3x3 section in here. 00:13:02.860 |
So we could do that by creating a little apply kernel function that takes some particular 00:13:08.400 |
row, and some particular column, and some particular tensor as a kernel, and does that 00:13:21.820 |
So for example, we could replicate this one by calling apply kernel. 00:13:28.000 |
And this here is the center of that 3x3 grid area. 00:13:36.940 |
So now we could apply that kernel to every one of the 3x3 windows in this 28x28 image. 00:13:46.520 |
So we're going to be sliding over, like this red bit sliding over here, but we've actually 00:13:55.600 |
So to get all of the coordinates-- let's just simplify it to do this 5x5-- we can create 00:14:02.220 |
a list comprehension. We can take i through every value in range 5, and then for each 00:14:08.680 |
of those, we can take j for every value in range 5. 00:14:14.200 |
And so if we just look at that tuple, you can see we get a list of lists containing 00:14:25.340 |
So this is a list comprehension in a list comprehension, which when you first say it, may be surprising 00:14:35.000 |
or confusing, but it's a really helpful idiom. 00:14:39.520 |
And I certainly recommend getting used to it. 00:14:43.800 |
Now what we're going to do is we're not just going to create this tuple, but we're actually 00:14:50.480 |
going to call apply kernel for each of those. 00:14:54.620 |
So if we go through from 1 to 27-- well, actually, 1 to 26, because 27 is exclusive. 00:15:03.860 |
So we're going to go through everything from 1 to 26, and then for each of those, go through 00:15:12.880 |
And that's going to give us the result of applying that convolutional kernel to every 00:15:21.980 |
And you can see what it's done, as we hoped, is it is highlighting the top edges. 00:15:28.360 |
So yeah, you might find that kind of surprising that it's that easy to do this kind of image 00:15:36.440 |
We're literally just doing an element-wise multiplication and a sum for each window. 00:15:56.140 |
This time, we could do one with a left edge tensor, so as you can see, it looks just a 00:16:00.940 |
rotated version or transposed version, I guess, of our top edge tensor. 00:16:07.320 |
And so if we apply that kernel-- so this time, we're going to apply the left edge kernel. 00:16:13.240 |
And so notice here that we're actually passing in a function. 00:16:18.020 |
We're passing in a function-- sorry, actually, not a function, is it? 00:16:27.740 |
So we're going to pass in the left edge tensor for the same list comprehension in a list 00:16:37.060 |
And this time, we're getting back at the left edges. 00:16:40.180 |
Highlighting all of the left edges in the digit. 00:16:45.320 |
So yeah, this is basically what's happening here, is that a 2 by 2 can be looped over 00:16:57.420 |
Now you'll see here that in the process of doing so, we are losing the outermost pixels 00:17:12.060 |
But just for now, notice that as we are putting our 3 by 3 through, for example, in this 5 00:17:18.840 |
by 5, there's only one, two, three places that we can put it going across, not five 00:17:31.300 |
And hopefully, if you remember back to kind of the Zeiler and Fergus pictures from lesson 00:17:36.540 |
1, you might recognize that the kind of first layer of a convolutional network is often 00:17:41.040 |
looking for kind of edges and gradients and things like that. 00:17:46.080 |
And then the convolutions on top of convolutions with nonlinear activations between them can 00:17:51.380 |
combine those into curves or corners or stuff like that, and so on and so forth. 00:18:01.100 |
Because currently, this is going to be super, super slow doing this in Python. 00:18:05.180 |
So one of the very earliest or probably the earliest publicly available general purpose 00:18:13.300 |
deep learning, GPU accelerated deep learning thing I saw, it was called CAFE. 00:18:18.500 |
That was created by somebody called Yang Qingjia. 00:18:22.540 |
And he actually described how CAFE went about implementing a fast convolution on a GPU. 00:18:37.860 |
And basically, he said, "Well, I had two months to do it, and I had to finish my thesis." 00:18:44.660 |
And so I ended up doing something where I said, "Well, there was some other code out 00:18:53.180 |
Kojewski, who you might have come across him and Hinton, set up a little startup, which 00:19:01.040 |
Google bought, and that kind of became the start of Google's deep learning, the Google 00:19:06.700 |
So Kojewski had all this fancy stuff in his library, but Yang Qingjia said, "Oh, I didn't 00:19:14.700 |
So I said, "Well, I already know how to multiply matrices, so maybe I can convert a convolution 00:19:29.940 |
IM2COAL is a way of converting a convolution into a matrix multiply. 00:19:38.700 |
And so actually, I don't know if I suspect Yang Qingjia kind of accidentally reinvented 00:19:45.180 |
it, because it actually had been around for a while, even at the point that he was writing 00:19:56.060 |
So it was actually, this is the place I believe it was created in this paper. 00:20:12.820 |
And what they describe is, let's say you are putting this two by two kernel over this three 00:20:24.700 |
So here you've got this window needs to match to this bit of this window, right? 00:20:29.740 |
What you could do is you could unwrap this to one, one, two, sorry, one, two, one, two 00:20:35.820 |
downwards to here, one, two, one, two. So unroll it like so. And you could unroll the kernel 00:20:44.580 |
Yeah, sorry, this is one, two, one, one. So this bit is here, one, two, one, one. And 00:20:51.860 |
then you could unroll the kernel one, one, two, two to here, one, one, two, two. 00:20:57.780 |
And then once they've been flattened out and moved in that way, and then you'll do exactly 00:21:02.940 |
the same thing for this next patch here, two, oh, one, three. You flatten it out and put 00:21:09.580 |
So if you basically take those kernels and flatten them out in this format, then you 00:21:14.420 |
end up with a matrix multiply. If you multiply this matrix by this matrix, you'll end up 00:21:21.420 |
with the output that you want from the convolution. So this is basically a way of unrolling your 00:21:30.180 |
kernels and your input features into matrices, such as when you do the matrix multiply, you 00:21:36.780 |
So it's a kind of a nifty trick. And so that is called I am to call. I guess we're kind 00:21:45.180 |
of cheating a little bit. Implementing that is kind of boring. It's just a bunch of copying 00:21:49.100 |
and tensor manipulation. So I actually haven't done it. Instead, I've linked to a numpy implementation, 00:21:58.700 |
which is here. And it also part of it is this get indices, which is here. And as you can 00:22:10.020 |
see, it's a little bit tedious with repeats and tiles and reshapes and whatnot. 00:22:14.980 |
So I'm not going to call it homework. But if you want to practice your tensor indexing 00:22:21.860 |
manipulation skills, try creating a PyTorch version from scratch. I got to admit I didn't 00:22:27.260 |
bother. Instead, I use the one that's built into PyTorch. And in PyTorch it's called unfold. 00:22:35.860 |
So if we take our image and PyTorch expects there to be a batch axis and a dimension and 00:22:45.500 |
a channel dimension. So we'll add two unit leading dimensions to it. Then we can unfold 00:22:52.900 |
our input for a three by three. And that will give us a nine by six, 76 input. And so then 00:23:07.780 |
we can take that and we can take that and then we will make our we will take our kernel and 00:23:20.680 |
just flatten it out into a vector. So view changes the shape and minus one just says 00:23:26.460 |
dump everything into this dimension. So that's going to create a nine long vector length 00:23:36.140 |
nine vector. And so now we can do the matrix model play just like they've done here of 00:23:42.740 |
the kernel matrix. That's our weights by the unrolled input features. And so that gives 00:23:52.660 |
us a six, 76 long. We can then view that as 26 by 26. And we get back, as we hoped, our 00:24:01.700 |
left edge result. And so this is how we can kind of from scratch create a better implementation 00:24:16.340 |
of convolutions. The reason I'm cheating, I'm allowed to cheat here, is because we did actually 00:24:21.180 |
create convolutions from scratch. We're not always creating the GPU optimized versions 00:24:25.500 |
from scratch, which was never something I promised. So I think that's fair. But it's 00:24:29.620 |
cool that we can kind of hack out a GPU optimized version in the same way that the kind of original 00:24:34.820 |
deep learning library did. So if we use apply a kernel, we get nearly nine milliseconds. 00:24:46.980 |
If we use unfold with matrix model play, we get 20 microseconds. So that's what about 00:24:56.580 |
400 times faster. So that's pretty cool. Now, of course, we don't have to use unfold and 00:25:03.100 |
matrix model play because PyTorch has a conv2d. So we can run that. And that interestingly 00:25:11.860 |
is about the same speed, at least on GPU. But this would also work on GPU on GPU just 00:25:19.100 |
as well. Yeah, I'm not sure this will always be the case. In this case, it's a pretty small 00:25:25.220 |
image. I haven't experimented a whole lot to see whereabouts there's a big difference 00:25:32.260 |
in speeds between these. Obviously, I always just use f.com2d. But if there's some more 00:25:37.500 |
tricky convolution you need to do with some weird thing around channels or dimensions or 00:25:42.820 |
something, you can always try this unfold trick. It's nice to know it's there, I think. 00:25:48.660 |
So we could do the same thing for diagonal edges. So here's our diagonal edge kernel 00:25:56.980 |
or the other diagonal. So if we just grab the first 16 images, then we can do a convolution 00:26:16.140 |
on our whole batch with all of our kernels at once. So this is a nice optimized thing 00:26:24.580 |
that we can do. And you end up with your 26 by 26. You've got your four kernels and you've 00:26:36.460 |
got your 16 images. And so that's summarized here. So that's generally what we're doing 00:26:41.980 |
to get good GPU acceleration is we're doing a bunch of kernels and a bunch of images all 00:26:47.180 |
at once across all of their pixels. And so here we go. That's what happens when we take 00:26:55.580 |
a look at our various kernels for a particular image. Left edge, I guess top edge, and then 00:27:06.980 |
diagonal top left and top right. OK, so that is optimized convolutions on and that works 00:27:16.260 |
just as well on CPU or GPU. Obviously, GPU will be faster if you have one. Now, how do 00:27:22.740 |
we deal with the problem that we're losing one pixel on each side? What we can do is we 00:27:30.380 |
can add something called padding. And for padding, what we basically do is rather than 00:27:36.500 |
starting our window here, we start it right over here. And we actually would be up one 00:27:43.500 |
as well. And so these three on the left here, we just take the input for each of those as 00:27:57.340 |
zero. So we're basically just assuming that they're all zero. I mean, there's other options 00:28:02.420 |
we could choose. We could assume they're the same as the one next to them. There's various 00:28:07.980 |
things we can do, but the simplest and the one we normally do is just assume that there's 00:28:10.740 |
zero. So now, so let's say, for example, this is called one pixel padding. Let's say we 00:28:22.820 |
did two pixel padding. So we had two pixel padding with a five by five input and a four 00:28:32.020 |
by four kernel. So that grays our kernel. Then we're going to start right up way over 00:28:38.580 |
here on the corner. And then you can see what happens as we slide the kernel over. There's 00:28:46.300 |
all the spots that it's going to take. And so that this dotted line area is the area 00:28:51.580 |
that we're kind of effectively going through. But all of these white bits, we're just going 00:28:56.860 |
to treat as zero. And so, and then this is this green as the output size we end up with, 00:29:01.460 |
which is going to be six by six for a five by five input. I should mention even numbered 00:29:12.060 |
edge kernels are not used very often. We normally used odd numbered kernels. If you use, for 00:29:16.420 |
example, a three by three kernel and one pixel of padding, you will get back the same size 00:29:22.260 |
you start with. If you use five by five with three pixels of padding, you'll end up with 00:29:28.580 |
the same size you start with. So generally, odd numbered edge size kernels are easier 00:29:33.820 |
to deal with, to make sure you end up with the same thing you start with. 00:29:37.100 |
OK, so, yeah, so as it says here with you've got a odd numbered size KS by KS size kernel, 00:29:47.660 |
then KS truncate divide two, that's what slash slash means, will give you the right size. 00:29:57.060 |
And so another trick you can do is you don't always have to just move your window across 00:30:05.140 |
by one each time. You could move it by a different amount each time. The amount you move it by 00:30:10.960 |
is called the stride. So, for example, here's a case of doing a stride two. So with stride 00:30:16.460 |
two padding one, so we start out here and then we jump across two and then we jump across 00:30:21.380 |
two and then we go to the next row. So that's called a stride two convolution. Stride two 00:30:26.720 |
convolutions are handy because they actually reduce the dimensionality of your input by 00:30:34.620 |
a factor of two. And that's actually what we want to do a lot. For example, with an 00:30:42.860 |
autoencoder, we want to do that. And in fact, for most classification architectures, we 00:30:49.220 |
do exactly that. We keep on reducing the kind of the grid size by a factor of two again 00:30:56.140 |
and again and again using stride two convolutions with padding of one. So that's strides in 00:31:02.740 |
padding. So let's go ahead and create a conf net using these approaches. So we're going 00:31:10.180 |
to put get our size of our training set. This is all the same as before, number of categories, 00:31:16.460 |
number of digits, size of our hidden layer. So, previously with our sequential linear 00:31:37.220 |
models with our MLPs, we basically went from the number of pixels to the number of hidden 00:31:50.060 |
and then a value and then the number of hidden to the number of outputs. So here's the equivalent 00:31:57.300 |
with a convolution. Now the problem is that you can't just do that because the output 00:32:02.460 |
is not now 10 probabilities for each item in our batch, but it's 10 probabilities for 00:32:08.620 |
each item in our batch for each of 28 by 28 pixels because we don't even have a stride 00:32:13.180 |
or anything. So you can't just use the same simple approach that we had for MLP. We have 00:32:19.140 |
to be a bit more careful. So to make life easier, let's create a little conv function 00:32:26.020 |
that does a conv2D with a stride of 2, optionally followed by an activation. So if act is true, 00:32:34.940 |
we will add in a value activation. So this is going to either return a conv2D or a little 00:32:44.500 |
sequential containing a conv2D followed by a value. And so now we can create a CNN from 00:32:53.580 |
scratch as a sequential model. And so since activation is true by default, this is going 00:33:00.000 |
to take out 28 by 28 image starting with one channel and creating an output of four channels. 00:33:08.420 |
So this is the number of in, this is the number of filters. Sometimes we'll say filters to 00:33:13.080 |
describe the number of kind of channels that our convolution has. That's the number of 00:33:18.060 |
outputs. And it's very similar to the idea of the number of outputs in a linear layer, 00:33:23.540 |
except this is the number of outputs in your convolution. So what I like to do when I create 00:33:30.180 |
stuff like this is I add a little comment just to remind myself what is my grid size 00:33:34.980 |
after this. So I had a 28 by 28 input. So then I've then put it through a stride2 conv. 00:33:41.020 |
So the output of this will be 14 by 14. So then we'll do the same thing again, but this 00:33:46.500 |
time we'll go from a four channel input to an eight channel output and then from eight 00:33:51.740 |
to 16. So by this point, we're now down to a four by four and then down to a two by two. 00:34:01.360 |
And then finally, we're down to a one by one. So on the very last layer, we won't add an 00:34:07.580 |
activation. And the very last layer is going to create and create 10 outputs. And since 00:34:13.220 |
we're now down to a one by one, we can just call flatten and that's going to remove those 00:34:19.380 |
unnecessary unit axes. So if we take that, pop a mini batch through it, we end up with 00:34:26.900 |
exactly what we want, 16 by 10. So for each of our 16 images, we've got 10 probabilities 00:34:35.060 |
of 10 probabilities of each possible digit. So if we take our training set and make it 00:34:43.180 |
into 28 by 28 images, and we do the same thing for a validation set. And then we create two 00:34:50.020 |
data sets, one for each, which record train data set and valid data set. And we're now 00:34:56.900 |
going to train this on the GPU. Now, if you've got a Mac, you can use a device called, well, 00:35:06.220 |
if you've got an Apple Silicon Mac, you've got a device called MPS, which is going to 00:35:11.460 |
use your Mac's GPU. Where if you've got an Nvidia, you can use CUDA, which will use your 00:35:17.780 |
Nvidia GPU. CUDA is 10 times or more, possibly much more faster than a Mac. So you definitely 00:35:25.340 |
want to use Nvidia if you can. But if you're just running it on a Mac laptop or whatever, 00:35:31.100 |
you can use MPS. So basically you want to know what device to use. Do we want to use 00:35:35.300 |
CUDA or MPS? You can check. If you can check torch.backends.nps.is available to see if 00:35:41.780 |
you're running on a Mac with MPS, you can check torch.cuda.is available to see if you've 00:35:47.500 |
got an Nvidia GPU, in which case you've got CUDA. And if you've got neither, of course, 00:35:51.540 |
you'll have to use the CPU to do computation. So I've created a little function here to 00:35:58.780 |
device which takes a tensor or a dictionary or a list of tensors or whatever, and a device 00:36:06.460 |
to move it to. And it just goes through and moves everything onto that device. Or if it's 00:36:13.220 |
a dictionary, a dictionary of things, values moved onto that device. So there's a handy 00:36:18.060 |
little function. And so we can create a custom collate function, which calls the PyTorch 00:36:27.900 |
default collation function and then puts those tensors onto our device. And so with that, 00:36:34.900 |
we've now got enough to run, train this neural net on the GPU. We created this get deals 00:36:44.420 |
function in the last lesson. So we're going to use that passing in the datasets that we 00:36:50.420 |
just created and our default collation function. We're going to create our optimizer using 00:36:57.740 |
our CNNs parameters. And then we call fit. Now fit remember, we also created in our last 00:37:06.900 |
lesson and it's done. So I then what I did then was I reduced the learning rate by a 00:37:13.860 |
factor of four and ran it again. And eventually, yeah, I got to a fairly similar accuracy to 00:37:21.660 |
what we did on our multi on our MLP. So yeah, we've got a convolutional network working. 00:37:32.300 |
I think that's pretty encouraging. And it's nice that to train it, we didn't have to write 00:37:37.140 |
much code, right? We were able to use code that we already built. We were able to use 00:37:41.940 |
the dataset class that we made, the get deals function that we made and the fit function 00:37:48.220 |
that we made. And you know, because those things are written in a fairly general way, 00:37:55.900 |
they work just as well for a ConvNet as they did for an MLP, nothing had to change. So 00:38:00.260 |
that was nice. Notice we I had to take the model and put it on the device as well. So 00:38:07.940 |
that will go through and basically put all of the tenses that are in that model onto 00:38:21.980 |
So if we've got a batch size of 64, and as we do one channel, 28 by 28. So then our axes 00:38:29.860 |
are batch channel height, width. So normally, this is referred to as NCHW. So N, generally 00:38:38.400 |
when you see N in a paper or whatever, in this way, it's referring to the batch size. 00:38:44.540 |
N being the number, that's the mnemonic, the number of items in the batch. C is the number 00:38:50.620 |
of channels, height by width, NCHW. TensorFlow doesn't use that, TensorFlow uses NHWC. So 00:39:02.180 |
we generally call these that channels last, since channels are at the end. And this one 00:39:10.120 |
we normally call channels first. Now, of course, it's not actually channels first. It's actually 00:39:19.300 |
channel second, but we ignore the batch bit. In some models, particularly some more modern 00:39:27.180 |
models, it turns out the channels last is faster. So PyTorch has recently added support for 00:39:33.740 |
channels last. And so you'll see that being used more and more as well. 00:39:39.940 |
All right, so a couple of comments and questions from our chat. The first is Sam Watkins pointing 00:39:50.380 |
out that we've actually had a bit of a win here, which is that the number of parameters 00:39:56.380 |
in our CNN is pretty small by comparison. So the number in the MLP version, the number 00:40:04.740 |
of parameters is equal to basically the size of this matrix. So M times NH. Oh, plus the 00:40:21.540 |
number in this, which will be NH times 10. And, you know, something that at some point 00:40:31.780 |
we probably should do is actually create something that allows us to automatically calculate 00:40:38.420 |
the number of parameters. And I'm ignoring the bias there, of course. Let's see what 00:40:56.820 |
would be a good way to do that. Maybe NP dot product. There we go. So what we could do, 00:41:15.900 |
what we could do is just calculate this automatically by doing a little list comprehension here. 00:41:24.360 |
So there's the number of parameters across all of the different layers, so both bias 00:41:29.180 |
and weights. And then we could, I guess, just, well, we could just use, well, let's use PyTorch. 00:41:38.320 |
So we could turn that into a tensor and sum it up. Oops. So that's the number in our MLP. 00:41:49.900 |
And then the number in our simple CNN. So that's pretty cool. We've gone down from 40,000 00:41:59.780 |
to 5,000 and got about the same number there. Oh, thank you, Jonathan. Jonathan's reminding 00:42:07.420 |
me that there's a better way than NP dot product O dot shape, which is just to say O dot number 00:42:16.260 |
of elements, num EL. Same thing. Very nice. Now, one person asked a very good question, 00:42:30.980 |
which is I thought convolutional neural networks can handle any sized image. And actually, no, 00:42:40.960 |
this convolutional network cannot handle any sized image. This convolutional neural network 00:42:45.780 |
only handles images that once they go through these tried to comms end up with a one by 00:42:50.180 |
one because otherwise you can't dot flatten it and end up with 16 by 10. 00:42:59.020 |
So we will learn how to create conv nets that can handle any sized input. But there's nothing 00:43:06.680 |
particularly about a common net that necessitates that it has to be any sized input that it 00:43:10.700 |
can handle. Okay, so just let's briefly finish this section off by talking about this. Yeah, 00:43:22.840 |
this particularly on to talk about the idea of receptive field. Consider this one input 00:43:29.860 |
channel for output channel three by three kernel. Right. So that's just to show you 00:43:37.520 |
what we're doing here. Conve one, well, actually, so a simple CNN, simple CNN. This is the model 00:43:45.860 |
we created. Remember, it was like a sequential model containing sequential models because 00:43:49.260 |
that's how our con function worked. So simple CNN zero is our first layer. It contains both 00:43:55.500 |
the convenor value. So simple CNN zero zero is the actual con. So if we grab that, call 00:44:02.280 |
it con one. It's a four by one by three by three. So number of outputs, number of input 00:44:12.380 |
channels and height by width of their kernel. And then it's got its bias as well. So that's 00:44:18.940 |
how we could kind of deconstruct what's going on with our weight matrices or parameters inside 00:44:26.820 |
a convolution. Now, I'm going to switch over to Excel. So in the lesson notes on the course 00:44:38.140 |
website or on the forum, you'll find we've got an Excel. You'll see we've got an Excel 00:44:44.460 |
workbook. Oh, what seemed reminded me that there is a nice trick we can do. I do want 00:44:49.500 |
to do that actually because I love this trick. Oh, I just deleted everything though. Let's 00:44:56.740 |
put them all back. Here we go. Which is you actually don't need square brackets. The square 00:45:00.460 |
brackets is a list comprehension. Without the square brackets, it's called a generator 00:45:05.700 |
and it. Oh, no, you can't use it there. Maybe that only works with num. Maybe that only 00:45:12.980 |
works with NumPy. Ah, okay. So wait, that's the list. No, that doesn't work either. So 00:45:29.300 |
much for that. I'm kind of curious now. Maybe torch.sum. Nope. Just some. Oh, okay. I don't 00:45:55.420 |
want to use Python some. That's interesting. I feel like all of them should handle generators, 00:46:02.260 |
but there you go. Okay. So open up the conv example spreadsheet and what you'll see on 00:46:17.260 |
the conv example pay a worksheet page is something that looks a lot like the number seven. And 00:46:24.780 |
this is the number seven that I got straight from MNIST. Okay. So you can see over here 00:46:34.620 |
we have a number seven. This is a number seven from MNIST that I have copied into Excel. 00:46:41.460 |
And then you can see over here we've got like a top edge kernel being applied and over here 00:46:46.340 |
we've got a right edge kernel being applied. This might be surprising you because you might 00:46:51.100 |
be thinking where did tick Jeremy Microsoft Excel doesn't do convolutional neural networks. 00:46:57.500 |
Well actually it does. So if I zoom in in Excel, you'll see actually these numbers are in fact 00:47:10.340 |
conditional formatting applied to a bunch of spreadsheet cells. And so what I did was 00:47:15.660 |
I copied the actual pixel values into Excel and then applied conditional formatting. And 00:47:21.700 |
so now you can see what the digit is actually made of. So you can see here I've created 00:47:32.260 |
our top edge filter and here I've created our left edge filter. And so here I am applying 00:47:44.980 |
that filter to that window. And so here you can see it looks a lot like NumPy. It's just 00:47:55.260 |
a sum product. And you might not be aware of this but in Excel you can actually do broadcasting. 00:48:06.700 |
You have to hit Apple shift enter or control shift enter and it puts these little curly 00:48:12.100 |
brackets around it. It's called an array formula. It basically lets you do broadcasting or simple 00:48:16.660 |
broadcasting in Excel. And so here's how you could say this is how I created this top edge 00:48:22.500 |
filtered version in Excel. And the left edge version is exactly the same just a different 00:48:29.380 |
kernel. And as you can see if I click on it it's applying this filter to this input area 00:48:38.260 |
and so forth. OK. So we can then I just arbitrarily picked some different values here. And so something 00:48:50.300 |
to notice now in my second layer. So here's con one is con two. It's got a bit more work 00:48:57.940 |
to do. We actually need two filters because we need to add together this bit here applied 00:49:09.500 |
to this with this kernel applied and this bit here with this kernel applied. So you 00:49:18.180 |
actually need one set of three by three for each input. And also I want to set two separate 00:49:26.900 |
outputs. So I actually end up needing a two by two by three by three weights matrix or 00:49:37.340 |
weights a tensor I should say which you might remember is exactly what we had in PyTorch. 00:49:41.780 |
We had a rank for tensor. So if I have a look at this one you see exactly the same thing. 00:49:49.220 |
This input is using this kernel applied to here and this kernel applied to here. So that's 00:49:56.180 |
important to remember that you have these rank for tensors. And so then rather than 00:50:03.360 |
doing stride to con I did something else which is actually a bit out of favor nowadays but 00:50:09.900 |
it's another option which is to do something called max pooling to reduce my dimensionality. 00:50:15.500 |
So you can see here I've got 28 by 28. I've reduced it down here to 14 by 14. And the 00:50:21.420 |
way I did it was simply to take the max of each little two by two area. OK. So that's 00:50:30.640 |
all that's been done there. So that's called max pooling. And so max pooling has the same 00:50:36.940 |
effect as a stride to conf not mathematically identical the same effect which it does a 00:50:41.640 |
convolution and reduces the grid size by two on each dimension. OK. So then how do we create 00:50:50.420 |
a single output if we don't keep doing this until we get to one by one which I'm too lazy 00:50:55.180 |
to do in Excel. Well one approach and again this is a little bit out of favor as well 00:50:59.700 |
but one approach we can do is we can take every one of these we've now got 14 by 14 00:51:07.340 |
and apply a dense layer to it. And so what I've done here is I've got a big imagine this 00:51:15.580 |
is basically all been flattened out into a vector. And so here we've got some product 00:51:24.740 |
of this by this plus the sum product of this by this. And that gives us a single number. 00:51:34.940 |
And so that is how we could then optimize that in order to optimize our weight matrices. 00:51:42.820 |
Now and then you know that the more modern approach we don't use this kind of dense layer 00:51:50.180 |
much anymore it still appears a bit. The main place that you see this used is in a network 00:51:58.940 |
called VGG which is very old now. I thought it might be 2013 or something. But it's actually 00:52:05.180 |
still used. And that's because for certain things like something called style transfer 00:52:11.940 |
or in general perceptual losses people still find VGG seems to work better. So you still 00:52:20.180 |
actually see this approach nowadays sometimes. The more common approach however nowadays 00:52:25.660 |
is we take the penultimate layer and we just simply take the average of all of the activations. 00:52:35.100 |
So the nowadays we would simply the Excel way of doing it would be literally simply 00:52:40.340 |
say average of the penultimate layer and that is called global average pooling. Everything 00:52:52.420 |
has to has a fancy word a fancy phrase but that's all it is take the average is called 00:52:56.700 |
global average pooling or you could take the you could take the max whatever that would 00:53:01.380 |
be global max pooling. So anyway the main reason I wanted to show you this was to do 00:53:06.220 |
something which I think is pretty interesting which is to take something in our zoom out 00:53:13.780 |
a little bit here let's take something in our max pool here and I'm going to say trace 00:53:27.220 |
precedence to show you here it is the area that it's coming from. OK. So it's coming 00:53:33.020 |
from these four numbers. Now if I trace precedence again saying what's actually impacting this 00:53:40.780 |
obviously the kernels impacting it and then you can see that the input area here is a 00:53:47.100 |
bit bigger and then if I trace precedence again then you can see the input area is bigger 00:53:55.300 |
still. So this number here is calculated from all of these numbers in the input. This area 00:54:06.180 |
in the input is called the receptive field of this unit. And so the receptive field in 00:54:15.380 |
this case is 1 2 3 4 5 6 by 6. Right. And that means that a pixel way up here in the 00:54:24.140 |
top right has literally no ability to impact that activation. It's not part of its receptive 00:54:31.260 |
field. If you have a whole bunch of stride to comms each time you have one the receptive 00:54:37.300 |
field is going to get twice as big. So the receptive field at the end of a deep network 00:54:42.660 |
is actually very large. But the the inputs closest to the middle of the receptive field 00:54:50.220 |
have the biggest kind of say in the output because they they implicitly appear the most 00:54:57.420 |
often in all of these kind of dot products that are inside this this this convolutional 00:55:04.220 |
window. So the receptive field is not just like a single binary on off thing. Certainly 00:55:10.900 |
all the stuff that's not got precedence here is not part of it at all. But the closer to 00:55:16.980 |
the center of the receptive field the more impact it's going to have the more ability 00:55:21.340 |
it's got to change this number. So the receptive field is a really important concept. And yeah 00:55:29.620 |
fiddling playing around with Excel's precedent arrows I think is a nice way to to say that 00:55:35.980 |
at least in my opinion. And apart from anything else it's great fun creating a convolutional 00:55:42.520 |
neural network in Excel. I thought so anyway. OK. So let's take a seven minute break. I'll 00:55:53.860 |
see you back after that to talk about a convolutional auto encoder. All right. OK. Welcome back. 00:56:06.500 |
We're going to have a look now at the auto encoder notebook. So we're just going to import 00:56:13.140 |
all of our usual stuff and we've got one more of our own modules to import now as well. 00:56:22.060 |
And this time we are going to switch to a different we're going to switch to a different 00:56:29.540 |
data set which is the fashion MNIST data set. We can take advantage of the stuff that we 00:56:38.660 |
did in 0 5 data sets and the hugging face stuff to load it. So we've seen this a little 00:56:46.900 |
bit before back in our data sets one here and we never actually built any models with 00:56:57.700 |
it. So let's first of all do that. So this is just going to convert each thing each image 00:57:07.340 |
into a tensor and it's going to be an in place transform. Remember we created this decorator 00:57:13.940 |
and so we can call data set dictionary with transform. This is all stuff we've done before. 00:57:22.180 |
And so here we have our example of a sneaker. All right. And we will create our collation 00:57:33.540 |
function collating the dictionary for that data set. That's something to remind you should 00:57:39.660 |
remind yourself we built that ourselves in the data sets notebook. And let's actually 00:57:46.180 |
make our collate function something that does to device which we wrote in our in our last 00:57:53.260 |
notebook and we'll get a little data loaders function here which is going to go through 00:57:59.820 |
each item in the data set dictionary and get a data loader for it and give us a dictionary 00:58:06.900 |
of data loaders. OK. So OK. So now we've got a data loader for training and a data loader 00:58:19.700 |
for validation. So we can grab the X and Y batch by just calling next on that iterator 00:58:28.860 |
as we've done before. We can grab the let's look at each of these in turn. Actually we've 00:58:38.020 |
done all this before but it's a couple of weeks ago. So just to remind you we can get 00:58:42.860 |
the names of the features. And so we can then get create an item getter for our wise and 00:58:52.340 |
we can call that the label getter. We can apply that to our labels to get the titles 00:58:58.060 |
of everything in our mini batch and we can then call our show images that we created 00:59:05.620 |
with that mini batch with those titles. And here we have our fashion MNIST mini batch. 00:59:19.380 |
OK. So let's create a classifier and we're just going to use exactly the same code copy 00:59:24.460 |
and pasted from the previous notebook. So here is our sequential model. And we are going 00:59:41.340 |
to grab the parameters of the CNN and the CNN I've actually moved it over to the device. 00:59:55.180 |
The default device was what we created in our last notebook. And as you can see it's 00:59:58.460 |
fitting. Now our first problem is it's getting very slowly which is kind of annoying. So 01:00:10.740 |
why is it running pretty slowly. Let's think about let's have a look at our data set. So 01:00:18.820 |
when it's finally finished let's take a look at an item from the data set. Actually let's 01:00:27.460 |
look at the data set. Let's actually go all the way back to the data set dictionary. So 01:00:35.580 |
before it gets transformed data set dictionary and let's grab the training part of that. 01:00:45.180 |
And let's grab one item. And actually we can see here the problem for MNIST. We had all 01:00:56.140 |
of the data loaded into memory into a single big tensor. But this hugging face one is created 01:01:03.380 |
in a much more kind of normal way which is each image is a totally separate PNG image. 01:01:09.320 |
It's not all pre converted into a single thing. Why is that a problem. Well the reason it's 01:01:17.940 |
a problem is that our data loader is spending all of its time decoding these PNGs. So if 01:01:32.580 |
I train here. OK. So while I'm training I can type H top and you can see that basically 01:01:42.900 |
my CPU is 100 percent used. Now that's weird because I've actually got 64 CPUs. Why is 01:01:49.260 |
it using just one of them is the first problem. But why does it matter that it's using 100 01:01:54.100 |
percent CPU. Well the reason it matters. Let's run it again so you can see. Why does it matter 01:02:01.940 |
that our CPU is 100 percent. And why is it making it so slow. Well the reason why is 01:02:07.740 |
if we look at Nvidia SMI demon that will monitor our our GPUs utilization. I've got three GPUs 01:02:17.380 |
I say to choose just the zeroth index one. And you'll see this column here SM. This stands 01:02:23.400 |
for symmetric model processor. It's like the equivalent of like CPU usage. And generally 01:02:28.660 |
we're only using up one percent of our one GPU. So no wonder it's so slow. So the first 01:02:38.140 |
thing we want to do then is try to make things faster. Now to make things faster we want 01:02:44.900 |
to be using more than one CPU to decode our PNGs. And as it turns out that's actually 01:02:50.500 |
pretty easy to do. You just have to add a extra argument to your data loaders. Which 01:03:08.900 |
is here num underscore workers. And so I can say use eight CPUs for example. Now if I create 01:03:19.540 |
a recreator data loaders and then try to create get the next one. Oh now I've got an error. 01:03:25.620 |
And the error is rather quirky. And what it's saying is oh you're you're now trying to use 01:03:34.980 |
multiple processes and generally in Python and PyTorch using multiple processes things 01:03:40.160 |
that get complicated. And one of the things that absolutely just doesn't work is you can't 01:03:45.980 |
actually have your data loader put things onto the GPU in your in your separate processes. 01:03:56.960 |
It just doesn't work. So the reason for this error is actually because of the fact that 01:04:05.140 |
we used a collate function that put things on the device. That's incompatible unfortunately 01:04:12.220 |
with using multiple workers. So that's that's a problem. And the answer to that problem 01:04:25.100 |
sadly is that we would have to actually rewrite our fit function entirely. So there's annoying 01:04:37.300 |
thing number one. And we don't want to be rewriting our fit function again and again. 01:04:41.900 |
We want to have a single fit function. So OK so there's a problem that we're going to 01:04:47.260 |
have to think about. Problem number two is that this is not very accurate. Eighty seven 01:04:57.100 |
percent. Well I mean is it accurate. It's easy enough to find out. There's a really 01:05:01.260 |
nice website called papers with code and it will tell you a little leaderboard and we 01:05:16.860 |
can see whether we're any good. And the answer is we're not very good at all. So these papers 01:05:24.180 |
had ninety six percent ninety four percent ninety two percent. So yeah we're not looking 01:05:35.700 |
great. So how do we improve that. There's a lot of things we could try but pretty much 01:05:45.540 |
all of them are going to involve modifying our fit function again and in reasonably complicated 01:05:53.540 |
ways. So we still got a bit of an issue there. Let's put that aside because what we actually 01:05:58.660 |
wanted to do is create an auto encoder. So to remind you about what an auto encoder is 01:06:09.500 |
and we're going to be able to go into a bit more detail now we're going to start with 01:06:13.580 |
our input image which is going to be twenty eight by twenty eight. So it's the number 01:06:17.660 |
three right. And it's a twenty eight by twenty eight and we're going to put it through for 01:06:24.100 |
example a Stride 2 Conv Stride 2 and that's going to have an output of a fourteen by fourteen 01:06:37.380 |
and we can have more channels. So say maybe four. So this is twenty eight by twenty eight 01:06:42.020 |
by one. That's two fourteen by fourteen by two. So we've reduced the height and width 01:06:48.100 |
by two but added an extra channel. So overall this is a two X decrease in parameters and 01:06:56.860 |
then we could do another Stride 2 Conv and that would give us a seven by seven. And again 01:07:04.220 |
we can choose however many channels we want but let's say we choose four. So now compared 01:07:09.140 |
to our original we've now got a times four reduction. And so we could do that a few times 01:07:16.300 |
or we could just stay there. And so this is compressing. And so then what we could do 01:07:27.300 |
is then somehow have a convolution layer or group of layers which does a convolution and 01:07:36.220 |
also increases the size. There is actually something called a transposed convolution 01:07:48.140 |
which I'll leave you to look up if you're interested which can do that. Also known as 01:07:53.660 |
a rather weirdly a Stride one half convolution. But there's actually a really simple way to 01:08:00.580 |
do this which is to say let's say you've got a bunch of pixels is that say we've got a 01:08:06.100 |
three by three pixels that looks like this one zero one one say we could make that into 01:08:15.540 |
a six by six very easily which is we could simply get these out. We could simply copy 01:08:30.340 |
that pixel there into the first four. Copy that pixel there into these four. And so you 01:08:39.140 |
can see and then copy this pixel here into these four. And so we're simply turning each 01:08:45.540 |
pixel into four pixels. And so this is called nearest neighbor up sampling. Now that's not 01:09:01.660 |
a convolution that's just copying. But what we could then do is we could then apply a 01:09:07.700 |
Stride one convolution to that right. And that would allow us to double the grid size with 01:09:17.860 |
a convolution. And that's what we're going to do. So our autoencoder is going to need 01:09:23.260 |
a deconvolutional layer and that's going to contain two layers up sampling nearest neighbor 01:09:31.140 |
scale factor of two followed by a conv2d with a Stride of one. OK. And you can see for padding 01:09:40.100 |
I just put kernel size slash slash two. So that's a truncating division because that 01:09:44.260 |
always works for any odd sized kernel. As before we will have an optional activation 01:09:50.400 |
function and then we will create a sequential using star layers. So that's going to pass 01:09:57.460 |
in each layer as a separate argument which is what sequential expects. OK. 01:10:09.940 |
So let's write a new fitness function goes through. I just basically copied it over from 01:10:17.620 |
our previous one going through each epoch. But I've pulled out a vowel into a separate 01:10:23.740 |
function but it's basically doing the same thing. OK. So here is our autoencoder. 01:10:41.340 |
And so we're going to it's a bit tricky because I wanted to go down by one to three to get 01:10:51.300 |
to a four by four by eight. But starting at twenty eight by twenty eight you can't divide 01:10:58.700 |
that three times and get an integer. So what I first do is I zero pad so add padding of 01:11:05.700 |
two on each side to get a 32 by 32 input. So if I then do a conv with two channel output 01:11:11.980 |
that gives us 16 by 16 by 2 and then again to get an 8 by 8 by 4 and then again to get 01:11:17.900 |
a 4 by 4 by 8. So this is doing an 8 X compression and then we can call D conv to do exactly 01:11:24.860 |
the same thing in reverse. The final one with no activation. And then we can truncate off 01:11:30.140 |
those two pixels off the edge slightly surprisingly PyTorch lets you pass negative two to zero 01:11:36.100 |
padding to crop off the final two pixels. And then we'll add a sigmoid which will force 01:11:42.780 |
everything to go between zero and one which of course is what we need. And then we will 01:11:48.580 |
use MSE loss to compare those pixels to our input pixels. And so a big difference we've 01:11:56.620 |
got here now is that our loss function is being applied to the output of the model and 01:12:02.820 |
itself. Right. We don't have YB here. We have XB. So we're trying to recreate our original 01:12:17.060 |
and again this is a bit annoying that we have to create our own fit function. Anyway so 01:12:23.900 |
we can now see what is the MSE loss and it's not like going to be particularly human readable 01:12:30.060 |
but it's it's it's a number we can see if it goes down. And so then we can create then 01:12:43.780 |
we can do our SGD with the parameters of our auto encoder with MSE loss call that fit function 01:12:50.140 |
we just wrote and I won't wait for it to run. As you can see it's really slow for reasons 01:13:02.780 |
we've discussed. I've got it before. And what we want is to see that the original which 01:13:12.020 |
is which is here gets recreated. And the answer is oh not really. I mean roughly the same 01:13:30.340 |
things but there's no point having an auto encoder which can't even recreate the originals. 01:13:39.000 |
The idea would be that if this if these looked almost identical to these they would say wow 01:13:43.740 |
this is a fantastic network at compressing things by eight times. So I found this like 01:13:54.940 |
very fiddly to try and get this to work at all. Something that I discovered can get it 01:13:59.580 |
to start training is to start with a really low learning rate for a few epochs and then 01:14:06.120 |
increase the learning rate after a few epochs. I mean at least it gets it to train and show 01:14:14.740 |
something vaguely sensible. But let's see. Yeah it still looks pretty crummy. This one 01:14:23.680 |
here I got actually by switching to Adam and I actually removed the tricky bit I removed 01:14:31.980 |
these two as well. But yeah I couldn't get this to like recreate anything very reasonable 01:14:39.100 |
or any reasonable amount of time. And you know why is this not working very well. There's 01:14:47.420 |
so many reasons it could be. You know like we do we need a better optimizer do we need 01:14:52.620 |
a better architecture. Do we need to use a variational auto encoder. You know there's 01:14:58.740 |
a thousand things we could try but you know doing it like this is going to drive us crazy. 01:15:06.060 |
We need to be able to really rapidly try things and all kinds of different things. And so 01:15:12.620 |
what I often see you know in projects or on Kaggle or whatever people's code looks kind 01:15:19.780 |
of like this. It's all like manual and then their iteration speed is is too slow. We need 01:15:29.620 |
to be able to really rapidly try things. So we're not going to keep doing stuff manually 01:15:34.480 |
anymore. This is where we take a halt and we say OK let's build up a framework that 01:15:44.220 |
we can use to rapidly try things and understand when things are working and when things aren't 01:15:50.780 |
working. So we're going to start creating a learner. So what is a learner. It's basically 01:16:01.500 |
the idea is this this learner is going to be something that we build which will allow 01:16:05.900 |
us to try like anything that we can imagine very quickly. And we will build that on top 01:16:12.820 |
of that learner things that will allow us to introspect what's going on inside a model 01:16:17.420 |
will allow us to do multiprocess CUDA to go fast. It will allow us to add things like 01:16:23.140 |
data augmentation. It will allow us to try a wide variety of architectures quickly and 01:16:27.980 |
so forth. So that's going to be the idea. And of course we're going to create it from 01:16:32.060 |
scratch. And so let's start with fashion. And this does before and let's create a data 01:16:43.660 |
loaders class which is going to look a bit like what we had before where we're just going 01:16:48.900 |
to pass in. This is just couldn't be simpler right. We're just going to pass in two data 01:16:55.900 |
loaders and store them away. And I'm going to create a class method from data set dictionary. 01:17:06.140 |
And what that's going to do is it's going to call data loader on each of the data set 01:17:11.660 |
dictionary items with our batch side batch size and instantiate our class. So if you 01:17:18.420 |
haven't seen class method before it's what allows us to say data loaders dot something 01:17:24.460 |
in order to construct this. We could have put this in it just as well but we'll be building 01:17:29.780 |
more complex data loaders things later. So I thought we might start by getting the basic 01:17:35.180 |
structure right. So this is all pretty much the same as what we've had before. I'm not 01:17:39.100 |
doing anything on the device here because as we know that didn't really work. OK. Oh 01:17:51.500 |
this is an old thing that I need to Kuda anymore. So we're going to use to device which I think 01:17:59.180 |
came from. Here we go. So here's a here's an example of a very simple learner that fits 01:18:15.380 |
on one screen and this is basically going to replace our fit function. So a learner 01:18:21.300 |
is going to be something that is going to train or learn a particular model using a 01:18:27.220 |
particular set of data loaders a particular loss function some particular learning rate 01:18:34.060 |
and some particular optimizer or some particular optimization function. Now normally I know 01:18:41.140 |
most people would often kind of store each of these away separately by writing like self 01:18:45.700 |
dot model equals model blah blah blah. Right. And as I think we've talked about before that's 01:18:52.260 |
you know that kind of huge amounts of boilerplate. It just it's more stuff that you can get wrong 01:18:57.220 |
and it's more stuff to mean that you have to read to understand the code. And yeah don't 01:19:02.260 |
like that kind of repetition. So instead we just call fastcore dot store atra to do that 01:19:07.620 |
all in one line. OK. So that's basically the idea with a class is to think about what's 01:19:12.940 |
the information it's going to need. So you pass that all to the constructor store it 01:19:16.980 |
away and then our fit function is going to we've got the basic stuff that we have for 01:19:31.340 |
keeping track of accuracy. This is only work for stuff that's a classification where we 01:19:37.220 |
can use accuracy put the model on our device create the optimizer store how many epochs 01:19:48.740 |
we're going through then for each epoch we'll call the one epoch function and the one epoch 01:19:54.940 |
function we're going to either do train or evaluation. So we pass in true if we're training 01:20:01.260 |
and false if we're evaluating and they're basically almost the same. We basically set 01:20:07.580 |
the model to training mode or not. We then decide whether to use a validation set or 01:20:14.220 |
the training set based on whether we're training. And then we go through each batch in the data 01:20:22.380 |
loader and call one batch and one batch is then the thing which is going to put our batch 01:20:30.260 |
onto the device call our model call our loss function. And then if we're training then 01:20:39.300 |
do our backward step our optimizer step in our zero gradient and then finally calculate 01:20:45.220 |
our metrics or stats. And so here's where we calculate our metrics. So that's basically 01:20:51.780 |
what we have there. So let's go back to using an MLP we call fit and the way it goes. This 01:21:13.980 |
is an error here pointed out by Kevin. Thank you. Self dot model dot two. One thing I guess 01:21:21.580 |
we could try now is we think that maybe we can use more than one process. So let's try 01:21:31.540 |
that. Oh it's so fast. I didn't even see. There goes. You can see all four CPUs being used 01:21:44.740 |
at once. Bang. It's done. OK. So that's pretty great. Let's see how fast it looks here. Bump 01:21:52.300 |
bump. All right. Lovely. OK. So that's a good sign. We've got a learner that can fit things 01:22:02.620 |
but it's not very flexible. It's not going to help us for example with our autoencoder 01:22:09.820 |
because there's no way of like to say you know changing which which things are used 01:22:14.740 |
for predicting with or for calculating with. We can't use it for anything except things 01:22:18.780 |
that involve accuracy with a binary classification. Sorry. Right. Sorry. Yeah. A multi class classification. 01:22:30.580 |
It's not flexible at all but it's a start. And so I wanted to basically put this all 01:22:34.140 |
on one screen so you can see what the basic learner looks like. All right. So how do we 01:22:41.780 |
do things other than multi class accuracy. I decided to create a metric class and basically 01:22:55.460 |
a metric class is a something where we are going to define subclasses of it that calculate 01:23:04.100 |
particular metrics. So for example here I've got a subclass of a metric called accuracy. 01:23:10.300 |
So if you haven't done subclasses before you can basically think of this as saying please 01:23:17.460 |
copy and paste all the code from here into here for me but the bit that says def calc 01:23:25.500 |
replace it with this version. So in fact this would be identical to copying and pasting 01:23:31.100 |
this whole thing typing accuracy here and replacing the definition of calc with that. 01:23:43.140 |
That's what is happening here when we do subclassing. So it's basically copying and pasting all 01:23:48.420 |
that code in there for us. It's actually more powerful than that. There's more we can do 01:23:53.500 |
with it. But in this case this is all that's happening with this subclassing and that's 01:23:58.460 |
this is called I'll leave that that's fine. OK. So the accuracy metric is here and then 01:24:07.900 |
this is kind of our really basic metric which is we're going to use for just for loss. And 01:24:13.460 |
so what happens is we're going to let's for example create an accuracy metric object. 01:24:22.220 |
We're basically going to add in many batches of data. Right. So for example here's a many 01:24:28.060 |
batches of inputs and predictions. Here's another many batch of inputs and predictions. And 01:24:34.060 |
then we're going to call dot value and it will calculate the accuracy. Now dot value 01:24:41.300 |
is a neat little thing. It doesn't require parentheses after it because it's called a 01:24:45.060 |
property. And so a property is something that just calculates automatically without putting 01:24:51.620 |
having to put parentheses. That's all a property is. Well property getter anyway. And so they 01:24:57.140 |
look like this. You give it a name. And so we are going to be each time we call add we 01:25:05.100 |
are going to be storing that input and that target. And also the number of items in the 01:25:14.420 |
mini batch optionally. For now that's just always going to be one. And you can see here 01:25:22.340 |
that we then call dot calc which is going to call the accuracy calc. So just see how 01:25:29.860 |
often they equal. And then we're going to append to the list of values that calculation. 01:25:43.180 |
And we're also going to append to the list of ends in this case just one. And so then 01:25:48.060 |
to calculate the value we just do that. So that's all that's happening for accuracy. 01:25:55.460 |
And then we can do for loss. We can just use metric directly because metric directly will 01:26:00.740 |
just calculate the average of whatever it's past. So we can say oh add the number zero 01:26:05.260 |
point six. So the target's optional. And we're saying this is a mini batch of size 32. So 01:26:11.500 |
that's going to be the end. And then add the value 0.9 with a mini batch size of 2 and 01:26:17.940 |
then get the value. And as you can see that's exactly the same as the weighted average of 01:26:23.860 |
0.6 and 0.9 with weights of 32 and 2. So we've created a metric class. And so that's something 01:26:31.480 |
that we can use to create any metric we like just by overriding calc. Or we could create 01:26:39.980 |
totally things from scratch as long as they have an add and a value. OK. So we're now 01:26:48.180 |
going to change our learner. And what we're going to do is we're going to keep the same 01:26:56.500 |
basic structure. So there's going to be fit. It's going to go through each epoch. It's 01:27:03.380 |
going to call one epoch passing in true and false as for training invalidation. One epoch 01:27:11.300 |
is going to go through each batch in the data loader and call one batch. One batch is going 01:27:18.060 |
to do the prediction get loss. And if it's training it's going to do the backward step 01:27:24.740 |
and zero grad. But there's a few other things going on. So let's take a look. Actually let's 01:27:34.380 |
just look at it in use first. So when we use it we're going to be creating a learner with 01:27:40.740 |
the model data loaders loss function learning rate and some callbacks which we'll learn 01:27:45.300 |
about in a moment. And we call fit and it's going to do our thing. And look we're going 01:27:48.940 |
to have charts and stuff. All right so the basic idea is going to look very similar. 01:27:54.940 |
So we're going to call fit. So when we construct it we're going to be passing in exactly the 01:28:00.540 |
same things as before. But we've got one extra thing callbacks which we'll see in a moment. 01:28:06.700 |
Store the attributes as before and we're going to be doing some stuff with the callbacks. 01:28:11.820 |
So when we call fit for this number of epochs we're going to store away how many epochs 01:28:18.420 |
we're going to do. We're also going to store away the actual range that we're going to loop 01:28:24.340 |
through as soft epochs. So here's that looping through soft epochs. We're going to create 01:28:30.380 |
the optimizer using the optimizer function and the parameters. And then we're going to 01:28:40.180 |
call underscore fit. Now what on earth is underscore fit. Why didn't we just copy and 01:28:44.460 |
paste this into here? Why do this? It's because we've created this special decorator with 01:28:53.100 |
callbacks. What does that do? So it's up here with callbacks. With callbacks is a class. 01:29:03.780 |
It's going to just store one thing which is the name. In this case the name is fit. And 01:29:12.660 |
what it's going to do is now this is the decorator right. So when we call it remember decorators 01:29:23.100 |
get past a function. So it's going to get past this whole function. And that's going 01:29:29.420 |
to be called f. So done to call remember is what happens when a class is treated an object 01:29:35.540 |
is treated as if it's a function. So it's going to get past this function. So this function 01:29:40.020 |
is underscore fit. And so what we want to do is we want to return a different function. 01:29:46.520 |
It's going to of course call the function that we were asked to call using the arguments 01:29:53.060 |
and keyword arguments we were asked to use. But before it calls that function it's going 01:29:59.180 |
to call a special method called callback passing in the string before in this case before underscore 01:30:06.460 |
fit. After it's completed it's going to call that method called callback and passing the 01:30:13.660 |
string after underscore fit. And it's going to wrap the whole thing in a try accept block. 01:30:21.220 |
And it's going to be looking for an exception called cancel fit exception. And if it gets 01:30:30.260 |
one it's not going to complain. So let me explain what's going on with all of those 01:30:34.660 |
things. Let's look at an example of a callback. So for example here is a callback called device 01:30:49.420 |
cb device callback. And before fit will be called automatically before that underscore 01:30:56.060 |
fit method is called. And it's going to put the model onto our device CUDA or MPS if we 01:31:06.380 |
have one otherwise it will just be on GPU. So what's going to happen here. So it's going 01:31:13.600 |
to call we're going to call fit. It's going to go through these lines of code. It's going 01:31:18.660 |
to call underscore fit underscore fit is not this function underscore fit is this function 01:31:26.900 |
with F is this function. So it's going to call our learner dot callback passing in before 01:31:36.740 |
underscore fit and callback is defined here. What's callback going to do. It's going to 01:31:45.660 |
be past the string before underscore fit. It's going to then go through each of our callbacks 01:31:54.620 |
sorted based on their order. And you can see here our callbacks can have an order. And 01:32:01.900 |
it's going to look at that callback and try to get an attribute called before underscore 01:32:09.740 |
fit and it will find one. And so then it's going to call that method. Now if that method 01:32:22.740 |
doesn't exist it doesn't appear at all then get acher will return this instead. Identity 01:32:30.020 |
is a function just here. This is an identity function. All it does is whatever arguments 01:32:37.780 |
it gets passed it returns them. And if it's not passed any arguments it just returns. 01:32:47.420 |
So there's a lot of Python going on here. And that is why we did that foundations lesson. 01:32:59.220 |
And so for people who haven't done a lot of this Python there's going to be a lot of stuff 01:33:06.860 |
to experiment with and learn about. And so do ask on the forums if any of these bits 01:33:18.180 |
get confusing. But the best way to learn about these things is to open up this Jupyter notebook 01:33:23.740 |
and try and create really simple versions of things. So for example let's try identity. 01:33:34.980 |
How exactly does identity work? I could call it and it gets nothing. I can call it with 01:33:43.660 |
one. It gets back one. I could call it with a. It gets back a. I can call it with a one. 01:33:55.300 |
Call it with a one and get a one. And how is it doing that exactly? So remember we can 01:34:04.180 |
add a breakpoint. And this would be a great time to really test your debugging skills. 01:34:12.220 |
So remember in our debugger we can hit H to find out what the commands are. But you really 01:34:16.180 |
should do a tutorial on the debugger if you're not familiar with it. And then we can step 01:34:19.980 |
through each one. So I can now print args. And there's actually a trick which I like 01:34:27.820 |
is that args is actually a command funnily enough which will just tell you the arguments 01:34:32.540 |
to any function regardless of what they're called. Which is kind of nice. And so then 01:34:38.660 |
we can step through by pressing N. And after this we can check like OK what is X now. And 01:34:48.820 |
what is args now. Right. So remember to really experiment with these things. So anyway we're 01:35:01.660 |
going to talk about this a lot more in the next lesson. But before that if you're not 01:35:13.460 |
familiar with try accept blocks you know spend some time practicing them. If you're not familiar 01:35:19.380 |
with decorators well we've seen them before. So go back and look at them again really carefully. 01:35:26.340 |
If you're not familiar with the debugger practice with that. If you haven't spent much time 01:35:31.160 |
with getatra remind yourself about that. So try to get yourself really familiar and comfortable 01:35:39.620 |
as much as possible with the pieces because if you're not comfortable with the pieces 01:35:44.100 |
and the way we put the pieces together is going to be confusing. There's actually something 01:35:48.700 |
in education in kind of the theory of education called cognitive load theory and the theory 01:35:54.620 |
of cognitive basically cognitive load theory says if you're trying to learn something but 01:36:01.660 |
your cognitive load is really high because of all lots of other things going on at the 01:36:05.740 |
same time you're not going to learn it. So it's going to be hard for you to learn this 01:36:12.900 |
framework that we're building if you have too much cognitive load of like what the hell 01:36:17.380 |
is a decorator or what the hell is getatra or what does sort of do or what's partial. 01:36:23.580 |
You know all these things now I actually spent quite a bit of time trying to make this as 01:36:28.380 |
simple as possible. But but also as flexible as it needs to be for the rest of the course 01:36:36.940 |
and this is this is this is as simple as I could get it. So these are kind of things 01:36:41.940 |
that you actually do have to learn. But in doing so you're going to be able to write 01:36:47.940 |
some really you know powerful and general code yourself. So hopefully you'll find this 01:36:56.940 |
a really valuable and mind expanding exercise in bringing high level software engineering 01:37:04.180 |
skills to your data science work. OK. So with that this looks like a good place to leave 01:37:11.380 |
it and look forward to seeing you next time. Bye.