Lesson 15: Deep Learning Foundations to Stable Diffusion

Hi all and welcome to Lesson 15 and what we're going to endeavor to do today is to create a convolutional autoencoder. And in the process we will see why doing that well is a tricky thing to do and time permitting we will begin to work on a framework, a deep learning framework to make life a lot easier.

Not sure how far we'll get on that today, time wise, so let's see how we go and get straight into it. So okay, so today let's start by talking before we can create a convolutional autoencoder. We need to talk about convolutions and what are they and what are they for?

Generally speaking, convolutions are something that allows us to tell our neural network a little bit about the structure of the problem that's going to make it a lot easier for it to solve the problem. And in particular the structure of our problem is we're doing things with images. Images are laid out on a grid, a 2D grid for black and white or a 3D for color or a 4D for a color video or whatever.

And so we would say, you know, there's a relationship between the pixels going across and the pixels going down. They tend to be similar to each other, differences in those pixels across those dimensions tend to have meaning, patterns of pixels that appear in different places often represent the same thing.

So for example, a cat in the top left is still a cat even if it's in the bottom right. These kinds of this kind of prior information is something that is naturally captured by a convolutional neural network, something that uses convolutions. Generally speaking, this is a good thing because it means that we will be able to use less parameters and less computation because more of that information about the problem we're solving is kind of encoded directly into our architecture.

There are other architectures that don't encode that prior information as strongly such as a multilayer perceptron, which we've been looking at so far or a transformers network, which we haven't looked at yet. Those kinds of architectures could potentially give us what they do give us more flexibility and given enough time, compute and data, they could potentially find things that maybe CNNs would struggle to find.

So we're not always going to use convolutional neural networks, but they're a pretty good starting point and certainly something important to understand. They're not just used for images. We can also take advantage of one-dimensional convolutions for language-based tasks, for instance. So convolutions come up a lot. So in this notebook, one thing you'll notice that might be of interest is we are importing stuff from MiniAI now.

Now MiniAI is this little library that we're starting to create and we're creating it using nbdev. So we've now got a MiniAI.training and a MiniAI.datasets. And so if we look, for example, at the datasets notebook, it starts with something that says that the default export module is called datasets and some of the cells have a export directive on them.

And at the very bottom, we had something that called nbdev export. Now what that's going to do is it's going to create a file called datasets.py. Next here, datasets.py. And it contains those cells that we exported. And why is it called MiniAI.datasets? That's because everything for nbdev is stored in settings.ini and there's something here saying create a library libname called MiniAI.

You can't use this library until you install it. Now we haven't uploaded it to PyPy, like we've made it a pip installable package from the public server. But you can actually install a local directory as if it's a Python module that you've kind of installed from the internet. And to do that, you say pip install in the usual way, but you say -e, this down to editable.

And that means set up the current directory as a Python module. Well, current directory, actually any directory you like, I just put dot to mean the current directory. And so you'll see that's going to go ahead and actually install my library. And so after I've done that, I can now import things from that library, as you see.

Okay, so this is just the same as before. We're going to grab our MNIST dataset and we're going to create a convolutional neural network on it. So before we do that, we're going to talk about what are convolutions. And one of my favorite descriptions of convolutions comes from the student in our, I think it was our very first course, Matt Kleinsmith, who wrote this really nice Medium article, CNNs from different viewpoints, which I'm going to steal from.

And here's the basic idea. Say that this is our image, it's a three by three image with nine pixels labeled from A to J as capital letters. Now a convolution uses something called a kernel and a kernel is just another tensor. In this case, it's a two by two matrix again.

So this one's we're going to have alpha, beta, gamma, delta as our four values in this convolution. Now in this kernel, oh, now one thing I'll mention, I can't remember if I've said this before, is the Greek letters are things that you want to be able to, I think I have mentioned this, you want to be able to pronounce them.

So if you don't know how to read these and say what these names are, make sure you head over to Wikipedia or whatever and learn the names of all the Greek letters so that you can, because they come up all the time. So, what happens when we apply a convolution with this two by two kernel to this three by three image, I mean, it doesn't have to be an image, it's in this case, it's just a rank two tensor, but it might represent an image.

What happens is we take the kernel and we overlay it over the first little two by two sub grid, like so, and specifically what we do is we match color to color. So the output of this first two by two overlay would be alpha times A plus beta times B plus gamma times D plus delta times E and that would yield some value P and that's going to end up in the top left of a two by two output.

So the top right of the two by two output, we're going to slide, it's like a sliding window, we're going to slide our kernel over to here and apply each of our coefficients to these respectively colored squares and then ditto for the bottom left and then ditto for the bottom right.

So we end up with this equation, P as we discussed is alpha A plus beta B plus gamma D plus delta E plus some bias term, Q to the top right. As you can see, it's just alpha in this test times B and so we're just multiplying them together and adding them up, multiply together, add them up, multiply together and add them up.

So we're basically, you can imagine that we're basically flattening these out into rank one tensors into vectors and then doing a dot product would be one way of thinking about what's happening as we slide this kernel over these windows. And so this is called a convolution. So let's try and create a convolution.

So for example, let's grab our training images and take a look at one. And let's create a three by three kernel. So remember, a kernel is just, we've already, a kernel appears a lot of times in computer science and math. We've already seen the term kernel to mean a piece of code that we run on a GPU across lots of parallel kind of virtual devices or potentially in a grid.

There's a similar idea here. We've got a computation, which is in this case, kind of this dot product or something like a dot product, sliding over, occurring lots of times over a grid. But it's, yeah, it's a bit different. That's kind of another use of the word kernel. So in this case, a kernel is a, in this case, it's going to be a rank two tensor.

And so let's create a kernel with these values in the three by three matrix, rank two tensor. And we could draw what that looks like. Not surprising. It just looks like a bunch of lines. Oops. Okay. So what would happen if we slide this over just these nine pixels over this 28 by 28?

Well, what's going to happen is if we've got some, the top left, for example, three by three section has these names, then we're going to end up with negative A1 because the top three are all negative, right? Negative A1, minus A2, minus A3, the next to just zero. So that won't do anything.

And then plus A7, plus A8, plus A9. Why is that interesting? That's interesting. Well, let's try here. What I've done here is I've grabbed just the first 13 rows and first 23 columns of our image. And I'm actually showing the numbers and also using gray kind of conditional formatting, if you like, or the equivalent in pandas to show this top bit.

So we're looking at just this top bit. So what happens if we take rows three, four, and five? Remember this is not inclusive, right? So it's rows three, four, and five, columns 14, 15, 16, 14, 15, 16. So we're looking at this, these three here. What's that going to give us if we multiply it by this kernel?

It gives us a fairly large positive value because the three that we have negatives on is the top row. Well, they're all zero. And the three that we have positives on, they're all close to one. So we end up with quite a large number. What about the same columns, but for rows 789, 789 here, the top is all positive and the bottom is all zero.

So that means that we're going to get a lot of negative terms. And not surprisingly, that's exactly what we see. If we do this kind of a dot product equivalent, which all you need a NumPy to do that is just an element-wise multiplication followed by a sum, right? So that's going to be quite a large negative number.

And so perhaps you're seeing what this is doing, and maybe you got a hint from the name of the tensor we created. It's something that is going to find the top edge, right? So this one is a top edge, so it's a positive, and this one is a bottom edge, so it's a negative.

So we would like to apply that, this kernel, to every single 3x3 section in here. So we could do that by creating a little apply kernel function that takes some particular row, and some particular column, and some particular tensor as a kernel, and does that multiplication dot sum that we just saw.

So for example, we could replicate this one by calling apply kernel. And this here is the center of that 3x3 grid area. And so there's that same number, 2.97. So now we could apply that kernel to every one of the 3x3 windows in this 28x28 image. So we're going to be sliding over, like this red bit sliding over here, but we've actually got a 28x28 input, not just a 5x5 input.

So to get all of the coordinates-- let's just simplify it to do this 5x5-- we can create a list comprehension. We can take i through every value in range 5, and then for each of those, we can take j for every value in range 5. And so if we just look at that tuple, you can see we get a list of lists containing all of those coordinates.

So this is a list comprehension in a list comprehension, which when you first say it, may be surprising or confusing, but it's a really helpful idiom. And I certainly recommend getting used to it. Now what we're going to do is we're not just going to create this tuple, but we're actually going to call apply kernel for each of those.

So if we go through from 1 to 27-- well, actually, 1 to 26, because 27 is exclusive. So we're going to go through everything from 1 to 26, and then for each of those, go through from 1 to 26 again and call apply kernel. And that's going to give us the result of applying that convolutional kernel to every one of those coordinates.

And there's the result. And you can see what it's done, as we hoped, is it is highlighting the top edges. So yeah, you might find that kind of surprising that it's that easy to do this kind of image processing. We're literally just doing an element-wise multiplication and a sum for each window.

OK, so that is called a convolution. So we can do another convolution. This time, we could do one with a left edge tensor, so as you can see, it looks just a rotated version or transposed version, I guess, of our top edge tensor. Here's what it looks like. And so if we apply that kernel-- so this time, we're going to apply the left edge kernel.

And so notice here that we're actually passing in a function. Right? We're passing in a function-- sorry, actually, not a function, is it? It's just a tensor, actually. So we're going to pass in the left edge tensor for the same list comprehension in a list comprehension. And this time, we're getting back at the left edges.

Highlighting all of the left edges in the digit. So yeah, this is basically what's happening here, is that a 2 by 2 can be looped over an image, creating these outputs. Now you'll see here that in the process of doing so, we are losing the outermost pixels of our image.

We'll learn about how to fix that later. But just for now, notice that as we are putting our 3 by 3 through, for example, in this 5 by 5, there's only one, two, three places that we can put it going across, not five places, because we need some kind of edge.

All right. So that's cool. That's a convolution. And hopefully, if you remember back to kind of the Zeiler and Fergus pictures from lesson 1, you might recognize that the kind of first layer of a convolutional network is often looking for kind of edges and gradients and things like that.

And this is how it does it. And then the convolutions on top of convolutions with nonlinear activations between them can combine those into curves or corners or stuff like that, and so on and so forth. Okay. So how do we do this quickly? Because currently, this is going to be super, super slow doing this in Python.

So one of the very earliest or probably the earliest publicly available general purpose deep learning, GPU accelerated deep learning thing I saw, it was called CAFE. That was created by somebody called Yang Qingjia. And he actually described how CAFE went about implementing a fast convolution on a GPU. And basically, he said, "Well, I had two months to do it, and I had to finish my thesis." And so I ended up doing something where I said, "Well, there was some other code out there." Kojewski, who you might have come across him and Hinton, set up a little startup, which Google bought, and that kind of became the start of Google's deep learning, the Google brain basically.

So Kojewski had all this fancy stuff in his library, but Yang Qingjia said, "Oh, I didn't know how to do all that stuff." So I said, "Well, I already know how to multiply matrices, so maybe I can convert a convolution into a matrix multiplication." And so that became known as IM2COAL.

IM2COAL is a way of converting a convolution into a matrix multiply. And so actually, I don't know if I suspect Yang Qingjia kind of accidentally reinvented it, because it actually had been around for a while, even at the point that he was writing his thesis, I believe. So it was actually, this is the place I believe it was created in this paper.

So that was in 2006, which is a while ago. And so this is actually from that paper. And what they describe is, let's say you are putting this two by two kernel over this three by three bit of an image. So here you've got this window needs to match to this bit of this window, right?

What you could do is you could unwrap this to one, one, two, sorry, one, two, one, two downwards to here, one, two, one, two. So unroll it like so. And you could unroll the kernel here. Yeah, sorry, this is one, two, one, one. So this bit is here, one, two, one, one.

And then you could unroll the kernel one, one, two, two to here, one, one, two, two. And then once they've been flattened out and moved in that way, and then you'll do exactly the same thing for this next patch here, two, oh, one, three. You flatten it out and put it here, two, oh, one, three.

So if you basically take those kernels and flatten them out in this format, then you end up with a matrix multiply. If you multiply this matrix by this matrix, you'll end up with the output that you want from the convolution. So this is basically a way of unrolling your kernels and your input features into matrices, such as when you do the matrix multiply, you get the right answer.

So it's a kind of a nifty trick. And so that is called I am to call. I guess we're kind of cheating a little bit. Implementing that is kind of boring. It's just a bunch of copying and tensor manipulation. So I actually haven't done it. Instead, I've linked to a numpy implementation, which is here.

And it also part of it is this get indices, which is here. And as you can see, it's a little bit tedious with repeats and tiles and reshapes and whatnot. So I'm not going to call it homework. But if you want to practice your tensor indexing manipulation skills, try creating a PyTorch version from scratch.

I got to admit I didn't bother. Instead, I use the one that's built into PyTorch. And in PyTorch it's called unfold. So if we take our image and PyTorch expects there to be a batch axis and a dimension and a channel dimension. So we'll add two unit leading dimensions to it.

Then we can unfold our input for a three by three. And that will give us a nine by six, 76 input. And so then we can take that and we can take that and then we will make our we will take our kernel and just flatten it out into a vector.

So view changes the shape and minus one just says dump everything into this dimension. So that's going to create a nine long vector length nine vector. And so now we can do the matrix model play just like they've done here of the kernel matrix. That's our weights by the unrolled input features.

And so that gives us a six, 76 long. We can then view that as 26 by 26. And we get back, as we hoped, our left edge result. And so this is how we can kind of from scratch create a better implementation of convolutions. The reason I'm cheating, I'm allowed to cheat here, is because we did actually create convolutions from scratch.

We're not always creating the GPU optimized versions from scratch, which was never something I promised. So I think that's fair. But it's cool that we can kind of hack out a GPU optimized version in the same way that the kind of original deep learning library did. So if we use apply a kernel, we get nearly nine milliseconds.

If we use unfold with matrix model play, we get 20 microseconds. So that's what about 400 times faster. So that's pretty cool. Now, of course, we don't have to use unfold and matrix model play because PyTorch has a conv2d. So we can run that. And that interestingly is about the same speed, at least on GPU.

But this would also work on GPU on GPU just as well. Yeah, I'm not sure this will always be the case. In this case, it's a pretty small image. I haven't experimented a whole lot to see whereabouts there's a big difference in speeds between these. Obviously, I always just use f.com2d.

But if there's some more tricky convolution you need to do with some weird thing around channels or dimensions or something, you can always try this unfold trick. It's nice to know it's there, I think. So we could do the same thing for diagonal edges. So here's our diagonal edge kernel or the other diagonal.

So if we just grab the first 16 images, then we can do a convolution on our whole batch with all of our kernels at once. So this is a nice optimized thing that we can do. And you end up with your 26 by 26. You've got your four kernels and you've got your 16 images.

And so that's summarized here. So that's generally what we're doing to get good GPU acceleration is we're doing a bunch of kernels and a bunch of images all at once across all of their pixels. And so here we go. That's what happens when we take a look at our various kernels for a particular image.

Left edge, I guess top edge, and then diagonal top left and top right. OK, so that is optimized convolutions on and that works just as well on CPU or GPU. Obviously, GPU will be faster if you have one. Now, how do we deal with the problem that we're losing one pixel on each side?

What we can do is we can add something called padding. And for padding, what we basically do is rather than starting our window here, we start it right over here. And we actually would be up one as well. And so these three on the left here, we just take the input for each of those as zero.

So we're basically just assuming that they're all zero. I mean, there's other options we could choose. We could assume they're the same as the one next to them. There's various things we can do, but the simplest and the one we normally do is just assume that there's zero. So now, so let's say, for example, this is called one pixel padding.

Let's say we did two pixel padding. So we had two pixel padding with a five by five input and a four by four kernel. So that grays our kernel. Then we're going to start right up way over here on the corner. And then you can see what happens as we slide the kernel over.

There's all the spots that it's going to take. And so that this dotted line area is the area that we're kind of effectively going through. But all of these white bits, we're just going to treat as zero. And so, and then this is this green as the output size we end up with, which is going to be six by six for a five by five input.

I should mention even numbered edge kernels are not used very often. We normally used odd numbered kernels. If you use, for example, a three by three kernel and one pixel of padding, you will get back the same size you start with. If you use five by five with three pixels of padding, you'll end up with the same size you start with.

So generally, odd numbered edge size kernels are easier to deal with, to make sure you end up with the same thing you start with. OK, so, yeah, so as it says here with you've got a odd numbered size KS by KS size kernel, then KS truncate divide two, that's what slash slash means, will give you the right size.

And so another trick you can do is you don't always have to just move your window across by one each time. You could move it by a different amount each time. The amount you move it by is called the stride. So, for example, here's a case of doing a stride two.

So with stride two padding one, so we start out here and then we jump across two and then we jump across two and then we go to the next row. So that's called a stride two convolution. Stride two convolutions are handy because they actually reduce the dimensionality of your input by a factor of two.

And that's actually what we want to do a lot. For example, with an autoencoder, we want to do that. And in fact, for most classification architectures, we do exactly that. We keep on reducing the kind of the grid size by a factor of two again and again and again using stride two convolutions with padding of one.

So that's strides in padding. So let's go ahead and create a conf net using these approaches. So we're going to put get our size of our training set. This is all the same as before, number of categories, number of digits, size of our hidden layer. So, previously with our sequential linear models with our MLPs, we basically went from the number of pixels to the number of hidden and then a value and then the number of hidden to the number of outputs.

So here's the equivalent with a convolution. Now the problem is that you can't just do that because the output is not now 10 probabilities for each item in our batch, but it's 10 probabilities for each item in our batch for each of 28 by 28 pixels because we don't even have a stride or anything.

So you can't just use the same simple approach that we had for MLP. We have to be a bit more careful. So to make life easier, let's create a little conv function that does a conv2D with a stride of 2, optionally followed by an activation. So if act is true, we will add in a value activation.

So this is going to either return a conv2D or a little sequential containing a conv2D followed by a value. And so now we can create a CNN from scratch as a sequential model. And so since activation is true by default, this is going to take out 28 by 28 image starting with one channel and creating an output of four channels.

So this is the number of in, this is the number of filters. Sometimes we'll say filters to describe the number of kind of channels that our convolution has. That's the number of outputs. And it's very similar to the idea of the number of outputs in a linear layer, except this is the number of outputs in your convolution.

So what I like to do when I create stuff like this is I add a little comment just to remind myself what is my grid size after this. So I had a 28 by 28 input. So then I've then put it through a stride2 conv. So the output of this will be 14 by 14.

So then we'll do the same thing again, but this time we'll go from a four channel input to an eight channel output and then from eight to 16. So by this point, we're now down to a four by four and then down to a two by two. And then finally, we're down to a one by one.

So on the very last layer, we won't add an activation. And the very last layer is going to create and create 10 outputs. And since we're now down to a one by one, we can just call flatten and that's going to remove those unnecessary unit axes. So if we take that, pop a mini batch through it, we end up with exactly what we want, 16 by 10.

So for each of our 16 images, we've got 10 probabilities of 10 probabilities of each possible digit. So if we take our training set and make it into 28 by 28 images, and we do the same thing for a validation set. And then we create two data sets, one for each, which record train data set and valid data set.

And we're now going to train this on the GPU. Now, if you've got a Mac, you can use a device called, well, if you've got an Apple Silicon Mac, you've got a device called MPS, which is going to use your Mac's GPU. Where if you've got an Nvidia, you can use CUDA, which will use your Nvidia GPU.

CUDA is 10 times or more, possibly much more faster than a Mac. So you definitely want to use Nvidia if you can. But if you're just running it on a Mac laptop or whatever, you can use MPS. So basically you want to know what device to use. Do we want to use CUDA or MPS?

You can check. If you can check torch.backends.nps.is available to see if you're running on a Mac with MPS, you can check torch.cuda.is available to see if you've got an Nvidia GPU, in which case you've got CUDA. And if you've got neither, of course, you'll have to use the CPU to do computation.

So I've created a little function here to device which takes a tensor or a dictionary or a list of tensors or whatever, and a device to move it to. And it just goes through and moves everything onto that device. Or if it's a dictionary, a dictionary of things, values moved onto that device.

So there's a handy little function. And so we can create a custom collate function, which calls the PyTorch default collation function and then puts those tensors onto our device. And so with that, we've now got enough to run, train this neural net on the GPU. We created this get deals function in the last lesson.

So we're going to use that passing in the datasets that we just created and our default collation function. We're going to create our optimizer using our CNNs parameters. And then we call fit. Now fit remember, we also created in our last lesson and it's done. So I then what I did then was I reduced the learning rate by a factor of four and ran it again.

And eventually, yeah, I got to a fairly similar accuracy to what we did on our multi on our MLP. So yeah, we've got a convolutional network working. I think that's pretty encouraging. And it's nice that to train it, we didn't have to write much code, right? We were able to use code that we already built.

We were able to use the dataset class that we made, the get deals function that we made and the fit function that we made. And you know, because those things are written in a fairly general way, they work just as well for a ConvNet as they did for an MLP, nothing had to change.

So that was nice. Notice we I had to take the model and put it on the device as well. So that will go through and basically put all of the tenses that are in that model onto the MPS or CUDA device, if appropriate. So if we've got a batch size of 64, and as we do one channel, 28 by 28.

So then our axes are batch channel height, width. So normally, this is referred to as NCHW. So N, generally when you see N in a paper or whatever, in this way, it's referring to the batch size. N being the number, that's the mnemonic, the number of items in the batch.

C is the number of channels, height by width, NCHW. TensorFlow doesn't use that, TensorFlow uses NHWC. So we generally call these that channels last, since channels are at the end. And this one we normally call channels first. Now, of course, it's not actually channels first. It's actually channel second, but we ignore the batch bit.

In some models, particularly some more modern models, it turns out the channels last is faster. So PyTorch has recently added support for channels last. And so you'll see that being used more and more as well. All right, so a couple of comments and questions from our chat. The first is Sam Watkins pointing out that we've actually had a bit of a win here, which is that the number of parameters in our CNN is pretty small by comparison.

So the number in the MLP version, the number of parameters is equal to basically the size of this matrix. So M times NH. Oh, plus the number in this, which will be NH times 10. And, you know, something that at some point we probably should do is actually create something that allows us to automatically calculate the number of parameters.

And I'm ignoring the bias there, of course. Let's see what would be a good way to do that. Maybe NP dot product. There we go. So what we could do, what we could do is just calculate this automatically by doing a little list comprehension here. So there's the number of parameters across all of the different layers, so both bias and weights.

And then we could, I guess, just, well, we could just use, well, let's use PyTorch. So we could turn that into a tensor and sum it up. Oops. So that's the number in our MLP. And then the number in our simple CNN. So that's pretty cool. We've gone down from 40,000 to 5,000 and got about the same number there.

Oh, thank you, Jonathan. Jonathan's reminding me that there's a better way than NP dot product O dot shape, which is just to say O dot number of elements, num EL. Same thing. Very nice. Now, one person asked a very good question, which is I thought convolutional neural networks can handle any sized image.

And actually, no, this convolutional network cannot handle any sized image. This convolutional neural network only handles images that once they go through these tried to comms end up with a one by one because otherwise you can't dot flatten it and end up with 16 by 10. So we will learn how to create conv nets that can handle any sized input.

But there's nothing particularly about a common net that necessitates that it has to be any sized input that it can handle. Okay, so just let's briefly finish this section off by talking about this. Yeah, this particularly on to talk about the idea of receptive field. Consider this one input channel for output channel three by three kernel.

Right. So that's just to show you what we're doing here. Conve one, well, actually, so a simple CNN, simple CNN. This is the model we created. Remember, it was like a sequential model containing sequential models because that's how our con function worked. So simple CNN zero is our first layer.

It contains both the convenor value. So simple CNN zero zero is the actual con. So if we grab that, call it con one. It's a four by one by three by three. So number of outputs, number of input channels and height by width of their kernel. And then it's got its bias as well.

So that's how we could kind of deconstruct what's going on with our weight matrices or parameters inside a convolution. Now, I'm going to switch over to Excel. So in the lesson notes on the course website or on the forum, you'll find we've got an Excel. You'll see we've got an Excel workbook.

Oh, what seemed reminded me that there is a nice trick we can do. I do want to do that actually because I love this trick. Oh, I just deleted everything though. Let's put them all back. Here we go. Which is you actually don't need square brackets. The square brackets is a list comprehension.

Without the square brackets, it's called a generator and it. Oh, no, you can't use it there. Maybe that only works with num. Maybe that only works with NumPy. Ah, okay. So wait, that's the list. No, that doesn't work either. So much for that. I'm kind of curious now. Maybe torch.sum.

Nope. Just some. Oh, okay. I don't want to use Python some. That's interesting. I feel like all of them should handle generators, but there you go. Okay. So open up the conv example spreadsheet and what you'll see on the conv example pay a worksheet page is something that looks a lot like the number seven.

And this is the number seven that I got straight from MNIST. Okay. So you can see over here we have a number seven. This is a number seven from MNIST that I have copied into Excel. And then you can see over here we've got like a top edge kernel being applied and over here we've got a right edge kernel being applied.

This might be surprising you because you might be thinking where did tick Jeremy Microsoft Excel doesn't do convolutional neural networks. Well actually it does. So if I zoom in in Excel, you'll see actually these numbers are in fact conditional formatting applied to a bunch of spreadsheet cells. And so what I did was I copied the actual pixel values into Excel and then applied conditional formatting.

And so now you can see what the digit is actually made of. So you can see here I've created our top edge filter and here I've created our left edge filter. And so here I am applying that filter to that window. And so here you can see it looks a lot like NumPy.

It's just a sum product. And you might not be aware of this but in Excel you can actually do broadcasting. You have to hit Apple shift enter or control shift enter and it puts these little curly brackets around it. It's called an array formula. It basically lets you do broadcasting or simple broadcasting in Excel.

And so here's how you could say this is how I created this top edge filtered version in Excel. And the left edge version is exactly the same just a different kernel. And as you can see if I click on it it's applying this filter to this input area and so forth.

OK. So we can then I just arbitrarily picked some different values here. And so something to notice now in my second layer. So here's con one is con two. It's got a bit more work to do. We actually need two filters because we need to add together this bit here applied to this with this kernel applied and this bit here with this kernel applied.

So you actually need one set of three by three for each input. And also I want to set two separate outputs. So I actually end up needing a two by two by three by three weights matrix or weights a tensor I should say which you might remember is exactly what we had in PyTorch.

We had a rank for tensor. So if I have a look at this one you see exactly the same thing. This input is using this kernel applied to here and this kernel applied to here. So that's important to remember that you have these rank for tensors. And so then rather than doing stride to con I did something else which is actually a bit out of favor nowadays but it's another option which is to do something called max pooling to reduce my dimensionality.

So you can see here I've got 28 by 28. I've reduced it down here to 14 by 14. And the way I did it was simply to take the max of each little two by two area. OK. So that's all that's been done there. So that's called max pooling.

And so max pooling has the same effect as a stride to conf not mathematically identical the same effect which it does a convolution and reduces the grid size by two on each dimension. OK. So then how do we create a single output if we don't keep doing this until we get to one by one which I'm too lazy to do in Excel.

Well one approach and again this is a little bit out of favor as well but one approach we can do is we can take every one of these we've now got 14 by 14 and apply a dense layer to it. And so what I've done here is I've got a big imagine this is basically all been flattened out into a vector.

And so here we've got some product of this by this plus the sum product of this by this. And that gives us a single number. And so that is how we could then optimize that in order to optimize our weight matrices. Now and then you know that the more modern approach we don't use this kind of dense layer much anymore it still appears a bit.

The main place that you see this used is in a network called VGG which is very old now. I thought it might be 2013 or something. But it's actually still used. And that's because for certain things like something called style transfer or in general perceptual losses people still find VGG seems to work better.

So you still actually see this approach nowadays sometimes. The more common approach however nowadays is we take the penultimate layer and we just simply take the average of all of the activations. So the nowadays we would simply the Excel way of doing it would be literally simply say average of the penultimate layer and that is called global average pooling.

Everything has to has a fancy word a fancy phrase but that's all it is take the average is called global average pooling or you could take the you could take the max whatever that would be global max pooling. So anyway the main reason I wanted to show you this was to do something which I think is pretty interesting which is to take something in our zoom out a little bit here let's take something in our max pool here and I'm going to say trace precedence to show you here it is the area that it's coming from.

OK. So it's coming from these four numbers. Now if I trace precedence again saying what's actually impacting this obviously the kernels impacting it and then you can see that the input area here is a bit bigger and then if I trace precedence again then you can see the input area is bigger still.

So this number here is calculated from all of these numbers in the input. This area in the input is called the receptive field of this unit. And so the receptive field in this case is 1 2 3 4 5 6 by 6. Right. And that means that a pixel way up here in the top right has literally no ability to impact that activation.

It's not part of its receptive field. If you have a whole bunch of stride to comms each time you have one the receptive field is going to get twice as big. So the receptive field at the end of a deep network is actually very large. But the the inputs closest to the middle of the receptive field have the biggest kind of say in the output because they they implicitly appear the most often in all of these kind of dot products that are inside this this this convolutional window.

So the receptive field is not just like a single binary on off thing. Certainly all the stuff that's not got precedence here is not part of it at all. But the closer to the center of the receptive field the more impact it's going to have the more ability it's got to change this number.

So the receptive field is a really important concept. And yeah fiddling playing around with Excel's precedent arrows I think is a nice way to to say that at least in my opinion. And apart from anything else it's great fun creating a convolutional neural network in Excel. I thought so anyway.

OK. So let's take a seven minute break. I'll see you back after that to talk about a convolutional auto encoder. All right. OK. Welcome back. We're going to have a look now at the auto encoder notebook. So we're just going to import all of our usual stuff and we've got one more of our own modules to import now as well.

And this time we are going to switch to a different we're going to switch to a different data set which is the fashion MNIST data set. We can take advantage of the stuff that we did in 0 5 data sets and the hugging face stuff to load it. So we've seen this a little bit before back in our data sets one here and we never actually built any models with it.

So let's first of all do that. So this is just going to convert each thing each image into a tensor and it's going to be an in place transform. Remember we created this decorator and so we can call data set dictionary with transform. This is all stuff we've done before.

And so here we have our example of a sneaker. All right. And we will create our collation function collating the dictionary for that data set. That's something to remind you should remind yourself we built that ourselves in the data sets notebook. And let's actually make our collate function something that does to device which we wrote in our in our last notebook and we'll get a little data loaders function here which is going to go through each item in the data set dictionary and get a data loader for it and give us a dictionary of data loaders.

OK. So OK. So now we've got a data loader for training and a data loader for validation. So we can grab the X and Y batch by just calling next on that iterator as we've done before. We can grab the let's look at each of these in turn. Actually we've done all this before but it's a couple of weeks ago.

So just to remind you we can get the names of the features. And so we can then get create an item getter for our wise and we can call that the label getter. We can apply that to our labels to get the titles of everything in our mini batch and we can then call our show images that we created with that mini batch with those titles.

And here we have our fashion MNIST mini batch. OK. So let's create a classifier and we're just going to use exactly the same code copy and pasted from the previous notebook. So here is our sequential model. And we are going to grab the parameters of the CNN and the CNN I've actually moved it over to the device.

The default device was what we created in our last notebook. And as you can see it's fitting. Now our first problem is it's getting very slowly which is kind of annoying. So why is it running pretty slowly. Let's think about let's have a look at our data set. So when it's finally finished let's take a look at an item from the data set.

Actually let's look at the data set. Let's actually go all the way back to the data set dictionary. So before it gets transformed data set dictionary and let's grab the training part of that. And let's grab one item. And actually we can see here the problem for MNIST. We had all of the data loaded into memory into a single big tensor.

But this hugging face one is created in a much more kind of normal way which is each image is a totally separate PNG image. It's not all pre converted into a single thing. Why is that a problem. Well the reason it's a problem is that our data loader is spending all of its time decoding these PNGs.

So if I train here. OK. So while I'm training I can type H top and you can see that basically my CPU is 100 percent used. Now that's weird because I've actually got 64 CPUs. Why is it using just one of them is the first problem. But why does it matter that it's using 100 percent CPU.

Well the reason it matters. Let's run it again so you can see. Why does it matter that our CPU is 100 percent. And why is it making it so slow. Well the reason why is if we look at Nvidia SMI demon that will monitor our our GPUs utilization. I've got three GPUs I say to choose just the zeroth index one.

And you'll see this column here SM. This stands for symmetric model processor. It's like the equivalent of like CPU usage. And generally we're only using up one percent of our one GPU. So no wonder it's so slow. So the first thing we want to do then is try to make things faster.

Now to make things faster we want to be using more than one CPU to decode our PNGs. And as it turns out that's actually pretty easy to do. You just have to add a extra argument to your data loaders. Which is here num underscore workers. And so I can say use eight CPUs for example.

Now if I create a recreator data loaders and then try to create get the next one. Oh now I've got an error. And the error is rather quirky. And what it's saying is oh you're you're now trying to use multiple processes and generally in Python and PyTorch using multiple processes things that get complicated.

And one of the things that absolutely just doesn't work is you can't actually have your data loader put things onto the GPU in your in your separate processes. It just doesn't work. So the reason for this error is actually because of the fact that we used a collate function that put things on the device.

That's incompatible unfortunately with using multiple workers. So that's that's a problem. And the answer to that problem sadly is that we would have to actually rewrite our fit function entirely. So there's annoying thing number one. And we don't want to be rewriting our fit function again and again. We want to have a single fit function.

So OK so there's a problem that we're going to have to think about. Problem number two is that this is not very accurate. Eighty seven percent. Well I mean is it accurate. It's easy enough to find out. There's a really nice website called papers with code and it will tell you a little leaderboard and we can see whether we're any good.

And the answer is we're not very good at all. So these papers had ninety six percent ninety four percent ninety two percent. So yeah we're not looking great. So how do we improve that. There's a lot of things we could try but pretty much all of them are going to involve modifying our fit function again and in reasonably complicated ways.

So we still got a bit of an issue there. Let's put that aside because what we actually wanted to do is create an auto encoder. So to remind you about what an auto encoder is and we're going to be able to go into a bit more detail now we're going to start with our input image which is going to be twenty eight by twenty eight.

So it's the number three right. And it's a twenty eight by twenty eight and we're going to put it through for example a Stride 2 Conv Stride 2 and that's going to have an output of a fourteen by fourteen and we can have more channels. So say maybe four.

So this is twenty eight by twenty eight by one. That's two fourteen by fourteen by two. So we've reduced the height and width by two but added an extra channel. So overall this is a two X decrease in parameters and then we could do another Stride 2 Conv and that would give us a seven by seven.

And again we can choose however many channels we want but let's say we choose four. So now compared to our original we've now got a times four reduction. And so we could do that a few times or we could just stay there. And so this is compressing. And so then what we could do is then somehow have a convolution layer or group of layers which does a convolution and also increases the size.

There is actually something called a transposed convolution which I'll leave you to look up if you're interested which can do that. Also known as a rather weirdly a Stride one half convolution. But there's actually a really simple way to do this which is to say let's say you've got a bunch of pixels is that say we've got a three by three pixels that looks like this one zero one one say we could make that into a six by six very easily which is we could simply get these out.

We could simply copy that pixel there into the first four. Copy that pixel there into these four. And so you can see and then copy this pixel here into these four. And so we're simply turning each pixel into four pixels. And so this is called nearest neighbor up sampling.

Now that's not a convolution that's just copying. But what we could then do is we could then apply a Stride one convolution to that right. And that would allow us to double the grid size with a convolution. And that's what we're going to do. So our autoencoder is going to need a deconvolutional layer and that's going to contain two layers up sampling nearest neighbor scale factor of two followed by a conv2d with a Stride of one.

OK. And you can see for padding I just put kernel size slash slash two. So that's a truncating division because that always works for any odd sized kernel. As before we will have an optional activation function and then we will create a sequential using star layers. So that's going to pass in each layer as a separate argument which is what sequential expects.

OK. So let's write a new fitness function goes through. I just basically copied it over from our previous one going through each epoch. But I've pulled out a vowel into a separate function but it's basically doing the same thing. OK. So here is our autoencoder. And so we're going to it's a bit tricky because I wanted to go down by one to three to get to a four by four by eight.

But starting at twenty eight by twenty eight you can't divide that three times and get an integer. So what I first do is I zero pad so add padding of two on each side to get a 32 by 32 input. So if I then do a conv with two channel output that gives us 16 by 16 by 2 and then again to get an 8 by 8 by 4 and then again to get a 4 by 4 by 8.

So this is doing an 8 X compression and then we can call D conv to do exactly the same thing in reverse. The final one with no activation. And then we can truncate off those two pixels off the edge slightly surprisingly PyTorch lets you pass negative two to zero padding to crop off the final two pixels.

And then we'll add a sigmoid which will force everything to go between zero and one which of course is what we need. And then we will use MSE loss to compare those pixels to our input pixels. And so a big difference we've got here now is that our loss function is being applied to the output of the model and itself.

Right. We don't have YB here. We have XB. So we're trying to recreate our original and again this is a bit annoying that we have to create our own fit function. Anyway so we can now see what is the MSE loss and it's not like going to be particularly human readable but it's it's it's a number we can see if it goes down.

And so then we can create then we can do our SGD with the parameters of our auto encoder with MSE loss call that fit function we just wrote and I won't wait for it to run. As you can see it's really slow for reasons we've discussed. I've got it before.

And what we want is to see that the original which is which is here gets recreated. And the answer is oh not really. I mean roughly the same things but there's no point having an auto encoder which can't even recreate the originals. The idea would be that if this if these looked almost identical to these they would say wow this is a fantastic network at compressing things by eight times.

So I found this like very fiddly to try and get this to work at all. Something that I discovered can get it to start training is to start with a really low learning rate for a few epochs and then increase the learning rate after a few epochs. I mean at least it gets it to train and show something vaguely sensible.

But let's see. Yeah it still looks pretty crummy. This one here I got actually by switching to Adam and I actually removed the tricky bit I removed these two as well. But yeah I couldn't get this to like recreate anything very reasonable or any reasonable amount of time. And you know why is this not working very well.

There's so many reasons it could be. You know like we do we need a better optimizer do we need a better architecture. Do we need to use a variational auto encoder. You know there's a thousand things we could try but you know doing it like this is going to drive us crazy.

We need to be able to really rapidly try things and all kinds of different things. And so what I often see you know in projects or on Kaggle or whatever people's code looks kind of like this. It's all like manual and then their iteration speed is is too slow.

We need to be able to really rapidly try things. So we're not going to keep doing stuff manually anymore. This is where we take a halt and we say OK let's build up a framework that we can use to rapidly try things and understand when things are working and when things aren't working.

So we're going to start creating a learner. So what is a learner. It's basically the idea is this this learner is going to be something that we build which will allow us to try like anything that we can imagine very quickly. And we will build that on top of that learner things that will allow us to introspect what's going on inside a model will allow us to do multiprocess CUDA to go fast.

It will allow us to add things like data augmentation. It will allow us to try a wide variety of architectures quickly and so forth. So that's going to be the idea. And of course we're going to create it from scratch. And so let's start with fashion. And this does before and let's create a data loaders class which is going to look a bit like what we had before where we're just going to pass in.

This is just couldn't be simpler right. We're just going to pass in two data loaders and store them away. And I'm going to create a class method from data set dictionary. And what that's going to do is it's going to call data loader on each of the data set dictionary items with our batch side batch size and instantiate our class.

So if you haven't seen class method before it's what allows us to say data loaders dot something in order to construct this. We could have put this in it just as well but we'll be building more complex data loaders things later. So I thought we might start by getting the basic structure right.

So this is all pretty much the same as what we've had before. I'm not doing anything on the device here because as we know that didn't really work. OK. Oh this is an old thing that I need to Kuda anymore. So we're going to use to device which I think came from.

Here we go. So here's a here's an example of a very simple learner that fits on one screen and this is basically going to replace our fit function. So a learner is going to be something that is going to train or learn a particular model using a particular set of data loaders a particular loss function some particular learning rate and some particular optimizer or some particular optimization function.

Now normally I know most people would often kind of store each of these away separately by writing like self dot model equals model blah blah blah. Right. And as I think we've talked about before that's you know that kind of huge amounts of boilerplate. It just it's more stuff that you can get wrong and it's more stuff to mean that you have to read to understand the code.

And yeah don't like that kind of repetition. So instead we just call fastcore dot store atra to do that all in one line. OK. So that's basically the idea with a class is to think about what's the information it's going to need. So you pass that all to the constructor store it away and then our fit function is going to we've got the basic stuff that we have for keeping track of accuracy.

This is only work for stuff that's a classification where we can use accuracy put the model on our device create the optimizer store how many epochs we're going through then for each epoch we'll call the one epoch function and the one epoch function we're going to either do train or evaluation.

So we pass in true if we're training and false if we're evaluating and they're basically almost the same. We basically set the model to training mode or not. We then decide whether to use a validation set or the training set based on whether we're training. And then we go through each batch in the data loader and call one batch and one batch is then the thing which is going to put our batch onto the device call our model call our loss function.

And then if we're training then do our backward step our optimizer step in our zero gradient and then finally calculate our metrics or stats. And so here's where we calculate our metrics. So that's basically what we have there. So let's go back to using an MLP we call fit and the way it goes.

This is an error here pointed out by Kevin. Thank you. Self dot model dot two. One thing I guess we could try now is we think that maybe we can use more than one process. So let's try that. Oh it's so fast. I didn't even see. There goes. You can see all four CPUs being used at once.

Bang. It's done. OK. So that's pretty great. Let's see how fast it looks here. Bump bump. All right. Lovely. OK. So that's a good sign. We've got a learner that can fit things but it's not very flexible. It's not going to help us for example with our autoencoder because there's no way of like to say you know changing which which things are used for predicting with or for calculating with.

We can't use it for anything except things that involve accuracy with a binary classification. Sorry. Right. Sorry. Yeah. A multi class classification. It's not flexible at all but it's a start. And so I wanted to basically put this all on one screen so you can see what the basic learner looks like.

All right. So how do we do things other than multi class accuracy. I decided to create a metric class and basically a metric class is a something where we are going to define subclasses of it that calculate particular metrics. So for example here I've got a subclass of a metric called accuracy.

So if you haven't done subclasses before you can basically think of this as saying please copy and paste all the code from here into here for me but the bit that says def calc replace it with this version. So in fact this would be identical to copying and pasting this whole thing typing accuracy here and replacing the definition of calc with that.

That's what is happening here when we do subclassing. So it's basically copying and pasting all that code in there for us. It's actually more powerful than that. There's more we can do with it. But in this case this is all that's happening with this subclassing and that's this is called I'll leave that that's fine.

OK. So the accuracy metric is here and then this is kind of our really basic metric which is we're going to use for just for loss. And so what happens is we're going to let's for example create an accuracy metric object. We're basically going to add in many batches of data.

Right. So for example here's a many batches of inputs and predictions. Here's another many batch of inputs and predictions. And then we're going to call dot value and it will calculate the accuracy. Now dot value is a neat little thing. It doesn't require parentheses after it because it's called a property.

And so a property is something that just calculates automatically without putting having to put parentheses. That's all a property is. Well property getter anyway. And so they look like this. You give it a name. And so we are going to be each time we call add we are going to be storing that input and that target.

And also the number of items in the mini batch optionally. For now that's just always going to be one. And you can see here that we then call dot calc which is going to call the accuracy calc. So just see how often they equal. And then we're going to append to the list of values that calculation.

And we're also going to append to the list of ends in this case just one. And so then to calculate the value we just do that. So that's all that's happening for accuracy. And then we can do for loss. We can just use metric directly because metric directly will just calculate the average of whatever it's past.

So we can say oh add the number zero point six. So the target's optional. And we're saying this is a mini batch of size 32. So that's going to be the end. And then add the value 0.9 with a mini batch size of 2 and then get the value.

And as you can see that's exactly the same as the weighted average of 0.6 and 0.9 with weights of 32 and 2. So we've created a metric class. And so that's something that we can use to create any metric we like just by overriding calc. Or we could create totally things from scratch as long as they have an add and a value.

OK. So we're now going to change our learner. And what we're going to do is we're going to keep the same basic structure. So there's going to be fit. It's going to go through each epoch. It's going to call one epoch passing in true and false as for training invalidation.

One epoch is going to go through each batch in the data loader and call one batch. One batch is going to do the prediction get loss. And if it's training it's going to do the backward step and zero grad. But there's a few other things going on. So let's take a look.

Actually let's just look at it in use first. So when we use it we're going to be creating a learner with the model data loaders loss function learning rate and some callbacks which we'll learn about in a moment. And we call fit and it's going to do our thing.

And look we're going to have charts and stuff. All right so the basic idea is going to look very similar. So we're going to call fit. So when we construct it we're going to be passing in exactly the same things as before. But we've got one extra thing callbacks which we'll see in a moment.

Store the attributes as before and we're going to be doing some stuff with the callbacks. So when we call fit for this number of epochs we're going to store away how many epochs we're going to do. We're also going to store away the actual range that we're going to loop through as soft epochs.

So here's that looping through soft epochs. We're going to create the optimizer using the optimizer function and the parameters. And then we're going to call underscore fit. Now what on earth is underscore fit. Why didn't we just copy and paste this into here? Why do this? It's because we've created this special decorator with callbacks.

What does that do? So it's up here with callbacks. With callbacks is a class. It's going to just store one thing which is the name. In this case the name is fit. And what it's going to do is now this is the decorator right. So when we call it remember decorators get past a function.

So it's going to get past this whole function. And that's going to be called f. So done to call remember is what happens when a class is treated an object is treated as if it's a function. So it's going to get past this function. So this function is underscore fit.

And so what we want to do is we want to return a different function. It's going to of course call the function that we were asked to call using the arguments and keyword arguments we were asked to use. But before it calls that function it's going to call a special method called callback passing in the string before in this case before underscore fit.

After it's completed it's going to call that method called callback and passing the string after underscore fit. And it's going to wrap the whole thing in a try accept block. And it's going to be looking for an exception called cancel fit exception. And if it gets one it's not going to complain.

So let me explain what's going on with all of those things. Let's look at an example of a callback. So for example here is a callback called device cb device callback. And before fit will be called automatically before that underscore fit method is called. And it's going to put the model onto our device CUDA or MPS if we have one otherwise it will just be on GPU.

So what's going to happen here. So it's going to call we're going to call fit. It's going to go through these lines of code. It's going to call underscore fit underscore fit is not this function underscore fit is this function with F is this function. So it's going to call our learner dot callback passing in before underscore fit and callback is defined here.

What's callback going to do. It's going to be past the string before underscore fit. It's going to then go through each of our callbacks sorted based on their order. And you can see here our callbacks can have an order. And it's going to look at that callback and try to get an attribute called before underscore fit and it will find one.

And so then it's going to call that method. Now if that method doesn't exist it doesn't appear at all then get acher will return this instead. Identity is a function just here. This is an identity function. All it does is whatever arguments it gets passed it returns them. And if it's not passed any arguments it just returns.

So there's a lot of Python going on here. And that is why we did that foundations lesson. And so for people who haven't done a lot of this Python there's going to be a lot of stuff to experiment with and learn about. And so do ask on the forums if any of these bits get confusing.

But the best way to learn about these things is to open up this Jupyter notebook and try and create really simple versions of things. So for example let's try identity. How exactly does identity work? I could call it and it gets nothing. I can call it with one. It gets back one.

I could call it with a. It gets back a. I can call it with a one. Call it with a one and get a one. And how is it doing that exactly? So remember we can add a breakpoint. And this would be a great time to really test your debugging skills.

So remember in our debugger we can hit H to find out what the commands are. But you really should do a tutorial on the debugger if you're not familiar with it. And then we can step through each one. So I can now print args. And there's actually a trick which I like is that args is actually a command funnily enough which will just tell you the arguments to any function regardless of what they're called.

Which is kind of nice. And so then we can step through by pressing N. And after this we can check like OK what is X now. And what is args now. Right. So remember to really experiment with these things. So anyway we're going to talk about this a lot more in the next lesson.

But before that if you're not familiar with try accept blocks you know spend some time practicing them. If you're not familiar with decorators well we've seen them before. So go back and look at them again really carefully. If you're not familiar with the debugger practice with that. If you haven't spent much time with getatra remind yourself about that.

So try to get yourself really familiar and comfortable as much as possible with the pieces because if you're not comfortable with the pieces and the way we put the pieces together is going to be confusing. There's actually something in education in kind of the theory of education called cognitive load theory and the theory of cognitive basically cognitive load theory says if you're trying to learn something but your cognitive load is really high because of all lots of other things going on at the same time you're not going to learn it.

So it's going to be hard for you to learn this framework that we're building if you have too much cognitive load of like what the hell is a decorator or what the hell is getatra or what does sort of do or what's partial. You know all these things now I actually spent quite a bit of time trying to make this as simple as possible.

But but also as flexible as it needs to be for the rest of the course and this is this is this is as simple as I could get it. So these are kind of things that you actually do have to learn. But in doing so you're going to be able to write some really you know powerful and general code yourself.

So hopefully you'll find this a really valuable and mind expanding exercise in bringing high level software engineering skills to your data science work. OK. So with that this looks like a good place to leave it and look forward to seeing you next time. Bye.

Lesson 15: Deep Learning Foundations to Stable Diffusion

Chapters

Transcript