back to index

Lesson 15: Deep Learning Foundations to Stable Diffusion


Chapters

0:0 Introduction
0:51 What are convolutions?
6:52 Visualizing convolutions
8:51 Creating a convolution with MNIST
17:58 Speeding up the matrix multiplication when calculating convolutions
22:27 Pythorch’s F.unfold and F.conv2d
27:21 Padding and Stride
31:3 Creating the ConvNet
38:32 Convolution Arithmetic. NCHW and NHWC
39:47 Parameters in MLP vs CNN
42:27 CNNs and image size
43:12 Receptive fields
46:9 Convolutions in Excel: conv-example.xlsx
56:4 Autoencoders
60:0 Speeding up fitting and improving accuracy
65:56 Reminding what an auto-encoder is
75:52 Creating a Learner
82:48 Metric class
88:40 Decorator with callbacks
92:45 Python recap

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi all and welcome to Lesson 15 and what we're going to endeavor to do today is to create
00:00:07.680 | a convolutional autoencoder.
00:00:13.200 | And in the process we will see why doing that well is a tricky thing to do and time permitting
00:00:22.060 | we will begin to work on a framework, a deep learning framework to make life a lot easier.
00:00:29.920 | Not sure how far we'll get on that today, time wise, so let's see how we go and get
00:00:34.360 | straight into it.
00:00:37.000 | So okay, so today let's start by talking before we can create a convolutional autoencoder.
00:00:45.140 | We need to talk about convolutions and what are they and what are they for?
00:00:52.480 | Generally speaking, convolutions are something that allows us to tell our neural network
00:01:00.560 | a little bit about the structure of the problem that's going to make it a lot easier for it
00:01:05.160 | to solve the problem.
00:01:06.640 | And in particular the structure of our problem is we're doing things with images.
00:01:11.280 | Images are laid out on a grid, a 2D grid for black and white or a 3D for color or a 4D
00:01:21.680 | for a color video or whatever.
00:01:24.720 | And so we would say, you know, there's a relationship between the pixels going across and the pixels
00:01:30.280 | going down.
00:01:31.280 | They tend to be similar to each other, differences in those pixels across those dimensions tend
00:01:35.980 | to have meaning, patterns of pixels that appear in different places often represent the same
00:01:43.240 | thing.
00:01:44.240 | So for example, a cat in the top left is still a cat even if it's in the bottom right.
00:01:49.520 | These kinds of this kind of prior information is something that is naturally captured by
00:01:56.760 | a convolutional neural network, something that uses convolutions.
00:02:01.760 | Generally speaking, this is a good thing because it means that we will be able to use less
00:02:06.040 | parameters and less computation because more of that information about the problem we're
00:02:11.640 | solving is kind of encoded directly into our architecture.
00:02:17.320 | There are other architectures that don't encode that prior information as strongly such as
00:02:24.880 | a multilayer perceptron, which we've been looking at so far or a transformers network,
00:02:28.920 | which we haven't looked at yet.
00:02:32.040 | Those kinds of architectures could potentially give us what they do give us more flexibility
00:02:38.520 | and given enough time, compute and data, they could potentially find things that maybe CNNs
00:02:45.240 | would struggle to find.
00:02:48.640 | So we're not always going to use convolutional neural networks, but they're a pretty good
00:02:53.000 | starting point and certainly something important to understand.
00:02:57.300 | They're not just used for images.
00:02:59.360 | We can also take advantage of one-dimensional convolutions for language-based tasks, for
00:03:05.680 | instance.
00:03:06.680 | So convolutions come up a lot.
00:03:11.580 | So in this notebook, one thing you'll notice that might be of interest is we are importing
00:03:18.160 | stuff from MiniAI now.
00:03:20.520 | Now MiniAI is this little library that we're starting to create and we're creating it using
00:03:25.880 | nbdev.
00:03:27.280 | So we've now got a MiniAI.training and a MiniAI.datasets.
00:03:31.480 | And so if we look, for example, at the datasets notebook, it starts with something that says
00:03:36.440 | that the default export module is called datasets and some of the cells have a export directive
00:03:46.080 | on them.
00:03:47.280 | And at the very bottom, we had something that called nbdev export.
00:03:52.720 | Now what that's going to do is it's going to create a file called datasets.py.
00:04:03.840 | Next here, datasets.py.
00:04:08.660 | And it contains those cells that we exported.
00:04:17.440 | And why is it called MiniAI.datasets?
00:04:21.880 | That's because everything for nbdev is stored in settings.ini and there's something here
00:04:26.200 | saying create a library libname called MiniAI.
00:04:32.460 | You can't use this library until you install it.
00:04:35.420 | Now we haven't uploaded it to PyPy, like we've made it a pip installable package from the
00:04:41.900 | public server.
00:04:43.620 | But you can actually install a local directory as if it's a Python module that you've kind
00:04:52.160 | of installed from the internet.
00:04:53.880 | And to do that, you say pip install in the usual way, but you say -e, this down to editable.
00:05:00.100 | And that means set up the current directory as a Python module.
00:05:03.720 | Well, current directory, actually any directory you like, I just put dot to mean the current
00:05:07.600 | directory.
00:05:08.600 | And so you'll see that's going to go ahead and actually install my library.
00:05:15.480 | And so after I've done that, I can now import things from that library, as you see.
00:05:27.720 | Okay, so this is just the same as before.
00:05:31.200 | We're going to grab our MNIST dataset and we're going to create a convolutional neural
00:05:35.260 | network on it.
00:05:36.260 | So before we do that, we're going to talk about what are convolutions.
00:05:40.600 | And one of my favorite descriptions of convolutions comes from the student in our, I think it
00:05:44.920 | was our very first course, Matt Kleinsmith, who wrote this really nice Medium article,
00:05:53.040 | CNNs from different viewpoints, which I'm going to steal from.
00:05:56.200 | And here's the basic idea.
00:05:57.320 | Say that this is our image, it's a three by three image with nine pixels labeled from
00:06:05.320 | A to J as capital letters.
00:06:08.080 | Now a convolution uses something called a kernel and a kernel is just another tensor.
00:06:15.920 | In this case, it's a two by two matrix again.
00:06:18.760 | So this one's we're going to have alpha, beta, gamma, delta as our four values in this convolution.
00:06:26.960 | Now in this kernel, oh, now one thing I'll mention, I can't remember if I've said this
00:06:31.760 | before, is the Greek letters are things that you want to be able to, I think I have mentioned
00:06:37.480 | this, you want to be able to pronounce them.
00:06:39.280 | So if you don't know how to read these and say what these names are, make sure you head
00:06:44.240 | over to Wikipedia or whatever and learn the names of all the Greek letters so that you
00:06:49.080 | can, because they come up all the time.
00:06:51.600 | So, what happens when we apply a convolution with this two by two kernel to this three
00:06:58.400 | by three image, I mean, it doesn't have to be an image, it's in this case, it's just
00:07:04.720 | a rank two tensor, but it might represent an image.
00:07:09.160 | What happens is we take the kernel and we overlay it over the first little two by two
00:07:15.780 | sub grid, like so, and specifically what we do is we match color to color.
00:07:22.760 | So the output of this first two by two overlay would be alpha times A plus beta times B plus
00:07:31.280 | gamma times D plus delta times E and that would yield some value P and that's going
00:07:37.720 | to end up in the top left of a two by two output.
00:07:42.420 | So the top right of the two by two output, we're going to slide, it's like a sliding
00:07:46.760 | window, we're going to slide our kernel over to here and apply each of our coefficients
00:07:52.720 | to these respectively colored squares and then ditto for the bottom left and then ditto
00:08:01.120 | for the bottom right.
00:08:03.520 | So we end up with this equation, P as we discussed is alpha A plus beta B plus gamma D plus delta
00:08:10.400 | E plus some bias term, Q to the top right.
00:08:19.280 | As you can see, it's just alpha in this test times B and so we're just multiplying them
00:08:24.600 | together and adding them up, multiply together, add them up, multiply together and add them
00:08:29.080 | So we're basically, you can imagine that we're basically flattening these out into rank one
00:08:34.360 | tensors into vectors and then doing a dot product would be one way of thinking about
00:08:38.240 | what's happening as we slide this kernel over these windows.
00:08:41.880 | And so this is called a convolution.
00:08:46.260 | So let's try and create a convolution.
00:08:52.080 | So for example, let's grab our training images and take a look at one.
00:09:05.480 | And let's create a three by three kernel.
00:09:09.320 | So remember, a kernel is just, we've already, a kernel appears a lot of times in computer
00:09:14.640 | science and math.
00:09:15.840 | We've already seen the term kernel to mean a piece of code that we run on a GPU across
00:09:24.240 | lots of parallel kind of virtual devices or potentially in a grid.
00:09:30.480 | There's a similar idea here.
00:09:31.880 | We've got a computation, which is in this case, kind of this dot product or something
00:09:35.520 | like a dot product, sliding over, occurring lots of times over a grid.
00:09:40.800 | But it's, yeah, it's a bit different.
00:09:45.680 | That's kind of another use of the word kernel.
00:09:47.480 | So in this case, a kernel is a, in this case, it's going to be a rank two tensor.
00:09:52.520 | And so let's create a kernel with these values in the three by three matrix, rank two tensor.
00:09:59.280 | And we could draw what that looks like.
00:10:01.840 | Not surprising.
00:10:02.840 | It just looks like a bunch of lines.
00:10:04.760 | Oops.
00:10:05.760 | Okay.
00:10:06.760 | So what would happen if we slide this over just these nine pixels over this 28 by 28?
00:10:17.240 | Well, what's going to happen is if we've got some, the top left, for example, three by
00:10:23.360 | three section has these names, then we're going to end up with negative A1 because the top
00:10:29.160 | three are all negative, right?
00:10:30.920 | Negative A1, minus A2, minus A3, the next to just zero.
00:10:36.000 | So that won't do anything.
00:10:37.500 | And then plus A7, plus A8, plus A9.
00:10:43.360 | Why is that interesting?
00:10:47.080 | That's interesting.
00:10:48.080 | Well, let's try here.
00:10:50.880 | What I've done here is I've grabbed just the first 13 rows and first 23 columns of our
00:10:58.300 | image.
00:11:00.440 | And I'm actually showing the numbers and also using gray kind of conditional formatting,
00:11:08.260 | if you like, or the equivalent in pandas to show this top bit.
00:11:12.080 | So we're looking at just this top bit.
00:11:18.400 | So what happens if we take rows three, four, and five?
00:11:23.880 | Remember this is not inclusive, right?
00:11:26.620 | So it's rows three, four, and five, columns 14, 15, 16, 14, 15, 16.
00:11:31.920 | So we're looking at this, these three here.
00:11:35.560 | What's that going to give us if we multiply it by this kernel?
00:11:42.040 | It gives us a fairly large positive value because the three that we have negatives on
00:11:49.960 | is the top row.
00:11:50.960 | Well, they're all zero.
00:11:52.480 | And the three that we have positives on, they're all close to one.
00:11:57.760 | So we end up with quite a large number.
00:12:00.720 | What about the same columns, but for rows 789, 789 here, the top is all positive and
00:12:11.840 | the bottom is all zero.
00:12:13.900 | So that means that we're going to get a lot of negative terms.
00:12:18.200 | And not surprisingly, that's exactly what we see.
00:12:20.760 | If we do this kind of a dot product equivalent, which all you need a NumPy to do that is just
00:12:29.000 | an element-wise multiplication followed by a sum, right?
00:12:32.720 | So that's going to be quite a large negative number.
00:12:35.620 | And so perhaps you're seeing what this is doing, and maybe you got a hint from the name
00:12:39.320 | of the tensor we created.
00:12:41.480 | It's something that is going to find the top edge, right?
00:12:45.780 | So this one is a top edge, so it's a positive, and this one is a bottom edge, so it's a negative.
00:12:52.880 | So we would like to apply that, this kernel, to every single 3x3 section in here.
00:13:02.860 | So we could do that by creating a little apply kernel function that takes some particular
00:13:08.400 | row, and some particular column, and some particular tensor as a kernel, and does that
00:13:16.080 | multiplication dot sum that we just saw.
00:13:21.820 | So for example, we could replicate this one by calling apply kernel.
00:13:28.000 | And this here is the center of that 3x3 grid area.
00:13:33.280 | And so there's that same number, 2.97.
00:13:36.940 | So now we could apply that kernel to every one of the 3x3 windows in this 28x28 image.
00:13:46.520 | So we're going to be sliding over, like this red bit sliding over here, but we've actually
00:13:51.120 | got a 28x28 input, not just a 5x5 input.
00:13:55.600 | So to get all of the coordinates-- let's just simplify it to do this 5x5-- we can create
00:14:02.220 | a list comprehension. We can take i through every value in range 5, and then for each
00:14:08.680 | of those, we can take j for every value in range 5.
00:14:14.200 | And so if we just look at that tuple, you can see we get a list of lists containing
00:14:21.280 | all of those coordinates.
00:14:25.340 | So this is a list comprehension in a list comprehension, which when you first say it, may be surprising
00:14:35.000 | or confusing, but it's a really helpful idiom.
00:14:39.520 | And I certainly recommend getting used to it.
00:14:43.800 | Now what we're going to do is we're not just going to create this tuple, but we're actually
00:14:50.480 | going to call apply kernel for each of those.
00:14:54.620 | So if we go through from 1 to 27-- well, actually, 1 to 26, because 27 is exclusive.
00:15:03.860 | So we're going to go through everything from 1 to 26, and then for each of those, go through
00:15:08.720 | from 1 to 26 again and call apply kernel.
00:15:12.880 | And that's going to give us the result of applying that convolutional kernel to every
00:15:17.320 | one of those coordinates.
00:15:20.380 | And there's the result.
00:15:21.980 | And you can see what it's done, as we hoped, is it is highlighting the top edges.
00:15:28.360 | So yeah, you might find that kind of surprising that it's that easy to do this kind of image
00:15:35.440 | processing.
00:15:36.440 | We're literally just doing an element-wise multiplication and a sum for each window.
00:15:46.020 | OK, so that is called a convolution.
00:15:54.580 | So we can do another convolution.
00:15:56.140 | This time, we could do one with a left edge tensor, so as you can see, it looks just a
00:16:00.940 | rotated version or transposed version, I guess, of our top edge tensor.
00:16:05.940 | Here's what it looks like.
00:16:07.320 | And so if we apply that kernel-- so this time, we're going to apply the left edge kernel.
00:16:13.240 | And so notice here that we're actually passing in a function.
00:16:17.020 | Right?
00:16:18.020 | We're passing in a function-- sorry, actually, not a function, is it?
00:16:22.440 | It's just a tensor, actually.
00:16:27.740 | So we're going to pass in the left edge tensor for the same list comprehension in a list
00:16:35.620 | comprehension.
00:16:37.060 | And this time, we're getting back at the left edges.
00:16:40.180 | Highlighting all of the left edges in the digit.
00:16:45.320 | So yeah, this is basically what's happening here, is that a 2 by 2 can be looped over
00:16:52.800 | an image, creating these outputs.
00:16:57.420 | Now you'll see here that in the process of doing so, we are losing the outermost pixels
00:17:09.480 | of our image.
00:17:10.480 | We'll learn about how to fix that later.
00:17:12.060 | But just for now, notice that as we are putting our 3 by 3 through, for example, in this 5
00:17:18.840 | by 5, there's only one, two, three places that we can put it going across, not five
00:17:24.800 | places, because we need some kind of edge.
00:17:27.340 | All right.
00:17:28.740 | So that's cool.
00:17:30.300 | That's a convolution.
00:17:31.300 | And hopefully, if you remember back to kind of the Zeiler and Fergus pictures from lesson
00:17:36.540 | 1, you might recognize that the kind of first layer of a convolutional network is often
00:17:41.040 | looking for kind of edges and gradients and things like that.
00:17:43.860 | And this is how it does it.
00:17:46.080 | And then the convolutions on top of convolutions with nonlinear activations between them can
00:17:51.380 | combine those into curves or corners or stuff like that, and so on and so forth.
00:17:58.180 | Okay.
00:17:59.340 | So how do we do this quickly?
00:18:01.100 | Because currently, this is going to be super, super slow doing this in Python.
00:18:05.180 | So one of the very earliest or probably the earliest publicly available general purpose
00:18:13.300 | deep learning, GPU accelerated deep learning thing I saw, it was called CAFE.
00:18:18.500 | That was created by somebody called Yang Qingjia.
00:18:22.540 | And he actually described how CAFE went about implementing a fast convolution on a GPU.
00:18:37.860 | And basically, he said, "Well, I had two months to do it, and I had to finish my thesis."
00:18:44.660 | And so I ended up doing something where I said, "Well, there was some other code out
00:18:52.180 | there."
00:18:53.180 | Kojewski, who you might have come across him and Hinton, set up a little startup, which
00:19:01.040 | Google bought, and that kind of became the start of Google's deep learning, the Google
00:19:05.700 | brain basically.
00:19:06.700 | So Kojewski had all this fancy stuff in his library, but Yang Qingjia said, "Oh, I didn't
00:19:12.760 | know how to do all that stuff."
00:19:14.700 | So I said, "Well, I already know how to multiply matrices, so maybe I can convert a convolution
00:19:19.900 | into a matrix multiplication."
00:19:23.060 | And so that became known as IM2COAL.
00:19:29.940 | IM2COAL is a way of converting a convolution into a matrix multiply.
00:19:38.700 | And so actually, I don't know if I suspect Yang Qingjia kind of accidentally reinvented
00:19:45.180 | it, because it actually had been around for a while, even at the point that he was writing
00:19:51.220 | his thesis, I believe.
00:19:56.060 | So it was actually, this is the place I believe it was created in this paper.
00:20:02.340 | So that was in 2006, which is a while ago.
00:20:09.580 | And so this is actually from that paper.
00:20:12.820 | And what they describe is, let's say you are putting this two by two kernel over this three
00:20:23.100 | by three bit of an image.
00:20:24.700 | So here you've got this window needs to match to this bit of this window, right?
00:20:29.740 | What you could do is you could unwrap this to one, one, two, sorry, one, two, one, two
00:20:35.820 | downwards to here, one, two, one, two. So unroll it like so. And you could unroll the kernel
00:20:42.580 | here.
00:20:44.580 | Yeah, sorry, this is one, two, one, one. So this bit is here, one, two, one, one. And
00:20:51.860 | then you could unroll the kernel one, one, two, two to here, one, one, two, two.
00:20:57.780 | And then once they've been flattened out and moved in that way, and then you'll do exactly
00:21:02.940 | the same thing for this next patch here, two, oh, one, three. You flatten it out and put
00:21:07.700 | it here, two, oh, one, three.
00:21:09.580 | So if you basically take those kernels and flatten them out in this format, then you
00:21:14.420 | end up with a matrix multiply. If you multiply this matrix by this matrix, you'll end up
00:21:21.420 | with the output that you want from the convolution. So this is basically a way of unrolling your
00:21:30.180 | kernels and your input features into matrices, such as when you do the matrix multiply, you
00:21:35.460 | get the right answer.
00:21:36.780 | So it's a kind of a nifty trick. And so that is called I am to call. I guess we're kind
00:21:45.180 | of cheating a little bit. Implementing that is kind of boring. It's just a bunch of copying
00:21:49.100 | and tensor manipulation. So I actually haven't done it. Instead, I've linked to a numpy implementation,
00:21:58.700 | which is here. And it also part of it is this get indices, which is here. And as you can
00:22:10.020 | see, it's a little bit tedious with repeats and tiles and reshapes and whatnot.
00:22:14.980 | So I'm not going to call it homework. But if you want to practice your tensor indexing
00:22:21.860 | manipulation skills, try creating a PyTorch version from scratch. I got to admit I didn't
00:22:27.260 | bother. Instead, I use the one that's built into PyTorch. And in PyTorch it's called unfold.
00:22:35.860 | So if we take our image and PyTorch expects there to be a batch axis and a dimension and
00:22:45.500 | a channel dimension. So we'll add two unit leading dimensions to it. Then we can unfold
00:22:52.900 | our input for a three by three. And that will give us a nine by six, 76 input. And so then
00:23:07.780 | we can take that and we can take that and then we will make our we will take our kernel and
00:23:20.680 | just flatten it out into a vector. So view changes the shape and minus one just says
00:23:26.460 | dump everything into this dimension. So that's going to create a nine long vector length
00:23:36.140 | nine vector. And so now we can do the matrix model play just like they've done here of
00:23:42.740 | the kernel matrix. That's our weights by the unrolled input features. And so that gives
00:23:52.660 | us a six, 76 long. We can then view that as 26 by 26. And we get back, as we hoped, our
00:24:01.700 | left edge result. And so this is how we can kind of from scratch create a better implementation
00:24:16.340 | of convolutions. The reason I'm cheating, I'm allowed to cheat here, is because we did actually
00:24:21.180 | create convolutions from scratch. We're not always creating the GPU optimized versions
00:24:25.500 | from scratch, which was never something I promised. So I think that's fair. But it's
00:24:29.620 | cool that we can kind of hack out a GPU optimized version in the same way that the kind of original
00:24:34.820 | deep learning library did. So if we use apply a kernel, we get nearly nine milliseconds.
00:24:46.980 | If we use unfold with matrix model play, we get 20 microseconds. So that's what about
00:24:56.580 | 400 times faster. So that's pretty cool. Now, of course, we don't have to use unfold and
00:25:03.100 | matrix model play because PyTorch has a conv2d. So we can run that. And that interestingly
00:25:11.860 | is about the same speed, at least on GPU. But this would also work on GPU on GPU just
00:25:19.100 | as well. Yeah, I'm not sure this will always be the case. In this case, it's a pretty small
00:25:25.220 | image. I haven't experimented a whole lot to see whereabouts there's a big difference
00:25:32.260 | in speeds between these. Obviously, I always just use f.com2d. But if there's some more
00:25:37.500 | tricky convolution you need to do with some weird thing around channels or dimensions or
00:25:42.820 | something, you can always try this unfold trick. It's nice to know it's there, I think.
00:25:48.660 | So we could do the same thing for diagonal edges. So here's our diagonal edge kernel
00:25:56.980 | or the other diagonal. So if we just grab the first 16 images, then we can do a convolution
00:26:16.140 | on our whole batch with all of our kernels at once. So this is a nice optimized thing
00:26:24.580 | that we can do. And you end up with your 26 by 26. You've got your four kernels and you've
00:26:36.460 | got your 16 images. And so that's summarized here. So that's generally what we're doing
00:26:41.980 | to get good GPU acceleration is we're doing a bunch of kernels and a bunch of images all
00:26:47.180 | at once across all of their pixels. And so here we go. That's what happens when we take
00:26:55.580 | a look at our various kernels for a particular image. Left edge, I guess top edge, and then
00:27:06.980 | diagonal top left and top right. OK, so that is optimized convolutions on and that works
00:27:16.260 | just as well on CPU or GPU. Obviously, GPU will be faster if you have one. Now, how do
00:27:22.740 | we deal with the problem that we're losing one pixel on each side? What we can do is we
00:27:30.380 | can add something called padding. And for padding, what we basically do is rather than
00:27:36.500 | starting our window here, we start it right over here. And we actually would be up one
00:27:43.500 | as well. And so these three on the left here, we just take the input for each of those as
00:27:57.340 | zero. So we're basically just assuming that they're all zero. I mean, there's other options
00:28:02.420 | we could choose. We could assume they're the same as the one next to them. There's various
00:28:07.980 | things we can do, but the simplest and the one we normally do is just assume that there's
00:28:10.740 | zero. So now, so let's say, for example, this is called one pixel padding. Let's say we
00:28:22.820 | did two pixel padding. So we had two pixel padding with a five by five input and a four
00:28:32.020 | by four kernel. So that grays our kernel. Then we're going to start right up way over
00:28:38.580 | here on the corner. And then you can see what happens as we slide the kernel over. There's
00:28:46.300 | all the spots that it's going to take. And so that this dotted line area is the area
00:28:51.580 | that we're kind of effectively going through. But all of these white bits, we're just going
00:28:56.860 | to treat as zero. And so, and then this is this green as the output size we end up with,
00:29:01.460 | which is going to be six by six for a five by five input. I should mention even numbered
00:29:12.060 | edge kernels are not used very often. We normally used odd numbered kernels. If you use, for
00:29:16.420 | example, a three by three kernel and one pixel of padding, you will get back the same size
00:29:22.260 | you start with. If you use five by five with three pixels of padding, you'll end up with
00:29:28.580 | the same size you start with. So generally, odd numbered edge size kernels are easier
00:29:33.820 | to deal with, to make sure you end up with the same thing you start with.
00:29:37.100 | OK, so, yeah, so as it says here with you've got a odd numbered size KS by KS size kernel,
00:29:47.660 | then KS truncate divide two, that's what slash slash means, will give you the right size.
00:29:57.060 | And so another trick you can do is you don't always have to just move your window across
00:30:05.140 | by one each time. You could move it by a different amount each time. The amount you move it by
00:30:10.960 | is called the stride. So, for example, here's a case of doing a stride two. So with stride
00:30:16.460 | two padding one, so we start out here and then we jump across two and then we jump across
00:30:21.380 | two and then we go to the next row. So that's called a stride two convolution. Stride two
00:30:26.720 | convolutions are handy because they actually reduce the dimensionality of your input by
00:30:34.620 | a factor of two. And that's actually what we want to do a lot. For example, with an
00:30:42.860 | autoencoder, we want to do that. And in fact, for most classification architectures, we
00:30:49.220 | do exactly that. We keep on reducing the kind of the grid size by a factor of two again
00:30:56.140 | and again and again using stride two convolutions with padding of one. So that's strides in
00:31:02.740 | padding. So let's go ahead and create a conf net using these approaches. So we're going
00:31:10.180 | to put get our size of our training set. This is all the same as before, number of categories,
00:31:16.460 | number of digits, size of our hidden layer. So, previously with our sequential linear
00:31:37.220 | models with our MLPs, we basically went from the number of pixels to the number of hidden
00:31:50.060 | and then a value and then the number of hidden to the number of outputs. So here's the equivalent
00:31:57.300 | with a convolution. Now the problem is that you can't just do that because the output
00:32:02.460 | is not now 10 probabilities for each item in our batch, but it's 10 probabilities for
00:32:08.620 | each item in our batch for each of 28 by 28 pixels because we don't even have a stride
00:32:13.180 | or anything. So you can't just use the same simple approach that we had for MLP. We have
00:32:19.140 | to be a bit more careful. So to make life easier, let's create a little conv function
00:32:26.020 | that does a conv2D with a stride of 2, optionally followed by an activation. So if act is true,
00:32:34.940 | we will add in a value activation. So this is going to either return a conv2D or a little
00:32:44.500 | sequential containing a conv2D followed by a value. And so now we can create a CNN from
00:32:53.580 | scratch as a sequential model. And so since activation is true by default, this is going
00:33:00.000 | to take out 28 by 28 image starting with one channel and creating an output of four channels.
00:33:08.420 | So this is the number of in, this is the number of filters. Sometimes we'll say filters to
00:33:13.080 | describe the number of kind of channels that our convolution has. That's the number of
00:33:18.060 | outputs. And it's very similar to the idea of the number of outputs in a linear layer,
00:33:23.540 | except this is the number of outputs in your convolution. So what I like to do when I create
00:33:30.180 | stuff like this is I add a little comment just to remind myself what is my grid size
00:33:34.980 | after this. So I had a 28 by 28 input. So then I've then put it through a stride2 conv.
00:33:41.020 | So the output of this will be 14 by 14. So then we'll do the same thing again, but this
00:33:46.500 | time we'll go from a four channel input to an eight channel output and then from eight
00:33:51.740 | to 16. So by this point, we're now down to a four by four and then down to a two by two.
00:34:01.360 | And then finally, we're down to a one by one. So on the very last layer, we won't add an
00:34:07.580 | activation. And the very last layer is going to create and create 10 outputs. And since
00:34:13.220 | we're now down to a one by one, we can just call flatten and that's going to remove those
00:34:19.380 | unnecessary unit axes. So if we take that, pop a mini batch through it, we end up with
00:34:26.900 | exactly what we want, 16 by 10. So for each of our 16 images, we've got 10 probabilities
00:34:35.060 | of 10 probabilities of each possible digit. So if we take our training set and make it
00:34:43.180 | into 28 by 28 images, and we do the same thing for a validation set. And then we create two
00:34:50.020 | data sets, one for each, which record train data set and valid data set. And we're now
00:34:56.900 | going to train this on the GPU. Now, if you've got a Mac, you can use a device called, well,
00:35:06.220 | if you've got an Apple Silicon Mac, you've got a device called MPS, which is going to
00:35:11.460 | use your Mac's GPU. Where if you've got an Nvidia, you can use CUDA, which will use your
00:35:17.780 | Nvidia GPU. CUDA is 10 times or more, possibly much more faster than a Mac. So you definitely
00:35:25.340 | want to use Nvidia if you can. But if you're just running it on a Mac laptop or whatever,
00:35:31.100 | you can use MPS. So basically you want to know what device to use. Do we want to use
00:35:35.300 | CUDA or MPS? You can check. If you can check torch.backends.nps.is available to see if
00:35:41.780 | you're running on a Mac with MPS, you can check torch.cuda.is available to see if you've
00:35:47.500 | got an Nvidia GPU, in which case you've got CUDA. And if you've got neither, of course,
00:35:51.540 | you'll have to use the CPU to do computation. So I've created a little function here to
00:35:58.780 | device which takes a tensor or a dictionary or a list of tensors or whatever, and a device
00:36:06.460 | to move it to. And it just goes through and moves everything onto that device. Or if it's
00:36:13.220 | a dictionary, a dictionary of things, values moved onto that device. So there's a handy
00:36:18.060 | little function. And so we can create a custom collate function, which calls the PyTorch
00:36:27.900 | default collation function and then puts those tensors onto our device. And so with that,
00:36:34.900 | we've now got enough to run, train this neural net on the GPU. We created this get deals
00:36:44.420 | function in the last lesson. So we're going to use that passing in the datasets that we
00:36:50.420 | just created and our default collation function. We're going to create our optimizer using
00:36:57.740 | our CNNs parameters. And then we call fit. Now fit remember, we also created in our last
00:37:06.900 | lesson and it's done. So I then what I did then was I reduced the learning rate by a
00:37:13.860 | factor of four and ran it again. And eventually, yeah, I got to a fairly similar accuracy to
00:37:21.660 | what we did on our multi on our MLP. So yeah, we've got a convolutional network working.
00:37:32.300 | I think that's pretty encouraging. And it's nice that to train it, we didn't have to write
00:37:37.140 | much code, right? We were able to use code that we already built. We were able to use
00:37:41.940 | the dataset class that we made, the get deals function that we made and the fit function
00:37:48.220 | that we made. And you know, because those things are written in a fairly general way,
00:37:55.900 | they work just as well for a ConvNet as they did for an MLP, nothing had to change. So
00:38:00.260 | that was nice. Notice we I had to take the model and put it on the device as well. So
00:38:07.940 | that will go through and basically put all of the tenses that are in that model onto
00:38:13.020 | the MPS or CUDA device, if appropriate.
00:38:21.980 | So if we've got a batch size of 64, and as we do one channel, 28 by 28. So then our axes
00:38:29.860 | are batch channel height, width. So normally, this is referred to as NCHW. So N, generally
00:38:38.400 | when you see N in a paper or whatever, in this way, it's referring to the batch size.
00:38:44.540 | N being the number, that's the mnemonic, the number of items in the batch. C is the number
00:38:50.620 | of channels, height by width, NCHW. TensorFlow doesn't use that, TensorFlow uses NHWC. So
00:39:02.180 | we generally call these that channels last, since channels are at the end. And this one
00:39:10.120 | we normally call channels first. Now, of course, it's not actually channels first. It's actually
00:39:19.300 | channel second, but we ignore the batch bit. In some models, particularly some more modern
00:39:27.180 | models, it turns out the channels last is faster. So PyTorch has recently added support for
00:39:33.740 | channels last. And so you'll see that being used more and more as well.
00:39:39.940 | All right, so a couple of comments and questions from our chat. The first is Sam Watkins pointing
00:39:50.380 | out that we've actually had a bit of a win here, which is that the number of parameters
00:39:56.380 | in our CNN is pretty small by comparison. So the number in the MLP version, the number
00:40:04.740 | of parameters is equal to basically the size of this matrix. So M times NH. Oh, plus the
00:40:21.540 | number in this, which will be NH times 10. And, you know, something that at some point
00:40:31.780 | we probably should do is actually create something that allows us to automatically calculate
00:40:38.420 | the number of parameters. And I'm ignoring the bias there, of course. Let's see what
00:40:56.820 | would be a good way to do that. Maybe NP dot product. There we go. So what we could do,
00:41:15.900 | what we could do is just calculate this automatically by doing a little list comprehension here.
00:41:24.360 | So there's the number of parameters across all of the different layers, so both bias
00:41:29.180 | and weights. And then we could, I guess, just, well, we could just use, well, let's use PyTorch.
00:41:38.320 | So we could turn that into a tensor and sum it up. Oops. So that's the number in our MLP.
00:41:49.900 | And then the number in our simple CNN. So that's pretty cool. We've gone down from 40,000
00:41:59.780 | to 5,000 and got about the same number there. Oh, thank you, Jonathan. Jonathan's reminding
00:42:07.420 | me that there's a better way than NP dot product O dot shape, which is just to say O dot number
00:42:16.260 | of elements, num EL. Same thing. Very nice. Now, one person asked a very good question,
00:42:30.980 | which is I thought convolutional neural networks can handle any sized image. And actually, no,
00:42:40.960 | this convolutional network cannot handle any sized image. This convolutional neural network
00:42:45.780 | only handles images that once they go through these tried to comms end up with a one by
00:42:50.180 | one because otherwise you can't dot flatten it and end up with 16 by 10.
00:42:59.020 | So we will learn how to create conv nets that can handle any sized input. But there's nothing
00:43:06.680 | particularly about a common net that necessitates that it has to be any sized input that it
00:43:10.700 | can handle. Okay, so just let's briefly finish this section off by talking about this. Yeah,
00:43:22.840 | this particularly on to talk about the idea of receptive field. Consider this one input
00:43:29.860 | channel for output channel three by three kernel. Right. So that's just to show you
00:43:37.520 | what we're doing here. Conve one, well, actually, so a simple CNN, simple CNN. This is the model
00:43:45.860 | we created. Remember, it was like a sequential model containing sequential models because
00:43:49.260 | that's how our con function worked. So simple CNN zero is our first layer. It contains both
00:43:55.500 | the convenor value. So simple CNN zero zero is the actual con. So if we grab that, call
00:44:02.280 | it con one. It's a four by one by three by three. So number of outputs, number of input
00:44:12.380 | channels and height by width of their kernel. And then it's got its bias as well. So that's
00:44:18.940 | how we could kind of deconstruct what's going on with our weight matrices or parameters inside
00:44:26.820 | a convolution. Now, I'm going to switch over to Excel. So in the lesson notes on the course
00:44:38.140 | website or on the forum, you'll find we've got an Excel. You'll see we've got an Excel
00:44:44.460 | workbook. Oh, what seemed reminded me that there is a nice trick we can do. I do want
00:44:49.500 | to do that actually because I love this trick. Oh, I just deleted everything though. Let's
00:44:56.740 | put them all back. Here we go. Which is you actually don't need square brackets. The square
00:45:00.460 | brackets is a list comprehension. Without the square brackets, it's called a generator
00:45:05.700 | and it. Oh, no, you can't use it there. Maybe that only works with num. Maybe that only
00:45:12.980 | works with NumPy. Ah, okay. So wait, that's the list. No, that doesn't work either. So
00:45:29.300 | much for that. I'm kind of curious now. Maybe torch.sum. Nope. Just some. Oh, okay. I don't
00:45:55.420 | want to use Python some. That's interesting. I feel like all of them should handle generators,
00:46:02.260 | but there you go. Okay. So open up the conv example spreadsheet and what you'll see on
00:46:17.260 | the conv example pay a worksheet page is something that looks a lot like the number seven. And
00:46:24.780 | this is the number seven that I got straight from MNIST. Okay. So you can see over here
00:46:34.620 | we have a number seven. This is a number seven from MNIST that I have copied into Excel.
00:46:41.460 | And then you can see over here we've got like a top edge kernel being applied and over here
00:46:46.340 | we've got a right edge kernel being applied. This might be surprising you because you might
00:46:51.100 | be thinking where did tick Jeremy Microsoft Excel doesn't do convolutional neural networks.
00:46:57.500 | Well actually it does. So if I zoom in in Excel, you'll see actually these numbers are in fact
00:47:10.340 | conditional formatting applied to a bunch of spreadsheet cells. And so what I did was
00:47:15.660 | I copied the actual pixel values into Excel and then applied conditional formatting. And
00:47:21.700 | so now you can see what the digit is actually made of. So you can see here I've created
00:47:32.260 | our top edge filter and here I've created our left edge filter. And so here I am applying
00:47:44.980 | that filter to that window. And so here you can see it looks a lot like NumPy. It's just
00:47:55.260 | a sum product. And you might not be aware of this but in Excel you can actually do broadcasting.
00:48:06.700 | You have to hit Apple shift enter or control shift enter and it puts these little curly
00:48:12.100 | brackets around it. It's called an array formula. It basically lets you do broadcasting or simple
00:48:16.660 | broadcasting in Excel. And so here's how you could say this is how I created this top edge
00:48:22.500 | filtered version in Excel. And the left edge version is exactly the same just a different
00:48:29.380 | kernel. And as you can see if I click on it it's applying this filter to this input area
00:48:38.260 | and so forth. OK. So we can then I just arbitrarily picked some different values here. And so something
00:48:50.300 | to notice now in my second layer. So here's con one is con two. It's got a bit more work
00:48:57.940 | to do. We actually need two filters because we need to add together this bit here applied
00:49:09.500 | to this with this kernel applied and this bit here with this kernel applied. So you
00:49:18.180 | actually need one set of three by three for each input. And also I want to set two separate
00:49:26.900 | outputs. So I actually end up needing a two by two by three by three weights matrix or
00:49:37.340 | weights a tensor I should say which you might remember is exactly what we had in PyTorch.
00:49:41.780 | We had a rank for tensor. So if I have a look at this one you see exactly the same thing.
00:49:49.220 | This input is using this kernel applied to here and this kernel applied to here. So that's
00:49:56.180 | important to remember that you have these rank for tensors. And so then rather than
00:50:03.360 | doing stride to con I did something else which is actually a bit out of favor nowadays but
00:50:09.900 | it's another option which is to do something called max pooling to reduce my dimensionality.
00:50:15.500 | So you can see here I've got 28 by 28. I've reduced it down here to 14 by 14. And the
00:50:21.420 | way I did it was simply to take the max of each little two by two area. OK. So that's
00:50:30.640 | all that's been done there. So that's called max pooling. And so max pooling has the same
00:50:36.940 | effect as a stride to conf not mathematically identical the same effect which it does a
00:50:41.640 | convolution and reduces the grid size by two on each dimension. OK. So then how do we create
00:50:50.420 | a single output if we don't keep doing this until we get to one by one which I'm too lazy
00:50:55.180 | to do in Excel. Well one approach and again this is a little bit out of favor as well
00:50:59.700 | but one approach we can do is we can take every one of these we've now got 14 by 14
00:51:07.340 | and apply a dense layer to it. And so what I've done here is I've got a big imagine this
00:51:15.580 | is basically all been flattened out into a vector. And so here we've got some product
00:51:24.740 | of this by this plus the sum product of this by this. And that gives us a single number.
00:51:34.940 | And so that is how we could then optimize that in order to optimize our weight matrices.
00:51:42.820 | Now and then you know that the more modern approach we don't use this kind of dense layer
00:51:50.180 | much anymore it still appears a bit. The main place that you see this used is in a network
00:51:58.940 | called VGG which is very old now. I thought it might be 2013 or something. But it's actually
00:52:05.180 | still used. And that's because for certain things like something called style transfer
00:52:11.940 | or in general perceptual losses people still find VGG seems to work better. So you still
00:52:20.180 | actually see this approach nowadays sometimes. The more common approach however nowadays
00:52:25.660 | is we take the penultimate layer and we just simply take the average of all of the activations.
00:52:35.100 | So the nowadays we would simply the Excel way of doing it would be literally simply
00:52:40.340 | say average of the penultimate layer and that is called global average pooling. Everything
00:52:52.420 | has to has a fancy word a fancy phrase but that's all it is take the average is called
00:52:56.700 | global average pooling or you could take the you could take the max whatever that would
00:53:01.380 | be global max pooling. So anyway the main reason I wanted to show you this was to do
00:53:06.220 | something which I think is pretty interesting which is to take something in our zoom out
00:53:13.780 | a little bit here let's take something in our max pool here and I'm going to say trace
00:53:27.220 | precedence to show you here it is the area that it's coming from. OK. So it's coming
00:53:33.020 | from these four numbers. Now if I trace precedence again saying what's actually impacting this
00:53:40.780 | obviously the kernels impacting it and then you can see that the input area here is a
00:53:47.100 | bit bigger and then if I trace precedence again then you can see the input area is bigger
00:53:55.300 | still. So this number here is calculated from all of these numbers in the input. This area
00:54:06.180 | in the input is called the receptive field of this unit. And so the receptive field in
00:54:15.380 | this case is 1 2 3 4 5 6 by 6. Right. And that means that a pixel way up here in the
00:54:24.140 | top right has literally no ability to impact that activation. It's not part of its receptive
00:54:31.260 | field. If you have a whole bunch of stride to comms each time you have one the receptive
00:54:37.300 | field is going to get twice as big. So the receptive field at the end of a deep network
00:54:42.660 | is actually very large. But the the inputs closest to the middle of the receptive field
00:54:50.220 | have the biggest kind of say in the output because they they implicitly appear the most
00:54:57.420 | often in all of these kind of dot products that are inside this this this convolutional
00:55:04.220 | window. So the receptive field is not just like a single binary on off thing. Certainly
00:55:10.900 | all the stuff that's not got precedence here is not part of it at all. But the closer to
00:55:16.980 | the center of the receptive field the more impact it's going to have the more ability
00:55:21.340 | it's got to change this number. So the receptive field is a really important concept. And yeah
00:55:29.620 | fiddling playing around with Excel's precedent arrows I think is a nice way to to say that
00:55:35.980 | at least in my opinion. And apart from anything else it's great fun creating a convolutional
00:55:42.520 | neural network in Excel. I thought so anyway. OK. So let's take a seven minute break. I'll
00:55:53.860 | see you back after that to talk about a convolutional auto encoder. All right. OK. Welcome back.
00:56:06.500 | We're going to have a look now at the auto encoder notebook. So we're just going to import
00:56:13.140 | all of our usual stuff and we've got one more of our own modules to import now as well.
00:56:22.060 | And this time we are going to switch to a different we're going to switch to a different
00:56:29.540 | data set which is the fashion MNIST data set. We can take advantage of the stuff that we
00:56:38.660 | did in 0 5 data sets and the hugging face stuff to load it. So we've seen this a little
00:56:46.900 | bit before back in our data sets one here and we never actually built any models with
00:56:57.700 | it. So let's first of all do that. So this is just going to convert each thing each image
00:57:07.340 | into a tensor and it's going to be an in place transform. Remember we created this decorator
00:57:13.940 | and so we can call data set dictionary with transform. This is all stuff we've done before.
00:57:22.180 | And so here we have our example of a sneaker. All right. And we will create our collation
00:57:33.540 | function collating the dictionary for that data set. That's something to remind you should
00:57:39.660 | remind yourself we built that ourselves in the data sets notebook. And let's actually
00:57:46.180 | make our collate function something that does to device which we wrote in our in our last
00:57:53.260 | notebook and we'll get a little data loaders function here which is going to go through
00:57:59.820 | each item in the data set dictionary and get a data loader for it and give us a dictionary
00:58:06.900 | of data loaders. OK. So OK. So now we've got a data loader for training and a data loader
00:58:19.700 | for validation. So we can grab the X and Y batch by just calling next on that iterator
00:58:28.860 | as we've done before. We can grab the let's look at each of these in turn. Actually we've
00:58:38.020 | done all this before but it's a couple of weeks ago. So just to remind you we can get
00:58:42.860 | the names of the features. And so we can then get create an item getter for our wise and
00:58:52.340 | we can call that the label getter. We can apply that to our labels to get the titles
00:58:58.060 | of everything in our mini batch and we can then call our show images that we created
00:59:05.620 | with that mini batch with those titles. And here we have our fashion MNIST mini batch.
00:59:19.380 | OK. So let's create a classifier and we're just going to use exactly the same code copy
00:59:24.460 | and pasted from the previous notebook. So here is our sequential model. And we are going
00:59:41.340 | to grab the parameters of the CNN and the CNN I've actually moved it over to the device.
00:59:55.180 | The default device was what we created in our last notebook. And as you can see it's
00:59:58.460 | fitting. Now our first problem is it's getting very slowly which is kind of annoying. So
01:00:10.740 | why is it running pretty slowly. Let's think about let's have a look at our data set. So
01:00:18.820 | when it's finally finished let's take a look at an item from the data set. Actually let's
01:00:27.460 | look at the data set. Let's actually go all the way back to the data set dictionary. So
01:00:35.580 | before it gets transformed data set dictionary and let's grab the training part of that.
01:00:45.180 | And let's grab one item. And actually we can see here the problem for MNIST. We had all
01:00:56.140 | of the data loaded into memory into a single big tensor. But this hugging face one is created
01:01:03.380 | in a much more kind of normal way which is each image is a totally separate PNG image.
01:01:09.320 | It's not all pre converted into a single thing. Why is that a problem. Well the reason it's
01:01:17.940 | a problem is that our data loader is spending all of its time decoding these PNGs. So if
01:01:32.580 | I train here. OK. So while I'm training I can type H top and you can see that basically
01:01:42.900 | my CPU is 100 percent used. Now that's weird because I've actually got 64 CPUs. Why is
01:01:49.260 | it using just one of them is the first problem. But why does it matter that it's using 100
01:01:54.100 | percent CPU. Well the reason it matters. Let's run it again so you can see. Why does it matter
01:02:01.940 | that our CPU is 100 percent. And why is it making it so slow. Well the reason why is
01:02:07.740 | if we look at Nvidia SMI demon that will monitor our our GPUs utilization. I've got three GPUs
01:02:17.380 | I say to choose just the zeroth index one. And you'll see this column here SM. This stands
01:02:23.400 | for symmetric model processor. It's like the equivalent of like CPU usage. And generally
01:02:28.660 | we're only using up one percent of our one GPU. So no wonder it's so slow. So the first
01:02:38.140 | thing we want to do then is try to make things faster. Now to make things faster we want
01:02:44.900 | to be using more than one CPU to decode our PNGs. And as it turns out that's actually
01:02:50.500 | pretty easy to do. You just have to add a extra argument to your data loaders. Which
01:03:08.900 | is here num underscore workers. And so I can say use eight CPUs for example. Now if I create
01:03:19.540 | a recreator data loaders and then try to create get the next one. Oh now I've got an error.
01:03:25.620 | And the error is rather quirky. And what it's saying is oh you're you're now trying to use
01:03:34.980 | multiple processes and generally in Python and PyTorch using multiple processes things
01:03:40.160 | that get complicated. And one of the things that absolutely just doesn't work is you can't
01:03:45.980 | actually have your data loader put things onto the GPU in your in your separate processes.
01:03:56.960 | It just doesn't work. So the reason for this error is actually because of the fact that
01:04:05.140 | we used a collate function that put things on the device. That's incompatible unfortunately
01:04:12.220 | with using multiple workers. So that's that's a problem. And the answer to that problem
01:04:25.100 | sadly is that we would have to actually rewrite our fit function entirely. So there's annoying
01:04:37.300 | thing number one. And we don't want to be rewriting our fit function again and again.
01:04:41.900 | We want to have a single fit function. So OK so there's a problem that we're going to
01:04:47.260 | have to think about. Problem number two is that this is not very accurate. Eighty seven
01:04:57.100 | percent. Well I mean is it accurate. It's easy enough to find out. There's a really
01:05:01.260 | nice website called papers with code and it will tell you a little leaderboard and we
01:05:16.860 | can see whether we're any good. And the answer is we're not very good at all. So these papers
01:05:24.180 | had ninety six percent ninety four percent ninety two percent. So yeah we're not looking
01:05:35.700 | great. So how do we improve that. There's a lot of things we could try but pretty much
01:05:45.540 | all of them are going to involve modifying our fit function again and in reasonably complicated
01:05:53.540 | ways. So we still got a bit of an issue there. Let's put that aside because what we actually
01:05:58.660 | wanted to do is create an auto encoder. So to remind you about what an auto encoder is
01:06:09.500 | and we're going to be able to go into a bit more detail now we're going to start with
01:06:13.580 | our input image which is going to be twenty eight by twenty eight. So it's the number
01:06:17.660 | three right. And it's a twenty eight by twenty eight and we're going to put it through for
01:06:24.100 | example a Stride 2 Conv Stride 2 and that's going to have an output of a fourteen by fourteen
01:06:37.380 | and we can have more channels. So say maybe four. So this is twenty eight by twenty eight
01:06:42.020 | by one. That's two fourteen by fourteen by two. So we've reduced the height and width
01:06:48.100 | by two but added an extra channel. So overall this is a two X decrease in parameters and
01:06:56.860 | then we could do another Stride 2 Conv and that would give us a seven by seven. And again
01:07:04.220 | we can choose however many channels we want but let's say we choose four. So now compared
01:07:09.140 | to our original we've now got a times four reduction. And so we could do that a few times
01:07:16.300 | or we could just stay there. And so this is compressing. And so then what we could do
01:07:27.300 | is then somehow have a convolution layer or group of layers which does a convolution and
01:07:36.220 | also increases the size. There is actually something called a transposed convolution
01:07:48.140 | which I'll leave you to look up if you're interested which can do that. Also known as
01:07:53.660 | a rather weirdly a Stride one half convolution. But there's actually a really simple way to
01:08:00.580 | do this which is to say let's say you've got a bunch of pixels is that say we've got a
01:08:06.100 | three by three pixels that looks like this one zero one one say we could make that into
01:08:15.540 | a six by six very easily which is we could simply get these out. We could simply copy
01:08:30.340 | that pixel there into the first four. Copy that pixel there into these four. And so you
01:08:39.140 | can see and then copy this pixel here into these four. And so we're simply turning each
01:08:45.540 | pixel into four pixels. And so this is called nearest neighbor up sampling. Now that's not
01:09:01.660 | a convolution that's just copying. But what we could then do is we could then apply a
01:09:07.700 | Stride one convolution to that right. And that would allow us to double the grid size with
01:09:17.860 | a convolution. And that's what we're going to do. So our autoencoder is going to need
01:09:23.260 | a deconvolutional layer and that's going to contain two layers up sampling nearest neighbor
01:09:31.140 | scale factor of two followed by a conv2d with a Stride of one. OK. And you can see for padding
01:09:40.100 | I just put kernel size slash slash two. So that's a truncating division because that
01:09:44.260 | always works for any odd sized kernel. As before we will have an optional activation
01:09:50.400 | function and then we will create a sequential using star layers. So that's going to pass
01:09:57.460 | in each layer as a separate argument which is what sequential expects. OK.
01:10:09.940 | So let's write a new fitness function goes through. I just basically copied it over from
01:10:17.620 | our previous one going through each epoch. But I've pulled out a vowel into a separate
01:10:23.740 | function but it's basically doing the same thing. OK. So here is our autoencoder.
01:10:41.340 | And so we're going to it's a bit tricky because I wanted to go down by one to three to get
01:10:51.300 | to a four by four by eight. But starting at twenty eight by twenty eight you can't divide
01:10:58.700 | that three times and get an integer. So what I first do is I zero pad so add padding of
01:11:05.700 | two on each side to get a 32 by 32 input. So if I then do a conv with two channel output
01:11:11.980 | that gives us 16 by 16 by 2 and then again to get an 8 by 8 by 4 and then again to get
01:11:17.900 | a 4 by 4 by 8. So this is doing an 8 X compression and then we can call D conv to do exactly
01:11:24.860 | the same thing in reverse. The final one with no activation. And then we can truncate off
01:11:30.140 | those two pixels off the edge slightly surprisingly PyTorch lets you pass negative two to zero
01:11:36.100 | padding to crop off the final two pixels. And then we'll add a sigmoid which will force
01:11:42.780 | everything to go between zero and one which of course is what we need. And then we will
01:11:48.580 | use MSE loss to compare those pixels to our input pixels. And so a big difference we've
01:11:56.620 | got here now is that our loss function is being applied to the output of the model and
01:12:02.820 | itself. Right. We don't have YB here. We have XB. So we're trying to recreate our original
01:12:17.060 | and again this is a bit annoying that we have to create our own fit function. Anyway so
01:12:23.900 | we can now see what is the MSE loss and it's not like going to be particularly human readable
01:12:30.060 | but it's it's it's a number we can see if it goes down. And so then we can create then
01:12:43.780 | we can do our SGD with the parameters of our auto encoder with MSE loss call that fit function
01:12:50.140 | we just wrote and I won't wait for it to run. As you can see it's really slow for reasons
01:13:02.780 | we've discussed. I've got it before. And what we want is to see that the original which
01:13:12.020 | is which is here gets recreated. And the answer is oh not really. I mean roughly the same
01:13:30.340 | things but there's no point having an auto encoder which can't even recreate the originals.
01:13:39.000 | The idea would be that if this if these looked almost identical to these they would say wow
01:13:43.740 | this is a fantastic network at compressing things by eight times. So I found this like
01:13:54.940 | very fiddly to try and get this to work at all. Something that I discovered can get it
01:13:59.580 | to start training is to start with a really low learning rate for a few epochs and then
01:14:06.120 | increase the learning rate after a few epochs. I mean at least it gets it to train and show
01:14:14.740 | something vaguely sensible. But let's see. Yeah it still looks pretty crummy. This one
01:14:23.680 | here I got actually by switching to Adam and I actually removed the tricky bit I removed
01:14:31.980 | these two as well. But yeah I couldn't get this to like recreate anything very reasonable
01:14:39.100 | or any reasonable amount of time. And you know why is this not working very well. There's
01:14:47.420 | so many reasons it could be. You know like we do we need a better optimizer do we need
01:14:52.620 | a better architecture. Do we need to use a variational auto encoder. You know there's
01:14:58.740 | a thousand things we could try but you know doing it like this is going to drive us crazy.
01:15:06.060 | We need to be able to really rapidly try things and all kinds of different things. And so
01:15:12.620 | what I often see you know in projects or on Kaggle or whatever people's code looks kind
01:15:19.780 | of like this. It's all like manual and then their iteration speed is is too slow. We need
01:15:29.620 | to be able to really rapidly try things. So we're not going to keep doing stuff manually
01:15:34.480 | anymore. This is where we take a halt and we say OK let's build up a framework that
01:15:44.220 | we can use to rapidly try things and understand when things are working and when things aren't
01:15:50.780 | working. So we're going to start creating a learner. So what is a learner. It's basically
01:16:01.500 | the idea is this this learner is going to be something that we build which will allow
01:16:05.900 | us to try like anything that we can imagine very quickly. And we will build that on top
01:16:12.820 | of that learner things that will allow us to introspect what's going on inside a model
01:16:17.420 | will allow us to do multiprocess CUDA to go fast. It will allow us to add things like
01:16:23.140 | data augmentation. It will allow us to try a wide variety of architectures quickly and
01:16:27.980 | so forth. So that's going to be the idea. And of course we're going to create it from
01:16:32.060 | scratch. And so let's start with fashion. And this does before and let's create a data
01:16:43.660 | loaders class which is going to look a bit like what we had before where we're just going
01:16:48.900 | to pass in. This is just couldn't be simpler right. We're just going to pass in two data
01:16:55.900 | loaders and store them away. And I'm going to create a class method from data set dictionary.
01:17:06.140 | And what that's going to do is it's going to call data loader on each of the data set
01:17:11.660 | dictionary items with our batch side batch size and instantiate our class. So if you
01:17:18.420 | haven't seen class method before it's what allows us to say data loaders dot something
01:17:24.460 | in order to construct this. We could have put this in it just as well but we'll be building
01:17:29.780 | more complex data loaders things later. So I thought we might start by getting the basic
01:17:35.180 | structure right. So this is all pretty much the same as what we've had before. I'm not
01:17:39.100 | doing anything on the device here because as we know that didn't really work. OK. Oh
01:17:51.500 | this is an old thing that I need to Kuda anymore. So we're going to use to device which I think
01:17:59.180 | came from. Here we go. So here's a here's an example of a very simple learner that fits
01:18:15.380 | on one screen and this is basically going to replace our fit function. So a learner
01:18:21.300 | is going to be something that is going to train or learn a particular model using a
01:18:27.220 | particular set of data loaders a particular loss function some particular learning rate
01:18:34.060 | and some particular optimizer or some particular optimization function. Now normally I know
01:18:41.140 | most people would often kind of store each of these away separately by writing like self
01:18:45.700 | dot model equals model blah blah blah. Right. And as I think we've talked about before that's
01:18:52.260 | you know that kind of huge amounts of boilerplate. It just it's more stuff that you can get wrong
01:18:57.220 | and it's more stuff to mean that you have to read to understand the code. And yeah don't
01:19:02.260 | like that kind of repetition. So instead we just call fastcore dot store atra to do that
01:19:07.620 | all in one line. OK. So that's basically the idea with a class is to think about what's
01:19:12.940 | the information it's going to need. So you pass that all to the constructor store it
01:19:16.980 | away and then our fit function is going to we've got the basic stuff that we have for
01:19:31.340 | keeping track of accuracy. This is only work for stuff that's a classification where we
01:19:37.220 | can use accuracy put the model on our device create the optimizer store how many epochs
01:19:48.740 | we're going through then for each epoch we'll call the one epoch function and the one epoch
01:19:54.940 | function we're going to either do train or evaluation. So we pass in true if we're training
01:20:01.260 | and false if we're evaluating and they're basically almost the same. We basically set
01:20:07.580 | the model to training mode or not. We then decide whether to use a validation set or
01:20:14.220 | the training set based on whether we're training. And then we go through each batch in the data
01:20:22.380 | loader and call one batch and one batch is then the thing which is going to put our batch
01:20:30.260 | onto the device call our model call our loss function. And then if we're training then
01:20:39.300 | do our backward step our optimizer step in our zero gradient and then finally calculate
01:20:45.220 | our metrics or stats. And so here's where we calculate our metrics. So that's basically
01:20:51.780 | what we have there. So let's go back to using an MLP we call fit and the way it goes. This
01:21:13.980 | is an error here pointed out by Kevin. Thank you. Self dot model dot two. One thing I guess
01:21:21.580 | we could try now is we think that maybe we can use more than one process. So let's try
01:21:31.540 | that. Oh it's so fast. I didn't even see. There goes. You can see all four CPUs being used
01:21:44.740 | at once. Bang. It's done. OK. So that's pretty great. Let's see how fast it looks here. Bump
01:21:52.300 | bump. All right. Lovely. OK. So that's a good sign. We've got a learner that can fit things
01:22:02.620 | but it's not very flexible. It's not going to help us for example with our autoencoder
01:22:09.820 | because there's no way of like to say you know changing which which things are used
01:22:14.740 | for predicting with or for calculating with. We can't use it for anything except things
01:22:18.780 | that involve accuracy with a binary classification. Sorry. Right. Sorry. Yeah. A multi class classification.
01:22:30.580 | It's not flexible at all but it's a start. And so I wanted to basically put this all
01:22:34.140 | on one screen so you can see what the basic learner looks like. All right. So how do we
01:22:41.780 | do things other than multi class accuracy. I decided to create a metric class and basically
01:22:55.460 | a metric class is a something where we are going to define subclasses of it that calculate
01:23:04.100 | particular metrics. So for example here I've got a subclass of a metric called accuracy.
01:23:10.300 | So if you haven't done subclasses before you can basically think of this as saying please
01:23:17.460 | copy and paste all the code from here into here for me but the bit that says def calc
01:23:25.500 | replace it with this version. So in fact this would be identical to copying and pasting
01:23:31.100 | this whole thing typing accuracy here and replacing the definition of calc with that.
01:23:43.140 | That's what is happening here when we do subclassing. So it's basically copying and pasting all
01:23:48.420 | that code in there for us. It's actually more powerful than that. There's more we can do
01:23:53.500 | with it. But in this case this is all that's happening with this subclassing and that's
01:23:58.460 | this is called I'll leave that that's fine. OK. So the accuracy metric is here and then
01:24:07.900 | this is kind of our really basic metric which is we're going to use for just for loss. And
01:24:13.460 | so what happens is we're going to let's for example create an accuracy metric object.
01:24:22.220 | We're basically going to add in many batches of data. Right. So for example here's a many
01:24:28.060 | batches of inputs and predictions. Here's another many batch of inputs and predictions. And
01:24:34.060 | then we're going to call dot value and it will calculate the accuracy. Now dot value
01:24:41.300 | is a neat little thing. It doesn't require parentheses after it because it's called a
01:24:45.060 | property. And so a property is something that just calculates automatically without putting
01:24:51.620 | having to put parentheses. That's all a property is. Well property getter anyway. And so they
01:24:57.140 | look like this. You give it a name. And so we are going to be each time we call add we
01:25:05.100 | are going to be storing that input and that target. And also the number of items in the
01:25:14.420 | mini batch optionally. For now that's just always going to be one. And you can see here
01:25:22.340 | that we then call dot calc which is going to call the accuracy calc. So just see how
01:25:29.860 | often they equal. And then we're going to append to the list of values that calculation.
01:25:43.180 | And we're also going to append to the list of ends in this case just one. And so then
01:25:48.060 | to calculate the value we just do that. So that's all that's happening for accuracy.
01:25:55.460 | And then we can do for loss. We can just use metric directly because metric directly will
01:26:00.740 | just calculate the average of whatever it's past. So we can say oh add the number zero
01:26:05.260 | point six. So the target's optional. And we're saying this is a mini batch of size 32. So
01:26:11.500 | that's going to be the end. And then add the value 0.9 with a mini batch size of 2 and
01:26:17.940 | then get the value. And as you can see that's exactly the same as the weighted average of
01:26:23.860 | 0.6 and 0.9 with weights of 32 and 2. So we've created a metric class. And so that's something
01:26:31.480 | that we can use to create any metric we like just by overriding calc. Or we could create
01:26:39.980 | totally things from scratch as long as they have an add and a value. OK. So we're now
01:26:48.180 | going to change our learner. And what we're going to do is we're going to keep the same
01:26:56.500 | basic structure. So there's going to be fit. It's going to go through each epoch. It's
01:27:03.380 | going to call one epoch passing in true and false as for training invalidation. One epoch
01:27:11.300 | is going to go through each batch in the data loader and call one batch. One batch is going
01:27:18.060 | to do the prediction get loss. And if it's training it's going to do the backward step
01:27:24.740 | and zero grad. But there's a few other things going on. So let's take a look. Actually let's
01:27:34.380 | just look at it in use first. So when we use it we're going to be creating a learner with
01:27:40.740 | the model data loaders loss function learning rate and some callbacks which we'll learn
01:27:45.300 | about in a moment. And we call fit and it's going to do our thing. And look we're going
01:27:48.940 | to have charts and stuff. All right so the basic idea is going to look very similar.
01:27:54.940 | So we're going to call fit. So when we construct it we're going to be passing in exactly the
01:28:00.540 | same things as before. But we've got one extra thing callbacks which we'll see in a moment.
01:28:06.700 | Store the attributes as before and we're going to be doing some stuff with the callbacks.
01:28:11.820 | So when we call fit for this number of epochs we're going to store away how many epochs
01:28:18.420 | we're going to do. We're also going to store away the actual range that we're going to loop
01:28:24.340 | through as soft epochs. So here's that looping through soft epochs. We're going to create
01:28:30.380 | the optimizer using the optimizer function and the parameters. And then we're going to
01:28:40.180 | call underscore fit. Now what on earth is underscore fit. Why didn't we just copy and
01:28:44.460 | paste this into here? Why do this? It's because we've created this special decorator with
01:28:53.100 | callbacks. What does that do? So it's up here with callbacks. With callbacks is a class.
01:29:03.780 | It's going to just store one thing which is the name. In this case the name is fit. And
01:29:12.660 | what it's going to do is now this is the decorator right. So when we call it remember decorators
01:29:23.100 | get past a function. So it's going to get past this whole function. And that's going
01:29:29.420 | to be called f. So done to call remember is what happens when a class is treated an object
01:29:35.540 | is treated as if it's a function. So it's going to get past this function. So this function
01:29:40.020 | is underscore fit. And so what we want to do is we want to return a different function.
01:29:46.520 | It's going to of course call the function that we were asked to call using the arguments
01:29:53.060 | and keyword arguments we were asked to use. But before it calls that function it's going
01:29:59.180 | to call a special method called callback passing in the string before in this case before underscore
01:30:06.460 | fit. After it's completed it's going to call that method called callback and passing the
01:30:13.660 | string after underscore fit. And it's going to wrap the whole thing in a try accept block.
01:30:21.220 | And it's going to be looking for an exception called cancel fit exception. And if it gets
01:30:30.260 | one it's not going to complain. So let me explain what's going on with all of those
01:30:34.660 | things. Let's look at an example of a callback. So for example here is a callback called device
01:30:49.420 | cb device callback. And before fit will be called automatically before that underscore
01:30:56.060 | fit method is called. And it's going to put the model onto our device CUDA or MPS if we
01:31:06.380 | have one otherwise it will just be on GPU. So what's going to happen here. So it's going
01:31:13.600 | to call we're going to call fit. It's going to go through these lines of code. It's going
01:31:18.660 | to call underscore fit underscore fit is not this function underscore fit is this function
01:31:26.900 | with F is this function. So it's going to call our learner dot callback passing in before
01:31:36.740 | underscore fit and callback is defined here. What's callback going to do. It's going to
01:31:45.660 | be past the string before underscore fit. It's going to then go through each of our callbacks
01:31:54.620 | sorted based on their order. And you can see here our callbacks can have an order. And
01:32:01.900 | it's going to look at that callback and try to get an attribute called before underscore
01:32:09.740 | fit and it will find one. And so then it's going to call that method. Now if that method
01:32:22.740 | doesn't exist it doesn't appear at all then get acher will return this instead. Identity
01:32:30.020 | is a function just here. This is an identity function. All it does is whatever arguments
01:32:37.780 | it gets passed it returns them. And if it's not passed any arguments it just returns.
01:32:47.420 | So there's a lot of Python going on here. And that is why we did that foundations lesson.
01:32:59.220 | And so for people who haven't done a lot of this Python there's going to be a lot of stuff
01:33:06.860 | to experiment with and learn about. And so do ask on the forums if any of these bits
01:33:18.180 | get confusing. But the best way to learn about these things is to open up this Jupyter notebook
01:33:23.740 | and try and create really simple versions of things. So for example let's try identity.
01:33:34.980 | How exactly does identity work? I could call it and it gets nothing. I can call it with
01:33:43.660 | one. It gets back one. I could call it with a. It gets back a. I can call it with a one.
01:33:55.300 | Call it with a one and get a one. And how is it doing that exactly? So remember we can
01:34:04.180 | add a breakpoint. And this would be a great time to really test your debugging skills.
01:34:12.220 | So remember in our debugger we can hit H to find out what the commands are. But you really
01:34:16.180 | should do a tutorial on the debugger if you're not familiar with it. And then we can step
01:34:19.980 | through each one. So I can now print args. And there's actually a trick which I like
01:34:27.820 | is that args is actually a command funnily enough which will just tell you the arguments
01:34:32.540 | to any function regardless of what they're called. Which is kind of nice. And so then
01:34:38.660 | we can step through by pressing N. And after this we can check like OK what is X now. And
01:34:48.820 | what is args now. Right. So remember to really experiment with these things. So anyway we're
01:35:01.660 | going to talk about this a lot more in the next lesson. But before that if you're not
01:35:13.460 | familiar with try accept blocks you know spend some time practicing them. If you're not familiar
01:35:19.380 | with decorators well we've seen them before. So go back and look at them again really carefully.
01:35:26.340 | If you're not familiar with the debugger practice with that. If you haven't spent much time
01:35:31.160 | with getatra remind yourself about that. So try to get yourself really familiar and comfortable
01:35:39.620 | as much as possible with the pieces because if you're not comfortable with the pieces
01:35:44.100 | and the way we put the pieces together is going to be confusing. There's actually something
01:35:48.700 | in education in kind of the theory of education called cognitive load theory and the theory
01:35:54.620 | of cognitive basically cognitive load theory says if you're trying to learn something but
01:36:01.660 | your cognitive load is really high because of all lots of other things going on at the
01:36:05.740 | same time you're not going to learn it. So it's going to be hard for you to learn this
01:36:12.900 | framework that we're building if you have too much cognitive load of like what the hell
01:36:17.380 | is a decorator or what the hell is getatra or what does sort of do or what's partial.
01:36:23.580 | You know all these things now I actually spent quite a bit of time trying to make this as
01:36:28.380 | simple as possible. But but also as flexible as it needs to be for the rest of the course
01:36:36.940 | and this is this is this is as simple as I could get it. So these are kind of things
01:36:41.940 | that you actually do have to learn. But in doing so you're going to be able to write
01:36:47.940 | some really you know powerful and general code yourself. So hopefully you'll find this
01:36:56.940 | a really valuable and mind expanding exercise in bringing high level software engineering
01:37:04.180 | skills to your data science work. OK. So with that this looks like a good place to leave
01:37:11.380 | it and look forward to seeing you next time. Bye.
01:37:15.420 | [BLANK_AUDIO]