Back to Index

Lesson 12: Deep Learning Foundations to Stable Diffusion


Chapters

0:0 Introduction
0:15 CLIP Interrogator & how it works
10:52 Matrix multiplication refresher
11:59 Einstein summation
18:34 Matrix multiplication put on to the GPU
33:31 Clustering (Meanshift)
37:5 Create Synthetic Centroids
41:47 Mean shift algorithm
47:37 Plotting gaussian kernels
53:33 Calculating distances between points
57:42 Calculating distances between points (illustrated)
64:25 Getting the weights and weighted average of all the points
71:53 Matplotlib animations
75:34 Accelerating our work by putting it on the GPU
97:33 Calculus refresher

Transcript

Hi everybody, welcome back to lesson 12 of Practical Deep Learning for Coders. So got a lot of stuff to cover today so let's dive straight in and I actually thought I would start by sharing something which I've seen been getting a lot of attention recently which is the Clip Interrogator.

So the Clip Interrogator is a hugging face basis I guess radio app where I uploaded my image here and it's output, let's just zoom in a bit, it's output a text prompt for creating a clip embedding from I guess. So I've seen a lot of folks on Twitter and elsewhere on the internet saying that this is producing the clip prompt that would generate this image.

And generally speaking the clip, the prompts it creates a rather rude, my one's less rude than the some although you know extremely long forehead maybe not thanks very much but your personal data avatar, funny professional photo, I don't know what tectonics is meant to mean here without eyebrows. So this doesn't actually return the clip prompt that would generate this photo at all and the fact that some people are saying that makes me realize that some people have no idea what's going on with stable diffusion so I thought we might take this as an opportunity to explain why we can't do that and what we can try and do instead.

So let's imagine that my friend took a photo and of himself and he wanted to send me his photo and he thought he would compress it a whole lot. So what he did was he put it through the clip image encoder. So that's going to take this big image and it's going to turn it into an embedding and the embedding is much much smaller than the image it's just a vector of a few floats.

So then my friend hopes that they could send me this embedding and so they send that over at an email and they say there you go Jeremy there's a clip embedding of the photo I wanted to send you so now you just have to decode it to turn it back into a picture.

So now I've got the embedding and I have to decode it. How would you do that? Well you can't. Okay we have a function here or let's call it f which is the clip image encoder which takes as import an image which I'll call x and returns an embedding.

Does that mean that there is some other function and inverse functions we normally write with a minus one an inverse function with which I can take that embedding let's say we call that y we pass it y and it would give us back our photo. And so y remember is f of x so to put it another way this is f inverse of f of f of y.

So an inverse function is something that undoes a function and so that gives you back y. Is there an inverse function for the clip image encoder? Well not everything has an has an inverse function. For example consider the function like let's say in Python which takes f of x return 0.

Can you invert that function? If you get back you pass in 3 you get back 0. Is there a function that's going to take the output and give you back the input? No of course not because you just threw the whole thing away. So not all functions can be inverted and indeed in this case we've started with a function which is whatever 512 by 512 by 3 say and we've turned it into about something much much smaller.

I can't remember exactly how big a clip image encoding encoding is embedding is but it's much smaller. So clearly we're losing something. But what I could do is I could put it through a diffusion process and so remember a diffusion process is something where we have learned we have taught or I don't know taught an algorithm has learned to take some noise so we could start with some noise and we could start with an image embedding.

We haven't done this before but we could do that. We could train something that takes noise in an image embedding and removes a bit of the noise and we could run that a bunch of times. And it wouldn't give us back the original picture but hopefully it would give us something back if it's a conditional.

So remember using the conditional diffusion approach we'd get back something that might be something like our original image. So that's what diffusion is right. Diffusion is something that takes an embedding and inverts an encoder to give you back something that hopefully might generate that embedding. Now of course remember we don't actually get image embeddings when we do prompts in stable diffusion instead we have text embeddings.

But if you remember that actually doesn't matter because you remember how we actually or we open AI trained clip so that they had various pictures along with their captions and they trained an algorithm that was explicitly designed to make it so that each image returned a embedding for the image that was similar to the embedding that the text encoder created for the caption.

And remember all of the stuff that didn't match it was trained to be different. And so that means that a text embedding which describes this picture and the actual image embedding of this picture should be very similar if they're clip embeddings. That's the definition of clip embeddings. So you see this idea that you could take a text or image embedding and turn it back into an image perfectly makes no sense.

This is the very definition of the thing we're trying to do when we do clip. And because what we're basically trying to do is invert the embedding function these kinds of problems are generally referred to as inverse problems. So stable diffusion is something that attempts to approximate the solution to an inverse problem.

So why does that mean that clip interrogator is not actually inverting the picture to give us back the text. Well it's just as nonsensical if we've got an image embedding right trying to undo that to get back to the picture and trying to undo that to get back to a suitable prompt is equally infeasible.

Both of them require inverting an encoder and that just doesn't exist. The best we can do is the best we know how to do at the moment is to approximate that using a diffusion process. OK. So that's why these texts that it spits back are fun and interesting but they are not the thing that you can put back into stable diffusion and have it generate the same photo.

And the nice thing is that actually the code for this is available. And you can take a look at it. It's the app. And you'll see what it does is it has a big list of let's have a look at some examples. So it has a big this has big lists of examples for example a big list of artists and it has a big list of mediums and a big list of movements and so forth.

It's got all this hard coded pieces of text. And so what it does is it basically mixes and matches those various things together to see which one works well and it combines it with the output of something called the blip language model which is not designed to give you an exactly accurate description of an image but it has been specifically trained to give an OK ish caption for an image and it actually works reasonably well.

But again it's not it's not the inverse of the clip encoder. So OK so that's how that all works. So where we had got to was that we had done matrix multiplication with broadcasting where we had broadcast the the entire column from the right hand matrix all at once.

And that allowed us to get it down to a point where we only have one for loop written in Python. And generally speaking we do not want to be doing loop looping through too many things in Python because that's a slow bit. So the two inner loops we originally had which just remind us originally were here these two inner loops looping through 10 and then to 784 respectively have been replaced with a single line of code.

So that was pretty great. And our times now is increased is improved by 5000 times. So we're 5000 times faster than we started out. So another trick that we can use which I'm a big fan of is something called Einstein summation. And Einstein summation is a compact representation for representing products and sums.

And this is an example of an Einstein summation. And what we're going to do now is we're going to replicate our matrix product with an Einstein summation. And believe it or not the entire thing can be pushed down to just these characters which is pretty amazing. So let me explain what's happening here.

The arrow is separating the left hand side from the right hand side. The left hand side is the inputs. The right hand side is the output. The comma is between each input so there are two inputs. The letters are just names that you're giving to the number of rows and the number of columns.

So the first matrix we're multiplying by has i rows and k columns. The second has k rows and j columns. It's going to go through a process which creates a new matrix that actually this is not even doing this is not yet doing the matrix multiplication. This is without the sum.

This one's going to create a new matrix that contains i rows and k - well how do we say it - i faces and k rows and j columns so a rank 3 tensor. So the number of letters is going to be the rank. And the rules of how this works is that if you repeat letters between input arrays - so here's my inputs i, k and k, j we've got a repeated letter - it means that values along those axes will be multiplied together.

So it means that each item in each row of sorry in each yeah across a row will be multiplied by each item down each column to create this i by k by j output tensor. So to remind you our first matrix is 5 by 784. That's m1. Our second matrix is 7084 by 10.

That's m2. So i is 5, k is 784 and j is 10. So if I do this torch.insum then I will end up with a i by k by j. It'll be 5 by 784 by 10. And if you have a look I've run it here on these two tensors m1 and m2 and the shape of the result is 5 plus 784 by 10.

And what it contains is the original five rows of m1, the original ten columns of m2 and then for the other 784 that dimension they're all multiplied together because it's been copied between the two arguments to the iinsum. And so if we now sum up that over this dimension we get back.

So what we get back if we go back to the original matrix multiply we do we had 10.94 negative negative 0.68 etc. And so now with this Einstein summation version we've got back exactly the same thing. Because what it's done is it's taken each of these columns by rows multiply them together to get this 5 by 784 by 10 and then add it up that's 784 for each one which is exactly what matrix multiplication does.

But we're going to use one of the two things from Einstein summation. The second one says if we omit a letter from the output so the bit on the right of the arrow it means those values will be summed. So if we remove this k which gives us i k and k j goes to i j so we've removed the k entirely that means that sum happens automatically.

So if we run this as you see we get back again matrix multiplication. So Einstein summation notation is you know it takes some practice getting used to but it's very convenient. And once you get used to it it's actually a really nice way of thinking about what's going on.

And as we'll see in lots of examples often you can really simplify your code by using just a tiny little Einstein summation. And it doesn't even have to be a sum right. You don't have to admit any letters if you're just doing products. So maybe it's a bit misnamed.

So we can now define our map mole as simply this torch dot own sum. So if we now check it test close that the original result is equal to this new map mole. And yes it is. And let's see how the speed looks. 15 milliseconds. OK. And that was for the whole thing.

So compared to 600 milliseconds. So as you can see this is much faster than even the very fast broadcasting approach we used. So this is a pretty good trick is torch dot own sum. OK. But of course we don't have to do any of those things because PyTorch already knows how to do map mole.

So there's two ways we can run map mole directly in PyTorch. You can use this special at operator. So X train at weights is the same as map mole train comma weights. As you see test place. Or you can say torch dot map mole. And interestingly as you can see here the speed is about the same as the own sum.

So there's no particular harm no particular reason not to do an own sum. So when I say own sum that stands for Einstein summation notation. All right. Let's go faster still. Currently we're just using my CPU. But I have a GPU. It would be nice to use it. So how does a GPU work.

An Nvidia GPU and indeed pretty much all GPUs. The way they work is that they do lots and lots of things in parallel. And you have to actually tell the GPU what are all the things you want to do in parallel one at a time. And so what we're going to do is we're going to write in pure Python something that works like a GPU except it won't actually be in parallel so it won't be fast at all.

But the first thing we have to do if we're going to get something working in parallel is we have to create a function that can calculate just one thing even if a thousand other things are happening at the same time it won't interact with anything else. And there's actually a very easy way to think about matrix multiplication in this way which is what if we try to create something which just as we've done here fills in a single a single item of the result.

So how do we create something that just fills in a row 0 column 0. Well what we could do is we could create a new map model where we're going to pass in the coordinates of the place that we want to fill in. So we're going to start by passing it 0 comma 0.

We'll pass it the matrix matrices we want to multiply and we'll pass in a tensor that we've prefilled in with zeros to put the result into. So we're going to say okay the result is torch dot zeros rows by columns call map model for location 0 comma 0 passing in those two matrices and the bunch of zeros matrix ready to put the result in.

And if we call that we get the answer in cell 0 0. So here's an implementation of that. So the implementation is first of all we've been past the 0 comma 0 coordinates. So let's destructure them. So hopefully you've been experimenting with destructuring because it's so important you see it all the time into i and j.

That's the row in the column. Make sure that that is inside the bounds of our output matrix. And we're going to start by start at 0 and loop through all of the rows of A and all of the columns of B for i and j. Sorry all of the columns of A and all of the rows of B for i and j.

Just like the very innermost loop of our very first Python attempt. And then at the end pop that into the output. So here's something that fills in one piece of the grid successfully. So we could call this row by columns times each time passing in a different grid. And we could do that in parallel because none of those different locations interact with any other location.

So something which can calculate a little piece of of an output on a GPU is called a kernel. So we'd call this a kernel. And so now we can create something called launch kernel. We pass it the kernel. So that's the function. Here's an example launch kernel passing in the function.

And how many rows and how many columns are there in the output grid. And then give me any arguments that you need to calculate it. So in Python star args just says any additional arguments that you pass are going to be put into an array called args. If you use something like C you might have seen like variadic arguments parameters.

It's the same basic idea. So we're going to be calling launch kernel. We're going to be saying launch the kernel matmul using all the rows of A all the columns of B and then the args which are going to be in star args are going to be M1 the first matrix M2 the second matrix and res another torch.zeros we just created.

So launch kernel is going to loop through the rows of A and then for each row of A it'll loop through the columns of B and call the kernel which is matmul on that grid location passing in M1 M2 and res. So starargs here is going to unpack that and pass them as three separate arguments.

And if I run that and all of that you'll see it's done it. It's filled in the exact same matrix. OK so that's actually not fast at all. It's not doing anything in parallel but it's the basic idea. So now to actually do it in parallel we have to use something called CUDA.

So CUDA is a programming model for NVIDIA GPUs and to program in CUDA from Python the easiest way currently to do that is with something called number. And number is a compiler where you've seen it actually already for non GPU. It's a compiler that takes Python code and spits out you know compiled fast machine code.

If you use its CUDA module it'll actually spit out GPU accelerated CUDA code. So rather than using an N-Git like before we now say at CUDA.Git and it behaves a little bit differently. But you'll see that this matmul let me copy the other one over so you can compare it to our Python one.

Now Python matmul and this CUDA.Git matmul look I think identical except for one thing instead of passing in the grid there's a special magic thing called CUDA.grid and you say how many dimensions does my grid have and you unpack it. So that's you don't have to it's just a little convenience that number does for you.

You don't have to pass over the grid it passes it over for you so it doesn't need this grid. Other than that these two are identical but the decorator is going to compile that into GPU code. So now we need to create our output tensor just like before and we need to do something else which is we have to take our input matrices and our output so our input tensors matrices in this case in the output tensor and we have to move them to the GPU.

I should say copy them to the GPU so CUDA device copies a tensor to the GPU and so we've got three things getting copied to the GPU here and therefore we store the three things over here. Another way I could have written this is I could have said map which I kind of quite like doing a function which is CUDA device to each of these arguments and this would be the same thing.

This is going to call CUDA device on X train and put it in here on weights and put it in here and on R and put it in here. That's a slightly more convenient way to do it. Okay so we've got our 50,000 by 10 output that's just all zeros of course that's just how we created it and now we're going to try and fill it in.

There is a there's a particular detail that you don't have to worry about too much which is in CUDA they don't just have a grid but there's also a concept of blocks and there's something we call here TPB which is threads per block. This is just a detail of the CUDA programming model you don't have to worry about too much you can just basically copy this and what it's going to do is it's going to call each grid item in parallel and with a number of different processes basically.

So this is just the code which turns the grid into blocks and so you don't have to worry too much about the details of that you just always run it. Okay and so now how do you call the equivalent of launch kernel well it's it's a slightly weird way to do it but it works fine you call matmul but because matmul has CUDA.jit it's got a special thing which is you have to put something in square brackets afterwards which is you have to tell it how many blocks per grid that's just the result from the previous cell and how many threads per block in each of the two dimensions.

So again you can just copy and paste this from my version but then you pass in the three arguments to the function this will be A B and C and this will this is this is how you launch a kernel so this will launch the kernel matmul on the GPU.

At the end of it RG is going to get filled in it's gone it's on the GPU which is not much good to us so we now have to copy it back to the CPU which is called the host copy to host to run that and it's done and test close shows us that our result is similar to our original results so it seems to be working so that's great.

So I see Siva on the YouTube chat is finding that it's not working on his Mac that's right so this will only work on an Nvidia GPU as basically all of the GPU nearly all the GPU stuff we look at only works on Nvidia GPUs. Mac GPUs are gradually starting to get a little bit of support from machine learning libraries but it's taking quite a while it's been you know it's got quite a way to go as I say this at least towards the end of 2022 if this works for you and later on that's yeah that's great.

Okay so let's time how fast that is. Okay so that was three point six one milliseconds and so if we compare that to the PyTorch matmul on CPU that was 15 milliseconds so that's great so it's faster still so how much faster. Oh by the way we can actually go faster than that which is we can use the exact same code we had from the PyTorch op but here's a trick if you just take your tensor and write dot CUDA after it it copies it over to the GPU if it's on a if it's on a Nvidia GPU do the same for weights dot CUDA so these are our two CUDA versions and now I can do the whole thing and this will actually run on the GPU and then to copy it back to the host you just say dot CPU so if we look to see how fast that is 458 microseconds so oh that is somebody just pointed out that I wrote the wrong thing here one e neg three okay so how much faster is that well 458 microseconds our original on the whole data set was 663 microseconds so compared to our broadcast version we are another thousand times faster so overall this version here compared to our original version which was here the difference in performance is 5 million X so when you see people say yeah Python can be pretty slow can be better to run stuff on the GPU if possible we're not talking about a 20% change we're talking about a 5 million X change so that's a big deal and so that's why you need to be running stuff on the GPU all right some folks on YouTube are wondering how on earth I'm running CUDA when I'm on a Mac and given it says localhost here that's because I'm using something called SSH tunneling which we might get to sometime I suspect my live coding from the previous course might have covered that already but this is basically you can use a Jupyter notebook that's running anywhere in the world from your own machine using something called SSH tunneling which is a good thing to look up okay one person asks if Einstein summation borrows anything from APL yes it does actually so it's kind of the other way around actually APL borrows it from Einstein notation so I don't know if you remember I mentioned that when Iverson when he developed APL was heavily influenced by tensor analysis and so this Einstein notation is very heavily used there if you'll notice a key thing that happens in Einstein notation is there's no loop you know there isn't this kind of Sigma you know I from here to here and then you put the I inside the function that you're summing up everything's implicit and APL takes that a very long way and and J takes it even further which is what can Iverson developed after APL and this kind of general idea of removing the index is very important in APL and it's become very important in numpy pytorch tensor flow and so forth so finally we know how to multiply matrices congratulations so let's practice that let's practice what we've learned so we're going to go to zero to main shift to practice this and so we're going to try to exercise our kind of tensor manipulation operation muscles in this section and the key actually endpoint for this is the homework and so what you need to be doing is getting yourself to a point that you can implement something like this but for a different algorithm why do we care about this because this is like learning your times table your times tables if you're doing you know mathematics it's this kind of like thing that's going to come up all the time and if you're not good at your times tables everything else a lot more a lot of other things particularly at primary school in high school you know they they get difficult you get slower and it's frustrating and you spend time thinking about these mechanical operations rather than getting your work done it is it's important that when you have an idea about something you want to try or debug or profile or whatever that you can quickly translate that into working code and the way that code is written for GPUs or even for fast running on CPUs is using broadcasting Einstein notation matrix modifications and so forth so you've got to you've got to got to got to practice super important so we're going to practice it by running by developing a clustering algorithm and the clustering algorithm we're going to work on is something called mean shift clustering which hopefully you've never heard of before and I say that because I just think it's a really funny algorithm that not many people have come across excuse me and I think you'll find it really useful so what is cluster analysis cluster analysis is very different to anything that we've worked on in this course so far and that there isn't a dependent variable that we're trying to match but instead we're just trying to find are there groups of similar things in this data and those groups we call clusters and as you can see from the wiki page there's all kinds of applications of cluster analysis across many different areas I will say that sometimes cluster analysis can be overused or misused it's really best for when your various columns are the same kind of thing and have the same kind of scale for example pixels are all the same kind of thing they're all pixels so one of the examples they use is market research so I wouldn't use cluster analysis for socio demographic inputs because they're all different kinds of things but the example they give here makes a lot of sense which is looking at data from surveys if you've got a whole bunch of like from one to five answers on surveys alright so let's take a look at this and the way I like to build my algorithms is to create some often to create some synthetic data that I know how I want it to behave and so we're going to create six clusters and each cluster is going to have 750 samples in it so first of all I'm going to randomly create six centroids and so the centroid is going to be like the middle of where my clusters are so I'm going to randomly create them I need to end clusters by two so I need an X and a Y coordinate for each one and so I'm now going to randomly generate data around those six centroids okay so to do that I'm going to call a little function I made here called sample and I'm going to run it on each of those six centroids and so I'll show you what that looks like so here's what that data looks like so the X's are the six centroids and the colored dots is the data so if you were given this data without the X's the idea would be to come back with figuring out where the X's would have been like where are the where are these clustering around and so if you can get clusters that that's that's the goal here is to find out that there's a few discreetly distinctly different types of data in your data set so for example for images I've used this before to discover that there are some images that look completely different to all the other ones for example they were taken at nighttime or they're of a different object or something like that so how does sample work well we're passing in the centroid and so what we want is we're going to get back so each of those centroids contains an X and a Y so multivariate normal is just like normal it's going to give you back normally distributed data but more than one item that's why it's multivariate and so we passed in two means a mean for X and a mean for our Y and so that's the mean that we're going to get and our standard deviation is going to be 5 why do we use torch.diag 5,5 that's because we're saying that's because that for multivariate normal distributions there's not just one standard deviation for each column that you get back there could also be a connection between columns the columns might not be independent so you actually need so it's called a covariance matrix not just to make not just a variance we discussed that a little bit more in lesson 9b if you're interested in learning more about that okay so this is something that's going to give us back random columns of data with this mean and this standard deviation and this is the number of samples that we want and this is coming from PyTorch so PyTorch has a whole bunch of different distributions that you can use which can be very handy so there's our data okay so remember for clustering we we don't know the different colors and we don't know where the X's are that's kind of our job is to figure that out we might just briefly also look at how to plot so in this case we want to plot the X's and we want to plot the data so it looks like this so all I do is I loop through each centroid and I grab that centroid samples and they're just all done in order so I grab it from i times n samples up to i plus 1 times n samples and then I create a scatterplot with the samples on them and what I've done is I've created an axis here and you'll see why later that we can also pass one in but I'm not passing one in so you create a plot and an axis and so in that plotlib you can keep plotting things on the same axis so then I plot on the centroid a big X which is black and then I a smaller X which is what is that magenta and so that's how I get these X's so that's how plot data works okay so how do we create something now that starts with all the dots and returns where the X's are we're going to use a particular algorithm particular clustering algorithm called mean shift and mean shift is a nice clustering approach because you don't have to say how many clusters there are so it's not that often that you actually got to know how many clusters there are so we don't have to say quite a few things like the very popular K means require you to say how many instead we just have to pass them in called a bandwidth which we'll learn about which can actually be chosen automatically and it can also handle clusters of any shape so they don't have to be ball shaped like they were they are here they can be kind of like L shaped or lips shaped or whatever and so what here's what's going to happen we're going to pick some point so let's say we pick that point just there okay and so what we now do is we go through each data point so we'll pick the first one and so we then find the distance between that point and every other point okay so we're going to have to say what is the distance between that point and that point and that point and that point and that point and also the ones further away that point and that point and you do it for every single point compared to the one that we're currently looking at okay so we get all of those as a big list and now what we're going to do is we're going to take a weighted average of all of those points now that's not interesting without the weighting if we just take our average of all of the points and how far away they are we're going to end up somewhere here right this is the average of all the points but the key is that we're going to take an average and just find the right spot the key is we need to find an average that is weighted by how far away things are so for example this one over here is a very long way away from our point of interest and so it should have a very low weight in the weighted average whereas this point here which is very close should have a very high weight in our weighted average so what we do is we create weights for every point compared to the one that we're currently interested in using a what's called a Gaussian kernel that we'll look at but the key thing to know is that points that are further away from our point of interest which is this one are going to have lower weights that's what we mean there they're penalized the rate at which weights four to zero is determined by this thing that we set at the start called the bandwidth and that's going to be the standard deviation of our Gaussian so we take an average of all the points in the data set a weighted average weighted by how far away they are so for our point of interest right the this point is going to get a big weight this point is going to get a big weight this point is going to get a big weight that point is going to get a tiny weight that points going to get an even tinier weight so it's mainly going to be a weighted average of these points that are nearby and the weighted average of those points I would guess is going to be somewhere around about here right and would have a similar thing for the weighted average of the points near this one that's going to probably be somewhere around about here or maybe over here and so it's going to move all of these points in closer it's almost like a gravity right they're kind of going to be moved like closer and closer in towards this kind of gravitational center and then these ones will go towards their own gravitational center and so forth okay so let's take a look at it all right so what's the Gaussian kernel this is the Gaussian kernel which was a sign in the original March for Science back in the days when the idea of not following scientists was considered socially unacceptable we used to have much for these things if you remember so this is this is not normal so this is the definition of the Gaussian kernel which is also known as the normal distribution this is the shape of it sure you've seen it before and here is that formula copied directly off the science March sign okay here we go see the square root 2 pi etc okay and this here is the standard deviation now what does that look like it's very helpful to have something that we can very quickly plot any function that doesn't come with matplotlib but it's very easy to write one just say oh let's as X let's use all the numbers from 0 to 10 a hundred of them spaced evenly that's what lens base does in linearly spaced 100 numbers in this range that's going to be our X's so plot those X's and plot f of X is the wise so here's a very nice little plot fuck we want and here it is and as you can see here we've now got something where if you are this like very close to the point of interest you're going to get a very high weight and if you're a long way away from the point of interest you'll get a very low weight so that's the key thing that we wanted to remember is something that penalizes further away points more now you'll notice here I managed to plot this function for a bandwidth of 2.5 and the way I did that was using this special thing from funk tools called partial now the first thing to point out here is that very often drives me crazy I see people trying to find out what something is in Jupiter and the way they do it is they'll scroll up to the top of the notebook and search through the imports and try to find it that is the dumb way to do it the smart way to do it is just to type it and press shift enter and it'll tell you where it comes from and you can get its help with question mark and you can get its source code with two question marks okay so just type it to find out where it comes from okay so this is as save as mentioned in the chat also known as carrying or partial function application this creates a new function so let's just grab it we create a new function and this function f is is the function Gaussian but it's going to automatically pass BW equals 2.5 this is a partially applied function so I could type f of 4 for example that's going to be a tensor there we go and you can see that's exactly what this is go up to 4 go across yep about 0.44 so we use partial function application all the time it's a very very very important tool without it for example plotting this function would have been more complicated with it it was trivially easy I guess the alternative like one alternative which would be fine but slightly more clunky would be we could create a little function in line so we could have said oh plot a function that I'm going to define right now which is called lamb which is lambda X which is Gaussian of X with a bandwidth of 0.2 0.5 you could do that too you know it's it's fine but but yeah partials I think are a bit neater a bit less to think about they often produce some neater and clearer code okay why did we decide to make the bandwidth 2.5 as a rule of thumb choose a band width which covers about a third of the data so if we kind of found ourselves somewhere over here right a band which which covers about a third of the data would be enough to cover two clusters ish so it would be kind of like this big so somewhere in the middle there so that's the basic idea yeah so but you can play around with bandwidths and get different amounts of clusters I should mention like often when you see something that's kind of on the complicated side like a Gaussian you can often simplify things I think most implementations and write-ups I've seen talk about using Gaussians but if you look at the shape of it it looks a lot like this shape so this is a triangular weighting which is just using clamp min so it's just using a linear with clamp min and yeah it occurred to me that we could probably use this just as well so I did find it I decided to define this triangular weighting and then we can try both anyway so we'll start with we're going to use the Gaussian version all right so we're going to be move literally moving all the points towards their kind of center of gravity so we don't want to mess up our original data so we clone it that's a pie torch thing is dot clone it's very handy and so big X is our matrix of data I mean it's actually a that's right matrix of data yeah and then little X will be our first point and it's pretty common to use big X capital letters for matrices so this is our data this is the first point okay so there it is we've got to start at twenty six point two twenty six point three so twenty six point two twenty six point three so somewhere up here so little X its shape is just it's a rank one tensor of shape two big X is a rank two tensor of 1500 data points by two the X and Y and if we call X none that would add a unit access to that and the reason I'm going to show you that is because we want to find the distance from little X to everything in big X and the way we do a distance is with minus but you wouldn't be able to go you wouldn't be able to go X minus big X and get the right actually do you get the right answer let's think about that X dot shape oh you've got that already oh no actually that is going to work isn't it so yes all right so you can see why we've got these two versions here if we do X none we've got something of shape 1 comma 2 now we can subtract that from something of shape 1500 comma 2 because the twos match up because they're the same and the 1500 and the 1 matches up because you remember our numpy rules everything matches up to a unit axis so it's going to copy this matrix across every row of this matrix and it works but do you remember there's a special trick which is if you've got two shapes of different lengths we can use the shorter length and it's going to add unit axes to the front to make it as long as necessary so we actually don't need the X none we can just use little X and it works because it's going to say is this compatible with this well the last axis remember we go right to left the last axis matches the second last axis oh it doesn't exist so we pretend that there's a unit axis and so it's going to do exactly the same thing as this so if you have not studied the broadcasting from last week carefully that might not have made a lot of sense to you and so definitely at this point you might want to pause the video and go back and reread the numpy broadcasting rules and last time and practice them because that's what we just did we use numpy broadcasting rules and we're going to be doing this dozens more times throughout the rest of the course and many more times in fact in this lesson okay so now i think it's a pretty good place to have a pause so i'll see you back here in nine minutes hi everybody welcome back so we had got to the point where we had managed to get the distance between our first point x and all of the other points in the data and so we're just looking at the first eight of them here so the very first distance is of course zero on the x axis and zero on the y axis because it is the first point the other thing is that because we the way we created the clusters is they're all kind of next to each other in the list so these are all in the first clusters so none of them are too far away from each other so now that we've got all the distances it's easy enough to well not that the distances on x and y it's easy enough to get the distance the kind of Euclidean distance so we can just square that that difference and sum and square root and actually maybe this is a good time to talk about norms and to talk about what we just did there we've got all these data points um so here's one of our data points and here's the other one of our data points and there's some um you know distance across the x axis and there's some distance along the y axis so we could call that change in x and change in y and one way to think about this distance then is it's this distance here um so to calculate that we can use Pythagoras so a squared plus b squared equals c squared or in our case so this would be c a and b say so in our case it would be the square root of the change in x squared plus the change in y squared and rather than saying square root we could say to the power of a half another way of saying the same thing but there's a different way we could find the distance we could first go along here and then go up here and so that one would be change in x if you like to the one plus change in y to the one to the power of one oneth i'm writing it a slightly odd way for reasons you'll see in a moment it's just this otherwise um in general if we've got a whole list of numbers we can add them up let's say there are some list v we can add them up we can do each one to the power of some number alpha and take that sum to the one over alpha and this thing here is called a norm so you might remember we came across that last week and we come across it again this week they basically come up i don't know they might end up coming up every week they come up all the time particularly because the two norm which we could write like this or we could write like this or we could write like this they're all the two norm this is just saying it's this equation for alpha equals two and stefano's pointing out we should actually have an absolute value i'm not going to worry about that we're just doing real numbers here so we'll keep things simple oh well i guess for a higher than one no you're probably right for something like three yeah i guess we do need an absolute value there that's a good point because okay we could have this one and so the distance actually has to be the absolute value so the change in x is the absolute value of that distance uh yes thank you stefano okay so we'll have the absolute value okay so the two norm is what happens when alpha equals two and we would call this in this case we would call this the euclidean distance but actually where it comes up more often is when you're doing like a loss function so the mean squared error is just uh well the root mean squared error i should say is just the two norm whereas the mean absolute error is the one norm and these are also known as l2 and l1 plus and remember what we saw in that paper last week we saw it in this form there's a two up here which is where they got rid of the square root again so that would have just been change in x squared plus change in y squared and now we don't even need the parentheses oopsie dozy okay okay so all of this is to say that for you know this comes up all the time because we're very very often interested in distances and errors and things like that um i'm trying to think i don't feel like i've ever seen anything other than one or two so although it is a general concept i don't think we're going to see probably things other than one or two in this course i'd be excited if we do that would be kind of cool so here we're taking the euclidean distance which is the two norm so this has got eight things in it because we've summed it over dimension one so here's your first homework is to rewrite using torch.insum you won't be able to get rid of the x minus x you'll still need to have that in there but when you've got a multiply followed by a sum now you won't be able to get rid of the square root either you should be able to get rid of the multiply and the sum by doing it in a single torch.insum so we're summing up over the first dimension which is this dimension so in other words we're summing up the x and the y axes okay so now we can get the uh the weights by passing those distances into our gaussian and so as we would expect the biggest weights it gets up to point one six so the closest one is itself it's going to be a big weight these other ones get reasonable weights and the ones that are in totally different clusters have weights small enough that at three significant figures they appear to be zero okay so we've got our weights so there are the weights are a 1500 long vector and of course the original data is 1500 by two the x and the y for each one so we now want a weighted average we want this data we want its average weighted by this so normally an average is the um is the sum of your data divided by account that's a normal average a weighted average each item in your data let's put some eyes around here just to be more clear each item in your data is going to have a different weight and so you multiply each one by the weights and so rather than dividing by n which is just the sum of ones we would divide by the sum of weights so this is an important concept to be familiar with weighted averages so we need to multiply every one of these x's by this okay so can we say weight times x no all right why didn't that work so remember we go right to left so first of all it's going to say let's look at the two and multiply that by the 15 are they compatible things are compatible if they're equal or if at least one of them is one these are not equal and they're not one so they're not compatible that's why it says the size of a tensor a must match now when it says match it doesn't mean they have to be the same one of them can be one okay that's what it means to match they're either equal or one of them is one so that doesn't work on the other hand what if this was 1500 comma one if it was 1500 comma one then they would match because the one and the two match because one of them is a unit axis and the 1500 and the 1500 match because they're the same so that's what we're going to do because that would then copy this to every one of these which is what we want we want weights for each of these x y tuples so to add the trailing unit axis we say every row and a trailing unit axis so that's what that shape looks like so we can now multiply that by x and as you can see it's now weighting each of them and so each of these x's and y's down the bottom they're all zero so we can sum that up and then divide by the sum of weights so let's now write a function that puts all this together so you can see this really important way of like to me the the only way that makes sense to do particularly scientific numerical programming i actually do all my programming this way but particularly scientific and numerical programming is write it all out step by step check every piece have it all there documented for you and for others and then copy the cells merge them together and indent them to indent its control right spread bracket and put a function header on top so here's all those things we just did and now rather than just grabbing the first x we enumerate through all of them so that's the distance we had before that's the weight we had before there's the product we had before and then finally sum across the rows divide by the sum of the weights so that's going to calculate for the ith it's going to move so it's actually changing capital x so it's changing the ith thing and capital x so that it's now the weighted sum oh actually sorry the weighted average of all of the other data weighted by how far it is away so that's going to do a single step so the mean shift update is extremely straightforward which is clone the data iterate a few times and do the update so if we run it take 600 milliseconds and what i've done is i've plotted the centroids moved by two pixels or two well not two pixels two units so that you can see them and so you can see the dots is where our data is and they're dots now because every single data point is on top of each other on a cluster and so you can see they are now in the correct spots so it has successfully clustered our data so that's great news and so we could test out our hypothesis could we use triangular just as well as we could have used gaussian so control slash comments and uncomments yeah we got exactly the same results so that's good it's really important to know these keyboard shortcuts hit h to get a list of them some things that are really important don't have keyboard shortcuts so if you click help edit keyboard shortcuts there's a list of all the things jupiter can do and you can add keyboard shortcuts to things that don't have them so for example i always add keyboard shortcuts to run all cells above and run all cells below as you can see i type q and then a for above and q and then b for below all right now that was kind of boring in a way because it did five steps um but we just saw the result what did it look like one step at a time um this isn't just fun it's really important to be able to see things happening one step at a time because there are so many algorithms we do which are like updating weights or updating data you know so if a stable diffusion for example you're very likely to want to show you know you're incrementally denoising and so forth so in my opinion it's important to know how to do animations and i found the documentation for this unnecessarily complicated because it's a lot of it's about how to make them performant but most of the time we probably don't care too much about that so i want to show you a little trick a simple way to create animations without any trouble so matplotlib.animation has something called func animation that's what we're going to use to create an animation you have to create a function and the function you're going to be calling func animation passing in the name of that function and saying how many times to run it and that's what this frames argument this says run this function this many times and then create an animation that that basically contains the result of that with a 500 millisecond interval between each one so what's this do one going to do to create one frame of animation we will call our one update here it is one update right we're going to call this that's going to update our axes and then we're going to have an axis which we've created here so we're going to clear whatever was on the plot before and plot our new data on that axis and then the only other thing you need to do is that the very first time it calls it we want to plot it before running and d is going to be passed automatically the frame number so for the zero frame we're going to not do the update we're just going to plot the data as it is already i guess another way we could have done that would have been just to say if d then do the update i suppose that should work too maybe it's even simpler let's see if i just broke it okay so we're going to clone our data we're going to create our figure in our subplots we're going to call func animation calling do one five times and then we're going to display the animation and so let's see so html takes some html and displays it and two js html creates some html so that's why it's created this html includes javascript and so we'll click run one two three four five there's the five steps so if i click loop you'll see them running again and again fantastic so that's how easy it is to create a matplotlib animation so hopefully now you can use that to play around with some fun stable diffusion animations as well you don't just have to use touch a shtml you can also create you can also create movies for example so you can call two html five video would be another option and you can save an animation as a movie file so there's all these different options for that but hopefully that's enough to get you started so for your homework i would like you when you create your k-means or whatever to try to create your own animation or create an animation of some stable diffusion thing that you're playing with so don't forget this important ax.clear you without the ax.clear it prints it on top of the last one which sometimes is what you want to be fair but in this case it's not what i wanted all right so kind of slow half a second for not that much data i'm sure would be nice it was faster well the good news is we can gpu accelerate it the bad news is it's not going to gpu accelerate that well because of this loop this is looping 1500 times if we so looping is not going to run on the gpu so the best we could do with this would be to move all this to the gpu now the problem is that calling something on the gpu 1500 times from python is a really bad idea because there's this kind of huge communication overhead of this kind of flow of control and data switching back between the cpu and the gpu it's the kernel launching overhead it's bad news so you don't want to have a really big fast python loop that inside it calls cuda code it calls gpu code so we need to make all of this run without the loop which we could do with broadcasting so let's roll up our sleeves and try to get the broadcast version of this working so generally speaking the way we tend to do things with broadcasting on a gpu is we create batches or mini batches so to create batches or mini batches we normally just call them batches nowadays we create a batch size so let's say we're going to do a batch size of five so we're going to do five at a time all right so how do we do five at a time this is only doing one at a time how do we do five at a time as before let's clone our data and this time little x for our testing so we're going to do everything ahead of time little tests as we always do this is not now x zero anymore but it's x colon bs so it's the first five this is now the first five items okay so little x is now a five by two matrix this is our mini batch the first five items as before our data itself is 1500 by two all right so we need a distance calculation but previously our distance calculation um previously our distance calculation only worked if little x was a single number and it returned just the distances from that to everything in big x but we need something that's actually going to be um return a matrix right we've got um um let's see we've got five by two in little x and then in big x we've got something much bigger not to scale obviously we've got 1500 by two and what is the distance between these two things well if you think about it there's going to be a distance between item one and item one but there's also going to be a distance between item one and item two and there's going to be a distance between let's use a different color for the next one item two and item one right so the output of this is actually going to be a matrix the distances are actually going to give us a matrix where i mean it doesn't matter which way around we do it we can decide if we do it this way around for each of the five things in the mini batch there will be 1500 distances the distance between every one so we're going to need to do broadcasting to do this calculation so this is the function that we're going to create and it's going to create this as you can see five by 1500 output but let's see uh how we get it so can we do x minus x no we can't why is that that's because big x is 1500 by two and little x is five by two so it's going to look at remember our rules right to left are these compatible yes they are they're the same are these compatible no they're not okay because they're different so that's not possible to do what if though we wanted to what if we insert in big x an axis at the start here and in little x we add an axis in the middle here then now these are compatible because you've got they're the same because i should use arrows really these are compatible because one of them is a one and these are compatible because one of them is the one as well so they are all compatible and what it's going to do is it's going to do this attraction between these directly and it's going to copy this across all 1500 rows it'll copy it this is going to be copied and then this sorry across five rows and then this will be copied across these 1500 rows because that's what broadcasting does i mean it's not really copying but it's effectively copying and so that gives us we can now subtract them and that gives us what we wanted which is five by 1500 and there's also by two because there's both the x and the y so that's why this works that's what this is doing here it's taking this attraction it's squaring them and then summing over that last shortest axis summing over the x and the y squares and then take square root i don't know why i said torch dot square root we could just put dot square root at the end but same same in fact it's worth mentioning that so most things that you can do on tensors you can either write torch dot as a function or you can write it as a method generally speaking both should be fine not everything but most things work in both spots okay so now we've got this matrix which is five by 1500 and the nice thing is that our Gaussian kernel doesn't actually have to be changed to get the weights believe it or not and the reason for that is now how do we get the source code i could move back up there or i can just type Gaussian question mark question mark and see it and the nice thing is that this is just this is a scalar so it broadcasts over anything and then this is also just a scalar so this is all going to work fine without any fiddling around okay so now we've got a five by 1500 weight so that's the weight for each of the five things our mini battery each of the 1500 things each of them is compared to and then we've got the shape of the data itself x dot shape which is the 1500 points so now we want to apply each one of these weights to each of these columns so we need to add a unit axis to the end so to add a unit axis to the end we could say colon comma colon comma none but dot dot dot means all of the axes up until however many you need so in this case the last one common none so this is going to add an axis to the end so this is going to turn this is going to turn weight dot shape from five comma 1500 to five comma 1500 comma one and this is going to add an axis to the start remember it's the same as x none comma colon comma colon and so let's check our rules left right to left these are compatible because one of them is one these are compatible because they're both the same and these are compatible because one of them is one okay so it's going to be um copying each weight across to each of the x and y which is what we want we want to um we want to weight both of those components and it's going to copy each of the 1500 points sorry each of the point five times because we do in fact want to weight every one of the five things in our mini batch a separate set of weights for each of them so that sounds perfect so that's how i think through these calculations okay so we can now do that multiplication which is going to give us something of five by 1500 by two because we end up with the maximum of our ranks and then we sum up over those 1500 points and that's going to give us now five new data points um now something that you might notice here is that we've got a product and a sum and when you see a product and a sum that tells you that maybe we should use a sum so in this case we've got our weight we've got five by 1500 so let's call those i and j those for the five and 1500 we've got the x is 1500 by two now we want to take the product of that and that so we need to use the same name for this rows so we use j again okay and then k is the number of rows that's the two and then we want to end up with i by k so torch.insum gives exactly the same result that's great but you might recognize this that's exactly the same ironsum we had just before when we were doing matrix multiplication oh that is a matrix multiplication we've just reinvented matrix multiplication using uh this rather nifty trick so we could also just use that and so you know again this is like um what i was just playing around with this morning as i started to look at this and i was thinking like oh you know can we simplify this i don't like this kind of like messing around with axes and summing over dimensions and whatnot and so it's nice to get things down to i and sum or better still getting down to matrix model applies it's just clearer you know it's stuff that we all recognize because we use them all the time they all work performance would be pretty similar i suspect um okay so now that we've got that we then need to do our sum and we've got our five points this is our five uh denominators so we've got our numerator that we calculated up here for our weighted uh for our weighted average the denominator is just the sum of the weights remember and so numerator divided by denominator is our answer so again we've gone through every step we've checked out all the dimensions all along the way so nothing's going to surprise us don't try and write a function like this just bang from scratch right you've got to drive yourself crazy instead do it step by step so here's our mean shift algorithm clone the data go through five iterations and now go from naught to n batch size at a time so python has something called slices so we can create a slice of x starting at one up to i plus batch size right unless you've gone past n which goes use n and so then we're just copying and pasting each of the lines of code that we had before actually i just copy the cells and merge them of course i don't actually copy and paste because it's so boring and there's my final step to create the new xs and so notice here s is not a single thing it's a slice of things you might not have seen slice before but this is just internally what python's doing when you use colon and it's very convenient when you need to use the same slice multiple times okay so let's do that using CUDA i would run it first without CUDA but i mean i've done all the steps before so it should be fine um so pop it on the GPU and run beam shift and let's see how long that takes it takes one millisecond and previously without GPU it took 400 milliseconds and you know the other thing we should probably think about doing is looking at other batch sizes as well because now we're looping over batches right so if we make the batch size bigger that for loop is going to do less looping so what if we make that 16 will that be any faster i actually never tried this before that's interesting it's actually slower huh there you go fascinating what if it was eight amazing so the big batches don't quite seem to be working so well for some reason so i wonder if i've hang on what's going on why is it why is it changing how it should be uh my batch size was five why is it slower suddenly i think it's just a bit varying is probably the answer so it just varies a lot okay so it doesn't seem like changing the batch size is changing much here um so that's fine so we'll just leave it where it was and then check looking at the data oh that looks lovely oh i see thank you people on youtube pointing out that i'm passing batch size so i actually need to put it here all right so if we used a batch size of five no wonder it was messing up oh look at that i've totally made it slow now 157 milliseconds ha ha okay 64 13 milliseconds all right finally that makes much more sense 256 1024 okay so the bigger bigger is better and i guess we could actually do all 5000 at once probably okay nice all right thank you youtube friends for solving that bizarre mystery okay all right so that's pretty great i mean you know to see um that we can gpu optimizer mean shift like actually googled for this to see if it's been done before um and it's the kind of thing that people like write papers about um so i think it's great that we can do it so easily with pytorch and it's the kind of thing that previously had been considered you know a very challenging academic problem to solve so maybe you can do something similar with some of these now i haven't told you what these are so part of the homework is to go read about them and learn about them db scan funnily enough actually is an algorithm that i accidentally invented and then discovered a year later had already been invented that was a long time ago i was playing around with j which is the successor to apl on a very old windows phone um and i had a long plane flight and i came up with an algorithm and implemented the whole thing on my phone using j um and then discovered a year later that i just invented db scan this is actually a really cool algorithm and it's got a lot of similarities to mean shift um lsh comes up all the time um so that's great um and in fact i have a strong feeling and i've been thinking about this for a while that something like lsh could be used to speed this whole thing up a lot um because if you think about it and again maybe they're already this already exists i don't know but if you think about it when we did that distance calculation the the vast majority of the the weights are nearly zero and so it seems pointless to create that big you know kind of eventually 1500 by 1500 matrix um that's slow it would be much better if we just found the ones that were like pretty close by and just took their average and so you want an optimized nearest neighbors basically and so um this is an example of something that can give you a uh an up a kind of a fast nearest neighbors algorithm um or you know there are things like kd trees and opt trees and stuff like that so if you want to like have a bonus bonus invent a new mean shift algorithm which picks only the closest points to avoid quadratic type all right so not very often you get an assignment which is to invent a new mean shift algorithm i guess a super super bonus super super bonus publish a paper that describes it all right you definitely get four points if you do that we'll give you a number of points equal to the impact factor of the journal you get it published in okay um so what i want to do now is move on to calculus which for some of us may not be our favorite topic that's funny stefano wrote the ironsome version here already i didn't notice okay always ahead of his time that guy um let's talk about calculus um if you're not um super comfortable with derivatives and what they are and why we care um three blue one brown has a wonderful series called the essence of calculus which i strongly recommend watching um it's just a pleasure actually to watch uh as it's everything that is on three blue one brown a pleasure to watch um the and so we're not going to get into back prop today instead we're just going to have a quick chat about about calculus where do we start so the good news is just like you don't have to know much linear algebra at all you basically just need to know about matrix multiplication um you also don't need to know um much calculus at all um just derivatives so let's think about like what derivatives are so i'm going to borrow actually the same starting point that um three blue one brown uses one of their videos is to consider a car and we're going to see um how far away from home it is at various time points okay so after a minute let's say after a second it's traveled five meters and then after two seconds it's traveled 10 meters okay and after three seconds you can probably guess it's traveled 15 meters so there's this concept here of uh um yeah got it the wrong way around obviously so time distance okay so um there's this concept of yeah of like location it's like how far how far have you traveled um at a particular point in time so we can look at one of these points and find out how far that car is gone we could also take two points and we can say where did it start at the start of those two points and where did it finish at the end of those two points and we can say between those two points how much time passed and how far did they travel in two seconds they traveled 10 meters so we could now also say all right well the slope of something is rise over run oopsie dozy 10 meters in two seconds and notice we don't just divide the numbers we also divide the units we get five meters per second so um this here is now change the dimensions entirely we're now not looking at distance but we're looking at speed or velocity and it's equal to rise over run it's equal to the rate of change and what it says really is as time the x-axis goes up by one second what happens to the distance in meters as one second passes how does the number of meters change and so maybe these aren't points at all maybe there's a function right it's a continuum of points and so you can do that for the function so the function is a function of time distance is a function of time and so we could say what's the slope of that function and we can get the slope from point a to point b using rise over run so from t1 to t2 the amount of distance that's uh the amount of time that's passed is t2 minus t1 that's how much time has passed let's say this is t1 this is t2 and the distance that they've traveled while they've moved from wherever they are at the end to wherever they were at the start so that's the change in distance divided by the change in time change in distance divided by change in time okay let's say that's y so another way um now the thing is when we talk about calculus we talk about finding a slope but we talk about finding a slope of something that's more often more tricky than this right we have slopes of things that look more like this and we say what's this slope oops and terrible chapter drawing let's maybe put it over here because i'm left-handed what's this slope now what does it mean to have like the idea of a velocity at an exact moment in time it doesn't mean anything you know at an exact moment in time you're just like it's frozen right what's happening exactly now but what you can do is you can say well what's the change in time between a bit before our point and a bit after our point and what's the change in distance between a bit before our point and a bit after our point and so you can do the same kind of rise over run thing right but you can make that distance between t2 and t1 smaller and smaller and smaller so let's rewrite this in a slightly different way let's call the denominator the distance between t1 plus a little bit we'll call it d it's that minus t1 so this is t2 right it's t1 plus a little bit so we say oh here's t1 let's add a little bit and notice that when we write it this well let's actually let's do the rest of it so now f of t2 becomes f of t1 plus a little bit and this is the same and now notice here that t1 plus d minus t1 we can delete all that because it just comes out to d so this is another way of calculating the slope of our function and as d gets smaller and smaller and smaller we're kind of getting a triangle that's tinier and tinier and tinier and it still makes sense it's still that some time has passed and the car is moved right but it's just smaller and smaller amounts of time now if you did calculus at college or at school you might have done all this stuff messing around with limits and epsilon delta and blah blah blah i've got really good news it turns out you can actually just think of this d as a really small number where d is the difference difference and so when we calculate the slope we can write it in a slightly different way as the change in y divided by the change in x this here is the change in y and this here is the change in x and so in other words this here is a very small number a very small number and this here is the result in the function of changing by that very small number and this way of thinking about calculus is known as the calculus of infinitesimals and it's how Leibniz originally developed it and it's been turned into a whole theory nowadays and the reason i talk about it here is because when we do calculus you'll see me doing stuff all the time where i act like dx is a really small number and when i was at school i was told i wasn't allowed to do that i've since learned that it's totally fine to do that so for example next lesson we're going to be looking at the chain rule which looks like this dy dx equals dy du times du dx and i'm just going to say oh these two small numbers can cancel out and that's why obviously they're the same thing and that's all going to work out nicely so anywho what would be very helpful would be if before the next lesson if you're not totally up to date with your you know remembering all the stuff you did in high school about calculus is watch the three blue one brown course we are not going to be looking i don't think at all at integration so you don't have to worry about that also we are not going to on the whole be doing any derivatives by hand so for example there are rules such as dy dx if y equals x squared is 2x these kind of rules you're not really going to have to learn because pytorch is going to do them all for you the one that we care about is going to be the chain rule but we're going to learn about that next time okay i hope i don't get beaten to a bloody pulp the next time i walk into a mathematicians conference i suspect i might um but hopefully i get away with with this uh i i think it's safe we'll see how we go um so um thanks everybody very much for joining me um and really look forward to seeing you next time where we're going to do back propagation from scratch we've already learned to multiply matrices so once we've got back propagation as well we'll be ready to train a neural network all right thanks all bye