Getting Started With CUDA for Python Programmers

Hi there. I'm Jeremy Howard from answer.ai and this is Getting Started with CUDA. CUDA is of course what we use to program NVIDIA GPUs if we want them to go super fast and we want maximum flexibility and it has a reputation of being very hard to get started with.

The truth is it's actually not so bad. You just have to know some tricks and so in this video I'm going to show you some of those tricks. So let's switch to the screen and take a look. So I'm going to be doing all of the work today in notebooks.

This might surprise you. You might be thinking that to do work with CUDA we have to do stuff with compilers and terminals and things like that and the truth is actually it turns out we really don't thanks to some magic that is provided by PyTorch. You can follow along in all of these steps and I strongly suggest you do so in your own computer.

You can go to the CUDA mode organization in GitHub. Find the lecture 2 repo there and you'll see there is a lecture 3 folder. This is lecture 3 of the CUDA mode series. You don't need to have seen any of the previous ones however to follow along. In the read me there you'll see there's a lecture 3 section and at the bottom there is a click to go to the colab version.

Yep you can run all of this in colab for free. You don't even have to have a GPU available to run the whole thing. We're going to be following along with some of the examples from this book programming massively parallel processes is a really great book to read and once you've completed today's lesson you should be able to make a great start on this book.

It goes into a lot more details about some of the things that we're going to cover on fairly quickly. It's okay if you don't have the book but if you want to go deeper I strongly suggest you get it and in fact you'll see in the repo that lecture 2 in this series actually was a deep dive into chapters 1-3 of that book and so actually you might want to do lecture 2 confusingly enough after this one lecture 3 to get more details about some of what we're talking about.

Okay so let's dive into the notebook. So what we're going to be doing today is we're going to be doing a whole lot of stuff with plain old PyTorch first to make sure that we get all the ideas and then we will try to convert each of these things into CUDA.

So in order to do this we're going to start by importing a bunch of stuff in fact let's do all of this in colab. So here we are in colab and you should make sure that you set in colab your runtime to the T4 GPU. That's one you can use plenty of for free and it's easily good enough to run everything we're doing today.

And once you've got that running we can import the libraries we're going to need and we can start on our first exercise. So the first exercise actually comes from chapter 2 of the book and chapter 2 of the book teaches how to do this problem which is converting an RGB color picture into a grayscale picture.

And it turns out that the recommended formula for this is to take 0.21 of the red pixel, 0.72 of the green pixel, 0.07 of the blue pixel and add them up together and that creates the luminance value which is what we're seeing here. That's a common way, kind of the standard way to go from RGB to grayscale.

So we're going to do this, we're going to make a CUDA kernel to do this. So the first thing we're going to need is a picture and anytime you need a picture I recommend going for a picture of a puppy. So we've got here a URL to a picture of a puppy so we'll just go ahead and download it and then we can use torchvision.io to load that.

So this is already part of Colab. If you're interested in running stuff on your own machine or a server in the cloud I'll show you how to set that up at the end of this lecture. So let's read in the image and if we have a look at the shape of it it says it's 3 by 1066 by 1600.

So I'm going to assume that you know the basics of PyTorch here. If you don't know the basics of PyTorch I am a bit biased but I highly recommend my course which covers exactly that. You can go to course.fast.ai and you get the benefit also of having seen some very cute bunnies and along with the very cute bunnies it basically takes you through all of everything you need to be an effective practitioner of modern deep learning.

So finish part one if you want to go right into those details but even if you just do the first two or three lessons that will give you more than enough you need to know to understand this kind of code and these kinds of outputs. So I'm assuming you've done all that.

So you'll see here we've got a rank 3 tensor. There are three channels so they're like the faces of a cube if you like. There are 1066 rows on each face so that's the height and then there are 16 columns in each row so that's the width. So if we then look at the first couple of channels and the first three rows and the first four columns you can see here that these are unsigned 8-bit integers so they're bytes and so here they are.

So that's what an image looks like. Hopefully you know all that already. So let's take a look at our image. To do that I'm just going to create a simple little function show image that will create a mat lip plot mat lip lip plot plot remove the axes if it's color which this one is it will change the order of the axes from channel by height by width which is what PyTorch uses to height by width by channel which is what matplotlib I'm having trouble today expects.

So we change the order of the axes to be 1, 2, 0 and then we can show the image putting it on the CPU if necessary. Now we're going to be working with this image in Python which is going to be just pure Python to start with before we switch to CUDA that's going to be really slow so we'll resize it to have the smallest length smallest dimension be 150 so that's the height in this case so we end up with a 150 by 225 shape which is a rectangle which is 3, 3, 750 pixels each one with our GMB values and there is our puppy.

So you see wasn't it a good idea to make this a puppy. Okay so how do we convert that to grayscale? Well the book has told us the formula to use go through every pixel and do that to it. Alright so here is the loop we're going to go through every pixel and do that to it and stick that in the output so that's the basic idea so what are the details of this?

Well here we've got channel by row by column so how do we loop through every pixel? Well the first thing we need to know is how many pixels are there so we can say channel by height by width is the shape so now we have to find those three variables so the number of pixels is the height times the width and so to loop through all those pixels an easy way to do them is to flatten them all out into a vector.

Now what happens when you flatten them all out into a vector? Well as we saw they're currently stored in this format where we've got one face and then another face and then there's a we haven't got it printed here but there's a third face within each face then there is one row we're just showing the first few and then the next row and then the next row and then with each row you've got column column column.

So let's say we had a small image in which in fact we can do it like this we could say here's our red so we've got the pixels 0, 1, 2, 3, 4, 5 so let's say this was a height to width 3, 3 channel image so then there'll be 6, 7, 8, 9, 10, 11, GB, 12, 13, 14, 15, 16 so let's say these are the pixels so when these are flattened out it's going to turn into a single vector just like so 6, 7, 8, 12, 13, 14.

So actually when we talk about an image we initially see it as a bunch of pixels we can think of it as having 3 channels but in practice in our computer the memory is all laid out linearly everything has just an address in memory it's just a whole bunch you can think of it as your computer's memory is one giant vector and so when we say when we say flatten then what that's actually doing is it's turning our channel by height by width into a big vector like this.

Okay so now that we've done that we can say all right our the place we're going to be putting this into the result we're going to start out with just an empty vector of length n we'll go through all of the n values from 0 to n - 1 and we're going to put in the output value 0.29 ish times the input value at xi so this will be here in the red bit and then 0.59 times xi plus n so n here is this distance it's the number of pixels one two three four five six see one two three four five six so that's why to get to green we have to jump up to i plus n and then to get to blue we have to jump to i plus 2n see and so that's how this works we've flattened everything out and we're indexing into this flattened out thing directly and so at the end of that we're going to have our grayscale is all done so we can then just reshape that into height by width and there it is there's our grayscale puppy and you can see here the flattened image is just a single vector with all those channel values flattened out as we described okay now that is incredibly slow it's nearly two seconds to do something with only 34,000 pixels in so to speed it up we are going to want to use CUDA how come CUDA is able to speed things up well the reason CUDA is able to speed things up is because it is set up in a very different way to how a normal CPU is set up and we can actually see that if we look at some of this information about what is in an RTX 3090 card for example an RTX 3090 card is a fantastic GPU you can get them second hand pretty good value so a really good choice particularly for hobbyists what is inside a 3090 it has 82 SM's what's an SM and SM is a streaming multi processor so you can think of this as almost like a separate CPU in your computer and so there's 82 of these so that's already a lot more than you have CPUs in your computer but then each one of these has 128 CUDA cores so these CUDA cores are all able to operate at the same time these multi processors are all able to operate at the same time so that gives us 128 times 82 10,500 CUDA cores in total that can all work at the same time so that's a lot more than any CPU we're familiar with can do and the 3090 isn't even at the very top end it's really a very good GPU but there are some with even more CUDA cores so how do we use them all well we need to be able to set up our code in such a way that we can say here is a piece of code that you can run on lots of different pieces of data lots of different pieces of memory at the same time so that you can do 10,000 things at the same time and so CUDA does this in a really simple and pretty elegant way which is it basically says okay take out the kind of the inner loop so here's our inner loop the stuff where you can run 10,000 of these at the same time they're not going to influence each other at all so you see these do not influence each other at all all they do is they stick something into some output memory so it doesn't even return something you can't return something from these CUDA kernels as they're going to be called all you can do is you can modify memory in such a way that you don't know what order they're going to run in they could all run at the same time some could run a little bit before another one and so forth so the way that CUDA does this is it says okay write a function right and in your function write a line of code which I'm going to call as many dozens hundreds thousands millions of times as necessary to do all the work that's needed and I'm going to do and I'm going to use do this in parallel for you as much as I can in the case of running on a 3090 up to 10,000 times up to 10,000 things all at once and I will get this done as fast as possible so all you have to do is basically write the line of code you want to be called lots of times and then the second thing you have to do is say how many times to call that code and so what will happen is that piece of code called the kernel will be called for you it'll be passed in whatever arguments you ask to be passed in which in this case will be the input array tensor the output tensor and the size of how many pixels are in each channel and it'll tell you okay this is the ith time I've called it now we can simulate that in Python very very simply a single for loop now this doesn't happen in parallel so it's not going to speed it up but the kind of results the semantics are going to be identical to CUDA so here is a function we've called run kernel we're going to pass it in a function we're going to say how many times to run the function and what arguments to call the function with and so each time it will call the function passing in the index what time and the arguments that we've requested okay so we can now create something to call that so let's get the just like before get the channel number of channels height and width the number of pixels flatten it out create the result tensor that we're going to put things in and this time rather than calling the loop directly we will call run kernel we will pass in the name of the function to be called as f we will pass in the number of times which is the number of pixels for the loop and we'll pass in the arguments that are going to be required inside our kernel so we're going to need out we're going to need x and we're going to need n so you can see here we're using no external libraries at all we have just plain python and a tiny bit of pytorch just enough to create a tensor into index into tensors and that's all that's being used but conceptually it's doing the same thing as a CUDA kernel would do nearly and we'll get to the nearly in just a moment but conceptually you could see that you could now potentially write something which if you knew that this was running a bunch of things totally independently of each other conceptually you could now truly easily paralyze that and that's what CUDA does however it's not quite that simple it does not simply create a single list of numbers like range n does in python and pass each one in turn into your kernel but instead it actually splits the range of numbers into what's called blocks so in this case you know maybe there's like a thousand pixels we wanted to get through it's going to group them into blocks of 256 at a time and so in python it looks like this in practice a CUDA kernel runner is not a single for loop that loops n times but instead it is a pair of nested for loops so you don't just pass in a single number and say this is the number of pixels but you pass in two numbers number of blocks and the number of threads we'll get into that in a moment but these are just numbers they're just you can put any numbers you like here and if you choose two numbers that multiply to get the thing that we want which is the n times we want to call it then this can do exactly the same thing because we're now going to pass in which of the what's the index of the outer loop we're up to what's the index in the inner loop we're up to how many things do we go through in the inner loop and therefore inside the kernel we can find out what index we're up to by multiplying the block index times the block dimension so that is to say the i by the threads and add the inner loop index the j so that's what we pass in with the i j threads but inside the kernel we call it block index thread index and block dimension so if you look at the CUDA book you'll see here this is exactly what they do they say the index is equal to the block index times the block dimension plus the thread index there's a dot x thing here that we can ignore for now we'll look at that in a moment but in practice this is actually how CUDA works so it has all these blocks and inside there are threads and you can just think of them as numbers so you can see these blocks they just have numbers o one dot dot dot dot and so forth now that does mean something a little bit tricky though which is well the first thing i'll say is how do we pick these numbers the number of blocks and the number of threads for now in practice we're just always going to say the number of threads is 256 and that's a perfectly fine number to use as a default anyway you can't go too far wrong just always picking 256 nearly always so don't worry about that too much for now optimizing that number so if we say okay we want to have 256 threads so remember that's the inner loop or if we look inside our kernel runner here that's our inner loop so we're going to call each of this is going to be called 256 times so how many times you have to call this well you're going to have to call it n number of pixels divided by 256 times now that might not be an integer so you'll have to round that up so it's ceiling and so that's how we can calculate the number of blocks we need to make sure that our kernel is called enough times now we do have a problem though which is that the number of times we would have liked to have called it which previously was equal to the number of pixels might not be a multiple of 256 so we might end up going too far and so that's why we also need in our kernel now this if statement and so this is making sure that the index that we're up to does not go past the number of pixels we have and this appears and basically every CUDA kernel you'll see and it's called the guard or the guard block so this is our guard to make sure we don't go out of bounds so this is the same line of code we had before and now we've also just added this thing to calculate the index and we've added the guard and this is like the pretty standard first lines from any CUDA kernel so we can now run those and they'll do exactly the same thing as before and so the obvious question is well why do CUDA kernels work in this weird block and thread way why don't we just tell them the number of times to run it why do we have to do it by blocks and threads and the reason why is because of some of this detail that we've got here which is that CUDA sets things up for us so that everything in the same block or to say it more completely thread block which is the same block they will all be given some shared memory and they'll also all be given the opportunity to synchronize which is to basically say okay everything in this block has to get to this point before you can move on all of the threads in a block will be executed on the same streaming multiprocessor and so we'll we'll see later in later lectures that won't be taught by me that by using blocks smartly you can make your code run more quickly and the shared memory is particularly important so shared memory is a little bit of memory in the GPU that all the threads in a block share and it's fast it's super super super fast now when we say not very much it's like on a 3090 it's 128k so very small so this is basically the same as a cache in a CPU the difference though is that on a CPU you're not going to be manually deciding what goes into your cache but on the GPU you do it's all up to you so at the moment this cache is not going to be used when we create our CUDA code because we're just getting started and so we're not going to worry about that optimization but to go fast you want to use that cache and also you want to use the register file something a lot of people don't realize is that there's actually quite a lot of register memory even more register memory than shared memory so anyway those are all things to worry about down the track not needed for getting started.

So how do we go about using CUDA? There is a basically standard setup block that I would add and we are going to add and what happens in this setup block is we're going to set an environment variable you wouldn't use this in kind of production or for going fast but this says if you get an error stop right away basically so wait you know wait to see how things go and then that way you can tell us exactly when an error occurs and where it happens so that slows things down but it's good for development.

We're also going to install two modules one is a build tool which is required by PyTorch to compile your C++ CUDA code. The second is a very handy little thing called Wurlitzer and the only place you're going to see that used is in this line here where we load this extension called Wurlitzer.

Without this anything you print from your CUDA code in fact from your C++ code won't appear in a notebook so you always want to do this in a notebook where you're doing stuff in CUDA so that you can use print statements to debug things. Okay so if you've got some CUDA code how do you use it from Python?

The answer is that PyTorch comes with a very handy thing called load inline which is inside torch.utils.cpp extension. Load inline is a marvelous load inline is a marvelous function that you just pass in a list of any of the CUDA code strings that you want to compile any of the plain C++ strings you want to compile any functions in that C++ you want to make available to PyTorch and it will go and compile it all turn it into a Python module and make it available right away which is pretty amazing.

I've just created a tiny little wrapper for that called load CUDA just to streamline it a tiny bit but behind the scenes it's just going to call load inline. The other thing I've done is I've created a string that contains some C++ code I mean this is all C code I think but it's compiled as C++ code we'll call it C++ code.

C++ code we want included in all of our CUDA files. We need to include this header file to make sure that we can access PyTorch tensor stuff. We want to be able to use I/O and we want to be able to check for exceptions. And then I also define three macros.

The first macro just checks that a tensor is CUDA. The second one checks that it's contiguous in memory because sometimes PyTorch can actually split things up over different memory pieces and then if we try to access that in this flattened out form it won't work. And then the way we're actually going to use it check input or just check both of those things.

So if something's not on CUDA and it's not contiguous we aren't going to be able to use it so we always have this. And then the third thing we do here is we define ceiling division. Ceiling division is just this although you can implement it a different way like this and so this will do ceiling division and so this is how we're going to this is what we're going to call in order to figure out how many blocks we need.

So this is just you don't have to worry about the details of this too much it's just a standard setup we're going to use. Okay so now we need to write our CUDA kernel. Now how do you write the CUDA kernel? Well all I did and I recommend you do is take your Python kernel and paste it into chat GPT and say convert this to equivalent C code using the same names formatting etc where possible paste it in and chat GPT will do it for you.

Unless you're very comfortable with C which case just write it yourself is fine but this way since you've already got the Python why not just do this? It basically was pretty much perfect I found although it did assume that these were floats they're actually not floats we had to change a couple of data types but basically I was able to use it almost as is and so particularly you know for people who are much more Python programmers nowadays like me this is a nice way to write 95 percent of the code you need.

What else do we have to change? Well as we saw in our picture earlier it's not called block IDX it's called blockIDX.X blockDIM.X threadIDX.X so we have to add the dot X there. Other than that if we compare so as you can see these two pieces of code look nearly identical we've had to add data types to them we've had to add semicolons we had to get rid of the colon we had to add curly brackets that's about it.

So it's not very different at all so if you haven't done much C programming yeah don't worry about it too much because you know the truth is actually it's not that different for this kind of calculation intensive work. One thing we should talk about is this. What's unsigned car star?

This is just how you write Uint8 in C. You can just if you're not sure how to change change a data type between the PyTorch spelling and the C spelling you could ask chat GPT or you can Google it but this is how you write byte. The star in practice it's basically how you say this is an array.

So this says that X is an array of bytes. It actually means it's a pointer but pointers are treated as you can see here as arrays by C. So you don't really have to worry about the fact that the pointer it just means for us that it's an array but in C the only kind of arrays that it knows how to deal with these one-dimensional arrays and that's why we always have to flatten things out okay.

We can't use multi-dimensional tensors really directly in these CUDA kernels in this way. So we're going to end up with these one-dimensional C arrays. Yeah other than that it's going to look exactly in fact I mean even because we did our Python like that it's going to look identical.

The void here just means it doesn't return anything and then the dunder global here is a special thing added by CUDA. There's three things that can appear and this simply says what should I compile this to do and so you can put dunder device and that means compile it so that you can only call it on the GPU.

You can say dunder global and that says okay you can call it from the CPU or GPU and it will run on the GPU or you can write write dunder host which you don't have to and that just means it's a normal C or C++ program that runs on the CPU side.

So anytime we want to call something from the CPU side to run something on the GPU which is basically almost always when we're doing kernels you write dunder global. So here we've got dunder global we've got our kernel and that's it. So then we need the thing to call that kernel.

So earlier to call the kernel we called this block kernel function passed in the kernel and passed in the blocks and threads and the arguments. With CUDA we don't have to use a special function there is a weird special syntax built into kernel to do it for us. To use the weird special syntax you say okay what's the kernel the function that I want to call and then you use these weird triple angle brackets.

So the triple angle brackets is a special CUDA extension to the C++ language and it means this is a kernel please call it on the GPU and between the triple angle brackets there's a number of things you can pass but you have to pass at least the first two things which is how many blocks how many threads.

So how many blocks ceiling division number of pixels divided by threads and how many threads as we said before let's just pick 256 all the time and not worry about it. So that says call this function as a GPU kernel and then passing in these arguments. We have to pass in our input tensor, our output tensor and how many pixels.

And you'll see that for each of these tensors we have to use a special method .data pointer and that's going to convert it into a C pointer to the tensor. So that's why by the time it arrives in our kernel it's a C pointer. You also have to tell it what data type you want it to be treated as.

This says treat it as Uintates. So that's this is a C++ template parameter here and this is a method. The other thing you need to know is in C++ dot means call a method of an object or else colon colon is basically like in C in Python calling a method of a class.

So you don't say torch dot empty you say torch colon colon empty to create our output or else back when we did it in Python we said torch dot empty. Also in Python oh okay so in Python that's right we just created a length n vector and then did a dot view.

It doesn't really matter how we do it but in this case we actually created a two-dimensional tensor bypassing. We pass in this thing in curly brackets here this is called a C++ list initializer and it's just basically a little list containing height comma width. So this tells it to create a two-dimensional matrix which is why we don't need dot view at the end.

We could have done it the dot view way as well. Probably be better to keep it consistent but this is what I wrote at the time. The other interesting thing when we create the output is if you pass in input dot options so this is our input tensor that just says oh use the same data type and the same device CUDA device as our input has.

This is a nice really convenient way which I don't even think we have in Python to say make sure that this is the same data type in the same device. If you say auto here this is quite convenient you don't have to specify what type this is. We could have written torch colon colon tensor but by writing auto it just says figure it out yourself which is another convenient little C++ thing.

After we call the kernel if there's an error in it we won't necessarily get told so to tell it to check for an error you have to write this. This is a macro that's again provided by PyTorch. The details don't matter you should just always call it after you call a kernel to make sure it works and then you can return the tensor that you allocated and then you passed as a pointer and then that you filled in.

Okay now as well as the CUDA source you also need C++ source and the C++ source is just something that says here is a list of all of the details of the functions that I want you to make available to the outside world in this case Python and so this is basically your header effectively.

So you can just copy and paste the full line here from your function definition and stick a semicolon on the end. So that's something you can always do and so then we call our load CUDA function that we looked at earlier passing in the CUDA source code the C++ source code and then a list of the names of the functions that are defined there that you want to make available to Python.

So we just have one which is the RGB2 grayscale function and believe it or not that's all you have to do this will automatically you can see it running in the background now compiling with a hugely long thing our files from so it's created a main.cpp for us and it's going to put it into a main.o for us and compile everything up link it all together and create a module and you can see here we then take that module it's been passed back and put it into a variable called module and then when it's done it will load that module and if we look inside the module that we just created you'll see now that apart from the normal auto generated stuff Python adds it's got a function in it RGB2 grayscale okay so that's amazing we now have a CUDA function that's been made available from Python and we can even see if we want to this is where it put it all so we can have a look and there it is you can see it's created a main.cpp it's compiled it into a main.o it's created a library that we can load up it's created a CUDA file it's created a build script and we could have a look at that build script if we wanted to and there it is so none of this matters too much it's just nice to know that PyTorch is doing all this stuff for us and we don't have to worry about it so that's pretty cool.

So in order to pass a tensor to this we're going to be checking that it's contiguous and on CUDA so we'd better make sure it is so we're going to create an image C variable which is the image made contiguous and put on through the CUDA device and now we can actually run this on the full sized image not on the tiny little minimized image we created before this has got much more pixels it's got 1.7 million pixels where else before we had I think it was 35,000 34,000 and it's gone down from one and a half seconds to one millisecond so that is amazing it's dramatically faster both because it's now running in compiled code and because it's running on the GPU.

The step of putting the data onto the GPU is not part of what we timed and that's probably fair enough because normally you do that once and then you run a whole lot of CUDA things on it. We have though included the step of moving it off the GPU and putting it onto the CPU as part of what we're timing and one key reason for that is that if we didn't do that it can actually run our Python code at the same time that the CUDA code is still running and so the amount of time shown could be dramatically less because it hasn't finished synchronizing so by adding this it forces it to complete the CUDA run and to put the data back onto the CPU that kind of synchronization you can also trigger this by printing a value from it or you can synchronize it manually so after we've done that and we can have a look and we should get exactly the same grayscale puppy okay so we have successfully created our first real working code from Python CUDA kernel.

This approach of writing it in Python and then converting it to CUDA is not particularly common but I'm not just doing it as an educational exercise that's how I like to write my CUDA kernels at least as much of it as I can because it's much easier to debug in Python it's much easier to see exactly what's going on and so and I don't have to worry about compiling it takes about 45 or 50 seconds to compile even our simple example here I can just run it straight away and once it's working to convert that into C as I mentioned you know chatgpt can do most of it for us so I think this is actually a fantastically good way of writing CUDA kernels even as you start to get somewhat familiar with them it's because it lets you debug and develop much more quickly a lot of people avoid writing CUDA just because that process is so painful and so here's a way that we can make that process less painful so let's do it again and this time we're going to do it to implement something very important which is matrix multiplication so matrix multiplication as you probably know is fundamentally critical for deep learning it's like the most basic linear algebra operation we have and the way it works is that you have a input matrix M and a second input matrix N and we go through every row of M so we go through every row of M till we get to here we are up to this one and every column of N and here we are up to this one and then we take the dot product at each point of that row with that column and this here is the dot product of those two things and that is what matrix multiplication is so it's a very simple operation conceptually and it's one that we do many many many times in deep learning and basically every deep learning every neural network has this is its most fundamental operation of course we don't actually need to implement matrix multiplication from scratch because it's done for us in libraries but we will often do things where we have to kind of fuse in some kind of matrix multiplication like paces and so you know and of course it's also just a good exercise so let's take a look at how to do matrix multiplication first of all in pure Python so in the actually in the first AI course that I mentioned there's a very complete in-depth dive into matrix multiplication in part two less than 11 where we spend like an hour or two talking about nothing but matrix multiplication we're not going to go into that much detail here but what we do do in that is we use the MNIST data set to to do this and so we're going to do the same thing here we're going to grab the MNIST data set of handwritten digits and they are 28 by 28 digits they look like this 28 by 28 is 784 so to do a you know to basically do a single layer of a neural net or without the activation function we would do a matrix multiplication of the image flattened out by a weight matrix with 784 rows and however many columns we like and I'm going to need if we're going to go straight to the output so this would be a linear function a linear model we'd have 10 layers one for each digit so here's this is our weights we're not actually going to do any learning here this is just not any deep learning or logistic regression learning is just for an example okay so we've got our weights and we've got our input our input data x train and x valid and so we're going to start off by implementing this in Python now again Python's really slow so let's make this smaller so matrix one will just be five rows matrix two will be all the weights so that's going to be a five by seven eighty four matrix multiplied by a seven eighty four by ten matrix now these two have to match of course they have to match because otherwise this product won't work those two are going to have to match the row by the column okay so let's pull that out into a rows a columns b rows b columns and obviously a columns and b rows are the things that have to match and then the output will be a rows by b columns so five by ten so let's create an output fill of zeros with rows by columns in it and so now we can go ahead and go through every row of a every column of b and do the dot product which involves going through every item in the innermost dimension or 784 of them multiplying together the equivalent things from m1 and m2 and summing them up into the output tensor that we created so that's going to give us as we said a five by ten five by ten output and here it is okay so this is how I always create things in python I basically almost never have to debug I almost never have like errors unexpected errors in my code because I've written every single line one step at a time in python I've checked them all as they go and then I copy all the cells and merge them together stick a function header on like so and so here is matmul so this is exactly the code we've already seen and we can call it and we'll see that for 39,200 innermost operations we took us about a second so that's pretty slow okay so now that we've done that you might not be surprised to hear that we now need to do the innermost loop as a kernel call in such a way that it is can be run in parallel now in this case the innermost loop is not this line of code it's actually this line of code I mean we can choose to be whatever we want it to be but in this case this is how we're going to do it we're going to say for every pixel we're not every pixel for every cell in the output tensor like this one here is going to be one CUDA thread so one CUDA thread is going to do the dot product so this is the bit that does the dot product so that'll be our kernel so we can write that matmul block kernel is going to contain that okay so that's exactly the same thing that we just copied from above and so now we're going to need a something to run this kernel and you might not be surprised to hear that in CUDA we are going to call this using blocks and threads but something that's rather handy in CUDA is that the blocks and threads don't have to be just a 1d vector they can be a 2d or even 3d tensor so in this case you can see we've got one two a little hard to see exactly where they stop two three four blocks and so then for each block that's kind of in one dimension and then there's also one two three four five blocks in the other dimension and so each of these blocks has an index so this one here is going to be zero zero a little bit hard to see this one here is going to be one three and so forth and this one over here is going to be three four so rather than just having a integer block index we're going to have a tuple block index and then within a block there's going to be to pick let's say this exact spot here didn't do that very well there's going to be a thread index and again the thread index won't be a single index into a vector it'll be a two elements so in this case it would be 0 1 2 3 4 5 6 rows down and 0 1 2 3 4 5 6 7 8 9 10 is that 11 12 I can't count 12 maybe across so the this here is actually going to be defined by two things one is by the block and so the block is 3 comma 4 and the thread is 6 comma 12 so that's how CUDA lets us index into two-dimensional grids using blocks and threads we don't have to it's just a convenience if we want to and in fact it can we can use up to three dimensions so to create our kernel runner now rather than just having so rather than just having two nested loops for blocks and threads we're going to have to have two lots of two nested loops for our both of our X and Y blocks and threads or our rows and columns blocks and threads so it ends up looking a bit messy because we now have four nested for loops so we'll go through our blocks on the Y axis and then through our blocks on the X axis and then through our threads on the Y axis and then through our threads on the X axis and so what that means is that for you can think of this Cartesian product as being for each block for each thread now to get the dot Y and the dot X will use this handy little Python standard library thing called simple namespace I'd use that so much I just give it an NS name because I use namespaces all the time and my quick and dirty code so we go through all those four we then call our kernel and we pass in an object containing the Y and X coordinates and that's going to be our block and we also pass in our thread which is an object with the Y and X coordinates of our thread and it's going to eventually do all possible blocks and all possible threads numbers for each of those blocks and we also need to tell it how big is each block how how high and how wide and so that's what this is this is going to be a simple namespace and object with an X and Y as you can see so I need to know how big they are just like earlier on we had to know the block dimension that's why we passed in threads so remember this is all pure PyTorch we're not actually calling any out to any CUDA we're not calling out to any libraries other than just a tiny bit of PyTorch for the indexing and tensor creation so you can run all of this by hand make sure you understand you can put it in the debugger you can step through it and so it's going to call our function so here's our matrix modification function as we said it's a kernel that contains the dot product that we wrote earlier so now the guard is going to have to check that the row number we're up to is not taller than we have and the column number we're up to is not wider than we have and we also need to know what row number we're up to and this is exactly the same actually I should say the column is exactly the same as we've seen before and in fact you might remember in the CUDA we had block idx dot x and this is why right because in CUDA it's always gives you these three-dimensional dim three structures so you have to put this dot x so we can find out the column this way and then we can find out the row by seeing how many blocks have we gone through how big is each block in the y-axis and how many threads have we gone through in the y-axis so which row number are we up to what column number are we up to is that inside the bounds of our tensor if not then just stop and then otherwise do our dot product and put it into our output tensor so that's all pure Python and so now we can call it by getting the height and width of our first input the height and width of our second input and so then K and K2 the inner dimensions ought to match we can then create our output and so now threads per block is not just the number 256 but it's a pair of numbers it's an x and a y and we've selected two numbers that multiply together to create 256 so again this is a reasonable choice if you've got two dimensional inputs to spread it out nicely one thing to be aware of here is that your threads per block can't be bigger than 1024 so we're using 256 which is safe right and notice that you have to multiply these together 16 times 16 is going to be the number of threads per block so this is a these are safe numbers to use you're not going to run out of blocks though 2 to the 31 is the number of maximum blocks for dimension 0 and then 2 to the 16 for dimensions 1 and 2 I think it's actually minus 1 but don't worry about that so don't have too many 10 threads but you can have lots of blocks but of course each symmetric model processor is going to run all of these on the same device and they're also going to have access to shared memory so that's why you use a few threads per block so our blocks the x we're going to use the ceiling division the y we're going to use the same ceiling division so if any of this is unfamiliar go back to our earlier example because the code's all copied from there and now we can call our 2D block kernel runner passing in the kernel the number of blocks the number of threads per block our input matrices flattened out our output matrix flattened out and the dimensions that it needs because they get all used here and return the result and so if we call that matmul with a 2D block and we can check that they are close to what we got in our original manual loops and of course they are because it's running the same code so now that we've done that we can do the CUDA version now the CUDA version is going to be so much faster we do not need to use this slimmed down matrix anymore we can use the whole thing so to check that it's correct I want a fast CPU-based approach that I can compare to so previously I took about a second to do 39,000 elements so I'm not going to explain how this works but I'm going to use a broadcasting approach to get a fast CPU-based approach if you check the fast AI course we teach you how to do this broadcasting approach but it's a pure Python approach which manages to do it all in a single loop rather than three nested loops it gives the same answer for the cut down tensors but much faster only four milliseconds so it's fast enough that we can now run it on the whole input matrices and it takes about 1.3 seconds and so this broadcast optimized version as you can see it's much faster and now we've got 392 million additions going on in the middle of our three loops effectively three loops but we're broadcasting them so this is much faster but the reason I'm really doing this is so that we can store this result to compare to so that makes sure that our CUDA version is correct okay so how do we convert this to CUDA you might not be surprised to hear that what I did was I grabbed this function and I passed it over to chat GPT and said please rewrite this in C and it gave me something basically that I could use first time and here it is this time I don't have unsigned cast I have float star other than that this looks almost exactly like the Python we had with exactly the same changes we saw before we've now got the dot Y and dot X versions once again we've got done to global which says please run this on the GPU when we call it from the CPU so the CUDA the kernel I don't think there's anything to talk about there and then the thing that calls the kernel is going to be passed in two tenses we're going to check that they're both contiguous and check that they are on the CUDA device we'll grab the height and width of the first and second tenses we're going to grab the inner dimension we'll make sure that the inner dimensions of the two matrices match just like before and this is how you do an assertion in PyTorch CUDA code you call torch check pass anything to check pass in the message to pop up if there's a problem so these are a really good thing to spread around all through your CUDA code to make sure that everything is as you thought it was going to be just like before we create an output so now when we create a number of threads we don't say threads is 256 we instead say this is a special thing provided by CUDA for us dim three so this is a basically a tuple with three elements so we're going to create a dim three called TPB it's going to be 16 by 16 now I said it has three elements where's the third one that's okay it just treats the third one as being one so it just we can ignore it so that's the number of threads per block and then how many blocks will there be well in the X dimension it'll be W divided by X ceiling division in the Y dimension it will be H divided by Y C division and ceiling division and so that's the number of blocks we have so just like before we call our kernel just by calling it like a normal function but then we add this weird triple angle bracket thing telling it how many blocks and how many threads so these aren't ints anymore these are now dim three structures and that's what we use these dim three structures and in fact even before what actually happened behind the scenes when we did the grayscale thing is even though we passed in 256 instance we actually ended up with a dim three structure and just in which case the second the index one and two or the dot X and dot Z values were just set to one automatically so we've actually already used a dim three structure without quite realizing it and then just like before pass in all of the tensors we want cast casting them to pointers maybe they're not just casting converting them to pointers through some particular data type and passing in any other information that our function will need that kernel will need okay so then we call load CUDA again that'll compile this into a module make sure that they're both contiguous and on the CUDA device and then after we call module.mapmol passing those in putting on the CPU and checking that they're all close and it says yes they are so it's this is now running not on just the first five rows but on the entire MNIST data set and on the entire MNIST data set using a optimized CPU approach it took 1.3 seconds using CUDA it takes six milliseconds so that is quite a big improvement cool the other thing I will mention of course is PyTorch can do a matrix modification for us just by using at how long does and obviously gives the same answer how long does that take to run that takes two milliseconds so three times faster and in many situations it'll be much more than three times faster so why are we still pretty slow compared to PyTorch I mean this isn't bad to do 392 million of these calculations in six milliseconds but if PyTorch can do it so much faster what are they doing well the trick is that they are taking advantage in particular of this shared memory so shared memory is a small memory space that is shared amongst the threads in a block and it is much faster than global memory in our matrix multiplication when we have one of these blocks and so it's going to do one block at a time all in the same SM it's going to be reusing the same 16 by 16 block it's going to be using the same 16 rows and columns again and again and again each time with access to the same shared memory so you can see how you could really potentially cache the information a lot of the information you need and reuse it rather than going back to the slower memory again and again so this is an example of the kinds of things that you could optimize potentially once you get to that point.

The only other thing that I wanted to mention here is that this 2D block idea is totally optional you can do everything with 1D blocks or with 2D blocks or with 3D blocks and threads and just to show that I've actually got an example at the end here which converts RGB to grayscale using the 2D blocks because remember earlier when we did this it was with 1D blocks.

It gives exactly the same result and if we compare the code, so if we compare the code the version actually that was done with 1D threads and blocks is quite a bit shorter than the version that uses 2D threads and blocks and so in this case even as though we're manipulating pixels where you might think that using the 2D approach would be neater and more convenient in this particular case it wasn't really I mean it's still pretty simple code that we have to deal with the columns and rows.x.y separately, the guards a little bit more complex, we have to find out what index we're actually up to here or else this kernel we just there was just much more direct just two lines of code and then calling the kernel you know again it's a little bit more complex with the threads per blocks stuff rather than this but the key thing I wanted to point out is that these two pieces of code do exactly the same thing so don't feel like if you don't want to use a 2D or 3D block thread structure you don't have to.

You can just use a 1D one, the 2D stuff is only there if it's convenient for you to use and you want to use it. Don't feel like you have to. So yeah I think that's basically like all the key things that I wanted to show you all today.

The main thing I hope you take from this is that even for Python programmers for data scientists it's not way outside our comfort zone you know we can write these things in Python we can convert them pretty much automatically we end up with code that doesn't look you know it looks reasonably familiar even though it's now in a different language we can do everything inside notebooks we can test everything as we go we can print things from our kernels and so you know it's hopefully feeling a little bit less beyond our capabilities than we might have previously imagined.

So I'd say yeah you know go for it I think it's also like I think it's increasingly important to be able to write CUDA code nowadays because for things like flash attention or for things like quantization GPTQ AWQ bits and bytes these are all things you can't write in PyTorch.

You know our models are getting more sophisticated the kind of assumptions that libraries like PyTorch make about what we want to do you know increasingly less and less accurate so we're having to do more and more of this stuff ourselves nowadays in CUDA and so I think it's a really valuable capability to have.

Now the other thing I mentioned is we did it all in CoLab today but we can also do things on our own machines if you have a GPU or on a cloud machine and getting set up for this again it's much less complicated than you might expect and in fact I can show you it's basically like four lines of code or four lines or three or four lines of bash script to get it all set up it'll run on Windows or under WSL it'll also run on Linux of course Mac stuff doesn't really work on CUDA stuff doesn't really work on Mac so not on Mac.

Actually I'll put a link to this into the video notes but for now I'm just going to jump to a Twitter thread where I wrote this all down to show you all the steps. So the way to do it is to use something called Conda. Conda is something that very very very few people understand a lot of people think it's like a replacement for like pip or poetry or something it's not it's better to think of it as a replacement for docker.

You can literally have multiple different versions of Python multiple different versions of CUDA multiple different C++ compilation systems all in parallel at the same time on your machine and switch between them you can only do this with Conda and everything just works right so you don't have to worry about all the confusing stuff around .run files or Ubuntu packages or anything like that you can do everything with just Conda.

You need to install Conda I've actually got a script which you just run the script it's a tiny script as you see if you just run the script it'll automatically figure out which many Conda you need it'll automatically figure out what shell you're on and it'll just go ahead and download it and install it for you.

Okay so run that script restart your terminal now you've got Conda. Step two is find out what version of CUDA PyTorch wants you to have so if I click Linux Conda CUDA 12.1 is the latest so then step three is run this shell command replacing 12.1 with whatever the current version of PyTorch is it actually still 12.1 for me at this point and that'll install everything all the stuff you need to profile debug build etc all the nvidia tools you need the full suite will all be installed and it's coming directly from nvidia so you'll have like the proper versions as I said you can have multiple versions it's stored at once in different environments no problem at all and then finally install PyTorch and this command here will install PyTorch for some reason I wrote nightly here you don't need the nightly so just remove just nightly so this will install the latest version of PyTorch using the nvidia CUDA stuff that you just installed if you've used Conda before and it was really slow that's because it used to use a different solver which was thousands or tens of thousands of times slower than the modern one just has been added and made default in the last couple of months so nowadays this should all run very fast and as I said it'll run under WSL on Windows it'll run on Ubuntu it'll run on Fedora it'll run on Debian it'll all just work so that's how I strongly recommend getting yourself set up for local development you don't need to worry about using Docker as I said you can switch between different CUDA versions different Python versions different compilers and so forth without having to worry about any of the Docker stuff and it's also efficient enough that if you've got the same libraries and so forth installed in multiple environments it'll hard link them so it won't even use additional hard drive space so it's also very efficient great so that's how you can get started on your own machine or on the cloud or whatever so hopefully you'll find that helpful as well alright thanks very much for watching I hope you found this useful and I look forward to hearing about what you create with CUDA in terms of going to the next steps check out the other CUDA mode lectures I will link to them and I would also recommend trying out some projects of your own so for example you could try to implement something like 4-bit quantization or flash attention or anything like that now those are kind of pretty big projects but you can try to break them up into smaller things you build up one step at a time and of course look at other people's code so look at the implementation of flash attention look at the implementation of bits and bytes look at the implementation of GPTQ and so forth the more time you spend reading other people's code the better alright I hope you found this useful and thank you very much for watching

Getting Started With CUDA for Python Programmers

Chapters

Transcript