back to indexGetting Started With CUDA for Python Programmers
Chapters
0:0 Introduction to CUDA Programming
0:32 Setting Up the Environment
1:43 Recommended Learning Resources
2:39 Starting the Exercise
3:26 Image Processing Exercise
6:8 Converting RGB to Grayscale
7:50 Understanding Image Flattening
11:4 Executing the Grayscale Conversion
12:41 Performance Issues and Introduction to CUDA Cores
14:46 Understanding Cuda and Parallel Processing
16:23 Simulating Cuda with Python
19:4 The Structure of Cuda Kernels and Memory Management
21:42 Optimizing Cuda Performance with Blocks and Threads
24:16 Utilizing Cuda's Advanced Features for Speed
26:15 Setting Up Cuda for Development and Debugging
27:28 Compiling and Using Cuda Code with PyTorch
28:51 Including Necessary Components and Defining Macros
29:45 Ceiling Division Function
30:10 Writing the CUDA Kernel
32:19 Handling Data Types and Arrays in C
33:42 Defining the Kernel and Calling Conventions
35:49 Passing Arguments to the Kernel
36:49 Creating the Output Tensor
38:11 Error Checking and Returning the Tensor
39:1 Compiling and Linking the Code
40:6 Examining the Compiled Module and Running the Kernel
42:57 Cuda Synchronization and Debugging
43:27 Python to Cuda Development Approach
44:54 Introduction to Matrix Multiplication
46:57 Implementing Matrix Multiplication in Python
50:39 Parallelizing Matrix Multiplication with Cuda
51:50 Utilizing Blocks and Threads in Cuda
58:21 Kernel Execution and Output
58:28 Introduction to Matrix Multiplication with CUDA
60:1 Executing the 2D Block Kernel
60:51 Optimizing CPU Matrix Multiplication
62:35 Conversion to CUDA and Performance Comparison
67:50 Advantages of Shared Memory and Further Optimizations
68:42 Flexibility of Block and Thread Dimensions
70:48 Encouragement and Importance of Learning CUDA
72:30 Setting Up CUDA on Local Machines
72:59 Introduction to Conda and its Utility
74:0 Setting Up Conda
74:32 Configuring Cuda and PyTorch with Conda
75:35 Conda's Improvements and Compatibility
76:5 Benefits of Using Conda for Development
76:40 Conclusion and Next Steps
00:00:00.000 |
Hi there. I'm Jeremy Howard from answer.ai and this is Getting Started with CUDA. CUDA 00:00:10.160 |
is of course what we use to program NVIDIA GPUs if we want them to go super fast and 00:00:16.080 |
we want maximum flexibility and it has a reputation of being very hard to get started with. The 00:00:23.720 |
truth is it's actually not so bad. You just have to know some tricks and so in this video 00:00:29.720 |
I'm going to show you some of those tricks. So let's switch to the screen and take a look. 00:00:37.020 |
So I'm going to be doing all of the work today in notebooks. This might surprise you. You 00:00:42.180 |
might be thinking that to do work with CUDA we have to do stuff with compilers and terminals 00:00:46.760 |
and things like that and the truth is actually it turns out we really don't thanks to some 00:00:52.320 |
magic that is provided by PyTorch. You can follow along in all of these steps and I strongly 00:00:59.440 |
suggest you do so in your own computer. You can go to the CUDA mode organization in GitHub. 00:01:08.560 |
Find the lecture 2 repo there and you'll see there is a lecture 3 folder. This is lecture 00:01:15.120 |
3 of the CUDA mode series. You don't need to have seen any of the previous ones however 00:01:19.800 |
to follow along. In the read me there you'll see there's a lecture 3 section and at the 00:01:26.040 |
bottom there is a click to go to the colab version. Yep you can run all of this in colab 00:01:32.880 |
for free. You don't even have to have a GPU available to run the whole thing. We're going 00:01:39.680 |
to be following along with some of the examples from this book programming massively parallel 00:01:46.480 |
processes is a really great book to read and once you've completed today's lesson you should 00:01:59.960 |
be able to make a great start on this book. It goes into a lot more details about some 00:02:04.400 |
of the things that we're going to cover on fairly quickly. It's okay if you don't have 00:02:09.400 |
the book but if you want to go deeper I strongly suggest you get it and in fact you'll see 00:02:16.920 |
in the repo that lecture 2 in this series actually was a deep dive into chapters 1-3 00:02:22.640 |
of that book and so actually you might want to do lecture 2 confusingly enough after this 00:02:26.880 |
one lecture 3 to get more details about some of what we're talking about. Okay so let's 00:02:34.560 |
dive into the notebook. So what we're going to be doing today is we're going to be doing 00:02:40.560 |
a whole lot of stuff with plain old PyTorch first to make sure that we get all the ideas 00:02:46.320 |
and then we will try to convert each of these things into CUDA. So in order to do this we're 00:02:54.600 |
going to start by importing a bunch of stuff in fact let's do all of this in colab. So 00:03:00.440 |
here we are in colab and you should make sure that you set in colab your runtime to the 00:03:07.200 |
T4 GPU. That's one you can use plenty of for free and it's easily good enough to run everything 00:03:13.400 |
we're doing today. And once you've got that running we can import the libraries we're 00:03:19.320 |
going to need and we can start on our first exercise. So the first exercise actually comes 00:03:25.360 |
from chapter 2 of the book and chapter 2 of the book teaches how to do this problem which 00:03:33.120 |
is converting an RGB color picture into a grayscale picture. And it turns out that the 00:03:38.640 |
recommended formula for this is to take 0.21 of the red pixel, 0.72 of the green pixel, 00:03:44.400 |
0.07 of the blue pixel and add them up together and that creates the luminance value which 00:03:51.000 |
is what we're seeing here. That's a common way, kind of the standard way to go from RGB 00:03:55.960 |
to grayscale. So we're going to do this, we're going to make a CUDA kernel to do this. So 00:04:02.600 |
the first thing we're going to need is a picture and anytime you need a picture I recommend 00:04:06.860 |
going for a picture of a puppy. So we've got here a URL to a picture of a puppy so we'll 00:04:13.520 |
just go ahead and download it and then we can use torchvision.io to load that. So this 00:04:22.560 |
is already part of Colab. If you're interested in running stuff on your own machine or a 00:04:28.280 |
server in the cloud I'll show you how to set that up at the end of this lecture. So let's 00:04:34.000 |
read in the image and if we have a look at the shape of it it says it's 3 by 1066 by 00:04:40.280 |
1600. So I'm going to assume that you know the basics of PyTorch here. If you don't 00:04:46.960 |
know the basics of PyTorch I am a bit biased but I highly recommend my course which covers 00:04:54.240 |
exactly that. You can go to course.fast.ai and you get the benefit also of having seen 00:05:00.480 |
some very cute bunnies and along with the very cute bunnies it basically takes you through 00:05:06.680 |
all of everything you need to be an effective practitioner of modern deep learning. So finish 00:05:14.920 |
part one if you want to go right into those details but even if you just do the first 00:05:19.840 |
two or three lessons that will give you more than enough you need to know to understand 00:05:25.560 |
this kind of code and these kinds of outputs. So I'm assuming you've done all that. So you'll 00:05:31.160 |
see here we've got a rank 3 tensor. There are three channels so they're like the faces of 00:05:38.360 |
a cube if you like. There are 1066 rows on each face so that's the height and then there 00:05:44.240 |
are 16 columns in each row so that's the width. So if we then look at the first couple of 00:05:50.720 |
channels and the first three rows and the first four columns you can see here that these are 00:05:58.720 |
unsigned 8-bit integers so they're bytes and so here they are. So that's what an image looks 00:06:03.600 |
like. Hopefully you know all that already. So let's take a look at our image. To do that 00:06:10.840 |
I'm just going to create a simple little function show image that will create a mat lip plot 00:06:16.720 |
mat lip lip plot plot remove the axes if it's color which this one is it will change the 00:06:25.200 |
order of the axes from channel by height by width which is what PyTorch uses to height 00:06:31.440 |
by width by channel which is what matplotlib I'm having trouble today expects. So we change 00:06:40.360 |
the order of the axes to be 1, 2, 0 and then we can show the image putting it on the CPU 00:06:47.360 |
if necessary. Now we're going to be working with this image in Python which is going to 00:06:52.600 |
be just pure Python to start with before we switch to CUDA that's going to be really slow 00:06:56.720 |
so we'll resize it to have the smallest length smallest dimension be 150 so that's the height 00:07:05.640 |
in this case so we end up with a 150 by 225 shape which is a rectangle which is 3, 3, 750 00:07:14.920 |
pixels each one with our GMB values and there is our puppy. So you see wasn't it a good 00:07:20.120 |
idea to make this a puppy. Okay so how do we convert that to grayscale? Well the book 00:07:28.800 |
has told us the formula to use go through every pixel and do that to it. Alright so 00:07:36.920 |
here is the loop we're going to go through every pixel and do that to it and stick that 00:07:44.360 |
in the output so that's the basic idea so what are the details of this? Well here we've 00:07:49.040 |
got channel by row by column so how do we loop through every pixel? Well the first thing 00:07:56.520 |
we need to know is how many pixels are there so we can say channel by height by width is 00:08:02.760 |
the shape so now we have to find those three variables so the number of pixels is the height 00:08:08.440 |
times the width and so to loop through all those pixels an easy way to do them is to 00:08:14.880 |
flatten them all out into a vector. Now what happens when you flatten them all out into 00:08:20.920 |
a vector? Well as we saw they're currently stored in this format where we've got one 00:08:29.440 |
face and then another face and then there's a we haven't got it printed here but there's 00:08:33.040 |
a third face within each face then there is one row we're just showing the first few and 00:08:40.480 |
then the next row and then the next row and then with each row you've got column column 00:08:45.360 |
column. So let's say we had a small image in which in fact we can do it like this we 00:08:56.840 |
could say here's our red so we've got the pixels 0, 1, 2, 3, 4, 5 so let's say this was 00:09:08.520 |
a height to width 3, 3 channel image so then there'll be 6, 7, 8, 9, 10, 11, GB, 12, 13, 00:09:28.840 |
14, 15, 16 so let's say these are the pixels so when these are flattened out it's going 00:09:39.040 |
to turn into a single vector just like so 6, 7, 8, 12, 13, 14. So actually when we talk 00:09:58.000 |
about an image we initially see it as a bunch of pixels we can think of it as having 3 channels 00:10:14.320 |
but in practice in our computer the memory is all laid out linearly everything has just 00:10:23.520 |
an address in memory it's just a whole bunch you can think of it as your computer's memory 00:10:28.080 |
is one giant vector and so when we say when we say flatten then what that's actually doing 00:10:39.800 |
is it's turning our channel by height by width into a big vector like this. Okay so now that 00:10:52.240 |
we've done that we can say all right our the place we're going to be putting this into 00:10:59.520 |
the result we're going to start out with just an empty vector of length n we'll go through 00:11:06.760 |
all of the n values from 0 to n - 1 and we're going to put in the output value 0.29 ish 00:11:16.400 |
times the input value at xi so this will be here in the red bit and then 0.59 times xi 00:11:29.800 |
plus n so n here is this distance it's the number of pixels one two three four five six 00:11:41.460 |
see one two three four five six so that's why to get to green we have to jump up to 00:11:50.600 |
i plus n and then to get to blue we have to jump to i plus 2n see and so that's how this 00:12:04.520 |
works we've flattened everything out and we're indexing into this flattened out thing directly 00:12:12.360 |
and so at the end of that we're going to have our grayscale is all done so we can then just 00:12:16.460 |
reshape that into height by width and there it is there's our grayscale puppy and you 00:12:26.040 |
can see here the flattened image is just a single vector with all those channel values 00:12:34.760 |
flattened out as we described okay now that is incredibly slow it's nearly two seconds 00:12:43.600 |
to do something with only 34,000 pixels in so to speed it up we are going to want to 00:12:50.320 |
use CUDA how come CUDA is able to speed things up well the reason CUDA is able to speed things 00:12:59.400 |
up is because it is set up in a very different way to how a normal CPU is set up and we can 00:13:10.720 |
actually see that if we look at some of this information about what is in an RTX 3090 card 00:13:21.200 |
for example an RTX 3090 card is a fantastic GPU you can get them second hand pretty good 00:13:27.600 |
value so a really good choice particularly for hobbyists what is inside a 3090 it has 00:13:35.920 |
82 SM's what's an SM and SM is a streaming multi processor so you can think of this as 00:13:44.600 |
almost like a separate CPU in your computer and so there's 82 of these so that's already 00:13:50.960 |
a lot more than you have CPUs in your computer but then each one of these has 128 CUDA cores 00:14:01.640 |
so these CUDA cores are all able to operate at the same time these multi processors are 00:14:06.720 |
all able to operate at the same time so that gives us 128 times 82 10,500 CUDA cores in 00:14:16.200 |
total that can all work at the same time so that's a lot more than any CPU we're familiar 00:14:23.840 |
with can do and the 3090 isn't even at the very top end it's really a very good GPU but 00:14:31.600 |
there are some with even more CUDA cores so how do we use them all well we need to be 00:14:40.440 |
able to set up our code in such a way that we can say here is a piece of code that you 00:14:45.560 |
can run on lots of different pieces of data lots of different pieces of memory at the 00:14:50.440 |
same time so that you can do 10,000 things at the same time and so CUDA does this in 00:14:56.520 |
a really simple and pretty elegant way which is it basically says okay take out the kind 00:15:02.800 |
of the inner loop so here's our inner loop the stuff where you can run 10,000 of these 00:15:10.440 |
at the same time they're not going to influence each other at all so you see these do not 00:15:14.080 |
influence each other at all all they do is they stick something into some output memory 00:15:20.480 |
so it doesn't even return something you can't return something from these CUDA kernels as 00:15:25.120 |
they're going to be called all you can do is you can modify memory in such a way that 00:15:30.440 |
you don't know what order they're going to run in they could all run at the same time 00:15:33.440 |
some could run a little bit before another one and so forth so the way that CUDA does 00:15:39.080 |
this is it says okay write a function right and in your function write a line of code 00:15:46.920 |
which I'm going to call as many dozens hundreds thousands millions of times as necessary to 00:15:53.080 |
do all the work that's needed and I'm going to do and I'm going to use do this in parallel 00:15:57.040 |
for you as much as I can in the case of running on a 3090 up to 10,000 times up to 10,000 00:16:05.040 |
things all at once and I will get this done as fast as possible so all you have to do 00:16:11.160 |
is basically write the line of code you want to be called lots of times and then the second 00:16:16.400 |
thing you have to do is say how many times to call that code and so what will happen 00:16:20.720 |
is that piece of code called the kernel will be called for you it'll be passed in whatever 00:16:26.360 |
arguments you ask to be passed in which in this case will be the input array tensor the 00:16:31.440 |
output tensor and the size of how many pixels are in each channel and it'll tell you okay 00:16:40.680 |
this is the ith time I've called it now we can simulate that in Python very very simply 00:16:48.640 |
a single for loop now this doesn't happen in parallel so it's not going to speed it up 00:16:53.520 |
but the kind of results the semantics are going to be identical to CUDA so here is a 00:16:59.760 |
function we've called run kernel we're going to pass it in a function we're going to say 00:17:04.000 |
how many times to run the function and what arguments to call the function with and so 00:17:08.840 |
each time it will call the function passing in the index what time and the arguments that 00:17:14.920 |
we've requested okay so we can now create something to call that so let's get the just 00:17:24.480 |
like before get the channel number of channels height and width the number of pixels flatten 00:17:29.760 |
it out create the result tensor that we're going to put things in and this time rather 00:17:35.600 |
than calling the loop directly we will call run kernel we will pass in the name of the 00:17:41.920 |
function to be called as f we will pass in the number of times which is the number of 00:17:48.920 |
pixels for the loop and we'll pass in the arguments that are going to be required inside 00:17:59.200 |
our kernel so we're going to need out we're going to need x and we're going to need n so 00:18:05.600 |
you can see here we're using no external libraries at all we have just plain python and a tiny 00:18:17.200 |
bit of pytorch just enough to create a tensor into index into tensors and that's all that's 00:18:22.720 |
being used but conceptually it's doing the same thing as a CUDA kernel would do nearly 00:18:32.040 |
and we'll get to the nearly in just a moment but conceptually you could see that you could 00:18:37.080 |
now potentially write something which if you knew that this was running a bunch of things 00:18:45.280 |
totally independently of each other conceptually you could now truly easily paralyze that and 00:18:50.760 |
that's what CUDA does however it's not quite that simple it does not simply create a single 00:19:06.880 |
list of numbers like range n does in python and pass each one in turn into your kernel 00:19:14.360 |
but instead it actually splits the range of numbers into what's called blocks so in this 00:19:23.200 |
case you know maybe there's like a thousand pixels we wanted to get through it's going 00:19:29.840 |
to group them into blocks of 256 at a time and so in python it looks like this in practice 00:19:41.240 |
a CUDA kernel runner is not a single for loop that loops n times but instead it is a pair 00:19:51.600 |
of nested for loops so you don't just pass in a single number and say this is the number 00:19:58.200 |
of pixels but you pass in two numbers number of blocks and the number of threads we'll 00:20:04.440 |
get into that in a moment but these are just numbers they're just you can put any numbers 00:20:07.640 |
you like here and if you choose two numbers that multiply to get the thing that we want 00:20:15.880 |
which is the n times we want to call it then this can do exactly the same thing because 00:20:21.520 |
we're now going to pass in which of the what's the index of the outer loop we're up to what's 00:20:27.480 |
the index in the inner loop we're up to how many things do we go through in the inner 00:20:32.960 |
loop and therefore inside the kernel we can find out what index we're up to by multiplying 00:20:40.560 |
the block index times the block dimension so that is to say the i by the threads and 00:20:47.120 |
add the inner loop index the j so that's what we pass in with the i j threads but inside 00:20:55.240 |
the kernel we call it block index thread index and block dimension so if you look at the 00:21:00.120 |
CUDA book you'll see here this is exactly what they do they say the index is equal to 00:21:06.320 |
the block index times the block dimension plus the thread index there's a dot x thing 00:21:12.560 |
here that we can ignore for now we'll look at that in a moment but in practice this is 00:21:20.800 |
actually how CUDA works so it has all these blocks and inside there are threads and you 00:21:30.440 |
can just think of them as numbers so you can see these blocks they just have numbers o 00:21:33.760 |
one dot dot dot dot and so forth now that does mean something a little bit tricky though 00:21:40.440 |
which is well the first thing i'll say is how do we pick these numbers the number of 00:21:46.040 |
blocks and the number of threads for now in practice we're just always going to say the 00:21:49.960 |
number of threads is 256 and that's a perfectly fine number to use as a default anyway you 00:21:57.760 |
can't go too far wrong just always picking 256 nearly always so don't worry about that 00:22:03.720 |
too much for now optimizing that number so if we say okay we want to have 256 threads 00:22:12.000 |
so remember that's the inner loop or if we look inside our kernel runner here that's 00:22:16.800 |
our inner loop so we're going to call each of this is going to be called 256 times so 00:22:22.160 |
how many times you have to call this well you're going to have to call it n number of 00:22:29.000 |
pixels divided by 256 times now that might not be an integer so you'll have to round 00:22:35.920 |
that up so it's ceiling and so that's how we can calculate the number of blocks we need 00:22:42.040 |
to make sure that our kernel is called enough times now we do have a problem though which 00:22:49.800 |
is that the number of times we would have liked to have called it which previously was 00:22:56.160 |
equal to the number of pixels might not be a multiple of 256 so we might end up going 00:23:02.520 |
too far and so that's why we also need in our kernel now this if statement and so this 00:23:08.640 |
is making sure that the index that we're up to does not go past the number of pixels we 00:23:15.360 |
have and this appears and basically every CUDA kernel you'll see and it's called the 00:23:20.040 |
guard or the guard block so this is our guard to make sure we don't go out of bounds so 00:23:25.880 |
this is the same line of code we had before and now we've also just added this thing to 00:23:31.560 |
calculate the index and we've added the guard and this is like the pretty standard first 00:23:37.920 |
lines from any CUDA kernel so we can now run those and they'll do exactly the same thing 00:23:47.400 |
as before and so the obvious question is well why do CUDA kernels work in this weird block 00:23:55.680 |
and thread way why don't we just tell them the number of times to run it why do we have 00:24:03.720 |
to do it by blocks and threads and the reason why is because of some of this detail that 00:24:09.400 |
we've got here which is that CUDA sets things up for us so that everything in the same block 00:24:17.920 |
or to say it more completely thread block which is the same block they will all be given 00:24:24.840 |
some shared memory and they'll also all be given the opportunity to synchronize which 00:24:29.560 |
is to basically say okay everything in this block has to get to this point before you 00:24:35.640 |
can move on all of the threads in a block will be executed on the same streaming multiprocessor 00:24:44.600 |
and so we'll we'll see later in later lectures that won't be taught by me that by using blocks 00:24:54.520 |
smartly you can make your code run more quickly and the shared memory is particularly important 00:25:01.320 |
so shared memory is a little bit of memory in the GPU that all the threads in a block 00:25:06.600 |
share and it's fast it's super super super fast now when we say not very much it's like 00:25:13.880 |
on a 3090 it's 128k so very small so this is basically the same as a cache in a CPU 00:25:26.400 |
the difference though is that on a CPU you're not going to be manually deciding what goes 00:25:31.520 |
into your cache but on the GPU you do it's all up to you so at the moment this cache 00:25:36.680 |
is not going to be used when we create our CUDA code because we're just getting started 00:25:42.800 |
and so we're not going to worry about that optimization but to go fast you want to use 00:25:46.480 |
that cache and also you want to use the register file something a lot of people don't realize 00:25:52.520 |
is that there's actually quite a lot of register memory even more register memory than shared 00:25:56.600 |
memory so anyway those are all things to worry about down the track not needed for getting 00:26:00.920 |
started. So how do we go about using CUDA? There is a basically standard setup block 00:26:16.960 |
that I would add and we are going to add and what happens in this setup block is we're 00:26:22.880 |
going to set an environment variable you wouldn't use this in kind of production or for going 00:26:27.960 |
fast but this says if you get an error stop right away basically so wait you know wait 00:26:35.000 |
to see how things go and then that way you can tell us exactly when an error occurs and 00:26:40.840 |
where it happens so that slows things down but it's good for development. We're also 00:26:48.480 |
going to install two modules one is a build tool which is required by PyTorch to compile 00:26:55.960 |
your C++ CUDA code. The second is a very handy little thing called Wurlitzer and the only 00:27:04.360 |
place you're going to see that used is in this line here where we load this extension 00:27:08.300 |
called Wurlitzer. Without this anything you print from your CUDA code in fact from your 00:27:14.880 |
C++ code won't appear in a notebook so you always want to do this in a notebook where 00:27:20.160 |
you're doing stuff in CUDA so that you can use print statements to debug things. Okay 00:27:27.120 |
so if you've got some CUDA code how do you use it from Python? The answer is that PyTorch 00:27:35.880 |
comes with a very handy thing called load inline which is inside torch.utils.cpp extension. 00:27:51.640 |
Load inline is a marvelous load inline is a marvelous function that you just pass in 00:28:02.320 |
a list of any of the CUDA code strings that you want to compile any of the plain C++ strings 00:28:08.240 |
you want to compile any functions in that C++ you want to make available to PyTorch 00:28:15.360 |
and it will go and compile it all turn it into a Python module and make it available 00:28:20.640 |
right away which is pretty amazing. I've just created a tiny little wrapper for that called 00:28:27.160 |
load CUDA just to streamline it a tiny bit but behind the scenes it's just going to call 00:28:33.160 |
load inline. The other thing I've done is I've created a string that contains some C++ code 00:28:43.720 |
I mean this is all C code I think but it's compiled as C++ code we'll call it C++ code. 00:28:51.160 |
C++ code we want included in all of our CUDA files. We need to include this header file 00:28:59.600 |
to make sure that we can access PyTorch tensor stuff. We want to be able to use I/O and we 00:29:06.800 |
want to be able to check for exceptions. And then I also define three macros. The first 00:29:14.880 |
macro just checks that a tensor is CUDA. The second one checks that it's contiguous in 00:29:23.040 |
memory because sometimes PyTorch can actually split things up over different memory pieces 00:29:27.920 |
and then if we try to access that in this flattened out form it won't work. And then 00:29:33.320 |
the way we're actually going to use it check input or just check both of those things. 00:29:37.480 |
So if something's not on CUDA and it's not contiguous we aren't going to be able to use 00:29:41.360 |
it so we always have this. And then the third thing we do here is we define ceiling division. 00:29:48.840 |
Ceiling division is just this although you can implement it a different way like this 00:29:59.400 |
and so this will do ceiling division and so this is how we're going to this is what we're 00:30:03.440 |
going to call in order to figure out how many blocks we need. So this is just you don't 00:30:08.040 |
have to worry about the details of this too much it's just a standard setup we're going 00:30:10.720 |
to use. Okay so now we need to write our CUDA kernel. 00:30:16.880 |
Now how do you write the CUDA kernel? Well all I did and I recommend you do is take your 00:30:24.480 |
Python kernel and paste it into chat GPT and say convert this to equivalent C code using 00:30:33.240 |
the same names formatting etc where possible paste it in and chat GPT will do it for you. 00:30:40.840 |
Unless you're very comfortable with C which case just write it yourself is fine but this 00:30:44.840 |
way since you've already got the Python why not just do this? It basically was pretty 00:30:52.680 |
much perfect I found although it did assume that these were floats they're actually not 00:31:00.280 |
floats we had to change a couple of data types but basically I was able to use it almost 00:31:04.640 |
as is and so particularly you know for people who are much more Python programmers nowadays 00:31:12.040 |
like me this is a nice way to write 95 percent of the code you need. What else do we have 00:31:21.360 |
to change? Well as we saw in our picture earlier it's not called block IDX it's called blockIDX.X 00:31:31.360 |
blockDIM.X threadIDX.X so we have to add the dot X there. Other than that if we compare 00:31:43.840 |
so as you can see these two pieces of code look nearly identical we've had to add data 00:31:48.000 |
types to them we've had to add semicolons we had to get rid of the colon we had to add 00:31:55.160 |
curly brackets that's about it. So it's not very different at all so if you haven't done 00:32:02.080 |
much C programming yeah don't worry about it too much because you know the truth is 00:32:08.520 |
actually it's not that different for this kind of calculation intensive work. One thing 00:32:17.300 |
we should talk about is this. What's unsigned car star? This is just how you write Uint8 00:32:26.760 |
in C. You can just if you're not sure how to change change a data type between the PyTorch 00:32:35.160 |
spelling and the C spelling you could ask chat GPT or you can Google it but this is 00:32:40.200 |
how you write byte. The star in practice it's basically how you say this is an array. So 00:32:47.280 |
this says that X is an array of bytes. It actually means it's a pointer but pointers are treated 00:33:00.000 |
as you can see here as arrays by C. So you don't really have to worry about the fact 00:33:06.040 |
that the pointer it just means for us that it's an array but in C the only kind of arrays 00:33:12.160 |
that it knows how to deal with these one-dimensional arrays and that's why we always have to flatten 00:33:17.160 |
things out okay. We can't use multi-dimensional tensors really directly in these CUDA kernels 00:33:22.560 |
in this way. So we're going to end up with these one-dimensional C arrays. Yeah other 00:33:28.280 |
than that it's going to look exactly in fact I mean even because we did our Python like 00:33:32.240 |
that it's going to look identical. The void here just means it doesn't return anything 00:33:38.400 |
and then the dunder global here is a special thing added by CUDA. There's three things 00:33:46.360 |
that can appear and this simply says what should I compile this to do and so you can 00:33:53.700 |
put dunder device and that means compile it so that you can only call it on the GPU. You 00:34:00.080 |
can say dunder global and that says okay you can call it from the CPU or GPU and it will 00:34:06.720 |
run on the GPU or you can write write dunder host which you don't have to and that just 00:34:11.760 |
means it's a normal C or C++ program that runs on the CPU side. So anytime we want to 00:34:18.080 |
call something from the CPU side to run something on the GPU which is basically almost always 00:34:24.520 |
when we're doing kernels you write dunder global. So here we've got dunder global we've 00:34:32.080 |
got our kernel and that's it. So then we need the thing to call that kernel. So earlier 00:34:40.400 |
to call the kernel we called this block kernel function passed in the kernel and passed in 00:34:45.700 |
the blocks and threads and the arguments. With CUDA we don't have to use a special function 00:34:51.520 |
there is a weird special syntax built into kernel to do it for us. To use the weird special 00:34:57.480 |
syntax you say okay what's the kernel the function that I want to call and then you 00:35:03.600 |
use these weird triple angle brackets. So the triple angle brackets is a special CUDA extension 00:35:11.840 |
to the C++ language and it means this is a kernel please call it on the GPU and between 00:35:20.200 |
the triple angle brackets there's a number of things you can pass but you have to pass 00:35:25.760 |
at least the first two things which is how many blocks how many threads. So how many 00:35:33.160 |
blocks ceiling division number of pixels divided by threads and how many threads as we said 00:35:40.800 |
before let's just pick 256 all the time and not worry about it. So that says call this 00:35:45.600 |
function as a GPU kernel and then passing in these arguments. We have to pass in our 00:35:52.680 |
input tensor, our output tensor and how many pixels. And you'll see that for each of these 00:35:58.480 |
tensors we have to use a special method .data pointer and that's going to convert it into 00:36:04.320 |
a C pointer to the tensor. So that's why by the time it arrives in our kernel it's a C 00:36:11.440 |
pointer. You also have to tell it what data type you want it to be treated as. This says 00:36:17.680 |
treat it as Uintates. So that's this is a C++ template parameter here and this is a method. 00:36:30.040 |
The other thing you need to know is in C++ dot means call a method of an object or else 00:36:38.120 |
colon colon is basically like in C in Python calling a method of a class. So you don't say 00:36:45.760 |
torch dot empty you say torch colon colon empty to create our output or else back when we 00:36:51.920 |
did it in Python we said torch dot empty. Also in Python oh okay so in Python that's right 00:37:01.880 |
we just created a length n vector and then did a dot view. It doesn't really matter how 00:37:07.080 |
we do it but in this case we actually created a two-dimensional tensor bypassing. We pass 00:37:12.160 |
in this thing in curly brackets here this is called a C++ list initializer and it's 00:37:17.000 |
just basically a little list containing height comma width. So this tells it to create a 00:37:21.840 |
two-dimensional matrix which is why we don't need dot view at the end. We could have done 00:37:25.800 |
it the dot view way as well. Probably be better to keep it consistent but this is what I wrote 00:37:30.600 |
at the time. The other interesting thing when we create the output is if you pass in input 00:37:37.720 |
dot options so this is our input tensor that just says oh use the same data type and the 00:37:44.300 |
same device CUDA device as our input has. This is a nice really convenient way which 00:37:49.400 |
I don't even think we have in Python to say make sure that this is the same data type 00:37:54.880 |
in the same device. If you say auto here this is quite convenient you don't have to specify 00:38:01.240 |
what type this is. We could have written torch colon colon tensor but by writing auto it 00:38:06.320 |
just says figure it out yourself which is another convenient little C++ thing. After 00:38:13.040 |
we call the kernel if there's an error in it we won't necessarily get told so to tell 00:38:18.080 |
it to check for an error you have to write this. This is a macro that's again provided 00:38:23.900 |
by PyTorch. The details don't matter you should just always call it after you call a kernel 00:38:29.440 |
to make sure it works and then you can return the tensor that you allocated and then you 00:38:36.320 |
passed as a pointer and then that you filled in. Okay now as well as the CUDA source you 00:38:48.360 |
also need C++ source and the C++ source is just something that says here is a list of 00:38:55.560 |
all of the details of the functions that I want you to make available to the outside 00:39:02.000 |
world in this case Python and so this is basically your header effectively. So you can just copy 00:39:08.960 |
and paste the full line here from your function definition and stick a semicolon on the end. 00:39:16.640 |
So that's something you can always do and so then we call our load CUDA function that 00:39:21.400 |
we looked at earlier passing in the CUDA source code the C++ source code and then a list of 00:39:27.760 |
the names of the functions that are defined there that you want to make available to Python. 00:39:32.440 |
So we just have one which is the RGB2 grayscale function and believe it or not that's all you 00:39:38.840 |
have to do this will automatically you can see it running in the background now compiling 00:39:45.440 |
with a hugely long thing our files from so it's created a main.cpp for us and it's going 00:39:56.960 |
to put it into a main.o for us and compile everything up link it all together and create 00:40:04.540 |
a module and you can see here we then take that module it's been passed back and put 00:40:09.880 |
it into a variable called module and then when it's done it will load that module and 00:40:17.880 |
if we look inside the module that we just created you'll see now that apart from the 00:40:21.800 |
normal auto generated stuff Python adds it's got a function in it RGB2 grayscale okay so 00:40:28.840 |
that's amazing we now have a CUDA function that's been made available from Python and 00:40:35.040 |
we can even see if we want to this is where it put it all so we can have a look and there 00:40:45.480 |
it is you can see it's created a main.cpp it's compiled it into a main.o it's created a library 00:40:52.400 |
that we can load up it's created a CUDA file it's created a build script and we could have 00:40:58.840 |
a look at that build script if we wanted to and there it is so none of this matters too 00:41:06.240 |
much it's just nice to know that PyTorch is doing all this stuff for us and we don't have 00:41:16.080 |
So in order to pass a tensor to this we're going to be checking that it's contiguous 00:41:23.420 |
and on CUDA so we'd better make sure it is so we're going to create an image C variable 00:41:29.360 |
which is the image made contiguous and put on through the CUDA device and now we can 00:41:38.800 |
actually run this on the full sized image not on the tiny little minimized image we 00:41:43.360 |
created before this has got much more pixels it's got 1.7 million pixels where else before 00:41:50.360 |
we had I think it was 35,000 34,000 and it's gone down from one and a half seconds to one 00:42:00.000 |
millisecond so that is amazing it's dramatically faster both because it's now running in compiled 00:42:15.600 |
The step of putting the data onto the GPU is not part of what we timed and that's probably 00:42:21.280 |
fair enough because normally you do that once and then you run a whole lot of CUDA things 00:42:28.020 |
We have though included the step of moving it off the GPU and putting it onto the CPU 00:42:33.680 |
as part of what we're timing and one key reason for that is that if we didn't do that it can 00:42:39.680 |
actually run our Python code at the same time that the CUDA code is still running and so 00:42:46.160 |
the amount of time shown could be dramatically less because it hasn't finished synchronizing 00:42:51.880 |
so by adding this it forces it to complete the CUDA run and to put the data back onto 00:43:01.240 |
the CPU that kind of synchronization you can also trigger this by printing a value from 00:43:06.920 |
it or you can synchronize it manually so after we've done that and we can have a look and 00:43:13.600 |
we should get exactly the same grayscale puppy okay so we have successfully created our first 00:43:36.140 |
This approach of writing it in Python and then converting it to CUDA is not particularly 00:43:46.320 |
common but I'm not just doing it as an educational exercise that's how I like to write my CUDA 00:43:53.400 |
kernels at least as much of it as I can because it's much easier to debug in Python it's much 00:44:04.600 |
easier to see exactly what's going on and so and I don't have to worry about compiling 00:44:10.560 |
it takes about 45 or 50 seconds to compile even our simple example here I can just run 00:44:15.520 |
it straight away and once it's working to convert that into C as I mentioned you know 00:44:20.720 |
chatgpt can do most of it for us so I think this is actually a fantastically good way 00:44:27.320 |
of writing CUDA kernels even as you start to get somewhat familiar with them it's because 00:44:34.280 |
it lets you debug and develop much more quickly a lot of people avoid writing CUDA just because 00:44:43.860 |
that process is so painful and so here's a way that we can make that process less painful 00:44:49.360 |
so let's do it again and this time we're going to do it to implement something very important 00:44:56.000 |
which is matrix multiplication so matrix multiplication as you probably know is fundamentally critical 00:45:05.640 |
for deep learning it's like the most basic linear algebra operation we have and the way 00:45:12.680 |
it works is that you have a input matrix M and a second input matrix N and we go through 00:45:24.240 |
every row of M so we go through every row of M till we get to here we are up to this 00:45:30.520 |
one and every column of N and here we are up to this one and then we take the dot product 00:45:36.880 |
at each point of that row with that column and this here is the dot product of those 00:45:45.200 |
two things and that is what matrix multiplication is so it's a very simple operation conceptually 00:45:58.720 |
and it's one that we do many many many times in deep learning and basically every deep 00:46:04.560 |
learning every neural network has this is its most fundamental operation of course we don't 00:46:11.520 |
actually need to implement matrix multiplication from scratch because it's done for us in libraries 00:46:16.280 |
but we will often do things where we have to kind of fuse in some kind of matrix multiplication 00:46:22.400 |
like paces and so you know and of course it's also just a good exercise so let's take a 00:46:29.800 |
look at how to do matrix multiplication first of all in pure Python so in the actually in 00:46:38.680 |
the first AI course that I mentioned there's a very complete in-depth dive into matrix 00:46:45.000 |
multiplication in part two less than 11 where we spend like an hour or two talking about 00:46:52.120 |
nothing but matrix multiplication we're not going to go into that much detail here but 00:46:57.040 |
what we do do in that is we use the MNIST data set to to do this and so we're going 00:47:04.560 |
to do the same thing here we're going to grab the MNIST data set of handwritten digits and 00:47:11.280 |
they are 28 by 28 digits they look like this 28 by 28 is 784 so to do a you know to basically 00:47:21.320 |
do a single layer of a neural net or without the activation function we would do a matrix 00:47:29.400 |
multiplication of the image flattened out by a weight matrix with 784 rows and however 00:47:38.040 |
many columns we like and I'm going to need if we're going to go straight to the output 00:47:41.320 |
so this would be a linear function a linear model we'd have 10 layers one for each digit 00:47:46.080 |
so here's this is our weights we're not actually going to do any learning here this is just 00:47:50.440 |
not any deep learning or logistic regression learning is just for an example okay so we've 00:47:56.680 |
got our weights and we've got our input our input data x train and x valid and so we're 00:48:06.360 |
going to start off by implementing this in Python now again Python's really slow so let's 00:48:11.960 |
make this smaller so matrix one will just be five rows matrix two will be all the weights 00:48:18.920 |
so that's going to be a five by seven eighty four matrix multiplied by a seven eighty four 00:48:25.480 |
by ten matrix now these two have to match of course they have to match because otherwise 00:48:32.800 |
this product won't work those two are going to have to match the row by the column okay 00:48:40.560 |
so let's pull that out into a rows a columns b rows b columns and obviously a columns and 00:48:47.640 |
b rows are the things that have to match and then the output will be a rows by b columns 00:48:53.580 |
so five by ten so let's create an output fill of zeros with rows by columns in it and so 00:49:03.600 |
now we can go ahead and go through every row of a every column of b and do the dot product 00:49:11.360 |
which involves going through every item in the innermost dimension or 784 of them multiplying 00:49:17.880 |
together the equivalent things from m1 and m2 and summing them up into the output tensor 00:49:27.920 |
that we created so that's going to give us as we said a five by ten five by ten output 00:49:41.760 |
and here it is okay so this is how I always create things in python I basically almost 00:49:49.400 |
never have to debug I almost never have like errors unexpected errors in my code because 00:49:55.520 |
I've written every single line one step at a time in python I've checked them all as 00:50:00.000 |
they go and then I copy all the cells and merge them together stick a function header 00:50:04.200 |
on like so and so here is matmul so this is exactly the code we've already seen and we 00:50:10.600 |
can call it and we'll see that for 39,200 innermost operations we took us about a second so that's 00:50:26.600 |
pretty slow okay so now that we've done that you might not be surprised to hear that we 00:50:32.840 |
now need to do the innermost loop as a kernel call in such a way that it is can be run in 00:50:41.720 |
parallel now in this case the innermost loop is not this line of code it's actually this 00:50:50.400 |
line of code I mean we can choose to be whatever we want it to be but in this case this is 00:50:54.400 |
how we're going to do it we're going to say for every pixel we're not every pixel for 00:50:58.560 |
every cell in the output tensor like this one here is going to be one CUDA thread so 00:51:07.040 |
one CUDA thread is going to do the dot product so this is the bit that does the dot product 00:51:14.200 |
so that'll be our kernel so we can write that matmul block kernel is going to contain that 00:51:25.680 |
okay so that's exactly the same thing that we just copied from above and so now we're 00:51:31.520 |
going to need a something to run this kernel and you might not be surprised to hear that 00:51:41.400 |
in CUDA we are going to call this using blocks and threads but something that's rather handy 00:51:50.240 |
in CUDA is that the blocks and threads don't have to be just a 1d vector they can be a 00:51:58.440 |
2d or even 3d tensor so in this case you can see we've got one two a little hard to see 00:52:09.080 |
exactly where they stop two three four blocks and so then for each block that's kind of 00:52:23.400 |
in one dimension and then there's also one two three four five blocks in the other dimension 00:52:35.600 |
and so each of these blocks has an index so this one here is going to be zero zero a little 00:52:45.760 |
bit hard to see this one here is going to be one three and so forth and this one over 00:52:52.720 |
here is going to be three four so rather than just having a integer block index we're going 00:53:06.500 |
to have a tuple block index and then within a block there's going to be to pick let's 00:53:19.160 |
say this exact spot here didn't do that very well there's going to be a thread index and 00:53:30.840 |
again the thread index won't be a single index into a vector it'll be a two elements so in 00:53:37.480 |
this case it would be 0 1 2 3 4 5 6 rows down and 0 1 2 3 4 5 6 7 8 9 10 is that 11 12 I 00:53:52.440 |
can't count 12 maybe across so the this here is actually going to be defined by two things 00:53:59.120 |
one is by the block and so the block is 3 comma 4 and the thread is 6 comma 12 so that's how 00:54:22.080 |
CUDA lets us index into two-dimensional grids using blocks and threads we don't have to 00:54:31.880 |
it's just a convenience if we want to and in fact it can we can use up to three dimensions 00:54:41.640 |
so to create our kernel runner now rather than just having so rather than just having 00:54:50.640 |
two nested loops for blocks and threads we're going to have to have two lots of two nested 00:54:58.360 |
loops for our both of our X and Y blocks and threads or our rows and columns blocks and 00:55:06.920 |
threads so it ends up looking a bit messy because we now have four nested for loops 00:55:16.720 |
so we'll go through our blocks on the Y axis and then through our blocks on the X axis 00:55:22.320 |
and then through our threads on the Y axis and then through our threads on the X axis 00:55:27.040 |
and so what that means is that for you can think of this Cartesian product as being for 00:55:32.040 |
each block for each thread now to get the dot Y and the dot X will use this handy little 00:55:40.320 |
Python standard library thing called simple namespace I'd use that so much I just give 00:55:44.400 |
it an NS name because I use namespaces all the time and my quick and dirty code so we 00:55:50.240 |
go through all those four we then call our kernel and we pass in an object containing 00:55:58.320 |
the Y and X coordinates and that's going to be our block and we also pass in our thread 00:56:09.600 |
which is an object with the Y and X coordinates of our thread and it's going to eventually 00:56:17.280 |
do all possible blocks and all possible threads numbers for each of those blocks and we also 00:56:24.240 |
need to tell it how big is each block how how high and how wide and so that's what this 00:56:30.240 |
is this is going to be a simple namespace and object with an X and Y as you can see 00:56:36.400 |
so I need to know how big they are just like earlier on we had to know the block dimension 00:56:43.920 |
that's why we passed in threads so remember this is all pure PyTorch we're not actually 00:56:50.480 |
calling any out to any CUDA we're not calling out to any libraries other than just a tiny 00:56:54.800 |
bit of PyTorch for the indexing and tensor creation so you can run all of this by hand 00:57:00.720 |
make sure you understand you can put it in the debugger you can step through it and so 00:57:06.720 |
it's going to call our function so here's our matrix modification function as we said 00:57:10.760 |
it's a kernel that contains the dot product that we wrote earlier so now the guard is 00:57:16.840 |
going to have to check that the row number we're up to is not taller than we have and 00:57:23.520 |
the column number we're up to is not wider than we have and we also need to know what 00:57:27.400 |
row number we're up to and this is exactly the same actually I should say the column 00:57:32.760 |
is exactly the same as we've seen before and in fact you might remember in the CUDA we 00:57:36.880 |
had block idx dot x and this is why right because in CUDA it's always gives you these 00:57:44.320 |
three-dimensional dim three structures so you have to put this dot x so we can find 00:57:52.560 |
out the column this way and then we can find out the row by seeing how many blocks have 00:57:58.760 |
we gone through how big is each block in the y-axis and how many threads have we gone through 00:58:03.800 |
in the y-axis so which row number are we up to what column number are we up to is that 00:58:09.840 |
inside the bounds of our tensor if not then just stop and then otherwise do our dot product 00:58:21.360 |
and put it into our output tensor so that's all pure Python and so now we can call it 00:58:29.960 |
by getting the height and width of our first input the height and width of our second input 00:58:36.640 |
and so then K and K2 the inner dimensions ought to match we can then create our output 00:58:43.860 |
and so now threads per block is not just the number 256 but it's a pair of numbers it's 00:58:50.360 |
an x and a y and we've selected two numbers that multiply together to create 256 so again 00:58:55.960 |
this is a reasonable choice if you've got two dimensional inputs to spread it out nicely 00:59:04.940 |
one thing to be aware of here is that your threads per block can't be bigger than 1024 00:59:16.320 |
so we're using 256 which is safe right and notice that you have to multiply these together 00:59:20.920 |
16 times 16 is going to be the number of threads per block so this is a these are safe numbers 00:59:26.600 |
to use you're not going to run out of blocks though 2 to the 31 is the number of maximum 00:59:32.520 |
blocks for dimension 0 and then 2 to the 16 for dimensions 1 and 2 I think it's actually 00:59:37.920 |
minus 1 but don't worry about that so don't have too many 10 threads but you can have 00:59:43.520 |
lots of blocks but of course each symmetric model processor is going to run all of these 00:59:49.160 |
on the same device and they're also going to have access to shared memory so that's 00:59:53.880 |
why you use a few threads per block so our blocks the x we're going to use the ceiling 01:00:01.520 |
division the y we're going to use the same ceiling division so if any of this is unfamiliar 01:00:06.880 |
go back to our earlier example because the code's all copied from there and now we can 01:00:10.840 |
call our 2D block kernel runner passing in the kernel the number of blocks the number 01:00:16.600 |
of threads per block our input matrices flattened out our output matrix flattened out and the 01:00:23.360 |
dimensions that it needs because they get all used here and return the result and so 01:00:33.080 |
if we call that matmul with a 2D block and we can check that they are close to what we 01:00:39.480 |
got in our original manual loops and of course they are because it's running the same code 01:00:46.680 |
so now that we've done that we can do the CUDA version now the CUDA version is going 01:00:54.800 |
to be so much faster we do not need to use this slimmed down matrix anymore we can use 01:01:05.560 |
the whole thing so to check that it's correct I want a fast CPU-based approach that I can 01:01:12.080 |
compare to so previously I took about a second to do 39,000 elements so I'm not going to 01:01:21.680 |
explain how this works but I'm going to use a broadcasting approach to get a fast CPU-based 01:01:26.040 |
approach if you check the fast AI course we teach you how to do this broadcasting approach 01:01:31.400 |
but it's a pure Python approach which manages to do it all in a single loop rather than 01:01:36.080 |
three nested loops it gives the same answer for the cut down tensors but much faster only 01:01:50.920 |
four milliseconds so it's fast enough that we can now run it on the whole input matrices 01:02:00.920 |
and it takes about 1.3 seconds and so this broadcast optimized version as you can see 01:02:06.880 |
it's much faster and now we've got 392 million additions going on in the middle of our three 01:02:14.720 |
loops effectively three loops but we're broadcasting them so this is much faster but the reason 01:02:19.960 |
I'm really doing this is so that we can store this result to compare to so that makes sure 01:02:27.520 |
that our CUDA version is correct okay so how do we convert this to CUDA you might not be 01:02:35.120 |
surprised to hear that what I did was I grabbed this function and I passed it over to chat 01:02:39.920 |
GPT and said please rewrite this in C and it gave me something basically that I could 01:02:45.480 |
use first time and here it is this time I don't have unsigned cast I have float star 01:02:54.000 |
other than that this looks almost exactly like the Python we had with exactly the same 01:03:01.980 |
changes we saw before we've now got the dot Y and dot X versions once again we've got 01:03:09.480 |
done to global which says please run this on the GPU when we call it from the CPU so 01:03:14.720 |
the CUDA the kernel I don't think there's anything to talk about there and then the 01:03:18.680 |
thing that calls the kernel is going to be passed in two tenses we're going to check 01:03:24.240 |
that they're both contiguous and check that they are on the CUDA device we'll grab the 01:03:29.880 |
height and width of the first and second tenses we're going to grab the inner dimension we'll 01:03:37.120 |
make sure that the inner dimensions of the two matrices match just like before and this 01:03:43.020 |
is how you do an assertion in PyTorch CUDA code you call torch check pass anything to 01:03:50.080 |
check pass in the message to pop up if there's a problem so these are a really good thing 01:03:53.840 |
to spread around all through your CUDA code to make sure that everything is as you thought 01:04:01.460 |
it was going to be just like before we create an output so now when we create a number of 01:04:07.800 |
threads we don't say threads is 256 we instead say this is a special thing provided by CUDA 01:04:15.320 |
for us dim three so this is a basically a tuple with three elements so we're going to 01:04:21.040 |
create a dim three called TPB it's going to be 16 by 16 now I said it has three elements 01:04:28.640 |
where's the third one that's okay it just treats the third one as being one so it just 01:04:33.640 |
we can ignore it so that's the number of threads per block and then how many blocks will there 01:04:40.840 |
be well in the X dimension it'll be W divided by X ceiling division in the Y dimension it 01:04:50.120 |
will be H divided by Y C division and ceiling division and so that's the number of blocks 01:04:56.080 |
we have so just like before we call our kernel just by calling it like a normal function 01:05:03.120 |
but then we add this weird triple angle bracket thing telling it how many blocks and how many 01:05:08.600 |
threads so these aren't ints anymore these are now dim three structures and that's what 01:05:16.660 |
we use these dim three structures and in fact even before what actually happened behind 01:05:22.800 |
the scenes when we did the grayscale thing is even though we passed in 256 instance we 01:05:32.960 |
actually ended up with a dim three structure and just in which case the second the index 01:05:40.920 |
one and two or the dot X and dot Z values were just set to one automatically so we've 01:05:47.440 |
actually already used a dim three structure without quite realizing it and then just like 01:05:55.080 |
before pass in all of the tensors we want cast casting them to pointers maybe they're 01:06:01.640 |
not just casting converting them to pointers through some particular data type and passing 01:06:07.240 |
in any other information that our function will need that kernel will need okay so then 01:06:14.720 |
we call load CUDA again that'll compile this into a module make sure that they're both 01:06:22.520 |
contiguous and on the CUDA device and then after we call module.mapmol passing those 01:06:29.320 |
in putting on the CPU and checking that they're all close and it says yes they are so it's 01:06:36.320 |
this is now running not on just the first five rows but on the entire MNIST data set 01:06:41.440 |
and on the entire MNIST data set using a optimized CPU approach it took 1.3 seconds using CUDA 01:06:51.840 |
it takes six milliseconds so that is quite a big improvement cool the other thing I will 01:07:04.920 |
mention of course is PyTorch can do a matrix modification for us just by using at how long 01:07:11.200 |
does and obviously gives the same answer how long does that take to run that takes two 01:07:17.720 |
milliseconds so three times faster and in many situations it'll be much more than three 01:07:25.940 |
times faster so why are we still pretty slow compared to PyTorch I mean this isn't bad 01:07:32.080 |
to do 392 million of these calculations in six milliseconds but if PyTorch can do it 01:07:38.480 |
so much faster what are they doing well the trick is that they are taking advantage in 01:07:46.700 |
particular of this shared memory so shared memory is a small memory space that is shared 01:07:53.960 |
amongst the threads in a block and it is much faster than global memory in our matrix multiplication 01:08:02.460 |
when we have one of these blocks and so it's going to do one block at a time all in the 01:08:07.160 |
same SM it's going to be reusing the same 16 by 16 block it's going to be using the 01:08:14.760 |
same 16 rows and columns again and again and again each time with access to the same shared 01:08:20.060 |
memory so you can see how you could really potentially cache the information a lot of 01:08:25.400 |
the information you need and reuse it rather than going back to the slower memory again 01:08:30.680 |
and again so this is an example of the kinds of things that you could optimize potentially 01:08:37.040 |
once you get to that point. The only other thing that I wanted to mention here is that 01:08:46.720 |
this 2D block idea is totally optional you can do everything with 1D blocks or with 2D 01:08:55.000 |
blocks or with 3D blocks and threads and just to show that I've actually got an example 01:09:00.320 |
at the end here which converts RGB to grayscale using the 2D blocks because remember earlier 01:09:11.920 |
when we did this it was with 1D blocks. It gives exactly the same result and if we compare 01:09:21.160 |
the code, so if we compare the code the version actually that was done with 1D threads and 01:09:32.660 |
blocks is quite a bit shorter than the version that uses 2D threads and blocks and so in 01:09:38.320 |
this case even as though we're manipulating pixels where you might think that using the 01:09:43.440 |
2D approach would be neater and more convenient in this particular case it wasn't really I 01:09:50.160 |
mean it's still pretty simple code that we have to deal with the columns and rows.x.y 01:09:56.560 |
separately, the guards a little bit more complex, we have to find out what index we're actually 01:10:02.840 |
up to here or else this kernel we just there was just much more direct just two lines of 01:10:09.400 |
code and then calling the kernel you know again it's a little bit more complex with 01:10:13.480 |
the threads per blocks stuff rather than this but the key thing I wanted to point out is 01:10:18.080 |
that these two pieces of code do exactly the same thing so don't feel like if you don't 01:10:25.520 |
want to use a 2D or 3D block thread structure you don't have to. You can just use a 1D one, 01:10:32.720 |
the 2D stuff is only there if it's convenient for you to use and you want to use it. Don't 01:10:44.040 |
So yeah I think that's basically like all the key things that I wanted to show you all 01:10:49.360 |
today. The main thing I hope you take from this is that even for Python programmers for 01:10:56.100 |
data scientists it's not way outside our comfort zone you know we can write these things in 01:11:04.600 |
Python we can convert them pretty much automatically we end up with code that doesn't look you 01:11:11.160 |
know it looks reasonably familiar even though it's now in a different language we can do 01:11:15.880 |
everything inside notebooks we can test everything as we go we can print things from our kernels 01:11:23.760 |
and so you know it's hopefully feeling a little bit less beyond our capabilities than we might 01:11:34.500 |
have previously imagined. So I'd say yeah you know go for it I think it's also like I think 01:11:41.840 |
it's increasingly important to be able to write CUDA code nowadays because for things 01:11:47.640 |
like flash attention or for things like quantization GPTQ AWQ bits and bytes these are all things 01:11:57.320 |
you can't write in PyTorch. You know our models are getting more sophisticated the kind of 01:12:03.560 |
assumptions that libraries like PyTorch make about what we want to do you know increasingly 01:12:10.880 |
less and less accurate so we're having to do more and more of this stuff ourselves nowadays 01:12:15.320 |
in CUDA and so I think it's a really valuable capability to have. Now the other thing I 01:12:23.640 |
mentioned is we did it all in CoLab today but we can also do things on our own machines 01:12:32.200 |
if you have a GPU or on a cloud machine and getting set up for this again it's much less 01:12:39.440 |
complicated than you might expect and in fact I can show you it's basically like four lines 01:12:45.800 |
of code or four lines or three or four lines of bash script to get it all set up it'll 01:12:50.360 |
run on Windows or under WSL it'll also run on Linux of course Mac stuff doesn't really 01:12:55.720 |
work on CUDA stuff doesn't really work on Mac so not on Mac. Actually I'll put a link 01:13:04.480 |
to this into the video notes but for now I'm just going to jump to a Twitter thread where 01:13:11.640 |
I wrote this all down to show you all the steps. So the way to do it is to use something 01:13:19.640 |
called Conda. Conda is something that very very very few people understand a lot of people 01:13:26.080 |
think it's like a replacement for like pip or poetry or something it's not it's better 01:13:31.440 |
to think of it as a replacement for docker. You can literally have multiple different 01:13:35.560 |
versions of Python multiple different versions of CUDA multiple different C++ compilation 01:13:41.600 |
systems all in parallel at the same time on your machine and switch between them you can 01:13:48.680 |
only do this with Conda and everything just works right so you don't have to worry about 01:13:55.440 |
all the confusing stuff around .run files or Ubuntu packages or anything like that you 01:14:00.920 |
can do everything with just Conda. You need to install Conda I've actually got a script 01:14:09.000 |
which you just run the script it's a tiny script as you see if you just run the script 01:14:12.600 |
it'll automatically figure out which many Conda you need it'll automatically figure 01:14:17.120 |
out what shell you're on and it'll just go ahead and download it and install it for you. 01:14:21.200 |
Okay so run that script restart your terminal now you've got Conda. Step two is find out 01:14:30.720 |
what version of CUDA PyTorch wants you to have so if I click Linux Conda CUDA 12.1 is 01:14:38.920 |
the latest so then step three is run this shell command replacing 12.1 with whatever 01:14:50.000 |
the current version of PyTorch is it actually still 12.1 for me at this point and that'll 01:14:56.280 |
install everything all the stuff you need to profile debug build etc all the nvidia 01:15:04.600 |
tools you need the full suite will all be installed and it's coming directly from nvidia 01:15:08.720 |
so you'll have like the proper versions as I said you can have multiple versions it's 01:15:13.480 |
stored at once in different environments no problem at all and then finally install PyTorch 01:15:21.300 |
and this command here will install PyTorch for some reason I wrote nightly here you don't 01:15:25.600 |
need the nightly so just remove just nightly so this will install the latest version of 01:15:29.520 |
PyTorch using the nvidia CUDA stuff that you just installed if you've used Conda before 01:15:36.240 |
and it was really slow that's because it used to use a different solver which was thousands 01:15:42.160 |
or tens of thousands of times slower than the modern one just has been added and made 01:15:46.880 |
default in the last couple of months so nowadays this should all run very fast and as I said 01:15:53.960 |
it'll run under WSL on Windows it'll run on Ubuntu it'll run on Fedora it'll run on Debian 01:16:00.720 |
it'll all just work so that's how I strongly recommend getting yourself set up for local 01:16:11.760 |
development you don't need to worry about using Docker as I said you can switch between different 01:16:17.800 |
CUDA versions different Python versions different compilers and so forth without having to worry 01:16:22.320 |
about any of the Docker stuff and it's also efficient enough that if you've got the same 01:16:27.400 |
libraries and so forth installed in multiple environments it'll hard link them so it won't 01:16:32.760 |
even use additional hard drive space so it's also very efficient great so that's how you 01:16:40.880 |
can get started on your own machine or on the cloud or whatever so hopefully you'll 01:16:45.800 |
find that helpful as well alright thanks very much for watching I hope you found this useful 01:16:54.040 |
and I look forward to hearing about what you create with CUDA in terms of going to the 01:17:01.960 |
next steps check out the other CUDA mode lectures I will link to them and I would also recommend 01:17:09.760 |
trying out some projects of your own so for example you could try to implement something 01:17:17.760 |
like 4-bit quantization or flash attention or anything like that now those are kind of 01:17:26.000 |
pretty big projects but you can try to break them up into smaller things you build up one 01:17:30.200 |
step at a time and of course look at other people's code so look at the implementation 01:17:37.440 |
of flash attention look at the implementation of bits and bytes look at the implementation 01:17:41.880 |
of GPTQ and so forth the more time you spend reading other people's code the better alright 01:17:50.800 |
I hope you found this useful and thank you very much for watching