back to index

Getting Started With CUDA for Python Programmers


Chapters

0:0 Introduction to CUDA Programming
0:32 Setting Up the Environment
1:43 Recommended Learning Resources
2:39 Starting the Exercise
3:26 Image Processing Exercise
6:8 Converting RGB to Grayscale
7:50 Understanding Image Flattening
11:4 Executing the Grayscale Conversion
12:41 Performance Issues and Introduction to CUDA Cores
14:46 Understanding Cuda and Parallel Processing
16:23 Simulating Cuda with Python
19:4 The Structure of Cuda Kernels and Memory Management
21:42 Optimizing Cuda Performance with Blocks and Threads
24:16 Utilizing Cuda's Advanced Features for Speed
26:15 Setting Up Cuda for Development and Debugging
27:28 Compiling and Using Cuda Code with PyTorch
28:51 Including Necessary Components and Defining Macros
29:45 Ceiling Division Function
30:10 Writing the CUDA Kernel
32:19 Handling Data Types and Arrays in C
33:42 Defining the Kernel and Calling Conventions
35:49 Passing Arguments to the Kernel
36:49 Creating the Output Tensor
38:11 Error Checking and Returning the Tensor
39:1 Compiling and Linking the Code
40:6 Examining the Compiled Module and Running the Kernel
42:57 Cuda Synchronization and Debugging
43:27 Python to Cuda Development Approach
44:54 Introduction to Matrix Multiplication
46:57 Implementing Matrix Multiplication in Python
50:39 Parallelizing Matrix Multiplication with Cuda
51:50 Utilizing Blocks and Threads in Cuda
58:21 Kernel Execution and Output
58:28 Introduction to Matrix Multiplication with CUDA
60:1 Executing the 2D Block Kernel
60:51 Optimizing CPU Matrix Multiplication
62:35 Conversion to CUDA and Performance Comparison
67:50 Advantages of Shared Memory and Further Optimizations
68:42 Flexibility of Block and Thread Dimensions
70:48 Encouragement and Importance of Learning CUDA
72:30 Setting Up CUDA on Local Machines
72:59 Introduction to Conda and its Utility
74:0 Setting Up Conda
74:32 Configuring Cuda and PyTorch with Conda
75:35 Conda's Improvements and Compatibility
76:5 Benefits of Using Conda for Development
76:40 Conclusion and Next Steps

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi there. I'm Jeremy Howard from answer.ai and this is Getting Started with CUDA. CUDA
00:00:10.160 | is of course what we use to program NVIDIA GPUs if we want them to go super fast and
00:00:16.080 | we want maximum flexibility and it has a reputation of being very hard to get started with. The
00:00:23.720 | truth is it's actually not so bad. You just have to know some tricks and so in this video
00:00:29.720 | I'm going to show you some of those tricks. So let's switch to the screen and take a look.
00:00:37.020 | So I'm going to be doing all of the work today in notebooks. This might surprise you. You
00:00:42.180 | might be thinking that to do work with CUDA we have to do stuff with compilers and terminals
00:00:46.760 | and things like that and the truth is actually it turns out we really don't thanks to some
00:00:52.320 | magic that is provided by PyTorch. You can follow along in all of these steps and I strongly
00:00:59.440 | suggest you do so in your own computer. You can go to the CUDA mode organization in GitHub.
00:01:08.560 | Find the lecture 2 repo there and you'll see there is a lecture 3 folder. This is lecture
00:01:15.120 | 3 of the CUDA mode series. You don't need to have seen any of the previous ones however
00:01:19.800 | to follow along. In the read me there you'll see there's a lecture 3 section and at the
00:01:26.040 | bottom there is a click to go to the colab version. Yep you can run all of this in colab
00:01:32.880 | for free. You don't even have to have a GPU available to run the whole thing. We're going
00:01:39.680 | to be following along with some of the examples from this book programming massively parallel
00:01:46.480 | processes is a really great book to read and once you've completed today's lesson you should
00:01:59.960 | be able to make a great start on this book. It goes into a lot more details about some
00:02:04.400 | of the things that we're going to cover on fairly quickly. It's okay if you don't have
00:02:09.400 | the book but if you want to go deeper I strongly suggest you get it and in fact you'll see
00:02:16.920 | in the repo that lecture 2 in this series actually was a deep dive into chapters 1-3
00:02:22.640 | of that book and so actually you might want to do lecture 2 confusingly enough after this
00:02:26.880 | one lecture 3 to get more details about some of what we're talking about. Okay so let's
00:02:34.560 | dive into the notebook. So what we're going to be doing today is we're going to be doing
00:02:40.560 | a whole lot of stuff with plain old PyTorch first to make sure that we get all the ideas
00:02:46.320 | and then we will try to convert each of these things into CUDA. So in order to do this we're
00:02:54.600 | going to start by importing a bunch of stuff in fact let's do all of this in colab. So
00:03:00.440 | here we are in colab and you should make sure that you set in colab your runtime to the
00:03:07.200 | T4 GPU. That's one you can use plenty of for free and it's easily good enough to run everything
00:03:13.400 | we're doing today. And once you've got that running we can import the libraries we're
00:03:19.320 | going to need and we can start on our first exercise. So the first exercise actually comes
00:03:25.360 | from chapter 2 of the book and chapter 2 of the book teaches how to do this problem which
00:03:33.120 | is converting an RGB color picture into a grayscale picture. And it turns out that the
00:03:38.640 | recommended formula for this is to take 0.21 of the red pixel, 0.72 of the green pixel,
00:03:44.400 | 0.07 of the blue pixel and add them up together and that creates the luminance value which
00:03:51.000 | is what we're seeing here. That's a common way, kind of the standard way to go from RGB
00:03:55.960 | to grayscale. So we're going to do this, we're going to make a CUDA kernel to do this. So
00:04:02.600 | the first thing we're going to need is a picture and anytime you need a picture I recommend
00:04:06.860 | going for a picture of a puppy. So we've got here a URL to a picture of a puppy so we'll
00:04:13.520 | just go ahead and download it and then we can use torchvision.io to load that. So this
00:04:22.560 | is already part of Colab. If you're interested in running stuff on your own machine or a
00:04:28.280 | server in the cloud I'll show you how to set that up at the end of this lecture. So let's
00:04:34.000 | read in the image and if we have a look at the shape of it it says it's 3 by 1066 by
00:04:40.280 | 1600. So I'm going to assume that you know the basics of PyTorch here. If you don't
00:04:46.960 | know the basics of PyTorch I am a bit biased but I highly recommend my course which covers
00:04:54.240 | exactly that. You can go to course.fast.ai and you get the benefit also of having seen
00:05:00.480 | some very cute bunnies and along with the very cute bunnies it basically takes you through
00:05:06.680 | all of everything you need to be an effective practitioner of modern deep learning. So finish
00:05:14.920 | part one if you want to go right into those details but even if you just do the first
00:05:19.840 | two or three lessons that will give you more than enough you need to know to understand
00:05:25.560 | this kind of code and these kinds of outputs. So I'm assuming you've done all that. So you'll
00:05:31.160 | see here we've got a rank 3 tensor. There are three channels so they're like the faces of
00:05:38.360 | a cube if you like. There are 1066 rows on each face so that's the height and then there
00:05:44.240 | are 16 columns in each row so that's the width. So if we then look at the first couple of
00:05:50.720 | channels and the first three rows and the first four columns you can see here that these are
00:05:58.720 | unsigned 8-bit integers so they're bytes and so here they are. So that's what an image looks
00:06:03.600 | like. Hopefully you know all that already. So let's take a look at our image. To do that
00:06:10.840 | I'm just going to create a simple little function show image that will create a mat lip plot
00:06:16.720 | mat lip lip plot plot remove the axes if it's color which this one is it will change the
00:06:25.200 | order of the axes from channel by height by width which is what PyTorch uses to height
00:06:31.440 | by width by channel which is what matplotlib I'm having trouble today expects. So we change
00:06:40.360 | the order of the axes to be 1, 2, 0 and then we can show the image putting it on the CPU
00:06:47.360 | if necessary. Now we're going to be working with this image in Python which is going to
00:06:52.600 | be just pure Python to start with before we switch to CUDA that's going to be really slow
00:06:56.720 | so we'll resize it to have the smallest length smallest dimension be 150 so that's the height
00:07:05.640 | in this case so we end up with a 150 by 225 shape which is a rectangle which is 3, 3, 750
00:07:14.920 | pixels each one with our GMB values and there is our puppy. So you see wasn't it a good
00:07:20.120 | idea to make this a puppy. Okay so how do we convert that to grayscale? Well the book
00:07:28.800 | has told us the formula to use go through every pixel and do that to it. Alright so
00:07:36.920 | here is the loop we're going to go through every pixel and do that to it and stick that
00:07:44.360 | in the output so that's the basic idea so what are the details of this? Well here we've
00:07:49.040 | got channel by row by column so how do we loop through every pixel? Well the first thing
00:07:56.520 | we need to know is how many pixels are there so we can say channel by height by width is
00:08:02.760 | the shape so now we have to find those three variables so the number of pixels is the height
00:08:08.440 | times the width and so to loop through all those pixels an easy way to do them is to
00:08:14.880 | flatten them all out into a vector. Now what happens when you flatten them all out into
00:08:20.920 | a vector? Well as we saw they're currently stored in this format where we've got one
00:08:29.440 | face and then another face and then there's a we haven't got it printed here but there's
00:08:33.040 | a third face within each face then there is one row we're just showing the first few and
00:08:40.480 | then the next row and then the next row and then with each row you've got column column
00:08:45.360 | column. So let's say we had a small image in which in fact we can do it like this we
00:08:56.840 | could say here's our red so we've got the pixels 0, 1, 2, 3, 4, 5 so let's say this was
00:09:08.520 | a height to width 3, 3 channel image so then there'll be 6, 7, 8, 9, 10, 11, GB, 12, 13,
00:09:28.840 | 14, 15, 16 so let's say these are the pixels so when these are flattened out it's going
00:09:39.040 | to turn into a single vector just like so 6, 7, 8, 12, 13, 14. So actually when we talk
00:09:58.000 | about an image we initially see it as a bunch of pixels we can think of it as having 3 channels
00:10:14.320 | but in practice in our computer the memory is all laid out linearly everything has just
00:10:23.520 | an address in memory it's just a whole bunch you can think of it as your computer's memory
00:10:28.080 | is one giant vector and so when we say when we say flatten then what that's actually doing
00:10:39.800 | is it's turning our channel by height by width into a big vector like this. Okay so now that
00:10:52.240 | we've done that we can say all right our the place we're going to be putting this into
00:10:59.520 | the result we're going to start out with just an empty vector of length n we'll go through
00:11:06.760 | all of the n values from 0 to n - 1 and we're going to put in the output value 0.29 ish
00:11:16.400 | times the input value at xi so this will be here in the red bit and then 0.59 times xi
00:11:29.800 | plus n so n here is this distance it's the number of pixels one two three four five six
00:11:41.460 | see one two three four five six so that's why to get to green we have to jump up to
00:11:50.600 | i plus n and then to get to blue we have to jump to i plus 2n see and so that's how this
00:12:04.520 | works we've flattened everything out and we're indexing into this flattened out thing directly
00:12:12.360 | and so at the end of that we're going to have our grayscale is all done so we can then just
00:12:16.460 | reshape that into height by width and there it is there's our grayscale puppy and you
00:12:26.040 | can see here the flattened image is just a single vector with all those channel values
00:12:34.760 | flattened out as we described okay now that is incredibly slow it's nearly two seconds
00:12:43.600 | to do something with only 34,000 pixels in so to speed it up we are going to want to
00:12:50.320 | use CUDA how come CUDA is able to speed things up well the reason CUDA is able to speed things
00:12:59.400 | up is because it is set up in a very different way to how a normal CPU is set up and we can
00:13:10.720 | actually see that if we look at some of this information about what is in an RTX 3090 card
00:13:21.200 | for example an RTX 3090 card is a fantastic GPU you can get them second hand pretty good
00:13:27.600 | value so a really good choice particularly for hobbyists what is inside a 3090 it has
00:13:35.920 | 82 SM's what's an SM and SM is a streaming multi processor so you can think of this as
00:13:44.600 | almost like a separate CPU in your computer and so there's 82 of these so that's already
00:13:50.960 | a lot more than you have CPUs in your computer but then each one of these has 128 CUDA cores
00:14:01.640 | so these CUDA cores are all able to operate at the same time these multi processors are
00:14:06.720 | all able to operate at the same time so that gives us 128 times 82 10,500 CUDA cores in
00:14:16.200 | total that can all work at the same time so that's a lot more than any CPU we're familiar
00:14:23.840 | with can do and the 3090 isn't even at the very top end it's really a very good GPU but
00:14:31.600 | there are some with even more CUDA cores so how do we use them all well we need to be
00:14:40.440 | able to set up our code in such a way that we can say here is a piece of code that you
00:14:45.560 | can run on lots of different pieces of data lots of different pieces of memory at the
00:14:50.440 | same time so that you can do 10,000 things at the same time and so CUDA does this in
00:14:56.520 | a really simple and pretty elegant way which is it basically says okay take out the kind
00:15:02.800 | of the inner loop so here's our inner loop the stuff where you can run 10,000 of these
00:15:10.440 | at the same time they're not going to influence each other at all so you see these do not
00:15:14.080 | influence each other at all all they do is they stick something into some output memory
00:15:20.480 | so it doesn't even return something you can't return something from these CUDA kernels as
00:15:25.120 | they're going to be called all you can do is you can modify memory in such a way that
00:15:30.440 | you don't know what order they're going to run in they could all run at the same time
00:15:33.440 | some could run a little bit before another one and so forth so the way that CUDA does
00:15:39.080 | this is it says okay write a function right and in your function write a line of code
00:15:46.920 | which I'm going to call as many dozens hundreds thousands millions of times as necessary to
00:15:53.080 | do all the work that's needed and I'm going to do and I'm going to use do this in parallel
00:15:57.040 | for you as much as I can in the case of running on a 3090 up to 10,000 times up to 10,000
00:16:05.040 | things all at once and I will get this done as fast as possible so all you have to do
00:16:11.160 | is basically write the line of code you want to be called lots of times and then the second
00:16:16.400 | thing you have to do is say how many times to call that code and so what will happen
00:16:20.720 | is that piece of code called the kernel will be called for you it'll be passed in whatever
00:16:26.360 | arguments you ask to be passed in which in this case will be the input array tensor the
00:16:31.440 | output tensor and the size of how many pixels are in each channel and it'll tell you okay
00:16:40.680 | this is the ith time I've called it now we can simulate that in Python very very simply
00:16:48.640 | a single for loop now this doesn't happen in parallel so it's not going to speed it up
00:16:53.520 | but the kind of results the semantics are going to be identical to CUDA so here is a
00:16:59.760 | function we've called run kernel we're going to pass it in a function we're going to say
00:17:04.000 | how many times to run the function and what arguments to call the function with and so
00:17:08.840 | each time it will call the function passing in the index what time and the arguments that
00:17:14.920 | we've requested okay so we can now create something to call that so let's get the just
00:17:24.480 | like before get the channel number of channels height and width the number of pixels flatten
00:17:29.760 | it out create the result tensor that we're going to put things in and this time rather
00:17:35.600 | than calling the loop directly we will call run kernel we will pass in the name of the
00:17:41.920 | function to be called as f we will pass in the number of times which is the number of
00:17:48.920 | pixels for the loop and we'll pass in the arguments that are going to be required inside
00:17:59.200 | our kernel so we're going to need out we're going to need x and we're going to need n so
00:18:05.600 | you can see here we're using no external libraries at all we have just plain python and a tiny
00:18:17.200 | bit of pytorch just enough to create a tensor into index into tensors and that's all that's
00:18:22.720 | being used but conceptually it's doing the same thing as a CUDA kernel would do nearly
00:18:32.040 | and we'll get to the nearly in just a moment but conceptually you could see that you could
00:18:37.080 | now potentially write something which if you knew that this was running a bunch of things
00:18:45.280 | totally independently of each other conceptually you could now truly easily paralyze that and
00:18:50.760 | that's what CUDA does however it's not quite that simple it does not simply create a single
00:19:06.880 | list of numbers like range n does in python and pass each one in turn into your kernel
00:19:14.360 | but instead it actually splits the range of numbers into what's called blocks so in this
00:19:23.200 | case you know maybe there's like a thousand pixels we wanted to get through it's going
00:19:29.840 | to group them into blocks of 256 at a time and so in python it looks like this in practice
00:19:41.240 | a CUDA kernel runner is not a single for loop that loops n times but instead it is a pair
00:19:51.600 | of nested for loops so you don't just pass in a single number and say this is the number
00:19:58.200 | of pixels but you pass in two numbers number of blocks and the number of threads we'll
00:20:04.440 | get into that in a moment but these are just numbers they're just you can put any numbers
00:20:07.640 | you like here and if you choose two numbers that multiply to get the thing that we want
00:20:15.880 | which is the n times we want to call it then this can do exactly the same thing because
00:20:21.520 | we're now going to pass in which of the what's the index of the outer loop we're up to what's
00:20:27.480 | the index in the inner loop we're up to how many things do we go through in the inner
00:20:32.960 | loop and therefore inside the kernel we can find out what index we're up to by multiplying
00:20:40.560 | the block index times the block dimension so that is to say the i by the threads and
00:20:47.120 | add the inner loop index the j so that's what we pass in with the i j threads but inside
00:20:55.240 | the kernel we call it block index thread index and block dimension so if you look at the
00:21:00.120 | CUDA book you'll see here this is exactly what they do they say the index is equal to
00:21:06.320 | the block index times the block dimension plus the thread index there's a dot x thing
00:21:12.560 | here that we can ignore for now we'll look at that in a moment but in practice this is
00:21:20.800 | actually how CUDA works so it has all these blocks and inside there are threads and you
00:21:30.440 | can just think of them as numbers so you can see these blocks they just have numbers o
00:21:33.760 | one dot dot dot dot and so forth now that does mean something a little bit tricky though
00:21:40.440 | which is well the first thing i'll say is how do we pick these numbers the number of
00:21:46.040 | blocks and the number of threads for now in practice we're just always going to say the
00:21:49.960 | number of threads is 256 and that's a perfectly fine number to use as a default anyway you
00:21:57.760 | can't go too far wrong just always picking 256 nearly always so don't worry about that
00:22:03.720 | too much for now optimizing that number so if we say okay we want to have 256 threads
00:22:12.000 | so remember that's the inner loop or if we look inside our kernel runner here that's
00:22:16.800 | our inner loop so we're going to call each of this is going to be called 256 times so
00:22:22.160 | how many times you have to call this well you're going to have to call it n number of
00:22:29.000 | pixels divided by 256 times now that might not be an integer so you'll have to round
00:22:35.920 | that up so it's ceiling and so that's how we can calculate the number of blocks we need
00:22:42.040 | to make sure that our kernel is called enough times now we do have a problem though which
00:22:49.800 | is that the number of times we would have liked to have called it which previously was
00:22:56.160 | equal to the number of pixels might not be a multiple of 256 so we might end up going
00:23:02.520 | too far and so that's why we also need in our kernel now this if statement and so this
00:23:08.640 | is making sure that the index that we're up to does not go past the number of pixels we
00:23:15.360 | have and this appears and basically every CUDA kernel you'll see and it's called the
00:23:20.040 | guard or the guard block so this is our guard to make sure we don't go out of bounds so
00:23:25.880 | this is the same line of code we had before and now we've also just added this thing to
00:23:31.560 | calculate the index and we've added the guard and this is like the pretty standard first
00:23:37.920 | lines from any CUDA kernel so we can now run those and they'll do exactly the same thing
00:23:47.400 | as before and so the obvious question is well why do CUDA kernels work in this weird block
00:23:55.680 | and thread way why don't we just tell them the number of times to run it why do we have
00:24:03.720 | to do it by blocks and threads and the reason why is because of some of this detail that
00:24:09.400 | we've got here which is that CUDA sets things up for us so that everything in the same block
00:24:17.920 | or to say it more completely thread block which is the same block they will all be given
00:24:24.840 | some shared memory and they'll also all be given the opportunity to synchronize which
00:24:29.560 | is to basically say okay everything in this block has to get to this point before you
00:24:35.640 | can move on all of the threads in a block will be executed on the same streaming multiprocessor
00:24:44.600 | and so we'll we'll see later in later lectures that won't be taught by me that by using blocks
00:24:54.520 | smartly you can make your code run more quickly and the shared memory is particularly important
00:25:01.320 | so shared memory is a little bit of memory in the GPU that all the threads in a block
00:25:06.600 | share and it's fast it's super super super fast now when we say not very much it's like
00:25:13.880 | on a 3090 it's 128k so very small so this is basically the same as a cache in a CPU
00:25:26.400 | the difference though is that on a CPU you're not going to be manually deciding what goes
00:25:31.520 | into your cache but on the GPU you do it's all up to you so at the moment this cache
00:25:36.680 | is not going to be used when we create our CUDA code because we're just getting started
00:25:42.800 | and so we're not going to worry about that optimization but to go fast you want to use
00:25:46.480 | that cache and also you want to use the register file something a lot of people don't realize
00:25:52.520 | is that there's actually quite a lot of register memory even more register memory than shared
00:25:56.600 | memory so anyway those are all things to worry about down the track not needed for getting
00:26:00.920 | started. So how do we go about using CUDA? There is a basically standard setup block
00:26:16.960 | that I would add and we are going to add and what happens in this setup block is we're
00:26:22.880 | going to set an environment variable you wouldn't use this in kind of production or for going
00:26:27.960 | fast but this says if you get an error stop right away basically so wait you know wait
00:26:35.000 | to see how things go and then that way you can tell us exactly when an error occurs and
00:26:40.840 | where it happens so that slows things down but it's good for development. We're also
00:26:48.480 | going to install two modules one is a build tool which is required by PyTorch to compile
00:26:55.960 | your C++ CUDA code. The second is a very handy little thing called Wurlitzer and the only
00:27:04.360 | place you're going to see that used is in this line here where we load this extension
00:27:08.300 | called Wurlitzer. Without this anything you print from your CUDA code in fact from your
00:27:14.880 | C++ code won't appear in a notebook so you always want to do this in a notebook where
00:27:20.160 | you're doing stuff in CUDA so that you can use print statements to debug things. Okay
00:27:27.120 | so if you've got some CUDA code how do you use it from Python? The answer is that PyTorch
00:27:35.880 | comes with a very handy thing called load inline which is inside torch.utils.cpp extension.
00:27:51.640 | Load inline is a marvelous load inline is a marvelous function that you just pass in
00:28:02.320 | a list of any of the CUDA code strings that you want to compile any of the plain C++ strings
00:28:08.240 | you want to compile any functions in that C++ you want to make available to PyTorch
00:28:15.360 | and it will go and compile it all turn it into a Python module and make it available
00:28:20.640 | right away which is pretty amazing. I've just created a tiny little wrapper for that called
00:28:27.160 | load CUDA just to streamline it a tiny bit but behind the scenes it's just going to call
00:28:33.160 | load inline. The other thing I've done is I've created a string that contains some C++ code
00:28:43.720 | I mean this is all C code I think but it's compiled as C++ code we'll call it C++ code.
00:28:51.160 | C++ code we want included in all of our CUDA files. We need to include this header file
00:28:59.600 | to make sure that we can access PyTorch tensor stuff. We want to be able to use I/O and we
00:29:06.800 | want to be able to check for exceptions. And then I also define three macros. The first
00:29:14.880 | macro just checks that a tensor is CUDA. The second one checks that it's contiguous in
00:29:23.040 | memory because sometimes PyTorch can actually split things up over different memory pieces
00:29:27.920 | and then if we try to access that in this flattened out form it won't work. And then
00:29:33.320 | the way we're actually going to use it check input or just check both of those things.
00:29:37.480 | So if something's not on CUDA and it's not contiguous we aren't going to be able to use
00:29:41.360 | it so we always have this. And then the third thing we do here is we define ceiling division.
00:29:48.840 | Ceiling division is just this although you can implement it a different way like this
00:29:59.400 | and so this will do ceiling division and so this is how we're going to this is what we're
00:30:03.440 | going to call in order to figure out how many blocks we need. So this is just you don't
00:30:08.040 | have to worry about the details of this too much it's just a standard setup we're going
00:30:10.720 | to use. Okay so now we need to write our CUDA kernel.
00:30:16.880 | Now how do you write the CUDA kernel? Well all I did and I recommend you do is take your
00:30:24.480 | Python kernel and paste it into chat GPT and say convert this to equivalent C code using
00:30:33.240 | the same names formatting etc where possible paste it in and chat GPT will do it for you.
00:30:40.840 | Unless you're very comfortable with C which case just write it yourself is fine but this
00:30:44.840 | way since you've already got the Python why not just do this? It basically was pretty
00:30:52.680 | much perfect I found although it did assume that these were floats they're actually not
00:31:00.280 | floats we had to change a couple of data types but basically I was able to use it almost
00:31:04.640 | as is and so particularly you know for people who are much more Python programmers nowadays
00:31:12.040 | like me this is a nice way to write 95 percent of the code you need. What else do we have
00:31:21.360 | to change? Well as we saw in our picture earlier it's not called block IDX it's called blockIDX.X
00:31:31.360 | blockDIM.X threadIDX.X so we have to add the dot X there. Other than that if we compare
00:31:43.840 | so as you can see these two pieces of code look nearly identical we've had to add data
00:31:48.000 | types to them we've had to add semicolons we had to get rid of the colon we had to add
00:31:55.160 | curly brackets that's about it. So it's not very different at all so if you haven't done
00:32:02.080 | much C programming yeah don't worry about it too much because you know the truth is
00:32:08.520 | actually it's not that different for this kind of calculation intensive work. One thing
00:32:17.300 | we should talk about is this. What's unsigned car star? This is just how you write Uint8
00:32:26.760 | in C. You can just if you're not sure how to change change a data type between the PyTorch
00:32:35.160 | spelling and the C spelling you could ask chat GPT or you can Google it but this is
00:32:40.200 | how you write byte. The star in practice it's basically how you say this is an array. So
00:32:47.280 | this says that X is an array of bytes. It actually means it's a pointer but pointers are treated
00:33:00.000 | as you can see here as arrays by C. So you don't really have to worry about the fact
00:33:06.040 | that the pointer it just means for us that it's an array but in C the only kind of arrays
00:33:12.160 | that it knows how to deal with these one-dimensional arrays and that's why we always have to flatten
00:33:17.160 | things out okay. We can't use multi-dimensional tensors really directly in these CUDA kernels
00:33:22.560 | in this way. So we're going to end up with these one-dimensional C arrays. Yeah other
00:33:28.280 | than that it's going to look exactly in fact I mean even because we did our Python like
00:33:32.240 | that it's going to look identical. The void here just means it doesn't return anything
00:33:38.400 | and then the dunder global here is a special thing added by CUDA. There's three things
00:33:46.360 | that can appear and this simply says what should I compile this to do and so you can
00:33:53.700 | put dunder device and that means compile it so that you can only call it on the GPU. You
00:34:00.080 | can say dunder global and that says okay you can call it from the CPU or GPU and it will
00:34:06.720 | run on the GPU or you can write write dunder host which you don't have to and that just
00:34:11.760 | means it's a normal C or C++ program that runs on the CPU side. So anytime we want to
00:34:18.080 | call something from the CPU side to run something on the GPU which is basically almost always
00:34:24.520 | when we're doing kernels you write dunder global. So here we've got dunder global we've
00:34:32.080 | got our kernel and that's it. So then we need the thing to call that kernel. So earlier
00:34:40.400 | to call the kernel we called this block kernel function passed in the kernel and passed in
00:34:45.700 | the blocks and threads and the arguments. With CUDA we don't have to use a special function
00:34:51.520 | there is a weird special syntax built into kernel to do it for us. To use the weird special
00:34:57.480 | syntax you say okay what's the kernel the function that I want to call and then you
00:35:03.600 | use these weird triple angle brackets. So the triple angle brackets is a special CUDA extension
00:35:11.840 | to the C++ language and it means this is a kernel please call it on the GPU and between
00:35:20.200 | the triple angle brackets there's a number of things you can pass but you have to pass
00:35:25.760 | at least the first two things which is how many blocks how many threads. So how many
00:35:33.160 | blocks ceiling division number of pixels divided by threads and how many threads as we said
00:35:40.800 | before let's just pick 256 all the time and not worry about it. So that says call this
00:35:45.600 | function as a GPU kernel and then passing in these arguments. We have to pass in our
00:35:52.680 | input tensor, our output tensor and how many pixels. And you'll see that for each of these
00:35:58.480 | tensors we have to use a special method .data pointer and that's going to convert it into
00:36:04.320 | a C pointer to the tensor. So that's why by the time it arrives in our kernel it's a C
00:36:11.440 | pointer. You also have to tell it what data type you want it to be treated as. This says
00:36:17.680 | treat it as Uintates. So that's this is a C++ template parameter here and this is a method.
00:36:30.040 | The other thing you need to know is in C++ dot means call a method of an object or else
00:36:38.120 | colon colon is basically like in C in Python calling a method of a class. So you don't say
00:36:45.760 | torch dot empty you say torch colon colon empty to create our output or else back when we
00:36:51.920 | did it in Python we said torch dot empty. Also in Python oh okay so in Python that's right
00:37:01.880 | we just created a length n vector and then did a dot view. It doesn't really matter how
00:37:07.080 | we do it but in this case we actually created a two-dimensional tensor bypassing. We pass
00:37:12.160 | in this thing in curly brackets here this is called a C++ list initializer and it's
00:37:17.000 | just basically a little list containing height comma width. So this tells it to create a
00:37:21.840 | two-dimensional matrix which is why we don't need dot view at the end. We could have done
00:37:25.800 | it the dot view way as well. Probably be better to keep it consistent but this is what I wrote
00:37:30.600 | at the time. The other interesting thing when we create the output is if you pass in input
00:37:37.720 | dot options so this is our input tensor that just says oh use the same data type and the
00:37:44.300 | same device CUDA device as our input has. This is a nice really convenient way which
00:37:49.400 | I don't even think we have in Python to say make sure that this is the same data type
00:37:54.880 | in the same device. If you say auto here this is quite convenient you don't have to specify
00:38:01.240 | what type this is. We could have written torch colon colon tensor but by writing auto it
00:38:06.320 | just says figure it out yourself which is another convenient little C++ thing. After
00:38:13.040 | we call the kernel if there's an error in it we won't necessarily get told so to tell
00:38:18.080 | it to check for an error you have to write this. This is a macro that's again provided
00:38:23.900 | by PyTorch. The details don't matter you should just always call it after you call a kernel
00:38:29.440 | to make sure it works and then you can return the tensor that you allocated and then you
00:38:36.320 | passed as a pointer and then that you filled in. Okay now as well as the CUDA source you
00:38:48.360 | also need C++ source and the C++ source is just something that says here is a list of
00:38:55.560 | all of the details of the functions that I want you to make available to the outside
00:39:02.000 | world in this case Python and so this is basically your header effectively. So you can just copy
00:39:08.960 | and paste the full line here from your function definition and stick a semicolon on the end.
00:39:16.640 | So that's something you can always do and so then we call our load CUDA function that
00:39:21.400 | we looked at earlier passing in the CUDA source code the C++ source code and then a list of
00:39:27.760 | the names of the functions that are defined there that you want to make available to Python.
00:39:32.440 | So we just have one which is the RGB2 grayscale function and believe it or not that's all you
00:39:38.840 | have to do this will automatically you can see it running in the background now compiling
00:39:45.440 | with a hugely long thing our files from so it's created a main.cpp for us and it's going
00:39:56.960 | to put it into a main.o for us and compile everything up link it all together and create
00:40:04.540 | a module and you can see here we then take that module it's been passed back and put
00:40:09.880 | it into a variable called module and then when it's done it will load that module and
00:40:17.880 | if we look inside the module that we just created you'll see now that apart from the
00:40:21.800 | normal auto generated stuff Python adds it's got a function in it RGB2 grayscale okay so
00:40:28.840 | that's amazing we now have a CUDA function that's been made available from Python and
00:40:35.040 | we can even see if we want to this is where it put it all so we can have a look and there
00:40:45.480 | it is you can see it's created a main.cpp it's compiled it into a main.o it's created a library
00:40:52.400 | that we can load up it's created a CUDA file it's created a build script and we could have
00:40:58.840 | a look at that build script if we wanted to and there it is so none of this matters too
00:41:06.240 | much it's just nice to know that PyTorch is doing all this stuff for us and we don't have
00:41:10.680 | to worry about it so that's pretty cool.
00:41:16.080 | So in order to pass a tensor to this we're going to be checking that it's contiguous
00:41:23.420 | and on CUDA so we'd better make sure it is so we're going to create an image C variable
00:41:29.360 | which is the image made contiguous and put on through the CUDA device and now we can
00:41:38.800 | actually run this on the full sized image not on the tiny little minimized image we
00:41:43.360 | created before this has got much more pixels it's got 1.7 million pixels where else before
00:41:50.360 | we had I think it was 35,000 34,000 and it's gone down from one and a half seconds to one
00:42:00.000 | millisecond so that is amazing it's dramatically faster both because it's now running in compiled
00:42:09.840 | code and because it's running on the GPU.
00:42:15.600 | The step of putting the data onto the GPU is not part of what we timed and that's probably
00:42:21.280 | fair enough because normally you do that once and then you run a whole lot of CUDA things
00:42:25.400 | on it.
00:42:28.020 | We have though included the step of moving it off the GPU and putting it onto the CPU
00:42:33.680 | as part of what we're timing and one key reason for that is that if we didn't do that it can
00:42:39.680 | actually run our Python code at the same time that the CUDA code is still running and so
00:42:46.160 | the amount of time shown could be dramatically less because it hasn't finished synchronizing
00:42:51.880 | so by adding this it forces it to complete the CUDA run and to put the data back onto
00:43:01.240 | the CPU that kind of synchronization you can also trigger this by printing a value from
00:43:06.920 | it or you can synchronize it manually so after we've done that and we can have a look and
00:43:13.600 | we should get exactly the same grayscale puppy okay so we have successfully created our first
00:43:27.920 | real working code from Python CUDA kernel.
00:43:36.140 | This approach of writing it in Python and then converting it to CUDA is not particularly
00:43:46.320 | common but I'm not just doing it as an educational exercise that's how I like to write my CUDA
00:43:53.400 | kernels at least as much of it as I can because it's much easier to debug in Python it's much
00:44:04.600 | easier to see exactly what's going on and so and I don't have to worry about compiling
00:44:10.560 | it takes about 45 or 50 seconds to compile even our simple example here I can just run
00:44:15.520 | it straight away and once it's working to convert that into C as I mentioned you know
00:44:20.720 | chatgpt can do most of it for us so I think this is actually a fantastically good way
00:44:27.320 | of writing CUDA kernels even as you start to get somewhat familiar with them it's because
00:44:34.280 | it lets you debug and develop much more quickly a lot of people avoid writing CUDA just because
00:44:43.860 | that process is so painful and so here's a way that we can make that process less painful
00:44:49.360 | so let's do it again and this time we're going to do it to implement something very important
00:44:56.000 | which is matrix multiplication so matrix multiplication as you probably know is fundamentally critical
00:45:05.640 | for deep learning it's like the most basic linear algebra operation we have and the way
00:45:12.680 | it works is that you have a input matrix M and a second input matrix N and we go through
00:45:24.240 | every row of M so we go through every row of M till we get to here we are up to this
00:45:30.520 | one and every column of N and here we are up to this one and then we take the dot product
00:45:36.880 | at each point of that row with that column and this here is the dot product of those
00:45:45.200 | two things and that is what matrix multiplication is so it's a very simple operation conceptually
00:45:58.720 | and it's one that we do many many many times in deep learning and basically every deep
00:46:04.560 | learning every neural network has this is its most fundamental operation of course we don't
00:46:11.520 | actually need to implement matrix multiplication from scratch because it's done for us in libraries
00:46:16.280 | but we will often do things where we have to kind of fuse in some kind of matrix multiplication
00:46:22.400 | like paces and so you know and of course it's also just a good exercise so let's take a
00:46:29.800 | look at how to do matrix multiplication first of all in pure Python so in the actually in
00:46:38.680 | the first AI course that I mentioned there's a very complete in-depth dive into matrix
00:46:45.000 | multiplication in part two less than 11 where we spend like an hour or two talking about
00:46:52.120 | nothing but matrix multiplication we're not going to go into that much detail here but
00:46:57.040 | what we do do in that is we use the MNIST data set to to do this and so we're going
00:47:04.560 | to do the same thing here we're going to grab the MNIST data set of handwritten digits and
00:47:11.280 | they are 28 by 28 digits they look like this 28 by 28 is 784 so to do a you know to basically
00:47:21.320 | do a single layer of a neural net or without the activation function we would do a matrix
00:47:29.400 | multiplication of the image flattened out by a weight matrix with 784 rows and however
00:47:38.040 | many columns we like and I'm going to need if we're going to go straight to the output
00:47:41.320 | so this would be a linear function a linear model we'd have 10 layers one for each digit
00:47:46.080 | so here's this is our weights we're not actually going to do any learning here this is just
00:47:50.440 | not any deep learning or logistic regression learning is just for an example okay so we've
00:47:56.680 | got our weights and we've got our input our input data x train and x valid and so we're
00:48:06.360 | going to start off by implementing this in Python now again Python's really slow so let's
00:48:11.960 | make this smaller so matrix one will just be five rows matrix two will be all the weights
00:48:18.920 | so that's going to be a five by seven eighty four matrix multiplied by a seven eighty four
00:48:25.480 | by ten matrix now these two have to match of course they have to match because otherwise
00:48:32.800 | this product won't work those two are going to have to match the row by the column okay
00:48:40.560 | so let's pull that out into a rows a columns b rows b columns and obviously a columns and
00:48:47.640 | b rows are the things that have to match and then the output will be a rows by b columns
00:48:53.580 | so five by ten so let's create an output fill of zeros with rows by columns in it and so
00:49:03.600 | now we can go ahead and go through every row of a every column of b and do the dot product
00:49:11.360 | which involves going through every item in the innermost dimension or 784 of them multiplying
00:49:17.880 | together the equivalent things from m1 and m2 and summing them up into the output tensor
00:49:27.920 | that we created so that's going to give us as we said a five by ten five by ten output
00:49:41.760 | and here it is okay so this is how I always create things in python I basically almost
00:49:49.400 | never have to debug I almost never have like errors unexpected errors in my code because
00:49:55.520 | I've written every single line one step at a time in python I've checked them all as
00:50:00.000 | they go and then I copy all the cells and merge them together stick a function header
00:50:04.200 | on like so and so here is matmul so this is exactly the code we've already seen and we
00:50:10.600 | can call it and we'll see that for 39,200 innermost operations we took us about a second so that's
00:50:26.600 | pretty slow okay so now that we've done that you might not be surprised to hear that we
00:50:32.840 | now need to do the innermost loop as a kernel call in such a way that it is can be run in
00:50:41.720 | parallel now in this case the innermost loop is not this line of code it's actually this
00:50:50.400 | line of code I mean we can choose to be whatever we want it to be but in this case this is
00:50:54.400 | how we're going to do it we're going to say for every pixel we're not every pixel for
00:50:58.560 | every cell in the output tensor like this one here is going to be one CUDA thread so
00:51:07.040 | one CUDA thread is going to do the dot product so this is the bit that does the dot product
00:51:14.200 | so that'll be our kernel so we can write that matmul block kernel is going to contain that
00:51:25.680 | okay so that's exactly the same thing that we just copied from above and so now we're
00:51:31.520 | going to need a something to run this kernel and you might not be surprised to hear that
00:51:41.400 | in CUDA we are going to call this using blocks and threads but something that's rather handy
00:51:50.240 | in CUDA is that the blocks and threads don't have to be just a 1d vector they can be a
00:51:58.440 | 2d or even 3d tensor so in this case you can see we've got one two a little hard to see
00:52:09.080 | exactly where they stop two three four blocks and so then for each block that's kind of
00:52:23.400 | in one dimension and then there's also one two three four five blocks in the other dimension
00:52:35.600 | and so each of these blocks has an index so this one here is going to be zero zero a little
00:52:45.760 | bit hard to see this one here is going to be one three and so forth and this one over
00:52:52.720 | here is going to be three four so rather than just having a integer block index we're going
00:53:06.500 | to have a tuple block index and then within a block there's going to be to pick let's
00:53:19.160 | say this exact spot here didn't do that very well there's going to be a thread index and
00:53:30.840 | again the thread index won't be a single index into a vector it'll be a two elements so in
00:53:37.480 | this case it would be 0 1 2 3 4 5 6 rows down and 0 1 2 3 4 5 6 7 8 9 10 is that 11 12 I
00:53:52.440 | can't count 12 maybe across so the this here is actually going to be defined by two things
00:53:59.120 | one is by the block and so the block is 3 comma 4 and the thread is 6 comma 12 so that's how
00:54:22.080 | CUDA lets us index into two-dimensional grids using blocks and threads we don't have to
00:54:31.880 | it's just a convenience if we want to and in fact it can we can use up to three dimensions
00:54:41.640 | so to create our kernel runner now rather than just having so rather than just having
00:54:50.640 | two nested loops for blocks and threads we're going to have to have two lots of two nested
00:54:58.360 | loops for our both of our X and Y blocks and threads or our rows and columns blocks and
00:55:06.920 | threads so it ends up looking a bit messy because we now have four nested for loops
00:55:16.720 | so we'll go through our blocks on the Y axis and then through our blocks on the X axis
00:55:22.320 | and then through our threads on the Y axis and then through our threads on the X axis
00:55:27.040 | and so what that means is that for you can think of this Cartesian product as being for
00:55:32.040 | each block for each thread now to get the dot Y and the dot X will use this handy little
00:55:40.320 | Python standard library thing called simple namespace I'd use that so much I just give
00:55:44.400 | it an NS name because I use namespaces all the time and my quick and dirty code so we
00:55:50.240 | go through all those four we then call our kernel and we pass in an object containing
00:55:58.320 | the Y and X coordinates and that's going to be our block and we also pass in our thread
00:56:09.600 | which is an object with the Y and X coordinates of our thread and it's going to eventually
00:56:17.280 | do all possible blocks and all possible threads numbers for each of those blocks and we also
00:56:24.240 | need to tell it how big is each block how how high and how wide and so that's what this
00:56:30.240 | is this is going to be a simple namespace and object with an X and Y as you can see
00:56:36.400 | so I need to know how big they are just like earlier on we had to know the block dimension
00:56:43.920 | that's why we passed in threads so remember this is all pure PyTorch we're not actually
00:56:50.480 | calling any out to any CUDA we're not calling out to any libraries other than just a tiny
00:56:54.800 | bit of PyTorch for the indexing and tensor creation so you can run all of this by hand
00:57:00.720 | make sure you understand you can put it in the debugger you can step through it and so
00:57:06.720 | it's going to call our function so here's our matrix modification function as we said
00:57:10.760 | it's a kernel that contains the dot product that we wrote earlier so now the guard is
00:57:16.840 | going to have to check that the row number we're up to is not taller than we have and
00:57:23.520 | the column number we're up to is not wider than we have and we also need to know what
00:57:27.400 | row number we're up to and this is exactly the same actually I should say the column
00:57:32.760 | is exactly the same as we've seen before and in fact you might remember in the CUDA we
00:57:36.880 | had block idx dot x and this is why right because in CUDA it's always gives you these
00:57:44.320 | three-dimensional dim three structures so you have to put this dot x so we can find
00:57:52.560 | out the column this way and then we can find out the row by seeing how many blocks have
00:57:58.760 | we gone through how big is each block in the y-axis and how many threads have we gone through
00:58:03.800 | in the y-axis so which row number are we up to what column number are we up to is that
00:58:09.840 | inside the bounds of our tensor if not then just stop and then otherwise do our dot product
00:58:21.360 | and put it into our output tensor so that's all pure Python and so now we can call it
00:58:29.960 | by getting the height and width of our first input the height and width of our second input
00:58:36.640 | and so then K and K2 the inner dimensions ought to match we can then create our output
00:58:43.860 | and so now threads per block is not just the number 256 but it's a pair of numbers it's
00:58:50.360 | an x and a y and we've selected two numbers that multiply together to create 256 so again
00:58:55.960 | this is a reasonable choice if you've got two dimensional inputs to spread it out nicely
00:59:04.940 | one thing to be aware of here is that your threads per block can't be bigger than 1024
00:59:16.320 | so we're using 256 which is safe right and notice that you have to multiply these together
00:59:20.920 | 16 times 16 is going to be the number of threads per block so this is a these are safe numbers
00:59:26.600 | to use you're not going to run out of blocks though 2 to the 31 is the number of maximum
00:59:32.520 | blocks for dimension 0 and then 2 to the 16 for dimensions 1 and 2 I think it's actually
00:59:37.920 | minus 1 but don't worry about that so don't have too many 10 threads but you can have
00:59:43.520 | lots of blocks but of course each symmetric model processor is going to run all of these
00:59:49.160 | on the same device and they're also going to have access to shared memory so that's
00:59:53.880 | why you use a few threads per block so our blocks the x we're going to use the ceiling
01:00:01.520 | division the y we're going to use the same ceiling division so if any of this is unfamiliar
01:00:06.880 | go back to our earlier example because the code's all copied from there and now we can
01:00:10.840 | call our 2D block kernel runner passing in the kernel the number of blocks the number
01:00:16.600 | of threads per block our input matrices flattened out our output matrix flattened out and the
01:00:23.360 | dimensions that it needs because they get all used here and return the result and so
01:00:33.080 | if we call that matmul with a 2D block and we can check that they are close to what we
01:00:39.480 | got in our original manual loops and of course they are because it's running the same code
01:00:46.680 | so now that we've done that we can do the CUDA version now the CUDA version is going
01:00:54.800 | to be so much faster we do not need to use this slimmed down matrix anymore we can use
01:01:05.560 | the whole thing so to check that it's correct I want a fast CPU-based approach that I can
01:01:12.080 | compare to so previously I took about a second to do 39,000 elements so I'm not going to
01:01:21.680 | explain how this works but I'm going to use a broadcasting approach to get a fast CPU-based
01:01:26.040 | approach if you check the fast AI course we teach you how to do this broadcasting approach
01:01:31.400 | but it's a pure Python approach which manages to do it all in a single loop rather than
01:01:36.080 | three nested loops it gives the same answer for the cut down tensors but much faster only
01:01:50.920 | four milliseconds so it's fast enough that we can now run it on the whole input matrices
01:02:00.920 | and it takes about 1.3 seconds and so this broadcast optimized version as you can see
01:02:06.880 | it's much faster and now we've got 392 million additions going on in the middle of our three
01:02:14.720 | loops effectively three loops but we're broadcasting them so this is much faster but the reason
01:02:19.960 | I'm really doing this is so that we can store this result to compare to so that makes sure
01:02:27.520 | that our CUDA version is correct okay so how do we convert this to CUDA you might not be
01:02:35.120 | surprised to hear that what I did was I grabbed this function and I passed it over to chat
01:02:39.920 | GPT and said please rewrite this in C and it gave me something basically that I could
01:02:45.480 | use first time and here it is this time I don't have unsigned cast I have float star
01:02:54.000 | other than that this looks almost exactly like the Python we had with exactly the same
01:03:01.980 | changes we saw before we've now got the dot Y and dot X versions once again we've got
01:03:09.480 | done to global which says please run this on the GPU when we call it from the CPU so
01:03:14.720 | the CUDA the kernel I don't think there's anything to talk about there and then the
01:03:18.680 | thing that calls the kernel is going to be passed in two tenses we're going to check
01:03:24.240 | that they're both contiguous and check that they are on the CUDA device we'll grab the
01:03:29.880 | height and width of the first and second tenses we're going to grab the inner dimension we'll
01:03:37.120 | make sure that the inner dimensions of the two matrices match just like before and this
01:03:43.020 | is how you do an assertion in PyTorch CUDA code you call torch check pass anything to
01:03:50.080 | check pass in the message to pop up if there's a problem so these are a really good thing
01:03:53.840 | to spread around all through your CUDA code to make sure that everything is as you thought
01:04:01.460 | it was going to be just like before we create an output so now when we create a number of
01:04:07.800 | threads we don't say threads is 256 we instead say this is a special thing provided by CUDA
01:04:15.320 | for us dim three so this is a basically a tuple with three elements so we're going to
01:04:21.040 | create a dim three called TPB it's going to be 16 by 16 now I said it has three elements
01:04:28.640 | where's the third one that's okay it just treats the third one as being one so it just
01:04:33.640 | we can ignore it so that's the number of threads per block and then how many blocks will there
01:04:40.840 | be well in the X dimension it'll be W divided by X ceiling division in the Y dimension it
01:04:50.120 | will be H divided by Y C division and ceiling division and so that's the number of blocks
01:04:56.080 | we have so just like before we call our kernel just by calling it like a normal function
01:05:03.120 | but then we add this weird triple angle bracket thing telling it how many blocks and how many
01:05:08.600 | threads so these aren't ints anymore these are now dim three structures and that's what
01:05:16.660 | we use these dim three structures and in fact even before what actually happened behind
01:05:22.800 | the scenes when we did the grayscale thing is even though we passed in 256 instance we
01:05:32.960 | actually ended up with a dim three structure and just in which case the second the index
01:05:40.920 | one and two or the dot X and dot Z values were just set to one automatically so we've
01:05:47.440 | actually already used a dim three structure without quite realizing it and then just like
01:05:55.080 | before pass in all of the tensors we want cast casting them to pointers maybe they're
01:06:01.640 | not just casting converting them to pointers through some particular data type and passing
01:06:07.240 | in any other information that our function will need that kernel will need okay so then
01:06:14.720 | we call load CUDA again that'll compile this into a module make sure that they're both
01:06:22.520 | contiguous and on the CUDA device and then after we call module.mapmol passing those
01:06:29.320 | in putting on the CPU and checking that they're all close and it says yes they are so it's
01:06:36.320 | this is now running not on just the first five rows but on the entire MNIST data set
01:06:41.440 | and on the entire MNIST data set using a optimized CPU approach it took 1.3 seconds using CUDA
01:06:51.840 | it takes six milliseconds so that is quite a big improvement cool the other thing I will
01:07:04.920 | mention of course is PyTorch can do a matrix modification for us just by using at how long
01:07:11.200 | does and obviously gives the same answer how long does that take to run that takes two
01:07:17.720 | milliseconds so three times faster and in many situations it'll be much more than three
01:07:25.940 | times faster so why are we still pretty slow compared to PyTorch I mean this isn't bad
01:07:32.080 | to do 392 million of these calculations in six milliseconds but if PyTorch can do it
01:07:38.480 | so much faster what are they doing well the trick is that they are taking advantage in
01:07:46.700 | particular of this shared memory so shared memory is a small memory space that is shared
01:07:53.960 | amongst the threads in a block and it is much faster than global memory in our matrix multiplication
01:08:02.460 | when we have one of these blocks and so it's going to do one block at a time all in the
01:08:07.160 | same SM it's going to be reusing the same 16 by 16 block it's going to be using the
01:08:14.760 | same 16 rows and columns again and again and again each time with access to the same shared
01:08:20.060 | memory so you can see how you could really potentially cache the information a lot of
01:08:25.400 | the information you need and reuse it rather than going back to the slower memory again
01:08:30.680 | and again so this is an example of the kinds of things that you could optimize potentially
01:08:37.040 | once you get to that point. The only other thing that I wanted to mention here is that
01:08:46.720 | this 2D block idea is totally optional you can do everything with 1D blocks or with 2D
01:08:55.000 | blocks or with 3D blocks and threads and just to show that I've actually got an example
01:09:00.320 | at the end here which converts RGB to grayscale using the 2D blocks because remember earlier
01:09:11.920 | when we did this it was with 1D blocks. It gives exactly the same result and if we compare
01:09:21.160 | the code, so if we compare the code the version actually that was done with 1D threads and
01:09:32.660 | blocks is quite a bit shorter than the version that uses 2D threads and blocks and so in
01:09:38.320 | this case even as though we're manipulating pixels where you might think that using the
01:09:43.440 | 2D approach would be neater and more convenient in this particular case it wasn't really I
01:09:50.160 | mean it's still pretty simple code that we have to deal with the columns and rows.x.y
01:09:56.560 | separately, the guards a little bit more complex, we have to find out what index we're actually
01:10:02.840 | up to here or else this kernel we just there was just much more direct just two lines of
01:10:09.400 | code and then calling the kernel you know again it's a little bit more complex with
01:10:13.480 | the threads per blocks stuff rather than this but the key thing I wanted to point out is
01:10:18.080 | that these two pieces of code do exactly the same thing so don't feel like if you don't
01:10:25.520 | want to use a 2D or 3D block thread structure you don't have to. You can just use a 1D one,
01:10:32.720 | the 2D stuff is only there if it's convenient for you to use and you want to use it. Don't
01:10:38.720 | feel like you have to.
01:10:44.040 | So yeah I think that's basically like all the key things that I wanted to show you all
01:10:49.360 | today. The main thing I hope you take from this is that even for Python programmers for
01:10:56.100 | data scientists it's not way outside our comfort zone you know we can write these things in
01:11:04.600 | Python we can convert them pretty much automatically we end up with code that doesn't look you
01:11:11.160 | know it looks reasonably familiar even though it's now in a different language we can do
01:11:15.880 | everything inside notebooks we can test everything as we go we can print things from our kernels
01:11:23.760 | and so you know it's hopefully feeling a little bit less beyond our capabilities than we might
01:11:34.500 | have previously imagined. So I'd say yeah you know go for it I think it's also like I think
01:11:41.840 | it's increasingly important to be able to write CUDA code nowadays because for things
01:11:47.640 | like flash attention or for things like quantization GPTQ AWQ bits and bytes these are all things
01:11:57.320 | you can't write in PyTorch. You know our models are getting more sophisticated the kind of
01:12:03.560 | assumptions that libraries like PyTorch make about what we want to do you know increasingly
01:12:10.880 | less and less accurate so we're having to do more and more of this stuff ourselves nowadays
01:12:15.320 | in CUDA and so I think it's a really valuable capability to have. Now the other thing I
01:12:23.640 | mentioned is we did it all in CoLab today but we can also do things on our own machines
01:12:32.200 | if you have a GPU or on a cloud machine and getting set up for this again it's much less
01:12:39.440 | complicated than you might expect and in fact I can show you it's basically like four lines
01:12:45.800 | of code or four lines or three or four lines of bash script to get it all set up it'll
01:12:50.360 | run on Windows or under WSL it'll also run on Linux of course Mac stuff doesn't really
01:12:55.720 | work on CUDA stuff doesn't really work on Mac so not on Mac. Actually I'll put a link
01:13:04.480 | to this into the video notes but for now I'm just going to jump to a Twitter thread where
01:13:11.640 | I wrote this all down to show you all the steps. So the way to do it is to use something
01:13:19.640 | called Conda. Conda is something that very very very few people understand a lot of people
01:13:26.080 | think it's like a replacement for like pip or poetry or something it's not it's better
01:13:31.440 | to think of it as a replacement for docker. You can literally have multiple different
01:13:35.560 | versions of Python multiple different versions of CUDA multiple different C++ compilation
01:13:41.600 | systems all in parallel at the same time on your machine and switch between them you can
01:13:48.680 | only do this with Conda and everything just works right so you don't have to worry about
01:13:55.440 | all the confusing stuff around .run files or Ubuntu packages or anything like that you
01:14:00.920 | can do everything with just Conda. You need to install Conda I've actually got a script
01:14:09.000 | which you just run the script it's a tiny script as you see if you just run the script
01:14:12.600 | it'll automatically figure out which many Conda you need it'll automatically figure
01:14:17.120 | out what shell you're on and it'll just go ahead and download it and install it for you.
01:14:21.200 | Okay so run that script restart your terminal now you've got Conda. Step two is find out
01:14:30.720 | what version of CUDA PyTorch wants you to have so if I click Linux Conda CUDA 12.1 is
01:14:38.920 | the latest so then step three is run this shell command replacing 12.1 with whatever
01:14:50.000 | the current version of PyTorch is it actually still 12.1 for me at this point and that'll
01:14:56.280 | install everything all the stuff you need to profile debug build etc all the nvidia
01:15:04.600 | tools you need the full suite will all be installed and it's coming directly from nvidia
01:15:08.720 | so you'll have like the proper versions as I said you can have multiple versions it's
01:15:13.480 | stored at once in different environments no problem at all and then finally install PyTorch
01:15:21.300 | and this command here will install PyTorch for some reason I wrote nightly here you don't
01:15:25.600 | need the nightly so just remove just nightly so this will install the latest version of
01:15:29.520 | PyTorch using the nvidia CUDA stuff that you just installed if you've used Conda before
01:15:36.240 | and it was really slow that's because it used to use a different solver which was thousands
01:15:42.160 | or tens of thousands of times slower than the modern one just has been added and made
01:15:46.880 | default in the last couple of months so nowadays this should all run very fast and as I said
01:15:53.960 | it'll run under WSL on Windows it'll run on Ubuntu it'll run on Fedora it'll run on Debian
01:16:00.720 | it'll all just work so that's how I strongly recommend getting yourself set up for local
01:16:11.760 | development you don't need to worry about using Docker as I said you can switch between different
01:16:17.800 | CUDA versions different Python versions different compilers and so forth without having to worry
01:16:22.320 | about any of the Docker stuff and it's also efficient enough that if you've got the same
01:16:27.400 | libraries and so forth installed in multiple environments it'll hard link them so it won't
01:16:32.760 | even use additional hard drive space so it's also very efficient great so that's how you
01:16:40.880 | can get started on your own machine or on the cloud or whatever so hopefully you'll
01:16:45.800 | find that helpful as well alright thanks very much for watching I hope you found this useful
01:16:54.040 | and I look forward to hearing about what you create with CUDA in terms of going to the
01:17:01.960 | next steps check out the other CUDA mode lectures I will link to them and I would also recommend
01:17:09.760 | trying out some projects of your own so for example you could try to implement something
01:17:17.760 | like 4-bit quantization or flash attention or anything like that now those are kind of
01:17:26.000 | pretty big projects but you can try to break them up into smaller things you build up one
01:17:30.200 | step at a time and of course look at other people's code so look at the implementation
01:17:37.440 | of flash attention look at the implementation of bits and bytes look at the implementation
01:17:41.880 | of GPTQ and so forth the more time you spend reading other people's code the better alright
01:17:50.800 | I hope you found this useful and thank you very much for watching
01:17:54.440 | [BLANK_AUDIO]