back to indexllm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE
00:00:12.120 |
from vector learning processing to reinforcement learning, 00:00:21.360 |
He's a distinguished machine learning superstar. 00:00:36.560 |
An ex-Google brain, ex-DeepMind, ex-Tesla, Mr. Autopilot. 00:00:48.080 |
this special person joined the CudaMode Discord 00:00:56.960 |
and most active community projects on our server. 00:00:59.900 |
But I guess it's best if he tells the story himself. 00:01:27.000 |
This is my favorite kind of event to present at. 00:01:30.640 |
and thank you for running CudaMode and putting this on. 00:01:38.400 |
We're training transformers in C in a pinch of C++. 00:01:49.360 |
and what this looked like from my perspective. 00:01:52.120 |
I was trying to add a video to my YouTube series 00:01:54.400 |
and I was trying to teach people LLM training, 00:02:02.920 |
And then you've all worked with PyTorch, of course, right? 00:02:07.760 |
okay, you have your model, which you've written. 00:02:11.680 |
of a number of abstractions here at the same time. 00:02:18.300 |
And suddenly things start to be a little bit more complicated 00:02:21.120 |
because I'm not even sure in what order you do these, 00:02:26.920 |
So I don't fully understand how any of this works. 00:02:30.960 |
you want to use your model in different ways. 00:02:40.200 |
but for some reason eval and inference was not working. 00:02:44.600 |
I was getting some kind of a Torch compile error 00:02:46.600 |
when I was trying to run my eval and my inference. 00:03:09.340 |
I was looking for a PTR VLCK to solve my issue. 00:03:13.160 |
And fortunately, PTR VLCK did not have any guidance 00:03:18.960 |
So two hours later, I'm finding a Torch compile 00:03:21.680 |
and I'm trying to figure out what the hell is going on. 00:03:23.240 |
I'm kind of a sad, I don't know exactly how to solve this. 00:03:40.360 |
And then eventually I entered into a state of anger. 00:03:45.320 |
You know what, I'm just gonna write the whole thing. 00:03:47.480 |
I understand in my mind that what I'm trying to do, 00:03:49.800 |
like the computation itself, the algorithm itself 00:03:53.080 |
And for some reason, Torch compile doesn't let me 00:03:58.160 |
And I was like, okay, I'm gonna take life into my own hands 00:04:02.320 |
I'm gonna just write this and see how bad could it be. 00:04:05.120 |
So let's think about what is PyTorch offering you, really? 00:04:09.880 |
And there's many things, but maybe some of the things 00:04:12.800 |
I don't know why there's bullet points everywhere. 00:04:20.880 |
Okay, but number one, we're getting an array, right? 00:04:26.880 |
If we're gonna abandon this, then we're gonna have to do 00:04:30.520 |
making sure that we rattle and unravel indices correctly. 00:04:34.960 |
So if we don't have autograph, we need to do forward 00:04:39.760 |
We don't have devices, so we have to worry about memory 00:04:41.880 |
being on the host or on the device and shuttling memory 00:04:44.160 |
around different devices between CPU and GPU and so on. 00:04:49.560 |
so we have to be very mindful what tensors are stored 00:04:52.280 |
and what precisions and convert explicitly between them. 00:04:54.880 |
We don't have Torch Compile, so we're gonna have to do 00:04:56.880 |
all the kernel fusions that we want manually, 00:05:05.040 |
all of our processes, make sure that they can find each other, 00:05:10.560 |
and this is just some of the things that PyTorch offers. 00:05:12.560 |
So without PyTorch, we're kind of naked in the world, right? 00:05:22.180 |
which now isn't the primary thing we're working with. 00:05:27.920 |
And so we're in PyTorch land, everything is nice and clean. 00:05:31.960 |
and we're just calling them, so everything is great. 00:05:33.720 |
And that now becomes our reference in PyTorch. 00:05:36.840 |
I'd love to just take you through one example of a layer. 00:05:38.880 |
So for example, layer norm here is like a PyTorch layer, 00:05:41.960 |
and we'd like to basically port this over to C. 00:05:46.760 |
Well, we're gonna iterate through all the layers. 00:05:50.560 |
And actually, I had to write a forward pass of layer norm 00:05:53.360 |
because PyTorch doesn't just have this kind of implementation 00:05:56.960 |
in PyTorch of layer norm, because it's kind of like a block 00:06:01.520 |
So I had to write a forward pass of layer norm 00:06:03.400 |
and make sure it's equivalent to the layer norm in PyTorch. 00:06:07.140 |
I had to write the backward pass of layer norm. 00:06:09.320 |
This is where you kind of take out your pen and paper, 00:06:12.600 |
This is for batch norm, but layer norm would be similar. 00:06:17.720 |
And again, this is all still in PyTorch, but it's explicit. 00:06:20.680 |
And you're just making sure that the layer norm 00:06:23.560 |
matches this basically manual tensor-based implementation. 00:06:28.560 |
So now we have PyTorch code forward and backward. 00:06:34.000 |
So the next thing we do is we try to port it to C. 00:06:40.280 |
on the right, we basically have the equivalent 00:06:50.880 |
So we have a float star out, float star inputs, 00:06:53.440 |
outputs, means, standard deviations, weights, 00:07:03.400 |
I don't want to create any abstraction, really. 00:07:05.620 |
It's just float arrays and operations on float arrays. 00:07:08.560 |
Like, why should it be a lot more complicated than that? 00:07:18.060 |
This is the layer norm forward on float arrays, 00:07:22.960 |
And then you also do the backward for all the layers. 00:07:47.800 |
Then it's fixed, and from then on it's just dynamics 00:07:50.320 |
of just feeding data through it and training the model. 00:07:53.160 |
So we have to pre-plan all the tensors, their sizes, 00:08:17.800 |
just kind of like you allocate all these tensors 00:08:21.720 |
and you make sure everything flows correctly through. 00:08:24.020 |
And you just call the forwards and then all the backwards, 00:08:28.000 |
and you're left with gradient and you can do an update. 00:08:29.820 |
So stringing that together is the second piece of work. 00:08:32.940 |
And then once we sort of like strung it together, 00:08:34.720 |
you get something that you can just compile and run. 00:08:36.940 |
So we, on the top left is everything that's required. 00:08:50.720 |
And then we just compile and run this little C code file. 00:08:56.720 |
And I think it's like 2000 lines or something like that, 00:09:00.240 |
And you run that program and it like does a little training 00:09:05.000 |
And then we can verify that the package code is identical 00:09:11.280 |
And at this point, I'm actually feeling quite great 00:09:22.040 |
All the memory is just allocated in a single block. 00:09:30.880 |
It, in principle, can train GPT-2, it's complete. 00:09:33.280 |
It will train GPT-2, you just have to wait for one time. 00:09:38.360 |
It's just a single file of C with no dependencies. 00:09:42.080 |
this would be a great candidate to run on a moment probe 00:09:45.160 |
because in space, if we just harden it a little bit more, 00:09:48.240 |
because you're not gonna ship high-torch code 00:10:05.680 |
So basically it's perfect because you wake up at 1 a.m. 00:10:12.960 |
and then in sunrise, you go do all the weather activities. 00:10:15.400 |
So that is the villa where most of LLM.C was trained. 00:10:22.400 |
And this is a, I think the moon is about to set 00:10:26.680 |
This is a recommended way to do software development. 00:10:32.360 |
Okay, so now we have C code, but it's inefficient. 00:10:37.200 |
So we need to convert all of our C code to GPU. 00:10:39.440 |
So this is where we go to the dev CUDA part of the repo 00:10:43.920 |
So here's the layer of forward pass, as I mentioned. 00:10:46.120 |
And now we're gonna develop a number of kernels 00:10:49.280 |
but now run on the GPU and they're gonna be faster. 00:10:51.760 |
And so usually we have versions one, two, three, 00:10:55.240 |
And these are all different kernel implementations. 00:11:02.720 |
So we develop all those layers and port them to CUDA. 00:11:10.960 |
Basically, the point here is the first kernel 00:11:14.040 |
because you're paralyzing over batch and time. 00:11:17.760 |
And then you're basically copy pasting the C code 00:11:22.400 |
because you're paralyzing over the batch time tokens 00:11:24.440 |
and each thread just handles a single output element. 00:11:28.720 |
but then optimization is gonna be pretty elaborate. 00:11:31.920 |
So by the end, we get to kernel six, for example, 00:11:33.920 |
in layer norm, and we're doing a lot of things 00:11:36.400 |
So we have some, you know, more produce operations. 00:11:40.680 |
We have some, we also probably get through shared memory, 00:11:52.200 |
And I'm gonna go into a bit more detail later, 00:11:58.120 |
One thing that I sort of found in this project 00:11:59.840 |
is that it's not exactly trivial to learn CUDA, 00:12:02.840 |
unfortunately, and it was like a little bit harder 00:12:07.400 |
but getting better at it, I think, is not trivial. 00:12:19.080 |
because a lot of the CUDA code that we ended up developing 00:12:23.400 |
you would not find those things in this book, actually. 00:12:26.000 |
So a lot of the kernels that we ended up adding 00:12:39.440 |
And then you have this amazing blog post from Simon. 00:12:42.880 |
That is like way better than anything we deserve, 00:12:49.280 |
But yeah, so I think I found it a little bit difficult. 00:12:53.520 |
But I mean, I'm hoping that things like CUDA mode 00:12:55.880 |
can definitely speed up the availability of writing CUDA. 00:13:00.880 |
Okay, so next what happened is I was basically struggling 00:13:09.460 |
and I was implementing all these CUDA kernels. 00:13:14.760 |
And so a team of Avengers assembled from the internet 00:13:18.280 |
and what they saw, and that's how you start contributing. 00:13:30.600 |
And this was incredible to watch and learn a lot from. 00:13:33.200 |
And there's many more, Ross Wheeler and Chen Hsiao 00:13:44.600 |
so that we can run and optimize all these kernels. 00:13:47.120 |
So it was amazing for me that people just came 00:13:48.880 |
from the internet and helped out on the project. 00:13:50.360 |
And you know, this is one of the favorite things 00:13:59.960 |
Okay, so we've converted all the layers to CUDA. 00:14:04.760 |
And we can now train on a single GPU in FB32 so far. 00:14:10.000 |
So from then on, we start to make more and more optimizations. 00:14:12.480 |
So number one, we don't want to have mammals in FB32 00:14:20.040 |
Step two, we don't want to write our own flash attention. 00:14:28.800 |
Next, you want to definitely reach for base precision 00:14:36.320 |
So you want to go over all your testers for parameters 00:14:46.560 |
And then do all the conversions automatically. 00:14:54.560 |
So as an example, we did all the kernel fusions, 00:15:01.280 |
There's been a lot of optimizations from Eric, 00:15:04.520 |
especially on minimizing the amount of memory 00:15:15.800 |
load and store instructions that are available, 00:15:17.640 |
but somehow the compiler is unwilling to use in many cases. 00:15:32.320 |
okay, there should be a 128-bit load and store, 00:15:34.440 |
but it happens to be a 32-bit or something else 00:15:41.320 |
kind of forces the compiler's hand a bit more. 00:15:43.960 |
We implemented all kinds of CUDA streams to overlap 00:15:52.320 |
because at one point of LLM.c, as Arun would say, 00:15:55.480 |
I basically went in and I looped it from orbit. 00:16:07.160 |
of really weird race conditions and errors and so on. 00:16:10.120 |
So LLM.c is not actually as overlapped as it could be, 00:16:18.720 |
But maybe we can slowly reintroduce some of it. 00:16:21.760 |
We have stochastic grounding, we have full determinism. 00:16:29.120 |
Like the encoder backward was especially crazy 00:16:31.440 |
because the encoder backward is trivial with atomics, 00:16:42.120 |
And accuracy, like stochastic grounding and so on. 00:16:50.960 |
you start to do overduce between all the different workers. 00:17:01.000 |
which are in float, and these are really large buffers 00:17:07.800 |
and it really helps to keep your requirements 00:17:30.280 |
with the complexity of what you're actually introducing. 00:17:32.680 |
And so I actually rejected a lot of PRs because of that 00:17:37.380 |
And I think that decreases the amount of people 00:17:39.860 |
And then after your multi-GP, you hold the node. 00:17:43.960 |
So now you are running across multiple machines here 00:17:46.160 |
to make sure that you synchronize all of them, 00:17:53.960 |
actually train GPT-2, and we can actually reproduce it 00:17:57.320 |
So there's a post in the discussions of LLNC. 00:18:02.080 |
which was state of the art LLN as of 2019 or so. 00:18:12.020 |
And the way you do that is it's extremely dependency free. 00:18:14.300 |
There's no need for Python, no need for PyTorch. 00:18:16.860 |
So you do need CUDNN, which is the most heavy dependency. 00:18:21.820 |
So if you'd like to roll your own manual attention, 00:18:25.580 |
But CUDNN is kind of like the hairiest dependency. 00:18:36.720 |
And then you compile your code and you run it. 00:18:38.640 |
And it starts stepping and you wait 24 hours. 00:18:41.360 |
And then it's stepping, doing some diagnostics. 00:18:44.920 |
We have almost a 50% MFU here on one node, which is quite good. 00:18:57.400 |
No crazy numerical issues, lost bytes or anything 00:19:01.360 |
And yeah, it should be a nearly good model in LLN.C. 00:19:08.860 |
Because remember, we have PyTorch implementation 00:19:12.980 |
And so you can run the whole untrained almost in PyTorch. 00:19:16.460 |
And we can compare the two implementations side-by-side. 00:19:19.060 |
And in particular, at the time of writing that post-- 00:19:24.700 |
to optimize things over time-- but at the time of that post, 00:19:30.660 |
And we were 20% faster in training just the throughput. 00:19:33.700 |
And I don't know if I fully super duper optimized 00:19:38.360 |
But we were able to, I think, beat PyTorch in training mode, 00:19:44.880 |
want to train anything else, you're in a lot of trouble. 00:19:49.780 |
And when I'm doing that, I won't come back to it. 00:19:51.820 |
But for GPT-2 training, we're better after all that work. 00:19:58.220 |
Torch compile actually takes quite a bit of time, 00:20:06.500 |
So looping back around, turns out it wasn't all that simple. 00:20:27.540 |
We actually thought maybe we would have it done by today. 00:20:33.880 |
But we will have LLNA 3.1 training in LLN.C very, very 00:20:42.920 |
And there's a big PR that's coming for FP8 support, 00:20:52.240 |
The AMP fork is very active, as far as I understand it, 00:20:56.480 |
I think also the C++ CUDA fork is quite nice. 00:21:08.420 |
I think it's pretty well understood what's in there. 00:21:10.580 |
It's only maybe like, I think, 3,000 lines of code, 00:21:15.700 |
And one more thought, I think, that I wanted to get across 00:21:18.120 |
is it wasn't all that haphazard to start the project. 00:21:22.220 |
I had another motivation for starting the project. 00:21:26.100 |
And that's that I think, I mean, what is LLN.C like? 00:21:28.980 |
PyTorch is, especially when you push compilers, 00:21:36.260 |
where you're doing everything manually, right? 00:21:38.820 |
And basically I think we wrote LLN.C as multiple people 00:21:45.860 |
and got something that was faster than PyTorch 00:21:50.500 |
And so this exercise basically proves that this is possible. 00:21:53.860 |
Now the problem is you need to spend multiple people 00:21:57.100 |
But if LLNs are about to become much better at coding 00:21:59.980 |
over time, then I think you can expect that the LLN 00:22:02.900 |
could actually do this for any custom application over time. 00:22:05.540 |
And so the LLNs could act as a kind of compiler 00:22:08.540 |
for any custom application you're interested in. 00:22:13.940 |
that you can compile and run for your specific applications. 00:22:17.820 |
like the use of Python and PyTorch and everything else, 00:22:20.500 |
it's just a crutch because we humans are finite. 00:22:22.860 |
We have finite knowledge, intelligence, and potential. 00:22:30.660 |
And so the other thing that I think is interesting 00:22:37.940 |
and their intelligence, they might not be able 00:22:46.260 |
if you put LLN.C in the context of a session LLN, 00:22:49.740 |
and you can expect that the few-shot learning 00:23:00.260 |
And so I think this is actually not unlikely to happen. 00:23:22.260 |
- All right, well, the morning session talks.