llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

to introduce to you a legendary person. Someone who has been packing and educating at the forefront of AI for over a decade. From neural networks to computer vision, from vector learning processing to reinforcement learning, he has pushed the boundaries and inspired millions all over the world, including, I think, all of us here.

He's a distinguished machine learning superstar. A founding member of OpenAI, the reference human for ImageNet. (audience laughing) An ex-Google brain, ex-DeepMind, ex-Tesla, Mr. Autopilot. Here's where we see it all. And some months ago, on a memorable day, this special person joined the CudaMode Discord to start hacking with others on LLM.C, which became one of the greatest and most active community projects on our server.

But I guess it's best if he tells the story himself. So please join me in welcoming the incredible one and only Andrej Karpathy! (audience applauding) - Wow, okay. (audience laughing) Okay, I'm very excited to be here. This is my favorite kind of event to present at. So yeah, thank you for the invitation and thank you for running CudaMode and putting this on.

This is a wonderful event. Okay, so I'll tell you a bit about LLM.C. So what are we doing? We're training transformers in C in a pinch of C++. How do I go next here? Okay, so I'd like to tell the story a little bit of how this project came about and what this looked like from my perspective.

So roughly a year ago, I was trying to add a video to my YouTube series and I was trying to teach people LLM training, GPD training, and so on. And I was basically hacking on that and GPD trying to get it to work. So that was me. And then you've all worked with PyTorch, of course, right?

So the trickiness comes that, okay, you have your model, which you've written. That makes sense. But now you have to keep track of a number of abstractions here at the same time. So you have to put it to a device. You want to compile it. You want to wrap it in DDP.

And suddenly things start to be a little bit more complicated because I'm not even sure in what order you do these, what exactly happens? What are these abstractions? What do they do to your model? So I don't fully understand how any of this works. And then what happens is you want to use your model in different ways.

So you want to use it in evaluation, in training, or model inference, and so on. And what happens to me is that I was able to train the model, but for some reason eval and inference was not working. And what happened was I was getting some kind of a Torch compile error when I was trying to run my eval and my inference.

And this is just an illustrative example of a Torch compile error. It was something else I didn't remember. I couldn't capture it. But both of them were giving me error, inference and eval, and a different error. And I had no idea what was going on. So I did what anyone would do, and then in my position, I went to discuss.

(audience laughing) (audience applauding) I was looking for a PTR VLCK to solve my issue. And fortunately, PTR VLCK did not have any guidance that I could see on that specific error. So I was kind of stuck, honestly. So two hours later, I'm finding a Torch compile and I'm trying to figure out what the hell is going on.

I'm kind of a sad, I don't know exactly how to solve this. And so I felt like I was going to be (audience laughing) In the beginning, I was in denial. I was like, this can't be happening to me. I'm not doing anything crazy. I'm just training a little GPT.

Like, why is this not working? This seems really simple. I'm not doing anything crazy. And then eventually I entered into a state of anger. (audience laughing) And I was like, okay. You know what, I'm just gonna write the whole thing. (audience laughing) I understand in my mind that what I'm trying to do, like the computation itself, the algorithm itself is totally clear in my mind.

And for some reason, Torch compile doesn't let me use it, run it, et cetera. So I felt a little bit powerless. And I was like, okay, I'm gonna take life into my own hands and be in control of my destiny. I'm gonna just write this and see how bad could it be.

So let's think about what is PyTorch offering you, really? And there's many things, but maybe some of the things that are relevant here. I don't know why there's bullet points everywhere. (audience laughing) Well, my slides is too many points, so I don't know how to do that. Okay, but number one, we're getting an array, right?

So a very useful and dimensional array that we can manipulate the operations. If we're gonna abandon this, then we're gonna have to do a lot of counter-arithmetic, basically, making sure that we rattle and unravel indices correctly. Second, we're getting autograph for free. So if we don't have autograph, we need to do forward and backward passes of all the layers.

We don't have devices, so we have to worry about memory being on the host or on the device and shuttling memory around different devices between CPU and GPU and so on. We don't have simple DTAP conversions, so we have to be very mindful what tensors are stored and what precisions and convert explicitly between them.

We don't have Torch Compile, so we're gonna have to do all the kernel fusions that we want manually, and we're gonna have to optimize for space-time performance manually. And finally, we don't have distributives, so we're gonna have to manually spin up all of our processes, make sure that they can find each other, communicate with Nicol, et cetera.

So PyTorch is really, really nice, and this is just some of the things that PyTorch offers. So without PyTorch, we're kind of naked in the world, right? But maybe it's okay. So, yeah, how bad can it be? So, step one, we have our PyTorch code, which now isn't the primary thing we're working with.

It's only a reference that we check for correctness with respect to. And so we're in PyTorch land, everything is nice and clean. We have a little transformer, a few modules, and we're just calling them, so everything is great. And that now becomes our reference in PyTorch. I'd love to just take you through one example of a layer.

So for example, layer norm here is like a PyTorch layer, and we'd like to basically port this over to C. So what kind of process do we go through? Well, we're gonna iterate through all the layers. Number one, we need the forward pass. And actually, I had to write a forward pass of layer norm because PyTorch doesn't just have this kind of implementation in PyTorch of layer norm, because it's kind of like a block and eventually pulls into some CUDA kernels.

So I had to write a forward pass of layer norm and make sure it's equivalent to the layer norm in PyTorch. And then, of course, I had to write the backward pass of layer norm. This is where you kind of take out your pen and paper, do some backdrop.

This is for batch norm, but layer norm would be similar. And yeah, we had to write the backward pass. And again, this is all still in PyTorch, but it's explicit. And you're just making sure that the layer norm of PyTorch, forward and backward, matches this basically manual tensor-based implementation.

So now we have PyTorch code forward and backward. So the next thing we do is we try to port it to C. And this is actually a lot simpler in many cases than you might think. So on the left, we have the PyTorch code, on the right, we basically have the equivalent of layer norm forward in C.

And it's not that crazy, right? So unlike in PyTorch, we just have a bunch of float star arrays. So we have a float star out, float star inputs, outputs, means, standard deviations, weights, and biases, and some type of headers. And one thing I really like to do in LL.C is I just want to keep things simple.

I don't want to create a tensor abstraction. I don't want to create any abstraction, really. It's just float arrays and operations on float arrays. Like, why should it be a lot more complicated than that? So everything is just float arrays. Everything is fully self-contained. There's no underlying representations, abstractions to call, import, et cetera.

This is the layer norm forward on float arrays, and that's it. So that's the forward. And then you also do the backward for all the layers. Once we've done that for all the layers and converted everything to C to make sure that everything matches our reference implementation, we have to start to string it together.

So we go into our C code in the main, and we have to allocate all of the memory that we're going to be using. In LL.C, all of the allocation happens a single time at the beginning. So we pre-plan all of the memory that we're going to ever use.

Then it's fixed, and from then on it's just dynamics of just feeding data through it and training the model. So we have to pre-plan all the tensors, their sizes, and we have to do that for the parameters. And we have the data_grad and the mnb for the @mw buffers.

And then for the activations as well, and we need space for both data and grad. And so you just pre-plan all the memory, you allocate all of it, and then we just stitch it all up. So we have all these layers, and they have a forward and a backward pass in backpropagation.

And so on the forward pass, just kind of like you allocate all these tensors and you're very careful, and you index into them properly, and you make sure everything flows correctly through. And you just call the forwards and then all the backwards, and then you're kind of done, and you're left with gradient and you can do an update.

So stringing that together is the second piece of work. And then once we sort of like strung it together, you get something that you can just compile and run. So we, on the top left is everything that's required. We download a starter pack, which is really just a GPT-2 weights in a single binary file, very simple.

And also we need the data set, in this case TinyShakespeare, and the tokenizer and stuff like that. And then we just compile and run this little C code file. It's a single file of C at this point. And I think it's like 2000 lines or something like that, if I remember correctly.

And you run that program and it like does a little training and outputs on Shakespeare again. And then we can verify that the package code is identical to the C code and everything is great. And we're just running on C. And at this point, I'm actually feeling quite great because this is amazing.

So we have a single file C, there's no dependencies whatsoever. It compiles instantly, it runs instantly. All the memory is just allocated in a single block. So if you start stepping, there's no way you're gonna boom later. It's all pre-planned. It's fully deterministic. It, in principle, can train GPT-2, it's complete.

It will train GPT-2, you just have to wait for one time. And it can run on a potato. It can just run on anything. It's just a single file of C with no dependencies. And in principle, this could run, this would be a great candidate to run on a moment probe because in space, if we just harden it a little bit more, because you're not gonna ship high-torch code on a one-moment probe.

But I think LMC is great. (all laughing) So I was feeling great at this point. Fun side note, by the way, all of this work that I described so far - Whoa. - Whoa. - While I was on vacation, and while I was jet-lagged in Maldives. So basically it's perfect because you wake up at 1 a.m.

and there's nothing to do. So you write stuff like LLM.C, and then in sunrise, you go do all the weather activities. So that is the villa where most of LLM.C was trained. So that was perfect. This is a picture of it. (audience laughing) And this is a, I think the moon is about to set and the sunrise is about to happen.

This is a recommended way to do software development. (audience laughing) Okay, so now we have C code, but it's inefficient. So we'd like to run it faster. For that, we need four GPUs. So we need to convert all of our C code to GPU. So this is where we go to the dev CUDA part of the repo and we start to develop all the kernels.

So here's the layer of forward pass, as I mentioned. And now we're gonna develop a number of kernels that have the identical functionality, but now run on the GPU and they're gonna be faster. And so usually we have versions one, two, three, four, five, six, et cetera. And these are all different kernel implementations.

They're a bit faster usually over time, but they match the specification exactly and give the exact same numbers. So we develop all those layers and port them to CUDA. And this is, I don't know what this is. I'm gonna skip that. (audience laughing) I need to look at one of the kernels.

Basically, the point here is the first kernel is trivial to do usually because you're paralyzing over batch and time. And then you're basically copy pasting the C code into your CUDA kernel. And you're already getting speed ups because you're paralyzing over the batch time tokens and each thread just handles a single output element.

So the first kernel is usually trivial, but then optimization is gonna be pretty elaborate. So by the end, we get to kernel six, for example, in layer norm, and we're doing a lot of things that are a bit more complicated. So we have some, you know, more produce operations.

We have some, we also probably get through shared memory, through Google memory. We're orchestrating it correctly, cache screening hints, and a bunch of little tips and tricks for dealing with everything. And I'm gonna go into a bit more detail later, but you can get arbitrarily complicated if you're writing the CUDA code.

One thing that I sort of found in this project is that it's not exactly trivial to learn CUDA, unfortunately, and it was like a little bit harder than I expected. I mean, I knew some CUDA going in, but getting better at it, I think, is not trivial. I think some of these books, unfortunately, are a bit out of date, as you might know.

PMDP is actually quite good. But also, I think still kind of like, mostly on the beginner level, because a lot of the CUDA code that we ended up developing in the lifetime of the LLMC project, you would not find those things in this book, actually. So a lot of the kernels that we ended up adding would just not be covered.

And then on top of that, you have this CUDA C++ programming guide, which, frankly, is not exactly readable for someone who is a bit new to CUDA. And then you have this amazing blog post from Simon. Yeah, is that in Tropic? That is like way better than anything we deserve, just like randomly on the internet.

So that was incredible. And if there was just more of that, that would be so amazing. But yeah, so I think I found it a little bit difficult. But I mean, I'm hoping that things like CUDA mode can definitely speed up the availability of writing CUDA. Okay, so next what happened is I was basically struggling with the CUDA code a little bit.

And I was reading through the book and I was implementing all these CUDA kernels. And they're like okay CUDA kernels, but they're not great. And so a team of Avengers assembled from the internet and what they saw, and that's how you start contributing. So specifically, Eric, Arun, Aleksar, like I would say core devs of LLM.C and contributed a ton of work to LLM.C.

And they started to like really optimize and write all these kernels. And this was incredible to watch and learn a lot from. And there's many more, Ross Wheeler and Chen Hsiao and a few others. But over time, we have 60 contributors to the LLM.C project. Shout out to LLM.Dev for sponsoring LLM.C.

They can contribute to compute so that we can run and optimize all these kernels. So it was amazing for me that people just came from the internet and helped out on the project. And you know, this is one of the favorite things that can happen. My favorite things that can happen within an open source MIT licensed repo, people just come from the internet and help contribute, it's amazing.

Okay, so we've converted all the layers to CUDA. We have now all the kernels. And we can now train on a single GPU in FB32 so far. So that's great. So from then on, we start to make more and more optimizations. So number one, we don't want to have mammals in FB32 when you roll your own code.

We actually switched to Kubas. Step two, we don't want to write our own flash attention. I think that would be pretty complicated. Turns out Codium has a very good flash attention implementation. So we switched to that. Next, you want to definitely reach for base precision so that to speed up the code.

So you want to go over all your testers for parameters and also for activations and so on. And you have to start to think about, okay, which ones are in float 32, which ones are in float 16, and what precision are they in? And then do all the conversions automatically.

So we reached for that and implemented that. There's many, many other optimizations that we've been implementing over time. So as an example, we did all the kernel fusions, different recompute settings to recompute a piece of the forward pass during the backward. There's been a lot of optimizations from Eric, especially on minimizing the amount of memory that you need during the backward pass.

We have this packed 128 data structure, which basically, in our experience, forces the compiler to use the 128-bit load and store instructions that are available, but somehow the compiler is unwilling to use in many cases. So I think Arun did a lot of work here where you just look at the SAS, and you look at the SAS as the assembly, and you are looking at what the instructions are being used for your loop, and you figure out that, okay, there should be a 128-bit load and store, but it happens to be a 32-bit or something else because something in the MSCC compiler is not going very well.

So we found that this data structure kind of forces the compiler's hand a bit more. We implemented all kinds of CUDA streams to overlap at the heart of the computation, and this ended up creating a total disaster. And so that's why I scratched it out, because at one point of LLM.c, as Arun would say, I basically went in and I looped it from orbit.

I just went in, I controlled that for all dimensions of stream, and I just delete, delete, delete. And basically I deleted all the streams, made everything single-threaded, because we ended up getting all kinds of really weird race conditions and errors and so on. I just didn't want to deal with it.

So LLM.c is not actually as overlapped as it could be, but it's just like, it's too much complexity for me not to have gained at this point. But maybe we can slowly reintroduce some of it. We have stochastic grounding, we have full determinism. Full determinism turns out to be pretty hard because some of the kernels complexify a lot because you can't use atomics.

Like the encoder backward was especially crazy because the encoder backward is trivial with atomics, but non-trivial without it. Anyway, so a lot of the optimizations went into with a lot of efficiency and determinism in mind. And accuracy, like stochastic grounding and so on. Next, you want to use multiple GPUs, not just a single GPU.

So this is where you bring in the nickel, you start to do overduce between all the different workers. And this is where you also start to reach for like sharded optimizer state zero one. For basically, take your optimizer states, which are in float, and these are really large buffers for atom W, and you can actually spread out a lot of the stuff across all the GPUs and it really helps to keep your requirements down per GPU in terms of memory.

So very helpful to reach for that. So currently, LLNC uses zero one, which is a sharded optimizer state. There's a PR for zero two, but I don't believe I merged that yet because it gets a little bit messy, but might be merged eventually. A lot of LLNC is just kind of like balancing the improvement and speed with the complexity of what you're actually introducing.

And so I actually rejected a lot of PRs because of that because the code starts to get crazy. And I think that decreases the amount of people that can be onboarded. And then after your multi-GP, you hold the node. So now you are running across multiple machines here to make sure that you synchronize all of them, that they can find each other and so on.

So we implemented all that. And where that leads us to is that we can actually train GPT-2, and we can actually reproduce it after all of that work. So there's a post in the discussions of LLNC. We can train the 1.6 billion GPT-2, which was state of the art LLN as of 2019 or so.

And you can train it on a single node, H100, in about 24 hours. And that costs roughly $600. And the way you do that is it's extremely dependency free. There's no need for Python, no need for PyTorch. So you do need CUDNN, which is the most heavy dependency. But CUDNN is optional.

So if you'd like to roll your own manual attention, that is possible in LLN.C. But CUDNN is kind of like the hairiest dependency. But after that, it's just a bunch of C code. You compile it and you run it. There's no need for really anything. So there's no need for conda environments, as it installs, there's just nothing.

It's just amazing. And then you compile your code and you run it. And it starts stepping and you wait 24 hours. And then it's stepping, doing some diagnostics. We have almost a 50% MFU here on one node, which is quite good. And you get really nice plots. And you beat the GPT-2 on Hellaswag.

And basically, that's just the case that the optimization went well. No crazy numerical issues, lost bytes or anything like that for this size. And yeah, it should be a nearly good model in LLN.C. We can still compare to PyTorch. Because remember, we have PyTorch implementation for all this stuff in parallel on the side.

And so you can run the whole untrained almost in PyTorch. And we can compare the two implementations side-by-side. And in particular, at the time of writing that post-- and I don't know if this has changed, because the PyTorch team continues to optimize things over time-- but at the time of that post, we were using, in LLN.C, 30% less memory.

And we were 20% faster in training just the throughput. And I don't know if I fully super duper optimized the PyTorch implementation. I did my personal best. But we were able to, I think, beat PyTorch in training mode, specifically GPT-2 in LLN.C. If you want to train anything else, you're in a lot of trouble.

You have to change your code a lot. And when I'm doing that, I won't come back to it. But for GPT-2 training, we're better after all that work. And it also compiles and runs much faster, which is beautiful. Torch compile actually takes quite a bit of time, like a minute or something.

You're just waiting. So that's also something that I personally don't like to work with usually. OK. So looping back around, turns out it wasn't all that simple. I mean, there was a lot of stuff involved. And it took a few months for a few people. But it was fun.

We learned a lot. And we were friends along the way. This is the LLN.C core devs. So it was great. Ongoing work. We are adding a lot of free support. We actually thought maybe we would have it done by today. But there's a little bit more work to do.

But we will have LLNA 3.1 training in LLN.C very, very soon. We will have FP8 support. So everyone has been working on this. And there's a big PR that's coming for FP8 support, which is also interesting. And there's a lot of notable forks in LLN.C. They're all listed on the GitHub repo.

The AMP fork is very active, as far as I understand it, and quite good. I think also the C++ CUDA fork is quite nice. And so a lot of folks. So I encourage you to also work LLN.C. It's fairly readable, I think. I try to keep it clean, well-documented.

I think it's pretty well understood what's in there. It's only maybe like, I think, 3,000 lines of code, basically C mostly. And one more thought, I think, that I wanted to get across is it wasn't all that haphazard to start the project. I had another motivation for starting the project.

And that's that I think, I mean, what is LLN.C like? PyTorch is, especially when you push compilers, a bit like TCC for software 2.0. It's a compiler. But LLN.C is a bit like writing assembly, where you're doing everything manually, right? And basically I think we wrote LLN.C as multiple people over a duration of three months, and got something that was faster than PyTorch in a specific setting of GPT-3 training.

And so this exercise basically proves that this is possible. Now the problem is you need to spend multiple people several months. But if LLNs are about to become much better at coding over time, then I think you can expect that the LLN could actually do this for any custom application over time.

And so the LLNs could act as a kind of compiler for any custom application you're interested in. They're gonna do all the LLN.C work, and they're gonna output the binary that you can compile and run for your specific applications. So I don't actually know if we, like the use of Python and PyTorch and everything else, it's just a crutch because we humans are finite.

We have finite knowledge, intelligence, and potential. But actually, don't you wanna write all code in custom CUDA kernels and so on? Like, maybe. And so the other thing that I think is interesting is the LLN.C repo might be useful because in the early stages of these LLNs and their intelligence, they might not be able to write this code from scratch.

You just prompted them, "Write GPT-2 on C." You probably won't get LLN.C. But you're a lot more likely to get it if you put LLN.C in the context of a session LLN, and you can expect that the few-shot learning would be very helpful for the LLN to basically get an example code.

And so I think LLN.C could be very useful for this example code to get to the LLNs 'cause they're about to write all of our custom applications. And so I think this is actually not unlikely to happen. Yeah, this is kind of likely to happen. So I think software development in general will probably change a lot.

And to me, LLN.C is an exploration of whether this is even possible, because if it is possible, then maybe this is what's gonna happen. So, yeah, that's it. Thank you. (audience applauding) (audience cheering) - All right, well, the morning session talks.

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

Transcript