llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE

00:00:00.000 | to introduce to you a legendary person.

00:00:03.000 | Someone who has been packing and educating

00:00:05.880 | at the forefront of AI for over a decade.

00:00:09.640 | From neural networks to computer vision,

00:00:12.120 | from vector learning processing to reinforcement learning,

00:00:15.120 | he has pushed the boundaries

00:00:16.940 | and inspired millions all over the world,

00:00:19.560 | including, I think, all of us here.

00:00:21.360 | He's a distinguished machine learning superstar.

00:00:28.600 | A founding member of OpenAI,

00:00:30.880 | the reference human for ImageNet.

00:00:33.360 | (audience laughing)

00:00:36.560 | An ex-Google brain, ex-DeepMind, ex-Tesla, Mr. Autopilot.

00:00:41.560 | Here's where we see it all.

00:00:44.500 | And some months ago, on a memorable day,

00:00:48.080 | this special person joined the CudaMode Discord

00:00:51.680 | to start hacking with others on LLM.C,

00:00:55.400 | which became one of the greatest

00:00:56.960 | and most active community projects on our server.

00:00:59.900 | But I guess it's best if he tells the story himself.

00:01:04.680 | So please join me in welcoming

00:01:06.760 | the incredible one and only Andrej Karpathy!

00:01:10.740 | (audience applauding)

00:01:13.900 | - Wow, okay.

00:01:20.420 | (audience laughing)

00:01:25.760 | Okay, I'm very excited to be here.

00:01:27.000 | This is my favorite kind of event to present at.

00:01:29.040 | So yeah, thank you for the invitation

00:01:30.640 | and thank you for running CudaMode and putting this on.

00:01:33.040 | This is a wonderful event.

00:01:34.920 | Okay, so I'll tell you a bit about LLM.C.

00:01:37.440 | So what are we doing?

00:01:38.400 | We're training transformers in C in a pinch of C++.

00:01:42.060 | How do I go next here?

00:01:43.700 | Okay, so I'd like to tell the story

00:01:47.600 | a little bit of how this project came about

00:01:49.360 | and what this looked like from my perspective.

00:01:51.180 | So roughly a year ago,

00:01:52.120 | I was trying to add a video to my YouTube series

00:01:54.400 | and I was trying to teach people LLM training,

00:01:57.040 | GPD training, and so on.

00:01:58.320 | And I was basically hacking on that

00:01:59.520 | and GPD trying to get it to work.

00:02:01.120 | So that was me.

00:02:02.920 | And then you've all worked with PyTorch, of course, right?

00:02:05.840 | So the trickiness comes that,

00:02:07.760 | okay, you have your model, which you've written.

00:02:09.440 | That makes sense.

00:02:10.280 | But now you have to keep track

00:02:11.680 | of a number of abstractions here at the same time.

00:02:14.200 | So you have to put it to a device.

00:02:15.680 | You want to compile it.

00:02:16.720 | You want to wrap it in DDP.

00:02:18.300 | And suddenly things start to be a little bit more complicated

00:02:21.120 | because I'm not even sure in what order you do these,

00:02:23.640 | what exactly happens?

00:02:24.640 | What are these abstractions?

00:02:25.560 | What do they do to your model?

00:02:26.920 | So I don't fully understand how any of this works.

00:02:29.180 | And then what happens is

00:02:30.960 | you want to use your model in different ways.

00:02:32.560 | So you want to use it in evaluation,

00:02:34.040 | in training, or model inference, and so on.

00:02:36.600 | And what happens to me is that

00:02:38.600 | I was able to train the model,

00:02:40.200 | but for some reason eval and inference was not working.

00:02:43.640 | And what happened was

00:02:44.600 | I was getting some kind of a Torch compile error

00:02:46.600 | when I was trying to run my eval and my inference.

00:02:49.440 | And this is just an illustrative example

00:02:50.760 | of a Torch compile error.

00:02:51.680 | It was something else I didn't remember.

00:02:52.920 | I couldn't capture it.

00:02:54.080 | But both of them were giving me error,

00:02:55.560 | inference and eval, and a different error.

00:02:57.520 | And I had no idea what was going on.

00:02:59.480 | So I did what anyone would do,

00:03:00.640 | and then in my position, I went to discuss.

00:03:02.820 | (audience laughing)

00:03:05.820 | (audience applauding)

00:03:09.340 | I was looking for a PTR VLCK to solve my issue.

00:03:13.160 | And fortunately, PTR VLCK did not have any guidance

00:03:15.760 | that I could see on that specific error.

00:03:17.720 | So I was kind of stuck, honestly.

00:03:18.960 | So two hours later, I'm finding a Torch compile

00:03:21.680 | and I'm trying to figure out what the hell is going on.

00:03:23.240 | I'm kind of a sad, I don't know exactly how to solve this.

00:03:26.800 | And so I felt like I was going to be

00:03:28.680 | (audience laughing)

00:03:31.000 | In the beginning, I was in denial.

00:03:32.080 | I was like, this can't be happening to me.

00:03:33.400 | I'm not doing anything crazy.

00:03:34.640 | I'm just training a little GPT.

00:03:36.360 | Like, why is this not working?

00:03:37.760 | This seems really simple.

00:03:38.880 | I'm not doing anything crazy.

00:03:40.360 | And then eventually I entered into a state of anger.

00:03:42.600 | (audience laughing)

00:03:43.680 | And I was like, okay.

00:03:45.320 | You know what, I'm just gonna write the whole thing.

00:03:46.640 | (audience laughing)

00:03:47.480 | I understand in my mind that what I'm trying to do,

00:03:49.800 | like the computation itself, the algorithm itself

00:03:51.520 | is totally clear in my mind.

00:03:53.080 | And for some reason, Torch compile doesn't let me

00:03:55.240 | use it, run it, et cetera.

00:03:56.440 | So I felt a little bit powerless.

00:03:58.160 | And I was like, okay, I'm gonna take life into my own hands

00:04:00.640 | and be in control of my destiny.

00:04:02.320 | I'm gonna just write this and see how bad could it be.

00:04:05.120 | So let's think about what is PyTorch offering you, really?

00:04:09.880 | And there's many things, but maybe some of the things

00:04:11.400 | that are relevant here.

00:04:12.800 | I don't know why there's bullet points everywhere.

00:04:14.320 | (audience laughing)

00:04:17.640 | Well, my slides is too many points,

00:04:18.760 | so I don't know how to do that.

00:04:20.880 | Okay, but number one, we're getting an array, right?

00:04:23.400 | So a very useful and dimensional array

00:04:25.240 | that we can manipulate the operations.

00:04:26.880 | If we're gonna abandon this, then we're gonna have to do

00:04:28.560 | a lot of counter-arithmetic, basically,

00:04:30.520 | making sure that we rattle and unravel indices correctly.

00:04:33.520 | Second, we're getting autograph for free.

00:04:34.960 | So if we don't have autograph, we need to do forward

00:04:37.240 | and backward passes of all the layers.

00:04:39.760 | We don't have devices, so we have to worry about memory

00:04:41.880 | being on the host or on the device and shuttling memory

00:04:44.160 | around different devices between CPU and GPU and so on.

00:04:47.760 | We don't have simple DTAP conversions,

00:04:49.560 | so we have to be very mindful what tensors are stored

00:04:52.280 | and what precisions and convert explicitly between them.

00:04:54.880 | We don't have Torch Compile, so we're gonna have to do

00:04:56.880 | all the kernel fusions that we want manually,

00:04:59.200 | and we're gonna have to optimize

00:05:00.360 | for space-time performance manually.

00:05:02.280 | And finally, we don't have distributives,

00:05:03.760 | so we're gonna have to manually spin up

00:05:05.040 | all of our processes, make sure that they can find each other,

00:05:07.120 | communicate with Nicol, et cetera.

00:05:08.960 | So PyTorch is really, really nice,

00:05:10.560 | and this is just some of the things that PyTorch offers.

00:05:12.560 | So without PyTorch, we're kind of naked in the world, right?

00:05:14.800 | But maybe it's okay.

00:05:16.760 | So, yeah, how bad can it be?

00:05:19.320 | So, step one, we have our PyTorch code,

00:05:22.180 | which now isn't the primary thing we're working with.

00:05:24.140 | It's only a reference that we check

00:05:25.920 | for correctness with respect to.

00:05:27.920 | And so we're in PyTorch land, everything is nice and clean.

00:05:30.080 | We have a little transformer, a few modules,

00:05:31.960 | and we're just calling them, so everything is great.

00:05:33.720 | And that now becomes our reference in PyTorch.

00:05:36.840 | I'd love to just take you through one example of a layer.

00:05:38.880 | So for example, layer norm here is like a PyTorch layer,

00:05:41.960 | and we'd like to basically port this over to C.

00:05:45.040 | So what kind of process do we go through?

00:05:46.760 | Well, we're gonna iterate through all the layers.

00:05:48.240 | Number one, we need the forward pass.

00:05:50.560 | And actually, I had to write a forward pass of layer norm

00:05:53.360 | because PyTorch doesn't just have this kind of implementation

00:05:56.960 | in PyTorch of layer norm, because it's kind of like a block

00:05:59.240 | and eventually pulls into some CUDA kernels.

00:06:01.520 | So I had to write a forward pass of layer norm

00:06:03.400 | and make sure it's equivalent to the layer norm in PyTorch.

00:06:06.300 | And then, of course,

00:06:07.140 | I had to write the backward pass of layer norm.

00:06:09.320 | This is where you kind of take out your pen and paper,

00:06:11.080 | do some backdrop.

00:06:12.600 | This is for batch norm, but layer norm would be similar.

00:06:15.720 | And yeah, we had to write the backward pass.

00:06:17.720 | And again, this is all still in PyTorch, but it's explicit.

00:06:20.680 | And you're just making sure that the layer norm

00:06:22.280 | of PyTorch, forward and backward,

00:06:23.560 | matches this basically manual tensor-based implementation.

00:06:28.560 | So now we have PyTorch code forward and backward.

00:06:34.000 | So the next thing we do is we try to port it to C.

00:06:36.360 | And this is actually a lot simpler

00:06:37.720 | in many cases than you might think.

00:06:38.980 | So on the left, we have the PyTorch code,

00:06:40.280 | on the right, we basically have the equivalent

00:06:42.000 | of layer norm forward in C.

00:06:44.000 | And it's not that crazy, right?

00:06:45.600 | So unlike in PyTorch,

00:06:47.680 | we just have a bunch of float star arrays.

00:06:50.880 | So we have a float star out, float star inputs,

00:06:53.440 | outputs, means, standard deviations, weights,

00:06:55.760 | and biases, and some type of headers.

00:06:57.540 | And one thing I really like to do in LL.C

00:07:00.080 | is I just want to keep things simple.

00:07:01.560 | I don't want to create a tensor abstraction.

00:07:03.400 | I don't want to create any abstraction, really.

00:07:05.620 | It's just float arrays and operations on float arrays.

00:07:08.560 | Like, why should it be a lot more complicated than that?

00:07:10.720 | So everything is just float arrays.

00:07:12.560 | Everything is fully self-contained.

00:07:13.840 | There's no underlying representations,

00:07:16.040 | abstractions to call, import, et cetera.

00:07:18.060 | This is the layer norm forward on float arrays,

00:07:20.080 | and that's it.

00:07:21.160 | So that's the forward.

00:07:22.960 | And then you also do the backward for all the layers.

00:07:25.820 | Once we've done that for all the layers

00:07:27.280 | and converted everything to C

00:07:28.120 | to make sure that everything matches

00:07:29.740 | our reference implementation,

00:07:31.520 | we have to start to string it together.

00:07:34.000 | So we go into our C code in the main,

00:07:37.600 | and we have to allocate all of the memory

00:07:39.440 | that we're going to be using.

00:07:40.520 | In LL.C, all of the allocation happens

00:07:43.400 | a single time at the beginning.

00:07:44.800 | So we pre-plan all of the memory

00:07:46.320 | that we're going to ever use.

00:07:47.800 | Then it's fixed, and from then on it's just dynamics

00:07:50.320 | of just feeding data through it and training the model.

00:07:53.160 | So we have to pre-plan all the tensors, their sizes,

00:07:56.080 | and we have to do that for the parameters.

00:07:57.940 | And we have the data_grad and the mnb

00:08:00.040 | for the @mw buffers.

00:08:02.320 | And then for the activations as well,

00:08:03.760 | and we need space for both data and grad.

00:08:05.880 | And so you just pre-plan all the memory,

00:08:07.720 | you allocate all of it,

00:08:08.680 | and then we just stitch it all up.

00:08:10.880 | So we have all these layers,

00:08:12.420 | and they have a forward and a backward pass

00:08:14.340 | in backpropagation.

00:08:15.800 | And so on the forward pass,

00:08:17.800 | just kind of like you allocate all these tensors

00:08:19.680 | and you're very careful,

00:08:20.520 | and you index into them properly,

00:08:21.720 | and you make sure everything flows correctly through.

00:08:24.020 | And you just call the forwards and then all the backwards,

00:08:26.320 | and then you're kind of done,

00:08:28.000 | and you're left with gradient and you can do an update.

00:08:29.820 | So stringing that together is the second piece of work.

00:08:32.940 | And then once we sort of like strung it together,

00:08:34.720 | you get something that you can just compile and run.

00:08:36.940 | So we, on the top left is everything that's required.

00:08:40.480 | We download a starter pack,

00:08:41.880 | which is really just a GPT-2 weights

00:08:43.500 | in a single binary file, very simple.

00:08:45.920 | And also we need the data set,

00:08:47.480 | in this case TinyShakespeare,

00:08:48.720 | and the tokenizer and stuff like that.

00:08:50.720 | And then we just compile and run this little C code file.

00:08:53.960 | It's a single file of C at this point.

00:08:56.720 | And I think it's like 2000 lines or something like that,

00:08:59.120 | if I remember correctly.

00:09:00.240 | And you run that program and it like does a little training

00:09:03.000 | and outputs on Shakespeare again.

00:09:05.000 | And then we can verify that the package code is identical

00:09:08.120 | to the C code and everything is great.

00:09:09.920 | And we're just running on C.

00:09:11.280 | And at this point, I'm actually feeling quite great

00:09:14.080 | because this is amazing.

00:09:16.240 | So we have a single file C,

00:09:17.880 | there's no dependencies whatsoever.

00:09:19.800 | It compiles instantly, it runs instantly.

00:09:22.040 | All the memory is just allocated in a single block.

00:09:25.400 | So if you start stepping,

00:09:26.600 | there's no way you're gonna boom later.

00:09:28.180 | It's all pre-planned.

00:09:29.300 | It's fully deterministic.

00:09:30.880 | It, in principle, can train GPT-2, it's complete.

00:09:33.280 | It will train GPT-2, you just have to wait for one time.

00:09:36.160 | And it can run on a potato.

00:09:37.520 | It can just run on anything.

00:09:38.360 | It's just a single file of C with no dependencies.

00:09:40.840 | And in principle, this could run,

00:09:42.080 | this would be a great candidate to run on a moment probe

00:09:45.160 | because in space, if we just harden it a little bit more,

00:09:48.240 | because you're not gonna ship high-torch code

00:09:50.520 | on a one-moment probe.

00:09:51.680 | But I think LMC is great.

00:09:53.320 | (all laughing)

00:09:56.520 | So I was feeling great at this point.

00:09:58.440 | Fun side note, by the way,

00:09:59.680 | all of this work that I described so far

00:10:00.960 | - Whoa. - Whoa.

00:10:01.800 | - While I was on vacation,

00:10:03.640 | and while I was jet-lagged in Maldives.

00:10:05.680 | So basically it's perfect because you wake up at 1 a.m.

00:10:09.400 | and there's nothing to do.

00:10:11.000 | So you write stuff like LLM.C,

00:10:12.960 | and then in sunrise, you go do all the weather activities.

00:10:15.400 | So that is the villa where most of LLM.C was trained.

00:10:19.240 | So that was perfect.

00:10:20.200 | This is a picture of it.

00:10:21.560 | (audience laughing)

00:10:22.400 | And this is a, I think the moon is about to set

00:10:25.200 | and the sunrise is about to happen.

00:10:26.680 | This is a recommended way to do software development.

00:10:29.160 | (audience laughing)

00:10:32.360 | Okay, so now we have C code, but it's inefficient.

00:10:34.480 | So we'd like to run it faster.

00:10:35.680 | For that, we need four GPUs.

00:10:37.200 | So we need to convert all of our C code to GPU.

00:10:39.440 | So this is where we go to the dev CUDA part of the repo

00:10:42.240 | and we start to develop all the kernels.

00:10:43.920 | So here's the layer of forward pass, as I mentioned.

00:10:46.120 | And now we're gonna develop a number of kernels

00:10:47.760 | that have the identical functionality,

00:10:49.280 | but now run on the GPU and they're gonna be faster.

00:10:51.760 | And so usually we have versions one, two, three,

00:10:53.960 | four, five, six, et cetera.

00:10:55.240 | And these are all different kernel implementations.

00:10:56.640 | They're a bit faster usually over time,

00:10:58.160 | but they match the specification exactly

00:11:00.920 | and give the exact same numbers.

00:11:02.720 | So we develop all those layers and port them to CUDA.

00:11:05.760 | And this is, I don't know what this is.

00:11:08.000 | I'm gonna skip that.

00:11:08.840 | (audience laughing)

00:11:09.680 | I need to look at one of the kernels.

00:11:10.960 | Basically, the point here is the first kernel

00:11:13.000 | is trivial to do usually

00:11:14.040 | because you're paralyzing over batch and time.

00:11:17.760 | And then you're basically copy pasting the C code

00:11:20.120 | into your CUDA kernel.

00:11:21.320 | And you're already getting speed ups

00:11:22.400 | because you're paralyzing over the batch time tokens

00:11:24.440 | and each thread just handles a single output element.

00:11:26.920 | So the first kernel is usually trivial,

00:11:28.720 | but then optimization is gonna be pretty elaborate.

00:11:31.920 | So by the end, we get to kernel six, for example,

00:11:33.920 | in layer norm, and we're doing a lot of things

00:11:35.480 | that are a bit more complicated.

00:11:36.400 | So we have some, you know, more produce operations.

00:11:40.680 | We have some, we also probably get through shared memory,

00:11:42.880 | through Google memory.

00:11:43.720 | We're orchestrating it correctly,

00:11:45.280 | cache screening hints,

00:11:47.560 | and a bunch of little tips and tricks

00:11:49.720 | for dealing with everything.

00:11:52.200 | And I'm gonna go into a bit more detail later,

00:11:53.800 | but you can get arbitrarily complicated

00:11:55.800 | if you're writing the CUDA code.

00:11:58.120 | One thing that I sort of found in this project

00:11:59.840 | is that it's not exactly trivial to learn CUDA,

00:12:02.840 | unfortunately, and it was like a little bit harder

00:12:04.680 | than I expected.

00:12:05.960 | I mean, I knew some CUDA going in,

00:12:07.400 | but getting better at it, I think, is not trivial.

00:12:09.760 | I think some of these books, unfortunately,

00:12:11.320 | are a bit out of date, as you might know.

00:12:13.320 | PMDP is actually quite good.

00:12:15.200 | But also, I think still kind of like,

00:12:17.880 | mostly on the beginner level,

00:12:19.080 | because a lot of the CUDA code that we ended up developing

00:12:21.600 | in the lifetime of the LLMC project,

00:12:23.400 | you would not find those things in this book, actually.

00:12:26.000 | So a lot of the kernels that we ended up adding

00:12:29.480 | would just not be covered.

00:12:31.000 | And then on top of that,

00:12:31.840 | you have this CUDA C++ programming guide,

00:12:33.600 | which, frankly, is not exactly readable

00:12:35.700 | for someone who is a bit new to CUDA.

00:12:39.440 | And then you have this amazing blog post from Simon.

00:12:41.640 | Yeah, is that in Tropic?

00:12:42.880 | That is like way better than anything we deserve,

00:12:44.800 | just like randomly on the internet.

00:12:46.280 | So that was incredible.

00:12:47.280 | And if there was just more of that,

00:12:48.120 | that would be so amazing.

00:12:49.280 | But yeah, so I think I found it a little bit difficult.

00:12:53.520 | But I mean, I'm hoping that things like CUDA mode

00:12:55.880 | can definitely speed up the availability of writing CUDA.

00:13:00.880 | Okay, so next what happened is I was basically struggling

00:13:06.280 | with the CUDA code a little bit.

00:13:08.160 | And I was reading through the book

00:13:09.460 | and I was implementing all these CUDA kernels.

00:13:11.000 | And they're like okay CUDA kernels,

00:13:12.400 | but they're not great.

00:13:14.760 | And so a team of Avengers assembled from the internet

00:13:18.280 | and what they saw, and that's how you start contributing.

00:13:19.960 | So specifically, Eric, Arun, Aleksar,

00:13:22.640 | like I would say core devs of LLM.C

00:13:24.760 | and contributed a ton of work to LLM.C.

00:13:27.280 | And they started to like really optimize

00:13:29.640 | and write all these kernels.

00:13:30.600 | And this was incredible to watch and learn a lot from.

00:13:33.200 | And there's many more, Ross Wheeler and Chen Hsiao

00:13:36.160 | and a few others.

00:13:37.600 | But over time, we have 60 contributors

00:13:39.440 | to the LLM.C project.

00:13:40.840 | Shout out to LLM.Dev for sponsoring LLM.C.

00:13:43.440 | They can contribute to compute

00:13:44.600 | so that we can run and optimize all these kernels.

00:13:47.120 | So it was amazing for me that people just came

00:13:48.880 | from the internet and helped out on the project.

00:13:50.360 | And you know, this is one of the favorite things

00:13:51.920 | that can happen.

00:13:52.880 | My favorite things that can happen

00:13:54.200 | within an open source MIT licensed repo,

00:13:56.440 | people just come from the internet

00:13:58.160 | and help contribute, it's amazing.

00:13:59.960 | Okay, so we've converted all the layers to CUDA.

00:14:03.640 | We have now all the kernels.

00:14:04.760 | And we can now train on a single GPU in FB32 so far.

00:14:08.560 | So that's great.

00:14:10.000 | So from then on, we start to make more and more optimizations.

00:14:12.480 | So number one, we don't want to have mammals in FB32

00:14:15.480 | when you roll your own code.

00:14:17.520 | We actually switched to Kubas.

00:14:20.040 | Step two, we don't want to write our own flash attention.

00:14:23.000 | I think that would be pretty complicated.

00:14:24.360 | Turns out Codium has a very good

00:14:25.960 | flash attention implementation.

00:14:27.320 | So we switched to that.

00:14:28.800 | Next, you want to definitely reach for base precision

00:14:33.560 | so that to speed up the code.

00:14:36.320 | So you want to go over all your testers for parameters

00:14:39.520 | and also for activations and so on.

00:14:41.080 | And you have to start to think about,

00:14:42.120 | okay, which ones are in float 32,

00:14:43.480 | which ones are in float 16,

00:14:45.080 | and what precision are they in?

00:14:46.560 | And then do all the conversions automatically.

00:14:48.840 | So we reached for that and implemented that.

00:14:51.600 | There's many, many other optimizations

00:14:52.960 | that we've been implementing over time.

00:14:54.560 | So as an example, we did all the kernel fusions,

00:14:57.400 | different recompute settings

00:14:58.880 | to recompute a piece of the forward pass

00:15:00.280 | during the backward.

00:15:01.280 | There's been a lot of optimizations from Eric,

00:15:04.520 | especially on minimizing the amount of memory

00:15:06.520 | that you need during the backward pass.

00:15:08.960 | We have this packed 128 data structure,

00:15:11.160 | which basically, in our experience,

00:15:13.000 | forces the compiler to use the 128-bit

00:15:15.800 | load and store instructions that are available,

00:15:17.640 | but somehow the compiler is unwilling to use in many cases.

00:15:21.000 | So I think Arun did a lot of work here

00:15:23.800 | where you just look at the SAS,

00:15:25.560 | and you look at the SAS as the assembly,

00:15:28.760 | and you are looking at what the instructions

00:15:30.520 | are being used for your loop,

00:15:31.480 | and you figure out that,

00:15:32.320 | okay, there should be a 128-bit load and store,

00:15:34.440 | but it happens to be a 32-bit or something else

00:15:36.360 | because something in the MSCC compiler

00:15:38.400 | is not going very well.

00:15:39.280 | So we found that this data structure

00:15:41.320 | kind of forces the compiler's hand a bit more.

00:15:43.960 | We implemented all kinds of CUDA streams to overlap

00:15:46.360 | at the heart of the computation,

00:15:47.680 | and this ended up creating a total disaster.

00:15:51.080 | And so that's why I scratched it out,

00:15:52.320 | because at one point of LLM.c, as Arun would say,

00:15:55.480 | I basically went in and I looped it from orbit.

00:15:58.000 | I just went in, I controlled that

00:16:00.280 | for all dimensions of stream,

00:16:01.520 | and I just delete, delete, delete.

00:16:03.160 | And basically I deleted all the streams,

00:16:04.560 | made everything single-threaded,

00:16:05.920 | because we ended up getting all kinds

00:16:07.160 | of really weird race conditions and errors and so on.

00:16:09.160 | I just didn't want to deal with it.

00:16:10.120 | So LLM.c is not actually as overlapped as it could be,

00:16:13.960 | but it's just like, it's too much complexity

00:16:16.280 | for me not to have gained at this point.

00:16:18.720 | But maybe we can slowly reintroduce some of it.

00:16:21.760 | We have stochastic grounding, we have full determinism.

00:16:23.920 | Full determinism turns out to be pretty hard

00:16:25.760 | because some of the kernels complexify a lot

00:16:27.800 | because you can't use atomics.

00:16:29.120 | Like the encoder backward was especially crazy

00:16:31.440 | because the encoder backward is trivial with atomics,

00:16:34.720 | but non-trivial without it.

00:16:36.440 | Anyway, so a lot of the optimizations

00:16:37.960 | went into with a lot of efficiency

00:16:40.920 | and determinism in mind.

00:16:42.120 | And accuracy, like stochastic grounding and so on.

00:16:46.220 | Next, you want to use multiple GPUs,

00:16:47.640 | not just a single GPU.

00:16:48.800 | So this is where you bring in the nickel,

00:16:50.960 | you start to do overduce between all the different workers.

00:16:55.280 | And this is where you also start to reach

00:16:56.800 | for like sharded optimizer state zero one.

00:16:59.600 | For basically, take your optimizer states,

00:17:01.000 | which are in float, and these are really large buffers

00:17:04.320 | for atom W, and you can actually spread out

00:17:06.320 | a lot of the stuff across all the GPUs

00:17:07.800 | and it really helps to keep your requirements

00:17:09.920 | down per GPU in terms of memory.

00:17:11.600 | So very helpful to reach for that.

00:17:14.000 | So currently, LLNC uses zero one,

00:17:16.280 | which is a sharded optimizer state.

00:17:17.760 | There's a PR for zero two,

00:17:19.000 | but I don't believe I merged that yet

00:17:21.000 | because it gets a little bit messy,

00:17:22.800 | but might be merged eventually.

00:17:25.000 | A lot of LLNC is just kind of like

00:17:26.720 | balancing the improvement and speed

00:17:30.280 | with the complexity of what you're actually introducing.

00:17:32.680 | And so I actually rejected a lot of PRs because of that

00:17:35.200 | because the code starts to get crazy.

00:17:37.380 | And I think that decreases the amount of people

00:17:38.760 | that can be onboarded.

00:17:39.860 | And then after your multi-GP, you hold the node.

00:17:43.960 | So now you are running across multiple machines here

00:17:46.160 | to make sure that you synchronize all of them,

00:17:48.320 | that they can find each other and so on.

00:17:49.600 | So we implemented all that.

00:17:51.560 | And where that leads us to is that we can

00:17:53.960 | actually train GPT-2, and we can actually reproduce it

00:17:56.120 | after all of that work.

00:17:57.320 | So there's a post in the discussions of LLNC.

00:17:59.960 | We can train the 1.6 billion GPT-2,

00:18:02.080 | which was state of the art LLN as of 2019 or so.

00:18:05.320 | And you can train it on a single node,

00:18:06.980 | H100, in about 24 hours.

00:18:09.260 | And that costs roughly $600.

00:18:12.020 | And the way you do that is it's extremely dependency free.

00:18:14.300 | There's no need for Python, no need for PyTorch.

00:18:16.860 | So you do need CUDNN, which is the most heavy dependency.

00:18:20.540 | But CUDNN is optional.

00:18:21.820 | So if you'd like to roll your own manual attention,

00:18:23.900 | that is possible in LLN.C.

00:18:25.580 | But CUDNN is kind of like the hairiest dependency.

00:18:27.540 | But after that, it's just a bunch of C code.

00:18:29.260 | You compile it and you run it.

00:18:30.380 | There's no need for really anything.

00:18:32.460 | So there's no need for conda environments,

00:18:34.060 | as it installs, there's just nothing.

00:18:35.640 | It's just amazing.

00:18:36.720 | And then you compile your code and you run it.

00:18:38.640 | And it starts stepping and you wait 24 hours.

00:18:41.360 | And then it's stepping, doing some diagnostics.

00:18:44.920 | We have almost a 50% MFU here on one node, which is quite good.

00:18:50.880 | And you get really nice plots.

00:18:52.160 | And you beat the GPT-2 on Hellaswag.

00:18:54.400 | And basically, that's just the case

00:18:55.920 | that the optimization went well.

00:18:57.400 | No crazy numerical issues, lost bytes or anything

00:18:59.480 | like that for this size.

00:19:01.360 | And yeah, it should be a nearly good model in LLN.C.

00:19:07.500 | We can still compare to PyTorch.

00:19:08.860 | Because remember, we have PyTorch implementation

00:19:10.900 | for all this stuff in parallel on the side.

00:19:12.980 | And so you can run the whole untrained almost in PyTorch.

00:19:16.460 | And we can compare the two implementations side-by-side.

00:19:19.060 | And in particular, at the time of writing that post--

00:19:21.740 | and I don't know if this has changed,

00:19:23.280 | because the PyTorch team continues

00:19:24.700 | to optimize things over time-- but at the time of that post,

00:19:27.940 | we were using, in LLN.C, 30% less memory.

00:19:30.660 | And we were 20% faster in training just the throughput.

00:19:33.700 | And I don't know if I fully super duper optimized

00:19:35.860 | the PyTorch implementation.

00:19:36.980 | I did my personal best.

00:19:38.360 | But we were able to, I think, beat PyTorch in training mode,

00:19:41.540 | specifically GPT-2 in LLN.C. If you

00:19:44.880 | want to train anything else, you're in a lot of trouble.

00:19:47.220 | [LAUGHTER]

00:19:48.300 | You have to change your code a lot.

00:19:49.780 | And when I'm doing that, I won't come back to it.

00:19:51.820 | But for GPT-2 training, we're better after all that work.

00:19:55.620 | And it also compiles and runs much faster,

00:19:57.380 | which is beautiful.

00:19:58.220 | Torch compile actually takes quite a bit of time,

00:19:59.840 | like a minute or something.

00:20:00.880 | You're just waiting.

00:20:01.780 | So that's also something that I personally

00:20:03.780 | don't like to work with usually.

00:20:05.460 | OK.

00:20:06.500 | So looping back around, turns out it wasn't all that simple.

00:20:09.220 | [LAUGHTER]

00:20:09.720 | I mean, there was a lot of stuff involved.

00:20:11.460 | And it took a few months for a few people.

00:20:13.420 | But it was fun.

00:20:16.460 | We learned a lot.

00:20:17.340 | And we were friends along the way.

00:20:18.780 | This is the LLN.C core devs.

00:20:20.300 | [LAUGHTER]

00:20:22.620 | So it was great.

00:20:24.860 | Ongoing work.

00:20:26.020 | We are adding a lot of free support.

00:20:27.540 | We actually thought maybe we would have it done by today.

00:20:29.960 | But there's a little bit more work to do.

00:20:33.880 | But we will have LLNA 3.1 training in LLN.C very, very

00:20:37.040 | soon.

00:20:38.240 | We will have FP8 support.

00:20:40.280 | So everyone has been working on this.

00:20:42.920 | And there's a big PR that's coming for FP8 support,

00:20:45.840 | which is also interesting.

00:20:47.840 | And there's a lot of notable forks in LLN.C.

00:20:50.160 | They're all listed on the GitHub repo.

00:20:52.240 | The AMP fork is very active, as far as I understand it,

00:20:55.360 | and quite good.

00:20:56.480 | I think also the C++ CUDA fork is quite nice.

00:21:00.340 | And so a lot of folks.

00:21:02.940 | So I encourage you to also work LLN.C.

00:21:05.420 | It's fairly readable, I think.

00:21:06.740 | I try to keep it clean, well-documented.

00:21:08.420 | I think it's pretty well understood what's in there.

00:21:10.580 | It's only maybe like, I think, 3,000 lines of code,

00:21:13.340 | basically C mostly.

00:21:15.700 | And one more thought, I think, that I wanted to get across

00:21:18.120 | is it wasn't all that haphazard to start the project.

00:21:22.220 | I had another motivation for starting the project.

00:21:26.100 | And that's that I think, I mean, what is LLN.C like?

00:21:28.980 | PyTorch is, especially when you push compilers,

00:21:31.380 | a bit like TCC for software 2.0.

00:21:33.180 | It's a compiler.

00:21:34.240 | But LLN.C is a bit like writing assembly,

00:21:36.260 | where you're doing everything manually, right?

00:21:38.820 | And basically I think we wrote LLN.C as multiple people

00:21:43.820 | over a duration of three months,

00:21:45.860 | and got something that was faster than PyTorch

00:21:47.860 | in a specific setting of GPT-3 training.

00:21:50.500 | And so this exercise basically proves that this is possible.

00:21:53.860 | Now the problem is you need to spend multiple people

00:21:56.040 | several months.

00:21:57.100 | But if LLNs are about to become much better at coding

00:21:59.980 | over time, then I think you can expect that the LLN

00:22:02.900 | could actually do this for any custom application over time.

00:22:05.540 | And so the LLNs could act as a kind of compiler

00:22:08.540 | for any custom application you're interested in.

00:22:10.340 | They're gonna do all the LLN.C work,

00:22:12.180 | and they're gonna output the binary

00:22:13.940 | that you can compile and run for your specific applications.

00:22:16.460 | So I don't actually know if we,

00:22:17.820 | like the use of Python and PyTorch and everything else,

00:22:20.500 | it's just a crutch because we humans are finite.

00:22:22.860 | We have finite knowledge, intelligence, and potential.

00:22:25.340 | But actually, don't you wanna write all code

00:22:27.180 | in custom CUDA kernels and so on?

00:22:29.500 | Like, maybe.

00:22:30.660 | And so the other thing that I think is interesting

00:22:33.100 | is the LLN.C repo might be useful

00:22:35.620 | because in the early stages of these LLNs

00:22:37.940 | and their intelligence, they might not be able

00:22:39.820 | to write this code from scratch.

00:22:40.980 | You just prompted them, "Write GPT-2 on C."

00:22:43.060 | You probably won't get LLN.C.

00:22:44.800 | But you're a lot more likely to get it

00:22:46.260 | if you put LLN.C in the context of a session LLN,

00:22:49.740 | and you can expect that the few-shot learning

00:22:51.180 | would be very helpful for the LLN

00:22:52.580 | to basically get an example code.

00:22:54.180 | And so I think LLN.C could be very useful

00:22:56.060 | for this example code to get to the LLNs

00:22:57.820 | 'cause they're about to write

00:22:58.660 | all of our custom applications.

00:23:00.260 | And so I think this is actually not unlikely to happen.

00:23:03.520 | Yeah, this is kind of likely to happen.

00:23:04.820 | So I think software development in general

00:23:06.580 | will probably change a lot.

00:23:07.420 | And to me, LLN.C is an exploration

00:23:09.260 | of whether this is even possible,

00:23:10.740 | because if it is possible,

00:23:11.660 | then maybe this is what's gonna happen.

00:23:13.380 | So, yeah, that's it.

00:23:15.260 | Thank you.

00:23:16.100 | (audience applauding)

00:23:19.260 | (audience cheering)

00:23:22.260 | - All right, well, the morning session talks.