back to index

llm.c's Origin and the Future of LLM Compilers - Andrej Karpathy at CUDA MODE


Whisper Transcript | Transcript Only Page

00:00:00.000 | to introduce to you a legendary person.
00:00:03.000 | Someone who has been packing and educating
00:00:05.880 | at the forefront of AI for over a decade.
00:00:09.640 | From neural networks to computer vision,
00:00:12.120 | from vector learning processing to reinforcement learning,
00:00:15.120 | he has pushed the boundaries
00:00:16.940 | and inspired millions all over the world,
00:00:19.560 | including, I think, all of us here.
00:00:21.360 | He's a distinguished machine learning superstar.
00:00:28.600 | A founding member of OpenAI,
00:00:30.880 | the reference human for ImageNet.
00:00:33.360 | (audience laughing)
00:00:36.560 | An ex-Google brain, ex-DeepMind, ex-Tesla, Mr. Autopilot.
00:00:41.560 | Here's where we see it all.
00:00:44.500 | And some months ago, on a memorable day,
00:00:48.080 | this special person joined the CudaMode Discord
00:00:51.680 | to start hacking with others on LLM.C,
00:00:55.400 | which became one of the greatest
00:00:56.960 | and most active community projects on our server.
00:00:59.900 | But I guess it's best if he tells the story himself.
00:01:04.680 | So please join me in welcoming
00:01:06.760 | the incredible one and only Andrej Karpathy!
00:01:10.740 | (audience applauding)
00:01:13.900 | - Wow, okay.
00:01:20.420 | (audience laughing)
00:01:25.760 | Okay, I'm very excited to be here.
00:01:27.000 | This is my favorite kind of event to present at.
00:01:29.040 | So yeah, thank you for the invitation
00:01:30.640 | and thank you for running CudaMode and putting this on.
00:01:33.040 | This is a wonderful event.
00:01:34.920 | Okay, so I'll tell you a bit about LLM.C.
00:01:37.440 | So what are we doing?
00:01:38.400 | We're training transformers in C in a pinch of C++.
00:01:42.060 | How do I go next here?
00:01:43.700 | Okay, so I'd like to tell the story
00:01:47.600 | a little bit of how this project came about
00:01:49.360 | and what this looked like from my perspective.
00:01:51.180 | So roughly a year ago,
00:01:52.120 | I was trying to add a video to my YouTube series
00:01:54.400 | and I was trying to teach people LLM training,
00:01:57.040 | GPD training, and so on.
00:01:58.320 | And I was basically hacking on that
00:01:59.520 | and GPD trying to get it to work.
00:02:01.120 | So that was me.
00:02:02.920 | And then you've all worked with PyTorch, of course, right?
00:02:05.840 | So the trickiness comes that,
00:02:07.760 | okay, you have your model, which you've written.
00:02:09.440 | That makes sense.
00:02:10.280 | But now you have to keep track
00:02:11.680 | of a number of abstractions here at the same time.
00:02:14.200 | So you have to put it to a device.
00:02:15.680 | You want to compile it.
00:02:16.720 | You want to wrap it in DDP.
00:02:18.300 | And suddenly things start to be a little bit more complicated
00:02:21.120 | because I'm not even sure in what order you do these,
00:02:23.640 | what exactly happens?
00:02:24.640 | What are these abstractions?
00:02:25.560 | What do they do to your model?
00:02:26.920 | So I don't fully understand how any of this works.
00:02:29.180 | And then what happens is
00:02:30.960 | you want to use your model in different ways.
00:02:32.560 | So you want to use it in evaluation,
00:02:34.040 | in training, or model inference, and so on.
00:02:36.600 | And what happens to me is that
00:02:38.600 | I was able to train the model,
00:02:40.200 | but for some reason eval and inference was not working.
00:02:43.640 | And what happened was
00:02:44.600 | I was getting some kind of a Torch compile error
00:02:46.600 | when I was trying to run my eval and my inference.
00:02:49.440 | And this is just an illustrative example
00:02:50.760 | of a Torch compile error.
00:02:51.680 | It was something else I didn't remember.
00:02:52.920 | I couldn't capture it.
00:02:54.080 | But both of them were giving me error,
00:02:55.560 | inference and eval, and a different error.
00:02:57.520 | And I had no idea what was going on.
00:02:59.480 | So I did what anyone would do,
00:03:00.640 | and then in my position, I went to discuss.
00:03:02.820 | (audience laughing)
00:03:05.820 | (audience applauding)
00:03:09.340 | I was looking for a PTR VLCK to solve my issue.
00:03:13.160 | And fortunately, PTR VLCK did not have any guidance
00:03:15.760 | that I could see on that specific error.
00:03:17.720 | So I was kind of stuck, honestly.
00:03:18.960 | So two hours later, I'm finding a Torch compile
00:03:21.680 | and I'm trying to figure out what the hell is going on.
00:03:23.240 | I'm kind of a sad, I don't know exactly how to solve this.
00:03:26.800 | And so I felt like I was going to be
00:03:28.680 | (audience laughing)
00:03:31.000 | In the beginning, I was in denial.
00:03:32.080 | I was like, this can't be happening to me.
00:03:33.400 | I'm not doing anything crazy.
00:03:34.640 | I'm just training a little GPT.
00:03:36.360 | Like, why is this not working?
00:03:37.760 | This seems really simple.
00:03:38.880 | I'm not doing anything crazy.
00:03:40.360 | And then eventually I entered into a state of anger.
00:03:42.600 | (audience laughing)
00:03:43.680 | And I was like, okay.
00:03:45.320 | You know what, I'm just gonna write the whole thing.
00:03:46.640 | (audience laughing)
00:03:47.480 | I understand in my mind that what I'm trying to do,
00:03:49.800 | like the computation itself, the algorithm itself
00:03:51.520 | is totally clear in my mind.
00:03:53.080 | And for some reason, Torch compile doesn't let me
00:03:55.240 | use it, run it, et cetera.
00:03:56.440 | So I felt a little bit powerless.
00:03:58.160 | And I was like, okay, I'm gonna take life into my own hands
00:04:00.640 | and be in control of my destiny.
00:04:02.320 | I'm gonna just write this and see how bad could it be.
00:04:05.120 | So let's think about what is PyTorch offering you, really?
00:04:09.880 | And there's many things, but maybe some of the things
00:04:11.400 | that are relevant here.
00:04:12.800 | I don't know why there's bullet points everywhere.
00:04:14.320 | (audience laughing)
00:04:17.640 | Well, my slides is too many points,
00:04:18.760 | so I don't know how to do that.
00:04:20.880 | Okay, but number one, we're getting an array, right?
00:04:23.400 | So a very useful and dimensional array
00:04:25.240 | that we can manipulate the operations.
00:04:26.880 | If we're gonna abandon this, then we're gonna have to do
00:04:28.560 | a lot of counter-arithmetic, basically,
00:04:30.520 | making sure that we rattle and unravel indices correctly.
00:04:33.520 | Second, we're getting autograph for free.
00:04:34.960 | So if we don't have autograph, we need to do forward
00:04:37.240 | and backward passes of all the layers.
00:04:39.760 | We don't have devices, so we have to worry about memory
00:04:41.880 | being on the host or on the device and shuttling memory
00:04:44.160 | around different devices between CPU and GPU and so on.
00:04:47.760 | We don't have simple DTAP conversions,
00:04:49.560 | so we have to be very mindful what tensors are stored
00:04:52.280 | and what precisions and convert explicitly between them.
00:04:54.880 | We don't have Torch Compile, so we're gonna have to do
00:04:56.880 | all the kernel fusions that we want manually,
00:04:59.200 | and we're gonna have to optimize
00:05:00.360 | for space-time performance manually.
00:05:02.280 | And finally, we don't have distributives,
00:05:03.760 | so we're gonna have to manually spin up
00:05:05.040 | all of our processes, make sure that they can find each other,
00:05:07.120 | communicate with Nicol, et cetera.
00:05:08.960 | So PyTorch is really, really nice,
00:05:10.560 | and this is just some of the things that PyTorch offers.
00:05:12.560 | So without PyTorch, we're kind of naked in the world, right?
00:05:14.800 | But maybe it's okay.
00:05:16.760 | So, yeah, how bad can it be?
00:05:19.320 | So, step one, we have our PyTorch code,
00:05:22.180 | which now isn't the primary thing we're working with.
00:05:24.140 | It's only a reference that we check
00:05:25.920 | for correctness with respect to.
00:05:27.920 | And so we're in PyTorch land, everything is nice and clean.
00:05:30.080 | We have a little transformer, a few modules,
00:05:31.960 | and we're just calling them, so everything is great.
00:05:33.720 | And that now becomes our reference in PyTorch.
00:05:36.840 | I'd love to just take you through one example of a layer.
00:05:38.880 | So for example, layer norm here is like a PyTorch layer,
00:05:41.960 | and we'd like to basically port this over to C.
00:05:45.040 | So what kind of process do we go through?
00:05:46.760 | Well, we're gonna iterate through all the layers.
00:05:48.240 | Number one, we need the forward pass.
00:05:50.560 | And actually, I had to write a forward pass of layer norm
00:05:53.360 | because PyTorch doesn't just have this kind of implementation
00:05:56.960 | in PyTorch of layer norm, because it's kind of like a block
00:05:59.240 | and eventually pulls into some CUDA kernels.
00:06:01.520 | So I had to write a forward pass of layer norm
00:06:03.400 | and make sure it's equivalent to the layer norm in PyTorch.
00:06:06.300 | And then, of course,
00:06:07.140 | I had to write the backward pass of layer norm.
00:06:09.320 | This is where you kind of take out your pen and paper,
00:06:11.080 | do some backdrop.
00:06:12.600 | This is for batch norm, but layer norm would be similar.
00:06:15.720 | And yeah, we had to write the backward pass.
00:06:17.720 | And again, this is all still in PyTorch, but it's explicit.
00:06:20.680 | And you're just making sure that the layer norm
00:06:22.280 | of PyTorch, forward and backward,
00:06:23.560 | matches this basically manual tensor-based implementation.
00:06:28.560 | So now we have PyTorch code forward and backward.
00:06:34.000 | So the next thing we do is we try to port it to C.
00:06:36.360 | And this is actually a lot simpler
00:06:37.720 | in many cases than you might think.
00:06:38.980 | So on the left, we have the PyTorch code,
00:06:40.280 | on the right, we basically have the equivalent
00:06:42.000 | of layer norm forward in C.
00:06:44.000 | And it's not that crazy, right?
00:06:45.600 | So unlike in PyTorch,
00:06:47.680 | we just have a bunch of float star arrays.
00:06:50.880 | So we have a float star out, float star inputs,
00:06:53.440 | outputs, means, standard deviations, weights,
00:06:55.760 | and biases, and some type of headers.
00:06:57.540 | And one thing I really like to do in LL.C
00:07:00.080 | is I just want to keep things simple.
00:07:01.560 | I don't want to create a tensor abstraction.
00:07:03.400 | I don't want to create any abstraction, really.
00:07:05.620 | It's just float arrays and operations on float arrays.
00:07:08.560 | Like, why should it be a lot more complicated than that?
00:07:10.720 | So everything is just float arrays.
00:07:12.560 | Everything is fully self-contained.
00:07:13.840 | There's no underlying representations,
00:07:16.040 | abstractions to call, import, et cetera.
00:07:18.060 | This is the layer norm forward on float arrays,
00:07:20.080 | and that's it.
00:07:21.160 | So that's the forward.
00:07:22.960 | And then you also do the backward for all the layers.
00:07:25.820 | Once we've done that for all the layers
00:07:27.280 | and converted everything to C
00:07:28.120 | to make sure that everything matches
00:07:29.740 | our reference implementation,
00:07:31.520 | we have to start to string it together.
00:07:34.000 | So we go into our C code in the main,
00:07:37.600 | and we have to allocate all of the memory
00:07:39.440 | that we're going to be using.
00:07:40.520 | In LL.C, all of the allocation happens
00:07:43.400 | a single time at the beginning.
00:07:44.800 | So we pre-plan all of the memory
00:07:46.320 | that we're going to ever use.
00:07:47.800 | Then it's fixed, and from then on it's just dynamics
00:07:50.320 | of just feeding data through it and training the model.
00:07:53.160 | So we have to pre-plan all the tensors, their sizes,
00:07:56.080 | and we have to do that for the parameters.
00:07:57.940 | And we have the data_grad and the mnb
00:08:00.040 | for the @mw buffers.
00:08:02.320 | And then for the activations as well,
00:08:03.760 | and we need space for both data and grad.
00:08:05.880 | And so you just pre-plan all the memory,
00:08:07.720 | you allocate all of it,
00:08:08.680 | and then we just stitch it all up.
00:08:10.880 | So we have all these layers,
00:08:12.420 | and they have a forward and a backward pass
00:08:14.340 | in backpropagation.
00:08:15.800 | And so on the forward pass,
00:08:17.800 | just kind of like you allocate all these tensors
00:08:19.680 | and you're very careful,
00:08:20.520 | and you index into them properly,
00:08:21.720 | and you make sure everything flows correctly through.
00:08:24.020 | And you just call the forwards and then all the backwards,
00:08:26.320 | and then you're kind of done,
00:08:28.000 | and you're left with gradient and you can do an update.
00:08:29.820 | So stringing that together is the second piece of work.
00:08:32.940 | And then once we sort of like strung it together,
00:08:34.720 | you get something that you can just compile and run.
00:08:36.940 | So we, on the top left is everything that's required.
00:08:40.480 | We download a starter pack,
00:08:41.880 | which is really just a GPT-2 weights
00:08:43.500 | in a single binary file, very simple.
00:08:45.920 | And also we need the data set,
00:08:47.480 | in this case TinyShakespeare,
00:08:48.720 | and the tokenizer and stuff like that.
00:08:50.720 | And then we just compile and run this little C code file.
00:08:53.960 | It's a single file of C at this point.
00:08:56.720 | And I think it's like 2000 lines or something like that,
00:08:59.120 | if I remember correctly.
00:09:00.240 | And you run that program and it like does a little training
00:09:03.000 | and outputs on Shakespeare again.
00:09:05.000 | And then we can verify that the package code is identical
00:09:08.120 | to the C code and everything is great.
00:09:09.920 | And we're just running on C.
00:09:11.280 | And at this point, I'm actually feeling quite great
00:09:14.080 | because this is amazing.
00:09:16.240 | So we have a single file C,
00:09:17.880 | there's no dependencies whatsoever.
00:09:19.800 | It compiles instantly, it runs instantly.
00:09:22.040 | All the memory is just allocated in a single block.
00:09:25.400 | So if you start stepping,
00:09:26.600 | there's no way you're gonna boom later.
00:09:28.180 | It's all pre-planned.
00:09:29.300 | It's fully deterministic.
00:09:30.880 | It, in principle, can train GPT-2, it's complete.
00:09:33.280 | It will train GPT-2, you just have to wait for one time.
00:09:36.160 | And it can run on a potato.
00:09:37.520 | It can just run on anything.
00:09:38.360 | It's just a single file of C with no dependencies.
00:09:40.840 | And in principle, this could run,
00:09:42.080 | this would be a great candidate to run on a moment probe
00:09:45.160 | because in space, if we just harden it a little bit more,
00:09:48.240 | because you're not gonna ship high-torch code
00:09:50.520 | on a one-moment probe.
00:09:51.680 | But I think LMC is great.
00:09:53.320 | (all laughing)
00:09:56.520 | So I was feeling great at this point.
00:09:58.440 | Fun side note, by the way,
00:09:59.680 | all of this work that I described so far
00:10:00.960 | - Whoa. - Whoa.
00:10:01.800 | - While I was on vacation,
00:10:03.640 | and while I was jet-lagged in Maldives.
00:10:05.680 | So basically it's perfect because you wake up at 1 a.m.
00:10:09.400 | and there's nothing to do.
00:10:11.000 | So you write stuff like LLM.C,
00:10:12.960 | and then in sunrise, you go do all the weather activities.
00:10:15.400 | So that is the villa where most of LLM.C was trained.
00:10:19.240 | So that was perfect.
00:10:20.200 | This is a picture of it.
00:10:21.560 | (audience laughing)
00:10:22.400 | And this is a, I think the moon is about to set
00:10:25.200 | and the sunrise is about to happen.
00:10:26.680 | This is a recommended way to do software development.
00:10:29.160 | (audience laughing)
00:10:32.360 | Okay, so now we have C code, but it's inefficient.
00:10:34.480 | So we'd like to run it faster.
00:10:35.680 | For that, we need four GPUs.
00:10:37.200 | So we need to convert all of our C code to GPU.
00:10:39.440 | So this is where we go to the dev CUDA part of the repo
00:10:42.240 | and we start to develop all the kernels.
00:10:43.920 | So here's the layer of forward pass, as I mentioned.
00:10:46.120 | And now we're gonna develop a number of kernels
00:10:47.760 | that have the identical functionality,
00:10:49.280 | but now run on the GPU and they're gonna be faster.
00:10:51.760 | And so usually we have versions one, two, three,
00:10:53.960 | four, five, six, et cetera.
00:10:55.240 | And these are all different kernel implementations.
00:10:56.640 | They're a bit faster usually over time,
00:10:58.160 | but they match the specification exactly
00:11:00.920 | and give the exact same numbers.
00:11:02.720 | So we develop all those layers and port them to CUDA.
00:11:05.760 | And this is, I don't know what this is.
00:11:08.000 | I'm gonna skip that.
00:11:08.840 | (audience laughing)
00:11:09.680 | I need to look at one of the kernels.
00:11:10.960 | Basically, the point here is the first kernel
00:11:13.000 | is trivial to do usually
00:11:14.040 | because you're paralyzing over batch and time.
00:11:17.760 | And then you're basically copy pasting the C code
00:11:20.120 | into your CUDA kernel.
00:11:21.320 | And you're already getting speed ups
00:11:22.400 | because you're paralyzing over the batch time tokens
00:11:24.440 | and each thread just handles a single output element.
00:11:26.920 | So the first kernel is usually trivial,
00:11:28.720 | but then optimization is gonna be pretty elaborate.
00:11:31.920 | So by the end, we get to kernel six, for example,
00:11:33.920 | in layer norm, and we're doing a lot of things
00:11:35.480 | that are a bit more complicated.
00:11:36.400 | So we have some, you know, more produce operations.
00:11:40.680 | We have some, we also probably get through shared memory,
00:11:42.880 | through Google memory.
00:11:43.720 | We're orchestrating it correctly,
00:11:45.280 | cache screening hints,
00:11:47.560 | and a bunch of little tips and tricks
00:11:49.720 | for dealing with everything.
00:11:52.200 | And I'm gonna go into a bit more detail later,
00:11:53.800 | but you can get arbitrarily complicated
00:11:55.800 | if you're writing the CUDA code.
00:11:58.120 | One thing that I sort of found in this project
00:11:59.840 | is that it's not exactly trivial to learn CUDA,
00:12:02.840 | unfortunately, and it was like a little bit harder
00:12:04.680 | than I expected.
00:12:05.960 | I mean, I knew some CUDA going in,
00:12:07.400 | but getting better at it, I think, is not trivial.
00:12:09.760 | I think some of these books, unfortunately,
00:12:11.320 | are a bit out of date, as you might know.
00:12:13.320 | PMDP is actually quite good.
00:12:15.200 | But also, I think still kind of like,
00:12:17.880 | mostly on the beginner level,
00:12:19.080 | because a lot of the CUDA code that we ended up developing
00:12:21.600 | in the lifetime of the LLMC project,
00:12:23.400 | you would not find those things in this book, actually.
00:12:26.000 | So a lot of the kernels that we ended up adding
00:12:29.480 | would just not be covered.
00:12:31.000 | And then on top of that,
00:12:31.840 | you have this CUDA C++ programming guide,
00:12:33.600 | which, frankly, is not exactly readable
00:12:35.700 | for someone who is a bit new to CUDA.
00:12:39.440 | And then you have this amazing blog post from Simon.
00:12:41.640 | Yeah, is that in Tropic?
00:12:42.880 | That is like way better than anything we deserve,
00:12:44.800 | just like randomly on the internet.
00:12:46.280 | So that was incredible.
00:12:47.280 | And if there was just more of that,
00:12:48.120 | that would be so amazing.
00:12:49.280 | But yeah, so I think I found it a little bit difficult.
00:12:53.520 | But I mean, I'm hoping that things like CUDA mode
00:12:55.880 | can definitely speed up the availability of writing CUDA.
00:13:00.880 | Okay, so next what happened is I was basically struggling
00:13:06.280 | with the CUDA code a little bit.
00:13:08.160 | And I was reading through the book
00:13:09.460 | and I was implementing all these CUDA kernels.
00:13:11.000 | And they're like okay CUDA kernels,
00:13:12.400 | but they're not great.
00:13:14.760 | And so a team of Avengers assembled from the internet
00:13:18.280 | and what they saw, and that's how you start contributing.
00:13:19.960 | So specifically, Eric, Arun, Aleksar,
00:13:22.640 | like I would say core devs of LLM.C
00:13:24.760 | and contributed a ton of work to LLM.C.
00:13:27.280 | And they started to like really optimize
00:13:29.640 | and write all these kernels.
00:13:30.600 | And this was incredible to watch and learn a lot from.
00:13:33.200 | And there's many more, Ross Wheeler and Chen Hsiao
00:13:36.160 | and a few others.
00:13:37.600 | But over time, we have 60 contributors
00:13:39.440 | to the LLM.C project.
00:13:40.840 | Shout out to LLM.Dev for sponsoring LLM.C.
00:13:43.440 | They can contribute to compute
00:13:44.600 | so that we can run and optimize all these kernels.
00:13:47.120 | So it was amazing for me that people just came
00:13:48.880 | from the internet and helped out on the project.
00:13:50.360 | And you know, this is one of the favorite things
00:13:51.920 | that can happen.
00:13:52.880 | My favorite things that can happen
00:13:54.200 | within an open source MIT licensed repo,
00:13:56.440 | people just come from the internet
00:13:58.160 | and help contribute, it's amazing.
00:13:59.960 | Okay, so we've converted all the layers to CUDA.
00:14:03.640 | We have now all the kernels.
00:14:04.760 | And we can now train on a single GPU in FB32 so far.
00:14:08.560 | So that's great.
00:14:10.000 | So from then on, we start to make more and more optimizations.
00:14:12.480 | So number one, we don't want to have mammals in FB32
00:14:15.480 | when you roll your own code.
00:14:17.520 | We actually switched to Kubas.
00:14:20.040 | Step two, we don't want to write our own flash attention.
00:14:23.000 | I think that would be pretty complicated.
00:14:24.360 | Turns out Codium has a very good
00:14:25.960 | flash attention implementation.
00:14:27.320 | So we switched to that.
00:14:28.800 | Next, you want to definitely reach for base precision
00:14:33.560 | so that to speed up the code.
00:14:36.320 | So you want to go over all your testers for parameters
00:14:39.520 | and also for activations and so on.
00:14:41.080 | And you have to start to think about,
00:14:42.120 | okay, which ones are in float 32,
00:14:43.480 | which ones are in float 16,
00:14:45.080 | and what precision are they in?
00:14:46.560 | And then do all the conversions automatically.
00:14:48.840 | So we reached for that and implemented that.
00:14:51.600 | There's many, many other optimizations
00:14:52.960 | that we've been implementing over time.
00:14:54.560 | So as an example, we did all the kernel fusions,
00:14:57.400 | different recompute settings
00:14:58.880 | to recompute a piece of the forward pass
00:15:00.280 | during the backward.
00:15:01.280 | There's been a lot of optimizations from Eric,
00:15:04.520 | especially on minimizing the amount of memory
00:15:06.520 | that you need during the backward pass.
00:15:08.960 | We have this packed 128 data structure,
00:15:11.160 | which basically, in our experience,
00:15:13.000 | forces the compiler to use the 128-bit
00:15:15.800 | load and store instructions that are available,
00:15:17.640 | but somehow the compiler is unwilling to use in many cases.
00:15:21.000 | So I think Arun did a lot of work here
00:15:23.800 | where you just look at the SAS,
00:15:25.560 | and you look at the SAS as the assembly,
00:15:28.760 | and you are looking at what the instructions
00:15:30.520 | are being used for your loop,
00:15:31.480 | and you figure out that,
00:15:32.320 | okay, there should be a 128-bit load and store,
00:15:34.440 | but it happens to be a 32-bit or something else
00:15:36.360 | because something in the MSCC compiler
00:15:38.400 | is not going very well.
00:15:39.280 | So we found that this data structure
00:15:41.320 | kind of forces the compiler's hand a bit more.
00:15:43.960 | We implemented all kinds of CUDA streams to overlap
00:15:46.360 | at the heart of the computation,
00:15:47.680 | and this ended up creating a total disaster.
00:15:51.080 | And so that's why I scratched it out,
00:15:52.320 | because at one point of LLM.c, as Arun would say,
00:15:55.480 | I basically went in and I looped it from orbit.
00:15:58.000 | I just went in, I controlled that
00:16:00.280 | for all dimensions of stream,
00:16:01.520 | and I just delete, delete, delete.
00:16:03.160 | And basically I deleted all the streams,
00:16:04.560 | made everything single-threaded,
00:16:05.920 | because we ended up getting all kinds
00:16:07.160 | of really weird race conditions and errors and so on.
00:16:09.160 | I just didn't want to deal with it.
00:16:10.120 | So LLM.c is not actually as overlapped as it could be,
00:16:13.960 | but it's just like, it's too much complexity
00:16:16.280 | for me not to have gained at this point.
00:16:18.720 | But maybe we can slowly reintroduce some of it.
00:16:21.760 | We have stochastic grounding, we have full determinism.
00:16:23.920 | Full determinism turns out to be pretty hard
00:16:25.760 | because some of the kernels complexify a lot
00:16:27.800 | because you can't use atomics.
00:16:29.120 | Like the encoder backward was especially crazy
00:16:31.440 | because the encoder backward is trivial with atomics,
00:16:34.720 | but non-trivial without it.
00:16:36.440 | Anyway, so a lot of the optimizations
00:16:37.960 | went into with a lot of efficiency
00:16:40.920 | and determinism in mind.
00:16:42.120 | And accuracy, like stochastic grounding and so on.
00:16:46.220 | Next, you want to use multiple GPUs,
00:16:47.640 | not just a single GPU.
00:16:48.800 | So this is where you bring in the nickel,
00:16:50.960 | you start to do overduce between all the different workers.
00:16:55.280 | And this is where you also start to reach
00:16:56.800 | for like sharded optimizer state zero one.
00:16:59.600 | For basically, take your optimizer states,
00:17:01.000 | which are in float, and these are really large buffers
00:17:04.320 | for atom W, and you can actually spread out
00:17:06.320 | a lot of the stuff across all the GPUs
00:17:07.800 | and it really helps to keep your requirements
00:17:09.920 | down per GPU in terms of memory.
00:17:11.600 | So very helpful to reach for that.
00:17:14.000 | So currently, LLNC uses zero one,
00:17:16.280 | which is a sharded optimizer state.
00:17:17.760 | There's a PR for zero two,
00:17:19.000 | but I don't believe I merged that yet
00:17:21.000 | because it gets a little bit messy,
00:17:22.800 | but might be merged eventually.
00:17:25.000 | A lot of LLNC is just kind of like
00:17:26.720 | balancing the improvement and speed
00:17:30.280 | with the complexity of what you're actually introducing.
00:17:32.680 | And so I actually rejected a lot of PRs because of that
00:17:35.200 | because the code starts to get crazy.
00:17:37.380 | And I think that decreases the amount of people
00:17:38.760 | that can be onboarded.
00:17:39.860 | And then after your multi-GP, you hold the node.
00:17:43.960 | So now you are running across multiple machines here
00:17:46.160 | to make sure that you synchronize all of them,
00:17:48.320 | that they can find each other and so on.
00:17:49.600 | So we implemented all that.
00:17:51.560 | And where that leads us to is that we can
00:17:53.960 | actually train GPT-2, and we can actually reproduce it
00:17:56.120 | after all of that work.
00:17:57.320 | So there's a post in the discussions of LLNC.
00:17:59.960 | We can train the 1.6 billion GPT-2,
00:18:02.080 | which was state of the art LLN as of 2019 or so.
00:18:05.320 | And you can train it on a single node,
00:18:06.980 | H100, in about 24 hours.
00:18:09.260 | And that costs roughly $600.
00:18:12.020 | And the way you do that is it's extremely dependency free.
00:18:14.300 | There's no need for Python, no need for PyTorch.
00:18:16.860 | So you do need CUDNN, which is the most heavy dependency.
00:18:20.540 | But CUDNN is optional.
00:18:21.820 | So if you'd like to roll your own manual attention,
00:18:23.900 | that is possible in LLN.C.
00:18:25.580 | But CUDNN is kind of like the hairiest dependency.
00:18:27.540 | But after that, it's just a bunch of C code.
00:18:29.260 | You compile it and you run it.
00:18:30.380 | There's no need for really anything.
00:18:32.460 | So there's no need for conda environments,
00:18:34.060 | as it installs, there's just nothing.
00:18:35.640 | It's just amazing.
00:18:36.720 | And then you compile your code and you run it.
00:18:38.640 | And it starts stepping and you wait 24 hours.
00:18:41.360 | And then it's stepping, doing some diagnostics.
00:18:44.920 | We have almost a 50% MFU here on one node, which is quite good.
00:18:50.880 | And you get really nice plots.
00:18:52.160 | And you beat the GPT-2 on Hellaswag.
00:18:54.400 | And basically, that's just the case
00:18:55.920 | that the optimization went well.
00:18:57.400 | No crazy numerical issues, lost bytes or anything
00:18:59.480 | like that for this size.
00:19:01.360 | And yeah, it should be a nearly good model in LLN.C.
00:19:07.500 | We can still compare to PyTorch.
00:19:08.860 | Because remember, we have PyTorch implementation
00:19:10.900 | for all this stuff in parallel on the side.
00:19:12.980 | And so you can run the whole untrained almost in PyTorch.
00:19:16.460 | And we can compare the two implementations side-by-side.
00:19:19.060 | And in particular, at the time of writing that post--
00:19:21.740 | and I don't know if this has changed,
00:19:23.280 | because the PyTorch team continues
00:19:24.700 | to optimize things over time-- but at the time of that post,
00:19:27.940 | we were using, in LLN.C, 30% less memory.
00:19:30.660 | And we were 20% faster in training just the throughput.
00:19:33.700 | And I don't know if I fully super duper optimized
00:19:35.860 | the PyTorch implementation.
00:19:36.980 | I did my personal best.
00:19:38.360 | But we were able to, I think, beat PyTorch in training mode,
00:19:41.540 | specifically GPT-2 in LLN.C. If you
00:19:44.880 | want to train anything else, you're in a lot of trouble.
00:19:47.220 | [LAUGHTER]
00:19:48.300 | You have to change your code a lot.
00:19:49.780 | And when I'm doing that, I won't come back to it.
00:19:51.820 | But for GPT-2 training, we're better after all that work.
00:19:55.620 | And it also compiles and runs much faster,
00:19:57.380 | which is beautiful.
00:19:58.220 | Torch compile actually takes quite a bit of time,
00:19:59.840 | like a minute or something.
00:20:00.880 | You're just waiting.
00:20:01.780 | So that's also something that I personally
00:20:03.780 | don't like to work with usually.
00:20:06.500 | So looping back around, turns out it wasn't all that simple.
00:20:09.220 | [LAUGHTER]
00:20:09.720 | I mean, there was a lot of stuff involved.
00:20:11.460 | And it took a few months for a few people.
00:20:13.420 | But it was fun.
00:20:16.460 | We learned a lot.
00:20:17.340 | And we were friends along the way.
00:20:18.780 | This is the LLN.C core devs.
00:20:20.300 | [LAUGHTER]
00:20:22.620 | So it was great.
00:20:24.860 | Ongoing work.
00:20:26.020 | We are adding a lot of free support.
00:20:27.540 | We actually thought maybe we would have it done by today.
00:20:29.960 | But there's a little bit more work to do.
00:20:33.880 | But we will have LLNA 3.1 training in LLN.C very, very
00:20:37.040 | soon.
00:20:38.240 | We will have FP8 support.
00:20:40.280 | So everyone has been working on this.
00:20:42.920 | And there's a big PR that's coming for FP8 support,
00:20:45.840 | which is also interesting.
00:20:47.840 | And there's a lot of notable forks in LLN.C.
00:20:50.160 | They're all listed on the GitHub repo.
00:20:52.240 | The AMP fork is very active, as far as I understand it,
00:20:55.360 | and quite good.
00:20:56.480 | I think also the C++ CUDA fork is quite nice.
00:21:00.340 | And so a lot of folks.
00:21:02.940 | So I encourage you to also work LLN.C.
00:21:05.420 | It's fairly readable, I think.
00:21:06.740 | I try to keep it clean, well-documented.
00:21:08.420 | I think it's pretty well understood what's in there.
00:21:10.580 | It's only maybe like, I think, 3,000 lines of code,
00:21:13.340 | basically C mostly.
00:21:15.700 | And one more thought, I think, that I wanted to get across
00:21:18.120 | is it wasn't all that haphazard to start the project.
00:21:22.220 | I had another motivation for starting the project.
00:21:26.100 | And that's that I think, I mean, what is LLN.C like?
00:21:28.980 | PyTorch is, especially when you push compilers,
00:21:31.380 | a bit like TCC for software 2.0.
00:21:33.180 | It's a compiler.
00:21:34.240 | But LLN.C is a bit like writing assembly,
00:21:36.260 | where you're doing everything manually, right?
00:21:38.820 | And basically I think we wrote LLN.C as multiple people
00:21:43.820 | over a duration of three months,
00:21:45.860 | and got something that was faster than PyTorch
00:21:47.860 | in a specific setting of GPT-3 training.
00:21:50.500 | And so this exercise basically proves that this is possible.
00:21:53.860 | Now the problem is you need to spend multiple people
00:21:56.040 | several months.
00:21:57.100 | But if LLNs are about to become much better at coding
00:21:59.980 | over time, then I think you can expect that the LLN
00:22:02.900 | could actually do this for any custom application over time.
00:22:05.540 | And so the LLNs could act as a kind of compiler
00:22:08.540 | for any custom application you're interested in.
00:22:10.340 | They're gonna do all the LLN.C work,
00:22:12.180 | and they're gonna output the binary
00:22:13.940 | that you can compile and run for your specific applications.
00:22:16.460 | So I don't actually know if we,
00:22:17.820 | like the use of Python and PyTorch and everything else,
00:22:20.500 | it's just a crutch because we humans are finite.
00:22:22.860 | We have finite knowledge, intelligence, and potential.
00:22:25.340 | But actually, don't you wanna write all code
00:22:27.180 | in custom CUDA kernels and so on?
00:22:29.500 | Like, maybe.
00:22:30.660 | And so the other thing that I think is interesting
00:22:33.100 | is the LLN.C repo might be useful
00:22:35.620 | because in the early stages of these LLNs
00:22:37.940 | and their intelligence, they might not be able
00:22:39.820 | to write this code from scratch.
00:22:40.980 | You just prompted them, "Write GPT-2 on C."
00:22:43.060 | You probably won't get LLN.C.
00:22:44.800 | But you're a lot more likely to get it
00:22:46.260 | if you put LLN.C in the context of a session LLN,
00:22:49.740 | and you can expect that the few-shot learning
00:22:51.180 | would be very helpful for the LLN
00:22:52.580 | to basically get an example code.
00:22:54.180 | And so I think LLN.C could be very useful
00:22:56.060 | for this example code to get to the LLNs
00:22:57.820 | 'cause they're about to write
00:22:58.660 | all of our custom applications.
00:23:00.260 | And so I think this is actually not unlikely to happen.
00:23:03.520 | Yeah, this is kind of likely to happen.
00:23:04.820 | So I think software development in general
00:23:06.580 | will probably change a lot.
00:23:07.420 | And to me, LLN.C is an exploration
00:23:09.260 | of whether this is even possible,
00:23:10.740 | because if it is possible,
00:23:11.660 | then maybe this is what's gonna happen.
00:23:13.380 | So, yeah, that's it.
00:23:15.260 | Thank you.
00:23:16.100 | (audience applauding)
00:23:19.260 | (audience cheering)
00:23:22.260 | - All right, well, the morning session talks.