Running and Finetuning Open Source LLMs — ft. Charles Frye, Modal

- Yeah, sure, yeah thanks for inviting me, Noah and Sean. It's always a pleasure. And yeah, I guess my background is studied neural networks, sort of how to optimize them, how to prove they converge in grad school, then worked at, that was at Berkeley, joined Weights and Biases, the experiment management and MLOps startup, series A to series C, did education for them, then did a full stack deep learning online course about how to deploy models in the pre-foundation model or liminal foundation model era, and then now work for Modal, infrastructure company that helps people run data-intensive workloads like ML inference.

So, yeah, so then, oh yeah, so FSDL fans in the chat, like Sean, yeah, we'd love, maybe we'll be able to do something under that banner again sometime. But yeah, so wanted to talk today about running and fine-tuning open-source language models. Why would you do it? The answer is not always with both of these things.

And then like some things about how, some like high-level things. This course, my understanding is oriented at software engineers who wanna learn more about like running AI models and building systems around them. So that's kind of the background that I've assumed in a lot of these. And then, yeah, to actually kick us off, before we go through the slides, I'm actually gonna do a quick demo.

This is something that I got set up just yesterday, but like since it is, you know, in the news, like quite literally, let's run a local model or rather run our own inference on a model. Let's run this DeepSeq R1 model that people keep talking about. So this is coming from the modal, our examples repo.

So you can try this code out yourself. All you need is like a single Python file and a modal account, and you're ready to go. I'm gonna kick it off here. Oops, you need a virtual environment with Python in it. That's one thing you need, I suppose. I forgot to mention that.

Okay, so let's run this guy here in the terminal. It's my VS code. The code's up there. You know, as is often the case, the code is not supremely interesting. I'm pulling in, I'm running it with Llama CPP here. Llama CPP has very, very low precision quants. So there's a ternary quant of DeepSeq R1.

That means all the values are either minus one, zero, or one in the weights. And that's enough to squeeze it down to fit on a single machine with multiple GPUs. In this case, four L40S GPUs. So that's why I'm running with Llama CPP here. So let's see, spinning up right now.

Oh man, we're out of it. Four XL40Ss on modal, so we might have to wait as many as 15 or 30 seconds for that to spin up. While we're waiting for that, let me show you just a little bit about what's going on here. We're running Llama CPP here.

Running these things is an exercise in configuration. So if you've ever administered a database, you'll be familiar with this sort of thing. Or if you've run compilation for a serious large project, you got your mysterious flags with mysterious arguments that have meaningful impact on the performance. So controlling, in this case, the KV cache and setting the quantization precision of that, along with some other things for Llama CPP.

Okay, so we had about a minute queue for GPUs. That's actually, that's like a P95 probably for XL40Ss on modal. So sometimes you roll the dice and you get a natural one. But, so it took us about 60 seconds maybe to spin this up and get a hold of four XL40S GPUs.

If this happens to you, DM me and I'll go smack our GPU cluster with a hammer and try and make it go faster. All right, so this is loading up, loading up all the stuff you need to do Llama to run. DeepSeq R1, this is the model loader. Actually, it turns out it's about 100 something gigabytes once you've quantized it down this far.

These are all different layers here. Nothing too interesting in the model of like architecture itself. It's really the like data and the inference tech that DeepSeq built that's really the interesting part. So skipping past all this extra stuff. So now we're at the point where we're like loading the model.

This one, so you're running, you wanna run your own model, great, okay. If you have a GPU to have it on all the time, then you gotta have, you know, why do we have four GPUs, why not just one? We gotta have 100 gigabytes of space to hold all the weights for this thing in.

That's 100 something gigabytes of RAM. That, you know, problem with RAM is like, you can't share RAM and when you unplug it, the data goes out of it. So this is actually one of the major cost sources for running your own model. It's like, you gotta have, if you want, like, if you wanna avoid this latency that we're looking at here for of like, you know, what's it about, you know, minute, 90 seconds to spin up.

That's like separate from any like modal overhead. This is just the raw moving bytes around setting up RAM. If you wanna avoid that, you gotta have stuff hot in RAM and RAM is not free. That's, you either, you know, you gotta pay to keep something warm on a serverless provider like modal or you gotta have an instance running in the cloud.

But that's all, that's been done. We're now doing prompt processing. This is the prompt, Unsloth, by the way, is the team that did the quantization here for DeepSeek R1 down to three bits. And they, their demo prompt is what I've just like copied directly here. It's, oh yeah, the prints mess up a little bit sometimes here.

But there should be at the top, the beginning of the prompt is something like, please write the game Flappy Bird in Python. So that's the prompt along with some instructions. That prompt has gone into DeepSeek and is now being processed. Okay, prompt processing is done. And now the beloved think token has been emitted and the model has begun to think about it, what it wants to do.

So this deployment is not super well optimized. There's a substantial amount of host overhead, which means the GPU is not actually working all the time, even as we're generating these tokens. That's probably either a Llama CPP needs like a PR or I missed some compiler flag or something. The CPU usage is also kind of low.

So I'm suspicious that maybe I messed something up in the compile. So it's 10 tokens per second right now. There's line buffering. So you aren't seeing the tokens live. You see them once a line is emitted. But yeah, but runs about 10 tokens per second on these L40s GPUs and could probably be boosted up to about 50 tokens per second by removing some of this host overhead.

And then from there probably optimize kernels for this architecture. Some other things would maybe double it again. Oh, finished thinking pretty quickly that time. Interesting thing with these models is like, they think for very variable amounts of time controlled by how hard they think the problem is. And so sometimes it finishes thinking pretty quickly like here.

Sometimes it thinks for like 20 minutes. So, you know, go make a coffee. I don't know, go compile something else while it's writing your answer. And yeah, I think the quality of the output here is reasonably good. One thing the Unsloft people call out and I've noticed in a couple of generations is these super low bit quants sometimes throw in a random junk token.

So this case here, I bet that dense there is like, that might not be defined. Yeah, it doesn't look like that's defined. Rough, that's probably supposed to be a zero. There's some inference config stuff that I haven't played with that can reduce that and improve the quality, but that comes with the quantization territory.

Yeah, so there it goes. It thought about it for a while, did some backtracking, and then wrote some code for Flappy Bird. So there's running a model and some of the stuff I talked about along the way, some of the main concerns you're gonna have running your own model.

Yeah, any questions before I dive into the slides? >> Really quickly, so we haven't gotten super into the internals. Could you go over what quantization is? Like why do you need to do that and what that process generally looks like? >> Yeah, sure, I'll talk more about this, but the basic answer is the model is trained as a bunch of floating-point numbers.

For DeepSeq R1, they were eight-bit floating-point numbers. That's crazy small. They worked hard to reduce the overhead during training. More typical is 16-bit or even 32-bit floating-point numbers. Default in Python is often 64-bit floating-point numbers, but that's way too much for neural networks most of the time. They don't need that level of precision.

So you can save a lot on memory, which saves a ton on inference speed, especially in these low user count settings and heavy decode settings, like a reasoning model that produces a ton of tokens in response to a single input. Yeah, that helps a lot to decrease the memory footprint of the weights, even if that's all you do with your quantization.

But yeah, so there we go. It actually finished. I noticed a couple more typos in there. That's probably, yeah, should tune the inference config on there. There's like a top-P, min-P sampling thing that you can do that I haven't dialed in yet. But yeah. Yeah, anybody, any other questions before we dive in?

Yeah, feel free to interrupt me as we're going. Let's make this a little bit more on the interactive side, ideally. Got a lot of slides, a lot to cover, but what we cover depends on what people are most interested in. Okay, so I just ran an open-source language model, DeepSeq R1.

Let's talk about just in general, what does that take? Define some of the things that I talked about there, like memory, bandwidth constraints, and quantization, and all this other stuff. We'll also talk a bit about fine-tuning models, customizing them to your use case by doing actual machine learning and training.

Before doing that, I do want to talk about the why of this here, like just 'cause something is, even if something's quick, easy, and free, it doesn't mean it's a good idea. And running and fine-tuning your own models is none of those things. So you want to make sure you have a good idea why you want to do this.

Why not just use a managed service behind an API? One of the primary reasons to do it is if you don't need Frontier capabilities, so if you don't need to run DeepSeq R1 to get reasoning traces for your customer support JapPot, that just needs to ask them to turn the thing off and turn it back on again.

That level of LLM inference, the software is pretty commodity. The hardware to run it's getting easier and cheaper. And so you can frequently run that relatively inexpensively. And so you don't need a proprietary model, and you can often, the complexity of serving is lower. Just like a call-out on that DeepSeq R1 demo, there's probably an order of magnitude and a half of improvement that could be done to that.

So a 30x improvement is probably low-hanging fruit, like a week of engineering. But right now, running that on Modal is $300 a megatoken. And just having DeepSeq run it for you is $3 a megatoken. So that's a pretty big difference, even assuming we can get a 30x cost reduction running just by doing more than a day's engineering to get it running.

So that's a reason people sometimes think running their own LLM inference makes sense is to save money. And that intuition, I think, comes from getting fleeced by cloud providers who will charge you an arm and a leg to just stand up commodity Redis on commodity hardware. But right now, that's not the case.

So the main reason I think that people bring up is to manage security and improve data governance. You want to make sure to run this thing yourself. The more control you want, the more complex this problem is going to be, the more eventually it ends up getting your own GPUs and putting them in a cage, which is probably six months or a year of engineering work, and then a lot of ongoing maintenance.

But at the very least, running it with a cloud provider, whether that's Modal or raw dogging AWS, can improve your security and data governance posture. Not everybody wants to send data to untrustworthy nation states like the United States or China. Then gaining control over inference is maybe the one that I would say is most important.

It's like, and most general, it's like API providers, there's only so much they can do. If they're proprietary, they got to hide stuff from you, whether that's reasoning chains, in OpenAI's case, or like log probs, also in OpenAI's case, or just like the increased customization decreases their ability to amortize work, to spread it across multiple customers, which is the way that they get things cheaper than you can run it yourself, sort of economies of scale.

And so the more flexible, the more different your deployment is, the harder it is for them to do that, to run this variety of workloads economically. I think over time, all of these things are going to lean more in the favor of running your own LLM inference. Like frontier capabilities will go off in the direction of artificial super intelligence or whatever, but the baseline capabilities that anybody can just download off of Hugging Face will just keep on getting better.

So we just saw reasoning, a one-level capabilities. Six months ago, I told people, "You got to go to OpenAI for that. "Now you can run it yourself." But I think the most important one that's going to tilt in the direction of running your own inference as the field matures is gaining control over inference.

Like things are just going to get way more flexible. People are going to discover all kinds of crazy things you can do with like hacking the internals, with log probabilities. People will rediscover what everybody was doing in 2022 and 2023, when people still had access to the models internals, and discover that it makes their lives better.

And you'll want to run your own inference for that, to control that. See a question, Juliette. - Yeah, Charles. So before we carry on, and I'm not sure if you're going to speak more about this as we go forward, but could you speak a bit about how inference is currently working, just to make it a bit more concrete in my mind?

- When you say how inference is currently working, do you mean like how people normally, the alternative to running your own? - Well, you're saying that, and I'm not familiar, I'm not so familiar with the word inference itself. Like, could you share a bit about how current models are using inference and like how it works today, so that then I understand how to better like tweak it and what it's like?

- Got it. Yeah, sure. Sorry, that's a bit of jargon. Inference just means running the model, right? Like putting something into the model and something coming out of it. Goes back to the like probabilistic backing of these models. Like you do it, you're like predicting what the future tokens are going to be.

And that's like inference, like logical inference. But yeah, that's where the term comes from. But yeah. Cool. So yeah, so it's like, this is like replacing OpenAI's API or Anthropx API or OpenRouter with a service you host yourself, is what we're talking about here. Cool, yeah, definitely if I'm like, especially since, you know, I usually speak to more of an ML engineering audience.

So like, if I just like forget that I haven't defined a term, please do interrupt me and ask me about it. Spent some time on this one already, so I won't go into more detail on this. But I would just say like, it's not that uncommon that proprietary software leads open software and raw capabilities like Oracle SQL and Microsoft SQL Server and like OSX and Windows have a bunch of things that like, beat their open source equivalents and have for a long time, like query optimizers in particular in the case of databases.

So like, it's maybe not so surprising that that's the case in AI. But the like, the places in general, these things have co-existed in other domains. And then open software has been preferred in cases where it's more important to be able to hack on and integrate deeply with a technology.

And so, you know, we're likely to see some mixture stably. And I, you know, I initially said this at one of SWIX's events, the AI Engineer Summit, year and a half ago now, and this has remained true. So that's at least 18 months of prediction stability, which is best you can maybe hope for these days.

Yeah, so saving money. A lot of people want to run their own models to save money. Right now, inference is priced like a commodity. People find it relatively easy to change models. Little prompt tuning, keep a couple of prompts around, ask a language model to rewrite your prompts for you.

Like, yeah, this among other factors has led to this LM inference being priced like a commodity rather than like a service. And so it's actually like quite difficult to run it more cheaply yourself. And so there's a couple of things that might swing in your favor. If you have idle GPUs, like maybe you have an ML team internally, and they like, when they're not doing training runs, they have GPUs just sitting there.

You might just mine cryptocurrency with them instead, you know, like faster time to ROI. But like, you know, that at least if you have them, that like, you're just paying electricity. So that makes it a little bit easier. But electricity costs are actually quite high for these things, you know, kilowatt per accelerator for the big ones.

The, like, taking a really big generic model, one of these like foundation models, like OpenAI's O1 model or Claude, and distilling it for just the problems that you care about into something like smaller and easier to run, that's a way that you can like save money. And we'll talk a bit about that if we get to fine tuning, if we spend time on that in fine tuning.

But, you know, that can help a lot. If your traffic is super high and dependable, and you can just like allocate some GPUs to it, and like, you know, run it, you know, just get a block of EC2 instances with GPUs on them, hold them there, send traffic to it.

It's flat, you're utilizing all the GPUs all the time. You could probably start to like compete with the model providers there on price. And then finally, it's like, if it's like once a week, you need to like process like every support conversation that you had and add annotations to it and generate a report.

So it's like once a week, you need like 10 mega tokens per minute throughput. And then like rest of the time you don't, then like the proprietary model providers are gonna push you onto their enterprise tier for those big rate limits. But you can actually like, and so that's gonna push up the cost of using a provider.

But then it's also easier to run super like big batches. Like it's actually kind of like easier to run these things economically at scale than it is at small scale. Somewhat counterintuitively maybe for a software engineer who's used to running like databases and web servers. Just like the nature of GPUs is that it's easier to use them the more work you have for them.

And so that makes, you know, these like batch and less latency sensitive workloads, like more amenable to running yourself if you can get ahold of serverless GPUs through a platform like Modal, Replicate, Google Cloud Run, something like that. Okay, so that's everything on like why you would do this, why you would run your own OpenAI API or Anthropic API replacement.

Any questions before we move on? I saw the chat had some activity, maybe check that out. Anybody wanna speak up? - No, I think we're just sort of adding color to different stuff. - Got it, thanks for grabbing the chat. Okay, so let's start. Like I've already mentioned hardware and GPUs a lot, so let's talk about that a little bit more.

Talk a little bit about like picking a model, then deep dive on like serving inference, a little bit on the tooling for it. Then like fine tuning, like how do you customize these models and then close out with thinking about observability and continual improvement. Okay, and yeah, link for the slides there.

Of course, you'll be able to get it after the session. Okay, so picking hardware is pretty easy. Just use NVIDIA GPUs, don't have to go any further. No, let me go into a little bit more detail about why that's the case. So Juliet wanted like a little bit more color and detail on what does LLM inference mean.

So what LLM inference means is you need to take the like parameters of the model, the weights, this like giant pile of floating point numbers. Those are gonna be sitting in some storage. You need to bring them into the place where compute happens. So like even if they're sitting in memory, like compute doesn't happen in memory.

Compute happens like on chip inside of like registers. So you gotta move all of that in. And the fun fact is like you actually need like pretty much every single weight needs to go in. So like for most models, you can just look at how many gigabytes is that model file.

And that tells you how many bytes they're gonna need to move in to get computed on. So like you're running an 8B model in one byte quantization, that's 8 billion weights. One byte per weight, that's eight gigabytes. So you need to move eight gigabytes out of wherever they're stored and into the place where compute happens.

And then like that happens, like you're pushing tokens and activations through those weights to get out the next token. On your first iteration, you're sending in the whole prompt. And so you're sending in a whole prompt and generating an output token. So is guava a fruit? Yes. In the process of like pushing something through the weights, you can kind of rough estimation is that you want to do one, you wanna do two floating point operations per weight.

So that's like, you want to multiply the weight with some number and then you're gonna add it to some other number. So that's two operations per weight. This is very napkin math. But again, nobody should have to write this very small number of wizards to write the actual code here.

The core thing is being able to reason as an engineer about what the system's requirements are and how to, kind of like with a database, you don't have to be able to write a B-tree from scratch on a chalkboard unless you're interviewing at Google. But you should know how indices work so that you can like think about queries and structure tables in a smart way.

And so similarly here, I'm just trying to give you the like intuition you need for understanding this workload. So for this, we have four tokens. We've got like one output. Yeah, we got four tokens coming in. We've got 8 billion parameters. So eight times two times four, that's 64 billion floating point operations.

And then that gets us one token. Then we got to repeat this every time we want to generate another token. So we're going to move the weights. Like they have to, they go into where they get muted, then back out. Because we're talking about like registers and caches here.

If you think of your like low level hardware stuff, registers and caches. So they can't hold the whole weight. So they got to go in and out the whole time. Again, if you're a database person, you should think of like running a sequential scan on your database over and over and over again on like a billion row database.

So it's wild that we can even run it as fast as we do. But this is the workload. The hard part about it is the scale. The easy part about it is that this is like relatively simple control flow at the core. So that makes it amenable to acceleration with GPUs.

GPUs have a bunch of, like if you look at the chip itself, this is the chip area. CPUs spend most of their space on like control flow logic. And then caches that hide how smart the CPU is being about like control flow and switching work. And then like relatively less is actually given over to the part that does like calculations, which is here in green.

GPUs, on the other hand, are just all calculation. And they have relatively simple control flow and like relatively less cache memory. And that-- because it doesn't need to hold 100 programs at once or whatever. And so that means you can really rip through a workload like this one that has like relatively simple stuff, where most of what you want to do is just like zoom through doing simple math on a bunch of numbers.

So that's why GPUs are designed for this, because it works well for graphics, which also looks like ripping through a bunch of math. Basically the same math on a bunch of different inputs, this graphics workload. But they've like tilted now even further in the direction of being specialized for running language models and big neural networks.

The TLDR here is like the GPU is 100 duck-sized horses, a bunch of tiny cores doing like very simple stuff. And that wins out over the one horse-sized duck that is the CPU that you're used to programming and working with. There's like one other piece here, which is like if you're looking at a top-tier GPU, one of the things that makes the top-tier ones really good, like an H100, is that they have soldered the RAM onto the chip, which is not something you normally do.

But it gives you much faster communication, lower latency, higher throughput, which is really important. The memory is still slower than the math, which is really important if you start to think about optimizing these things. But we don't have to go that deep. So the TLDR here is that it's like NVIDIA-inferenced GPUs from one or two generations back are what you probably want to run with.

The primary constrained resource is how much space there is in this memory to hold all those weights. Well, it's the weights. And then later you're going to start adding things like past sequences you've run on in a cache. And then there's never enough RAM. And so when you're looking at buying GPUs yourself for rent or which ones to rent from the cloud, look for the ones with more VRAM.

And then this is a primary reason to want to make your model weights smaller, to go from high-precision floating-point numbers to low-precision floating-point numbers, or even more exotic things, because they save space in that memory. And they make it easier to move the things in and out of memory and into where the compute happens.

So the thing you want is a recent but not bleeding-edge GPU unless you enjoy pain. So most recent GPUs from NVIDIA are the Blackwell architecture. That's the 5,000 series of GeForce GPUs, your local neighborhood GPU, and then the Blackwell B200s and similar data center GPUs. Generally, you're going to find that you don't get the full speedup that you'd like because people don't compile for that architecture always and yada, yada.

And then things are randomly broken. And then they're really hard to get a hold of and expensive. So the sweet spot is one generation behind whatever OpenAI and Meta are training on. So now that's Hopper GPUs. H200s were free on Amazon, at least on EC2, for a bit there a couple weeks ago.

And then loveless GPUs like the L40s that I ran my demo on, those are pretty nice. Loveless is the more-- or sorry, L40s is the more inference-oriented data center GPU. So data center GPU means like ones you're going to find in the public clouds. NVIDIA doesn't really let people put your friendly local GPU, the same one you can buy locally and put in your own machine.

They don't really let them run in the clouds unless NVIDIA is on the cap table. So that doesn't work for AWS and GCP. So that's a data center GPU. And then an inference data center GPU is one that's less focused on connecting a whole shitload of GPUs together, like 10,000 or 100,000, with a super fast custom network InfiniBand.

And instead, they're more focused on just having one reasonably sized effective individual GPU. So the L40s are getting pretty mature. So I might recommend those. For a while, the H100, which is really more of a training GPU, was kind of the better one. I think, yeah, just because the L40s was relatively mature.

If your model's small, if you're running a small model, like a modern BERT or one of the 3 billion or 1 billion models, you can get away with running it even a generation further back. And that's really nice, very stable. The Ampere A10 is a really real workhorse GPU, easy to get ahold of.

You can transparently scale up to thousands of those on modal when it comes time. So that's pretty nice. Just a quick-- since part NVIDIA is in the news these days, like why NVIDIA? AMD and Intel GPUs are still butt catching up on performance. So nominally, you look at the sticker on the side that says Flops, and the AMD GPUs look good.

And Intel Gaudi looks pretty good. The software stack is way behind. There's a great post from Dylan Patel and others that's semi-analysis, just ripping on the AMD software stack. George Hopps has done the same thing. It's just pain. That's a bet the company move. It's like, we can maybe either write the software ourselves or spend so much money on AMD chips that AMD will fix this for us.

That's not really like, oh, I want to stand up a service kind of thing, stick with the well-trodden paths. There's non-GPU alternatives. There are other accelerators that are designed, unlike CPUs, for super high throughput and low memory bandwidth. TPU is the most mature one, the Tensor Processing Unit from Google.

Unfortunately, it's very from Google in that they only run in Google Cloud. And the software stack is pretty decent for them, actually, like Jax, which can be used as a back end for PyTorch. But like many things in Google, the internal software for it is way better than anything you'll ever use.

And you're second in line behind their internal engineers for any bug fixes. So caveat emptor there. The Grok and Cerebrus accelerators are still a little bit too bleeding edge. At that point, you're kind of not running your own LM inference anymore. You're having somebody else run it as a service for you on chips that they run.

It's kind of the way it works. It's unclear if they could do it-- what's the word I'm looking for-- cost-effectively as well. Those chips are very expensive to run. I would say any of the other accelerators you see aren't super worth considering. But in general, long term, I would expect this to change a lot.

NVIDIA has a very thick stack of water-cooled network cards that can do a little bit of math for you. That's crazy shit, and it's going to take a long time for anybody to catch up there. But inference is actually pretty easy to match their performance on. So I expect a lot of innovation in this space, and VCs are spending accordingly.

Last thing I'll say is the startup that I work on, Modal, it makes getting GPUs really easy. So a lot of it-- this is high-performance computing hardware. It's normally a huge pain to get. If you've run a Kubernetes cluster, you know that heterogeneous compute makes you cry. There's a reason they call it taints.

So Modal makes getting GPUs super easy, just like add Python decorators, get stuff to run on GPUs. This is real code that our CEO ran to test our H100 scaling, just like let me just run 100,000 times, time sleep one on an H100. And this is all the code that you need to run that.

In our enterprise tier, this would scale up to 500 H100s or more, pretty transparently. So when you need it, we've got it. OK, so that's everything I want to say on hardware. Any questions about that stuff before I dive into talking about the zoo of models? No, I think we're pretty good.

I like the commentary on TPUs. Yeah. It'd be cool if they sold them. That would be great. I'd have one in my house. But yeah. So was that-- They're eating all the ones they can make. So it's almost like a competitive advantage. Make more, you know? How hard could it be to build a semiconductor foundry?

I thought, why do you have a money printer if you aren't going to use the money for good stuff? Anyway, I'm sure they have great reasons for this. But yeah. Oh, yes. Anyway, I won't go on any more tangents there. But DM me on Twitter if you want to talk more about this.

Yeah, and also, oh, yeah, I wrote a guide to using GPUs, modal.com/gpuglossary, GPU hyphen glossary. So if you're interested in this stuff, check it out. It's kind of intended to give you the intuition for this hardware and a little bit of debugging on the software stack because most people didn't encounter anything like this in their computer science education, their boot camp, or their working experience so far.

So yeah. All right. So I could talk for hours about that. But let's talk about model selection. So what is the actual model we're going to run? My one piece of advice that I've contractually obligated, before you start thinking about, oh, what model am I going to run? How do I-- I want to do a good job on this task.

Make sure you've defined the task well and you have evals, an ability to evaluate whether the-- you swap out a model for another one. Is it better or not? You can start with vibe checks. You just run one prompt that you like that helps you get good smell for a model.

But that's going to-- that works for a very short period of time. 10 inputs, 50 inputs, how long does it take you to write that out with ground truth answers? If it takes you an hour, put on your jams and do it. That's the length of Brat. Just listen to Brat and write out 10 or 50 evals.

Just because it's kind of like test-driven development, where everybody says write the tests and then write the software. But in this case, with test-driven development, one reason people don't do it is because they can mentally run tests really well. I know what a test-- I know all the different ways this code could misbehave.

I don't have to write it out as a test. And if you're good, that's correct. If you're bad at software, like me, then you need the test to help you. But in this case, nobody is good at predicting the behavior of these models. And so evals are really critical, being able to check is this actually improving things or not.

So do this, even just 10 things in a notebook. Don't go and buy an eval framework to do this. Just find a way to run models in the terminal in a notebook that helps you make these decisions like an engineer, not like a scientist like me. OK, so model options here are still, I would say, limited but growing.

I might drop the limited sometime soon, because it's starting to feel like we have options. Meta's Llama model series is pretty well-regarded and has the very strong backing of Meta. So if I'm an engineer thinking about which open source software am I going to build on, I actually think about that a lot more so than raw capabilities a lot of the time.

And the key thing here is there's a pretty big community building on Llama, making their software work really well with Llama, doing things with Llama that you would otherwise have to do yourself. So Neural Magic, major contributor to an inference framework called VLLM, they quantize models for you. So they squish them down so they're a lot smaller.

Now you don't have to do that yourself. That's very nice. Noose Research does a lot of fine-tuning of models to remove their chat GPT slop behavior. So it's nice to have that. And RCAI will mush together five different Llamas to make one Penta Llama that weirdly works better than any of the five inputs.

And then you don't have to do any of that yourself. Very nice. And then because it's backed by Meta, you can expect there will be continued investment in it. Meta's been great about open source in other places, like, I don't know, React. So maybe that's a bad one to pick because of the licensing thing.

But they learned their lesson. So you can build on Meta comfortably. DeepSeek model series is on the rise. Not the first model series out of China to catch people's attention, the other one being Quen. There's slightly less tooling and integration than the Llama model series. But an important thing to note is that it is released under the MIT license.

So the model weights are released under a normal open source license that the open source initiative would put their stamp on. But the Llama model is under a proprietary license that says, for example, if you're Amazon or Google, you can't use this. Not literally, but effectively. And a couple other things that make it less open, slightly less open, might make your lawyers nervous.

So maybe DeepSeek will just push Llama to go MIT, inshallah that will happen with Llama 4. There are others to pay attention to. You might see a shitty model come out of a model training team, or sorry, you might see a non-state-of-the-art model come out of a model training team.

But that doesn't mean that the team is bad. It's just that it takes a long time to get really good. So ones to watch are the Allen Institute's been putting out some good models with the Olmo series and the Molmo model. Microsoft's been doing their small language models with Phi.

Mistral's has been quiet for a bit, but they keep putting out models. And Quen. Maybe in the future, the enterprise cloud homies Snowflake and Databricks will put out really compelling models. Mostly, Arctic and DBRX are fun for research reasons rather than raw capabilities. But yeah, that's kind of a small number of options.

A little bit more like databases in the late '90s, early 2000s than databases today, where everybody and their mother has their own data fusion analytic database. But yeah, a little bit about quantization. So I've mentioned this a lot. So by default, floats are 32 or 64 bits, like integers are.

Neural networks do not need this. Digital computers that you're used to programming are very precise. They go back to this-- pardon me-- the Z2 by Konrad Zuse. He made this basically a clock that was a computer. Physical plates were being pushed around. And I think this is an AND gate or an XOR gate.

So it only moves if one of the two plates on one side moves forward. So it's very physical clockwork. That's the lineage of digital computers. At the same time, in the '40s, people were working on analog computers. So on the right is a numerical integrator. That's on the other side of World War II.

I think this is artillery trajectory calculations. You see there's a ball. And that ball rolls around. And you would calculate the speed that the ball is rolling around by changing the gears. Neural networks are way more like that. They're more like-- they're imprecise because they are the raw physical world without the intervention of a clock system to abstract it away and make it all ones and zeros and specific time steps.

Neural networks are way more like these analog computers. And so how precise do you need to be when you're measuring a number that's coming out of an analog system? It's never going to be exactly the same with an analog system anyway. So why not decrease the precision? Whereas you change one bit in a digital computer, and it's like throwing a stick into a clock.

The whole thing explodes and stops running. So this is the reason why you can aggressively quantize neural networks in a way that you can't do with lossily compressing, I don't know, Postgres. If you quantized every byte in Postgres down to 4 bits, you would just get garbage. So this quantization is really key for performance.

The safe choice, you'll see, is 16 bits. FP16 or BF Brain Float 16. Weight quantization only, that means just make the model itself smaller, makes it smaller in memory. And then that whole thing about moving it in and out of compute is easier because it's smaller. That's great. And then that doesn't actually quantize the math.

The actual math that happens still happens at 16-bit, 32-bit. To do activation quantization requires more recent GPUs, sometimes requires special compilation flags. Not always is the operation that you want to speed up. Does it already have a kernel written for you by TreeDAO or some other wizard to make the GPU go at full speed?

So that's harder. It doesn't always work. VLM has great docs on this. And there's some papers as well. Give me FP16 or give me death, question mark, is a good paper. Because the answer is you don't need death. Don't be dramatic. You can use the quants. Evals help you decide whether the quantization is hurting.

So I was running DeepSeq R1 in ternary, actually. So 1, 0, minus 1 in that demo. That's extreme quantization. There's no way the full model performance or anything close to it is retained. You need evals to determine whether you've lost the thing that made you pick the model in the first place.

So make sure you have a way to check this. And benchmarks, don't trust benchmarks. People's benchmarks are wrong. They're different from your workload. You've got to run this stuff yourself. So curate your own internal benchmarks to help you scale up your own taste in models and intuition. I have more slides on fine tuning in a bit.

But people who want to run their own models often have this DIY hacker spirit. And they're like, why should I just use the weights everybody else is using? I want to fine tune these things. This is really hard. I'll talk more about why it's hard in a bit. But try to get as far as you can just with prompting and really control flow around models.

I don't know, DeepSeek R1 writes Python code. The Python code is wrong. Take the code, run it, take the error message, pipe it back in. So writing things around models, instead of fine tuning it to write better Python code, that's what all the model providers are doing. You're hard to compete with them on a lot of this stuff.

So managing prompts and managing control flow around models is way easier as a software engineer and has way better ROI per effort ROI. So definitely start with just prompting, retrieval, et cetera. Yeah, I want to make sure to talk about the inference frameworks and what Suri inference looks like.

Running LLMs inference economically requires a ton of thought and effort on optimization. This is not something you can sit down and write yourself, even if you're a code force's top 1%. There's a lot to write. A fast matrix multiplication is-- yeah, the standards are very high. So the current core of the stack that's most popular is PyTorch and CUDA.

So PyTorch is a combo of a Python steering library and then a C++ internal library and libraries for doing all the hard shit, including CUDA C++, AKA C++ that runs on GPUs. That's where all the work gets done. Python is not usually the bottleneck. Don't get excited and rewrite that part in Rust.

You're going to find out that that didn't help you that much. There's some features that make it easier to write Torch and still get good performance. So Torch added a compiler a couple of years ago now in version 2. But compilers are young until they're 40. But it's very promising and can get you most of the speed up of writing a bunch of custom stuff.

But even besides writing custom GPU code, there's a bunch of things you need to build on top of raw matmuls, like the stuff that showed up in my napkin math diagram to serve inference fast. There's a bunch of caching. You don't want to roll your own cache. Rolling your own cache is a recipe for pain.

There's continuous batching is this smart stuff for rearranging requests as they're on the way. Speculative decoding is a way to improve your throughput and has a ton of gotchas. So you don't want to build all this just for yourself. This is a clear case for a framework, just like database management systems.

This is a don't roll your own case rather than a don't overcomplicate shit with a tool case, like the classic the two genders in engineering. So I would strongly recommend the VLM inference server on a number of grounds. So like Postgres, VLM started as a Berkeley academic project. They introduced this thing called paged attention, paged KV caching, and then kind of ran with it from there.

There's performance numbers, and we can talk about them, but they're pretty prominent. People are gunning to beat them on workloads. And also, don't trust anybody's benchmarks. You have to run it to decide whether you agree. Anyway, that doesn't apply just for models. It also applies for performance. They really won Mindshare as the inference server, and so they've attracted a ton of external contributions.

So now, Neural Magic was a startup, got acquired by Red Hat, a.k.a. IBM, basically exclusively to support their work on VLM. And so they got tons of contributions from any scale, IBM, bunch of people contributing stuff. And that's really important for open source success. Open source software succeeds when it creates this locus for cooperation between otherwise competing private organizations, whether they're nonprofit or for profit or whatever.

And VLM has done that. So it's kind of hard to dislodge a project like that once it's held that crown for a while. It's not undislodgable yet, so it's not quite like Postgres, where you can be like, just use Postgres, and feel pretty like that's been around for 30 years, and this is more like 30 months less.

But yeah, also pretty easy to use, like PIP installable once you have your GPU drivers. They make an OpenAI compatible API layer, which NVIDIA has refused to do with TensorRT, LLM, and Triton. So it's got a bunch of nice features and good performance. The main alternative, I would suggest, is NVIDIA's offering the ONNX, TensorRT, TensorRT, LLM, Triton kind of stack.

There's this NVIDIA stack. Legally, it's open source, because you can read the source code. And it's under, I forget, either Apache or MI2 license. But if you look at the source code history, you'll see that it updates in the form of one 10,000 line commit with 5,000 deletions every week or two that says fixes.

So pretty hard to maintain a fork. Pretty hard to-- you don't get input on the roadmap. VLM, on the other hand, classic. True open governance and open source. You can actually participate. Show up to the biweekly meetings. It's fun. Yeah, good performance, but maybe not top. What's up, Twix?

>>SGLang? >>Yeah, SGLang, there's some cool stuff. They have this nice interface for prompt programming that's kind of cool. And sometimes they beat VLM on performance. But yeah, with open source projects, you win when you can draw the most contribution. So I feel like even if SGLang is winning over VLM in certain places currently, I doubt that that will persist.

But we'll see. SGLang is another good one to look at. >>Yeah, OK. My impression was that they're both from Berkeley, and I thought basically SGLang is kind of the new generation of-- it's an anointed successor. >>Yeah, we'll see. We'll see. I don't think they've attracted the same degree of external contribution, which is important.

>>They try to do it. OK, cool. >>Yeah. But yeah, good call out. That part of the slide's a little bit older, so I should maybe bump SGLang up to its own part. If you're going to be running your own inference, this is a high-performance computing workload. It's an expensive workload.

Performance matters. Engineering effort can do 100x speedups and can take you from hundreds of dollars a megatoken to dollars or tens of dollars a megatoken. So you will need to debug performance and optimize it. And the only tool for doing that is profiling. So you're going to want to-- even if you aren't writing your own stuff, like if you're just using VLM, if you want to figure out what all these flags do and which ones you should use on your workload, you're going to want to profile stuff.

There's built-in profiler support in VLM to try and make it easy. So PyTorch has a tracer and profiler. That's kind of like what VLM integrates with. There's also NVIDIA Insight, both for creating and viewing traces. That's their slightly more boomery corporate performance debugger. It's got a lot of nice features, though, can't lie.

But yeah, it's the same basic tracing and profiling stuff, except there's work on the CPU and on the GPU, so that makes it a little bit harder. I would also just generally recommend, if you're thinking about this a lot, running a tracer and just looking at the trace a couple of times for PyTorch, VLM, whatever, just because you learn a ton from looking at a trace, a trace of an execution, all the function calls, all the stacks that resulted in your program running.

No better way to learn about a program. I prefer it to reading the source code. That's where I start, and then I go back to the source code to figure out what things are doing. It's way easier than trying to build up a mental model of a programming model and concurrency implications, et cetera, just from reading source code.

It's unnatural. Humans were meant to observe processes in evolution, not as programs. But yeah, so some recommendations for tools there. We also have some demos for how to run this stuff on Modal if you want to try that out. As a first pass for GPU optimization for, OK, is this making good use of the GPU?

Very first pass is this number, GPU utilization. What fraction of time is anything running on the GPU at all? So that catches-- I don't know. If you looked at my DeepSeek R1, you would see that this utilization number is really low, like 20%. That means the host is getting in the way a lot and stuff isn't running on the GPU a ton.

This is not like model maximum flops utilization or model flops utilization. This is not like what fraction of the number NVIDIA quoted you for flops that you're getting. This is way far away from that. This is just like-- this is a smoke check. Is the GPU running what fraction of the time?

You would like for this to be 100%. Like, this is-- yeah, that's an attainable goal, 95% to 99%. Unlike CPU utilization, that's not a problem. That's a goal. So GPU utilization here is like a first check. Problem is, just because work is running on a GPU doesn't mean progress is being made or that that work is efficient.

So the two other things to check are power utilization and temperature. Fundamentally, GPUs are limited by how much power they can draw to run their calculations and how much heat that generates that they need to get out of the system in order to keep running without melting. So you want to see power utilization 80% to 100%.

And you want to see GPU temperatures running high 60 Celsius for the data center GPUs, maybe low 70s, but pretty close to their thermal design power, maybe 5 to 10 degrees off of the power at which NVIDIA says, whoa, warranty's off. That means you're most likely making really good use of the GPU, whereas this GPU utilization, 100% that we have here on the left, is actually a deadlocked system.

It's like two GPUs are both expecting the other to send a message, like two polite people trying to go through a door. And so they're both executing something because they're both being like, waiting for that message, dog. But they aren't making any progress. And the system is hung. But it has 100% GPU utilization.

So you won't see that that often if you're running an inference framework. But it is something to watch out for and why, on Modal, I learned Rust in order to be able to add these to our dashboard. I think it's that important to show it to people, the power and the temperature.

Cool. All right. So I do want to talk about fine tuning since it was in the title, conscious of time. So I'm going to rip through this. And then if we have more time, we can dive deep via questions. Sound good, Sean, Noah? Thumbs up? All right. Yeah, that's great.

All right, yeah, fine tuning. So fine tuning means taking the weights of the model and using data to customize them, not via rag, but by actually changing those numbers. So when does it make sense to do that and make your own custom model? If you can take the capabilities that an API has and distill them into a smaller model-- so train a smaller model to mimic the behavior of a big model, then you can-- frequently, you don't need all the things like GPT.

The big models know the name of every arrondissement in France and things about 15th century sculpt-- or esotericism that you probably don't need in a support chatbot. So a smaller model with less weights, less room to store knowledge, could probably serve your purposes. I think of this a bit like a Python to Rust rewrite.

You start off when you aren't sure what you need. You write in Python because it's easy to change, just like changing a prompt is easy, and switching between proprietary model providers is easy, upgrades are easy. But then once you really understand what you're doing, you rewrite it in Rust to get better performance.

And then that Rust rewrite is going to be more maintenance work and harder to update, yada, yada, but it's going to be 100x cheaper or something. And so both the good and the bad things about that kind of rewrite-- it's a very similar engineering decision in terms of technical debt, feature velocity, cost of engineers, all this stuff.

There's a nice product called OpenPipe that will help you steal capabilities as a service. So maybe check them out. If you want tighter control of style, like you want it to always respond in the voice of a pirate and never break k-fabe, fine tuning is pretty good at that.

Relatively small amounts of data can do that. It's pretty bad at adding knowledge. That's usually better to do search or retrieval, which is what people call RAG, like get the knowledge from somewhere and stuff it in the prompt. Prompts can get pretty big these days. So your search doesn't have to be as good as it needed to be a year and a half ago.

You can get vaguely the right information and put it in the prompt. The holy grail would be for you to define a reward function of what does it mean for this model to do well. Maybe that's customer retention, NPS, whatever. And then you could do ML directly on those rewards to optimize the model for that.

That's the holy grail. Then you could just sit back and monitor that RL system. And then you would magically make that reward number go up. Could be stock price. That would be nice. The problem is there's a large gap between the things you want to improve, and the things that you can actually measure, and the things that you can provide to a model, measure quickly enough, et cetera.

And also the rewards need to be unhackable. They need to be exactly what you want to maximize. When you do ML, ML is like paperclip maximization. It's like, you told me to make this number go up. I'm going to make this number go up. Imagine the brooms from "The Sorcerer's Apprentice." So if your rewards aren't something that's extremely logically correct, does this code compile?

And does it run faster? They're hackable. So there's this famous example from OpenAI where they trained a model to drive a boat in this boat racing game. And it was trying to maximize points. And what it learned was, actually, you don't want to win the race and do what the game is supposed to do, which is collect these little pips and finish a race.

If you want to score max, what you actually want to do is find this tiny little corner and slam against the wall repeatedly, picking up this bonus item that respawns, and just slamming against the wall over and over again and pick up the bonus item when it spawns. Very inhuman.

More like a speed runner playing a video game than a normal human. So imagine this, but with your customer support. Great way to get customers to give a 10 on an NPS is to hack their machine and say, your machine is locked down until you put a 10 on our NPS.

So be careful when using that approach. But that is the direction we're going. And it's RL for things like reasoning models gets better and more mainstreamed. It's kind of the long-term direction we're going. But that's not where we are today. Where we are today is really more like stealing capabilities from public APIs and distilling them.

So the main reason fine-tuning can save costs, can improve performance, why shouldn't you do it? Fine-tuning is machine learning. Running inference is mostly normal software engineering with some fun spicy bits-- GPUs, floating point numbers. But machine learning is a whole different beast. Machine learning engineering has a lot in common with hardware and with scientific research.

And it's just fucking hard. You've got non-determinism of the normal variety. On top of that, there's epistemic uncertainty. We don't understand these models. We don't understand the optimization process. There's all the floating point nonsense, which is much worse in machine learning than elsewhere. You've got to maintain a bunch of data pipelines.

No one's favorite form of software engineering. This is a high-performance computing workload. Terra or Exaflop scale, if not more. Like, yeah, high-performance computing sucks. There's a reason why only the Department of Energy does it. And now a few people training models. There's a bunch of bad software out there.

Like, the software in ML is frankly bad. It's written by people like me with scientific background. You have to deal-- things are inferential. You have to deal with statistical inference. Yeah, there's data involved. And now data is getting stored in a form that no one understands. Like, user data went in.

And somebody can maybe pull a "New York Times" article directly out of your model weights. This scares lawyers. And so that is tricky and probably is going to require some Supreme Court rulings and so on to really figure out. Yeah, and when Mercury is a retrograde, your GPUs run slower.

I'm sorry. That's just how it is. It's just, like, the point is there's a lot of complexity that's very hard to get an engineering grip on. So if you can solve it in literally any other way, try that first. Be creative. Think of ways you can solve this problem without fine tuning.

What information can you bring in? What program control flow can you put around a model? Like, distillation is the easiest ML problem because you're using an ML model to mimic an ML model. And you can write down the math for that. It's perfect. It's very easy. Like, there's a notion of a data-generating process.

In the real world, that's like the climate of the planet Earth. But in distillation, it's like an API call. Much easier. So if you have never fine-tuned before, definitely start with stealing capabilities from OpenAI, a.k.a. distillation, rather than anything else. To do this, you're going to need even more high-performance hardware.

I focused on running models at the beginning. Fine-tuning blows out your memory budget, even with these parameter-efficient methods that are out there. Like, kind of what happens during training is you run a program forwards, and then you flip it around and run it backwards. So that puts a lot of extra pressure on memory.

Then you also, during training, you want lots of examples so the model doesn't learn too much from one specific example. And you also want large batches to make better use of the big compute and to make better use of all those floating-point units. So that puts pressure on memory.

And then optimization just, in general, requires some extra tensors that are the size of or larger than the model parameters. Sorry, some arrays, some extra arrays of floating-point numbers that are at least the size of the model parameters themselves. So you've got gradients and optimizer states. These are basically like 2 to 10 extra copies of the model weights are going to be floating around.

There's ways to shard it, but you can't get around the fact that a lot of this stuff just needs to be stored. So you're going to need eight 80-gigabyte GPUs, or 32 of them, connected in a network. And yeah, the software for that is pretty hard, or pretty rough.

I already talked about how hard machine learning is. It's like there are software engineering practices that can prevent it from being made harder. I worked on experiment tracking software, weights and biases. That said, I worked on it for a reason. It's like when I was training models, the thing I wanted was being able to store voluminous quantities of data that come out of my run.

Tons of metrics, gradients, inputs, outputs, loss values. There's just a bunch of stuff that you want to keep track of on top of very fast-changing code and configuration. And so you want a place to store that. The software is hard to debug. You don't know where the bugs are.

So you want to store very raw information from which you can calculate the thing that reveals your bug. This is actually, I would say, like Honeycomb, their approach to observability is very similar. This is like observability for training runs. Observability is like recording enough about your system that you can debug it from your logs without having to SSH in.

Same thing with model training. So yeah, weights and biases, hosted version, Neptune's hosted version, MLflow. You can run yourself. Yeah. You-- Hm? TensorBoard? Yeah, so TensorBoard, you have to run TensorBoard yourself. There's no real hosted service for it. I think they shut down TensorBoard.dev. So even if you're willing to make it public, you can't even use TensorBoard.dev anymore.

Yeah, that's my most sad kill by Google, because it hits me personally, or maybe happiest, because I'm a shareholder in weights and biases. But yeah, so yeah, TensorBoard is really good at a small number of experiments. It's bad at collaboration and bad at large numbers of experiments. Other experiment tracking workflows that have gotten more-- or experiment tracking solutions that have gotten more love, like the venture-backed ones or the open source ones, are better for that.

So you can-- I would say a lot of software engineers come into the ML engineers' habitat and are pretty disgusted to discover the state of affairs. So you definitely do, in general, as a software engineer entering this field, you will be disgusted. And you should push people to up their SWE standards.

But there's actually a lot of benefit to fast-moving code in ML engineering. It is researchy in that way. So you do want fast iteration. A lot of software engineering practices are oriented to a slower cycle of iteration and less interactive iteration. So the detente that I've found works is build internal libraries in normal code files, but then use them via Jupyter Notebooks so that you can poke prod, run ad hoc workflows, et cetera.

And then as soon as something in a Jupyter Notebook starts to become regularly useful, pull that out into your utils.py, at the very least, if not an internal library. So yeah, Noah mentioned at the beginning-- or I forget, maybe it was just me. Anyway, full-stack deep learning course I taught in 2022 still has the basics of how to run ML engineering.

The main thing that's changed is that we're talking about fine-tuning here. And back then, we were talking about training from scratch, because the foundation model era was only beginning. But the basic stuff in there, like the YouTube videos, the lecture-level stuff, is all still, I would say, pretty much solid gold.

And then the code's rotted a bit, but it's at least vibes-level helpful. OK, actually, the observability stuff is less interesting and relevant. The main point is the eventual goal with any ML feature is to build a virtuous cycle, a data flywheel, a data engine, something that allows you to capture user data, annotate it, collect it into evals, and improve the underlying system.

This is like-- if you're running your own LM inference, one of the ways you're going to make this thing truly better than what you could get elsewhere is building your own custom semi-self-improving system, or at least continually-improving system, based off of user data. There's some specialized tooling for collecting this stuff up, whether it's offline style with something like Weights and Biases Weave.

You can see Sean's recent conversation with Sean from Weights and Biases on how he used Weave, among other tools, to win at Sweebench. >>Then Thomas came on Thursday and went over Weave. >>Oh, nice. OK, yeah, that's pure product on Weave, plus Sean-- oh, wait, in this class or somewhere else?

Oh, in this class, awesome. >>Yeah, Thomas came in on Thursday and did an hour and a half and change on Weave. >>Nice, yeah. So I would say Weave is really good for this offline evals, which is collect up a data set, kind of run code on it. The code and the data set co-evolve.

And this is very much how an ML engineer approaches evaluation, coming from academic benchmarking, really, originally. And then there's a different style of evals. I don't know if you're going to have anybody from Lang Chain or LOM Index or one of these other people who are also building these observability tooling.

There's this product engineer style, which is just collect up information and then let anybody write to it. Anybody can come in and annotate a trace and be like, this one is wrong. Lang Smith is very open-ended, the tool from Lang Chain, as are a lot of the other observability tooling-- or sorry, these more online eval-oriented things.

It's about raw stuff from production. And it's about a living database of all the information you've learned about your users, your problem, the behavior of models. And so it's this very dynamic, active artifact, which has its place. I think the more you need input from people who are not you to evaluate models-- like, for example, it's producing medical traces, and you are not a doctor.

As opposed to producing code, and you are a programmer, then being able to bring in more people is more helpful. And so there's utility to these more online-style things. You can also actually build this stuff yourself. One thing I will say is these people don't know that much more about running these models than you do and getting them to perform well.

And the workflows are not really set down for this. So with experiment management, that's been pretty figured out. It's an older thing. And so there's lots of-- the tooling has good ideas baked into it and will teach you to be better. These tools are in the design partner phase, a.k.a.

the provide free engineering and design work for somebody you're also paying for a service phase. So if you have a good internal data engineering team that is good at, say, an open telemetry integration, would love to set up a little ClickHouse instance or something. And that's exciting to you, the prospect of putting something like that together, you or somebody on your team.

You can build your own with something like this. And then the front end people can hack on the experience. So Brian Bischoff at Hex is big on this, because Hex has both really incredible internal data engineering and they're a data notebook product. So they can actually dog food their product to do their evaluation of their product.

So not everybody's in the situation to be able to do that, but it's like a bigger fraction than it is with some of the other stuff that we've talked about. More tilted in the build direction than the buy. OK, so that's everything. I'll do my quick pitch here. I mentioned at the beginning, if you want to run code on GPUs in the cloud, Modal is the infrastructure provider that I'm working on.

That-- I joined this company because I thought their shit was great. I was talking about how much I liked it on social media, and they're like, what if we paid you to do this? And I was like, no. I love this so much. Please don't pay me to do it, because then people won't trust me when I tell them it's coded.

But eventually I gave in. Now I work at Modal, and they pay me to say this. The same thing I was saying before, which is Modal is great. It's like, you pay for only the hardware you use. Important when the hardware is so expensive. They built the whole-- all the infrastructure is built from the ground up in Rust, BTW, to design for data-intensive workloads.

There's a great podcast with our co-founder, with Sean, that completely separate from learning about Modal. It's just like, gain 10 IQ points, or 10 levels in computer infrastructure from hearing the story, learning about the software that was built, and how they sped it up. It's also a great data council talk on it.

Just designed to run stuff fast. And then, unlike other serverless GPU narrow sense providers, Modal has code sandboxes, web endpoints, makes it easy to stand up a user interface around your stuff. So that's why I ended up going all in on Modal. It was like, wow, not only does this run my models, but I learned how to properly use fast API from Modal's integration with it.

And yeah, that's just the tip of the iceberg on the additional things that it provides. So that can be for running your fine-tuning jobs, if you've decided you want to distill models yourself. It can be just running the inference, to be able to scale up and down, and handle changing inference load, and make sure you're filling up all the GPUs that you're using.

And it can be for doing your evaluations, running these things online or offline, creating data to help you observe your system and make it better. So it's like full service, serverless cloud infrastructure that doesn't require a PhD in Kubernetes. Great. All right, that's all I got. Any questions? That was sick.

Thanks so much. We love Modal in this house. I was in the process of rewriting it, so everyone that got the-- and also everyone, Charles is the person that we talked to to get the Modal credits for the course. So everyone, a big, big thank you to Charles for that.

But the entire course, every single-- this cohort builds three projects, all of which are built off of FastAPI that lives in Modal. So we love Modal here. It's great. Yeah, if you ever run into any bugs, definitely slide into our Slack. There's a decent chance you'll get co-founder support if you slide into the Slack.

And yeah, hopefully you've been pointed to the examples page, modal.com/docs/examples. I slave to ensure that those things run end-to-end. They're continuously monitored and run stochastically at times during the day to ensure that. So if you run into-- they should run. They should help you get started. They're designed to be something you can build production-grade services off of as much as possible.

And so yeah, if you want any help with those, slide into the Slack. Tag me. Feel free to tag me on stuff related to the course or otherwise. I love the examples. I should talk to you sometime, how you set all of that up. Because I was very impressed.

I ran through the comfy UI workflow a couple of days ago. And I was able to tweak a few things. I pulled down the code example. I got a few different things running. I was like, holy shit. I just pulled down an example from the internet and just ran the command that it said to run.

And then it ran. And I was like, that never happens. There's always some other thing I have to do. I was very impressed. Yeah, part of it is that as an infrastructure product, the thing that kills being able to run code is the differences between infrastructure and like, oh, well, that will only run if you set this LD flags thing or have this installed.

It works on my machine. See, the thing about the modal examples is they all work on my machine. And my machine is modal, which you can also run them on. So that does make it a lot easier. I think that's generally true for being able to share things that run on modal within your team, making it easier to do that.

But then separately, like, yeah, the trick-- and this is actually like an engineering trick that is surprised it took me this long to learn. It's like there's tests and there's monitoring. And there are a lot of things that are really hard to write as tests. Slow down your iteration speed.

Like, yeah, require a bunch of disgusting mocking that breaks as often as the actual code does. Or you could monitor production and fix issues that arise there, a.k.a. do both. So yeah, that's an important trick for the modal examples, but also for all the things you would maybe run using-- as part of running your own language model inference or running your own AI-powered app.

It's like, monitor the shit out of this thing. >>Awesome. Cool. Well, before we let Charles go, does anybody have any questions? I know I'm sure given everyone's background here, there's a lot of-- everyone's brain feels very full with all of the hardware architecture that you just learned and terminology.

But just want to open it up for anyone. >>I think-- I'll kill time while people ask questions. But I think that it's always intimidating for people sort of running their own models and fine-tuning them. I'm just like, what's a really good first exercise that you could-- probably you have some tutorials on modal that you would recommend people just go through.

>>Yeah, running your own model. I would actually say, if you don't have a MacBook M2 or later with at least 32 gigabytes of RAM, go ahead and buy one of those. Get your company to buy it for you. So that turns out to be actually a really incredible machine for running local inference.

Has to do with the memory bandwidth stuff that we talked about, like moving the bytes in and out really fast. And so that-- I would actually say like that was the first thing I did back when you had to torrent llama weights. That running it locally-- and there's good tools out there for this, Ollama.

You can also use the same thing you would run on a cloud server like VLLM. That is-- that's probably the easiest way to get started with running some of your own inference. And then the cost is amortized more effectively. And you can use it for other stuff, the computer that you're using for this.

So that's actually probably my-- it's bad modal marketing to say that. But I would say people like to be able to poke and prod. If you don't already know modal, I know modal well enough that now it's not any harder for me to use modal to try these things than to run it on my MacBook.

But it takes some time. And everybody knows how to use a command line as part of becoming a software engineer. So yeah. So that's my primary recommendation. For fine-tuning, I would say distilling a model is the easiest thing to do. Besides, I guess our demo for fine-tuning, which I didn't have time to show, it's like fine-tuning something on somebody's Slack messages so that it talks like them.

And that's easy, fun, the stakes are low, and it teaches you some things about the software and about fine-tuning problems. But then to really understand what it means to fine-tune in pursuit of a specific objective, it's like distillation of a large model. Yeah, totally. I did insert a little comment about what distillation means.

Because apparently, a lot of people kind of view training on output of GPT-4 as distillation. But the purist would be like, you have to train on the logits. Oh, yeah. Yeah, the teacher-student methods. Real distillation is different. Real distillation. Yeah, yeah. Oh, so I guess maybe that's a reason to run your own models to be able to get the raw output of the model is not tokens.

It's probability for every token. And so that's a much richer signal for fine-tuning off of. And so that's what people prefer. But I guess I was thinking of it in the looser sense that most people talk about today, which is just like training to mimic the outputs of the model.

Yeah, create a synthetic corpus of text. Yeah, yeah. And when you run a model in production, you're creating a synthetic corpus of text, you know? Synthetic corpus of text is somewhat intimidating sounding. I say as somebody who's used a lot of intimidating sounding jargon. But really, the simplest synthetic corpus of text is all the outputs that the API returned while it was running in prod.

That's a great thing to fine-tune on. I just linked to an example here where someone distilled from R1. And it was pretty effective. And it took 48 hours and a few H100s. And that was it, not that expensive. Nice. Yeah, yeah. So Modal will do fine-tuning jobs up to eight H100s and up to 24 hours.

We're working on features for bigger scale training, and both longer in time and larger in number. But yeah, I would say there's also a pretty strong argument for keeping your fine-tunes as small and fast as possible to be able to iterate more effectively and quickly. Because it's fun to run on 1,000 GPUs or whatever.

There's this frisson of making machines go brr. But then when you need to regularly execute that job to maintain a service that you've promised people that you will keep up, then it starts to get painful. Because reliability, cost, it's ungodly slow. It's 48 hours is a long time to wait for a computer to do something, even if it is an eggs a flop of operations.

So definitely, when starting out with fine-tuning, go for the smallest job you can. Got it. OK. All right, I've hogged the mic enough. Who has questions? Anyone? No? OK, great. Well, awesome, everybody. Thanks to you so much, Charles, for coming. We really appreciate it.

Running and Finetuning Open Source LLMs — ft. Charles Frye, Modal

Transcript