back to index

Running and Finetuning Open Source LLMs — ft. Charles Frye, Modal


Whisper Transcript | Transcript Only Page

00:00:00.000 | - Yeah, sure, yeah thanks for inviting me, Noah and Sean.
00:00:04.400 | It's always a pleasure.
00:00:06.220 | And yeah, I guess my background is studied neural networks,
00:00:11.560 | sort of how to optimize them,
00:00:13.560 | how to prove they converge in grad school,
00:00:16.080 | then worked at, that was at Berkeley,
00:00:19.120 | joined Weights and Biases,
00:00:21.360 | the experiment management and MLOps startup,
00:00:24.920 | series A to series C, did education for them,
00:00:29.760 | then did a full stack deep learning online course
00:00:32.960 | about how to deploy models in the pre-foundation model
00:00:37.240 | or liminal foundation model era,
00:00:39.640 | and then now work for Modal, infrastructure company
00:00:43.720 | that helps people run data-intensive workloads
00:00:47.520 | like ML inference.
00:00:49.200 | So, yeah, so then, oh yeah,
00:00:54.080 | so FSDL fans in the chat, like Sean,
00:00:58.200 | yeah, we'd love, maybe we'll be able to do something
00:01:00.460 | under that banner again sometime.
00:01:02.560 | But yeah, so wanted to talk today
00:01:06.920 | about running and fine-tuning open-source language models.
00:01:10.400 | Why would you do it?
00:01:12.360 | The answer is not always with both of these things.
00:01:15.720 | And then like some things about how,
00:01:17.840 | some like high-level things.
00:01:19.240 | This course, my understanding is oriented
00:01:23.000 | at software engineers who wanna learn more
00:01:24.800 | about like running AI models
00:01:27.080 | and building systems around them.
00:01:29.480 | So that's kind of the background
00:01:30.960 | that I've assumed in a lot of these.
00:01:32.800 | And then, yeah, to actually kick us off,
00:01:37.160 | before we go through the slides,
00:01:38.560 | I'm actually gonna do a quick demo.
00:01:40.960 | This is something that I got set up just yesterday,
00:01:44.480 | but like since it is, you know, in the news,
00:01:48.840 | like quite literally,
00:01:53.440 | let's run a local model
00:01:55.080 | or rather run our own inference on a model.
00:01:58.320 | Let's run this DeepSeq R1 model
00:02:02.760 | that people keep talking about.
00:02:05.000 | So this is coming from the modal, our examples repo.
00:02:09.400 | So you can try this code out yourself.
00:02:11.280 | All you need is like a single Python file
00:02:13.920 | and a modal account, and you're ready to go.
00:02:17.600 | I'm gonna kick it off here.
00:02:19.360 | Oops, you need a virtual environment with Python in it.
00:02:22.720 | That's one thing you need, I suppose.
00:02:24.360 | I forgot to mention that.
00:02:26.320 | Okay, so let's run this guy here in the terminal.
00:02:29.600 | It's my VS code.
00:02:30.440 | The code's up there.
00:02:31.400 | You know, as is often the case,
00:02:35.040 | the code is not supremely interesting.
00:02:37.000 | I'm pulling in, I'm running it with Llama CPP here.
00:02:42.000 | Llama CPP has very, very low precision quants.
00:02:46.880 | So there's a ternary quant of DeepSeq R1.
00:02:49.760 | That means all the values are either minus one, zero,
00:02:52.440 | or one in the weights.
00:02:54.680 | And that's enough to squeeze it down to fit
00:02:57.320 | on a single machine with multiple GPUs.
00:03:01.000 | In this case, four L40S GPUs.
00:03:03.640 | So that's why I'm running with Llama CPP here.
00:03:06.920 | So let's see, spinning up right now.
00:03:11.080 | Oh man, we're out of it.
00:03:12.280 | Four XL40Ss on modal, so we might have to wait
00:03:14.800 | as many as 15 or 30 seconds for that to spin up.
00:03:19.800 | While we're waiting for that,
00:03:21.280 | let me show you just a little bit
00:03:23.640 | about what's going on here.
00:03:25.800 | We're running Llama CPP here.
00:03:28.680 | Running these things is an exercise in configuration.
00:03:35.840 | So if you've ever administered a database,
00:03:38.240 | you'll be familiar with this sort of thing.
00:03:40.960 | Or if you've run compilation for a serious large project,
00:03:44.880 | you got your mysterious flags with mysterious arguments
00:03:50.840 | that have meaningful impact on the performance.
00:03:54.320 | So controlling, in this case, the KV cache
00:03:56.600 | and setting the quantization precision of that,
00:03:59.600 | along with some other things for Llama CPP.
00:04:02.640 | Okay, so we had about a minute queue for GPUs.
00:04:04.800 | That's actually, that's like a P95 probably
00:04:08.880 | for XL40Ss on modal.
00:04:10.960 | So sometimes you roll the dice and you get a natural one.
00:04:14.520 | But, so it took us about 60 seconds maybe to spin this up
00:04:18.280 | and get a hold of four XL40S GPUs.
00:04:21.280 | If this happens to you, DM me
00:04:23.480 | and I'll go smack our GPU cluster with a hammer
00:04:26.720 | and try and make it go faster.
00:04:29.920 | All right, so this is loading up,
00:04:31.960 | loading up all the stuff you need to do Llama to run.
00:04:37.680 | DeepSeq R1, this is the model loader.
00:04:41.000 | Actually, it turns out it's about 100 something gigabytes
00:04:43.280 | once you've quantized it down this far.
00:04:45.440 | These are all different layers here.
00:04:47.960 | Nothing too interesting in the model
00:04:50.080 | of like architecture itself.
00:04:51.360 | It's really the like data and the inference tech
00:04:53.920 | that DeepSeq built that's really the interesting part.
00:04:57.800 | So skipping past all this extra stuff.
00:05:00.120 | So now we're at the point where we're like loading the model.
00:05:02.720 | This one, so you're running,
00:05:04.840 | you wanna run your own model, great, okay.
00:05:07.360 | If you have a GPU to have it on all the time,
00:05:10.040 | then you gotta have, you know,
00:05:12.160 | why do we have four GPUs, why not just one?
00:05:14.920 | We gotta have 100 gigabytes of space
00:05:16.560 | to hold all the weights for this thing in.
00:05:18.960 | That's 100 something gigabytes of RAM.
00:05:21.640 | That, you know, problem with RAM is like,
00:05:23.880 | you can't share RAM and when you unplug it,
00:05:26.960 | the data goes out of it.
00:05:28.480 | So this is actually one of the major cost sources
00:05:30.860 | for running your own model.
00:05:32.960 | It's like, you gotta have, if you want,
00:05:35.360 | like, if you wanna avoid this latency
00:05:37.040 | that we're looking at here for of like, you know,
00:05:40.040 | what's it about, you know, minute, 90 seconds to spin up.
00:05:45.040 | That's like separate from any like modal overhead.
00:05:49.520 | This is just the raw moving bytes around setting up RAM.
00:05:53.120 | If you wanna avoid that, you gotta have stuff hot in RAM
00:05:56.640 | and RAM is not free.
00:05:57.800 | That's, you either, you know,
00:06:01.000 | you gotta pay to keep something warm
00:06:03.860 | on a serverless provider like modal
00:06:05.240 | or you gotta have an instance running in the cloud.
00:06:08.320 | But that's all, that's been done.
00:06:10.200 | We're now doing prompt processing.
00:06:11.760 | This is the prompt, Unsloth, by the way,
00:06:14.560 | is the team that did the quantization here
00:06:16.920 | for DeepSeek R1 down to three bits.
00:06:19.640 | And they, their demo prompt
00:06:22.880 | is what I've just like copied directly here.
00:06:25.840 | It's, oh yeah, the prints mess up a little bit
00:06:29.320 | sometimes here.
00:06:32.520 | But there should be at the top,
00:06:34.520 | the beginning of the prompt is something like,
00:06:37.280 | please write the game Flappy Bird in Python.
00:06:39.980 | So that's the prompt along with some instructions.
00:06:44.800 | That prompt has gone into DeepSeek
00:06:46.840 | and is now being processed.
00:06:48.800 | Okay, prompt processing is done.
00:06:52.240 | And now the beloved think token has been emitted
00:06:56.880 | and the model has begun to think about it,
00:06:59.240 | what it wants to do.
00:07:01.120 | So this deployment is not super well optimized.
00:07:04.960 | There's a substantial amount of host overhead,
00:07:06.960 | which means the GPU is not actually working all the time,
00:07:09.600 | even as we're generating these tokens.
00:07:11.620 | That's probably either a Llama CPP needs like a PR
00:07:16.620 | or I missed some compiler flag or something.
00:07:19.800 | The CPU usage is also kind of low.
00:07:21.780 | So I'm suspicious that maybe I messed something up
00:07:23.800 | in the compile.
00:07:25.240 | So it's 10 tokens per second right now.
00:07:27.760 | There's line buffering.
00:07:28.780 | So you aren't seeing the tokens live.
00:07:30.480 | You see them once a line is emitted.
00:07:33.080 | But yeah, but runs about 10 tokens per second
00:07:37.120 | on these L40s GPUs and could probably be boosted up
00:07:41.960 | to about 50 tokens per second
00:07:44.800 | by removing some of this host overhead.
00:07:47.200 | And then from there probably optimize kernels
00:07:50.320 | for this architecture.
00:07:51.520 | Some other things would maybe double it again.
00:07:53.760 | Oh, finished thinking pretty quickly that time.
00:07:55.920 | Interesting thing with these models is like,
00:07:58.160 | they think for very variable amounts of time
00:08:01.480 | controlled by how hard they think the problem is.
00:08:04.720 | And so sometimes it finishes thinking pretty quickly
00:08:07.360 | like here.
00:08:08.180 | Sometimes it thinks for like 20 minutes.
00:08:10.740 | So, you know, go make a coffee.
00:08:12.920 | I don't know, go compile something else
00:08:15.520 | while it's writing your answer.
00:08:17.500 | And yeah, I think the quality of the output here
00:08:23.000 | is reasonably good.
00:08:24.440 | One thing the Unsloft people call out
00:08:27.920 | and I've noticed in a couple of generations
00:08:29.760 | is these super low bit quants
00:08:31.600 | sometimes throw in a random junk token.
00:08:35.120 | So this case here, I bet that dense there is like,
00:08:38.760 | that might not be defined.
00:08:40.400 | Yeah, it doesn't look like that's defined.
00:08:42.480 | Rough, that's probably supposed to be a zero.
00:08:44.520 | There's some inference config stuff
00:08:46.080 | that I haven't played with that can reduce that
00:08:48.120 | and improve the quality,
00:08:49.600 | but that comes with the quantization territory.
00:08:52.280 | Yeah, so there it goes.
00:08:54.800 | It thought about it for a while,
00:08:56.320 | did some backtracking,
00:08:57.960 | and then wrote some code for Flappy Bird.
00:09:01.000 | So there's running a model
00:09:03.400 | and some of the stuff I talked about along the way,
00:09:05.160 | some of the main concerns you're gonna have
00:09:06.960 | running your own model.
00:09:09.020 | Yeah, any questions before I dive into the slides?
00:09:13.000 | >> Really quickly, so we haven't gotten
00:09:15.880 | super into the internals.
00:09:17.360 | Could you go over what quantization is?
00:09:19.800 | Like why do you need to do that
00:09:21.360 | and what that process generally looks like?
00:09:24.080 | >> Yeah, sure, I'll talk more about this,
00:09:25.720 | but the basic answer is the model is trained
00:09:28.640 | as a bunch of floating-point numbers.
00:09:31.320 | For DeepSeq R1, they were eight-bit floating-point numbers.
00:09:34.320 | That's crazy small.
00:09:35.560 | They worked hard to reduce the overhead during training.
00:09:39.980 | More typical is 16-bit
00:09:41.480 | or even 32-bit floating-point numbers.
00:09:44.260 | Default in Python is often 64-bit floating-point numbers,
00:09:48.800 | but that's way too much for neural networks most of the time.
00:09:55.400 | They don't need that level of precision.
00:09:57.200 | So you can save a lot on memory,
00:09:59.200 | which saves a ton on inference speed,
00:10:01.920 | especially in these low user count settings
00:10:06.920 | and heavy decode settings,
00:10:09.640 | like a reasoning model that produces a ton of tokens
00:10:12.240 | in response to a single input.
00:10:14.080 | Yeah, that helps a lot to decrease
00:10:17.360 | the memory footprint of the weights,
00:10:19.520 | even if that's all you do with your quantization.
00:10:24.680 | But yeah, so there we go.
00:10:27.000 | It actually finished.
00:10:28.320 | I noticed a couple more typos in there.
00:10:31.640 | That's probably, yeah,
00:10:32.760 | should tune the inference config on there.
00:10:34.400 | There's like a top-P, min-P sampling thing
00:10:37.480 | that you can do that I haven't dialed in yet.
00:10:40.140 | But yeah.
00:10:42.880 | Yeah, anybody, any other questions before we dive in?
00:10:53.640 | Yeah, feel free to interrupt me as we're going.
00:10:56.480 | Let's make this a little bit more
00:10:58.520 | on the interactive side, ideally.
00:11:00.440 | Got a lot of slides, a lot to cover,
00:11:01.960 | but what we cover depends on
00:11:06.400 | what people are most interested in.
00:11:08.160 | Okay, so I just ran an open-source language model,
00:11:14.320 | DeepSeq R1.
00:11:15.480 | Let's talk about just in general,
00:11:17.080 | what does that take?
00:11:18.080 | Define some of the things that I talked about there,
00:11:20.200 | like memory, bandwidth constraints,
00:11:22.760 | and quantization, and all this other stuff.
00:11:26.120 | We'll also talk a bit about fine-tuning models,
00:11:28.520 | customizing them to your use case
00:11:30.120 | by doing actual machine learning and training.
00:11:32.760 | Before doing that,
00:11:35.120 | I do want to talk about the why of this here,
00:11:37.480 | like just 'cause something is,
00:11:39.080 | even if something's quick, easy, and free,
00:11:41.800 | it doesn't mean it's a good idea.
00:11:43.360 | And running and fine-tuning your own models
00:11:45.340 | is none of those things.
00:11:47.640 | So you want to make sure you have a good idea
00:11:50.400 | why you want to do this.
00:11:52.440 | Why not just use a managed service behind an API?
00:11:56.540 | One of the primary reasons to do it
00:11:59.200 | is if you don't need Frontier capabilities,
00:12:01.440 | so if you don't need to run DeepSeq R1
00:12:04.140 | to get reasoning traces for your customer support JapPot,
00:12:07.680 | that just needs to ask them to turn the thing off
00:12:10.640 | and turn it back on again.
00:12:11.940 | That level of LLM inference,
00:12:14.600 | the software is pretty commodity.
00:12:16.760 | The hardware to run it's getting easier and cheaper.
00:12:20.000 | And so you can frequently run that relatively inexpensively.
00:12:25.000 | And so you don't need a proprietary model,
00:12:28.720 | and you can often, the complexity of serving is lower.
00:12:32.220 | Just like a call-out on that DeepSeq R1 demo,
00:12:35.480 | there's probably an order of magnitude and a half
00:12:39.240 | of improvement that could be done to that.
00:12:41.520 | So a 30x improvement is probably low-hanging fruit,
00:12:44.320 | like a week of engineering.
00:12:45.780 | But right now, running that on Modal is $300 a megatoken.
00:12:51.720 | And just having DeepSeq run it for you is $3 a megatoken.
00:12:57.340 | So that's a pretty big difference,
00:13:02.340 | even assuming we can get a 30x cost reduction running
00:13:07.600 | just by doing more than a day's engineering
00:13:12.560 | to get it running.
00:13:14.080 | So that's a reason people sometimes think
00:13:17.320 | running their own LLM inference makes sense
00:13:19.040 | is to save money.
00:13:20.120 | And that intuition, I think,
00:13:21.200 | comes from getting fleeced by cloud providers
00:13:24.680 | who will charge you an arm and a leg
00:13:27.080 | to just stand up commodity Redis on commodity hardware.
00:13:30.920 | But right now, that's not the case.
00:13:33.420 | So the main reason I think that people bring up
00:13:37.520 | is to manage security and improve data governance.
00:13:40.120 | You want to make sure to run this thing yourself.
00:13:44.000 | The more control you want,
00:13:45.340 | the more complex this problem is going to be,
00:13:47.960 | the more eventually it ends up getting your own GPUs
00:13:52.960 | and putting them in a cage,
00:13:54.880 | which is probably six months or a year of engineering work,
00:13:59.020 | and then a lot of ongoing maintenance.
00:14:02.400 | But at the very least, running it with a cloud provider,
00:14:04.700 | whether that's Modal or raw dogging AWS,
00:14:08.360 | can improve your security and data governance posture.
00:14:11.660 | Not everybody wants to send data
00:14:14.580 | to untrustworthy nation states
00:14:17.980 | like the United States or China.
00:14:19.820 | Then gaining control over inference
00:14:23.340 | is maybe the one that I would say is most important.
00:14:25.980 | It's like, and most general, it's like API providers,
00:14:31.060 | there's only so much they can do.
00:14:33.120 | If they're proprietary, they got to hide stuff from you,
00:14:35.420 | whether that's reasoning chains, in OpenAI's case,
00:14:38.180 | or like log probs, also in OpenAI's case,
00:14:41.180 | or just like the increased customization
00:14:43.500 | decreases their ability to amortize work,
00:14:46.140 | to spread it across multiple customers,
00:14:48.480 | which is the way that they get things cheaper
00:14:50.780 | than you can run it yourself, sort of economies of scale.
00:14:53.380 | And so the more flexible,
00:14:55.640 | the more different your deployment is,
00:14:57.460 | the harder it is for them to do that,
00:15:00.400 | to run this variety of workloads economically.
00:15:07.500 | I think over time, all of these things
00:15:10.980 | are going to lean more in the favor
00:15:13.300 | of running your own LLM inference.
00:15:15.060 | Like frontier capabilities will go off
00:15:16.980 | in the direction of artificial super intelligence
00:15:19.100 | or whatever, but the baseline capabilities
00:15:23.140 | that anybody can just download off of Hugging Face
00:15:25.100 | will just keep on getting better.
00:15:27.700 | So we just saw reasoning, a one-level capabilities.
00:15:31.020 | Six months ago, I told people,
00:15:33.380 | "You got to go to OpenAI for that.
00:15:34.760 | "Now you can run it yourself."
00:15:37.140 | But I think the most important one
00:15:38.340 | that's going to tilt in the direction
00:15:41.900 | of running your own inference as the field matures
00:15:44.180 | is gaining control over inference.
00:15:45.700 | Like things are just going to get way more flexible.
00:15:48.180 | People are going to discover all kinds of crazy things
00:15:49.980 | you can do with like hacking the internals,
00:15:52.580 | with log probabilities.
00:15:54.460 | People will rediscover what everybody was doing
00:15:56.340 | in 2022 and 2023, when people still had access
00:16:00.660 | to the models internals,
00:16:03.220 | and discover that it makes their lives better.
00:16:05.780 | And you'll want to run your own inference
00:16:07.620 | for that, to control that.
00:16:09.980 | See a question, Juliette.
00:16:12.480 | - Yeah, Charles.
00:16:13.500 | So before we carry on, and I'm not sure
00:16:15.340 | if you're going to speak more about this as we go forward,
00:16:17.420 | but could you speak a bit about
00:16:18.700 | how inference is currently working,
00:16:20.860 | just to make it a bit more concrete in my mind?
00:16:23.720 | - When you say how inference is currently working,
00:16:27.100 | do you mean like how people normally,
00:16:28.900 | the alternative to running your own?
00:16:30.700 | - Well, you're saying that, and I'm not familiar,
00:16:35.540 | I'm not so familiar with the word inference itself.
00:16:37.380 | Like, could you share a bit about how current models
00:16:39.620 | are using inference and like how it works today,
00:16:42.140 | so that then I understand how to better like tweak it
00:16:44.020 | and what it's like?
00:16:45.360 | - Got it.
00:16:46.200 | Yeah, sure.
00:16:47.020 | Sorry, that's a bit of jargon.
00:16:48.780 | Inference just means running the model, right?
00:16:51.140 | Like putting something into the model
00:16:52.420 | and something coming out of it.
00:16:54.500 | Goes back to the like probabilistic backing of these models.
00:16:57.460 | Like you do it, you're like predicting
00:17:00.020 | what the future tokens are going to be.
00:17:02.140 | And that's like inference, like logical inference.
00:17:05.620 | But yeah, that's where the term comes from.
00:17:08.980 | But yeah.
00:17:11.140 | Cool.
00:17:13.420 | So yeah, so it's like, this is like replacing OpenAI's API
00:17:17.460 | or Anthropx API or OpenRouter
00:17:19.900 | with a service you host yourself,
00:17:21.580 | is what we're talking about here.
00:17:24.140 | Cool, yeah, definitely if I'm like,
00:17:28.300 | especially since, you know, I usually speak
00:17:30.380 | to more of an ML engineering audience.
00:17:32.700 | So like, if I just like forget that I haven't defined a term,
00:17:36.180 | please do interrupt me and ask me about it.
00:17:38.520 | Spent some time on this one already,
00:17:42.180 | so I won't go into more detail on this.
00:17:43.980 | But I would just say like, it's not that uncommon
00:17:46.880 | that proprietary software leads open software
00:17:49.380 | and raw capabilities like Oracle SQL
00:17:51.940 | and Microsoft SQL Server and like OSX and Windows
00:17:56.100 | have a bunch of things that like,
00:17:58.020 | beat their open source equivalents
00:18:00.420 | and have for a long time, like query optimizers in particular
00:18:03.580 | in the case of databases.
00:18:05.940 | So like, it's maybe not so surprising
00:18:07.620 | that that's the case in AI.
00:18:10.940 | But the like, the places in general,
00:18:14.100 | these things have co-existed in other domains.
00:18:17.340 | And then open software has been preferred in cases
00:18:20.300 | where it's more important to be able to hack on
00:18:22.260 | and integrate deeply with a technology.
00:18:25.940 | And so, you know, we're likely to see some mixture stably.
00:18:30.820 | And I, you know, I initially said this
00:18:35.140 | at one of SWIX's events, the AI Engineer Summit,
00:18:38.240 | year and a half ago now, and this has remained true.
00:18:43.940 | So that's at least 18 months of prediction stability,
00:18:46.580 | which is best you can maybe hope for these days.
00:18:49.960 | Yeah, so saving money.
00:18:53.300 | A lot of people want to run their own models
00:18:55.180 | to save money.
00:18:56.820 | Right now, inference is priced like a commodity.
00:18:59.300 | People find it relatively easy to change models.
00:19:01.980 | Little prompt tuning, keep a couple of prompts around,
00:19:05.060 | ask a language model to rewrite your prompts for you.
00:19:07.700 | Like, yeah, this among other factors has led
00:19:11.000 | to this LM inference being priced like a commodity
00:19:14.300 | rather than like a service.
00:19:16.240 | And so it's actually like quite difficult
00:19:20.820 | to run it more cheaply yourself.
00:19:23.580 | And so there's a couple of things
00:19:24.980 | that might swing in your favor.
00:19:25.940 | If you have idle GPUs,
00:19:27.580 | like maybe you have an ML team internally,
00:19:29.820 | and they like, when they're not doing training runs,
00:19:32.100 | they have GPUs just sitting there.
00:19:33.820 | You might just mine cryptocurrency with them instead,
00:19:37.420 | you know, like faster time to ROI.
00:19:41.220 | But like, you know, that at least if you have them,
00:19:44.160 | that like, you're just paying electricity.
00:19:47.060 | So that makes it a little bit easier.
00:19:49.560 | But electricity costs are actually quite high
00:19:51.540 | for these things, you know, kilowatt per accelerator
00:19:55.460 | for the big ones.
00:19:56.480 | The, like, taking a really big generic model,
00:20:01.180 | one of these like foundation models,
00:20:02.860 | like OpenAI's O1 model or Claude,
00:20:06.300 | and distilling it for just the problems that you care about
00:20:10.220 | into something like smaller and easier to run,
00:20:13.300 | that's a way that you can like save money.
00:20:15.720 | And we'll talk a bit about that if we get to fine tuning,
00:20:19.340 | if we spend time on that in fine tuning.
00:20:22.020 | But, you know, that can help a lot.
00:20:24.740 | If your traffic is super high and dependable,
00:20:29.580 | and you can just like allocate some GPUs to it,
00:20:32.020 | and like, you know, run it, you know,
00:20:33.660 | just get a block of EC2 instances with GPUs on them,
00:20:37.660 | hold them there, send traffic to it.
00:20:40.540 | It's flat, you're utilizing all the GPUs all the time.
00:20:43.760 | You could probably start to like compete
00:20:45.580 | with the model providers there on price.
00:20:49.540 | And then finally, it's like, if it's like once a week,
00:20:52.160 | you need to like process like every support conversation
00:20:56.020 | that you had and add annotations to it
00:20:58.500 | and generate a report.
00:21:00.020 | So it's like once a week,
00:21:01.300 | you need like 10 mega tokens per minute throughput.
00:21:06.300 | And then like rest of the time you don't,
00:21:10.060 | then like the proprietary model providers
00:21:12.300 | are gonna push you onto their enterprise tier
00:21:14.220 | for those big rate limits.
00:21:16.700 | But you can actually like,
00:21:19.620 | and so that's gonna push up the cost of using a provider.
00:21:23.060 | But then it's also easier to run super like big batches.
00:21:28.980 | Like it's actually kind of like easier
00:21:31.820 | to run these things economically at scale
00:21:34.900 | than it is at small scale.
00:21:35.900 | Somewhat counterintuitively maybe for a software engineer
00:21:38.260 | who's used to running like databases and web servers.
00:21:41.540 | Just like the nature of GPUs is that it's easier to use them
00:21:44.500 | the more work you have for them.
00:21:46.380 | And so that makes, you know,
00:21:49.860 | these like batch and less latency sensitive workloads,
00:21:54.160 | like more amenable to running yourself
00:21:56.620 | if you can get ahold of serverless GPUs
00:21:58.700 | through a platform like Modal,
00:22:00.780 | Replicate, Google Cloud Run, something like that.
00:22:04.740 | Okay, so that's everything on like why you would do this,
00:22:10.260 | why you would run your own OpenAI API
00:22:13.420 | or Anthropic API replacement.
00:22:15.500 | Any questions before we move on?
00:22:18.720 | I saw the chat had some activity, maybe check that out.
00:22:22.220 | Anybody wanna speak up?
00:22:23.380 | - No, I think we're just sort of adding color
00:22:30.860 | to different stuff.
00:22:33.420 | - Got it, thanks for grabbing the chat.
00:22:35.900 | Okay, so let's start.
00:22:37.700 | Like I've already mentioned hardware and GPUs a lot,
00:22:40.300 | so let's talk about that a little bit more.
00:22:43.140 | Talk a little bit about like picking a model,
00:22:46.500 | then deep dive on like serving inference,
00:22:49.140 | a little bit on the tooling for it.
00:22:51.660 | Then like fine tuning,
00:22:54.560 | like how do you customize these models
00:22:56.460 | and then close out with thinking about observability
00:23:00.620 | and continual improvement.
00:23:02.060 | Okay, and yeah, link for the slides there.
00:23:07.380 | Of course, you'll be able to get it after the session.
00:23:10.380 | Okay, so picking hardware is pretty easy.
00:23:13.500 | Just use NVIDIA GPUs, don't have to go any further.
00:23:15.900 | No, let me go into a little bit more detail
00:23:18.140 | about why that's the case.
00:23:19.540 | So Juliet wanted like a little bit more color and detail
00:23:22.460 | on what does LLM inference mean.
00:23:24.660 | So what LLM inference means
00:23:26.420 | is you need to take the like parameters of the model,
00:23:28.660 | the weights, this like giant pile of floating point numbers.
00:23:32.700 | Those are gonna be sitting in some storage.
00:23:35.620 | You need to bring them into the place where compute happens.
00:23:39.100 | So like even if they're sitting in memory,
00:23:41.260 | like compute doesn't happen in memory.
00:23:42.580 | Compute happens like on chip inside of like registers.
00:23:46.100 | So you gotta move all of that in.
00:23:48.020 | And the fun fact is like you actually need
00:23:50.580 | like pretty much every single weight needs to go in.
00:23:53.700 | So like for most models,
00:23:55.740 | you can just look at how many gigabytes is that model file.
00:23:58.980 | And that tells you how many bytes
00:24:00.260 | they're gonna need to move in to get computed on.
00:24:04.180 | So like you're running an 8B model
00:24:06.500 | in one byte quantization, that's 8 billion weights.
00:24:10.580 | One byte per weight, that's eight gigabytes.
00:24:13.580 | So you need to move eight gigabytes
00:24:15.580 | out of wherever they're stored
00:24:17.780 | and into the place where compute happens.
00:24:20.220 | And then like that happens,
00:24:23.780 | like you're pushing tokens and activations
00:24:27.100 | through those weights to get out the next token.
00:24:30.340 | On your first iteration, you're sending in the whole prompt.
00:24:34.140 | And so you're sending in a whole prompt
00:24:35.780 | and generating an output token.
00:24:37.180 | So is guava a fruit?
00:24:40.540 | In the process of like pushing something through the weights,
00:24:43.300 | you can kind of rough estimation
00:24:45.540 | is that you want to do one,
00:24:49.060 | you wanna do two floating point operations per weight.
00:24:53.140 | So that's like, you want to multiply the weight
00:24:55.200 | with some number and then you're gonna add it
00:24:57.420 | to some other number.
00:24:59.180 | So that's two operations per weight.
00:25:01.580 | This is very napkin math.
00:25:03.100 | But again, nobody should have to write
00:25:06.420 | this very small number of wizards
00:25:08.140 | to write the actual code here.
00:25:10.260 | The core thing is being able to reason as an engineer
00:25:13.600 | about what the system's requirements are
00:25:16.540 | and how to, kind of like with a database,
00:25:20.260 | you don't have to be able to write a B-tree from scratch
00:25:22.460 | on a chalkboard unless you're interviewing at Google.
00:25:25.300 | But you should know how indices work
00:25:27.780 | so that you can like think about queries
00:25:29.980 | and structure tables in a smart way.
00:25:32.220 | And so similarly here,
00:25:34.420 | I'm just trying to give you the like intuition you need
00:25:36.620 | for understanding this workload.
00:25:39.100 | So for this, we have four tokens.
00:25:42.020 | We've got like one output.
00:25:46.300 | Yeah, we got four tokens coming in.
00:25:47.860 | We've got 8 billion parameters.
00:25:50.540 | So eight times two times four,
00:25:52.160 | that's 64 billion floating point operations.
00:25:55.860 | And then that gets us one token.
00:25:58.660 | Then we got to repeat this every time
00:26:00.380 | we want to generate another token.
00:26:01.860 | So we're going to move the weights.
00:26:03.380 | Like they have to, they go into where they get muted,
00:26:05.780 | then back out.
00:26:06.860 | Because we're talking about like registers and caches here.
00:26:09.620 | If you think of your like low level hardware stuff,
00:26:11.980 | registers and caches.
00:26:13.380 | So they can't hold the whole weight.
00:26:14.500 | So they got to go in and out the whole time.
00:26:16.260 | Again, if you're a database person,
00:26:17.860 | you should think of like running a sequential scan
00:26:20.200 | on your database over and over and over again
00:26:23.180 | on like a billion row database.
00:26:25.660 | So it's wild that we can even run it as fast as we do.
00:26:29.320 | But this is the workload.
00:26:32.020 | The hard part about it is the scale.
00:26:35.140 | The easy part about it is that this
00:26:36.780 | is like relatively simple control flow at the core.
00:26:41.180 | So that makes it amenable to acceleration with GPUs.
00:26:45.340 | GPUs have a bunch of, like if you look at the chip itself,
00:26:49.300 | this is the chip area.
00:26:50.860 | CPUs spend most of their space on like control flow logic.
00:26:55.620 | And then caches that hide how smart the CPU is being
00:26:59.140 | about like control flow and switching work.
00:27:01.700 | And then like relatively less is actually
00:27:04.140 | given over to the part that does like calculations, which
00:27:06.900 | is here in green.
00:27:08.780 | GPUs, on the other hand, are just all calculation.
00:27:12.260 | And they have relatively simple control flow
00:27:15.500 | and like relatively less cache memory.
00:27:20.500 | And that-- because it doesn't need to hold 100 programs
00:27:23.940 | at once or whatever.
00:27:25.780 | And so that means you can really rip through a workload
00:27:31.900 | like this one that has like relatively simple stuff, where
00:27:35.020 | most of what you want to do is just
00:27:36.700 | like zoom through doing simple math on a bunch of numbers.
00:27:41.260 | So that's why GPUs are designed for this,
00:27:44.060 | because it works well for graphics, which also looks
00:27:46.220 | like ripping through a bunch of math.
00:27:49.380 | Basically the same math on a bunch of different inputs,
00:27:51.940 | this graphics workload.
00:27:53.820 | But they've like tilted now even further
00:27:55.620 | in the direction of being specialized
00:27:57.180 | for running language models and big neural networks.
00:28:06.540 | The TLDR here is like the GPU is 100 duck-sized horses,
00:28:10.620 | a bunch of tiny cores doing like very simple stuff.
00:28:14.940 | And that wins out over the one horse-sized duck
00:28:18.940 | that is the CPU that you're used to programming and working
00:28:22.500 | with.
00:28:25.180 | There's like one other piece here,
00:28:26.620 | which is like if you're looking at a top-tier GPU,
00:28:28.780 | one of the things that makes the top-tier ones really good,
00:28:31.740 | like an H100, is that they have soldered the RAM
00:28:35.220 | onto the chip, which is not something you normally do.
00:28:39.980 | But it gives you much faster communication, lower latency,
00:28:43.980 | higher throughput, which is really important.
00:28:46.460 | The memory is still slower than the math,
00:28:48.140 | which is really important if you start
00:28:50.100 | to think about optimizing these things.
00:28:52.420 | But we don't have to go that deep.
00:28:55.740 | So the TLDR here is that it's like NVIDIA-inferenced GPUs
00:29:01.060 | from one or two generations back are what you probably
00:29:03.380 | want to run with.
00:29:04.860 | The primary constrained resource is
00:29:06.740 | how much space there is in this memory
00:29:08.900 | to hold all those weights.
00:29:11.100 | Well, it's the weights.
00:29:12.140 | And then later you're going to start
00:29:13.640 | adding things like past sequences you've run on
00:29:16.380 | in a cache.
00:29:17.220 | And then there's never enough RAM.
00:29:19.900 | And so when you're looking at buying GPUs yourself
00:29:22.380 | for rent or which ones to rent from the cloud,
00:29:25.380 | look for the ones with more VRAM.
00:29:28.340 | And then this is a primary reason
00:29:31.940 | to want to make your model weights smaller,
00:29:33.940 | to go from high-precision floating-point numbers
00:29:36.260 | to low-precision floating-point numbers,
00:29:38.100 | or even more exotic things, because they
00:29:41.740 | save space in that memory.
00:29:43.420 | And they make it easier to move the things in and out
00:29:46.980 | of memory and into where the compute happens.
00:29:49.620 | So the thing you want is a recent but not bleeding-edge
00:29:52.580 | GPU unless you enjoy pain.
00:29:55.460 | So most recent GPUs from NVIDIA are the Blackwell architecture.
00:29:58.620 | That's the 5,000 series of GeForce GPUs,
00:30:01.780 | your local neighborhood GPU, and then the Blackwell B200s
00:30:08.660 | and similar data center GPUs.
00:30:11.940 | Generally, you're going to find that you
00:30:13.860 | don't get the full speedup that you'd like because people
00:30:16.420 | don't compile for that architecture always
00:30:19.420 | and yada, yada.
00:30:20.780 | And then things are randomly broken.
00:30:23.260 | And then they're really hard to get a hold of and expensive.
00:30:26.100 | So the sweet spot is one generation
00:30:28.300 | behind whatever OpenAI and Meta are training on.
00:30:30.980 | So now that's Hopper GPUs.
00:30:32.980 | H200s were free on Amazon, at least on EC2,
00:30:37.820 | for a bit there a couple weeks ago.
00:30:40.140 | And then loveless GPUs like the L40s that I ran my demo on,
00:30:45.540 | those are pretty nice.
00:30:47.220 | Loveless is the more-- or sorry, L40s
00:30:50.740 | is the more inference-oriented data center GPU.
00:30:53.540 | So data center GPU means like ones
00:30:55.780 | you're going to find in the public clouds.
00:30:58.180 | NVIDIA doesn't really let people put your friendly local GPU,
00:31:02.460 | the same one you can buy locally and put in your own machine.
00:31:05.540 | They don't really let them run in the clouds
00:31:07.380 | unless NVIDIA is on the cap table.
00:31:09.940 | So that doesn't work for AWS and GCP.
00:31:13.940 | So that's a data center GPU.
00:31:16.280 | And then an inference data center GPU
00:31:18.580 | is one that's less focused on connecting a whole shitload
00:31:22.620 | of GPUs together, like 10,000 or 100,000,
00:31:25.500 | with a super fast custom network InfiniBand.
00:31:30.620 | And instead, they're more focused
00:31:38.700 | on just having one reasonably sized effective individual GPU.
00:31:43.860 | So the L40s are getting pretty mature.
00:31:46.420 | So I might recommend those.
00:31:47.860 | For a while, the H100, which is really more of a training GPU,
00:31:52.700 | was kind of the better one.
00:31:53.860 | I think, yeah, just because the L40s was relatively mature.
00:31:58.240 | If your model's small, if you're running a small model,
00:32:01.020 | like a modern BERT or one of the 3 billion or 1 billion models,
00:32:07.540 | you can get away with running it even a generation further back.
00:32:10.780 | And that's really nice, very stable.
00:32:12.740 | The Ampere A10 is a really real workhorse GPU,
00:32:16.900 | easy to get ahold of.
00:32:18.660 | You can transparently scale up to thousands of those on modal
00:32:22.740 | when it comes time.
00:32:24.820 | So that's pretty nice.
00:32:27.500 | Just a quick-- since part NVIDIA is in the news these days,
00:32:33.860 | like why NVIDIA?
00:32:35.820 | AMD and Intel GPUs are still butt catching up on performance.
00:32:40.700 | So nominally, you look at the sticker
00:32:42.940 | on the side that says Flops, and the AMD GPUs look good.
00:32:46.540 | And Intel Gaudi looks pretty good.
00:32:49.420 | The software stack is way behind.
00:32:50.780 | There's a great post from Dylan Patel and others
00:32:54.860 | that's semi-analysis, just ripping on the AMD software
00:32:57.820 | stack.
00:32:58.320 | George Hopps has done the same thing.
00:33:00.020 | It's just pain.
00:33:01.780 | That's a bet the company move.
00:33:02.980 | It's like, we can maybe either write the software ourselves
00:33:06.020 | or spend so much money on AMD chips
00:33:08.180 | that AMD will fix this for us.
00:33:11.100 | That's not really like, oh, I want
00:33:12.780 | to stand up a service kind of thing,
00:33:16.220 | stick with the well-trodden paths.
00:33:18.020 | There's non-GPU alternatives.
00:33:19.500 | There are other accelerators that are designed,
00:33:21.700 | unlike CPUs, for super high throughput and low memory
00:33:25.220 | bandwidth.
00:33:27.260 | TPU is the most mature one, the Tensor Processing
00:33:29.660 | Unit from Google.
00:33:30.740 | Unfortunately, it's very from Google
00:33:32.340 | in that they only run in Google Cloud.
00:33:34.140 | And the software stack is pretty decent for them, actually,
00:33:36.720 | like Jax, which can be used as a back end for PyTorch.
00:33:42.220 | But like many things in Google, the internal software for it
00:33:46.540 | is way better than anything you'll ever use.
00:33:49.300 | And you're second in line behind their internal engineers
00:33:51.900 | for any bug fixes.
00:33:54.020 | So caveat emptor there.
00:33:57.140 | The Grok and Cerebrus accelerators
00:33:59.060 | are still a little bit too bleeding edge.
00:34:02.180 | At that point, you're kind of not running your own LM
00:34:04.500 | inference anymore.
00:34:05.220 | You're having somebody else run it as a service for you
00:34:08.000 | on chips that they run.
00:34:09.520 | It's kind of the way it works.
00:34:10.920 | It's unclear if they could do it-- what's
00:34:15.360 | the word I'm looking for-- cost-effectively as well.
00:34:18.720 | Those chips are very expensive to run.
00:34:21.960 | I would say any of the other accelerators you see
00:34:24.080 | aren't super worth considering.
00:34:26.680 | But in general, long term, I would
00:34:28.320 | expect this to change a lot.
00:34:29.720 | NVIDIA has a very thick stack of water-cooled network cards
00:34:34.520 | that can do a little bit of math for you.
00:34:36.820 | That's crazy shit, and it's going
00:34:38.340 | to take a long time for anybody to catch up there.
00:34:40.460 | But inference is actually pretty easy
00:34:44.300 | to match their performance on.
00:34:46.620 | So I expect a lot of innovation in this space,
00:34:48.860 | and VCs are spending accordingly.
00:34:52.460 | Last thing I'll say is the startup that I work on,
00:34:56.140 | Modal, it makes getting GPUs really easy.
00:34:58.420 | So a lot of it-- this is high-performance computing
00:35:01.060 | hardware.
00:35:01.900 | It's normally a huge pain to get.
00:35:04.220 | If you've run a Kubernetes cluster,
00:35:06.180 | you know that heterogeneous compute makes you cry.
00:35:09.860 | There's a reason they call it taints.
00:35:12.240 | So Modal makes getting GPUs super easy,
00:35:14.580 | just like add Python decorators, get stuff to run on GPUs.
00:35:18.020 | This is real code that our CEO ran to test our H100 scaling,
00:35:24.500 | just like let me just run 100,000 times,
00:35:27.660 | time [AUDIO OUT] sleep one on an H100.
00:35:30.860 | And this is all the code that you need to run that.
00:35:34.500 | In our enterprise tier, this would scale up to 500 H100s
00:35:38.660 | or more, pretty transparently.
00:35:43.860 | So when you need it, we've got it.
00:35:46.460 | OK, so that's everything I want to say on hardware.
00:35:50.420 | Any questions about that stuff before I
00:35:52.740 | dive into talking about the zoo of models?
00:36:00.460 | No, I think we're pretty good.
00:36:03.820 | I like the commentary on TPUs.
00:36:07.180 | Yeah.
00:36:08.340 | It'd be cool if they sold them.
00:36:10.340 | That would be great.
00:36:11.140 | I'd have one in my house.
00:36:12.380 | But yeah.
00:36:14.380 | So was that--
00:36:16.180 | They're eating all the ones they can make.
00:36:17.940 | So it's almost like a competitive advantage.
00:36:20.140 | Make more, you know?
00:36:22.180 | How hard could it be to build a semiconductor foundry?
00:36:25.740 | I thought, why do you have a money printer
00:36:28.460 | if you aren't going to use the money for good stuff?
00:36:30.300 | Anyway, I'm sure they have great reasons for this.
00:36:33.260 | But yeah.
00:36:34.740 | Oh, yes.
00:36:35.780 | Anyway, I won't go on any more tangents there.
00:36:38.420 | But DM me on Twitter if you want to talk more about this.
00:36:43.100 | Yeah, and also, oh, yeah, I wrote a guide
00:36:45.020 | to using GPUs, modal.com/gpuglossary,
00:36:49.540 | GPU hyphen glossary.
00:36:51.300 | So if you're interested in this stuff, check it out.
00:36:53.660 | It's kind of intended to give you
00:36:55.940 | the intuition for this hardware and a little bit of debugging
00:37:00.900 | on the software stack because most people didn't encounter
00:37:05.140 | anything like this in their computer science
00:37:07.380 | education, their boot camp, or their working experience so far.
00:37:11.860 | So yeah.
00:37:12.740 | All right.
00:37:13.420 | So I could talk for hours about that.
00:37:15.340 | But let's talk about model selection.
00:37:17.220 | So what is the actual model we're going to run?
00:37:21.420 | My one piece of advice that I've contractually obligated,
00:37:25.740 | before you start thinking about, oh, what model am I going to run?
00:37:28.780 | How do I--
00:37:30.380 | I want to do a good job on this task.
00:37:34.260 | Make sure you've defined the task well
00:37:36.380 | and you have evals, an ability to evaluate whether the--
00:37:40.960 | you swap out a model for another one.
00:37:42.500 | Is it better or not?
00:37:43.500 | You can start with vibe checks.
00:37:44.980 | You just run one prompt that you like
00:37:47.780 | that helps you get good smell for a model.
00:37:51.340 | But that's going to--
00:37:53.180 | that works for a very short period of time.
00:37:55.300 | 10 inputs, 50 inputs, how long does it
00:37:58.780 | take you to write that out with ground truth answers?
00:38:01.820 | If it takes you an hour, put on your jams and do it.
00:38:07.500 | That's the length of Brat.
00:38:09.820 | Just listen to Brat and write out 10 or 50 evals.
00:38:14.380 | Just because it's kind of like test-driven development,
00:38:16.900 | where everybody says write the tests and then
00:38:19.820 | write the software.
00:38:20.860 | But in this case, with test-driven development,
00:38:23.620 | one reason people don't do it is because they
00:38:25.220 | can mentally run tests really well.
00:38:27.620 | I know what a test--
00:38:29.540 | I know all the different ways this code could misbehave.
00:38:31.820 | I don't have to write it out as a test.
00:38:33.580 | And if you're good, that's correct.
00:38:35.460 | If you're bad at software, like me,
00:38:38.380 | then you need the test to help you.
00:38:39.940 | But in this case, nobody is good at predicting
00:38:41.780 | the behavior of these models.
00:38:43.660 | And so evals are really critical,
00:38:45.980 | being able to check is this actually improving things
00:38:49.900 | or not.
00:38:50.500 | So do this, even just 10 things in a notebook.
00:38:53.540 | Don't go and buy an eval framework to do this.
00:38:56.020 | Just find a way to run models in the terminal in a notebook
00:39:00.300 | that helps you make these decisions like an engineer,
00:39:05.460 | not like a scientist like me.
00:39:08.700 | OK, so model options here are still, I would say,
00:39:11.780 | limited but growing.
00:39:13.180 | I might drop the limited sometime soon,
00:39:15.140 | because it's starting to feel like we have options.
00:39:17.780 | Meta's Llama model series is pretty well-regarded
00:39:20.980 | and has the very strong backing of Meta.
00:39:23.460 | So if I'm an engineer thinking about which open source
00:39:26.200 | software am I going to build on, I actually think about that
00:39:28.700 | a lot more so than raw capabilities a lot of the time.
00:39:34.060 | And the key thing here is there's a pretty big community
00:39:37.260 | building on Llama, making their software work really well
00:39:40.500 | with Llama, doing things with Llama
00:39:42.820 | that you would otherwise have to do yourself.
00:39:44.660 | So Neural Magic, major contributor
00:39:47.620 | to an inference framework called VLLM,
00:39:51.660 | they quantize models for you.
00:39:53.300 | So they squish them down so they're a lot smaller.
00:39:57.100 | Now you don't have to do that yourself.
00:39:58.740 | That's very nice.
00:40:00.020 | Noose Research does a lot of fine-tuning of models
00:40:03.540 | to remove their chat GPT slop behavior.
00:40:07.820 | So it's nice to have that.
00:40:09.020 | And RCAI will mush together five different Llamas
00:40:12.860 | to make one Penta Llama that weirdly works better
00:40:17.340 | than any of the five inputs.
00:40:19.380 | And then you don't have to do any of that yourself.
00:40:21.460 | Very nice.
00:40:22.180 | And then because it's backed by Meta,
00:40:23.760 | you can expect there will be continued investment in it.
00:40:26.300 | Meta's been great about open source in other places,
00:40:29.020 | like, I don't know, React.
00:40:31.100 | So maybe that's a bad one to pick
00:40:32.740 | because of the licensing thing.
00:40:34.060 | But they learned their lesson.
00:40:35.340 | So you can build on Meta comfortably.
00:40:39.220 | DeepSeek model series is on the rise.
00:40:41.820 | Not the first model series out of China
00:40:46.140 | to catch people's attention, the other one being Quen.
00:40:50.660 | There's slightly less tooling and integration
00:40:53.060 | than the Llama model series.
00:40:54.500 | But an important thing to note is
00:40:55.880 | that it is released under the MIT license.
00:40:58.940 | So the model weights are released
00:41:00.380 | under a normal open source license
00:41:03.300 | that the open source initiative would put their stamp on.
00:41:07.220 | But the Llama model is under a proprietary license that
00:41:12.220 | says, for example, if you're Amazon or Google,
00:41:15.340 | you can't use this.
00:41:17.540 | Not literally, but effectively.
00:41:19.860 | And a couple other things that make it less open,
00:41:22.580 | slightly less open, might make your lawyers nervous.
00:41:26.260 | So maybe DeepSeek will just push Llama
00:41:29.340 | to go MIT, inshallah that will happen with Llama 4.
00:41:33.500 | There are others to pay attention to.
00:41:37.220 | You might see a shitty model come out of a model training
00:41:39.940 | team, or sorry, you might see a non-state-of-the-art model come
00:41:43.660 | out of a model training team.
00:41:45.060 | But that doesn't mean that the team is bad.
00:41:47.180 | It's just that it takes a long time to get really good.
00:41:51.220 | So ones to watch are the Allen Institute's
00:41:53.340 | been putting out some good models with the Olmo series
00:41:56.060 | and the Molmo model.
00:41:57.620 | Microsoft's been doing their small language models with Phi.
00:42:00.940 | Mistral's has been quiet for a bit,
00:42:04.380 | but they keep putting out models.
00:42:05.820 | And Quen.
00:42:06.700 | Maybe in the future, the enterprise cloud homies
00:42:10.140 | Snowflake and Databricks will put out
00:42:12.620 | really compelling models.
00:42:13.860 | Mostly, Arctic and DBRX are fun for research reasons
00:42:17.500 | rather than raw capabilities.
00:42:20.620 | But yeah, that's kind of a small number of options.
00:42:23.860 | A little bit more like databases in the late '90s, early 2000s
00:42:27.180 | than databases today, where everybody and their mother
00:42:31.340 | has their own data fusion analytic database.
00:42:39.460 | But yeah, a little bit about quantization.
00:42:43.940 | So I've mentioned this a lot.
00:42:45.900 | So by default, floats are 32 or 64 bits, like integers are.
00:42:50.860 | Neural networks do not need this.
00:42:52.860 | Digital computers that you're used to programming
00:42:54.900 | are very precise.
00:42:55.740 | They go back to this--
00:42:58.260 | pardon me-- the Z2 by Konrad Zuse.
00:43:02.980 | He made this basically a clock that was a computer.
00:43:06.300 | Physical plates were being pushed around.
00:43:08.060 | And I think this is an AND gate or an XOR gate.
00:43:12.540 | So it only moves if one of the two plates on one side
00:43:16.140 | moves forward.
00:43:17.500 | So it's very physical clockwork.
00:43:19.140 | That's the lineage of digital computers.
00:43:22.820 | At the same time, in the '40s, people
00:43:24.460 | were working on analog computers.
00:43:25.860 | So on the right is a numerical integrator.
00:43:28.460 | That's on the other side of World War II.
00:43:30.660 | I think this is artillery trajectory calculations.
00:43:33.460 | You see there's a ball.
00:43:34.460 | And that ball rolls around.
00:43:35.660 | And you would calculate the speed
00:43:37.020 | that the ball is rolling around by changing the gears.
00:43:39.460 | Neural networks are way more like that.
00:43:41.100 | They're more like-- they're imprecise
00:43:44.460 | because they are the raw physical world
00:43:47.220 | without the intervention of a clock system
00:43:50.220 | to abstract it away and make it all ones and zeros
00:43:53.200 | and specific time steps.
00:43:55.300 | Neural networks are way more like these analog computers.
00:43:58.380 | And so how precise do you need to be
00:44:00.860 | when you're measuring a number that's
00:44:03.180 | coming out of an analog system?
00:44:05.660 | It's never going to be exactly the same with an analog system
00:44:08.700 | anyway.
00:44:09.900 | So why not decrease the precision?
00:44:14.780 | Whereas you change one bit in a digital computer,
00:44:18.740 | and it's like throwing a stick into a clock.
00:44:24.220 | The whole thing explodes and stops running.
00:44:28.860 | So this is the reason why you can aggressively
00:44:31.140 | quantize neural networks in a way
00:44:34.020 | that you can't do with lossily compressing, I don't know,
00:44:37.620 | Postgres.
00:44:38.780 | If you quantized every byte in Postgres down to 4 bits,
00:44:43.420 | you would just get garbage.
00:44:46.140 | So this quantization is really key for performance.
00:44:50.700 | The safe choice, you'll see, is 16 bits.
00:44:54.820 | FP16 or BF Brain Float 16.
00:44:59.500 | Weight quantization only, that means just make
00:45:01.900 | the model itself smaller, makes it smaller in memory.
00:45:06.540 | And then that whole thing about moving it in and out of compute
00:45:09.980 | is easier because it's smaller.
00:45:12.100 | That's great.
00:45:12.720 | And then that doesn't actually quantize the math.
00:45:15.020 | The actual math that happens still happens at 16-bit, 32-bit.
00:45:20.700 | To do activation quantization requires more recent GPUs,
00:45:24.820 | sometimes requires special compilation flags.
00:45:27.380 | Not always is the operation that you want to speed up.
00:45:30.020 | Does it already have a kernel written for you
00:45:32.780 | by TreeDAO or some other wizard to make the GPU go at full speed?
00:45:37.660 | So that's harder.
00:45:38.540 | It doesn't always work.
00:45:40.100 | VLM has great docs on this.
00:45:42.500 | And there's some papers as well.
00:45:44.540 | Give me FP16 or give me death, question mark, is a good paper.
00:45:49.180 | Because the answer is you don't need death.
00:45:51.820 | Don't be dramatic.
00:45:52.580 | You can use the quants.
00:45:54.620 | Evals help you decide whether the quantization is hurting.
00:45:57.820 | So I was running DeepSeq R1 in ternary, actually.
00:46:03.020 | So 1, 0, minus 1 in that demo.
00:46:05.740 | That's extreme quantization.
00:46:07.540 | There's no way the full model performance or anything
00:46:09.900 | close to it is retained.
00:46:11.900 | You need evals to determine whether you've
00:46:13.660 | lost the thing that made you pick the model in the first place.
00:46:16.660 | So make sure you have a way to check this.
00:46:20.460 | And benchmarks, don't trust benchmarks.
00:46:24.220 | People's benchmarks are wrong.
00:46:25.780 | They're different from your workload.
00:46:29.700 | You've got to run this stuff yourself.
00:46:31.700 | So curate your own internal benchmarks
00:46:34.100 | to help you scale up your own taste in models and intuition.
00:46:38.220 | I have more slides on fine tuning in a bit.
00:46:42.980 | But people who want to run their own models
00:46:46.220 | often have this DIY hacker spirit.
00:46:49.220 | And they're like, why should I just
00:46:50.720 | use the weights everybody else is using?
00:46:52.380 | I want to fine tune these things.
00:46:54.020 | This is really hard.
00:46:54.900 | I'll talk more about why it's hard in a bit.
00:46:56.740 | But try to get as far as you can just
00:46:58.460 | with prompting and really control flow around models.
00:47:02.500 | I don't know, DeepSeek R1 writes Python code.
00:47:05.500 | The Python code is wrong.
00:47:07.260 | Take the code, run it, take the error message, pipe it back in.
00:47:10.900 | So writing things around models, instead of fine tuning it
00:47:15.300 | to write better Python code, that's
00:47:17.060 | what all the model providers are doing.
00:47:19.300 | You're hard to compete with them on a lot of this stuff.
00:47:22.540 | So managing prompts and managing control flow around models
00:47:26.420 | is way easier as a software engineer
00:47:29.220 | and has way better ROI per effort ROI.
00:47:35.340 | So definitely start with just prompting, retrieval, et cetera.
00:47:39.220 | Yeah, I want to make sure to talk about the inference
00:47:46.140 | frameworks and what Suri inference looks like.
00:47:49.700 | Running LLMs inference economically
00:47:52.180 | requires a ton of thought and effort on optimization.
00:47:55.860 | This is not something you can sit down and write yourself,
00:48:00.860 | even if you're a code force's top 1%.
00:48:05.460 | There's a lot to write.
00:48:06.660 | A fast matrix multiplication is-- yeah,
00:48:10.620 | the standards are very high.
00:48:12.900 | So the current core of the stack that's most popular
00:48:16.380 | is PyTorch and CUDA.
00:48:18.700 | So PyTorch is a combo of a Python steering library
00:48:24.180 | and then a C++ internal library and libraries
00:48:29.940 | for doing all the hard shit, including CUDA C++,
00:48:35.220 | AKA C++ that runs on GPUs.
00:48:38.860 | That's where all the work gets done.
00:48:40.460 | Python is not usually the bottleneck.
00:48:42.420 | Don't get excited and rewrite that part in Rust.
00:48:45.980 | You're going to find out that that didn't help you that much.
00:48:48.620 | There's some features that make it easier to write Torch
00:48:52.420 | and still get good performance.
00:48:53.740 | So Torch added a compiler a couple of years
00:48:56.700 | ago now in version 2.
00:48:59.220 | But compilers are young until they're 40.
00:49:02.500 | But it's very promising and can get you
00:49:05.140 | most of the speed up of writing a bunch of custom stuff.
00:49:08.100 | But even besides writing custom GPU code,
00:49:11.060 | there's a bunch of things you need
00:49:12.420 | to build on top of raw matmuls, like the stuff that showed up
00:49:15.740 | in my napkin math diagram to serve inference fast.
00:49:20.500 | There's a bunch of caching.
00:49:21.780 | You don't want to roll your own cache.
00:49:23.400 | Rolling your own cache is a recipe for pain.
00:49:26.620 | There's continuous batching is this smart stuff
00:49:29.660 | for rearranging requests as they're on the way.
00:49:32.180 | Speculative decoding is a way to improve your throughput
00:49:36.700 | and has a ton of gotchas.
00:49:39.780 | So you don't want to build all this just for yourself.
00:49:42.460 | This is a clear case for a framework,
00:49:45.220 | just like database management systems.
00:49:47.500 | This is a don't roll your own case rather than a don't
00:49:51.420 | overcomplicate shit with a tool case,
00:49:54.260 | like the classic the two genders in engineering.
00:49:59.380 | So I would strongly recommend the VLM inference server
00:50:04.300 | on a number of grounds.
00:50:06.660 | So like Postgres, VLM started as a Berkeley academic project.
00:50:11.380 | They introduced this thing called paged attention,
00:50:13.780 | paged KV caching, and then kind of ran with it from there.
00:50:19.740 | There's performance numbers, and we can talk about them,
00:50:22.140 | but they're pretty prominent.
00:50:25.780 | People are gunning to beat them on workloads.
00:50:27.780 | And also, don't trust anybody's benchmarks.
00:50:29.540 | You have to run it to decide whether you agree.
00:50:32.020 | Anyway, that doesn't apply just for models.
00:50:34.220 | It also applies for performance.
00:50:36.060 | They really won Mindshare as the inference server,
00:50:39.020 | and so they've attracted a ton of external contributions.
00:50:43.020 | So now, Neural Magic was a startup,
00:50:45.460 | got acquired by Red Hat, a.k.a.
00:50:47.260 | IBM, basically exclusively to support their work on VLM.
00:50:53.140 | And so they got tons of contributions
00:50:56.620 | from any scale, IBM, bunch of people contributing stuff.
00:51:01.060 | And that's really important for open source success.
00:51:03.500 | Open source software succeeds when
00:51:05.100 | it creates this locus for cooperation
00:51:07.740 | between otherwise competing private organizations,
00:51:11.820 | whether they're nonprofit or for profit or whatever.
00:51:14.700 | And VLM has done that.
00:51:16.940 | So it's kind of hard to dislodge a project
00:51:19.700 | like that once it's held that crown for a while.
00:51:22.700 | It's not undislodgable yet, so it's not quite like Postgres,
00:51:25.780 | where you can be like, just use Postgres,
00:51:27.820 | and feel pretty like that's been around for 30 years,
00:51:30.980 | and this is more like 30 months less.
00:51:33.780 | But yeah, also pretty easy to use,
00:51:36.140 | like PIP installable once you have your GPU drivers.
00:51:38.980 | They make an OpenAI compatible API layer,
00:51:41.660 | which NVIDIA has refused to do with TensorRT, LLM, and Triton.
00:51:47.100 | So it's got a bunch of nice features and good performance.
00:51:52.620 | The main alternative, I would suggest,
00:51:54.380 | is NVIDIA's offering the ONNX, TensorRT, TensorRT, LLM,
00:52:00.060 | Triton kind of stack.
00:52:01.580 | There's this NVIDIA stack.
00:52:03.820 | Legally, it's open source, because you
00:52:05.380 | can read the source code.
00:52:06.340 | And it's under, I forget, either Apache or MI2 license.
00:52:09.780 | But if you look at the source code history,
00:52:12.540 | you'll see that it updates in the form of one 10,000 line
00:52:16.380 | commit with 5,000 deletions every week or two that
00:52:21.060 | says fixes.
00:52:23.380 | So pretty hard to maintain a fork.
00:52:26.380 | Pretty hard to-- you don't get input on the roadmap.
00:52:29.580 | VLM, on the other hand, classic.
00:52:31.460 | True open governance and open source.
00:52:34.540 | You can actually participate.
00:52:37.780 | Show up to the biweekly meetings.
00:52:39.620 | It's fun.
00:52:41.460 | Yeah, good performance, but maybe not top.
00:52:43.780 | What's up, Twix?
00:52:45.340 | >>SGLang?
00:52:47.340 | >>Yeah, SGLang, there's some cool stuff.
00:52:50.020 | They have this nice interface for prompt programming
00:52:53.020 | that's kind of cool.
00:52:55.060 | And sometimes they beat VLM on performance.
00:52:58.820 | But yeah, with open source projects,
00:53:00.740 | you win when you can draw the most contribution.
00:53:03.340 | So I feel like even if SGLang is winning over VLM
00:53:07.620 | in certain places currently, I doubt that that will persist.
00:53:10.540 | But we'll see.
00:53:11.140 | SGLang is another good one to look at.
00:53:13.700 | >>Yeah, OK.
00:53:14.740 | My impression was that they're both from Berkeley,
00:53:16.860 | and I thought basically SGLang is kind of the new generation
00:53:20.780 | of-- it's an anointed successor.
00:53:23.700 | >>Yeah, we'll see.
00:53:25.180 | We'll see.
00:53:26.060 | I don't think they've attracted the same degree
00:53:28.180 | of external contribution, which is important.
00:53:30.700 | >>They try to do it.
00:53:32.460 | OK, cool.
00:53:33.020 | >>Yeah.
00:53:33.540 | But yeah, good call out.
00:53:35.580 | That part of the slide's a little bit older,
00:53:37.780 | so I should maybe bump SGLang up to its own part.
00:53:42.700 | If you're going to be running your own inference,
00:53:45.260 | this is a high-performance computing workload.
00:53:47.140 | It's an expensive workload.
00:53:48.540 | Performance matters.
00:53:49.500 | Engineering effort can do 100x speedups
00:53:52.680 | and can take you from hundreds of dollars a megatoken
00:53:56.780 | to dollars or tens of dollars a megatoken.
00:53:59.740 | So you will need to debug performance and optimize it.
00:54:04.340 | And the only tool for doing that is profiling.
00:54:07.380 | So you're going to want to-- even
00:54:10.540 | if you aren't writing your own stuff,
00:54:12.340 | like if you're just using VLM, if you
00:54:14.420 | want to figure out what all these flags do
00:54:17.100 | and which ones you should use on your workload,
00:54:19.300 | you're going to want to profile stuff.
00:54:20.880 | There's built-in profiler support in VLM
00:54:23.780 | to try and make it easy.
00:54:25.660 | So PyTorch has a tracer and profiler.
00:54:28.620 | That's kind of like what VLM integrates with.
00:54:30.740 | There's also NVIDIA Insight, both for creating and viewing
00:54:33.980 | traces.
00:54:35.260 | That's their slightly more boomery corporate performance
00:54:39.820 | debugger.
00:54:40.340 | It's got a lot of nice features, though, can't lie.
00:54:43.140 | But yeah, it's the same basic tracing and profiling stuff,
00:54:49.340 | except there's work on the CPU and on the GPU,
00:54:52.380 | so that makes it a little bit harder.
00:54:54.060 | I would also just generally recommend,
00:54:55.660 | if you're thinking about this a lot, running a tracer
00:55:00.100 | and just looking at the trace a couple of times for PyTorch,
00:55:04.260 | VLM, whatever, just because you learn a ton from looking
00:55:08.100 | at a trace, a trace of an execution,
00:55:11.100 | all the function calls, all the stacks that
00:55:14.420 | resulted in your program running.
00:55:17.820 | No better way to learn about a program.
00:55:20.460 | I prefer it to reading the source code.
00:55:22.140 | That's where I start, and then I go back to the source code
00:55:23.900 | to figure out what things are doing.
00:55:25.780 | It's way easier than trying to build up
00:55:28.660 | a mental model of a programming model and concurrency
00:55:32.740 | implications, et cetera, just from reading source code.
00:55:35.420 | It's unnatural.
00:55:36.900 | Humans were meant to observe processes in evolution, not
00:55:41.980 | as programs.
00:55:42.860 | But yeah, so some recommendations for tools
00:55:46.820 | there.
00:55:47.320 | We also have some demos for how to run this stuff on Modal
00:55:50.540 | if you want to try that out.
00:55:53.620 | As a first pass for GPU optimization for, OK,
00:55:58.700 | is this making good use of the GPU?
00:56:01.260 | Very first pass is this number, GPU utilization.
00:56:05.620 | What fraction of time is anything
00:56:08.060 | running on the GPU at all?
00:56:10.460 | So that catches-- I don't know.
00:56:12.100 | If you looked at my DeepSeek R1, you
00:56:13.900 | would see that this utilization number is really low, like 20%.
00:56:17.220 | That means the host is getting in the way a lot
00:56:19.740 | and stuff isn't running on the GPU a ton.
00:56:22.620 | This is not like model maximum flops utilization or model
00:56:26.100 | flops utilization.
00:56:26.920 | This is not like what fraction of the number
00:56:28.980 | NVIDIA quoted you for flops that you're getting.
00:56:31.420 | This is way far away from that.
00:56:32.700 | This is just like-- this is a smoke check.
00:56:35.220 | Is the GPU running what fraction of the time?
00:56:37.620 | You would like for this to be 100%.
00:56:39.980 | Like, this is-- yeah, that's an attainable goal, 95% to 99%.
00:56:47.380 | Unlike CPU utilization, that's not a problem.
00:56:49.660 | That's a goal.
00:56:51.420 | So GPU utilization here is like a first check.
00:56:54.020 | Problem is, just because work is running on a GPU
00:56:56.700 | doesn't mean progress is being made
00:56:59.300 | or that that work is efficient.
00:57:01.060 | So the two other things to check are power utilization
00:57:04.660 | and temperature.
00:57:05.700 | Fundamentally, GPUs are limited by how much power
00:57:09.720 | they can draw to run their calculations
00:57:12.020 | and how much heat that generates that they
00:57:14.740 | need to get out of the system in order
00:57:17.780 | to keep running without melting.
00:57:20.020 | So you want to see power utilization 80% to 100%.
00:57:26.620 | And you want to see GPU temperatures running high 60
00:57:29.980 | Celsius for the data center GPUs, maybe low 70s,
00:57:35.340 | but pretty close to their thermal design power,
00:57:37.500 | maybe 5 to 10 degrees off of the power at which NVIDIA says,
00:57:41.700 | whoa, warranty's off.
00:57:46.460 | That means you're most likely making
00:57:50.260 | really good use of the GPU, whereas this GPU utilization,
00:57:54.380 | 100% that we have here on the left,
00:57:56.700 | is actually a deadlocked system.
00:57:58.580 | It's like two GPUs are both expecting the other
00:58:01.140 | to send a message, like two polite people trying
00:58:03.860 | to go through a door.
00:58:05.340 | And so they're both executing something
00:58:07.540 | because they're both being like, waiting for that message, dog.
00:58:10.500 | But they aren't making any progress.
00:58:11.980 | And the system is hung.
00:58:13.420 | But it has 100% GPU utilization.
00:58:15.660 | So you won't see that that often if you're
00:58:19.220 | running an inference framework.
00:58:22.180 | But it is something to watch out for and why, on Modal,
00:58:26.260 | I learned Rust in order to be able to add these
00:58:29.060 | to our dashboard.
00:58:30.780 | I think it's that important to show it
00:58:32.380 | to people, the power and the temperature.
00:58:36.700 | Cool.
00:58:37.340 | All right.
00:58:37.820 | So I do want to talk about fine tuning
00:58:39.420 | since it was in the title, conscious of time.
00:58:41.860 | So I'm going to rip through this.
00:58:43.500 | And then if we have more time, we
00:58:45.400 | can dive deep via questions.
00:58:47.700 | Sound good, Sean, Noah?
00:58:50.540 | Thumbs up?
00:58:51.060 | All right.
00:58:51.580 | Yeah, that's great.
00:58:52.860 | All right, yeah, fine tuning.
00:58:54.140 | So fine tuning means taking the weights of the model
00:58:56.740 | and using data to customize them, not via rag,
00:59:00.460 | but by actually changing those numbers.
00:59:03.860 | So when does it make sense to do that
00:59:05.580 | and make your own custom model?
00:59:07.300 | If you can take the capabilities that an API has
00:59:10.620 | and distill them into a smaller model--
00:59:12.620 | so train a smaller model to mimic
00:59:14.540 | the behavior of a big model, then you can--
00:59:17.260 | frequently, you don't need all the things like GPT.
00:59:20.420 | The big models know the name of every arrondissement in France
00:59:24.020 | and things about 15th century sculpt--
00:59:27.300 | or esotericism that you probably don't
00:59:29.900 | need in a support chatbot.
00:59:31.780 | So a smaller model with less weights,
00:59:34.940 | less room to store knowledge, could probably
00:59:40.100 | serve your purposes.
00:59:42.500 | I think of this a bit like a Python to Rust rewrite.
00:59:45.820 | You start off when you aren't sure what you need.
00:59:48.020 | You write in Python because it's easy to change,
00:59:50.020 | just like changing a prompt is easy,
00:59:52.100 | and switching between proprietary model providers
00:59:55.260 | is easy, upgrades are easy.
00:59:57.460 | But then once you really understand what you're doing,
00:59:59.660 | you rewrite it in Rust to get better performance.
01:00:03.900 | And then that Rust rewrite is going
01:00:05.540 | to be more maintenance work and harder to update, yada, yada,
01:00:08.500 | but it's going to be 100x cheaper or something.
01:00:12.300 | And so both the good and the bad things
01:00:14.020 | about that kind of rewrite-- it's
01:00:15.420 | a very similar engineering decision in terms
01:00:17.780 | of technical debt, feature velocity, cost of engineers,
01:00:24.860 | all this stuff.
01:00:26.300 | There's a nice product called OpenPipe
01:00:28.060 | that will help you steal capabilities as a service.
01:00:32.380 | So maybe check them out.
01:00:35.740 | If you want tighter control of style,
01:00:37.740 | like you want it to always respond
01:00:39.340 | in the voice of a pirate and never break k-fabe,
01:00:42.260 | fine tuning is pretty good at that.
01:00:44.220 | Relatively small amounts of data can do that.
01:00:46.900 | It's pretty bad at adding knowledge.
01:00:49.060 | That's usually better to do search or retrieval, which
01:00:51.700 | is what people call RAG, like get the knowledge from somewhere
01:00:55.340 | and stuff it in the prompt.
01:00:56.780 | Prompts can get pretty big these days.
01:00:59.900 | So your search doesn't have to be
01:01:02.020 | as good as it needed to be a year and a half ago.
01:01:04.500 | You can get vaguely the right information
01:01:06.220 | and put it in the prompt.
01:01:08.180 | The holy grail would be for you to define a reward function
01:01:12.820 | of what does it mean for this model to do well.
01:01:14.780 | Maybe that's customer retention, NPS, whatever.
01:01:19.860 | And then you could do ML directly on those rewards
01:01:23.340 | to optimize the model for that.
01:01:26.020 | That's the holy grail.
01:01:27.580 | Then you could just sit back and monitor that RL system.
01:01:32.380 | And then you would magically make that reward number go up.
01:01:36.820 | Could be stock price.
01:01:37.860 | That would be nice.
01:01:38.860 | The problem is there's a large gap between the things you
01:01:41.700 | want to improve, and the things that you can actually measure,
01:01:44.540 | and the things that you can provide to a model,
01:01:47.060 | measure quickly enough, et cetera.
01:01:48.820 | And also the rewards need to be unhackable.
01:01:51.260 | They need to be exactly what you want to maximize.
01:01:55.860 | When you do ML, ML is like paperclip maximization.
01:01:58.660 | It's like, you told me to make this number go up.
01:02:00.660 | I'm going to make this number go up.
01:02:02.160 | Imagine the brooms from "The Sorcerer's Apprentice."
01:02:05.340 | So if your rewards aren't something
01:02:07.100 | that's extremely logically correct,
01:02:10.180 | does this code compile?
01:02:12.020 | And does it run faster?
01:02:14.780 | They're hackable.
01:02:15.700 | So there's this famous example from OpenAI
01:02:18.020 | where they trained a model to drive a boat in this boat
01:02:20.620 | racing game.
01:02:21.620 | And it was trying to maximize points.
01:02:23.140 | And what it learned was, actually, you
01:02:24.720 | don't want to win the race and do
01:02:28.740 | what the game is supposed to do, which
01:02:30.660 | is collect these little pips and finish a race.
01:02:33.900 | If you want to score max, what you actually want to do
01:02:36.180 | is find this tiny little corner and slam against the wall
01:02:38.620 | repeatedly, picking up this bonus item that respawns,
01:02:42.620 | and just slamming against the wall over and over again
01:02:44.820 | and pick up the bonus item when it spawns.
01:02:50.900 | Very inhuman.
01:02:53.740 | More like a speed runner playing a video game
01:02:56.220 | than a normal human.
01:02:58.380 | So imagine this, but with your customer support.
01:03:01.180 | Great way to get customers to give a 10 on an NPS
01:03:04.500 | is to hack their machine and say,
01:03:07.540 | your machine is locked down until you put a 10 on our NPS.
01:03:11.020 | So be careful when using that approach.
01:03:13.660 | But that is the direction we're going.
01:03:15.580 | And it's RL for things like reasoning models
01:03:17.820 | gets better and more mainstreamed.
01:03:20.700 | It's kind of the long-term direction we're going.
01:03:23.700 | But that's not where we are today.
01:03:26.980 | Where we are today is really more like stealing capabilities
01:03:29.460 | from public APIs and distilling them.
01:03:34.580 | So the main reason fine-tuning can save costs,
01:03:38.380 | can improve performance, why shouldn't you do it?
01:03:42.220 | Fine-tuning is machine learning.
01:03:43.980 | Running inference is mostly normal software engineering
01:03:47.220 | with some fun spicy bits-- GPUs, floating point numbers.
01:03:50.900 | But machine learning is a whole different beast.
01:03:53.580 | Machine learning engineering has a lot
01:03:56.180 | in common with hardware and with scientific research.
01:03:59.540 | And it's just fucking hard.
01:04:01.060 | You've got non-determinism of the normal variety.
01:04:04.780 | On top of that, there's epistemic uncertainty.
01:04:06.740 | We don't understand these models.
01:04:08.100 | We don't understand the optimization process.
01:04:10.900 | There's all the floating point nonsense,
01:04:12.580 | which is much worse in machine learning than elsewhere.
01:04:15.060 | You've got to maintain a bunch of data pipelines.
01:04:17.340 | No one's favorite form of software engineering.
01:04:19.460 | This is a high-performance computing workload.
01:04:21.580 | Terra or Exaflop scale, if not more.
01:04:24.860 | Like, yeah, high-performance computing sucks.
01:04:27.180 | There's a reason why only the Department of Energy does it.
01:04:29.680 | And now a few people training models.
01:04:34.740 | There's a bunch of bad software out there.
01:04:36.660 | Like, the software in ML is frankly bad.
01:04:38.500 | It's written by people like me with scientific background.
01:04:42.460 | You have to deal-- things are inferential.
01:04:44.380 | You have to deal with statistical inference.
01:04:46.820 | Yeah, there's data involved.
01:04:48.460 | And now data is getting stored in a form
01:04:50.300 | that no one understands.
01:04:51.420 | Like, user data went in.
01:04:53.020 | And somebody can maybe pull a "New York Times" article
01:04:55.260 | directly out of your model weights.
01:04:56.900 | This scares lawyers.
01:04:59.500 | And so that is tricky and probably
01:05:03.140 | is going to require some Supreme Court rulings and so on
01:05:06.740 | to really figure out.
01:05:08.860 | Yeah, and when Mercury is a retrograde, your GPUs run slower.
01:05:11.420 | I'm sorry.
01:05:11.780 | That's just how it is.
01:05:12.700 | It's just, like, the point is there's
01:05:14.260 | a lot of complexity that's very hard to get an engineering
01:05:17.060 | grip on.
01:05:18.460 | So if you can solve it in literally any other way,
01:05:20.580 | try that first.
01:05:21.240 | Be creative.
01:05:22.180 | Think of ways you can solve this problem without fine tuning.
01:05:24.620 | What information can you bring in?
01:05:26.020 | What program control flow can you put around a model?
01:05:29.540 | Like, distillation is the easiest ML problem
01:05:32.980 | because you're using an ML model to mimic an ML model.
01:05:36.700 | And you can write down the math for that.
01:05:39.140 | It's perfect.
01:05:39.860 | It's very easy.
01:05:41.140 | Like, there's a notion of a data-generating process.
01:05:43.740 | In the real world, that's like the climate of the planet
01:05:46.340 | Earth.
01:05:47.380 | But in distillation, it's like an API call.
01:05:50.340 | Much easier.
01:05:51.220 | So if you have never fine-tuned before,
01:05:53.780 | definitely start with stealing capabilities
01:05:56.540 | from OpenAI, a.k.a.
01:05:58.340 | distillation, rather than anything else.
01:06:01.900 | To do this, you're going to need even more high-performance
01:06:04.500 | hardware.
01:06:04.780 | I focused on running models at the beginning.
01:06:06.860 | Fine-tuning blows out your memory budget,
01:06:09.620 | even with these parameter-efficient methods
01:06:11.580 | that are out there.
01:06:13.700 | Like, kind of what happens during training
01:06:15.580 | is you run a program forwards, and then you flip it around
01:06:18.420 | and run it backwards.
01:06:19.740 | So that puts a lot of extra pressure on memory.
01:06:22.620 | Then you also, during training, you want lots of examples
01:06:25.420 | so the model doesn't learn too much from one specific example.
01:06:29.140 | And you also want large batches to make better use
01:06:32.260 | of the big compute and to make better use of all
01:06:36.260 | those floating-point units.
01:06:38.380 | So that puts pressure on memory.
01:06:40.900 | And then optimization just, in general,
01:06:42.560 | requires some extra tensors that are the size of or larger
01:06:46.300 | than the model parameters.
01:06:47.500 | Sorry, some arrays, some extra arrays of floating-point
01:06:49.820 | numbers that are at least the size of the model parameters
01:06:53.940 | themselves.
01:06:54.860 | So you've got gradients and optimizer states.
01:06:57.300 | These are basically like 2 to 10 extra copies of the model
01:07:01.700 | weights are going to be floating around.
01:07:03.660 | There's ways to shard it, but you
01:07:05.300 | can't get around the fact that a lot of this stuff
01:07:07.300 | just needs to be stored.
01:07:09.540 | So you're going to need eight 80-gigabyte GPUs, or 32
01:07:15.020 | of them, connected in a network.
01:07:17.020 | And yeah, the software for that is pretty hard,
01:07:20.340 | or pretty rough.
01:07:22.220 | I already talked about how hard machine learning is.
01:07:25.060 | It's like there are software engineering practices that
01:07:27.900 | can prevent it from being made harder.
01:07:30.980 | I worked on experiment tracking software, weights and biases.
01:07:35.820 | That said, I worked on it for a reason.
01:07:37.900 | It's like when I was training models, the thing I wanted
01:07:40.300 | was being able to store voluminous quantities of data
01:07:45.140 | that come out of my run.
01:07:46.380 | Tons of metrics, gradients, inputs, outputs, loss values.
01:07:51.740 | There's just a bunch of stuff that you
01:07:53.340 | want to keep track of on top of very fast-changing code
01:07:56.940 | and configuration.
01:07:58.580 | And so you want a place to store that.
01:08:01.380 | The software is hard to debug.
01:08:03.100 | You don't know where the bugs are.
01:08:04.480 | So you want to store very raw information
01:08:07.020 | from which you can calculate the thing that reveals your bug.
01:08:10.140 | This is actually, I would say, like Honeycomb,
01:08:13.020 | their approach to observability is very similar.
01:08:15.140 | This is like observability for training runs.
01:08:17.580 | Observability is like recording enough about your system
01:08:19.940 | that you can debug it from your logs without having to SSH in.
01:08:23.980 | Same thing with model training.
01:08:25.580 | So yeah, weights and biases, hosted version,
01:08:27.740 | Neptune's hosted version, MLflow.
01:08:29.980 | You can run yourself.
01:08:32.580 | Yeah.
01:08:33.820 | You--
01:08:34.320 | [INTERPOSING VOICES]
01:08:36.420 | TensorBoard?
01:08:38.100 | Yeah, so TensorBoard, you have to run TensorBoard yourself.
01:08:41.740 | There's no real hosted service for it.
01:08:43.320 | I think they shut down TensorBoard.dev.
01:08:45.500 | So even if you're willing to make it public,
01:08:47.500 | you can't even use TensorBoard.dev anymore.
01:08:51.140 | Yeah, that's my most sad kill by Google,
01:08:53.300 | because it hits me personally, or maybe happiest,
01:08:56.260 | because I'm a shareholder in weights and biases.
01:08:58.860 | But yeah, so yeah, TensorBoard is really good
01:09:04.300 | at a small number of experiments.
01:09:06.300 | It's bad at collaboration and bad at large numbers
01:09:08.740 | of experiments.
01:09:10.300 | Other experiment tracking workflows that have gotten more--
01:09:13.460 | or experiment tracking solutions that
01:09:15.340 | have gotten more love, like the venture-backed ones
01:09:18.180 | or the open source ones, are better for that.
01:09:23.580 | So you can-- I would say a lot of software engineers
01:09:26.460 | come into the ML engineers' habitat
01:09:28.460 | and are pretty disgusted to discover the state of affairs.
01:09:33.780 | So you definitely do, in general,
01:09:35.740 | as a software engineer entering this field,
01:09:37.540 | you will be disgusted.
01:09:38.740 | And you should push people to up their SWE standards.
01:09:42.180 | But there's actually a lot of benefit
01:09:44.020 | to fast-moving code in ML engineering.
01:09:48.220 | It is researchy in that way.
01:09:50.180 | So you do want fast iteration.
01:09:53.660 | A lot of software engineering practices
01:09:55.380 | are oriented to a slower cycle of iteration
01:09:57.740 | and less interactive iteration.
01:09:59.980 | So the detente that I've found works
01:10:02.700 | is build internal libraries in normal code files,
01:10:08.140 | but then use them via Jupyter Notebooks
01:10:10.820 | so that you can poke prod, run ad hoc workflows, et cetera.
01:10:15.660 | And then as soon as something in a Jupyter Notebook
01:10:17.900 | starts to become regularly useful,
01:10:20.540 | pull that out into your utils.py, at the very least,
01:10:23.780 | if not an internal library.
01:10:26.740 | So yeah, Noah mentioned at the beginning--
01:10:30.220 | or I forget, maybe it was just me.
01:10:33.220 | Anyway, full-stack deep learning course I taught in 2022
01:10:36.740 | still has the basics of how to run ML engineering.
01:10:40.540 | The main thing that's changed is that we're
01:10:42.380 | talking about fine-tuning here.
01:10:43.960 | And back then, we were talking about training from scratch,
01:10:46.420 | because the foundation model era was only beginning.
01:10:49.180 | But the basic stuff in there, like the YouTube videos,
01:10:52.100 | the lecture-level stuff, is all still, I would say,
01:10:55.020 | pretty much solid gold.
01:10:56.860 | And then the code's rotted a bit,
01:10:58.420 | but it's at least vibes-level helpful.
01:11:01.740 | OK, actually, the observability stuff
01:11:06.220 | is less interesting and relevant.
01:11:08.100 | The main point is the eventual goal with any ML feature
01:11:12.140 | is to build a virtuous cycle, a data flywheel, a data engine,
01:11:18.100 | something that allows you to capture user data, annotate it,
01:11:21.140 | collect it into evals, and improve the underlying system.
01:11:23.780 | This is like-- if you're running your own LM inference,
01:11:26.780 | one of the ways you're going to make
01:11:28.320 | this thing truly better than what you could get elsewhere
01:11:31.060 | is building your own custom semi-self-improving system,
01:11:37.340 | or at least continually-improving system,
01:11:40.100 | based off of user data.
01:11:42.900 | There's some specialized tooling for collecting this stuff up,
01:11:45.900 | whether it's offline style with something
01:11:48.260 | like Weights and Biases Weave.
01:11:50.300 | You can see Sean's recent conversation with Sean
01:11:54.460 | from Weights and Biases on how he used Weave,
01:11:56.860 | among other tools, to win at Sweebench.
01:12:02.020 | >>Then Thomas came on Thursday and went over Weave.
01:12:05.780 | >>Oh, nice.
01:12:06.940 | OK, yeah, that's pure product on Weave, plus Sean--
01:12:12.220 | oh, wait, in this class or somewhere else?
01:12:14.420 | Oh, in this class, awesome.
01:12:15.540 | >>Yeah, Thomas came in on Thursday
01:12:17.020 | and did an hour and a half and change on Weave.
01:12:20.760 | >>Nice, yeah.
01:12:21.780 | So I would say Weave is really good for this offline evals,
01:12:25.300 | which is collect up a data set, kind of run code on it.
01:12:29.180 | The code and the data set co-evolve.
01:12:31.260 | And this is very much how an ML engineer approaches
01:12:34.620 | evaluation, coming from academic benchmarking,
01:12:37.900 | really, originally.
01:12:39.100 | And then there's a different style of evals.
01:12:41.300 | I don't know if you're going to have anybody from Lang Chain
01:12:43.980 | or LOM Index or one of these other people
01:12:45.940 | who are also building these observability tooling.
01:12:50.420 | There's this product engineer style,
01:12:52.300 | which is just collect up information
01:12:54.100 | and then let anybody write to it.
01:12:56.340 | Anybody can come in and annotate a trace
01:12:58.220 | and be like, this one is wrong.
01:13:01.540 | Lang Smith is very open-ended, the tool from Lang Chain,
01:13:04.940 | as are a lot of the other observability tooling--
01:13:08.820 | or sorry, these more online eval-oriented things.
01:13:12.060 | It's about raw stuff from production.
01:13:16.940 | And it's about a living database of all the information
01:13:20.820 | you've learned about your users, your problem,
01:13:23.940 | the behavior of models.
01:13:25.740 | And so it's this very dynamic, active artifact,
01:13:29.540 | which has its place.
01:13:32.740 | I think the more you need input from people who are not
01:13:35.820 | you to evaluate models-- like, for example,
01:13:38.420 | it's producing medical traces, and you are not a doctor.
01:13:41.900 | As opposed to producing code, and you are a programmer,
01:13:44.940 | then being able to bring in more people is more helpful.
01:13:47.900 | And so there's utility to these more online-style things.
01:13:52.420 | You can also actually build this stuff yourself.
01:13:54.460 | One thing I will say is these people
01:13:57.180 | don't know that much more about running these models than you
01:13:59.780 | do and getting them to perform well.
01:14:01.540 | And the workflows are not really set down for this.
01:14:03.700 | So with experiment management, that's
01:14:05.580 | been pretty figured out.
01:14:06.620 | It's an older thing.
01:14:08.540 | And so there's lots of-- the tooling
01:14:11.420 | has good ideas baked into it and will teach you to be better.
01:14:15.020 | These tools are in the design partner phase,
01:14:18.080 | a.k.a. the provide free engineering and design
01:14:20.420 | work for somebody you're also paying for a service phase.
01:14:26.580 | So if you have a good internal data engineering
01:14:30.660 | team that is good at, say, an open telemetry integration,
01:14:34.980 | would love to set up a little ClickHouse instance
01:14:39.060 | or something.
01:14:40.420 | And that's exciting to you, the prospect
01:14:44.940 | of putting something like that together,
01:14:46.600 | you or somebody on your team.
01:14:47.740 | You can build your own with something like this.
01:14:49.740 | And then the front end people can hack on the experience.
01:14:54.700 | So Brian Bischoff at Hex is big on this,
01:14:57.780 | because Hex has both really incredible internal data
01:15:00.820 | engineering and they're a data notebook product.
01:15:03.580 | So they can actually dog food their product
01:15:05.420 | to do their evaluation of their product.
01:15:08.320 | So not everybody's in the situation
01:15:09.740 | to be able to do that, but it's like a bigger fraction
01:15:14.300 | than it is with some of the other stuff
01:15:16.020 | that we've talked about.
01:15:17.820 | More tilted in the build direction than the buy.
01:15:24.020 | OK, so that's everything.
01:15:25.100 | I'll do my quick pitch here.
01:15:27.660 | I mentioned at the beginning, if you
01:15:29.340 | want to run code on GPUs in the cloud,
01:15:31.820 | Modal is the infrastructure provider that I'm working on.
01:15:37.060 | That-- I joined this company because I
01:15:43.300 | thought their shit was great.
01:15:44.460 | I was talking about how much I liked it on social media,
01:15:47.020 | and they're like, what if we paid you to do this?
01:15:49.060 | And I was like, no.
01:15:50.940 | I love this so much.
01:15:52.140 | Please don't pay me to do it, because then people
01:15:54.180 | won't trust me when I tell them it's coded.
01:15:56.020 | But eventually I gave in.
01:15:57.580 | Now I work at Modal, and they pay me to say this.
01:16:01.460 | The same thing I was saying before, which is Modal is great.
01:16:05.100 | It's like, you pay for only the hardware you use.
01:16:08.020 | Important when the hardware is so expensive.
01:16:12.660 | They built the whole--
01:16:15.020 | all the infrastructure is built from the ground up in Rust,
01:16:18.740 | BTW, to design for data-intensive workloads.
01:16:22.460 | There's a great podcast with our co-founder,
01:16:24.820 | with Sean, that completely separate from learning
01:16:30.100 | about Modal.
01:16:31.020 | It's just like, gain 10 IQ points, or 10 levels
01:16:36.300 | in computer infrastructure from hearing the story,
01:16:39.580 | learning about the software that was built,
01:16:41.940 | and how they sped it up.
01:16:43.380 | It's also a great data council talk on it.
01:16:46.020 | Just designed to run stuff fast.
01:16:48.500 | And then, unlike other serverless GPU narrow sense
01:16:52.860 | providers, Modal has code sandboxes, web endpoints,
01:16:57.020 | makes it easy to stand up a user interface around your stuff.
01:17:00.140 | So that's why I ended up going all in on Modal.
01:17:03.340 | It was like, wow, not only does this run my models,
01:17:05.900 | but I learned how to properly use fast API from Modal's
01:17:10.660 | integration with it.
01:17:12.660 | And yeah, that's just the tip of the iceberg
01:17:16.740 | on the additional things that it provides.
01:17:19.300 | So that can be for running your fine-tuning jobs,
01:17:23.260 | if you've decided you want to distill models yourself.
01:17:26.240 | It can be just running the inference,
01:17:27.820 | to be able to scale up and down, and handle changing inference
01:17:31.100 | load, and make sure you're filling up
01:17:32.820 | all the GPUs that you're using.
01:17:35.780 | And it can be for doing your evaluations,
01:17:38.500 | running these things online or offline,
01:17:41.660 | creating data to help you observe your system
01:17:46.100 | and make it better.
01:17:46.980 | So it's like full service, serverless cloud
01:17:52.820 | infrastructure that doesn't require a PhD in Kubernetes.
01:17:59.260 | Great.
01:17:59.960 | All right, that's all I got.
01:18:02.780 | Any questions?
01:18:03.940 | That was sick.
01:18:08.820 | Thanks so much.
01:18:09.420 | We love Modal in this house.
01:18:11.980 | I was in the process of rewriting it,
01:18:13.580 | so everyone that got the-- and also everyone,
01:18:16.780 | Charles is the person that we talked to to get the Modal
01:18:19.680 | credits for the course.
01:18:20.900 | So everyone, a big, big thank you to Charles for that.
01:18:23.820 | But the entire course, every single--
01:18:25.980 | this cohort builds three projects, all of which
01:18:29.220 | are built off of FastAPI that lives in Modal.
01:18:31.740 | So we love Modal here.
01:18:34.300 | It's great.
01:18:35.220 | Yeah, if you ever run into any bugs,
01:18:39.900 | definitely slide into our Slack.
01:18:42.460 | There's a decent chance you'll get co-founder support
01:18:46.780 | if you slide into the Slack.
01:18:49.220 | And yeah, hopefully you've been pointed to the examples page,
01:18:53.940 | modal.com/docs/examples.
01:18:57.280 | I slave to ensure that those things run end-to-end.
01:19:04.300 | They're continuously monitored and run stochastically
01:19:08.980 | at times during the day to ensure that.
01:19:11.380 | So if you run into-- they should run.
01:19:14.500 | They should help you get started.
01:19:15.900 | They're designed to be something you
01:19:17.460 | can build production-grade services off
01:19:19.860 | of as much as possible.
01:19:22.100 | And so yeah, if you want any help with those,
01:19:25.260 | slide into the Slack.
01:19:26.860 | Tag me.
01:19:27.580 | Feel free to tag me on stuff related
01:19:31.620 | to the course or otherwise.
01:19:36.060 | I love the examples.
01:19:37.060 | I should talk to you sometime, how you set all of that up.
01:19:39.180 | Because I was very impressed.
01:19:40.380 | I ran through the comfy UI workflow a couple of days ago.
01:19:43.140 | And I was able to tweak a few things.
01:19:44.940 | I pulled down the code example.
01:19:46.300 | I got a few different things running.
01:19:47.980 | I was like, holy shit.
01:19:49.180 | I just pulled down an example from the internet
01:19:51.220 | and just ran the command that it said to run.
01:19:53.500 | And then it ran.
01:19:54.260 | And I was like, that never happens.
01:19:55.680 | There's always some other thing I have to do.
01:19:58.540 | I was very impressed.
01:20:01.060 | Yeah, part of it is that as an infrastructure product,
01:20:05.180 | the thing that kills being able to run code
01:20:07.660 | is the differences between infrastructure and like,
01:20:10.100 | oh, well, that will only run if you set this LD flags thing
01:20:16.900 | or have this installed.
01:20:19.540 | It works on my machine.
01:20:20.620 | See, the thing about the modal examples
01:20:22.240 | is they all work on my machine.
01:20:23.700 | And my machine is modal, which you can also run them on.
01:20:26.940 | So that does make it a lot easier.
01:20:29.580 | I think that's generally true for being
01:20:31.160 | able to share things that run on modal within your team,
01:20:35.260 | making it easier to do that.
01:20:39.180 | But then separately, like, yeah, the trick--
01:20:42.740 | and this is actually like an engineering trick
01:20:44.660 | that is surprised it took me this long to learn.
01:20:46.660 | It's like there's tests and there's monitoring.
01:20:49.220 | And there are a lot of things that
01:20:51.020 | are really hard to write as tests.
01:20:52.580 | Slow down your iteration speed.
01:20:54.620 | Like, yeah, require a bunch of disgusting mocking
01:20:57.820 | that breaks as often as the actual code does.
01:21:00.140 | Or you could monitor production and fix
01:21:02.700 | issues that arise there, a.k.a. do both.
01:21:05.500 | So yeah, that's an important trick for the modal examples,
01:21:10.420 | but also for all the things you would maybe run using--
01:21:14.020 | as part of running your own language model inference
01:21:16.460 | or running your own AI-powered app.
01:21:18.020 | It's like, monitor the shit out of this thing.
01:21:20.060 | >>Awesome.
01:21:23.960 | Cool.
01:21:24.460 | Well, before we let Charles go, does anybody have any questions?
01:21:27.740 | I know I'm sure given everyone's background here,
01:21:30.180 | there's a lot of-- everyone's brain
01:21:32.180 | feels very full with all of the hardware architecture
01:21:35.140 | that you just learned and terminology.
01:21:37.420 | But just want to open it up for anyone.
01:21:40.060 | >>I think-- I'll kill time while people ask questions.
01:21:47.260 | But I think that it's always intimidating for people
01:21:51.300 | sort of running their own models and fine-tuning them.
01:21:56.540 | I'm just like, what's a really good first exercise
01:21:59.820 | that you could-- probably you have some tutorials on modal
01:22:02.940 | that you would recommend people just go through.
01:22:05.900 | >>Yeah, running your own model.
01:22:07.140 | I would actually say, if you don't
01:22:10.300 | have a MacBook M2 or later with at least 32 gigabytes of RAM,
01:22:17.900 | go ahead and buy one of those.
01:22:19.220 | Get your company to buy it for you.
01:22:21.860 | So that turns out to be actually a really incredible machine
01:22:25.900 | for running local inference.
01:22:27.100 | Has to do with the memory bandwidth stuff
01:22:28.820 | that we talked about, like moving the bytes in and out
01:22:31.580 | really fast.
01:22:33.300 | And so that-- I would actually say
01:22:35.980 | like that was the first thing I did back when you
01:22:38.460 | had to torrent llama weights.
01:22:41.420 | That running it locally-- and there's good tools out there
01:22:45.780 | for this, Ollama.
01:22:48.140 | You can also use the same thing you would run on a cloud server
01:22:51.420 | like VLLM.
01:22:53.900 | That is-- that's probably the easiest way
01:22:57.020 | to get started with running some of your own inference.
01:22:59.420 | And then the cost is amortized more effectively.
01:23:03.140 | And you can use it for other stuff, the computer
01:23:05.740 | that you're using for this.
01:23:07.620 | So that's actually probably my-- it's bad modal marketing
01:23:11.460 | to say that.
01:23:12.020 | But I would say people like to be able to poke and prod.
01:23:15.100 | If you don't already know modal, I know modal well enough
01:23:17.920 | that now it's not any harder for me
01:23:19.960 | to use modal to try these things than to run it on my MacBook.
01:23:23.260 | But it takes some time.
01:23:24.660 | And everybody knows how to use a command line as part
01:23:28.820 | of becoming a software engineer.
01:23:31.220 | So yeah.
01:23:33.740 | So that's my primary recommendation.
01:23:35.540 | For fine-tuning, I would say distilling a model
01:23:40.220 | is the easiest thing to do.
01:23:43.420 | Besides, I guess our demo for fine-tuning,
01:23:45.900 | which I didn't have time to show,
01:23:47.300 | it's like fine-tuning something on somebody's Slack messages
01:23:50.140 | so that it talks like them.
01:23:51.660 | And that's easy, fun, the stakes are low,
01:23:56.900 | and it teaches you some things about the software
01:24:01.340 | and about fine-tuning problems.
01:24:03.580 | But then to really understand what
01:24:06.940 | it means to fine-tune in pursuit of a specific objective,
01:24:10.260 | it's like distillation of a large model.
01:24:12.260 | Yeah, totally.
01:24:18.460 | I did insert a little comment about what distillation means.
01:24:21.500 | Because apparently, a lot of people
01:24:23.980 | kind of view training on output of GPT-4 as distillation.
01:24:30.180 | But the purist would be like, you
01:24:33.580 | have to train on the logits.
01:24:35.820 | Oh, yeah.
01:24:37.620 | Yeah, the teacher-student methods.
01:24:40.420 | Real distillation is different.
01:24:42.380 | Real distillation.
01:24:43.340 | Yeah, yeah.
01:24:43.900 | Oh, so I guess maybe that's a reason
01:24:46.420 | to run your own models to be able to get
01:24:51.100 | the raw output of the model is not tokens.
01:24:54.300 | It's probability for every token.
01:24:57.180 | And so that's a much richer signal for fine-tuning off of.
01:25:01.060 | And so that's what people prefer.
01:25:04.580 | But I guess I was thinking of it in the looser sense
01:25:07.980 | that most people talk about today, which is just like
01:25:10.260 | training to mimic the outputs of the model.
01:25:12.460 | Yeah, create a synthetic corpus of text.
01:25:16.180 | Yeah, yeah.
01:25:18.120 | And when you run a model in production,
01:25:19.780 | you're creating a synthetic corpus of text, you know?
01:25:23.780 | Synthetic corpus of text is somewhat intimidating sounding.
01:25:29.540 | I say as somebody who's used a lot of intimidating
01:25:31.620 | sounding jargon.
01:25:34.780 | But really, the simplest synthetic corpus of text
01:25:38.460 | is all the outputs that the API returned
01:25:41.660 | while it was running in prod.
01:25:43.620 | That's a great thing to fine-tune on.
01:25:47.180 | I just linked to an example here where
01:25:51.620 | someone distilled from R1.
01:25:54.900 | And it was pretty effective.
01:25:56.420 | And it took 48 hours and a few H100s.
01:26:00.500 | And that was it, not that expensive.
01:26:02.180 | Nice.
01:26:03.460 | Yeah, yeah.
01:26:04.620 | So Modal will do fine-tuning jobs up to eight H100s
01:26:08.460 | and up to 24 hours.
01:26:10.420 | We're working on features for bigger scale training,
01:26:13.820 | and both longer in time and larger in number.
01:26:19.020 | But yeah, I would say there's also
01:26:20.900 | a pretty strong argument for keeping your fine-tunes as small
01:26:24.500 | and fast as possible to be able to iterate more effectively
01:26:28.700 | and quickly.
01:26:29.540 | Because it's fun to run on 1,000 GPUs or whatever.
01:26:35.660 | There's this frisson of making machines go brr.
01:26:39.660 | But then when you need to regularly execute that job
01:26:42.980 | to maintain a service that you've promised people
01:26:45.900 | that you will keep up, then it starts to get painful.
01:26:50.140 | Because reliability, cost, it's ungodly slow.
01:26:54.700 | It's 48 hours is a long time to wait for a computer
01:26:57.860 | to do something, even if it is an eggs a flop of operations.
01:27:02.860 | So definitely, when starting out with fine-tuning,
01:27:06.420 | go for the smallest job you can.
01:27:10.540 | Got it.
01:27:12.620 | All right, I've hogged the mic enough.
01:27:14.500 | Who has questions?
01:27:15.220 | Anyone?
01:27:24.260 | OK, great.
01:27:25.100 | Well, awesome, everybody.
01:27:26.820 | Thanks to you so much, Charles, for coming.
01:27:28.900 | We really appreciate it.