Running and Finetuning Open Source LLMs — ft. Charles Frye, Modal

00:00:00.000 | - Yeah, sure, yeah thanks for inviting me, Noah and Sean.

00:00:04.400 | It's always a pleasure.

00:00:06.220 | And yeah, I guess my background is studied neural networks,

00:00:11.560 | sort of how to optimize them,

00:00:13.560 | how to prove they converge in grad school,

00:00:16.080 | then worked at, that was at Berkeley,

00:00:19.120 | joined Weights and Biases,

00:00:21.360 | the experiment management and MLOps startup,

00:00:24.920 | series A to series C, did education for them,

00:00:29.760 | then did a full stack deep learning online course

00:00:32.960 | about how to deploy models in the pre-foundation model

00:00:37.240 | or liminal foundation model era,

00:00:39.640 | and then now work for Modal, infrastructure company

00:00:43.720 | that helps people run data-intensive workloads

00:00:47.520 | like ML inference.

00:00:49.200 | So, yeah, so then, oh yeah,

00:00:54.080 | so FSDL fans in the chat, like Sean,

00:00:58.200 | yeah, we'd love, maybe we'll be able to do something

00:01:00.460 | under that banner again sometime.

00:01:02.560 | But yeah, so wanted to talk today

00:01:06.920 | about running and fine-tuning open-source language models.

00:01:10.400 | Why would you do it?

00:01:12.360 | The answer is not always with both of these things.

00:01:15.720 | And then like some things about how,

00:01:17.840 | some like high-level things.

00:01:19.240 | This course, my understanding is oriented

00:01:23.000 | at software engineers who wanna learn more

00:01:24.800 | about like running AI models

00:01:27.080 | and building systems around them.

00:01:29.480 | So that's kind of the background

00:01:30.960 | that I've assumed in a lot of these.

00:01:32.800 | And then, yeah, to actually kick us off,

00:01:37.160 | before we go through the slides,

00:01:38.560 | I'm actually gonna do a quick demo.

00:01:40.960 | This is something that I got set up just yesterday,

00:01:44.480 | but like since it is, you know, in the news,

00:01:48.840 | like quite literally,

00:01:53.440 | let's run a local model

00:01:55.080 | or rather run our own inference on a model.

00:01:58.320 | Let's run this DeepSeq R1 model

00:02:02.760 | that people keep talking about.

00:02:05.000 | So this is coming from the modal, our examples repo.

00:02:09.400 | So you can try this code out yourself.

00:02:11.280 | All you need is like a single Python file

00:02:13.920 | and a modal account, and you're ready to go.

00:02:17.600 | I'm gonna kick it off here.

00:02:19.360 | Oops, you need a virtual environment with Python in it.

00:02:22.720 | That's one thing you need, I suppose.

00:02:24.360 | I forgot to mention that.

00:02:26.320 | Okay, so let's run this guy here in the terminal.

00:02:29.600 | It's my VS code.

00:02:30.440 | The code's up there.

00:02:31.400 | You know, as is often the case,

00:02:35.040 | the code is not supremely interesting.

00:02:37.000 | I'm pulling in, I'm running it with Llama CPP here.

00:02:42.000 | Llama CPP has very, very low precision quants.

00:02:46.880 | So there's a ternary quant of DeepSeq R1.

00:02:49.760 | That means all the values are either minus one, zero,

00:02:52.440 | or one in the weights.

00:02:54.680 | And that's enough to squeeze it down to fit

00:02:57.320 | on a single machine with multiple GPUs.

00:03:01.000 | In this case, four L40S GPUs.

00:03:03.640 | So that's why I'm running with Llama CPP here.

00:03:06.920 | So let's see, spinning up right now.

00:03:11.080 | Oh man, we're out of it.

00:03:12.280 | Four XL40Ss on modal, so we might have to wait

00:03:14.800 | as many as 15 or 30 seconds for that to spin up.

00:03:19.800 | While we're waiting for that,

00:03:21.280 | let me show you just a little bit

00:03:23.640 | about what's going on here.

00:03:25.800 | We're running Llama CPP here.

00:03:28.680 | Running these things is an exercise in configuration.

00:03:35.840 | So if you've ever administered a database,

00:03:38.240 | you'll be familiar with this sort of thing.

00:03:40.960 | Or if you've run compilation for a serious large project,

00:03:44.880 | you got your mysterious flags with mysterious arguments

00:03:50.840 | that have meaningful impact on the performance.

00:03:54.320 | So controlling, in this case, the KV cache

00:03:56.600 | and setting the quantization precision of that,

00:03:59.600 | along with some other things for Llama CPP.

00:04:02.640 | Okay, so we had about a minute queue for GPUs.

00:04:04.800 | That's actually, that's like a P95 probably

00:04:08.880 | for XL40Ss on modal.

00:04:10.960 | So sometimes you roll the dice and you get a natural one.

00:04:14.520 | But, so it took us about 60 seconds maybe to spin this up

00:04:18.280 | and get a hold of four XL40S GPUs.

00:04:21.280 | If this happens to you, DM me

00:04:23.480 | and I'll go smack our GPU cluster with a hammer

00:04:26.720 | and try and make it go faster.

00:04:29.920 | All right, so this is loading up,

00:04:31.960 | loading up all the stuff you need to do Llama to run.

00:04:37.680 | DeepSeq R1, this is the model loader.

00:04:41.000 | Actually, it turns out it's about 100 something gigabytes

00:04:43.280 | once you've quantized it down this far.

00:04:45.440 | These are all different layers here.

00:04:47.960 | Nothing too interesting in the model

00:04:50.080 | of like architecture itself.

00:04:51.360 | It's really the like data and the inference tech

00:04:53.920 | that DeepSeq built that's really the interesting part.

00:04:57.800 | So skipping past all this extra stuff.

00:05:00.120 | So now we're at the point where we're like loading the model.

00:05:02.720 | This one, so you're running,

00:05:04.840 | you wanna run your own model, great, okay.

00:05:07.360 | If you have a GPU to have it on all the time,

00:05:10.040 | then you gotta have, you know,

00:05:12.160 | why do we have four GPUs, why not just one?

00:05:14.920 | We gotta have 100 gigabytes of space

00:05:16.560 | to hold all the weights for this thing in.

00:05:18.960 | That's 100 something gigabytes of RAM.

00:05:21.640 | That, you know, problem with RAM is like,

00:05:23.880 | you can't share RAM and when you unplug it,

00:05:26.960 | the data goes out of it.

00:05:28.480 | So this is actually one of the major cost sources

00:05:30.860 | for running your own model.

00:05:32.960 | It's like, you gotta have, if you want,

00:05:35.360 | like, if you wanna avoid this latency

00:05:37.040 | that we're looking at here for of like, you know,

00:05:40.040 | what's it about, you know, minute, 90 seconds to spin up.

00:05:45.040 | That's like separate from any like modal overhead.

00:05:49.520 | This is just the raw moving bytes around setting up RAM.

00:05:53.120 | If you wanna avoid that, you gotta have stuff hot in RAM

00:05:56.640 | and RAM is not free.

00:05:57.800 | That's, you either, you know,

00:06:01.000 | you gotta pay to keep something warm

00:06:03.860 | on a serverless provider like modal

00:06:05.240 | or you gotta have an instance running in the cloud.

00:06:08.320 | But that's all, that's been done.

00:06:10.200 | We're now doing prompt processing.

00:06:11.760 | This is the prompt, Unsloth, by the way,

00:06:14.560 | is the team that did the quantization here

00:06:16.920 | for DeepSeek R1 down to three bits.

00:06:19.640 | And they, their demo prompt

00:06:22.880 | is what I've just like copied directly here.

00:06:25.840 | It's, oh yeah, the prints mess up a little bit

00:06:29.320 | sometimes here.

00:06:32.520 | But there should be at the top,

00:06:34.520 | the beginning of the prompt is something like,

00:06:37.280 | please write the game Flappy Bird in Python.

00:06:39.980 | So that's the prompt along with some instructions.

00:06:44.800 | That prompt has gone into DeepSeek

00:06:46.840 | and is now being processed.

00:06:48.800 | Okay, prompt processing is done.

00:06:52.240 | And now the beloved think token has been emitted

00:06:56.880 | and the model has begun to think about it,

00:06:59.240 | what it wants to do.

00:07:01.120 | So this deployment is not super well optimized.

00:07:04.960 | There's a substantial amount of host overhead,

00:07:06.960 | which means the GPU is not actually working all the time,

00:07:09.600 | even as we're generating these tokens.

00:07:11.620 | That's probably either a Llama CPP needs like a PR

00:07:16.620 | or I missed some compiler flag or something.

00:07:19.800 | The CPU usage is also kind of low.

00:07:21.780 | So I'm suspicious that maybe I messed something up

00:07:23.800 | in the compile.

00:07:25.240 | So it's 10 tokens per second right now.

00:07:27.760 | There's line buffering.

00:07:28.780 | So you aren't seeing the tokens live.

00:07:30.480 | You see them once a line is emitted.

00:07:33.080 | But yeah, but runs about 10 tokens per second

00:07:37.120 | on these L40s GPUs and could probably be boosted up

00:07:41.960 | to about 50 tokens per second

00:07:44.800 | by removing some of this host overhead.

00:07:47.200 | And then from there probably optimize kernels

00:07:50.320 | for this architecture.

00:07:51.520 | Some other things would maybe double it again.

00:07:53.760 | Oh, finished thinking pretty quickly that time.

00:07:55.920 | Interesting thing with these models is like,

00:07:58.160 | they think for very variable amounts of time

00:08:01.480 | controlled by how hard they think the problem is.

00:08:04.720 | And so sometimes it finishes thinking pretty quickly

00:08:07.360 | like here.

00:08:08.180 | Sometimes it thinks for like 20 minutes.

00:08:10.740 | So, you know, go make a coffee.

00:08:12.920 | I don't know, go compile something else

00:08:15.520 | while it's writing your answer.

00:08:17.500 | And yeah, I think the quality of the output here

00:08:23.000 | is reasonably good.

00:08:24.440 | One thing the Unsloft people call out

00:08:27.920 | and I've noticed in a couple of generations

00:08:29.760 | is these super low bit quants

00:08:31.600 | sometimes throw in a random junk token.

00:08:35.120 | So this case here, I bet that dense there is like,

00:08:38.760 | that might not be defined.

00:08:40.400 | Yeah, it doesn't look like that's defined.

00:08:42.480 | Rough, that's probably supposed to be a zero.

00:08:44.520 | There's some inference config stuff

00:08:46.080 | that I haven't played with that can reduce that

00:08:48.120 | and improve the quality,

00:08:49.600 | but that comes with the quantization territory.

00:08:52.280 | Yeah, so there it goes.

00:08:54.800 | It thought about it for a while,

00:08:56.320 | did some backtracking,

00:08:57.960 | and then wrote some code for Flappy Bird.

00:09:01.000 | So there's running a model

00:09:03.400 | and some of the stuff I talked about along the way,

00:09:05.160 | some of the main concerns you're gonna have

00:09:06.960 | running your own model.

00:09:09.020 | Yeah, any questions before I dive into the slides?

00:09:13.000 | >> Really quickly, so we haven't gotten

00:09:15.880 | super into the internals.

00:09:17.360 | Could you go over what quantization is?

00:09:19.800 | Like why do you need to do that

00:09:21.360 | and what that process generally looks like?

00:09:24.080 | >> Yeah, sure, I'll talk more about this,

00:09:25.720 | but the basic answer is the model is trained

00:09:28.640 | as a bunch of floating-point numbers.

00:09:31.320 | For DeepSeq R1, they were eight-bit floating-point numbers.

00:09:34.320 | That's crazy small.

00:09:35.560 | They worked hard to reduce the overhead during training.

00:09:39.980 | More typical is 16-bit

00:09:41.480 | or even 32-bit floating-point numbers.

00:09:44.260 | Default in Python is often 64-bit floating-point numbers,

00:09:48.800 | but that's way too much for neural networks most of the time.

00:09:55.400 | They don't need that level of precision.

00:09:57.200 | So you can save a lot on memory,

00:09:59.200 | which saves a ton on inference speed,

00:10:01.920 | especially in these low user count settings

00:10:06.920 | and heavy decode settings,

00:10:09.640 | like a reasoning model that produces a ton of tokens

00:10:12.240 | in response to a single input.

00:10:14.080 | Yeah, that helps a lot to decrease

00:10:17.360 | the memory footprint of the weights,

00:10:19.520 | even if that's all you do with your quantization.

00:10:24.680 | But yeah, so there we go.

00:10:27.000 | It actually finished.

00:10:28.320 | I noticed a couple more typos in there.

00:10:31.640 | That's probably, yeah,

00:10:32.760 | should tune the inference config on there.

00:10:34.400 | There's like a top-P, min-P sampling thing

00:10:37.480 | that you can do that I haven't dialed in yet.

00:10:40.140 | But yeah.

00:10:42.880 | Yeah, anybody, any other questions before we dive in?

00:10:53.640 | Yeah, feel free to interrupt me as we're going.

00:10:56.480 | Let's make this a little bit more

00:10:58.520 | on the interactive side, ideally.

00:11:00.440 | Got a lot of slides, a lot to cover,

00:11:01.960 | but what we cover depends on

00:11:06.400 | what people are most interested in.

00:11:08.160 | Okay, so I just ran an open-source language model,

00:11:14.320 | DeepSeq R1.

00:11:15.480 | Let's talk about just in general,

00:11:17.080 | what does that take?

00:11:18.080 | Define some of the things that I talked about there,

00:11:20.200 | like memory, bandwidth constraints,

00:11:22.760 | and quantization, and all this other stuff.

00:11:26.120 | We'll also talk a bit about fine-tuning models,

00:11:28.520 | customizing them to your use case

00:11:30.120 | by doing actual machine learning and training.

00:11:32.760 | Before doing that,

00:11:35.120 | I do want to talk about the why of this here,

00:11:37.480 | like just 'cause something is,

00:11:39.080 | even if something's quick, easy, and free,

00:11:41.800 | it doesn't mean it's a good idea.

00:11:43.360 | And running and fine-tuning your own models

00:11:45.340 | is none of those things.

00:11:47.640 | So you want to make sure you have a good idea

00:11:50.400 | why you want to do this.

00:11:52.440 | Why not just use a managed service behind an API?

00:11:56.540 | One of the primary reasons to do it

00:11:59.200 | is if you don't need Frontier capabilities,

00:12:01.440 | so if you don't need to run DeepSeq R1

00:12:04.140 | to get reasoning traces for your customer support JapPot,

00:12:07.680 | that just needs to ask them to turn the thing off

00:12:10.640 | and turn it back on again.

00:12:11.940 | That level of LLM inference,

00:12:14.600 | the software is pretty commodity.

00:12:16.760 | The hardware to run it's getting easier and cheaper.

00:12:20.000 | And so you can frequently run that relatively inexpensively.

00:12:25.000 | And so you don't need a proprietary model,

00:12:28.720 | and you can often, the complexity of serving is lower.

00:12:32.220 | Just like a call-out on that DeepSeq R1 demo,

00:12:35.480 | there's probably an order of magnitude and a half

00:12:39.240 | of improvement that could be done to that.

00:12:41.520 | So a 30x improvement is probably low-hanging fruit,

00:12:44.320 | like a week of engineering.

00:12:45.780 | But right now, running that on Modal is $300 a megatoken.

00:12:51.720 | And just having DeepSeq run it for you is $3 a megatoken.

00:12:57.340 | So that's a pretty big difference,

00:13:02.340 | even assuming we can get a 30x cost reduction running

00:13:07.600 | just by doing more than a day's engineering

00:13:12.560 | to get it running.

00:13:14.080 | So that's a reason people sometimes think

00:13:17.320 | running their own LLM inference makes sense

00:13:19.040 | is to save money.

00:13:20.120 | And that intuition, I think,

00:13:21.200 | comes from getting fleeced by cloud providers

00:13:24.680 | who will charge you an arm and a leg

00:13:27.080 | to just stand up commodity Redis on commodity hardware.

00:13:30.920 | But right now, that's not the case.

00:13:33.420 | So the main reason I think that people bring up

00:13:37.520 | is to manage security and improve data governance.

00:13:40.120 | You want to make sure to run this thing yourself.

00:13:44.000 | The more control you want,

00:13:45.340 | the more complex this problem is going to be,

00:13:47.960 | the more eventually it ends up getting your own GPUs

00:13:52.960 | and putting them in a cage,

00:13:54.880 | which is probably six months or a year of engineering work,

00:13:59.020 | and then a lot of ongoing maintenance.

00:14:02.400 | But at the very least, running it with a cloud provider,

00:14:04.700 | whether that's Modal or raw dogging AWS,

00:14:08.360 | can improve your security and data governance posture.

00:14:11.660 | Not everybody wants to send data

00:14:14.580 | to untrustworthy nation states

00:14:17.980 | like the United States or China.

00:14:19.820 | Then gaining control over inference

00:14:23.340 | is maybe the one that I would say is most important.

00:14:25.980 | It's like, and most general, it's like API providers,

00:14:31.060 | there's only so much they can do.

00:14:33.120 | If they're proprietary, they got to hide stuff from you,

00:14:35.420 | whether that's reasoning chains, in OpenAI's case,

00:14:38.180 | or like log probs, also in OpenAI's case,

00:14:41.180 | or just like the increased customization

00:14:43.500 | decreases their ability to amortize work,

00:14:46.140 | to spread it across multiple customers,

00:14:48.480 | which is the way that they get things cheaper

00:14:50.780 | than you can run it yourself, sort of economies of scale.

00:14:53.380 | And so the more flexible,

00:14:55.640 | the more different your deployment is,

00:14:57.460 | the harder it is for them to do that,

00:15:00.400 | to run this variety of workloads economically.

00:15:07.500 | I think over time, all of these things

00:15:10.980 | are going to lean more in the favor

00:15:13.300 | of running your own LLM inference.

00:15:15.060 | Like frontier capabilities will go off

00:15:16.980 | in the direction of artificial super intelligence

00:15:19.100 | or whatever, but the baseline capabilities

00:15:23.140 | that anybody can just download off of Hugging Face

00:15:25.100 | will just keep on getting better.

00:15:27.700 | So we just saw reasoning, a one-level capabilities.

00:15:31.020 | Six months ago, I told people,

00:15:33.380 | "You got to go to OpenAI for that.

00:15:34.760 | "Now you can run it yourself."

00:15:37.140 | But I think the most important one

00:15:38.340 | that's going to tilt in the direction

00:15:41.900 | of running your own inference as the field matures

00:15:44.180 | is gaining control over inference.

00:15:45.700 | Like things are just going to get way more flexible.

00:15:48.180 | People are going to discover all kinds of crazy things

00:15:49.980 | you can do with like hacking the internals,

00:15:52.580 | with log probabilities.

00:15:54.460 | People will rediscover what everybody was doing

00:15:56.340 | in 2022 and 2023, when people still had access

00:16:00.660 | to the models internals,

00:16:03.220 | and discover that it makes their lives better.

00:16:05.780 | And you'll want to run your own inference

00:16:07.620 | for that, to control that.

00:16:09.980 | See a question, Juliette.

00:16:12.480 | - Yeah, Charles.

00:16:13.500 | So before we carry on, and I'm not sure

00:16:15.340 | if you're going to speak more about this as we go forward,

00:16:17.420 | but could you speak a bit about

00:16:18.700 | how inference is currently working,

00:16:20.860 | just to make it a bit more concrete in my mind?

00:16:23.720 | - When you say how inference is currently working,

00:16:27.100 | do you mean like how people normally,

00:16:28.900 | the alternative to running your own?

00:16:30.700 | - Well, you're saying that, and I'm not familiar,

00:16:35.540 | I'm not so familiar with the word inference itself.

00:16:37.380 | Like, could you share a bit about how current models

00:16:39.620 | are using inference and like how it works today,

00:16:42.140 | so that then I understand how to better like tweak it

00:16:44.020 | and what it's like?

00:16:45.360 | - Got it.

00:16:46.200 | Yeah, sure.

00:16:47.020 | Sorry, that's a bit of jargon.

00:16:48.780 | Inference just means running the model, right?

00:16:51.140 | Like putting something into the model

00:16:52.420 | and something coming out of it.

00:16:54.500 | Goes back to the like probabilistic backing of these models.

00:16:57.460 | Like you do it, you're like predicting

00:17:00.020 | what the future tokens are going to be.

00:17:02.140 | And that's like inference, like logical inference.

00:17:05.620 | But yeah, that's where the term comes from.

00:17:08.980 | But yeah.

00:17:11.140 | Cool.

00:17:13.420 | So yeah, so it's like, this is like replacing OpenAI's API

00:17:17.460 | or Anthropx API or OpenRouter

00:17:19.900 | with a service you host yourself,

00:17:21.580 | is what we're talking about here.

00:17:24.140 | Cool, yeah, definitely if I'm like,

00:17:28.300 | especially since, you know, I usually speak

00:17:30.380 | to more of an ML engineering audience.

00:17:32.700 | So like, if I just like forget that I haven't defined a term,

00:17:36.180 | please do interrupt me and ask me about it.

00:17:38.520 | Spent some time on this one already,

00:17:42.180 | so I won't go into more detail on this.

00:17:43.980 | But I would just say like, it's not that uncommon

00:17:46.880 | that proprietary software leads open software

00:17:49.380 | and raw capabilities like Oracle SQL

00:17:51.940 | and Microsoft SQL Server and like OSX and Windows

00:17:56.100 | have a bunch of things that like,

00:17:58.020 | beat their open source equivalents

00:18:00.420 | and have for a long time, like query optimizers in particular

00:18:03.580 | in the case of databases.

00:18:05.940 | So like, it's maybe not so surprising

00:18:07.620 | that that's the case in AI.

00:18:10.940 | But the like, the places in general,

00:18:14.100 | these things have co-existed in other domains.

00:18:17.340 | And then open software has been preferred in cases

00:18:20.300 | where it's more important to be able to hack on

00:18:22.260 | and integrate deeply with a technology.

00:18:25.940 | And so, you know, we're likely to see some mixture stably.

00:18:30.820 | And I, you know, I initially said this

00:18:35.140 | at one of SWIX's events, the AI Engineer Summit,

00:18:38.240 | year and a half ago now, and this has remained true.

00:18:43.940 | So that's at least 18 months of prediction stability,

00:18:46.580 | which is best you can maybe hope for these days.

00:18:49.960 | Yeah, so saving money.

00:18:53.300 | A lot of people want to run their own models

00:18:55.180 | to save money.

00:18:56.820 | Right now, inference is priced like a commodity.

00:18:59.300 | People find it relatively easy to change models.

00:19:01.980 | Little prompt tuning, keep a couple of prompts around,

00:19:05.060 | ask a language model to rewrite your prompts for you.

00:19:07.700 | Like, yeah, this among other factors has led

00:19:11.000 | to this LM inference being priced like a commodity

00:19:14.300 | rather than like a service.

00:19:16.240 | And so it's actually like quite difficult

00:19:20.820 | to run it more cheaply yourself.

00:19:23.580 | And so there's a couple of things

00:19:24.980 | that might swing in your favor.

00:19:25.940 | If you have idle GPUs,

00:19:27.580 | like maybe you have an ML team internally,

00:19:29.820 | and they like, when they're not doing training runs,

00:19:32.100 | they have GPUs just sitting there.

00:19:33.820 | You might just mine cryptocurrency with them instead,

00:19:37.420 | you know, like faster time to ROI.

00:19:41.220 | But like, you know, that at least if you have them,

00:19:44.160 | that like, you're just paying electricity.

00:19:47.060 | So that makes it a little bit easier.

00:19:49.560 | But electricity costs are actually quite high

00:19:51.540 | for these things, you know, kilowatt per accelerator

00:19:55.460 | for the big ones.

00:19:56.480 | The, like, taking a really big generic model,

00:20:01.180 | one of these like foundation models,

00:20:02.860 | like OpenAI's O1 model or Claude,

00:20:06.300 | and distilling it for just the problems that you care about

00:20:10.220 | into something like smaller and easier to run,

00:20:13.300 | that's a way that you can like save money.

00:20:15.720 | And we'll talk a bit about that if we get to fine tuning,

00:20:19.340 | if we spend time on that in fine tuning.

00:20:22.020 | But, you know, that can help a lot.

00:20:24.740 | If your traffic is super high and dependable,

00:20:29.580 | and you can just like allocate some GPUs to it,

00:20:32.020 | and like, you know, run it, you know,

00:20:33.660 | just get a block of EC2 instances with GPUs on them,

00:20:37.660 | hold them there, send traffic to it.

00:20:40.540 | It's flat, you're utilizing all the GPUs all the time.

00:20:43.760 | You could probably start to like compete

00:20:45.580 | with the model providers there on price.

00:20:49.540 | And then finally, it's like, if it's like once a week,

00:20:52.160 | you need to like process like every support conversation

00:20:56.020 | that you had and add annotations to it

00:20:58.500 | and generate a report.

00:21:00.020 | So it's like once a week,

00:21:01.300 | you need like 10 mega tokens per minute throughput.

00:21:06.300 | And then like rest of the time you don't,

00:21:10.060 | then like the proprietary model providers

00:21:12.300 | are gonna push you onto their enterprise tier

00:21:14.220 | for those big rate limits.

00:21:16.700 | But you can actually like,

00:21:19.620 | and so that's gonna push up the cost of using a provider.

00:21:23.060 | But then it's also easier to run super like big batches.

00:21:28.980 | Like it's actually kind of like easier

00:21:31.820 | to run these things economically at scale

00:21:34.900 | than it is at small scale.

00:21:35.900 | Somewhat counterintuitively maybe for a software engineer

00:21:38.260 | who's used to running like databases and web servers.

00:21:41.540 | Just like the nature of GPUs is that it's easier to use them

00:21:44.500 | the more work you have for them.

00:21:46.380 | And so that makes, you know,

00:21:49.860 | these like batch and less latency sensitive workloads,

00:21:54.160 | like more amenable to running yourself

00:21:56.620 | if you can get ahold of serverless GPUs

00:21:58.700 | through a platform like Modal,

00:22:00.780 | Replicate, Google Cloud Run, something like that.

00:22:04.740 | Okay, so that's everything on like why you would do this,

00:22:10.260 | why you would run your own OpenAI API

00:22:13.420 | or Anthropic API replacement.

00:22:15.500 | Any questions before we move on?

00:22:18.720 | I saw the chat had some activity, maybe check that out.

00:22:22.220 | Anybody wanna speak up?

00:22:23.380 | - No, I think we're just sort of adding color

00:22:30.860 | to different stuff.

00:22:33.420 | - Got it, thanks for grabbing the chat.

00:22:35.900 | Okay, so let's start.

00:22:37.700 | Like I've already mentioned hardware and GPUs a lot,

00:22:40.300 | so let's talk about that a little bit more.

00:22:43.140 | Talk a little bit about like picking a model,

00:22:46.500 | then deep dive on like serving inference,

00:22:49.140 | a little bit on the tooling for it.

00:22:51.660 | Then like fine tuning,

00:22:54.560 | like how do you customize these models

00:22:56.460 | and then close out with thinking about observability

00:23:00.620 | and continual improvement.

00:23:02.060 | Okay, and yeah, link for the slides there.

00:23:07.380 | Of course, you'll be able to get it after the session.

00:23:10.380 | Okay, so picking hardware is pretty easy.

00:23:13.500 | Just use NVIDIA GPUs, don't have to go any further.

00:23:15.900 | No, let me go into a little bit more detail

00:23:18.140 | about why that's the case.

00:23:19.540 | So Juliet wanted like a little bit more color and detail

00:23:22.460 | on what does LLM inference mean.

00:23:24.660 | So what LLM inference means

00:23:26.420 | is you need to take the like parameters of the model,

00:23:28.660 | the weights, this like giant pile of floating point numbers.

00:23:32.700 | Those are gonna be sitting in some storage.

00:23:35.620 | You need to bring them into the place where compute happens.

00:23:39.100 | So like even if they're sitting in memory,

00:23:41.260 | like compute doesn't happen in memory.

00:23:42.580 | Compute happens like on chip inside of like registers.

00:23:46.100 | So you gotta move all of that in.

00:23:48.020 | And the fun fact is like you actually need

00:23:50.580 | like pretty much every single weight needs to go in.

00:23:53.700 | So like for most models,

00:23:55.740 | you can just look at how many gigabytes is that model file.

00:23:58.980 | And that tells you how many bytes

00:24:00.260 | they're gonna need to move in to get computed on.

00:24:04.180 | So like you're running an 8B model

00:24:06.500 | in one byte quantization, that's 8 billion weights.

00:24:10.580 | One byte per weight, that's eight gigabytes.

00:24:13.580 | So you need to move eight gigabytes

00:24:15.580 | out of wherever they're stored

00:24:17.780 | and into the place where compute happens.

00:24:20.220 | And then like that happens,

00:24:23.780 | like you're pushing tokens and activations

00:24:27.100 | through those weights to get out the next token.

00:24:30.340 | On your first iteration, you're sending in the whole prompt.

00:24:34.140 | And so you're sending in a whole prompt

00:24:35.780 | and generating an output token.

00:24:37.180 | So is guava a fruit?

00:24:39.060 | Yes.

00:24:40.540 | In the process of like pushing something through the weights,

00:24:43.300 | you can kind of rough estimation

00:24:45.540 | is that you want to do one,

00:24:49.060 | you wanna do two floating point operations per weight.

00:24:53.140 | So that's like, you want to multiply the weight

00:24:55.200 | with some number and then you're gonna add it

00:24:57.420 | to some other number.

00:24:59.180 | So that's two operations per weight.

00:25:01.580 | This is very napkin math.

00:25:03.100 | But again, nobody should have to write

00:25:06.420 | this very small number of wizards

00:25:08.140 | to write the actual code here.

00:25:10.260 | The core thing is being able to reason as an engineer

00:25:13.600 | about what the system's requirements are

00:25:16.540 | and how to, kind of like with a database,

00:25:20.260 | you don't have to be able to write a B-tree from scratch

00:25:22.460 | on a chalkboard unless you're interviewing at Google.

00:25:25.300 | But you should know how indices work

00:25:27.780 | so that you can like think about queries

00:25:29.980 | and structure tables in a smart way.

00:25:32.220 | And so similarly here,

00:25:34.420 | I'm just trying to give you the like intuition you need

00:25:36.620 | for understanding this workload.

00:25:39.100 | So for this, we have four tokens.

00:25:42.020 | We've got like one output.

00:25:46.300 | Yeah, we got four tokens coming in.

00:25:47.860 | We've got 8 billion parameters.

00:25:50.540 | So eight times two times four,

00:25:52.160 | that's 64 billion floating point operations.

00:25:55.860 | And then that gets us one token.

00:25:58.660 | Then we got to repeat this every time

00:26:00.380 | we want to generate another token.

00:26:01.860 | So we're going to move the weights.

00:26:03.380 | Like they have to, they go into where they get muted,

00:26:05.780 | then back out.

00:26:06.860 | Because we're talking about like registers and caches here.

00:26:09.620 | If you think of your like low level hardware stuff,

00:26:11.980 | registers and caches.

00:26:13.380 | So they can't hold the whole weight.

00:26:14.500 | So they got to go in and out the whole time.

00:26:16.260 | Again, if you're a database person,

00:26:17.860 | you should think of like running a sequential scan

00:26:20.200 | on your database over and over and over again

00:26:23.180 | on like a billion row database.

00:26:25.660 | So it's wild that we can even run it as fast as we do.

00:26:29.320 | But this is the workload.

00:26:32.020 | The hard part about it is the scale.

00:26:35.140 | The easy part about it is that this

00:26:36.780 | is like relatively simple control flow at the core.

00:26:41.180 | So that makes it amenable to acceleration with GPUs.

00:26:45.340 | GPUs have a bunch of, like if you look at the chip itself,

00:26:49.300 | this is the chip area.

00:26:50.860 | CPUs spend most of their space on like control flow logic.

00:26:55.620 | And then caches that hide how smart the CPU is being

00:26:59.140 | about like control flow and switching work.

00:27:01.700 | And then like relatively less is actually

00:27:04.140 | given over to the part that does like calculations, which

00:27:06.900 | is here in green.

00:27:08.780 | GPUs, on the other hand, are just all calculation.

00:27:12.260 | And they have relatively simple control flow

00:27:15.500 | and like relatively less cache memory.

00:27:20.500 | And that-- because it doesn't need to hold 100 programs

00:27:23.940 | at once or whatever.

00:27:25.780 | And so that means you can really rip through a workload

00:27:31.900 | like this one that has like relatively simple stuff, where

00:27:35.020 | most of what you want to do is just

00:27:36.700 | like zoom through doing simple math on a bunch of numbers.

00:27:41.260 | So that's why GPUs are designed for this,

00:27:44.060 | because it works well for graphics, which also looks

00:27:46.220 | like ripping through a bunch of math.

00:27:49.380 | Basically the same math on a bunch of different inputs,

00:27:51.940 | this graphics workload.

00:27:53.820 | But they've like tilted now even further

00:27:55.620 | in the direction of being specialized

00:27:57.180 | for running language models and big neural networks.

00:28:06.540 | The TLDR here is like the GPU is 100 duck-sized horses,

00:28:10.620 | a bunch of tiny cores doing like very simple stuff.

00:28:14.940 | And that wins out over the one horse-sized duck

00:28:18.940 | that is the CPU that you're used to programming and working

00:28:22.500 | with.

00:28:25.180 | There's like one other piece here,

00:28:26.620 | which is like if you're looking at a top-tier GPU,

00:28:28.780 | one of the things that makes the top-tier ones really good,

00:28:31.740 | like an H100, is that they have soldered the RAM

00:28:35.220 | onto the chip, which is not something you normally do.

00:28:39.980 | But it gives you much faster communication, lower latency,

00:28:43.980 | higher throughput, which is really important.

00:28:46.460 | The memory is still slower than the math,

00:28:48.140 | which is really important if you start

00:28:50.100 | to think about optimizing these things.

00:28:52.420 | But we don't have to go that deep.

00:28:55.740 | So the TLDR here is that it's like NVIDIA-inferenced GPUs

00:29:01.060 | from one or two generations back are what you probably

00:29:03.380 | want to run with.

00:29:04.860 | The primary constrained resource is

00:29:06.740 | how much space there is in this memory

00:29:08.900 | to hold all those weights.

00:29:11.100 | Well, it's the weights.

00:29:12.140 | And then later you're going to start

00:29:13.640 | adding things like past sequences you've run on

00:29:16.380 | in a cache.

00:29:17.220 | And then there's never enough RAM.

00:29:19.900 | And so when you're looking at buying GPUs yourself

00:29:22.380 | for rent or which ones to rent from the cloud,

00:29:25.380 | look for the ones with more VRAM.

00:29:28.340 | And then this is a primary reason

00:29:31.940 | to want to make your model weights smaller,

00:29:33.940 | to go from high-precision floating-point numbers

00:29:36.260 | to low-precision floating-point numbers,

00:29:38.100 | or even more exotic things, because they

00:29:41.740 | save space in that memory.

00:29:43.420 | And they make it easier to move the things in and out

00:29:46.980 | of memory and into where the compute happens.

00:29:49.620 | So the thing you want is a recent but not bleeding-edge

00:29:52.580 | GPU unless you enjoy pain.

00:29:55.460 | So most recent GPUs from NVIDIA are the Blackwell architecture.

00:29:58.620 | That's the 5,000 series of GeForce GPUs,

00:30:01.780 | your local neighborhood GPU, and then the Blackwell B200s

00:30:08.660 | and similar data center GPUs.

00:30:11.940 | Generally, you're going to find that you

00:30:13.860 | don't get the full speedup that you'd like because people

00:30:16.420 | don't compile for that architecture always

00:30:19.420 | and yada, yada.

00:30:20.780 | And then things are randomly broken.

00:30:23.260 | And then they're really hard to get a hold of and expensive.

00:30:26.100 | So the sweet spot is one generation

00:30:28.300 | behind whatever OpenAI and Meta are training on.

00:30:30.980 | So now that's Hopper GPUs.

00:30:32.980 | H200s were free on Amazon, at least on EC2,

00:30:37.820 | for a bit there a couple weeks ago.

00:30:40.140 | And then loveless GPUs like the L40s that I ran my demo on,

00:30:45.540 | those are pretty nice.

00:30:47.220 | Loveless is the more-- or sorry, L40s

00:30:50.740 | is the more inference-oriented data center GPU.

00:30:53.540 | So data center GPU means like ones

00:30:55.780 | you're going to find in the public clouds.

00:30:58.180 | NVIDIA doesn't really let people put your friendly local GPU,

00:31:02.460 | the same one you can buy locally and put in your own machine.

00:31:05.540 | They don't really let them run in the clouds

00:31:07.380 | unless NVIDIA is on the cap table.

00:31:09.940 | So that doesn't work for AWS and GCP.

00:31:13.940 | So that's a data center GPU.

00:31:16.280 | And then an inference data center GPU

00:31:18.580 | is one that's less focused on connecting a whole shitload

00:31:22.620 | of GPUs together, like 10,000 or 100,000,

00:31:25.500 | with a super fast custom network InfiniBand.

00:31:30.620 | And instead, they're more focused

00:31:38.700 | on just having one reasonably sized effective individual GPU.

00:31:43.860 | So the L40s are getting pretty mature.

00:31:46.420 | So I might recommend those.

00:31:47.860 | For a while, the H100, which is really more of a training GPU,

00:31:52.700 | was kind of the better one.

00:31:53.860 | I think, yeah, just because the L40s was relatively mature.

00:31:58.240 | If your model's small, if you're running a small model,

00:32:01.020 | like a modern BERT or one of the 3 billion or 1 billion models,

00:32:07.540 | you can get away with running it even a generation further back.

00:32:10.780 | And that's really nice, very stable.

00:32:12.740 | The Ampere A10 is a really real workhorse GPU,

00:32:16.900 | easy to get ahold of.

00:32:18.660 | You can transparently scale up to thousands of those on modal

00:32:22.740 | when it comes time.

00:32:24.820 | So that's pretty nice.

00:32:27.500 | Just a quick-- since part NVIDIA is in the news these days,

00:32:33.860 | like why NVIDIA?

00:32:35.820 | AMD and Intel GPUs are still butt catching up on performance.

00:32:40.700 | So nominally, you look at the sticker

00:32:42.940 | on the side that says Flops, and the AMD GPUs look good.

00:32:46.540 | And Intel Gaudi looks pretty good.

00:32:49.420 | The software stack is way behind.

00:32:50.780 | There's a great post from Dylan Patel and others

00:32:54.860 | that's semi-analysis, just ripping on the AMD software

00:32:57.820 | stack.

00:32:58.320 | George Hopps has done the same thing.

00:33:00.020 | It's just pain.

00:33:01.780 | That's a bet the company move.

00:33:02.980 | It's like, we can maybe either write the software ourselves

00:33:06.020 | or spend so much money on AMD chips

00:33:08.180 | that AMD will fix this for us.

00:33:11.100 | That's not really like, oh, I want

00:33:12.780 | to stand up a service kind of thing,

00:33:16.220 | stick with the well-trodden paths.

00:33:18.020 | There's non-GPU alternatives.

00:33:19.500 | There are other accelerators that are designed,

00:33:21.700 | unlike CPUs, for super high throughput and low memory

00:33:25.220 | bandwidth.

00:33:27.260 | TPU is the most mature one, the Tensor Processing

00:33:29.660 | Unit from Google.

00:33:30.740 | Unfortunately, it's very from Google

00:33:32.340 | in that they only run in Google Cloud.

00:33:34.140 | And the software stack is pretty decent for them, actually,

00:33:36.720 | like Jax, which can be used as a back end for PyTorch.

00:33:42.220 | But like many things in Google, the internal software for it

00:33:46.540 | is way better than anything you'll ever use.

00:33:49.300 | And you're second in line behind their internal engineers

00:33:51.900 | for any bug fixes.

00:33:54.020 | So caveat emptor there.

00:33:57.140 | The Grok and Cerebrus accelerators

00:33:59.060 | are still a little bit too bleeding edge.

00:34:02.180 | At that point, you're kind of not running your own LM

00:34:04.500 | inference anymore.

00:34:05.220 | You're having somebody else run it as a service for you

00:34:08.000 | on chips that they run.

00:34:09.520 | It's kind of the way it works.

00:34:10.920 | It's unclear if they could do it-- what's

00:34:15.360 | the word I'm looking for-- cost-effectively as well.

00:34:18.720 | Those chips are very expensive to run.

00:34:21.960 | I would say any of the other accelerators you see

00:34:24.080 | aren't super worth considering.

00:34:26.680 | But in general, long term, I would

00:34:28.320 | expect this to change a lot.

00:34:29.720 | NVIDIA has a very thick stack of water-cooled network cards

00:34:34.520 | that can do a little bit of math for you.

00:34:36.820 | That's crazy shit, and it's going

00:34:38.340 | to take a long time for anybody to catch up there.

00:34:40.460 | But inference is actually pretty easy

00:34:44.300 | to match their performance on.

00:34:46.620 | So I expect a lot of innovation in this space,

00:34:48.860 | and VCs are spending accordingly.

00:34:52.460 | Last thing I'll say is the startup that I work on,

00:34:56.140 | Modal, it makes getting GPUs really easy.

00:34:58.420 | So a lot of it-- this is high-performance computing

00:35:01.060 | hardware.

00:35:01.900 | It's normally a huge pain to get.

00:35:04.220 | If you've run a Kubernetes cluster,

00:35:06.180 | you know that heterogeneous compute makes you cry.

00:35:09.860 | There's a reason they call it taints.

00:35:12.240 | So Modal makes getting GPUs super easy,

00:35:14.580 | just like add Python decorators, get stuff to run on GPUs.

00:35:18.020 | This is real code that our CEO ran to test our H100 scaling,

00:35:24.500 | just like let me just run 100,000 times,

00:35:27.660 | time [AUDIO OUT] sleep one on an H100.

00:35:30.860 | And this is all the code that you need to run that.

00:35:34.500 | In our enterprise tier, this would scale up to 500 H100s

00:35:38.660 | or more, pretty transparently.

00:35:43.860 | So when you need it, we've got it.

00:35:46.460 | OK, so that's everything I want to say on hardware.

00:35:50.420 | Any questions about that stuff before I

00:35:52.740 | dive into talking about the zoo of models?

00:36:00.460 | No, I think we're pretty good.

00:36:03.820 | I like the commentary on TPUs.

00:36:07.180 | Yeah.

00:36:08.340 | It'd be cool if they sold them.

00:36:10.340 | That would be great.

00:36:11.140 | I'd have one in my house.

00:36:12.380 | But yeah.

00:36:14.380 | So was that--

00:36:16.180 | They're eating all the ones they can make.

00:36:17.940 | So it's almost like a competitive advantage.

00:36:20.140 | Make more, you know?

00:36:22.180 | How hard could it be to build a semiconductor foundry?

00:36:25.740 | I thought, why do you have a money printer

00:36:28.460 | if you aren't going to use the money for good stuff?

00:36:30.300 | Anyway, I'm sure they have great reasons for this.

00:36:33.260 | But yeah.

00:36:34.740 | Oh, yes.

00:36:35.780 | Anyway, I won't go on any more tangents there.

00:36:38.420 | But DM me on Twitter if you want to talk more about this.

00:36:43.100 | Yeah, and also, oh, yeah, I wrote a guide

00:36:45.020 | to using GPUs, modal.com/gpuglossary,

00:36:49.540 | GPU hyphen glossary.

00:36:51.300 | So if you're interested in this stuff, check it out.

00:36:53.660 | It's kind of intended to give you

00:36:55.940 | the intuition for this hardware and a little bit of debugging

00:37:00.900 | on the software stack because most people didn't encounter

00:37:05.140 | anything like this in their computer science

00:37:07.380 | education, their boot camp, or their working experience so far.

00:37:11.860 | So yeah.

00:37:12.740 | All right.

00:37:13.420 | So I could talk for hours about that.

00:37:15.340 | But let's talk about model selection.

00:37:17.220 | So what is the actual model we're going to run?

00:37:21.420 | My one piece of advice that I've contractually obligated,

00:37:25.740 | before you start thinking about, oh, what model am I going to run?

00:37:28.780 | How do I--

00:37:30.380 | I want to do a good job on this task.

00:37:34.260 | Make sure you've defined the task well

00:37:36.380 | and you have evals, an ability to evaluate whether the--

00:37:40.960 | you swap out a model for another one.

00:37:42.500 | Is it better or not?

00:37:43.500 | You can start with vibe checks.

00:37:44.980 | You just run one prompt that you like

00:37:47.780 | that helps you get good smell for a model.

00:37:51.340 | But that's going to--

00:37:53.180 | that works for a very short period of time.

00:37:55.300 | 10 inputs, 50 inputs, how long does it

00:37:58.780 | take you to write that out with ground truth answers?

00:38:01.820 | If it takes you an hour, put on your jams and do it.

00:38:07.500 | That's the length of Brat.

00:38:09.820 | Just listen to Brat and write out 10 or 50 evals.

00:38:14.380 | Just because it's kind of like test-driven development,

00:38:16.900 | where everybody says write the tests and then

00:38:19.820 | write the software.

00:38:20.860 | But in this case, with test-driven development,

00:38:23.620 | one reason people don't do it is because they

00:38:25.220 | can mentally run tests really well.

00:38:27.620 | I know what a test--

00:38:29.540 | I know all the different ways this code could misbehave.

00:38:31.820 | I don't have to write it out as a test.

00:38:33.580 | And if you're good, that's correct.

00:38:35.460 | If you're bad at software, like me,

00:38:38.380 | then you need the test to help you.

00:38:39.940 | But in this case, nobody is good at predicting

00:38:41.780 | the behavior of these models.

00:38:43.660 | And so evals are really critical,

00:38:45.980 | being able to check is this actually improving things

00:38:49.900 | or not.

00:38:50.500 | So do this, even just 10 things in a notebook.

00:38:53.540 | Don't go and buy an eval framework to do this.

00:38:56.020 | Just find a way to run models in the terminal in a notebook

00:39:00.300 | that helps you make these decisions like an engineer,

00:39:05.460 | not like a scientist like me.

00:39:08.700 | OK, so model options here are still, I would say,

00:39:11.780 | limited but growing.

00:39:13.180 | I might drop the limited sometime soon,

00:39:15.140 | because it's starting to feel like we have options.

00:39:17.780 | Meta's Llama model series is pretty well-regarded

00:39:20.980 | and has the very strong backing of Meta.

00:39:23.460 | So if I'm an engineer thinking about which open source

00:39:26.200 | software am I going to build on, I actually think about that

00:39:28.700 | a lot more so than raw capabilities a lot of the time.

00:39:34.060 | And the key thing here is there's a pretty big community

00:39:37.260 | building on Llama, making their software work really well

00:39:40.500 | with Llama, doing things with Llama

00:39:42.820 | that you would otherwise have to do yourself.

00:39:44.660 | So Neural Magic, major contributor

00:39:47.620 | to an inference framework called VLLM,

00:39:51.660 | they quantize models for you.

00:39:53.300 | So they squish them down so they're a lot smaller.

00:39:57.100 | Now you don't have to do that yourself.

00:39:58.740 | That's very nice.

00:40:00.020 | Noose Research does a lot of fine-tuning of models

00:40:03.540 | to remove their chat GPT slop behavior.

00:40:07.820 | So it's nice to have that.

00:40:09.020 | And RCAI will mush together five different Llamas

00:40:12.860 | to make one Penta Llama that weirdly works better

00:40:17.340 | than any of the five inputs.

00:40:19.380 | And then you don't have to do any of that yourself.

00:40:21.460 | Very nice.

00:40:22.180 | And then because it's backed by Meta,

00:40:23.760 | you can expect there will be continued investment in it.

00:40:26.300 | Meta's been great about open source in other places,

00:40:29.020 | like, I don't know, React.

00:40:31.100 | So maybe that's a bad one to pick

00:40:32.740 | because of the licensing thing.

00:40:34.060 | But they learned their lesson.

00:40:35.340 | So you can build on Meta comfortably.

00:40:39.220 | DeepSeek model series is on the rise.

00:40:41.820 | Not the first model series out of China

00:40:46.140 | to catch people's attention, the other one being Quen.

00:40:50.660 | There's slightly less tooling and integration

00:40:53.060 | than the Llama model series.

00:40:54.500 | But an important thing to note is

00:40:55.880 | that it is released under the MIT license.

00:40:58.940 | So the model weights are released

00:41:00.380 | under a normal open source license

00:41:03.300 | that the open source initiative would put their stamp on.

00:41:07.220 | But the Llama model is under a proprietary license that

00:41:12.220 | says, for example, if you're Amazon or Google,

00:41:15.340 | you can't use this.

00:41:17.540 | Not literally, but effectively.

00:41:19.860 | And a couple other things that make it less open,

00:41:22.580 | slightly less open, might make your lawyers nervous.

00:41:26.260 | So maybe DeepSeek will just push Llama

00:41:29.340 | to go MIT, inshallah that will happen with Llama 4.

00:41:33.500 | There are others to pay attention to.

00:41:37.220 | You might see a shitty model come out of a model training

00:41:39.940 | team, or sorry, you might see a non-state-of-the-art model come

00:41:43.660 | out of a model training team.

00:41:45.060 | But that doesn't mean that the team is bad.

00:41:47.180 | It's just that it takes a long time to get really good.

00:41:51.220 | So ones to watch are the Allen Institute's

00:41:53.340 | been putting out some good models with the Olmo series

00:41:56.060 | and the Molmo model.

00:41:57.620 | Microsoft's been doing their small language models with Phi.

00:42:00.940 | Mistral's has been quiet for a bit,

00:42:04.380 | but they keep putting out models.

00:42:05.820 | And Quen.

00:42:06.700 | Maybe in the future, the enterprise cloud homies

00:42:10.140 | Snowflake and Databricks will put out

00:42:12.620 | really compelling models.

00:42:13.860 | Mostly, Arctic and DBRX are fun for research reasons

00:42:17.500 | rather than raw capabilities.

00:42:20.620 | But yeah, that's kind of a small number of options.

00:42:23.860 | A little bit more like databases in the late '90s, early 2000s

00:42:27.180 | than databases today, where everybody and their mother

00:42:31.340 | has their own data fusion analytic database.

00:42:39.460 | But yeah, a little bit about quantization.

00:42:43.940 | So I've mentioned this a lot.

00:42:45.900 | So by default, floats are 32 or 64 bits, like integers are.

00:42:50.860 | Neural networks do not need this.

00:42:52.860 | Digital computers that you're used to programming

00:42:54.900 | are very precise.

00:42:55.740 | They go back to this--

00:42:58.260 | pardon me-- the Z2 by Konrad Zuse.

00:43:02.980 | He made this basically a clock that was a computer.

00:43:06.300 | Physical plates were being pushed around.

00:43:08.060 | And I think this is an AND gate or an XOR gate.

00:43:12.540 | So it only moves if one of the two plates on one side

00:43:16.140 | moves forward.

00:43:17.500 | So it's very physical clockwork.

00:43:19.140 | That's the lineage of digital computers.

00:43:22.820 | At the same time, in the '40s, people

00:43:24.460 | were working on analog computers.

00:43:25.860 | So on the right is a numerical integrator.

00:43:28.460 | That's on the other side of World War II.

00:43:30.660 | I think this is artillery trajectory calculations.

00:43:33.460 | You see there's a ball.

00:43:34.460 | And that ball rolls around.

00:43:35.660 | And you would calculate the speed

00:43:37.020 | that the ball is rolling around by changing the gears.

00:43:39.460 | Neural networks are way more like that.

00:43:41.100 | They're more like-- they're imprecise

00:43:44.460 | because they are the raw physical world

00:43:47.220 | without the intervention of a clock system

00:43:50.220 | to abstract it away and make it all ones and zeros

00:43:53.200 | and specific time steps.

00:43:55.300 | Neural networks are way more like these analog computers.

00:43:58.380 | And so how precise do you need to be

00:44:00.860 | when you're measuring a number that's

00:44:03.180 | coming out of an analog system?

00:44:05.660 | It's never going to be exactly the same with an analog system

00:44:08.700 | anyway.

00:44:09.900 | So why not decrease the precision?

00:44:14.780 | Whereas you change one bit in a digital computer,

00:44:18.740 | and it's like throwing a stick into a clock.

00:44:24.220 | The whole thing explodes and stops running.

00:44:28.860 | So this is the reason why you can aggressively

00:44:31.140 | quantize neural networks in a way

00:44:34.020 | that you can't do with lossily compressing, I don't know,

00:44:37.620 | Postgres.

00:44:38.780 | If you quantized every byte in Postgres down to 4 bits,

00:44:43.420 | you would just get garbage.

00:44:46.140 | So this quantization is really key for performance.

00:44:50.700 | The safe choice, you'll see, is 16 bits.

00:44:54.820 | FP16 or BF Brain Float 16.

00:44:59.500 | Weight quantization only, that means just make

00:45:01.900 | the model itself smaller, makes it smaller in memory.

00:45:06.540 | And then that whole thing about moving it in and out of compute

00:45:09.980 | is easier because it's smaller.

00:45:12.100 | That's great.

00:45:12.720 | And then that doesn't actually quantize the math.

00:45:15.020 | The actual math that happens still happens at 16-bit, 32-bit.

00:45:20.700 | To do activation quantization requires more recent GPUs,

00:45:24.820 | sometimes requires special compilation flags.

00:45:27.380 | Not always is the operation that you want to speed up.

00:45:30.020 | Does it already have a kernel written for you

00:45:32.780 | by TreeDAO or some other wizard to make the GPU go at full speed?

00:45:37.660 | So that's harder.

00:45:38.540 | It doesn't always work.

00:45:40.100 | VLM has great docs on this.

00:45:42.500 | And there's some papers as well.

00:45:44.540 | Give me FP16 or give me death, question mark, is a good paper.

00:45:49.180 | Because the answer is you don't need death.

00:45:51.820 | Don't be dramatic.

00:45:52.580 | You can use the quants.

00:45:54.620 | Evals help you decide whether the quantization is hurting.

00:45:57.820 | So I was running DeepSeq R1 in ternary, actually.

00:46:03.020 | So 1, 0, minus 1 in that demo.

00:46:05.740 | That's extreme quantization.

00:46:07.540 | There's no way the full model performance or anything

00:46:09.900 | close to it is retained.

00:46:11.900 | You need evals to determine whether you've

00:46:13.660 | lost the thing that made you pick the model in the first place.

00:46:16.660 | So make sure you have a way to check this.

00:46:20.460 | And benchmarks, don't trust benchmarks.

00:46:24.220 | People's benchmarks are wrong.

00:46:25.780 | They're different from your workload.

00:46:29.700 | You've got to run this stuff yourself.

00:46:31.700 | So curate your own internal benchmarks

00:46:34.100 | to help you scale up your own taste in models and intuition.

00:46:38.220 | I have more slides on fine tuning in a bit.

00:46:42.980 | But people who want to run their own models

00:46:46.220 | often have this DIY hacker spirit.

00:46:49.220 | And they're like, why should I just

00:46:50.720 | use the weights everybody else is using?

00:46:52.380 | I want to fine tune these things.

00:46:54.020 | This is really hard.

00:46:54.900 | I'll talk more about why it's hard in a bit.

00:46:56.740 | But try to get as far as you can just

00:46:58.460 | with prompting and really control flow around models.

00:47:02.500 | I don't know, DeepSeek R1 writes Python code.

00:47:05.500 | The Python code is wrong.

00:47:07.260 | Take the code, run it, take the error message, pipe it back in.

00:47:10.900 | So writing things around models, instead of fine tuning it

00:47:15.300 | to write better Python code, that's

00:47:17.060 | what all the model providers are doing.

00:47:19.300 | You're hard to compete with them on a lot of this stuff.

00:47:22.540 | So managing prompts and managing control flow around models

00:47:26.420 | is way easier as a software engineer

00:47:29.220 | and has way better ROI per effort ROI.

00:47:35.340 | So definitely start with just prompting, retrieval, et cetera.

00:47:39.220 | Yeah, I want to make sure to talk about the inference

00:47:46.140 | frameworks and what Suri inference looks like.

00:47:49.700 | Running LLMs inference economically

00:47:52.180 | requires a ton of thought and effort on optimization.

00:47:55.860 | This is not something you can sit down and write yourself,

00:48:00.860 | even if you're a code force's top 1%.

00:48:05.460 | There's a lot to write.

00:48:06.660 | A fast matrix multiplication is-- yeah,

00:48:10.620 | the standards are very high.

00:48:12.900 | So the current core of the stack that's most popular

00:48:16.380 | is PyTorch and CUDA.

00:48:18.700 | So PyTorch is a combo of a Python steering library

00:48:24.180 | and then a C++ internal library and libraries

00:48:29.940 | for doing all the hard shit, including CUDA C++,

00:48:35.220 | AKA C++ that runs on GPUs.

00:48:38.860 | That's where all the work gets done.

00:48:40.460 | Python is not usually the bottleneck.

00:48:42.420 | Don't get excited and rewrite that part in Rust.

00:48:45.980 | You're going to find out that that didn't help you that much.

00:48:48.620 | There's some features that make it easier to write Torch

00:48:52.420 | and still get good performance.

00:48:53.740 | So Torch added a compiler a couple of years

00:48:56.700 | ago now in version 2.

00:48:59.220 | But compilers are young until they're 40.

00:49:02.500 | But it's very promising and can get you

00:49:05.140 | most of the speed up of writing a bunch of custom stuff.

00:49:08.100 | But even besides writing custom GPU code,

00:49:11.060 | there's a bunch of things you need

00:49:12.420 | to build on top of raw matmuls, like the stuff that showed up

00:49:15.740 | in my napkin math diagram to serve inference fast.

00:49:20.500 | There's a bunch of caching.

00:49:21.780 | You don't want to roll your own cache.

00:49:23.400 | Rolling your own cache is a recipe for pain.

00:49:26.620 | There's continuous batching is this smart stuff

00:49:29.660 | for rearranging requests as they're on the way.

00:49:32.180 | Speculative decoding is a way to improve your throughput

00:49:36.700 | and has a ton of gotchas.

00:49:39.780 | So you don't want to build all this just for yourself.

00:49:42.460 | This is a clear case for a framework,

00:49:45.220 | just like database management systems.

00:49:47.500 | This is a don't roll your own case rather than a don't

00:49:51.420 | overcomplicate shit with a tool case,

00:49:54.260 | like the classic the two genders in engineering.

00:49:59.380 | So I would strongly recommend the VLM inference server

00:50:04.300 | on a number of grounds.

00:50:06.660 | So like Postgres, VLM started as a Berkeley academic project.

00:50:11.380 | They introduced this thing called paged attention,

00:50:13.780 | paged KV caching, and then kind of ran with it from there.

00:50:19.740 | There's performance numbers, and we can talk about them,

00:50:22.140 | but they're pretty prominent.

00:50:25.780 | People are gunning to beat them on workloads.

00:50:27.780 | And also, don't trust anybody's benchmarks.

00:50:29.540 | You have to run it to decide whether you agree.

00:50:32.020 | Anyway, that doesn't apply just for models.

00:50:34.220 | It also applies for performance.

00:50:36.060 | They really won Mindshare as the inference server,

00:50:39.020 | and so they've attracted a ton of external contributions.

00:50:43.020 | So now, Neural Magic was a startup,

00:50:45.460 | got acquired by Red Hat, a.k.a.

00:50:47.260 | IBM, basically exclusively to support their work on VLM.

00:50:53.140 | And so they got tons of contributions

00:50:56.620 | from any scale, IBM, bunch of people contributing stuff.

00:51:01.060 | And that's really important for open source success.

00:51:03.500 | Open source software succeeds when

00:51:05.100 | it creates this locus for cooperation

00:51:07.740 | between otherwise competing private organizations,

00:51:11.820 | whether they're nonprofit or for profit or whatever.

00:51:14.700 | And VLM has done that.

00:51:16.940 | So it's kind of hard to dislodge a project

00:51:19.700 | like that once it's held that crown for a while.

00:51:22.700 | It's not undislodgable yet, so it's not quite like Postgres,

00:51:25.780 | where you can be like, just use Postgres,

00:51:27.820 | and feel pretty like that's been around for 30 years,

00:51:30.980 | and this is more like 30 months less.

00:51:33.780 | But yeah, also pretty easy to use,

00:51:36.140 | like PIP installable once you have your GPU drivers.

00:51:38.980 | They make an OpenAI compatible API layer,

00:51:41.660 | which NVIDIA has refused to do with TensorRT, LLM, and Triton.

00:51:47.100 | So it's got a bunch of nice features and good performance.

00:51:52.620 | The main alternative, I would suggest,

00:51:54.380 | is NVIDIA's offering the ONNX, TensorRT, TensorRT, LLM,

00:52:00.060 | Triton kind of stack.

00:52:01.580 | There's this NVIDIA stack.

00:52:03.820 | Legally, it's open source, because you

00:52:05.380 | can read the source code.

00:52:06.340 | And it's under, I forget, either Apache or MI2 license.

00:52:09.780 | But if you look at the source code history,

00:52:12.540 | you'll see that it updates in the form of one 10,000 line

00:52:16.380 | commit with 5,000 deletions every week or two that

00:52:21.060 | says fixes.

00:52:23.380 | So pretty hard to maintain a fork.

00:52:26.380 | Pretty hard to-- you don't get input on the roadmap.

00:52:29.580 | VLM, on the other hand, classic.

00:52:31.460 | True open governance and open source.

00:52:34.540 | You can actually participate.

00:52:37.780 | Show up to the biweekly meetings.

00:52:39.620 | It's fun.

00:52:41.460 | Yeah, good performance, but maybe not top.

00:52:43.780 | What's up, Twix?

00:52:45.340 | >>SGLang?

00:52:47.340 | >>Yeah, SGLang, there's some cool stuff.

00:52:50.020 | They have this nice interface for prompt programming

00:52:53.020 | that's kind of cool.

00:52:55.060 | And sometimes they beat VLM on performance.

00:52:58.820 | But yeah, with open source projects,

00:53:00.740 | you win when you can draw the most contribution.

00:53:03.340 | So I feel like even if SGLang is winning over VLM

00:53:07.620 | in certain places currently, I doubt that that will persist.

00:53:10.540 | But we'll see.

00:53:11.140 | SGLang is another good one to look at.

00:53:13.700 | >>Yeah, OK.

00:53:14.740 | My impression was that they're both from Berkeley,

00:53:16.860 | and I thought basically SGLang is kind of the new generation

00:53:20.780 | of-- it's an anointed successor.

00:53:23.700 | >>Yeah, we'll see.

00:53:25.180 | We'll see.

00:53:26.060 | I don't think they've attracted the same degree

00:53:28.180 | of external contribution, which is important.

00:53:30.700 | >>They try to do it.

00:53:32.460 | OK, cool.

00:53:33.020 | >>Yeah.

00:53:33.540 | But yeah, good call out.

00:53:35.580 | That part of the slide's a little bit older,

00:53:37.780 | so I should maybe bump SGLang up to its own part.

00:53:42.700 | If you're going to be running your own inference,

00:53:45.260 | this is a high-performance computing workload.

00:53:47.140 | It's an expensive workload.

00:53:48.540 | Performance matters.

00:53:49.500 | Engineering effort can do 100x speedups

00:53:52.680 | and can take you from hundreds of dollars a megatoken

00:53:56.780 | to dollars or tens of dollars a megatoken.

00:53:59.740 | So you will need to debug performance and optimize it.

00:54:04.340 | And the only tool for doing that is profiling.

00:54:07.380 | So you're going to want to-- even

00:54:10.540 | if you aren't writing your own stuff,

00:54:12.340 | like if you're just using VLM, if you

00:54:14.420 | want to figure out what all these flags do

00:54:17.100 | and which ones you should use on your workload,

00:54:19.300 | you're going to want to profile stuff.

00:54:20.880 | There's built-in profiler support in VLM

00:54:23.780 | to try and make it easy.

00:54:25.660 | So PyTorch has a tracer and profiler.

00:54:28.620 | That's kind of like what VLM integrates with.

00:54:30.740 | There's also NVIDIA Insight, both for creating and viewing

00:54:33.980 | traces.

00:54:35.260 | That's their slightly more boomery corporate performance

00:54:39.820 | debugger.

00:54:40.340 | It's got a lot of nice features, though, can't lie.

00:54:43.140 | But yeah, it's the same basic tracing and profiling stuff,

00:54:49.340 | except there's work on the CPU and on the GPU,

00:54:52.380 | so that makes it a little bit harder.

00:54:54.060 | I would also just generally recommend,

00:54:55.660 | if you're thinking about this a lot, running a tracer

00:55:00.100 | and just looking at the trace a couple of times for PyTorch,

00:55:04.260 | VLM, whatever, just because you learn a ton from looking

00:55:08.100 | at a trace, a trace of an execution,

00:55:11.100 | all the function calls, all the stacks that

00:55:14.420 | resulted in your program running.

00:55:17.820 | No better way to learn about a program.

00:55:20.460 | I prefer it to reading the source code.

00:55:22.140 | That's where I start, and then I go back to the source code

00:55:23.900 | to figure out what things are doing.

00:55:25.780 | It's way easier than trying to build up

00:55:28.660 | a mental model of a programming model and concurrency

00:55:32.740 | implications, et cetera, just from reading source code.

00:55:35.420 | It's unnatural.

00:55:36.900 | Humans were meant to observe processes in evolution, not

00:55:41.980 | as programs.

00:55:42.860 | But yeah, so some recommendations for tools

00:55:46.820 | there.

00:55:47.320 | We also have some demos for how to run this stuff on Modal

00:55:50.540 | if you want to try that out.

00:55:53.620 | As a first pass for GPU optimization for, OK,

00:55:58.700 | is this making good use of the GPU?

00:56:01.260 | Very first pass is this number, GPU utilization.

00:56:05.620 | What fraction of time is anything

00:56:08.060 | running on the GPU at all?

00:56:10.460 | So that catches-- I don't know.

00:56:12.100 | If you looked at my DeepSeek R1, you

00:56:13.900 | would see that this utilization number is really low, like 20%.

00:56:17.220 | That means the host is getting in the way a lot

00:56:19.740 | and stuff isn't running on the GPU a ton.

00:56:22.620 | This is not like model maximum flops utilization or model

00:56:26.100 | flops utilization.

00:56:26.920 | This is not like what fraction of the number

00:56:28.980 | NVIDIA quoted you for flops that you're getting.

00:56:31.420 | This is way far away from that.

00:56:32.700 | This is just like-- this is a smoke check.

00:56:35.220 | Is the GPU running what fraction of the time?

00:56:37.620 | You would like for this to be 100%.

00:56:39.980 | Like, this is-- yeah, that's an attainable goal, 95% to 99%.

00:56:47.380 | Unlike CPU utilization, that's not a problem.

00:56:49.660 | That's a goal.

00:56:51.420 | So GPU utilization here is like a first check.

00:56:54.020 | Problem is, just because work is running on a GPU

00:56:56.700 | doesn't mean progress is being made

00:56:59.300 | or that that work is efficient.

00:57:01.060 | So the two other things to check are power utilization

00:57:04.660 | and temperature.

00:57:05.700 | Fundamentally, GPUs are limited by how much power

00:57:09.720 | they can draw to run their calculations

00:57:12.020 | and how much heat that generates that they

00:57:14.740 | need to get out of the system in order

00:57:17.780 | to keep running without melting.

00:57:20.020 | So you want to see power utilization 80% to 100%.

00:57:26.620 | And you want to see GPU temperatures running high 60

00:57:29.980 | Celsius for the data center GPUs, maybe low 70s,

00:57:35.340 | but pretty close to their thermal design power,

00:57:37.500 | maybe 5 to 10 degrees off of the power at which NVIDIA says,

00:57:41.700 | whoa, warranty's off.

00:57:46.460 | That means you're most likely making

00:57:50.260 | really good use of the GPU, whereas this GPU utilization,

00:57:54.380 | 100% that we have here on the left,

00:57:56.700 | is actually a deadlocked system.

00:57:58.580 | It's like two GPUs are both expecting the other

00:58:01.140 | to send a message, like two polite people trying

00:58:03.860 | to go through a door.

00:58:05.340 | And so they're both executing something

00:58:07.540 | because they're both being like, waiting for that message, dog.

00:58:10.500 | But they aren't making any progress.

00:58:11.980 | And the system is hung.

00:58:13.420 | But it has 100% GPU utilization.

00:58:15.660 | So you won't see that that often if you're

00:58:19.220 | running an inference framework.

00:58:22.180 | But it is something to watch out for and why, on Modal,

00:58:26.260 | I learned Rust in order to be able to add these

00:58:29.060 | to our dashboard.

00:58:30.780 | I think it's that important to show it

00:58:32.380 | to people, the power and the temperature.

00:58:36.700 | Cool.

00:58:37.340 | All right.

00:58:37.820 | So I do want to talk about fine tuning

00:58:39.420 | since it was in the title, conscious of time.

00:58:41.860 | So I'm going to rip through this.

00:58:43.500 | And then if we have more time, we

00:58:45.400 | can dive deep via questions.

00:58:47.700 | Sound good, Sean, Noah?

00:58:50.540 | Thumbs up?

00:58:51.060 | All right.

00:58:51.580 | Yeah, that's great.

00:58:52.860 | All right, yeah, fine tuning.

00:58:54.140 | So fine tuning means taking the weights of the model

00:58:56.740 | and using data to customize them, not via rag,

00:59:00.460 | but by actually changing those numbers.

00:59:03.860 | So when does it make sense to do that

00:59:05.580 | and make your own custom model?

00:59:07.300 | If you can take the capabilities that an API has

00:59:10.620 | and distill them into a smaller model--

00:59:12.620 | so train a smaller model to mimic

00:59:14.540 | the behavior of a big model, then you can--

00:59:17.260 | frequently, you don't need all the things like GPT.

00:59:20.420 | The big models know the name of every arrondissement in France

00:59:24.020 | and things about 15th century sculpt--

00:59:27.300 | or esotericism that you probably don't

00:59:29.900 | need in a support chatbot.

00:59:31.780 | So a smaller model with less weights,

00:59:34.940 | less room to store knowledge, could probably

00:59:40.100 | serve your purposes.

00:59:42.500 | I think of this a bit like a Python to Rust rewrite.

00:59:45.820 | You start off when you aren't sure what you need.

00:59:48.020 | You write in Python because it's easy to change,

00:59:50.020 | just like changing a prompt is easy,

00:59:52.100 | and switching between proprietary model providers

00:59:55.260 | is easy, upgrades are easy.

00:59:57.460 | But then once you really understand what you're doing,

00:59:59.660 | you rewrite it in Rust to get better performance.

01:00:03.900 | And then that Rust rewrite is going

01:00:05.540 | to be more maintenance work and harder to update, yada, yada,

01:00:08.500 | but it's going to be 100x cheaper or something.

01:00:12.300 | And so both the good and the bad things

01:00:14.020 | about that kind of rewrite-- it's

01:00:15.420 | a very similar engineering decision in terms

01:00:17.780 | of technical debt, feature velocity, cost of engineers,

01:00:24.860 | all this stuff.

01:00:26.300 | There's a nice product called OpenPipe

01:00:28.060 | that will help you steal capabilities as a service.

01:00:32.380 | So maybe check them out.

01:00:35.740 | If you want tighter control of style,

01:00:37.740 | like you want it to always respond

01:00:39.340 | in the voice of a pirate and never break k-fabe,

01:00:42.260 | fine tuning is pretty good at that.

01:00:44.220 | Relatively small amounts of data can do that.

01:00:46.900 | It's pretty bad at adding knowledge.

01:00:49.060 | That's usually better to do search or retrieval, which

01:00:51.700 | is what people call RAG, like get the knowledge from somewhere

01:00:55.340 | and stuff it in the prompt.

01:00:56.780 | Prompts can get pretty big these days.

01:00:59.900 | So your search doesn't have to be

01:01:02.020 | as good as it needed to be a year and a half ago.

01:01:04.500 | You can get vaguely the right information

01:01:06.220 | and put it in the prompt.

01:01:08.180 | The holy grail would be for you to define a reward function

01:01:12.820 | of what does it mean for this model to do well.

01:01:14.780 | Maybe that's customer retention, NPS, whatever.

01:01:19.860 | And then you could do ML directly on those rewards

01:01:23.340 | to optimize the model for that.

01:01:26.020 | That's the holy grail.

01:01:27.580 | Then you could just sit back and monitor that RL system.

01:01:32.380 | And then you would magically make that reward number go up.

01:01:36.820 | Could be stock price.

01:01:37.860 | That would be nice.

01:01:38.860 | The problem is there's a large gap between the things you

01:01:41.700 | want to improve, and the things that you can actually measure,

01:01:44.540 | and the things that you can provide to a model,

01:01:47.060 | measure quickly enough, et cetera.

01:01:48.820 | And also the rewards need to be unhackable.

01:01:51.260 | They need to be exactly what you want to maximize.

01:01:55.860 | When you do ML, ML is like paperclip maximization.

01:01:58.660 | It's like, you told me to make this number go up.

01:02:00.660 | I'm going to make this number go up.

01:02:02.160 | Imagine the brooms from "The Sorcerer's Apprentice."

01:02:05.340 | So if your rewards aren't something

01:02:07.100 | that's extremely logically correct,

01:02:10.180 | does this code compile?

01:02:12.020 | And does it run faster?

01:02:14.780 | They're hackable.

01:02:15.700 | So there's this famous example from OpenAI

01:02:18.020 | where they trained a model to drive a boat in this boat

01:02:20.620 | racing game.

01:02:21.620 | And it was trying to maximize points.

01:02:23.140 | And what it learned was, actually, you

01:02:24.720 | don't want to win the race and do

01:02:28.740 | what the game is supposed to do, which

01:02:30.660 | is collect these little pips and finish a race.

01:02:33.900 | If you want to score max, what you actually want to do

01:02:36.180 | is find this tiny little corner and slam against the wall

01:02:38.620 | repeatedly, picking up this bonus item that respawns,

01:02:42.620 | and just slamming against the wall over and over again

01:02:44.820 | and pick up the bonus item when it spawns.

01:02:50.900 | Very inhuman.

01:02:53.740 | More like a speed runner playing a video game

01:02:56.220 | than a normal human.

01:02:58.380 | So imagine this, but with your customer support.

01:03:01.180 | Great way to get customers to give a 10 on an NPS

01:03:04.500 | is to hack their machine and say,

01:03:07.540 | your machine is locked down until you put a 10 on our NPS.

01:03:11.020 | So be careful when using that approach.

01:03:13.660 | But that is the direction we're going.

01:03:15.580 | And it's RL for things like reasoning models

01:03:17.820 | gets better and more mainstreamed.

01:03:20.700 | It's kind of the long-term direction we're going.

01:03:23.700 | But that's not where we are today.

01:03:26.980 | Where we are today is really more like stealing capabilities

01:03:29.460 | from public APIs and distilling them.

01:03:34.580 | So the main reason fine-tuning can save costs,

01:03:38.380 | can improve performance, why shouldn't you do it?

01:03:42.220 | Fine-tuning is machine learning.

01:03:43.980 | Running inference is mostly normal software engineering

01:03:47.220 | with some fun spicy bits-- GPUs, floating point numbers.

01:03:50.900 | But machine learning is a whole different beast.

01:03:53.580 | Machine learning engineering has a lot

01:03:56.180 | in common with hardware and with scientific research.

01:03:59.540 | And it's just fucking hard.

01:04:01.060 | You've got non-determinism of the normal variety.

01:04:04.780 | On top of that, there's epistemic uncertainty.

01:04:06.740 | We don't understand these models.

01:04:08.100 | We don't understand the optimization process.

01:04:10.900 | There's all the floating point nonsense,

01:04:12.580 | which is much worse in machine learning than elsewhere.

01:04:15.060 | You've got to maintain a bunch of data pipelines.

01:04:17.340 | No one's favorite form of software engineering.

01:04:19.460 | This is a high-performance computing workload.

01:04:21.580 | Terra or Exaflop scale, if not more.

01:04:24.860 | Like, yeah, high-performance computing sucks.

01:04:27.180 | There's a reason why only the Department of Energy does it.

01:04:29.680 | And now a few people training models.

01:04:34.740 | There's a bunch of bad software out there.

01:04:36.660 | Like, the software in ML is frankly bad.

01:04:38.500 | It's written by people like me with scientific background.

01:04:42.460 | You have to deal-- things are inferential.

01:04:44.380 | You have to deal with statistical inference.

01:04:46.820 | Yeah, there's data involved.

01:04:48.460 | And now data is getting stored in a form

01:04:50.300 | that no one understands.

01:04:51.420 | Like, user data went in.

01:04:53.020 | And somebody can maybe pull a "New York Times" article

01:04:55.260 | directly out of your model weights.

01:04:56.900 | This scares lawyers.

01:04:59.500 | And so that is tricky and probably

01:05:03.140 | is going to require some Supreme Court rulings and so on

01:05:06.740 | to really figure out.

01:05:08.860 | Yeah, and when Mercury is a retrograde, your GPUs run slower.

01:05:11.420 | I'm sorry.

01:05:11.780 | That's just how it is.

01:05:12.700 | It's just, like, the point is there's

01:05:14.260 | a lot of complexity that's very hard to get an engineering

01:05:17.060 | grip on.

01:05:18.460 | So if you can solve it in literally any other way,

01:05:20.580 | try that first.

01:05:21.240 | Be creative.

01:05:22.180 | Think of ways you can solve this problem without fine tuning.

01:05:24.620 | What information can you bring in?

01:05:26.020 | What program control flow can you put around a model?

01:05:29.540 | Like, distillation is the easiest ML problem

01:05:32.980 | because you're using an ML model to mimic an ML model.

01:05:36.700 | And you can write down the math for that.

01:05:39.140 | It's perfect.

01:05:39.860 | It's very easy.

01:05:41.140 | Like, there's a notion of a data-generating process.

01:05:43.740 | In the real world, that's like the climate of the planet

01:05:46.340 | Earth.

01:05:47.380 | But in distillation, it's like an API call.

01:05:50.340 | Much easier.

01:05:51.220 | So if you have never fine-tuned before,

01:05:53.780 | definitely start with stealing capabilities

01:05:56.540 | from OpenAI, a.k.a.

01:05:58.340 | distillation, rather than anything else.

01:06:01.900 | To do this, you're going to need even more high-performance

01:06:04.500 | hardware.

01:06:04.780 | I focused on running models at the beginning.

01:06:06.860 | Fine-tuning blows out your memory budget,

01:06:09.620 | even with these parameter-efficient methods

01:06:11.580 | that are out there.

01:06:13.700 | Like, kind of what happens during training

01:06:15.580 | is you run a program forwards, and then you flip it around

01:06:18.420 | and run it backwards.

01:06:19.740 | So that puts a lot of extra pressure on memory.

01:06:22.620 | Then you also, during training, you want lots of examples

01:06:25.420 | so the model doesn't learn too much from one specific example.

01:06:29.140 | And you also want large batches to make better use

01:06:32.260 | of the big compute and to make better use of all

01:06:36.260 | those floating-point units.

01:06:38.380 | So that puts pressure on memory.

01:06:40.900 | And then optimization just, in general,

01:06:42.560 | requires some extra tensors that are the size of or larger

01:06:46.300 | than the model parameters.

01:06:47.500 | Sorry, some arrays, some extra arrays of floating-point

01:06:49.820 | numbers that are at least the size of the model parameters

01:06:53.940 | themselves.

01:06:54.860 | So you've got gradients and optimizer states.

01:06:57.300 | These are basically like 2 to 10 extra copies of the model

01:07:01.700 | weights are going to be floating around.

01:07:03.660 | There's ways to shard it, but you

01:07:05.300 | can't get around the fact that a lot of this stuff

01:07:07.300 | just needs to be stored.

01:07:09.540 | So you're going to need eight 80-gigabyte GPUs, or 32

01:07:15.020 | of them, connected in a network.

01:07:17.020 | And yeah, the software for that is pretty hard,

01:07:20.340 | or pretty rough.

01:07:22.220 | I already talked about how hard machine learning is.

01:07:25.060 | It's like there are software engineering practices that

01:07:27.900 | can prevent it from being made harder.

01:07:30.980 | I worked on experiment tracking software, weights and biases.

01:07:35.820 | That said, I worked on it for a reason.

01:07:37.900 | It's like when I was training models, the thing I wanted

01:07:40.300 | was being able to store voluminous quantities of data

01:07:45.140 | that come out of my run.

01:07:46.380 | Tons of metrics, gradients, inputs, outputs, loss values.

01:07:51.740 | There's just a bunch of stuff that you

01:07:53.340 | want to keep track of on top of very fast-changing code

01:07:56.940 | and configuration.

01:07:58.580 | And so you want a place to store that.

01:08:01.380 | The software is hard to debug.

01:08:03.100 | You don't know where the bugs are.

01:08:04.480 | So you want to store very raw information

01:08:07.020 | from which you can calculate the thing that reveals your bug.

01:08:10.140 | This is actually, I would say, like Honeycomb,

01:08:13.020 | their approach to observability is very similar.

01:08:15.140 | This is like observability for training runs.

01:08:17.580 | Observability is like recording enough about your system

01:08:19.940 | that you can debug it from your logs without having to SSH in.

01:08:23.980 | Same thing with model training.

01:08:25.580 | So yeah, weights and biases, hosted version,

01:08:27.740 | Neptune's hosted version, MLflow.

01:08:29.980 | You can run yourself.

01:08:32.580 | Yeah.

01:08:33.820 | You--

01:08:34.320 | [INTERPOSING VOICES]

01:08:35.620 | Hm?

01:08:36.420 | TensorBoard?

01:08:38.100 | Yeah, so TensorBoard, you have to run TensorBoard yourself.

01:08:41.740 | There's no real hosted service for it.

01:08:43.320 | I think they shut down TensorBoard.dev.

01:08:45.500 | So even if you're willing to make it public,

01:08:47.500 | you can't even use TensorBoard.dev anymore.

01:08:51.140 | Yeah, that's my most sad kill by Google,

01:08:53.300 | because it hits me personally, or maybe happiest,

01:08:56.260 | because I'm a shareholder in weights and biases.

01:08:58.860 | But yeah, so yeah, TensorBoard is really good

01:09:04.300 | at a small number of experiments.

01:09:06.300 | It's bad at collaboration and bad at large numbers

01:09:08.740 | of experiments.

01:09:10.300 | Other experiment tracking workflows that have gotten more--

01:09:13.460 | or experiment tracking solutions that

01:09:15.340 | have gotten more love, like the venture-backed ones

01:09:18.180 | or the open source ones, are better for that.

01:09:23.580 | So you can-- I would say a lot of software engineers

01:09:26.460 | come into the ML engineers' habitat

01:09:28.460 | and are pretty disgusted to discover the state of affairs.

01:09:33.780 | So you definitely do, in general,

01:09:35.740 | as a software engineer entering this field,

01:09:37.540 | you will be disgusted.

01:09:38.740 | And you should push people to up their SWE standards.

01:09:42.180 | But there's actually a lot of benefit

01:09:44.020 | to fast-moving code in ML engineering.

01:09:48.220 | It is researchy in that way.

01:09:50.180 | So you do want fast iteration.

01:09:53.660 | A lot of software engineering practices

01:09:55.380 | are oriented to a slower cycle of iteration

01:09:57.740 | and less interactive iteration.

01:09:59.980 | So the detente that I've found works

01:10:02.700 | is build internal libraries in normal code files,

01:10:08.140 | but then use them via Jupyter Notebooks

01:10:10.820 | so that you can poke prod, run ad hoc workflows, et cetera.

01:10:15.660 | And then as soon as something in a Jupyter Notebook

01:10:17.900 | starts to become regularly useful,

01:10:20.540 | pull that out into your utils.py, at the very least,

01:10:23.780 | if not an internal library.

01:10:26.740 | So yeah, Noah mentioned at the beginning--

01:10:30.220 | or I forget, maybe it was just me.

01:10:33.220 | Anyway, full-stack deep learning course I taught in 2022

01:10:36.740 | still has the basics of how to run ML engineering.

01:10:40.540 | The main thing that's changed is that we're

01:10:42.380 | talking about fine-tuning here.

01:10:43.960 | And back then, we were talking about training from scratch,

01:10:46.420 | because the foundation model era was only beginning.

01:10:49.180 | But the basic stuff in there, like the YouTube videos,

01:10:52.100 | the lecture-level stuff, is all still, I would say,

01:10:55.020 | pretty much solid gold.

01:10:56.860 | And then the code's rotted a bit,

01:10:58.420 | but it's at least vibes-level helpful.

01:11:01.740 | OK, actually, the observability stuff

01:11:06.220 | is less interesting and relevant.

01:11:08.100 | The main point is the eventual goal with any ML feature

01:11:12.140 | is to build a virtuous cycle, a data flywheel, a data engine,

01:11:18.100 | something that allows you to capture user data, annotate it,

01:11:21.140 | collect it into evals, and improve the underlying system.

01:11:23.780 | This is like-- if you're running your own LM inference,

01:11:26.780 | one of the ways you're going to make

01:11:28.320 | this thing truly better than what you could get elsewhere

01:11:31.060 | is building your own custom semi-self-improving system,

01:11:37.340 | or at least continually-improving system,

01:11:40.100 | based off of user data.

01:11:42.900 | There's some specialized tooling for collecting this stuff up,

01:11:45.900 | whether it's offline style with something

01:11:48.260 | like Weights and Biases Weave.

01:11:50.300 | You can see Sean's recent conversation with Sean

01:11:54.460 | from Weights and Biases on how he used Weave,

01:11:56.860 | among other tools, to win at Sweebench.

01:12:02.020 | >>Then Thomas came on Thursday and went over Weave.

01:12:05.780 | >>Oh, nice.

01:12:06.940 | OK, yeah, that's pure product on Weave, plus Sean--

01:12:12.220 | oh, wait, in this class or somewhere else?

01:12:14.420 | Oh, in this class, awesome.

01:12:15.540 | >>Yeah, Thomas came in on Thursday

01:12:17.020 | and did an hour and a half and change on Weave.

01:12:20.760 | >>Nice, yeah.

01:12:21.780 | So I would say Weave is really good for this offline evals,

01:12:25.300 | which is collect up a data set, kind of run code on it.

01:12:29.180 | The code and the data set co-evolve.

01:12:31.260 | And this is very much how an ML engineer approaches

01:12:34.620 | evaluation, coming from academic benchmarking,

01:12:37.900 | really, originally.

01:12:39.100 | And then there's a different style of evals.

01:12:41.300 | I don't know if you're going to have anybody from Lang Chain

01:12:43.980 | or LOM Index or one of these other people

01:12:45.940 | who are also building these observability tooling.

01:12:50.420 | There's this product engineer style,

01:12:52.300 | which is just collect up information

01:12:54.100 | and then let anybody write to it.

01:12:56.340 | Anybody can come in and annotate a trace

01:12:58.220 | and be like, this one is wrong.

01:13:01.540 | Lang Smith is very open-ended, the tool from Lang Chain,

01:13:04.940 | as are a lot of the other observability tooling--

01:13:08.820 | or sorry, these more online eval-oriented things.

01:13:12.060 | It's about raw stuff from production.

01:13:16.940 | And it's about a living database of all the information

01:13:20.820 | you've learned about your users, your problem,

01:13:23.940 | the behavior of models.

01:13:25.740 | And so it's this very dynamic, active artifact,

01:13:29.540 | which has its place.

01:13:32.740 | I think the more you need input from people who are not

01:13:35.820 | you to evaluate models-- like, for example,

01:13:38.420 | it's producing medical traces, and you are not a doctor.

01:13:41.900 | As opposed to producing code, and you are a programmer,

01:13:44.940 | then being able to bring in more people is more helpful.

01:13:47.900 | And so there's utility to these more online-style things.

01:13:52.420 | You can also actually build this stuff yourself.

01:13:54.460 | One thing I will say is these people

01:13:57.180 | don't know that much more about running these models than you

01:13:59.780 | do and getting them to perform well.

01:14:01.540 | And the workflows are not really set down for this.

01:14:03.700 | So with experiment management, that's

01:14:05.580 | been pretty figured out.

01:14:06.620 | It's an older thing.

01:14:08.540 | And so there's lots of-- the tooling

01:14:11.420 | has good ideas baked into it and will teach you to be better.

01:14:15.020 | These tools are in the design partner phase,

01:14:18.080 | a.k.a. the provide free engineering and design

01:14:20.420 | work for somebody you're also paying for a service phase.

01:14:26.580 | So if you have a good internal data engineering

01:14:30.660 | team that is good at, say, an open telemetry integration,

01:14:34.980 | would love to set up a little ClickHouse instance

01:14:39.060 | or something.

01:14:40.420 | And that's exciting to you, the prospect

01:14:44.940 | of putting something like that together,

01:14:46.600 | you or somebody on your team.

01:14:47.740 | You can build your own with something like this.

01:14:49.740 | And then the front end people can hack on the experience.

01:14:54.700 | So Brian Bischoff at Hex is big on this,

01:14:57.780 | because Hex has both really incredible internal data

01:15:00.820 | engineering and they're a data notebook product.

01:15:03.580 | So they can actually dog food their product

01:15:05.420 | to do their evaluation of their product.

01:15:08.320 | So not everybody's in the situation

01:15:09.740 | to be able to do that, but it's like a bigger fraction

01:15:14.300 | than it is with some of the other stuff

01:15:16.020 | that we've talked about.

01:15:17.820 | More tilted in the build direction than the buy.

01:15:24.020 | OK, so that's everything.

01:15:25.100 | I'll do my quick pitch here.

01:15:27.660 | I mentioned at the beginning, if you

01:15:29.340 | want to run code on GPUs in the cloud,

01:15:31.820 | Modal is the infrastructure provider that I'm working on.

01:15:37.060 | That-- I joined this company because I

01:15:43.300 | thought their shit was great.

01:15:44.460 | I was talking about how much I liked it on social media,

01:15:47.020 | and they're like, what if we paid you to do this?

01:15:49.060 | And I was like, no.

01:15:50.940 | I love this so much.

01:15:52.140 | Please don't pay me to do it, because then people

01:15:54.180 | won't trust me when I tell them it's coded.

01:15:56.020 | But eventually I gave in.

01:15:57.580 | Now I work at Modal, and they pay me to say this.

01:16:01.460 | The same thing I was saying before, which is Modal is great.

01:16:05.100 | It's like, you pay for only the hardware you use.

01:16:08.020 | Important when the hardware is so expensive.

01:16:12.660 | They built the whole--

01:16:15.020 | all the infrastructure is built from the ground up in Rust,

01:16:18.740 | BTW, to design for data-intensive workloads.

01:16:22.460 | There's a great podcast with our co-founder,

01:16:24.820 | with Sean, that completely separate from learning

01:16:30.100 | about Modal.

01:16:31.020 | It's just like, gain 10 IQ points, or 10 levels

01:16:36.300 | in computer infrastructure from hearing the story,

01:16:39.580 | learning about the software that was built,

01:16:41.940 | and how they sped it up.

01:16:43.380 | It's also a great data council talk on it.

01:16:46.020 | Just designed to run stuff fast.

01:16:48.500 | And then, unlike other serverless GPU narrow sense

01:16:52.860 | providers, Modal has code sandboxes, web endpoints,

01:16:57.020 | makes it easy to stand up a user interface around your stuff.

01:17:00.140 | So that's why I ended up going all in on Modal.

01:17:03.340 | It was like, wow, not only does this run my models,

01:17:05.900 | but I learned how to properly use fast API from Modal's

01:17:10.660 | integration with it.

01:17:12.660 | And yeah, that's just the tip of the iceberg

01:17:16.740 | on the additional things that it provides.

01:17:19.300 | So that can be for running your fine-tuning jobs,

01:17:23.260 | if you've decided you want to distill models yourself.

01:17:26.240 | It can be just running the inference,

01:17:27.820 | to be able to scale up and down, and handle changing inference

01:17:31.100 | load, and make sure you're filling up

01:17:32.820 | all the GPUs that you're using.

01:17:35.780 | And it can be for doing your evaluations,

01:17:38.500 | running these things online or offline,

01:17:41.660 | creating data to help you observe your system

01:17:46.100 | and make it better.

01:17:46.980 | So it's like full service, serverless cloud

01:17:52.820 | infrastructure that doesn't require a PhD in Kubernetes.

01:17:59.260 | Great.

01:17:59.960 | All right, that's all I got.

01:18:02.780 | Any questions?

01:18:03.940 | That was sick.

01:18:08.820 | Thanks so much.

01:18:09.420 | We love Modal in this house.

01:18:11.980 | I was in the process of rewriting it,

01:18:13.580 | so everyone that got the-- and also everyone,

01:18:16.780 | Charles is the person that we talked to to get the Modal

01:18:19.680 | credits for the course.

01:18:20.900 | So everyone, a big, big thank you to Charles for that.

01:18:23.820 | But the entire course, every single--

01:18:25.980 | this cohort builds three projects, all of which

01:18:29.220 | are built off of FastAPI that lives in Modal.

01:18:31.740 | So we love Modal here.

01:18:34.300 | It's great.

01:18:35.220 | Yeah, if you ever run into any bugs,

01:18:39.900 | definitely slide into our Slack.

01:18:42.460 | There's a decent chance you'll get co-founder support

01:18:46.780 | if you slide into the Slack.

01:18:49.220 | And yeah, hopefully you've been pointed to the examples page,

01:18:53.940 | modal.com/docs/examples.

01:18:57.280 | I slave to ensure that those things run end-to-end.

01:19:04.300 | They're continuously monitored and run stochastically

01:19:08.980 | at times during the day to ensure that.

01:19:11.380 | So if you run into-- they should run.

01:19:14.500 | They should help you get started.

01:19:15.900 | They're designed to be something you

01:19:17.460 | can build production-grade services off

01:19:19.860 | of as much as possible.

01:19:22.100 | And so yeah, if you want any help with those,

01:19:25.260 | slide into the Slack.

01:19:26.860 | Tag me.

01:19:27.580 | Feel free to tag me on stuff related

01:19:31.620 | to the course or otherwise.

01:19:36.060 | I love the examples.

01:19:37.060 | I should talk to you sometime, how you set all of that up.

01:19:39.180 | Because I was very impressed.

01:19:40.380 | I ran through the comfy UI workflow a couple of days ago.

01:19:43.140 | And I was able to tweak a few things.

01:19:44.940 | I pulled down the code example.

01:19:46.300 | I got a few different things running.

01:19:47.980 | I was like, holy shit.

01:19:49.180 | I just pulled down an example from the internet

01:19:51.220 | and just ran the command that it said to run.

01:19:53.500 | And then it ran.

01:19:54.260 | And I was like, that never happens.

01:19:55.680 | There's always some other thing I have to do.

01:19:58.540 | I was very impressed.

01:20:01.060 | Yeah, part of it is that as an infrastructure product,

01:20:05.180 | the thing that kills being able to run code

01:20:07.660 | is the differences between infrastructure and like,

01:20:10.100 | oh, well, that will only run if you set this LD flags thing

01:20:16.900 | or have this installed.

01:20:19.540 | It works on my machine.

01:20:20.620 | See, the thing about the modal examples

01:20:22.240 | is they all work on my machine.

01:20:23.700 | And my machine is modal, which you can also run them on.

01:20:26.940 | So that does make it a lot easier.

01:20:29.580 | I think that's generally true for being

01:20:31.160 | able to share things that run on modal within your team,

01:20:35.260 | making it easier to do that.

01:20:39.180 | But then separately, like, yeah, the trick--

01:20:42.740 | and this is actually like an engineering trick

01:20:44.660 | that is surprised it took me this long to learn.

01:20:46.660 | It's like there's tests and there's monitoring.

01:20:49.220 | And there are a lot of things that

01:20:51.020 | are really hard to write as tests.

01:20:52.580 | Slow down your iteration speed.

01:20:54.620 | Like, yeah, require a bunch of disgusting mocking

01:20:57.820 | that breaks as often as the actual code does.

01:21:00.140 | Or you could monitor production and fix

01:21:02.700 | issues that arise there, a.k.a. do both.

01:21:05.500 | So yeah, that's an important trick for the modal examples,

01:21:10.420 | but also for all the things you would maybe run using--

01:21:14.020 | as part of running your own language model inference

01:21:16.460 | or running your own AI-powered app.

01:21:18.020 | It's like, monitor the shit out of this thing.

01:21:20.060 | >>Awesome.

01:21:23.960 | Cool.

01:21:24.460 | Well, before we let Charles go, does anybody have any questions?

01:21:27.740 | I know I'm sure given everyone's background here,

01:21:30.180 | there's a lot of-- everyone's brain

01:21:32.180 | feels very full with all of the hardware architecture

01:21:35.140 | that you just learned and terminology.

01:21:37.420 | But just want to open it up for anyone.

01:21:40.060 | >>I think-- I'll kill time while people ask questions.

01:21:47.260 | But I think that it's always intimidating for people

01:21:51.300 | sort of running their own models and fine-tuning them.

01:21:56.540 | I'm just like, what's a really good first exercise

01:21:59.820 | that you could-- probably you have some tutorials on modal

01:22:02.940 | that you would recommend people just go through.

01:22:05.900 | >>Yeah, running your own model.

01:22:07.140 | I would actually say, if you don't

01:22:10.300 | have a MacBook M2 or later with at least 32 gigabytes of RAM,

01:22:17.900 | go ahead and buy one of those.

01:22:19.220 | Get your company to buy it for you.

01:22:21.860 | So that turns out to be actually a really incredible machine

01:22:25.900 | for running local inference.

01:22:27.100 | Has to do with the memory bandwidth stuff

01:22:28.820 | that we talked about, like moving the bytes in and out

01:22:31.580 | really fast.

01:22:33.300 | And so that-- I would actually say

01:22:35.980 | like that was the first thing I did back when you

01:22:38.460 | had to torrent llama weights.

01:22:41.420 | That running it locally-- and there's good tools out there

01:22:45.780 | for this, Ollama.

01:22:48.140 | You can also use the same thing you would run on a cloud server

01:22:51.420 | like VLLM.

01:22:53.900 | That is-- that's probably the easiest way

01:22:57.020 | to get started with running some of your own inference.

01:22:59.420 | And then the cost is amortized more effectively.

01:23:03.140 | And you can use it for other stuff, the computer

01:23:05.740 | that you're using for this.

01:23:07.620 | So that's actually probably my-- it's bad modal marketing

01:23:11.460 | to say that.

01:23:12.020 | But I would say people like to be able to poke and prod.

01:23:15.100 | If you don't already know modal, I know modal well enough

01:23:17.920 | that now it's not any harder for me

01:23:19.960 | to use modal to try these things than to run it on my MacBook.

01:23:23.260 | But it takes some time.

01:23:24.660 | And everybody knows how to use a command line as part

01:23:28.820 | of becoming a software engineer.

01:23:31.220 | So yeah.

01:23:33.740 | So that's my primary recommendation.

01:23:35.540 | For fine-tuning, I would say distilling a model

01:23:40.220 | is the easiest thing to do.

01:23:43.420 | Besides, I guess our demo for fine-tuning,

01:23:45.900 | which I didn't have time to show,

01:23:47.300 | it's like fine-tuning something on somebody's Slack messages

01:23:50.140 | so that it talks like them.

01:23:51.660 | And that's easy, fun, the stakes are low,

01:23:56.900 | and it teaches you some things about the software

01:24:01.340 | and about fine-tuning problems.

01:24:03.580 | But then to really understand what

01:24:06.940 | it means to fine-tune in pursuit of a specific objective,

01:24:10.260 | it's like distillation of a large model.

01:24:12.260 | Yeah, totally.

01:24:18.460 | I did insert a little comment about what distillation means.

01:24:21.500 | Because apparently, a lot of people

01:24:23.980 | kind of view training on output of GPT-4 as distillation.

01:24:30.180 | But the purist would be like, you

01:24:33.580 | have to train on the logits.

01:24:35.820 | Oh, yeah.

01:24:37.620 | Yeah, the teacher-student methods.

01:24:40.420 | Real distillation is different.

01:24:42.380 | Real distillation.

01:24:43.340 | Yeah, yeah.

01:24:43.900 | Oh, so I guess maybe that's a reason

01:24:46.420 | to run your own models to be able to get

01:24:51.100 | the raw output of the model is not tokens.

01:24:54.300 | It's probability for every token.

01:24:57.180 | And so that's a much richer signal for fine-tuning off of.

01:25:01.060 | And so that's what people prefer.

01:25:04.580 | But I guess I was thinking of it in the looser sense

01:25:07.980 | that most people talk about today, which is just like

01:25:10.260 | training to mimic the outputs of the model.

01:25:12.460 | Yeah, create a synthetic corpus of text.

01:25:16.180 | Yeah, yeah.

01:25:18.120 | And when you run a model in production,

01:25:19.780 | you're creating a synthetic corpus of text, you know?

01:25:23.780 | Synthetic corpus of text is somewhat intimidating sounding.

01:25:29.540 | I say as somebody who's used a lot of intimidating

01:25:31.620 | sounding jargon.

01:25:34.780 | But really, the simplest synthetic corpus of text

01:25:38.460 | is all the outputs that the API returned

01:25:41.660 | while it was running in prod.

01:25:43.620 | That's a great thing to fine-tune on.

01:25:47.180 | I just linked to an example here where

01:25:51.620 | someone distilled from R1.

01:25:54.900 | And it was pretty effective.

01:25:56.420 | And it took 48 hours and a few H100s.

01:26:00.500 | And that was it, not that expensive.

01:26:02.180 | Nice.

01:26:03.460 | Yeah, yeah.

01:26:04.620 | So Modal will do fine-tuning jobs up to eight H100s

01:26:08.460 | and up to 24 hours.

01:26:10.420 | We're working on features for bigger scale training,

01:26:13.820 | and both longer in time and larger in number.

01:26:19.020 | But yeah, I would say there's also

01:26:20.900 | a pretty strong argument for keeping your fine-tunes as small

01:26:24.500 | and fast as possible to be able to iterate more effectively

01:26:28.700 | and quickly.

01:26:29.540 | Because it's fun to run on 1,000 GPUs or whatever.

01:26:35.660 | There's this frisson of making machines go brr.

01:26:39.660 | But then when you need to regularly execute that job

01:26:42.980 | to maintain a service that you've promised people

01:26:45.900 | that you will keep up, then it starts to get painful.

01:26:50.140 | Because reliability, cost, it's ungodly slow.

01:26:54.700 | It's 48 hours is a long time to wait for a computer

01:26:57.860 | to do something, even if it is an eggs a flop of operations.

01:27:02.860 | So definitely, when starting out with fine-tuning,

01:27:06.420 | go for the smallest job you can.

01:27:10.540 | Got it.

01:27:11.180 | OK.

01:27:12.620 | All right, I've hogged the mic enough.

01:27:14.500 | Who has questions?

01:27:15.220 | Anyone?

01:27:15.740 | No?

01:27:24.260 | OK, great.

01:27:25.100 | Well, awesome, everybody.

01:27:26.820 | Thanks to you so much, Charles, for coming.

01:27:28.900 | We really appreciate it.