back to indexRunning and Finetuning Open Source LLMs — ft. Charles Frye, Modal

00:00:00.000 |
- Yeah, sure, yeah thanks for inviting me, Noah and Sean. 00:00:06.220 |
And yeah, I guess my background is studied neural networks, 00:00:24.920 |
series A to series C, did education for them, 00:00:29.760 |
then did a full stack deep learning online course 00:00:32.960 |
about how to deploy models in the pre-foundation model 00:00:39.640 |
and then now work for Modal, infrastructure company 00:00:43.720 |
that helps people run data-intensive workloads 00:00:58.200 |
yeah, we'd love, maybe we'll be able to do something 00:01:06.920 |
about running and fine-tuning open-source language models. 00:01:12.360 |
The answer is not always with both of these things. 00:01:40.960 |
This is something that I got set up just yesterday, 00:02:05.000 |
So this is coming from the modal, our examples repo. 00:02:19.360 |
Oops, you need a virtual environment with Python in it. 00:02:26.320 |
Okay, so let's run this guy here in the terminal. 00:02:37.000 |
I'm pulling in, I'm running it with Llama CPP here. 00:02:42.000 |
Llama CPP has very, very low precision quants. 00:02:49.760 |
That means all the values are either minus one, zero, 00:03:03.640 |
So that's why I'm running with Llama CPP here. 00:03:12.280 |
Four XL40Ss on modal, so we might have to wait 00:03:14.800 |
as many as 15 or 30 seconds for that to spin up. 00:03:28.680 |
Running these things is an exercise in configuration. 00:03:40.960 |
Or if you've run compilation for a serious large project, 00:03:44.880 |
you got your mysterious flags with mysterious arguments 00:03:50.840 |
that have meaningful impact on the performance. 00:03:56.600 |
and setting the quantization precision of that, 00:04:02.640 |
Okay, so we had about a minute queue for GPUs. 00:04:10.960 |
So sometimes you roll the dice and you get a natural one. 00:04:14.520 |
But, so it took us about 60 seconds maybe to spin this up 00:04:23.480 |
and I'll go smack our GPU cluster with a hammer 00:04:31.960 |
loading up all the stuff you need to do Llama to run. 00:04:41.000 |
Actually, it turns out it's about 100 something gigabytes 00:04:51.360 |
It's really the like data and the inference tech 00:04:53.920 |
that DeepSeq built that's really the interesting part. 00:05:00.120 |
So now we're at the point where we're like loading the model. 00:05:07.360 |
If you have a GPU to have it on all the time, 00:05:28.480 |
So this is actually one of the major cost sources 00:05:37.040 |
that we're looking at here for of like, you know, 00:05:40.040 |
what's it about, you know, minute, 90 seconds to spin up. 00:05:45.040 |
That's like separate from any like modal overhead. 00:05:49.520 |
This is just the raw moving bytes around setting up RAM. 00:05:53.120 |
If you wanna avoid that, you gotta have stuff hot in RAM 00:06:05.240 |
or you gotta have an instance running in the cloud. 00:06:25.840 |
It's, oh yeah, the prints mess up a little bit 00:06:34.520 |
the beginning of the prompt is something like, 00:06:39.980 |
So that's the prompt along with some instructions. 00:06:52.240 |
And now the beloved think token has been emitted 00:07:01.120 |
So this deployment is not super well optimized. 00:07:04.960 |
There's a substantial amount of host overhead, 00:07:06.960 |
which means the GPU is not actually working all the time, 00:07:11.620 |
That's probably either a Llama CPP needs like a PR 00:07:21.780 |
So I'm suspicious that maybe I messed something up 00:07:33.080 |
But yeah, but runs about 10 tokens per second 00:07:37.120 |
on these L40s GPUs and could probably be boosted up 00:07:47.200 |
And then from there probably optimize kernels 00:07:51.520 |
Some other things would maybe double it again. 00:07:53.760 |
Oh, finished thinking pretty quickly that time. 00:08:01.480 |
controlled by how hard they think the problem is. 00:08:04.720 |
And so sometimes it finishes thinking pretty quickly 00:08:17.500 |
And yeah, I think the quality of the output here 00:08:35.120 |
So this case here, I bet that dense there is like, 00:08:42.480 |
Rough, that's probably supposed to be a zero. 00:08:46.080 |
that I haven't played with that can reduce that 00:08:49.600 |
but that comes with the quantization territory. 00:09:03.400 |
and some of the stuff I talked about along the way, 00:09:09.020 |
Yeah, any questions before I dive into the slides? 00:09:31.320 |
For DeepSeq R1, they were eight-bit floating-point numbers. 00:09:35.560 |
They worked hard to reduce the overhead during training. 00:09:44.260 |
Default in Python is often 64-bit floating-point numbers, 00:09:48.800 |
but that's way too much for neural networks most of the time. 00:10:09.640 |
like a reasoning model that produces a ton of tokens 00:10:19.520 |
even if that's all you do with your quantization. 00:10:37.480 |
that you can do that I haven't dialed in yet. 00:10:42.880 |
Yeah, anybody, any other questions before we dive in? 00:10:53.640 |
Yeah, feel free to interrupt me as we're going. 00:11:08.160 |
Okay, so I just ran an open-source language model, 00:11:18.080 |
Define some of the things that I talked about there, 00:11:26.120 |
We'll also talk a bit about fine-tuning models, 00:11:30.120 |
by doing actual machine learning and training. 00:11:35.120 |
I do want to talk about the why of this here, 00:11:47.640 |
So you want to make sure you have a good idea 00:11:52.440 |
Why not just use a managed service behind an API? 00:12:04.140 |
to get reasoning traces for your customer support JapPot, 00:12:07.680 |
that just needs to ask them to turn the thing off 00:12:16.760 |
The hardware to run it's getting easier and cheaper. 00:12:20.000 |
And so you can frequently run that relatively inexpensively. 00:12:28.720 |
and you can often, the complexity of serving is lower. 00:12:32.220 |
Just like a call-out on that DeepSeq R1 demo, 00:12:35.480 |
there's probably an order of magnitude and a half 00:12:41.520 |
So a 30x improvement is probably low-hanging fruit, 00:12:45.780 |
But right now, running that on Modal is $300 a megatoken. 00:12:51.720 |
And just having DeepSeq run it for you is $3 a megatoken. 00:13:02.340 |
even assuming we can get a 30x cost reduction running 00:13:21.200 |
comes from getting fleeced by cloud providers 00:13:27.080 |
to just stand up commodity Redis on commodity hardware. 00:13:33.420 |
So the main reason I think that people bring up 00:13:37.520 |
is to manage security and improve data governance. 00:13:40.120 |
You want to make sure to run this thing yourself. 00:13:45.340 |
the more complex this problem is going to be, 00:13:47.960 |
the more eventually it ends up getting your own GPUs 00:13:54.880 |
which is probably six months or a year of engineering work, 00:14:02.400 |
But at the very least, running it with a cloud provider, 00:14:08.360 |
can improve your security and data governance posture. 00:14:23.340 |
is maybe the one that I would say is most important. 00:14:25.980 |
It's like, and most general, it's like API providers, 00:14:33.120 |
If they're proprietary, they got to hide stuff from you, 00:14:35.420 |
whether that's reasoning chains, in OpenAI's case, 00:14:48.480 |
which is the way that they get things cheaper 00:14:50.780 |
than you can run it yourself, sort of economies of scale. 00:15:00.400 |
to run this variety of workloads economically. 00:15:16.980 |
in the direction of artificial super intelligence 00:15:23.140 |
that anybody can just download off of Hugging Face 00:15:27.700 |
So we just saw reasoning, a one-level capabilities. 00:15:41.900 |
of running your own inference as the field matures 00:15:45.700 |
Like things are just going to get way more flexible. 00:15:48.180 |
People are going to discover all kinds of crazy things 00:15:54.460 |
People will rediscover what everybody was doing 00:15:56.340 |
in 2022 and 2023, when people still had access 00:16:03.220 |
and discover that it makes their lives better. 00:16:15.340 |
if you're going to speak more about this as we go forward, 00:16:20.860 |
just to make it a bit more concrete in my mind? 00:16:23.720 |
- When you say how inference is currently working, 00:16:30.700 |
- Well, you're saying that, and I'm not familiar, 00:16:35.540 |
I'm not so familiar with the word inference itself. 00:16:37.380 |
Like, could you share a bit about how current models 00:16:39.620 |
are using inference and like how it works today, 00:16:42.140 |
so that then I understand how to better like tweak it 00:16:48.780 |
Inference just means running the model, right? 00:16:54.500 |
Goes back to the like probabilistic backing of these models. 00:17:02.140 |
And that's like inference, like logical inference. 00:17:13.420 |
So yeah, so it's like, this is like replacing OpenAI's API 00:17:32.700 |
So like, if I just like forget that I haven't defined a term, 00:17:43.980 |
But I would just say like, it's not that uncommon 00:17:46.880 |
that proprietary software leads open software 00:17:51.940 |
and Microsoft SQL Server and like OSX and Windows 00:18:00.420 |
and have for a long time, like query optimizers in particular 00:18:14.100 |
these things have co-existed in other domains. 00:18:17.340 |
And then open software has been preferred in cases 00:18:20.300 |
where it's more important to be able to hack on 00:18:25.940 |
And so, you know, we're likely to see some mixture stably. 00:18:35.140 |
at one of SWIX's events, the AI Engineer Summit, 00:18:38.240 |
year and a half ago now, and this has remained true. 00:18:43.940 |
So that's at least 18 months of prediction stability, 00:18:46.580 |
which is best you can maybe hope for these days. 00:18:56.820 |
Right now, inference is priced like a commodity. 00:18:59.300 |
People find it relatively easy to change models. 00:19:01.980 |
Little prompt tuning, keep a couple of prompts around, 00:19:05.060 |
ask a language model to rewrite your prompts for you. 00:19:11.000 |
to this LM inference being priced like a commodity 00:19:29.820 |
and they like, when they're not doing training runs, 00:19:33.820 |
You might just mine cryptocurrency with them instead, 00:19:41.220 |
But like, you know, that at least if you have them, 00:19:49.560 |
But electricity costs are actually quite high 00:19:51.540 |
for these things, you know, kilowatt per accelerator 00:19:56.480 |
The, like, taking a really big generic model, 00:20:06.300 |
and distilling it for just the problems that you care about 00:20:10.220 |
into something like smaller and easier to run, 00:20:15.720 |
And we'll talk a bit about that if we get to fine tuning, 00:20:24.740 |
If your traffic is super high and dependable, 00:20:29.580 |
and you can just like allocate some GPUs to it, 00:20:33.660 |
just get a block of EC2 instances with GPUs on them, 00:20:40.540 |
It's flat, you're utilizing all the GPUs all the time. 00:20:49.540 |
And then finally, it's like, if it's like once a week, 00:20:52.160 |
you need to like process like every support conversation 00:21:01.300 |
you need like 10 mega tokens per minute throughput. 00:21:12.300 |
are gonna push you onto their enterprise tier 00:21:19.620 |
and so that's gonna push up the cost of using a provider. 00:21:23.060 |
But then it's also easier to run super like big batches. 00:21:35.900 |
Somewhat counterintuitively maybe for a software engineer 00:21:38.260 |
who's used to running like databases and web servers. 00:21:41.540 |
Just like the nature of GPUs is that it's easier to use them 00:21:49.860 |
these like batch and less latency sensitive workloads, 00:22:00.780 |
Replicate, Google Cloud Run, something like that. 00:22:04.740 |
Okay, so that's everything on like why you would do this, 00:22:18.720 |
I saw the chat had some activity, maybe check that out. 00:22:23.380 |
- No, I think we're just sort of adding color 00:22:37.700 |
Like I've already mentioned hardware and GPUs a lot, 00:22:43.140 |
Talk a little bit about like picking a model, 00:22:56.460 |
and then close out with thinking about observability 00:23:07.380 |
Of course, you'll be able to get it after the session. 00:23:13.500 |
Just use NVIDIA GPUs, don't have to go any further. 00:23:19.540 |
So Juliet wanted like a little bit more color and detail 00:23:26.420 |
is you need to take the like parameters of the model, 00:23:28.660 |
the weights, this like giant pile of floating point numbers. 00:23:35.620 |
You need to bring them into the place where compute happens. 00:23:42.580 |
Compute happens like on chip inside of like registers. 00:23:50.580 |
like pretty much every single weight needs to go in. 00:23:55.740 |
you can just look at how many gigabytes is that model file. 00:24:00.260 |
they're gonna need to move in to get computed on. 00:24:06.500 |
in one byte quantization, that's 8 billion weights. 00:24:27.100 |
through those weights to get out the next token. 00:24:30.340 |
On your first iteration, you're sending in the whole prompt. 00:24:40.540 |
In the process of like pushing something through the weights, 00:24:49.060 |
you wanna do two floating point operations per weight. 00:24:53.140 |
So that's like, you want to multiply the weight 00:24:55.200 |
with some number and then you're gonna add it 00:25:10.260 |
The core thing is being able to reason as an engineer 00:25:20.260 |
you don't have to be able to write a B-tree from scratch 00:25:22.460 |
on a chalkboard unless you're interviewing at Google. 00:25:34.420 |
I'm just trying to give you the like intuition you need 00:26:03.380 |
Like they have to, they go into where they get muted, 00:26:06.860 |
Because we're talking about like registers and caches here. 00:26:09.620 |
If you think of your like low level hardware stuff, 00:26:17.860 |
you should think of like running a sequential scan 00:26:20.200 |
on your database over and over and over again 00:26:25.660 |
So it's wild that we can even run it as fast as we do. 00:26:36.780 |
is like relatively simple control flow at the core. 00:26:41.180 |
So that makes it amenable to acceleration with GPUs. 00:26:45.340 |
GPUs have a bunch of, like if you look at the chip itself, 00:26:50.860 |
CPUs spend most of their space on like control flow logic. 00:26:55.620 |
And then caches that hide how smart the CPU is being 00:27:04.140 |
given over to the part that does like calculations, which 00:27:08.780 |
GPUs, on the other hand, are just all calculation. 00:27:20.500 |
And that-- because it doesn't need to hold 100 programs 00:27:25.780 |
And so that means you can really rip through a workload 00:27:31.900 |
like this one that has like relatively simple stuff, where 00:27:36.700 |
like zoom through doing simple math on a bunch of numbers. 00:27:44.060 |
because it works well for graphics, which also looks 00:27:49.380 |
Basically the same math on a bunch of different inputs, 00:27:57.180 |
for running language models and big neural networks. 00:28:06.540 |
The TLDR here is like the GPU is 100 duck-sized horses, 00:28:10.620 |
a bunch of tiny cores doing like very simple stuff. 00:28:14.940 |
And that wins out over the one horse-sized duck 00:28:18.940 |
that is the CPU that you're used to programming and working 00:28:26.620 |
which is like if you're looking at a top-tier GPU, 00:28:28.780 |
one of the things that makes the top-tier ones really good, 00:28:31.740 |
like an H100, is that they have soldered the RAM 00:28:35.220 |
onto the chip, which is not something you normally do. 00:28:39.980 |
But it gives you much faster communication, lower latency, 00:28:43.980 |
higher throughput, which is really important. 00:28:55.740 |
So the TLDR here is that it's like NVIDIA-inferenced GPUs 00:29:01.060 |
from one or two generations back are what you probably 00:29:13.640 |
adding things like past sequences you've run on 00:29:19.900 |
And so when you're looking at buying GPUs yourself 00:29:22.380 |
for rent or which ones to rent from the cloud, 00:29:33.940 |
to go from high-precision floating-point numbers 00:29:43.420 |
And they make it easier to move the things in and out 00:29:46.980 |
of memory and into where the compute happens. 00:29:49.620 |
So the thing you want is a recent but not bleeding-edge 00:29:55.460 |
So most recent GPUs from NVIDIA are the Blackwell architecture. 00:30:01.780 |
your local neighborhood GPU, and then the Blackwell B200s 00:30:13.860 |
don't get the full speedup that you'd like because people 00:30:23.260 |
And then they're really hard to get a hold of and expensive. 00:30:28.300 |
behind whatever OpenAI and Meta are training on. 00:30:40.140 |
And then loveless GPUs like the L40s that I ran my demo on, 00:30:50.740 |
is the more inference-oriented data center GPU. 00:30:58.180 |
NVIDIA doesn't really let people put your friendly local GPU, 00:31:02.460 |
the same one you can buy locally and put in your own machine. 00:31:18.580 |
is one that's less focused on connecting a whole shitload 00:31:38.700 |
on just having one reasonably sized effective individual GPU. 00:31:47.860 |
For a while, the H100, which is really more of a training GPU, 00:31:53.860 |
I think, yeah, just because the L40s was relatively mature. 00:31:58.240 |
If your model's small, if you're running a small model, 00:32:01.020 |
like a modern BERT or one of the 3 billion or 1 billion models, 00:32:07.540 |
you can get away with running it even a generation further back. 00:32:12.740 |
The Ampere A10 is a really real workhorse GPU, 00:32:18.660 |
You can transparently scale up to thousands of those on modal 00:32:27.500 |
Just a quick-- since part NVIDIA is in the news these days, 00:32:35.820 |
AMD and Intel GPUs are still butt catching up on performance. 00:32:42.940 |
on the side that says Flops, and the AMD GPUs look good. 00:32:50.780 |
There's a great post from Dylan Patel and others 00:32:54.860 |
that's semi-analysis, just ripping on the AMD software 00:33:02.980 |
It's like, we can maybe either write the software ourselves 00:33:19.500 |
There are other accelerators that are designed, 00:33:21.700 |
unlike CPUs, for super high throughput and low memory 00:33:27.260 |
TPU is the most mature one, the Tensor Processing 00:33:34.140 |
And the software stack is pretty decent for them, actually, 00:33:36.720 |
like Jax, which can be used as a back end for PyTorch. 00:33:42.220 |
But like many things in Google, the internal software for it 00:33:49.300 |
And you're second in line behind their internal engineers 00:34:02.180 |
At that point, you're kind of not running your own LM 00:34:05.220 |
You're having somebody else run it as a service for you 00:34:15.360 |
the word I'm looking for-- cost-effectively as well. 00:34:21.960 |
I would say any of the other accelerators you see 00:34:29.720 |
NVIDIA has a very thick stack of water-cooled network cards 00:34:38.340 |
to take a long time for anybody to catch up there. 00:34:46.620 |
So I expect a lot of innovation in this space, 00:34:52.460 |
Last thing I'll say is the startup that I work on, 00:34:58.420 |
So a lot of it-- this is high-performance computing 00:35:06.180 |
you know that heterogeneous compute makes you cry. 00:35:14.580 |
just like add Python decorators, get stuff to run on GPUs. 00:35:18.020 |
This is real code that our CEO ran to test our H100 scaling, 00:35:30.860 |
And this is all the code that you need to run that. 00:35:34.500 |
In our enterprise tier, this would scale up to 500 H100s 00:35:46.460 |
OK, so that's everything I want to say on hardware. 00:36:22.180 |
How hard could it be to build a semiconductor foundry? 00:36:28.460 |
if you aren't going to use the money for good stuff? 00:36:30.300 |
Anyway, I'm sure they have great reasons for this. 00:36:35.780 |
Anyway, I won't go on any more tangents there. 00:36:38.420 |
But DM me on Twitter if you want to talk more about this. 00:36:51.300 |
So if you're interested in this stuff, check it out. 00:36:55.940 |
the intuition for this hardware and a little bit of debugging 00:37:00.900 |
on the software stack because most people didn't encounter 00:37:07.380 |
education, their boot camp, or their working experience so far. 00:37:17.220 |
So what is the actual model we're going to run? 00:37:21.420 |
My one piece of advice that I've contractually obligated, 00:37:25.740 |
before you start thinking about, oh, what model am I going to run? 00:37:36.380 |
and you have evals, an ability to evaluate whether the-- 00:37:58.780 |
take you to write that out with ground truth answers? 00:38:01.820 |
If it takes you an hour, put on your jams and do it. 00:38:09.820 |
Just listen to Brat and write out 10 or 50 evals. 00:38:14.380 |
Just because it's kind of like test-driven development, 00:38:16.900 |
where everybody says write the tests and then 00:38:20.860 |
But in this case, with test-driven development, 00:38:23.620 |
one reason people don't do it is because they 00:38:29.540 |
I know all the different ways this code could misbehave. 00:38:39.940 |
But in this case, nobody is good at predicting 00:38:45.980 |
being able to check is this actually improving things 00:38:50.500 |
So do this, even just 10 things in a notebook. 00:38:53.540 |
Don't go and buy an eval framework to do this. 00:38:56.020 |
Just find a way to run models in the terminal in a notebook 00:39:00.300 |
that helps you make these decisions like an engineer, 00:39:08.700 |
OK, so model options here are still, I would say, 00:39:15.140 |
because it's starting to feel like we have options. 00:39:17.780 |
Meta's Llama model series is pretty well-regarded 00:39:23.460 |
So if I'm an engineer thinking about which open source 00:39:26.200 |
software am I going to build on, I actually think about that 00:39:28.700 |
a lot more so than raw capabilities a lot of the time. 00:39:34.060 |
And the key thing here is there's a pretty big community 00:39:37.260 |
building on Llama, making their software work really well 00:39:42.820 |
that you would otherwise have to do yourself. 00:39:53.300 |
So they squish them down so they're a lot smaller. 00:40:00.020 |
Noose Research does a lot of fine-tuning of models 00:40:09.020 |
And RCAI will mush together five different Llamas 00:40:12.860 |
to make one Penta Llama that weirdly works better 00:40:19.380 |
And then you don't have to do any of that yourself. 00:40:23.760 |
you can expect there will be continued investment in it. 00:40:26.300 |
Meta's been great about open source in other places, 00:40:46.140 |
to catch people's attention, the other one being Quen. 00:40:50.660 |
There's slightly less tooling and integration 00:41:03.300 |
that the open source initiative would put their stamp on. 00:41:07.220 |
But the Llama model is under a proprietary license that 00:41:12.220 |
says, for example, if you're Amazon or Google, 00:41:19.860 |
And a couple other things that make it less open, 00:41:22.580 |
slightly less open, might make your lawyers nervous. 00:41:29.340 |
to go MIT, inshallah that will happen with Llama 4. 00:41:37.220 |
You might see a shitty model come out of a model training 00:41:39.940 |
team, or sorry, you might see a non-state-of-the-art model come 00:41:47.180 |
It's just that it takes a long time to get really good. 00:41:53.340 |
been putting out some good models with the Olmo series 00:41:57.620 |
Microsoft's been doing their small language models with Phi. 00:42:06.700 |
Maybe in the future, the enterprise cloud homies 00:42:13.860 |
Mostly, Arctic and DBRX are fun for research reasons 00:42:20.620 |
But yeah, that's kind of a small number of options. 00:42:23.860 |
A little bit more like databases in the late '90s, early 2000s 00:42:27.180 |
than databases today, where everybody and their mother 00:42:45.900 |
So by default, floats are 32 or 64 bits, like integers are. 00:42:52.860 |
Digital computers that you're used to programming 00:43:02.980 |
He made this basically a clock that was a computer. 00:43:08.060 |
And I think this is an AND gate or an XOR gate. 00:43:12.540 |
So it only moves if one of the two plates on one side 00:43:30.660 |
I think this is artillery trajectory calculations. 00:43:37.020 |
that the ball is rolling around by changing the gears. 00:43:50.220 |
to abstract it away and make it all ones and zeros 00:43:55.300 |
Neural networks are way more like these analog computers. 00:44:05.660 |
It's never going to be exactly the same with an analog system 00:44:14.780 |
Whereas you change one bit in a digital computer, 00:44:28.860 |
So this is the reason why you can aggressively 00:44:34.020 |
that you can't do with lossily compressing, I don't know, 00:44:38.780 |
If you quantized every byte in Postgres down to 4 bits, 00:44:46.140 |
So this quantization is really key for performance. 00:44:59.500 |
Weight quantization only, that means just make 00:45:01.900 |
the model itself smaller, makes it smaller in memory. 00:45:06.540 |
And then that whole thing about moving it in and out of compute 00:45:12.720 |
And then that doesn't actually quantize the math. 00:45:15.020 |
The actual math that happens still happens at 16-bit, 32-bit. 00:45:20.700 |
To do activation quantization requires more recent GPUs, 00:45:24.820 |
sometimes requires special compilation flags. 00:45:27.380 |
Not always is the operation that you want to speed up. 00:45:30.020 |
Does it already have a kernel written for you 00:45:32.780 |
by TreeDAO or some other wizard to make the GPU go at full speed? 00:45:44.540 |
Give me FP16 or give me death, question mark, is a good paper. 00:45:54.620 |
Evals help you decide whether the quantization is hurting. 00:45:57.820 |
So I was running DeepSeq R1 in ternary, actually. 00:46:07.540 |
There's no way the full model performance or anything 00:46:13.660 |
lost the thing that made you pick the model in the first place. 00:46:34.100 |
to help you scale up your own taste in models and intuition. 00:46:58.460 |
with prompting and really control flow around models. 00:47:02.500 |
I don't know, DeepSeek R1 writes Python code. 00:47:07.260 |
Take the code, run it, take the error message, pipe it back in. 00:47:10.900 |
So writing things around models, instead of fine tuning it 00:47:19.300 |
You're hard to compete with them on a lot of this stuff. 00:47:22.540 |
So managing prompts and managing control flow around models 00:47:35.340 |
So definitely start with just prompting, retrieval, et cetera. 00:47:39.220 |
Yeah, I want to make sure to talk about the inference 00:47:46.140 |
frameworks and what Suri inference looks like. 00:47:52.180 |
requires a ton of thought and effort on optimization. 00:47:55.860 |
This is not something you can sit down and write yourself, 00:48:12.900 |
So the current core of the stack that's most popular 00:48:18.700 |
So PyTorch is a combo of a Python steering library 00:48:24.180 |
and then a C++ internal library and libraries 00:48:29.940 |
for doing all the hard shit, including CUDA C++, 00:48:42.420 |
Don't get excited and rewrite that part in Rust. 00:48:45.980 |
You're going to find out that that didn't help you that much. 00:48:48.620 |
There's some features that make it easier to write Torch 00:49:05.140 |
most of the speed up of writing a bunch of custom stuff. 00:49:12.420 |
to build on top of raw matmuls, like the stuff that showed up 00:49:15.740 |
in my napkin math diagram to serve inference fast. 00:49:26.620 |
There's continuous batching is this smart stuff 00:49:29.660 |
for rearranging requests as they're on the way. 00:49:32.180 |
Speculative decoding is a way to improve your throughput 00:49:39.780 |
So you don't want to build all this just for yourself. 00:49:47.500 |
This is a don't roll your own case rather than a don't 00:49:54.260 |
like the classic the two genders in engineering. 00:49:59.380 |
So I would strongly recommend the VLM inference server 00:50:06.660 |
So like Postgres, VLM started as a Berkeley academic project. 00:50:11.380 |
They introduced this thing called paged attention, 00:50:13.780 |
paged KV caching, and then kind of ran with it from there. 00:50:19.740 |
There's performance numbers, and we can talk about them, 00:50:25.780 |
People are gunning to beat them on workloads. 00:50:29.540 |
You have to run it to decide whether you agree. 00:50:36.060 |
They really won Mindshare as the inference server, 00:50:39.020 |
and so they've attracted a ton of external contributions. 00:50:47.260 |
IBM, basically exclusively to support their work on VLM. 00:50:56.620 |
from any scale, IBM, bunch of people contributing stuff. 00:51:01.060 |
And that's really important for open source success. 00:51:07.740 |
between otherwise competing private organizations, 00:51:11.820 |
whether they're nonprofit or for profit or whatever. 00:51:19.700 |
like that once it's held that crown for a while. 00:51:22.700 |
It's not undislodgable yet, so it's not quite like Postgres, 00:51:27.820 |
and feel pretty like that's been around for 30 years, 00:51:36.140 |
like PIP installable once you have your GPU drivers. 00:51:41.660 |
which NVIDIA has refused to do with TensorRT, LLM, and Triton. 00:51:47.100 |
So it's got a bunch of nice features and good performance. 00:51:54.380 |
is NVIDIA's offering the ONNX, TensorRT, TensorRT, LLM, 00:52:06.340 |
And it's under, I forget, either Apache or MI2 license. 00:52:12.540 |
you'll see that it updates in the form of one 10,000 line 00:52:16.380 |
commit with 5,000 deletions every week or two that 00:52:26.380 |
Pretty hard to-- you don't get input on the roadmap. 00:52:50.020 |
They have this nice interface for prompt programming 00:53:00.740 |
you win when you can draw the most contribution. 00:53:03.340 |
So I feel like even if SGLang is winning over VLM 00:53:07.620 |
in certain places currently, I doubt that that will persist. 00:53:14.740 |
My impression was that they're both from Berkeley, 00:53:16.860 |
and I thought basically SGLang is kind of the new generation 00:53:26.060 |
I don't think they've attracted the same degree 00:53:28.180 |
of external contribution, which is important. 00:53:37.780 |
so I should maybe bump SGLang up to its own part. 00:53:42.700 |
If you're going to be running your own inference, 00:53:45.260 |
this is a high-performance computing workload. 00:53:52.680 |
and can take you from hundreds of dollars a megatoken 00:53:59.740 |
So you will need to debug performance and optimize it. 00:54:04.340 |
And the only tool for doing that is profiling. 00:54:17.100 |
and which ones you should use on your workload, 00:54:28.620 |
That's kind of like what VLM integrates with. 00:54:30.740 |
There's also NVIDIA Insight, both for creating and viewing 00:54:35.260 |
That's their slightly more boomery corporate performance 00:54:40.340 |
It's got a lot of nice features, though, can't lie. 00:54:43.140 |
But yeah, it's the same basic tracing and profiling stuff, 00:54:49.340 |
except there's work on the CPU and on the GPU, 00:54:55.660 |
if you're thinking about this a lot, running a tracer 00:55:00.100 |
and just looking at the trace a couple of times for PyTorch, 00:55:04.260 |
VLM, whatever, just because you learn a ton from looking 00:55:22.140 |
That's where I start, and then I go back to the source code 00:55:28.660 |
a mental model of a programming model and concurrency 00:55:32.740 |
implications, et cetera, just from reading source code. 00:55:36.900 |
Humans were meant to observe processes in evolution, not 00:55:47.320 |
We also have some demos for how to run this stuff on Modal 00:55:53.620 |
As a first pass for GPU optimization for, OK, 00:56:01.260 |
Very first pass is this number, GPU utilization. 00:56:13.900 |
would see that this utilization number is really low, like 20%. 00:56:17.220 |
That means the host is getting in the way a lot 00:56:22.620 |
This is not like model maximum flops utilization or model 00:56:28.980 |
NVIDIA quoted you for flops that you're getting. 00:56:35.220 |
Is the GPU running what fraction of the time? 00:56:39.980 |
Like, this is-- yeah, that's an attainable goal, 95% to 99%. 00:56:47.380 |
Unlike CPU utilization, that's not a problem. 00:56:51.420 |
So GPU utilization here is like a first check. 00:56:54.020 |
Problem is, just because work is running on a GPU 00:57:01.060 |
So the two other things to check are power utilization 00:57:05.700 |
Fundamentally, GPUs are limited by how much power 00:57:20.020 |
So you want to see power utilization 80% to 100%. 00:57:26.620 |
And you want to see GPU temperatures running high 60 00:57:29.980 |
Celsius for the data center GPUs, maybe low 70s, 00:57:35.340 |
but pretty close to their thermal design power, 00:57:37.500 |
maybe 5 to 10 degrees off of the power at which NVIDIA says, 00:57:50.260 |
really good use of the GPU, whereas this GPU utilization, 00:57:58.580 |
It's like two GPUs are both expecting the other 00:58:01.140 |
to send a message, like two polite people trying 00:58:07.540 |
because they're both being like, waiting for that message, dog. 00:58:22.180 |
But it is something to watch out for and why, on Modal, 00:58:26.260 |
I learned Rust in order to be able to add these 00:58:39.420 |
since it was in the title, conscious of time. 00:58:54.140 |
So fine tuning means taking the weights of the model 00:58:56.740 |
and using data to customize them, not via rag, 00:59:07.300 |
If you can take the capabilities that an API has 00:59:17.260 |
frequently, you don't need all the things like GPT. 00:59:20.420 |
The big models know the name of every arrondissement in France 00:59:42.500 |
I think of this a bit like a Python to Rust rewrite. 00:59:45.820 |
You start off when you aren't sure what you need. 00:59:48.020 |
You write in Python because it's easy to change, 00:59:52.100 |
and switching between proprietary model providers 00:59:57.460 |
But then once you really understand what you're doing, 00:59:59.660 |
you rewrite it in Rust to get better performance. 01:00:05.540 |
to be more maintenance work and harder to update, yada, yada, 01:00:08.500 |
but it's going to be 100x cheaper or something. 01:00:17.780 |
of technical debt, feature velocity, cost of engineers, 01:00:28.060 |
that will help you steal capabilities as a service. 01:00:39.340 |
in the voice of a pirate and never break k-fabe, 01:00:44.220 |
Relatively small amounts of data can do that. 01:00:49.060 |
That's usually better to do search or retrieval, which 01:00:51.700 |
is what people call RAG, like get the knowledge from somewhere 01:01:02.020 |
as good as it needed to be a year and a half ago. 01:01:08.180 |
The holy grail would be for you to define a reward function 01:01:12.820 |
of what does it mean for this model to do well. 01:01:14.780 |
Maybe that's customer retention, NPS, whatever. 01:01:19.860 |
And then you could do ML directly on those rewards 01:01:27.580 |
Then you could just sit back and monitor that RL system. 01:01:32.380 |
And then you would magically make that reward number go up. 01:01:38.860 |
The problem is there's a large gap between the things you 01:01:41.700 |
want to improve, and the things that you can actually measure, 01:01:44.540 |
and the things that you can provide to a model, 01:01:51.260 |
They need to be exactly what you want to maximize. 01:01:55.860 |
When you do ML, ML is like paperclip maximization. 01:01:58.660 |
It's like, you told me to make this number go up. 01:02:02.160 |
Imagine the brooms from "The Sorcerer's Apprentice." 01:02:18.020 |
where they trained a model to drive a boat in this boat 01:02:30.660 |
is collect these little pips and finish a race. 01:02:33.900 |
If you want to score max, what you actually want to do 01:02:36.180 |
is find this tiny little corner and slam against the wall 01:02:38.620 |
repeatedly, picking up this bonus item that respawns, 01:02:42.620 |
and just slamming against the wall over and over again 01:02:53.740 |
More like a speed runner playing a video game 01:02:58.380 |
So imagine this, but with your customer support. 01:03:01.180 |
Great way to get customers to give a 10 on an NPS 01:03:07.540 |
your machine is locked down until you put a 10 on our NPS. 01:03:20.700 |
It's kind of the long-term direction we're going. 01:03:26.980 |
Where we are today is really more like stealing capabilities 01:03:34.580 |
So the main reason fine-tuning can save costs, 01:03:38.380 |
can improve performance, why shouldn't you do it? 01:03:43.980 |
Running inference is mostly normal software engineering 01:03:47.220 |
with some fun spicy bits-- GPUs, floating point numbers. 01:03:50.900 |
But machine learning is a whole different beast. 01:03:56.180 |
in common with hardware and with scientific research. 01:04:01.060 |
You've got non-determinism of the normal variety. 01:04:04.780 |
On top of that, there's epistemic uncertainty. 01:04:08.100 |
We don't understand the optimization process. 01:04:12.580 |
which is much worse in machine learning than elsewhere. 01:04:15.060 |
You've got to maintain a bunch of data pipelines. 01:04:17.340 |
No one's favorite form of software engineering. 01:04:19.460 |
This is a high-performance computing workload. 01:04:24.860 |
Like, yeah, high-performance computing sucks. 01:04:27.180 |
There's a reason why only the Department of Energy does it. 01:04:38.500 |
It's written by people like me with scientific background. 01:04:53.020 |
And somebody can maybe pull a "New York Times" article 01:05:03.140 |
is going to require some Supreme Court rulings and so on 01:05:08.860 |
Yeah, and when Mercury is a retrograde, your GPUs run slower. 01:05:14.260 |
a lot of complexity that's very hard to get an engineering 01:05:18.460 |
So if you can solve it in literally any other way, 01:05:22.180 |
Think of ways you can solve this problem without fine tuning. 01:05:26.020 |
What program control flow can you put around a model? 01:05:32.980 |
because you're using an ML model to mimic an ML model. 01:05:41.140 |
Like, there's a notion of a data-generating process. 01:05:43.740 |
In the real world, that's like the climate of the planet 01:06:01.900 |
To do this, you're going to need even more high-performance 01:06:04.780 |
I focused on running models at the beginning. 01:06:15.580 |
is you run a program forwards, and then you flip it around 01:06:19.740 |
So that puts a lot of extra pressure on memory. 01:06:22.620 |
Then you also, during training, you want lots of examples 01:06:25.420 |
so the model doesn't learn too much from one specific example. 01:06:29.140 |
And you also want large batches to make better use 01:06:32.260 |
of the big compute and to make better use of all 01:06:42.560 |
requires some extra tensors that are the size of or larger 01:06:47.500 |
Sorry, some arrays, some extra arrays of floating-point 01:06:49.820 |
numbers that are at least the size of the model parameters 01:06:54.860 |
So you've got gradients and optimizer states. 01:06:57.300 |
These are basically like 2 to 10 extra copies of the model 01:07:05.300 |
can't get around the fact that a lot of this stuff 01:07:09.540 |
So you're going to need eight 80-gigabyte GPUs, or 32 01:07:17.020 |
And yeah, the software for that is pretty hard, 01:07:22.220 |
I already talked about how hard machine learning is. 01:07:25.060 |
It's like there are software engineering practices that 01:07:30.980 |
I worked on experiment tracking software, weights and biases. 01:07:37.900 |
It's like when I was training models, the thing I wanted 01:07:40.300 |
was being able to store voluminous quantities of data 01:07:46.380 |
Tons of metrics, gradients, inputs, outputs, loss values. 01:07:53.340 |
want to keep track of on top of very fast-changing code 01:08:07.020 |
from which you can calculate the thing that reveals your bug. 01:08:10.140 |
This is actually, I would say, like Honeycomb, 01:08:13.020 |
their approach to observability is very similar. 01:08:15.140 |
This is like observability for training runs. 01:08:17.580 |
Observability is like recording enough about your system 01:08:19.940 |
that you can debug it from your logs without having to SSH in. 01:08:38.100 |
Yeah, so TensorBoard, you have to run TensorBoard yourself. 01:08:53.300 |
because it hits me personally, or maybe happiest, 01:08:56.260 |
because I'm a shareholder in weights and biases. 01:08:58.860 |
But yeah, so yeah, TensorBoard is really good 01:09:06.300 |
It's bad at collaboration and bad at large numbers 01:09:10.300 |
Other experiment tracking workflows that have gotten more-- 01:09:15.340 |
have gotten more love, like the venture-backed ones 01:09:18.180 |
or the open source ones, are better for that. 01:09:23.580 |
So you can-- I would say a lot of software engineers 01:09:28.460 |
and are pretty disgusted to discover the state of affairs. 01:09:38.740 |
And you should push people to up their SWE standards. 01:10:02.700 |
is build internal libraries in normal code files, 01:10:10.820 |
so that you can poke prod, run ad hoc workflows, et cetera. 01:10:15.660 |
And then as soon as something in a Jupyter Notebook 01:10:20.540 |
pull that out into your utils.py, at the very least, 01:10:33.220 |
Anyway, full-stack deep learning course I taught in 2022 01:10:36.740 |
still has the basics of how to run ML engineering. 01:10:43.960 |
And back then, we were talking about training from scratch, 01:10:46.420 |
because the foundation model era was only beginning. 01:10:49.180 |
But the basic stuff in there, like the YouTube videos, 01:10:52.100 |
the lecture-level stuff, is all still, I would say, 01:11:08.100 |
The main point is the eventual goal with any ML feature 01:11:12.140 |
is to build a virtuous cycle, a data flywheel, a data engine, 01:11:18.100 |
something that allows you to capture user data, annotate it, 01:11:21.140 |
collect it into evals, and improve the underlying system. 01:11:23.780 |
This is like-- if you're running your own LM inference, 01:11:28.320 |
this thing truly better than what you could get elsewhere 01:11:31.060 |
is building your own custom semi-self-improving system, 01:11:42.900 |
There's some specialized tooling for collecting this stuff up, 01:11:50.300 |
You can see Sean's recent conversation with Sean 01:11:54.460 |
from Weights and Biases on how he used Weave, 01:12:02.020 |
>>Then Thomas came on Thursday and went over Weave. 01:12:06.940 |
OK, yeah, that's pure product on Weave, plus Sean-- 01:12:17.020 |
and did an hour and a half and change on Weave. 01:12:21.780 |
So I would say Weave is really good for this offline evals, 01:12:25.300 |
which is collect up a data set, kind of run code on it. 01:12:31.260 |
And this is very much how an ML engineer approaches 01:12:34.620 |
evaluation, coming from academic benchmarking, 01:12:41.300 |
I don't know if you're going to have anybody from Lang Chain 01:12:45.940 |
who are also building these observability tooling. 01:13:01.540 |
Lang Smith is very open-ended, the tool from Lang Chain, 01:13:04.940 |
as are a lot of the other observability tooling-- 01:13:08.820 |
or sorry, these more online eval-oriented things. 01:13:16.940 |
And it's about a living database of all the information 01:13:20.820 |
you've learned about your users, your problem, 01:13:25.740 |
And so it's this very dynamic, active artifact, 01:13:32.740 |
I think the more you need input from people who are not 01:13:38.420 |
it's producing medical traces, and you are not a doctor. 01:13:41.900 |
As opposed to producing code, and you are a programmer, 01:13:44.940 |
then being able to bring in more people is more helpful. 01:13:47.900 |
And so there's utility to these more online-style things. 01:13:52.420 |
You can also actually build this stuff yourself. 01:13:57.180 |
don't know that much more about running these models than you 01:14:01.540 |
And the workflows are not really set down for this. 01:14:11.420 |
has good ideas baked into it and will teach you to be better. 01:14:18.080 |
a.k.a. the provide free engineering and design 01:14:20.420 |
work for somebody you're also paying for a service phase. 01:14:26.580 |
So if you have a good internal data engineering 01:14:30.660 |
team that is good at, say, an open telemetry integration, 01:14:34.980 |
would love to set up a little ClickHouse instance 01:14:47.740 |
You can build your own with something like this. 01:14:49.740 |
And then the front end people can hack on the experience. 01:14:57.780 |
because Hex has both really incredible internal data 01:15:00.820 |
engineering and they're a data notebook product. 01:15:09.740 |
to be able to do that, but it's like a bigger fraction 01:15:17.820 |
More tilted in the build direction than the buy. 01:15:31.820 |
Modal is the infrastructure provider that I'm working on. 01:15:44.460 |
I was talking about how much I liked it on social media, 01:15:47.020 |
and they're like, what if we paid you to do this? 01:15:52.140 |
Please don't pay me to do it, because then people 01:15:57.580 |
Now I work at Modal, and they pay me to say this. 01:16:01.460 |
The same thing I was saying before, which is Modal is great. 01:16:05.100 |
It's like, you pay for only the hardware you use. 01:16:15.020 |
all the infrastructure is built from the ground up in Rust, 01:16:24.820 |
with Sean, that completely separate from learning 01:16:31.020 |
It's just like, gain 10 IQ points, or 10 levels 01:16:36.300 |
in computer infrastructure from hearing the story, 01:16:48.500 |
And then, unlike other serverless GPU narrow sense 01:16:52.860 |
providers, Modal has code sandboxes, web endpoints, 01:16:57.020 |
makes it easy to stand up a user interface around your stuff. 01:17:00.140 |
So that's why I ended up going all in on Modal. 01:17:03.340 |
It was like, wow, not only does this run my models, 01:17:05.900 |
but I learned how to properly use fast API from Modal's 01:17:19.300 |
So that can be for running your fine-tuning jobs, 01:17:23.260 |
if you've decided you want to distill models yourself. 01:17:27.820 |
to be able to scale up and down, and handle changing inference 01:17:41.660 |
creating data to help you observe your system 01:17:52.820 |
infrastructure that doesn't require a PhD in Kubernetes. 01:18:13.580 |
so everyone that got the-- and also everyone, 01:18:16.780 |
Charles is the person that we talked to to get the Modal 01:18:20.900 |
So everyone, a big, big thank you to Charles for that. 01:18:25.980 |
this cohort builds three projects, all of which 01:18:29.220 |
are built off of FastAPI that lives in Modal. 01:18:42.460 |
There's a decent chance you'll get co-founder support 01:18:49.220 |
And yeah, hopefully you've been pointed to the examples page, 01:18:57.280 |
I slave to ensure that those things run end-to-end. 01:19:04.300 |
They're continuously monitored and run stochastically 01:19:22.100 |
And so yeah, if you want any help with those, 01:19:37.060 |
I should talk to you sometime, how you set all of that up. 01:19:40.380 |
I ran through the comfy UI workflow a couple of days ago. 01:19:49.180 |
I just pulled down an example from the internet 01:19:51.220 |
and just ran the command that it said to run. 01:19:55.680 |
There's always some other thing I have to do. 01:20:01.060 |
Yeah, part of it is that as an infrastructure product, 01:20:07.660 |
is the differences between infrastructure and like, 01:20:10.100 |
oh, well, that will only run if you set this LD flags thing 01:20:23.700 |
And my machine is modal, which you can also run them on. 01:20:31.160 |
able to share things that run on modal within your team, 01:20:42.740 |
and this is actually like an engineering trick 01:20:44.660 |
that is surprised it took me this long to learn. 01:20:46.660 |
It's like there's tests and there's monitoring. 01:20:54.620 |
Like, yeah, require a bunch of disgusting mocking 01:20:57.820 |
that breaks as often as the actual code does. 01:21:05.500 |
So yeah, that's an important trick for the modal examples, 01:21:10.420 |
but also for all the things you would maybe run using-- 01:21:14.020 |
as part of running your own language model inference 01:21:18.020 |
It's like, monitor the shit out of this thing. 01:21:24.460 |
Well, before we let Charles go, does anybody have any questions? 01:21:27.740 |
I know I'm sure given everyone's background here, 01:21:32.180 |
feels very full with all of the hardware architecture 01:21:40.060 |
>>I think-- I'll kill time while people ask questions. 01:21:47.260 |
But I think that it's always intimidating for people 01:21:51.300 |
sort of running their own models and fine-tuning them. 01:21:56.540 |
I'm just like, what's a really good first exercise 01:21:59.820 |
that you could-- probably you have some tutorials on modal 01:22:02.940 |
that you would recommend people just go through. 01:22:10.300 |
have a MacBook M2 or later with at least 32 gigabytes of RAM, 01:22:21.860 |
So that turns out to be actually a really incredible machine 01:22:28.820 |
that we talked about, like moving the bytes in and out 01:22:35.980 |
like that was the first thing I did back when you 01:22:41.420 |
That running it locally-- and there's good tools out there 01:22:48.140 |
You can also use the same thing you would run on a cloud server 01:22:57.020 |
to get started with running some of your own inference. 01:22:59.420 |
And then the cost is amortized more effectively. 01:23:03.140 |
And you can use it for other stuff, the computer 01:23:07.620 |
So that's actually probably my-- it's bad modal marketing 01:23:12.020 |
But I would say people like to be able to poke and prod. 01:23:15.100 |
If you don't already know modal, I know modal well enough 01:23:19.960 |
to use modal to try these things than to run it on my MacBook. 01:23:24.660 |
And everybody knows how to use a command line as part 01:23:35.540 |
For fine-tuning, I would say distilling a model 01:23:47.300 |
it's like fine-tuning something on somebody's Slack messages 01:23:56.900 |
and it teaches you some things about the software 01:24:06.940 |
it means to fine-tune in pursuit of a specific objective, 01:24:18.460 |
I did insert a little comment about what distillation means. 01:24:23.980 |
kind of view training on output of GPT-4 as distillation. 01:24:57.180 |
And so that's a much richer signal for fine-tuning off of. 01:25:04.580 |
But I guess I was thinking of it in the looser sense 01:25:07.980 |
that most people talk about today, which is just like 01:25:19.780 |
you're creating a synthetic corpus of text, you know? 01:25:23.780 |
Synthetic corpus of text is somewhat intimidating sounding. 01:25:29.540 |
I say as somebody who's used a lot of intimidating 01:25:34.780 |
But really, the simplest synthetic corpus of text 01:26:04.620 |
So Modal will do fine-tuning jobs up to eight H100s 01:26:10.420 |
We're working on features for bigger scale training, 01:26:13.820 |
and both longer in time and larger in number. 01:26:20.900 |
a pretty strong argument for keeping your fine-tunes as small 01:26:24.500 |
and fast as possible to be able to iterate more effectively 01:26:29.540 |
Because it's fun to run on 1,000 GPUs or whatever. 01:26:35.660 |
There's this frisson of making machines go brr. 01:26:39.660 |
But then when you need to regularly execute that job 01:26:42.980 |
to maintain a service that you've promised people 01:26:45.900 |
that you will keep up, then it starts to get painful. 01:26:50.140 |
Because reliability, cost, it's ungodly slow. 01:26:54.700 |
It's 48 hours is a long time to wait for a computer 01:26:57.860 |
to do something, even if it is an eggs a flop of operations. 01:27:02.860 |
So definitely, when starting out with fine-tuning,