back to indexHow fast are LLM inference engines anyway? — Charles Frye, Modal

00:00:35.040 |
since it was AI Engineer Summit two years ago. 00:00:46.320 |
It's like, just because, yeah, what am I going to do? 00:00:51.680 |
And it was exciting to talk about all these cool new technologies, 00:00:55.200 |
see people building stuff like Cursor on top of them, or Devon. 00:01:01.300 |
trained my own models a lot, it was like, oh, man, 00:01:20.300 |
And so now we're catching up to maybe even literally catching up 00:01:24.720 |
with the frontier labs, which would be pretty crazy. 00:01:27.100 |
But we're at the very least at the point where a lot of things 00:01:30.140 |
people have been talking about at AI Engineer for years 00:01:32.260 |
are possible with open weights models, where they weren't before. 00:01:35.540 |
And at the same time, there's also the development 00:01:41.640 |
and running my own PyTorch models, I was like, 00:01:43.640 |
how hard could it be to run a language model? 00:01:45.380 |
Like, you just torch nn.module, wrap a class around it. 00:01:49.680 |
And yeah, I mean, having done stuff with Transformers before, 00:01:52.760 |
it's like, oh yeah, it's a little more complicated 00:01:59.700 |
has advanced a lot on how to run a Transformer, right? 00:02:08.220 |
now multi-token prediction, speculative decoding, 00:02:10.920 |
all this stuff that's pretty hard to write yourself. 00:02:13.400 |
And so you want software to do that for you, probably. 00:02:18.480 |
And so these open source engines are available now. 00:02:20.880 |
You've got VLLM and SGLang and Tensor or TLLM. 00:02:31.420 |
to run your own models, like you're the US government 00:02:33.880 |
or something, or you want to run on an air gap system. 00:02:47.280 |
You had to be like a decentralized crypto bro or whatever 00:03:05.640 |
Was anybody here at this AI Engineer Summit 2023, the first one? 00:03:14.400 |
workshop the day before in the 500 Treet Avenue? 00:03:20.740 |
I talked to one or two people who were there. 00:03:23.340 |
So I gave a talk on how to do your own AI stuff back then. 00:03:28.820 |
I just wanted to pull out a couple of slides. 00:03:35.420 |
We're like, who's going to win, open models or closed models? 00:03:38.220 |
And the key statement in the talk was like, if capabilities requirements saturate, open models 00:03:43.700 |
will catch up to proprietary models and then dominate for those cases. 00:03:47.460 |
Inspired by what you see with operating systems, databases, programming languages, like as soon 00:03:52.500 |
as there's like a sort of like, you know, capability level that everybody-- like that you don't need 00:04:00.500 |
the absolute best thing, you just need something that's good enough, open like collaborative projects 00:04:06.420 |
tend to catch up and then have better properties. 00:04:09.720 |
And that's happened with open models, so check. 00:04:11.820 |
Yeah, so that was mostly around like capabilities and, you know, at the time there was only LLAMA, 00:04:22.300 |
The other one was, yeah, talked a little bit about LLM inference libraries at the time. 00:04:26.100 |
And most of them are gone, TGI, RIP, for example, but VLLM was good then. 00:04:34.640 |
Yeah, so I don't know what the slides in 2027's AI engineering conference are going to look like. 00:04:41.820 |
But at the very least, like three of my slides from two years ago were right. 00:04:46.760 |
So, yeah, hopefully it's not those three slides again in two years. 00:04:52.060 |
But yeah, all right, so what does the LLM engine landscape look like two years later? 00:04:58.100 |
So what, you know, we were advising a bunch of people on how to run, like people were coming to us, 00:05:04.140 |
like I want to run my own code completion editor in editors. 00:05:08.820 |
I want to run like big backfill jobs to like enrich data and databases with language models. 00:05:15.580 |
It's too expensive or to run it on open AI or I train my own model. 00:05:20.560 |
So it's too expensive to run it on a like a provider like fireworks. 00:05:23.920 |
They want to run it on more generic infrastructure like what we have at modal. 00:05:29.100 |
And so they would people would come and they'd be like, all right, 00:05:31.800 |
well, how fast can you run an 8 billion parameter LLM model with SG Lang? 00:05:36.040 |
Like with 128 tokens in 1,024 tokens out on a Tuesday when Mercury is in retrograde. 00:05:42.880 |
And that would like take at first, it took a couple days to like, you know, 00:05:48.020 |
figure out how to make sure all the packages are working, 00:05:50.020 |
that we've got the fastest versions installed and like that we can give like a, you know, 00:05:56.060 |
a trustworthy number got that down eventually to like, you know, 00:05:59.500 |
like an hour or two and then built some benchmarking software. 00:06:03.460 |
So we get it done in about like 15, 20 minutes. 00:06:06.400 |
But then like, you know, the people ended up asking a lot of similar questions. 00:06:11.200 |
So we decided just, you know, fifth law of fifth mantra of performance is do it 00:06:18.780 |
So like compute the thing ahead of time and store it. 00:06:21.380 |
So we ran a giant benchmark over like 10 or so different models on BLM, SG Lang, 00:06:28.780 |
into your TLM on about 10 different context lengths and put that all up on the internet. 00:06:46.340 |
The idea is that this is just one page in the like, you know, your almanac, 00:06:50.700 |
your little book that has the useful things you need to know to be an LLM engineer. 00:06:55.180 |
So to start, we've got our benchmarking results, benchmarking methodology in detail, 00:07:01.020 |
the open source code for it, and a little executive summary. 00:07:04.480 |
Hope to put more stuff up there as we accumulate the things people need. 00:07:08.320 |
things about like speculative decoding and multi-token prediction and quantization. 00:07:13.320 |
But yeah, so to start off, we got this little interface here. 00:07:15.560 |
So yeah, anybody, what's a model people want to see results for? 00:07:31.500 |
Excuse me, that, oh man, that's, that's my boss. 00:07:40.040 |
SGLang's Quen3 support is a little buggy for the, for the 8-bit quant that we ran. 00:07:50.420 |
Oh yeah, by the way, if you, if you try this thing out, like you'll see like if, 00:07:55.020 |
there's a giant tensor of configurations, right? 00:07:57.460 |
And we'd love to have that full tensor, but it doesn't always, 00:08:00.220 |
it's like either not always possible or like it's not clear how to do it. 00:08:04.800 |
So there's a place where you can contribute configurations. 00:08:07.400 |
So we can build up a nice big database of like how to run these models. 00:08:11.040 |
We also haven't like carefully optimized any of these things. 00:08:13.500 |
We started with out of the box performance for all the engines, 00:08:16.340 |
which is like optimizing a hundred configurations is going to take some time. 00:08:19.540 |
So we'd love like contributions of optimized implementations, 00:08:22.880 |
especially tensor TLM, which has like a ton of knobs and they have names 00:08:36.220 |
Let's say this is, this is a pretty common SLO. 00:08:38.560 |
Like you want one seconds, nice round number. 00:08:41.100 |
People feel like that's like a good amount of time to wait. 00:08:49.380 |
If you're a fan of Holt and catch fire, um, made up number, 00:08:52.520 |
but like if somebody repeats a made up number enough, it's a real number. 00:08:57.860 |
So we can get a throughput of about one request per second on Quinn three 00:09:04.260 |
in the mixture of experts model on VLM for 128 tokens in thousand, 00:09:09.500 |
Um, since we got over here, we got like a little, uh, code snippet here. 00:09:13.780 |
So you should be the case that you can UVX model run this. 00:09:16.580 |
And if you have a token that should just work immediately. 00:09:19.120 |
Uh, and by immediately, I mean, after five minutes of loading the model weights 00:09:22.720 |
and spinning up the model server, but that's as immediate as it gets. 00:09:50.700 |
Quant is the only one that we could get working at first. 00:09:53.340 |
So I think we eventually got the eight bit quants working. 00:09:56.280 |
I don't think that's a hard blocker, but it was the easiest one to get going 00:10:02.280 |
Um, so you'll see 27 billion parameter models. 00:10:05.660 |
So like 10 X smaller in, uh, model weights, roughly the same number of active 00:10:10.620 |
parameters is Quinn three, but we're getting like about the same, um, like 00:10:15.320 |
throughput and requests per second on the same load, one 28 and 10 24 out. 00:10:22.300 |
You can see sort of which ones have had more optimization work on them. 00:10:26.380 |
I think the Quinn three models and the llama model series, you see a lot 00:10:31.080 |
Um, this is also one of the ones where you saw the biggest gap between, um, 00:10:38.220 |
So it looks like the VLM team spent a little bit more time or Google's 00:10:41.520 |
contributed a little bit more to VLM on, uh, getting good results. 00:10:47.100 |
So the other thing you'll see is you generally like, so I just switched. 00:10:54.920 |
Uh, and you can see we're getting about one request per second on this guy. 00:10:58.800 |
Uh, let's just, and it, and the first token comes back in 400 milliseconds. 00:11:08.120 |
So this is going from like a reasoning workload to like a rag workload, big scare 00:11:13.180 |
quotes on that, but it's just the difference between whether you're dominated by decode 00:11:16.600 |
time, like more of your tokens are decode or more of your tokens, uh, pre-fill. 00:11:21.360 |
Uh, and what you'll see very consistently in these results is that you get much higher 00:11:27.380 |
throughput if you have more like tokens in the context, as opposed to tokens being generated. 00:11:33.260 |
I mean, if you know your transformer architecture, it's like auto aggressive versus parallel. 00:11:37.280 |
Like, yeah, one of the first things you would learn if you looked at the, you know, kind 00:11:40.820 |
of the implementation of the architecture, but it's nice to see it like nice and, you know, 00:11:44.360 |
very cleanly more of an empiricist than a rationalist myself. 00:11:47.520 |
So I like to see data, um, and not like, uh, chalkboard stuff. 00:11:51.620 |
Um, so yeah, so, uh, what I'm getting at here is that the, um, the request per second 00:11:57.740 |
that we're seeing here is about four requests per second for VLM. 00:12:01.540 |
Um, on the same workload, but with like context instead of, uh, um, generation as well, where 00:12:10.180 |
Um, so this is, uh, yeah, I gave a talk on GP, like GPUs a little bit earlier today. 00:12:16.860 |
And like one of the big takeaways there is find things that like are throughput oriented 00:12:21.840 |
and evolve a lot of arithmetic and not like moving memory around or communication. 00:12:29.040 |
You have like big matrix matrix multiplications, load the weights one time, use them a bunch. 00:12:36.480 |
And it's a four X improvement and that's using B of 16, which does not have tensor core support. 00:12:42.600 |
B of 16 has tensor core support, but it's the slow tensor core support compared to FB eight 00:12:50.060 |
Well, and so you're like, you're the real win. 00:12:53.240 |
There is the shorter, like shorter numbers, faster multiplication. 00:13:00.640 |
So this like, if we were to run some results with FB four on black wells, you would see an 00:13:05.520 |
even bigger gap than just this like four X improvement four X is like barely enough to 00:13:10.740 |
Um, yeah, so that's a little like not every application. 00:13:16.920 |
Um, like your users might be bringing queries to you, so you don't have control. 00:13:21.060 |
Um, but the, um, uh, it can, it's more of the sort of thing where you, 00:13:26.800 |
like a product person is like, we should improve the quality. 00:13:30.160 |
You're like, Oh, how can I improve the quality without like killing our latency? 00:13:35.500 |
Don't have it like, don't immediately reach for reasoning, reach for context instead. 00:13:40.320 |
Cause it's going to be cheaper and you're going to get better performance. 00:13:42.700 |
Like you're, you're going to find it easier to hit your latency SLAs. 00:13:46.660 |
I forgot to point this out, but the latency is like almost identical in time, in time to 00:13:50.960 |
first token, even though we're doing 10 times as many tokens, basically a free lunch. 00:13:57.680 |
So that's, uh, that's sort of like how I envision people using this interface and the data. 00:14:02.680 |
Um, there is a, there's a URL somewhere where you can just download the raw data. 00:14:09.960 |
If you want to run some of these benchmarks yourself. 00:14:15.340 |
Like our benchmarking methodology, which is written up here, the, uh, executive summary, which you 00:14:20.680 |
can share with your, um, with your leadership, um, uh, on like running open models. 00:14:26.980 |
Um, but I'll take a question or two before we close out. 00:14:32.980 |
So that, so one thing I'll say is like this is throughput per replica. 00:14:47.680 |
So like would you, the way you solve your like total throughput is by scaling out rather than 00:14:56.000 |
Um, but, um, so if you want, if you want 400 QPS or whatever, like eventually you're just 00:15:01.660 |
going to have to scale out, but to your question of like, yeah, how do you know? 00:15:05.480 |
Like, you know, where, like, why are we saying this is the highest throughput you can get? 00:15:10.860 |
So the answer is like goes to our benchmarking methodology. 00:15:14.420 |
What we do is first we dump like, uh, you know, a thousand requests and wait for them all 00:15:19.120 |
to come back, calculate the like, uh, uh, seconds divided, uh, requests divided by seconds, 00:15:29.140 |
Cause we gave it max, we exposed the maximum parallelism to the engine. 00:15:32.260 |
So presumably they knew how to, they were smart enough to handle that. 00:15:36.940 |
Any more than that, you should expect from queuing theory that the latency will blow up. 00:15:40.980 |
Then the other side is, um, like you send one request at a time, wait for it to come back, 00:15:46.420 |
And that gives us our, like, that's the fastest you could possibly run the server. 00:15:50.360 |
And we sweep between to get the numbers that are here. 00:15:56.740 |
Uh, I'll be outside if you, uh, have any questions. 00:16:00.080 |
I'll be outside if you, uh, I'll be outside if you, uh, have any questions.