back to index

How fast are LLM inference engines anyway? — Charles Frye, Modal


Whisper Transcript | Transcript Only Page

00:00:00.000 | Thanks, everybody, for coming.
00:00:17.200 | Yeah, I wanted to talk about some work
00:00:19.400 | I've done recently on trying to figure out
00:00:23.120 | just how fast these inference engines are
00:00:26.440 | when you run open models on them.
00:00:30.000 | So I've been talking at AI Engineer
00:00:35.040 | since it was AI Engineer Summit two years ago.
00:00:38.880 | And for a long time, it's basically
00:00:43.020 | been the open AI wrapper conference, right?
00:00:46.320 | It's like, just because, yeah, what am I going to do?
00:00:48.640 | Am I going to run an agent with Burt?
00:00:49.900 | Probably not.
00:00:51.680 | And it was exciting to talk about all these cool new technologies,
00:00:55.200 | see people building stuff like Cursor on top of them, or Devon.
00:00:58.960 | But for me, as somebody coming from having
00:01:01.300 | trained my own models a lot, it was like, oh, man,
00:01:04.000 | I want to touch the weights.
00:01:05.060 | I want to play with them.
00:01:06.720 | I want to hack them.
00:01:09.840 | But the quality just wasn't there yet
00:01:11.980 | to do some of the interesting stuff.
00:01:14.900 | And that's changed.
00:01:15.720 | You've got the Llama series.
00:01:17.500 | We've got the Quinn series.
00:01:18.700 | We've got Deep Seek.
00:01:20.300 | And so now we're catching up to maybe even literally catching up
00:01:24.720 | with the frontier labs, which would be pretty crazy.
00:01:27.100 | But we're at the very least at the point where a lot of things
00:01:30.140 | people have been talking about at AI Engineer for years
00:01:32.260 | are possible with open weights models, where they weren't before.
00:01:35.540 | And at the same time, there's also the development
00:01:37.540 | of the software stack on top of that.
00:01:39.300 | Also, as somebody coming from writing
00:01:41.640 | and running my own PyTorch models, I was like,
00:01:43.640 | how hard could it be to run a language model?
00:01:45.380 | Like, you just torch nn.module, wrap a class around it.
00:01:49.680 | And yeah, I mean, having done stuff with Transformers before,
00:01:52.760 | it's like, oh yeah, it's a little more complicated
00:01:54.480 | with Transformers.
00:01:55.260 | Training looks weird.
00:01:56.120 | Inference is different.
00:01:57.560 | But pretty quickly, the state of play
00:01:59.700 | has advanced a lot on how to run a Transformer, right?
00:02:02.180 | KV caching is just the first thing.
00:02:04.900 | And then it's this page detention,
00:02:08.220 | now multi-token prediction, speculative decoding,
00:02:10.920 | all this stuff that's pretty hard to write yourself.
00:02:13.400 | And so you want software to do that for you, probably.
00:02:18.480 | And so these open source engines are available now.
00:02:20.880 | You've got VLLM and SGLang and Tensor or TLLM.
00:02:23.720 | So the combination of those two things
00:02:25.160 | has kind of flipped the playing field around
00:02:28.860 | to where you'd need a really good reason
00:02:31.420 | to run your own models, like you're the US government
00:02:33.880 | or something, or you want to run on an air gap system.
00:02:38.020 | Or you had to believe it in your heart.
00:02:42.440 | You had to want open source models,
00:02:44.580 | like the news or prime intellect.
00:02:47.280 | You had to be like a decentralized crypto bro or whatever
00:02:50.760 | to want to run your own open models.
00:02:53.520 | But yeah, now the situation has changed.
00:02:56.520 | It's really exciting.
00:02:57.360 | It sort of finally makes sense to self-host.
00:03:01.160 | So just wanted to do this really quickly.
00:03:05.640 | Was anybody here at this AI Engineer Summit 2023, the first one?
00:03:10.660 | Anybody?
00:03:11.900 | Did anybody come to the AI Engineering 201
00:03:14.400 | workshop the day before in the 500 Treet Avenue?
00:03:20.080 | Maybe not.
00:03:20.740 | I talked to one or two people who were there.
00:03:23.340 | So I gave a talk on how to do your own AI stuff back then.
00:03:28.820 | I just wanted to pull out a couple of slides.
00:03:31.820 | So one of the main-- this is 2023.
00:03:35.420 | We're like, who's going to win, open models or closed models?
00:03:38.220 | And the key statement in the talk was like, if capabilities requirements saturate, open models
00:03:43.700 | will catch up to proprietary models and then dominate for those cases.
00:03:47.460 | Inspired by what you see with operating systems, databases, programming languages, like as soon
00:03:52.500 | as there's like a sort of like, you know, capability level that everybody-- like that you don't need
00:04:00.500 | the absolute best thing, you just need something that's good enough, open like collaborative projects
00:04:06.420 | tend to catch up and then have better properties.
00:04:09.720 | And that's happened with open models, so check.
00:04:11.820 | Yeah, so that was mostly around like capabilities and, you know, at the time there was only LLAMA,
00:04:20.520 | but now we got a lot.
00:04:22.300 | The other one was, yeah, talked a little bit about LLM inference libraries at the time.
00:04:26.100 | And most of them are gone, TGI, RIP, for example, but VLLM was good then.
00:04:32.700 | It's stuck around now.
00:04:34.640 | Yeah, so I don't know what the slides in 2027's AI engineering conference are going to look like.
00:04:41.820 | But at the very least, like three of my slides from two years ago were right.
00:04:46.760 | So, yeah, hopefully it's not those three slides again in two years.
00:04:52.060 | But yeah, all right, so what does the LLM engine landscape look like two years later?
00:04:56.800 | Let's take a look.
00:04:58.100 | So what, you know, we were advising a bunch of people on how to run, like people were coming to us,
00:05:04.140 | like I want to run my own code completion editor in editors.
00:05:08.820 | I want to run like big backfill jobs to like enrich data and databases with language models.
00:05:15.580 | It's too expensive or to run it on open AI or I train my own model.
00:05:20.560 | So it's too expensive to run it on a like a provider like fireworks.
00:05:23.920 | They want to run it on more generic infrastructure like what we have at modal.
00:05:29.100 | And so they would people would come and they'd be like, all right,
00:05:31.800 | well, how fast can you run an 8 billion parameter LLM model with SG Lang?
00:05:36.040 | Like with 128 tokens in 1,024 tokens out on a Tuesday when Mercury is in retrograde.
00:05:42.880 | And that would like take at first, it took a couple days to like, you know,
00:05:48.020 | figure out how to make sure all the packages are working,
00:05:50.020 | that we've got the fastest versions installed and like that we can give like a, you know,
00:05:56.060 | a trustworthy number got that down eventually to like, you know,
00:05:59.500 | like an hour or two and then built some benchmarking software.
00:06:03.460 | So we get it done in about like 15, 20 minutes.
00:06:06.400 | But then like, you know, the people ended up asking a lot of similar questions.
00:06:11.200 | So we decided just, you know, fifth law of fifth mantra of performance is do it
00:06:17.040 | when they're not looking.
00:06:18.780 | So like compute the thing ahead of time and store it.
00:06:21.380 | So we ran a giant benchmark over like 10 or so different models on BLM, SG Lang,
00:06:28.780 | into your TLM on about 10 different context lengths and put that all up on the internet.
00:06:35.720 | So let's take a look at that.
00:06:36.820 | Let's see, I'll drop into this one.
00:06:40.760 | This is a live version of it.
00:06:42.060 | So you'll find this at modal.com/llmalmanac.
00:06:46.340 | The idea is that this is just one page in the like, you know, your almanac,
00:06:50.700 | your little book that has the useful things you need to know to be an LLM engineer.
00:06:55.180 | So to start, we've got our benchmarking results, benchmarking methodology in detail,
00:07:01.020 | the open source code for it, and a little executive summary.
00:07:04.480 | Hope to put more stuff up there as we accumulate the things people need.
00:07:08.320 | things about like speculative decoding and multi-token prediction and quantization.
00:07:13.320 | But yeah, so to start off, we got this little interface here.
00:07:15.560 | So yeah, anybody, what's a model people want to see results for?
00:07:20.940 | Any, if, hopefully that's legible to people.
00:07:24.240 | Anybody got a favorite?
00:07:25.700 | Quen3?
00:07:27.700 | Quen3?
00:07:29.700 | Ministrel?
00:07:30.460 | Okay.
00:07:31.500 | Excuse me, that, oh man, that's, that's my boss.
00:07:34.080 | I'm not going to do that one.
00:07:34.980 | Okay, we'll do any engine here.
00:07:37.700 | Oh yeah, we didn't do, so this was fun.
00:07:40.040 | SGLang's Quen3 support is a little buggy for the, for the 8-bit quant that we ran.
00:07:46.180 | So I think we only have results for VLM.
00:07:48.760 | We'll stick with any engine there.
00:07:50.420 | Oh yeah, by the way, if you, if you try this thing out, like you'll see like if,
00:07:55.020 | there's a giant tensor of configurations, right?
00:07:57.460 | And we'd love to have that full tensor, but it doesn't always,
00:08:00.220 | it's like either not always possible or like it's not clear how to do it.
00:08:04.800 | So there's a place where you can contribute configurations.
00:08:07.400 | So we can build up a nice big database of like how to run these models.
00:08:11.040 | We also haven't like carefully optimized any of these things.
00:08:13.500 | We started with out of the box performance for all the engines,
00:08:16.340 | which is like optimizing a hundred configurations is going to take some time.
00:08:19.540 | So we'd love like contributions of optimized implementations,
00:08:22.880 | especially tensor TLM, which has like a ton of knobs and they have names
00:08:28.620 | like user buffer, like what's that?
00:08:32.020 | Yeah.
00:08:33.720 | Okay. Yeah.
00:08:34.420 | So first token under one second.
00:08:36.220 | Let's say this is, this is a pretty common SLO.
00:08:38.560 | Like you want one seconds, nice round number.
00:08:41.100 | People feel like that's like a good amount of time to wait.
00:08:43.460 | I'd say 300 milliseconds is a tighter one.
00:08:45.740 | That's more like interactive.
00:08:47.300 | That's your Doherty threshold.
00:08:49.380 | If you're a fan of Holt and catch fire, um, made up number,
00:08:52.520 | but like if somebody repeats a made up number enough, it's a real number.
00:08:55.560 | Um, so yeah, 300 milliseconds.
00:08:57.620 | Okay.
00:08:57.860 | So we can get a throughput of about one request per second on Quinn three
00:09:04.260 | in the mixture of experts model on VLM for 128 tokens in thousand,
00:09:08.360 | 28 four tokens out.
00:09:09.500 | Um, since we got over here, we got like a little, uh, code snippet here.
00:09:13.780 | So you should be the case that you can UVX model run this.
00:09:16.580 | And if you have a token that should just work immediately.
00:09:19.120 | Uh, and by immediately, I mean, after five minutes of loading the model weights
00:09:22.720 | and spinning up the model server, but that's as immediate as it gets.
00:09:25.460 | Um, cool.
00:09:27.860 | All right.
00:09:28.120 | So that's one result.
00:09:29.060 | Uh, who's asking for Quinn?
00:09:30.660 | Are you satisfied?
00:09:31.740 | Yeah.
00:09:33.000 | Okay.
00:09:33.500 | Great.
00:09:34.640 | Gemma 327 B.
00:09:37.580 | All right.
00:09:37.800 | On this one, let's do any engine here.
00:09:41.300 | All right.
00:09:41.660 | This was all right.
00:09:43.040 | Sorry.
00:09:43.300 | We got a tight filter on this.
00:09:44.460 | I'm going to put the first token filter up.
00:09:47.140 | Oh yeah.
00:09:47.500 | This one we're doing the BF 16.
00:09:50.700 | Quant is the only one that we could get working at first.
00:09:53.340 | So I think we eventually got the eight bit quants working.
00:09:56.280 | I don't think that's a hard blocker, but it was the easiest one to get going
00:09:59.420 | with the BF 16.
00:10:00.680 | So you were definitely slower.
00:10:02.280 | Um, so you'll see 27 billion parameter models.
00:10:05.660 | So like 10 X smaller in, uh, model weights, roughly the same number of active
00:10:10.620 | parameters is Quinn three, but we're getting like about the same, um, like
00:10:15.320 | throughput and requests per second on the same load, one 28 and 10 24 out.
00:10:19.860 | Um, so yeah, so it's interesting.
00:10:22.300 | You can see sort of which ones have had more optimization work on them.
00:10:26.380 | I think the Quinn three models and the llama model series, you see a lot
00:10:30.040 | more optimization.
00:10:31.080 | Um, this is also one of the ones where you saw the biggest gap between, um,
00:10:35.880 | SG lang and VLM, the JAMA one.
00:10:38.220 | So it looks like the VLM team spent a little bit more time or Google's
00:10:41.520 | contributed a little bit more to VLM on, uh, getting good results.
00:10:45.120 | Oh yeah.
00:10:46.200 | Let's go.
00:10:46.800 | Yeah.
00:10:47.100 | So the other thing you'll see is you generally like, so I just switched.
00:10:50.460 | Sorry.
00:10:50.700 | I should say what I'm doing.
00:10:52.160 | So this is 128 tokens in 10, 24 tokens out.
00:10:54.920 | Uh, and you can see we're getting about one request per second on this guy.
00:10:58.800 | Uh, let's just, and it, and the first token comes back in 400 milliseconds.
00:11:03.680 | Let's flip it to 10, 24 in and 128 out.
00:11:07.240 | Right?
00:11:08.120 | So this is going from like a reasoning workload to like a rag workload, big scare
00:11:13.180 | quotes on that, but it's just the difference between whether you're dominated by decode
00:11:16.600 | time, like more of your tokens are decode or more of your tokens, uh, pre-fill.
00:11:21.360 | Uh, and what you'll see very consistently in these results is that you get much higher
00:11:27.380 | throughput if you have more like tokens in the context, as opposed to tokens being generated.
00:11:32.360 | Very straightforward.
00:11:33.260 | I mean, if you know your transformer architecture, it's like auto aggressive versus parallel.
00:11:37.280 | Like, yeah, one of the first things you would learn if you looked at the, you know, kind
00:11:40.820 | of the implementation of the architecture, but it's nice to see it like nice and, you know,
00:11:44.360 | very cleanly more of an empiricist than a rationalist myself.
00:11:47.520 | So I like to see data, um, and not like, uh, chalkboard stuff.
00:11:51.620 | Um, so yeah, so, uh, what I'm getting at here is that the, um, the request per second
00:11:57.740 | that we're seeing here is about four requests per second for VLM.
00:12:01.540 | Um, on the same workload, but with like context instead of, uh, um, generation as well, where
00:12:08.860 | the majority of the tokens are.
00:12:10.180 | Um, so this is, uh, yeah, I gave a talk on GP, like GPUs a little bit earlier today.
00:12:16.860 | And like one of the big takeaways there is find things that like are throughput oriented
00:12:21.840 | and evolve a lot of arithmetic and not like moving memory around or communication.
00:12:26.640 | And that's exactly the difference here.
00:12:29.040 | You have like big matrix matrix multiplications, load the weights one time, use them a bunch.
00:12:34.620 | Um, and that's exactly the difference here.
00:12:36.480 | And it's a four X improvement and that's using B of 16, which does not have tensor core support.
00:12:41.860 | No, no.
00:12:42.600 | B of 16 has tensor core support, but it's the slow tensor core support compared to FB eight
00:12:47.840 | or FB four on hopper and black.
00:12:50.060 | Well, and so you're like, you're the real win.
00:12:53.240 | There is the shorter, like shorter numbers, faster multiplication.
00:12:56.820 | It's actually quadratic in the bit width.
00:12:58.600 | So you get a big win as you go down.
00:13:00.640 | So this like, if we were to run some results with FB four on black wells, you would see an
00:13:05.520 | even bigger gap than just this like four X improvement four X is like barely enough to
00:13:09.900 | wake up for, you know?
00:13:10.740 | Um, yeah, so that's a little like not every application.
00:13:15.720 | Can you like change that?
00:13:16.920 | Um, like your users might be bringing queries to you, so you don't have control.
00:13:21.060 | Um, but the, um, uh, it can, it's more of the sort of thing where you,
00:13:26.800 | like a product person is like, we should improve the quality.
00:13:30.160 | You're like, Oh, how can I improve the quality without like killing our latency?
00:13:35.500 | Don't have it like, don't immediately reach for reasoning, reach for context instead.
00:13:40.320 | Cause it's going to be cheaper and you're going to get better performance.
00:13:42.700 | Like you're, you're going to find it easier to hit your latency SLAs.
00:13:46.660 | I forgot to point this out, but the latency is like almost identical in time, in time to
00:13:50.960 | first token, even though we're doing 10 times as many tokens, basically a free lunch.
00:13:55.600 | Um, yeah.
00:13:57.420 | Okay.
00:13:57.680 | So that's, uh, that's sort of like how I envision people using this interface and the data.
00:14:02.680 | Um, there is a, there's a URL somewhere where you can just download the raw data.
00:14:06.460 | Um, if you're interested in that, hit me up.
00:14:08.200 | The code is also open source.
00:14:09.960 | If you want to run some of these benchmarks yourself.
00:14:12.160 | Um, I think I've, I'll close there.
00:14:13.980 | There are lots of other stuff to talk about.
00:14:15.340 | Like our benchmarking methodology, which is written up here, the, uh, executive summary, which you
00:14:20.680 | can share with your, um, with your leadership, um, uh, on like running open models.
00:14:26.980 | Um, but I'll take a question or two before we close out.
00:14:29.980 | Yeah.
00:14:30.980 | Yeah.
00:14:31.980 | Yeah.
00:14:32.980 | So that, so one thing I'll say is like this is throughput per replica.
00:14:43.800 | Right.
00:14:44.300 | So this is one G is one GPU.
00:14:46.200 | Yeah.
00:14:46.400 | One H 100.
00:14:47.680 | So like would you, the way you solve your like total throughput is by scaling out rather than
00:14:54.620 | scaling up.
00:14:55.500 | Right.
00:14:56.000 | Um, but, um, so if you want, if you want 400 QPS or whatever, like eventually you're just
00:15:01.660 | going to have to scale out, but to your question of like, yeah, how do you know?
00:15:05.480 | Like, you know, where, like, why are we saying this is the highest throughput you can get?
00:15:10.160 | Yeah.
00:15:10.860 | So the answer is like goes to our benchmarking methodology.
00:15:14.420 | What we do is first we dump like, uh, you know, a thousand requests and wait for them all
00:15:19.120 | to come back, calculate the like, uh, uh, seconds divided, uh, requests divided by seconds,
00:15:25.060 | thousand divided by how long it took.
00:15:26.500 | That's like a maximum throughput.
00:15:28.460 | Right.
00:15:29.140 | Cause we gave it max, we exposed the maximum parallelism to the engine.
00:15:32.260 | So presumably they knew how to, they were smart enough to handle that.
00:15:35.440 | Use your maximum RPS.
00:15:36.940 | Any more than that, you should expect from queuing theory that the latency will blow up.
00:15:40.380 | Right.
00:15:40.980 | Then the other side is, um, like you send one request at a time, wait for it to come back,
00:15:45.760 | send another.
00:15:46.420 | And that gives us our, like, that's the fastest you could possibly run the server.
00:15:50.360 | And we sweep between to get the numbers that are here.
00:15:52.880 | But yeah, cool.
00:15:54.180 | All right.
00:15:54.500 | I'll, uh, got to move on to the next talk.
00:15:56.740 | Uh, I'll be outside if you, uh, have any questions.
00:15:59.160 | Thank you very much.
00:16:00.080 | I'll be outside if you, uh, I'll be outside if you, uh, have any questions.
00:16:04.300 | Thank you very much.