How fast are LLM inference engines anyway?

00:00:00.000 | Thanks, everybody, for coming.

00:00:17.200 | Yeah, I wanted to talk about some work

00:00:19.400 | I've done recently on trying to figure out

00:00:23.120 | just how fast these inference engines are

00:00:26.440 | when you run open models on them.

00:00:30.000 | So I've been talking at AI Engineer

00:00:35.040 | since it was AI Engineer Summit two years ago.

00:00:38.880 | And for a long time, it's basically

00:00:43.020 | been the open AI wrapper conference, right?

00:00:46.320 | It's like, just because, yeah, what am I going to do?

00:00:48.640 | Am I going to run an agent with Burt?

00:00:49.900 | Probably not.

00:00:51.680 | And it was exciting to talk about all these cool new technologies,

00:00:55.200 | see people building stuff like Cursor on top of them, or Devon.

00:00:58.960 | But for me, as somebody coming from having

00:01:01.300 | trained my own models a lot, it was like, oh, man,

00:01:04.000 | I want to touch the weights.

00:01:05.060 | I want to play with them.

00:01:06.720 | I want to hack them.

00:01:09.840 | But the quality just wasn't there yet

00:01:11.980 | to do some of the interesting stuff.

00:01:14.900 | And that's changed.

00:01:15.720 | You've got the Llama series.

00:01:17.500 | We've got the Quinn series.

00:01:18.700 | We've got Deep Seek.

00:01:20.300 | And so now we're catching up to maybe even literally catching up

00:01:24.720 | with the frontier labs, which would be pretty crazy.

00:01:27.100 | But we're at the very least at the point where a lot of things

00:01:30.140 | people have been talking about at AI Engineer for years

00:01:32.260 | are possible with open weights models, where they weren't before.

00:01:35.540 | And at the same time, there's also the development

00:01:37.540 | of the software stack on top of that.

00:01:39.300 | Also, as somebody coming from writing

00:01:41.640 | and running my own PyTorch models, I was like,

00:01:43.640 | how hard could it be to run a language model?

00:01:45.380 | Like, you just torch nn.module, wrap a class around it.

00:01:49.680 | And yeah, I mean, having done stuff with Transformers before,

00:01:52.760 | it's like, oh yeah, it's a little more complicated

00:01:54.480 | with Transformers.

00:01:55.260 | Training looks weird.

00:01:56.120 | Inference is different.

00:01:57.560 | But pretty quickly, the state of play

00:01:59.700 | has advanced a lot on how to run a Transformer, right?

00:02:02.180 | KV caching is just the first thing.

00:02:04.900 | And then it's this page detention,

00:02:08.220 | now multi-token prediction, speculative decoding,

00:02:10.920 | all this stuff that's pretty hard to write yourself.

00:02:13.400 | And so you want software to do that for you, probably.

00:02:18.480 | And so these open source engines are available now.

00:02:20.880 | You've got VLLM and SGLang and Tensor or TLLM.

00:02:23.720 | So the combination of those two things

00:02:25.160 | has kind of flipped the playing field around

00:02:28.860 | to where you'd need a really good reason

00:02:31.420 | to run your own models, like you're the US government

00:02:33.880 | or something, or you want to run on an air gap system.

00:02:38.020 | Or you had to believe it in your heart.

00:02:42.440 | You had to want open source models,

00:02:44.580 | like the news or prime intellect.

00:02:47.280 | You had to be like a decentralized crypto bro or whatever

00:02:50.760 | to want to run your own open models.

00:02:53.520 | But yeah, now the situation has changed.

00:02:56.520 | It's really exciting.

00:02:57.360 | It sort of finally makes sense to self-host.

00:03:01.160 | So just wanted to do this really quickly.

00:03:05.640 | Was anybody here at this AI Engineer Summit 2023, the first one?

00:03:10.660 | Anybody?

00:03:11.400 | OK.

00:03:11.900 | Did anybody come to the AI Engineering 201

00:03:14.400 | workshop the day before in the 500 Treet Avenue?

00:03:20.080 | Maybe not.

00:03:20.740 | I talked to one or two people who were there.

00:03:23.340 | So I gave a talk on how to do your own AI stuff back then.

00:03:28.820 | I just wanted to pull out a couple of slides.

00:03:31.820 | So one of the main-- this is 2023.

00:03:35.420 | We're like, who's going to win, open models or closed models?

00:03:38.220 | And the key statement in the talk was like, if capabilities requirements saturate, open models

00:03:43.700 | will catch up to proprietary models and then dominate for those cases.

00:03:47.460 | Inspired by what you see with operating systems, databases, programming languages, like as soon

00:03:52.500 | as there's like a sort of like, you know, capability level that everybody-- like that you don't need

00:04:00.500 | the absolute best thing, you just need something that's good enough, open like collaborative projects

00:04:06.420 | tend to catch up and then have better properties.

00:04:09.720 | And that's happened with open models, so check.

00:04:11.820 | Yeah, so that was mostly around like capabilities and, you know, at the time there was only LLAMA,

00:04:20.520 | but now we got a lot.

00:04:22.300 | The other one was, yeah, talked a little bit about LLM inference libraries at the time.

00:04:26.100 | And most of them are gone, TGI, RIP, for example, but VLLM was good then.

00:04:32.700 | It's stuck around now.

00:04:34.640 | Yeah, so I don't know what the slides in 2027's AI engineering conference are going to look like.

00:04:41.820 | But at the very least, like three of my slides from two years ago were right.

00:04:46.760 | So, yeah, hopefully it's not those three slides again in two years.

00:04:52.060 | But yeah, all right, so what does the LLM engine landscape look like two years later?

00:04:56.800 | Let's take a look.

00:04:58.100 | So what, you know, we were advising a bunch of people on how to run, like people were coming to us,

00:05:04.140 | like I want to run my own code completion editor in editors.

00:05:08.820 | I want to run like big backfill jobs to like enrich data and databases with language models.

00:05:15.580 | It's too expensive or to run it on open AI or I train my own model.

00:05:20.560 | So it's too expensive to run it on a like a provider like fireworks.

00:05:23.920 | They want to run it on more generic infrastructure like what we have at modal.

00:05:29.100 | And so they would people would come and they'd be like, all right,

00:05:31.800 | well, how fast can you run an 8 billion parameter LLM model with SG Lang?

00:05:36.040 | Like with 128 tokens in 1,024 tokens out on a Tuesday when Mercury is in retrograde.

00:05:42.880 | And that would like take at first, it took a couple days to like, you know,

00:05:48.020 | figure out how to make sure all the packages are working,

00:05:50.020 | that we've got the fastest versions installed and like that we can give like a, you know,

00:05:56.060 | a trustworthy number got that down eventually to like, you know,

00:05:59.500 | like an hour or two and then built some benchmarking software.

00:06:03.460 | So we get it done in about like 15, 20 minutes.

00:06:06.400 | But then like, you know, the people ended up asking a lot of similar questions.

00:06:11.200 | So we decided just, you know, fifth law of fifth mantra of performance is do it

00:06:17.040 | when they're not looking.

00:06:18.780 | So like compute the thing ahead of time and store it.

00:06:21.380 | So we ran a giant benchmark over like 10 or so different models on BLM, SG Lang,

00:06:28.780 | into your TLM on about 10 different context lengths and put that all up on the internet.

00:06:35.720 | So let's take a look at that.

00:06:36.820 | Let's see, I'll drop into this one.

00:06:40.760 | This is a live version of it.

00:06:42.060 | So you'll find this at modal.com/llmalmanac.

00:06:46.340 | The idea is that this is just one page in the like, you know, your almanac,

00:06:50.700 | your little book that has the useful things you need to know to be an LLM engineer.

00:06:55.180 | So to start, we've got our benchmarking results, benchmarking methodology in detail,

00:07:01.020 | the open source code for it, and a little executive summary.

00:07:04.480 | Hope to put more stuff up there as we accumulate the things people need.

00:07:08.320 | things about like speculative decoding and multi-token prediction and quantization.

00:07:13.320 | But yeah, so to start off, we got this little interface here.

00:07:15.560 | So yeah, anybody, what's a model people want to see results for?

00:07:20.940 | Any, if, hopefully that's legible to people.

00:07:24.240 | Anybody got a favorite?

00:07:25.700 | Quen3?

00:07:27.700 | Quen3?

00:07:29.700 | Ministrel?

00:07:30.460 | Okay.

00:07:31.500 | Excuse me, that, oh man, that's, that's my boss.

00:07:34.080 | I'm not going to do that one.

00:07:34.980 | Okay, we'll do any engine here.

00:07:37.700 | Oh yeah, we didn't do, so this was fun.

00:07:40.040 | SGLang's Quen3 support is a little buggy for the, for the 8-bit quant that we ran.

00:07:46.180 | So I think we only have results for VLM.

00:07:48.760 | We'll stick with any engine there.

00:07:50.420 | Oh yeah, by the way, if you, if you try this thing out, like you'll see like if,

00:07:55.020 | there's a giant tensor of configurations, right?

00:07:57.460 | And we'd love to have that full tensor, but it doesn't always,

00:08:00.220 | it's like either not always possible or like it's not clear how to do it.

00:08:04.800 | So there's a place where you can contribute configurations.

00:08:07.400 | So we can build up a nice big database of like how to run these models.

00:08:11.040 | We also haven't like carefully optimized any of these things.

00:08:13.500 | We started with out of the box performance for all the engines,

00:08:16.340 | which is like optimizing a hundred configurations is going to take some time.

00:08:19.540 | So we'd love like contributions of optimized implementations,

00:08:22.880 | especially tensor TLM, which has like a ton of knobs and they have names

00:08:28.620 | like user buffer, like what's that?

00:08:32.020 | Yeah.

00:08:33.720 | Okay. Yeah.

00:08:34.420 | So first token under one second.

00:08:36.220 | Let's say this is, this is a pretty common SLO.

00:08:38.560 | Like you want one seconds, nice round number.

00:08:41.100 | People feel like that's like a good amount of time to wait.

00:08:43.460 | I'd say 300 milliseconds is a tighter one.

00:08:45.740 | That's more like interactive.

00:08:47.300 | That's your Doherty threshold.

00:08:49.380 | If you're a fan of Holt and catch fire, um, made up number,

00:08:52.520 | but like if somebody repeats a made up number enough, it's a real number.

00:08:55.560 | Um, so yeah, 300 milliseconds.

00:08:57.620 | Okay.

00:08:57.860 | So we can get a throughput of about one request per second on Quinn three

00:09:04.260 | in the mixture of experts model on VLM for 128 tokens in thousand,

00:09:08.360 | 28 four tokens out.

00:09:09.500 | Um, since we got over here, we got like a little, uh, code snippet here.

00:09:13.780 | So you should be the case that you can UVX model run this.

00:09:16.580 | And if you have a token that should just work immediately.

00:09:19.120 | Uh, and by immediately, I mean, after five minutes of loading the model weights

00:09:22.720 | and spinning up the model server, but that's as immediate as it gets.

00:09:25.460 | Um, cool.

00:09:27.860 | All right.

00:09:28.120 | So that's one result.

00:09:29.060 | Uh, who's asking for Quinn?

00:09:30.660 | Are you satisfied?

00:09:31.740 | Yeah.

00:09:33.000 | Okay.

00:09:33.500 | Great.

00:09:33.900 | Um,

00:09:34.640 | Gemma 327 B.

00:09:37.580 | All right.

00:09:37.800 | On this one, let's do any engine here.

00:09:41.300 | All right.

00:09:41.660 | This was all right.

00:09:43.040 | Sorry.

00:09:43.300 | We got a tight filter on this.

00:09:44.460 | I'm going to put the first token filter up.

00:09:47.140 | Oh yeah.

00:09:47.500 | This one we're doing the BF 16.

00:09:50.700 | Quant is the only one that we could get working at first.

00:09:53.340 | So I think we eventually got the eight bit quants working.

00:09:56.280 | I don't think that's a hard blocker, but it was the easiest one to get going

00:09:59.420 | with the BF 16.

00:10:00.680 | So you were definitely slower.

00:10:02.280 | Um, so you'll see 27 billion parameter models.

00:10:05.660 | So like 10 X smaller in, uh, model weights, roughly the same number of active

00:10:10.620 | parameters is Quinn three, but we're getting like about the same, um, like

00:10:15.320 | throughput and requests per second on the same load, one 28 and 10 24 out.

00:10:19.860 | Um, so yeah, so it's interesting.

00:10:22.300 | You can see sort of which ones have had more optimization work on them.

00:10:26.380 | I think the Quinn three models and the llama model series, you see a lot

00:10:30.040 | more optimization.

00:10:31.080 | Um, this is also one of the ones where you saw the biggest gap between, um,

00:10:35.880 | SG lang and VLM, the JAMA one.

00:10:38.220 | So it looks like the VLM team spent a little bit more time or Google's

00:10:41.520 | contributed a little bit more to VLM on, uh, getting good results.

00:10:45.120 | Oh yeah.

00:10:46.200 | Let's go.

00:10:46.800 | Yeah.

00:10:47.100 | So the other thing you'll see is you generally like, so I just switched.

00:10:50.460 | Sorry.

00:10:50.700 | I should say what I'm doing.

00:10:52.160 | So this is 128 tokens in 10, 24 tokens out.

00:10:54.920 | Uh, and you can see we're getting about one request per second on this guy.

00:10:58.800 | Uh, let's just, and it, and the first token comes back in 400 milliseconds.

00:11:03.680 | Let's flip it to 10, 24 in and 128 out.

00:11:07.240 | Right?

00:11:08.120 | So this is going from like a reasoning workload to like a rag workload, big scare

00:11:13.180 | quotes on that, but it's just the difference between whether you're dominated by decode

00:11:16.600 | time, like more of your tokens are decode or more of your tokens, uh, pre-fill.

00:11:21.360 | Uh, and what you'll see very consistently in these results is that you get much higher

00:11:27.380 | throughput if you have more like tokens in the context, as opposed to tokens being generated.

00:11:32.360 | Very straightforward.

00:11:33.260 | I mean, if you know your transformer architecture, it's like auto aggressive versus parallel.

00:11:37.280 | Like, yeah, one of the first things you would learn if you looked at the, you know, kind

00:11:40.820 | of the implementation of the architecture, but it's nice to see it like nice and, you know,

00:11:44.360 | very cleanly more of an empiricist than a rationalist myself.

00:11:47.520 | So I like to see data, um, and not like, uh, chalkboard stuff.

00:11:51.620 | Um, so yeah, so, uh, what I'm getting at here is that the, um, the request per second

00:11:57.740 | that we're seeing here is about four requests per second for VLM.

00:12:01.540 | Um, on the same workload, but with like context instead of, uh, um, generation as well, where

00:12:08.860 | the majority of the tokens are.

00:12:10.180 | Um, so this is, uh, yeah, I gave a talk on GP, like GPUs a little bit earlier today.

00:12:16.860 | And like one of the big takeaways there is find things that like are throughput oriented

00:12:21.840 | and evolve a lot of arithmetic and not like moving memory around or communication.

00:12:26.640 | And that's exactly the difference here.

00:12:29.040 | You have like big matrix matrix multiplications, load the weights one time, use them a bunch.

00:12:34.620 | Um, and that's exactly the difference here.

00:12:36.480 | And it's a four X improvement and that's using B of 16, which does not have tensor core support.

00:12:41.860 | No, no.

00:12:42.600 | B of 16 has tensor core support, but it's the slow tensor core support compared to FB eight

00:12:47.840 | or FB four on hopper and black.

00:12:50.060 | Well, and so you're like, you're the real win.

00:12:53.240 | There is the shorter, like shorter numbers, faster multiplication.

00:12:56.820 | It's actually quadratic in the bit width.

00:12:58.600 | So you get a big win as you go down.

00:13:00.640 | So this like, if we were to run some results with FB four on black wells, you would see an

00:13:05.520 | even bigger gap than just this like four X improvement four X is like barely enough to

00:13:09.900 | wake up for, you know?

00:13:10.740 | Um, yeah, so that's a little like not every application.

00:13:15.720 | Can you like change that?

00:13:16.920 | Um, like your users might be bringing queries to you, so you don't have control.

00:13:21.060 | Um, but the, um, uh, it can, it's more of the sort of thing where you,

00:13:26.800 | like a product person is like, we should improve the quality.

00:13:30.160 | You're like, Oh, how can I improve the quality without like killing our latency?

00:13:35.500 | Don't have it like, don't immediately reach for reasoning, reach for context instead.

00:13:40.320 | Cause it's going to be cheaper and you're going to get better performance.

00:13:42.700 | Like you're, you're going to find it easier to hit your latency SLAs.

00:13:46.660 | I forgot to point this out, but the latency is like almost identical in time, in time to

00:13:50.960 | first token, even though we're doing 10 times as many tokens, basically a free lunch.

00:13:55.600 | Um, yeah.

00:13:57.420 | Okay.

00:13:57.680 | So that's, uh, that's sort of like how I envision people using this interface and the data.

00:14:02.680 | Um, there is a, there's a URL somewhere where you can just download the raw data.

00:14:06.460 | Um, if you're interested in that, hit me up.

00:14:08.200 | The code is also open source.

00:14:09.960 | If you want to run some of these benchmarks yourself.

00:14:12.160 | Um, I think I've, I'll close there.

00:14:13.980 | There are lots of other stuff to talk about.

00:14:15.340 | Like our benchmarking methodology, which is written up here, the, uh, executive summary, which you

00:14:20.680 | can share with your, um, with your leadership, um, uh, on like running open models.

00:14:26.980 | Um, but I'll take a question or two before we close out.

00:14:29.980 | Yeah.

00:14:30.980 | Yeah.

00:14:31.980 | Yeah.

00:14:32.980 | So that, so one thing I'll say is like this is throughput per replica.

00:14:43.800 | Right.

00:14:44.300 | So this is one G is one GPU.

00:14:46.200 | Yeah.

00:14:46.400 | One H 100.

00:14:47.680 | So like would you, the way you solve your like total throughput is by scaling out rather than

00:14:54.620 | scaling up.

00:14:55.500 | Right.

00:14:56.000 | Um, but, um, so if you want, if you want 400 QPS or whatever, like eventually you're just

00:15:01.660 | going to have to scale out, but to your question of like, yeah, how do you know?

00:15:05.480 | Like, you know, where, like, why are we saying this is the highest throughput you can get?

00:15:10.160 | Yeah.

00:15:10.860 | So the answer is like goes to our benchmarking methodology.

00:15:14.420 | What we do is first we dump like, uh, you know, a thousand requests and wait for them all

00:15:19.120 | to come back, calculate the like, uh, uh, seconds divided, uh, requests divided by seconds,

00:15:25.060 | thousand divided by how long it took.

00:15:26.500 | That's like a maximum throughput.

00:15:28.460 | Right.

00:15:29.140 | Cause we gave it max, we exposed the maximum parallelism to the engine.

00:15:32.260 | So presumably they knew how to, they were smart enough to handle that.

00:15:35.440 | Use your maximum RPS.

00:15:36.940 | Any more than that, you should expect from queuing theory that the latency will blow up.

00:15:40.380 | Right.

00:15:40.980 | Then the other side is, um, like you send one request at a time, wait for it to come back,

00:15:45.760 | send another.

00:15:46.420 | And that gives us our, like, that's the fastest you could possibly run the server.

00:15:50.360 | And we sweep between to get the numbers that are here.

00:15:52.880 | But yeah, cool.

00:15:54.180 | All right.

00:15:54.500 | I'll, uh, got to move on to the next talk.

00:15:56.740 | Uh, I'll be outside if you, uh, have any questions.

00:15:59.160 | Thank you very much.

00:16:00.080 | I'll be outside if you, uh, I'll be outside if you, uh, have any questions.

00:16:04.300 | Thank you very much.

How fast are LLM inference engines anyway? — Charles Frye, Modal