How fast are LLM inference engines anyway?

Thanks, everybody, for coming. Yeah, I wanted to talk about some work I've done recently on trying to figure out just how fast these inference engines are when you run open models on them. So I've been talking at AI Engineer since it was AI Engineer Summit two years ago. And for a long time, it's basically been the open AI wrapper conference, right?

It's like, just because, yeah, what am I going to do? Am I going to run an agent with Burt? Probably not. And it was exciting to talk about all these cool new technologies, see people building stuff like Cursor on top of them, or Devon. But for me, as somebody coming from having trained my own models a lot, it was like, oh, man, I want to touch the weights.

I want to play with them. I want to hack them. But the quality just wasn't there yet to do some of the interesting stuff. And that's changed. You've got the Llama series. We've got the Quinn series. We've got Deep Seek. And so now we're catching up to maybe even literally catching up with the frontier labs, which would be pretty crazy.

But we're at the very least at the point where a lot of things people have been talking about at AI Engineer for years are possible with open weights models, where they weren't before. And at the same time, there's also the development of the software stack on top of that.

Also, as somebody coming from writing and running my own PyTorch models, I was like, how hard could it be to run a language model? Like, you just torch nn.module, wrap a class around it. And yeah, I mean, having done stuff with Transformers before, it's like, oh yeah, it's a little more complicated with Transformers.

Training looks weird. Inference is different. But pretty quickly, the state of play has advanced a lot on how to run a Transformer, right? KV caching is just the first thing. And then it's this page detention, now multi-token prediction, speculative decoding, all this stuff that's pretty hard to write yourself.

And so you want software to do that for you, probably. And so these open source engines are available now. You've got VLLM and SGLang and Tensor or TLLM. So the combination of those two things has kind of flipped the playing field around to where you'd need a really good reason to run your own models, like you're the US government or something, or you want to run on an air gap system.

Or you had to believe it in your heart. You had to want open source models, like the news or prime intellect. You had to be like a decentralized crypto bro or whatever to want to run your own open models. But yeah, now the situation has changed. It's really exciting.

It sort of finally makes sense to self-host. So just wanted to do this really quickly. Was anybody here at this AI Engineer Summit 2023, the first one? Anybody? OK. Did anybody come to the AI Engineering 201 workshop the day before in the 500 Treet Avenue? Maybe not. I talked to one or two people who were there.

So I gave a talk on how to do your own AI stuff back then. I just wanted to pull out a couple of slides. So one of the main-- this is 2023. We're like, who's going to win, open models or closed models? And the key statement in the talk was like, if capabilities requirements saturate, open models will catch up to proprietary models and then dominate for those cases.

Inspired by what you see with operating systems, databases, programming languages, like as soon as there's like a sort of like, you know, capability level that everybody-- like that you don't need the absolute best thing, you just need something that's good enough, open like collaborative projects tend to catch up and then have better properties.

And that's happened with open models, so check. Yeah, so that was mostly around like capabilities and, you know, at the time there was only LLAMA, but now we got a lot. The other one was, yeah, talked a little bit about LLM inference libraries at the time. And most of them are gone, TGI, RIP, for example, but VLLM was good then.

It's stuck around now. Yeah, so I don't know what the slides in 2027's AI engineering conference are going to look like. But at the very least, like three of my slides from two years ago were right. So, yeah, hopefully it's not those three slides again in two years. But yeah, all right, so what does the LLM engine landscape look like two years later?

Let's take a look. So what, you know, we were advising a bunch of people on how to run, like people were coming to us, like I want to run my own code completion editor in editors. I want to run like big backfill jobs to like enrich data and databases with language models.

It's too expensive or to run it on open AI or I train my own model. So it's too expensive to run it on a like a provider like fireworks. They want to run it on more generic infrastructure like what we have at modal. And so they would people would come and they'd be like, all right, well, how fast can you run an 8 billion parameter LLM model with SG Lang?

Like with 128 tokens in 1,024 tokens out on a Tuesday when Mercury is in retrograde. And that would like take at first, it took a couple days to like, you know, figure out how to make sure all the packages are working, that we've got the fastest versions installed and like that we can give like a, you know, a trustworthy number got that down eventually to like, you know, like an hour or two and then built some benchmarking software.

So we get it done in about like 15, 20 minutes. But then like, you know, the people ended up asking a lot of similar questions. So we decided just, you know, fifth law of fifth mantra of performance is do it when they're not looking. So like compute the thing ahead of time and store it.

So we ran a giant benchmark over like 10 or so different models on BLM, SG Lang, into your TLM on about 10 different context lengths and put that all up on the internet. So let's take a look at that. Let's see, I'll drop into this one. This is a live version of it.

So you'll find this at modal.com/llmalmanac. The idea is that this is just one page in the like, you know, your almanac, your little book that has the useful things you need to know to be an LLM engineer. So to start, we've got our benchmarking results, benchmarking methodology in detail, the open source code for it, and a little executive summary.

Hope to put more stuff up there as we accumulate the things people need. things about like speculative decoding and multi-token prediction and quantization. But yeah, so to start off, we got this little interface here. So yeah, anybody, what's a model people want to see results for? Any, if, hopefully that's legible to people.

Anybody got a favorite? Quen3? Quen3? Ministrel? Okay. Excuse me, that, oh man, that's, that's my boss. I'm not going to do that one. Okay, we'll do any engine here. Oh yeah, we didn't do, so this was fun. SGLang's Quen3 support is a little buggy for the, for the 8-bit quant that we ran.

So I think we only have results for VLM. We'll stick with any engine there. Oh yeah, by the way, if you, if you try this thing out, like you'll see like if, there's a giant tensor of configurations, right? And we'd love to have that full tensor, but it doesn't always, it's like either not always possible or like it's not clear how to do it.

So there's a place where you can contribute configurations. So we can build up a nice big database of like how to run these models. We also haven't like carefully optimized any of these things. We started with out of the box performance for all the engines, which is like optimizing a hundred configurations is going to take some time.

So we'd love like contributions of optimized implementations, especially tensor TLM, which has like a ton of knobs and they have names like user buffer, like what's that? Yeah. Okay. Yeah. So first token under one second. Let's say this is, this is a pretty common SLO. Like you want one seconds, nice round number.

People feel like that's like a good amount of time to wait. I'd say 300 milliseconds is a tighter one. That's more like interactive. That's your Doherty threshold. If you're a fan of Holt and catch fire, um, made up number, but like if somebody repeats a made up number enough, it's a real number.

Um, so yeah, 300 milliseconds. Okay. So we can get a throughput of about one request per second on Quinn three in the mixture of experts model on VLM for 128 tokens in thousand, 28 four tokens out. Um, since we got over here, we got like a little, uh, code snippet here.

So you should be the case that you can UVX model run this. And if you have a token that should just work immediately. Uh, and by immediately, I mean, after five minutes of loading the model weights and spinning up the model server, but that's as immediate as it gets.

Um, cool. All right. So that's one result. Uh, who's asking for Quinn? Are you satisfied? Yeah. Okay. Great. Um, Gemma 327 B. All right. On this one, let's do any engine here. All right. This was all right. Sorry. We got a tight filter on this. I'm going to put the first token filter up.

Oh yeah. This one we're doing the BF 16. Quant is the only one that we could get working at first. So I think we eventually got the eight bit quants working. I don't think that's a hard blocker, but it was the easiest one to get going with the BF 16.

So you were definitely slower. Um, so you'll see 27 billion parameter models. So like 10 X smaller in, uh, model weights, roughly the same number of active parameters is Quinn three, but we're getting like about the same, um, like throughput and requests per second on the same load, one 28 and 10 24 out.

Um, so yeah, so it's interesting. You can see sort of which ones have had more optimization work on them. I think the Quinn three models and the llama model series, you see a lot more optimization. Um, this is also one of the ones where you saw the biggest gap between, um, SG lang and VLM, the JAMA one.

So it looks like the VLM team spent a little bit more time or Google's contributed a little bit more to VLM on, uh, getting good results. Oh yeah. Let's go. Yeah. So the other thing you'll see is you generally like, so I just switched. Sorry. I should say what I'm doing.

So this is 128 tokens in 10, 24 tokens out. Uh, and you can see we're getting about one request per second on this guy. Uh, let's just, and it, and the first token comes back in 400 milliseconds. Let's flip it to 10, 24 in and 128 out. Right? So this is going from like a reasoning workload to like a rag workload, big scare quotes on that, but it's just the difference between whether you're dominated by decode time, like more of your tokens are decode or more of your tokens, uh, pre-fill.

Uh, and what you'll see very consistently in these results is that you get much higher throughput if you have more like tokens in the context, as opposed to tokens being generated. Very straightforward. I mean, if you know your transformer architecture, it's like auto aggressive versus parallel. Like, yeah, one of the first things you would learn if you looked at the, you know, kind of the implementation of the architecture, but it's nice to see it like nice and, you know, very cleanly more of an empiricist than a rationalist myself.

So I like to see data, um, and not like, uh, chalkboard stuff. Um, so yeah, so, uh, what I'm getting at here is that the, um, the request per second that we're seeing here is about four requests per second for VLM. Um, on the same workload, but with like context instead of, uh, um, generation as well, where the majority of the tokens are.

Um, so this is, uh, yeah, I gave a talk on GP, like GPUs a little bit earlier today. And like one of the big takeaways there is find things that like are throughput oriented and evolve a lot of arithmetic and not like moving memory around or communication. And that's exactly the difference here.

You have like big matrix matrix multiplications, load the weights one time, use them a bunch. Um, and that's exactly the difference here. And it's a four X improvement and that's using B of 16, which does not have tensor core support. No, no. B of 16 has tensor core support, but it's the slow tensor core support compared to FB eight or FB four on hopper and black.

Well, and so you're like, you're the real win. There is the shorter, like shorter numbers, faster multiplication. It's actually quadratic in the bit width. So you get a big win as you go down. So this like, if we were to run some results with FB four on black wells, you would see an even bigger gap than just this like four X improvement four X is like barely enough to wake up for, you know?

Um, yeah, so that's a little like not every application. Can you like change that? Um, like your users might be bringing queries to you, so you don't have control. Um, but the, um, uh, it can, it's more of the sort of thing where you, like a product person is like, we should improve the quality.

You're like, Oh, how can I improve the quality without like killing our latency? Don't have it like, don't immediately reach for reasoning, reach for context instead. Cause it's going to be cheaper and you're going to get better performance. Like you're, you're going to find it easier to hit your latency SLAs.

I forgot to point this out, but the latency is like almost identical in time, in time to first token, even though we're doing 10 times as many tokens, basically a free lunch. Um, yeah. Okay. So that's, uh, that's sort of like how I envision people using this interface and the data.

Um, there is a, there's a URL somewhere where you can just download the raw data. Um, if you're interested in that, hit me up. The code is also open source. If you want to run some of these benchmarks yourself. Um, I think I've, I'll close there. There are lots of other stuff to talk about.

Like our benchmarking methodology, which is written up here, the, uh, executive summary, which you can share with your, um, with your leadership, um, uh, on like running open models. Um, but I'll take a question or two before we close out. Yeah. Yeah. Yeah. So that, so one thing I'll say is like this is throughput per replica.

Right. So this is one G is one GPU. Yeah. One H 100. So like would you, the way you solve your like total throughput is by scaling out rather than scaling up. Right. Um, but, um, so if you want, if you want 400 QPS or whatever, like eventually you're just going to have to scale out, but to your question of like, yeah, how do you know?

Like, you know, where, like, why are we saying this is the highest throughput you can get? Yeah. So the answer is like goes to our benchmarking methodology. What we do is first we dump like, uh, you know, a thousand requests and wait for them all to come back, calculate the like, uh, uh, seconds divided, uh, requests divided by seconds, thousand divided by how long it took.

That's like a maximum throughput. Right. Cause we gave it max, we exposed the maximum parallelism to the engine. So presumably they knew how to, they were smart enough to handle that. Use your maximum RPS. Any more than that, you should expect from queuing theory that the latency will blow up.

Right. Then the other side is, um, like you send one request at a time, wait for it to come back, send another. And that gives us our, like, that's the fastest you could possibly run the server. And we sweep between to get the numbers that are here. But yeah, cool.

All right. I'll, uh, got to move on to the next talk. Uh, I'll be outside if you, uh, have any questions. Thank you very much. I'll be outside if you, uh, I'll be outside if you, uh, have any questions. Thank you very much.

How fast are LLM inference engines anyway? — Charles Frye, Modal

Transcript