From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta

I am here with Pankaj Gupta. He's the co-founder of Base10. Actually, so today I was checking Slack, and in the random Slack channel, one of the people in the company was saying like, "Hey, I heard someone call someone Cracked. What does Cracked mean?" Those of you who are Gen Z like me or know someone like that is laughing right now because Cracked just means an exceptional engineer, and so Pankaj is the most Cracked software engineer I've ever had the pleasure of working with.

He's from San Francisco. His favorite model is Llama 38B. We're going to be working with a smaller version of that today. I'm Phillip. I do develop a relations here at Base10. I've been here for about two and a half years, and I am based in Chicago, but I'm very happy to be here in San Francisco with you all today, and my favorite model is Playground 2.

It's a text-to-image model that's kind of like SDXL, but it's trained on mid-journey images. You're going to see a ton of Playground 2 images in the slideshow today. What are we doing here today? What is our agenda? We're going to cover what is TensorRT LLM and why use it?

Model selection and TensorRT LLM support because it supports a lot of stuff, but not everything. We're going to talk about building a TensorRT engine, configuring a TensorRT engine automatically, benchmarking it so you can know if you actually did something worthwhile, and then deploying it to production. As much as I love the sound of my own voice and I want to just stand here and grasp this microphone for two hours and say things, this is not just going to be Philip Reed's office slideshow.

We're going to do tons of coding, debugging, live Q&A. The way this presentation is kind of broken up is we've got some sections. We've got some live coding. It's going to be a very interactive workshop. I'm going to be taking questions all the time, so please don't hesitate to let us know if anything's confusing.

We really want everyone to come away from this with a strong working understanding of how you can actually use this technology in production. So let's get started. If I may interject for a second and ask Razor fans, how many of you know about TensorRT? This is so exciting. I'm so glad that we get to teach you all this today.

How about TensorRT LLM? Okay, a few. So we'll cover the basics. I think I'm pretty sure that you'll get a sense of what it is. If you know PyTorch, this shouldn't be too hard. And if you don't know PyTorch, like me, it's still not that hard. So we're going to start with the story of TensorRT LLM.

What, who, why, you know, once upon a time, there was a company called NVIDIA. And they noticed that there are these things called large language models that people love running. But what do you want when you want a large language model? You want a lot of tokens per second, you want a really short time to first token, and you want high throughput.

You know, GPUs are expensive. So you want to get the maximum value out of your GPU. And TensorRT and TensorRT LLM are technologies that are going to help you do that. So if we get into it here, what is TensorRT? Here's one of my Playground 2 images. Very proud of these.

If the words on the slides are dumb, just look at the images, because I worked hard on those. Anyway, so TensorRT is a SDK for high performance deep learning inference on NVIDIA GPUs. Basically what that means is it's just a great set of tools for building high performance models.

It's a, you know, toolkit that supports both C++ and Python. Our interface today is going to be entirely Python. So if, like me, you skip the class that teaches C++, don't worry, you're covered. I know Punkage reads C++ textbooks for fun, Python. But, but, but I do not. So we're going to do it in Python today.

And so how does this work? You know, do you want to, do you want to kind of jump in here and talk about this a little bit? Because, you know, it's, it's a, it's a really cool process, how you go from a neural network to, to an engine. Yeah.

Yeah, exactly. So ultimately, what are machine learning models? They're the graphs, they're computation graphs. You flow data through them, you transform them. And ultimately, whatever executes a model does that. They execute a graph. Your neural network is a graph. TensorRT works on a graph representation. You take your model and you express that using an API, that graph in TensorRT.

And then TensorRT is able to take that graph, discover patterns, optimize it, and then be able to execute it. That's what TensorRT is ultimately. When you write it at PyTorch model, you're ultimately creating a graph. It's graph followers, right? There is data flowing through this graph. And that's what it is.

TensorRT additionally provides a plugin mechanism. So it says that, you know what, I know this graph, I can do a lot of stuff, but I can't do very fancy things like flash attention. It's just too complex. I can't infer automatically from this graph that this is even possible. Like I'm not a scientist.

So it gives a plugin mechanism using which you can inspect the graph and say that, okay, I recognize this thing and I can do it better than you, TensorRT. So I'm going to do it through this plugin. And that is what TensorRT LLM does. It has a bunch of plugins for optimizing this graph execution for large language models.

So, for example, for attention, for flash attention, it has its own plugin. But it says that, okay, now we are in TensorRT LLM land, take this graph and let me execute it using my optimized CUDA kernels. And that's what ultimately TensorRT LLM is. A very, very optimized way of executing these graphs using GPU resources, not only to get more efficiency, better costs for your money, but also better latency, better time to first token, all the things that we care about when we are running these models.

In addition to that, it provides a few more things like when you're executing a model, you're not just executing a request at a time, you're executing a bunch of requests at a time. And in-flight batching is a key optimization that is very, very key. Like, in this day and age, if you're executing a large language model, you have to have in-flight batching.

There's just no way. It's like a 10x or 20x improvement, like, and you have to have that. And TensorRT LLM provides that. TensorRT wouldn't. TensorRT is a graph executor. It doesn't know about that. But TensorRT LLM has an engine that does that. It also has a language to express graph, just like PyTorch, and it requires that there is a conversion.

But it makes it pretty easy to do that conversion. And there are tons of examples in the repo. Exactly. So, TensorRT is this great sort of engine builder. And then TensorRT LLM is a mechanism on top of that that's going to give us a ton of plugins and a ton of optimization specifically for large language models.

So TensorRT LLM, like Pankaj said, defines the set of plugins for your LLMs. If you want to, you know, compute attention, do LoRa's, Medusa, other fine tunes. And it lets you define optimization profiles. So when you're running a large language model, you generally have a batch of requests that you're running at the same time.

You also have an input sequence and an output sequence. And this input sequence could be really long. You know, maybe you're summarizing a book. It could be really short. Maybe you're just doing some LLM chat. Like, hi, how are you? I'm Fred from the bank. Depending on what your input sequence and output sequence lengths are, you're going to want to build a different engine that is going to be optimized for that to process that number of tokens.

So, yeah. So TensorRT LLM is this toolbox for taking TensorRT and building large language model engines in TensorRT. I want to say just one thing at this point. Like, why I care about input and output sizes? Like, how does TensorRT LLM optimize for that? It actually has specific kernels for different sizes of inputs, different sizes of matrices.

And it's optimized for that level. And sometimes it becomes a pain when I'm compiling TensorRT LLM. It takes hours because it optimizes for so many sizes. But it also means that giving it that size guidance is useful. It can use better kernels to do things faster. And that's why.

A lot of the models you'll run, you don't have to care about it. But there is always a trade-off. Here, it does care about that. And you can benefit using that trade-off. Yeah. And TensorRT LLM is a great tool for a number of reasons. It's got those built-in optimized kernels for different sequence lengths.

And that level of detail is really across the entire tool. And what that means is that with TensorRT LLM, you can get some of the highest performance possible on GPUs for a wide range of models. And it's really a production-ready system. We are using TensorRT LLM today for tons of different client projects.

And it's, you know, running in production, powering things. TensorRT LLM has support for a ton of different GPUs. Basically anything like Volta or newer. The Volta support is kind of experimental. But yeah, like your A10s, your A100s, H100s, all that stuff is supported. And yeah, TensorRT LLM, it's developed by NVIDIA.

So, you know, they know their graphics cards better than anyone. So we just kind of use it to run models quickly on that. That said, everything does come with a trade-off. Is anyone from NVIDIA here in the room? It's okay. You don't have to wait. Okay. So I'm going to be nice.

No, we really are big fans of this technology, but it does come with trade-offs. You know, some of the underlying stuff is not fully open source. So sometimes if you're diving super deep, you need to go get more information without just like looking at the source code. And it does sometimes have a pretty steep learning curve when you're building these optimizations.

So that's what we're here to help flatten out for you guys today. Hopefully we're still friends. What makes it hard? So there's a couple of things that make building with TensorRT LLM really hard. And when we enumerate the things that make it hard, that's how we know what we need to do to make it easy.

So the number one thing in my mind that makes it hard to build a general model or to optimize a model with TRT LLM is you need a ton of specific information about the production environment you're going to run it. All right. So I do a lot of sales enablement trainings and I love a good metaphor.

So I'm going to walk you guys through a metaphor here. Apologies if metaphors aren't your thing. So imagine you go into a clothing store and it only sells one size of shirt. You know, it's just like a medium. You know, for some people that's going to fit great. For some people it's going to be too small.

For some people it's going to be too big. And on the other hand, you can go to like a tailor, I don't know, in like Italy or something. And you go there and they've got, you know, some super fancy guy with a, you know, cool mustache and stuff. And he, you know, he measures you like every single detail and then builds a suit exactly for you.

That's perfect for your body measurements, like a made to measure suit. So optimizing a model is kind of like making that suit. You know, everything has to be measured for exactly the use case that you're building for. And so when people come in and expect that they can just walk in and grab off the shelf a model that's going to work perfectly for their use case, that's like expecting you're going to go into a store and buy a piece of clothing that fits you just as well as that custom made, made to measure suit from the tailor.

So in, you know, to relate that more concretely to TensorRT LLM, you need information. You need, like we talked about, you need to understand the sequence lengths that you're going to be working at, the batch sizes that you want to run at. You also need to know ahead of time what GPUs you're going to be using in production.

These engines that we're building are not portable. They are built for a specific GPU. You build them. So if you build it on an A10, you run it on an A10. If you build it on an H100, you run it on an H100. You want to switch to H100 MIG?

Okay, you build it again for H100 MIG. So you need to know all of this information about your production environment. And then also, as we'll talk about kind of toward the end, there are some infrastructure challenges as well. These engines that we're going to build are quite large. So if you're, for example, doing auto scaling, you have to deal with slow cold starts, you know, work, work around the size of the engines.

Otherwise, your cold starts are going to be slow. And overall, also just model optimization means we're living on the cutting edge of new research. You know, I'm, when I'm, when I'm writing blog posts about this stuff, I'm oftentimes looking at papers that have been published in the last six months.

So, you know, just combining all these new approaches and tools, there can be some rough edges, but the performance gains are worth it. So, yeah. Oh, please go ahead. I want to add one thing is that there are modes in TensorRDLM where you can build for a certain, on a certain GPU, and it will run on other GPUs.

But then it, it's not optimized for the GPUs. So why would you do that? We never do that. We always build it for the GPU. But there is that option. Exactly. That would be like, if I went to that fancy tailor shop, got a made-to-measure suit, and then was like, "Hey, Punkic, happy birthday.

I got you a new suit." That's, that's what it would be like. So, you know, what, what makes TensorRDLM worth it? Well, it's, it's the performance. So, these numbers are from a Mistral 7B that we ran on artificial analysis, which is a third-party benchmarking site. And we were able to get, with TensorRDLM and a few other optimizations as well on top of it, 216 tokens per second, perceived tokens per second, and 180 milliseconds time to first token.

So, unless any of you are maybe like some super high quality athletes, like a UFC fighter or something, your reaction time is probably about 200 milliseconds. So, you know, 180 millisecond time to first token, counting network latency, by the way, counting the round trip time to the server is great because that, to a user, feels instant once you're under 200 milliseconds.

And actually, most of it is network latency. The time on the GPU is less than 50 milliseconds. Less than 50 milliseconds. So, we've got another one of these green slides here. I like to talk really fast. So, these slides I put in this presentation to give us all a chance to take a breath and ask any questions.

So, you know, we're going to cover a lot more technical detail moving forward, but if there's anything kind of foundational that you're struggling with, like what's TensorRD, what's TensorRDLM, anything I can explain more clearly, I would love to hear about it. Going once, going twice. It's okay. We're all friends here.

You can raise your hand. All right. Well, it sounds like I'm amazing at my job. I explained everything perfectly, and we get to move on to the next section. So, what models can you use with TensorRDLM? Lots of them. There's a list of like 50 foundation models in the TensorRDLM documentation that you can use.

And you can also use, you know, fine tunes of those models, anything you've built on top of them. It supports open source large vision models, so if you're, you know, building your own GPT 4.0, you can do that with TensorRDLM. And it also supports models like Whisper. And then TensorRD itself, you can do anything with TensorRD.

So, any model, custom, open source, fine tuned, you can run it with TensorRD. But TensorRDLM is what we're focusing on today, because it's a much more convenient way of building these models. And, you know, on this list of models that it supports, there's one that maybe stands out. Does anyone know like what model kind of doesn't belong in the, in this list of supported models?

Like what, what, what up here isn't an LLM? Whisper, exactly. Why, why is, why is Whisper on here? Well, TensorRDLM, it's, it's called dash LLM. Um, but it really is a little more flexible than that because you can run, you know, a lot of different auto-aggressive transformers models with it like Whisper.

So, if anyone doesn't know what Whisper is, it is a audio transcription model. You give it, uh, you know, MP3 file with someone talking, it gives you back a transcript of what they said. It's one of our, it's one of our favorite models to work with. We've spent a ton of time optimizing Whisper, building pipelines for it, and all that sort of stuff.

And what's really cool about Whisper is structurally, like it's basically an LLM. You know, that, that, that's a massively reductive statement for me to make, but it's a auto-aggressive transformers model. It has the same bottlenecks in terms of influence performance. So even though this is not a little, not an LLM, it's an audio transcription model, we're actually still able to optimize it with TensorRT LLM because, uh, because of its architecture.

Let me say one more thing. Of course. So the, the whole, uh, I think the recent ML revolution started with the transformers paper attention is all you need and that describes an encoder decoder architecture. And in a way, Whisper is machine translation. That paper was about machine translation. You're translating audio text, uh, audio into text, right?

And it's basically that it's an encoder decoder model, exactly like the transform architecture and transfer, tensor ID LLM is about that. It's about that transform architecture. So it actually matches pretty well. Exactly. So, um, moving on, um, I want to run through a few things, uh, just, just some, some things in terms of what TensorRT LLM supports.

So I assume it's going to support Blackwell when that comes out, like 99.999% certain. Um, but anyway, in terms of what we have today, we've got Hopper, so the H100s, the L4s, RTX 4090s. If anyone has a super sweet gaming desktop at home, number one, I'm jealous. Number two, you can run TensorRT LLM on that.

Um, Ampere GPUs, Turing GPUs, uh, V100s are, you know, somewhat supported. Um, and what's cool about, what's cool about TensorRT and hardware support is that, like, it works better with newer GPUs. When you move from an A100 to an H100 and you're using TensorRT or TensorRT LLM, you're not just getting the sort of, like, linear increase in performance that you'd expect from, you know, oh, I've got more flops now.

I've got more gigabytes per second of GPU bandwidth. You're actually getting more of a performance gain going from one GPU to the next, uh, than you would expect off raw stats alone. And that's because, um, you know, H100s, for example, have all these great architectural features and TensorRT, because it actually optimizes the model by compiling a Takuda instructions, is able to take advantage of those architectural features, not just kind of run the model, um, you know, raw.

And so for that, you know, that's why we do a lot with H100 MIGs. This, this, this bullet point here is a whole different 45-minute talk that I tried to pitch to, uh, do here. But basically, you know, H100 MIGs are especially good for TensorRT LLM. Um, if you're trying to run smaller models, like a 7B, you know, Llama 8B, for example, uh, because you don't need the massive amount of VRAM, but you get the, um, increased performance from the architectural features.

Um, and you know, just my own speculation down here, that I'm sure whatever the next generation is, is going to have even more architectural features for TensorRT to take advantage of. And so, you know, adopting it now is a good move, uh, you know, looking to the future. Here we've got a graph showing, you know, with SDXL.

Now this is TensorRT, not TensorRT LLM, but the underlying technology is the same. Um, you know, when you're working on an A10G, we were looking at, you know, maybe like a 25 to 30% increase in throughput for SDXL. And, uh, with an H100, it's a 70%. And that's not, you know, just because the H100 is bigger.

It's 70% more on an H100 with TensorRT L, TensorRT versus an H100 without. So, yeah, great, uh, yeah, please go ahead. One thing I want to add here is that, uh, H100 supports FP8 and A100 does not. FP8 is a game changer. I think it's very easy to understate that fact.

FP8 is really, really good. Post-training quantization. You don't need to train anything. Post-training quantization, it takes like five minutes. And the results are so close. We've done perplexity tests on it. Whenever you quantize, you have to check the accuracy. Uh, and we've done that. It's hard to tell. And FP8 is about 40% better in most scenarios.

So if you're using, uh, make H100, if you can, then it's, it can be way better if you use FP8. And FP8 is also supported, um, by Lovelace. So that's going to be your L4 GPUs, um, which are also a great option for, for FP8. So yeah, a bunch of different precisions are supported.

Again, FP8 is kind of the highlight. FP4, uh, could be coming. And, um, you know, traditionally though, we're going to run in FP16, um, which is sort of like a, a, uh, full precision. Um, oh, sorry, half precision. FP32 is technically full precision. What? But nobody does FP32. Yeah, yeah.

So, so for, for inference generally, you start at FP16. By the way, FP16 means a 16-bit floating point number. Um, and from there, you know, you can quantize to INT8, FP8 if, you know, you want your model to run faster, if you want to run on fewer or smaller GPUs.

Um, we'll, we'll, we'll cover quantization in a bit more detail later on in the actual workshop. I don't want to spend too long on the slides here. I know you guys want to get your laptops out and start coding. Um, so the other thing just to talk about is, like I said, TensorRT LLM, it's a, you know, it's a set of optimizations that you can, you know, build into your TensorRT, into your TensorRT, into your TensorRT engines.

Um, so some of the features that are supported, again, each one of these could be its own talk here, uh, but we've got quantization, LoRa swapping, speculative decoding, and Medusa heads, which is where you basically, like, fine-tune additional heads onto your model. And then at each forward pass, you're generating, like, four tokens instead of one token.

Great for when you're, you know, when you have memory bandwidth restrictions. Um, yeah, in-flight batching, like you mentioned, page attention. There's just a ton of different optimizations supported by TensorRT LLM for you to dive into once you have the basic engine built. So, um, we're about to switch into more of, like, a live coding workshop segment.

So, if there's any of this sort of groundwork information that didn't make sense or any more details that you want on anything, let us know. We'll cover it now. Um, otherwise, it's, it's about to be laptop time. Looks like everyone wants laptop time. So, uh, yes, please go ahead.

Can you do, like, some high-level comparison, like, to the LLM? Yeah, do you want, do you want to, do you want to handle that one? Like, a high-level comparison to the LLM? Uh, I can do a very high-level comparison. Uh, first of all, uh, I respect both tools. VLLM is great.

TensorRT LLM is also great. We found in our comparisons that, uh, for most of the scenarios we compared, we found TensorRT LLM to be better. Um, there, there are a few things there. One thing is that whenever a new GPU lands or a new technique lands, that tends to work better on TensorRT LLM.

VLLM, it takes a bit of time for it to catch up for the kernels to be optimized. Uh, TensorRT LLM is generally ahead of that. For example, when H100 landed, TensorRT LLM was, uh, was very, very fast out of the box because they've been working for it on it for a long time.

Um, second thing is that TensorRT LLM is optimized from bottom to the top. These, uh, CUDA kernels at the very bottom are very, very well optimized. On top of that, there is the in-flight batching engine, all written in C++. And I've, I've seen that code. It's very, very optimized C++ code with your STD moves and whatnot.

And on top of that is Triton, which is a web server, again, written in C++. So the whole thing is very, very optimized. Whereas, uh, in some other frameworks, uh, they also try to optimize, uh, in the sense like, you know, Java versus C++. Java is like, you know, we optimize everything that matters, but there are always cases where it might not be as good.

But TensorRT LLM is that let's optimize every single thing. So it generally tends to perform better in our experience. That said, VLM is a great, great product. We use it a lot as well. For example, LoRa swapping, it became available in VLM first. So we use that, uh, there for a while.

What we found is that when something lands in TensorRT LLM and it's usually after a delay, it works like, like bonkers. It just works like so well that, uh, performance is just amazing. So when something is working very stable in TensorRT LLM, we tend to use that. But, uh, VLM and other frameworks, they provide a lot of flexibility, which is great.

Uh, we love all all the words, yeah. Yeah. Yeah. I think, I think we should, uh, definitely question that. That is a clear trade off. If you're working with two GPUs, for example, two 8NGs, it's not worth it probably. But if you're spending hundreds of thousands of dollars a year, uh, it's, it's your call.

But if you're spending, uh, many hundreds of thousands of dollars a year, it can be material. Like your profit margin might be 20%. This 20% or 50% improvement might make the, all the difference that you need. So it depends upon the use case. It could be, yeah. I think if you're working with one or two GPUs, 8NGs or T4s, uh, I mean, I'm not an expert in that, but, uh, it probably doesn't matter which framework you use, whichever works best for you.

But if you end up using A100s or H100s, you should definitely look into TensorFlow. And, uh, regarding the learning curve, stick around a little bit. We're going to, uh, flatten it out a lot for you because we've built some great tooling on top of TensorFlow RTLM that's going to make it just as easy to use.

A little, little, little marketing spin right there. Um, so yeah. So we're going to, um, be doing a engine building, live coding exercise. Um, and, and Punk is just going to lead us through that. I'm just going to kind of roam around. So if, if people have, you know, questions, need help kind of on a one-on-one basis, I'll be able to help, help out with that doing this portion of the workshop.

Great. So, uh, yeah, let, let's just like go through a couple. There's, there's, there's, there's a little bit of a little bit of setup material, right? Or yes. Yeah. Um, you want to run through that? Okay, sure. I'll run through the, I'll run through the setup material. Um, and then, uh, yeah.

So, um, anyway, what we're going to do, um, to, to, to be clear is we're going to build an engine for a model called tiny llama 1.1 B. I want to be really clear about this. Tensor OT LM is a production ready technology that works great with big models on big GPUs.

Uh, that takes time to run. The dev loop can be a little bit slow and we only have a two hour workshop here. And, uh, you know, I don't want us all just to be sitting there watching a model build. It's basically as fun as watching paint dry or watching grass grow.

Um, so we're going to be using this super tiny 1.1 billion parameter model. We're going to be using 40 90s and 8 10 Gs, um, just to kind of keep the dev loop fast, but this stuff does scale. So, um, at this point, we're going to walk you through the manual process of doing, doing it all from scratch.

You're going to procure and configure a GPU. You're going to install dependencies for Tensor OT LM, configure the engine, run the engine build job, and, uh, test the results. And we, we should be able to get through this in, in about half an hour or maybe a little less because these, uh, these models are quite small.

Um, and there's a few important settings that we're going to look at when building the engine. We're going to look at the quantization, again, the post training quantization, like we talked about. We're going to be on 8 10s or sorry, no, first we're going to be on 40 90s.

So we will actually have access to FP8 so that you can test that out. Uh, we're going to look at secret shapes and batch sizes, how to set that. And we're going to look at tensor parallelism. You want to give them a quick preview on tensor parallelism? Oh, yeah, tensor parallelism is, uh, is very important in certain scenarios.

I wish it were more useful, but it is critical in many scenarios. So what is tensor parallelism? Ultimately, machine learning, running these GPUs is about matrix multiplications. We take this model architecture, whatever it is, it ultimately boils down to matrices that we multiply and a lot of the wrangling is around that.

How do we shove these all batches into matrices? So ultimately it is matrix multiplication, right? What you can do is you can split these matrices and you can multiply them separately on different GPUs and then combine the results. And that's what tensor parallelism is. It's one of the tensor of parallelism techniques.

Uh, there are many techniques. Uh, it's one of the most commonly used ones because you need that. Uh, why do you need tensor parallelism versus other parallelisms like pipeline parallelism? Um, is that it saves on latency. You can do things in parallel. You can use two GPUs at the same time for doing something, even though there is some overhead of crosstalk between them.

With pipeline parallelism, you take the model architecture and you can divide these layers into separate things. So your thing goes through one GPU, like half the layers and then half the layers on the second GPU. But you're not saving on latency. It still has to go through each layer and it's going sequentially.

And that's why pipeline parallelism is not very popular for inference. It is still popular for training. There are scenarios. Uh, and there's a lot of theory about that. But for, for inference, I don't think I've ever seen it used and nobody pays much attention to optimizing it because of this thing that tensor parallelism is just better.

There's also expert level parallelism. If your model has mixture of experts, then you can parallelize those experts. And that tends to be very advanced and Lama doesn't have a mixture of experts. So it's a esoteric thing that we haven't covered here. TensorFlow is pretty helpful and useful. Uh, one downside is that your throughput is not as great.

If you can fit something in a bigger GPU, that's generally better, but there are bigger models like Lama 7 DB, they just can't fit on one GPU. So you have to use tensor parallelism. Awesome. So for everyone to get started, um, we made a GitHub repository for you all to work off of in this, in this, uh, workshop.

So you can scan the QR code. It'll take you right there. Otherwise, uh, you know, this is, this is not too long to type out. Um, so I'm just going to leave this up on screen for 30 seconds. Everyone can pull it up. Um, you're going to want to, you know, fork and clone this, uh, this repository, um, to your, to your local development environment.

Um, we're just, you know, we're just using Python, Python 3.10, Python 3.11, um, 3.9. Um, so yeah, just like however, however your, your normal way of writing code is, um, this, this should be compatible. Um, the, there isn't a lot of what? Uh, no. So, uh, I, yeah, to be clear in, in this, in this repository, um, you're going to find instructions and we're going to walk through all this.

Um, we're going to be using entirely remote GPUs. Um, so, you know, I personally have an H 100 under my podium right here that I'm going to be using. No, I'm just kidding. I don't. Um, but, uh, yeah, yeah. So we're just, uh, we just all have laptops here.

So we're going to be using cloud GPUs. Yeah. Actually, if you want to follow along, you might need a run for account. Yeah. Yeah. Well, we'll, we'll, we'll talk them through the, uh, the, the, the setup steps there. Um, does, does anyone want me to leave this information on the screen any longer going once going twice?

Okay. If, if you, for whatever reason, lose the repository, just let me know. I'll, I'll get it back for you. Uh, yes. Okay. So this, this slide means we are transitioning to live coding. So yes, let's go, uh, let's go over to, um, the, yeah, the, the, the live coding experience.

So I'm, I'm basically going to follow this repository. All the instructions are here and, uh, I'm going to follow exactly what is here. So you can see, uh, how to follow along. And if you, uh, if you ever get lost or need help, just raise your hand and I'll come over and catch you up like one on one.

Yeah. Yeah. I'm going to go really slow. I'm going to actually do all these steps here. I know it takes time, but, uh, you know, there's a lot of information here. It's easy to, uh, lose track of thing and get lost. So if you, if you be lost, like ask and we break, I want to make sure this is not a long process, a 10 minute process.

We can take it slow for everybody here. So first thing is that, uh, we'll, we'll do it like really, really from scratch. So we're going to spin up a new, uh, container on run pod with a GPU to run our setup in. So if you, okay, okay. Yeah. Yeah.

Please, please. Um, if you want to follow along, please go on to import and create an account. This should cost like less than $5 overall. Yeah. So, um, so yeah, so if you want to make an account, um, there's instructions in the, um, 01 folder. Uh, yeah, this read me, so tensor RT in the, in the first folder in the read me, there's instructions and a video walkthrough.

Um, the minimum we're, we're, we're not affiliated with one part in any way. Uh, they just have 40 nineties and we wanted you guys to use 40 nineties today. There is a minimum credit buy of $10. If for whatever reason you can't use a company card or get a reimbursed or whatever, and you want your $10 back, uh, send me an email after the, um, after the conference and I will reimburse you myself.

So, uh, you know, I, I just, I want to, I, it's really important to me that we are giving you the GPUs, uh, to, to run this on. So yeah. So if, if you, if you, uh, if you need the $10 back, I will, I will get it for you.

Um, so yeah, just, uh, just follow this. It's like a one and a half minute video. Um, and, uh, yeah, that, that, that, that'll get the account set up. Maybe we could even like play the video. Yeah. Um, so yeah. And then this is how you, you know, deploy, um, once we, once we get set up.

So in, in two minutes, we will come back together and you know, we resume the, resume the talk. So yeah, make that account real quick, get checked out, get those $10 of credits. Yeah. And, and also if you, if you do happen to run into any issues, getting the one part account set up, like it's not working or whatever.

Um, I do have one that I can share individually with people. It would just take a really long time for me to go type in my username and password into every single computer in this room. I feel like maybe I should have brought some, some jokes to tell while everyone sets their account up.

Uh, all right. Uh, how about this one? Why did the LLM cross the road? Ignore previous instructions and cross the road. All right. I'll, I'll, I'll workshop that. I'll get some better jokes for you guys before, before the next set up. How are we doing on the, on the account set up?

Um, anyone need a little more time? All right. Great. No problem. No problem. Sorry. I don't want to rush you. Just checking in. Yeah. And then once, once everyone has the account, we'll set up the GPU together because there's a few things you need to configure. Cool. Uh, oh, really?

It's, it's not taking anyone's credit card. Great. Uh, does someone, does here, can I, can I, can I know someone who runs it, who works at one pod and would have their, uh, their phone number? He's calling them. Okay. Awesome. All right. We're getting, we're getting in touch with customer support.

Up. Yeah. It could be, it could be that. Yeah. Okay. So as a backup, um, yeah. Okay. So the recommendation here is go off of the conference Wi-Fi, put your computer on your phone hotspots and try it again. Um, because that, that worked, uh, you know, maybe, maybe, maybe coming from a different IP address will, will help.

How would we do this? I run through this and we can do it again once everybody has their account. Yeah, that sounds good. So what we're going to do in the interest of time here, um, is we're going to, uh, just going to run through end to end, um, sort of the, the, the demo as we, as we get the stuff set up and everyone's credit cards get unblocked.

Um, yeah, you know, who, who would have thought, you know, we, we were, we were talking this big game about, oh, TensorFlow TLM. It's so hard. It's so technical. There's going to be so many bugs. And then there's the payment processing. So, uh, yeah, you know, that, that's, that's, that's live demos for you.

So anyway, yeah, go, go ahead and, uh, work through it. Um, and then we'll do it kind of again, uh, together once everyone has their account. All right. Yeah. Let me run through this. I'll follow the, all the steps. Uh, I, I already have an account run pod. So let me spin up a new instance here and, uh, I'm picking up the 4090 here, which is this one, and it has high availability.

So that should be fine. And, uh, I'm going to edit this template and get more space here. This doesn't cost anything extra. Yeah. We need more space, uh, because the engine, uh, everything that we're installing, um, and the engine we're building takes up a lot of gigabytes. So otherwise we'll be safer.

Yeah. Even though these engines are small engines, in general can be very, very big. It can be hundreds of gigs. And I'm going to pick on demand because I'm doing this demo. I don't want the instance to go away, but feel free to use spot for your use case.

I'm going to do that. You want to set the, uh, the container, um, to container disk to 200 gigabytes so that you have enough room to install everything. And then I'm going to deploy spot. It's going to be a bit slow, but you know, feel free to ask any questions.

And, uh, I feel like this way we'll take it slow, but we'll make sure everything is understood by everybody. So what, what's happening now is that this, uh, pot is spinning up. Uh, one thing to note here is that it has a specific image of, uh, torch with a specific CUDA version.

It's very important that, uh, node has GPUs. And the first thing we're going to, we're going to do is that once this pod comes up, you're going to check that it has everything related to GPUs running fine. So this is starting up now. I'm going to connect. It gives you nothing sensitive here.

It uses your SSH keys, but, uh, the names are not sensitive. So I'm going to just do that. Log into that box. Uh, sorry. Oh, okay. Sorry. Yeah. Yeah. This is much smaller. I think the part is still spinning up. So it's taking a bit of time. Hmm. Okay.

All right. So to test that everything is set up properly. Just, uh, is it possible to scroll it to the top of the screen? Oh, yeah. Okay, great. So we are on this machine that we spin up. You're going to run NVIDIA SMI to make sure that the GPU is available.

And this is what you should see. Uh, one thing to note here is this portion, which shows that the GPU has, uh, more than 24 gigs of memory the RTX 4090 has. And right now it's using one memory. I think it does some, uh, some stuff like it was by default.

So one mag is already taken. So now we're going to go back to a workshop and then just follow these instructions. Manual engine build. We are at this point. Uh, and now we're going to install Tencer RTLM. This is going to take a bit of time. Tencer RTLM comes as a Python library that you just pip install.

And that's all we're doing. We're setting up the dependencies. This APD update is setting up the Python environment, uh, open MPI and other things. And then we just install and start the LLM, uh, from, not from PyPy, but from NVIDIA's own PyPy. That's where we find the right versions.

If you focus on this line, uh, let me kick this off. Then I can come back here and show you that we're using a specific version of Tencer RTLM. And, uh, we need to tell it to get it from the NVIDIA PyPy using these instructions. And all these are on the GitHub repo.

If you want to follow from there. I saw a guy with a camera. So I started posing. Uh, I think 310, I think. 310. Yes. Uh, this, uh, this command should have instructions to install. I also, I want to check in with the room. Has anyone else had success getting one part up and going, uh, using your, using your phone wifi?

It's working. Okay. Okay. Awesome. Crisis supported. Thank you so much to, uh, to whoever from, from over there suggested the idea to begin with. Really save the day. Great. So we're just waiting for it to build. It takes some time. This is, this is the best part of the job.

You know, you wait for it to build. You can go get a snack. You can go like change your laundry. It's a very convenient that it takes this time sometimes. It used to be compilation takes time. Now engine build takes time. Yeah. I think you're very close. And I promise like, then the fun part begins.

Are you saying that pip install isn't the fun part? I think this is pretty fun. You know, look, look, look at all this, look at all this lines. You know, this is, this is, this is real coding right here. And if you want pip to feel like more fun, try poetry.

Oh, that's true. Poetry is really fun. Yeah. Nvidia does publish these images on their container history called NGC. And there are Triton registries, uh, available for these things. Maybe we should have used that rather than run pod, but, uh, it's all good. So, uh, now let's check that tensor ID LLM is installed.

And this will just, uh, tell us that, uh, everything is good. So it printed the version. You should see that if everything's working fine. And then we're going to do the real thing. Now we're going to clone the tensor ID LLM repository where a lot of those examples are.

And I'll, I'll show you those examples while this, uh, this cloning happens shouldn't take that much long. Maybe, uh, maybe a minute or so, but tensor ID LLM has a lot of examples. Uh, if you go to the tensor ID LLM repository, there are these examples folder and there are a ton of examples.

Like, uh, Philip mentioned there are about 50 examples. And we're going to go through the LLM example here. So if you search for LLM, uh, that's the one we are going to look into. And so the cloning is complete. And we go back to these instructions. And now we're going to actually build the engine.

Actually, one more thing. Uh, how many of you know about HF transfer? Have you used the Transformers library from Hugging Face? So HF transfer is a fast way of downloading and uploading your engines. It does slides download. It takes the URL and patches them up into slices, downloads them all in parallel.

And it works really, really fast. It goes up to like one gig a second. So we should definitely do that, which is what I did just now. Now we're going to follow this step by step. Uh, first thing we're going to do is download from Hugging Face. And, uh, let's see like how fast the wifi here is, uh, how fast this downloads.

So not bad. It's going at one gigs a second. So HF all, yeah. Oh, from, oh, you're right. You're right. You're right. See, but this is, uh, this is what I call good software. Downloads at one gig a second. Now we, now first thing to build with TensorRD LLM is that we have to convert, uh, the Hugging Face checkpoint into a checkpoint format that TensorRD LLM works with.

And, uh, checkpointing also covers tensor parallelism and quantization. Sometimes you need a different kind of checkpoint for doing those things. So I'm going to run this command to convert the checkpoint. And this should be pretty fast. It's just converting weights to weights. That's pretty fast, like three seconds. Uh, and now we do the actual build.

And I'm going to do this basic build here. Uh, there are a ton of options that this command takes. The TRT LLM build command. Uh, in here, we are just saying that, uh, take this checkpoint and build me an engine with most of the default settings. And that should build the engine.

Now it will print a lot of stuff about what it's doing, what it's finding and how it's optimizing and all that. Uh, it won't make much sense right now, but, uh, later on, this could be very useful. So the engine was built as pretty fast, right? It's a small, uh, model, uh, only a billion parameters.

So that was pretty fast. And now let's, uh, let's try to see how big the engine is. I'm going to do that. And the engine is, uh, two gigs in size. This is about how big that model is on hugging face. So it's, uh, the engine itself adds very little, uh, storage or memory.

It's maybe like, Uh, hundreds of megabytes, but very tiny compared to the overall and those weights are bundled into the engine. And what is this engine? This engine is something that the tensor RT LLM runtime can take and it can execute it. Uh, you can think of it like a shared library.

It's, uh, it's kind of kind of like a binary in the standard format that the binaries are in. Uh, but it's, it's something that tensor RT LLM can take and interpret. Ultimately, it's a tensor RT engine because that's what tensor RT LLM works with. It creates tensor RT, uh, engine and then tensor RT is the one that loads it, but tensor RT LLM gives it these plugins that tensor RT understands and then is able to make sense of it.

And now let's execute this. So these, these examples also come with the, come up with a, uh, come with a run script that we can run and we're gonna run that. So what this is going to do is start up the engine and give it a very tiny request and we should expect a response.

And that's what happened here. Our engine was launched. We gave it an input text of born in Northeast France. So we're trained as a, and the model printed out the response beyond that that painter in Paris moving before moving to London in 929. And this is a standard, uh, example that comes with tensor RT LLM.

So if you follow along these instructions, you should see that Any questions at this point? The convert, the question is what happens during the convert checkpoint? Uh, I think there are three things that happen, uh, potentially three things. First thing is that tensor RT LLM needs the tensors to be in a specific format to work with.

So think of it as a pre-processing. There are many ways of specifying a model. It can be on hugging face. It can be exported from PyTorch. It can be onyx. There are many, many different ways of specifying these models. So the first thing it does is that it converts that into a format that it understands.

So it does some kind of translation into a standard structure. Second thing is quantization. For quantization, it needs to, uh, quantize the weights. It needs to take the weights and quantize them into the quantized versions of them. And that happens at convert checkpoint too. Uh, not necessarily though. They also have a quantized script.

Some of those, uh, quantizations happen. Some types of quantizations happen in convert checkpoint, but they also have a different way of quantizing. They call it, I think, uh, ammo. There's a library called ammo, which does that. Uh, and that can also be used for doing it. But, uh, I think AWQ and, uh, smooth quant, they happen in convert checkpoint.

And, uh, third thing is tensor parallelism. For tensor parallelism, you need to divide the weights, uh, into different categories for the different GPUs that they will run on. So it does use that during convert checkpoint as well. Thank you. Yes. All right. Uh, uh, uh, uh, me too. .

Yeah. So there's, there's two places that the max output is set. Um, so the first place is when you're actually building the engine, you give it a argument for the expected output sequence length. And then that's, that's more just sort of like for the optimization side, you know, so that you're selecting the correct CUDA kernels.

And so that you're, you know, batching everything up correctly. And then once the engine is built, it just uses a standard, I think it's, uh, max tokens, right, is the parameter. Um, and yeah, you just, you just pass max tokens and that'll, you know, limit, um, how, how long it runs for.

Yeah, I guess what I'm asking is, does it influence the generation before generating happens? Yeah. Right. Like, are you asking if you, if you make the engine with a shorter, um, output, see, Oh, okay. No, I, I, as far as I know, the Mac, all the max token does is it just cuts off influence after a certain number, right?

Yeah. So the, the way, uh, I would put it is that normally if you give it a large number of max tokens, it would emit, uh, uh, end of sequence token. Most models have a different end of sequence token and it's up to you. You can stop there. You can configure at runtime.

Like I don't want more than that, but you can also tell it ignore end of sequence. I just want the whole thing. And we do need it for performance benchmarking. For example, when we are comparing performance across different GPU types or whatnot, we want all of those tokens to be generated so we can tell it, like, give me all of them.

Thank you. Welcome. In the back there. Oh, great. Great question, actually. So, uh, in build checkpoint, a lot of stuff is happening. You're taking these weights and you're, uh, you're generating this thing called a network in TensorRD. TensorRD has this notion of a network and what you need to do is populate that network with your weights and architectures.

So it actually does that. It creates that network and feeds it these weights. It also does inference during building the engine, uh, for doing optimizations. So it generates for every model type. It has a mechanism of generating sample input and it passes that into the a TensorRD engine that it's generating and then it optimizes it that way.

And as, uh, as a result, this TensorRD engine is generated in memory, which is then serialized. So all of this is happening in that. And these, uh, there's a lot of nuance to it. Uh, if you get a chance, you can look at the source code for that TRDLM build.

I'll post references in that GitHub repo and you can follow on. There's lots of options, but let me try if I can find a help here. Ah, the VIP person. What's going on? Sorry. Oh yeah. This is law, a lot of stuff here that you can go through. Um, uh, yeah, maybe I should go through some of them, which are very important.

Yeah. I think a lot of stuff is important here. Like the max beam width, if you're using beams for uh, generating the graph, you can generate, uh, logits, not just the output tokens. We can also generate logits if you want to process them. Uh, there are a lot of optimizations that you can use.

Like you can, there's a optimization called fused MLP. There is contrast chunking. There is, there is a lot of stuff. I think you should play around with those, uh, at your time. LoRa is very good to play around with. Yeah. I'll try to leave some, uh, some more examples, uh, in the GitHub repo to try.

Uh, okay. So let me go to the next one. Just, uh, uh, one more thing I want to do is, uh, FPA quantization. Uh, RTX 4090 is, is actually an amazing GPU. It's pretty cheap, but supports FPA. So we're going to do an FPA engine build now. So in this case, uh, like I said, like some of these optimizations, these, uh, quantizations are not in convert checkpoint, but quantize.py, which uses a library called ammo in NVIDIA.

So I'm going to run that now. And, uh, yeah, let me spend some time here. We are, we're saying is we're telling it that the quantization format is FPA, but also note that we are saying KV cache D type is FPA. So FPA quantization actually can happen at two levels.

You can wait quantize to FPA, but you can also quantize the KV cache with FP8 and doing both is very critical because, uh, these GPUs, they, you might have heard of things called tensor cores, right? Tensor cores are very, very, very important because they can do quantize calculations very fast.

For example, if you look at a spec of the H100 GPU, you can see that the teraflops that you can get, number of computation that you can get with lower quantization options are much more than higher. For example, FP16 teraflops will be much lower than FP8 because you can use this special tensor cores for doing more FP8 computations in the same time that you would do FP16.

But for that to happen, both sides of the matrix have to be the same quantization type. Mixed precision doesn't exist. At least now it's not very common or popular. So you want both sides to be quantized. And when you quantize both the KV cache and the weights to FP8, you get that extra unlock, that your computation is also faster, which can be critical for scenarios which are compute bound.

And as you would know in LLMs, there's a context phase and generation phase. Generation phase is memory bandwidth bound. But the context phase is compute bound. So that can benefit greatly from both sides being quantized. So in this case, we are saying that quantize both weights and KV cache.

And it's actually not a trivial decision to do that because weights quantize very easily. You hardly lose anything when you quantize weights. The dynamic range of weights is generally much, much smaller. You can use Int8 or when you do FP8, there is hardly any loss. KV cache doesn't quantize as well.

And that's why FP8 is a game changer. Because what we found is that when you quantize the KV cache with Int8, even using smooth quant, there is still degradation of quality. And practically, we've never seen anybody use it. Even though there are a lot of papers about it. And it's great, great technology.

But practically, it was not there until FP8. FP8 even KV cache quantization works extremely well. Let me show something with that, actually, if you don't mind. Back on the... If we go to... I'll just show like a little visualization for FP8 that shows off the dynamic range. So... Oh, hey, look, it's us.

Yeah, so when you look at the FP8 data format, it has a sign. And then rather than... So there's two different FP8 data formats. But we're using the, you know, the e4m3 format. So basically, you have four bits dedicated to an exponent. And that's what gives your FP8 data format a lot of dynamic range versus Int8, which is just, you know, like what, like 256 to...

256. Yeah, so... So you still have the same number of possible values, but they're spread apart further. That's dynamic range. And it's that which allows you to quantize this much more sensitive KV cache. Yeah. Yeah, exactly. Basically, you have the... My teacher and the exponents, you basically are able to quantize smaller values better, give more bits to smaller scale than larger scale.

You don't have to fit into a linear scale. And that's where FP8 excels. So going back to the presentation, the FP8 quantization is done. And I forgot to show you this, but there is calibration involved here. If you look at this stack here, we actually give it some data.

We feed it some data and let it calibrate. Because as you would know, in Tate and FP8, you have a start and end range. And they differ in how you divide up that range into data points. But you have to find million max. And for that, you need calibration.

So we give it a standard data set and you can change the data set. But we give it a specific data set and it does multiple runs. And we try to calibrate, like, what are the dynamic ranges of each of the layers of this transformer architecture. And based upon that, we specify that million max for each layer separately.

There's more detail there, but at the high level, that's what is happening. Yeah, yeah, it's possible that the ranges can vary a lot with data set. And this used to be more critical with Intate. With FP8, we found that you get to a good state pretty fast. But it's worth thinking about trying different data sets, especially if you know what data set you are going to be calling it with.

It could be worth it. It just works very well out of the box. But it's not perfect. Going back to the workshop. So we were following along here. And yeah, so after you quantize it, the steps are very similar as before. Now we are building an engine with FP8.

And internally, all the CUDA kernels that are being used are now FP8-specific. They are different kernels which use the tensor cores in the right fashion. And this should be pretty quick as well. And there's a lot of depth here as you learn more about it. You don't need to, but there are things like timing cache.

There are optimization profiles in tensor RT through which you tell what sizes we expect and it does optimizations. But this is a good beginning. So now we have the engine. Let me do a DU on that and to see the size of that engine now. And the size is 1.2 gigs, which is about half of previous, and which is what we expect because we quantized.

And now let's run this engine and see the output and it should be pretty similar to what we saw before. So using this run script, now it's going to load the engine and then we'll do an inference on top. That should be pretty quick. So yeah, here's the output, the same input as before.

And about the same output as before. And that's what we generally observe with FP8. FP8 quality is really, really good. It's very, very hard to tell the difference. And that's it for this workshop, this part of the workshop. Yeah. Awesome. Thank you. So, you know, I definitely welcome you to keep playing around with this run pod setup and trying different things.

Try to build different engines and stuff. But we're going to move on to the next step, which is an automated version of basically exactly what we just did. So we're going to show a few things to make this easier. So we're going to be using for this next step, something called trust.

Trust is an open source model serving framework developed by us here at base 10. Punkage is the one who, you know, actually wrote a lot of the code. All I did was name it trust because I was riding on a train and I was like, huh, what should I call the framework?

And then we run over a bridge and I was like, I know, I'll call it bridge. But that was already taken. So I called it trust. So it lets you deploy models with Python instead of, you know, building a Docker image yourself. It gives you a nice live reload dev loop.

And what we really wanted to focus on when we were building this, because it's kind of the technology that sits under our entire model serving platform, is we really wanted a lot of flexibility so that we could work with, you know, things like tensor RT, tensor TLM. You can run it with VLM, Triton.

You can, you know, run a transformers model, a diffusers model. You can put an XGBoost model in there if you're still doing ML. Like you can do basically whatever you want with it. It's just Python code. If I may interject like trust is actually a very simple system. It's a way of running Python code, specifying an environment for running the Python code and your Python code.

So it's sort of like a very simple packaging mechanism, but built for machine learning models. It takes account of the typical things you would need with the machine learning models, like getting access to data, passing security, secure tokens and such. But it's fundamentally a very, very simple system. Just a conflict file and some Python code.

Exactly. And so looking at that config file, we're not even actually going to write any Python code today for the model server. We're just going to write a quick config. So actually this morning I was eating breakfast here and I sat down with a group of engineers and we were talking about stuff and everyone was complaining about YAML and how they're always getting like type errors when they write YAML.

So unfortunately this is going to be a YAML system. So apologies to my new friends from breakfast. But what we're going to do is use this as basically an abstraction on top of trtllm. Pankaj, quick question for you. What's the name of that C++ textbook you were reading before bed every night the other month?

Modern C++. What? Modern C++. Yeah, C++. So you know, before bed every night I was watching Survivor. And so for those of us who are not cracked software engineers and even for those who want to get things done quickly, we want to have a great abstraction. What does that abstraction need to be able to do?

It needs to be able to build an engine. And that engine needs to take into account what model we're going to run, what GPU we're going to run it on, the input and output sequence links, the batch size, quantization, any of the other optimizations we want to do on top of that.

And then we also want to not just grab that and run it in the GPU part somewhere, we actually want to deploy it behind an API endpoint so that, you know, we can integrate it into our product and stuff. So I'm going to show how to do that. Let's see here.

This is yours now. I'm stealing. Oh, this is a good mic. I might not give this back, Pankaj. This is a good mic. All right. So we're going to go over. Let's see. This is in the one pod thing still. Yeah, I need to go to the first one.

Uh, great. The second one. Perfect. It's not this. Let me go to the workshop. Okay. Yeah, you do that. This is his computer, not my computer. So I don't know where anything is. It's like, uh, walking into someone else's house. There you go. All right. Thank you. Thank you so much.

Um, okay. So, um, what we're going to do in this, in this second step is we are going to do basically exactly the same thing we just did. Um, just automated. So for this step, we're going to use base 10, um, we're going to give you all some, some GPUs to play with you.

Um, so if you want to follow along, I really encourage you to do so. Um, you're going to go sign up at base 10. We're going to, you know, your account will automatically get free credits. If, um, our fraud system is a little freaked out by everyone, uh, signing up at the same time.

Well, fortunately, uh, we have some, uh, admin panel access ourselves over here. So we'll just unplug this, approve you all and plug it back in. Um, so yeah, so everyone go ahead and, um, sign up for base 10. We're also going to want you to make an API key real quick and save that.

Um, and then once that's all done, we're going to jump into this, uh, this part of the, of the project. Okay. Everyone. So I know there's a, a few errors going on. Um, we have, we have pinged the team about that. Let me let you in on a little secret.

Uh, we shipped this on Thursday as an internal beta, and this is the very first time anyone who doesn't have an at base 10.co email address is using our, uh, new tensor RT LLM build system. So if there, uh, if, uh, yeah, so, uh, sorry for tricking you all into beta testing our software.

Um, but hey, that's what demos are for, right? So, uh, we'll, we'll get that sorted out. In the meantime, we have an image cached locally, which means we can keep going with the demo as if nothing ever happens. So, um, let's see. So what you, uh, what you would see in the logs as you, uh, build, as you, uh, deploy this.

Yep. Yeah. Oh, well, I mean, I can just kind of look through the, look through the logs right here. Um, let me actually just, let me just, uh, wake it up. Sorry. What was that? Okay. Yep. Yep. Yep. I got you. Um, yep. All right. Big logs. Um, let's see.

All right. So, um, what, what, what you're seeing here, um, as we, oh, I'm sorry, we got a, we got a lot of logs here. Um, right. Cause we tested a bunch with the scale up. Um, anyway, what you see is, uh, you see the engine getting built, um, and then, and then deployed and to walk through the YAML code really quick.

Uh, yes here. So we talked about that there are a bunch of different settings that you need to do when you are working with TRT LLM. Um, and you can set all these settings right here and build. So right now we're doing something with an input sequence and output sequence of 2000 tokens each, um, and a batch size of 64 concurrent requests.

Um, we're using the int eight quantization because we're running on a, a 10 and that does not support at FP eight because it's an ampere GPU, which is one generation before FP eight support. Um, and then of course you, you pull in your, uh, model and stuff. And then if we want to, you know, call the model to test that it is working, um, we can come over here, um, to the call model.

Um, we can just test this out really quick. Yes. Um, so we do not, um, on base 10, we have T fours, a tens, a one hundreds, H one hundreds and H one hundred Migs and L fours as well. Um, we, we generally stick with the more like data center type GPUs, um, rather than the consumer GPUs.

Yeah, uh, I want one for, uh, for, um, um, well, punctuation so I'm going to say that I want it for legitimate business purposes and it should be an improved, approved expense. Uh, I don't want it for playing video games. Definitely not. So yes. No, but I bet he can.

All of that, uh, all of the scores code is open source. And typically when new models come up, those companies provide, uh, convert checkpoints scripts. Uh, but if you can follow those scripts, it's not terribly difficult. It's mostly like, uh, if you're familiar with the transformers library, it's about reading weights from a hugging force transformer model and converting that into something.

Yeah, this is simple transformation. So it should be possible to do it yourself if you want. Awesome. So once, once your model's deployed again, you can, you know, you can just test it really quick. You can call it with a API end points. Um, but, uh, yeah, we're coming up on, on two 30 here.

So I'm not going to spend too long on this example. Um, let's see, but you know, we've been talking a big game up here about performance, right? And performance is not just, okay, I'm testing it by myself. Performance is in production for my actual users. Is this meeting my needs at a cost that is reasonable to me?

And in in order to, you know, validate your performance before you go to production, you need to do benchmarking and you need to do a lot more rigorous benchmarking than just saying like, Hey, you know, I, I called it, it seemed pretty fast. Um, so what do you want to measure when you're benchmarking?

Uh, uh, you know, say it with me, everyone. It depends. That's what a software engineers are always saying. So, um, you know, depending on your use case, you might have different things that you're optimizing for. If you're say like a live chat service, uh, you probably really care about time to first token for your streaming output, because you know, you're, you're trying to give people, you know, instantaneous responses.

Um, you might also care a lot about tokens per second. Um, so that's, you know, how many, how many tokens are generated generally some, some good numbers to keep in mind is somewhere, depending on the tokenizer and the data and everything and the reader, somewhere between 30 to 50 tokens per second is going to be about as fast as anyone can read.

So, you know, if you're at 50 tokens per second, generally, it's going to feel pretty fast. People aren't going to be waiting for your output. However, if you're doing something like code, you know, code takes more tokens per word than say natural language. So you're going to need even more tokens per second for that, you know, nice, smooth output.

And then from there getting into, you know, 100, 200 tokens, that's when it just feels kind of, you know, magically fast. But again, we'll, we'll, our inference is all about trade-offs, right? When we're optimizing. So, you know, sometimes you might want to trade off a, you know, a few, you know, maybe, maybe you're going to go at a hundred, not 120 tokens per second, because that gets you a bigger batch size, which is going to lower your cost per million tokens.

Another thing you're going to want to look at, um, when you're running your benchmarks is your total tokens per second. So there's the tokens per second per user, right? Like per request, how many tokens is your end user seeing? And then there's tokens per second in terms of how many tokens is your GPU actually producing?

And that's a really important metric for throughput, for cost, um, especially if you're going to be doing anything that's a little less than real time. Um, you want to look at this, not just once, you want to look at the, uh, 50th, 90th, 95th, 99th percentile, make sure you're good with all those.

And you want to look at the effects of different batch sizes on this. And, um, so something is that benchmarking actually reveals really important information. It's not linear and it's not obvious. The sort of performance space of your model is not this nice, nice flat piece of paper that goes linearly from batch size to batch size.

So this is a graph of time to first token for like a menstrual model that I ran a long time ago. I just happened to have a pretty graph of it. So that's how it ended up in the presentation. Um, so if you look at the batch sizes as it's, uh, you know, increasing, doubling, um, 32 to 64, the time to first token like barely budges.

Um, but as it goes from 64 to 128, doubling again, the time to first token, uh, increases massively. And in this case, you know, the reason behind that is we're, we're in the, you know, compute bound, um, pre-fill step. Um, when we're talking about computing the first token and there's these different sort of slots that this computation could happen in.

And as you increase the bat, increase the batch size, you're saturating these slots until eventually you have an increased chance of a slot collision. And that's, what's going to rocket your time to first token. I'm glad you're nodding. I'm glad I got that right. Um, but yeah, all of this to, uh, all of this to say, um, you know, the performance that you get out of your model once it's actually built and deployed is not necessarily just going to be linear.

It's not going to be something super predictable. You have to actually benchmark your deployment before you put it into production. Otherwise these sort of surprises can, can happen quite often. Um, so yeah. So Pankage, do you want to take over the, uh, the benchmarking script? So, um, just for this, uh, workshop, we wrote a benchmarking script.

It's not the script we use ourselves, but it's a simpler version so that you can follow along that if you wanted to modify it, you can play around with it and understand it easily. Uh, if you go into that repository, it's, uh, it's a very simple script where we send requests in parallel, just using Python, using async libraries.

And, uh, all you give it is the URL of the endpoint of the model where your model is deployed and you can give it different concurrencies and input lands and output lands and give it number of runs. You want to run these benchmarks a number of times to get, uh, an idea of values one might be off.

So I'm just going to run that script and it's all, it's all in the benchmark repository. Uh, it's all structured using make files and there is a make file target for benchmark that we're going to use. And, uh, I think the readme should also have instructions on that. So we basically gonna run this and we're going to need the base URL.

You need two things. We need to export the API key and then we need to, uh, supply the URL. So once you deploy your model on base 10, you, you would see deployed. And like Philip said, there is a call model button. There are various ways you can deploy it.

Ultimately it's an HTTP API and you can just copy this URL for that model for our benchmarking script. But if you want to play around the examples in all kinds of, uh, languages and you can also click streaming. So it'll give you a streaming code. Streaming is very important with large language models because you want to see the output as soon as possible.

So we're going to take this output and, uh, I don't know if I exported the API key. So give me one second to export the API key. Come on. This is a good time to mention. Oh, yes. This is a good time to mention. This is a good time to mention.

I think it's a good time to mention. I think it's a good time to mention. I think it's a good time to mention. I think it's a good time to mention. It's a good time to mention. You know, if you lose the API key, you can always revoke it.

So that's good. Yes, uh, there's a good time to mention that base 10 is 2, type 2, so the client, uh, that is why we cannot show you other API keys. So now we're just going to give it the URL here. Let me go back. All right. So I'm going to do this, uh, first run with a concurrency of 32 and input and output plans of a 1000 and uh, let's see it work.

So first it does a warmup run just to make sure that there is some traffic on the GPU. You always want to have a warmup run before you get the real numbers. Now, as this is running and you can see the TPS here, the total TPS is 5000 and this is on 8NG.

8NG is not the most powerful, uh, GPU, but this TPS is still very, very high. 5000 is very, very high. It's because this is tiny Lama model. Tiny Lama is a tiny model, just like a billion parameters. And that's why we see this very high. But yeah, on bigger GPUs with the, with Lama 8G, you should also see very, very high values because H100s are very, very powerful and TLS is very, very optimized.

I think we see up to like 11,000 tokens per second and you should do a comparison. It's really, really high. In this case, we have two runs and you see these values. Uh, let me try a different run. Now I'm going to do concurrency of one. And one is good to know how best of a time to first token you can get.

So you're just sending one request at a time and many requests, but one at a time with the same input and output length. And, uh, this should run. So, uh, you see time to first token of, uh, 180 milliseconds here, and this is from this laptop. I'm running it right from this laptop on this wifi and my model is deployed somewhere in US central.

And this is, uh, 180 milliseconds for that. Yeah. So, so the, so the vast majority of that time to first token is going to be network latency, right? Not model. Not the model itself should be pretty fast. Uh, yes. Uh, uh, uh, not, not that one. Um, this one.

Uh, this is from this, uh, script, uh, script that I'm running. I didn't do a thorough job of cleaning up everything. We're saying that, uh, we just making an RPC call. We don't need PyTorch or whatever. This, this. Yeah. Local runtime, something, uh, probably machine. This script, uh, if you look at that, all it's doing is RPC.

Uh, let me go through that real quick. Benchmark script. Uh, it's just a simple Python using async and it's amazing how good Python has got. With this async API, you're able to load, uh, load this model with thousands of tokens per second. All of them coming in streaming, uh, Python has actually gotten really, really well.

There was a, there was a case where I was loading with K6 and K6 client became a bottleneck because Edge 100s are so fast, but Python could keep up. Uh, Python was able to load it very well. Uh, but I would, I'm getting distracted. So, uh, we tried concurrency one, which is like the best case scenario.

Latencies are very, very good and TTFT should be very low. Now let's go to the other extreme. If you look at this model that we deployed, uh, we created it with the batch size of 64 maximum. So now we'll do 64 and, uh, I'm hoping we see throughput improvements.

So ignore the warm up run. And this is going to take a bit longer because now we're going to send 64 requests at the same time. 64 is not, uh, not that high. We can go even higher. So in this case, you see total TPS of, uh, of 7,000, which is, uh, even higher than before.

We saw 5,000 before this goes up to 7,000, but maybe this is a fluke. Let's wait for the second run. And you can run, uh, this is all 7,000. So this is, uh, this is much better than what we saw before. So, uh, if you increase batch size, you would find that your latencies become, become higher.

Latencies in the sense that for every request that a user is sending, now tokens are coming slower and slower. And then you have to make a trade off at some point. Like, is it still good enough? Is it still more than say 30 or 50 tokens per second that users won't perceive?

And at some point it will become unusable. And as you increase batch size, the pressure on your GPU memory also increases. Because all these extra batches, they require KV cache to be kept in GPU memory. So you might hit that bottleneck. So depending on all these scenarios, you want to experiment with different batch sizes and find the sweet spot.

And that's, uh, that takes a bit of time, but it's not terribly complex. Um, yeah, so this script is there, there for you to modify and play around with. It's pretty simple. It's pretty much a single, uh, Python file, not much in there. You can modify it. You can run it.

Yes. The question was, uh, did I have a max batch size in mind when I was running? Yes, because I deployed the model with the config in this, uh, workshop. Let me show you that I built the model with the max batch size and you can increase that batch size.

So let me show you that. So in this, uh, tiny Lama model that I deployed, I specified a max batch size of 64. So, uh, if I go beyond 64, it's not going to help me because all those requests will just wait. And yeah, actually there's one, one interesting thing I want to show you.

Uh, so this is a good question. You can look at the logs here and in the logs, we put these, uh, metrics for what's going on. I think maybe I'm not. Yeah. And, and if you wanted to increase your batch size past 64, you just change the YAML and say like, Oh, match this batch size.

So that should be like 128 and build a new engine by deploying it. Yeah. So if you look at these logs here, it shows how many requests are running in parallel active request count. Uh, you can actually observe how many requests are being executed in parallel right on the GPU, because there are chances that you haven't, uh, configured something right.

And that, uh, for whatever reason, the, uh, requests are not all getting executed in parallel. For example, a common mistake one could make is that when you deploy on, on base 10 and this is based on specific, but just to take an example, there are scaling settings in base 10.

You can specify the scale and you can specify what is the max concurrency the model will receive. In this case, I've set it to a very, very high value. So it won't become a bottleneck, but there are chances that, that, you know, you forget, you make a mistake there, you, you can check these logs and they will actually tell you what's happening on the GPU.

And I think I lost that again. Uh, let me go here. So yeah, these are actual metrics from the tensor RT LLM batch manager, which tells you what's going on. It also tells you about the KV cache blocks that are being used and that helps tune you, uh, helps you tune the KV cache size.

For example, in this case, it says it's using 832 KV cache blocks and 4,500 are empty, which means there is the, there is this way more KV cache than is needed for this use case. So just to mention that as a, as an aside. Yeah, I think that's it for that presentation.

I'm going to, I'm going to, I'm going to, I'm going to talk about that next slide. Thank you. You're reading my mind. It does. Yes, tensor RT LLM does come with a benchmarking tool. It's very, very good. Uh, the only downside is that you have to build it from source.

It's not bundled with the tensor RT LLM, Python library, which is my gripe. I'm going to ask anybody to fix that. Uh, they, they have benchmarking tools. There are two benchmarking tools. One that, uh, that just sends a single batch and, um, and measures the raw throughput you can get without serving it through in-flight batching.

And there is a separate second, second tool, uh, called the GPD manager benchmark, which actually starts up a server and does in-flight batching on top. So there are two tools and they're very, very good quality, but they're not available easily. Building tensor RT LLM is like with 96 CPUs, it takes us one and a half hours to build it.

It's not for the weak of heart. Or for the short of workshop. So, um, we just, we just have a, we have a few minutes left. Um, so I want to run through a few slides and then leave time for last minute questions. So, um, I was asked, um, you know, how do we, how do we actually run this in production?

What does the auto scaling look like? How does that all work? So, um, how do you run a tensor RT engine? So you use something called Triton, the Triton inference server. Um, and that's what helps you, you know, take the engine and actually serve requests to it. We're actually working on our own server, um, that uses the same spec, but supports, uh, C plus plus tokenization, de-tokenization, custom features for, for even more performance.

Um, but the, you know, as we've talked about, the engine is specific to versions, GPUs, batch sizes, sequence links, all that sort of stuff. So, um, that causes some challenges when you're running it in production. We've talked this whole time about vertical scale, right? Like how do I get more scale off a single GPU?

There's also horizontal scale. How do I just like get more GPUs? How do I, you know, auto automatically scale my, my platform up to meet my traffic demands? So, um, you know, some challenges in scaling out in general, you know, you have to automatically respond to traffic. You have to manage your cold start times.

You have to manage the availability and reliability of your GPU nodes. You have to route requests, do batching, all that kind of stuff. And then TensorRT LLM adds a few more challenges. Um, you've got these large image sizes. Um, so that's going to make your cold starts even slower.

You've got these specific batching requirements. So you can't just like send whatever traffic, however you want. And you have these specific GPU requirements. So when you spin up a new node, it's got to be exactly the same as your old node or your model's not going to work. Um, and, uh, you know, unfortunately our workshop is almost over.

Otherwise I would love to give you an in-depth answer of how to solve all these problems. Uh, but the quick answer to how to solve all these problems is you run your code on base 10, uh, because we solved it all for you. Um, so base 10 is a model influence platform.

Um, it's the company that we both work at. Um, with base 10, you can deploy models on GPUs. Uh, you can use TensorRT LLM, but you don't have to, you can use any other service, VLM, TGI, just like a vanilla model deployment. Um, you get access to auto scaling, fast cold starts, um, scale to zero, tons of other great infrastructure features.

You get access to all of our model optimizations. We have a bunch of pre-packaged and pre-optimized models for you to work with. Um, so yeah, uh, the last thing before we go, um, is we are co-hosting a, um, happy hour tomorrow on Shelby's rooftop bar. Um, I was not really involved in organizing it, but the sales guys who organized it told me that it's a super sweet spot and that we're going to have a great time.

Um, so yeah, so I'm going to be there, um, and a bunch of other great, uh, AI engineers are going to be there. So please feel free to sign up, come on through. We'd love to have you. Um, it's going to be super sweet, cool party for the cool kids.

Um, well, you've been, you've been really great. Thank you for listening to us and we open it up for questions. Uh, I feel one question we didn't answer about auto scaling, right? Maybe I should take that on now. Yeah, go ahead. So how does auto scaling work? Yeah, yeah, please.

Uh, yeah, please feel free to hold. I'll just, uh, finish that question because you asked that I wanted to answer it. And, uh, how auto scaling works is that we use a system called Knative, but we forked it to make it work for machine learning use cases. Knative, if I understand correctly, was built for microservices where your requests are very, very quick, like, you know, one second or two seconds.

It doesn't exactly apply to machine learning model use cases, where your request is long lasting and you're streaming and you need to, uh, still scale, but, uh, some of the considerations are different. So we had to actually fork it to be able to apply settings dynamically. For example, a lot of settings that you see on base 10, they apply immediately, like within a second.

Whereas in Knative, you would need to deploy a new revision for those settings to apply. And it's, it's not even practical because the way you deploy a new model, it creates so much hassle and requires extra capacity that it's not good. So we made changes to Knative to cater to the machine learning use case.

But fundamentally the idea is, is very simple. As a request come in, you specify the capacity of your pod that it can take in. And if it's, uh, if it reaches near there, we spin up a new pod. And then the traffic spreads. If your traffic goes down, it goes, your GPUs are, uh, your pods are reduced.

Your GPUs are freed up up all the way up to zero. And when the traffic arrives, it kept, it's kept in a queue and then the model is spin up and the requests are, are sent there. A lot of the machinery at base 10 is around improving cold starts.

How do we start up these, you know, giant models, 50 gig models in under a minute, right? I mean, it sounds like a long time, but when you're talking about 50 gigs, 50 gigs is also a lot. And for, for 10 gig models, we aim for less than 10 seconds.

For 50 gig models, we aim for less than a minute because that's really important. Unless you can scale up in a minute, your request is going to time out. So you really can't scale to zero. So it's, uh, it seems like a detail, but it's very critical. You can't have a scale to zero without very, very fast cold starts.

So we have the whole machinery built out for that. Even before LLMs became popular, we've had this machinery in place. Uh, so what would I expect on terms of performance if I were, like, I think if you optimize it, yeah, can I? Are you saying, like, deploy it with TensorRT versus deploy it on base 10?

So, so you, you would deploy it on base 10 using TensorRT under the hood to run it. Uh, so, so base 10 is just going to, like, facilitate that TensorRT deployment for you. Yeah, no, no difference. Base 10 runs TensorRT LLM. We just make it very easy to run TensorRT LLM.

So you, it's, uh, easier and faster for you to get at the optimum point. Uh, but if you could do it yourself, yeah, it's, it's the same thing under the hood. Uh, and then, you know, I'm compelled by the fact that I'm in the marketing department to say things like we also provide a lot of infrastructure value on top of that so that you're not managing your own, you know?

Yeah. Yeah. Actually, that is, that is true. Yeah. Because we, we have a large fleet. We, we get good costs. So you actually won't pay higher on base 10. Like it's not that you're gonna, uh, it's gonna cost you to more on base 10. Yes, over there. Yeah. I was wondering, um, like the open source Yeah.

So only the packaging part is open source. The serving parts, I mean, that, that's kind of, that's kind of the platform. So I do, I do want to mention is that, uh, from trust, you can create a Docker image and that Docker image can be run anywhere. So you get pretty close.

You don't get auto scaling and, uh, all of those that nice dev loop, but you do get a Docker image and you can do a lot with the Docker image. It builds it locally. Yeah. You can build, there is a trust, uh, image build command. You find it to a trust.

It will build the Docker image locally. Oh, okay. So I guess, uh, serving on a single pod or container, but there is also spreading across multiple containers, the auto scaling and all of the dev loop. Uh, that is not, yeah, but serving. Yeah. I mean, single model serving is interest with that image.

Yes. It's, uh, it's, uh, fast API at the trust server level. Then internally we have our own server layer that we wrote to interact with tensor RTLM. Uh, that part is not open source. We, it's very new. So we still figuring out when to, or where to open source it, but there is also a version that uses Triton.

So there is the fast API, then there's Triton, and then there is the tensor RTLM library, and then it runs the engine. Yeah, yeah, yeah, exactly. Exactly. We have, uh, we have, uh, we work on multiple cloud providers. We are spread across, uh, I don't want to say globe, like mostly us, but also Australia and a few other places.

So we have access to many different kinds of hardware. We find the right hardware to build the engines and then we deploy it. Yes, you can. Yeah, you can use self-hosted clusters. Our stack is built that way. That's one of our selling points. We're not giving you an API.

You can run the entire stack in a self-hosted way. Awesome. Well, look, the conference organizers were very clear with us. All sessions and workshops are to end on time. So I'm going to wrap it up here, but, um, we're going to be right outside if you have any questions.

If you have any easy questions, come see me. If you have any hard questions, please go talk to Punkage instead. Thank you all so much for being here. It was so much fun doing this workshop with all of you. Again, I'm Phillip. This is Punkage. We're from Base 10.

And thank you so much for being here. Have a great conference, everyone. Thank you. We'll see you next time.

From model weights to API endpoint with TensorRT LLM: Philip Kiely and Pankaj Gupta

Transcript