Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Hey everyone. So we're gonna go ahead and get started here. We've got a nice close group here today and that's I think to everyone's benefit. This workshop is really for you. You know I love the sound of my own voice. I love talking. That's why I'm a developer advocate.

But the purpose of this workshop is to help you get comfortable with SGLang. So if you have questions, if you have ideas, if you have bugs, ask your name or me. And we're definitely going to be able to tailor this workshop to you and your interests and what you're working on.

So the title of this workshop is an introduction to LLM serving with SGLang. We're going to be talking about SGLang and a little quick introduction. So my co-speaker here, Yaneng, is a core maintainer of SGLang, has been involved with LMSys org for quite a while now, is the sort of like inference lead on the project, previously worked at Baidu and some other places, and also is, you're an author of a a few papers, including Flash Info.

And I'm Phillip, and I got a B+ in linear algebra. So whether, you know, whether you're coming in here and you're super cracked, or you're brand new at SGLang, we're going to have something for you. Whatever your skill level, this is, this is the place to be. So what are we going to do today?

We're going to, you know, introduce SGLang, get set up a little bit. We're going to talk about the history of SGLang, talk about deploying your first model, bunch of things you can do to optimize performance after that. And then we're also going to talk a little bit about the SGLang community and how you can get involved and even do a little bit of a tour of the code base in case you want to start making open source contributions.

So by way of introduction, let's see. What is SGLang? So SGLang is an open source, fast-serving framework for large language models and large vision models. Generally, you use SGLang in a sentence along with either VLLM or TensorRT LLM. It's one of the multiple options for serving models in production.

So the question is, why SGLang? Like, why should we, you know, invest in learning and building this library? And, you know, first off, it's very performant. SGLang offers excellent performance on a wide variety of GPUs. It's production-ready out of the box. It's got DayZero support for new model releases from labs like Quen and DeepSeq.

And it's got a great community, strong open source ethos, which means that if something is broken in SGLang, if you don't like something, you can fix it, which is, which is pretty huge advantage. So who uses SGLang? Well, we do at Base 10. We use it as part of our inference stack for a variety of different models that we run.

We also see SGLang being used very heavily by XAI for their Glock models, as well as a wide variety of inference providers and cloud providers and research labs, universities, and even product companies like Cursor. So quick history of SGLang. It's honestly really impressive to me how quickly this project has come up and gotten big.

If you look at, you know, the archive paper was released in December 2023. That's 18 months ago. So in just 18 months, this project has gone from a paper to 15,000 GitHub stars, almost. You should all go star it so that we can get a little closer. And it's, you know, supporting all of those logos, all those companies we saw on the last slide.

It's got a growing and vibrant community. It's got international adoption. So yeah, incredibly impressive what the team has done in that time. And I'm going to turn over to Yanang now to talk a little bit more about that history and also like how you got involved in the project.

Okay. Hello. I'm Yanang. I'm the co-developer of the SGLang project. And I'm also the software engineer at Base 10. And before I joined base 10, I worked at Meituan. And at that time, I worked for the internal click-through rate, ranking model optimization, and the LM inference optimization. And at that time, the creator of SGLang named Lian Ming just reached out and then we have Google Meet.

So at that time, when I left Meituan, I joined the SGLang project. So I worked closely with Lian Ming and Ying on SGLang. Also, you know, SGLang used Flash Infer heavily because we use Flash Infer as the attention kernel library and the sampling kernel library. So I also worked with Tzu Hao on the Flash Infer project.

And yeah, currently, I'm the co-maintener of the SGLang project. And I'm also the team member at LMSys.org. And that's the little point of trivia. That's the same LMSys.org that just got $100 million to build Chatbot Arena from A16Z. I learned that while I was putting together the slides for this talk.

So if you were here early, you were able to scan this QR code and get everything set up for the workshop. If not, definitely grab that right now. You've got the QR code, you've got the URL that takes you to the same place. Does anyone still need the QR code?

Okay, I've got a couple people still. All right. Anyone still need the QR code? Going once. Going twice. Yep. To folks watching at home, you've got this great button on YouTube. It's called the Fast Forward button so you can just skip this part. All right. We are looking good.

If you need this again, just let me know. I'll throw it back up there. So we're going to talk about how to deploy your first tool on SGLang. So if you go over to the GitHub. Yes. So in this step, we're just going to get familiar with the basic mechanics of SGLang.

SGLang is basically just like a server command that you're going to run in your Docker container. There's a little bit of sort of difference with using it the way we're going to use it in the workshop right now versus how you might use it if you're working directly on a GPU.

The difference is you're using something called truss to package it. Basically, you're putting in your SGLang dependencies and your command into this YAML file. You're bundling it and you're shipping it up to a GPU. The reason we are using truss is because that is the way that you can get on base 10 and the reason you are using base 10 is because that is the only company on earth that will give me free GPUs because I work there.

So we're going to be working on all these examples on L4 GPUs because they are cheap and abundant and they also support FP8. But this same product works on H100, H200 and Blackwell is coming soon. Yeah, coming soon. So yeah, it's going to be basically like the same principles.

If you go through here the configuration, you can actually in your trust config, you can change the hardware type to H100 if you want. And in the accelerator line right there. But yeah, so what is like the actual SGLang launch server command that we're running here? So it's basically just like a bunch of flags.

That's the thing to understand about using SGLang. It's all about knowing what flags are available, knowing what configuration options are available, knowing the support matrix that exists for them, and knowing how they interact with each other. If you, you know, turn on a major speculation algorithm, And then you also jack your batch size way up.

Well, that's probably not going to go so well for you. But if you want to do say like your, you know, quantization along with some of these other optimizations, those play nice. So yeah, what we're going to do, this is the fun part of leading a workshop, is the part where we just like stand around watching you type.

What we're going to do is give everyone about five minutes to work through this first example. We're going to circulate the room if you have any questions. And then we're going to come back together after running the first example. Sound good? All right, let's do it. Can you cut the mics for five minutes?

Pause. Skip. It's, it's great. They, these buttons, they're magical. As issues. Is anyone having issues where you're like stuck trying to get into base 10? Um, you're in like a waiting room and it won't let you out. Um, if you are, if you are a flag me, um, if anyone is having issues where you're getting like an error in your code, please don't show me show him.

Kind of check on progress. Has anyone managed to get the first model deployed and running? It's deploying. Awesome. Let's hope it's deploying really fast. Let me, let me take a look here. All right. Sounds good. Can you take a look at the logs for me real quick? Wow. Wow.

Our, our, our wifi is just amazing here. I promise base 10 is usually faster than this. Okay. Oh, okay. Well, it looks like it, it came up. Um, so you can, um, you can use the, um, sample code, um, in call.py. Oh, call.ipy.nb. Um, or like, you can just use an ordinary open AI client.

Um, what you need to call.py.nb. Um, or like, you can just use an ordinary open AI client. Um, what you need to call.py.nb. Um, what you need to call it. If you go back to your base 10 workspace with the model, um, what you need is, uh, scroll back up a little bit for me.

You need that model ID. That's what's going to, um, unlock your calling code. Um, that, yeah, paste it in right there. Um, you'll, you'll need to run an actual, run an actual Jupyter notebook to, to run that. All right. We've had our first successful deploy. If you want to call it using the open AI SDK, using the call.ipy.nb, uh, notebook, um, this thing up here, it's going to be different for everyone.

Um, this within the UI is your model ID. That you use to put, uh, set up the URL. Um, hey everyone. We're going to come back together here. Um, it's about 9:45. Um, so we're going to move on to the next stage of the workshop where Yanang is going to do some really awesome demos.

Um, if you are still getting everything set up, uh, no worries. All this stuff is going to be live on, um, GitHub. You've, oh, sorry. Yeah, on GitHub, um, the, the repository with the workshop information is going to stay up. Um, so you can, and keep following along. Um, this is also all going to be published.

Um, so it's going to be easy to go back if you have any issues. Um, anyway, so the next thing that we're going to look at, um, now that we have a sort of basic idea of, okay. Actually laying is just like running a model server. Um, how are we going to actually make it fast?

Um, so Yanang is going to show, um, one demo, which is the, um, CUDA. Um, what is it? Yes, CUDA graph match max BS flag, um, and how to set that to improve performance. Um, and then we're also going to take a look at Eagle three, which is a new specular of decoding algorithm, which, uh, also can improve performance.

So take it away, Yanang. Yeah. Uh, can you see my screen? Yes. Yeah. Good, good call. Zoom it in a little bit. Um, we're using one pod because, uh, on base 10, you don't get SSH access into your GPUs because, uh, security or something, I guess. I don't know.

Okay. So yeah, I will use the L4 GPU. Yeah. This is the L4 GPU and I have already installed the SGLine. Yeah. We can just use the PP install or install from source. And, uh, here is the, this command line, sorry. We launch the server and we use the Lama 3.18 B instruct model.

And, uh, the attention backhand use FA3. This is the default. And when we, Okay. It just started loading the weights. So, uh, just to, just to give everyone a little bit of context, um, the top window you're seeing here is the, um, L4 that's, uh, actually running the SGLine server.

The bottom window here, um, LM eval is a sort of industry standard benchmarking tool, um, that we're just going to use to throw a bunch of traffic at the running server. Yeah. Yeah. For sure. And, uh, yeah, we can see the, the log from, from the server. Uh, it shows that we capture CUDA graph.

I think CUDA graph is turned on by default. But the CUDA graph max batch size for L4 for this model is eight. So, it only capture one, two, four, eight. And, uh, okay. The server is ready to run. And we can use the LM eval to send a request.

Yeah, we can see that from, from the log. Here is the prefilled batch. And here is the decoder batch. And we can see, uh, under the decoder batch when the running request is 10, it means that there are 10 running requests. And the CUDA graph is false. Because the running request 10 is larger than the max CUDA graph size 8.

That's why this one, this flag is false. And when this is false, uh, we get, uh, 255 generation token per second. And we can use this one divided 10. So, I think per user nearly 15, uh, token per second. Okay. We can cue the client. And we can also cue the server.

So, yeah. So, yeah. We, we can use this command as a base. And, uh, uh, say that the CUDA graph. Max, yes. CUDA graph. For example, we can adjust the state, uh, 32. You, you, you, you've got a, you've got a typo in there. Um. Oh, sorry. Okay. The network is not good.

Here, everyone is learning a very important lesson in the value of latency. Okay. Yeah. It's loading weights. Yeah. And we can see that, uh, after we set the max CUDA graph website, the capture CUDA graph BS, uh, I think that the max is, uh, 32. It's larger than the 8.

And the server is ready to roll. ready to roll. We also use the AME4 to send a request. Okay. So, first is the preview batch. And then we can -- here is the decoder batch. Oh, okay. And, uh, yeah. Here is the decoder batch. We can -- Oh, wait for a moment.

Decoder batch. We can -- Oh, wait for a moment. Decoder batch. Hold the batch. Hold the batch. Yeah, for example, here is the decode batch, and there are 13 running requests, and the CUDA graph is true, and here is the generation throughout put. And I think per user should be 12, and we can compare with before.

It's not easy to compare. Yeah, yeah, I think we have recording this video, and we can also see here CUDA graph, and we upload this one, CUDA graph max batch size demo. We want CUDA graph to be true during decode because I think this is very important for the decoding performance, but the default max batch size is 8 on L4.

And when we used LME4 to send a request, we found that, oh, the max batch size is larger than 8. That's why we want to set or adjust the parameter. So here, when we set it to 32, we can handle the realistic batch during benchmark. Okay. Do we have any questions?

Do you have any questions? Why would the commands do LME4? Yeah, yeah, yeah. I think LME4 is evaluation tool, and we needed to specify the model. And here is the model name. Here is the URL because I just used the run port to run this, and it used the same node.

So that's why the URL is the local host. And we specify the port. This one, 8,000. That's why we used 8,000, and we used the OpenAI compatible server. And here, the number concurrent or the batch size is 128. We set the max generation tokens. We just used GSMNK. I think it's a classical evaluation data set.

And because we used the chat completion API interface, that's why we need to apply chat complete. And I just used few short 8. And the limit means that because, you know, the GSMNK, it has 1,319 promotes, and when we use the limit 0.15, I think it's nearly 200 promotes.

I can also share this command line in the repo. Yeah. Yeah. Yeah. Yeah. Maybe I can add it. Oh, sorry. Yeah. So just to be clear, this command is running on the actual GPU itself. Yeah. So this is for when you have SSH access into the GPU you're running.

On the service we're all using, on the base 10 GPUs, you can't SSH in. But if you do have the access to a GPU where you can get SSH access, then you would use this LM eval tool in order to simulate that traffic. If you're using a more like standard HTTP connection to a, you know, remote GPU, then you would use a different benchmarking tool that's, you know, request-based.

Yeah. Yeah. Okay. And do you have any other questions for CUDA graph? Why is default 8? Yeah, yeah. I think the default 8 is because the L4 GPU fee memory. Yeah. And we have some default configuration. We will, yeah. Set the... When you didn't set the CUDA graph max batch size, the default value is none.

And when the default value is none, we will set it internally for specific hardware, for specific model. Yeah. Yeah. For example, it's TP1 and it's on L4, so the default is just 8. Yeah. So what if someone, by mistake, like, he adds a higher one for a large model or something before it?

Yeah, you can just try that. Because when you launch the server, you can see the startup parameters. And then... Well, you have a workload, right? And you use the LME402 benchmark, for example. And then you can analyze the server log. And you find that, oh, during the decoding, the CUDA graph is disabled.

And we... Actually, we want to enable CUDA graph. That's why we increase the max CUDA graph batch size. Yeah. Okay. Awesome. So let's... Let's see. Do you want to show the Eagle stuff or do you want to show the code-based stuff? Yeah, yeah, yeah. Okay. I think the next very important is about the Eagle stuff.

Yeah. So Eagle 3 is a speculative decoding framework. It came out very recently, right? Yeah, yeah. The paper was released a few months ago. And so SGLang supports Eagle 3. And with it, you can configure a wide variety of different parameters around how many tokens you're speculating, how deep you're speculating, that kind of stuff.

And Eagle 3 can have much higher acceptance, token acceptance rate. So obviously when you're speculating the higher your token acceptance rate, the better performance you're going to get. So we can take a quick look at some of those parameters that you showed, and then maybe the benchmark script you were showing me the other day.

Yeah, yeah. I think for the Eagle 3, we also provide the example. We can just change directory to this directory and then use trust push. It's very easy. I just want to explain some details. For example, we need to specify the speculative decoding algorithm. here is the Eagle, like this one.

Yeah. We need to specify speculative decoding algorithm Eagle. And we also need to specify the draft model path. Because this one, the model path, this is the target model. And here is the draft model. Sorry. Here is the draft model for the Eagle 3. So one thing that's different about Eagle, all the different Eagle algorithms, is instead of like a standard draft target where you're, say, maybe using Lama 1B and Lama 8B together, Eagle works by pulling in multiple layers of the target model, using that to build a draft model.

So the draft model is kind of derived directly from the target model versus being just a small model that you're also running. Yeah. Yeah. Yeah. And you also need to specify this parameter, the number of steps, the Eagle, TopK, and the draft verified tokens. For example, the depth of the draft team is three and the TopK is one.

I think the most number of Java tokens should not be more than four. That's why we said four here. And, yeah, you can see more details about this configuration at the SGLAN official documentation. And I also will show something about how to turn in these parameters. You know, we have these parameters.

I think the model path is fixed under the -- how about this one? The number of steps, the TopK, and the number of Java tokens. We can turn in these parameters. And I will show you how to turn in that. So in the SGLAN main repo, we have a script and we have a playground.

Yeah, we have a bunch of speculative decoding. Okay. So we can just use this script to turn in these three parameters. For example, on a single GPU, when we want to -- this is the target model, Lama 2, 7b. And this is the draft model. Here is some default parameters.

The batch size is from 1, 2, 4, 8, 16. And the step is here, 0, 1, 3, 5, 7. And the TopK is here. This is the number of Java tokens. What does that mean? I think it's very easy to understand. For example, we have different combinations of these different parameters.

And this script will run all of these combinations. And you will get the result. And from the result, you will get to know that, oh, this -- for example, this combination is best. For example, at the batch size 8, the three steps, maybe -- and the TopK is 1.

And the number of Java tokens is 4. You will get some result about the speed and about the accept rate. Then you can use this parameter for your online servering, for your production servering. And when you're running this benchmark, do be sure to set the prompts to things that are representative of your actual workload.

Yeah, yeah, yeah. Because speculation in any format, including Eagle, is all about guessing future tokens, if you are benchmarking on data that is not representative of your actual inputs and outputs that you're seeing live in production, then you're probably going to end up with the wrong parameters. Speculation is a very topic and content-dependent optimization.

Yeah, yeah. I think so. So you can also update these promotes. Here in this batch speculating decoding Python script, we have some promotes. And I think you can update these just according to your needs. Yep. Okay. So let's take a look at some of the stuff around, you know, the community and getting involved.

Yeah, yeah, yeah. Also, I think, yeah, SGLAN currently has become very popular. And if you want to participate in this community and contribute some code, I think -- Yeah, I'll show the slides real quick. Okay. Yeah, so, you know, SGLang does have a really great community. And, you know, some quick ways to get involved.

You can start on GitHub, file issues and bug reports as you build. They have a great tagging system of first issues to get involved with, which GNN is going to show in a second. But the number one thing you can do is follow SGLang LMSysOrg on Twitter and then join the Slack to keep an eye out for online and in-person meetups.

So this is a link to the community Slack. You can scan that real quick if you want to get involved with SGLang. These slides are also all in the -- these slides are all in the repo that you got from the workshop. So you can access this link and stuff later.

It's also just slack.sglang.ai. Pretty simple link. So if you are going to get involved and you do want to, you know, start contributing to the code base, we can kind of show you some of the stuff. So at a high level, the code base has the SGLang runtime. It's got a domain-specific front-end language and it has a set of optimized kernels.

You can go actually on this deep wiki page and get a really good tour of the code base as well as a tour from this other repository that we have linked, which is also by one of the SGLang people with some diagrams about like exactly how this stuff works.

And then, yeah, Yanang is just going to show a quick overview of the code base on GitHub in case you're interested in getting involved and contributing. Yeah. I think that's the best way to get involved in this project. First, we need to use that and then you will find some issue or you will find some feature missing in this repo.

And then, the first thing that is you can raise a new issue here. Oh, it's loading. Yeah, you can just create a new issue feature request or something like this. And also, I think, yeah, we have labeled something like good first issue or help wanted. Yeah, you can see that there are nearly 26.

So, I think, yeah, if you are interested in this issue, for example, if you are interested in support or suffering VILM, VILA model, you can just start with this. So, yeah. I think, yeah. I think good first issue and help wanted issue, yeah, we are welcome to the contributions.

And here is the development roadmap. So, yeah, if some feature is missing or if some feature you care about, you can find it in the roadmap, I think you can join us for this feature development. Or you can also, yeah, raise a new issue about this. And the last one is about the overall work through.

Okay. So, yeah. In the SGLAN repo, we have some component. This one is the SGLAN kernel. It's an SGLAN kernel library. We implement attention, normalization, activation, GIM, all of them in this kernel library. And if you are familiar with CUDA kernels, and if you are interested with kernel programming, you can just contribute to this part.

And here is the SGL router. Last year, we published SGLAN, the three version, and we supported the cache aware routing. If you are interested in this part, you can work on the SGL router. Currently, we use SGLAN as LM inference runtime. So, I think the Python part, the SRT, is the core part.

We supported PD disaggregation. We supported constructed coding. We supported function coding. Yeah. We supported OpenAI compatible server. And we also supported a lot of models. Yeah. I think if you want to support the custom model, you can just take this as a reference. For example, you can take LAMA as a reference.

I think the popular open source model, the architecture, is very, very similar. So, if the model you are interested in has not been implemented in the SGLAN, you can just take this as a reference and do some modification. And then we welcome contributions. Yeah. That's all. Awesome. So, if we get the slides back up here.

Yeah. So, to, you know, wrap it up. First off, thank you so much for coming out. Thank you for bearing with us. Thank you for waiting for web pages to load on this wonderful internet connection that we all have. To kind of wrap things up, I do want to issue a couple invitations to everyone in this room today.

Number one, we're having a really cool happy hour with the folks from Oxen AI. Oxen AI is a fine tuning company. Their CEO just had a really cool demo that he published a couple weeks ago where he took GPT 4.1 and made it, you know, do a SQL generation benchmark.

Took the score, said, okay, I think I can do better than this. Took QN 0.6B. Yes, you heard me right. Less than a billion parameters. Fine tuned it on some SQL generation data and actually beat GPT 4.1 with a model that you can run on, like, three years ago iPhone.

So, yeah, we're gonna be, you know, at this happy hour, we're gonna be talking about fine tuning and stuff. It's gonna be a great time. Second invitation I want to extend to everyone in this room is if you think this stuff is cool, if you are, you know, seeing all the stuff that Yanang was talking about around contributing to the code base, and you're like, yeah, I love CUDA programming, just come work at base 10.

If you're bored in your job, you won't be bored here. We've got a lot of open roles for both infrastructure and for model performance. If you're at all interested, just come talk to me. I'm gonna be here all three days. So, yeah, that's pretty much our workshop today. Thank you so much for coming through and happy to take any questions in the remaining time we have.

Yes. What are the main reasons why you use this? Yeah, that's a great question. You know, I think that what we -- based on, like, we use all sorts of different runtimes model to model. Sometimes you just want to use whatever one is best for your use case. But in general, I think that the reason that we've been really attracted to it is because of how configurable and extensible it is.

Out of the box with basic parameters, you're gonna get more or less the same performance from anyone. But if you're able to, number one, have, like, a really deeply and well-documented code base, like SGLang, where you're able to really deeply understand all the different options that you have, that can get you a long way.

And then, as we were just talking about, it's super easy to contribute. So, we're constantly, like, making fixes and contributing them back. And that means that, you know, if you're using a different library, you might be blocked waiting for the core developers to implement support for a model or something.

SGLang, you can unblock yourself. Yes. When there are multiple vendors and different kind of applications around the endpoint or within the subnet you are defining, how would you define your cybersecurity or security protocols? How would you enhance your protocols? Yeah. I mean, that's a great question. I don't really think that your choice of runtime engine, like, affects that too much because, you know, you're just packaging it up in a container.

You know, within base 10, we've thought a lot about this in a sort of runtime agnostic way, where we're thinking about, of course, like, least privilege. We're thinking about, you know, making sure that there's a good deal of isolation built into the system. But from a runtime perspective, I don't think there's anything special we have to do for security with SGLang, right, compared to, like, VLM or anything else.

Thank you. So, I'm from the Department of Defense Center. Oh, awesome. So, extensive experience in financial applications. So, to do some product developments in-house, do you think I can do the entire product development in-house within a subnet? Yeah. You don't have to go back and forth. Right now, for example, just throwing an example.

Yeah. One of those CMMC cybersecurity certifications, I have to go through the endpoint controls and define the endpoint control and then go connect to the chat -- open AI and chat CPT. Gotcha. Kind of. Yeah, yeah. So, in that case, this would actually help you out a lot. Instead of relying on that remote server, you can just spin up a cluster, like, within the same VPC or, like, within the same physical data center as the workload that's relying on the AI model.

You can clone SGLang, you can cut, you know, take a release and fully inspect the code because it's open source and then fix on that release so that there's nothing changing under the hood. And then, yeah, with that, you'd be able to, you know, run the models just directly on the GPU.

As you saw in Yaneng's demo when he was doing the CUDA graph stuff, you're able to, you know, call it on even a local host basis and run inference. So, yeah, it gives you all the tools you need if you're trying to build even, like, a sort of AOGAP type of system with, you know, all of these open source runtimes.

You can pull that code in, inspect it, lock it, and then build off of it. Very impressive. And also, currently, I'm working on, I'm also a PhD student. Yeah. So, I'm working on blockchain-based and part-time computing and some kind of AI deliverables. So, how do you circumvent within your product?

Can you, so blockchain is completely another community-based code development. Yeah. So, how do you, can we integrate different community-based or a combination of both hybrid community-based protocol? Or, so, what is, because blockchain is kind of a decentralized network, whereas this one is kind of . Yeah. To be perfectly honest, like, I haven't really experienced anything with that.

Pretty much all of the use cases that I've run with SGLang are just traditional client server applications. Any other questions? Yeah. . Yeah. Great. So, yeah. So, in base 10, like, what we do is we call it, like, the base 10 inference stack, where we're taking all of these different providers, the VLM, the SGLang, and the TensorRT LLM, which we actually probably use the most heavily of the three, and taking them in, customizing them, doing all that stuff I'm supposed to say for marketing purposes, but we are customizing it quite a bit.

Anyway, where we generally pick VLLM, I'm sorry, I'm talking about them during your SGLang talk, but where we use VLLM is oftentimes for compatibility. For example, like, I know our Gemma models that we have up in the library are using VLLM because, like, it's what was supported when it dropped.

So, yeah, that's, in my mind, like, the best use case for VLLM is, like, super broad compatibility. Any other questions? Awesome. Well, like I said, we're going to be around all day, and I'm going to be at the base 10 booth for the next three days. So, if you have any questions about SGLang, model serving, model inference in general, or if you want one of them jobs I was talking about, we are hiring very aggressively.

So, definitely stop by the booth, hang out, grab one of these shirts, and, yeah, thank you so much for coming. Thank you so much for coming.

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Transcript