Back to Index

Harnessing the Power of LLMs Locally: Mithun Hunsur


Chapters

0:0 Intro
0:41 Overview
2:43 Cost
3:57 Quantization
5:40 Why LLMRS
8:49 Community Projects
11:20 Real World Example
12:44 Benefits
16:19 Outro

Transcript

. Good day, everyone. Good to see you all. Today, I'm here to tell you how to harness the power of local LLMs using our Rust library. Quick intro. I'm a thorn, as you just heard, but I go by PhilpikesOnline. I hail from Australia, hence the accent, but I live in Sweden.

I do a lot of things for computers, but my day job is at Ambient where I build a game engine of the future. Today, though, I'm here to talk to you about LLM.RS, a Rust library that I maintain. So, LLM.RS, or LLM between friends, I realize that I have to disambiguate when I started with Simon's newsletter.

It's no one solution for local inference of LLMs, but what does that actually mean? Well, most of the models we've discussed in this conference have been cloud models. You have ChatJPTs, your Clawds, your Bards. Local models offer another way, where you own the model and it runs on your computer.

So let's quickly go over what that actually means. First up, size. Model size can be used as a rough proxy for the intelligence of the model. Most of the models are really, really big. You can see that it's dominating the right-hand side of the chart there. You have your GPT-3, your GPT-4, we'll get back to that, your GURF, your Palm 2.

These are all insanely big in comparison to the open-source models we have. We're beginning to see some bigger models thanks to LLM and Falcon, but even they pale in comparison to what the bigger players can do. This means the local models don't have the same capacity for intelligence. However, a smaller, more focused model may be able to solve problems better than a large general model.

By the way, we don't actually know what size GPT-4 is. That's rumors. The only APN AI knows. Next, let's talk about speed and capacity. Cloud models run on specialized hardware with special configuration. Local models run on whatever hardware you can scrounge up, including rented hardware. The further up the axis you go, the more speed and/or parallel inference you can do, but the more inaccessible it becomes.

This end? A few hundred dollars. That end? A few hundred million dollars. Next up, latency. Cloud models need the full prompt before they can start inference and you have to wait for the message back and forth. Local models can give you a response immediately. You can feed the prompt as you go along.

This is very important for conversations where you want the model to be able to process what you're saying as you say it. And of course, you can't escape talking about cost. The cloud vendors will charge you a per token price. When running locally, it's entirely up to you how much it costs you to run the machine.

If the running cost of your model is less than the cost of running your workload through the cloud, you're going to make a profit. And if you're running on a machine you already own, well, that's basically free, right? With the cloud, you have to use the models they offer you.

Some vendors offer fine-tuning, but they often charge more than just using the regular model, and they often charge you for the process of fine-tuning. This means it's not often cost-effective to actually do that. With local models, the sky's the limit. There are hundreds, potentially thousands, of custom models that can suit any need you have.

Knowledge retrieval, storytelling, conversation, tool use, you name it, someone's already done it. But if you haven't, fine-tuning the existing model for your own use is easy enough. Special shout-out to Axlodl over there, which makes it easy to fine-tune models of any architecture. And of course, privacy. There are some questions you don't want to ask the internet.

Local models let you privately embarrass yourself. Now, you might be wondering how it's actually possible to run these models locally. That, my friends, is possible with the power of quantisation. If each model is billions of parameters, and those parameters are like individual numbers, how could you possibly run them on consumer hardware when there's only so much memory available for a given performance level?

Well, we can use quantisation. Quantisation lets you lossly compress a model while maintaining the majority of its maths. We can take the original model here in blue and squish it down to something much smaller using one of these green formats. This is a secret sauce that makes it viable to run models locally.

Smaller models aren't just easier to store. They can also run faster as a computer can process more of the model at any given moment. But that's enough about local models. You've probably already heard much of that already. Let's talk about the actual library. It all started with this man who built something you may have heard of.

Of course, I'm referring to Lama CPP, and that's what it looked like on day one. Look at the mere 98 stars. How pedestrian compared to today where it's 42,000 stars. But let's go back to March when I first saw it. When I saw it, I had but one idea.

It's time to reroute it in Rust for both the meme and because I wanted to use it for other things. Well, I said I wanted to do it, and I did. But to the right here, set to 22 was also working on the same problem. And, well, there was just one catch.

He beat me to it. He beat me to it. Completely beat me to it. I'm not afraid to admit it. Luckily, we came together, managed our projects, and I ended up as the maintainer of the resulting project and that's how LLM was born. So you might be wondering why.

If Lama CPP exists, why use LMRS? Well, with LLM.RS, I had six principles in mind. It must be a library. When I first started in March, Lama CPP was not a library. It was an application, and that made it impossible to reuse. It must not be coupled to an application.

You must be able to customize this behavior, you must be able to go in and change every little bit of it to make it work for your application, and we shouldn't make any assumptions about how it's going to be used. It should support a multitude of model architectures. Of course, Lama CPP supports Lama and now Falcon, but clearly there are more out there.

Next up, it should be Rust native. It should feel like using a Rust library. It should not feel like using a library with bindings, and it should work how you expect a Rust library to work. Next up, backends. It should support all possible kinds of backends, you can write on your CPU, your GPU, or of course, your ML-powered toaster.

I'm sure that's going to be a thing, but we were going to see it coming, I swear. And finally, platforms. It should work the same whether it's on Windows, Linux, Mac OS, or something else. You shouldn't have to change it significantly to make it work, because deployment has always been an issue.

Today, I'm proud to say we support a myriad of architectures, including the darlings of the movement, Lama and Falcon. These architectures all use the same interface, so you don't have to worry about changing your code to use a different model. This is made possible by the concerted efforts by code contributors, Lucas and Dan, who couldn't have done this without, as well as many others.

Here's some sample code for the library. I won't go too much into it, because it's quite dense. But the idea is that you load a model right there on the top, because it's actually quite small, and with that model, you create sessions which track an ongoing use of the model.

You can have as many of these as you would like, but they do have a memory cost, so you want to be careful. Once you have a session, you can pass a prompt in and infer with the model to determine what comes next. You can keep reusing the same session, which is very useful for conversation.

You don't need to keep refeeding the context. The last argument of the function is the callback. That's where you actually get the tokens out. It's worth noting that the function itself is actually a helper. All it does is call the model in a loop with some boundary conditions. If you want to change the logic in some significant way, you can.

We're not going to start from doing that. One last thing about this, though. You see all the calls to default there? Those are all customisation points. You can change pretty much anything about this. You can change how the model is loaded. You can change how it will do the inference.

You can change how it will sample. The entire point is you have the control you need to make the thing you need to work. Here's a quick demo of the library working with Lama 7 billion on my MacBook CPU. It's reasonably fast, but it could be faster, right? Well, thanks to the power of GPU acceleration, we have something that's much more usable.

And believe me, it's even faster than Nvidia GPUs. Now let's talk about what you can actually do with the library. Let's start with three community projects to begin with. First we've got local AI. Local AI is a simple app that you can install to do inference locally. There's nothing magical about it.

It's just exactly what it says. I think that's really wonderful because it means anyone can download this app and be able to use local models without having to think about it. Next up, LMChain. It's a LangChain, but for Rust. And of course, it supports inference with the library. And finally, we have Flonium, which is a flowchart-based application where you can build your own workflows.

I think we've seen a few of those at this conference. And you can combine and create nodes to build the workflow you need. And of course, it supports the library as an inference engine. Now I wouldn't be a very good library author if I didn't actually test my own library.

So I'm going to go through three applications. The first two approves the concept. The first is LMChain. It's a Discord bot. You can see it's exactly what you'd expect. You give it a prompt. It will give you a response. Any hitches you see come from Discord limits, not from the actual inference itself.

You can see... Bam! All there. When an issue is a request for generation, it goes through this process here where the request goes through a generation thread with a channel. That channel is then used to create a response task. And then that response task is responsible for sending the responses to the user.

Now, the interesting thing is these sessions are created and thrown away immediately with each query. But you don't need to do that. If you keep them around, you can actually use them for conversation. And just to illustrate, this is just like the request response workflow you would use for anything.

If I just take what I had there, drop the Discord bit and add in HTTP, you can see request generation response. Easy. Next up. Alpa. I love using GitHub Copilot. But it's only available in my code editor and it requires an internet connection. Alpa is my attempt to solve this.

It is order complete anywhere in your system just by taking what's left of your cursor and passing to a model to type in. And, of course, you can use any model including a model to fine-tune in my own writing. Ask me how I know. Alpa is also quite simple.

In fact, it is so simple I don't really need to cover it. Listen for input, copy the input into a prompt, start generating, type out response. Easy. Now, the first two examples are pretty simple. They are proofs of concept. But now I want to talk about an actual use case.

This is a real-world data extraction task. Over the last few years, I have been working on a project to make a timeline from the dates of Wikipedia, because there are millions of pages and they all have dates, and you can build a world history from it. However, these dates are often unstructured and more or less impossible to pass using traditional means.

Like, yes, you can try using regex to extract the dates, but you can't get the context out in any meaningful sense, and there are some dates here that don't make any sense at all. So that's why, as is the theme of this conference, I threw a large language model at it.

However, GPT-3 and 4 aren't perfect. Even after rounds of prompt engineering, you can see I tried here. And handling millions of dates is just too expensive and slow. So I decided I'd fine-tune my own model. So I generate a representative data set using GPT-3, build a tool to go through the data set, so pick out any data point, fix it up, and then correct the errors, build a new data set, and train a new model.

So I did that using Axolotl, which I mentioned earlier. Again, check out Axolotl for all your fine-tuning needs. Highly recommended. And now I have a small, fast, consistent model that I can pass any data to - sorry, any date to, and get back a structured representation, which I can, of course, immediately pass using Rust.

And I can treat that as a black box. So I have a function there, fn-pass, pass some dates, get some dates back, simple. Now, let's quickly talk about the benefits of using local models and the library. First off, deployments. Show of hands, who's had to deal with Python deployment hell?

Can't see hell even. Yeah. Yeah, I know. It's awful. You spend hours just trying to sort out your conda, your pip, your pipen, it's awful. With the library, you inherit Rust's excellent cross-platform support and build system, making it easier to ship self-enclosed support to your platform, no more on making your users install Torch.

As you might imagine, this unlocks the use of desktop applications with models. Next up, the ecosystem. Rust has one of the strongest ecosystems of any native language. You can combine these libraries with LLMs to build all kinds of things. It's what let me build a Discord bot, a system order completion utility, a data ingestion pipeline with a data set, a utility explorer, all in the same language.

And I think if you use LMRS, you can do the same thing with your task as well. Of course, you also have control over how the model generates. I alluded to this earlier, but you can choose exactly how it samples tokens. Normally, when you use a cloud model, you have to get back the logits, the probabilities, but those probabilities are limited.

You have to keep going back and forth, and that's slow and expensive. With this, you can directly control what you are sampling. Finally, let's talk about the innovation in the space. If you're here, you probably know there's a paper almost every single day. It's impossible to keep up with.

Trust me. I've tried. But the use of local models means you can try this out before anyone else can. You can go through. You can try out some of these papers. You can be like, oh, wow, that's actually a worthwhile improvement. And eventually, the cloud providers will provide them, but in the meantime, the controller remains with you.

However, it's time to talk about the problems. There ain't no such thing as a free lunch, except for a conference, of course. Let's talk about hardware again. I mentioned earlier that you can pretty much run these things on almost any hardware, but that's kind of a lie. You still need some kind of power.

You can only get so much out of your 10-year-old computer, your smartphone, or your Raspberry Pi. We're finding clever ways to improve this, like smaller models and better inferencing, but it's still something to be aware of. Next, as with all things, the fast, cheap, good tryout applies. You can make all kinds of trade-offs here, and you see I've listed a couple of them here, but fundamentally, you have to choose what are you willing to sacrifice in order to serve your application?

Are you willing to go for a bigger model to get better quality? Better quality results at the cost of speed. These are all decisions you have to make, and they're not always obvious. It's something you have to think about. Next, there's no other way of putting this. The ecosystem churns.

Innovation is a double-head sword. When those changes come in, they can often break your existing workflows. I've helped alleviate this, to some extent, using the GGU file format, which helps data-guise, but it's still a problem. Some days, you will just wake up, try your application with a new model, and it just won't work.

There's nothing you can do except deal with it. Finally, a lot of the models in this space are open source. They're free for use personally, but they have very strange clauses and exceptions. For most of us, this doesn't matter. You can just use the model personally, but it's a reminder that even though these models are free, they're not capital F-free, luckily, there's been some recent change in the space with Mistral and Stable LM giving you strong performance of a small size and being completely unburdened, but it's still a problem, and they're still much smaller than the big ones, like Lama and Falcon.

Unfortunately, I've got to wrap things up here. There's only so much you can talk about in 18 minutes, I'm afraid. Local models are great, and I'd like to think our library is too. They're getting easier to run day by day with smaller, more powerful models, however, the situation isn't perfect, and there isn't always one obvious solution for your problem.

Thanks for listening. You can contact me by email or by Mastodon. The library can be found at, you guessed it, llm.rs, or by scanning the QR code. Finally, we're always looking for contributors, if you're interested in LMs or Rust, feel free to reach out. Sponsorships are also very welcome, because they help me try out new hardware, which is always necessary, and if you want to chat in person, I'll be hanging around the conference.

I'll see you later. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.