back to indexHarnessing the Power of LLMs Locally: Mithun Hunsur

Chapters
0:0 Intro
0:41 Overview
2:43 Cost
3:57 Quantization
5:40 Why LLMRS
8:49 Community Projects
11:20 Real World Example
12:44 Benefits
16:19 Outro
00:00:16.880 |
Today, I'm here to tell you how to harness the power of local LLMs using our Rust library. 00:00:23.400 |
Quick intro. I'm a thorn, as you just heard, but I go by PhilpikesOnline. 00:00:27.680 |
I hail from Australia, hence the accent, but I live in Sweden. 00:00:30.960 |
I do a lot of things for computers, but my day job is at Ambient where I build a game engine of the future. 00:00:36.040 |
Today, though, I'm here to talk to you about LLM.RS, a Rust library that I maintain. 00:00:42.120 |
So, LLM.RS, or LLM between friends, I realize that I have to disambiguate when I started with Simon's newsletter. 00:00:50.120 |
It's no one solution for local inference of LLMs, but what does that actually mean? 00:00:55.360 |
Well, most of the models we've discussed in this conference have been cloud models. 00:01:02.200 |
Local models offer another way, where you own the model and it runs on your computer. 00:01:07.640 |
So let's quickly go over what that actually means. 00:01:13.640 |
Model size can be used as a rough proxy for the intelligence of the model. 00:01:21.440 |
You can see that it's dominating the right-hand side of the chart there. 00:01:24.520 |
You have your GPT-3, your GPT-4, we'll get back to that, your GURF, your Palm 2. 00:01:30.360 |
These are all insanely big in comparison to the open-source models we have. 00:01:33.800 |
We're beginning to see some bigger models thanks to LLM and Falcon, but even they pale in comparison 00:01:41.280 |
This means the local models don't have the same capacity for intelligence. 00:01:45.220 |
However, a smaller, more focused model may be able to solve problems better than a large 00:01:52.400 |
By the way, we don't actually know what size GPT-4 is. 00:02:03.580 |
Cloud models run on specialized hardware with special configuration. 00:02:06.820 |
Local models run on whatever hardware you can scrounge up, including rented hardware. 00:02:11.160 |
The further up the axis you go, the more speed and/or parallel inference you can do, but the 00:02:26.240 |
Cloud models need the full prompt before they can start inference and you have to wait for 00:02:32.900 |
Local models can give you a response immediately. 00:02:37.060 |
This is very important for conversations where you want the model to be able to process what 00:02:44.220 |
And of course, you can't escape talking about cost. 00:02:47.060 |
The cloud vendors will charge you a per token price. 00:02:50.440 |
When running locally, it's entirely up to you how much it costs you to run the machine. 00:02:54.520 |
If the running cost of your model is less than the cost of running your workload through the 00:03:00.460 |
And if you're running on a machine you already own, well, that's basically free, right? 00:03:06.980 |
With the cloud, you have to use the models they offer you. 00:03:10.340 |
Some vendors offer fine-tuning, but they often charge more than just using the regular model, 00:03:16.020 |
and they often charge you for the process of fine-tuning. 00:03:18.780 |
This means it's not often cost-effective to actually do that. 00:03:25.120 |
There are hundreds, potentially thousands, of custom models that can suit any need you have. 00:03:30.700 |
Knowledge retrieval, storytelling, conversation, tool use, you name it, someone's already done 00:03:36.900 |
But if you haven't, fine-tuning the existing model for your own use is easy enough. 00:03:41.020 |
Special shout-out to Axlodl over there, which makes it easy to fine-tune models of any architecture. 00:03:51.480 |
There are some questions you don't want to ask the internet. 00:03:54.080 |
Local models let you privately embarrass yourself. 00:03:56.840 |
Now, you might be wondering how it's actually possible to run these models locally. 00:04:02.880 |
That, my friends, is possible with the power of quantisation. 00:04:06.300 |
If each model is billions of parameters, and those parameters are like individual numbers, 00:04:11.120 |
how could you possibly run them on consumer hardware when there's only so much memory available 00:04:20.620 |
Quantisation lets you lossly compress a model while maintaining the majority of its maths. 00:04:24.120 |
We can take the original model here in blue and squish it down to something much smaller 00:04:30.480 |
This is a secret sauce that makes it viable to run models locally. 00:04:37.700 |
They can also run faster as a computer can process more of the model at any given moment. 00:04:46.800 |
You've probably already heard much of that already. 00:04:52.200 |
It all started with this man who built something you may have heard of. 00:04:56.000 |
Of course, I'm referring to Lama CPP, and that's what it looked like on day one. 00:05:01.360 |
How pedestrian compared to today where it's 42,000 stars. 00:05:06.260 |
But let's go back to March when I first saw it. 00:05:11.300 |
It's time to reroute it in Rust for both the meme and because I wanted to use it for other 00:05:21.340 |
But to the right here, set to 22 was also working on the same problem. 00:05:32.340 |
Luckily, we came together, managed our projects, and I ended up as the maintainer of the resulting 00:05:47.220 |
Well, with LLM.RS, I had six principles in mind. 00:05:53.360 |
When I first started in March, Lama CPP was not a library. 00:05:56.720 |
It was an application, and that made it impossible to reuse. 00:06:02.660 |
You must be able to customize this behavior, you must be able to go in and change every 00:06:06.280 |
little bit of it to make it work for your application, and we shouldn't make any assumptions about 00:06:13.120 |
It should support a multitude of model architectures. 00:06:15.700 |
Of course, Lama CPP supports Lama and now Falcon, but clearly there are more out there. 00:06:25.380 |
It should not feel like using a library with bindings, and it should work how you expect 00:06:32.540 |
It should support all possible kinds of backends, you can write on your CPU, your GPU, or of course, 00:06:38.620 |
I'm sure that's going to be a thing, but we were going to see it coming, I swear. 00:06:44.720 |
It should work the same whether it's on Windows, Linux, Mac OS, or something else. 00:06:49.760 |
You shouldn't have to change it significantly to make it work, because deployment has always 00:06:55.920 |
Today, I'm proud to say we support a myriad of architectures, including the darlings of 00:07:02.880 |
These architectures all use the same interface, so you don't have to worry about changing your 00:07:08.440 |
This is made possible by the concerted efforts by code contributors, Lucas and Dan, who couldn't 00:07:14.260 |
have done this without, as well as many others. 00:07:20.080 |
I won't go too much into it, because it's quite dense. 00:07:22.680 |
But the idea is that you load a model right there on the top, because it's actually quite 00:07:26.780 |
small, and with that model, you create sessions which track an ongoing use of the model. 00:07:31.340 |
You can have as many of these as you would like, but they do have a memory cost, so you 00:07:35.960 |
Once you have a session, you can pass a prompt in and infer with the model to determine what 00:07:41.540 |
You can keep reusing the same session, which is very useful for conversation. 00:07:44.980 |
You don't need to keep refeeding the context. 00:07:47.580 |
The last argument of the function is the callback. 00:07:51.040 |
That's where you actually get the tokens out. 00:07:54.080 |
It's worth noting that the function itself is actually a helper. 00:07:57.040 |
All it does is call the model in a loop with some boundary conditions. 00:08:00.640 |
If you want to change the logic in some significant way, you can. 00:08:11.740 |
You can change pretty much anything about this. 00:08:18.620 |
The entire point is you have the control you need to make the thing you need to work. 00:08:23.740 |
Here's a quick demo of the library working with Lama 7 billion on my MacBook CPU. 00:08:31.460 |
It's reasonably fast, but it could be faster, right? 00:08:35.500 |
Well, thanks to the power of GPU acceleration, we have something that's much more usable. 00:08:44.500 |
And believe me, it's even faster than Nvidia GPUs. 00:08:51.040 |
Now let's talk about what you can actually do with the library. 00:08:54.380 |
Let's start with three community projects to begin with. 00:08:58.140 |
Local AI is a simple app that you can install to do inference locally. 00:09:06.140 |
I think that's really wonderful because it means anyone can download this app and be able 00:09:12.060 |
to use local models without having to think about it. 00:09:18.160 |
And of course, it supports inference with the library. 00:09:20.420 |
And finally, we have Flonium, which is a flowchart-based application where you can build your own 00:09:25.380 |
I think we've seen a few of those at this conference. 00:09:28.180 |
And you can combine and create nodes to build the workflow you need. 00:09:32.340 |
And of course, it supports the library as an inference engine. 00:09:36.520 |
Now I wouldn't be a very good library author if I didn't actually test my own library. 00:09:40.880 |
So I'm going to go through three applications. 00:09:54.800 |
Any hitches you see come from Discord limits, not from the actual inference itself. 00:10:03.940 |
When an issue is a request for generation, it goes through this process here where the request 00:10:09.700 |
goes through a generation thread with a channel. 00:10:12.760 |
That channel is then used to create a response task. 00:10:16.940 |
And then that response task is responsible for sending the responses to the user. 00:10:22.180 |
Now, the interesting thing is these sessions are created and thrown away immediately with each 00:10:29.400 |
If you keep them around, you can actually use them for conversation. 00:10:33.740 |
And just to illustrate, this is just like the request response workflow you would use for 00:10:38.080 |
If I just take what I had there, drop the Discord bit and add in HTTP, you can see request generation 00:10:50.520 |
But it's only available in my code editor and it requires an internet connection. 00:10:56.300 |
It is order complete anywhere in your system just by taking what's left of your cursor and passing 00:11:04.860 |
And, of course, you can use any model including a model to fine-tune in my own writing. 00:11:12.680 |
In fact, it is so simple I don't really need to cover it. 00:11:14.880 |
Listen for input, copy the input into a prompt, start generating, type out response. 00:11:21.740 |
Now, the first two examples are pretty simple. 00:11:26.020 |
But now I want to talk about an actual use case. 00:11:30.740 |
Over the last few years, I have been working on a project to make a timeline from the dates 00:11:34.780 |
of Wikipedia, because there are millions of pages and they all have dates, and you can 00:11:40.400 |
However, these dates are often unstructured and more or less impossible to pass using traditional 00:11:45.480 |
Like, yes, you can try using regex to extract the dates, but you can't get the context out 00:11:49.120 |
in any meaningful sense, and there are some dates here that don't make any sense at all. 00:11:53.900 |
So that's why, as is the theme of this conference, I threw a large language model at it. 00:12:01.520 |
Even after rounds of prompt engineering, you can see I tried here. 00:12:04.620 |
And handling millions of dates is just too expensive and slow. 00:12:10.520 |
So I generate a representative data set using GPT-3, build a tool to go through the data set, 00:12:15.140 |
so pick out any data point, fix it up, and then correct the errors, build a new data set, 00:12:22.840 |
So I did that using Axolotl, which I mentioned earlier. 00:12:25.280 |
Again, check out Axolotl for all your fine-tuning needs. 00:12:28.640 |
And now I have a small, fast, consistent model that I can pass any data to - sorry, 00:12:32.140 |
any date to, and get back a structured representation, which I can, of course, immediately pass using 00:12:40.400 |
So I have a function there, fn-pass, pass some dates, get some dates back, simple. 00:12:44.460 |
Now, let's quickly talk about the benefits of using local models and the library. 00:12:52.000 |
Show of hands, who's had to deal with Python deployment hell? 00:13:01.580 |
You spend hours just trying to sort out your conda, your pip, your pipen, it's awful. 00:13:06.520 |
With the library, you inherit Rust's excellent cross-platform support and build system, making 00:13:11.360 |
it easier to ship self-enclosed support to your platform, no more on making your users install 00:13:17.160 |
As you might imagine, this unlocks the use of desktop applications with models. 00:13:23.720 |
Rust has one of the strongest ecosystems of any native language. 00:13:27.620 |
You can combine these libraries with LLMs to build all kinds of things. 00:13:31.700 |
It's what let me build a Discord bot, a system order completion utility, a data ingestion pipeline 00:13:36.120 |
with a data set, a utility explorer, all in the same language. 00:13:40.120 |
And I think if you use LMRS, you can do the same thing with your task as well. 00:13:45.500 |
Of course, you also have control over how the model generates. 00:13:48.100 |
I alluded to this earlier, but you can choose exactly how it samples tokens. 00:13:52.020 |
Normally, when you use a cloud model, you have to get back the logits, the probabilities, but 00:13:58.900 |
You have to keep going back and forth, and that's slow and expensive. 00:14:02.180 |
With this, you can directly control what you are sampling. 00:14:06.220 |
Finally, let's talk about the innovation in the space. 00:14:10.040 |
If you're here, you probably know there's a paper almost every single day. 00:14:16.800 |
But the use of local models means you can try this out before anyone else can. 00:14:22.040 |
You can be like, oh, wow, that's actually a worthwhile improvement. 00:14:24.260 |
And eventually, the cloud providers will provide them, but in the meantime, the controller remains 00:14:28.760 |
However, it's time to talk about the problems. 00:14:32.420 |
There ain't no such thing as a free lunch, except for a conference, of course. 00:14:40.080 |
I mentioned earlier that you can pretty much run these things on almost any hardware, but that's 00:14:48.820 |
You can only get so much out of your 10-year-old computer, your smartphone, or your Raspberry 00:14:54.140 |
We're finding clever ways to improve this, like smaller models and better inferencing, 00:14:59.800 |
Next, as with all things, the fast, cheap, good tryout applies. 00:15:04.280 |
You can make all kinds of trade-offs here, and you see I've listed a couple of them here, 00:15:08.120 |
but fundamentally, you have to choose what are you willing to sacrifice in order to serve 00:15:13.720 |
Are you willing to go for a bigger model to get better quality? 00:15:18.460 |
These are all decisions you have to make, and they're not always obvious. 00:15:31.460 |
When those changes come in, they can often break your existing workflows. 00:15:35.200 |
I've helped alleviate this, to some extent, using the GGU file format, which helps data-guise, 00:15:41.160 |
Some days, you will just wake up, try your application with a new model, and it just won't work. 00:15:45.840 |
There's nothing you can do except deal with it. 00:15:48.840 |
Finally, a lot of the models in this space are open source. 00:15:53.460 |
They're free for use personally, but they have very strange clauses and exceptions. 00:15:59.540 |
You can just use the model personally, but it's a reminder that even though these models are 00:16:03.220 |
free, they're not capital F-free, luckily, there's been some recent change in the space 00:16:08.160 |
with Mistral and Stable LM giving you strong performance of a small size and being completely 00:16:14.140 |
unburdened, but it's still a problem, and they're still much smaller than the big ones, 00:16:20.540 |
Unfortunately, I've got to wrap things up here. 00:16:22.720 |
There's only so much you can talk about in 18 minutes, I'm afraid. 00:16:26.360 |
Local models are great, and I'd like to think our library is too. 00:16:29.980 |
They're getting easier to run day by day with smaller, more powerful models, however, the 00:16:34.120 |
situation isn't perfect, and there isn't always one obvious solution for your problem. 00:16:42.660 |
The library can be found at, you guessed it, llm.rs, or by scanning the QR code. 00:16:47.060 |
Finally, we're always looking for contributors, if you're interested in LMs or Rust, feel free 00:16:52.280 |
Sponsorships are also very welcome, because they help me try out new hardware, which is always 00:16:55.820 |
necessary, and if you want to chat in person, I'll be hanging around the conference.