back to index

Harnessing the Power of LLMs Locally: Mithun Hunsur


Chapters

0:0 Intro
0:41 Overview
2:43 Cost
3:57 Quantization
5:40 Why LLMRS
8:49 Community Projects
11:20 Real World Example
12:44 Benefits
16:19 Outro

Whisper Transcript | Transcript Only Page

00:00:14.040 | Good day, everyone. Good to see you all.
00:00:16.880 | Today, I'm here to tell you how to harness the power of local LLMs using our Rust library.
00:00:23.400 | Quick intro. I'm a thorn, as you just heard, but I go by PhilpikesOnline.
00:00:27.680 | I hail from Australia, hence the accent, but I live in Sweden.
00:00:30.960 | I do a lot of things for computers, but my day job is at Ambient where I build a game engine of the future.
00:00:36.040 | Today, though, I'm here to talk to you about LLM.RS, a Rust library that I maintain.
00:00:42.120 | So, LLM.RS, or LLM between friends, I realize that I have to disambiguate when I started with Simon's newsletter.
00:00:50.120 | It's no one solution for local inference of LLMs, but what does that actually mean?
00:00:55.360 | Well, most of the models we've discussed in this conference have been cloud models.
00:00:59.120 | You have ChatJPTs, your Clawds, your Bards.
00:01:02.200 | Local models offer another way, where you own the model and it runs on your computer.
00:01:07.640 | So let's quickly go over what that actually means.
00:01:11.640 | First up, size.
00:01:13.640 | Model size can be used as a rough proxy for the intelligence of the model.
00:01:18.040 | Most of the models are really, really big.
00:01:21.440 | You can see that it's dominating the right-hand side of the chart there.
00:01:24.520 | You have your GPT-3, your GPT-4, we'll get back to that, your GURF, your Palm 2.
00:01:30.360 | These are all insanely big in comparison to the open-source models we have.
00:01:33.800 | We're beginning to see some bigger models thanks to LLM and Falcon, but even they pale in comparison
00:01:38.880 | to what the bigger players can do.
00:01:41.280 | This means the local models don't have the same capacity for intelligence.
00:01:45.220 | However, a smaller, more focused model may be able to solve problems better than a large
00:01:51.060 | general model.
00:01:52.400 | By the way, we don't actually know what size GPT-4 is.
00:01:55.320 | That's rumors.
00:01:56.960 | The only APN AI knows.
00:01:58.960 | Next, let's talk about speed and capacity.
00:02:03.580 | Cloud models run on specialized hardware with special configuration.
00:02:06.820 | Local models run on whatever hardware you can scrounge up, including rented hardware.
00:02:11.160 | The further up the axis you go, the more speed and/or parallel inference you can do, but the
00:02:15.400 | more inaccessible it becomes.
00:02:17.740 | This end?
00:02:18.740 | A few hundred dollars.
00:02:19.740 | That end?
00:02:20.740 | A few hundred million dollars.
00:02:24.160 | Next up, latency.
00:02:26.240 | Cloud models need the full prompt before they can start inference and you have to wait for
00:02:30.120 | the message back and forth.
00:02:32.900 | Local models can give you a response immediately.
00:02:35.140 | You can feed the prompt as you go along.
00:02:37.060 | This is very important for conversations where you want the model to be able to process what
00:02:41.000 | you're saying as you say it.
00:02:44.220 | And of course, you can't escape talking about cost.
00:02:47.060 | The cloud vendors will charge you a per token price.
00:02:50.440 | When running locally, it's entirely up to you how much it costs you to run the machine.
00:02:54.520 | If the running cost of your model is less than the cost of running your workload through the
00:02:58.400 | cloud, you're going to make a profit.
00:03:00.460 | And if you're running on a machine you already own, well, that's basically free, right?
00:03:06.980 | With the cloud, you have to use the models they offer you.
00:03:10.340 | Some vendors offer fine-tuning, but they often charge more than just using the regular model,
00:03:16.020 | and they often charge you for the process of fine-tuning.
00:03:18.780 | This means it's not often cost-effective to actually do that.
00:03:22.700 | With local models, the sky's the limit.
00:03:25.120 | There are hundreds, potentially thousands, of custom models that can suit any need you have.
00:03:30.700 | Knowledge retrieval, storytelling, conversation, tool use, you name it, someone's already done
00:03:36.900 | But if you haven't, fine-tuning the existing model for your own use is easy enough.
00:03:41.020 | Special shout-out to Axlodl over there, which makes it easy to fine-tune models of any architecture.
00:03:47.240 | And of course, privacy.
00:03:51.480 | There are some questions you don't want to ask the internet.
00:03:54.080 | Local models let you privately embarrass yourself.
00:03:56.840 | Now, you might be wondering how it's actually possible to run these models locally.
00:04:02.880 | That, my friends, is possible with the power of quantisation.
00:04:06.300 | If each model is billions of parameters, and those parameters are like individual numbers,
00:04:11.120 | how could you possibly run them on consumer hardware when there's only so much memory available
00:04:15.660 | for a given performance level?
00:04:17.840 | Well, we can use quantisation.
00:04:20.620 | Quantisation lets you lossly compress a model while maintaining the majority of its maths.
00:04:24.120 | We can take the original model here in blue and squish it down to something much smaller
00:04:28.460 | using one of these green formats.
00:04:30.480 | This is a secret sauce that makes it viable to run models locally.
00:04:35.040 | Smaller models aren't just easier to store.
00:04:37.700 | They can also run faster as a computer can process more of the model at any given moment.
00:04:44.760 | But that's enough about local models.
00:04:46.800 | You've probably already heard much of that already.
00:04:49.420 | Let's talk about the actual library.
00:04:52.200 | It all started with this man who built something you may have heard of.
00:04:56.000 | Of course, I'm referring to Lama CPP, and that's what it looked like on day one.
00:04:59.760 | Look at the mere 98 stars.
00:05:01.360 | How pedestrian compared to today where it's 42,000 stars.
00:05:06.260 | But let's go back to March when I first saw it.
00:05:08.700 | When I saw it, I had but one idea.
00:05:11.300 | It's time to reroute it in Rust for both the meme and because I wanted to use it for other
00:05:16.540 | things.
00:05:17.540 | Well, I said I wanted to do it, and I did.
00:05:21.340 | But to the right here, set to 22 was also working on the same problem.
00:05:25.100 | And, well, there was just one catch.
00:05:27.900 | He beat me to it.
00:05:28.980 | He beat me to it.
00:05:29.980 | Completely beat me to it.
00:05:30.980 | I'm not afraid to admit it.
00:05:32.340 | Luckily, we came together, managed our projects, and I ended up as the maintainer of the resulting
00:05:37.720 | project and that's how LLM was born.
00:05:41.440 | So you might be wondering why.
00:05:44.140 | If Lama CPP exists, why use LMRS?
00:05:47.220 | Well, with LLM.RS, I had six principles in mind.
00:05:51.280 | It must be a library.
00:05:53.360 | When I first started in March, Lama CPP was not a library.
00:05:56.720 | It was an application, and that made it impossible to reuse.
00:06:00.180 | It must not be coupled to an application.
00:06:02.660 | You must be able to customize this behavior, you must be able to go in and change every
00:06:06.280 | little bit of it to make it work for your application, and we shouldn't make any assumptions about
00:06:10.840 | how it's going to be used.
00:06:13.120 | It should support a multitude of model architectures.
00:06:15.700 | Of course, Lama CPP supports Lama and now Falcon, but clearly there are more out there.
00:06:21.760 | Next up, it should be Rust native.
00:06:23.360 | It should feel like using a Rust library.
00:06:25.380 | It should not feel like using a library with bindings, and it should work how you expect
00:06:28.940 | a Rust library to work.
00:06:30.740 | Next up, backends.
00:06:32.540 | It should support all possible kinds of backends, you can write on your CPU, your GPU, or of course,
00:06:37.620 | your ML-powered toaster.
00:06:38.620 | I'm sure that's going to be a thing, but we were going to see it coming, I swear.
00:06:42.900 | And finally, platforms.
00:06:44.720 | It should work the same whether it's on Windows, Linux, Mac OS, or something else.
00:06:49.760 | You shouldn't have to change it significantly to make it work, because deployment has always
00:06:53.920 | been an issue.
00:06:55.920 | Today, I'm proud to say we support a myriad of architectures, including the darlings of
00:07:00.720 | the movement, Lama and Falcon.
00:07:02.880 | These architectures all use the same interface, so you don't have to worry about changing your
00:07:06.260 | code to use a different model.
00:07:08.440 | This is made possible by the concerted efforts by code contributors, Lucas and Dan, who couldn't
00:07:14.260 | have done this without, as well as many others.
00:07:17.100 | Here's some sample code for the library.
00:07:20.080 | I won't go too much into it, because it's quite dense.
00:07:22.680 | But the idea is that you load a model right there on the top, because it's actually quite
00:07:26.780 | small, and with that model, you create sessions which track an ongoing use of the model.
00:07:31.340 | You can have as many of these as you would like, but they do have a memory cost, so you
00:07:34.640 | want to be careful.
00:07:35.960 | Once you have a session, you can pass a prompt in and infer with the model to determine what
00:07:40.420 | comes next.
00:07:41.540 | You can keep reusing the same session, which is very useful for conversation.
00:07:44.980 | You don't need to keep refeeding the context.
00:07:47.580 | The last argument of the function is the callback.
00:07:51.040 | That's where you actually get the tokens out.
00:07:54.080 | It's worth noting that the function itself is actually a helper.
00:07:57.040 | All it does is call the model in a loop with some boundary conditions.
00:08:00.640 | If you want to change the logic in some significant way, you can.
00:08:04.700 | We're not going to start from doing that.
00:08:06.700 | One last thing about this, though.
00:08:08.280 | You see all the calls to default there?
00:08:09.740 | Those are all customisation points.
00:08:11.740 | You can change pretty much anything about this.
00:08:13.320 | You can change how the model is loaded.
00:08:15.120 | You can change how it will do the inference.
00:08:16.960 | You can change how it will sample.
00:08:18.620 | The entire point is you have the control you need to make the thing you need to work.
00:08:23.740 | Here's a quick demo of the library working with Lama 7 billion on my MacBook CPU.
00:08:31.460 | It's reasonably fast, but it could be faster, right?
00:08:35.500 | Well, thanks to the power of GPU acceleration, we have something that's much more usable.
00:08:44.500 | And believe me, it's even faster than Nvidia GPUs.
00:08:51.040 | Now let's talk about what you can actually do with the library.
00:08:54.380 | Let's start with three community projects to begin with.
00:08:56.600 | First we've got local AI.
00:08:58.140 | Local AI is a simple app that you can install to do inference locally.
00:09:03.260 | There's nothing magical about it.
00:09:04.500 | It's just exactly what it says.
00:09:06.140 | I think that's really wonderful because it means anyone can download this app and be able
00:09:12.060 | to use local models without having to think about it.
00:09:14.340 | Next up, LMChain.
00:09:15.340 | It's a LangChain, but for Rust.
00:09:18.160 | And of course, it supports inference with the library.
00:09:20.420 | And finally, we have Flonium, which is a flowchart-based application where you can build your own
00:09:24.380 | workflows.
00:09:25.380 | I think we've seen a few of those at this conference.
00:09:28.180 | And you can combine and create nodes to build the workflow you need.
00:09:32.340 | And of course, it supports the library as an inference engine.
00:09:36.520 | Now I wouldn't be a very good library author if I didn't actually test my own library.
00:09:40.880 | So I'm going to go through three applications.
00:09:43.520 | The first two approves the concept.
00:09:45.340 | The first is LMChain.
00:09:46.520 | It's a Discord bot.
00:09:49.360 | You can see it's exactly what you'd expect.
00:09:51.400 | You give it a prompt.
00:09:52.920 | It will give you a response.
00:09:54.800 | Any hitches you see come from Discord limits, not from the actual inference itself.
00:09:58.940 | You can see...
00:10:00.940 | All there.
00:10:03.940 | When an issue is a request for generation, it goes through this process here where the request
00:10:09.700 | goes through a generation thread with a channel.
00:10:12.760 | That channel is then used to create a response task.
00:10:16.940 | And then that response task is responsible for sending the responses to the user.
00:10:22.180 | Now, the interesting thing is these sessions are created and thrown away immediately with each
00:10:27.120 | query.
00:10:28.120 | But you don't need to do that.
00:10:29.400 | If you keep them around, you can actually use them for conversation.
00:10:33.740 | And just to illustrate, this is just like the request response workflow you would use for
00:10:37.280 | anything.
00:10:38.080 | If I just take what I had there, drop the Discord bit and add in HTTP, you can see request generation
00:10:44.520 | response.
00:10:45.520 | Easy.
00:10:46.520 | Next up.
00:10:47.520 | Alpa.
00:10:48.520 | I love using GitHub Copilot.
00:10:50.520 | But it's only available in my code editor and it requires an internet connection.
00:10:53.580 | Alpa is my attempt to solve this.
00:10:56.300 | It is order complete anywhere in your system just by taking what's left of your cursor and passing
00:11:03.080 | to a model to type in.
00:11:04.860 | And, of course, you can use any model including a model to fine-tune in my own writing.
00:11:08.480 | Ask me how I know.
00:11:09.480 | Alpa is also quite simple.
00:11:12.680 | In fact, it is so simple I don't really need to cover it.
00:11:14.880 | Listen for input, copy the input into a prompt, start generating, type out response.
00:11:20.740 | Easy.
00:11:21.740 | Now, the first two examples are pretty simple.
00:11:24.580 | They are proofs of concept.
00:11:26.020 | But now I want to talk about an actual use case.
00:11:28.540 | This is a real-world data extraction task.
00:11:30.740 | Over the last few years, I have been working on a project to make a timeline from the dates
00:11:34.780 | of Wikipedia, because there are millions of pages and they all have dates, and you can
00:11:38.400 | build a world history from it.
00:11:40.400 | However, these dates are often unstructured and more or less impossible to pass using traditional
00:11:44.480 | means.
00:11:45.480 | Like, yes, you can try using regex to extract the dates, but you can't get the context out
00:11:49.120 | in any meaningful sense, and there are some dates here that don't make any sense at all.
00:11:53.900 | So that's why, as is the theme of this conference, I threw a large language model at it.
00:11:59.520 | However, GPT-3 and 4 aren't perfect.
00:12:01.520 | Even after rounds of prompt engineering, you can see I tried here.
00:12:04.620 | And handling millions of dates is just too expensive and slow.
00:12:08.680 | So I decided I'd fine-tune my own model.
00:12:10.520 | So I generate a representative data set using GPT-3, build a tool to go through the data set,
00:12:15.140 | so pick out any data point, fix it up, and then correct the errors, build a new data set,
00:12:20.560 | and train a new model.
00:12:22.840 | So I did that using Axolotl, which I mentioned earlier.
00:12:25.280 | Again, check out Axolotl for all your fine-tuning needs.
00:12:27.140 | Highly recommended.
00:12:28.640 | And now I have a small, fast, consistent model that I can pass any data to - sorry,
00:12:32.140 | any date to, and get back a structured representation, which I can, of course, immediately pass using
00:12:37.400 | Rust.
00:12:38.780 | And I can treat that as a black box.
00:12:40.400 | So I have a function there, fn-pass, pass some dates, get some dates back, simple.
00:12:44.460 | Now, let's quickly talk about the benefits of using local models and the library.
00:12:50.500 | First off, deployments.
00:12:52.000 | Show of hands, who's had to deal with Python deployment hell?
00:12:56.180 | Can't see hell even.
00:12:58.580 | Yeah.
00:12:59.580 | Yeah, I know.
00:13:00.580 | It's awful.
00:13:01.580 | You spend hours just trying to sort out your conda, your pip, your pipen, it's awful.
00:13:06.520 | With the library, you inherit Rust's excellent cross-platform support and build system, making
00:13:11.360 | it easier to ship self-enclosed support to your platform, no more on making your users install
00:13:16.160 | Torch.
00:13:17.160 | As you might imagine, this unlocks the use of desktop applications with models.
00:13:21.860 | Next up, the ecosystem.
00:13:23.720 | Rust has one of the strongest ecosystems of any native language.
00:13:27.620 | You can combine these libraries with LLMs to build all kinds of things.
00:13:31.700 | It's what let me build a Discord bot, a system order completion utility, a data ingestion pipeline
00:13:36.120 | with a data set, a utility explorer, all in the same language.
00:13:40.120 | And I think if you use LMRS, you can do the same thing with your task as well.
00:13:45.500 | Of course, you also have control over how the model generates.
00:13:48.100 | I alluded to this earlier, but you can choose exactly how it samples tokens.
00:13:52.020 | Normally, when you use a cloud model, you have to get back the logits, the probabilities, but
00:13:56.840 | those probabilities are limited.
00:13:58.900 | You have to keep going back and forth, and that's slow and expensive.
00:14:02.180 | With this, you can directly control what you are sampling.
00:14:06.220 | Finally, let's talk about the innovation in the space.
00:14:10.040 | If you're here, you probably know there's a paper almost every single day.
00:14:13.800 | It's impossible to keep up with.
00:14:14.800 | Trust me.
00:14:15.800 | I've tried.
00:14:16.800 | But the use of local models means you can try this out before anyone else can.
00:14:20.040 | You can go through.
00:14:21.040 | You can try out some of these papers.
00:14:22.040 | You can be like, oh, wow, that's actually a worthwhile improvement.
00:14:24.260 | And eventually, the cloud providers will provide them, but in the meantime, the controller remains
00:14:27.760 | with you.
00:14:28.760 | However, it's time to talk about the problems.
00:14:32.420 | There ain't no such thing as a free lunch, except for a conference, of course.
00:14:38.040 | Let's talk about hardware again.
00:14:40.080 | I mentioned earlier that you can pretty much run these things on almost any hardware, but that's
00:14:45.440 | kind of a lie.
00:14:46.440 | You still need some kind of power.
00:14:48.820 | You can only get so much out of your 10-year-old computer, your smartphone, or your Raspberry
00:14:54.140 | We're finding clever ways to improve this, like smaller models and better inferencing,
00:14:57.920 | but it's still something to be aware of.
00:14:59.800 | Next, as with all things, the fast, cheap, good tryout applies.
00:15:04.280 | You can make all kinds of trade-offs here, and you see I've listed a couple of them here,
00:15:08.120 | but fundamentally, you have to choose what are you willing to sacrifice in order to serve
00:15:12.360 | your application?
00:15:13.720 | Are you willing to go for a bigger model to get better quality?
00:15:16.080 | Better quality results at the cost of speed.
00:15:18.460 | These are all decisions you have to make, and they're not always obvious.
00:15:22.780 | It's something you have to think about.
00:15:25.780 | Next, there's no other way of putting this.
00:15:28.060 | The ecosystem churns.
00:15:30.460 | Innovation is a double-head sword.
00:15:31.460 | When those changes come in, they can often break your existing workflows.
00:15:35.200 | I've helped alleviate this, to some extent, using the GGU file format, which helps data-guise,
00:15:40.100 | but it's still a problem.
00:15:41.160 | Some days, you will just wake up, try your application with a new model, and it just won't work.
00:15:45.840 | There's nothing you can do except deal with it.
00:15:48.840 | Finally, a lot of the models in this space are open source.
00:15:53.460 | They're free for use personally, but they have very strange clauses and exceptions.
00:15:57.540 | For most of us, this doesn't matter.
00:15:59.540 | You can just use the model personally, but it's a reminder that even though these models are
00:16:03.220 | free, they're not capital F-free, luckily, there's been some recent change in the space
00:16:08.160 | with Mistral and Stable LM giving you strong performance of a small size and being completely
00:16:14.140 | unburdened, but it's still a problem, and they're still much smaller than the big ones,
00:16:18.540 | like Lama and Falcon.
00:16:20.540 | Unfortunately, I've got to wrap things up here.
00:16:22.720 | There's only so much you can talk about in 18 minutes, I'm afraid.
00:16:26.360 | Local models are great, and I'd like to think our library is too.
00:16:29.980 | They're getting easier to run day by day with smaller, more powerful models, however, the
00:16:34.120 | situation isn't perfect, and there isn't always one obvious solution for your problem.
00:16:38.860 | Thanks for listening.
00:16:39.860 | You can contact me by email or by Mastodon.
00:16:42.660 | The library can be found at, you guessed it, llm.rs, or by scanning the QR code.
00:16:47.060 | Finally, we're always looking for contributors, if you're interested in LMs or Rust, feel free
00:16:51.280 | to reach out.
00:16:52.280 | Sponsorships are also very welcome, because they help me try out new hardware, which is always
00:16:55.820 | necessary, and if you want to chat in person, I'll be hanging around the conference.
00:16:59.600 | I'll see you later.
00:17:00.220 | Thank you.
00:17:01.220 | Thank you.
00:17:01.220 | Thank you.
00:17:01.220 | Thank you.
00:17:01.220 | Thank you.
00:17:02.220 | Thank you.
00:17:02.220 | Thank you.
00:17:03.220 | Thank you.
00:17:03.220 | Thank you.
00:17:03.220 | Thank you.
00:17:03.220 | Thank you.
00:17:03.220 | Thank you.
00:17:04.220 | We'll see you next time.