back to index

Llama.cpp for FULL LOCAL Semantic Router


Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to take a look at the new features in Semantic Router that allow us to take everything fully local.
00:00:08.700 | So we'll be using LLAMA CPP for the dynamic routes and the Hugging Facing Coder for our routing decisions.
00:00:17.600 | Now one thing I really like about this is that using a very small model, so we're going to be using Mistral 7b,
00:00:24.700 | we add grammars onto that and using this format we seem to be getting much better results for agentic decision making
00:00:35.000 | than I can get with GPT 3.5.
00:00:38.400 | Now all of this I'm going to be running from my M1 MacBook Pro, so it's not like I have anything crazy here
00:00:45.300 | and it will run pretty quick as we'll see.
00:00:48.400 | So let's jump straight into it.
00:00:50.600 | Now I'm starting off in the Semantic Router library coming over to here and I'm going to download that onto my Mac.
00:01:00.300 | Once that has been downloaded we should be able to open it and we'll see this
00:01:04.900 | and what I'm going to do is just pip install Semantic Router.
00:01:12.300 | So I'm just switching across my terminal here.
00:01:15.700 | If you have the local Git repo you can install the most recent version like this,
00:01:21.600 | but I'm going to go ahead and install it from pypi.
00:01:28.500 | So Semantic Router and we want 0.0.16.
00:01:35.300 | Now coming down to here, if you are on Mac you want to use this.
00:01:42.300 | So that's just to speed things up.
00:01:45.700 | I'm going to use the Mistral 7b instruct model.
00:01:49.700 | It's quantized so we can actually run this pretty easily.
00:01:55.100 | You don't need much to run this and it runs surprisingly quickly.
00:02:01.200 | While we are waiting for that to download I'm going to come over to here
00:02:05.600 | and I'll just point out the PR where we got this implemented.
00:02:11.200 | So this one from Bogdan from Aurelio, super cool.
00:02:16.700 | And there's one thing in particular I wanted to point out
00:02:20.200 | which is that we use these LLM grammars here.
00:02:23.800 | Now the LLM grammars they are essentially enforcing a particular structured output from your LLM
00:02:32.500 | which is a big part of why we can get very good performance from a very small model like Mistral 7b.
00:02:40.700 | And it is surprisingly good.
00:02:44.200 | I'm actually seeing better performance with this and Mistral 7b than I am with GPT 3.5
00:02:49.900 | which is I think pretty insane.
00:02:52.900 | Now that has been downloaded so the Mistral model.
00:02:58.200 | Now we'll come down to initializing our dynamic route
00:03:01.200 | and you might recognize this from the previous example where we demoed a dynamic route.
00:03:08.400 | I'm using the exact same one here but we're just going to swap out the OpenAI encoder
00:03:15.600 | and the OpenAI LLM for a Hugging Face encoder and the Mistral 7b LLM.
00:03:22.200 | Exact same definitions here so this is our dynamic route so the get time route
00:03:26.400 | and we also have the static routes here as well so they are also in there.
00:03:32.500 | I'm going to take all those routes and I can drop that times here and there
00:03:39.000 | and we just put all of our routes in a list here
00:03:43.300 | and we're going to use them soon to initialize our route layer.
00:03:47.200 | But to initialize our route layer we do need an encoder so we go ahead and we initialize that.
00:03:52.700 | We're using the Hugging Face encoder here which by default is the Sentence Transformers
00:03:58.700 | or the MiniLM L6v2 which is a tiny, tiny model.
00:04:03.800 | So you can also run this on pretty much anything as well.
00:04:08.100 | Now we want to come over to here to begin initializing our Mistral 7b model.
00:04:16.900 | There's a little bit of explanation on what we're actually using here.
00:04:20.300 | We are going to simplify the way that you initialize a LLM CPP model
00:04:26.900 | but for now this is how you do it and we will still have this option.
00:04:30.500 | So the idea is we'll probably make it so that if you don't pass in this LLM parameter
00:04:35.800 | we will use default parameters when initializing it.
00:04:39.200 | But for those of you that do want to modify your parameters you will be able to.
00:04:44.200 | So let's run this.
00:04:48.900 | I'm going to run it on GPU.
00:04:51.100 | OK so I have that here and then we can initialize our route layer.
00:04:54.700 | OK so we have our encoder so the Hugging Face encoder.
00:04:57.900 | We have our routes that we defined before so 2 static, 1 dynamic.
00:05:02.100 | And we have Mistral 7b.
00:05:06.500 | Cool. Looks good.
00:05:09.100 | And now let's ask how's the weather today.
00:05:12.600 | We see that we hit our static route, the chachat route.
00:05:16.300 | Now let's ask what is the time in New York right now.
00:05:20.300 | OK and you can see the grammars coming through here.
00:05:23.500 | I'm not actually sure how to stop those from being logged
00:05:28.100 | because I'm sure there must be a way but we'll figure that out in a future release.
00:05:32.500 | We have the time and here in the UK it is 16.03.
00:05:37.500 | So that is correct.
00:05:39.800 | What's the time in Rome right now.
00:05:41.400 | I think they're an hour ahead.
00:05:45.100 | 17.03 is correct.
00:05:47.600 | Then I want to try something a little further out the way.
00:05:50.800 | This so it's I think this is the question where GPT 3.5
00:05:56.000 | actually struggled with quite a lot which is surprising.
00:05:58.800 | I would kind of expect it to be OK with this but it really struggled.
00:06:02.800 | So what is the time in Bangkok right now.
00:06:05.800 | I'm going to run this.
00:06:08.700 | We get 23.04.
00:06:12.700 | I don't know what the time is in Bangkok right now.
00:06:15.200 | 23.04.
00:06:19.100 | So that is correct. And then time in Phuket as well.
00:06:22.300 | So I wanted somewhere that's not a main city because you look at the time zone here
00:06:27.000 | and it has Bangkok in the time zone name.
00:06:29.700 | So I want to try OK Phuket.
00:06:34.100 | And then I'm actually not sure why but this this command here
00:06:39.800 | or this question takes way longer to answer than the others.
00:06:45.300 | And I yeah I'm not I'm not 100% sure why that is which is kind of interesting.
00:06:51.300 | But anyway so we're going to be waiting a little moment for this one.
00:06:55.200 | I will say that this question that just GPT 3.5 answering it.
00:07:01.000 | I didn't test this is that question. But if it can answer Bangkok.
00:07:04.900 | I feel like it would not have been able to answer for Phuket.
00:07:09.900 | Cool. So we come down here. We see Asia Bangkok for the time zone.
00:07:15.900 | And yeah we get the same time in there.
00:07:21.200 | Now let me just double check that they are in the same time.
00:07:25.500 | I'm pretty sure they are. OK. Yeah. Cool. So so good.
00:07:29.300 | And then we just downloaded the mystery model.
00:07:32.300 | So if you do want to remove that from your computer you can you just run this command the bomb here.
00:07:37.500 | And yeah that is everything. So we have got a fully local semantic router running now.
00:07:45.200 | It works with Lama CPP and it uses a lem grammars to just make the performance of small models pretty good as you can see.
00:07:54.800 | Then alongside that we also have new hugging face encoders which means that any embedding model that is supported by hugging face.
00:08:03.000 | We most likely support it unless it does some sort of weird pooling mechanism which most of them don't.
00:08:08.800 | Most of them are pretty straightforward. So we yeah we can now use semantic router with a ton more models.
00:08:17.100 | And we can do all that locally which is pretty exciting.
00:08:19.400 | So that's it for this video. I just wanted to show you this very quickly. So I will leave it there.
00:08:24.500 | I hope this has all been interesting and useful.
00:08:26.500 | But for now thank you very much for watching and I'll see you again next time.
00:08:34.200 | [MUSIC]
00:08:37.200 | [MUSIC]
00:08:40.200 | [MUSIC]
00:08:43.200 | [music fades]