Llama.cpp for FULL LOCAL Semantic Router

00:00:00.000 | Today we're going to take a look at the new features in Semantic Router that allow us to take everything fully local.

00:00:08.700 | So we'll be using LLAMA CPP for the dynamic routes and the Hugging Facing Coder for our routing decisions.

00:00:17.600 | Now one thing I really like about this is that using a very small model, so we're going to be using Mistral 7b,

00:00:24.700 | we add grammars onto that and using this format we seem to be getting much better results for agentic decision making

00:00:35.000 | than I can get with GPT 3.5.

00:00:38.400 | Now all of this I'm going to be running from my M1 MacBook Pro, so it's not like I have anything crazy here

00:00:45.300 | and it will run pretty quick as we'll see.

00:00:48.400 | So let's jump straight into it.

00:00:50.600 | Now I'm starting off in the Semantic Router library coming over to here and I'm going to download that onto my Mac.

00:01:00.300 | Once that has been downloaded we should be able to open it and we'll see this

00:01:04.900 | and what I'm going to do is just pip install Semantic Router.

00:01:12.300 | So I'm just switching across my terminal here.

00:01:15.700 | If you have the local Git repo you can install the most recent version like this,

00:01:21.600 | but I'm going to go ahead and install it from pypi.

00:01:28.500 | So Semantic Router and we want 0.0.16.

00:01:35.300 | Now coming down to here, if you are on Mac you want to use this.

00:01:42.300 | So that's just to speed things up.

00:01:45.700 | I'm going to use the Mistral 7b instruct model.

00:01:49.700 | It's quantized so we can actually run this pretty easily.

00:01:55.100 | You don't need much to run this and it runs surprisingly quickly.

00:02:01.200 | While we are waiting for that to download I'm going to come over to here

00:02:05.600 | and I'll just point out the PR where we got this implemented.

00:02:11.200 | So this one from Bogdan from Aurelio, super cool.

00:02:16.700 | And there's one thing in particular I wanted to point out

00:02:20.200 | which is that we use these LLM grammars here.

00:02:23.800 | Now the LLM grammars they are essentially enforcing a particular structured output from your LLM

00:02:32.500 | which is a big part of why we can get very good performance from a very small model like Mistral 7b.

00:02:40.700 | And it is surprisingly good.

00:02:44.200 | I'm actually seeing better performance with this and Mistral 7b than I am with GPT 3.5

00:02:49.900 | which is I think pretty insane.

00:02:52.900 | Now that has been downloaded so the Mistral model.

00:02:58.200 | Now we'll come down to initializing our dynamic route

00:03:01.200 | and you might recognize this from the previous example where we demoed a dynamic route.

00:03:08.400 | I'm using the exact same one here but we're just going to swap out the OpenAI encoder

00:03:15.600 | and the OpenAI LLM for a Hugging Face encoder and the Mistral 7b LLM.

00:03:22.200 | Exact same definitions here so this is our dynamic route so the get time route

00:03:26.400 | and we also have the static routes here as well so they are also in there.

00:03:32.500 | I'm going to take all those routes and I can drop that times here and there

00:03:39.000 | and we just put all of our routes in a list here

00:03:43.300 | and we're going to use them soon to initialize our route layer.

00:03:47.200 | But to initialize our route layer we do need an encoder so we go ahead and we initialize that.

00:03:52.700 | We're using the Hugging Face encoder here which by default is the Sentence Transformers

00:03:58.700 | or the MiniLM L6v2 which is a tiny, tiny model.

00:04:03.800 | So you can also run this on pretty much anything as well.

00:04:08.100 | Now we want to come over to here to begin initializing our Mistral 7b model.

00:04:16.900 | There's a little bit of explanation on what we're actually using here.

00:04:20.300 | We are going to simplify the way that you initialize a LLM CPP model

00:04:26.900 | but for now this is how you do it and we will still have this option.

00:04:30.500 | So the idea is we'll probably make it so that if you don't pass in this LLM parameter

00:04:35.800 | we will use default parameters when initializing it.

00:04:39.200 | But for those of you that do want to modify your parameters you will be able to.

00:04:44.200 | So let's run this.

00:04:48.900 | I'm going to run it on GPU.

00:04:51.100 | OK so I have that here and then we can initialize our route layer.

00:04:54.700 | OK so we have our encoder so the Hugging Face encoder.

00:04:57.900 | We have our routes that we defined before so 2 static, 1 dynamic.

00:05:02.100 | And we have Mistral 7b.

00:05:05.100 | OK.

00:05:06.500 | Cool. Looks good.

00:05:09.100 | And now let's ask how's the weather today.

00:05:12.600 | We see that we hit our static route, the chachat route.

00:05:16.300 | Now let's ask what is the time in New York right now.

00:05:20.300 | OK and you can see the grammars coming through here.

00:05:23.500 | I'm not actually sure how to stop those from being logged

00:05:28.100 | because I'm sure there must be a way but we'll figure that out in a future release.

00:05:32.500 | We have the time and here in the UK it is 16.03.

00:05:37.500 | So that is correct.

00:05:39.800 | What's the time in Rome right now.

00:05:41.400 | I think they're an hour ahead.

00:05:45.100 | 17.03 is correct.

00:05:47.600 | Then I want to try something a little further out the way.

00:05:50.800 | This so it's I think this is the question where GPT 3.5

00:05:56.000 | actually struggled with quite a lot which is surprising.

00:05:58.800 | I would kind of expect it to be OK with this but it really struggled.

00:06:02.800 | So what is the time in Bangkok right now.

00:06:05.800 | I'm going to run this.

00:06:08.700 | We get 23.04.

00:06:12.700 | I don't know what the time is in Bangkok right now.

00:06:15.200 | 23.04.

00:06:19.100 | So that is correct. And then time in Phuket as well.

00:06:22.300 | So I wanted somewhere that's not a main city because you look at the time zone here

00:06:27.000 | and it has Bangkok in the time zone name.

00:06:29.700 | So I want to try OK Phuket.

00:06:34.100 | And then I'm actually not sure why but this this command here

00:06:39.800 | or this question takes way longer to answer than the others.

00:06:45.300 | And I yeah I'm not I'm not 100% sure why that is which is kind of interesting.

00:06:51.300 | But anyway so we're going to be waiting a little moment for this one.

00:06:55.200 | I will say that this question that just GPT 3.5 answering it.

00:07:01.000 | I didn't test this is that question. But if it can answer Bangkok.

00:07:04.900 | I feel like it would not have been able to answer for Phuket.

00:07:09.900 | Cool. So we come down here. We see Asia Bangkok for the time zone.

00:07:15.900 | And yeah we get the same time in there.

00:07:21.200 | Now let me just double check that they are in the same time.

00:07:25.500 | I'm pretty sure they are. OK. Yeah. Cool. So so good.

00:07:29.300 | And then we just downloaded the mystery model.

00:07:32.300 | So if you do want to remove that from your computer you can you just run this command the bomb here.

00:07:37.500 | And yeah that is everything. So we have got a fully local semantic router running now.

00:07:45.200 | It works with Lama CPP and it uses a lem grammars to just make the performance of small models pretty good as you can see.

00:07:54.800 | Then alongside that we also have new hugging face encoders which means that any embedding model that is supported by hugging face.

00:08:03.000 | We most likely support it unless it does some sort of weird pooling mechanism which most of them don't.

00:08:08.800 | Most of them are pretty straightforward. So we yeah we can now use semantic router with a ton more models.

00:08:17.100 | And we can do all that locally which is pretty exciting.

00:08:19.400 | So that's it for this video. I just wanted to show you this very quickly. So I will leave it there.

00:08:24.500 | I hope this has all been interesting and useful.

00:08:26.500 | But for now thank you very much for watching and I'll see you again next time.

00:08:30.200 | Bye.

00:08:34.200 | [MUSIC]

00:08:37.200 | [MUSIC]

00:08:40.200 | [MUSIC]

00:08:43.200 | [music fades]