Llama.cpp for FULL LOCAL Semantic Router

Today we're going to take a look at the new features in Semantic Router that allow us to take everything fully local. So we'll be using LLAMA CPP for the dynamic routes and the Hugging Facing Coder for our routing decisions. Now one thing I really like about this is that using a very small model, so we're going to be using Mistral 7b, we add grammars onto that and using this format we seem to be getting much better results for agentic decision making than I can get with GPT 3.5.

Now all of this I'm going to be running from my M1 MacBook Pro, so it's not like I have anything crazy here and it will run pretty quick as we'll see. So let's jump straight into it. Now I'm starting off in the Semantic Router library coming over to here and I'm going to download that onto my Mac.

Once that has been downloaded we should be able to open it and we'll see this and what I'm going to do is just pip install Semantic Router. So I'm just switching across my terminal here. If you have the local Git repo you can install the most recent version like this, but I'm going to go ahead and install it from pypi.

So Semantic Router and we want 0.0.16. Now coming down to here, if you are on Mac you want to use this. So that's just to speed things up. I'm going to use the Mistral 7b instruct model. It's quantized so we can actually run this pretty easily. You don't need much to run this and it runs surprisingly quickly.

While we are waiting for that to download I'm going to come over to here and I'll just point out the PR where we got this implemented. So this one from Bogdan from Aurelio, super cool. And there's one thing in particular I wanted to point out which is that we use these LLM grammars here.

Now the LLM grammars they are essentially enforcing a particular structured output from your LLM which is a big part of why we can get very good performance from a very small model like Mistral 7b. And it is surprisingly good. I'm actually seeing better performance with this and Mistral 7b than I am with GPT 3.5 which is I think pretty insane.

Now that has been downloaded so the Mistral model. Now we'll come down to initializing our dynamic route and you might recognize this from the previous example where we demoed a dynamic route. I'm using the exact same one here but we're just going to swap out the OpenAI encoder and the OpenAI LLM for a Hugging Face encoder and the Mistral 7b LLM.

Exact same definitions here so this is our dynamic route so the get time route and we also have the static routes here as well so they are also in there. I'm going to take all those routes and I can drop that times here and there and we just put all of our routes in a list here and we're going to use them soon to initialize our route layer.

But to initialize our route layer we do need an encoder so we go ahead and we initialize that. We're using the Hugging Face encoder here which by default is the Sentence Transformers or the MiniLM L6v2 which is a tiny, tiny model. So you can also run this on pretty much anything as well.

Now we want to come over to here to begin initializing our Mistral 7b model. There's a little bit of explanation on what we're actually using here. We are going to simplify the way that you initialize a LLM CPP model but for now this is how you do it and we will still have this option.

So the idea is we'll probably make it so that if you don't pass in this LLM parameter we will use default parameters when initializing it. But for those of you that do want to modify your parameters you will be able to. So let's run this. I'm going to run it on GPU.

OK so I have that here and then we can initialize our route layer. OK so we have our encoder so the Hugging Face encoder. We have our routes that we defined before so 2 static, 1 dynamic. And we have Mistral 7b. OK. Cool. Looks good. And now let's ask how's the weather today.

We see that we hit our static route, the chachat route. Now let's ask what is the time in New York right now. OK and you can see the grammars coming through here. I'm not actually sure how to stop those from being logged because I'm sure there must be a way but we'll figure that out in a future release.

We have the time and here in the UK it is 16.03. So that is correct. What's the time in Rome right now. I think they're an hour ahead. 17.03 is correct. Then I want to try something a little further out the way. This so it's I think this is the question where GPT 3.5 actually struggled with quite a lot which is surprising.

I would kind of expect it to be OK with this but it really struggled. So what is the time in Bangkok right now. I'm going to run this. We get 23.04. I don't know what the time is in Bangkok right now. 23.04. So that is correct. And then time in Phuket as well.

So I wanted somewhere that's not a main city because you look at the time zone here and it has Bangkok in the time zone name. So I want to try OK Phuket. And then I'm actually not sure why but this this command here or this question takes way longer to answer than the others.

And I yeah I'm not I'm not 100% sure why that is which is kind of interesting. But anyway so we're going to be waiting a little moment for this one. I will say that this question that just GPT 3.5 answering it. I didn't test this is that question. But if it can answer Bangkok.

I feel like it would not have been able to answer for Phuket. Cool. So we come down here. We see Asia Bangkok for the time zone. And yeah we get the same time in there. Now let me just double check that they are in the same time. I'm pretty sure they are.

OK. Yeah. Cool. So so good. And then we just downloaded the mystery model. So if you do want to remove that from your computer you can you just run this command the bomb here. And yeah that is everything. So we have got a fully local semantic router running now.

It works with Lama CPP and it uses a lem grammars to just make the performance of small models pretty good as you can see. Then alongside that we also have new hugging face encoders which means that any embedding model that is supported by hugging face. We most likely support it unless it does some sort of weird pooling mechanism which most of them don't.

Most of them are pretty straightforward. So we yeah we can now use semantic router with a ton more models. And we can do all that locally which is pretty exciting. So that's it for this video. I just wanted to show you this very quickly. So I will leave it there.

I hope this has all been interesting and useful. But for now thank you very much for watching and I'll see you again next time. Bye.

Llama.cpp for FULL LOCAL Semantic Router

Transcript