back to indexLlama.cpp for FULL LOCAL Semantic Router
00:00:00.000 |
Today we're going to take a look at the new features in Semantic Router that allow us to take everything fully local. 00:00:08.700 |
So we'll be using LLAMA CPP for the dynamic routes and the Hugging Facing Coder for our routing decisions. 00:00:17.600 |
Now one thing I really like about this is that using a very small model, so we're going to be using Mistral 7b, 00:00:24.700 |
we add grammars onto that and using this format we seem to be getting much better results for agentic decision making 00:00:38.400 |
Now all of this I'm going to be running from my M1 MacBook Pro, so it's not like I have anything crazy here 00:00:50.600 |
Now I'm starting off in the Semantic Router library coming over to here and I'm going to download that onto my Mac. 00:01:00.300 |
Once that has been downloaded we should be able to open it and we'll see this 00:01:04.900 |
and what I'm going to do is just pip install Semantic Router. 00:01:12.300 |
So I'm just switching across my terminal here. 00:01:15.700 |
If you have the local Git repo you can install the most recent version like this, 00:01:21.600 |
but I'm going to go ahead and install it from pypi. 00:01:35.300 |
Now coming down to here, if you are on Mac you want to use this. 00:01:45.700 |
I'm going to use the Mistral 7b instruct model. 00:01:49.700 |
It's quantized so we can actually run this pretty easily. 00:01:55.100 |
You don't need much to run this and it runs surprisingly quickly. 00:02:01.200 |
While we are waiting for that to download I'm going to come over to here 00:02:05.600 |
and I'll just point out the PR where we got this implemented. 00:02:11.200 |
So this one from Bogdan from Aurelio, super cool. 00:02:16.700 |
And there's one thing in particular I wanted to point out 00:02:20.200 |
which is that we use these LLM grammars here. 00:02:23.800 |
Now the LLM grammars they are essentially enforcing a particular structured output from your LLM 00:02:32.500 |
which is a big part of why we can get very good performance from a very small model like Mistral 7b. 00:02:44.200 |
I'm actually seeing better performance with this and Mistral 7b than I am with GPT 3.5 00:02:52.900 |
Now that has been downloaded so the Mistral model. 00:02:58.200 |
Now we'll come down to initializing our dynamic route 00:03:01.200 |
and you might recognize this from the previous example where we demoed a dynamic route. 00:03:08.400 |
I'm using the exact same one here but we're just going to swap out the OpenAI encoder 00:03:15.600 |
and the OpenAI LLM for a Hugging Face encoder and the Mistral 7b LLM. 00:03:22.200 |
Exact same definitions here so this is our dynamic route so the get time route 00:03:26.400 |
and we also have the static routes here as well so they are also in there. 00:03:32.500 |
I'm going to take all those routes and I can drop that times here and there 00:03:39.000 |
and we just put all of our routes in a list here 00:03:43.300 |
and we're going to use them soon to initialize our route layer. 00:03:47.200 |
But to initialize our route layer we do need an encoder so we go ahead and we initialize that. 00:03:52.700 |
We're using the Hugging Face encoder here which by default is the Sentence Transformers 00:03:58.700 |
or the MiniLM L6v2 which is a tiny, tiny model. 00:04:03.800 |
So you can also run this on pretty much anything as well. 00:04:08.100 |
Now we want to come over to here to begin initializing our Mistral 7b model. 00:04:16.900 |
There's a little bit of explanation on what we're actually using here. 00:04:20.300 |
We are going to simplify the way that you initialize a LLM CPP model 00:04:26.900 |
but for now this is how you do it and we will still have this option. 00:04:30.500 |
So the idea is we'll probably make it so that if you don't pass in this LLM parameter 00:04:35.800 |
we will use default parameters when initializing it. 00:04:39.200 |
But for those of you that do want to modify your parameters you will be able to. 00:04:51.100 |
OK so I have that here and then we can initialize our route layer. 00:04:54.700 |
OK so we have our encoder so the Hugging Face encoder. 00:04:57.900 |
We have our routes that we defined before so 2 static, 1 dynamic. 00:05:12.600 |
We see that we hit our static route, the chachat route. 00:05:16.300 |
Now let's ask what is the time in New York right now. 00:05:20.300 |
OK and you can see the grammars coming through here. 00:05:23.500 |
I'm not actually sure how to stop those from being logged 00:05:28.100 |
because I'm sure there must be a way but we'll figure that out in a future release. 00:05:32.500 |
We have the time and here in the UK it is 16.03. 00:05:47.600 |
Then I want to try something a little further out the way. 00:05:50.800 |
This so it's I think this is the question where GPT 3.5 00:05:56.000 |
actually struggled with quite a lot which is surprising. 00:05:58.800 |
I would kind of expect it to be OK with this but it really struggled. 00:06:12.700 |
I don't know what the time is in Bangkok right now. 00:06:19.100 |
So that is correct. And then time in Phuket as well. 00:06:22.300 |
So I wanted somewhere that's not a main city because you look at the time zone here 00:06:34.100 |
And then I'm actually not sure why but this this command here 00:06:39.800 |
or this question takes way longer to answer than the others. 00:06:45.300 |
And I yeah I'm not I'm not 100% sure why that is which is kind of interesting. 00:06:51.300 |
But anyway so we're going to be waiting a little moment for this one. 00:06:55.200 |
I will say that this question that just GPT 3.5 answering it. 00:07:01.000 |
I didn't test this is that question. But if it can answer Bangkok. 00:07:04.900 |
I feel like it would not have been able to answer for Phuket. 00:07:09.900 |
Cool. So we come down here. We see Asia Bangkok for the time zone. 00:07:21.200 |
Now let me just double check that they are in the same time. 00:07:25.500 |
I'm pretty sure they are. OK. Yeah. Cool. So so good. 00:07:29.300 |
And then we just downloaded the mystery model. 00:07:32.300 |
So if you do want to remove that from your computer you can you just run this command the bomb here. 00:07:37.500 |
And yeah that is everything. So we have got a fully local semantic router running now. 00:07:45.200 |
It works with Lama CPP and it uses a lem grammars to just make the performance of small models pretty good as you can see. 00:07:54.800 |
Then alongside that we also have new hugging face encoders which means that any embedding model that is supported by hugging face. 00:08:03.000 |
We most likely support it unless it does some sort of weird pooling mechanism which most of them don't. 00:08:08.800 |
Most of them are pretty straightforward. So we yeah we can now use semantic router with a ton more models. 00:08:17.100 |
And we can do all that locally which is pretty exciting. 00:08:19.400 |
So that's it for this video. I just wanted to show you this very quickly. So I will leave it there. 00:08:24.500 |
I hope this has all been interesting and useful. 00:08:26.500 |
But for now thank you very much for watching and I'll see you again next time.