Today we're going to be taking a look at how we can improve the accuracy of our routes within Semantic Router. So what we've added to the library is the ability to modify your thresholds on a route by route basis. So previously it's the whole layer, now you have individual thresholds per route.
And of course, if you have many routes, modifying each one of those thresholds one by one, it would probably become quite tedious. So we've also added in training methods so that you can optimize your route values with a few lines of code, which is great. So let's jump straight into it.
I'm going to go over to the Semantic Router library, I'm going to go to the docs, and then I'm going to go down to this threshold optimization notebook. And we'll open that in Colab. Now we're going to be using the local model. So just the local encoder in this example, because we're using static routes only.
But that does mean that we can also speed up by using a GPU. So in Colab, I'm going to switch across to a T4 GPU, that will speed things up, although it is very fast anyway, literally a couple of seconds. But nonetheless, you know, if you're doing this for many routes, and you have a fairly large training set, then you might want to turn GPU on.
Okay, so that has installed. And we can come down to here where we're just defining our route layer. We have a few routes in here to give us, you know, more to optimize with. So they're pretty lightweight. One thing that I would recommend, okay, you can optimize using just the method that we have here.
But really, what you want to be doing, as well, is you want to be adding utterances that you see, you know, where it doesn't trigger the route you would expect it to, you should be adding those to your routes. And you should probably be adding just generally more utterances, maybe breaking apart routes into different routes, if you're seeing that they don't work so well.
And the other thing, as you'll see in a moment is that we have a training data set. And we can modify that to improve the performance, as well. We'll get around to that in a moment. So with our encoder, we are going to initialize our encoder, and I'm going to use a different one.
So the default encoder that we use is MiniLM, so it's a pretty old model, to be honest. But it's very small and efficient, so we just have it as the default. But at some point, that will probably change maybe to this model. So this is like a more recent, still very small, but generally, you should be able to get better performance with this model.
So yeah, we're going to switch across to that. So it's E5 base V2 model, and that will just download quickly. OK, now once that has downloaded, we can come down and initialize our route layer. So to initialize a route layer, we just need our routes and our encoder, both of which we have.
And there we go. So we have our route layer. Now let's try with these queries. So this one should go to politics, this one to chit chat, this one to, I think we had a biology question, and this one should just be none. OK, and you can see, actually, we get three of four.
This one actually goes to chit chat, where it shouldn't. So OK, let's take that, and let's try and improve what we have. So first, I'm just going to show you the evaluation or evaluate method. So this is a format that we use for both the evaluate and the fit methods.
So you see that we have these x, x, y, y. This refers to the utterances that are the input data for our fit method, whereas these, so these are the labels, like the intended routes that they should trigger. And I just like to keep it like this when I'm going through it, so it feels a bit easier than creating two separate lists.
So I create this test data set, which is just a list of the tuples, and then I unpack that. So we have our utterances here, our labels here, and then we're going to evaluate to see what we get. Now, if I run this, we actually get 75%, OK? So not that great, obviously.
We can improve that, and you can see with the-- actually, I think it's with MiniLM, we do actually get perfect accuracy, but I think it's bad show with this. Now what we need to do is create a test data set. So what we did here, but bigger. So that's what I do here.
When you're creating this, one thing that you can do, obviously, is just using an LLM to generate this for you. So gpt4, ask it to generate a set of these. So we have politics, chit-chat, mathematics, biology, and we have these as well. And we should probably add some more of those.
So these are the routes that shouldn't be classified as anything, and we add those in there because we-- well, if we just kind of have named routes, it's always going to choose-- like the similarity thresholds can just increase or decrease, sorry, to capture more area. And that means that it might work on this test data set, but maybe not when we have new queries coming in.
So one thing that I would also recommend doing is, OK, we have mathematics, politics, chit-chat. Let's add some more routes here or more test data that is kind of similar to those other ones, but we don't actually want it to be classified as those other ones. That will basically just make it harder for the model, the training function, to get a high accuracy.
And that's a good thing because we're kind of pushing the model more. It needs to try a bit harder to get something good. So I'm going to write a few very quickly. OK, so I've added these five here. You can see, OK, kind of similar to biology. These two are similar to the mathematics routes or tests that we have.
This one, kind of similar to the chit-chat one, and this one, obviously, to politics. But at least for me, they don't quite fit into those. So they're very similar, just not quite there. So that should be good enough, I think, to get some reasonable performance. It won't be anything incredible, I don't think, but we should get something.
Now let's try and calculate the accuracy using the default thresholds that we have. So you can see with MiniLM, that was actually 34.85, which is pretty low. And what you find is that different models just have different similarity thresholds where it's kind of like something is either similar if it's slightly higher or something is just not if it's slightly lower.
So you get wildly different results here. So this one, you actually, you know, we're probably not in a too bad a place. We get 76.06 there. Now let's come down and let's see what we have for the default routes. So you can see they're all just 0.5. That's coming from the Hugging Face encoder, it's just a default score threshold that we have set there.
Now we have our train data, so the X, and we have our labels, which is Y. So then we just train. And by default, it will go over 500 iterations of training or steps. We can try that. I'm not sure we need 500, but let's see. Okay. So it's pretty quick.
You see the accuracy increasing over here. And we've got up to 82. So nothing special, but I tend to find with these smaller open source models, it does tend to be a bit lower. So let's try that. Okay. And then we can see the update route thresholds are these.
So interestingly, mathematics is incredibly low here, which would make me think maybe we need to add some non-routes to that. I'm not sure, but that's something that I would probably consider trying. But yeah, the rest seem reasonable. Okay. So probably around this 75 down to around 60 is where we have that sort of similar to or not similar up to similar threshold for this model.
And as I mentioned, that will vary depending on which model you're using. So yeah, I mean, that is it really. We can just have a look at the valuation again. We get an 81.69. Okay. So that is it for this very quick introduction to the optimization function here. You can also try this with other models.
So for example, R002, which is not quite the latest OpenAI embedding model anymore. And also Cohere, I think as well, they would both go up to about 92% after training on the same dataset. Maybe there's some differences, but not a huge number. So the model does matter a lot here, but also how we're optimizing.
So we can obviously evaluate and fit, but realistically, we also should be adding new utterances to our routes, and we should also be adding more data to our test data as we go along. We really want to be iterating on this and sort of improving over time rather than just hoping that it will fit and then that's it.
So yeah, I mean, that's it for now. I hope this has been useful and interesting. So thank you very much for watching, and I will see you again in the next one. Bye. Bye. Bye. Bye. Bye. Bye. Bye. you you