AI Decision Making — Optimizing Routes

00:00:00.000 | Today we're going to be taking a look at how we can improve the accuracy of our routes

00:00:07.140 | within Semantic Router.

00:00:08.920 | So what we've added to the library is the ability to modify your thresholds on a route

00:00:15.760 | by route basis.

00:00:17.640 | So previously it's the whole layer, now you have individual thresholds per route.

00:00:23.020 | And of course, if you have many routes, modifying each one of those thresholds one by one, it

00:00:28.280 | would probably become quite tedious.

00:00:31.220 | So we've also added in training methods so that you can optimize your route values with

00:00:38.240 | a few lines of code, which is great.

00:00:40.400 | So let's jump straight into it.

00:00:42.840 | I'm going to go over to the Semantic Router library, I'm going to go to the docs, and

00:00:49.040 | then I'm going to go down to this threshold optimization notebook.

00:00:54.120 | And we'll open that in Colab.

00:00:56.240 | Now we're going to be using the local model.

00:00:59.160 | So just the local encoder in this example, because we're using static routes only.

00:01:04.040 | But that does mean that we can also speed up by using a GPU.

00:01:09.680 | So in Colab, I'm going to switch across to a T4 GPU, that will speed things up, although

00:01:14.120 | it is very fast anyway, literally a couple of seconds.

00:01:18.800 | But nonetheless, you know, if you're doing this for many routes, and you have a fairly

00:01:22.320 | large training set, then you might want to turn GPU on.

00:01:26.840 | Okay, so that has installed.

00:01:29.200 | And we can come down to here where we're just defining our route layer.

00:01:33.840 | We have a few routes in here to give us, you know, more to optimize with.

00:01:38.880 | So they're pretty lightweight.

00:01:41.200 | One thing that I would recommend, okay, you can optimize using just the method that we

00:01:45.960 | have here.

00:01:46.960 | But really, what you want to be doing, as well, is you want to be adding utterances

00:01:52.240 | that you see, you know, where it doesn't trigger the route you would expect it to, you should

00:01:56.560 | be adding those to your routes.

00:01:59.240 | And you should probably be adding just generally more utterances, maybe breaking apart routes

00:02:05.800 | into different routes, if you're seeing that they don't work so well.

00:02:09.600 | And the other thing, as you'll see in a moment is that we have a training data set.

00:02:13.920 | And we can modify that to improve the performance, as well.

00:02:19.360 | We'll get around to that in a moment.

00:02:21.920 | So with our encoder, we are going to initialize our encoder, and I'm going to use a different

00:02:26.720 | one.

00:02:27.720 | So the default encoder that we use is MiniLM, so it's a pretty old model, to be honest.

00:02:33.720 | But it's very small and efficient, so we just have it as the default.

00:02:37.280 | But at some point, that will probably change maybe to this model.

00:02:40.860 | So this is like a more recent, still very small, but generally, you should be able to

00:02:47.360 | get better performance with this model.

00:02:50.000 | So yeah, we're going to switch across to that.

00:02:53.720 | So it's E5 base V2 model, and that will just download quickly.

00:02:57.560 | OK, now once that has downloaded, we can come down and initialize our route layer.

00:03:02.160 | So to initialize a route layer, we just need our routes and our encoder, both of which

00:03:07.080 | we have.

00:03:08.080 | And there we go.

00:03:09.120 | So we have our route layer.

00:03:10.920 | Now let's try with these queries.

00:03:14.160 | So this one should go to politics, this one to chit chat, this one to, I think we had

00:03:20.600 | a biology question, and this one should just be none.

00:03:24.080 | OK, and you can see, actually, we get three of four.

00:03:31.560 | This one actually goes to chit chat, where it shouldn't.

00:03:34.800 | So OK, let's take that, and let's try and improve what we have.

00:03:42.520 | So first, I'm just going to show you the evaluation or evaluate method.

00:03:47.680 | So this is a format that we use for both the evaluate and the fit methods.

00:03:53.200 | So you see that we have these x, x, y, y.

00:03:56.520 | This refers to the utterances that are the input data for our fit method, whereas these,

00:04:06.200 | so these are the labels, like the intended routes that they should trigger.

00:04:11.940 | And I just like to keep it like this when I'm going through it, so it feels a bit easier

00:04:16.840 | than creating two separate lists.

00:04:19.160 | So I create this test data set, which is just a list of the tuples, and then I unpack that.

00:04:26.040 | So we have our utterances here, our labels here, and then we're going to evaluate to

00:04:30.560 | see what we get.

00:04:31.560 | Now, if I run this, we actually get 75%, OK?

00:04:35.440 | So not that great, obviously.

00:04:40.020 | We can improve that, and you can see with the-- actually, I think it's with MiniLM,

00:04:44.040 | we do actually get perfect accuracy, but I think it's bad show with this.

00:04:50.800 | Now what we need to do is create a test data set.

00:04:54.920 | So what we did here, but bigger.

00:04:57.840 | So that's what I do here.

00:04:59.960 | When you're creating this, one thing that you can do, obviously, is just using an LLM

00:05:04.320 | to generate this for you.

00:05:05.400 | So gpt4, ask it to generate a set of these.

00:05:09.240 | So we have politics, chit-chat, mathematics, biology, and we have these as well.

00:05:16.200 | And we should probably add some more of those.

00:05:18.240 | So these are the routes that shouldn't be classified as anything, and we add those in

00:05:24.640 | there because we-- well, if we just kind of have named routes, it's always going to choose--

00:05:32.000 | like the similarity thresholds can just increase or decrease, sorry, to capture more area.

00:05:39.800 | And that means that it might work on this test data set, but maybe not when we have

00:05:45.240 | new queries coming in.

00:05:47.160 | So one thing that I would also recommend doing is, OK, we have mathematics, politics, chit-chat.

00:05:53.520 | Let's add some more routes here or more test data that is kind of similar to those other

00:05:59.760 | ones, but we don't actually want it to be classified as those other ones.

00:06:03.400 | That will basically just make it harder for the model, the training function, to get a

00:06:09.280 | high accuracy.

00:06:10.480 | And that's a good thing because we're kind of pushing the model more.

00:06:13.800 | It needs to try a bit harder to get something good.

00:06:17.600 | So I'm going to write a few very quickly.

00:06:21.200 | OK, so I've added these five here.

00:06:24.720 | You can see, OK, kind of similar to biology.

00:06:27.260 | These two are similar to the mathematics routes or tests that we have.

00:06:31.920 | This one, kind of similar to the chit-chat one, and this one, obviously, to politics.

00:06:36.520 | But at least for me, they don't quite fit into those.

00:06:40.160 | So they're very similar, just not quite there.

00:06:42.700 | So that should be good enough, I think, to get some reasonable performance.

00:06:47.480 | It won't be anything incredible, I don't think, but we should get something.

00:06:53.260 | Now let's try and calculate the accuracy using the default thresholds that we have.

00:07:02.620 | So you can see with MiniLM, that was actually 34.85, which is pretty low.

00:07:08.200 | And what you find is that different models just have different similarity thresholds

00:07:13.420 | where it's kind of like something is either similar if it's slightly higher or something

00:07:17.400 | is just not if it's slightly lower.

00:07:19.800 | So you get wildly different results here.

00:07:22.620 | So this one, you actually, you know, we're probably not in a too bad a place.

00:07:26.780 | We get 76.06 there.

00:07:30.200 | Now let's come down and let's see what we have for the default routes.

00:07:35.460 | So you can see they're all just 0.5.

00:07:38.720 | That's coming from the Hugging Face encoder, it's just a default score threshold that we

00:07:42.880 | have set there.

00:07:44.360 | Now we have our train data, so the X, and we have our labels, which is Y.

00:07:50.240 | So then we just train.

00:07:53.000 | And by default, it will go over 500 iterations of training or steps.

00:08:01.200 | We can try that.

00:08:02.760 | I'm not sure we need 500, but let's see.

00:08:05.080 | Okay.

00:08:06.080 | So it's pretty quick.

00:08:07.080 | You see the accuracy increasing over here.

00:08:10.440 | And we've got up to 82.

00:08:12.800 | So nothing special, but I tend to find with these smaller open source models, it does

00:08:18.120 | tend to be a bit lower.

00:08:19.520 | So let's try that.

00:08:21.360 | Okay.

00:08:22.360 | And then we can see the update route thresholds are these.

00:08:27.040 | So interestingly, mathematics is incredibly low here, which would make me think maybe

00:08:33.060 | we need to add some non-routes to that.

00:08:38.740 | I'm not sure, but that's something that I would probably consider trying.

00:08:43.240 | But yeah, the rest seem reasonable.

00:08:45.760 | Okay.

00:08:46.760 | So probably around this 75 down to around 60 is where we have that sort of similar to

00:08:57.020 | or not similar up to similar threshold for this model.

00:09:01.440 | And as I mentioned, that will vary depending on which model you're using.

00:09:06.660 | So yeah, I mean, that is it really.

00:09:10.720 | We can just have a look at the valuation again.

00:09:13.680 | We get an 81.69.

00:09:15.680 | Okay.

00:09:17.120 | So that is it for this very quick introduction to the optimization function here.

00:09:23.600 | You can also try this with other models.

00:09:25.680 | So for example, R002, which is not quite the latest OpenAI embedding model anymore.

00:09:34.020 | And also Cohere, I think as well, they would both go up to about 92% after training on

00:09:40.960 | the same dataset.

00:09:42.600 | Maybe there's some differences, but not a huge number.

00:09:46.500 | So the model does matter a lot here, but also how we're optimizing.

00:09:53.040 | So we can obviously evaluate and fit, but realistically, we also should be adding new

00:09:59.320 | utterances to our routes, and we should also be adding more data to our test data as we

00:10:05.480 | go along.

00:10:06.600 | We really want to be iterating on this and sort of improving over time rather than just

00:10:13.280 | hoping that it will fit and then that's it.

00:10:16.560 | So yeah, I mean, that's it for now.

00:10:20.160 | I hope this has been useful and interesting.

00:10:23.960 | So thank you very much for watching, and I will see you again in the next one.

00:10:28.940 | Bye.

00:10:36.540 | Bye.

00:10:37.540 | Bye.

00:10:38.540 | Bye.

00:10:39.540 | Bye.

00:10:40.540 | Bye.

00:10:40.540 | you

00:10:42.600 | you