back to index

AI Decision Making — Optimizing Routes


Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to be taking a look at how we can improve the accuracy of our routes
00:00:07.140 | within Semantic Router.
00:00:08.920 | So what we've added to the library is the ability to modify your thresholds on a route
00:00:15.760 | by route basis.
00:00:17.640 | So previously it's the whole layer, now you have individual thresholds per route.
00:00:23.020 | And of course, if you have many routes, modifying each one of those thresholds one by one, it
00:00:28.280 | would probably become quite tedious.
00:00:31.220 | So we've also added in training methods so that you can optimize your route values with
00:00:38.240 | a few lines of code, which is great.
00:00:40.400 | So let's jump straight into it.
00:00:42.840 | I'm going to go over to the Semantic Router library, I'm going to go to the docs, and
00:00:49.040 | then I'm going to go down to this threshold optimization notebook.
00:00:54.120 | And we'll open that in Colab.
00:00:56.240 | Now we're going to be using the local model.
00:00:59.160 | So just the local encoder in this example, because we're using static routes only.
00:01:04.040 | But that does mean that we can also speed up by using a GPU.
00:01:09.680 | So in Colab, I'm going to switch across to a T4 GPU, that will speed things up, although
00:01:14.120 | it is very fast anyway, literally a couple of seconds.
00:01:18.800 | But nonetheless, you know, if you're doing this for many routes, and you have a fairly
00:01:22.320 | large training set, then you might want to turn GPU on.
00:01:26.840 | Okay, so that has installed.
00:01:29.200 | And we can come down to here where we're just defining our route layer.
00:01:33.840 | We have a few routes in here to give us, you know, more to optimize with.
00:01:38.880 | So they're pretty lightweight.
00:01:41.200 | One thing that I would recommend, okay, you can optimize using just the method that we
00:01:45.960 | have here.
00:01:46.960 | But really, what you want to be doing, as well, is you want to be adding utterances
00:01:52.240 | that you see, you know, where it doesn't trigger the route you would expect it to, you should
00:01:56.560 | be adding those to your routes.
00:01:59.240 | And you should probably be adding just generally more utterances, maybe breaking apart routes
00:02:05.800 | into different routes, if you're seeing that they don't work so well.
00:02:09.600 | And the other thing, as you'll see in a moment is that we have a training data set.
00:02:13.920 | And we can modify that to improve the performance, as well.
00:02:19.360 | We'll get around to that in a moment.
00:02:21.920 | So with our encoder, we are going to initialize our encoder, and I'm going to use a different
00:02:27.720 | So the default encoder that we use is MiniLM, so it's a pretty old model, to be honest.
00:02:33.720 | But it's very small and efficient, so we just have it as the default.
00:02:37.280 | But at some point, that will probably change maybe to this model.
00:02:40.860 | So this is like a more recent, still very small, but generally, you should be able to
00:02:47.360 | get better performance with this model.
00:02:50.000 | So yeah, we're going to switch across to that.
00:02:53.720 | So it's E5 base V2 model, and that will just download quickly.
00:02:57.560 | OK, now once that has downloaded, we can come down and initialize our route layer.
00:03:02.160 | So to initialize a route layer, we just need our routes and our encoder, both of which
00:03:07.080 | we have.
00:03:08.080 | And there we go.
00:03:09.120 | So we have our route layer.
00:03:10.920 | Now let's try with these queries.
00:03:14.160 | So this one should go to politics, this one to chit chat, this one to, I think we had
00:03:20.600 | a biology question, and this one should just be none.
00:03:24.080 | OK, and you can see, actually, we get three of four.
00:03:31.560 | This one actually goes to chit chat, where it shouldn't.
00:03:34.800 | So OK, let's take that, and let's try and improve what we have.
00:03:42.520 | So first, I'm just going to show you the evaluation or evaluate method.
00:03:47.680 | So this is a format that we use for both the evaluate and the fit methods.
00:03:53.200 | So you see that we have these x, x, y, y.
00:03:56.520 | This refers to the utterances that are the input data for our fit method, whereas these,
00:04:06.200 | so these are the labels, like the intended routes that they should trigger.
00:04:11.940 | And I just like to keep it like this when I'm going through it, so it feels a bit easier
00:04:16.840 | than creating two separate lists.
00:04:19.160 | So I create this test data set, which is just a list of the tuples, and then I unpack that.
00:04:26.040 | So we have our utterances here, our labels here, and then we're going to evaluate to
00:04:30.560 | see what we get.
00:04:31.560 | Now, if I run this, we actually get 75%, OK?
00:04:35.440 | So not that great, obviously.
00:04:40.020 | We can improve that, and you can see with the-- actually, I think it's with MiniLM,
00:04:44.040 | we do actually get perfect accuracy, but I think it's bad show with this.
00:04:50.800 | Now what we need to do is create a test data set.
00:04:54.920 | So what we did here, but bigger.
00:04:57.840 | So that's what I do here.
00:04:59.960 | When you're creating this, one thing that you can do, obviously, is just using an LLM
00:05:04.320 | to generate this for you.
00:05:05.400 | So gpt4, ask it to generate a set of these.
00:05:09.240 | So we have politics, chit-chat, mathematics, biology, and we have these as well.
00:05:16.200 | And we should probably add some more of those.
00:05:18.240 | So these are the routes that shouldn't be classified as anything, and we add those in
00:05:24.640 | there because we-- well, if we just kind of have named routes, it's always going to choose--
00:05:32.000 | like the similarity thresholds can just increase or decrease, sorry, to capture more area.
00:05:39.800 | And that means that it might work on this test data set, but maybe not when we have
00:05:45.240 | new queries coming in.
00:05:47.160 | So one thing that I would also recommend doing is, OK, we have mathematics, politics, chit-chat.
00:05:53.520 | Let's add some more routes here or more test data that is kind of similar to those other
00:05:59.760 | ones, but we don't actually want it to be classified as those other ones.
00:06:03.400 | That will basically just make it harder for the model, the training function, to get a
00:06:09.280 | high accuracy.
00:06:10.480 | And that's a good thing because we're kind of pushing the model more.
00:06:13.800 | It needs to try a bit harder to get something good.
00:06:17.600 | So I'm going to write a few very quickly.
00:06:21.200 | OK, so I've added these five here.
00:06:24.720 | You can see, OK, kind of similar to biology.
00:06:27.260 | These two are similar to the mathematics routes or tests that we have.
00:06:31.920 | This one, kind of similar to the chit-chat one, and this one, obviously, to politics.
00:06:36.520 | But at least for me, they don't quite fit into those.
00:06:40.160 | So they're very similar, just not quite there.
00:06:42.700 | So that should be good enough, I think, to get some reasonable performance.
00:06:47.480 | It won't be anything incredible, I don't think, but we should get something.
00:06:53.260 | Now let's try and calculate the accuracy using the default thresholds that we have.
00:07:02.620 | So you can see with MiniLM, that was actually 34.85, which is pretty low.
00:07:08.200 | And what you find is that different models just have different similarity thresholds
00:07:13.420 | where it's kind of like something is either similar if it's slightly higher or something
00:07:17.400 | is just not if it's slightly lower.
00:07:19.800 | So you get wildly different results here.
00:07:22.620 | So this one, you actually, you know, we're probably not in a too bad a place.
00:07:26.780 | We get 76.06 there.
00:07:30.200 | Now let's come down and let's see what we have for the default routes.
00:07:35.460 | So you can see they're all just 0.5.
00:07:38.720 | That's coming from the Hugging Face encoder, it's just a default score threshold that we
00:07:42.880 | have set there.
00:07:44.360 | Now we have our train data, so the X, and we have our labels, which is Y.
00:07:50.240 | So then we just train.
00:07:53.000 | And by default, it will go over 500 iterations of training or steps.
00:08:01.200 | We can try that.
00:08:02.760 | I'm not sure we need 500, but let's see.
00:08:05.080 | Okay.
00:08:06.080 | So it's pretty quick.
00:08:07.080 | You see the accuracy increasing over here.
00:08:10.440 | And we've got up to 82.
00:08:12.800 | So nothing special, but I tend to find with these smaller open source models, it does
00:08:18.120 | tend to be a bit lower.
00:08:19.520 | So let's try that.
00:08:21.360 | Okay.
00:08:22.360 | And then we can see the update route thresholds are these.
00:08:27.040 | So interestingly, mathematics is incredibly low here, which would make me think maybe
00:08:33.060 | we need to add some non-routes to that.
00:08:38.740 | I'm not sure, but that's something that I would probably consider trying.
00:08:43.240 | But yeah, the rest seem reasonable.
00:08:45.760 | Okay.
00:08:46.760 | So probably around this 75 down to around 60 is where we have that sort of similar to
00:08:57.020 | or not similar up to similar threshold for this model.
00:09:01.440 | And as I mentioned, that will vary depending on which model you're using.
00:09:06.660 | So yeah, I mean, that is it really.
00:09:10.720 | We can just have a look at the valuation again.
00:09:13.680 | We get an 81.69.
00:09:15.680 | Okay.
00:09:17.120 | So that is it for this very quick introduction to the optimization function here.
00:09:23.600 | You can also try this with other models.
00:09:25.680 | So for example, R002, which is not quite the latest OpenAI embedding model anymore.
00:09:34.020 | And also Cohere, I think as well, they would both go up to about 92% after training on
00:09:40.960 | the same dataset.
00:09:42.600 | Maybe there's some differences, but not a huge number.
00:09:46.500 | So the model does matter a lot here, but also how we're optimizing.
00:09:53.040 | So we can obviously evaluate and fit, but realistically, we also should be adding new
00:09:59.320 | utterances to our routes, and we should also be adding more data to our test data as we
00:10:05.480 | go along.
00:10:06.600 | We really want to be iterating on this and sort of improving over time rather than just
00:10:13.280 | hoping that it will fit and then that's it.
00:10:16.560 | So yeah, I mean, that's it for now.
00:10:20.160 | I hope this has been useful and interesting.
00:10:23.960 | So thank you very much for watching, and I will see you again in the next one.