back to indexBetter Chatbots with Semantic Routes
Chapters
0:0 Semantic Router
0:37 Concept of Semantic Routers
7:42 Routes and Utterances
15:11 Encoders
16:26 New Routers
20:48 Semantic Routes for Chat
21:29 LLM Output Guardrails
28:25 Fine-grained control of LLMs
29:10 Routes for Tool Use
32:1 LLM Routing
34:34 Outro
00:00:05.840 |
or semantic routes to broaden the number of things 00:00:11.120 |
that we can do within the context of chatbots and AI 00:00:15.680 |
agents, and also get a very fine level of control 00:00:29.240 |
and to be able to take that control much further 00:00:32.080 |
than if we were not using this idea of semantic routes. 00:00:40.840 |
what I mean when I'm talking about semantic routes. 00:01:08.320 |
And what we can do is we use the concept that 00:01:15.480 |
comes from vector search and vector retrieval, which 00:01:24.840 |
So like OpenAIs, Embed3, Cohere's embedding models, 00:01:30.320 |
or among the many open source ones that we can also use. 00:01:35.520 |
And OK, so we put it through this embedding model here. 00:01:40.000 |
And what we get is a vector, or an embedding, 00:01:44.120 |
or a vector embedding, whatever you want to call it. 00:01:57.880 |
that high dimensional space is 1,536 dimensions. 00:02:13.400 |
know that it's not actually 3D, or 2D if you like as well. 00:02:21.680 |
We've turned that into this point on a 2D plane. 00:02:43.800 |
I'll explain a little bit on how we do that soon. 00:02:46.560 |
But we've created some of our semantic routes already. 00:03:00.400 |
Now, when you look at this, we would just look at this 00:03:05.240 |
and say, OK, it's very clear to us that, obviously, OK, purple, 00:03:12.240 |
that is the closest group of vectors within this 2D space. 00:03:24.160 |
let's say we return the top five records here. 00:03:44.720 |
That route is the route that our query is most similar to. 00:03:58.080 |
is 100% the classification, which we don't necessarily do. 00:04:06.840 |
So we take this, and we're going to call it Route A. 00:04:17.480 |
We've classified it as Route A. And therefore, we 00:04:36.760 |
let's say in this scenario, Route A is a guardrail. 00:04:41.400 |
So if it's a guardrail, we might say, OK, I would just 00:04:51.200 |
that it should be careful when it's answering 00:05:00.960 |
like as chatbots were first being released to people, 00:05:07.760 |
but maybe Renault or Volkswagen, or Ford even, I don't know. 00:05:17.000 |
and convincing the chatbots to sell them a car for $1. 00:05:23.200 |
So you'd obviously just, in this scenario, what you could do 00:05:41.680 |
So you'd say, OK, this would be an example query of something 00:05:54.280 |
It would be within that same topic of someone's 00:05:57.240 |
trying to get you get a car for a cheap price. 00:06:08.600 |
And this isn't necessarily a way that I would even do it. 00:06:13.080 |
So once you've done that, OK, you trigger your guardrail. 00:06:20.480 |
And you're just going to return that directly to a user. 00:06:28.880 |
we're going to pass on the user's original query. 00:06:37.360 |
But then we might want to add a warning or some sort of system 00:06:40.200 |
or modify our system message or something on there. 00:06:43.760 |
So let's, in this case, we'll add a little warning saying, 00:07:03.800 |
probably one of the most basic use cases of Semantic Router 00:07:07.280 |
when it comes to agents or chatbots in general. 00:07:11.560 |
And it's a very, in my opinion, it's a pretty good use case. 00:07:22.480 |
Now, I want to talk about a lot more than this in this video. 00:07:26.200 |
We're just going to go over a few different examples. 00:07:28.440 |
Conceptually, I will show you a little bit of code 00:07:35.040 |
And we'll go into more detail in all of these in the future. 00:07:41.360 |
So first, let me just explain a little bit of what I was-- 00:07:52.000 |
OK, so at the moment, we're using the dev branch 00:07:56.960 |
I'm currently working on the first stable release 00:08:01.280 |
of the library, which is in progress, going well. 00:08:09.880 |
So that's why I want to show you this current version 00:08:12.400 |
of the code, which is why we're using this dev pre-release 00:08:16.080 |
But let me just explain a little bit of what we just saw. 00:08:30.840 |
We have six here, so we're going to put six here as well. 00:08:34.960 |
And then we're going to have our second route. 00:08:39.920 |
And that includes, let's say, a little bit closer together. 00:08:47.200 |
So what we have here is we have these utterances. 00:08:50.800 |
Every single one of these is encoded with a encoding model. 00:09:14.640 |
becomes basically a catchment, like a fuzzy matching 00:09:24.320 |
So this is our chitchat, fuzzy matching area in semantic space. 00:09:48.460 |
So then what happens is, let's say a user comes in, 00:09:56.280 |
This is something we didn't see in the last little sketch. 00:10:04.600 |
defined by something called a threshold or score threshold, 00:10:15.040 |
There's another variable that you can define. 00:10:18.040 |
And you can also optimize automatically and stuff. 00:10:21.320 |
So I think, by default, with OpenAI's Embed 3 model, 00:10:28.880 |
is kind of like a good place where it seems anything 00:10:39.320 |
But what you will tend to find is that some routes that you 00:10:44.680 |
define really need a higher threshold or a lower threshold. 00:10:58.320 |
So if you were to increase your score threshold, 00:11:05.280 |
basically just be making it catch more stuff. 00:11:09.080 |
It wouldn't overlap with your other chitchat route. 00:11:28.440 |
I think, at least for the Embed 3 models, anything at 0, 00:11:48.520 |
So in that scenario, what would probably happen, 00:12:04.560 |
All of everything here would be the catchment area 00:12:12.080 |
And that is what would happen if you went with 0.0. 00:12:18.320 |
Then on the other hand, on the other end of the spectrum 00:12:22.680 |
there, the 1.0, again, this will vary by model, 00:12:28.640 |
by encoder or embedding model, which, again, I'll 00:12:36.240 |
is say, OK, the only things that are going to match 00:12:39.280 |
are the things that are an exact match of the, well, 00:13:00.360 |
to fit exactly where one of the utterances in this politics 00:13:15.240 |
with this exact format and punctuation and everything? 00:13:24.840 |
So that's like the sliding scale that you have of these, 00:13:31.840 |
Maybe sensitivity is even a better parameter name 00:13:46.600 |
are sticking with the score threshold of 0.3. 00:13:53.320 |
going to-- what is basically going to happen, 00:13:55.920 |
it's going to look at, I think, the top five by default, 00:14:06.680 |
And what it's going to see is that actually, OK, this-- 00:14:09.440 |
probably the chitchat route here is the most similar. 00:14:15.640 |
It is most similar, so it's kind of tied to our chitchat route. 00:14:21.360 |
But it is not similar enough to surpass this threshold 00:14:45.160 |
And yeah, I mean, that is how they map and how they work. 00:14:51.600 |
So utterances, routes, and thresholds or sensitivity, 00:15:02.080 |
I just want to point out here, I've created this little routes 00:15:15.760 |
So you have your question here, or even the vectors 00:15:22.320 |
that you create beforehand, or the utterances 00:15:31.920 |
So in this case, if we're using an OpenAI encoder, 00:15:37.200 |
that would be the embed3, small, I think, if I'm not wrong. 00:15:44.080 |
Or if you're using Cohere or another provider, 00:16:03.800 |
And then you just define which model you want to use there. 00:16:07.120 |
So yeah, you'd create your embedding, and that's it. 00:16:12.040 |
There are a ton of them in the library already, 00:16:17.140 |
You have sparse, and bends, and whatever else in there. 00:16:23.920 |
But more on all that later, not in this video. 00:16:26.680 |
So the-- OK, this is one of the things that have changed. 00:16:48.280 |
Well, OK, there are many changes, but most of them 00:17:06.600 |
And the route layer, it was its own specific thing. 00:17:17.640 |
to go and build new routers using different techniques. 00:17:23.120 |
So it's not just about semantic vector search. 00:17:26.400 |
There are a lot of other things that we can do, 00:17:31.040 |
But again, something for a future video, not this video. 00:17:37.080 |
But beyond that, there isn't too much difference. 00:17:40.360 |
So the route layer, in the past, would have this encoder. 00:17:43.640 |
It would have these-- you would insert your routes. 00:17:53.800 |
to belong to your index objects in the library. 00:17:58.320 |
Basically, just saying-- it's not super important, 00:18:01.600 |
but right now, just set it to autosync local. 00:18:05.560 |
In the future, I will explain this in more detail. 00:18:10.280 |
But for now, it's just synchronizing, essentially, 00:18:15.280 |
And we can see that route choice thing that I mentioned before. 00:18:19.200 |
So we had our query go in, and we have some politics ones 00:18:30.680 |
You know, I did this in the wrong order, but fine. 00:18:44.360 |
So let's say we're going to call the semantic router with, 00:18:50.760 |
So it's going to go in, and it's going to be like, oh, OK, 00:18:59.440 |
but it's still very firmly within that politics route. 00:19:06.360 |
So you see you've got the name politics, function calls, 00:19:09.760 |
something to talk about in the future, similarity score, 00:19:31.080 |
We're going to say, OK, how's the weather today? 00:19:42.360 |
We're seeing, all right, the most similar ones here. 00:19:45.680 |
Even if we're catching one from the other politics route, 00:20:03.840 |
And that is what we would get, ignoring these things 00:20:16.400 |
In this scenario, it's not really either of those. 00:20:18.640 |
So we might have something that's kind of over here. 00:20:27.960 |
But none of the similarities here cross that threshold. 00:20:51.240 |
this for the inputs of some queries from a user. 00:20:57.400 |
And OK, I can classify the user's input query. 00:21:13.040 |
and probably many other ways I can't even think of. 00:21:15.120 |
But let me just at a very high level mention some of those. 00:21:21.680 |
So the first thing that I would like to mention 00:21:26.560 |
is, if we go back to here, this first example I gave, 00:21:36.040 |
to get it to do something that we don't actually 00:21:43.120 |
Now, what if we modify that structure a little bit? 00:21:46.760 |
Now, the reason I say let's modify the structure 00:21:49.000 |
a little bit is because there's a very obvious weakness 00:21:51.800 |
with this, which is it's very hard to predict all the ways 00:21:55.320 |
that a user is going to try and hack your system to say 00:21:59.920 |
And actually, not all the time, but in many cases, 00:22:03.600 |
it can be much easier to predict what the agent should not say 00:22:21.720 |
I don't want to sell a car for some low price. 00:22:33.800 |
So what we should instead be focusing on here 00:22:37.920 |
is not what the user is saying, but what our LLM is saying. 00:22:51.120 |
The queries or the utterances that we're setting here, 00:22:55.720 |
let's say rather than it being these questions that 00:23:00.960 |
are user-focused, so some question from a user, 00:23:06.680 |
these are statements or responses from our agent, 00:23:30.760 |
And then we can give many different examples. 00:23:37.520 |
So they might be, oh, we can finance your car or something. 00:23:46.000 |
And that's why they would be in these different routes. 00:23:48.320 |
But they still might be these sort of protective guardrails 00:24:02.320 |
are actually basically outputs from your agent, 00:24:14.520 |
to the embedding model here, we're not going to do that. 00:24:25.280 |
Actually, let's make our LLM a different color. 00:24:44.800 |
So what we're going to do now is we're going to take this. 00:24:48.560 |
And we're going to put it through our embedding model. 00:25:03.520 |
We've produced some bad output that we don't want. 00:25:09.240 |
So there's different things you could do in this scenario. 00:25:13.680 |
One thing is maybe you just want to have a pre-written response. 00:25:19.880 |
So you just take from your wherever, your database, 00:25:31.320 |
So we're going to say, ah, OK, we're not going to do that. 00:25:35.680 |
We're going to go to our pre-written response. 00:25:45.440 |
The other option, which is, I would say, more risky, 00:26:05.880 |
Or maybe you completely swap out the LLMs system prompt. 00:26:09.760 |
And you tell it, OK, you're not a sales person anymore. 00:26:13.040 |
Now you're a protective defense against people 00:26:23.600 |
But you stop them from scamming you in a nice way 00:26:29.320 |
You just modify something to basically put it 00:26:45.120 |
But in any case, whichever approach you go for, 00:26:48.680 |
you're going to basically generate another output. 00:26:57.320 |
want to default to that backup query from over here. 00:27:04.600 |
But the benefit of going with this route, where you are 00:27:11.000 |
the agent or chatbot can still seem quite fluid. 00:27:23.080 |
to a user that isn't pre-written, which can-- 00:27:26.400 |
often in a chat, a pre-written thing is kind of weird. 00:27:38.760 |
That is a different approach to doing this whole thing. 00:27:46.960 |
are so many use cases that we can talk about and talk 00:27:54.680 |
I just want to give you a very high-level view 00:28:13.440 |
But I just want to focus on the chat use case, 00:28:16.560 |
or in particular, the language-focused or language 00:28:31.560 |
And we kind of touched upon not just using them as guardrails, 00:28:34.920 |
but almost using them as behavioral modifications, 00:28:41.320 |
going to modify the incoming query or the incoming user 00:28:51.800 |
You modify that, or you modify the system prompt, 00:28:59.920 |
to the system prompt, or you add in an additional system 00:29:05.480 |
There's so many different things you can do there 00:29:12.240 |
is actually kind of comes under behavioral modification, 00:29:17.600 |
but I do want to point out as a use case, which is tool use. 00:29:45.320 |
all of these different tools that your agent can use. 00:29:54.360 |
the politics route is triggered, what you would do 00:29:57.360 |
is you might want to filter your potential tool 00:30:16.720 |
you are saving on latency, you're saving on costs, 00:30:20.840 |
and you're probably going to get better performance 00:30:26.880 |
And it's always good to constrain the options 00:30:32.200 |
That will tend to lead to better accuracy and performance. 00:30:36.720 |
OK, the other one is the other filter, which would 00:30:49.280 |
Or in another scenario, let's say, actually, we 00:30:53.200 |
know when this happens, when this query triggers, 00:30:58.640 |
we actually know that we need to use this specific tool. 00:31:02.320 |
So we actually force it to use that tool in that scenario. 00:31:08.280 |
have that user query coming in, so on and so on. 00:31:13.040 |
And we think that this user query would probably 00:31:24.240 |
It doesn't need to be an alert, but some sort 00:31:28.400 |
of little informational thing, like, hey, maybe 00:31:33.040 |
Or you can actually, like, programmatically just 00:31:39.280 |
So you're just going to be like, no, no, no, no, no. 00:31:53.860 |
Because semantic router is just triggering something. 00:32:01.960 |
Now, these are all very agent LLM-specific things. 00:32:07.360 |
But we can broaden this out to just our agentic workflow, 00:32:15.680 |
So let's broaden it out to, OK, in particular scenarios, 00:32:25.000 |
Or you might want to use different system props. 00:32:27.120 |
Or you might want to use different temperature settings. 00:32:43.800 |
Obviously, they want to be a bit more creative. 00:33:01.080 |
is like an LLM router, where based on particular queries, 00:33:07.240 |
are better suited for this particular type of query 00:33:18.080 |
Or also, you might just be using different system prompts. 00:33:21.520 |
So the system prompts for helping someone out 00:33:42.680 |
Of course, they don't even need to be that varied, right? 00:33:56.400 |
there's the temperature and other model settings. 00:34:02.840 |
someone asks you, they want to write a story. 00:34:05.000 |
The temperature, you're just going to put that up 00:34:20.120 |
You're just going to turn that down like 0.1 or 0.0 or 0.01, 00:34:29.240 |
So that's another-- focus on the conversational side of things, 00:34:35.100 |
And yeah, I mean, there are just so many of these use cases. 00:34:45.440 |
the concept of Semantic Router, in particular, 00:34:52.840 |
We're going to go into a lot more detail with many more