Better Chatbots with Semantic Routes

Today, we're going to be talking about how we can use this concept of semantic routing or semantic routes to broaden the number of things that we can do within the context of chatbots and AI agents, and also get a very fine level of control over LLMs and our sort of agentic workflows, and to be able to take that control much further than if we were not using this idea of semantic routes.

So I want to start by just explaining what I mean when I'm talking about semantic routes. So the core concept behind semantic router in the context of chatbots and agents, like conversational AI essentially, a user is going to come in, and they're going to have some queried, and they're going to say something.

So they have whatever they're writing, their question, whatever it is. They bring that in. And what we can do is we use the concept that comes from vector search and vector retrieval, which is we take this text. We put it through an embedding model. So like OpenAIs, Embed3, Cohere's embedding models, or among the many open source ones that we can also use.

And OK, so we put it through this embedding model here. And what we get is a vector, or an embedding, or a vector embedding, whatever you want to call it. Basically, that vector embedding is a point in a high dimensional space. So for example, with OpenAI's Embed3 model, that high dimensional space is 1,536 dimensions.

So imagine 3D, but a lot more. And therefore, you can't really imagine it. So it's basically, imagine 3D, but just know that it's not actually 3D, or 2D if you like as well. Whatever works. So OK, we have our query. We've turned that into this point on a 2D plane.

Now, as we do with vector search, we're going to be looking at what this query is surrounded by. So in this scenario, let's say we've created some semantic routes already now. I'll explain a little bit on how we do that soon. But we've created some of our semantic routes already.

So we've got a few points already in here. So we have a few points here. We also have a few points over here. And maybe a few more just over here. Now, when you look at this, we would just look at this and say, OK, it's very clear to us that, obviously, OK, purple, the purple, pink, whatever that is, that is the closest group of vectors within this 2D space.

So we'd fairly easily just say, OK, let's say we return the top five records here. So we have 1, 2, 3, 4. And then probably the next closest one would be yellow over here. So we look at all these. And we say, OK, to be honest, it just looks, yeah, purple or pink.

That route is the route that our query is most similar to. So in that case, we're just looking at, OK, what is most similar and saying, OK, that is 100% the classification, which we don't necessarily do. But let's say we are in this scenario. We take that back. So we take this, and we're going to call it Route A.

And now we know, OK, our user's query here. We've classified it as Route A. And therefore, we are going to do something. And what you do is up to you. It doesn't-- like semantic router, you can use it in many different ways. It's up to you. We don't prescribe anything with it.

But yeah, you would say, OK, Route A, let's say in this scenario, Route A is a guardrail. So if it's a guardrail, we might say, OK, I would just want to block whatever the user's doing. Or I just want to make the LLM more aware that it should be careful when it's answering this question, for example.

So there were these examples earlier on in-- like as chatbots were first being released to people, where I don't remember who it was, but maybe Renault or Volkswagen, or Ford even, I don't know. But people were talking to the chatbots and convincing the chatbots to sell them a car for $1.

So you'd obviously just, in this scenario, what you could do is say, OK, this Route A here is actually a protective guardrail. We're saying, OK, if the user says something like, sell me this car for x price, you'd give an example here. So you'd say, OK, this would be an example query of something that we would not want to answer to.

And this would be another example query of something we wouldn't want to answer to, which would be similar. It would be within that same topic of someone's trying to get you get a car for a cheap price. These are the things you'd look for. And this is just one example.

There are many ways of doing this. And this isn't necessarily a way that I would even do it. But this is just one example. So once you've done that, OK, you trigger your guardrail. So you might have a pre-written response. And you're just going to return that directly to a user.

So in this case, we won't even hit our LLM. Or what we might do is, OK, we say we're going to pass on the user's original query. So we have that original query there. But then we might want to add a warning or some sort of system or modify our system message or something on there.

So let's, in this case, we'll add a little warning saying, OK, no washout. They're probably trying to screw you. Be aware of that. Remember that we do not sell cars for $1. Give the LLM some additional instructions. This is the most basic-- probably one of the most basic use cases of Semantic Router when it comes to agents or chatbots in general.

And it's a very, in my opinion, it's a pretty good use case. It works really well. And it's really the first step of what you would do in this scenario. Now, I want to talk about a lot more than this in this video. We're just going to go over a few different examples.

Conceptually, I will show you a little bit of code just to give you a bit of an idea. But we're sticking conceptually here. And we'll go into more detail in all of these in the future. But for now, we're sticking with this. So first, let me just explain a little bit of what I was-- OK, we just visualized that 2D chart.

How does that fit in with actual code? OK, so at the moment, we're using the dev branch of Semantic Router. I'm currently working on the first stable release of the library, which is in progress, going well. And because of that, there are going to be some breaking changes. So that's why I want to show you this current version of the code, which is why we're using this dev pre-release here.

But let me just explain a little bit of what we just saw. So we had those points. So this is a different example. We-- OK, let's say this is our vector space. We have six here, so we're going to put six here as well. And then we're going to have our second route.

This is the chitchat route here. And that includes, let's say, a little bit closer together. That might include five different routes. So what we have here is we have these utterances. Every single one of these is encoded with a encoding model. More on that in a moment. And it creates a vector or vector embedding within our 2D high dimensional space.

It's not actually 2D. We're just visualizing it as 2D. So this area here, this area, this becomes basically a catchment, like a fuzzy matching area for this chitchat route. So this is our chitchat, fuzzy matching area in semantic space. Then we have this other one here, which is this politics route.

And the politics route is exactly the same. So we have this fuzzy matching area here, and that becomes our politics route. So then what happens is, let's say a user comes in, they make a query, it drops over here. This is something we didn't see in the last little sketch.

So this is outside of these lines here. And these lines, they're actually defined by something called a threshold or score threshold, which is another variable that you can-- I didn't put it here. There's another variable that you can define. And you can also optimize automatically and stuff. So I think, by default, with OpenAI's Embed 3 model, this would just default to 0.3, which is kind of like a good place where it seems anything below that is kind of not that similar, whereas anything above that is similar.

But what you will tend to find is that some routes that you define really need a higher threshold or a lower threshold. And that is basically the sensitivity or the area of this catchment area that you are generating. So if you were to increase your score threshold, let's say for the politics one, you'd basically just be making it catch more stuff.

It wouldn't overlap with your other chitchat route. There's always going to be almost like a boundary between them. But you would be increasing or decreasing that based on your threshold, essentially. So what that becomes is anything, I think, at least for the Embed 3 models, anything at 0, if I'm not wrong, anything at 0.

Some models go to minus 1 and so on. So that's why I'm not sure. I think it's 0 is basically anything will trigger this route. So in that scenario, what would probably happen, this purple route here would basically just take up the entire vector space, except from this chitchat area.

So all of this would be like the-- sorry, not even all. All of everything here would be the catchment area for your politics route. And that is what would happen if you went with 0.0. Then on the other hand, on the other end of the spectrum there, the 1.0, again, this will vary by model, by encoder or embedding model, which, again, I'll show you in a moment.

But basically, what that would do is say, OK, the only things that are going to match are the things that are an exact match of the, well, whatever utterances you've defined here. So these are the utterances. These are the exact phrases that have been encoded in the vector space.

And if you have a threshold set of 1, then basically the user's query needs to fit exactly where one of the utterances in this politics route is. So the user would literally have to say, isn't politics the best thing ever with this exact format and punctuation and everything? They would have to say that exact thing.

So that's like the sliding scale that you have of these, like the sensitivity, essentially. Maybe sensitivity is even a better parameter name for that now I think about it. But let's-- we're rewriting stuff, so maybe. So yeah, in this case, let's say we are sticking with the score threshold of 0.3.

You have your query come in. And what it's going to do is, OK, you're going to-- what is basically going to happen, it's going to look at, I think, the top five by default, if I'm not wrong. So it's going to look at these. It's going to calculate the similarities between this and the query.

And what it's going to see is that actually, OK, this-- probably the chitchat route here is the most similar. So we have our user's query. It is most similar, so it's kind of tied to our chitchat route. But it is not similar enough to surpass this threshold that we've set here.

So in that case, what will happen is this will come out to having a route-- we call it a route choice, actually-- of just none. It's basically no classification. And yeah, I mean, that is how they map and how they work. So utterances, routes, and thresholds or sensitivity, however you want to think about it.

And actually, before I do move on, I just want to point out here, I've created this little routes variable, which contains both our polities and that chitchat route. OK, so very quickly on the encoders, the encoder is just that step in the middle. So you have your question here, or even the vectors that you create beforehand, or the utterances that you define beforehand.

And you're basically just putting it through that encoding model. So in this case, if we're using an OpenAI encoder, that would be the embed3, small, I think, if I'm not wrong. It would be that model. Or if you're using Cohere or another provider, it would be the equivalent encoder.

And you can decide, if you say, OK, I don't want to use embed3 small, I want to use embed3 large, you can do so. So you just-- I think it's a name parameter. And then you just define which model you want to use there. So yeah, you'd create your embedding, and that's it.

That's what the encoders are. There are a ton of them in the library already, and more coming all the time. You have sparse, and bends, and whatever else in there. So there are a few options. But more on all that later, not in this video. So the-- OK, this is one of the things that have changed.

So 0.0.72, if I'm not wrong, that was the last old version of the library. Now we're going on to 0.1.0. One thing that's changed, that's this. Well, OK, there are many changes, but most of them are not as obvious as this one, which is we now have semantic router routers, and we have the semantic router class.

This is what used to be the route layer. And the route layer, it was its own specific thing. What we've done with this new refactor is just standardize things a lot, which means it's much easier for us to go and build new routers using different techniques. So it's not just about semantic vector search.

There are a lot of other things that we can do, a lot of ways we can enhance that. But again, something for a future video, not this video. But beyond that, there isn't too much difference. So the route layer, in the past, would have this encoder. It would have these-- you would insert your routes.

The one thing that is different is we have this autosync option. This autosync option, before, used to belong to your index objects in the library. Basically, just saying-- it's not super important, but right now, just set it to autosync local. In the future, I will explain this in more detail.

But for now, it's just synchronizing, essentially, between your route layer and your index. And we can see that route choice thing that I mentioned before. So we had our query go in, and we have some politics ones up here, our politics routes. You know, I did this in the wrong order, but fine.

We have our politics route here. We have our chitchat route over here. All right, now let me redo that. So let's say we're going to call the semantic router with, don't you love politics? So it's going to go in, and it's going to be like, oh, OK, maybe it's kind of conversational.

It's a bit chitchatty. So maybe it goes towards that direction, but it's still very firmly within that politics route. And that is what you would get here, right? So you see you've got the name politics, function calls, something to talk about in the future, similarity score, another thing to talk about in the future, but actually something you need to work on.

So the main thing I want to focus on here is just pretend-- actually, just pretend these don't exist. They're not there. So route choice, politics. Now let's do another one. We're going to say, OK, how's the weather today? Very firmly in the chitchat route. You're comparing. We're seeing, all right, the most similar ones here.

Even if we're catching one from the other politics route, which we probably wouldn't in this scenario, but even if we were, we'd be looking at the overall scoring between everything. And it would say quite firmly, OK, here I am, I'm within this chitchat area. And that is what we would get, ignoring these things that we don't need to go through.

So we get chitchat. Then we can say something else. I'm interested in learning about LLAMA2. In this scenario, it's not really either of those. So we might have something that's kind of over here. We see, OK, we've got these. They're kind of their nearest neighbors. But none of the similarities here cross that threshold.

So unfortunately-- well, not unfortunately. It's quite fortunate. We would return none for our categorization. And I mean, that high level, that is what semantic route is doing. But then you think, OK, I'm doing this for the inputs of some queries from a user. And OK, I can classify the user's input query.

But what can I use that classification for? Do I need to use it in this way? No, you don't. You can use it in so many different ways, and probably many other ways I can't even think of. But let me just at a very high level mention some of those.

So the first thing that I would like to mention is, if we go back to here, this first example I gave, someone's basically trying to hack the agent to get it to do something that we don't actually want it to do. So we have this structure. Now, what if we modify that structure a little bit?

Now, the reason I say let's modify the structure a little bit is because there's a very obvious weakness with this, which is it's very hard to predict all the ways that a user is going to try and hack your system to say something that you don't want it to say.

And actually, not all the time, but in many cases, it can be much easier to predict what the agent should not say or the LLM should not say. So you probably have a particular domain that you would like to talk about. You have particular things that you don't want your agent to say.

I don't want to sell a car for some low price. I actually just don't want to tell it. I probably don't want to tell people, OK, you can buy this car for this price. It is just a bad idea. So what we should instead be focusing on here is not what the user is saying, but what our LLM is saying.

So let's go back. Let's modify this. So we have our user's query at the top here. So there. The queries or the utterances that we're setting here, let's say rather than it being these questions that are user-focused, so some question from a user, rather than it being that, let's say these are statements or responses from our agent, ones that we don't want it to say.

So for this query, we're going to say, yes, you can have this car for $1. We don't want that to happen. We can give another example. Yes, we can do 50% discount. This will be this query here. And then we can give many different examples. These other routes that we have here might actually be similar.

So they might be, oh, we can finance your car or something. Similar, but slightly different. And that's why they would be in these different routes. But they still might be these sort of protective guardrails against giving out financial deals that you actually don't want it to give. So in this scenario, these routes here are actually basically outputs from your agent, not inputs from your user.

So this user query comes in. And whereas before, we were passing it to the embedding model here, we're not going to do that. Instead, let's say the circle is our LLM. So our LLM is going to generate something. Actually, let's make our LLM a different color. Oh, no. OK, it's red.

So our LLM is going to create some output. It's not good. So it's giving someone 75% discount. And we just don't want that. So what we're going to do now is we're going to take this. And we're going to put it through our embedding model. And we're going to come over here.

And we put it in. It's probably-- OK, discount is around here. So we put it in there. We compare. We're like, oh, no, this is not very good. We've produced some bad output that we don't want. We just don't want this to happen. So there's different things you could do in this scenario.

One thing is maybe you just want to have a pre-written response. This is the safest thing to do. So you just take from your wherever, your database, your code, whatever you have. So you have your pre-written response if this particular route is triggered. So we're going to say, ah, OK, we're not going to do that.

We're going to come straight over here. We're going to go to our pre-written response. And we're going to return that to the user. Done. Easy. The other option, which is, I would say, more risky, but it's an option. You can do it, is you can come-- OK, you can take what you've done here.

You're going to come over here. You've got some additional prompting. Or maybe you completely swap out the LLMs system prompt. And you tell it, OK, you're not a sales person anymore. Now you're a protective defense against people trying to scam the company. But you stop them from scamming you in a nice way with polite language.

You just modify something to basically put it into safety mode. Or you just add a warning here again. OK, don't-- something. You can also do that if you want. But in any case, whichever approach you go for, you're going to basically generate another output. Hopefully, this time, the output is good.

If it's not, then again, maybe you just want to default to that backup query from over here. It's up to you. But the benefit of going with this route, where you are actually generating again, is obviously the agent or chatbot can still seem quite fluid. So it goes through here, it does everything.

And we say, OK, this is passed. And now we have a much more fluid response to a user that isn't pre-written, which can-- often in a chat, a pre-written thing is kind of weird. And it can really throw things off. So yeah, I mean, you have that. That is a different approach to doing this whole thing.

And then just very quickly, there are so many use cases that we can talk about and talk about them in depth. I just want to give you a very high-level view of a couple of others that I think are quite interesting and cool. And focusing on chat, we can actually use Semantic Router in different modalities, like with images and stuff.

A lot of different use cases there. But I just want to focus on the chat use case, or in particular, the language-focused or language modality chat use cases. So of course, we have the guardrails, which we've just spoken about. And we kind of touched upon not just using them as guardrails, but almost using them as behavioral modifications, where you are essentially, OK, you're going to modify the incoming query or the incoming user prompt, or user query, whatever.

You modify that, or you modify the system prompt, or you add additional instructions to the system prompt, or you add in an additional system prompt in a different place. There's so many different things you can do there to modify basic behavior. Oh, another thing as well on that is actually kind of comes under behavioral modification, but I do want to point out as a use case, which is tool use.

So if you trigger a particular route, you can, in several different ways, you can modify the tool use of your agent, assuming you have different tools. So that might be, in some scenarios, you might want to say, OK, we are going to select, let's say you have all of these different tools that your agent can use.

Each one of these dots is a tool. And in one scenario where, let's say, the politics route is triggered, what you would do is you might want to filter your potential tool use to these specific tools, because you know that your agent in this scenario should be using one of those.

It is very useful in many ways. One, it makes your prompts more concise. And by making your prompts more concise, you are saving on latency, you're saving on costs, and you're probably going to get better performance or accuracy out of your LLM, because you're giving it less choice. And it's always good to constrain the options that an LLM has.

That will tend to lead to better accuracy and performance. OK, the other one is the other filter, which would be the chitchat filter. So in this scenario, let's say we want to filter it to these specific tools. Or in another scenario, let's say, actually, we know when this happens, when this query triggers, we actually know that we need to use this specific tool.

So we actually force it to use that tool in that scenario. Another one might just be that, OK, we have that user query coming in, so on and so on. And we think that this user query would probably benefit from using a particular tool. So in that scenario, we would maybe just modify this user query.

It doesn't need to be an alert, but some sort of little informational thing, like, hey, maybe you can use this. Or you can actually, like, programmatically just reduce the number of tools it has access to. So you're just going to be like, no, no, no, no, no. You can't use any of these.

They don't exist. The only one you can use is this one. Maybe you can use both of those together. The things that you can do is-- well, it's actually completely up to you. Because semantic router is just triggering something. What it triggers is kind of up to you. Now, these are all very agent LLM-specific things.

But we can broaden this out to just our agentic workflow, or even beyond that. But let's focus on that. So let's broaden it out to, OK, in particular scenarios, you might want to use a different LLM in different places. Or you might want to use different system props. Or you might want to use different temperature settings.

So we might get a user query that comes in. And it's like, OK, I want to write a story. I don't know. I want to write a story. That should be all you know. Obviously, they want to be a bit more creative. And I think the vast majority of people prefer using not really opening eyes models, more like Anthropic or other providers for creative writing.

So what you could end up having here is like an LLM router, where based on particular queries, you're routing to the models that are better suited for this particular type of query from a user. So you might be going to different LLMs. Or also, you might just be using different system prompts.

So the system prompts for helping someone out with their code is probably going to be fairly different to someone wanting advice on the ingredients to cook a nice meal, right? The different use cases, probably you're going to require different prompts. Of course, they don't even need to be that varied, right?

You can have much more specific areas or more specific system prompts for very specific different things. And then, of course, as I mentioned, there's the temperature and other model settings. So in a creative writing example, someone asks you, they want to write a story. The temperature, you're just going to put that up to 0.9 or something.

Whereas someone's asking, oh, I really need help figuring out, I have some code and I have this problem. What do I do? You're just going to turn that down like 0.1 or 0.0 or 0.01, whatever it actually is. So that's another-- focus on the conversational side of things, another thing we can do with it.

And yeah, I mean, there are just so many of these use cases. I could go on forever. I'm not going to. I'm going to leave it there. So yeah, I just wanted to very quickly go through a few different-- the concept of Semantic Router, in particular, with conversations. So yeah, I mean, I think I've done that.

We're going to go into a lot more detail with many more of these in the near future. So for now, I will leave it there. So just thank you very much for watching. I hope this was useful and interesting. So I will see you again in the next one.

Bye. (soft music) you

Better Chatbots with Semantic Routes

Chapters

Transcript