Back to Index

How to Make RAG Chatbots FAST


Chapters

0:0 Making RAG Faster
0:20 Different Types of RAG
1:3 Naive Retrieval Augmented Generation
2:22 RAG with Agents
5:6 Making RAG Faster
8:55 Implementing Fast RAG with Guardrails
11:2 Creating Vector Database
12:52 RAG Functions in Guardrails
14:32 Guardrails Colang Config
16:13 Guardrails Register Actions
17:3 Testing RAG with Guardrails
19:42 RAG, Agents, and LLMs

Transcript

In this video, we're going to take a look at how we can do retrieval, augmented generation as an example of the sort of tooling or the sort of capabilities that we can build out within NEMO guardrails. So typically what we would do when building out a RAG pipeline for LLMs is we'd actually take two possible approaches.

Both approaches are going to kind of use similar components. So let me just draw those out very quickly. We're going to have a vector database. We're going to be using Pinecone. We're going to have an embedding model. We're going to be using text. We're going to be using R002.

So that's your embedding model. And basically we would have taken some documents from somewhere and we would have fed them in through our embedding model and store those within Pinecone already. Now, the two approaches to RAG that we can do with LLMs or the two traditional approaches is we can take a naive approach.

So naive approach is that we have an LLM up here. Let me just do it here. We have an LLM and let's say we have a query. We take our query and what we do is we actually take that straight to the embedding model here. That creates our query vector, xq, which goes into Pinecone and returns a set of documents or contacts which are like relevant information for that particular query.

Now what we do is we bring that over here and we're going to merge it with our query. Okay? So now we have the query plus the context and we feed them into the LLM. It doesn't matter what our query was in this instance. Our query could have just been hi, how are you?

Right? And we actually went to Pinecone or we went and embedded that and went to Pinecone and retrieved some contacts. That's why I call this a naive approach. The pros of this is that it's very quick. Right? To embed and search through Pinecone, it's like incredibly fast. So you're not waiting long.

Particularly when we compare that to the other approach. So the other approach is slower, more complex, but potentially more powerful. Right? So that is where you have like an agent, which is essentially like a big wrapper around your LLM that allows it to have multiple thoughts over time, like an internal dialogue.

So when you send your query, I'm going to just put the query over here this time, it goes over to your agent and then it's getting processed within that agent for a while. Right? Another thing that the agent can do is it has access to external tools. Right? One of those tools may be like a retrieval tool.

So the agent is going to say, okay, your query, if it's hi, how are you? I don't need to do anything. I'm just going to respond directly. Right? I'm just going to respond directly to you. You know, I'm doing okay. Actually, I'm not doing okay because I don't have feelings because I am AI.

Right? It's going to say something like that. But if we do ask it something that requires some external knowledge, what it can do is it's going to refer to its like external knowledge tool over here. And that external knowledge tool is going to point that query or a modified version of that query to our embedding model.

That is going to create XQ again, which then gets sent to Pinecone and it gets some context. Okay. They get passed back into like our tool pipeline and process internally by our agent again, which will output an answer. Okay. Based on those contexts in our original query. Now, I mean, you can see straight away that this process is heavier.

Right? Because before we even get to this retrieval tool, the LLM needs to have generated the fact that it needs to use that tool. And LLM generations are basically always going to be the slow part of the process. At least for now. So, before we even get to that tool, we have one LLM generation.

Right? And then we go through and we come down to here. We feed those contexts back into our agent and then we have at least another LLM generation. And actually, if you use like out of the box approaches from LangChain, I think at minimum, you're going to be doing three LLM generations because it also has like a summarization set in there as well.

So, basically, you're going to be waiting a while, but you'll probably get a good result. Now, what Guardrails does is kind of cool because it allows us to not do either of those approaches, but instead do something that's kind of in the middle. So, this is looking kind of messy already, but let me try.

Okay. So, we have our query. That query is actually going to go directly to Guardrails. Okay. So, yeah, I'm going to call it G over here. Right. So, that's going to go directly to Guardrails. Guardrails is going to actually use an embedding model, a different embedding model, but still an embedding model.

So, let me just put E for embedding. All right. It's going to take your query and create a not a query. Well, it is kind of like query vector, but it's not using it as a query vector. Right. Let's just, we'll call it V for vector at the moment.

What that's going to do is it's going to look at the Guardrails that have been set. Right. So, we have those definitions of user asks about politics or define user asks about large language models. And it's going to look at whether that query has a high similarity to any of those things.

What we might want to do is if the user is asking about language models, we want to actually trigger the retrieval tool. Okay. Like our own retrieval tool. So, we're going to say, okay, is that semantically similar? And we, based on that semantic similarity, we decide on a tool or we decide to just generate a response.

So, we've now kind of done what the agent was doing, but when not doing that first LLM generation, which is, makes things a lot faster. All right. So, now, okay, we've decided, yes, we do want to send our vector or our query over to Pinecone. All right. So, actually what we're going to have to do is we're going to have to take that query and we're going to have to bring it over to our embedding model here, because they're different embedding models.

So, we have our embedding model and then we have our query vector that goes into Pinecone. From that, we get our context. And here is where those contexts would actually go into our LLM. Now, how do I do this? I've made a bit of a mess. Basically, we want to put those two together, our query and the context, and we're going to feed them over into our LLM.

Okay. Yeah. And then that's going to return our answer to us. Okay. So, it's going to come over here. So, we have one LLM call there. And that is, that's all we really need. Depending on the tool, we may actually need to decide to actually use an LLM call beforehand, but yeah, it kind of depends.

And it means that for those queries where we didn't need an LLM call, like if we're saying, "Hi, how are you?" We won't generate two, we'll just generate one. So, that's where Guardrails comes into this whole sort of retrieve augmented generation thing and the sort of unique approach that it takes to this, which is significantly faster than the agent approach while still allowing us to use tools, which is pretty cool.

Naturally, just as with a normal agent, we're not restricted to just using one tool. We can obviously use many tools. So, that's, I think, pretty cool. Now, let's take a look at how we would actually implement this. Okay. So, there's this notebook. Again, as usual, there'll be a link to this at the top of the video.

We're going to just install a few libraries. So, Nemo Guardrails, Pinecone for the vector index. We have data. This is Hogan Face datasets, which is where we're just going to download some data from that we're going to be querying against and OpenAI to create the embeddings and also for the LLM calls.

So, yeah, let's come down to here. Now, there's this whole like indexing process with vector databases. I'm going to be very quick going through this because I've spoken about it like a million times. So, I don't want to repeat myself every time. We're just going to start with this dataset.

It's from Hogan Face. It's a dataset I created. It's basically just a load of papers that are either the LLAMA2 paper or related to the LLAMA2 paper that I scraped from archive. Okay. And it contains all this information. We don't need all of that. Okay. What I want to do is I'm just going to create some unique IDs.

And after I've created those unique IDs, I don't want any of those other irrelevant fields because there's quite a few in there. So, we just want to keep the unique ID, the chunk, the title, and the source. Okay. Now, what we want to do is embed that data. There's not too much in there, by the way, just under 5,000 records.

So, what we need to do is embed that. For that, we need an OpenAI API key. 5,000 embeddings with R002. It doesn't cost much money, by the way. It's pretty cheap. But you just need to enter your API key in here. Okay. I will run that. And now we can go ahead and create some embeddings.

So, takes embedding R002. This is how we're going to create those embeddings. That response will give us this object data model usage. We want to go into data. We've seen data. We have two records. Each one of those records is one of our embeddings, which every embedding from the R002 model is 1,536 dimensions.

All right. Now, what we need to do is initialize our vector index. We need an API key and a environment variable for that from Pinecone. This is all free. So, we head on over to app.Pinecone.io. You should see something kind of like this. Your name, default projects. You go to API keys and you just want to copy your API key.

Also, just remember your environment here. Your API key, you just need to put into here. The environment, for me, was US West 1 GCP. Yours will probably be different. Once you've entered both those, you just run that. This is just checking that we have connected successfully. I'm going to create a new index.

It's going to be called Nemo Guardrails Rag with Actions. You can call it whatever you want. It's okay. What we're going to do here is we're just going to create the index if it doesn't already exist. Now, obviously, if this is your first time working through this, the index shouldn't already exist.

It will create a new one. We use a cosine similarity metric that is just recommended for order 002. We need specified dimensionality of the vectors we'll be storing within our index, which is the 1,536 that we saw earlier. Now, we're just going to wait for that index to be fully initialized before we connect to the index.

Let's run that. This will take usually around a minute to initialize an index. I'll just skip forward a little bit. We see that our index is currently empty, as expected because we just created it. Then we add everything into our index. We're embedding things and just putting everything up there in terms of 100.

That again, it's going to take about a minute to run. Once that has finished, we can move on to actually creating our Rag pipelines with Guardrails. With Guardrails, what we're going to be doing here is using Guardrail actions, which are basically executable functions from within the Guardrails colang file.

If you saw the previous video, you will know about these. We need to initialize one of those functions, which is going to be the retrieve function. We need to make sure that's an async function because when we are using functions with async generate within Guardrails, they need to be asynchronous functions.

Otherwise, we're going to get an error. We're just going to embed our query to create our query vector. Now, we're going to retrieve relevant items from Pinecone. We're just going to return those. Then we follow that with another function. I'm just going to print records so we know when this is actually being called later on.

This is going to take our query and the context that we retrieved from our retrieve function. It's going to put them into this prompt template, which is saying, "You're a helpful assistant. Below is a query from a user and some relevant context. Answer the question given those contexts." That's what we're doing here.

Then we're passing that back to OpenAI to generate a response. We're going to call all of this from within Guardrails, given a particular criteria. We set up the initial, the typical config for Guardrails. We're not really going to be using TextAdventure here, at least not for the rag component.

Actually, here, I will remove that. Let's say I'm a simple assistant. I don't like to talk about politics. This is going to be our rail against talking about politics. We don't want to do that. You can see that here. We've seen this before. Then what I want to do is define.

The user is asking about Llama or I think LLMs in general. We can change that to LLM. User asked LLM. Define flow, LLM. Basically, what this is doing here is it's creating a set of semantically embedded vectors. What Guardrails is going to do is take our user's query and compare it against these.

If it sees that they are very similar, it's going to say, "Okay, the user is asking about LLMs." That will trigger this flow here. Then in this flow, we perform retrieve augmented generation. We get our context given the user's last message. Then we create a retrieval augmented answer based on those.

Then we just tell the bot to return that directly because this answer has been generated by our LLM. It doesn't need to generate a new answer based on that answer. Let's run that. One other thing we need to do here is we need to register actions. We have this execute retrieve, execute rag.

That's great, but Guardrails doesn't know which Python functions we're talking about here. We need to register them. Here, we just initialize our Rails. Then here, we register those. You can see we're passing in that function and we're specifying the name that that function has within the colon file. This could be different.

This could be get instead of retrieve. That means that in our colon, rather than calling execute retrieve, we would be calling execute get. I'm just going to keep it as retrieve because I think that's easier for us to read. Register those actions. Now, what we can do is try out our rag agent built with Guardrails or rag pipeline via Guardrails.

We saw a simple prompt. We're not asking anything about LLMs here. It shouldn't use the rag pipeline and it doesn't. Now, let's ask you about LLama2. We should see that it will call. Here, we can see rag called. That means it used the pipeline and we can see that it gives us this answer.

LLama2 is a collection of pre-trained and fine-tuned large language models, so on and so on. That's pretty cool. It's a good answer. It tells us everything, but I think maybe something that I would like to know here is how does that compare? We're using rag here. What if we just don't use rag?

What if we just use Guardrails directly without our rag pipeline? Let's try. All right. We're going to do this one. This is a no rag colang file. It just defines the politics flow. It doesn't mention anything about the LLama stuff. Let's run that. That is our no rag Rails.

Let's ask the same question. Tell me about LLama2. It just says, "Sorry, I don't know anything about LLama2. Can you provide a bit more information so I can help you better?" That's actually a better answer than what I got last time, which was telling me about the actual animals, the llamas.

Let's try another one. This is, again, without rag. There's this thing called red teaming that LLama2 did. Basically, it's stress testing the model. Let's ask about that. It's like, "Okay, I don't know the answer to that. Maybe we could just try searching the internet for more information on the topic." Interesting that I'm getting different responses now, but still kind of shows the point.

Now, let's try this with rag. Okay. Let's run that. Here we go. What is red teaming? Red teaming is used to identify risks and to measure the robustness of the model with respect to a red teaming exercise executed by a set of experts. It's also used to provide quality of insights to recognize and target specific patterns in a more comprehensive way.

That is what red teaming was used for within the training process. Now, let's try our rag rails. I'm just going to ask you, "Okay, what color is the sky?" This is a question, but it shouldn't need to use rag here. It doesn't. The sky is usually blue and so on and so on.

We can see that it is deciding when it needs to use rag and it's not using rag when it shouldn't, which is exactly what we wanted it to do. That, I think, is a very good use case for where we can use guard rails. It gives us the ability to basically create almost like an agent-like tool that can use retrieval tools or other tools like an agent would, but without that slow initial LLM call.

That means that when we are using tools that just need to be triggered rather than parameters, this approach is actually faster. That's it for this video. I hope this is an interesting approach or technique that we can use guard rails for. As I said, we can use this with a lot of other tools as well, which is really cool.

That's it for this video. I hope all this has been useful and interesting. Thank you very much for watching. I will see you again in the next one. Bye!