back to indexHow to Make RAG Chatbots FAST
Chapters
0:0 Making RAG Faster
0:20 Different Types of RAG
1:3 Naive Retrieval Augmented Generation
2:22 RAG with Agents
5:6 Making RAG Faster
8:55 Implementing Fast RAG with Guardrails
11:2 Creating Vector Database
12:52 RAG Functions in Guardrails
14:32 Guardrails Colang Config
16:13 Guardrails Register Actions
17:3 Testing RAG with Guardrails
19:42 RAG, Agents, and LLMs
00:00:00.000 |
In this video, we're going to take a look at how we can do retrieval, augmented generation 00:00:04.880 |
as an example of the sort of tooling or the sort of capabilities that we can build out within NEMO 00:00:11.520 |
guardrails. So typically what we would do when building out a RAG pipeline for LLMs is we'd 00:00:18.960 |
actually take two possible approaches. Both approaches are going to kind of use similar 00:00:25.680 |
components. So let me just draw those out very quickly. We're going to have a vector database. 00:00:30.960 |
We're going to be using Pinecone. We're going to have an embedding model. We're going to be using 00:00:37.040 |
text. We're going to be using R002. So that's your embedding model. And basically we would 00:00:45.040 |
have taken some documents from somewhere and we would have fed them in through our embedding model 00:00:50.400 |
and store those within Pinecone already. Now, the two approaches to RAG that we can do with LLMs or 00:00:58.320 |
the two traditional approaches is we can take a naive approach. So naive approach is that we have 00:01:05.520 |
an LLM up here. Let me just do it here. We have an LLM and let's say we have a query. We take our 00:01:15.200 |
query and what we do is we actually take that straight to the embedding model here. That creates 00:01:23.760 |
our query vector, xq, which goes into Pinecone and returns a set of documents or contacts which are 00:01:31.920 |
like relevant information for that particular query. Now what we do is we bring that over here 00:01:40.400 |
and we're going to merge it with our query. Okay? So now we have the query plus the context and we 00:01:48.560 |
feed them into the LLM. It doesn't matter what our query was in this instance. Our query could 00:01:55.360 |
have just been hi, how are you? Right? And we actually went to Pinecone or we went and embedded 00:02:01.520 |
that and went to Pinecone and retrieved some contacts. That's why I call this a naive approach. 00:02:06.800 |
The pros of this is that it's very quick. Right? To embed and search through Pinecone, it's like 00:02:14.080 |
incredibly fast. So you're not waiting long. Particularly when we compare that to the other 00:02:21.280 |
approach. So the other approach is slower, more complex, but potentially more powerful. 00:02:29.520 |
Right? So that is where you have like an agent, which is essentially like a big wrapper 00:02:35.760 |
around your LLM that allows it to have multiple thoughts over time, like an internal dialogue. 00:02:42.720 |
So when you send your query, I'm going to just put the query over here this time, it goes over 00:02:48.640 |
to your agent and then it's getting processed within that agent for a while. Right? Another 00:02:54.560 |
thing that the agent can do is it has access to external tools. Right? One of those tools may be 00:03:00.720 |
like a retrieval tool. So the agent is going to say, okay, your query, if it's hi, how are you? 00:03:09.280 |
I don't need to do anything. I'm just going to respond directly. Right? I'm just going to respond 00:03:13.200 |
directly to you. You know, I'm doing okay. Actually, I'm not doing okay because I don't 00:03:17.360 |
have feelings because I am AI. Right? It's going to say something like that. But if we do ask it 00:03:23.680 |
something that requires some external knowledge, what it can do is it's going to refer to its 00:03:30.720 |
like external knowledge tool over here. And that external knowledge tool is going to point 00:03:35.440 |
that query or a modified version of that query to our embedding model. That is going to create 00:03:41.520 |
XQ again, which then gets sent to Pinecone and it gets some context. Okay. They get passed back 00:03:53.680 |
into like our tool pipeline and process internally by our agent again, which will output an answer. 00:04:02.880 |
Okay. Based on those contexts in our original query. Now, I mean, you can see straight away that 00:04:10.000 |
this process is heavier. Right? Because before we even get to this retrieval tool, the LLM needs to 00:04:18.320 |
have generated the fact that it needs to use that tool. And LLM generations are basically always 00:04:26.880 |
going to be the slow part of the process. At least for now. So, before we even get to that tool, 00:04:33.200 |
we have one LLM generation. Right? And then we go through and we come down to here. We feed those 00:04:40.720 |
contexts back into our agent and then we have at least another LLM generation. And actually, 00:04:49.040 |
if you use like out of the box approaches from LangChain, I think at minimum, you're going to 00:04:54.880 |
be doing three LLM generations because it also has like a summarization set in there as well. 00:05:02.560 |
So, basically, you're going to be waiting a while, but you'll probably get a good result. 00:05:06.880 |
Now, what Guardrails does is kind of cool because it allows us to not do either of those approaches, 00:05:15.840 |
but instead do something that's kind of in the middle. So, this is looking kind of messy already, 00:05:23.120 |
but let me try. Okay. So, we have our query. That query is actually going to go directly 00:05:28.960 |
to Guardrails. Okay. So, yeah, I'm going to call it G over here. Right. So, that's going to go 00:05:36.240 |
directly to Guardrails. Guardrails is going to actually use an embedding model, a different 00:05:41.920 |
embedding model, but still an embedding model. So, let me just put E for embedding. All right. 00:05:48.800 |
It's going to take your query and create a not a query. Well, it is kind of like query vector, 00:05:57.920 |
but it's not using it as a query vector. Right. Let's just, we'll call it V for vector at the 00:06:04.320 |
moment. What that's going to do is it's going to look at the Guardrails that have been set. 00:06:09.280 |
Right. So, we have those definitions of user asks about politics or define user asks about large 00:06:17.600 |
language models. And it's going to look at whether that query has a high similarity to any of those 00:06:23.520 |
things. What we might want to do is if the user is asking about language models, we want to actually 00:06:30.160 |
trigger the retrieval tool. Okay. Like our own retrieval tool. So, we're going to say, okay, 00:06:35.840 |
is that semantically similar? And we, based on that semantic similarity, we decide on a tool 00:06:43.360 |
or we decide to just generate a response. So, we've now kind of done what the agent was doing, 00:06:50.320 |
but when not doing that first LLM generation, which is, makes things a lot faster. All right. 00:06:59.280 |
So, now, okay, we've decided, yes, we do want to send our vector or our query over to Pinecone. 00:07:06.400 |
All right. So, actually what we're going to have to do is we're going to have to take that 00:07:11.440 |
query and we're going to have to bring it over to our embedding model here, because they're 00:07:17.760 |
different embedding models. So, we have our embedding model and then we have our query vector 00:07:23.840 |
that goes into Pinecone. From that, we get our context. And here is where those contexts would 00:07:33.920 |
actually go into our LLM. Now, how do I do this? I've made a bit of a mess. Basically, we want to 00:07:41.520 |
put those two together, our query and the context, and we're going to feed them over into our LLM. 00:07:48.000 |
Okay. Yeah. And then that's going to return our answer to us. Okay. So, it's going to come over 00:07:55.760 |
here. So, we have one LLM call there. And that is, that's all we really need. Depending on the tool, 00:08:04.320 |
we may actually need to decide to actually use an LLM call beforehand, but yeah, it kind of depends. 00:08:12.000 |
And it means that for those queries where we didn't need an LLM call, like if we're saying, 00:08:17.600 |
"Hi, how are you?" We won't generate two, we'll just generate one. So, that's where Guardrails 00:08:23.680 |
comes into this whole sort of retrieve augmented generation thing and the sort of unique approach 00:08:30.320 |
that it takes to this, which is significantly faster than the agent approach while still 00:08:37.920 |
allowing us to use tools, which is pretty cool. Naturally, just as with a normal agent, 00:08:44.640 |
we're not restricted to just using one tool. We can obviously use many tools. So, that's, 00:08:49.440 |
I think, pretty cool. Now, let's take a look at how we would actually implement this. Okay. So, 00:08:56.640 |
there's this notebook. Again, as usual, there'll be a link to this at the top of the video. 00:09:01.280 |
We're going to just install a few libraries. So, Nemo Guardrails, Pinecone for the vector index. 00:09:07.840 |
We have data. This is Hogan Face datasets, which is where we're just going to download some data 00:09:12.240 |
from that we're going to be querying against and OpenAI to create the embeddings and also for the 00:09:18.400 |
LLM calls. So, yeah, let's come down to here. Now, there's this whole like indexing process 00:09:27.040 |
with vector databases. I'm going to be very quick going through this because I've spoken 00:09:30.880 |
about it like a million times. So, I don't want to repeat myself every time. 00:09:35.520 |
We're just going to start with this dataset. It's from Hogan Face. It's a dataset I created. It's 00:09:42.240 |
basically just a load of papers that are either the LLAMA2 paper or related to the LLAMA2 paper 00:09:48.800 |
that I scraped from archive. Okay. And it contains all this information. We don't need all of that. 00:09:54.080 |
Okay. What I want to do is I'm just going to create some unique IDs. And after I've created 00:09:59.680 |
those unique IDs, I don't want any of those other irrelevant fields because there's quite 00:10:05.200 |
a few in there. So, we just want to keep the unique ID, the chunk, the title, and the source. 00:10:10.480 |
Okay. Now, what we want to do is embed that data. There's not too much in there, by the way, 00:10:15.680 |
just under 5,000 records. So, what we need to do is embed that. For that, we need an OpenAI API key. 00:10:24.720 |
5,000 embeddings with R002. It doesn't cost much money, by the way. It's pretty cheap. But you just 00:10:32.560 |
need to enter your API key in here. Okay. I will run that. And now we can go ahead and create some 00:10:38.480 |
embeddings. So, takes embedding R002. This is how we're going to create those embeddings. That 00:10:45.120 |
response will give us this object data model usage. We want to go into data. We've seen data. 00:10:51.600 |
We have two records. Each one of those records is one of our embeddings, which every embedding 00:10:56.880 |
from the R002 model is 1,536 dimensions. All right. Now, what we need to do is initialize our 00:11:04.720 |
vector index. We need an API key and a environment variable for that from Pinecone. This is all free. 00:11:11.360 |
So, we head on over to app.Pinecone.io. You should see something kind of like this. Your name, 00:11:18.640 |
default projects. You go to API keys and you just want to copy your API key. Also, just remember 00:11:25.600 |
your environment here. Your API key, you just need to put into here. The environment, for me, 00:11:30.880 |
was US West 1 GCP. Yours will probably be different. Once you've entered both those, 00:11:37.120 |
you just run that. This is just checking that we have connected successfully. I'm going to create 00:11:44.000 |
a new index. It's going to be called Nemo Guardrails Rag with Actions. You can call it whatever you 00:11:48.560 |
want. It's okay. What we're going to do here is we're just going to create the index if it doesn't 00:11:53.360 |
already exist. Now, obviously, if this is your first time working through this, the index shouldn't 00:11:58.320 |
already exist. It will create a new one. We use a cosine similarity metric that is just recommended 00:12:05.120 |
for order 002. We need specified dimensionality of the vectors we'll be storing within our index, 00:12:11.600 |
which is the 1,536 that we saw earlier. Now, we're just going to wait for that index to be 00:12:18.240 |
fully initialized before we connect to the index. Let's run that. This will take usually around a 00:12:27.040 |
minute to initialize an index. I'll just skip forward a little bit. We see that our index is 00:12:34.640 |
currently empty, as expected because we just created it. Then we add everything into our 00:12:41.360 |
index. We're embedding things and just putting everything up there in terms of 100. That again, 00:12:50.240 |
it's going to take about a minute to run. Once that has finished, we can move on to actually 00:12:55.840 |
creating our Rag pipelines with Guardrails. With Guardrails, what we're going to be doing here is 00:13:03.920 |
using Guardrail actions, which are basically executable functions from within the Guardrails 00:13:09.360 |
colang file. If you saw the previous video, you will know about these. We need to initialize 00:13:16.960 |
one of those functions, which is going to be the retrieve function. We need to make sure that's an 00:13:25.760 |
async function because when we are using functions with async generate within Guardrails, they need 00:13:33.280 |
to be asynchronous functions. Otherwise, we're going to get an error. We're just going to embed 00:13:39.920 |
our query to create our query vector. Now, we're going to retrieve relevant items from Pinecone. 00:13:48.800 |
We're just going to return those. Then we follow that with another function. I'm just going to 00:13:55.600 |
print records so we know when this is actually being called later on. This is going to take our 00:14:01.520 |
query and the context that we retrieved from our retrieve function. It's going to put them into 00:14:06.960 |
this prompt template, which is saying, "You're a helpful assistant. Below is a query from a user 00:14:13.840 |
and some relevant context. Answer the question given those contexts." That's what we're doing 00:14:19.280 |
here. Then we're passing that back to OpenAI to generate a response. We're going to call all of 00:14:27.120 |
this from within Guardrails, given a particular criteria. We set up the initial, the typical 00:14:35.360 |
config for Guardrails. We're not really going to be using TextAdventure here, at least not for the 00:14:42.960 |
rag component. Actually, here, I will remove that. Let's say I'm a simple assistant. I don't like to 00:14:52.960 |
talk about politics. This is going to be our rail against talking about politics. We don't want to 00:14:59.280 |
do that. You can see that here. We've seen this before. Then what I want to do is define. The 00:15:06.800 |
user is asking about Llama or I think LLMs in general. We can change that to LLM. User asked 00:15:16.560 |
LLM. Define flow, LLM. Basically, what this is doing here is it's creating a set of semantically 00:15:26.000 |
embedded vectors. What Guardrails is going to do is take our user's query and compare it against 00:15:31.680 |
these. If it sees that they are very similar, it's going to say, "Okay, the user is asking 00:15:37.280 |
about LLMs." That will trigger this flow here. Then in this flow, we perform retrieve augmented 00:15:45.040 |
generation. We get our context given the user's last message. Then we create a retrieval augmented 00:15:52.880 |
answer based on those. Then we just tell the bot to return that directly because this answer has 00:15:59.680 |
been generated by our LLM. It doesn't need to generate a new answer based on that answer. 00:16:10.240 |
Let's run that. One other thing we need to do here is we need to register actions. We have 00:16:19.360 |
this execute retrieve, execute rag. That's great, but Guardrails doesn't know which Python functions 00:16:27.360 |
we're talking about here. We need to register them. Here, we just initialize our Rails. Then 00:16:33.680 |
here, we register those. You can see we're passing in that function and we're specifying the name 00:16:38.560 |
that that function has within the colon file. This could be different. This could be get instead of 00:16:47.360 |
retrieve. That means that in our colon, rather than calling execute retrieve, we would be calling 00:16:54.000 |
execute get. I'm just going to keep it as retrieve because I think that's easier for us to read. 00:17:02.400 |
Register those actions. Now, what we can do is try out our rag agent built with Guardrails 00:17:10.560 |
or rag pipeline via Guardrails. We saw a simple prompt. We're not asking anything about LLMs here. 00:17:22.400 |
It shouldn't use the rag pipeline and it doesn't. Now, let's ask you about LLama2. We should see 00:17:30.400 |
that it will call. Here, we can see rag called. That means it used the pipeline and we can see 00:17:37.120 |
that it gives us this answer. LLama2 is a collection of pre-trained and fine-tuned large 00:17:41.440 |
language models, so on and so on. That's pretty cool. It's a good answer. It tells us everything, 00:17:47.040 |
but I think maybe something that I would like to know here is how does that compare? We're using 00:17:52.640 |
rag here. What if we just don't use rag? What if we just use Guardrails directly 00:17:56.720 |
without our rag pipeline? Let's try. All right. We're going to do this one. This is a no rag 00:18:02.560 |
colang file. It just defines the politics flow. It doesn't mention anything about the LLama stuff. 00:18:09.120 |
Let's run that. That is our no rag Rails. Let's ask the same question. Tell me about LLama2. 00:18:21.760 |
It just says, "Sorry, I don't know anything about LLama2. Can you provide a bit more information 00:18:26.560 |
so I can help you better?" That's actually a better answer than what I got last time, 00:18:30.000 |
which was telling me about the actual animals, the llamas. Let's try another one. This is, again, 00:18:37.760 |
without rag. There's this thing called red teaming that LLama2 did. Basically, it's stress testing 00:18:43.840 |
the model. Let's ask about that. It's like, "Okay, I don't know the answer to that. Maybe we could 00:18:51.680 |
just try searching the internet for more information on the topic." Interesting that I'm 00:18:56.080 |
getting different responses now, but still kind of shows the point. Now, let's try this with rag. 00:19:03.280 |
Okay. Let's run that. Here we go. What is red teaming? Red teaming is used to identify risks 00:19:09.920 |
and to measure the robustness of the model with respect to a red teaming exercise executed by a 00:19:14.880 |
set of experts. It's also used to provide quality of insights to recognize and target specific 00:19:19.600 |
patterns in a more comprehensive way. That is what red teaming was used for within the training 00:19:25.040 |
process. Now, let's try our rag rails. I'm just going to ask you, "Okay, what color is the sky?" 00:19:32.160 |
This is a question, but it shouldn't need to use rag here. It doesn't. The sky is usually blue 00:19:40.400 |
and so on and so on. We can see that it is deciding when it needs to use rag and it's not 00:19:46.960 |
using rag when it shouldn't, which is exactly what we wanted it to do. That, I think, is a very good 00:19:51.840 |
use case for where we can use guard rails. It gives us the ability to basically create almost 00:19:57.760 |
like an agent-like tool that can use retrieval tools or other tools like an agent would, but 00:20:06.800 |
without that slow initial LLM call. That means that when we are using tools that just need to 00:20:13.680 |
be triggered rather than parameters, this approach is actually faster. That's it for this video. I 00:20:23.040 |
hope this is an interesting approach or technique that we can use guard rails for. As I said, 00:20:31.680 |
we can use this with a lot of other tools as well, which is really cool. That's it for this video. I 00:20:38.640 |
hope all this has been useful and interesting. Thank you very much for watching. I will see you