How to Make RAG Chatbots FAST

00:00:00.000 | In this video, we're going to take a look at how we can do retrieval, augmented generation

00:00:04.880 | as an example of the sort of tooling or the sort of capabilities that we can build out within NEMO

00:00:11.520 | guardrails. So typically what we would do when building out a RAG pipeline for LLMs is we'd

00:00:18.960 | actually take two possible approaches. Both approaches are going to kind of use similar

00:00:25.680 | components. So let me just draw those out very quickly. We're going to have a vector database.

00:00:30.960 | We're going to be using Pinecone. We're going to have an embedding model. We're going to be using

00:00:37.040 | text. We're going to be using R002. So that's your embedding model. And basically we would

00:00:45.040 | have taken some documents from somewhere and we would have fed them in through our embedding model

00:00:50.400 | and store those within Pinecone already. Now, the two approaches to RAG that we can do with LLMs or

00:00:58.320 | the two traditional approaches is we can take a naive approach. So naive approach is that we have

00:01:05.520 | an LLM up here. Let me just do it here. We have an LLM and let's say we have a query. We take our

00:01:15.200 | query and what we do is we actually take that straight to the embedding model here. That creates

00:01:23.760 | our query vector, xq, which goes into Pinecone and returns a set of documents or contacts which are

00:01:31.920 | like relevant information for that particular query. Now what we do is we bring that over here

00:01:40.400 | and we're going to merge it with our query. Okay? So now we have the query plus the context and we

00:01:48.560 | feed them into the LLM. It doesn't matter what our query was in this instance. Our query could

00:01:55.360 | have just been hi, how are you? Right? And we actually went to Pinecone or we went and embedded

00:02:01.520 | that and went to Pinecone and retrieved some contacts. That's why I call this a naive approach.

00:02:06.800 | The pros of this is that it's very quick. Right? To embed and search through Pinecone, it's like

00:02:14.080 | incredibly fast. So you're not waiting long. Particularly when we compare that to the other

00:02:21.280 | approach. So the other approach is slower, more complex, but potentially more powerful.

00:02:29.520 | Right? So that is where you have like an agent, which is essentially like a big wrapper

00:02:35.760 | around your LLM that allows it to have multiple thoughts over time, like an internal dialogue.

00:02:42.720 | So when you send your query, I'm going to just put the query over here this time, it goes over

00:02:48.640 | to your agent and then it's getting processed within that agent for a while. Right? Another

00:02:54.560 | thing that the agent can do is it has access to external tools. Right? One of those tools may be

00:03:00.720 | like a retrieval tool. So the agent is going to say, okay, your query, if it's hi, how are you?

00:03:09.280 | I don't need to do anything. I'm just going to respond directly. Right? I'm just going to respond

00:03:13.200 | directly to you. You know, I'm doing okay. Actually, I'm not doing okay because I don't

00:03:17.360 | have feelings because I am AI. Right? It's going to say something like that. But if we do ask it

00:03:23.680 | something that requires some external knowledge, what it can do is it's going to refer to its

00:03:30.720 | like external knowledge tool over here. And that external knowledge tool is going to point

00:03:35.440 | that query or a modified version of that query to our embedding model. That is going to create

00:03:41.520 | XQ again, which then gets sent to Pinecone and it gets some context. Okay. They get passed back

00:03:53.680 | into like our tool pipeline and process internally by our agent again, which will output an answer.

00:04:02.880 | Okay. Based on those contexts in our original query. Now, I mean, you can see straight away that

00:04:10.000 | this process is heavier. Right? Because before we even get to this retrieval tool, the LLM needs to

00:04:18.320 | have generated the fact that it needs to use that tool. And LLM generations are basically always

00:04:26.880 | going to be the slow part of the process. At least for now. So, before we even get to that tool,

00:04:33.200 | we have one LLM generation. Right? And then we go through and we come down to here. We feed those

00:04:40.720 | contexts back into our agent and then we have at least another LLM generation. And actually,

00:04:49.040 | if you use like out of the box approaches from LangChain, I think at minimum, you're going to

00:04:54.880 | be doing three LLM generations because it also has like a summarization set in there as well.

00:05:02.560 | So, basically, you're going to be waiting a while, but you'll probably get a good result.

00:05:06.880 | Now, what Guardrails does is kind of cool because it allows us to not do either of those approaches,

00:05:15.840 | but instead do something that's kind of in the middle. So, this is looking kind of messy already,

00:05:23.120 | but let me try. Okay. So, we have our query. That query is actually going to go directly

00:05:28.960 | to Guardrails. Okay. So, yeah, I'm going to call it G over here. Right. So, that's going to go

00:05:36.240 | directly to Guardrails. Guardrails is going to actually use an embedding model, a different

00:05:41.920 | embedding model, but still an embedding model. So, let me just put E for embedding. All right.

00:05:48.800 | It's going to take your query and create a not a query. Well, it is kind of like query vector,

00:05:57.920 | but it's not using it as a query vector. Right. Let's just, we'll call it V for vector at the

00:06:04.320 | moment. What that's going to do is it's going to look at the Guardrails that have been set.

00:06:09.280 | Right. So, we have those definitions of user asks about politics or define user asks about large

00:06:17.600 | language models. And it's going to look at whether that query has a high similarity to any of those

00:06:23.520 | things. What we might want to do is if the user is asking about language models, we want to actually

00:06:30.160 | trigger the retrieval tool. Okay. Like our own retrieval tool. So, we're going to say, okay,

00:06:35.840 | is that semantically similar? And we, based on that semantic similarity, we decide on a tool

00:06:43.360 | or we decide to just generate a response. So, we've now kind of done what the agent was doing,

00:06:50.320 | but when not doing that first LLM generation, which is, makes things a lot faster. All right.

00:06:59.280 | So, now, okay, we've decided, yes, we do want to send our vector or our query over to Pinecone.

00:07:06.400 | All right. So, actually what we're going to have to do is we're going to have to take that

00:07:11.440 | query and we're going to have to bring it over to our embedding model here, because they're

00:07:17.760 | different embedding models. So, we have our embedding model and then we have our query vector

00:07:23.840 | that goes into Pinecone. From that, we get our context. And here is where those contexts would

00:07:33.920 | actually go into our LLM. Now, how do I do this? I've made a bit of a mess. Basically, we want to

00:07:41.520 | put those two together, our query and the context, and we're going to feed them over into our LLM.

00:07:48.000 | Okay. Yeah. And then that's going to return our answer to us. Okay. So, it's going to come over

00:07:55.760 | here. So, we have one LLM call there. And that is, that's all we really need. Depending on the tool,

00:08:04.320 | we may actually need to decide to actually use an LLM call beforehand, but yeah, it kind of depends.

00:08:12.000 | And it means that for those queries where we didn't need an LLM call, like if we're saying,

00:08:17.600 | "Hi, how are you?" We won't generate two, we'll just generate one. So, that's where Guardrails

00:08:23.680 | comes into this whole sort of retrieve augmented generation thing and the sort of unique approach

00:08:30.320 | that it takes to this, which is significantly faster than the agent approach while still

00:08:37.920 | allowing us to use tools, which is pretty cool. Naturally, just as with a normal agent,

00:08:44.640 | we're not restricted to just using one tool. We can obviously use many tools. So, that's,

00:08:49.440 | I think, pretty cool. Now, let's take a look at how we would actually implement this. Okay. So,

00:08:56.640 | there's this notebook. Again, as usual, there'll be a link to this at the top of the video.

00:09:01.280 | We're going to just install a few libraries. So, Nemo Guardrails, Pinecone for the vector index.

00:09:07.840 | We have data. This is Hogan Face datasets, which is where we're just going to download some data

00:09:12.240 | from that we're going to be querying against and OpenAI to create the embeddings and also for the

00:09:18.400 | LLM calls. So, yeah, let's come down to here. Now, there's this whole like indexing process

00:09:27.040 | with vector databases. I'm going to be very quick going through this because I've spoken

00:09:30.880 | about it like a million times. So, I don't want to repeat myself every time.

00:09:35.520 | We're just going to start with this dataset. It's from Hogan Face. It's a dataset I created. It's

00:09:42.240 | basically just a load of papers that are either the LLAMA2 paper or related to the LLAMA2 paper

00:09:48.800 | that I scraped from archive. Okay. And it contains all this information. We don't need all of that.

00:09:54.080 | Okay. What I want to do is I'm just going to create some unique IDs. And after I've created

00:09:59.680 | those unique IDs, I don't want any of those other irrelevant fields because there's quite

00:10:05.200 | a few in there. So, we just want to keep the unique ID, the chunk, the title, and the source.

00:10:10.480 | Okay. Now, what we want to do is embed that data. There's not too much in there, by the way,

00:10:15.680 | just under 5,000 records. So, what we need to do is embed that. For that, we need an OpenAI API key.

00:10:24.720 | 5,000 embeddings with R002. It doesn't cost much money, by the way. It's pretty cheap. But you just

00:10:32.560 | need to enter your API key in here. Okay. I will run that. And now we can go ahead and create some

00:10:38.480 | embeddings. So, takes embedding R002. This is how we're going to create those embeddings. That

00:10:45.120 | response will give us this object data model usage. We want to go into data. We've seen data.

00:10:51.600 | We have two records. Each one of those records is one of our embeddings, which every embedding

00:10:56.880 | from the R002 model is 1,536 dimensions. All right. Now, what we need to do is initialize our

00:11:04.720 | vector index. We need an API key and a environment variable for that from Pinecone. This is all free.

00:11:11.360 | So, we head on over to app.Pinecone.io. You should see something kind of like this. Your name,

00:11:18.640 | default projects. You go to API keys and you just want to copy your API key. Also, just remember

00:11:25.600 | your environment here. Your API key, you just need to put into here. The environment, for me,

00:11:30.880 | was US West 1 GCP. Yours will probably be different. Once you've entered both those,

00:11:37.120 | you just run that. This is just checking that we have connected successfully. I'm going to create

00:11:44.000 | a new index. It's going to be called Nemo Guardrails Rag with Actions. You can call it whatever you

00:11:48.560 | want. It's okay. What we're going to do here is we're just going to create the index if it doesn't

00:11:53.360 | already exist. Now, obviously, if this is your first time working through this, the index shouldn't

00:11:58.320 | already exist. It will create a new one. We use a cosine similarity metric that is just recommended

00:12:05.120 | for order 002. We need specified dimensionality of the vectors we'll be storing within our index,

00:12:11.600 | which is the 1,536 that we saw earlier. Now, we're just going to wait for that index to be

00:12:18.240 | fully initialized before we connect to the index. Let's run that. This will take usually around a

00:12:27.040 | minute to initialize an index. I'll just skip forward a little bit. We see that our index is

00:12:34.640 | currently empty, as expected because we just created it. Then we add everything into our

00:12:41.360 | index. We're embedding things and just putting everything up there in terms of 100. That again,

00:12:50.240 | it's going to take about a minute to run. Once that has finished, we can move on to actually

00:12:55.840 | creating our Rag pipelines with Guardrails. With Guardrails, what we're going to be doing here is

00:13:03.920 | using Guardrail actions, which are basically executable functions from within the Guardrails

00:13:09.360 | colang file. If you saw the previous video, you will know about these. We need to initialize

00:13:16.960 | one of those functions, which is going to be the retrieve function. We need to make sure that's an

00:13:25.760 | async function because when we are using functions with async generate within Guardrails, they need

00:13:33.280 | to be asynchronous functions. Otherwise, we're going to get an error. We're just going to embed

00:13:39.920 | our query to create our query vector. Now, we're going to retrieve relevant items from Pinecone.

00:13:48.800 | We're just going to return those. Then we follow that with another function. I'm just going to

00:13:55.600 | print records so we know when this is actually being called later on. This is going to take our

00:14:01.520 | query and the context that we retrieved from our retrieve function. It's going to put them into

00:14:06.960 | this prompt template, which is saying, "You're a helpful assistant. Below is a query from a user

00:14:13.840 | and some relevant context. Answer the question given those contexts." That's what we're doing

00:14:19.280 | here. Then we're passing that back to OpenAI to generate a response. We're going to call all of

00:14:27.120 | this from within Guardrails, given a particular criteria. We set up the initial, the typical

00:14:35.360 | config for Guardrails. We're not really going to be using TextAdventure here, at least not for the

00:14:42.960 | rag component. Actually, here, I will remove that. Let's say I'm a simple assistant. I don't like to

00:14:52.960 | talk about politics. This is going to be our rail against talking about politics. We don't want to

00:14:59.280 | do that. You can see that here. We've seen this before. Then what I want to do is define. The

00:15:06.800 | user is asking about Llama or I think LLMs in general. We can change that to LLM. User asked

00:15:16.560 | LLM. Define flow, LLM. Basically, what this is doing here is it's creating a set of semantically

00:15:26.000 | embedded vectors. What Guardrails is going to do is take our user's query and compare it against

00:15:31.680 | these. If it sees that they are very similar, it's going to say, "Okay, the user is asking

00:15:37.280 | about LLMs." That will trigger this flow here. Then in this flow, we perform retrieve augmented

00:15:45.040 | generation. We get our context given the user's last message. Then we create a retrieval augmented

00:15:52.880 | answer based on those. Then we just tell the bot to return that directly because this answer has

00:15:59.680 | been generated by our LLM. It doesn't need to generate a new answer based on that answer.

00:16:10.240 | Let's run that. One other thing we need to do here is we need to register actions. We have

00:16:19.360 | this execute retrieve, execute rag. That's great, but Guardrails doesn't know which Python functions

00:16:27.360 | we're talking about here. We need to register them. Here, we just initialize our Rails. Then

00:16:33.680 | here, we register those. You can see we're passing in that function and we're specifying the name

00:16:38.560 | that that function has within the colon file. This could be different. This could be get instead of

00:16:47.360 | retrieve. That means that in our colon, rather than calling execute retrieve, we would be calling

00:16:54.000 | execute get. I'm just going to keep it as retrieve because I think that's easier for us to read.

00:17:02.400 | Register those actions. Now, what we can do is try out our rag agent built with Guardrails

00:17:10.560 | or rag pipeline via Guardrails. We saw a simple prompt. We're not asking anything about LLMs here.

00:17:22.400 | It shouldn't use the rag pipeline and it doesn't. Now, let's ask you about LLama2. We should see

00:17:30.400 | that it will call. Here, we can see rag called. That means it used the pipeline and we can see

00:17:37.120 | that it gives us this answer. LLama2 is a collection of pre-trained and fine-tuned large

00:17:41.440 | language models, so on and so on. That's pretty cool. It's a good answer. It tells us everything,

00:17:47.040 | but I think maybe something that I would like to know here is how does that compare? We're using

00:17:52.640 | rag here. What if we just don't use rag? What if we just use Guardrails directly

00:17:56.720 | without our rag pipeline? Let's try. All right. We're going to do this one. This is a no rag

00:18:02.560 | colang file. It just defines the politics flow. It doesn't mention anything about the LLama stuff.

00:18:09.120 | Let's run that. That is our no rag Rails. Let's ask the same question. Tell me about LLama2.

00:18:21.760 | It just says, "Sorry, I don't know anything about LLama2. Can you provide a bit more information

00:18:26.560 | so I can help you better?" That's actually a better answer than what I got last time,

00:18:30.000 | which was telling me about the actual animals, the llamas. Let's try another one. This is, again,

00:18:37.760 | without rag. There's this thing called red teaming that LLama2 did. Basically, it's stress testing

00:18:43.840 | the model. Let's ask about that. It's like, "Okay, I don't know the answer to that. Maybe we could

00:18:51.680 | just try searching the internet for more information on the topic." Interesting that I'm

00:18:56.080 | getting different responses now, but still kind of shows the point. Now, let's try this with rag.

00:19:03.280 | Okay. Let's run that. Here we go. What is red teaming? Red teaming is used to identify risks

00:19:09.920 | and to measure the robustness of the model with respect to a red teaming exercise executed by a

00:19:14.880 | set of experts. It's also used to provide quality of insights to recognize and target specific

00:19:19.600 | patterns in a more comprehensive way. That is what red teaming was used for within the training

00:19:25.040 | process. Now, let's try our rag rails. I'm just going to ask you, "Okay, what color is the sky?"

00:19:32.160 | This is a question, but it shouldn't need to use rag here. It doesn't. The sky is usually blue

00:19:40.400 | and so on and so on. We can see that it is deciding when it needs to use rag and it's not

00:19:46.960 | using rag when it shouldn't, which is exactly what we wanted it to do. That, I think, is a very good

00:19:51.840 | use case for where we can use guard rails. It gives us the ability to basically create almost

00:19:57.760 | like an agent-like tool that can use retrieval tools or other tools like an agent would, but

00:20:06.800 | without that slow initial LLM call. That means that when we are using tools that just need to

00:20:13.680 | be triggered rather than parameters, this approach is actually faster. That's it for this video. I

00:20:23.040 | hope this is an interesting approach or technique that we can use guard rails for. As I said,

00:20:31.680 | we can use this with a lot of other tools as well, which is really cool. That's it for this video. I

00:20:38.640 | hope all this has been useful and interesting. Thank you very much for watching. I will see you

00:20:43.120 | again in the next one. Bye!

00:20:44.880 | [Music]

00:20:59.880 | [BLANK_AUDIO]

How to Make RAG Chatbots FAST

Chapters