back to indexChatbots with RAG: LangChain Full Walkthrough
Chapters
0:0 Chatbots with RAG
0:59 RAG Pipeline
2:35 Hallucinations in LLMs
4:8 LangChain ChatOpenAI Chatbot
9:11 Reducing LLM Hallucinations
13:37 Adding Context to Prompts
17:47 Building the Vector Database
25:14 Adding RAG to Chatbot
28:52 Testing the RAG Chatbot
32:56 Important Notes when using RAG
00:00:00.000 |
Today we're going to take a look at how we can build a chatbot using retrieval augmented 00:00:09.160 |
So we're literally going to start with the assumption that you don't really know anything 00:00:18.120 |
But by the end of this video, what we're going to have is a chatbot, for those of you that 00:00:23.240 |
are interested using OpenAI's GPT 3.5 model and also the Langtrain library. 00:00:29.980 |
That is able to answer questions about more recent events or is able to answer questions 00:00:37.520 |
about our own internal documentation, for example, in an organization, which a model 00:00:47.860 |
And the way that we will enable that is through this retrieval augmented generation. 00:00:53.400 |
So to get started, let me just take you through what we're actually going to be building at 00:00:59.280 |
OK, so this, what you can see here is what we'd call a RAG pipeline or retrieval augmented 00:01:07.200 |
So in a typical scenario with an LLM, what we're going to do is we take a query like 00:01:15.040 |
this up here and we just feed it into our LLM and then we get some output, right? 00:01:21.200 |
That is OK in some cases, but in other cases, it's not. 00:01:25.480 |
For general knowledge, question answering or for knowledge that the LLM has seen before, 00:01:35.320 |
But the problem is that a lot of LLMs have not seen a lot of information that we would 00:01:44.040 |
So for example, in this question, I'm asking what makes LLAMA 2 special? 00:01:51.360 |
And most LLMs at the time of recording this would not be able to answer that because LLAMA 00:02:03.040 |
Most LLMs were trained on training data that did not contain any information about LLAMA 00:02:09.000 |
So most LLMs just have no idea what LLAMA 2 is. 00:02:14.840 |
And they'll typically tell you something about actual llamas, the animal, or they'll just 00:02:20.640 |
make something else up when you ask them this question. 00:02:26.400 |
So what we do is we use this Retrieve Augmented Generation pipeline. 00:02:31.040 |
And it's this pipeline that I'm going to teach you how to build today. 00:02:35.720 |
So here's an example of when LLMs don't know what you're talking about, even though it's 00:02:44.200 |
Or at least you would expect an LLM that is, for example, in this case, good at programming 00:02:52.000 |
LangChain is probably the most popular library for generative AI. 00:02:57.680 |
And usually using Python, there's also a JavaScript version available, and maybe some other languages 00:03:04.440 |
But if we ask-- this is GPT-4, this is when GPT-4 first was released. 00:03:08.640 |
I asked GPT-4 in the OpenAI Playground how I use the LLM chain in LangChain. 00:03:13.560 |
LLM chain is the basic building block of LangChain. 00:03:17.480 |
And it told me, OK, LangChain is a blockchain-based platform that combines artificial intelligence 00:03:23.600 |
LLM chain is a token system used in LangChain. 00:03:26.640 |
All of that is completely false, like none of this is true. 00:03:35.740 |
And the reason that we get this hallucination is, as I mentioned, an LLM, its knowledge 00:03:42.400 |
is just what it learned during training, like we can see here. 00:03:49.540 |
Now let's just jump straight into it and actually build a chatbot that has this sort of limitation. 00:04:03.760 |
And also, we'll sort of play around with it and see that limitation in action. 00:04:07.920 |
OK, so we're going to be running through this notebook here. 00:04:11.520 |
There'll be a link to this at the top of the video, probably now. 00:04:16.800 |
And we just start by doing a few pip installs. 00:04:18.600 |
OK, so we have LangChain, OpenAI, Hoganface data sets, Pinecone Client, and TickToken. 00:04:24.040 |
We need-- that's basically all we need to do the whole chatbot plus RAG thing. 00:04:35.320 |
Now we're going to be relying very much on the LangChain library here. 00:04:40.080 |
And what we do is we import this chat OpenAI object. 00:04:46.240 |
And this is compatible with GPT-3.5 and GPT-4 models from OpenAI. 00:04:54.320 |
And essentially, it's just a chat interface or an abstraction in LangChain to use GPT-3.5 00:05:03.800 |
You can also use those models directly via the OpenAI API. 00:05:09.160 |
But when we start building more complex AI systems, LangChain can be very useful because 00:05:16.520 |
it has all these additional components that we can just kind of plug in. 00:05:21.120 |
So we can add all these other items, such as a RAG pipeline, very easily. 00:05:34.160 |
And what it's going to do is we're going to put some objects in there. 00:05:39.520 |
And it's going to format them into what you can see here, this type of structure. 00:05:47.560 |
So you have a system prompt at the top, which is basically your instructions to the model. 00:05:52.600 |
And then you have your user query, the chatbot, the AI, the assistant, the user, and so on 00:06:04.320 |
That is what your chat log is going to look like. 00:06:07.200 |
Now, via the OpenAI API, that would look like this. 00:06:11.960 |
So you have a list of dictionaries, each dictionary containing a role, and the content, which 00:06:19.680 |
All I've done here is taken that and translated it into what you would put into the OpenAI 00:06:27.000 |
And LangChain is a slightly different format, but based on the same thing. 00:06:30.400 |
It's only-- it's a very thin abstraction layer. 00:06:33.160 |
So you have system message, human message, AI message. 00:06:37.440 |
Obviously, system will tie back to the system role. 00:06:50.080 |
So this is the LangChain version of what I've just shown you. 00:06:55.900 |
And what we're going to do is we're going to pass all of those to our chat OpenAI object, 00:07:08.960 |
So it's telling me about string theory, which is what I asked up here. 00:07:17.560 |
But I imagine that sort of thing, it probably doesn't know the answer to. 00:07:21.560 |
So we can just print out a little more nicely here. 00:07:24.520 |
So it actually gives us like this 1, 2, 3, 4, 5, gives us a nicer format that we can 00:07:31.600 |
Now, what we can do with this response, right, if we take a look at what it is, it's an AI 00:07:38.960 |
So when we're building up our chat log, all we need to do is append this AI message to 00:07:43.960 |
our messages list in order to sort of continue that conversation. 00:07:49.560 |
So what I'm going to do here is, yeah, I'm just appending that here. 00:07:56.920 |
And notice here that I'm not saying, why do physicists believe string theory can produce 00:08:02.600 |
a unified theory, if that's what I'm asking there. 00:08:05.840 |
I'm asking, why do physicists believe it can produce a unified theory, right? 00:08:12.520 |
So here, our chat model must rely on the conversational history, those previous messages that we sent. 00:08:20.320 |
And that's why we need to add the response to our messages. 00:08:24.280 |
And then we add our new prompt to the messages. 00:08:27.200 |
And then we send all of those over to chat GPT. 00:08:41.640 |
And you can see straightaway that it mentions, OK, physicists believe that string theory 00:08:46.240 |
has the potential to produce a unified theory, so on, and so on, and so on. 00:08:52.000 |
So it definitely has that conversational history in there. 00:09:00.640 |
That was, I think, pretty easy to put together. 00:09:06.700 |
Now let's talk a little bit more about hallucinations and why they happen. 00:09:11.440 |
Now returning to this, LLMs hallucinate, or one of the many reasons they hallucinate is 00:09:18.680 |
because they have to rely solely on knowledge that they learn during their training. 00:09:23.960 |
And what that means is that an LLM essentially lives in a world that is entirely made up 00:09:35.000 |
It doesn't understand the world by going out into the world and seeing the world. 00:09:38.600 |
It understands the world by looking at its training data set, and that's it. 00:09:42.540 |
So if some knowledge is not in that training data set, that knowledge is 100% not in the 00:09:49.640 |
And even if it is, it might not have made it into the LLM, or maybe it's not stored 00:09:58.960 |
But the whole point of an LLM or what it tries to do is compress whatever was within that 00:10:03.160 |
training data into like an internal model of the world as it was within that training 00:10:12.020 |
So obviously that causes issues because it has no access to anything else. 00:10:20.160 |
So this little box in the middle here, that can actually be many things, right? 00:10:33.200 |
It may be access to a SQL database or many other things, right? 00:10:39.080 |
This little box in the middle, what that represents is some sort of connection to the external 00:10:45.800 |
It doesn't mean a connection to the entire world, just some subset of the external world. 00:10:53.280 |
Now without that, this is our LLM, as mentioned, it just understands the world as it was in 00:11:01.840 |
The way that we would refer to this knowledge is parametric knowledge, okay? 00:11:07.720 |
So parametric knowledge, we call it that because it is a knowledge that is stored within the 00:11:16.280 |
Those model parameters are only ever changed during training, not during or not at any 00:11:24.120 |
So those parameters are frozen after training. 00:11:28.240 |
So essentially what we have is that kind of brain on the left where we just have the parametric 00:11:35.640 |
But what we can do with RAG is we can add like a more long-term memory or just memory 00:11:46.760 |
So in the case of RAG, that external knowledge base, that external memory is a vector database. 00:11:54.320 |
And the good part of having a database as a form of input into your LLM is that you 00:12:02.600 |
can actually add, delete, and just manage almost like the memory or the knowledge of 00:12:10.160 |
your LLM, which in my opinion is kind of cool. 00:12:14.520 |
It's almost like you can, it's almost like plugging into a person's brain and just being 00:12:19.120 |
able to manage the information that they have in there or update it or whatever else. 00:12:26.600 |
Which sounds a little dystopian, but it's a good parallel to what we're doing with LLMs. 00:12:34.600 |
We call it source knowledge, not parametric knowledge because the knowledge is not stored 00:12:40.920 |
Instead, the source knowledge is referring to anything that we insert into the model, 00:12:52.580 |
So any information that goes through the prompt is source knowledge. 00:12:56.880 |
Now when we're adding that source knowledge to our LLM, it's going to look kind of like 00:13:01.640 |
We typically have some instructions at the top of our prompt. 00:13:03.900 |
We have our, the prompter input, so basically the user's query, which is a little question 00:13:11.680 |
And then that external information, that source knowledge that we're inserting is here, right? 00:13:19.740 |
It's what we call either a context, we can call them documents, we can call them a lot 00:13:28.980 |
That is what we're going to be adding into our prompts. 00:13:32.040 |
So first, before we actually build the whole, you know, the right pipeline to do this, let's 00:13:37.760 |
just try inserting it ourselves and seeing what sort of effect it has on our model performance. 00:13:44.680 |
So we're just going to add another message here, what is so special about LLAMA2? 00:13:54.120 |
Okay, I apologize, but I'm not familiar with a specific reference to LLAMA2. 00:13:59.480 |
It's possible that you might be referring to something specific within a certain context 00:14:05.560 |
So please provide more information and clarify your question. 00:14:11.760 |
And actually, I think the OpenAI team have added this because in the past, if you asked 00:14:19.420 |
about LLAMA2, it would tell you about LLAMAs or give you, you know, it would hallucinate, 00:14:24.720 |
like full on hallucinate where it's giving you an answer, but it's completely wrong. 00:14:29.200 |
So I think they have seen, probably seen people asking about LLAMA2 or maybe they just saw 00:14:35.400 |
the LLAMA2 release and they added some sort of guardrail against people asking for that 00:14:41.000 |
to essentially tell the model, hey, when someone asks you about that, tell them you don't know. 00:14:47.160 |
I think anyway, unless they've just been training it on incoming data. 00:14:55.440 |
I'm going to say, okay, can you tell me about the LM chain, line chain? 00:15:01.000 |
So we see there's another example of something they've kind of modified a little bit later 00:15:06.960 |
Couldn't find any information specifically about LM chain in line chain, okay? 00:15:12.120 |
And it just, it asks the same, you know, there's that same structure to this, right? 00:15:17.700 |
So I'm relatively sure this is actually like a hard-coded guardrail that OpenAI put in 00:15:25.960 |
So they've added it for line chain, for LLAMA2, they've clearly added it for a few things. 00:15:37.520 |
So I actually got this information, I think I just Googled LLM chain in line chain and 00:15:43.400 |
I went on their website and just pulled in a few little bits of information. 00:15:52.720 |
Basically I just have some information about line chain in there, some information about 00:16:00.040 |
And what I'm going to do is just concatenate all those together to give us our source knowledge. 00:16:05.240 |
And then what I'm going to do is, it's what you saw before with that sort of structured 00:16:09.840 |
I'm going to pull those together and just see what we get. 00:16:14.200 |
So can you tell me about the LLM chain, line chain? 00:16:18.880 |
So we create this prompt, maybe I can just show you quickly. 00:16:29.060 |
And you get, so you have the instructions, we have our context and then we have the query. 00:16:35.920 |
And I'm just going to feed that into our chat bot here. 00:16:45.240 |
So LLM chain in the context of line chain refers to a specific type of chain within 00:16:52.280 |
Line chain framework is designed to develop applications powered by language models with 00:16:56.240 |
a focus on enabling data aware and agentic applications. 00:16:59.600 |
It's almost a copy and paste from the website itself, but obviously formulated in a way 00:17:05.880 |
that makes it much easier to read and specific to the question that we asked. 00:17:10.200 |
In this context, say LLM chain is the most common type of chain used in line chain. 00:17:20.280 |
And yeah, I mean, we got a very good answer just by adding some text in there. 00:17:25.600 |
But obviously, are we always going to add text into our prompt like we did just there? 00:17:32.000 |
It kind of defeats the point of what we're trying to do here. 00:17:35.280 |
So instead, what we want to do is find a way to do what we just did, but automatically 00:17:41.760 |
and that scale over many, many documents, which is where RAG comes in. 00:17:47.200 |
So looking back at this notebook here, what we kind of just did there, put the context 00:17:53.600 |
straight into the prompt, it's kind of like we just ignored this bit here. 00:17:59.760 |
We created a retrieval augmented query by pulling this in there and our own context, 00:18:08.380 |
So now what we need to do is figure out this bit here, right? 00:18:22.320 |
So the first part of setting up that pipeline is going to be actually getting our data, 00:18:29.080 |
So we're going to download this data set here. 00:18:41.680 |
Or maybe you just search this in the Hugging Face search bar. 00:18:45.480 |
And what it is, is a data set that I downloaded or scraped from LLAMA 2 archive papers and 00:18:53.800 |
archive papers related to LLAMA 2 a little while ago. 00:19:00.240 |
But it's, I think, pretty useful for this example. 00:19:02.960 |
So you can kind of see those chunks of text I've pulled from there. 00:19:07.040 |
So we're going to use that data set to create our knowledge base. 00:19:12.560 |
Now for the knowledge base, we're going to be using the vector database, like I mentioned, 00:19:23.040 |
If you don't have an account or you're not logged in, you will need to create an account 00:19:32.100 |
And if you don't have any indexes already, you should see a screen like this. 00:19:45.840 |
And I'm also going to remember my environment here. 00:19:59.120 |
And then I'll just paste my API key into here. 00:20:09.600 |
And what we're doing here is initializing our index. 00:20:20.000 |
When we are using that model, the embedding dimension-- so think of that as the size of 00:20:27.400 |
The vectors are like numerical representations of meaning, like human meaning, that we get 00:20:34.040 |
It's the size of the vectors that R002 outputs. 00:20:40.160 |
And therefore, the size of the index that will be storing those vectors. 00:20:45.360 |
So we need to make sure we get the dimension aligned with whatever model we're using there. 00:20:49.920 |
And the metric is also important, but less so. 00:20:53.640 |
Typically, most embedding models you can use with cosine. 00:20:57.760 |
But there are occasionally some where you should use Euclidean instead of cosine or 00:21:07.160 |
And then what we're going to need to do is just wait for our index to actually initialize. 00:21:13.560 |
It takes-- I think it's like 30 to 40 seconds, typically. 00:21:18.160 |
But that also depends on the tier of Pinecone or also the region that you're using, the 00:21:28.560 |
But I wouldn't expect more than maybe one or two minutes at most. 00:21:37.920 |
And once that is finished, we can go ahead and-- so this will connect to the index. 00:21:42.200 |
And then we just want to confirm that we have connected the index. 00:21:46.320 |
And we should see that the total vector count, at least for now, is 0, because we haven't 00:22:02.760 |
Or we can initialize it from LangChain like this. 00:22:10.000 |
And what I'm going to do is just create some embeddings for some-- we're calling them documents. 00:22:17.020 |
So documents here is equivalent to context that I was referring to earlier. 00:22:21.700 |
So basically, a chunk of text that we're going to sort and refer to as part of our knowledge 00:22:27.900 |
Here, we have two of those documents or contexts. 00:22:31.760 |
And if we embed those, what we're going to get is two embeddings. 00:22:38.000 |
Each one of those is the 1,536-dimensional embedding outputs by of 0.02. 00:22:48.340 |
Now we move on to just iterating over our entire data set, the LLAMA2 archive papers, 00:22:58.640 |
So we do the embedding here, extracting key information about each one of those records, 00:23:05.460 |
so the text, where it's coming from, and also the title of the paper that it is coming from. 00:23:13.180 |
And then we just add all of those into Pinecone. 00:23:16.420 |
Well, I suppose one other thing we are doing here is we're just creating some unique IDs 00:23:21.820 |
Now, when we're going through this loop and doing this-- actually, let me just run it. 00:23:29.000 |
We can't do the whole thing at once, because if you do the whole thing at once, we have 00:23:37.860 |
And if we try to get the embeddings for all of those, that's going to create 4,800-odd 00:23:43.740 |
1,536-dimensional embeddings, and we're going to be trying to receive them over a single 00:23:49.860 |
API call, which most of the time, or at least for most providers, most internet providers, 00:24:03.160 |
And also, even OpenAI R002, if you add too many things to embed at any one time to that 00:24:13.240 |
model, it's probably going to error out, although I'm not sure maybe-- OpenAI probably added 00:24:22.080 |
So at least on their side, you probably wouldn't actually run into any issues. 00:24:26.160 |
But in terms of getting the information to OpenAI, back from OpenAI, and then to Pinecone, 00:24:31.820 |
you probably are going to run into some problems if your batch size is too high. 00:24:43.000 |
Once it is, you can come down to here, and you can use describing those stats again. 00:24:50.880 |
Actually, let me just rerun it to make sure that it's there. 00:24:54.980 |
So we can see that we now have 4,836 vectors or records in our index. 00:25:03.500 |
So with that, we have a fully-fledged vector database, or knowledge base, that we can refer 00:25:14.860 |
So now, the final thing we need to do is just finish the RAG pipeline and connect that knowledge 00:25:31.860 |
I'm going to initialize this back in LineChain, because sometimes, it depends on what you're 00:25:36.220 |
doing, but often, you will want to use Pinecone via LineChain if you're using it with LLMs, 00:25:43.500 |
because there's a lot of ways you can connect the two. 00:25:46.340 |
I think in this example, we don't actually use those. 00:25:53.260 |
You'll see, but basically, I'm just going to throw information straight into the context, 00:25:58.180 |
But a lot of the time, you'll probably want to go and initialize the vector sort object 00:26:02.660 |
and use that with other components in LineChain. 00:26:07.820 |
You just pass in your index and also the embedding model. 00:26:12.260 |
And this is the embed query method, which basically means it's going to embed a single 00:26:17.860 |
chunk of text rather than the embed, I think it's embedDocuments method, which encodes 00:26:26.940 |
One important thing here is that we have the text field. 00:26:29.800 |
So text field is the metadata field that we set up earlier, you can see it here, that 00:26:36.220 |
contains the text that we would like to retrieve. 00:26:43.060 |
Now let's ask the question, what is so special about LLAMA2? 00:26:50.860 |
But now, if we take that query and we pass it into our vector database, or our vector 00:26:57.660 |
sort in this case, and we return the top three most semantically similar records, we can 00:27:03.620 |
see that we're getting these chunks from the LLAMA2 paper. 00:27:09.020 |
Now these are pretty hard to read, to be honest. 00:27:14.100 |
And even like here, our human evaluations for the helpfulness and safety may be suitable 00:27:27.020 |
But what we'll actually see is that also LLAMAs can also parse that information relatively 00:27:36.780 |
So we have these three documents, these three chunks of information. 00:27:46.260 |
So I'm going to set up this augment prompt function here. 00:27:51.740 |
We're going to do what I just did there, retrieve the top three most relevant items from the 00:27:58.180 |
We're going to use those to create our source knowledge. 00:28:01.580 |
You might recognize this code from earlier where we did it manually. 00:28:04.580 |
And then we're going to feed all that into an augmented prompt and return it. 00:28:16.940 |
So using the context below answer the query, you see that we have these contexts. 00:28:21.900 |
In this work, we develop and release LLAMA2 collection, pre-train, and fine tune large 00:28:25.900 |
language models and a ton of other stuff in there. 00:28:33.220 |
So this is now our augmented query that we can pass into our chatbot. 00:28:43.980 |
We're going to create a new human message from before. 00:28:47.620 |
We're going to append that to our chat history and feed that into there. 00:28:52.700 |
So remember, the question here is, what is so special about LLAMA2? 00:28:59.300 |
And it says, according to the provided context, LLAMA2 is a collection of pre-trained and 00:29:03.340 |
fine-tuned large language models-- I read that earlier, actually-- developed and released 00:29:10.080 |
These LLAMAs range in scale from 7 billion to 70 billion parameters. 00:29:15.020 |
They are specifically optimized for dialogue use cases and outperform open-source chat 00:29:24.780 |
So what makes LLAMA2 special is that the fine-tuned LLAMAs-- and then we have this kind of mess 00:29:33.240 |
I'm not sure what-- I think it's LLAMA something, but it's a bit of a mess. 00:29:38.340 |
They are designed to align with human preferences, enhancing their usability and safety. 00:29:44.140 |
This alignment with human preferences is often not easily reproducible or transparent in 00:29:48.940 |
closed-source models, limiting progress in AI alignment research. 00:29:56.700 |
So additionally, based on so-and-so on, LLAMA2 models appear to be on par with some of the 00:30:02.060 |
closed-source models in terms of helpfulness and safety. 00:30:06.780 |
So we can see-- yeah, I think it's giving us a good answer. 00:30:11.740 |
But yeah, it's giving us a good answer there. 00:30:20.420 |
So also just consider that we've just asked a question and got all this information here. 00:30:26.900 |
That has been stored in the conversation history. 00:30:29.680 |
So now the LLAM will at least know what LLAMA2 is. 00:30:34.260 |
And let's see if we can use that to answer our question here. 00:30:38.200 |
So what safety measures do we use in the development of LLAMA2? 00:30:43.620 |
So I'm not performing RAG on this specific query, but it does have some information already. 00:30:50.300 |
And it says, in the provided context, safety measures used in development of LLAMA2 are 00:30:56.660 |
Detailed description of their approach to fine-tuning safety. 00:30:59.420 |
However, the specific details of these safety measures are not mentioned in the given text. 00:31:09.560 |
Even though we've told it what LLAMA2 is and we've given it a fair amount of text about 00:31:14.660 |
LLAMA2, context about LLAMA2, it still can't answer the question. 00:31:19.940 |
And what we're going to do is augment our prompt. 00:31:28.200 |
OK, based on provided context, the development of LLAMA2 involves safety measures to enhance 00:31:35.420 |
Some of these safety measures mentioned in the text include. 00:31:38.720 |
And then it gives us a little list of items here. 00:31:42.440 |
So safety-specific data annotation and tuning, OK, specifically focused on training, OK, 00:31:51.820 |
And it also is kind of figuring something out from that. 00:31:54.740 |
Suggests that train data and model parameters were carefully selected and adjusting to 00:32:01.180 |
Red teaming, so that's, well, it tells us here, red teaming refers to a process in which 00:32:05.940 |
external experts or evaluators simulate adversarial attacks on a system to identify vulnerabilities 00:32:15.180 |
OK, so, you know, almost like safety stress testing the model, I would say. 00:32:23.220 |
Dimension of iterative evaluations suggests that the models underwent multiple rounds 00:32:29.340 |
This iterative process likely involved continuous feedback and improvements to enhance safety 00:32:35.420 |
So the impression I'm getting from this answer is that it mentions this iterative process, 00:32:42.680 |
So the model is kind of figuring out what that likely means. 00:32:46.620 |
All right, so we get a much better answer there, continues. 00:32:52.180 |
But like I said, you can take a look at this notebook yourself and run it. 00:32:56.580 |
I think it's very clear what sort of impact something like RAG has on the system and also 00:33:05.220 |
Now this is what I would call naive RAG, or almost like the standard RAG. 00:33:14.780 |
And it's assuming that there's a question with every single query, which is not always 00:33:23.900 |
Actually, your chatbot doesn't need to go and refer to an external knowledge base to 00:33:29.420 |
So that is one of the downsides of using this approach. 00:33:34.540 |
Obviously, we get this much better retrieval performance, right? 00:33:40.220 |
We can answer many more questions accurately. 00:33:43.740 |
And we can also cite where we're getting that information from. 00:33:46.780 |
This approach is much faster than other alternative RAG approaches, like using agents. 00:33:52.580 |
And we can also filter out the number of tokens. 00:33:59.580 |
So we can also filter out the number of tokens we're feeding back into the LLM by setting 00:34:03.060 |
like a similarity threshold, so that we're not returning things that are completely very 00:34:09.940 |
And if we do that, that'll help us mitigate one of the other issues with this approach, 00:34:19.100 |
Obviously, we're feeding way more information to our LLMs here, which is it's going to slow 00:34:26.060 |
It's also going to cost us more, especially if you're using OpenAI, you're paying per 00:34:32.820 |
And if you feed too much information in there, your LLM can actually-- the performance can 00:34:37.940 |
degrade quite a bit, especially when it's trying to follow instructions. 00:34:42.220 |
So there are always those sort of things to consider as well. 00:34:46.540 |
But overall, if done well, by not feeding too much into the context window, this approach 00:34:53.860 |
And when you need other approaches, and you still need that external knowledge base, you 00:34:59.580 |
can look at RAG with Agents, something I've spoken about in the past, or RAG with Guardrails, 00:35:06.380 |
which is something that I've spoken about very recently. 00:35:09.480 |
Both of those are alternative approaches that have their own pros and cons, but effectively, 00:35:19.340 |
I hope this has been useful in just introducing this idea of RAG and chatbots and also just 00:35:25.260 |
seeing how all of those components fit together. 00:35:28.660 |
But for now, what I'm going to do is just leave it there. 00:35:30.900 |
So thank you very much for watching, and I will see you again in the next one.