Today we're going to take a look at how we can build a chatbot using retrieval augmented generation from start to finish. So we're literally going to start with the assumption that you don't really know anything about chatbots or how to build one. But by the end of this video, what we're going to have is a chatbot, for those of you that are interested using OpenAI's GPT 3.5 model and also the Langtrain library.
That is able to answer questions about more recent events or is able to answer questions about our own internal documentation, for example, in an organization, which a model like GPT 3.5 or GPT 4 cannot do. And the way that we will enable that is through this retrieval augmented generation.
So to get started, let me just take you through what we're actually going to be building at a very high level. OK, so this, what you can see here is what we'd call a RAG pipeline or retrieval augmented generation pipeline. So in a typical scenario with an LLM, what we're going to do is we take a query like this up here and we just feed it into our LLM and then we get some output, right?
That is OK in some cases, but in other cases, it's not. For general knowledge, question answering or for knowledge that the LLM has seen before, this does work relatively well. But the problem is that a lot of LLMs have not seen a lot of information that we would like it to have an understanding of.
So for example, in this question, I'm asking what makes LLAMA 2 special? And most LLMs at the time of recording this would not be able to answer that because LLAMA 2 is a recent language model. Most LLMs were trained on training data that did not contain any information about LLAMA 2.
So most LLMs just have no idea what LLAMA 2 is. And they'll typically tell you something about actual llamas, the animal, or they'll just make something else up when you ask them this question. So we obviously don't want that to happen. So what we do is we use this Retrieve Augmented Generation pipeline.
And it's this pipeline that I'm going to teach you how to build today. So here's an example of when LLMs don't know what you're talking about, even though it's general knowledge. Or at least you would expect an LLM that is, for example, in this case, good at programming to be able to tell you the correct answer.
LangChain is probably the most popular library for generative AI. And usually using Python, there's also a JavaScript version available, and maybe some other languages as well. But if we ask-- this is GPT-4, this is when GPT-4 first was released. I asked GPT-4 in the OpenAI Playground how I use the LLM chain in LangChain.
LLM chain is the basic building block of LangChain. And it told me, OK, LangChain is a blockchain-based platform that combines artificial intelligence and language processing. LLM chain is a token system used in LangChain. All of that is completely false, like none of this is true. So it just completely made everything up.
This is-- it's a hallucination. And the reason that we get this hallucination is, as I mentioned, an LLM, its knowledge is just what it learned during training, like we can see here. It has no access to the outside world. Now let's just jump straight into it and actually build a chatbot that has this sort of limitation.
And we'll just see, how do we build that? It's pretty easy. And also, we'll sort of play around with it and see that limitation in action. OK, so we're going to be running through this notebook here. There'll be a link to this at the top of the video, probably now.
And we just start by doing a few pip installs. OK, so we have LangChain, OpenAI, Hoganface data sets, Pinecone Client, and TickToken. We need-- that's basically all we need to do the whole chatbot plus RAG thing. If we're not doing RAG, we need even less. But we're going to use RAG.
Now we're going to be relying very much on the LangChain library here. And what we do is we import this chat OpenAI object. And this is compatible with GPT-3.5 and GPT-4 models from OpenAI. And essentially, it's just a chat interface or an abstraction in LangChain to use GPT-3.5 or GPT-4.
You can also use those models directly via the OpenAI API. But when we start building more complex AI systems, LangChain can be very useful because it has all these additional components that we can just kind of plug in. So we can add all these other items, such as a RAG pipeline, very easily.
So that's why we do it. Now, here, OK, we initialize our chat model. And what it's going to do is we're going to put some objects in there. And it's going to format them into what you can see here, this type of structure. And this is typical of OpenAI chat models.
So you have a system prompt at the top, which is basically your instructions to the model. And then you have your user query, the chatbot, the AI, the assistant, the user, and so on and so on. Right? And it just keeps continuing. OK? That is what your chat log is going to look like.
Now, via the OpenAI API, that would look like this. So you have a list of dictionaries, each dictionary containing a role, and the content, which is the text, right? All I've done here is taken that and translated it into what you would put into the OpenAI chat completion endpoint.
And LangChain is a slightly different format, but based on the same thing. It's only-- it's a very thin abstraction layer. So you have system message, human message, AI message. Right? Obviously, system will tie back to the system role. Role user is human. And role assistant is AI. And you have your content here, right?
So this is the LangChain version of what I've just shown you. So let's initialize that. And what we're going to do is we're going to pass all of those to our chat OpenAI object, OK? In here. We run that. It will take a moment. But we'll see that we get this response.
All right? So it's telling me about string theory, which is what I asked up here. And I mean, I don't know if it's accurate. Maybe it's hallucinating. I don't know. But I imagine that sort of thing, it probably doesn't know the answer to. Right. So we can just print out a little more nicely here.
So it actually gives us like this 1, 2, 3, 4, 5, gives us a nicer format that we can read here. Now, what we can do with this response, right, if we take a look at what it is, it's an AI message. So when we're building up our chat log, all we need to do is append this AI message to our messages list in order to sort of continue that conversation.
So what I'm going to do here is, yeah, I'm just appending that here. Now, I'm going to create a new prompt. I'm just going to ask another question. And notice here that I'm not saying, why do physicists believe string theory can produce a unified theory, if that's what I'm asking there.
I'm asking, why do physicists believe it can produce a unified theory, right? So here, our chat model must rely on the conversational history, those previous messages that we sent. And that's why we need to add the response to our messages. And then we add our new prompt to the messages.
And then we send all of those over to chat GPT. No, it's GPT 3.5. It's just the same model. That would be the product. OK. OK. And you can see straightaway that it mentions, OK, physicists believe that string theory has the potential to produce a unified theory, so on, and so on, and so on.
OK. So it definitely has that conversational history in there. That's good. Now we have a chat bot. That was, I think, pretty easy to put together. There's nothing complicated going on there. Now let's talk a little bit more about hallucinations and why they happen. Now returning to this, LLMs hallucinate, or one of the many reasons they hallucinate is because they have to rely solely on knowledge that they learn during their training.
And what that means is that an LLM essentially lives in a world that is entirely made up of whatever was in its training data. It doesn't understand the world by going out into the world and seeing the world. It understands the world by looking at its training data set, and that's it.
So if some knowledge is not in that training data set, that knowledge is 100% not in the LLM. And even if it is, it might not have made it into the LLM, or maybe it's not stored very well, or it's been misrepresented. You kind of don't know. But the whole point of an LLM or what it tries to do is compress whatever was within that training data into like an internal model of the world as it was within that training data set.
So obviously that causes issues because it has no access to anything else. And that's what we want to fix with RAG. So this little box in the middle here, that can actually be many things, right? It may be a RAG pipeline. It may also be like a Google search, right?
It can be search. It may be access to a SQL database or many other things, right? This little box in the middle, what that represents is some sort of connection to the external world. It doesn't mean a connection to the entire world, just some subset of the external world.
So that's what we want to enable. Now without that, this is our LLM, as mentioned, it just understands the world as it was in our training data. The way that we would refer to this knowledge is parametric knowledge, okay? So this here. So parametric knowledge, we call it that because it is a knowledge that is stored within the model parameters, okay?
Those model parameters are only ever changed during training, not during or not at any other point, right? So those parameters are frozen after training. So essentially what we have is that kind of brain on the left where we just have the parametric knowledge. But what we can do with RAG is we can add like a more long-term memory or just memory component that we can actually modify, okay?
So in the case of RAG, that external knowledge base, that external memory is a vector database. And the good part of having a database as a form of input into your LLM is that you can actually add, delete, and just manage almost like the memory or the knowledge of your LLM, which in my opinion is kind of cool.
It's almost like you can, it's almost like plugging into a person's brain and just being able to manage the information that they have in there or update it or whatever else. Which sounds a little dystopian, but it's a good parallel to what we're doing with LLMs. So yes, we're doing that.
We call it source knowledge, not parametric knowledge because the knowledge is not stored in the parameters of the model. Instead, the source knowledge is referring to anything that we insert into the model, into the LLM via the prompt, okay? So any information that goes through the prompt is source knowledge.
Now when we're adding that source knowledge to our LLM, it's going to look kind of like this. We typically have some instructions at the top of our prompt. We have our, the prompter input, so basically the user's query, which is a little question at the bottom there. And then that external information, that source knowledge that we're inserting is here, right?
It's what we call either a context, we can call them documents, we can call them a lot of things actually. But let's call it a context in this case. That is what we're going to be adding into our prompts. So first, before we actually build the whole, you know, the right pipeline to do this, let's just try inserting it ourselves and seeing what sort of effect it has on our model performance.
So we're just going to add another message here, what is so special about LLAMA2? Okay, and let's see what the model tells us. Okay, I apologize, but I'm not familiar with a specific reference to LLAMA2. It's possible that you might be referring to something specific within a certain context or domain.
So please provide more information and clarify your question. So the model cannot answer this question. And actually, I think the OpenAI team have added this because in the past, if you asked about LLAMA2, it would tell you about LLAMAs or give you, you know, it would hallucinate, like full on hallucinate where it's giving you an answer, but it's completely wrong.
So I think they have seen, probably seen people asking about LLAMA2 or maybe they just saw the LLAMA2 release and they added some sort of guardrail against people asking for that to essentially tell the model, hey, when someone asks you about that, tell them you don't know. I think anyway, unless they've just been training it on incoming data.
I don't know, but I don't think they have. So let's try another one. I'm going to say, okay, can you tell me about the LM chain, line chain? I asked this earlier, right, you saw. So we see there's another example of something they've kind of modified a little bit later on.
Couldn't find any information specifically about LM chain in line chain, okay? And it just, it asks the same, you know, there's that same structure to this, right? So I'm relatively sure this is actually like a hard-coded guardrail that OpenAI put in there. So they've added it for line chain, for LLAMA2, they've clearly added it for a few things.
Okay. So let's try the source knowledge approach. So I actually got this information, I think I just Googled LLM chain in line chain and I went on their website and just pulled in a few little bits of information. They're actually quite long, right? You can see it goes on a little bit here.
Basically I just have some information about line chain in there, some information about chains and the LLM chain, right? And what I'm going to do is just concatenate all those together to give us our source knowledge. And then what I'm going to do is, it's what you saw before with that sort of structured prompt.
I'm going to pull those together and just see what we get. So can you tell me about the LLM chain, line chain? Let's try. So we create this prompt, maybe I can just show you quickly. So we just print augmented prompt. And you get, so you have the instructions, we have our context and then we have the query.
And I'm just going to feed that into our chat bot here. So let's do that. See what we get out. So LLM chain in the context of line chain refers to a specific type of chain within the line chain framework. Line chain framework is designed to develop applications powered by language models with a focus on enabling data aware and agentic applications.
It's almost a copy and paste from the website itself, but obviously formulated in a way that makes it much easier to read and specific to the question that we asked. In this context, say LLM chain is the most common type of chain used in line chain. Okay, there we go.
We have loads of information here. As far as I know, it's all accurate. And yeah, I mean, we got a very good answer just by adding some text in there. But obviously, are we always going to add text into our prompt like we did just there? Probably I think probably not, right?
It kind of defeats the point of what we're trying to do here. So instead, what we want to do is find a way to do what we just did, but automatically and that scale over many, many documents, which is where RAG comes in. So looking back at this notebook here, what we kind of just did there, put the context straight into the prompt, it's kind of like we just ignored this bit here.
This whole bit. We created a retrieval augmented query by pulling this in there and our own context, feeding it into LLM and getting our answer. So now what we need to do is figure out this bit here, right? This retrieval component, right? It's pretty easy. It isn't really not that complicated.
And I think you'll see that very soon. So the first part of setting up that pipeline is going to be actually getting our data, right? So we're going to download this data set here. It's from Hugging Face. So you can download it. You can even see it on the website.
You just put like huggingface.co/this here. Or maybe you just search this in the Hugging Face search bar. And what it is, is a data set that I downloaded or scraped from LLAMA 2 archive papers and archive papers related to LLAMA 2 a little while ago. It's not very clean.
It's also not a huge data set. But it's, I think, pretty useful for this example. So you can kind of see those chunks of text I've pulled from there. So we're going to use that data set to create our knowledge base. Now for the knowledge base, we're going to be using the vector database, like I mentioned, which is Pinecone.
For that, we do need to get an API key. So we would head on over to app.pinecone.io. If you don't have an account or you're not logged in, you will need to create an account or log in. It's free. So I'm going to go do that. And if you don't have any indexes already, you should see a screen like this.
You can even create an index here. But we're going to do it in the notebook. What we do want is to go to API keys. I'm going to copy my API key. And I'm also going to remember my environment here. So us-west1-gcp. I'm going to bring this over. So in here, we have the environment.
So it would be us-west1-gcp. And then I'll just paste my API key into here. I've already run this a little bit. So I'm going to move on to here. And what we're doing here is initializing our index. We're going to be using text embedding R002. That is a embedding model from OpenAI.
When we are using that model, the embedding dimension-- so think of that as the size of the vectors. The vectors are like numerical representations of meaning, like human meaning, that we get from some text. It's the size of the vectors that R002 outputs. And therefore, the size of the index that will be storing those vectors.
So we need to make sure we get the dimension aligned with whatever model we're using there. And the metric is also important, but less so. Typically, most embedding models you can use with cosine. But there are occasionally some where you should use Euclidean instead of cosine or dot products.
So we'd run that. And then what we're going to need to do is just wait for our index to actually initialize. It takes-- I think it's like 30 to 40 seconds, typically. But that also depends on the tier of Pinecone or also the region that you're using, the environment that you're using.
So it can vary a little bit. But I wouldn't expect more than maybe one or two minutes at most. So I'll just jump ahead to let that finish. OK. And once that is finished, we can go ahead and-- so this will connect to the index. And then we just want to confirm that we have connected the index.
And we should see that the total vector count, at least for now, is 0, because we haven't added anything in there. So it should be empty. OK. So now let's initialize an embedding model. Again, like I said, we're using Arda 0.02. Again, we can use the OpenAI API for that.
Or we can initialize it from LangChain like this. So we'll initialize it from LangChain here. And what I'm going to do is just create some embeddings for some-- we're calling them documents. So documents here is equivalent to context that I was referring to earlier. So basically, a chunk of text that we're going to sort and refer to as part of our knowledge base.
Here, we have two of those documents or contexts. And if we embed those, what we're going to get is two embeddings. Each one of those is the 1,536-dimensional embedding outputs by of 0.02. OK. So that's how we do the embedding. Now we move on to just iterating over our entire data set, the LLAMA2 archive papers, and just doing that embedding.
OK. So we do the embedding here, extracting key information about each one of those records, so the text, where it's coming from, and also the title of the paper that it is coming from. OK. And then we just add all of those into Pinecone. OK. Well, I suppose one other thing we are doing here is we're just creating some unique IDs as well.
Now, when we're going through this loop and doing this-- actually, let me just run it. We do it in batches, right? We can't do the whole thing at once, because if you do the whole thing at once, we have like 4,800-odd chunks there. And if we try to get the embeddings for all of those, that's going to create 4,800-odd 1,536-dimensional embeddings, and we're going to be trying to receive them over a single API call, which most of the time, or at least for most providers, most internet providers, they won't allow that, as far as I'm aware.
So yeah, you can't do that. And also, even OpenAI R002, if you add too many things to embed at any one time to that model, it's probably going to error out, although I'm not sure maybe-- OpenAI probably added some safeguards around that. So at least on their side, you probably wouldn't actually run into any issues.
But in terms of getting the information to OpenAI, back from OpenAI, and then to Pinecone, you probably are going to run into some problems if your batch size is too high. So yeah, we minimize that. Now that is almost ready. Once it is, you can come down to here, and you can use describing those stats again.
And what you should see is this. Actually, let me just rerun it to make sure that it's there. So we can see that we now have 4,836 vectors or records in our index. So with that, we have a fully-fledged vector database, or knowledge base, that we can refer to for getting knowledge into our LLM.
So now, the final thing we need to do is just finish the RAG pipeline and connect that knowledge base up to our LLM. And then that would be it. We're done. So let's just jump into that. OK, so we're going to come down to here. I'm going to initialize this back in LineChain, because sometimes, it depends on what you're doing, but often, you will want to use Pinecone via LineChain if you're using it with LLMs, because there's a lot of ways you can connect the two.
I think in this example, we don't actually use those. I'm just going to connect them directly. You'll see, but basically, I'm just going to throw information straight into the context, into the prompt. But a lot of the time, you'll probably want to go and initialize the vector sort object and use that with other components in LineChain.
So I initialized that there. You just pass in your index and also the embedding model. And this is the embed query method, which basically means it's going to embed a single chunk of text rather than the embed, I think it's embedDocuments method, which encodes like a batch of many chunks of text.
One important thing here is that we have the text field. So text field is the metadata field that we set up earlier, you can see it here, that contains the text that we would like to retrieve. So we also specify that. Now let's ask the question, what is so special about LLAMA2?
We saw earlier that it couldn't answer this. But now, if we take that query and we pass it into our vector database, or our vector sort in this case, and we return the top three most semantically similar records, we can see that we're getting these chunks from the LLAMA2 paper.
You can see the title here. Now these are pretty hard to read, to be honest. And even like here, our human evaluations for the helpfulness and safety may be suitable substitutes for closed source models. I can just about make that out. But what we'll actually see is that also LLAMAs can also parse that information relatively well.
I'm not saying it's perfect, but they can. So we have these three documents, these three chunks of information. Hard to read. So let's just let our LLAM deal with that. So I'm going to set up this augment prompt function here. So we're going to take that query. We're going to do what I just did there, retrieve the top three most relevant items from the vector store.
We're going to use those to create our source knowledge. You might recognize this code from earlier where we did it manually. And then we're going to feed all that into an augmented prompt and return it. So we run that. Let's augment our query. So using the context below answer the query, you see that we have these contexts.
In this work, we develop and release LLAMA2 collection, pre-train, and fine tune large language models and a ton of other stuff in there. And then we have a query. What is so special about LLAMA2? So this is now our augmented query that we can pass into our chatbot. So let's try it.
We're going to create a new human message from before. We're going to append that to our chat history and feed that into there. So remember, the question here is, what is so special about LLAMA2? And it says, according to the provided context, LLAMA2 is a collection of pre-trained and fine-tuned large language models-- I read that earlier, actually-- developed and released by the authors of the work.
These LLAMAs range in scale from 7 billion to 70 billion parameters. They are specifically optimized for dialogue use cases and outperform open-source chat models on most benchmarks tested. That's pretty cool, right? So what makes LLAMA2 special is that the fine-tuned LLAMAs-- and then we have this kind of mess up bit here.
I'm not sure what-- I think it's LLAMA something, but it's a bit of a mess. They are designed to align with human preferences, enhancing their usability and safety. This alignment with human preferences is often not easily reproducible or transparent in closed-source models, limiting progress in AI alignment research. So additionally, based on so-and-so on, LLAMA2 models appear to be on par with some of the closed-source models in terms of helpfulness and safety.
So we can see-- yeah, I think it's giving us a good answer. I don't want to read the whole thing. I almost did. But yeah, it's giving us a good answer there. Let's continue more LLAMA2 questions. So I'm going to try without RAG first. So also just consider that we've just asked a question and got all this information here.
That has been stored in the conversation history. So now the LLAM will at least know what LLAMA2 is. And let's see if we can use that to answer our question here. So what safety measures do we use in the development of LLAMA2? So I'm not performing RAG on this specific query, but it does have some information already.
And it says, in the provided context, safety measures used in development of LLAMA2 are mentioned briefly. Detailed description of their approach to fine-tuning safety. However, the specific details of these safety measures are not mentioned in the given text. So it's saying, OK, I don't know. Even though we've told it what LLAMA2 is and we've given it a fair amount of text about LLAMA2, context about LLAMA2, it still can't answer the question.
So let's avoid that. And what we're going to do is augment our prompt. And we're going to feed that in instead. And let's see what we get. OK, based on provided context, the development of LLAMA2 involves safety measures to enhance safety of the models. Some of these safety measures mentioned in the text include.
And then it gives us a little list of items here. So safety-specific data annotation and tuning, OK, specifically focused on training, OK, cool. And it also is kind of figuring something out from that. Suggests that train data and model parameters were carefully selected and adjusting to prioritize safety considerations.
Red teaming, so that's, well, it tells us here, red teaming refers to a process in which external experts or evaluators simulate adversarial attacks on a system to identify vulnerabilities and weaknesses. OK, so, you know, almost like safety stress testing the model, I would say. And then iterative evaluations. Dimension of iterative evaluations suggests that the models underwent multiple rounds of assessment and refinement.
This iterative process likely involved continuous feedback and improvements to enhance safety aspects. So the impression I'm getting from this answer is that it mentions this iterative process, but doesn't really go into details. So the model is kind of figuring out what that likely means. All right, so we get a much better answer there, continues.
But like I said, you can take a look at this notebook yourself and run it. I think it's very clear what sort of impact something like RAG has on the system and also just how we implement that. Now this is what I would call naive RAG, or almost like the standard RAG.
It's the simplest way of implementing RAG. And it's assuming that there's a question with every single query, which is not always going to be the case, right? You might say, hi, how are you? Actually, your chatbot doesn't need to go and refer to an external knowledge base to answer that.
So that is one of the downsides of using this approach. But there are many benefits. Obviously, we get this much better retrieval performance, right? We get a ton of information in there. We can answer many more questions accurately. And we can also cite where we're getting that information from.
This approach is much faster than other alternative RAG approaches, like using agents. And we can also filter out the number of tokens. We don't use too many tokens. So we can also filter out the number of tokens we're feeding back into the LLM by setting like a similarity threshold, so that we're not returning things that are completely very obviously irrelevant.
And if we do that, that'll help us mitigate one of the other issues with this approach, which is just token usage and costs. Obviously, we're feeding way more information to our LLMs here, which is it's going to slow them down a little bit. It's also going to cost us more, especially if you're using OpenAI, you're paying per token.
And if you feed too much information in there, your LLM can actually-- the performance can degrade quite a bit, especially when it's trying to follow instructions. So there are always those sort of things to consider as well. But overall, if done well, by not feeding too much into the context window, this approach is very good.
And when you need other approaches, and you still need that external knowledge base, you can look at RAG with Agents, something I've spoken about in the past, or RAG with Guardrails, which is something that I've spoken about very recently. Both of those are alternative approaches that have their own pros and cons, but effectively, you get the same outcome.
Now, that's it for this video. I hope this has been useful in just introducing this idea of RAG and chatbots and also just seeing how all of those components fit together. But for now, what I'm going to do is just leave it there. So thank you very much for watching, and I will see you again in the next one.
Bye.