Chatbot Memory for Chat-GPT, Davinci + other LLMs

conversational memory has become a very important topic recently. It essentially describes how a chatbot or some other type of AI agent can respond to queries in kind of like a conversational manner. So if you think about conversation, the current point of the conversation depends almost entirely on the previous parts of the conversation, the previous interactions.

And conversational memory is how we allow a large language model or other type of language model to remember those previous interactions. It is the type of memory that we would see in chatbots like OpenAI's ChatGPT or Google's Lambda. Without conversational memory, we would not be able to have a coherent conversation with any of these chatbots.

And that is because by default, these large language models are stateless. That means that every incoming query to the model is treated independently of everything else. It doesn't remember previous queries or anything like that. It just looks at what you are giving it at this current moment in time.

So how can we make a stateless large language model remember things that happened before? Well, that is what we're going to describe in this video. We're going to jump straight into it. We're going to be using the LangChain library to work through these different types of conversational memory, because there are a few different types and we'll introduce some of the essential types that you need to know in order to build chatbots or other conversational agents.

So to get started, we will get worked through this notebook here. If you'd like to follow along and run this code as well, there will be a link top of the video right now that will take you to this notebook. So libraries that we're going to be using are LangChain, OpenAI, and TicToken.

All right, we will install those. And once you've installed those, we'll move on to our imports here. Now, before we do move on any further, the vast majority of this notebook is by Francisco. So thanks a lot for putting that together. That is the same Francisco that we saw in the previous video and he'll be joining us for future videos as well.

So let's go ahead and import everything in here. You can see a few different memory types that we're going to be using. So we have this conversational buffer memory, summary memory, and so on. We'll be taking a look at all of these and actually maybe some other ones as well.

First thing we want to do is, well, throughout this notebook, we're going to be using OpenAI's large language models. So we want to go ahead and actually save the OpenAI API key. To get that, you'll need to go over to platform.openai.com. You log into your account in the top right corner.

Once you have logged in, you can click on view API keys and then you'll just create a new secret key and just copy that and then paste it into the little prompt that we get here in there. Okay, once we've done that, we have our OpenAI API key in there and what we can do is run this first step.

So obviously recently you may have heard that there is a new chat GPT model that is available to us and we can actually just put that in here. And what I'll do is, whilst we're running through this first example, I will also show you the examples that we get when we run with a GPT 3.5 turbo.

But for the majority of this notebook, we're going to be sticking with the text vintage 003 model just to show that, you know, we can use any large language model with this. It doesn't have to be a large language model like chat GPT that has been trained specifically as a chatbot.

It can be any large language model. So later on, we're going to be using this function here, count tokens, don't worry about it right now, we'll skip that. First thing we want to have a look at is this, the conversation chain. So everything that we are about to talk about is built on top of the conversation chain.

And the conversation chain, we just pass it a large language model. So this is going to be the text vintage 003 model we just created. And all it's going to do, or the core part of this conversation chain does is what you can see here. So the following is a friendly conversation between a human and AI.

AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to the question, it truthfully says it does not know, right? So this is a prompt that essentially primes the model to behave in a particular way. So in this case, it's going to behave like a friendly AI assistant.

And it does this first by having this current conversation, this is the history that we're going to pass. So, you know, I said before, these models are stateless. The reason we can pass a history is because we're actually taking over the past interactions and passing them into the model at this current point in time.

So in that single prompt, we will have our current input, our current query, and all of our past interactions with the chatbot. All right, so all of those past interactions get passed to history here, and our current query or current input is passed to the input. And then we tell the model, okay, now it's your time to respond with this AI part.

And beyond that, there's not really anything special going on with this conversation chain. So we can see here in these low examples that Francisco pulled up, the actual code for the conversation chain here, right? So which we initialized here, the code that calls these is exactly the same as the code that calls the large language model chain.

All right, there's not actually anything else changing other than this prompt here. Okay, so let's move on to the different memory types. So as I mentioned before, there are a few different memory types that we are able to use. Now, these different types of memory in line chain are essentially just going to change the history part of that prompt that you saw before.

So these go in and they will essentially take the conversation history, format it in a particular way, and just place it into that history parameter. But as you may have guessed, they format things differently. And because of that, there are pros and cons to each one of these methods.

So the simplest of those is the conversation buffer memory. Now, the conversation buffer memory is very simple. It basically takes all of your past interactions between you and the AI, and it just passes them into that history parameter as the raw text. There's no processing done. It is literally just a raw conversation that you've had up to this point.

Now, to initialize that, we just do this, super simple. So we have the memory here. So we have our conversation chain. This time we're specifying the memory parameter, and we're using conversational buffer memory. We run that, and then we can just pass in some input. So we're going to go with good morning AI.

And we get this response here. Good morning, it's a beautiful day today, isn't it? How can I help you? Cool. Now, one thing that I do want to do is actually count the number of tokens that we're using, 'cause that's a very big part of which one of these methods we might want to use over the others.

Now, to count those tokens that we're using, we actually need to refer back to the count tokens function up here, right? So this is just a, so we have this getOpenAI callback from a lang chain here. And within that callback, we are actually going to get the total number of tokens that we just used in our most recent request to OpenAI.

So let's come down here. To use that count tokens function, all we're going to do is pass our conversational chain, and we're going to pass in our input. So our next input is going to be, my interest here is to explore the potential of integrating larger language models with external knowledge, okay?

We run this, and we see, okay, we spent a total of 179 tokens so far. We keep going, I just want to analyze the different possibilities, so on and so on, right? And because we're saving the raw interactions that have happened up to this point, naturally with every new interaction, the number of tokens that we're using with each call increases.

And we just see this keep increasing and increasing. But each one of these queries that we're making is considering all the previous interactions, and we can see that, right? There is a common thread throughout each of these interactions, and the responses that we're getting, like it's clearly a conversation.

And if you come down to here, we ask the final question, which is, what is my aim again? Now, earlier on, we specified my interest, we didn't specifically say aim, we said my interest here is to explore the potential of integrating large language models with external knowledge. And if we come down, that's basically what we're asking here.

What is my aim? And it says your aim is to explore the potential of integrating large language models with external knowledge. Okay, so that's just to confirm that, yes, the model does in fact remember the start of the conversation. So clearly that works, right? And let's wait for those to run.

Okay, cool, and so what you can see here is we have the conversation chain that we initialize, and then we have this memory attribute, and within that, the buffer attribute. This is literally going to show us what it is that we're feeding into that history parameter. And you can just see that it is literally just the conversation.

Like there is nothing else there. We're not summarizing, we're not modifying it in any way. It's just the conversation that we had from above. Okay, so that is the first one of our memory types. I think the pros and cons to this are relatively clear, but let's go through them anyway.

So the pros are that we're storing everything, right? We're taking the raw interactions. We're not modifying or shortening them in any way. And that means that we're storing the maximum amount of information, which means we're not losing any information from the previous interactions. And to add to that, just storing previous interactions is a very simple approach.

It's intuitive, it's not complicated in any way. So that's also a nice benefit. But they kind of come with a few cons. And naturally, if we're storing all these tokens, it means that the response times of the model are going to be slower, especially as the conversation continues and the queries that we're sending get bigger.

And it also means that our costs are going to increase as well. And beyond that, it's even completely limiting us. So right now, the Text Adventure 003 model and the GPT 3.5 Turbo model both have a max token limit of, I think it's 4,096 tokens. That's pretty big, but a conversation might go on for longer than this.

So as soon as we hit that limit with this type of memory, we're actually just going to hit an error and that's it. Like we can't continue the conversation. So that's a pretty big downside. So are there any other types of memory that can help us remedy these issues?

Yes, there are. First, we have the conversation summary memory. This allows us to avoid excessive token usage by summarizing the previous interactions. Rather than storing everything, we just summarize it. So in between us getting our previous interactions and passing them into the history parameter of our prompt, they're summarized and obviously shortened or hopefully shortened.

Now to use this, we run this. Okay, so we have conversation chain and our memory is a conversation summary memory. And to actually do this summarization, we also need a large language model. So we're actually just going to use the same large language model here. And this conversation summary memory, just as a quick reminder, is coming from this.

So line chain, chains, conversation.memory, and then we have all of our memory types imported there. Okay, and let's have a look at what the prompt is for this summarization component, right? Because we're performing two calls here. We are performing the call to the summarization large language model, and then we will be performing the call to the chatbot or the conversational AI component.

So the first call is that summarization and it looks like this, okay? So conversation of some, so that is just this conversation chain that we have here. We're looking at the memory and we're looking at the prompt template. It is progressively summarizing the lines of conversation provided, adding on to the previous summary, returning a new summary, okay?

So we're going to, this is an example. Current summary, the human asks what AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good. And then it's saying new lines of conversation. Why do you think AI is a force for good? Because AI will help humans reach their full potential.

And then it creates a new summary. The human asks what AI thinks of this. AI thinks artificial intelligence is a force for good because it will help humans reach their full potential. So it's basically just added on to the end of the previous summary, a little bit more information.

And it will basically keep doing that with each interaction, right? So from there, we have the current summary, we pass in new lines of conversation, and then we will create a new summary. Now let's see how we would actually run through all of this. So we have our conversational summary memory here.

We're going to go through the same conversation again. So good morning, AI. And you'll notice that the responses are going to be slightly different. And we'll just run through these very quickly. And what we really want to see is, although we're summarizing, is the model able to remember that final question again?

What is my aim again? And fortunately, we can see that, yeah, it does have that same correct answer again. So that's pretty cool. Now, the only issue I see here is, okay, we're summarizing 'cause we want to reduce the number of tokens, but just take a look at this.

We're spending a total of almost 750 tokens here. That's the second-to-last input. Let's compare that to up here. Okay, second-to-last input for the one where we're just saving everything, like the raw interactions, is 360, which is actually less than half the number of tokens. So what is going on there?

Well, the summaries are generated text, and they can actually be pretty long. So we have this conversation, sum, memory, buffer. Okay, I'm not gonna read it all, but you can go through. There's clearly a lot of text there. So is this actually helping us at all? Well, it can do.

It just requires us to get to a certain length of conversation. And we can see here that the actual, so we're using this TIC token tokenizer, which is essentially the opening eyes tokenizer, and we're using it for text of interest 0x03, which is the current large language model that we're using.

Now, if we look at specifically the memory buffer and look at the number of tokens that we have in there versus the actual conversation, so this is a conversational buffer memory where we're storing everything, and this is a summary memory where we're just storing a summary. If we compare both of those, we can actually see that the summary memory is actually a lot shorter, right?

The only issue is the reason that we're actually spending more tokens is because first we're doing that summarization in the first place, and you can also see that we have two prompts. This prompt here is already quite a lot of text. So, you know, okay, that's great. It seems like, you know, I understand that the summary itself is shorter, but it doesn't matter because the actual total here is still longer.

And yes, that's true for this conversation, but for longer conversations, this is not usually the case. So we have this visual here, which I have calculated with using a longer conversation. The link for the code for that, you can see at the top of the video right now. But in this, we can see the comparison between these two methods.

So the line that you see just kind of growing linearly right here, that is our conversation buffer memory, okay? So the first one we looked at. And you see that we get to a certain level, like around 25 interactions. And at that point, we actually hit the token limit of the model.

Whereas the summary memory that we're using, initially, yes, is higher, but then it kind of, it doesn't grow quite as quickly. It grows quite quickly towards the end there, but the overall growth rate is much slower. So for shorter conversations, it's actually better to use the direct buffer memory.

But for longer conversations, summary memory, or you can see here, it works better. It reduces the number of tokens that you're using overall. So naturally, which one of those you would use just depends on your use case. So I think we can kind of summarize the pros and cons here.

For summary memory is it shortens the number of tokens for long conversations, also enables much longer conversations. And it's all relatively straightforward implementation, super easy to understand. But on the cons, naturally you just solve the shorter conversations. It doesn't help at all. It's actually less efficient. The memorization of the prior chats, because we're not saving everything like we did the buffer memory, the memorization of those previous interactions is wholly reliant on the summarization, including that information, which it might not always do, particularly if you're not using a particularly advanced large language model for that summarization.

So both those things we also just need to consider with this approach. So moving on to conversational buffer window memory, this one acts in a very similar way to a buffer memory, but there is now a window on the number of interactions that we remember or save. So we're gonna set that equal to one.

So we're gonna save the one most recent interaction from the AI and also the human, right? So one human interaction, one AI interaction. Usually it would be much larger, but it's just for the sake of this example. So running through these again, you know, it's pretty straightforward, right? I'm not gonna rerun everything here, but we can see we go through the same conversation again, and we get to the end and we say, what is my aim again?

All right, and your aim is to use data sources to give context to the model, right? You know, that's wrong. But the reason that the model is saying this is because it actually only remembers this previous interactions, one last interaction, because we set K equal to one. And we can actually see that here.

So if we go into our conversation chain, we go to the memory attribute, load memory variables. We set inputs equal to nothing now, 'cause we don't actually wanna pass anything in. And we load the history item. Then we can actually see the history that we have there. So we have just the previous interaction.

And the previous interaction was actually this, right? So, because we asked this question. So we actually just have that. Now, obviously, naturally with this approach, you're probably going to be using a much larger number than K equals one. And you can actually see the effect of that here. So adding to the previous two visuals that we saw, we now have conversational buffer window memory for that longer conversation with K equals 12.

So remembering the previous 12 interactions, and also K equals six. And you can see, okay, the token count for these, it's kind of on par with the conversational buffer memory up until you get to the number of interactions that you're saving. And then it kind of flattens out a little bit because you're just saving the previous 12 or six interactions in this case.

So naturally, I think the main pro is here, kind of similar to the buffer memory, we're saving the raw input of the most recent interactions. And the con is obviously that we are not remembering distant interactions. So if we do want to remember distant interactions, we just can't do that with this example.

Now, there's one other memory type that I wanted to very quickly cover here. And that is a conversation summary buffer memory. So let's import that quickly from lang chain, chains, conversation.memory, import. I'm going to be importing conversation summary buffer memory, okay? Now to initialize this, we will use this code here.

So we have the conversation chain again, we're passing in the large language model, and then we're also passing in the memory like we did before, conversation summary buffer memory. And in there, so we have passing a large language model. Now, the reason we're doing this is because you can see here that we are summarizing, okay?

And then we also pass in this max token limit. Now, this is kind of equivalent to what you saw before, where we had K equals six. But rather than saying we're going to save the previous six interactions, we're actually saying we're going to save the previous 650 tokens. So we're kind of doing both.

So we're summarizing here, and we are also saving the most recent interactions in their raw form. So we can run that, and we would just use it in the exact same way. Okay, so main pros and cons with this one as well, kind of like a mix of the previous two methods we looked at.

So we have a little bit of the buffer window in there, and also the summaries in there. So the summarizer means that we can remember those distant interactions, and the buffer window means that we are not misrepresenting the most recent interactions, because we're storing them in their raw form.

So we're keeping as much information there as possible, but only from the most recent interactions. And then the cons, kind of similar to the other ones again. For the summarizer, we naturally increase the token count for shorter conversations. And with the summarizer, we're not necessarily going to remember distant interactions that well.

They don't really contain all of the information. And naturally, storing the raw interactions, as well as the summarized interactions from a while back, all of this increases the token count. But this method does allow us quite a few methods, or a few parameters that we can tweak in order to get what it is that we want.

And in this little visualization here, we've added the summary buffer memory with a max token limit of 1,300, and also 650 there, which is roughly equivalent to the K equals 12 and K equals six that we had before. And we see that the tokens are not too excessive. Now, there are also other memory types.

Very quickly, you can see this one here is the conversation knowledge graph memory, which essentially keeps a knowledge graph of all of the entities that have been mentioned throughout the conversation. So you can kind of see that here. We say, "My name is human and I like mangoes." And then we see the memory here.

And we see that the entity human, referring to the person, and the entity human, I think this is referring to the name. They're both connected because the human is the name of human. And then here we have the entity human and the entity mangoes. And we can see that they are connected because the human likes mangoes, right?

So we have that knowledge graph memory in there. But beyond that, we're not going to dive into that for now any further. We're going to cover that more in a future video. So that's actually it for this introduction to conversational memory types with LangChain. We've covered quite a few there.

And I think the ones that we have covered are more than enough to actually build chatbot or conversational AI agents using what seem to be the same methods as those that are being used by some of the state-of-the-art conversational AI agents out there today, like OpenAI's ChatGPT and possibly Google's Lambda, although we really have no idea how that works.

But for now, we'll leave it there. As I mentioned, there's going to be a lot more on this type of stuff in the future. Thank you very much for watching. I hope this has been interesting and useful. And I will see you again in the next one. Bye. (gentle music) (gentle music) (gentle music) (gentle music) you you

Chatbot Memory for Chat-GPT, Davinci + other LLMs - LangChain #4

Chapters

Transcript