Llama 2 in LangChain — FIRST Open Source Conversational Agent!

A few days ago MetaAI released LLAMA2. Now what's exciting about LLAMA2 is that it's open source and it is currently the best performing open source model in a big variety of different benchmarks. Now one of the things that I'm personally very excited about is when I see these new open source models being released one of the first things I do is I try out as a conversational agent.

That is a chatbot that is actually able to use tools and every single time that I have tried this so far with other models I've been pretty disappointed. They either cannot use tools at all or they're just very unreliable. So this "will it work as a conversational agent" benchmark has just become my personal go-to when these new models are released.

It's my way of benchmarking where open source is compared to OpenAI models which generally speaking GPT-3.5, Text to Image 0.0.3 and especially GPT-4. They are pretty capable as conversational agents and what I find in real world use cases is that conversational agents are the future of how we interact with large language models.

Having a simple chatbot that just talks to us is great but it's limited. It doesn't have the flexibility in access to external information that a conversational agent will have and it cannot use tools like you know a Python interpreter that a conversational agent can use. So that for me is super important and finally with LLAMA2 we have a model that has actually passed that test.

I fairly quickly managed to sort of prompt engineer my way to getting a LLAMA2 model, the fine-tuned chat version of LLAMA2, to work as a conversational agent which I think is pretty insane. So what I want to do in this video is show you how you can do the same.

So we're going to take a look at the biggest LLAMA2 model. It's the 70B parameter model. We're going to quantize it so that we can fit it onto a single A100 GPU. I'm actually going to be running all this on Colab so you can actually go ahead and run the same notebook.

With this approach we're going to be able to fit that 70 billion parameter model into at a minimum 35 gigabytes of GPU memory but actually after multiple interactions it kind of pushes its way up to more like 38 gigabytes which is still not that much for such a performing model.

Now let's just dive into how we can actually do this. So the first thing we're going to have to do is actually sign up and get access to these models. It's pretty straightforward, it doesn't take that long. So what you can do for this is head on over to huggingface.co/meta-llama and you want to go over to the meta website here.

So we click on that and we just want to request access to the next version of Llama. So you fill that out and for me I got a response almost instantly through using two different emails and basically they're going to send you something like this. So it's just okay you're all set, start building with Llama2.

It also gives you model weights that are available. This is not every single Llama2 model, there is also a 34 billion parameter model which they have not finished testing yet so that hasn't been released just yet but the one that we are going to be using is this Llama2-70b-chat.

So on HuggingFace we need to go to Llama2-70b-chat-hf. This is the model that we want to be using. So you'll see that there's this access Llama2 on HuggingFace. One thing you need to be aware of here is that, well actually it says it right here, your HuggingFace account and email address must match the email you provide on the meta website.

So a minute ago when we entered our details on the meta website make sure you use the email that you also use on HuggingFace. So once you've done that you can click this, you can submit and as long as those emails line up you will get access fairly quickly.

Now one thing that you will need is one we have to wait for that access to come through but we also need to go down over to our profile, we go to settings and we need to get an access token. So this will allow us to download the model within our code.

So you will actually need to create a new token. I'm just going to call this Meta Llama and we just need read permissions. So with that we generate a token and I'm just going to copy that. So this is a notebook that we're going to be working through in this video.

There will be a link to this at the top of the video right now so you can follow along if you like although I will just pre-warn you that parts of this notebook can take a little bit of time particularly when you're downloading the model. So with that in mind I wouldn't even necessarily recommend running this on Colab because you're going to have to re-download the model like every day that you use this which is not ideal and it's fairly expensive.

So you should probably run this on your local computer if you have a good GPU or on a cloud service somewhere. So we come down to here you'll need to enter your Hugging Face API key in here and let me just come down and show you what is happening.

So there's a fair bit of code that is just kind of initializing the model here for us and as I mentioned this download of the model, this download and initialization of the model, does take a bit of time. So this has actually been running now for one hour and 10 minutes or a little bit longer and I'm not expecting it to finish too soon although I'm hoping it will not take too much longer.

But essentially we're going to be waiting a while for the model to download but let's come up here and just kind of go through that code that we've used to initialize it first. Right so we're doing a pip install of all the libraries that we're going to be using.

We do need all of these okay Hugging Face Transformers then we have like these libraries and these libraries are basically so we can run large language models and also optimize how we're running those. And we also have LangChain so later on in the notebook we're going to be using LangChain to create that conversational agent.

So come down to here what we need here is the large language model, a tokenizer for the large language model and also a stopping criteria object which is more of an optional item I would say for this model. But let's talk about those the LLM at first. So the LLM we have this model ID this is coming from Hugging Face so if we come up here again we can type in llama2 and we see that there's all these different model IDs.

The one that we're using is this one here. Okay so we have our model ID here we're just checking that we have a GPU available. Here we have this bits and bytes config object. I've spoken about this in previous videos so I'm not going to go too into depth but essentially what we're doing here is we're minimizing the amount of GPU memory we need to store the model.

Now this is a 70 billion parameter model so let's just do some very quick maths here. So 70 billion parameters. Each of those parameters using the standard data type is 32 bits of information. Okay so the standard data type is a float 32 so float 32 and that is 32 bits of information.

Within each byte there is 8 bits of information so we can actually calculate how much memory we need to store that model. Okay it is just the params by the data type divided by 8. Okay and that gives us this many bytes of information which is 280 gigabytes which is a lot right that's many many GPUs many A100s single A100 I think is 40 gigabytes so yeah we need we need a few of those.

Now by doing this bits and bytes quantization we can minimize that so what we're essentially doing is switching from a float 32 data type to an int 4 data type. Okay and that contains four bits of information. Okay so now each one of those parameters is not 32 bits it's four bits so let's calculate that we have int 4 divided by 8 which gives us this so that is 35 gigabytes of information.

Now that's not precise because when we're doing this quantization method if we just converted everything into int 4 basically we would lose a lot of performance. This works in a more intelligent way by quantizing different parts of the model that essentially don't need quite as much precision. Then the bits that do require more precision we convert into 16-bit floats so it will be a little bit more than 35 gigabytes essentially but we're going to be within that ballpark.

So that's great and allows us to load this model onto a single A100 which is pretty incredible. Then what we need to do is we load the model config from Hogan Face Transformers because we're downloading that from Hogan Face Transformers we need to make sure we're using our authorization token which you will need to set in here and then we're also going to download the LLAMA2 model itself.

Now we need to have the Trust Remote code in there because this is a big model and there is custom code that will allow us to load that model. You don't need that for all models on Transformers but you do need it for this one. We have the config object which we just initialize up here and we also have the quantization config which we initialize up here.

Device map needs to be set to auto and we again need to pass in our authorization token which we do here. Then after that we switch the model into evaluation mode which basically means we're not training the model we're going to be using for inference or prediction. Then after that we just wait.

This is almost done now so I think it's just finished downloading the model and now we're going to need to wait for it to actually initialize the model from all of those downloaded shards that we just created. I will see you in a few minutes when that is finished.

Everything has now loaded and initialized so we can get on with the rest of the code. We need a tokenizer. Tokenizer just converts plaintext into basically what the model will be reading. I just need to make sure I define this and I can rerun that. Converts plaintext to tokens which a model will read and then we come down to the stopping criteria for model.

Now with smaller models this is pretty important. With this model I would say less so but we can add this in anywhere as a precaution. Basically if we see that the model has generated these two items which are basically this is from like a chat log so we'd have the assistant it would type a reply and then if it moves on to the next line and starts generating the text for the human response well it's generating too much text and we want to cut it off.

We have that as a stopping criteria and we also have these three backticks. The reason we use these three backticks is because when we are using Llama2 as a conversational agent we actually ask it to reply to everything in essentially markdown of a JSON output. So we'll have it reply to everything in this format.

Then in here we'll have like an action which is something like user calculator and also the action input. So it would be like two plus two. So that is why we're using this or including this within the sub list. Essentially once we get to here we want the chatbot to stop generating anything.

As I said with this model it doesn't seem to be that necessary so you can add it in there as a precaution but actually what I'm going to do is just skip that for now. I don't necessarily need that to be in there. If you do want to include that in there what you'll need to do is just uncomment that and you'll have that in there.

I'm not going to initialize it with that. If we do see any issues then we'll go back and run that with the stopping criteria included. This is just initializing the text generation pipeline with HuggingFace. We can now ask it to generate something. This is a question I've used a few times in the past.

We just want to make sure that it is actually working on the HuggingFace side of things. Can this HuggingFace initiated model generate some text? It will take a little bit of time. As I said before this is exciting because it is finally able to at least at a very basic level act as a conversational agent.

In terms of speed and hardware requirements it is not the most optimal solution. At least not yet. That's something that can be solved with more optimized hardware or just kind of throwing a load of hardware at it at least on the time side of things. That will take a little while to run and we see that we get this response which I think is relatively accurate.

I haven't read through it but it looks pretty good. Then what we want to do is right now we have everything HuggingFace. We now want to transfer that over into LangChain. We're going to do that by initializing this HuggingFace pipeline object from LangChain and initializing it with our pipeline that we initialized up here.

Then we just treat that as the LLM. We'll run that. We can then run this again and this will produce a pretty similar output to what we got up here. We can see we get kind of similar output. This is just telling us the same sort of stuff but with more text.

Now what I want to do, come down to here. We have everything initialized in LangChain. Now what we can do is use all of the tooling that comes with LangChain to initialize our conversational agent. Now conversational agent as I mentioned before is conversational. That means it has some sort of conversational memory and it is also able to use tools.

That is kind of the advantage of using a conversational agent versus just a standard chatbot. We initialize both of those. Conversational buffer window memory. This is going to remember the previous five interactions and we're also just going to load a LLM math tool. It's a calculator. We initialize both of those and then here we have what is an output parser.

We don't need this for this model. You can have it in there as a precaution again if you like but for the most part I've found that it doesn't actually need this with good prompting. Essentially what I would do usually with this output parser is if the agent returns some text without the correct format, so without that JSON format that I mentioned earlier, I would assume that that's trying to respond directly to the user.

All this output parser does is kind of reformats that into the correct JSON like response but as I said we can ignore it. We don't need it necessarily for at least the tools that we're using here. Maybe in a more complex scenario it might come in more use. If you did want to use that you just uncomment that and run it but as mentioned let's skip that and just see how the agent performs without it.

Again it's just like a precaution. We initialize the agent here. We're using this chat conversational react description agent and this is kind of standard agent initialization parameters. What I want to show you here is the prompt that we initially use. Now this prompt doesn't work very well. One like this initial system prompt is super long.

It's not that useful. Then we have the user prompt template here which again is super long and it doesn't work that well. I've modified those. One thing that is slightly different or specific to LLAMA2 is the use of these special tokens. We have this which indicates the start of some instructions, this which indicates the end of instructions, this indicates the start of system message so that initial message that tells the chatbot or LLM how to behave and this indicates the end of the system message.

We initialize our system message and we include that sort of initialization of the system message in there. Then we go through we say assistant is an expert JSON builder designed to assess a wide range of tasks. The intention here is to really drill in the point that assistant needs to respond with JSON.

We also mentioned it needs to respond with the action and action input parameters. We can see an example of that in here. In this example I'm saying this is how to use a calculator. You need to say action calculator and what you would like to use with the calculator.

Then we have some future examples in here. We have just responding directly to the user. We need to use this JSON format. Using calculator again use the JSON format. We just go through and keep giving a few of those examples. At the end of system message we put that end of system message token.

We can run that and then we come down to here. This is another thing that they found in the paper is that LLAMA2 over multiple interactions seem to forget those initial instructions. All I'm doing here is saying we have some instructions. I'm adding those instruction tags in there and I'm summarizing like giving a little reminder to LLAMA2.

Respond to the following in JSON with action and action input values. We're just appending that or adding that to every user query which we can see here. Then we just modify the human message prompt template and what we end up with is this which you can see down here.

We're going to have that with every human message. Now we can actually begin asking questions. I just ran this one. Hey, how are you today? We see that we get this output. Final answer. I'm good, thanks. How are you? That's pretty good. Let's try what is 4 to the power of 2.1.

We see that's correctly using a calculator. It has the action input which is 4 to the power of 2.1 in Python. This interaction takes a little bit longer because there are multiple LLM calls happening here. The first LLM call produces the I need to use a calculator and the input to that calculator.

This is sent back to line train and this is executed in a Python interpreter. We get this answer from that. That is sent back to the assistant and based on that final answer it knows that it can give the answer back to us. The action is final answer. It looks like the answer is this.

That is the output that we get there. Now let's use our conversational history and ask it to multiply that previous number by 3. Let's run that. We can see the first item, the calculator, it is being used correctly. We have that 18.379 multiplied by 3. Again, it's going to take a little moment because it needs to get the answer and generate a new LLM response based on that answer.

Then we get our answer and we have this 55.13 and that's what we get. This is pretty good. Now, I would say as you saw, these answers where it's going through multiple steps, it's taking a minute for each one. A lot of that time seems to be spinning up a Python interpreter.

It's not fully on the LLM in this case, but it does take a little bit of time. Naturally, that is probably one of the biggest issues with using Longitude at the moment. It takes a lot of GPU memory to run it. That comes with high costs. Especially if you are running it on a single GPU like we are with quantization, which slows the whole thing down, things are going to take a little bit of time.

Nonetheless, I think this looks really cool. What we've done here is a very simple agent. It's just using a calculator. We're not stress testing this. Honestly, if we want to start using other tools, I think we might run into some issues that require a bit more tweaking and prompt engineering than what I have done here.

I'm optimistic that we can actually use this for other tools. When you consider that even GPT 3.5, even that model is not that good at just producing the JSON response when you use it as a conversational agent. It can, and it can do it so reliably, but it's not perfect.

The fact that LLM 2 and open source model that we're fitting on a single GPU is at least somewhat comparable to one of the best large language models in the world, I think that is pretty incredible. I'm very excited to see where this goes. Naturally, LLM 2 has only been around for a few days as of me recording this.

We're probably going to see a lot of new models built by the community on top of LLM 2 appear within probably the next few days from now, and especially in the coming weeks and months. I'll be very excited to see where that goes. For now, I'm going to leave it there for this video.

I hope this has all been useful and interesting. Thank you very much for watching, and I will see you again in the next one.

Llama 2 in LangChain — FIRST Open Source Conversational Agent!

Chapters

Transcript