Llama 2 in LangChain — FIRST Open Source Conversational Agent!

00:00:00.000 | A few days ago MetaAI released LLAMA2. Now what's exciting about LLAMA2 is that it's open source

00:00:08.560 | and it is currently the best performing open source model in a big variety of different

00:00:16.000 | benchmarks. Now one of the things that I'm personally very excited about is when I see

00:00:22.880 | these new open source models being released one of the first things I do is I try out as a

00:00:30.880 | conversational agent. That is a chatbot that is actually able to use tools and every single time

00:00:39.200 | that I have tried this so far with other models I've been pretty disappointed. They either cannot

00:00:46.800 | use tools at all or they're just very unreliable. So this "will it work as a conversational agent"

00:00:55.040 | benchmark has just become my personal go-to when these new models are released. It's my way of

00:01:01.760 | benchmarking where open source is compared to OpenAI models which generally speaking

00:01:08.400 | GPT-3.5, Text to Image 0.0.3 and especially GPT-4. They are pretty capable as conversational agents

00:01:17.040 | and what I find in real world use cases is that conversational agents are the future of how we

00:01:25.200 | interact with large language models. Having a simple chatbot that just talks to us is great

00:01:31.600 | but it's limited. It doesn't have the flexibility in access to external information

00:01:38.000 | that a conversational agent will have and it cannot use tools like you know a Python interpreter

00:01:44.800 | that a conversational agent can use. So that for me is super important and finally

00:01:52.800 | with LLAMA2 we have a model that has actually passed that test. I fairly quickly managed to

00:02:00.400 | sort of prompt engineer my way to getting a LLAMA2 model, the fine-tuned chat version of LLAMA2,

00:02:08.080 | to work as a conversational agent which I think is pretty insane. So what I want to do in this

00:02:15.120 | video is show you how you can do the same. So we're going to take a look at the biggest LLAMA2

00:02:21.440 | model. It's the 70B parameter model. We're going to quantize it so that we can fit it onto a single

00:02:26.320 | A100 GPU. I'm actually going to be running all this on Colab so you can actually go ahead and

00:02:31.920 | run the same notebook. With this approach we're going to be able to fit that 70 billion parameter

00:02:37.280 | model into at a minimum 35 gigabytes of GPU memory but actually after multiple interactions it kind

00:02:47.440 | of pushes its way up to more like 38 gigabytes which is still not that much for such a performing

00:02:55.600 | model. Now let's just dive into how we can actually do this. So the first thing we're going to have to

00:03:01.280 | do is actually sign up and get access to these models. It's pretty straightforward, it doesn't

00:03:07.280 | take that long. So what you can do for this is head on over to huggingface.co/meta-llama

00:03:14.800 | and you want to go over to the meta website here. So we click on that and we just want to request

00:03:22.720 | access to the next version of Llama. So you fill that out and for me I got a response almost

00:03:30.160 | instantly through using two different emails and basically they're going to send you something like

00:03:36.000 | this. So it's just okay you're all set, start building with Llama2. It also gives you model

00:03:42.080 | weights that are available. This is not every single Llama2 model, there is also a 34 billion

00:03:48.400 | parameter model which they have not finished testing yet so that hasn't been released just yet

00:03:53.920 | but the one that we are going to be using is this Llama2-70b-chat. So on HuggingFace we need to go to

00:04:02.800 | Llama2-70b-chat-hf. This is the model that we want to be using. So you'll see that there's

00:04:15.200 | this access Llama2 on HuggingFace. One thing you need to be aware of here is that, well actually

00:04:23.520 | it says it right here, your HuggingFace account and email address must match the email you provide

00:04:28.400 | on the meta website. So a minute ago when we entered our details on the meta website make

00:04:34.480 | sure you use the email that you also use on HuggingFace. So once you've done that you can

00:04:40.560 | click this, you can submit and as long as those emails line up you will get access fairly quickly.

00:04:48.960 | Now one thing that you will need is one we have to wait for that access to come through

00:04:54.400 | but we also need to go down over to our profile, we go to settings and we need to get an access token.

00:05:04.400 | So this will allow us to download the model within our code. So you will actually need to

00:05:12.880 | create a new token. I'm just going to call this Meta Llama and we just need read permissions.

00:05:19.840 | So with that we generate a token and I'm just going to copy that. So this is a notebook that

00:05:27.120 | we're going to be working through in this video. There will be a link to this at the top of the

00:05:32.000 | video right now so you can follow along if you like although I will just pre-warn you that

00:05:38.320 | parts of this notebook can take a little bit of time particularly when you're downloading the

00:05:44.160 | model. So with that in mind I wouldn't even necessarily recommend running this on Colab

00:05:49.840 | because you're going to have to re-download the model like every day that you use this which is

00:05:56.640 | not ideal and it's fairly expensive. So you should probably run this on your local computer if you

00:06:05.680 | have a good GPU or on a cloud service somewhere. So we come down to here you'll need to enter your

00:06:17.280 | Hugging Face API key in here and let me just come down and show you what is happening. So

00:06:26.160 | there's a fair bit of code that is just kind of initializing the model here for us and as I

00:06:31.840 | mentioned this download of the model, this download and initialization of the model,

00:06:37.600 | does take a bit of time. So this has actually been running now for one hour and 10 minutes or a

00:06:46.000 | little bit longer and I'm not expecting it to finish too soon although I'm hoping it will not

00:06:53.200 | take too much longer. But essentially we're going to be waiting a while for the model to download

00:06:59.200 | but let's come up here and just kind of go through that code that we've used to initialize it first.

00:07:06.560 | Right so we're doing a pip install of all the libraries that we're going to be using.

00:07:10.720 | We do need all of these okay Hugging Face Transformers then we have like these libraries

00:07:17.360 | and these libraries are basically so we can run large language models and also optimize how we're

00:07:23.840 | running those. And we also have LangChain so later on in the notebook we're going to be using LangChain

00:07:29.840 | to create that conversational agent. So come down to here what we need here is the large language

00:07:38.800 | model, a tokenizer for the large language model and also a stopping criteria object which is

00:07:44.880 | more of an optional item I would say for this model. But let's talk about those the LLM at

00:07:52.720 | first. So the LLM we have this model ID this is coming from Hugging Face so if we come up here

00:08:01.280 | again we can type in llama2 and we see that there's all these different model IDs. The one that we're

00:08:08.480 | using is this one here. Okay so we have our model ID here we're just checking that we have a GPU

00:08:16.800 | available. Here we have this bits and bytes config object. I've spoken about this in previous videos

00:08:24.880 | so I'm not going to go too into depth but essentially what we're doing here is we're

00:08:31.760 | minimizing the amount of GPU memory we need to store the model. Now this is a 70 billion parameter

00:08:38.640 | model so let's just do some very quick maths here. So 70 billion parameters.

00:08:45.520 | Each of those parameters using the standard data type is 32 bits of information. Okay so the

00:08:59.360 | standard data type is a float 32 so float 32 and that is 32 bits of information. Within each byte

00:09:10.000 | there is 8 bits of information so we can actually calculate how much memory we need to store that

00:09:19.840 | model. Okay it is just the params by the data type divided by 8. Okay and that gives us this many

00:09:30.240 | bytes of information which is 280 gigabytes which is a lot right that's many many GPUs many A100s

00:09:44.080 | single A100 I think is 40 gigabytes so yeah we need we need a few of those. Now by doing this

00:09:53.440 | bits and bytes quantization we can minimize that so what we're essentially doing is switching from

00:10:00.000 | a float 32 data type to an int 4 data type. Okay and that contains four bits of information. Okay

00:10:09.920 | so now each one of those parameters is not 32 bits it's four bits so let's calculate that

00:10:17.920 | we have int 4 divided by 8 which gives us this so that is 35 gigabytes of information. Now

00:10:28.720 | that's not precise because when we're doing this quantization method if we just converted

00:10:36.240 | everything into int 4 basically we would lose a lot of performance. This works in a more intelligent

00:10:41.600 | way by quantizing different parts of the model that essentially don't need quite as much precision.

00:10:50.320 | Then the bits that do require more precision we convert into 16-bit floats so it will be

00:10:59.680 | a little bit more than 35 gigabytes essentially but we're going to be within that ballpark.

00:11:05.840 | So that's great and allows us to load this model onto a single A100 which is pretty incredible.

00:11:12.320 | Then what we need to do is we load the model config from Hogan Face Transformers because

00:11:20.000 | we're downloading that from Hogan Face Transformers we need to make sure we're

00:11:22.800 | using our authorization token which you will need to set in here and then we're also going to

00:11:29.680 | download the LLAMA2 model itself. Now we need to have the Trust Remote code in there because

00:11:37.360 | this is a big model and there is custom code that will allow us to load that model. You don't need

00:11:44.720 | that for all models on Transformers but you do need it for this one. We have the config object

00:11:50.400 | which we just initialize up here and we also have the quantization config which we initialize up

00:11:56.080 | here. Device map needs to be set to auto and we again need to pass in our authorization token

00:12:04.640 | which we do here. Then after that we switch the model into evaluation mode which basically means

00:12:10.720 | we're not training the model we're going to be using for inference or prediction. Then after

00:12:16.640 | that we just wait. This is almost done now so I think it's just finished downloading the model

00:12:25.360 | and now we're going to need to wait for it to actually initialize the model from all of

00:12:31.200 | those downloaded shards that we just created. I will see you in a few minutes when that is finished.

00:12:39.200 | Everything has now loaded and initialized so we can get on with the rest of the code.

00:12:47.120 | We need a tokenizer. Tokenizer just converts plaintext into basically what the model will be

00:12:54.640 | reading. I just need to make sure I define this and I can rerun that. Converts plaintext to tokens

00:13:04.320 | which a model will read and then we come down to the stopping criteria for model. Now with smaller

00:13:11.440 | models this is pretty important. With this model I would say less so but we can add this in anywhere

00:13:18.800 | as a precaution. Basically if we see that the model has generated these two items which are

00:13:28.400 | basically this is from like a chat log so we'd have the assistant it would type a reply and then

00:13:34.800 | if it moves on to the next line and starts generating the text for the human response

00:13:39.040 | well it's generating too much text and we want to cut it off. We have that as a stopping criteria

00:13:46.240 | and we also have these three backticks. The reason we use these three backticks is because

00:13:51.680 | when we are using Llama2 as a conversational agent we actually ask it to reply to everything

00:14:02.320 | in essentially markdown of a JSON output. So we'll have it reply to everything in this format.

00:14:12.880 | Then in here we'll have like an action which is something like user calculator

00:14:18.400 | and also the action input. So it would be like two plus two.

00:14:27.760 | So that is why we're using this or including this within the sub list. Essentially once we get to

00:14:38.560 | here we want the chatbot to stop generating anything. As I said with this model it doesn't

00:14:46.320 | seem to be that necessary so you can add it in there as a precaution but actually what I'm going

00:14:51.760 | to do is just skip that for now. I don't necessarily need that to be in there. If you

00:14:58.240 | do want to include that in there what you'll need to do is just uncomment that and you'll have that

00:15:04.640 | in there. I'm not going to initialize it with that. If we do see any issues then we'll go back

00:15:13.520 | and run that with the stopping criteria included. This is just initializing the text generation

00:15:21.200 | pipeline with HuggingFace. We can now ask it to generate something. This is a question I've used

00:15:28.000 | a few times in the past. We just want to make sure that it is actually working on the HuggingFace

00:15:33.360 | side of things. Can this HuggingFace initiated model generate some text? It will take a little

00:15:40.640 | bit of time. As I said before this is exciting because it is finally able to at least at a very

00:15:49.760 | basic level act as a conversational agent. In terms of speed and hardware requirements it is

00:15:59.120 | not the most optimal solution. At least not yet. That's something that can be solved with more

00:16:07.040 | optimized hardware or just kind of throwing a load of hardware at it at least on the time side of

00:16:12.080 | things. That will take a little while to run and we see that we get this response which I think

00:16:20.480 | is relatively accurate. I haven't read through it but it looks pretty good. Then what we want to do

00:16:27.520 | is right now we have everything HuggingFace. We now want to transfer that over into LangChain.

00:16:32.320 | We're going to do that by initializing this HuggingFace pipeline object from LangChain

00:16:39.520 | and initializing it with our pipeline that we initialized up here. Then we just treat that as

00:16:47.600 | the LLM. We'll run that. We can then run this again and this will produce a pretty similar

00:16:56.000 | output to what we got up here. We can see we get kind of similar output. This is just telling us

00:17:04.000 | the same sort of stuff but with more text. Now what I want to do, come down to here. We have

00:17:11.040 | everything initialized in LangChain. Now what we can do is use all of the tooling that comes with

00:17:17.680 | LangChain to initialize our conversational agent. Now conversational agent as I mentioned before

00:17:23.280 | is conversational. That means it has some sort of conversational memory and it is also able to use

00:17:30.320 | tools. That is kind of the advantage of using a conversational agent versus just a standard

00:17:36.640 | chatbot. We initialize both of those. Conversational buffer window memory. This is going to

00:17:43.440 | remember the previous five interactions and we're also just going to load a LLM math tool. It's a

00:17:49.680 | calculator. We initialize both of those and then here we have what is an output parser. We don't

00:18:00.400 | need this for this model. You can have it in there as a precaution again if you like but for the most

00:18:08.080 | part I've found that it doesn't actually need this with good prompting. Essentially what I would do

00:18:14.000 | usually with this output parser is if the agent returns some text without the correct format, so

00:18:22.560 | without that JSON format that I mentioned earlier, I would assume that that's trying to respond

00:18:28.800 | directly to the user. All this output parser does is kind of reformats that into the correct JSON

00:18:36.080 | like response but as I said we can ignore it. We don't need it necessarily for at least the tools

00:18:43.680 | that we're using here. Maybe in a more complex scenario it might come in more use. If you did

00:18:51.520 | want to use that you just uncomment that and run it but as mentioned let's skip that and just see

00:18:58.560 | how the agent performs without it. Again it's just like a precaution. We initialize the agent here.

00:19:06.720 | We're using this chat conversational react description agent and this is kind of standard

00:19:14.000 | agent initialization parameters. What I want to show you here is the prompt that we initially use.

00:19:21.680 | Now this prompt doesn't work very well. One like this initial system prompt is super long. It's

00:19:31.600 | not that useful. Then we have the user prompt template here which again is super long

00:19:37.920 | and it doesn't work that well. I've modified those. One thing that is slightly different

00:19:48.160 | or specific to LLAMA2 is the use of these special tokens. We have this which indicates the start

00:19:55.600 | of some instructions, this which indicates the end of instructions, this indicates the start

00:20:00.640 | of system message so that initial message that tells the chatbot or LLM how to behave and this

00:20:07.040 | indicates the end of the system message. We initialize our system message and we include

00:20:14.400 | that sort of initialization of the system message in there. Then we go through we say

00:20:21.200 | assistant is an expert JSON builder designed to assess a wide range of tasks. The intention here

00:20:27.600 | is to really drill in the point that assistant needs to respond with JSON. We also mentioned

00:20:34.880 | it needs to respond with the action and action input parameters. We can see an example of that

00:20:41.520 | in here. In this example I'm saying this is how to use a calculator. You need to say action

00:20:47.360 | calculator and what you would like to use with the calculator. Then we have some future examples

00:20:55.680 | in here. We have just responding directly to the user. We need to use this JSON format.

00:21:03.520 | Using calculator again use the JSON format. We just go through and keep giving a few of those

00:21:11.440 | examples. At the end of system message we put that end of system message token. We can run that

00:21:21.840 | and then we come down to here. This is another thing that they found in the paper is that

00:21:31.840 | LLAMA2 over multiple interactions seem to forget those initial instructions. All I'm doing here

00:21:41.120 | is saying we have some instructions. I'm adding those instruction tags in there and I'm summarizing

00:21:46.800 | like giving a little reminder to LLAMA2. Respond to the following in JSON with action and action

00:21:51.760 | input values. We're just appending that or adding that to every user query which we can see here.

00:22:00.080 | Then we just modify the human message prompt template and what we end up with is this which

00:22:08.800 | you can see down here. We're going to have that with every human message. Now we can actually

00:22:18.400 | begin asking questions. I just ran this one. Hey, how are you today? We see that we get this output.

00:22:25.440 | Final answer. I'm good, thanks. How are you? That's pretty good. Let's try what is 4 to the

00:22:33.360 | power of 2.1. We see that's correctly using a calculator. It has the action input which is 4

00:22:42.640 | to the power of 2.1 in Python. This interaction takes a little bit longer because there are

00:22:49.280 | multiple LLM calls happening here. The first LLM call produces the I need to use a calculator and

00:22:55.760 | the input to that calculator. This is sent back to line train and this is executed in a Python

00:23:03.440 | interpreter. We get this answer from that. That is sent back to the assistant and based on that

00:23:12.400 | final answer it knows that it can give the answer back to us. The action is final answer. It looks

00:23:19.680 | like the answer is this. That is the output that we get there. Now let's use our conversational

00:23:28.000 | history and ask it to multiply that previous number by 3. Let's run that. We can see the

00:23:37.440 | first item, the calculator, it is being used correctly. We have that 18.379 multiplied by 3.

00:23:45.280 | Again, it's going to take a little moment because it needs to get the answer and generate a new

00:23:53.680 | LLM response based on that answer. Then we get our answer and we have this 55.13 and that's what

00:24:03.760 | we get. This is pretty good. Now, I would say as you saw, these answers where it's going through

00:24:11.920 | multiple steps, it's taking a minute for each one. A lot of that time seems to be spinning up a

00:24:17.440 | Python interpreter. It's not fully on the LLM in this case, but it does take a little bit of time.

00:24:25.600 | Naturally, that is probably one of the biggest issues with using Longitude at the moment. It

00:24:31.360 | takes a lot of GPU memory to run it. That comes with high costs. Especially if you are running

00:24:37.200 | it on a single GPU like we are with quantization, which slows the whole thing down, things are going

00:24:43.280 | to take a little bit of time. Nonetheless, I think this looks really cool. What we've done here is

00:24:50.400 | a very simple agent. It's just using a calculator. We're not stress testing this. Honestly,

00:24:59.680 | if we want to start using other tools, I think we might run into some issues that require a bit more

00:25:06.560 | tweaking and prompt engineering than what I have done here. I'm optimistic that we can actually

00:25:13.040 | use this for other tools. When you consider that even GPT 3.5, even that model is not that good

00:25:23.600 | at just producing the JSON response when you use it as a conversational agent. It can, and it can

00:25:31.760 | do it so reliably, but it's not perfect. The fact that LLM 2 and open source model that we're

00:25:40.320 | fitting on a single GPU is at least somewhat comparable to one of the best large language

00:25:48.160 | models in the world, I think that is pretty incredible. I'm very excited to see where this

00:25:54.720 | goes. Naturally, LLM 2 has only been around for a few days as of me recording this. We're probably

00:26:02.560 | going to see a lot of new models built by the community on top of LLM 2 appear within probably

00:26:11.920 | the next few days from now, and especially in the coming weeks and months. I'll be very excited to

00:26:19.920 | see where that goes. For now, I'm going to leave it there for this video. I hope this has all been

00:26:27.040 | useful and interesting. Thank you very much for watching, and I will see you again in the next one.

00:26:44.160 | [END]

Llama 2 in LangChain — FIRST Open Source Conversational Agent!

Chapters