back to indexLlama 2 in LangChain — FIRST Open Source Conversational Agent!
Chapters
0:0 Llama 2 Model
2:55 Getting Access to Llama 2
6:12 Initializing Llama 2 70B with Hugging Face
8:17 Quantization and GPU Memory Requirements
11:14 Loading Llama 2
13:5 Stopping Criteria
15:17 Initializing Text Generation Pipeline
16:25 Loading Llama 2 in LangChain
17:8 Creating Llama 2 Conversational Agent
19:46 Prompt Engineering with Llama 2 Chat
22:16 Llama 2 Conversational Agent
24:14 Future of Open Source LLMs
00:00:00.000 |
A few days ago MetaAI released LLAMA2. Now what's exciting about LLAMA2 is that it's open source 00:00:08.560 |
and it is currently the best performing open source model in a big variety of different 00:00:16.000 |
benchmarks. Now one of the things that I'm personally very excited about is when I see 00:00:22.880 |
these new open source models being released one of the first things I do is I try out as a 00:00:30.880 |
conversational agent. That is a chatbot that is actually able to use tools and every single time 00:00:39.200 |
that I have tried this so far with other models I've been pretty disappointed. They either cannot 00:00:46.800 |
use tools at all or they're just very unreliable. So this "will it work as a conversational agent" 00:00:55.040 |
benchmark has just become my personal go-to when these new models are released. It's my way of 00:01:01.760 |
benchmarking where open source is compared to OpenAI models which generally speaking 00:01:08.400 |
GPT-3.5, Text to Image 0.0.3 and especially GPT-4. They are pretty capable as conversational agents 00:01:17.040 |
and what I find in real world use cases is that conversational agents are the future of how we 00:01:25.200 |
interact with large language models. Having a simple chatbot that just talks to us is great 00:01:31.600 |
but it's limited. It doesn't have the flexibility in access to external information 00:01:38.000 |
that a conversational agent will have and it cannot use tools like you know a Python interpreter 00:01:44.800 |
that a conversational agent can use. So that for me is super important and finally 00:01:52.800 |
with LLAMA2 we have a model that has actually passed that test. I fairly quickly managed to 00:02:00.400 |
sort of prompt engineer my way to getting a LLAMA2 model, the fine-tuned chat version of LLAMA2, 00:02:08.080 |
to work as a conversational agent which I think is pretty insane. So what I want to do in this 00:02:15.120 |
video is show you how you can do the same. So we're going to take a look at the biggest LLAMA2 00:02:21.440 |
model. It's the 70B parameter model. We're going to quantize it so that we can fit it onto a single 00:02:26.320 |
A100 GPU. I'm actually going to be running all this on Colab so you can actually go ahead and 00:02:31.920 |
run the same notebook. With this approach we're going to be able to fit that 70 billion parameter 00:02:37.280 |
model into at a minimum 35 gigabytes of GPU memory but actually after multiple interactions it kind 00:02:47.440 |
of pushes its way up to more like 38 gigabytes which is still not that much for such a performing 00:02:55.600 |
model. Now let's just dive into how we can actually do this. So the first thing we're going to have to 00:03:01.280 |
do is actually sign up and get access to these models. It's pretty straightforward, it doesn't 00:03:07.280 |
take that long. So what you can do for this is head on over to huggingface.co/meta-llama 00:03:14.800 |
and you want to go over to the meta website here. So we click on that and we just want to request 00:03:22.720 |
access to the next version of Llama. So you fill that out and for me I got a response almost 00:03:30.160 |
instantly through using two different emails and basically they're going to send you something like 00:03:36.000 |
this. So it's just okay you're all set, start building with Llama2. It also gives you model 00:03:42.080 |
weights that are available. This is not every single Llama2 model, there is also a 34 billion 00:03:48.400 |
parameter model which they have not finished testing yet so that hasn't been released just yet 00:03:53.920 |
but the one that we are going to be using is this Llama2-70b-chat. So on HuggingFace we need to go to 00:04:02.800 |
Llama2-70b-chat-hf. This is the model that we want to be using. So you'll see that there's 00:04:15.200 |
this access Llama2 on HuggingFace. One thing you need to be aware of here is that, well actually 00:04:23.520 |
it says it right here, your HuggingFace account and email address must match the email you provide 00:04:28.400 |
on the meta website. So a minute ago when we entered our details on the meta website make 00:04:34.480 |
sure you use the email that you also use on HuggingFace. So once you've done that you can 00:04:40.560 |
click this, you can submit and as long as those emails line up you will get access fairly quickly. 00:04:48.960 |
Now one thing that you will need is one we have to wait for that access to come through 00:04:54.400 |
but we also need to go down over to our profile, we go to settings and we need to get an access token. 00:05:04.400 |
So this will allow us to download the model within our code. So you will actually need to 00:05:12.880 |
create a new token. I'm just going to call this Meta Llama and we just need read permissions. 00:05:19.840 |
So with that we generate a token and I'm just going to copy that. So this is a notebook that 00:05:27.120 |
we're going to be working through in this video. There will be a link to this at the top of the 00:05:32.000 |
video right now so you can follow along if you like although I will just pre-warn you that 00:05:38.320 |
parts of this notebook can take a little bit of time particularly when you're downloading the 00:05:44.160 |
model. So with that in mind I wouldn't even necessarily recommend running this on Colab 00:05:49.840 |
because you're going to have to re-download the model like every day that you use this which is 00:05:56.640 |
not ideal and it's fairly expensive. So you should probably run this on your local computer if you 00:06:05.680 |
have a good GPU or on a cloud service somewhere. So we come down to here you'll need to enter your 00:06:17.280 |
Hugging Face API key in here and let me just come down and show you what is happening. So 00:06:26.160 |
there's a fair bit of code that is just kind of initializing the model here for us and as I 00:06:31.840 |
mentioned this download of the model, this download and initialization of the model, 00:06:37.600 |
does take a bit of time. So this has actually been running now for one hour and 10 minutes or a 00:06:46.000 |
little bit longer and I'm not expecting it to finish too soon although I'm hoping it will not 00:06:53.200 |
take too much longer. But essentially we're going to be waiting a while for the model to download 00:06:59.200 |
but let's come up here and just kind of go through that code that we've used to initialize it first. 00:07:06.560 |
Right so we're doing a pip install of all the libraries that we're going to be using. 00:07:10.720 |
We do need all of these okay Hugging Face Transformers then we have like these libraries 00:07:17.360 |
and these libraries are basically so we can run large language models and also optimize how we're 00:07:23.840 |
running those. And we also have LangChain so later on in the notebook we're going to be using LangChain 00:07:29.840 |
to create that conversational agent. So come down to here what we need here is the large language 00:07:38.800 |
model, a tokenizer for the large language model and also a stopping criteria object which is 00:07:44.880 |
more of an optional item I would say for this model. But let's talk about those the LLM at 00:07:52.720 |
first. So the LLM we have this model ID this is coming from Hugging Face so if we come up here 00:08:01.280 |
again we can type in llama2 and we see that there's all these different model IDs. The one that we're 00:08:08.480 |
using is this one here. Okay so we have our model ID here we're just checking that we have a GPU 00:08:16.800 |
available. Here we have this bits and bytes config object. I've spoken about this in previous videos 00:08:24.880 |
so I'm not going to go too into depth but essentially what we're doing here is we're 00:08:31.760 |
minimizing the amount of GPU memory we need to store the model. Now this is a 70 billion parameter 00:08:38.640 |
model so let's just do some very quick maths here. So 70 billion parameters. 00:08:45.520 |
Each of those parameters using the standard data type is 32 bits of information. Okay so the 00:08:59.360 |
standard data type is a float 32 so float 32 and that is 32 bits of information. Within each byte 00:09:10.000 |
there is 8 bits of information so we can actually calculate how much memory we need to store that 00:09:19.840 |
model. Okay it is just the params by the data type divided by 8. Okay and that gives us this many 00:09:30.240 |
bytes of information which is 280 gigabytes which is a lot right that's many many GPUs many A100s 00:09:44.080 |
single A100 I think is 40 gigabytes so yeah we need we need a few of those. Now by doing this 00:09:53.440 |
bits and bytes quantization we can minimize that so what we're essentially doing is switching from 00:10:00.000 |
a float 32 data type to an int 4 data type. Okay and that contains four bits of information. Okay 00:10:09.920 |
so now each one of those parameters is not 32 bits it's four bits so let's calculate that 00:10:17.920 |
we have int 4 divided by 8 which gives us this so that is 35 gigabytes of information. Now 00:10:28.720 |
that's not precise because when we're doing this quantization method if we just converted 00:10:36.240 |
everything into int 4 basically we would lose a lot of performance. This works in a more intelligent 00:10:41.600 |
way by quantizing different parts of the model that essentially don't need quite as much precision. 00:10:50.320 |
Then the bits that do require more precision we convert into 16-bit floats so it will be 00:10:59.680 |
a little bit more than 35 gigabytes essentially but we're going to be within that ballpark. 00:11:05.840 |
So that's great and allows us to load this model onto a single A100 which is pretty incredible. 00:11:12.320 |
Then what we need to do is we load the model config from Hogan Face Transformers because 00:11:20.000 |
we're downloading that from Hogan Face Transformers we need to make sure we're 00:11:22.800 |
using our authorization token which you will need to set in here and then we're also going to 00:11:29.680 |
download the LLAMA2 model itself. Now we need to have the Trust Remote code in there because 00:11:37.360 |
this is a big model and there is custom code that will allow us to load that model. You don't need 00:11:44.720 |
that for all models on Transformers but you do need it for this one. We have the config object 00:11:50.400 |
which we just initialize up here and we also have the quantization config which we initialize up 00:11:56.080 |
here. Device map needs to be set to auto and we again need to pass in our authorization token 00:12:04.640 |
which we do here. Then after that we switch the model into evaluation mode which basically means 00:12:10.720 |
we're not training the model we're going to be using for inference or prediction. Then after 00:12:16.640 |
that we just wait. This is almost done now so I think it's just finished downloading the model 00:12:25.360 |
and now we're going to need to wait for it to actually initialize the model from all of 00:12:31.200 |
those downloaded shards that we just created. I will see you in a few minutes when that is finished. 00:12:39.200 |
Everything has now loaded and initialized so we can get on with the rest of the code. 00:12:47.120 |
We need a tokenizer. Tokenizer just converts plaintext into basically what the model will be 00:12:54.640 |
reading. I just need to make sure I define this and I can rerun that. Converts plaintext to tokens 00:13:04.320 |
which a model will read and then we come down to the stopping criteria for model. Now with smaller 00:13:11.440 |
models this is pretty important. With this model I would say less so but we can add this in anywhere 00:13:18.800 |
as a precaution. Basically if we see that the model has generated these two items which are 00:13:28.400 |
basically this is from like a chat log so we'd have the assistant it would type a reply and then 00:13:34.800 |
if it moves on to the next line and starts generating the text for the human response 00:13:39.040 |
well it's generating too much text and we want to cut it off. We have that as a stopping criteria 00:13:46.240 |
and we also have these three backticks. The reason we use these three backticks is because 00:13:51.680 |
when we are using Llama2 as a conversational agent we actually ask it to reply to everything 00:14:02.320 |
in essentially markdown of a JSON output. So we'll have it reply to everything in this format. 00:14:12.880 |
Then in here we'll have like an action which is something like user calculator 00:14:18.400 |
and also the action input. So it would be like two plus two. 00:14:27.760 |
So that is why we're using this or including this within the sub list. Essentially once we get to 00:14:38.560 |
here we want the chatbot to stop generating anything. As I said with this model it doesn't 00:14:46.320 |
seem to be that necessary so you can add it in there as a precaution but actually what I'm going 00:14:51.760 |
to do is just skip that for now. I don't necessarily need that to be in there. If you 00:14:58.240 |
do want to include that in there what you'll need to do is just uncomment that and you'll have that 00:15:04.640 |
in there. I'm not going to initialize it with that. If we do see any issues then we'll go back 00:15:13.520 |
and run that with the stopping criteria included. This is just initializing the text generation 00:15:21.200 |
pipeline with HuggingFace. We can now ask it to generate something. This is a question I've used 00:15:28.000 |
a few times in the past. We just want to make sure that it is actually working on the HuggingFace 00:15:33.360 |
side of things. Can this HuggingFace initiated model generate some text? It will take a little 00:15:40.640 |
bit of time. As I said before this is exciting because it is finally able to at least at a very 00:15:49.760 |
basic level act as a conversational agent. In terms of speed and hardware requirements it is 00:15:59.120 |
not the most optimal solution. At least not yet. That's something that can be solved with more 00:16:07.040 |
optimized hardware or just kind of throwing a load of hardware at it at least on the time side of 00:16:12.080 |
things. That will take a little while to run and we see that we get this response which I think 00:16:20.480 |
is relatively accurate. I haven't read through it but it looks pretty good. Then what we want to do 00:16:27.520 |
is right now we have everything HuggingFace. We now want to transfer that over into LangChain. 00:16:32.320 |
We're going to do that by initializing this HuggingFace pipeline object from LangChain 00:16:39.520 |
and initializing it with our pipeline that we initialized up here. Then we just treat that as 00:16:47.600 |
the LLM. We'll run that. We can then run this again and this will produce a pretty similar 00:16:56.000 |
output to what we got up here. We can see we get kind of similar output. This is just telling us 00:17:04.000 |
the same sort of stuff but with more text. Now what I want to do, come down to here. We have 00:17:11.040 |
everything initialized in LangChain. Now what we can do is use all of the tooling that comes with 00:17:17.680 |
LangChain to initialize our conversational agent. Now conversational agent as I mentioned before 00:17:23.280 |
is conversational. That means it has some sort of conversational memory and it is also able to use 00:17:30.320 |
tools. That is kind of the advantage of using a conversational agent versus just a standard 00:17:36.640 |
chatbot. We initialize both of those. Conversational buffer window memory. This is going to 00:17:43.440 |
remember the previous five interactions and we're also just going to load a LLM math tool. It's a 00:17:49.680 |
calculator. We initialize both of those and then here we have what is an output parser. We don't 00:18:00.400 |
need this for this model. You can have it in there as a precaution again if you like but for the most 00:18:08.080 |
part I've found that it doesn't actually need this with good prompting. Essentially what I would do 00:18:14.000 |
usually with this output parser is if the agent returns some text without the correct format, so 00:18:22.560 |
without that JSON format that I mentioned earlier, I would assume that that's trying to respond 00:18:28.800 |
directly to the user. All this output parser does is kind of reformats that into the correct JSON 00:18:36.080 |
like response but as I said we can ignore it. We don't need it necessarily for at least the tools 00:18:43.680 |
that we're using here. Maybe in a more complex scenario it might come in more use. If you did 00:18:51.520 |
want to use that you just uncomment that and run it but as mentioned let's skip that and just see 00:18:58.560 |
how the agent performs without it. Again it's just like a precaution. We initialize the agent here. 00:19:06.720 |
We're using this chat conversational react description agent and this is kind of standard 00:19:14.000 |
agent initialization parameters. What I want to show you here is the prompt that we initially use. 00:19:21.680 |
Now this prompt doesn't work very well. One like this initial system prompt is super long. It's 00:19:31.600 |
not that useful. Then we have the user prompt template here which again is super long 00:19:37.920 |
and it doesn't work that well. I've modified those. One thing that is slightly different 00:19:48.160 |
or specific to LLAMA2 is the use of these special tokens. We have this which indicates the start 00:19:55.600 |
of some instructions, this which indicates the end of instructions, this indicates the start 00:20:00.640 |
of system message so that initial message that tells the chatbot or LLM how to behave and this 00:20:07.040 |
indicates the end of the system message. We initialize our system message and we include 00:20:14.400 |
that sort of initialization of the system message in there. Then we go through we say 00:20:21.200 |
assistant is an expert JSON builder designed to assess a wide range of tasks. The intention here 00:20:27.600 |
is to really drill in the point that assistant needs to respond with JSON. We also mentioned 00:20:34.880 |
it needs to respond with the action and action input parameters. We can see an example of that 00:20:41.520 |
in here. In this example I'm saying this is how to use a calculator. You need to say action 00:20:47.360 |
calculator and what you would like to use with the calculator. Then we have some future examples 00:20:55.680 |
in here. We have just responding directly to the user. We need to use this JSON format. 00:21:03.520 |
Using calculator again use the JSON format. We just go through and keep giving a few of those 00:21:11.440 |
examples. At the end of system message we put that end of system message token. We can run that 00:21:21.840 |
and then we come down to here. This is another thing that they found in the paper is that 00:21:31.840 |
LLAMA2 over multiple interactions seem to forget those initial instructions. All I'm doing here 00:21:41.120 |
is saying we have some instructions. I'm adding those instruction tags in there and I'm summarizing 00:21:46.800 |
like giving a little reminder to LLAMA2. Respond to the following in JSON with action and action 00:21:51.760 |
input values. We're just appending that or adding that to every user query which we can see here. 00:22:00.080 |
Then we just modify the human message prompt template and what we end up with is this which 00:22:08.800 |
you can see down here. We're going to have that with every human message. Now we can actually 00:22:18.400 |
begin asking questions. I just ran this one. Hey, how are you today? We see that we get this output. 00:22:25.440 |
Final answer. I'm good, thanks. How are you? That's pretty good. Let's try what is 4 to the 00:22:33.360 |
power of 2.1. We see that's correctly using a calculator. It has the action input which is 4 00:22:42.640 |
to the power of 2.1 in Python. This interaction takes a little bit longer because there are 00:22:49.280 |
multiple LLM calls happening here. The first LLM call produces the I need to use a calculator and 00:22:55.760 |
the input to that calculator. This is sent back to line train and this is executed in a Python 00:23:03.440 |
interpreter. We get this answer from that. That is sent back to the assistant and based on that 00:23:12.400 |
final answer it knows that it can give the answer back to us. The action is final answer. It looks 00:23:19.680 |
like the answer is this. That is the output that we get there. Now let's use our conversational 00:23:28.000 |
history and ask it to multiply that previous number by 3. Let's run that. We can see the 00:23:37.440 |
first item, the calculator, it is being used correctly. We have that 18.379 multiplied by 3. 00:23:45.280 |
Again, it's going to take a little moment because it needs to get the answer and generate a new 00:23:53.680 |
LLM response based on that answer. Then we get our answer and we have this 55.13 and that's what 00:24:03.760 |
we get. This is pretty good. Now, I would say as you saw, these answers where it's going through 00:24:11.920 |
multiple steps, it's taking a minute for each one. A lot of that time seems to be spinning up a 00:24:17.440 |
Python interpreter. It's not fully on the LLM in this case, but it does take a little bit of time. 00:24:25.600 |
Naturally, that is probably one of the biggest issues with using Longitude at the moment. It 00:24:31.360 |
takes a lot of GPU memory to run it. That comes with high costs. Especially if you are running 00:24:37.200 |
it on a single GPU like we are with quantization, which slows the whole thing down, things are going 00:24:43.280 |
to take a little bit of time. Nonetheless, I think this looks really cool. What we've done here is 00:24:50.400 |
a very simple agent. It's just using a calculator. We're not stress testing this. Honestly, 00:24:59.680 |
if we want to start using other tools, I think we might run into some issues that require a bit more 00:25:06.560 |
tweaking and prompt engineering than what I have done here. I'm optimistic that we can actually 00:25:13.040 |
use this for other tools. When you consider that even GPT 3.5, even that model is not that good 00:25:23.600 |
at just producing the JSON response when you use it as a conversational agent. It can, and it can 00:25:31.760 |
do it so reliably, but it's not perfect. The fact that LLM 2 and open source model that we're 00:25:40.320 |
fitting on a single GPU is at least somewhat comparable to one of the best large language 00:25:48.160 |
models in the world, I think that is pretty incredible. I'm very excited to see where this 00:25:54.720 |
goes. Naturally, LLM 2 has only been around for a few days as of me recording this. We're probably 00:26:02.560 |
going to see a lot of new models built by the community on top of LLM 2 appear within probably 00:26:11.920 |
the next few days from now, and especially in the coming weeks and months. I'll be very excited to 00:26:19.920 |
see where that goes. For now, I'm going to leave it there for this video. I hope this has all been 00:26:27.040 |
useful and interesting. Thank you very much for watching, and I will see you again in the next one.