back to indexBEST Open Source LLM — Falcon 40B Chatbot in LangChain
Chapters
0:0 Falcon 40B Instruct LLM
1:48 Falcon 40B in Python
7:4 Falcon 40B Chatbot in LangChain
9:31 Handling Output with Output Parsers
13:31 Falcon 40B as Chatbot and Code Generation
15:45 Refactoring Code with Falcon 40B and GPT 3.5
19:14 Falcon 40B as an Agent
21:34 Final Thoughts
00:00:00.000 |
Today, we're going to be taking a look at what is currently the best performing open-source large-language model in the world. 00:00:06.960 |
This model is the Falcon 40B Instructor model. 00:00:11.760 |
We're going to be taking a look at first how to use it 00:00:15.760 |
and we're going to focus on using it on the smallest hardware we can. 00:00:20.400 |
So, we're actually going to manage to fit that on a single GPU through quantization. 00:00:25.320 |
And we're also going to look at how to implement it as a chatbot 00:00:30.080 |
and also how we would go about doing the same thing for the model as a conversational agent. 00:00:35.800 |
Now, according to Hugging Face's OpenLLM leaderboard, 00:00:39.040 |
it is currently the best performing open-source model that there is available today. 00:00:45.760 |
It's surprisingly even better performing than several 65 billion parameter models, 00:00:52.800 |
despite only being a 40 billion parameter model, which is pretty impressive. 00:00:58.440 |
So, in terms of performance, it's like several steps above Llama65B, which we can see down here. 00:01:06.960 |
Now, this is the Instruct version of the model. 00:01:10.840 |
We also have the base model down here, so the Falcon 40B. 00:01:14.040 |
We're going to stick with Instruct. Basically, it's been fine-tuned to follow instructions. 00:01:19.720 |
And we can see from here, the performance of this fine-tuned model is a fair bit higher than the base model. 00:01:27.120 |
Now, the model was trained by this TII institution in the UAE. 00:01:33.080 |
I'm not sure exactly how they've managed to get such a small model to perform as well as it does. 00:01:39.560 |
But in either case, it's open-source and we can use it. 00:01:47.920 |
So, let's take a look at how we would actually use this model. 00:01:52.480 |
Now, we're going to start with using this model as a chatbot 00:01:57.720 |
and essentially AI pair programmer and see how it performs. 00:02:04.240 |
It's pretty similar to how we set up other open-source large-language models, 00:02:11.880 |
So, we can just start going through that, but I'm going to go very quickly through these first few items. 00:02:17.800 |
So, first, we have these, again, similar to what we have done before in previous videos. 00:02:23.800 |
We're going to be using the Hunkface Transformers library. 00:02:26.440 |
We're going to be quantizing the model using bits and bytes. 00:02:30.280 |
And we're going to be converting it into a chatbot and also a conversational agent later on using the Langtrain library. 00:02:38.600 |
So, the first thing we want to do is initialize the model. 00:02:41.080 |
We do that like so, so with Hunkface Transformers. 00:02:48.840 |
But one thing that we do add, and this is so that we can actually fit it onto a single GPU. 00:02:54.600 |
And I am using a 100 here, so it's not the smallest GPU, but it is still just one GPU. 00:03:01.480 |
The way that we do that is we set this quantization config parameter. 00:03:05.640 |
So, the quantization config essentially is telling Transformers how to quantize the model weight. 00:03:14.200 |
So, essentially converting the high memory floating point numbers that are used within the model into quantized integer values, which are much smaller. 00:03:30.600 |
So, we actually convert, I think by default, the data type that we'd be using is a 32-bit float. 00:03:40.120 |
But we are actually going to be using a 4-bit integer value for the majority of the weights within the model. 00:03:49.960 |
The way that this works and manages to maintain the performance it does is that kind of selectively decides which parameters should be quantized and which parameters within the model should not be quantized. 00:04:06.360 |
It means we can make the model size much smaller whilst still getting almost the same performance out of it. 00:04:17.080 |
We also in here need to include this device map auto. 00:04:22.120 |
So, what that is going to do is move the model onto GPU if a GPU is available, which it is. 00:04:28.440 |
And I've checked that up here and printed it out down here. 00:04:32.600 |
So, you should see when you run this that the model has been loaded onto a CUDA device. 00:04:37.320 |
If you don't see that and you are also running this on Colab, what you can do is go to runtime, change runtime type, and you want to go to hardware accelerator, GPU, and you need to select an A100 there. 00:04:56.200 |
Now, that will actually take a while to initialize the model. 00:05:00.120 |
Okay, so I was running this cell for 10 minutes. 00:05:05.800 |
Once you have downloaded it once, within the same session on Colab, all it needs to do is load the model. 00:05:15.080 |
Although if you are using Colab for this and you use Colab like the next day or something like that, it will not be on the same computing instance. 00:05:27.240 |
And therefore, it will actually need to download the model again. 00:05:29.480 |
So, I would recommend you either do this locally or you do it on a dedicated instance. 00:05:37.640 |
That just translates from human readable text into transformer readable tokens. 00:05:47.480 |
I've spoken about this a lot recently, so I'm not going to go through it again. 00:05:50.600 |
But there will be a link to a video at the top right now where I do talk about this if you are interested. 00:05:58.840 |
And then all we're going to do here is initialize the model for text generation. 00:06:10.040 |
So, first, we're just going to ask this very kind of easy, it's a niche question, right? 00:06:15.880 |
It's not common knowledge, but it isn't too difficult for a large language model. 00:06:20.920 |
You can also check the amount of GPU RAM that we're using here. 00:06:29.400 |
So, this model, 40 billion parameter model with quantization is actually not using that much space. 00:06:40.840 |
So, we have here, we're just returning the original question. 00:06:45.640 |
We actually do that here because we're using line chain later on. 00:06:51.720 |
And we get nuclear fission is so and so on, splitting atoms, releasing energy. 00:06:59.000 |
You can read through that if you run it yourself or follow the notebook. 00:07:03.720 |
Now, what I want to do is take this model that we've loaded through Hugging Face, 00:07:08.440 |
or this actually text generation pipeline, and I want to load it into line chain. 00:07:15.320 |
Because line chain has a ton of utilities for building conversational chains or conversational 00:07:26.760 |
So, we want to be able to use those utilities, those tools. 00:07:38.680 |
So, that is basically a chain where you have a prompt template here, 00:07:42.680 |
your query or your input goes into that prompt template, 00:07:46.760 |
and it gets passed to whatever LLM you specified, in our case, Falcon 40B instruct. 00:07:52.520 |
Now, when we run that, we should see a pretty similar output to what we got here. 00:08:00.680 |
And we can see, yeah, it looks pretty similar again. 00:08:04.680 |
So, how do we go from that very simple LLM chain into a chatbot? 00:08:13.160 |
So, basically, a track or a record of previous interactions between the user and a chatbot. 00:08:22.200 |
And with that memory, we need our LLM, and we can initialize a conversation chain through 00:08:35.560 |
That basically just means it's going to remember the previous five interactions between the user 00:08:44.280 |
And now what we can say is, say, okay, hi, how are you? 00:08:49.400 |
And we see, because we've set verbose equal to true, we can see the actual prompt that 00:08:57.640 |
So, we can see that there's this primer here, which is following as a friendly conversation 00:09:03.480 |
between a human and an AI, and it has all of this here. 00:09:09.000 |
Basically, instructions for the model to follow. 00:09:15.960 |
So, this is, you know, basically telling the model it's time for you to start acting like 00:09:30.440 |
And then it moves on to the next line and says, oh, human. 00:09:34.760 |
So, the reason that it stops here, rather than stopping here, is because we are forcing 00:09:49.480 |
We've actually specified a stopping criteria, which is human colon. 00:09:55.320 |
Now, that's great, but naturally in our output, we don't want to include that. 00:10:06.040 |
Now, we can post process this, and that's exactly what we're going to do. 00:10:10.760 |
Obviously, there are different ways of doing that. 00:10:12.280 |
You can do it kind of more manually, or we can use the more sort of lang chain way of 00:10:20.280 |
And we can create what is called a output parser. 00:10:23.960 |
Now, the output parser is basically just another step that happens to the output of whatever 00:10:34.360 |
So, in this case here, we have the parse parameter here. 00:10:40.440 |
So, this is what we'll get called when we use the output parser. 00:10:45.720 |
We strip any white space from around that text. 00:10:50.200 |
We then look at the subwords that we'd like to remove. 00:10:54.840 |
So, we only have two subwords, which is human and AI. 00:10:59.560 |
We say, for those words, if they are at the end of the text, we remove them. 00:11:05.240 |
And then finally, we just strip any more white space from the two ends of our text again. 00:11:14.360 |
And then to actually integrate that into our chatbot, we need to use a prompt template. 00:11:24.280 |
So, we just take what we already took before. 00:11:32.520 |
So, this is the existing or the default prompt template. 00:11:41.720 |
So, I just copied this existing default template and I put it here. 00:11:47.160 |
The reason that I've done that is because I don't need to change it. 00:11:50.120 |
The only thing I do need to add here is I want to add that output parser within the prompt template. 00:11:56.600 |
To me, it seems a bit odd that you would put your output parser into the prompt template, 00:12:03.960 |
because the prompt template is kind of like your input pipeline. 00:12:07.560 |
But that is just how you do things in LangChain. 00:12:11.080 |
So, we pass the output parser into our input prompt template. 00:12:18.120 |
And then what we do is just initialize our conversation chain with that new prompt template 00:12:28.040 |
Literally, all we've changed here is we've set a output parser within the prompt template. 00:12:32.920 |
So, we reinitialize our chatbot with that output parser. 00:12:39.560 |
And then what we do is -- so, before we were doing chat predict, like this. 00:12:51.080 |
So, that is basically telling the conversational chain to parse any outputs based on whatever we 00:13:04.520 |
We're going to see a similar -- well, the same actual prompt being passed into there. 00:13:13.240 |
So, it's basically a cleaned up version of what we saw before. 00:13:20.680 |
And we also don't have the human appended onto the end there. 00:13:25.880 |
We don't have that messy human string at the end of our text anymore. 00:13:31.080 |
But what I'd like to do now is just continue this conversation and see how Falcon40b actually 00:13:39.640 |
So, what I'm going to do is ask it to write me a very simple Python script. 00:13:43.480 |
So, I want it to create a Python script that's going to calculate the circumference of a 00:13:50.520 |
So, I'm kind of specifying here that I want this to be a parameter. 00:14:05.800 |
But let's go ahead and just try this and see if it actually does run. 00:14:19.080 |
So, naturally, the pi variable there wasn't defined. 00:14:29.720 |
And we see that GPT 3.5 does actually give us, like, functional code straight away. 00:14:38.440 |
Whereas, obviously, with Falcon 40B, we get this error. 00:14:41.560 |
But we can continue with this with Falcon 40B and just give it a little more prompting 00:14:51.880 |
So, what I've done here is I've said, okay, using this code, I get the error. 00:15:06.200 |
Which is actually the same as what we got from chat GPT. 00:15:10.200 |
So, then we can actually take that, put it into here. 00:15:13.000 |
I should also include the circumference of circle in there as well. 00:15:26.120 |
It took a little bit more work than what it did for GPT 3.5. 00:15:38.680 |
So, the fact that it got to the answer as quickly as it did, I think is pretty impressive. 00:15:44.920 |
Now, another thing that I'd like some help with here is refactoring some pretty bad code. 00:15:51.560 |
So, what I have written here is thanks for giving us this code here. 00:15:57.480 |
But now I have some code I'd like to refactor. 00:16:04.520 |
So, this will run, but it's obviously kind of messy. 00:16:09.080 |
And it is written in a way that, you know, there's ways that we can improve it naturally. 00:16:18.120 |
Let's also try the same with GPT 3.5 whilst we wait. 00:16:36.840 |
Let's just copy this and we'll compare it to the actual answer of this code here. 00:16:55.560 |
Okay. And we can see that GPT 3.5 running this actually didn't refactor the code correctly. 00:17:06.520 |
GPT 3.5 seems to think that we're counting each number twice. 00:17:10.520 |
So, it mentions here, the code currently sums up all the numbers twice. 00:17:22.040 |
Let's go ahead and see what Falcon 40B managed to do. 00:17:39.080 |
So, I don't think it really changed much, to be honest there. 00:17:42.840 |
Let's try this other bit of code that it suggested and make sure it works. 00:17:51.240 |
Okay. And then this one where it has modified the code to actually allow us to 00:17:57.480 |
specify whether we want just even or odd numbers. 00:17:59.960 |
In this case, we get a different number, but it has actually told us that we are looking 00:18:05.720 |
specifically for even or odd, which is not what we asked it to do. 00:18:09.880 |
But at least it has explained and done this correctly. 00:18:16.840 |
Okay. So, it hasn't really managed to refactor our code. 00:18:22.120 |
But at least it did understand what the code was doing, unlike GP 3.5 Turbo. 00:18:31.480 |
Now, one thing that I will point out is this does seem to vary depending on the query, 00:18:39.960 |
The first time I ran this, I actually got this refactored function, which is obviously 00:18:48.600 |
So, sometimes Falcon 40B does actually manage to outperform GP 3.5, at least on that one 00:18:59.160 |
Which I thought was pretty cool, given that this is like an open source model that we 00:19:05.000 |
Now, unfortunately, it seems like it doesn't always manage to get that performance. 00:19:14.040 |
Now, one other thing that I want to take a very quick look at is trying, at least, to 00:19:22.760 |
Now, we'll just very quickly go through this code. 00:19:27.000 |
It's only different once we get down to here. 00:19:35.800 |
This agent, it only has access to a single tool, which is a calculator tool. 00:19:42.120 |
And I modified the prompts a little bit to try and get this working. 00:19:47.160 |
But unfortunately, it seemed to really struggle with what we are asking it to do. 00:19:53.880 |
Which is, if you can take a look around here, we're asking it to always output in a JSON 00:20:07.960 |
And actually, maybe I can take this and just print it out. 00:20:19.800 |
So, we're basically asking, because this is an agent that can use tool, we're asking the 00:20:24.440 |
LLM to generate the responses in this JSON format. 00:20:31.160 |
Now, I also tried using basically that output parser approach that you saw earlier to allow 00:20:51.160 |
So, if we take a look at that, it does manage to output a response directly to the user, 00:21:00.280 |
But if we then ask it to use a tool, or we would expect it to use a calculator tool in 00:21:10.040 |
It kind of goes straight ahead and tries to answer the response directly. 00:21:14.280 |
Which is exactly what our conversational chatbot does. 00:21:19.640 |
So, I at least during my experiments with different prompts and everything here, even 00:21:25.400 |
tried a few-shot learning, couldn't get it to work in a good way for a conversational 00:21:34.520 |
Now, there might be ways to get that working. 00:21:38.680 |
But nonetheless, although it isn't quite right now at the level of being used as an 00:21:46.760 |
agent, at least in this way, as a conversational chatbot, it works pretty well. 00:21:54.520 |
I just want to take a look at the Falcon40b instruct model. 00:21:59.000 |
Has the most powerful open source LLM that we have available to us today. 00:22:05.000 |
And how we can actually use that using a library like LangChain for building chatbots. 00:22:12.920 |
As you can see, there's some weaknesses, but also some strengths of the model. 00:22:16.920 |
And I'm very optimistic that, you know, if we can get this with a 40 billion parameter 00:22:23.000 |
model going forwards, there's probably going to be many more models that are on this set 00:22:29.560 |
scale and possibly slightly larger scale that we can still fit on more typical consumer 00:22:37.000 |
So, that's probably the most exciting part of this for me. 00:22:44.600 |
So, I hope all of this has been interesting and insightful.