Today, we're going to be taking a look at what is currently the best performing open-source large-language model in the world. This model is the Falcon 40B Instructor model. We're going to be taking a look at first how to use it and we're going to focus on using it on the smallest hardware we can.
So, we're actually going to manage to fit that on a single GPU through quantization. And we're also going to look at how to implement it as a chatbot and also how we would go about doing the same thing for the model as a conversational agent. Now, according to Hugging Face's OpenLLM leaderboard, it is currently the best performing open-source model that there is available today.
It's surprisingly even better performing than several 65 billion parameter models, despite only being a 40 billion parameter model, which is pretty impressive. So, in terms of performance, it's like several steps above Llama65B, which we can see down here. Now, this is the Instruct version of the model. We also have the base model down here, so the Falcon 40B.
We're going to stick with Instruct. Basically, it's been fine-tuned to follow instructions. And we can see from here, the performance of this fine-tuned model is a fair bit higher than the base model. Now, the model was trained by this TII institution in the UAE. I'm not sure exactly how they've managed to get such a small model to perform as well as it does.
But in either case, it's open-source and we can use it. And that's kind of like what I care about. So, let's take a look at how we would actually use this model. Now, we're going to start with using this model as a chatbot and essentially AI pair programmer and see how it performs.
Now, there's a bit of setup here. It's pretty similar to how we set up other open-source large-language models, if you've seen my other videos on those. So, we can just start going through that, but I'm going to go very quickly through these first few items. So, first, we have these, again, similar to what we have done before in previous videos.
We're going to be using the Hunkface Transformers library. We're going to be quantizing the model using bits and bytes. And we're going to be converting it into a chatbot and also a conversational agent later on using the Langtrain library. So, the first thing we want to do is initialize the model.
We do that like so, so with Hunkface Transformers. We initialize that in the typical way here. But one thing that we do add, and this is so that we can actually fit it onto a single GPU. And I am using a 100 here, so it's not the smallest GPU, but it is still just one GPU.
The way that we do that is we set this quantization config parameter. So, the quantization config essentially is telling Transformers how to quantize the model weight. So, essentially converting the high memory floating point numbers that are used within the model into quantized integer values, which are much smaller. So, that's what we're doing here.
So, we actually convert, I think by default, the data type that we'd be using is a 32-bit float. But we are actually going to be using a 4-bit integer value for the majority of the weights within the model. Now, that's not actually all of the weights. The way that this works and manages to maintain the performance it does is that kind of selectively decides which parameters should be quantized and which parameters within the model should not be quantized.
So, that is very useful. It means we can make the model size much smaller whilst still getting almost the same performance out of it. So, that's all we're doing there. We also in here need to include this device map auto. So, what that is going to do is move the model onto GPU if a GPU is available, which it is.
And I've checked that up here and printed it out down here. So, you should see when you run this that the model has been loaded onto a CUDA device. If you don't see that and you are also running this on Colab, what you can do is go to runtime, change runtime type, and you want to go to hardware accelerator, GPU, and you need to select an A100 there.
That is only included in Colab Pro. Now, that will actually take a while to initialize the model. Okay, so I was running this cell for 10 minutes. It takes a little bit of time. It's a big model. Once you have downloaded it once, within the same session on Colab, all it needs to do is load the model.
So, it doesn't need to download it again. Although if you are using Colab for this and you use Colab like the next day or something like that, it will not be on the same computing instance. And therefore, it will actually need to download the model again. So, I would recommend you either do this locally or you do it on a dedicated instance.
So, we also load tokenizer. That just translates from human readable text into transformer readable tokens. We specify supping criteria here. I've spoken about this a lot recently, so I'm not going to go through it again. But there will be a link to a video at the top right now where I do talk about this if you are interested.
And then all we're going to do here is initialize the model for text generation. So, we do that. Okay. And we can just confirm this is working. So, first, we're just going to ask this very kind of easy, it's a niche question, right? It's not common knowledge, but it isn't too difficult for a large language model.
You can also check the amount of GPU RAM that we're using here. It's actually just under 25 gigabytes there. So, this model, 40 billion parameter model with quantization is actually not using that much space. Okay. So, we have here, we're just returning the original question. We actually do that here because we're using line chain later on.
We need to set that. And we get nuclear fission is so and so on, splitting atoms, releasing energy. Yeah. You can read through that if you run it yourself or follow the notebook. Now, what I want to do is take this model that we've loaded through Hugging Face, or this actually text generation pipeline, and I want to load it into line chain.
Because line chain has a ton of utilities for building conversational chains or conversational chatbots and also conversational agents. So, we want to be able to use those utilities, those tools. So, we load it in to line chain first. Okay. Very simple. I just want to use LM chain initially.
So, that is basically a chain where you have a prompt template here, your query or your input goes into that prompt template, and it gets passed to whatever LLM you specified, in our case, Falcon 40B instruct. Now, when we run that, we should see a pretty similar output to what we got here.
Okay. And we can see, yeah, it looks pretty similar again. Okay. So, how do we go from that very simple LLM chain into a chatbot? Well, we need a few things. We need conversational memory. So, basically, a track or a record of previous interactions between the user and a chatbot.
And with that memory, we need our LLM, and we can initialize a conversation chain through line chain. So, we can run both of those. The K equals five here. That basically just means it's going to remember the previous five interactions between the user and our chatbot. Okay. And now what we can say is, say, okay, hi, how are you?
See what comes back. And we see, because we've set verbose equal to true, we can see the actual prompt that is being passed to the conversation chain. So, we can see that there's this primer here, which is following as a friendly conversation between a human and an AI, and it has all of this here.
Basically, instructions for the model to follow. And then we have current conversation. We have human. How are you? AI. So, this is, you know, basically telling the model it's time for you to start acting like the AI and generating text. And what we'll get is this. So, I'm doing well.
Thank you for asking. How are you? Okay. And then it moves on to the next line and says, oh, human. Right? So, the reason that it stops here, rather than stopping here, is because we are forcing it to stop there. Okay? So, if you just remember this. So, human colon, and we come up to here.
We've actually specified a stopping criteria, which is human colon. Now, that's great, but naturally in our output, we don't want to include that. We just want this bit here. Now, we can post process this, and that's exactly what we're going to do. Obviously, there are different ways of doing that.
You can do it kind of more manually, or we can use the more sort of lang chain way of doing things. And we can create what is called a output parser. Now, the output parser is basically just another step that happens to the output of whatever chain you're using in lang chain.
So, in this case here, we have the parse parameter here. So, this is what we'll get called when we use the output parser. And we just take our text. We strip any white space from around that text. We then look at the subwords that we'd like to remove. Okay?
So, we only have two subwords, which is human and AI. So, we go through those. We say, for those words, if they are at the end of the text, we remove them. And then finally, we just strip any more white space from the two ends of our text again.
So, we initialize that. And then to actually integrate that into our chatbot, we need to use a prompt template. Okay? So, we just take what we already took before. So, we have -- let me print it out here. So, this is the existing or the default prompt template. I think I -- did I change anything?
Nope. So, I just copied this existing default template and I put it here. Okay? The reason that I've done that is because I don't need to change it. The only thing I do need to add here is I want to add that output parser within the prompt template. To me, it seems a bit odd that you would put your output parser into the prompt template, because the prompt template is kind of like your input pipeline.
But that is just how you do things in LangChain. So, we pass the output parser into our input prompt template. And then what we do is just initialize our conversation chain with that new prompt template here. Everything else is the same. Literally, all we've changed here is we've set a output parser within the prompt template.
So, we reinitialize our chatbot with that output parser. And then what we do is -- so, before we were doing chat predict, like this. Now, we are going to do predict and parse. Okay? So, that is basically telling the conversational chain to parse any outputs based on whatever we have in our output parser.
Okay? So, we're going to -- same question again. Hi, how are you? We're going to see a similar -- well, the same actual prompt being passed into there. And then we get this. So, it's basically a cleaned up version of what we saw before. So, there's no white space on either end.
And we also don't have the human appended onto the end there. Okay. So, that's cool. We don't have that messy human string at the end of our text anymore. But what I'd like to do now is just continue this conversation and see how Falcon40b actually performs as a chatbot.
So, what I'm going to do is ask it to write me a very simple Python script. So, I want it to create a Python script that's going to calculate the circumference of a circle given a radius r. So, I'm kind of specifying here that I want this to be a parameter.
Okay. We get this response. We can go ahead and print it out. Okay. So, it looks kind of good. But let's go ahead and just try this and see if it actually does run. So, I'm going to just copy this and this. And see what we get. Okay. So, naturally, the pi variable there wasn't defined.
You can also compare this to GPT 3.5. So, let's ask the same question here. And we see that GPT 3.5 does actually give us, like, functional code straight away. We have that math.pi. Whereas, obviously, with Falcon 40B, we get this error. But we can continue with this with Falcon 40B and just give it a little more prompting to see if it can correct its own error.
So, what I've done here is I've said, okay, using this code, I get the error. And I've just specified name error. Pi is not defined. How can I fix this? And we run that. Come down to here and we get this. Which is actually the same as what we got from chat GPT.
So, then we can actually take that, put it into here. I should also include the circumference of circle in there as well. And let's see if that runs. Okay. So, we do get the correct answer. It took a little bit more work than what it did for GPT 3.5.
But this is an open source model. And it's also running on a single GPU. So, the fact that it got to the answer as quickly as it did, I think is pretty impressive. Now, another thing that I'd like some help with here is refactoring some pretty bad code. So, what I have written here is thanks for giving us this code here.
But now I have some code I'd like to refactor. Can you help? The code that we have is this. Okay. So, this will run, but it's obviously kind of messy. And it is written in a way that, you know, there's ways that we can improve it naturally. So, can it improve it?
Let's try. Let's also try the same with GPT 3.5 whilst we wait. And see what we get. Okay. And we get this. Let's just copy this and we'll compare it to the actual answer of this code here. Let's run this and this. Okay. And we can see that GPT 3.5 running this actually didn't refactor the code correctly.
So, the code is just summing the numbers. GPT 3.5 seems to think that we're counting each number twice. So, it mentions here, the code currently sums up all the numbers twice. So, didn't quite get it correct. Let's go ahead and see what Falcon 40B managed to do. So, we have this here.
Let's just try it. I'll bring it to here. Okay. So, we get 55 again. So, I don't think it really changed much, to be honest there. Let's try this other bit of code that it suggested and make sure it works. Okay. And then this one where it has modified the code to actually allow us to specify whether we want just even or odd numbers.
In this case, we get a different number, but it has actually told us that we are looking specifically for even or odd, which is not what we asked it to do. But at least it has explained and done this correctly. It's understood the original code. Okay. So, it hasn't really managed to refactor our code.
In the way that I'd like. But at least it did understand what the code was doing, unlike GP 3.5 Turbo. So, that at least is a good thing. Now, one thing that I will point out is this does seem to vary depending on the query, basically, when you run it.
The first time I ran this, I actually got this refactored function, which is obviously kind of what we were looking for. So, sometimes Falcon 40B does actually manage to outperform GP 3.5, at least on that one sort of coding refactoring question. Which I thought was pretty cool, given that this is like an open source model that we can fit on a single GPU.
Now, unfortunately, it seems like it doesn't always manage to get that performance. But it's still pretty impressive. Now, one other thing that I want to take a very quick look at is trying, at least, to use this as a conversational agent. Now, we'll just very quickly go through this code.
The setup, again, exactly the same. It's only different once we get down to here. Okay. So, we initialize an agent. We do that in a slightly different way. This agent, it only has access to a single tool, which is a calculator tool. The agent is initialized here. And I modified the prompts a little bit to try and get this working.
But unfortunately, it seemed to really struggle with what we are asking it to do. Which is, if you can take a look around here, we're asking it to always output in a JSON format, which you can just about see here. And actually, maybe I can take this and just print it out.
So, printing that out, we get this. So, we're basically asking, because this is an agent that can use tool, we're asking the LLM to generate the responses in this JSON format. And it really seems to struggle with that. Now, I also tried using basically that output parser approach that you saw earlier to allow it to output the final answers.
So, like answers directly to the human. And using that, it was able to do that. So, we kind of have my testing here. So, if we take a look at that, it does manage to output a response directly to the user, like we did with the chatbot. But if we then ask it to use a tool, or we would expect it to use a calculator tool in this scenario, it just doesn't.
It kind of goes straight ahead and tries to answer the response directly. Which is exactly what our conversational chatbot does. So, the conversation chain from LangChain. So, I at least during my experiments with different prompts and everything here, even tried a few-shot learning, couldn't get it to work in a good way for a conversational agent.
Now, there might be ways to get that working. I just couldn't seem to find them. But nonetheless, although it isn't quite right now at the level of being used as an agent, at least in this way, as a conversational chatbot, it works pretty well. Now, that's it for this video.
I just want to take a look at the Falcon40b instruct model. Has the most powerful open source LLM that we have available to us today. And how we can actually use that using a library like LangChain for building chatbots. As you can see, there's some weaknesses, but also some strengths of the model.
And I'm very optimistic that, you know, if we can get this with a 40 billion parameter model going forwards, there's probably going to be many more models that are on this set scale and possibly slightly larger scale that we can still fit on more typical consumer GPUs. So, that's probably the most exciting part of this for me.
But for now, that's it for this video. So, I hope all of this has been interesting and insightful. Thank you very much for watching. And I will see you again in the next one. Bye.