MPT-30B Chatbot with LangChain!

So today we have a pretty cool video, we're going to take the new MBT-30B model, which is an open source model that has better performance than GPT-3 and can be run on a single GPU. We're going to take that model and we are going to use it to build a chatbot using HuggingFace for the actual model itself and LangChain for the conversational aspect of it.

So I think this is going to be pretty fun. Let's jump straight into it. What I want to first do is take a look at the MBT-30B model. Now, the MBT-30B model, you've probably heard people talking about it already, but I want to just do a quick recap on what it is and the sort of performance that we can expect to get from it.

So, MBT-30B is from Mosaic ML. They also did the MBT-7B model that I spoke about in the past, and they have some nice visuals in their announcement that kind of give us a good grasp of what sort of performance we can expect. You see here, we have MBT-7B, which was a decent model.

It's definitely limited from testing it, but it was good. Particularly for its size. And then MBT-30B, okay, yes, it's a bit better in many, many ways. And I suppose what I think is a bit more interesting is when we compare it to these other models. So on the right over here, we have this current size, like the 30B plus size of models.

So the red is MBT-30B. We see that it's not actually quite as performant as Falcon-40B and Llama-30B, even though when it says Llama-30B, I think it's actually 33 billion parameters. So it makes sense that Llama would be slightly more performant there. It's not quite as performant as those, but it does actually outperform both in programming ability.

So in terms of programmability, this could be quite a good model to kind of help us as almost like an AI code assistant. So that would be definitely very interesting to try out at some point. But then we can see on the other things, it's pretty close to these other models, especially Falcon-40B, which is a fairly large model.

So cool to see. And what they've done is released a couple of models here. So we have MBT-30B, which is the base model, and we also have the instruct and chat models and fine-tuned for like chat or fine-tuned to follow instructions. We're going to be using the chat model, of course.

So let's jump into how we actually do this. So I have this notebook on Colab. So to run this, at least on Colab, you are actually going to need to use Colab Pro and you're going to need to use the A100 GPU. Now it's fairly expensive to run, just as a pre-warning.

Or if you can get an A100 elsewhere, then go ahead and do that as well. But you do need a fairly beefy GPU here. But we are just running it on a single GPU, which is pretty cool. So we're going to need to install most of these. Actually, I don't think we need the Wikipedia thing.

Do we need the rest? Yes. So the rest of these we do need. So we can pip install those. And what we're going to do is come down to here to initialize our model. So let's explain this. So we're making sure that we're going to be using a GPU.

So we need a CUDA-enabled GPU here. So we set that with device. And then we initialize the model. We're initializing it from HuggingFace. So if we took this little string here, we go to HuggingFace.co/that, we find the model page for this model. And it just gives us a little bit of a description about what the model actually is.

But yeah, so we're getting this from HuggingFace. To run this, we do need to... There's basically some custom code that needs to be run in order to get this model working with HuggingFace. So in order to allow that to be run, you have to use this trust remote code.

And then we come to here. So we have these two lines here. We can limit the amount of bytes that we're using here. So actually, this is not... Actually, it's just 8-bit. Maybe that's confusing. So we can limit the precision of the floating point numbers that we're using in this model in order to reduce the total size required by the model.

Now, if we use bfloat16, I mean, we could use 32-bit. That would be like the full version of the model. Or we can use 16-bit. But to use 16-bit, we need 80 gigabytes of RAM on our GPU, which we don't have. So actually, we are doing this load in 8-bit.

And this is why we need that bits and bytes library up at the top here. So with that, we're actually loading the model in 8-bit precision, which means, you know, it's not going to be quite as accurate, but it's still, I mean, at least from testing it, the performance is still very good.

So it's going to be a basically slightly less accurate, but much smaller model thanks to this. So we load in 8-bit. We set the max sequence length, ignore that. That was just me making some notes. So max sequence length, I'm going to say is 1024, but we can actually, I believe we can go up to this, right?

What I'll do is maybe go up to 8192. You know, maybe go up to there because if you have a lot of conversation, like interactions in conversation, maybe you do want that extra context window just to be safe. And then what we're going to do is initialize the device, right?

So the device is going to be our CUDA-enabled GPU. Okay, so I'm going to run that. Yeah. And this is going to take a fairly long time. It needs to download the model. It needs to load everything. And yeah, just be ready to wait a little while. I'm going to fast forward.

So I'll see you in a moment. Okay, so we've just finished. That took a nice 16 minutes, almost 17 minutes for me to download everything and then load the model. Obviously, once you've downloaded it once, you don't need to download the model again. That being said, when you're on Colab, once you download it once, as soon as your instance is reset, you are going to need to download it again.

So it's probably better if you have like a set instance that you can run this on. But even when you're just running this model and you've already loaded it, you're still going to need to actually load the model, right, from disk. And that took just over four minutes. So the fact is, it's just a huge model.

So it takes a little while to load everything. But that's done now, thankfully, so we can move on to the rest. So Tokenizer is the thing that translates human-readable plaintext into machine or transformer-readable tokens as part of this. And this is something I've spoken about more in the last MPT7B video and also the OpenLARM video.

So I've linked those towards the top of the video right now, if you want more information on those. But essentially, we need to, when we're chatting with the model, it's going to generate, or at least what I've found is that it generates the next step of the conversation. All right, so it's going to generate its response as the AI and then it's going to start generating what would a human say after this and so on, because it's text generation.

It's basically trying to predict what is going to happen next. So what I'm doing here is setting a list of tokens where if we see them, we tell the model, "Okay, you're done. Stop generating stuff, please." Okay? So I'm saying if we see human colon or AI colon, that means we're probably onto one of the next lines.

So at that point, we're going to stop. Now, we also need to convert those into tensors. So we do that here. And then we use this, we create this stopping function or this, what is it? StopOnTokens class. So basically, it's just going to check where the last number of tokens is equal to one of these combinations here.

If so, please stop. Okay, and then what we do is initialize a ton of model generation parameters. So we're doing this for text generation. We need to return full text for line chain. We have our stopping criteria and we have a set of things here. So temperature, you can set that to one to be more random, zero to be less random.

And there's other things in that. One other thing actually is the max new tokens. So how many tokens should, can the model predict before it's like, that's it, like stop. So you want to set this high enough that you're not cutting out responses, but low enough that you're not generating too much text.

And it's very important to not generate too much text because it actually takes longer. So the higher this number is, the longer the model is going to take to give you a response, which we don't really want. Okay, now what we're going to do is just confirm. Okay, is this working?

Very quick example here. You'll recognize this if you watch the other MPT 7b video. So right now, this isn't a chatbot. It's just generating text. We'll get onto this chatbot stuff soon. Okay, we can see we, so we return the original text. That's our question at top there. And it's like fission is a process where it does this and this and so on and so on.

And yeah, I have read this a couple of times, seems relatively accurate. It does just stop here. That's just, we're not really doing any prompt engineering here. When it comes to the chat, like we're going to be cutting off 128 tokens. So when it comes to chat, we don't really want long chat responses.

So that is not really an issue that I've seen, at least when we're actually in that chatbot phase or chatbot mode of the model. Okay, so now what we're going to do is initialize these things with BlankChain and we're going to just try the same thing again, make sure everything's working.

Explain to me the difference between nuclear fission and fusion. And we're going to get, we should get pretty much the same response. There's, we have a temperature of 0.1. So there's going to be a little bit of randomness in there. So it probably won't be exactly the same. It should be similar.

So we get that. It's either exactly the same or very similar. That's good. And now what I want to do is move on to actually how do we take this model and turn it into a chatbot. So what does a chatbot need? It needs conversational memory and it, well, it needs that different structure of the prompts.

So it's going to be going in a chat format. So the conversational memory is where you're going to have like human, how are you? AI, I'm good, thank you. Human, ask a question and so on and so on, right? There's that, basically a chat log. So that's our memory.

We implement that here. So that's going to modify the initial prompt. Another thing that our chatbot is going to have is the initial prompt is also going to be explaining what this is. So actually we can see that. So let me run this. So I'm going to run the conversation chain from LangChain.

And yeah, here we can see that initial prompt. So the initial prompt is here. Okay, we see the template is the structure of that. So the following is friendly conversation between human and AI. The AI is talkative and provides lots of specific details from its context, so on and so on.

And then you have where we would put in the conversational memory and then our new input. And then this is kind of like the primer to indicate to the model. It needs to start generating a response. Okay. Cool. Now this bit here, this initial part of the prompt is it's encouraging the model to talk a lot, which I don't actually want, right?

So AI is talkative and provides lots of specific details from its context. I don't really want that. I want it to be more precise. So I'm going to modify it a little bit. Okay, so we just go to chat prompt template and we're going to modify to this which is exactly the same, but I've just modified that middle sentence here.

So that is now this. AI is conversational but concise in its responses without rambling. Okay. So that just means we're going to get shorter responses that are concise but don't go on as much as what I was seeing before. Okay. So I'm going to start with just asking the model how it's doing as any conversation should start and we can see here.

So we've got verbose mode equal to true. That's just so we can see what's going on. So we have the full prompt and then we see, okay, I'm just a computer program. I don't have feelings like humans do. How can I assist you today? Right? And then at the end of there, we can also see, oh, it's done that human thing, right?

It's continued the conversation. But because we added the stopping criteria in there, it stops as soon as it gets to human, right? Because that was one of the things we were looking for within the stopping criteria. So at that point it stops. But we still have it in the response, which is a little bit annoying because we don't want to be returning that to a user.

So what we can do is actually just trim that last little bit off. So we can go into our chat memory. Okay, so we have chat memory here. Last message is what we just saw. I'm just a computer program. Then we have human at the end. And what we can do is just remove it, right?

This isn't like super nice and clean or anything. In fact, it's a bit of a mess, but it just works. So there are a few things I'm accommodating for here. So if it has human, if it has possibly AI. Another thing I noticed is that sometimes it would also have this, the square brackets.

Why was that? I remember seeing that. I think that might be when you're using this as a conversational agent, if I remember correctly. Although I could be wrong. It doesn't seem to be an issue in this case. So yeah, I basically looked for any of those. I also looked for the double new lines because that kind of indicated that we're at the end of conversation as well.

Yeah, so we added a few things in there. I don't know if we necessarily need all of that. We could probably remove this. We could probably remove this as well. And we would, I assume, get the same sort of outputs. But then when it comes to using this as an agent, you might want to keep those things in.

Okay, so we have that, we've just modified that final message. So we fixed that. So then we could just return this to our users. But yeah, we don't want to run this every time. To be honest, we don't want to run this code anyway, but let's just stick with it.

And what we're going to do is actually just put all of this into a function, right? So we're going to say, okay, we have this chatroom function. We're going to pass the conversation chain into there and we're also going to pass out query. And what it's going to do is create a response like we did before.

It was up here. So we're just running this, but here. And then we're going to do all of the trimming that we do after. And then we're going to return that final response. So we run this and let's try and take a look at what we get with this new function that just adds that extra step onto the end there.

Okay, so I'm going to say explain to me difference between nuclear fission and fusion. You can see our previous interactions in the verbose output here. So hi, how are you? I'm just a compute program, so on and so on. All right, so we have that conversation log and we're just passing all of this to the model and saying, you know, please continue.

Okay, so it does continue and it explains to us what these are, right? We can see that this is a different response to what we got earlier because we now have like the prompt instructions earlier on. So that has modified how the model is actually answering this. And then what I'm going to do, okay, I'm going to continue the conversation, but I'm going to ask a question that requires the model to have seen the previous interactions within the conversation.

So I'm going to say, could you explain this like I am five, right? So we run that. Okay, we can see those previous parts of the conversation. So I'm just asking for a simpler version of this, something that I can understand easier. And let's see what it gets. And it's like, okay, sure.

In simpler terms, nuclear fission is like breaking apart a toy car to get the pieces inside. It requires forceful intervention to break down the original structure. Meanwhile, nuclear fusion is more like putting together Lego blocks to make something new. You're taking separate elements and joining them together to create something different than before.

Both processes release energy, but they approach it from opposite directions. So I think that's actually a really good way of explaining it, like much easier for me to understand than the previous one. So I think that that's pretty cool. So with that, we've built our open source chatbot using MPT-30B.

Performance, as you can see, is pretty good. We need to make some accommodations, like having those stopping criteria in there. If you're building this out into a more fully-fledged solution, obviously, you're going to need more stopping criteria. You're probably going to need, for sure, you're going to need better trimming or post-processing than what I did there.

But at high level, you don't really need too much. And the performance is, we can see, it's actually pretty good. So that's really cool. The context windows that we can use with this model is pretty high as well. So I read that it was about 8K, but I also saw in some of their documentation that it goes up to 16K.

I'm not quite sure why there's a difference in what I've read. Maybe it's the chat model versus the base model. But in any case, you can go up to at least 8K, which is on par with the original GPT-4, which is pretty good. Anyway, I hope this has all been useful.

I'm going to leave it there for this video. So thank you very much for watching and I will see you again in the next one. Bye. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (gentle music) you

MPT-30B Chatbot with LangChain!

Chapters

Transcript