back to indexMPT-30B Chatbot with LangChain!
Chapters
0:0 MPT-30B Chatbot
0:40 MPT-30B Explained
2:48 Implementing MPT-30B in Python
7:30 Tokenization and Stopping Criteria
9:3 LLM Pipeline in Hugging Face
10:54 Initializing MPT-30B in LangChain
11:25 Building an MPT-30B Chatbot
14:13 Post-Processing Chatbot Responses
17:8 Talking with MPT-30B Chatbot
00:00:00.000 |
So today we have a pretty cool video, we're going to take the new MBT-30B model, 00:00:07.040 |
which is an open source model that has better performance than GPT-3 00:00:16.940 |
We're going to take that model and we are going to use it to build a chatbot 00:00:22.080 |
using HuggingFace for the actual model itself 00:00:26.240 |
and LangChain for the conversational aspect of it. 00:00:34.440 |
What I want to first do is take a look at the MBT-30B model. 00:00:40.940 |
Now, the MBT-30B model, you've probably heard people talking about it already, 00:00:46.240 |
but I want to just do a quick recap on what it is 00:00:50.540 |
and the sort of performance that we can expect to get from it. 00:00:57.140 |
They also did the MBT-7B model that I spoke about in the past, 00:01:02.340 |
and they have some nice visuals in their announcement 00:01:06.040 |
that kind of give us a good grasp of what sort of performance we can expect. 00:01:10.640 |
You see here, we have MBT-7B, which was a decent model. 00:01:16.340 |
It's definitely limited from testing it, but it was good. 00:01:23.440 |
And then MBT-30B, okay, yes, it's a bit better in many, many ways. 00:01:29.440 |
And I suppose what I think is a bit more interesting 00:01:34.640 |
So on the right over here, we have this current size, 00:01:42.240 |
We see that it's not actually quite as performant 00:01:53.640 |
So it makes sense that Llama would be slightly more performant there. 00:01:59.740 |
but it does actually outperform both in programming ability. 00:02:07.940 |
this could be quite a good model to kind of help us 00:02:14.540 |
So that would be definitely very interesting to try out at some point. 00:02:22.540 |
especially Falcon-40B, which is a fairly large model. 00:02:29.940 |
And what they've done is released a couple of models here. 00:02:36.240 |
and we also have the instruct and chat models 00:02:45.240 |
We're going to be using the chat model, of course. 00:02:57.540 |
you are actually going to need to use Colab Pro 00:03:00.940 |
and you're going to need to use the A100 GPU. 00:03:04.440 |
Now it's fairly expensive to run, just as a pre-warning. 00:03:18.940 |
But we are just running it on a single GPU, which is pretty cool. 00:03:21.840 |
So we're going to need to install most of these. 00:03:25.740 |
Actually, I don't think we need the Wikipedia thing. 00:03:32.840 |
And what we're going to do is come down to here 00:03:39.340 |
So we're making sure that we're going to be using a GPU. 00:04:04.640 |
And it just gives us a little bit of a description 00:04:10.540 |
But yeah, so we're getting this from HuggingFace. 00:04:15.840 |
There's basically some custom code that needs to be run 00:04:18.640 |
in order to get this model working with HuggingFace. 00:04:30.440 |
We can limit the amount of bytes that we're using here. 00:04:42.240 |
So we can limit the precision of the floating point numbers 00:04:48.240 |
in order to reduce the total size required by the model. 00:04:56.840 |
That would be like the full version of the model. 00:05:02.840 |
But to use 16-bit, we need 80 gigabytes of RAM on our GPU, 00:05:09.940 |
So actually, we are doing this load in 8-bit. 00:05:13.740 |
And this is why we need that bits and bytes library 00:05:18.640 |
So with that, we're actually loading the model in 8-bit precision, 00:05:22.740 |
which means, you know, it's not going to be quite as accurate, 00:05:25.740 |
but it's still, I mean, at least from testing it, 00:05:30.640 |
So it's going to be a basically slightly less accurate, 00:05:46.340 |
So max sequence length, I'm going to say is 1024, 00:05:51.540 |
but we can actually, I believe we can go up to this, right? 00:06:04.340 |
maybe you do want that extra context window just to be safe. 00:06:09.940 |
And then what we're going to do is initialize the device, right? 00:06:13.140 |
So the device is going to be our CUDA-enabled GPU. 00:06:19.340 |
And this is going to take a fairly long time. 00:06:27.340 |
And yeah, just be ready to wait a little while. 00:06:41.140 |
almost 17 minutes for me to download everything 00:07:01.140 |
So it's probably better if you have like a set instance 00:07:09.740 |
you're still going to need to actually load the model, 00:07:23.040 |
So it takes a little while to load everything. 00:07:36.140 |
or transformer-readable tokens as part of this. 00:07:49.240 |
So I've linked those towards the top of the video right now, 00:08:01.640 |
or at least what I've found is that it generates 00:08:05.640 |
All right, so it's going to generate its response as the AI 00:08:18.840 |
So what I'm doing here is setting a list of tokens 00:08:30.340 |
So I'm saying if we see human colon or AI colon, 00:08:34.640 |
that means we're probably onto one of the next lines. 00:08:38.240 |
Now, we also need to convert those into tensors. 00:09:06.340 |
is initialize a ton of model generation parameters. 00:09:25.340 |
One other thing actually is the max new tokens. 00:09:45.140 |
And it's very important to not generate too much text 00:09:51.240 |
the longer the model is going to take to give you a response, 00:09:55.840 |
Okay, now what we're going to do is just confirm. 00:10:20.540 |
where it does this and this and so on and so on. 00:10:22.940 |
And yeah, I have read this a couple of times, 00:10:34.640 |
like we're going to be cutting off 128 tokens. 00:10:42.540 |
So that is not really an issue that I've seen, 00:10:47.040 |
at least when we're actually in that chatbot phase 00:10:59.440 |
and we're going to just try the same thing again, 00:11:12.140 |
So there's going to be a little bit of randomness in there. 00:11:20.840 |
It's either exactly the same or very similar. 00:11:24.740 |
And now what I want to do is move on to actually 00:11:37.140 |
it needs that different structure of the prompts. 00:11:44.940 |
So the conversational memory is where you're going to have 00:11:50.940 |
Human, ask a question and so on and so on, right? 00:12:02.740 |
So that's going to modify the initial prompt. 00:12:05.140 |
Another thing that our chatbot is going to have is 00:12:08.740 |
the initial prompt is also going to be explaining 00:12:18.040 |
So I'm going to run the conversation chain from 00:12:23.840 |
And yeah, here we can see that initial prompt. 00:12:30.840 |
Okay, we see the template is the structure of that. 00:12:35.140 |
So the following is friendly conversation between human and AI. 00:12:38.040 |
The AI is talkative and provides lots of specific details 00:12:44.040 |
And then you have where we would put in the conversational 00:12:49.340 |
And then this is kind of like the primer to indicate to the model. 00:13:09.340 |
AI is talkative and provides lots of specific details from its context. 00:13:20.940 |
and we're going to modify to this which is exactly the same, 00:13:23.840 |
but I've just modified that middle sentence here. 00:13:31.540 |
AI is conversational but concise in its responses without rambling. 00:13:37.840 |
So that just means we're going to get shorter responses that are concise 00:13:44.040 |
but don't go on as much as what I was seeing before. 00:13:51.540 |
the model how it's doing as any conversation should start 00:14:05.540 |
and then we see, okay, I'm just a computer program. 00:14:13.440 |
And then at the end of there, we can also see, 00:14:20.540 |
But because we added the stopping criteria in there, 00:14:28.540 |
Because that was one of the things we were looking for 00:14:39.840 |
because we don't want to be returning that to a user. 00:14:42.640 |
So what we can do is actually just trim that last little bit off. 00:15:04.240 |
This isn't like super nice and clean or anything. 00:15:07.740 |
In fact, it's a bit of a mess, but it just works. 00:15:12.440 |
So there are a few things I'm accommodating for here. 00:15:19.940 |
Another thing I noticed is that sometimes it would also have this, 00:15:30.740 |
as a conversational agent, if I remember correctly. 00:15:37.940 |
So yeah, I basically looked for any of those. 00:15:44.840 |
that we're at the end of conversation as well. 00:15:50.240 |
I don't know if we necessarily need all of that. 00:15:55.440 |
And we would, I assume, get the same sort of outputs. 00:16:00.540 |
But then when it comes to using this as an agent, 00:16:07.040 |
Okay, so we have that, we've just modified that final message. 00:16:13.540 |
So then we could just return this to our users. 00:16:18.140 |
But yeah, we don't want to run this every time. 00:16:22.140 |
To be honest, we don't want to run this code anyway, 00:16:27.640 |
And what we're going to do is actually just put all of this 00:16:33.640 |
So we're going to say, okay, we have this chatroom function. 00:16:37.240 |
We're going to pass the conversation chain into there 00:16:41.840 |
And what it's going to do is create a response 00:16:55.640 |
And then we're going to do all of the trimming that we do after. 00:17:00.140 |
And then we're going to return that final response. 00:17:07.840 |
and take a look at what we get with this new function 00:17:11.440 |
that just adds that extra step onto the end there. 00:17:15.540 |
difference between nuclear fission and fusion. 00:17:17.740 |
You can see our previous interactions in the verbose output here. 00:17:28.840 |
and we're just passing all of this to the model 00:17:43.540 |
because we now have like the prompt instructions earlier on. 00:17:47.740 |
So that has modified how the model is actually answering this. 00:17:54.840 |
but I'm going to ask a question that requires the model 00:17:58.440 |
to have seen the previous interactions within the conversation. 00:18:01.540 |
So I'm going to say, could you explain this like I am five, right? 00:18:06.840 |
Okay, we can see those previous parts of the conversation. 00:18:11.540 |
So I'm just asking for a simpler version of this, 00:18:23.140 |
In simpler terms, nuclear fission is like breaking apart a toy car 00:18:29.240 |
It requires forceful intervention to break down the original structure. 00:18:32.940 |
Meanwhile, nuclear fusion is more like putting together Lego blocks 00:18:39.140 |
You're taking separate elements and joining them together 00:18:46.140 |
but they approach it from opposite directions. 00:18:48.640 |
So I think that's actually a really good way of explaining it, 00:18:52.040 |
like much easier for me to understand than the previous one. 00:18:59.140 |
So with that, we've built our open source chatbot 00:19:12.040 |
like having those stopping criteria in there. 00:19:14.040 |
If you're building this out into a more fully-fledged solution, 00:19:17.740 |
obviously, you're going to need more stopping criteria. 00:19:20.740 |
for sure, you're going to need better trimming or post-processing 00:19:26.240 |
But at high level, you don't really need too much. 00:19:31.140 |
And the performance is, we can see, it's actually pretty good. 00:19:37.640 |
The context windows that we can use with this model 00:19:44.440 |
but I also saw in some of their documentation 00:19:48.840 |
I'm not quite sure why there's a difference in what I've read. 00:19:52.540 |
Maybe it's the chat model versus the base model. 00:19:56.840 |
But in any case, you can go up to at least 8K,