back to index

MPT-30B Chatbot with LangChain!


Chapters

0:0 MPT-30B Chatbot
0:40 MPT-30B Explained
2:48 Implementing MPT-30B in Python
7:30 Tokenization and Stopping Criteria
9:3 LLM Pipeline in Hugging Face
10:54 Initializing MPT-30B in LangChain
11:25 Building an MPT-30B Chatbot
14:13 Post-Processing Chatbot Responses
17:8 Talking with MPT-30B Chatbot

Whisper Transcript | Transcript Only Page

00:00:00.000 | So today we have a pretty cool video, we're going to take the new MBT-30B model,
00:00:07.040 | which is an open source model that has better performance than GPT-3
00:00:13.640 | and can be run on a single GPU.
00:00:16.940 | We're going to take that model and we are going to use it to build a chatbot
00:00:22.080 | using HuggingFace for the actual model itself
00:00:26.240 | and LangChain for the conversational aspect of it.
00:00:30.540 | So I think this is going to be pretty fun.
00:00:32.440 | Let's jump straight into it.
00:00:34.440 | What I want to first do is take a look at the MBT-30B model.
00:00:40.940 | Now, the MBT-30B model, you've probably heard people talking about it already,
00:00:46.240 | but I want to just do a quick recap on what it is
00:00:50.540 | and the sort of performance that we can expect to get from it.
00:00:53.040 | So, MBT-30B is from Mosaic ML.
00:00:57.140 | They also did the MBT-7B model that I spoke about in the past,
00:01:02.340 | and they have some nice visuals in their announcement
00:01:06.040 | that kind of give us a good grasp of what sort of performance we can expect.
00:01:10.640 | You see here, we have MBT-7B, which was a decent model.
00:01:16.340 | It's definitely limited from testing it, but it was good.
00:01:19.840 | Particularly for its size.
00:01:23.440 | And then MBT-30B, okay, yes, it's a bit better in many, many ways.
00:01:29.440 | And I suppose what I think is a bit more interesting
00:01:32.940 | is when we compare it to these other models.
00:01:34.640 | So on the right over here, we have this current size,
00:01:37.940 | like the 30B plus size of models.
00:01:40.340 | So the red is MBT-30B.
00:01:42.240 | We see that it's not actually quite as performant
00:01:45.540 | as Falcon-40B and Llama-30B,
00:01:48.740 | even though when it says Llama-30B,
00:01:51.640 | I think it's actually 33 billion parameters.
00:01:53.640 | So it makes sense that Llama would be slightly more performant there.
00:01:57.340 | It's not quite as performant as those,
00:01:59.740 | but it does actually outperform both in programming ability.
00:02:05.540 | So in terms of programmability,
00:02:07.940 | this could be quite a good model to kind of help us
00:02:11.440 | as almost like an AI code assistant.
00:02:14.540 | So that would be definitely very interesting to try out at some point.
00:02:19.440 | But then we can see on the other things,
00:02:20.740 | it's pretty close to these other models,
00:02:22.540 | especially Falcon-40B, which is a fairly large model.
00:02:27.940 | So cool to see.
00:02:29.940 | And what they've done is released a couple of models here.
00:02:33.340 | So we have MBT-30B, which is the base model,
00:02:36.240 | and we also have the instruct and chat models
00:02:39.340 | and fine-tuned for like chat
00:02:42.340 | or fine-tuned to follow instructions.
00:02:45.240 | We're going to be using the chat model, of course.
00:02:48.640 | So let's jump into how we actually do this.
00:02:50.940 | So I have this notebook on Colab.
00:02:53.940 | So to run this, at least on Colab,
00:02:57.540 | you are actually going to need to use Colab Pro
00:03:00.940 | and you're going to need to use the A100 GPU.
00:03:04.440 | Now it's fairly expensive to run, just as a pre-warning.
00:03:08.140 | Or if you can get an A100 elsewhere,
00:03:11.840 | then go ahead and do that as well.
00:03:14.640 | But you do need a fairly beefy GPU here.
00:03:18.940 | But we are just running it on a single GPU, which is pretty cool.
00:03:21.840 | So we're going to need to install most of these.
00:03:25.740 | Actually, I don't think we need the Wikipedia thing.
00:03:27.840 | Do we need the rest? Yes.
00:03:29.840 | So the rest of these we do need.
00:03:31.240 | So we can pip install those.
00:03:32.840 | And what we're going to do is come down to here
00:03:36.240 | to initialize our model.
00:03:37.640 | So let's explain this.
00:03:39.340 | So we're making sure that we're going to be using a GPU.
00:03:44.440 | So we need a CUDA-enabled GPU here.
00:03:47.940 | So we set that with device.
00:03:49.240 | And then we initialize the model.
00:03:50.640 | We're initializing it from HuggingFace.
00:03:52.740 | So if we took this little string here,
00:03:55.840 | we go to HuggingFace.co/that,
00:04:01.240 | we find the model page for this model.
00:04:04.640 | And it just gives us a little bit of a description
00:04:08.140 | about what the model actually is.
00:04:10.540 | But yeah, so we're getting this from HuggingFace.
00:04:13.240 | To run this, we do need to...
00:04:15.840 | There's basically some custom code that needs to be run
00:04:18.640 | in order to get this model working with HuggingFace.
00:04:20.740 | So in order to allow that to be run,
00:04:25.140 | you have to use this trust remote code.
00:04:27.440 | And then we come to here.
00:04:28.640 | So we have these two lines here.
00:04:30.440 | We can limit the amount of bytes that we're using here.
00:04:37.340 | So actually, this is not...
00:04:38.540 | Actually, it's just 8-bit.
00:04:40.740 | Maybe that's confusing.
00:04:42.240 | So we can limit the precision of the floating point numbers
00:04:46.340 | that we're using in this model
00:04:48.240 | in order to reduce the total size required by the model.
00:04:51.640 | Now, if we use bfloat16,
00:04:53.840 | I mean, we could use 32-bit.
00:04:56.840 | That would be like the full version of the model.
00:05:00.940 | Or we can use 16-bit.
00:05:02.840 | But to use 16-bit, we need 80 gigabytes of RAM on our GPU,
00:05:07.940 | which we don't have.
00:05:09.940 | So actually, we are doing this load in 8-bit.
00:05:13.740 | And this is why we need that bits and bytes library
00:05:17.240 | up at the top here.
00:05:18.640 | So with that, we're actually loading the model in 8-bit precision,
00:05:22.740 | which means, you know, it's not going to be quite as accurate,
00:05:25.740 | but it's still, I mean, at least from testing it,
00:05:28.840 | the performance is still very good.
00:05:30.640 | So it's going to be a basically slightly less accurate,
00:05:35.640 | but much smaller model thanks to this.
00:05:39.240 | So we load in 8-bit.
00:05:40.640 | We set the max sequence length, ignore that.
00:05:44.040 | That was just me making some notes.
00:05:46.340 | So max sequence length, I'm going to say is 1024,
00:05:51.540 | but we can actually, I believe we can go up to this, right?
00:05:54.940 | What I'll do is maybe go up to 8192.
00:05:58.540 | You know, maybe go up to there
00:06:00.440 | because if you have a lot of conversation,
00:06:02.640 | like interactions in conversation,
00:06:04.340 | maybe you do want that extra context window just to be safe.
00:06:09.940 | And then what we're going to do is initialize the device, right?
00:06:13.140 | So the device is going to be our CUDA-enabled GPU.
00:06:16.340 | Okay, so I'm going to run that.
00:06:18.140 | Yeah.
00:06:19.340 | And this is going to take a fairly long time.
00:06:23.840 | It needs to download the model.
00:06:25.740 | It needs to load everything.
00:06:27.340 | And yeah, just be ready to wait a little while.
00:06:32.440 | I'm going to fast forward.
00:06:33.640 | So I'll see you in a moment.
00:06:35.440 | Okay, so we've just finished.
00:06:37.840 | That took a nice 16 minutes,
00:06:41.140 | almost 17 minutes for me to download everything
00:06:44.340 | and then load the model.
00:06:45.640 | Obviously, once you've downloaded it once,
00:06:48.540 | you don't need to download the model again.
00:06:51.140 | That being said, when you're on Colab,
00:06:53.340 | once you download it once,
00:06:55.140 | as soon as your instance is reset,
00:06:59.240 | you are going to need to download it again.
00:07:01.140 | So it's probably better if you have like a set instance
00:07:05.040 | that you can run this on.
00:07:06.040 | But even when you're just running this model
00:07:08.640 | and you've already loaded it,
00:07:09.740 | you're still going to need to actually load the model,
00:07:14.740 | right, from disk.
00:07:16.440 | And that took just over four minutes.
00:07:19.840 | So the fact is, it's just a huge model.
00:07:23.040 | So it takes a little while to load everything.
00:07:26.740 | But that's done now, thankfully,
00:07:28.740 | so we can move on to the rest.
00:07:30.140 | So Tokenizer is the thing that translates
00:07:33.640 | human-readable plaintext into machine
00:07:36.140 | or transformer-readable tokens as part of this.
00:07:41.040 | And this is something I've spoken about
00:07:42.940 | more in the last MPT7B video
00:07:47.640 | and also the OpenLARM video.
00:07:49.240 | So I've linked those towards the top of the video right now,
00:07:52.440 | if you want more information on those.
00:07:54.340 | But essentially, we need to,
00:07:56.740 | when we're chatting with the model,
00:07:59.640 | it's going to generate,
00:08:01.640 | or at least what I've found is that it generates
00:08:03.440 | the next step of the conversation.
00:08:05.640 | All right, so it's going to generate its response as the AI
00:08:09.240 | and then it's going to start generating
00:08:10.940 | what would a human say after this and so on,
00:08:13.540 | because it's text generation.
00:08:15.640 | It's basically trying to predict
00:08:17.340 | what is going to happen next.
00:08:18.840 | So what I'm doing here is setting a list of tokens
00:08:23.740 | where if we see them,
00:08:25.040 | we tell the model, "Okay, you're done.
00:08:26.840 | Stop generating stuff, please."
00:08:28.940 | Okay?
00:08:30.340 | So I'm saying if we see human colon or AI colon,
00:08:34.640 | that means we're probably onto one of the next lines.
00:08:36.540 | So at that point, we're going to stop.
00:08:38.240 | Now, we also need to convert those into tensors.
00:08:42.740 | So we do that here.
00:08:44.440 | And then we use this,
00:08:46.840 | we create this stopping function
00:08:49.140 | or this, what is it? StopOnTokens class.
00:08:52.040 | So basically, it's just going to check
00:08:54.340 | where the last number of tokens
00:08:57.340 | is equal to one of these combinations here.
00:09:00.940 | If so, please stop.
00:09:02.740 | Okay, and then what we do
00:09:06.340 | is initialize a ton of model generation parameters.
00:09:10.340 | So we're doing this for text generation.
00:09:12.740 | We need to return full text for line chain.
00:09:15.540 | We have our stopping criteria
00:09:17.140 | and we have a set of things here.
00:09:18.540 | So temperature, you can set that to one
00:09:20.840 | to be more random, zero to be less random.
00:09:23.040 | And there's other things in that.
00:09:25.340 | One other thing actually is the max new tokens.
00:09:28.440 | So how many tokens should,
00:09:31.440 | can the model predict before it's like,
00:09:33.540 | that's it, like stop.
00:09:34.940 | So you want to set this high enough
00:09:37.840 | that you're not cutting out responses,
00:09:40.540 | but low enough that you're not
00:09:42.840 | generating too much text.
00:09:45.140 | And it's very important to not generate too much text
00:09:48.240 | because it actually takes longer.
00:09:49.440 | So the higher this number is,
00:09:51.240 | the longer the model is going to take to give you a response,
00:09:54.340 | which we don't really want.
00:09:55.840 | Okay, now what we're going to do is just confirm.
00:09:58.840 | Okay, is this working?
00:10:00.240 | Very quick example here.
00:10:02.240 | You'll recognize this if you watch the other
00:10:04.240 | MPT 7b video.
00:10:06.540 | So right now, this isn't a chatbot.
00:10:08.640 | It's just generating text.
00:10:10.340 | We'll get onto this chatbot stuff soon.
00:10:13.240 | Okay, we can see we,
00:10:14.940 | so we return the original text.
00:10:17.040 | That's our question at top there.
00:10:18.640 | And it's like fission is a process
00:10:20.540 | where it does this and this and so on and so on.
00:10:22.940 | And yeah, I have read this a couple of times,
00:10:25.140 | seems relatively accurate.
00:10:27.040 | It does just stop here.
00:10:29.140 | That's just, we're not really doing any
00:10:31.340 | prompt engineering here.
00:10:32.640 | When it comes to the chat,
00:10:34.640 | like we're going to be cutting off 128 tokens.
00:10:38.240 | So when it comes to chat,
00:10:39.640 | we don't really want long chat responses.
00:10:42.540 | So that is not really an issue that I've seen,
00:10:47.040 | at least when we're actually in that chatbot phase
00:10:50.740 | or chatbot mode of the model.
00:10:53.840 | Okay, so now what we're going to do is
00:10:57.140 | initialize these things with BlankChain
00:10:59.440 | and we're going to just try the same thing again,
00:11:01.540 | make sure everything's working.
00:11:03.040 | Explain to me the difference between
00:11:04.840 | nuclear fission and fusion.
00:11:06.440 | And we're going to get,
00:11:07.940 | we should get pretty much the same response.
00:11:09.940 | There's, we have a temperature of 0.1.
00:11:12.140 | So there's going to be a little bit of randomness in there.
00:11:14.440 | So it probably won't be exactly the same.
00:11:17.040 | It should be similar.
00:11:19.740 | So we get that.
00:11:20.840 | It's either exactly the same or very similar.
00:11:23.640 | That's good.
00:11:24.740 | And now what I want to do is move on to actually
00:11:27.540 | how do we take this model
00:11:29.140 | and turn it into a chatbot.
00:11:31.840 | So what does a chatbot need?
00:11:33.140 | It needs conversational memory
00:11:35.040 | and it, well,
00:11:37.140 | it needs that different structure of the prompts.
00:11:41.140 | So it's going to be going in a chat format.
00:11:44.940 | So the conversational memory is where you're going to have
00:11:47.240 | like human, how are you?
00:11:49.140 | AI, I'm good, thank you.
00:11:50.940 | Human, ask a question and so on and so on, right?
00:11:53.540 | There's that, basically a chat log.
00:11:56.940 | So that's our memory.
00:12:00.340 | We implement that here.
00:12:02.740 | So that's going to modify the initial prompt.
00:12:05.140 | Another thing that our chatbot is going to have is
00:12:08.740 | the initial prompt is also going to be explaining
00:12:12.340 | what this is.
00:12:13.740 | So actually we can see that.
00:12:16.840 | So let me run this.
00:12:18.040 | So I'm going to run the conversation chain from
00:12:22.140 | LangChain.
00:12:23.840 | And yeah, here we can see that initial prompt.
00:12:28.040 | So the initial prompt is here.
00:12:30.840 | Okay, we see the template is the structure of that.
00:12:35.140 | So the following is friendly conversation between human and AI.
00:12:38.040 | The AI is talkative and provides lots of specific details
00:12:41.240 | from its context,
00:12:42.340 | so on and so on.
00:12:44.040 | And then you have where we would put in the conversational
00:12:46.840 | memory and then our new input.
00:12:49.340 | And then this is kind of like the primer to indicate to the model.
00:12:52.640 | It needs to start generating a response.
00:12:54.540 | Okay.
00:12:55.940 | Cool.
00:12:57.540 | Now this bit here,
00:12:59.640 | this initial part of the prompt is
00:13:03.240 | it's encouraging the model to talk a lot,
00:13:06.440 | which I don't actually want, right?
00:13:09.340 | AI is talkative and provides lots of specific details from its context.
00:13:13.340 | I don't really want that.
00:13:14.640 | I want it to be more precise.
00:13:16.640 | So I'm going to modify it a little bit.
00:13:18.340 | Okay, so we just go to chat prompt template
00:13:20.940 | and we're going to modify to this which is exactly the same,
00:13:23.840 | but I've just modified that middle sentence here.
00:13:27.740 | So that is now
00:13:30.640 | this.
00:13:31.540 | AI is conversational but concise in its responses without rambling.
00:13:36.840 | Okay.
00:13:37.840 | So that just means we're going to get shorter responses that are concise
00:13:44.040 | but don't go on as much as what I was seeing before.
00:13:47.340 | Okay.
00:13:48.240 | So I'm going to start with just asking
00:13:51.540 | the model how it's doing as any conversation should start
00:13:56.640 | and we can see here.
00:13:58.440 | So we've got verbose mode equal to true.
00:14:01.240 | That's just so we can see what's going on.
00:14:03.440 | So we have the full prompt
00:14:05.540 | and then we see, okay, I'm just a computer program.
00:14:09.140 | I don't have feelings like humans do.
00:14:11.140 | How can I assist you today?
00:14:12.240 | Right?
00:14:13.440 | And then at the end of there, we can also see,
00:14:15.040 | oh, it's done that human thing, right?
00:14:18.940 | It's continued the conversation.
00:14:20.540 | But because we added the stopping criteria in there,
00:14:24.440 | it stops as soon as it gets to human, right?
00:14:28.540 | Because that was one of the things we were looking for
00:14:32.540 | within the stopping criteria.
00:14:34.340 | So at that point it stops.
00:14:35.840 | But we still have it in the response,
00:14:38.340 | which is a little bit annoying
00:14:39.840 | because we don't want to be returning that to a user.
00:14:42.640 | So what we can do is actually just trim that last little bit off.
00:14:48.640 | So we can go into our chat memory.
00:14:50.640 | Okay, so we have chat memory here.
00:14:52.640 | Last message is what we just saw.
00:14:57.240 | I'm just a computer program.
00:14:58.640 | Then we have human at the end.
00:14:59.940 | And what we can do is just remove it, right?
00:15:04.240 | This isn't like super nice and clean or anything.
00:15:07.740 | In fact, it's a bit of a mess, but it just works.
00:15:12.440 | So there are a few things I'm accommodating for here.
00:15:16.240 | So if it has human, if it has possibly AI.
00:15:19.940 | Another thing I noticed is that sometimes it would also have this,
00:15:24.440 | the square brackets.
00:15:25.940 | Why was that?
00:15:27.140 | I remember seeing that.
00:15:28.640 | I think that might be when you're using this
00:15:30.740 | as a conversational agent, if I remember correctly.
00:15:34.040 | Although I could be wrong.
00:15:35.340 | It doesn't seem to be an issue in this case.
00:15:37.940 | So yeah, I basically looked for any of those.
00:15:41.540 | I also looked for the double new lines
00:15:43.440 | because that kind of indicated
00:15:44.840 | that we're at the end of conversation as well.
00:15:48.140 | Yeah, so we added a few things in there.
00:15:50.240 | I don't know if we necessarily need all of that.
00:15:52.440 | We could probably remove this.
00:15:53.740 | We could probably remove this as well.
00:15:55.440 | And we would, I assume, get the same sort of outputs.
00:16:00.540 | But then when it comes to using this as an agent,
00:16:04.740 | you might want to keep those things in.
00:16:07.040 | Okay, so we have that, we've just modified that final message.
00:16:11.640 | So we fixed that.
00:16:13.540 | So then we could just return this to our users.
00:16:18.140 | But yeah, we don't want to run this every time.
00:16:22.140 | To be honest, we don't want to run this code anyway,
00:16:25.240 | but let's just stick with it.
00:16:27.640 | And what we're going to do is actually just put all of this
00:16:30.940 | into a function, right?
00:16:33.640 | So we're going to say, okay, we have this chatroom function.
00:16:37.240 | We're going to pass the conversation chain into there
00:16:40.140 | and we're also going to pass out query.
00:16:41.840 | And what it's going to do is create a response
00:16:45.540 | like we did before.
00:16:47.140 | It was up here.
00:16:51.240 | So we're just running this, but here.
00:16:55.640 | And then we're going to do all of the trimming that we do after.
00:17:00.140 | And then we're going to return that final response.
00:17:03.740 | So we run this and let's try
00:17:07.840 | and take a look at what we get with this new function
00:17:11.440 | that just adds that extra step onto the end there.
00:17:13.740 | Okay, so I'm going to say explain to me
00:17:15.540 | difference between nuclear fission and fusion.
00:17:17.740 | You can see our previous interactions in the verbose output here.
00:17:23.340 | So hi, how are you?
00:17:24.940 | I'm just a compute program, so on and so on.
00:17:26.840 | All right, so we have that conversation log
00:17:28.840 | and we're just passing all of this to the model
00:17:31.840 | and saying, you know, please continue.
00:17:33.840 | Okay, so it does continue
00:17:37.340 | and it explains to us what these are, right?
00:17:40.640 | We can see that this is a different response
00:17:42.640 | to what we got earlier
00:17:43.540 | because we now have like the prompt instructions earlier on.
00:17:47.740 | So that has modified how the model is actually answering this.
00:17:51.140 | And then what I'm going to do, okay,
00:17:53.240 | I'm going to continue the conversation,
00:17:54.840 | but I'm going to ask a question that requires the model
00:17:58.440 | to have seen the previous interactions within the conversation.
00:18:01.540 | So I'm going to say, could you explain this like I am five, right?
00:18:04.840 | So we run that.
00:18:06.840 | Okay, we can see those previous parts of the conversation.
00:18:11.540 | So I'm just asking for a simpler version of this,
00:18:14.840 | something that I can understand easier.
00:18:18.040 | And let's see what it gets.
00:18:20.540 | And it's like, okay, sure.
00:18:23.140 | In simpler terms, nuclear fission is like breaking apart a toy car
00:18:27.540 | to get the pieces inside.
00:18:29.240 | It requires forceful intervention to break down the original structure.
00:18:32.940 | Meanwhile, nuclear fusion is more like putting together Lego blocks
00:18:37.640 | to make something new.
00:18:39.140 | You're taking separate elements and joining them together
00:18:41.540 | to create something different than before.
00:18:43.440 | Both processes release energy,
00:18:46.140 | but they approach it from opposite directions.
00:18:48.640 | So I think that's actually a really good way of explaining it,
00:18:52.040 | like much easier for me to understand than the previous one.
00:18:56.040 | So I think that that's pretty cool.
00:18:59.140 | So with that, we've built our open source chatbot
00:19:04.640 | using MPT-30B.
00:19:06.240 | Performance, as you can see, is pretty good.
00:19:09.640 | We need to make some accommodations,
00:19:12.040 | like having those stopping criteria in there.
00:19:14.040 | If you're building this out into a more fully-fledged solution,
00:19:17.740 | obviously, you're going to need more stopping criteria.
00:19:19.740 | You're probably going to need,
00:19:20.740 | for sure, you're going to need better trimming or post-processing
00:19:24.740 | than what I did there.
00:19:26.240 | But at high level, you don't really need too much.
00:19:31.140 | And the performance is, we can see, it's actually pretty good.
00:19:35.540 | So that's really cool.
00:19:37.640 | The context windows that we can use with this model
00:19:40.540 | is pretty high as well.
00:19:41.840 | So I read that it was about 8K,
00:19:44.440 | but I also saw in some of their documentation
00:19:46.940 | that it goes up to 16K.
00:19:48.840 | I'm not quite sure why there's a difference in what I've read.
00:19:52.540 | Maybe it's the chat model versus the base model.
00:19:56.840 | But in any case, you can go up to at least 8K,
00:19:59.740 | which is on par with the original GPT-4,
00:20:04.540 | which is pretty good.
00:20:06.340 | Anyway, I hope this has all been useful.
00:20:09.540 | I'm going to leave it there for this video.
00:20:11.240 | So thank you very much for watching
00:20:13.740 | and I will see you again in the next one.
00:20:16.940 | (upbeat music)
00:20:20.540 | (upbeat music)
00:20:23.140 | (upbeat music)
00:20:26.340 | (upbeat music)
00:20:28.840 | (gentle music)