MPT-30B Chatbot with LangChain!

00:00:00.000 | So today we have a pretty cool video, we're going to take the new MBT-30B model,

00:00:07.040 | which is an open source model that has better performance than GPT-3

00:00:13.640 | and can be run on a single GPU.

00:00:16.940 | We're going to take that model and we are going to use it to build a chatbot

00:00:22.080 | using HuggingFace for the actual model itself

00:00:26.240 | and LangChain for the conversational aspect of it.

00:00:30.540 | So I think this is going to be pretty fun.

00:00:32.440 | Let's jump straight into it.

00:00:34.440 | What I want to first do is take a look at the MBT-30B model.

00:00:40.940 | Now, the MBT-30B model, you've probably heard people talking about it already,

00:00:46.240 | but I want to just do a quick recap on what it is

00:00:50.540 | and the sort of performance that we can expect to get from it.

00:00:53.040 | So, MBT-30B is from Mosaic ML.

00:00:57.140 | They also did the MBT-7B model that I spoke about in the past,

00:01:02.340 | and they have some nice visuals in their announcement

00:01:06.040 | that kind of give us a good grasp of what sort of performance we can expect.

00:01:10.640 | You see here, we have MBT-7B, which was a decent model.

00:01:16.340 | It's definitely limited from testing it, but it was good.

00:01:19.840 | Particularly for its size.

00:01:23.440 | And then MBT-30B, okay, yes, it's a bit better in many, many ways.

00:01:29.440 | And I suppose what I think is a bit more interesting

00:01:32.940 | is when we compare it to these other models.

00:01:34.640 | So on the right over here, we have this current size,

00:01:37.940 | like the 30B plus size of models.

00:01:40.340 | So the red is MBT-30B.

00:01:42.240 | We see that it's not actually quite as performant

00:01:45.540 | as Falcon-40B and Llama-30B,

00:01:48.740 | even though when it says Llama-30B,

00:01:51.640 | I think it's actually 33 billion parameters.

00:01:53.640 | So it makes sense that Llama would be slightly more performant there.

00:01:57.340 | It's not quite as performant as those,

00:01:59.740 | but it does actually outperform both in programming ability.

00:02:05.540 | So in terms of programmability,

00:02:07.940 | this could be quite a good model to kind of help us

00:02:11.440 | as almost like an AI code assistant.

00:02:14.540 | So that would be definitely very interesting to try out at some point.

00:02:19.440 | But then we can see on the other things,

00:02:20.740 | it's pretty close to these other models,

00:02:22.540 | especially Falcon-40B, which is a fairly large model.

00:02:27.940 | So cool to see.

00:02:29.940 | And what they've done is released a couple of models here.

00:02:33.340 | So we have MBT-30B, which is the base model,

00:02:36.240 | and we also have the instruct and chat models

00:02:39.340 | and fine-tuned for like chat

00:02:42.340 | or fine-tuned to follow instructions.

00:02:45.240 | We're going to be using the chat model, of course.

00:02:48.640 | So let's jump into how we actually do this.

00:02:50.940 | So I have this notebook on Colab.

00:02:53.940 | So to run this, at least on Colab,

00:02:57.540 | you are actually going to need to use Colab Pro

00:03:00.940 | and you're going to need to use the A100 GPU.

00:03:04.440 | Now it's fairly expensive to run, just as a pre-warning.

00:03:08.140 | Or if you can get an A100 elsewhere,

00:03:11.840 | then go ahead and do that as well.

00:03:14.640 | But you do need a fairly beefy GPU here.

00:03:18.940 | But we are just running it on a single GPU, which is pretty cool.

00:03:21.840 | So we're going to need to install most of these.

00:03:25.740 | Actually, I don't think we need the Wikipedia thing.

00:03:27.840 | Do we need the rest? Yes.

00:03:29.840 | So the rest of these we do need.

00:03:31.240 | So we can pip install those.

00:03:32.840 | And what we're going to do is come down to here

00:03:36.240 | to initialize our model.

00:03:37.640 | So let's explain this.

00:03:39.340 | So we're making sure that we're going to be using a GPU.

00:03:44.440 | So we need a CUDA-enabled GPU here.

00:03:47.940 | So we set that with device.

00:03:49.240 | And then we initialize the model.

00:03:50.640 | We're initializing it from HuggingFace.

00:03:52.740 | So if we took this little string here,

00:03:55.840 | we go to HuggingFace.co/that,

00:04:01.240 | we find the model page for this model.

00:04:04.640 | And it just gives us a little bit of a description

00:04:08.140 | about what the model actually is.

00:04:10.540 | But yeah, so we're getting this from HuggingFace.

00:04:13.240 | To run this, we do need to...

00:04:15.840 | There's basically some custom code that needs to be run

00:04:18.640 | in order to get this model working with HuggingFace.

00:04:20.740 | So in order to allow that to be run,

00:04:25.140 | you have to use this trust remote code.

00:04:27.440 | And then we come to here.

00:04:28.640 | So we have these two lines here.

00:04:30.440 | We can limit the amount of bytes that we're using here.

00:04:37.340 | So actually, this is not...

00:04:38.540 | Actually, it's just 8-bit.

00:04:40.740 | Maybe that's confusing.

00:04:42.240 | So we can limit the precision of the floating point numbers

00:04:46.340 | that we're using in this model

00:04:48.240 | in order to reduce the total size required by the model.

00:04:51.640 | Now, if we use bfloat16,

00:04:53.840 | I mean, we could use 32-bit.

00:04:56.840 | That would be like the full version of the model.

00:05:00.940 | Or we can use 16-bit.

00:05:02.840 | But to use 16-bit, we need 80 gigabytes of RAM on our GPU,

00:05:07.940 | which we don't have.

00:05:09.940 | So actually, we are doing this load in 8-bit.

00:05:13.740 | And this is why we need that bits and bytes library

00:05:17.240 | up at the top here.

00:05:18.640 | So with that, we're actually loading the model in 8-bit precision,

00:05:22.740 | which means, you know, it's not going to be quite as accurate,

00:05:25.740 | but it's still, I mean, at least from testing it,

00:05:28.840 | the performance is still very good.

00:05:30.640 | So it's going to be a basically slightly less accurate,

00:05:35.640 | but much smaller model thanks to this.

00:05:39.240 | So we load in 8-bit.

00:05:40.640 | We set the max sequence length, ignore that.

00:05:44.040 | That was just me making some notes.

00:05:46.340 | So max sequence length, I'm going to say is 1024,

00:05:51.540 | but we can actually, I believe we can go up to this, right?

00:05:54.940 | What I'll do is maybe go up to 8192.

00:05:58.540 | You know, maybe go up to there

00:06:00.440 | because if you have a lot of conversation,

00:06:02.640 | like interactions in conversation,

00:06:04.340 | maybe you do want that extra context window just to be safe.

00:06:09.940 | And then what we're going to do is initialize the device, right?

00:06:13.140 | So the device is going to be our CUDA-enabled GPU.

00:06:16.340 | Okay, so I'm going to run that.

00:06:18.140 | Yeah.

00:06:19.340 | And this is going to take a fairly long time.

00:06:23.840 | It needs to download the model.

00:06:25.740 | It needs to load everything.

00:06:27.340 | And yeah, just be ready to wait a little while.

00:06:32.440 | I'm going to fast forward.

00:06:33.640 | So I'll see you in a moment.

00:06:35.440 | Okay, so we've just finished.

00:06:37.840 | That took a nice 16 minutes,

00:06:41.140 | almost 17 minutes for me to download everything

00:06:44.340 | and then load the model.

00:06:45.640 | Obviously, once you've downloaded it once,

00:06:48.540 | you don't need to download the model again.

00:06:51.140 | That being said, when you're on Colab,

00:06:53.340 | once you download it once,

00:06:55.140 | as soon as your instance is reset,

00:06:59.240 | you are going to need to download it again.

00:07:01.140 | So it's probably better if you have like a set instance

00:07:05.040 | that you can run this on.

00:07:06.040 | But even when you're just running this model

00:07:08.640 | and you've already loaded it,

00:07:09.740 | you're still going to need to actually load the model,

00:07:14.740 | right, from disk.

00:07:16.440 | And that took just over four minutes.

00:07:19.840 | So the fact is, it's just a huge model.

00:07:23.040 | So it takes a little while to load everything.

00:07:26.740 | But that's done now, thankfully,

00:07:28.740 | so we can move on to the rest.

00:07:30.140 | So Tokenizer is the thing that translates

00:07:33.640 | human-readable plaintext into machine

00:07:36.140 | or transformer-readable tokens as part of this.

00:07:41.040 | And this is something I've spoken about

00:07:42.940 | more in the last MPT7B video

00:07:47.640 | and also the OpenLARM video.

00:07:49.240 | So I've linked those towards the top of the video right now,

00:07:52.440 | if you want more information on those.

00:07:54.340 | But essentially, we need to,

00:07:56.740 | when we're chatting with the model,

00:07:59.640 | it's going to generate,

00:08:01.640 | or at least what I've found is that it generates

00:08:03.440 | the next step of the conversation.

00:08:05.640 | All right, so it's going to generate its response as the AI

00:08:09.240 | and then it's going to start generating

00:08:10.940 | what would a human say after this and so on,

00:08:13.540 | because it's text generation.

00:08:15.640 | It's basically trying to predict

00:08:17.340 | what is going to happen next.

00:08:18.840 | So what I'm doing here is setting a list of tokens

00:08:23.740 | where if we see them,

00:08:25.040 | we tell the model, "Okay, you're done.

00:08:26.840 | Stop generating stuff, please."

00:08:28.940 | Okay?

00:08:30.340 | So I'm saying if we see human colon or AI colon,

00:08:34.640 | that means we're probably onto one of the next lines.

00:08:36.540 | So at that point, we're going to stop.

00:08:38.240 | Now, we also need to convert those into tensors.

00:08:42.740 | So we do that here.

00:08:44.440 | And then we use this,

00:08:46.840 | we create this stopping function

00:08:49.140 | or this, what is it? StopOnTokens class.

00:08:52.040 | So basically, it's just going to check

00:08:54.340 | where the last number of tokens

00:08:57.340 | is equal to one of these combinations here.

00:09:00.940 | If so, please stop.

00:09:02.740 | Okay, and then what we do

00:09:06.340 | is initialize a ton of model generation parameters.

00:09:10.340 | So we're doing this for text generation.

00:09:12.740 | We need to return full text for line chain.

00:09:15.540 | We have our stopping criteria

00:09:17.140 | and we have a set of things here.

00:09:18.540 | So temperature, you can set that to one

00:09:20.840 | to be more random, zero to be less random.

00:09:23.040 | And there's other things in that.

00:09:25.340 | One other thing actually is the max new tokens.

00:09:28.440 | So how many tokens should,

00:09:31.440 | can the model predict before it's like,

00:09:33.540 | that's it, like stop.

00:09:34.940 | So you want to set this high enough

00:09:37.840 | that you're not cutting out responses,

00:09:40.540 | but low enough that you're not

00:09:42.840 | generating too much text.

00:09:45.140 | And it's very important to not generate too much text

00:09:48.240 | because it actually takes longer.

00:09:49.440 | So the higher this number is,

00:09:51.240 | the longer the model is going to take to give you a response,

00:09:54.340 | which we don't really want.

00:09:55.840 | Okay, now what we're going to do is just confirm.

00:09:58.840 | Okay, is this working?

00:10:00.240 | Very quick example here.

00:10:02.240 | You'll recognize this if you watch the other

00:10:04.240 | MPT 7b video.

00:10:06.540 | So right now, this isn't a chatbot.

00:10:08.640 | It's just generating text.

00:10:10.340 | We'll get onto this chatbot stuff soon.

00:10:13.240 | Okay, we can see we,

00:10:14.940 | so we return the original text.

00:10:17.040 | That's our question at top there.

00:10:18.640 | And it's like fission is a process

00:10:20.540 | where it does this and this and so on and so on.

00:10:22.940 | And yeah, I have read this a couple of times,

00:10:25.140 | seems relatively accurate.

00:10:27.040 | It does just stop here.

00:10:29.140 | That's just, we're not really doing any

00:10:31.340 | prompt engineering here.

00:10:32.640 | When it comes to the chat,

00:10:34.640 | like we're going to be cutting off 128 tokens.

00:10:38.240 | So when it comes to chat,

00:10:39.640 | we don't really want long chat responses.

00:10:42.540 | So that is not really an issue that I've seen,

00:10:47.040 | at least when we're actually in that chatbot phase

00:10:50.740 | or chatbot mode of the model.

00:10:53.840 | Okay, so now what we're going to do is

00:10:57.140 | initialize these things with BlankChain

00:10:59.440 | and we're going to just try the same thing again,

00:11:01.540 | make sure everything's working.

00:11:03.040 | Explain to me the difference between

00:11:04.840 | nuclear fission and fusion.

00:11:06.440 | And we're going to get,

00:11:07.940 | we should get pretty much the same response.

00:11:09.940 | There's, we have a temperature of 0.1.

00:11:12.140 | So there's going to be a little bit of randomness in there.

00:11:14.440 | So it probably won't be exactly the same.

00:11:17.040 | It should be similar.

00:11:19.740 | So we get that.

00:11:20.840 | It's either exactly the same or very similar.

00:11:23.640 | That's good.

00:11:24.740 | And now what I want to do is move on to actually

00:11:27.540 | how do we take this model

00:11:29.140 | and turn it into a chatbot.

00:11:31.840 | So what does a chatbot need?

00:11:33.140 | It needs conversational memory

00:11:35.040 | and it, well,

00:11:37.140 | it needs that different structure of the prompts.

00:11:41.140 | So it's going to be going in a chat format.

00:11:44.940 | So the conversational memory is where you're going to have

00:11:47.240 | like human, how are you?

00:11:49.140 | AI, I'm good, thank you.

00:11:50.940 | Human, ask a question and so on and so on, right?

00:11:53.540 | There's that, basically a chat log.

00:11:56.940 | So that's our memory.

00:12:00.340 | We implement that here.

00:12:02.740 | So that's going to modify the initial prompt.

00:12:05.140 | Another thing that our chatbot is going to have is

00:12:08.740 | the initial prompt is also going to be explaining

00:12:12.340 | what this is.

00:12:13.740 | So actually we can see that.

00:12:16.840 | So let me run this.

00:12:18.040 | So I'm going to run the conversation chain from

00:12:22.140 | LangChain.

00:12:23.840 | And yeah, here we can see that initial prompt.

00:12:28.040 | So the initial prompt is here.

00:12:30.840 | Okay, we see the template is the structure of that.

00:12:35.140 | So the following is friendly conversation between human and AI.

00:12:38.040 | The AI is talkative and provides lots of specific details

00:12:41.240 | from its context,

00:12:42.340 | so on and so on.

00:12:44.040 | And then you have where we would put in the conversational

00:12:46.840 | memory and then our new input.

00:12:49.340 | And then this is kind of like the primer to indicate to the model.

00:12:52.640 | It needs to start generating a response.

00:12:54.540 | Okay.

00:12:55.940 | Cool.

00:12:57.540 | Now this bit here,

00:12:59.640 | this initial part of the prompt is

00:13:03.240 | it's encouraging the model to talk a lot,

00:13:06.440 | which I don't actually want, right?

00:13:08.640 | So

00:13:09.340 | AI is talkative and provides lots of specific details from its context.

00:13:13.340 | I don't really want that.

00:13:14.640 | I want it to be more precise.

00:13:16.640 | So I'm going to modify it a little bit.

00:13:18.340 | Okay, so we just go to chat prompt template

00:13:20.940 | and we're going to modify to this which is exactly the same,

00:13:23.840 | but I've just modified that middle sentence here.

00:13:27.740 | So that is now

00:13:30.640 | this.

00:13:31.540 | AI is conversational but concise in its responses without rambling.

00:13:36.840 | Okay.

00:13:37.840 | So that just means we're going to get shorter responses that are concise

00:13:44.040 | but don't go on as much as what I was seeing before.

00:13:47.340 | Okay.

00:13:48.240 | So I'm going to start with just asking

00:13:51.540 | the model how it's doing as any conversation should start

00:13:56.640 | and we can see here.

00:13:58.440 | So we've got verbose mode equal to true.

00:14:01.240 | That's just so we can see what's going on.

00:14:03.440 | So we have the full prompt

00:14:05.540 | and then we see, okay, I'm just a computer program.

00:14:09.140 | I don't have feelings like humans do.

00:14:11.140 | How can I assist you today?

00:14:12.240 | Right?

00:14:13.440 | And then at the end of there, we can also see,

00:14:15.040 | oh, it's done that human thing, right?

00:14:18.940 | It's continued the conversation.

00:14:20.540 | But because we added the stopping criteria in there,

00:14:24.440 | it stops as soon as it gets to human, right?

00:14:28.540 | Because that was one of the things we were looking for

00:14:32.540 | within the stopping criteria.

00:14:34.340 | So at that point it stops.

00:14:35.840 | But we still have it in the response,

00:14:38.340 | which is a little bit annoying

00:14:39.840 | because we don't want to be returning that to a user.

00:14:42.640 | So what we can do is actually just trim that last little bit off.

00:14:48.640 | So we can go into our chat memory.

00:14:50.640 | Okay, so we have chat memory here.

00:14:52.640 | Last message is what we just saw.

00:14:57.240 | I'm just a computer program.

00:14:58.640 | Then we have human at the end.

00:14:59.940 | And what we can do is just remove it, right?

00:15:04.240 | This isn't like super nice and clean or anything.

00:15:07.740 | In fact, it's a bit of a mess, but it just works.

00:15:12.440 | So there are a few things I'm accommodating for here.

00:15:16.240 | So if it has human, if it has possibly AI.

00:15:19.940 | Another thing I noticed is that sometimes it would also have this,

00:15:24.440 | the square brackets.

00:15:25.940 | Why was that?

00:15:27.140 | I remember seeing that.

00:15:28.640 | I think that might be when you're using this

00:15:30.740 | as a conversational agent, if I remember correctly.

00:15:34.040 | Although I could be wrong.

00:15:35.340 | It doesn't seem to be an issue in this case.

00:15:37.940 | So yeah, I basically looked for any of those.

00:15:41.540 | I also looked for the double new lines

00:15:43.440 | because that kind of indicated

00:15:44.840 | that we're at the end of conversation as well.

00:15:48.140 | Yeah, so we added a few things in there.

00:15:50.240 | I don't know if we necessarily need all of that.

00:15:52.440 | We could probably remove this.

00:15:53.740 | We could probably remove this as well.

00:15:55.440 | And we would, I assume, get the same sort of outputs.

00:16:00.540 | But then when it comes to using this as an agent,

00:16:04.740 | you might want to keep those things in.

00:16:07.040 | Okay, so we have that, we've just modified that final message.

00:16:11.640 | So we fixed that.

00:16:13.540 | So then we could just return this to our users.

00:16:18.140 | But yeah, we don't want to run this every time.

00:16:22.140 | To be honest, we don't want to run this code anyway,

00:16:25.240 | but let's just stick with it.

00:16:27.640 | And what we're going to do is actually just put all of this

00:16:30.940 | into a function, right?

00:16:33.640 | So we're going to say, okay, we have this chatroom function.

00:16:37.240 | We're going to pass the conversation chain into there

00:16:40.140 | and we're also going to pass out query.

00:16:41.840 | And what it's going to do is create a response

00:16:45.540 | like we did before.

00:16:47.140 | It was up here.

00:16:51.240 | So we're just running this, but here.

00:16:55.640 | And then we're going to do all of the trimming that we do after.

00:17:00.140 | And then we're going to return that final response.

00:17:03.740 | So we run this and let's try

00:17:07.840 | and take a look at what we get with this new function

00:17:11.440 | that just adds that extra step onto the end there.

00:17:13.740 | Okay, so I'm going to say explain to me

00:17:15.540 | difference between nuclear fission and fusion.

00:17:17.740 | You can see our previous interactions in the verbose output here.

00:17:23.340 | So hi, how are you?

00:17:24.940 | I'm just a compute program, so on and so on.

00:17:26.840 | All right, so we have that conversation log

00:17:28.840 | and we're just passing all of this to the model

00:17:31.840 | and saying, you know, please continue.

00:17:33.840 | Okay, so it does continue

00:17:37.340 | and it explains to us what these are, right?

00:17:40.640 | We can see that this is a different response

00:17:42.640 | to what we got earlier

00:17:43.540 | because we now have like the prompt instructions earlier on.

00:17:47.740 | So that has modified how the model is actually answering this.

00:17:51.140 | And then what I'm going to do, okay,

00:17:53.240 | I'm going to continue the conversation,

00:17:54.840 | but I'm going to ask a question that requires the model

00:17:58.440 | to have seen the previous interactions within the conversation.

00:18:01.540 | So I'm going to say, could you explain this like I am five, right?

00:18:04.840 | So we run that.

00:18:06.840 | Okay, we can see those previous parts of the conversation.

00:18:11.540 | So I'm just asking for a simpler version of this,

00:18:14.840 | something that I can understand easier.

00:18:18.040 | And let's see what it gets.

00:18:20.540 | And it's like, okay, sure.

00:18:23.140 | In simpler terms, nuclear fission is like breaking apart a toy car

00:18:27.540 | to get the pieces inside.

00:18:29.240 | It requires forceful intervention to break down the original structure.

00:18:32.940 | Meanwhile, nuclear fusion is more like putting together Lego blocks

00:18:37.640 | to make something new.

00:18:39.140 | You're taking separate elements and joining them together

00:18:41.540 | to create something different than before.

00:18:43.440 | Both processes release energy,

00:18:46.140 | but they approach it from opposite directions.

00:18:48.640 | So I think that's actually a really good way of explaining it,

00:18:52.040 | like much easier for me to understand than the previous one.

00:18:56.040 | So I think that that's pretty cool.

00:18:59.140 | So with that, we've built our open source chatbot

00:19:04.640 | using MPT-30B.

00:19:06.240 | Performance, as you can see, is pretty good.

00:19:09.640 | We need to make some accommodations,

00:19:12.040 | like having those stopping criteria in there.

00:19:14.040 | If you're building this out into a more fully-fledged solution,

00:19:17.740 | obviously, you're going to need more stopping criteria.

00:19:19.740 | You're probably going to need,

00:19:20.740 | for sure, you're going to need better trimming or post-processing

00:19:24.740 | than what I did there.

00:19:26.240 | But at high level, you don't really need too much.

00:19:31.140 | And the performance is, we can see, it's actually pretty good.

00:19:35.540 | So that's really cool.

00:19:37.640 | The context windows that we can use with this model

00:19:40.540 | is pretty high as well.

00:19:41.840 | So I read that it was about 8K,

00:19:44.440 | but I also saw in some of their documentation

00:19:46.940 | that it goes up to 16K.

00:19:48.840 | I'm not quite sure why there's a difference in what I've read.

00:19:52.540 | Maybe it's the chat model versus the base model.

00:19:56.840 | But in any case, you can go up to at least 8K,

00:19:59.740 | which is on par with the original GPT-4,

00:20:04.540 | which is pretty good.

00:20:06.340 | Anyway, I hope this has all been useful.

00:20:09.540 | I'm going to leave it there for this video.

00:20:11.240 | So thank you very much for watching

00:20:13.740 | and I will see you again in the next one.

00:20:16.240 | Bye.

00:20:16.940 | (upbeat music)

00:20:20.540 | (upbeat music)

00:20:23.140 | (upbeat music)

00:20:26.340 | (upbeat music)

00:20:28.840 | (gentle music)

00:20:31.420 | you

MPT-30B Chatbot with LangChain!

Chapters