Using NEW MPT-7B in Hugging Face and LangChain

00:00:00.000 | Today we're gonna talk about using open source models

00:00:03.140 | in Hugging Face and LangChain.

00:00:05.480 | We're going to be focusing specifically

00:00:07.400 | on the MPT7B model,

00:00:10.360 | which I'm sure some of you have heard

00:00:12.600 | as one of these fine-tuned versions of this model

00:00:16.600 | actually has a context window of 65,000 tokens,

00:00:21.600 | which is pretty huge.

00:00:25.000 | At the moment of recording this video,

00:00:28.200 | GPT-4, the one that's generally available to people,

00:00:32.280 | has a context window of 8,000 tokens,

00:00:35.800 | and they have a version that goes up to 32,000,

00:00:40.200 | but I'm actually not aware of anyone

00:00:42.000 | that has access to that at the moment.

00:00:44.280 | So basically we're limited with GPT-4 to 8,000 tokens.

00:00:49.280 | Now, MPT7B, like I said,

00:00:53.880 | we can have that super huge model,

00:00:55.920 | but there are also a lot of other models

00:00:57.720 | that are available as well.

00:00:59.120 | So let me just go ahead and show you those very quickly.

00:01:02.200 | So just head over to Hugging Face,

00:01:04.480 | which is where we're gonna pull these models from.

00:01:06.920 | And you can see actually straight away,

00:01:08.960 | we have these four models.

00:01:10.840 | So the MPT7B is the core, that's the pre-trained model,

00:01:15.840 | that's the foundation model.

00:01:17.840 | Then we have StoryWriter, Chat, and Instruct.

00:01:19.640 | These are all fine-tuned models.

00:01:21.280 | So StoryWriter is the one you've probably heard about,

00:01:24.000 | which has a max context window of 65,000 tokens,

00:01:29.000 | which is pretty huge.

00:01:30.880 | And in reality, it actually goes up to higher.

00:01:33.640 | So I believe they say, ah, here, right?

00:01:36.920 | So we demonstrate generations as long as 84,000 tokens,

00:01:41.280 | which is, I would say, pretty impressive.

00:01:44.400 | And then if we, actually, we can come over to here,

00:01:47.240 | scroll down, and we can see the other models as well.

00:01:50.200 | So we have this Chat model, the Instruct model,

00:01:52.640 | and obviously the Foundation model.

00:01:54.160 | We're gonna be using the Instruct model

00:01:55.720 | because, I mean, most of the use cases I see

00:02:00.200 | kind of rely on us providing instructions to these models.

00:02:04.600 | And therefore, I think most people out there

00:02:07.960 | actually are going to want to use this model, okay?

00:02:10.400 | Because, yeah, we can give it instructions,

00:02:12.480 | it's gonna be able to follow them better than the others.

00:02:15.280 | So, yeah, we're gonna see how we can use this both.

00:02:18.840 | So initially in HuggingFace,

00:02:20.120 | we're gonna see how we can load that into HuggingFace,

00:02:22.000 | and then we're gonna see how we can take that

00:02:24.400 | and actually load it into LimeChain,

00:02:25.960 | which obviously has a few more features

00:02:28.000 | on the agent side of things.

00:02:29.520 | Okay, so the first thing we're gonna want to do

00:02:31.600 | is actually do a few pip installs.

00:02:35.240 | So we have Transformers, Accelerate.

00:02:37.640 | So Accelerate, we need that in order to basically optimize

00:02:41.280 | how we're running this on our GPU.

00:02:44.320 | We will want to run this on a GPU,

00:02:46.040 | otherwise you're going to be waiting

00:02:48.240 | an impossibly long time.

00:02:50.000 | So, yeah, if you don't have access to a GPU,

00:02:53.600 | I would recommend you figure that out.

00:02:57.360 | So right now I'm running this on Colab,

00:02:59.280 | and actually there'll be a link to this notebook

00:03:01.160 | as well on the top of the video.

00:03:03.000 | So from Colab, you can run on GPU, okay?

00:03:05.960 | So you just go to Runtime, Change Runtime Type.

00:03:09.920 | You initially maybe on None, so you click GPU.

00:03:13.560 | GPU Type, so I'm using T4, which is the smallest one on here

00:03:17.120 | and the standard version of T4

00:03:19.080 | you can get on the free version of Colab.

00:03:22.080 | But for me, that wasn't actually big enough

00:03:25.720 | to run the MPT7B model, unfortunately.

00:03:30.720 | So I'm currently on Colab Pro now, thanks to this model.

00:03:35.760 | And with that, I can switch up to the high-RAM version.

00:03:41.280 | Now, obviously you have to pay for that,

00:03:43.440 | but you don't have to pay that much, okay?

00:03:46.640 | It's not a significant cost.

00:03:49.800 | But of course, I know this will be limiting for some people,

00:03:52.640 | but this is the best and cheapest option

00:03:55.440 | I can find right now.

00:03:57.240 | Okay, so back on the installers, we have INOPS.

00:03:59.920 | So this is, again, it's used by the MPT model.

00:04:02.880 | Naturally, we're gonna be using LangChain.

00:04:04.760 | And do I use Wikipedia here?

00:04:07.160 | I actually don't think I use this anymore.

00:04:09.440 | And Xform is just for an optimization

00:04:12.120 | in our transform functions.

00:04:15.360 | Okay, once we have all those installed,

00:04:17.520 | we come down here,

00:04:18.360 | and this is where we initialize the model.

00:04:21.160 | Okay, so like I said, we're gonna be using

00:04:22.760 | the instruct model, okay?

00:04:25.200 | One thing, so if you do want to use StoryWriter

00:04:28.480 | and you want to use that huge context window,

00:04:31.360 | you would go StoryWriter, okay?

00:04:35.200 | And then here, you would write,

00:04:37.400 | I don't know, what is it, 65,000, which is kind of nuts.

00:04:42.160 | But in order to run that,

00:04:43.400 | you're gonna definitely need more than a T4 GPU.

00:04:46.280 | Basically, the higher the max sequence length is,

00:04:49.040 | the bigger your GPU memory is going to need to be.

00:04:52.360 | So yeah, you need something big to run that.

00:04:55.280 | But we're just gonna stick with this, instruct.

00:04:57.800 | This 2048 is the typical or the standard sequence length

00:05:02.640 | for these other models.

00:05:04.200 | So instruct, or the base model, the foundation model,

00:05:07.360 | instruct and chat.

00:05:08.640 | And this is also something important.

00:05:10.800 | So the trust remote code, we have to have that

00:05:13.120 | because essentially the MPT models

00:05:16.720 | are not fully supported by HuggingFace yet.

00:05:19.920 | So we have to rely on this remote code

00:05:22.720 | that is basically stored in the model directory for this

00:05:27.480 | to set up all the endpoints and everything for the model.

00:05:30.840 | Okay, then we switch the model to evaluation mode.

00:05:35.200 | So that just switches a few options within the model

00:05:39.560 | that says, okay, we're not training,

00:05:41.280 | we're now performing inference.

00:05:43.240 | Okay, we're now doing predictions.

00:05:45.000 | And then we want to move our model to device.

00:05:48.400 | So the device, we decided here.

00:05:50.880 | Okay, so CUDA, and then we have CUDA current device.

00:05:54.120 | If we scroll down to the end here,

00:05:55.640 | we should see what that moved it to.

00:05:58.080 | Yeah, so model loaded to CUDA at zero.

00:06:01.160 | Now, just one thing,

00:06:02.560 | this takes a little bit of time to run, okay?

00:06:05.160 | Like here, it just took a minute.

00:06:06.840 | I think that's because most of the model

00:06:08.920 | was probably already downloaded for me.

00:06:10.640 | If you're downloading and initializing this,

00:06:12.640 | expect to wait like five, 10 minutes, at least on Colab.

00:06:17.120 | But once that has been downloaded,

00:06:18.920 | you should be good to use it to basically initialize it.

00:06:23.920 | And it will just take like a minute or so

00:06:26.280 | because you only need to download it once.

00:06:28.440 | Okay, and then we initialize our tokenizer.

00:06:31.280 | So the tokenizer is actually using this

00:06:34.080 | Luther AI's GPT Neox 20B.

00:06:38.160 | This is just, this is a tokenizer.

00:06:40.040 | So when I say tokenizer,

00:06:41.960 | it's basically the thing that will translate

00:06:44.600 | from human readable plain text

00:06:47.760 | to transformer or large language model,

00:06:51.080 | readable token IDs, right?

00:06:53.200 | So it's gonna convert like the word V

00:06:55.920 | into the token ID 41, for example, right?

00:07:00.320 | And then they get fed into the large language model.

00:07:03.720 | Now the MPT7B model was trained using this tokenizer here.

00:07:07.840 | Right, so we have to use that tokenizer.

00:07:10.000 | Then what we need to do

00:07:11.280 | is define a stopping criteria of the model.

00:07:13.600 | So I should, I don't know if I mentioned this,

00:07:15.800 | but right now what we're doing

00:07:17.240 | is actually initializing the honey face pipeline.

00:07:20.920 | So within that pipeline, we have the large language model,

00:07:23.920 | the tokenizer, both of those we've just created

00:07:26.440 | and also stopping criteria object, right?

00:07:28.840 | Stopping criteria object,

00:07:30.960 | let me come down to where we create it, is this here.

00:07:35.440 | Okay, so basically MPT7B has been trained

00:07:40.440 | to add this particular bit of text

00:07:43.440 | at the end of its generations,

00:07:45.280 | when it's like, okay, I'm finished, right?

00:07:48.600 | But there's nothing within that model

00:07:51.640 | that will stop it from actually generating text

00:07:54.440 | at that point, right?

00:07:55.840 | It will just, it will generate this, right?

00:07:58.320 | And then it will actually just continue generating text.

00:08:03.320 | And the text that it generates after this

00:08:05.240 | is generally just going to be gibberish

00:08:07.200 | because it's been trained to generate this

00:08:09.960 | at the end of a meaningful answer, right?

00:08:12.800 | After generating this,

00:08:14.040 | it's able to just begin generating anything, okay?

00:08:17.760 | It's not going to be useful stuff.

00:08:20.960 | So what we need to do is define this

00:08:24.560 | as a stopping criteria for the model.

00:08:26.560 | We need to go in there and say,

00:08:28.080 | okay, when the model says end of text,

00:08:30.640 | when it gives this token to us, we stop, right?

00:08:33.280 | We need to specify that.

00:08:35.280 | And we do that using this stopping criteria list object.

00:08:39.960 | Okay, so that requires a stopping criteria object,

00:08:44.360 | which we've defined here.

00:08:45.680 | So, I mean, you can see this.

00:08:48.000 | So these parameters are just the default parameters needed

00:08:51.440 | by this stopping criteria object.

00:08:53.840 | And basically what it's going to do is say,

00:08:56.800 | okay, for SUP ID.

00:08:58.600 | So we have these SUP token IDs.

00:09:01.200 | Maybe I can just show you these.

00:09:03.240 | Maybe that's easier.

00:09:04.440 | So SUP token IDs,

00:09:09.160 | and it's just going to be a few integers, right?

00:09:12.560 | Those integers, actually it's one integer,

00:09:15.440 | which represents this, right?

00:09:17.880 | So I said before the tokenized,

00:09:19.480 | it translates from plain text to the token IDs.

00:09:22.240 | That's what this is.

00:09:23.080 | This is the plain text version.

00:09:24.800 | This is the token ID version, right?

00:09:27.080 | And it's going to say, okay, for the SUP ID here,

00:09:30.560 | so actually just for zero,

00:09:32.800 | if the input IDs,

00:09:34.960 | so the last input ID is equal to that,

00:09:38.400 | we're going to say, okay, it's time to stop, right?

00:09:41.480 | Otherwise it's not time to stop, you can keep going.

00:09:44.480 | And that's it, okay?

00:09:46.440 | So that gives us our stopping criteria object.

00:09:49.240 | And then we just pass that into our pipeline.

00:09:52.920 | So the pipeline is basically the tokenization,

00:09:57.120 | the model and the generation from that model,

00:10:00.520 | and then also this stopping criteria,

00:10:02.680 | all packaged into a nice little function.

00:10:04.960 | So within that pipeline,

00:10:07.120 | we pass in obviously our model, our tokenizer,

00:10:09.280 | and the stopping criteria,

00:10:10.640 | but there's also a few things we need as well.

00:10:13.360 | So return full text.

00:10:15.360 | So if we have this false,

00:10:16.720 | it's just going to return the generated part

00:10:19.920 | or generate portion of some text.

00:10:23.200 | And that's fine, you can do that.

00:10:24.400 | There's actually no problem with that.

00:10:26.440 | But if you want to use this in light chain,

00:10:28.680 | we need to return the generated text

00:10:31.600 | and also the input text.

00:10:33.840 | We need to return full text

00:10:35.480 | because we're going to be using line chain later.

00:10:37.320 | That's why we set return full text equal to true.

00:10:40.280 | If you were just wanting to use this and hung and face,

00:10:42.400 | you don't need to, you don't need to have this as true.

00:10:45.360 | Then our task here is text generation.

00:10:49.000 | Okay, so this just says, okay, we want to generate text.

00:10:51.840 | The device here is important.

00:10:53.560 | We obviously want to use our CUDA enabled GPU.

00:10:55.920 | So we set that.

00:10:56.880 | And then we have a few other model

00:10:59.080 | specific parameters down here.

00:11:01.160 | Or we could call them generation

00:11:02.600 | specific parameters as well.

00:11:04.520 | So the temperature is like the randomness of your output.

00:11:08.160 | Zero is the minimum.

00:11:09.200 | It's basically zero randomness

00:11:11.640 | and one is maximum randomness.

00:11:14.080 | Okay, so imagine it's kind of like how random

00:11:18.280 | the predicted tokens or the next words are going to be.

00:11:22.560 | Then we have top P.

00:11:23.600 | So top P is basically we're going to select

00:11:27.560 | from the top tokens on each prediction

00:11:30.360 | from whose probability adds up to 15%.

00:11:33.240 | And I would recommend if you want to read about this,

00:11:36.440 | I'd recommend looking at this page from Cohere.

00:11:39.360 | So there'll be a link at the top of the video right now.

00:11:43.200 | They explain this really nicely.

00:11:45.000 | So yeah, you can kind of see

00:11:47.640 | they use 0.15 here as well, right?

00:11:50.760 | So consider only top tokens whose likelihoods

00:11:54.000 | add up to that 15% and then ignore the others.

00:11:56.520 | So with each step, right?

00:11:58.680 | Each generation step, you're predicting the next token

00:12:02.480 | or the next word.

00:12:03.520 | You can think of it like that.

00:12:05.040 | And by setting top P equals 0.15,

00:12:09.160 | we're just going to consider the possible next words

00:12:13.680 | 'cause we're predicting for all of the words

00:12:15.680 | in that tokenizer.

00:12:17.520 | We're going to consider the top words

00:12:21.560 | whose together their likelihood adds up to 15%, right?

00:12:27.240 | The total, okay?

00:12:29.040 | So you can see that there, they visualize it very nicely.

00:12:32.240 | I don't think my explanation

00:12:34.880 | can compare to this visualization.

00:12:37.520 | Okay, and then we have top K.

00:12:39.440 | This is another value, kind of similar thing, right?

00:12:42.160 | So top K, if we come up to here,

00:12:45.640 | and this is easy to explain,

00:12:47.720 | we're picking from the top K tokens, right?

00:12:51.160 | So in this case, if you had top K equal to one,

00:12:55.400 | it would only select United

00:12:57.680 | or it could only decide on selecting United.

00:13:00.600 | If you had top K equal to two,

00:13:01.960 | you could do United or Netherlands.

00:13:04.240 | Top K equal to three,

00:13:05.200 | you could choose any of these top three, right?

00:13:08.240 | That is what the top K is actually doing.

00:13:12.200 | And actually you can visualize that here as well.

00:13:15.080 | Okay, and okay, what I've done here is

00:13:17.920 | set top K equal to zero.

00:13:19.320 | That's because I don't want to consider top K

00:13:22.120 | because I'm already defining the limits

00:13:25.800 | on the number of tokens to decide from using top P, okay?

00:13:30.200 | So I don't activate the top K there.

00:13:32.760 | Then we have the max, not max,

00:13:35.480 | max number of tokens to generate in the output.

00:13:38.400 | So with each generation,

00:13:40.200 | I'm saying I don't want you to generate

00:13:41.920 | any more than 64 tokens.

00:13:43.480 | You can increase that, right?

00:13:45.280 | So the max context window,

00:13:46.880 | so that's inputs and outputs for this model,

00:13:50.240 | we've already set it to,

00:13:52.200 | it's a max sequence time from earlier, 2048.

00:13:55.480 | So you can go much higher than 64 that I've set here.

00:13:59.080 | And then also we have this repetition penalty.

00:14:02.280 | That's super important because otherwise

00:14:04.800 | this is going to start repeating things

00:14:07.480 | over and over again.

00:14:08.680 | So the default value for that actually is one

00:14:10.960 | in that we can see more repetition.

00:14:13.600 | We switch that to 1.1

00:14:14.920 | and we're generally not going to see that anymore.

00:14:18.360 | Okay, so let's run this.

00:14:20.960 | So we say, explain to me the difference

00:14:23.400 | between nuclear fission and fusion.

00:14:26.080 | So this is from an example somewhere.

00:14:28.680 | I think it was Hugging Face,

00:14:30.680 | but I don't actually remember

00:14:33.000 | where I got that from exactly.

00:14:34.600 | Anyone does know, feel free to mention that in the comments.

00:14:39.000 | So we have the input, okay?

00:14:42.200 | So we said return full text.

00:14:46.280 | So we have the input here.

00:14:47.680 | And then we also have the output.

00:14:48.760 | So nuclear fission is a process that splits heavy atoms

00:14:51.440 | into smaller, lighter ones, so on and so on.

00:14:53.960 | Nuclear fusion occurs when two light

00:14:56.720 | atomic nuclei are combined.

00:14:58.840 | As far as I know, that is correct.

00:15:01.680 | So that looks pretty good.

00:15:03.440 | And then I've also added a note here

00:15:05.120 | on if you'd like to use the Triton optimized implementations.

00:15:10.120 | So Triton in this scenario, as far as I understand,

00:15:14.040 | is the way that the attention is implemented.

00:15:17.480 | It can be implemented either in PyTorch,

00:15:19.600 | which is what we're using by default.

00:15:21.800 | It can be implemented with flash attention or using Triton.

00:15:26.800 | And if you use Triton, it's gonna use more memory,

00:15:30.560 | but it will be faster

00:15:31.640 | when you're actually performing inference.

00:15:33.560 | So you can do that.

00:15:34.960 | The reason I haven't used it here

00:15:36.040 | is because the install takes just an insanely long time.

00:15:40.200 | So I just gave up with that.

00:15:43.520 | But as far as I know,

00:15:45.400 | this sort of setup here should work.

00:15:47.600 | So you pip install Triton and you go through

00:15:49.960 | and then this should work, okay.

00:15:52.320 | Just be wary of that added memory usage.

00:15:55.160 | So yeah, we've seen, okay,

00:15:56.800 | this is how we're gonna use this in the HuggingFace.

00:15:59.600 | So generating text.

00:16:01.440 | Now let's move on to the LineChain side of things.

00:16:04.240 | So how do we implement this in LineChain?

00:16:06.680 | Okay, so we're gonna use this

00:16:07.960 | with the simplest chain possible.

00:16:10.160 | So the LLM chain.

00:16:11.840 | For the LLM, we're going to initialize it

00:16:13.880 | via the HuggingFace pipeline,

00:16:15.440 | which is basically local HuggingFace model.

00:16:18.400 | And for that, we need our pipeline,

00:16:20.400 | which we have conveniently already initialized up here.

00:16:23.760 | So we just pass that into there.

00:16:26.840 | We have our prompt template.

00:16:29.560 | Okay, nothing, right?

00:16:31.600 | It's just the instruction here.

00:16:33.560 | So basically we have some inputs and that's it.

00:16:36.560 | I'm just defining that

00:16:37.720 | so that we can define this LLM chain.

00:16:40.200 | Okay, we initialize that.

00:16:42.200 | And then we come down to here

00:16:43.280 | and we can use the LLM chain to predict.

00:16:46.040 | And for the prediction,

00:16:47.640 | we just pass in those instructions again.

00:16:50.160 | Okay, so same question as before.

00:16:51.800 | So in this case, we should get pretty much the same answer.

00:16:55.360 | So we can run that.

00:16:56.880 | Okay, and the output we get there is this.

00:16:58.680 | So as far as I can tell,

00:17:00.600 | it's pretty much the same as what we got last time.

00:17:04.600 | Okay, so looks good.

00:17:08.120 | And with that, we've now implemented MPT7B

00:17:10.720 | in both HuggingFace and also LineChain as well.

00:17:15.120 | So naturally, if you just want to generate texts,

00:17:18.280 | you can use HuggingFace.

00:17:19.440 | But obviously, if you want to have access

00:17:21.160 | to all of the features that LineChain offers,

00:17:24.200 | all the chains, agents, all this sort of stuff,

00:17:26.840 | then you obviously just take on this actual set

00:17:30.120 | and you have your originally HuggingFace pipeline

00:17:33.840 | now integrated with LineChain,

00:17:36.480 | which I think is pretty cool and super easy to do.

00:17:39.320 | It's not that difficult.

00:17:40.800 | So with that, that's the end of this video.

00:17:43.800 | We've explored how we can actually begin

00:17:46.520 | using open source models in LineChain,

00:17:49.520 | which I think opens up a lot of opportunities for us.

00:17:52.600 | You know, fine tuning models, just using smaller models.

00:17:56.720 | Maybe you don't always need like a big GPT-4

00:18:00.160 | for all of our use cases.

00:18:03.000 | So I think this is the sort of thing

00:18:05.560 | where we'll see a lot more going forwards,

00:18:07.760 | a lot more open source, smaller model is being used.

00:18:10.840 | Of course, I still think OpenAI is gonna be used plenty,

00:18:14.240 | because honestly, in terms of performance,

00:18:16.400 | there are no open source models

00:18:18.000 | that are genuinely comparable to GPT-3.5

00:18:22.440 | or GPT-4 at the moment.

00:18:24.280 | You know, maybe going forwards, there will be eventually,

00:18:26.920 | but right now, we're not quite there.

00:18:29.560 | So yeah, that's it for this video.

00:18:31.680 | I hope all this has been interesting and useful.

00:18:34.240 | Thank you very much for watching

00:18:35.960 | and I will see you again in the next one.

00:18:37.800 | Bye.

00:18:38.640 | (gentle music)

00:18:42.120 | (gentle music)

00:18:44.720 | (gentle music)

00:18:47.320 | (gentle music)

00:18:49.920 | (gentle music)

00:18:52.500 | you

Using NEW MPT-7B in Hugging Face and LangChain

Chapters