back to index

Using NEW MPT-7B in Hugging Face and LangChain


Chapters

0:0 Open Source LLMs like MPT-7B
0:50 MPT-7B Models in Hugging Face
2:29 Python setup
4:16 Initializing MPT-7B-Instruct
6:28 Initializing the MPT-7B tokenizer
7:10 Stopping Criteria and HF Pipeline
9:52 Hugging Face Pipeline
14:18 Generating Text with Hugging Face
16:1 Implementing MPT-7B in LangChain
17:8 Final Thoughts on Open Source LLMs

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're gonna talk about using open source models
00:00:03.140 | in Hugging Face and LangChain.
00:00:05.480 | We're going to be focusing specifically
00:00:07.400 | on the MPT7B model,
00:00:10.360 | which I'm sure some of you have heard
00:00:12.600 | as one of these fine-tuned versions of this model
00:00:16.600 | actually has a context window of 65,000 tokens,
00:00:21.600 | which is pretty huge.
00:00:25.000 | At the moment of recording this video,
00:00:28.200 | GPT-4, the one that's generally available to people,
00:00:32.280 | has a context window of 8,000 tokens,
00:00:35.800 | and they have a version that goes up to 32,000,
00:00:40.200 | but I'm actually not aware of anyone
00:00:42.000 | that has access to that at the moment.
00:00:44.280 | So basically we're limited with GPT-4 to 8,000 tokens.
00:00:49.280 | Now, MPT7B, like I said,
00:00:53.880 | we can have that super huge model,
00:00:55.920 | but there are also a lot of other models
00:00:57.720 | that are available as well.
00:00:59.120 | So let me just go ahead and show you those very quickly.
00:01:02.200 | So just head over to Hugging Face,
00:01:04.480 | which is where we're gonna pull these models from.
00:01:06.920 | And you can see actually straight away,
00:01:08.960 | we have these four models.
00:01:10.840 | So the MPT7B is the core, that's the pre-trained model,
00:01:15.840 | that's the foundation model.
00:01:17.840 | Then we have StoryWriter, Chat, and Instruct.
00:01:19.640 | These are all fine-tuned models.
00:01:21.280 | So StoryWriter is the one you've probably heard about,
00:01:24.000 | which has a max context window of 65,000 tokens,
00:01:29.000 | which is pretty huge.
00:01:30.880 | And in reality, it actually goes up to higher.
00:01:33.640 | So I believe they say, ah, here, right?
00:01:36.920 | So we demonstrate generations as long as 84,000 tokens,
00:01:41.280 | which is, I would say, pretty impressive.
00:01:44.400 | And then if we, actually, we can come over to here,
00:01:47.240 | scroll down, and we can see the other models as well.
00:01:50.200 | So we have this Chat model, the Instruct model,
00:01:52.640 | and obviously the Foundation model.
00:01:54.160 | We're gonna be using the Instruct model
00:01:55.720 | because, I mean, most of the use cases I see
00:02:00.200 | kind of rely on us providing instructions to these models.
00:02:04.600 | And therefore, I think most people out there
00:02:07.960 | actually are going to want to use this model, okay?
00:02:10.400 | Because, yeah, we can give it instructions,
00:02:12.480 | it's gonna be able to follow them better than the others.
00:02:15.280 | So, yeah, we're gonna see how we can use this both.
00:02:18.840 | So initially in HuggingFace,
00:02:20.120 | we're gonna see how we can load that into HuggingFace,
00:02:22.000 | and then we're gonna see how we can take that
00:02:24.400 | and actually load it into LimeChain,
00:02:25.960 | which obviously has a few more features
00:02:28.000 | on the agent side of things.
00:02:29.520 | Okay, so the first thing we're gonna want to do
00:02:31.600 | is actually do a few pip installs.
00:02:35.240 | So we have Transformers, Accelerate.
00:02:37.640 | So Accelerate, we need that in order to basically optimize
00:02:41.280 | how we're running this on our GPU.
00:02:44.320 | We will want to run this on a GPU,
00:02:46.040 | otherwise you're going to be waiting
00:02:48.240 | an impossibly long time.
00:02:50.000 | So, yeah, if you don't have access to a GPU,
00:02:53.600 | I would recommend you figure that out.
00:02:57.360 | So right now I'm running this on Colab,
00:02:59.280 | and actually there'll be a link to this notebook
00:03:01.160 | as well on the top of the video.
00:03:03.000 | So from Colab, you can run on GPU, okay?
00:03:05.960 | So you just go to Runtime, Change Runtime Type.
00:03:09.920 | You initially maybe on None, so you click GPU.
00:03:13.560 | GPU Type, so I'm using T4, which is the smallest one on here
00:03:17.120 | and the standard version of T4
00:03:19.080 | you can get on the free version of Colab.
00:03:22.080 | But for me, that wasn't actually big enough
00:03:25.720 | to run the MPT7B model, unfortunately.
00:03:30.720 | So I'm currently on Colab Pro now, thanks to this model.
00:03:35.760 | And with that, I can switch up to the high-RAM version.
00:03:41.280 | Now, obviously you have to pay for that,
00:03:43.440 | but you don't have to pay that much, okay?
00:03:46.640 | It's not a significant cost.
00:03:49.800 | But of course, I know this will be limiting for some people,
00:03:52.640 | but this is the best and cheapest option
00:03:55.440 | I can find right now.
00:03:57.240 | Okay, so back on the installers, we have INOPS.
00:03:59.920 | So this is, again, it's used by the MPT model.
00:04:02.880 | Naturally, we're gonna be using LangChain.
00:04:04.760 | And do I use Wikipedia here?
00:04:07.160 | I actually don't think I use this anymore.
00:04:09.440 | And Xform is just for an optimization
00:04:12.120 | in our transform functions.
00:04:15.360 | Okay, once we have all those installed,
00:04:17.520 | we come down here,
00:04:18.360 | and this is where we initialize the model.
00:04:21.160 | Okay, so like I said, we're gonna be using
00:04:22.760 | the instruct model, okay?
00:04:25.200 | One thing, so if you do want to use StoryWriter
00:04:28.480 | and you want to use that huge context window,
00:04:31.360 | you would go StoryWriter, okay?
00:04:35.200 | And then here, you would write,
00:04:37.400 | I don't know, what is it, 65,000, which is kind of nuts.
00:04:42.160 | But in order to run that,
00:04:43.400 | you're gonna definitely need more than a T4 GPU.
00:04:46.280 | Basically, the higher the max sequence length is,
00:04:49.040 | the bigger your GPU memory is going to need to be.
00:04:52.360 | So yeah, you need something big to run that.
00:04:55.280 | But we're just gonna stick with this, instruct.
00:04:57.800 | This 2048 is the typical or the standard sequence length
00:05:02.640 | for these other models.
00:05:04.200 | So instruct, or the base model, the foundation model,
00:05:07.360 | instruct and chat.
00:05:08.640 | And this is also something important.
00:05:10.800 | So the trust remote code, we have to have that
00:05:13.120 | because essentially the MPT models
00:05:16.720 | are not fully supported by HuggingFace yet.
00:05:19.920 | So we have to rely on this remote code
00:05:22.720 | that is basically stored in the model directory for this
00:05:27.480 | to set up all the endpoints and everything for the model.
00:05:30.840 | Okay, then we switch the model to evaluation mode.
00:05:35.200 | So that just switches a few options within the model
00:05:39.560 | that says, okay, we're not training,
00:05:41.280 | we're now performing inference.
00:05:43.240 | Okay, we're now doing predictions.
00:05:45.000 | And then we want to move our model to device.
00:05:48.400 | So the device, we decided here.
00:05:50.880 | Okay, so CUDA, and then we have CUDA current device.
00:05:54.120 | If we scroll down to the end here,
00:05:55.640 | we should see what that moved it to.
00:05:58.080 | Yeah, so model loaded to CUDA at zero.
00:06:01.160 | Now, just one thing,
00:06:02.560 | this takes a little bit of time to run, okay?
00:06:05.160 | Like here, it just took a minute.
00:06:06.840 | I think that's because most of the model
00:06:08.920 | was probably already downloaded for me.
00:06:10.640 | If you're downloading and initializing this,
00:06:12.640 | expect to wait like five, 10 minutes, at least on Colab.
00:06:17.120 | But once that has been downloaded,
00:06:18.920 | you should be good to use it to basically initialize it.
00:06:23.920 | And it will just take like a minute or so
00:06:26.280 | because you only need to download it once.
00:06:28.440 | Okay, and then we initialize our tokenizer.
00:06:31.280 | So the tokenizer is actually using this
00:06:34.080 | Luther AI's GPT Neox 20B.
00:06:38.160 | This is just, this is a tokenizer.
00:06:40.040 | So when I say tokenizer,
00:06:41.960 | it's basically the thing that will translate
00:06:44.600 | from human readable plain text
00:06:47.760 | to transformer or large language model,
00:06:51.080 | readable token IDs, right?
00:06:53.200 | So it's gonna convert like the word V
00:06:55.920 | into the token ID 41, for example, right?
00:07:00.320 | And then they get fed into the large language model.
00:07:03.720 | Now the MPT7B model was trained using this tokenizer here.
00:07:07.840 | Right, so we have to use that tokenizer.
00:07:10.000 | Then what we need to do
00:07:11.280 | is define a stopping criteria of the model.
00:07:13.600 | So I should, I don't know if I mentioned this,
00:07:15.800 | but right now what we're doing
00:07:17.240 | is actually initializing the honey face pipeline.
00:07:20.920 | So within that pipeline, we have the large language model,
00:07:23.920 | the tokenizer, both of those we've just created
00:07:26.440 | and also stopping criteria object, right?
00:07:28.840 | Stopping criteria object,
00:07:30.960 | let me come down to where we create it, is this here.
00:07:35.440 | Okay, so basically MPT7B has been trained
00:07:40.440 | to add this particular bit of text
00:07:43.440 | at the end of its generations,
00:07:45.280 | when it's like, okay, I'm finished, right?
00:07:48.600 | But there's nothing within that model
00:07:51.640 | that will stop it from actually generating text
00:07:54.440 | at that point, right?
00:07:55.840 | It will just, it will generate this, right?
00:07:58.320 | And then it will actually just continue generating text.
00:08:03.320 | And the text that it generates after this
00:08:05.240 | is generally just going to be gibberish
00:08:07.200 | because it's been trained to generate this
00:08:09.960 | at the end of a meaningful answer, right?
00:08:12.800 | After generating this,
00:08:14.040 | it's able to just begin generating anything, okay?
00:08:17.760 | It's not going to be useful stuff.
00:08:20.960 | So what we need to do is define this
00:08:24.560 | as a stopping criteria for the model.
00:08:26.560 | We need to go in there and say,
00:08:28.080 | okay, when the model says end of text,
00:08:30.640 | when it gives this token to us, we stop, right?
00:08:33.280 | We need to specify that.
00:08:35.280 | And we do that using this stopping criteria list object.
00:08:39.960 | Okay, so that requires a stopping criteria object,
00:08:44.360 | which we've defined here.
00:08:45.680 | So, I mean, you can see this.
00:08:48.000 | So these parameters are just the default parameters needed
00:08:51.440 | by this stopping criteria object.
00:08:53.840 | And basically what it's going to do is say,
00:08:56.800 | okay, for SUP ID.
00:08:58.600 | So we have these SUP token IDs.
00:09:01.200 | Maybe I can just show you these.
00:09:03.240 | Maybe that's easier.
00:09:04.440 | So SUP token IDs,
00:09:09.160 | and it's just going to be a few integers, right?
00:09:12.560 | Those integers, actually it's one integer,
00:09:15.440 | which represents this, right?
00:09:17.880 | So I said before the tokenized,
00:09:19.480 | it translates from plain text to the token IDs.
00:09:22.240 | That's what this is.
00:09:23.080 | This is the plain text version.
00:09:24.800 | This is the token ID version, right?
00:09:27.080 | And it's going to say, okay, for the SUP ID here,
00:09:30.560 | so actually just for zero,
00:09:32.800 | if the input IDs,
00:09:34.960 | so the last input ID is equal to that,
00:09:38.400 | we're going to say, okay, it's time to stop, right?
00:09:41.480 | Otherwise it's not time to stop, you can keep going.
00:09:44.480 | And that's it, okay?
00:09:46.440 | So that gives us our stopping criteria object.
00:09:49.240 | And then we just pass that into our pipeline.
00:09:52.920 | So the pipeline is basically the tokenization,
00:09:57.120 | the model and the generation from that model,
00:10:00.520 | and then also this stopping criteria,
00:10:02.680 | all packaged into a nice little function.
00:10:04.960 | So within that pipeline,
00:10:07.120 | we pass in obviously our model, our tokenizer,
00:10:09.280 | and the stopping criteria,
00:10:10.640 | but there's also a few things we need as well.
00:10:13.360 | So return full text.
00:10:15.360 | So if we have this false,
00:10:16.720 | it's just going to return the generated part
00:10:19.920 | or generate portion of some text.
00:10:23.200 | And that's fine, you can do that.
00:10:24.400 | There's actually no problem with that.
00:10:26.440 | But if you want to use this in light chain,
00:10:28.680 | we need to return the generated text
00:10:31.600 | and also the input text.
00:10:33.840 | We need to return full text
00:10:35.480 | because we're going to be using line chain later.
00:10:37.320 | That's why we set return full text equal to true.
00:10:40.280 | If you were just wanting to use this and hung and face,
00:10:42.400 | you don't need to, you don't need to have this as true.
00:10:45.360 | Then our task here is text generation.
00:10:49.000 | Okay, so this just says, okay, we want to generate text.
00:10:51.840 | The device here is important.
00:10:53.560 | We obviously want to use our CUDA enabled GPU.
00:10:55.920 | So we set that.
00:10:56.880 | And then we have a few other model
00:10:59.080 | specific parameters down here.
00:11:01.160 | Or we could call them generation
00:11:02.600 | specific parameters as well.
00:11:04.520 | So the temperature is like the randomness of your output.
00:11:08.160 | Zero is the minimum.
00:11:09.200 | It's basically zero randomness
00:11:11.640 | and one is maximum randomness.
00:11:14.080 | Okay, so imagine it's kind of like how random
00:11:18.280 | the predicted tokens or the next words are going to be.
00:11:22.560 | Then we have top P.
00:11:23.600 | So top P is basically we're going to select
00:11:27.560 | from the top tokens on each prediction
00:11:30.360 | from whose probability adds up to 15%.
00:11:33.240 | And I would recommend if you want to read about this,
00:11:36.440 | I'd recommend looking at this page from Cohere.
00:11:39.360 | So there'll be a link at the top of the video right now.
00:11:43.200 | They explain this really nicely.
00:11:45.000 | So yeah, you can kind of see
00:11:47.640 | they use 0.15 here as well, right?
00:11:50.760 | So consider only top tokens whose likelihoods
00:11:54.000 | add up to that 15% and then ignore the others.
00:11:56.520 | So with each step, right?
00:11:58.680 | Each generation step, you're predicting the next token
00:12:02.480 | or the next word.
00:12:03.520 | You can think of it like that.
00:12:05.040 | And by setting top P equals 0.15,
00:12:09.160 | we're just going to consider the possible next words
00:12:13.680 | 'cause we're predicting for all of the words
00:12:15.680 | in that tokenizer.
00:12:17.520 | We're going to consider the top words
00:12:21.560 | whose together their likelihood adds up to 15%, right?
00:12:27.240 | The total, okay?
00:12:29.040 | So you can see that there, they visualize it very nicely.
00:12:32.240 | I don't think my explanation
00:12:34.880 | can compare to this visualization.
00:12:37.520 | Okay, and then we have top K.
00:12:39.440 | This is another value, kind of similar thing, right?
00:12:42.160 | So top K, if we come up to here,
00:12:45.640 | and this is easy to explain,
00:12:47.720 | we're picking from the top K tokens, right?
00:12:51.160 | So in this case, if you had top K equal to one,
00:12:55.400 | it would only select United
00:12:57.680 | or it could only decide on selecting United.
00:13:00.600 | If you had top K equal to two,
00:13:01.960 | you could do United or Netherlands.
00:13:04.240 | Top K equal to three,
00:13:05.200 | you could choose any of these top three, right?
00:13:08.240 | That is what the top K is actually doing.
00:13:12.200 | And actually you can visualize that here as well.
00:13:15.080 | Okay, and okay, what I've done here is
00:13:17.920 | set top K equal to zero.
00:13:19.320 | That's because I don't want to consider top K
00:13:22.120 | because I'm already defining the limits
00:13:25.800 | on the number of tokens to decide from using top P, okay?
00:13:30.200 | So I don't activate the top K there.
00:13:32.760 | Then we have the max, not max,
00:13:35.480 | max number of tokens to generate in the output.
00:13:38.400 | So with each generation,
00:13:40.200 | I'm saying I don't want you to generate
00:13:41.920 | any more than 64 tokens.
00:13:43.480 | You can increase that, right?
00:13:45.280 | So the max context window,
00:13:46.880 | so that's inputs and outputs for this model,
00:13:50.240 | we've already set it to,
00:13:52.200 | it's a max sequence time from earlier, 2048.
00:13:55.480 | So you can go much higher than 64 that I've set here.
00:13:59.080 | And then also we have this repetition penalty.
00:14:02.280 | That's super important because otherwise
00:14:04.800 | this is going to start repeating things
00:14:07.480 | over and over again.
00:14:08.680 | So the default value for that actually is one
00:14:10.960 | in that we can see more repetition.
00:14:13.600 | We switch that to 1.1
00:14:14.920 | and we're generally not going to see that anymore.
00:14:18.360 | Okay, so let's run this.
00:14:20.960 | So we say, explain to me the difference
00:14:23.400 | between nuclear fission and fusion.
00:14:26.080 | So this is from an example somewhere.
00:14:28.680 | I think it was Hugging Face,
00:14:30.680 | but I don't actually remember
00:14:33.000 | where I got that from exactly.
00:14:34.600 | Anyone does know, feel free to mention that in the comments.
00:14:39.000 | So we have the input, okay?
00:14:42.200 | So we said return full text.
00:14:46.280 | So we have the input here.
00:14:47.680 | And then we also have the output.
00:14:48.760 | So nuclear fission is a process that splits heavy atoms
00:14:51.440 | into smaller, lighter ones, so on and so on.
00:14:53.960 | Nuclear fusion occurs when two light
00:14:56.720 | atomic nuclei are combined.
00:14:58.840 | As far as I know, that is correct.
00:15:01.680 | So that looks pretty good.
00:15:03.440 | And then I've also added a note here
00:15:05.120 | on if you'd like to use the Triton optimized implementations.
00:15:10.120 | So Triton in this scenario, as far as I understand,
00:15:14.040 | is the way that the attention is implemented.
00:15:17.480 | It can be implemented either in PyTorch,
00:15:19.600 | which is what we're using by default.
00:15:21.800 | It can be implemented with flash attention or using Triton.
00:15:26.800 | And if you use Triton, it's gonna use more memory,
00:15:30.560 | but it will be faster
00:15:31.640 | when you're actually performing inference.
00:15:33.560 | So you can do that.
00:15:34.960 | The reason I haven't used it here
00:15:36.040 | is because the install takes just an insanely long time.
00:15:40.200 | So I just gave up with that.
00:15:43.520 | But as far as I know,
00:15:45.400 | this sort of setup here should work.
00:15:47.600 | So you pip install Triton and you go through
00:15:49.960 | and then this should work, okay.
00:15:52.320 | Just be wary of that added memory usage.
00:15:55.160 | So yeah, we've seen, okay,
00:15:56.800 | this is how we're gonna use this in the HuggingFace.
00:15:59.600 | So generating text.
00:16:01.440 | Now let's move on to the LineChain side of things.
00:16:04.240 | So how do we implement this in LineChain?
00:16:06.680 | Okay, so we're gonna use this
00:16:07.960 | with the simplest chain possible.
00:16:10.160 | So the LLM chain.
00:16:11.840 | For the LLM, we're going to initialize it
00:16:13.880 | via the HuggingFace pipeline,
00:16:15.440 | which is basically local HuggingFace model.
00:16:18.400 | And for that, we need our pipeline,
00:16:20.400 | which we have conveniently already initialized up here.
00:16:23.760 | So we just pass that into there.
00:16:26.840 | We have our prompt template.
00:16:29.560 | Okay, nothing, right?
00:16:31.600 | It's just the instruction here.
00:16:33.560 | So basically we have some inputs and that's it.
00:16:36.560 | I'm just defining that
00:16:37.720 | so that we can define this LLM chain.
00:16:40.200 | Okay, we initialize that.
00:16:42.200 | And then we come down to here
00:16:43.280 | and we can use the LLM chain to predict.
00:16:46.040 | And for the prediction,
00:16:47.640 | we just pass in those instructions again.
00:16:50.160 | Okay, so same question as before.
00:16:51.800 | So in this case, we should get pretty much the same answer.
00:16:55.360 | So we can run that.
00:16:56.880 | Okay, and the output we get there is this.
00:16:58.680 | So as far as I can tell,
00:17:00.600 | it's pretty much the same as what we got last time.
00:17:04.600 | Okay, so looks good.
00:17:08.120 | And with that, we've now implemented MPT7B
00:17:10.720 | in both HuggingFace and also LineChain as well.
00:17:15.120 | So naturally, if you just want to generate texts,
00:17:18.280 | you can use HuggingFace.
00:17:19.440 | But obviously, if you want to have access
00:17:21.160 | to all of the features that LineChain offers,
00:17:24.200 | all the chains, agents, all this sort of stuff,
00:17:26.840 | then you obviously just take on this actual set
00:17:30.120 | and you have your originally HuggingFace pipeline
00:17:33.840 | now integrated with LineChain,
00:17:36.480 | which I think is pretty cool and super easy to do.
00:17:39.320 | It's not that difficult.
00:17:40.800 | So with that, that's the end of this video.
00:17:43.800 | We've explored how we can actually begin
00:17:46.520 | using open source models in LineChain,
00:17:49.520 | which I think opens up a lot of opportunities for us.
00:17:52.600 | You know, fine tuning models, just using smaller models.
00:17:56.720 | Maybe you don't always need like a big GPT-4
00:18:00.160 | for all of our use cases.
00:18:03.000 | So I think this is the sort of thing
00:18:05.560 | where we'll see a lot more going forwards,
00:18:07.760 | a lot more open source, smaller model is being used.
00:18:10.840 | Of course, I still think OpenAI is gonna be used plenty,
00:18:14.240 | because honestly, in terms of performance,
00:18:16.400 | there are no open source models
00:18:18.000 | that are genuinely comparable to GPT-3.5
00:18:22.440 | or GPT-4 at the moment.
00:18:24.280 | You know, maybe going forwards, there will be eventually,
00:18:26.920 | but right now, we're not quite there.
00:18:29.560 | So yeah, that's it for this video.
00:18:31.680 | I hope all this has been interesting and useful.
00:18:34.240 | Thank you very much for watching
00:18:35.960 | and I will see you again in the next one.
00:18:38.640 | (gentle music)
00:18:42.120 | (gentle music)
00:18:44.720 | (gentle music)
00:18:47.320 | (gentle music)
00:18:49.920 | (gentle music)