back to indexUsing NEW MPT-7B in Hugging Face and LangChain
Chapters
0:0 Open Source LLMs like MPT-7B
0:50 MPT-7B Models in Hugging Face
2:29 Python setup
4:16 Initializing MPT-7B-Instruct
6:28 Initializing the MPT-7B tokenizer
7:10 Stopping Criteria and HF Pipeline
9:52 Hugging Face Pipeline
14:18 Generating Text with Hugging Face
16:1 Implementing MPT-7B in LangChain
17:8 Final Thoughts on Open Source LLMs
00:00:00.000 |
Today we're gonna talk about using open source models 00:00:12.600 |
as one of these fine-tuned versions of this model 00:00:16.600 |
actually has a context window of 65,000 tokens, 00:00:28.200 |
GPT-4, the one that's generally available to people, 00:00:35.800 |
and they have a version that goes up to 32,000, 00:00:44.280 |
So basically we're limited with GPT-4 to 8,000 tokens. 00:00:59.120 |
So let me just go ahead and show you those very quickly. 00:01:04.480 |
which is where we're gonna pull these models from. 00:01:10.840 |
So the MPT7B is the core, that's the pre-trained model, 00:01:17.840 |
Then we have StoryWriter, Chat, and Instruct. 00:01:21.280 |
So StoryWriter is the one you've probably heard about, 00:01:24.000 |
which has a max context window of 65,000 tokens, 00:01:30.880 |
And in reality, it actually goes up to higher. 00:01:36.920 |
So we demonstrate generations as long as 84,000 tokens, 00:01:44.400 |
And then if we, actually, we can come over to here, 00:01:47.240 |
scroll down, and we can see the other models as well. 00:01:50.200 |
So we have this Chat model, the Instruct model, 00:02:00.200 |
kind of rely on us providing instructions to these models. 00:02:07.960 |
actually are going to want to use this model, okay? 00:02:12.480 |
it's gonna be able to follow them better than the others. 00:02:15.280 |
So, yeah, we're gonna see how we can use this both. 00:02:20.120 |
we're gonna see how we can load that into HuggingFace, 00:02:22.000 |
and then we're gonna see how we can take that 00:02:29.520 |
Okay, so the first thing we're gonna want to do 00:02:37.640 |
So Accelerate, we need that in order to basically optimize 00:02:59.280 |
and actually there'll be a link to this notebook 00:03:05.960 |
So you just go to Runtime, Change Runtime Type. 00:03:09.920 |
You initially maybe on None, so you click GPU. 00:03:13.560 |
GPU Type, so I'm using T4, which is the smallest one on here 00:03:30.720 |
So I'm currently on Colab Pro now, thanks to this model. 00:03:35.760 |
And with that, I can switch up to the high-RAM version. 00:03:49.800 |
But of course, I know this will be limiting for some people, 00:03:57.240 |
Okay, so back on the installers, we have INOPS. 00:03:59.920 |
So this is, again, it's used by the MPT model. 00:04:25.200 |
One thing, so if you do want to use StoryWriter 00:04:28.480 |
and you want to use that huge context window, 00:04:37.400 |
I don't know, what is it, 65,000, which is kind of nuts. 00:04:43.400 |
you're gonna definitely need more than a T4 GPU. 00:04:46.280 |
Basically, the higher the max sequence length is, 00:04:49.040 |
the bigger your GPU memory is going to need to be. 00:04:55.280 |
But we're just gonna stick with this, instruct. 00:04:57.800 |
This 2048 is the typical or the standard sequence length 00:05:04.200 |
So instruct, or the base model, the foundation model, 00:05:10.800 |
So the trust remote code, we have to have that 00:05:22.720 |
that is basically stored in the model directory for this 00:05:27.480 |
to set up all the endpoints and everything for the model. 00:05:30.840 |
Okay, then we switch the model to evaluation mode. 00:05:35.200 |
So that just switches a few options within the model 00:05:45.000 |
And then we want to move our model to device. 00:05:50.880 |
Okay, so CUDA, and then we have CUDA current device. 00:06:02.560 |
this takes a little bit of time to run, okay? 00:06:12.640 |
expect to wait like five, 10 minutes, at least on Colab. 00:06:18.920 |
you should be good to use it to basically initialize it. 00:07:00.320 |
And then they get fed into the large language model. 00:07:03.720 |
Now the MPT7B model was trained using this tokenizer here. 00:07:13.600 |
So I should, I don't know if I mentioned this, 00:07:17.240 |
is actually initializing the honey face pipeline. 00:07:20.920 |
So within that pipeline, we have the large language model, 00:07:23.920 |
the tokenizer, both of those we've just created 00:07:30.960 |
let me come down to where we create it, is this here. 00:07:51.640 |
that will stop it from actually generating text 00:07:58.320 |
And then it will actually just continue generating text. 00:08:14.040 |
it's able to just begin generating anything, okay? 00:08:30.640 |
when it gives this token to us, we stop, right? 00:08:35.280 |
And we do that using this stopping criteria list object. 00:08:39.960 |
Okay, so that requires a stopping criteria object, 00:08:48.000 |
So these parameters are just the default parameters needed 00:09:09.160 |
and it's just going to be a few integers, right? 00:09:19.480 |
it translates from plain text to the token IDs. 00:09:27.080 |
And it's going to say, okay, for the SUP ID here, 00:09:38.400 |
we're going to say, okay, it's time to stop, right? 00:09:41.480 |
Otherwise it's not time to stop, you can keep going. 00:09:46.440 |
So that gives us our stopping criteria object. 00:09:49.240 |
And then we just pass that into our pipeline. 00:09:52.920 |
So the pipeline is basically the tokenization, 00:09:57.120 |
the model and the generation from that model, 00:10:07.120 |
we pass in obviously our model, our tokenizer, 00:10:10.640 |
but there's also a few things we need as well. 00:10:35.480 |
because we're going to be using line chain later. 00:10:37.320 |
That's why we set return full text equal to true. 00:10:40.280 |
If you were just wanting to use this and hung and face, 00:10:42.400 |
you don't need to, you don't need to have this as true. 00:10:49.000 |
Okay, so this just says, okay, we want to generate text. 00:10:53.560 |
We obviously want to use our CUDA enabled GPU. 00:11:04.520 |
So the temperature is like the randomness of your output. 00:11:14.080 |
Okay, so imagine it's kind of like how random 00:11:18.280 |
the predicted tokens or the next words are going to be. 00:11:33.240 |
And I would recommend if you want to read about this, 00:11:36.440 |
I'd recommend looking at this page from Cohere. 00:11:39.360 |
So there'll be a link at the top of the video right now. 00:11:50.760 |
So consider only top tokens whose likelihoods 00:11:54.000 |
add up to that 15% and then ignore the others. 00:11:58.680 |
Each generation step, you're predicting the next token 00:12:09.160 |
we're just going to consider the possible next words 00:12:21.560 |
whose together their likelihood adds up to 15%, right? 00:12:29.040 |
So you can see that there, they visualize it very nicely. 00:12:39.440 |
This is another value, kind of similar thing, right? 00:12:51.160 |
So in this case, if you had top K equal to one, 00:13:05.200 |
you could choose any of these top three, right? 00:13:12.200 |
And actually you can visualize that here as well. 00:13:19.320 |
That's because I don't want to consider top K 00:13:25.800 |
on the number of tokens to decide from using top P, okay? 00:13:35.480 |
max number of tokens to generate in the output. 00:13:55.480 |
So you can go much higher than 64 that I've set here. 00:13:59.080 |
And then also we have this repetition penalty. 00:14:08.680 |
So the default value for that actually is one 00:14:14.920 |
and we're generally not going to see that anymore. 00:14:34.600 |
Anyone does know, feel free to mention that in the comments. 00:14:48.760 |
So nuclear fission is a process that splits heavy atoms 00:15:05.120 |
on if you'd like to use the Triton optimized implementations. 00:15:10.120 |
So Triton in this scenario, as far as I understand, 00:15:14.040 |
is the way that the attention is implemented. 00:15:21.800 |
It can be implemented with flash attention or using Triton. 00:15:26.800 |
And if you use Triton, it's gonna use more memory, 00:15:36.040 |
is because the install takes just an insanely long time. 00:15:56.800 |
this is how we're gonna use this in the HuggingFace. 00:16:01.440 |
Now let's move on to the LineChain side of things. 00:16:20.400 |
which we have conveniently already initialized up here. 00:16:33.560 |
So basically we have some inputs and that's it. 00:16:51.800 |
So in this case, we should get pretty much the same answer. 00:17:00.600 |
it's pretty much the same as what we got last time. 00:17:10.720 |
in both HuggingFace and also LineChain as well. 00:17:15.120 |
So naturally, if you just want to generate texts, 00:17:21.160 |
to all of the features that LineChain offers, 00:17:24.200 |
all the chains, agents, all this sort of stuff, 00:17:26.840 |
then you obviously just take on this actual set 00:17:30.120 |
and you have your originally HuggingFace pipeline 00:17:36.480 |
which I think is pretty cool and super easy to do. 00:17:49.520 |
which I think opens up a lot of opportunities for us. 00:17:52.600 |
You know, fine tuning models, just using smaller models. 00:18:07.760 |
a lot more open source, smaller model is being used. 00:18:10.840 |
Of course, I still think OpenAI is gonna be used plenty, 00:18:24.280 |
You know, maybe going forwards, there will be eventually, 00:18:31.680 |
I hope all this has been interesting and useful.