Back to Index

Using NEW MPT-7B in Hugging Face and LangChain


Chapters

0:0 Open Source LLMs like MPT-7B
0:50 MPT-7B Models in Hugging Face
2:29 Python setup
4:16 Initializing MPT-7B-Instruct
6:28 Initializing the MPT-7B tokenizer
7:10 Stopping Criteria and HF Pipeline
9:52 Hugging Face Pipeline
14:18 Generating Text with Hugging Face
16:1 Implementing MPT-7B in LangChain
17:8 Final Thoughts on Open Source LLMs

Transcript

Today we're gonna talk about using open source models in Hugging Face and LangChain. We're going to be focusing specifically on the MPT7B model, which I'm sure some of you have heard as one of these fine-tuned versions of this model actually has a context window of 65,000 tokens, which is pretty huge.

At the moment of recording this video, GPT-4, the one that's generally available to people, has a context window of 8,000 tokens, and they have a version that goes up to 32,000, but I'm actually not aware of anyone that has access to that at the moment. So basically we're limited with GPT-4 to 8,000 tokens.

Now, MPT7B, like I said, we can have that super huge model, but there are also a lot of other models that are available as well. So let me just go ahead and show you those very quickly. So just head over to Hugging Face, which is where we're gonna pull these models from.

And you can see actually straight away, we have these four models. So the MPT7B is the core, that's the pre-trained model, that's the foundation model. Then we have StoryWriter, Chat, and Instruct. These are all fine-tuned models. So StoryWriter is the one you've probably heard about, which has a max context window of 65,000 tokens, which is pretty huge.

And in reality, it actually goes up to higher. So I believe they say, ah, here, right? So we demonstrate generations as long as 84,000 tokens, which is, I would say, pretty impressive. And then if we, actually, we can come over to here, scroll down, and we can see the other models as well.

So we have this Chat model, the Instruct model, and obviously the Foundation model. We're gonna be using the Instruct model because, I mean, most of the use cases I see kind of rely on us providing instructions to these models. And therefore, I think most people out there actually are going to want to use this model, okay?

Because, yeah, we can give it instructions, it's gonna be able to follow them better than the others. So, yeah, we're gonna see how we can use this both. So initially in HuggingFace, we're gonna see how we can load that into HuggingFace, and then we're gonna see how we can take that and actually load it into LimeChain, which obviously has a few more features on the agent side of things.

Okay, so the first thing we're gonna want to do is actually do a few pip installs. So we have Transformers, Accelerate. So Accelerate, we need that in order to basically optimize how we're running this on our GPU. We will want to run this on a GPU, otherwise you're going to be waiting an impossibly long time.

So, yeah, if you don't have access to a GPU, I would recommend you figure that out. So right now I'm running this on Colab, and actually there'll be a link to this notebook as well on the top of the video. So from Colab, you can run on GPU, okay?

So you just go to Runtime, Change Runtime Type. You initially maybe on None, so you click GPU. GPU Type, so I'm using T4, which is the smallest one on here and the standard version of T4 you can get on the free version of Colab. But for me, that wasn't actually big enough to run the MPT7B model, unfortunately.

So I'm currently on Colab Pro now, thanks to this model. And with that, I can switch up to the high-RAM version. Now, obviously you have to pay for that, but you don't have to pay that much, okay? It's not a significant cost. But of course, I know this will be limiting for some people, but this is the best and cheapest option I can find right now.

Okay, so back on the installers, we have INOPS. So this is, again, it's used by the MPT model. Naturally, we're gonna be using LangChain. And do I use Wikipedia here? I actually don't think I use this anymore. And Xform is just for an optimization in our transform functions. Okay, once we have all those installed, we come down here, and this is where we initialize the model.

Okay, so like I said, we're gonna be using the instruct model, okay? One thing, so if you do want to use StoryWriter and you want to use that huge context window, you would go StoryWriter, okay? And then here, you would write, I don't know, what is it, 65,000, which is kind of nuts.

But in order to run that, you're gonna definitely need more than a T4 GPU. Basically, the higher the max sequence length is, the bigger your GPU memory is going to need to be. So yeah, you need something big to run that. But we're just gonna stick with this, instruct.

This 2048 is the typical or the standard sequence length for these other models. So instruct, or the base model, the foundation model, instruct and chat. And this is also something important. So the trust remote code, we have to have that because essentially the MPT models are not fully supported by HuggingFace yet.

So we have to rely on this remote code that is basically stored in the model directory for this to set up all the endpoints and everything for the model. Okay, then we switch the model to evaluation mode. So that just switches a few options within the model that says, okay, we're not training, we're now performing inference.

Okay, we're now doing predictions. And then we want to move our model to device. So the device, we decided here. Okay, so CUDA, and then we have CUDA current device. If we scroll down to the end here, we should see what that moved it to. Yeah, so model loaded to CUDA at zero.

Now, just one thing, this takes a little bit of time to run, okay? Like here, it just took a minute. I think that's because most of the model was probably already downloaded for me. If you're downloading and initializing this, expect to wait like five, 10 minutes, at least on Colab.

But once that has been downloaded, you should be good to use it to basically initialize it. And it will just take like a minute or so because you only need to download it once. Okay, and then we initialize our tokenizer. So the tokenizer is actually using this Luther AI's GPT Neox 20B.

This is just, this is a tokenizer. So when I say tokenizer, it's basically the thing that will translate from human readable plain text to transformer or large language model, readable token IDs, right? So it's gonna convert like the word V into the token ID 41, for example, right? And then they get fed into the large language model.

Now the MPT7B model was trained using this tokenizer here. Right, so we have to use that tokenizer. Then what we need to do is define a stopping criteria of the model. So I should, I don't know if I mentioned this, but right now what we're doing is actually initializing the honey face pipeline.

So within that pipeline, we have the large language model, the tokenizer, both of those we've just created and also stopping criteria object, right? Stopping criteria object, let me come down to where we create it, is this here. Okay, so basically MPT7B has been trained to add this particular bit of text at the end of its generations, when it's like, okay, I'm finished, right?

But there's nothing within that model that will stop it from actually generating text at that point, right? It will just, it will generate this, right? And then it will actually just continue generating text. And the text that it generates after this is generally just going to be gibberish because it's been trained to generate this at the end of a meaningful answer, right?

After generating this, it's able to just begin generating anything, okay? It's not going to be useful stuff. So what we need to do is define this as a stopping criteria for the model. We need to go in there and say, okay, when the model says end of text, when it gives this token to us, we stop, right?

We need to specify that. And we do that using this stopping criteria list object. Okay, so that requires a stopping criteria object, which we've defined here. So, I mean, you can see this. So these parameters are just the default parameters needed by this stopping criteria object. And basically what it's going to do is say, okay, for SUP ID.

So we have these SUP token IDs. Maybe I can just show you these. Maybe that's easier. So SUP token IDs, and it's just going to be a few integers, right? Those integers, actually it's one integer, which represents this, right? So I said before the tokenized, it translates from plain text to the token IDs.

That's what this is. This is the plain text version. This is the token ID version, right? And it's going to say, okay, for the SUP ID here, so actually just for zero, if the input IDs, so the last input ID is equal to that, we're going to say, okay, it's time to stop, right?

Otherwise it's not time to stop, you can keep going. And that's it, okay? So that gives us our stopping criteria object. And then we just pass that into our pipeline. So the pipeline is basically the tokenization, the model and the generation from that model, and then also this stopping criteria, all packaged into a nice little function.

So within that pipeline, we pass in obviously our model, our tokenizer, and the stopping criteria, but there's also a few things we need as well. So return full text. So if we have this false, it's just going to return the generated part or generate portion of some text. And that's fine, you can do that.

There's actually no problem with that. But if you want to use this in light chain, we need to return the generated text and also the input text. We need to return full text because we're going to be using line chain later. That's why we set return full text equal to true.

If you were just wanting to use this and hung and face, you don't need to, you don't need to have this as true. Then our task here is text generation. Okay, so this just says, okay, we want to generate text. The device here is important. We obviously want to use our CUDA enabled GPU.

So we set that. And then we have a few other model specific parameters down here. Or we could call them generation specific parameters as well. So the temperature is like the randomness of your output. Zero is the minimum. It's basically zero randomness and one is maximum randomness. Okay, so imagine it's kind of like how random the predicted tokens or the next words are going to be.

Then we have top P. So top P is basically we're going to select from the top tokens on each prediction from whose probability adds up to 15%. And I would recommend if you want to read about this, I'd recommend looking at this page from Cohere. So there'll be a link at the top of the video right now.

They explain this really nicely. So yeah, you can kind of see they use 0.15 here as well, right? So consider only top tokens whose likelihoods add up to that 15% and then ignore the others. So with each step, right? Each generation step, you're predicting the next token or the next word.

You can think of it like that. And by setting top P equals 0.15, we're just going to consider the possible next words 'cause we're predicting for all of the words in that tokenizer. We're going to consider the top words whose together their likelihood adds up to 15%, right? The total, okay?

So you can see that there, they visualize it very nicely. I don't think my explanation can compare to this visualization. Okay, and then we have top K. This is another value, kind of similar thing, right? So top K, if we come up to here, and this is easy to explain, we're picking from the top K tokens, right?

So in this case, if you had top K equal to one, it would only select United or it could only decide on selecting United. If you had top K equal to two, you could do United or Netherlands. Top K equal to three, you could choose any of these top three, right?

That is what the top K is actually doing. And actually you can visualize that here as well. Okay, and okay, what I've done here is set top K equal to zero. That's because I don't want to consider top K because I'm already defining the limits on the number of tokens to decide from using top P, okay?

So I don't activate the top K there. Then we have the max, not max, max number of tokens to generate in the output. So with each generation, I'm saying I don't want you to generate any more than 64 tokens. You can increase that, right? So the max context window, so that's inputs and outputs for this model, we've already set it to, it's a max sequence time from earlier, 2048.

So you can go much higher than 64 that I've set here. And then also we have this repetition penalty. That's super important because otherwise this is going to start repeating things over and over again. So the default value for that actually is one in that we can see more repetition.

We switch that to 1.1 and we're generally not going to see that anymore. Okay, so let's run this. So we say, explain to me the difference between nuclear fission and fusion. So this is from an example somewhere. I think it was Hugging Face, but I don't actually remember where I got that from exactly.

Anyone does know, feel free to mention that in the comments. So we have the input, okay? So we said return full text. So we have the input here. And then we also have the output. So nuclear fission is a process that splits heavy atoms into smaller, lighter ones, so on and so on.

Nuclear fusion occurs when two light atomic nuclei are combined. As far as I know, that is correct. So that looks pretty good. And then I've also added a note here on if you'd like to use the Triton optimized implementations. So Triton in this scenario, as far as I understand, is the way that the attention is implemented.

It can be implemented either in PyTorch, which is what we're using by default. It can be implemented with flash attention or using Triton. And if you use Triton, it's gonna use more memory, but it will be faster when you're actually performing inference. So you can do that. The reason I haven't used it here is because the install takes just an insanely long time.

So I just gave up with that. But as far as I know, this sort of setup here should work. So you pip install Triton and you go through and then this should work, okay. Just be wary of that added memory usage. So yeah, we've seen, okay, this is how we're gonna use this in the HuggingFace.

So generating text. Now let's move on to the LineChain side of things. So how do we implement this in LineChain? Okay, so we're gonna use this with the simplest chain possible. So the LLM chain. For the LLM, we're going to initialize it via the HuggingFace pipeline, which is basically local HuggingFace model.

And for that, we need our pipeline, which we have conveniently already initialized up here. So we just pass that into there. We have our prompt template. Okay, nothing, right? It's just the instruction here. So basically we have some inputs and that's it. I'm just defining that so that we can define this LLM chain.

Okay, we initialize that. And then we come down to here and we can use the LLM chain to predict. And for the prediction, we just pass in those instructions again. Okay, so same question as before. So in this case, we should get pretty much the same answer. So we can run that.

Okay, and the output we get there is this. So as far as I can tell, it's pretty much the same as what we got last time. Okay, so looks good. And with that, we've now implemented MPT7B in both HuggingFace and also LineChain as well. So naturally, if you just want to generate texts, you can use HuggingFace.

But obviously, if you want to have access to all of the features that LineChain offers, all the chains, agents, all this sort of stuff, then you obviously just take on this actual set and you have your originally HuggingFace pipeline now integrated with LineChain, which I think is pretty cool and super easy to do.

It's not that difficult. So with that, that's the end of this video. We've explored how we can actually begin using open source models in LineChain, which I think opens up a lot of opportunities for us. You know, fine tuning models, just using smaller models. Maybe you don't always need like a big GPT-4 for all of our use cases.

So I think this is the sort of thing where we'll see a lot more going forwards, a lot more open source, smaller model is being used. Of course, I still think OpenAI is gonna be used plenty, because honestly, in terms of performance, there are no open source models that are genuinely comparable to GPT-3.5 or GPT-4 at the moment.

You know, maybe going forwards, there will be eventually, but right now, we're not quite there. So yeah, that's it for this video. I hope all this has been interesting and useful. Thank you very much for watching and I will see you again in the next one. Bye. (gentle music) (gentle music) (gentle music) (gentle music) (gentle music) you