Back to Index

Mixtral 8X7B — Deploying an *Open* AI Agent


Chapters

0:0 Mixtral 8X7B is better than GPT 3.5
0:50 Deploying Mixtral 8x7B
3:21 Mixtral Code Setup
8:17 Using Mixtral Instructions
10:4 Mixtral Special Tokens
13:29 Parsing Multiple Agent Tools
14:28 RAG with Mixtral
17:1 Final Thoughts on Mixtral

Transcript

Today we have a very exciting video, we are going to test and see the performance of the new Mixedraw 8x 7 billion parameter model, which honestly from what I've seen so far is pretty incredible. It is far better than any other open model that I've tested and at the same time it's very fast.

Most open models I've tested that have you know reasonably large and reasonably performance are just incredibly slow and you can't really use them. This is actually incredibly fast, incredibly performance and I can actually use this rather than something like Jupyter 3.5 and that's the first time I can actually confidently say that about an open weights model.

So let's jump straight into how we can use this model and we're going to toy around with using it in almost like an agent like flow. So I'm going to use it through RunPod. RunPod seems to be pretty good, I tried a few different options to see what would be kind of easy to use and not super expensive and RunPod seemed pretty good.

They're not sponsoring this or anything, they just seem like a really very good option compared to what else is out there and yeah we have a few H100s on here which is pretty cool but what we can go with is the A100s, it's cheaper and it works just as well.

So I'm going to set up two A100s, I'm going to click deploy and then you can customize your deployment here. Okay for that custom deployment I think you can probably go a little lower than what I'm going here but this worked well for me without being too obsessive. So I set my container disk to 120 and my volume disk to 600 and that's going to get us pretty close to the limit but you know we're going to still have some breathing room.

So I'm going to set those overrides and the one thing like this doesn't matter, we're not going to be using this but we are going to be using Jupyter Notebook so make sure that is checked and we'll just continue. Okay so we have the pricing cast, it's not cheap but it could be worse and we can go ahead and deploy that.

I will expect that this is going to get cheaper as we have the quantized models being released and I've already seen that some of them have been released so we're going to have those anyway. Now in here we have this one running so this is the one I just created, it's starting up.

I'm going to stop it in a moment because I already have another one running and you can see that my container utilization is about 78%. If you go too low on the volume during the build time then you'll probably exceed that so you don't want to go too low on volume but at the moment it's like zero percent whilst nothing is being downloaded or running.

So I'm going to stop that and just delete it quickly and what I'm going to do is come over to my my pod here, we go to connect and we click on this connect to JupyterLab. That opens this window here and yours should be empty. I'm going to be using this notebook here, there will be a link to this notebook at the top of the video right now.

You can download this notebook and then you can just upload it to your run pod here. So you do upload and just find your file in here so I'd be uploading this one. Okay cool so I have some notes here, I'm trying to write this out into more of like a written guide as well but for now let's just jump straight into what we need prerequisites and actually testing all this.

So for the pip installs we have Hugging Face Transformers so we're going to be using the Hugging Face Transformers version of the model. Accelerate which is sort of a Hugging Face thing it's so that we're using the GPU in the way that we would like to use it and because we're using agents I wanted to add a couple of like tools to our agent.

So one of those is going to be a web search tool. Those will install, I already have them installed so I don't need to reinstall and then we'll come down to here and we're going to be using the instruct model, fine-tune model of Mixture 8x7b. There's also I think the normal model is just this so this is the pre-trained without the extra instruct fine-tuning.

We run this, there's a few things in here that are important that you should be aware of. I don't think I need the CUDA here. So we're trusting remote code basically Hugging Face hasn't got an object class for Mixture yet so we have to do this trust remote code so that it basically modifies some of the code and runs things in like within the object itself.

So we also want the torch data type we're going to be using float 16 and the device map is going to be auto. Okay so for this bit here this is where we need accelerate installed. So that actually just installed super quickly for me because I already had it downloaded.

When you first download this model on my run pod I think it took probably around 20 to 25 minutes so it does take a little while because there's a lot of waits in there. Okay cool so let's take a look at the tokenizer or what is next. So with LLMs and transform models in general what they use is something called a tokenizer to translate things from plain text to these arrays of tokens which are then read by the first layer of the transformer/LLM model.

So we need to initialize the tokenizer that Mixture uses so we just pass in the Mixture model ID into auto tokenizer from pre-trained and that will load that tokenizer for us and then with all of that set up we can go ahead and initialize this text generation pipeline. So the text generation pipeline needs a model so the Mixture model and we also need Mixture's tokenizer.

Return for text I've set that to false basically if you're using line chain or at least the last time I used this with line chain you had to set that to true for things to work. We're going to be using text generation as the the task that we're doing here and then there's a few parameters here that you can modify.

Okay so repetition penalty for example this was important for all some of the other models I was toying around with. I'm not sure how important it is with Mixture but you can essentially increase this number and it will reduce the likelihood of the model repeating itself which is actually something I see in even in GPT 3.5 quite often when you keep repeating the same input it tends to kind of go into a loop.

So yeah we can generate some text on the first the first time you run this it takes a lot longer and then after that it should be pretty quick. So I'll wait a moment now while that's running we can come over to here. So this is the the ROM pod that we have and we can just take a look at how much GPU this is actually using.

Okay so you can see it started to pick up now. We have okay memory used zero nothing on or nothing much on one yet so that's still running. Okay we see that we're using number one as well and we get our output. Okay and the output is you know kind of random because we just put in some random text we can use these special tokens or any or provide any instructions as to what the model should be doing.

So this sort of random output here is that's pretty normal. Now let's take a look at the okay how do we not to get that sort of output and get something that's actually useful. So as I said we haven't provided any instructions to the model so that's the first thing that we should do.

So I'm going to do that here. I'm going to say okay you're a helpful AI assistant you can help with a ton of things so on and so on. We add some descriptions for the tools that we're going to be using. So you can see that here we're talking about Python code that is for the calculator tool which does its calculations using Python.

The search tool which is for the search and the final answer which is like return an answer to the user. And then it tells you it has a little bit of an example how to use them and then we finish with okay this is the end right the user's query is as follows and then we have the user's query and then we have the assistant and we are using this json format it aligns with the long chain format for agents where we have the tool name and we'll also have the input okay and we've given the model instructions on how to use that here.

So this is like the primer okay we run this and and then we want to generate some more text okay so it's super quick all right so it's saying if I look at the question that we asked hi there I'm stuck on the math problem can you help my question is what is square root of 512 multiplied by 7 and it decides to go with the calculator and the input is from math import square root and then square root of 512 multiplied by 7.

So that looks pretty correct now the second thing that I mentioned up here is how we should structure those inputs to get good outputs is here we've not used the recommended instruction format which looks something like this I'm not 100% sure on it exactly if that is is perfect but this is what I've seen being used so what we have is what we are going to be inputting into the model is this beginning of string we're going to say beginning of instructions end of instructions and then we have our primer text right and then we let the lm generate some outputs originally I actually moved this over here and we'll get some pretty weird results so I think it's supposed to be like this so let's come down to here and we'll see how we'd actually do that so we come to instruction format we're going to do beginning of string winning of instructions end of instructions and then we have our user input our system and so on okay we don't add that final end of string in there so I'm going to run this we need to pass in our system message which is this the same as we saw before and then we want to pass in the user's query so we run that and then we can have a look at what we have so print input format sorry input prompt okay so this is the input we're getting and we'll come down to here we see end the instructions after the instructions are finished and then we have the chat log so it can be hi there I'm stuck in a math problem so on and so on and then the assistant is starting to reply with this but then we need to generate the the rest of this so we do that here and we get a calculator and it tells us the correct thing there now we can pass this into python executable code all right so this is kind of how an agent would work then we use that to get our formatted output so this is a python dictionary now and then we use the calculator so we say if the actual name is calculator then we're going to use the calculator which is python and we get this one five eight okay so that is the the answer now if we add all this information that we just received to our prompt we're going to get something like this so we have our agent template and then we append all the other stuff into there and we also get our output which is this number here so run this run this you can see the bottom we have that where do we have it the tool output which the lm can now use so now using this full prompt with the additional stuff in there we're going to run this generate text and see what happens so again we get the correct format we have final answer and we have the input that we'd like so we're in this and yeah so it's basically passing in all that information that we saw so far into the final lm call and we get this so the square root of 512 multiplied by 7 is approximately you know similar the number okay so that's a good sort of human-like answer now if we say okay the user or we would like to return final answer so we'd like to actually respond to the user we have another function for that and then we kind of put all the tool of passing logic into a single function okay so we decide on which tool this dictionary here contains the tool name and with that we can just go ahead and use that tool so we're going to do create some prompt and once we have our input prompt yeah we're going to be passing that to the run function here so let's run this pass our input prompt into the run function and then we get this tool output right so it's using it's deciding based on what we've asked it that it needs to use the the calculator tool and this is what we get so now we add the so we have this information in our context already and now we just add that primer again so assistant jason tool name that's a primer so we run this and then let's see what we get as well okay so we get pretty good output here's this is what we'd expect so that's good but i wanted to try something more because we we've built this like agent type thing and it has multiple tools one of those we haven't used yet which is the search tool okay so this is just say a web search using go search the reason we're using this because it's super simple to do of course we can do more complicated tooling here as well but what it's going to do is use the go search it's going to retrieve the items or the results that we get and it's going to return them in this format here so basically a set of results separated by new lines and these three dashes and then we're going to return that as the tool output okay so kind of like what we did here but we're going to return context so to do that we need to ask another question so i'm going to go with a query and that query will be who is the current prime minister of the uk you know this is not a question that most of us can answer so an llm by itself would have no chance we can try this and let's see what we get we might want to print this i think okay so it turns out rishi sunak is still the prime minister that's surprising and yeah we get these results but we don't obviously want this tool output we want our lm2 producer a nicer output for us so we then pass this output here so actually i don't know if i showed you that so if i print the out one it's basically going to show us a whole input prompt plus what we've just generated okay and you can see we have like everything in there plus the most recent so the tool input the search who's the current prime minister of uk the tool outputs and the last thing we want to do is add onto that uh what you know this here okay so i want to run this again and then let's see what we get okay so again rishi sunak has been a prime minister of the uk since the 25th of october 2024 so yeah that's that looks pretty good i think those are pretty good results and i think what i'm taking from this i haven't tested the new mixer model enough yet to confidently say it's incredible but it seems to work incredibly well from what i have seen so far if i compare it to using gpt 3.5 as an agent which is probably the most common agent model out there it seems to be more reliable just from the limited testing i have done with gpt 3.5 i would often be adding in output passes that handling like malformed json and so on and so on this i haven't needed to do that at all which i think seems pretty promising but anyway that's it for this video i think this is super exciting i'm definitely going to do more on mixture going forwards we're also going to talk about mixture of experts i think that'll be pretty interesting as well but we'll leave it there for now thank you very much for watching i hope it's been useful and interesting and i will see you again in the next one (gentle music) you