back to indexMixtral 8X7B — Deploying an *Open* AI Agent
Chapters
0:0 Mixtral 8X7B is better than GPT 3.5
0:50 Deploying Mixtral 8x7B
3:21 Mixtral Code Setup
8:17 Using Mixtral Instructions
10:4 Mixtral Special Tokens
13:29 Parsing Multiple Agent Tools
14:28 RAG with Mixtral
17:1 Final Thoughts on Mixtral
00:00:00.000 |
Today we have a very exciting video, we are going to test and see the performance of the new 00:00:06.960 |
Mixedraw 8x 7 billion parameter model, which honestly from what I've seen so far is pretty 00:00:14.160 |
incredible. It is far better than any other open model that I've tested and at the same time it's 00:00:20.720 |
very fast. Most open models I've tested that have you know reasonably large and reasonably 00:00:25.600 |
performance are just incredibly slow and you can't really use them. This is actually incredibly fast, 00:00:30.800 |
incredibly performance and I can actually use this rather than something like Jupyter 3.5 00:00:36.080 |
and that's the first time I can actually confidently say that about an open weights model. 00:00:41.280 |
So let's jump straight into how we can use this model and we're going to toy around with using it 00:00:46.880 |
in almost like an agent like flow. So I'm going to use it through RunPod. RunPod seems to be pretty 00:00:55.840 |
good, I tried a few different options to see what would be kind of easy to use and not super 00:01:01.920 |
expensive and RunPod seemed pretty good. They're not sponsoring this or anything, they just seem 00:01:08.080 |
like a really very good option compared to what else is out there and yeah we have a few H100s 00:01:15.760 |
on here which is pretty cool but what we can go with is the A100s, it's cheaper and it works just 00:01:22.080 |
as well. So I'm going to set up two A100s, I'm going to click deploy and then you can customize 00:01:29.680 |
your deployment here. Okay for that custom deployment I think you can probably go a 00:01:35.680 |
little lower than what I'm going here but this worked well for me without being too obsessive. 00:01:42.400 |
So I set my container disk to 120 and my volume disk to 600 and that's going to get us pretty 00:01:49.200 |
close to the limit but you know we're going to still have some breathing room. So I'm going to 00:01:54.240 |
set those overrides and the one thing like this doesn't matter, we're not going to be using this 00:02:00.160 |
but we are going to be using Jupyter Notebook so make sure that is checked and we'll just continue. 00:02:08.320 |
Okay so we have the pricing cast, it's not cheap but it could be worse and we can go ahead and 00:02:15.440 |
deploy that. I will expect that this is going to get cheaper as we have the quantized models 00:02:22.800 |
being released and I've already seen that some of them have been released so we're going to have 00:02:29.200 |
those anyway. Now in here we have this one running so this is the one I just created, 00:02:35.920 |
it's starting up. I'm going to stop it in a moment because I already have another one running and you 00:02:41.440 |
can see that my container utilization is about 78%. If you go too low on the volume during the 00:02:50.000 |
build time then you'll probably exceed that so you don't want to go too low on volume but at the 00:02:56.400 |
moment it's like zero percent whilst nothing is being downloaded or running. So I'm going to stop 00:03:03.360 |
that and just delete it quickly and what I'm going to do is come over to my my pod here, we go to 00:03:14.960 |
connect and we click on this connect to JupyterLab. That opens this window here and yours should be 00:03:24.400 |
empty. I'm going to be using this notebook here, there will be a link to this notebook at the top 00:03:30.480 |
of the video right now. You can download this notebook and then you can just upload it to your 00:03:36.480 |
run pod here. So you do upload and just find your file in here so I'd be uploading this one. 00:03:44.400 |
Okay cool so I have some notes here, I'm trying to write this out into more of like a written 00:03:50.080 |
guide as well but for now let's just jump straight into what we need prerequisites and actually 00:03:55.440 |
testing all this. So for the pip installs we have Hugging Face Transformers so we're going to be 00:04:02.160 |
using the Hugging Face Transformers version of the model. Accelerate which is sort of a Hugging 00:04:09.040 |
Face thing it's so that we're using the GPU in the way that we would like to use it and because 00:04:14.160 |
we're using agents I wanted to add a couple of like tools to our agent. So one of those is going 00:04:20.000 |
to be a web search tool. Those will install, I already have them installed so I don't need to 00:04:25.600 |
reinstall and then we'll come down to here and we're going to be using the instruct model, 00:04:30.800 |
fine-tune model of Mixture 8x7b. There's also I think the normal model is just this so this is 00:04:40.080 |
the pre-trained without the extra instruct fine-tuning. We run this, there's a few things 00:04:45.840 |
in here that are important that you should be aware of. I don't think I need the CUDA here. 00:04:50.240 |
So we're trusting remote code basically Hugging Face hasn't got an object class for Mixture yet 00:04:59.040 |
so we have to do this trust remote code so that it basically modifies some of the code and runs 00:05:04.480 |
things in like within the object itself. So we also want the torch data type we're going to be 00:05:12.800 |
using float 16 and the device map is going to be auto. Okay so for this bit here this is where we 00:05:20.000 |
need accelerate installed. So that actually just installed super quickly for me because I already 00:05:26.880 |
had it downloaded. When you first download this model on my run pod I think it took probably 00:05:34.560 |
around 20 to 25 minutes so it does take a little while because there's a lot of waits in there. 00:05:42.160 |
Okay cool so let's take a look at the tokenizer or what is next. So with LLMs and transform models in 00:05:53.280 |
general what they use is something called a tokenizer to translate things from plain text 00:06:00.480 |
to these arrays of tokens which are then read by the first layer of the transformer/LLM model. So 00:06:07.760 |
we need to initialize the tokenizer that Mixture uses so we just pass in the Mixture model ID 00:06:14.080 |
into auto tokenizer from pre-trained and that will load that tokenizer for us and then with all of 00:06:20.560 |
that set up we can go ahead and initialize this text generation pipeline. So the text generation 00:06:27.680 |
pipeline needs a model so the Mixture model and we also need Mixture's tokenizer. Return for text 00:06:34.480 |
I've set that to false basically if you're using line chain or at least the last time I used this 00:06:40.000 |
with line chain you had to set that to true for things to work. We're going to be using text 00:06:44.800 |
generation as the the task that we're doing here and then there's a few parameters here that you 00:06:49.200 |
can modify. Okay so repetition penalty for example this was important for all some of the other 00:06:56.320 |
models I was toying around with. I'm not sure how important it is with Mixture but you can 00:07:02.960 |
essentially increase this number and it will reduce the likelihood of the model repeating 00:07:08.400 |
itself which is actually something I see in even in GPT 3.5 quite often when you keep repeating the 00:07:15.280 |
same input it tends to kind of go into a loop. So yeah we can generate some text on the first 00:07:24.400 |
the first time you run this it takes a lot longer and then after that it should be pretty quick. 00:07:30.240 |
So I'll wait a moment now while that's running we can come over to here. So this is the the 00:07:36.160 |
ROM pod that we have and we can just take a look at how much GPU this is actually using. 00:07:40.880 |
Okay so you can see it started to pick up now. We have okay memory used zero nothing on or nothing 00:07:50.720 |
much on one yet so that's still running. Okay we see that we're using number one as well and we 00:07:59.200 |
get our output. Okay and the output is you know kind of random because we just put in some random 00:08:05.680 |
text we can use these special tokens or any or provide any instructions as to what the model 00:08:10.400 |
should be doing. So this sort of random output here is that's pretty normal. Now let's take a 00:08:17.760 |
look at the okay how do we not to get that sort of output and get something that's actually useful. 00:08:23.760 |
So as I said we haven't provided any instructions to the model so that's the first thing that we 00:08:28.560 |
should do. So I'm going to do that here. I'm going to say okay you're a helpful AI assistant you can 00:08:34.560 |
help with a ton of things so on and so on. We add some descriptions for the tools that we're going 00:08:42.560 |
to be using. So you can see that here we're talking about Python code that is for the calculator tool 00:08:49.120 |
which does its calculations using Python. The search tool which is for the search 00:08:54.560 |
and the final answer which is like return an answer to the user. And then it tells you it 00:09:01.840 |
has a little bit of an example how to use them and then we finish with okay this is the end right the 00:09:10.240 |
user's query is as follows and then we have the user's query and then we have the assistant and 00:09:14.720 |
we are using this json format it aligns with the long chain format for agents where we have the 00:09:23.280 |
tool name and we'll also have the input okay and we've given the model instructions on how to use 00:09:29.200 |
that here. So this is like the primer okay we run this and and then we want to generate some more 00:09:37.280 |
text okay so it's super quick all right so it's saying if I look at the question that we asked 00:09:43.680 |
hi there I'm stuck on the math problem can you help my question is what is square root of 512 00:09:49.840 |
multiplied by 7 and it decides to go with the calculator and the input is from math import 00:09:57.440 |
square root and then square root of 512 multiplied by 7. So that looks pretty correct now the second 00:10:05.840 |
thing that I mentioned up here is how we should structure those inputs to get good outputs 00:10:12.000 |
is here we've not used the recommended instruction format which looks something like this I'm not 100% 00:10:17.440 |
sure on it exactly if that is is perfect but this is what I've seen being used so what we have is 00:10:25.920 |
what we are going to be inputting into the model is this beginning of string we're going to say 00:10:30.720 |
beginning of instructions end of instructions and then we have our primer text right and then we let 00:10:36.640 |
the lm generate some outputs originally I actually moved this over here and we'll get some pretty 00:10:43.680 |
weird results so I think it's supposed to be like this so let's come down to here and we'll see how 00:10:49.440 |
we'd actually do that so we come to instruction format we're going to do beginning of string 00:10:55.360 |
winning of instructions end of instructions and then we have our user input our system and so on 00:11:00.080 |
okay we don't add that final end of string in there so I'm going to run this we need to pass 00:11:07.440 |
in our system message which is this the same as we saw before and then we want to pass in the user's 00:11:12.960 |
query so we run that and then we can have a look at what we have so print input format sorry input 00:11:24.560 |
prompt okay so this is the input we're getting and we'll come down to here we see end the 00:11:32.160 |
instructions after the instructions are finished and then we have the chat log so it can be hi there 00:11:37.600 |
I'm stuck in a math problem so on and so on and then the assistant is starting to reply with this 00:11:42.400 |
but then we need to generate the the rest of this so we do that here and we get a calculator and it 00:11:51.600 |
tells us the correct thing there now we can pass this into python executable code all right so this 00:11:58.000 |
is kind of how an agent would work then we use that to get our formatted output so this is a 00:12:06.160 |
python dictionary now and then we use the calculator so we say if the actual name is calculator then 00:12:13.280 |
we're going to use the calculator which is python and we get this one five eight okay so that is the 00:12:19.520 |
the answer now if we add all this information that we just received to our prompt we're going 00:12:25.040 |
to get something like this so we have our agent template and then we append all the other stuff 00:12:28.720 |
into there and we also get our output which is this number here so run this run this 00:12:37.520 |
you can see the bottom we have that where do we have it the tool output which the lm can now use 00:12:46.720 |
so now using this full prompt with the additional stuff in there we're going to run this generate 00:12:53.680 |
text and see what happens so again we get the correct format we have final answer and we have 00:12:59.280 |
the input that we'd like so we're in this and yeah so it's basically passing in all that 00:13:05.680 |
information that we saw so far into the final lm call and we get this so the square root of 00:13:12.080 |
512 multiplied by 7 is approximately you know similar the number okay so that's a good sort 00:13:17.440 |
of human-like answer now if we say okay the user or we would like to return final answer so we'd 00:13:24.720 |
like to actually respond to the user we have another function for that and then we kind of 00:13:31.120 |
put all the tool of passing logic into a single function okay so we decide on which tool this 00:13:38.720 |
dictionary here contains the tool name and with that we can just go ahead and use that tool 00:13:45.520 |
so we're going to do create some prompt and once we have our input prompt yeah we're going to be 00:13:53.680 |
passing that to the run function here so let's run this pass our input prompt into the run function 00:13:59.280 |
and then we get this tool output right so it's using it's deciding based on what we've asked 00:14:03.920 |
it that it needs to use the the calculator tool and this is what we get so now we add the so we 00:14:11.360 |
have this information in our context already and now we just add that primer again so assistant 00:14:16.320 |
jason tool name that's a primer so we run this and then let's see what we get as well okay so 00:14:23.120 |
we get pretty good output here's this is what we'd expect so that's good but i wanted to try something 00:14:30.400 |
more because we we've built this like agent type thing and it has multiple tools one of those we 00:14:36.080 |
haven't used yet which is the search tool okay so this is just say a web search using go search 00:14:41.680 |
the reason we're using this because it's super simple to do of course we can do more complicated 00:14:48.640 |
tooling here as well but what it's going to do is use the go search it's going to retrieve the 00:14:54.800 |
items or the results that we get and it's going to return them in this format here so basically 00:15:00.800 |
a set of results separated by new lines and these three dashes and then we're going to return that 00:15:09.120 |
as the tool output okay so kind of like what we did here but we're going to return context so to 00:15:16.240 |
do that we need to ask another question so i'm going to go with a query and that query will be 00:15:27.200 |
of the uk you know this is not a question that most of us can answer so 00:15:35.920 |
an llm by itself would have no chance we can try this 00:15:40.400 |
and let's see what we get we might want to print this i think 00:15:53.600 |
so it turns out rishi sunak is still the prime minister that's surprising 00:15:59.200 |
and yeah we get these results but we don't obviously want this tool output we want our 00:16:06.640 |
lm2 producer a nicer output for us so we then pass this output here so actually i don't know 00:16:14.880 |
if i showed you that so if i print the out one it's basically going to show us a whole 00:16:22.160 |
input prompt plus what we've just generated okay and you can see we have like everything in there 00:16:28.480 |
plus the most recent so the tool input the search who's the current prime minister of uk 00:16:33.280 |
the tool outputs and the last thing we want to do is add onto that uh what you know this here okay 00:16:41.680 |
so i want to run this again and then let's see what we get okay so again rishi sunak has been 00:16:51.040 |
a prime minister of the uk since the 25th of october 2024 so yeah that's that looks pretty 00:16:59.920 |
good i think those are pretty good results and i think what i'm taking from this i haven't tested 00:17:07.120 |
the new mixer model enough yet to confidently say it's incredible but it seems to work incredibly 00:17:13.840 |
well from what i have seen so far if i compare it to using gpt 3.5 as an agent which is probably the 00:17:20.640 |
most common agent model out there it seems to be more reliable just from the limited testing i have 00:17:28.960 |
done with gpt 3.5 i would often be adding in output passes that handling like malformed json 00:17:36.960 |
and so on and so on this i haven't needed to do that at all which i think seems pretty promising 00:17:45.600 |
but anyway that's it for this video i think this is super exciting i'm definitely going to do more 00:17:49.680 |
on mixture going forwards we're also going to talk about mixture of experts i think that'll 00:17:54.320 |
be pretty interesting as well but we'll leave it there for now thank you very much for watching 00:17:58.800 |
i hope it's been useful and interesting and i will see you again in the next one