back to index

Mixtral 8X7B — Deploying an *Open* AI Agent


Chapters

0:0 Mixtral 8X7B is better than GPT 3.5
0:50 Deploying Mixtral 8x7B
3:21 Mixtral Code Setup
8:17 Using Mixtral Instructions
10:4 Mixtral Special Tokens
13:29 Parsing Multiple Agent Tools
14:28 RAG with Mixtral
17:1 Final Thoughts on Mixtral

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we have a very exciting video, we are going to test and see the performance of the new
00:00:06.960 | Mixedraw 8x 7 billion parameter model, which honestly from what I've seen so far is pretty
00:00:14.160 | incredible. It is far better than any other open model that I've tested and at the same time it's
00:00:20.720 | very fast. Most open models I've tested that have you know reasonably large and reasonably
00:00:25.600 | performance are just incredibly slow and you can't really use them. This is actually incredibly fast,
00:00:30.800 | incredibly performance and I can actually use this rather than something like Jupyter 3.5
00:00:36.080 | and that's the first time I can actually confidently say that about an open weights model.
00:00:41.280 | So let's jump straight into how we can use this model and we're going to toy around with using it
00:00:46.880 | in almost like an agent like flow. So I'm going to use it through RunPod. RunPod seems to be pretty
00:00:55.840 | good, I tried a few different options to see what would be kind of easy to use and not super
00:01:01.920 | expensive and RunPod seemed pretty good. They're not sponsoring this or anything, they just seem
00:01:08.080 | like a really very good option compared to what else is out there and yeah we have a few H100s
00:01:15.760 | on here which is pretty cool but what we can go with is the A100s, it's cheaper and it works just
00:01:22.080 | as well. So I'm going to set up two A100s, I'm going to click deploy and then you can customize
00:01:29.680 | your deployment here. Okay for that custom deployment I think you can probably go a
00:01:35.680 | little lower than what I'm going here but this worked well for me without being too obsessive.
00:01:42.400 | So I set my container disk to 120 and my volume disk to 600 and that's going to get us pretty
00:01:49.200 | close to the limit but you know we're going to still have some breathing room. So I'm going to
00:01:54.240 | set those overrides and the one thing like this doesn't matter, we're not going to be using this
00:02:00.160 | but we are going to be using Jupyter Notebook so make sure that is checked and we'll just continue.
00:02:08.320 | Okay so we have the pricing cast, it's not cheap but it could be worse and we can go ahead and
00:02:15.440 | deploy that. I will expect that this is going to get cheaper as we have the quantized models
00:02:22.800 | being released and I've already seen that some of them have been released so we're going to have
00:02:29.200 | those anyway. Now in here we have this one running so this is the one I just created,
00:02:35.920 | it's starting up. I'm going to stop it in a moment because I already have another one running and you
00:02:41.440 | can see that my container utilization is about 78%. If you go too low on the volume during the
00:02:50.000 | build time then you'll probably exceed that so you don't want to go too low on volume but at the
00:02:56.400 | moment it's like zero percent whilst nothing is being downloaded or running. So I'm going to stop
00:03:03.360 | that and just delete it quickly and what I'm going to do is come over to my my pod here, we go to
00:03:14.960 | connect and we click on this connect to JupyterLab. That opens this window here and yours should be
00:03:24.400 | empty. I'm going to be using this notebook here, there will be a link to this notebook at the top
00:03:30.480 | of the video right now. You can download this notebook and then you can just upload it to your
00:03:36.480 | run pod here. So you do upload and just find your file in here so I'd be uploading this one.
00:03:44.400 | Okay cool so I have some notes here, I'm trying to write this out into more of like a written
00:03:50.080 | guide as well but for now let's just jump straight into what we need prerequisites and actually
00:03:55.440 | testing all this. So for the pip installs we have Hugging Face Transformers so we're going to be
00:04:02.160 | using the Hugging Face Transformers version of the model. Accelerate which is sort of a Hugging
00:04:09.040 | Face thing it's so that we're using the GPU in the way that we would like to use it and because
00:04:14.160 | we're using agents I wanted to add a couple of like tools to our agent. So one of those is going
00:04:20.000 | to be a web search tool. Those will install, I already have them installed so I don't need to
00:04:25.600 | reinstall and then we'll come down to here and we're going to be using the instruct model,
00:04:30.800 | fine-tune model of Mixture 8x7b. There's also I think the normal model is just this so this is
00:04:40.080 | the pre-trained without the extra instruct fine-tuning. We run this, there's a few things
00:04:45.840 | in here that are important that you should be aware of. I don't think I need the CUDA here.
00:04:50.240 | So we're trusting remote code basically Hugging Face hasn't got an object class for Mixture yet
00:04:59.040 | so we have to do this trust remote code so that it basically modifies some of the code and runs
00:05:04.480 | things in like within the object itself. So we also want the torch data type we're going to be
00:05:12.800 | using float 16 and the device map is going to be auto. Okay so for this bit here this is where we
00:05:20.000 | need accelerate installed. So that actually just installed super quickly for me because I already
00:05:26.880 | had it downloaded. When you first download this model on my run pod I think it took probably
00:05:34.560 | around 20 to 25 minutes so it does take a little while because there's a lot of waits in there.
00:05:42.160 | Okay cool so let's take a look at the tokenizer or what is next. So with LLMs and transform models in
00:05:53.280 | general what they use is something called a tokenizer to translate things from plain text
00:06:00.480 | to these arrays of tokens which are then read by the first layer of the transformer/LLM model. So
00:06:07.760 | we need to initialize the tokenizer that Mixture uses so we just pass in the Mixture model ID
00:06:14.080 | into auto tokenizer from pre-trained and that will load that tokenizer for us and then with all of
00:06:20.560 | that set up we can go ahead and initialize this text generation pipeline. So the text generation
00:06:27.680 | pipeline needs a model so the Mixture model and we also need Mixture's tokenizer. Return for text
00:06:34.480 | I've set that to false basically if you're using line chain or at least the last time I used this
00:06:40.000 | with line chain you had to set that to true for things to work. We're going to be using text
00:06:44.800 | generation as the the task that we're doing here and then there's a few parameters here that you
00:06:49.200 | can modify. Okay so repetition penalty for example this was important for all some of the other
00:06:56.320 | models I was toying around with. I'm not sure how important it is with Mixture but you can
00:07:02.960 | essentially increase this number and it will reduce the likelihood of the model repeating
00:07:08.400 | itself which is actually something I see in even in GPT 3.5 quite often when you keep repeating the
00:07:15.280 | same input it tends to kind of go into a loop. So yeah we can generate some text on the first
00:07:24.400 | the first time you run this it takes a lot longer and then after that it should be pretty quick.
00:07:30.240 | So I'll wait a moment now while that's running we can come over to here. So this is the the
00:07:36.160 | ROM pod that we have and we can just take a look at how much GPU this is actually using.
00:07:40.880 | Okay so you can see it started to pick up now. We have okay memory used zero nothing on or nothing
00:07:50.720 | much on one yet so that's still running. Okay we see that we're using number one as well and we
00:07:59.200 | get our output. Okay and the output is you know kind of random because we just put in some random
00:08:05.680 | text we can use these special tokens or any or provide any instructions as to what the model
00:08:10.400 | should be doing. So this sort of random output here is that's pretty normal. Now let's take a
00:08:17.760 | look at the okay how do we not to get that sort of output and get something that's actually useful.
00:08:23.760 | So as I said we haven't provided any instructions to the model so that's the first thing that we
00:08:28.560 | should do. So I'm going to do that here. I'm going to say okay you're a helpful AI assistant you can
00:08:34.560 | help with a ton of things so on and so on. We add some descriptions for the tools that we're going
00:08:42.560 | to be using. So you can see that here we're talking about Python code that is for the calculator tool
00:08:49.120 | which does its calculations using Python. The search tool which is for the search
00:08:54.560 | and the final answer which is like return an answer to the user. And then it tells you it
00:09:01.840 | has a little bit of an example how to use them and then we finish with okay this is the end right the
00:09:10.240 | user's query is as follows and then we have the user's query and then we have the assistant and
00:09:14.720 | we are using this json format it aligns with the long chain format for agents where we have the
00:09:23.280 | tool name and we'll also have the input okay and we've given the model instructions on how to use
00:09:29.200 | that here. So this is like the primer okay we run this and and then we want to generate some more
00:09:37.280 | text okay so it's super quick all right so it's saying if I look at the question that we asked
00:09:43.680 | hi there I'm stuck on the math problem can you help my question is what is square root of 512
00:09:49.840 | multiplied by 7 and it decides to go with the calculator and the input is from math import
00:09:57.440 | square root and then square root of 512 multiplied by 7. So that looks pretty correct now the second
00:10:05.840 | thing that I mentioned up here is how we should structure those inputs to get good outputs
00:10:12.000 | is here we've not used the recommended instruction format which looks something like this I'm not 100%
00:10:17.440 | sure on it exactly if that is is perfect but this is what I've seen being used so what we have is
00:10:25.920 | what we are going to be inputting into the model is this beginning of string we're going to say
00:10:30.720 | beginning of instructions end of instructions and then we have our primer text right and then we let
00:10:36.640 | the lm generate some outputs originally I actually moved this over here and we'll get some pretty
00:10:43.680 | weird results so I think it's supposed to be like this so let's come down to here and we'll see how
00:10:49.440 | we'd actually do that so we come to instruction format we're going to do beginning of string
00:10:55.360 | winning of instructions end of instructions and then we have our user input our system and so on
00:11:00.080 | okay we don't add that final end of string in there so I'm going to run this we need to pass
00:11:07.440 | in our system message which is this the same as we saw before and then we want to pass in the user's
00:11:12.960 | query so we run that and then we can have a look at what we have so print input format sorry input
00:11:24.560 | prompt okay so this is the input we're getting and we'll come down to here we see end the
00:11:32.160 | instructions after the instructions are finished and then we have the chat log so it can be hi there
00:11:37.600 | I'm stuck in a math problem so on and so on and then the assistant is starting to reply with this
00:11:42.400 | but then we need to generate the the rest of this so we do that here and we get a calculator and it
00:11:51.600 | tells us the correct thing there now we can pass this into python executable code all right so this
00:11:58.000 | is kind of how an agent would work then we use that to get our formatted output so this is a
00:12:06.160 | python dictionary now and then we use the calculator so we say if the actual name is calculator then
00:12:13.280 | we're going to use the calculator which is python and we get this one five eight okay so that is the
00:12:19.520 | the answer now if we add all this information that we just received to our prompt we're going
00:12:25.040 | to get something like this so we have our agent template and then we append all the other stuff
00:12:28.720 | into there and we also get our output which is this number here so run this run this
00:12:37.520 | you can see the bottom we have that where do we have it the tool output which the lm can now use
00:12:46.720 | so now using this full prompt with the additional stuff in there we're going to run this generate
00:12:53.680 | text and see what happens so again we get the correct format we have final answer and we have
00:12:59.280 | the input that we'd like so we're in this and yeah so it's basically passing in all that
00:13:05.680 | information that we saw so far into the final lm call and we get this so the square root of
00:13:12.080 | 512 multiplied by 7 is approximately you know similar the number okay so that's a good sort
00:13:17.440 | of human-like answer now if we say okay the user or we would like to return final answer so we'd
00:13:24.720 | like to actually respond to the user we have another function for that and then we kind of
00:13:31.120 | put all the tool of passing logic into a single function okay so we decide on which tool this
00:13:38.720 | dictionary here contains the tool name and with that we can just go ahead and use that tool
00:13:45.520 | so we're going to do create some prompt and once we have our input prompt yeah we're going to be
00:13:53.680 | passing that to the run function here so let's run this pass our input prompt into the run function
00:13:59.280 | and then we get this tool output right so it's using it's deciding based on what we've asked
00:14:03.920 | it that it needs to use the the calculator tool and this is what we get so now we add the so we
00:14:11.360 | have this information in our context already and now we just add that primer again so assistant
00:14:16.320 | jason tool name that's a primer so we run this and then let's see what we get as well okay so
00:14:23.120 | we get pretty good output here's this is what we'd expect so that's good but i wanted to try something
00:14:30.400 | more because we we've built this like agent type thing and it has multiple tools one of those we
00:14:36.080 | haven't used yet which is the search tool okay so this is just say a web search using go search
00:14:41.680 | the reason we're using this because it's super simple to do of course we can do more complicated
00:14:48.640 | tooling here as well but what it's going to do is use the go search it's going to retrieve the
00:14:54.800 | items or the results that we get and it's going to return them in this format here so basically
00:15:00.800 | a set of results separated by new lines and these three dashes and then we're going to return that
00:15:09.120 | as the tool output okay so kind of like what we did here but we're going to return context so to
00:15:16.240 | do that we need to ask another question so i'm going to go with a query and that query will be
00:15:23.520 | who is the current prime minister
00:15:27.200 | of the uk you know this is not a question that most of us can answer so
00:15:35.920 | an llm by itself would have no chance we can try this
00:15:40.400 | and let's see what we get we might want to print this i think
00:15:53.600 | so it turns out rishi sunak is still the prime minister that's surprising
00:15:59.200 | and yeah we get these results but we don't obviously want this tool output we want our
00:16:06.640 | lm2 producer a nicer output for us so we then pass this output here so actually i don't know
00:16:14.880 | if i showed you that so if i print the out one it's basically going to show us a whole
00:16:22.160 | input prompt plus what we've just generated okay and you can see we have like everything in there
00:16:28.480 | plus the most recent so the tool input the search who's the current prime minister of uk
00:16:33.280 | the tool outputs and the last thing we want to do is add onto that uh what you know this here okay
00:16:41.680 | so i want to run this again and then let's see what we get okay so again rishi sunak has been
00:16:51.040 | a prime minister of the uk since the 25th of october 2024 so yeah that's that looks pretty
00:16:59.920 | good i think those are pretty good results and i think what i'm taking from this i haven't tested
00:17:07.120 | the new mixer model enough yet to confidently say it's incredible but it seems to work incredibly
00:17:13.840 | well from what i have seen so far if i compare it to using gpt 3.5 as an agent which is probably the
00:17:20.640 | most common agent model out there it seems to be more reliable just from the limited testing i have
00:17:28.960 | done with gpt 3.5 i would often be adding in output passes that handling like malformed json
00:17:36.960 | and so on and so on this i haven't needed to do that at all which i think seems pretty promising
00:17:45.600 | but anyway that's it for this video i think this is super exciting i'm definitely going to do more
00:17:49.680 | on mixture going forwards we're also going to talk about mixture of experts i think that'll
00:17:54.320 | be pretty interesting as well but we'll leave it there for now thank you very much for watching
00:17:58.800 | i hope it's been useful and interesting and i will see you again in the next one
00:18:17.360 | (gentle music)