LoRA Fine-tuning Tiny LLMs as Expert Agents

Today we are going to be taking a look at how we can take off-the-shelf LLMs and fine-tune them on our own data sets so that they can become better agents. Now one of the fundamental requirements of LLMs in agentic workflows is that they can reliably use function calls. It is that ability to do function calling that takes us from just having an LLM generating text to having an LLM that is reviewing our code bases, writing code, opening PRs, checking out emails, doing web search, all of that stuff is at its core it's an LLM but it's an LLM that can do function calling.

Now despite this until I think very recently there was a lot of LLMs like say the LLMs being built and released that were good but they were not very good at function calling. Fortunately though for us even with those other LMs we can very easily fine-tune them to make them actually pretty good at function calling and therefore we can use these tiny LMs like we're going to be using a one billion parameter model as agents which is pretty cool.

So in this deep dive that's really what we're focusing on how we can fine-tune these tiny LLMs or bigger LLMs if you really want to to be better agents and we'll see how far we can push a tiny one billion parameter LLM. Now fine-tuning LLMs is pretty heavy you need a lot compute.

Fortunately we have access to NVIDIA's launchpad and with NVIDIA's launchpad I have access to H100 GPUs so we're going to be running all this we're going to be training on a single H100 but I can also use more if I want and as well as launchpad we are also going to be using the Nemo microservices.

Now Nemo microservices is from NVIDIA and the goal of the microservices is to make fine-tuning and hosting your own LLMs and then also just running them in production much easier or even just realistic to do at all. So these microservices, there's all these different components within the microservices that we can use.

One of the big ones is the customizer. The customizer is what we are going to be fine-tuning with and there's also the evaluator. We're not actually I'm not actually going to use this here although I may do in the future but that's of course evaluating the models that you're fine-tuned and comparing them to you know where you began and NVIDIA NIMH which is the essentially the LLM hosting service and that comes with the standard OpenAI compatible endpoints so you can do chat completions and everything as you would with OpenAI.

So we have all that. There are a few other things here that I'm not going to go into but you can get an idea of what they are anyway. So you have the curator which is for creating data sets to train your models, retriever which is essentially RAG and then also guardrails which is of course you know protective guardrails for LLMs when they're in production.

Now we're going to be deploying this service and we're going to be using a few different components. I've kind of covered most of those here. So the guardrails and evaluator we're not using those so we can just ignore them they're on the side there. The rest of these things we are using.

So we'll be diving into each one of these components as we get to them but let me just give you a quick overview of what this is here now and treat this as almost a map of what we're going to building and going through. So to begin with we are going to be pulling together a data set.

So the data set of course it's going to it's going to come in up here all right this is a data set we're going to bring that in and we're going to be sending that over here to the data store. Okay so the data store what every single one of these components is accessible via an API these all get deployed at once within our broader microservices deployment.

So we'll see all that later but anyway so we have our data set we do do some data preparation we're going to do that first so in the middle here but then we're going to be putting everything in our data store. We leave it in the data store there.

We do register that data set in the entity store. The entity store essentially is where we register all of our data sets and models so that all of the other components within our microservices can read that and see what is you know what is accessible within the service. Then we have the customizer.

So as I mentioned the customizer is what handles the training of our models. So the customizer is going to take a base model that we have defined. The way that we've done this is we're setting the base models which is going to be llama 3.2 1 billion instruct. We're setting that as the base model and we do that within the deployment set of our microservices.

So we'll see that soon. it's going to then load our train and validation data set from the data store component and then it's going to run a training job based on a set of training parameters that we are going to provide to us. So that's things like the learning rate, dropout, number of training epochs and so on.

We'll see that all later. And once that training job is complete our new custom model will be registered with our NC store which means it will be accessible or at least viewable from our other components. Then at that point, we will be using the deployment management component which you can see over here.

And what the deployment management is doing is deploying what we would call NIMs. Now a NIM is what's a container from NVIDIA that has been built for GPU acceleration tasks. So of course, hosting an LLM and running inference on it is the sort of thing that you would deploy within a NIM.

Now, when we deploy one of these LLM or model NIMs, that will then become usable by our NIM proxy. The NIM proxy components is essentially just an API where we are going to send requests to. So a chat completion request, for example, and based on the model that we provided, it's going to route that to a particular NIM.

Now, we actually don't deploy NIMs for each specific custom model that we've built. Okay, the only NIMs here are actually these two things. And these NIMs are pre-made by NVIDIA. We can see them in this big catalog. And the way that we use them to run our custom models is that our custom models over here all have a base model parameter.

That base model parameter is going to define all of our custom models here as having a base model of LLAMA 3.2 1b instruct in this case. And that means that our NIM proxy will know, okay, I'm using this NIM, but I'm using the model weights from this custom model.

Okay, so in essence, what this NIM becomes is a container for our custom model. So at a high level, that is what we are going to be building. I know that's a lot. We are, I know for sure this is going to be a fairly long one, but we are going to go through this step by step, starting with the deployment, and then working through each one of those steps I described.

Okay, so we're going to start by deploying our Nemo microservices. So to do that, I have this little notebook that just takes us through everything. Now, there are various ways of deploying this in the docs. You can, you can find those up here. You'll see that they have this beginner tutorial and this does take you through some of the, some alternative ways of setting this up.

So you can, you can also try this and potentially it will be simpler. Although for me, it was simpler to do what I'm going to show you, which is a little more hands up. Now, let's run through this. Essentially what we're going to be doing is downloading home charts for our Nemo microservices.

And to have to get those home charts, you will need a NGC account. So you would sign up over here. So go to NVIDIA NGC. I'm going to go James at Aurelio AI. Okay. And this is going to take me through to their NGC catalog. Now in here, you can find a ton of stuff, you know, that we are actually going to be using.

And I'll take you through this as we, as, as we need them. But to begin with, I'm going to type in here, Nemo microservices. And you can see that there is this Nemo microservices helm chart. And this is what we're going to be using. This is a helm chart, which bundles together all the individual home charts of the various components of the microservices.

So we basically deploy this and it gives us everything that we need. So it's like customized, evaluated guardrails. And then also data store, entity store, deployment management, the NIM proxy operator, and so on. Right. So it's everything that we would need. And okay, right now, the latest version here is 25.4.0.

You can also check up here for the various versions if you want to use a different one. But we are going to go with this one. So I just press that, it gave me the fetch command, and we can come over here. And we should get it here. Okay, so this is what we're going to be doing just here.

Cool. So that is how we navigate and find the various helm charts that we're using for deployment here. But before we actually do that, we can't access this without providing our NGC API key. So let's just go and grab that quickly. We're going to go back into the catalog.

We go over to the right here. Go to setup. And I'm just going to generate an API key. Okay. So you generate that. Then come back over here and just run this cell and enter your API key. Okay. Now we can fetch our helm charts. This is just a placeholder.

You don't need to enter anything for this. This is literally the string that we're sending as our username. And then what we need to do. So this is a helm thing, not specific to the microservices. We need to include the values that we would like to set. Okay. So these are going to override the default values within the helm chart.

The only one that we 100% need here is the NGC API key. Then here I'm adding the LAMA 3.2 1B instruct model so that my customizer later can access that as a model to actually optimize. You can also deploy this through the deployment management component otherwise. But yeah, here you can also just deploy and it's ready to go when we need it.

And then here I also set a storage class. So this was to handle a bug I was getting where there is essentially no storage for my customizer to save the model to or even I think write logs to. So you may need that, may not, as far as I'm aware, it's either a bug or it's my misuse of the system that required that.

So we create our YAML file here and then we actually use that YAML file to deploy our service. So before we do that, we do just create a namespace in Kubernetes. I've already created mine. So it's just demo namespace. You can call it whatever you like. Then, oh, the other thing that I also added here is, again, to handle that storage class issue I was seeing.

This here is going to look for all the default storage class values within our deployment and replace them with the new storage class that I have set, which is this, okay, NFS client. And you will see here the rule here is add the storage class if it is none, right?

So if there's already a storage class for the within the Helm chart, it's not going to replace it. It's only going to replace it if that storage class has been set to none, okay? And so, okay, let me just run these. So this is going to apply this logic, this replacement logic within our cluster.

I've already created it, so I'm going to get all these. But when you run this the first time, you shouldn't see this. Okay, and that should be it. So now what I want to do is, if you like, you can just confirm that you've filled out all the required values by creating a Helm template.

So that can be good for, especially when it comes to debugging, if you have issues with your deployment, that can be quite good. Now, we would install the Helm chart into the cluster. So mine is going to tell me installation failed because I've already installed it, okay? So mine is already there, so I don't actually need to do that.

But once you have done that, you can actually check your cluster and just see what is in there. And you should, for the most part, see either initializing or running. Sometimes you might see, especially for, I think, the entity store or the data store, you'll see a crash loop backoff.

That is, if you run this again, you should see, after not that long, that the status for that will turn to running. If you see any errors or crash loop backoff here that are not going away or pending, that doesn't seem to be going away, there's probably an issue.

And you should look into that and try and figure things out. The first time you do run this, you almost definitely will see that for your, I think, the Nemo operator controller manager, this one here, you will see that this is stuck in a, I think, a crash loop backoff.

And the reason for this is because the scheduler dependency, which is Volcano here, is not included within our cluster. So we just need to run this cell here and this will install Volcano. And then you would get your Nemo operator controller. You would get the name for that pod from here.

So in this case, it would be this one. And you would come in here and just run this and that would delete a pod and it will automatically restart. Okay. And that should, that should fix the problem. Now, one other thing that we will need to do is set our NVCR image pool secret.

So this is our ability to access NVIDIA's container registry, which is needed later on. When we're training our models, we need to pull the relevant containers from NVCR. And if we, if we don't have the secret set, we will not be able to access them. So we'll get a, I think a forbidden error if I, if I'm not wrong.

So we do first just delete the existing secret. I think there is just like a placeholder string in there by default. Then what we would do is we create a new secret for the NVCR image pool secret. For that, we do just need to pass in our, essentially our login details here.

So we run that, see that we created it. And if you like, okay, first we can just check, okay, this is, you know, it's there, it exists. And if you like, if yeah, I would recommend doing this. You can just confirm that the secret has been read in correctly.

So if you run this cell, you will be able to see what is in there. So that can just be useful if you're, especially if you're wanting to debug things or you're running this for the first time, just so you understand what is going on. And yeah, I mean, that, that actually should be everything for the deployment.

Okay. There is another step later where we're deploying our name, but we, we will run through that when we get to it. So with that, our whole deployment is ready to go. And yeah, we're ready to start running through that whole data preparation, data storage, customization, and so on pipeline.

Now for our dataset, we are going to be using a dataset from Salesforce, which they use to train what they call their large action models. Now these large action models are models that do a ton of function calling. So the dataset that train those is obviously pretty good for us to fine tune our own LLM to be a better agent through better function calling, of course.

So we need to go ahead and grab that dataset. It is available on Hugging Face. So we need to navigate to Hugging Face. You will need an account for this, by the way. I will just pre-warn you. So let's just first navigate to the dataset. You see Salesforce XLM function calling 60K.

This is the one we're using. And what you will probably see when you scroll down here is that you don't have access to this dataset. And that's because you need to agree to their terms and conditions for using dataset. So, you know, go ahead and agree to those terms and conditions.

And then once you've done that, you actually cannot programmatically download this dataset without providing a Hugging Face API key. Because, you know, your account has approved those terms and conditions, but you need to prove that it's your account requesting that data programmatically. So to do that, we need to go into our settings.

We go to access tokens and we're going to create a new token here. This token, you can you can have fine grained access here if you want. Or otherwise, I would just recommend like read only is fine. Give it a little name and then create that token and you're good to go.

I've already created mine down here, so I'm going to create a new one. So once you have that token, we need to jump into here and just run this cell and enter your Hugging Face API key. Okay, cool. So we have that. Now what we can do is go and download the data.

So we'll do this. And you can see that the more here, we can see the dataset structure. So you have ID obviously for each record, a query, which is the user query and an answer, which is the assistant answer. But it is not a direct answer, it's a tool call or function call.

And then we also have the tools here. So for those of you that have built tool calling agents before, with open AI, line chain and everything else, there is always a tools parameter. And in that tools parameter or function schema parameter. And within that, you always provide a list of function schemas.

Okay, which essentially just tell the LLM. Okay, how, how can or what tools do I have access to? And how do I use them? That is what that is for. Now, we can, I can show you what that looks like even. So in here, yeah, we have that user query, the answers.

You can see here, there are multiple answers. That is because, okay, where can I find live giveaways for beta access and games, right? What the agent in this training dataset decided to do is it decides to use this live giveaways by type tool with the argument beta. So it's basically looking for, okay, live giveaways by type, and it's pulling in the beta type.

But the user is asking for beta access and games, right? So in parallel, the LLM is also calling the same tool, but with the type game. Now, in our training, we're actually just going to filter for records that have a single function call because not all models can support parallel function calling.

So yeah, we're just sticking with the single one. So we'll actually filter these out. So we'll only have records where there is a single function call. But, you know, the, the same applies, like if you want to train parallel function calling and your, your base model is able to, you can, you just process the data as we do later a little bit differently.

And then we have the tools argument here where we see the list of tools that this LLM has access to, which is just this live giveaways by type tool. So that is it for the data. And there is a little bit of a problem with this data set, which is it's not quite in the correct format that we need in order to train with.

The NEMA customizer expects our data to be in the OpenAI compatible format. That means that both the messages and the tool streamers, function streamers, they need to be in the standard OpenAI format, which the XLM dataset does not provide is, is not in that format. format. So we actually need to modify this a little bit and we'll, I'll, I'll talk you through, okay, how we are doing that and what is actually going on there.

So let's jump through here. Actually, don't worry. I already explained this basically. In fact, if it wasn't clear, so a function schema is just describing what a function is and how to call it. So for example, we have this multiply function, very simple. We have this natural language description.

That's for the LLM to know what this does. And we have all these parameters. When we run this function through a method to turn it into a function schema, which is what we're doing here, you see that it turns into this structure, okay? This structure is, that is the function schema, is also an OpenAI function schema.

So just so you're aware, that is basically, this is what we're doing, okay? We're defining these schemas, okay? So let's first take a look at this, okay? You can see here that, okay, this is the XLM dataset. There are a few things that I'm missing, okay? The first thing is that we need this type function, which I've put up here, okay?

So you can see, compare this function schema up here. This is the OpenAI one to this one, okay? We're missing this type function, okay? It's very easy to put in, isn't it? Not a big deal. Then we need to put everything else within a function here. So name, description parameters are all going to go inside this dictionary.

And then for the parameters, okay? Here we have parameters and that goes straight into type and then description and so on. Here, the parameters have a type object and then they have a properties dictionary inside there where we're putting all this other information. And then finally, the other final little thing there is that the types in the XLM dataset are using Python types.

The OpenAI types are slightly different. So for example, STR, which is the Python type, which XLM uses, would become string. Like the full, that's how you would name it here. And then a list of anything would become array, okay? And there's a full table here where I've written down all of the various, okay, Python formats, OpenAI format, and what that looks like here, okay?

So we're going to be handling all of this in some logic, which I have written into here, okay? So we're just normalizing the type here, looking at the type, converting into an OpenAI type. Then we actually need to restructure the, you know, we looked at the structure difference between those two function streamers before.

We need to handle that as well. So that's exactly what we're doing here. We're converting from the XLM structure into the OpenAI structure. And as part of this, we also do normalize our types, right? So we're here, we're converting, you know, to all of these various type mappings, okay?

So I'll run that. And then what we can do, okay, let me, this is the XLM tool schema. If we convert this with our new function, we can see that it successfully turns XLM format into an OpenAI format here, okay? Which exactly, this is exactly what we need. Okay, so that is for our tool or function schema.

Another thing that we need to do is for the actual assistant where it's calling a particular function or tool, we also need to handle that in a particular way, which is a lot simpler. So we actually just do this. So we are saying, okay, if we just have one tool call, we're going to keep it.

If there are more tool calls in this assistant message, we're going to discard this and just skip it. That, again, it's just to keep things simple for us. We don't necessarily need to do that. But you do need to do that for LLMs that don't support parallel function calling.

And yeah, I mean, we're just restructuring that. This is pretty simple. I'm not going to go into detail there. But yeah, here you can see, right? So this is the answers. So this is the assistant message. What did it say? And then here we can see this is our formatted version.

So it's not that much different. Okay. But it's also simpler. Like, there's a lot less to process here. So we actually need to go through and process everything, like all of our messages. So that's the user message and also the assistant message, which is here. Run that. Okay. And this is before processing and this is after processing.

So this is our OpenAI format. This is our XLAM format. Cool. Okay. So now we can go through and do that for the full data set. Okay. So we do that here. It might take just a moment. And then we can see, okay, this is the first record has been cleaned up.

It is in the correct format. And now what we can do is work on separating these out into our train validation and test splits. So when we're fine tuning models, we have our training data, which is a segmented part of the data that we are actually showing to the LLM during fine tuning and is learning from that.

Then we have a small slice of that data set, which is left for validation. So at the end of every training epoch, we're going to run our, the current version of the LLM after it's been trained for an epoch on the validation data and see how it performs. And we just report that performance back to ourselves.

Then there is also the test split. So we're actually not necessarily going to use that here, but in the, if you're evaluating, you would reserve that test split for the evaluation step, which comes after training. You're going to basically test again on this test data set. And that will inform you as to your almost final performance for your model.

So to do all this, we first are going to shuffle our data and I'm going to split our data into that train validation and test split. We do a 70% train data followed by 15% and 15% for the validation and test data. Okay. Which you can see here for the actual number of records and yeah.

So that is it for the test data. If you are going to run that through evaluation, there's a slightly different format for that. So you would format it a little bit differently. And finally, we're just going to save all those. So our training file, validation file and test file.

And we're going to be uploading those into our data store and registering them now. And so pretty soon. So, I mean, that's already a lot I know, but that is the data preparation stuff out of the way. Okay. So we have so far outlined what we're going to do with all the various components for our fine tuning pipeline.

We have then deployed our Nemo microservices and then we have prepared our dataset for training. So we can actually jump into actually using the microservices now. In this pipeline of microservices, what we first need to do is, of course, set up our training data. Okay. Not preparing it. We've done that.

But just, you know, taking out our training data because we've just saved it locally right now. Taking that training data and giving it to our Nemo microservices. And basically, we're going to put them in the data store. Just register them in the entity store. Okay. And then they're ready for our customizer.

But first, we need to go ahead and do that. So let's jump in. So first thing we're going to do, okay, because we're going to need to reference all these IP addresses. We're going to be hitting the various APIs throughout this notebook. We're first just going to take a look at our deployment and we're going to see, okay, what are the IP addresses for each of our services?

We have the customizer, data store, actually, almost. So these four and also our NimProxy we're going to be using. So we need to make sure that we pull these IPs in. So for the customizer here, and we're going to pull them in down here. Okay. Now, the IP addresses for you when you're running this will be different to what you see here.

So it's important that you do copy those across. The ports at the end here should not change. And also, you do need to have HTTP at the start there. Okay. So the only thing you need to be changing is the IP addresses in the middle there. Don't change anything else.

Okay. And the other thing, okay, this is less important, but you can modify it if you want. You can change your dataset name. So this is what we're going to be using when we put our dataset in the data store and sort it later. So you can modify that if you want to call it something else.

Okay. So you have that. Now, the first thing we need to do is for both our end store and the data store, we're going to create a namespace. Now, namespace for both of these is going to be equivalent to our namespace that we've set already, which is demo. So you would need to run this.

Now, the first time you run this, you should get to 200 responses. The reason I'm getting 409 here is because I've already run this. So my namespace already exists. I don't, yeah, I don't need to recreate it. So, okay, that looks good. You should get 200 responses. Great. Now we are going to upload the data to our microservices.

And for that, we're using the Hunging Face API client. Now, the reason that we're using this is that the Hugging Face API client is just very good at data processing and fast transfer of data between various places. So it's really good. But we're not actually using Hugging Face here.

We're actually modifying the endpoint here. So this is not going to Hugging Face Hub. This is going to go directly to our data store. And the data store has been there's this Hugging Face endpoint, which is kind of like what we do with OpenAI compatibility. NVIDIA have done this for the data store.

They've made it Hugging Face API compatible as well. So we run that. We have our repo ID here. Okay, so repo ID. You can see that it is just the namespace followed by the dataset name. That is similar to if we go over to Hugging Face. And we look here at the datasets.

There is the like the namespace, which in this case is the Salesforce, followed by the dataset name, which is the XLAM function calling 60K. This is it's the same thing, but just locally or within our microservice cluster. Okay, now we can go ahead and create our repo. You will find, you know, once you've run this once and you run it again, it will not recreate the repo because it already exists.

Okay, so I've just recreated. I've just deleted it and run it again. And now it is recreating the dataset. But then if I run it again, it's not going to show me anything. That's because it already exists. So we've created our repo. And now we need to do is upload our training validation and test data sets.

And we do that with this face API upload file. And we're just pointing it to each one of our data files. Okay, so it's training validation and test data. Okay, and that will just upload. It's actually pretty quick. So we've done that. And now what we can do is register the dataset that we just created with our Nemo NC store.

So all we're going to do here is say, okay, this is our the URL to our files that we just created. And so it's at the hugging face endpoint data sets. And then we have the namespace and data set name again. Okay. Okay. So all we're doing that is just posting that to the end store.

Now the entity store knows that we have this dataset. Okay. And we can just confirm that that has been registered correctly. We can see in here, it has been. Okay. Great. That is, that's all good. So now we're on to the training part. So training, although super exciting, is actually always, there's not a lot to it nowadays.

So yeah, we'll jump into it. We can get some nice charts and everything here. So I'll just explain and go through, okay, how we can check in on the progress of our customization, our training, how we can check in on the lost charts and so on. We'll see. So the first thing we want to do is actually check what models or base models we can fine tune from.

So we run this get customization, customization configs, and I can see, okay, we have this model. Now, the reason I can see this is because earlier on when we were deploying everything for the customizer within the values like YAML, I specified that I want this model to be available.

Okay. So that is why it is now available. Otherwise, this would not be in here. I think by default, there is a default model, which is actually this one. So I'm looking at the wrong one. So this is the default model that the customizer will have access to. And then if we scroll down a little bit, we can see the model that I've defined as wanting to have access to, which is this one here, the llama3 to 1b in stroke.

Okay. So this is the one we have set in our values like YAML. We come up a little bit. It's this is the model that is just by default accessible within the by the customizer because it's set already in the pre-written values dot YAML that we later overwrite. Okay.

So we have that and we can jump into actually training. Now, the one thing, and I would, I would really recommend you do this is you can get a weights and biases API key here, and it just makes checking in on the progress of your model. So, so much easier.

So I would really recommend doing that to, to get this, you need to go to W and B dot A I open this. You have to sign up for an account. I think you, they come with a free trial period and they, they might still do the sort of free personal accounts.

I, I, I'm not sure. I haven't used it for a while. And you find your API key in the dashboard. So once you have that come back over here, run this cell and you need to enter your API key. And of course, enter it in the top here. Okay.

And now that has started a job. So you can see there's all the various parameters that you set in there. Yeah. You should see all of this. Then what you can also do. So, uh, these are just, okay. If you are running into some bugs and I mentioned some references to those here, uh, to deal with them, but I'm skipping ahead.

I haven't seen any issues. So now what we can do is we can go get our customization ID, and then we can send a request to this endpoint to get the status for our job. Okay. And we should see up here that it will probably be running. Okay. That's good.

Now we can also see there is, there's like this big list of like events that are happening. You can, you can scroll through those. The most recent one. Okay. We created the training job. The training job was pending. It got access to some resources and it started running. Okay.

That, that is the, the progress so far. We can also check in our cluster. Okay. So if we look for anything that begins or any pods that begin with this cust, that is a customizer, I believe you'll be able to see these and we can see, okay, that this, the first one here is already completed and it only took 74 seconds.

How incredibly fast is that for a training job? That isn't the training job. That is actually just the entity handler. So this is registering the training job. I believe pulling in the details and then it triggers the training job once it's complete. And then here is our actual training job.

You can take this, right? So we have the name of the pod here. You can take that and we can view the logs for that pod here. So I'm going here and see, uh, you'll see this right now. After a little while, you'll start seeing like training updates. So, you know, what step is it on and so on, but most useful, which is why I said, you know, get your weights and biases API key, most useful is actually going over to your weights and biases dashboard.

What you should find is that Nemo will automatically create this NVIDIA Nemo customizer project within that weights and biases for you. So you can click through into that. And then we're going to see, well, these are a couple of past jobs I've already run for me, but this is what you're going to see.

Okay. So you're going to be able to check in on your validation loss and training loss. And yeah, it's, it is pretty useful. Now, I think if I look at this, the job that I've just kicked off hasn't quite started yet. So I won't be able to see in here.

Now it has. Okay. So it's just popped up now. So this top one here, I can remove all the rest. I can't really see anything because it has literally just started. But once we start getting metrics coming through, I will be able to see, you know, how things are going and I'll be able to check back in every now and again to see how, you know, how my losses coming out, how far along we are and so on.

So it's really useful. And yeah, we, we have that. And it's also, also pretty useful. If you just look at this, right, I'll remove that one. Right. I can see, okay. My validation loss here is, yeah, it's lower. This, this one here, this yellow run scored best for validation loss.

So that's interesting. And what I can do is say, okay, I've got the run ID here. So I can, let me just expand this. I've got the run ID here. So I can, I can take that and I can actually use that when I'm deciding which model to use later.

Okay. So this ID here, we can use to run various models, which is pretty helpful. Now, I actually don't want to run another training run because this can take like 45, 50 minutes. So what I'm going to do is cancel this training run. Of course, you probably won't want to do that, but I'm going to cancel mine.

And I don't necessarily know, okay, what endpoint do I need to hit to cancel? So what I'm going to do is I'm going to navigate to the NVIDIA Nemo Microservices dots. Yeah. And we have the API dots. Okay. So this is that Nemo Microservices latest API index. And I can come in and say, okay, I need to customize the API.

And we actually have the dots here. A little bit hard to read with dark mode. Let me. Okay. There we go. A little better. So we can come down. I can say here posts. So this is what I need. V1 customization jobs, job ID, cancel. So I'm going to go and hit that.

Okay. So this is actually for status. I'm going to modify this. We're going to go cancel. And this is going to be a post. Okay. And that will cancel the job. So, you know, we have all the information here. And that's great because I wanted to use that GPU for running inference, which we are going to jump into right now.

So once that training job is complete, well, we will first see that in here. When we do the get to the status, we should see that the status is complete. That means we can then go into here. We should see that the entity sort has already registered our model.

Okay. So we should see, this is the latest model here. We should see our model. Okay. The name that we gave it followed by at, and then the customizer ID that it was provided. And that means that, okay, we have it, this model, our custom model in our entity.

So that's great. But if you remember in here, we cannot actually use chat inference on our custom models, unless we have a NIM for the base model already deployed. Okay. So how do we do that? That's it. That's the next thing. What we need to do is come down here and we're hitting the model deployments component.

Now, the model deployments component is, that's deciding, okay, I'm going to go and deploy a NIM. And we need to tell it, okay, which one do we want to deploy right now? So the name doesn't matter. You put the name that you prefer here. It's not a huge big deal.

But the thing that you do need is, okay, you need the model. And this needs to be the base model for your custom models. Okay. So my custom models were trained off of this model. That's fine. But the thing that you need to be aware of is, okay, the image name.

Where on earth do I get that from? Okay. So you actually need to go to the NGC catalog again, which we can find at catalog.ngc.nvidia.com. You go into here, you go to containers here, and then you can actually filter by NVIDIA NIM, right? Filter by NVIDIA NIM. And then what you have is all of the, well, all of the NIMs and the LMs that you can use.

So I'm going to say that I, well, yeah, I want the 3.21 model. So I'll just type in 3.2 here. And you can see, okay, these are the models. Okay. These are NIMs I can use. I'm using the 1 billion parameter model. So I'm going to jump into that.

Okay, cool. So I have this. And then what I want to do is say, okay, I want to get this container. I'm going to go here. And you can see that this is the latest tags image path, right? I'm actually, well, you can take the whole thing, but it will also work with just this.

I think this image tag might not be correct. Yeah. The latest tag is 1. We'll see in a moment that that's not the latest, but it's fine. So I come over here. I put this in here. So this is my image name. Great. Then I want to go and check my tags.

Okay. So go into here and you can see, yeah, I don't know why 1 is showing as the latest, because there are others, right? There is like one point. Yeah, the order is kind of messed up here, but it's fine. So 1.8.5 is the latest one as far as I can tell here.

So I'm going to use that. So come over here. 1.8.5 perfect. And that is it. We're actually ready to go. So we, yeah, we run this, that is going to deploy. And actually, for me, it's already deployed because I've already done that. If I want to create another deployment, I can.

I just change the name here. I can keep the model the same if I really want to. Okay. So we do that. And then we just come into here and we should see that there is a model deployment job going on. So it, I think, would be this. Yes, I think it is this one here.

So you'll look for model deployment. There are a few different jobs that say like model deployment or model download or whatever. So, you know, just be aware of that. But yeah, this is the one that we're looking for. That, of course, for me, you know, I completed forever ago, 16 hours ago.

And it is running still. So I don't need to worry about that. It is running. We can then just confirm that the model has been picked up now by our NIM endpoint. Which, okay, it looks like it has. We have LAMA 3.2 1B Instruct. Okay, that's great. And what this will also do, right?

So this is the base model. But automatically, it's also pulling in our custom models, right? So as soon as our NIM proxy sees that we have the like the base model NIM for our custom models, it's also going to load in all those custom models as well. So we can see that here.

There I, you know, we can scroll. I'm sure there's probably a few in here, right? So we can see, yeah, we have another one here. And actually, we just have those. Okay. So, and I think those are all the ones that I've trained within this version of the instance anyway.

Cool. So that's good. And now we can actually go ahead and use the model finally. Okay. So using the model, as I mentioned, we use OpenAI, right? We use OpenAI-compatible endpoints. So that means we can use it as we would a normal OpenAI model. So we're actually just using the OpenAI library here.

We set up our OpenAI client pointing to NIM. So we just change the base URL to point to the NIM URL, the V1 API there. And API key doesn't, you know, just put whatever you want in here. I don't, I think you can just buy anything, but you put none if you want to be cautious.

Now we can test it on our test data that we set up before. Okay. If we want, we can test, you test it on whatever you want, to be honest, but we can test on this. So we have our messages here. And this is what would the diabetes risk for a lightly active person away of this so-and-so-and-so-and be.

All right. So we, okay, what tools can we use? Let's have a quick look at those. Tools. Okay. We have assessed diabetes risk is probably the one that we would use. Okay. So let's go ahead and try. We'll print out the full message. So we just, we can just see everything there.

Check completion message. Content is none. Okay. Content is none because when there is content, that is the LLM responding directly to you. It's not a tool call. So it's good. We're using a tool call. And let's go across. We can see tool calls, check completion message, tool call ID, so-and-so-and.

All right. We can see that these are the parameters into some tool. The tool was assess diabetes risk. Okay. That is perfect. That's what we need to see. That is good. Then, okay, weight. What was it? So let's come up and just confirm. Yes. 165 pounds. The height is 70 inches.

We put lightly active person activity, lightly active. Okay, cool. And I'm curious. Okay. So for the activity here, we can actually see the allowed values here. We have sedentary, lightly active, and so on. So it's actually putting that in correctly, which is great. That's cool. Then we can also stream as well, because this is, you know, API compatible.

So to stream, we would, you know, we just set stream there and we stream like so. Let's run that. Okay. And we can see all that coming through as well. Right. So that is the full pipeline. We have deployed our full microservice suite. We've done data preparation. We upload our data prepared, like put it in the right places for Nemo microservices.

We fine-tuned our model. And then we've just tested it at the end there. Of course, in many cases, you probably want to do evaluation and everything else around that. All that is also available. But we can already see straight away, like this is a 1 billion parameter model. Just from that test, straight away, it's able to do function calling, which is, I think, really, really good.

And it's not something that you would typically get from such a small 1 billion parameter model. And you can test it more, like test it more, test it with more data. And you will see that it is actually able to very competently use function calling and use it correctly, which for a model of its size is really impressive.

And the reason for that is because we have fine-tuned it on that function calling data set. And of course, you know, we can apply it to that function calling data set. If you have your own specific tools and everything that you have within your specific use case or industry, whatever it is, you can fine-tune on that data as well and make these tiny LMs highly competent agents.

Which is, I think, in terms of just cost, performance, is really impressive. Yeah, so I really like this whole process that NVIDIA built with the microservices. It's really powerful. And just the fact that you can do this so quickly and build these models is, in my opinion, really exciting.

It's something that's fine-tuning models, building custom models. It's something that has really been lost in maybe the past couple of years with big LMs coming out. It's something that I hope this type of service makes more accessible and just a more common thing to do. Because the results that you can get from it are really impressive.

So, yeah, that is it for this video and this big walkthrough and introduction to the Nemo Myroservices and fine-tuning LMs. I hope all this has been useful and interesting, but for now I'll leave it there. So, thank you very much for watching and I will see you again in the next one.

Bye. Thanks for watching and I will see you again in the next one. Bye. Bye. Bye. Bye.

LoRA Fine-tuning Tiny LLMs as Expert Agents

Chapters

Transcript