Fine tune 20 Llama Models in 5 Minutes: Santosh Radha

. So the talk is actually going to be about how you run things extremely easy directly from Python. And the example that I'm going to show you here is obviously I just have five minutes in on my end. But I'm going to try my best to showcase how you can fine-tune pretty much 20 is an arbitrary number here, but hundreds of models that you can do right from Python without needing anything like Kubernetes, Docker, or anything on your side end.

So before that, you can find the talk and the actual code for what I'm going to do in this QR code. And you will find a lot more interesting examples over there to try out and run as well. So what do we do? So Covalent is an open source/open core product on its end.

And what we do is we help people write Python locally and ship the code to any kind of compute backend that you need to send it to. So what that means is, hey, you have a Python function that you want to run on your GPU. In your local laptop, open up a Jupyter notebook, add a single decorator on top to say, hey, do this on H100 with 36 gigs of memory for two days, maximum time limit, and press Shift Enter in your Jupyter notebook.

And that's it. The code gets shipped to a backend in a GPU, and you get back the result on your side end. In the open source case, it sends it to your own compute. You can attach your own compute cluster, and it runs over there. In the cloud case, it runs in our GPU cluster, and you just pay for the GPU time that it runs.

And so it runs for five minutes. You pay for five minutes of H100. It runs for 10 seconds. You pay for 10 seconds of H100s on your side end. You can also bring your own compute and attach to us, and we'll help you orchestrate the entire compute that you're handling on your side, be it your own cloud or on-prem systems or whatever it is on your end end.

OK. So Covalent basically has a bunch of primitives that you define in. You can submit in jobs, which are called single functions. So essentially, all you need to do is, as I said, add a single decorator on top and say, what is the compute that you need to ship it to?

It goes there, it runs, and you get back the Python object back, and you just pay for the function that you are running in. We also let you run inferences, and again, it's completely Pythonic. You don't Dockerize. You don't run Kubernetes cluster. You don't do anything. You just say, hey, I have an initializer function, and I need an endpoint called /generate.

And you define your Python functions. You click a single cc.deploy command in your Jupyter notebook. So if you create a server, it gets shipped to us, and we scale, you get back an API endpoint that scales to zero, or scales in its request as and when your new request comes in.

You can define your custom autoscaling mechanism. Like, hey, I want to autoscale it to 10 GPUs exactly at 9 o'clock every day. Or I want to autoscale whenever my GPU utilization hits an 80%. Or I want to autoscale whenever the number of requests I get in is 1,000. So you can define whatever autoscaling you want.

You can define authentication and everything. And everything happens in the background for you. And the talk I'm going to give in is a very tiny example of what we do from our side. But if you go to this link in, there's a whole host of examples that you can run in, right from real-time time series analysis to, you know, using inverter transformers for time series, which is like a state-of-the-art time series transformer on its end, running in large systems, large language models on your serving systems, and even building an entire AI model foundry out of just pure Pythonic code on your side end.

So without further ado, I'll quickly run through the code example of how you do essentially fine-tune a bunch of huge set of models, directly just from Python on your end. And I'll also show you how it looks like from the front-end side as well. So it's rather simple. All you do is I have written a bunch of normal Pythonic training functions in my local package called fine-tune and evaluate on our end.

And what we are going to do is, hey, I'm going to define a Python task, which essentially calls in my fine-tune function, which is going to accept the model and data and return back a fine-tune model. It's a simple Python function, and I'm going to say, hey, I want to run it on a 24-core CPU with one GPU in it of type H100 with 48 gigs of memory.

I'm going to give a max limit of 18 hours on it. And then I'm going to say, once the model is done, I'm going to accept the model and then evaluate its accuracy on its end. And finally, I'm going to just sort the model among all the best models and then pick the best model in it.

And I want this to run on a CPU-based machine. I don't want to waste GPU for my sorting or whatever I'm going to do on my end. And finally, I'm going to deploy the best model that I figured that has performed well on my end. And this is, again, a simple decorative react.

Say, hey, this is my initialization service, and I'm going to create an endpoint called slash generate. And I'm going to generate the text and give back the prediction. And finally, this is where the magic happens. To tie together all of these things, what I do is I'm going to create a workflow where I'm pretty much just going to simply loop over a bunch of models to fine-tune, call the fine-tune function, evaluate the task and get the accuracy, make a list of all the models and accuracy, sort the best models, and then deploy the model from my end.

And this is completely Pythonic. And once you dispatch this to our server, which is essentially calling a single line over here, what you will go back and see is a new job that creates in our application. And all of the functions that you called will run in the respective devices that you just defined.

So for instance, here is one of the evaluation steps that ran, and it has its own machine that we ran in. It ran in L14. It ran for six minutes, and you get back just $0.87 to evaluate your model in. Another model ran in V100 on its end, and it ran for six minutes again.

It costed $0.11 to do it in. And in total, you finally have deployed, fine-tuned, untrained completely in Python without needing anything like Docker or Kubernetes on your end. And we have a booth over there. Do visit us. And we can have more chat over there. Thank you, guys. Thank you, guys.

Thank you, guys. Thank you, guys.

Fine tune 20 Llama Models in 5 Minutes: Santosh Radha

Transcript