Fine tune 20 Llama Models in 5 Minutes: Santosh Radha

00:00:00.000 | .

00:00:13.700 | So the talk is actually going to be about how you run things extremely easy directly

00:00:19.680 | from Python.

00:00:21.200 | And the example that I'm going to show you here is obviously I just have five minutes

00:00:24.320 | in on my end.

00:00:25.320 | But I'm going to try my best to showcase how you can fine-tune pretty much 20 is an arbitrary

00:00:29.600 | number here, but hundreds of models that you can do right from Python without needing anything

00:00:34.420 | like Kubernetes, Docker, or anything on your side end.

00:00:37.220 | So before that, you can find the talk and the actual code for what I'm going to do in

00:00:40.540 | this QR code.

00:00:42.280 | And you will find a lot more interesting examples over there to try out and run as well.

00:00:48.420 | So what do we do?

00:00:50.020 | So Covalent is an open source/open core product on its end.

00:00:55.260 | And what we do is we help people write Python locally and ship the code to any kind of compute

00:01:01.080 | backend that you need to send it to.

00:01:02.880 | So what that means is, hey, you have a Python function that you want to run on your GPU.

00:01:07.700 | In your local laptop, open up a Jupyter notebook, add a single decorator on top to say, hey,

00:01:12.520 | do this on H100 with 36 gigs of memory for two days, maximum time limit, and press Shift

00:01:18.340 | Enter in your Jupyter notebook.

00:01:20.340 | And that's it.

00:01:21.340 | The code gets shipped to a backend in a GPU, and you get back the result on your side end.

00:01:25.340 | In the open source case, it sends it to your own compute.

00:01:28.340 | You can attach your own compute cluster, and it runs over there.

00:01:31.340 | In the cloud case, it runs in our GPU cluster, and you just pay for the GPU time that it runs.

00:01:36.160 | And so it runs for five minutes.

00:01:37.160 | You pay for five minutes of H100.

00:01:39.260 | It runs for 10 seconds.

00:01:40.260 | You pay for 10 seconds of H100s on your side end.

00:01:42.780 | You can also bring your own compute and attach to us, and we'll help you orchestrate the entire

00:01:46.440 | compute that you're handling on your side, be it your own cloud or on-prem systems or whatever

00:01:50.940 | it is on your end end.

00:01:53.220 | OK.

00:01:54.220 | So Covalent basically has a bunch of primitives that you define in.

00:01:57.440 | You can submit in jobs, which are called single functions.

00:02:00.780 | So essentially, all you need to do is, as I said, add a single decorator on top and say,

00:02:05.240 | what is the compute that you need to ship it to?

00:02:07.240 | It goes there, it runs, and you get back the Python object back, and you just pay for the

00:02:11.280 | function that you are running in.

00:02:13.820 | We also let you run inferences, and again, it's completely Pythonic.

00:02:17.540 | You don't Dockerize.

00:02:18.540 | You don't run Kubernetes cluster.

00:02:20.080 | You don't do anything.

00:02:21.080 | You just say, hey, I have an initializer function, and I need an endpoint called /generate.

00:02:27.460 | And you define your Python functions.

00:02:29.260 | You click a single cc.deploy command in your Jupyter notebook.

00:02:33.720 | So if you create a server, it gets shipped to us, and we scale, you get back an API endpoint

00:02:39.180 | that scales to zero, or scales in its request as and when your new request comes in.

00:02:42.180 | You can define your custom autoscaling mechanism.

00:02:44.180 | Like, hey, I want to autoscale it to 10 GPUs exactly at 9 o'clock every day.

00:02:48.180 | Or I want to autoscale whenever my GPU utilization hits an 80%.

00:02:52.640 | Or I want to autoscale whenever the number of requests I get in is 1,000.

00:02:56.640 | So you can define whatever autoscaling you want.

00:02:58.640 | You can define authentication and everything.

00:03:00.640 | And everything happens in the background for you.

00:03:02.640 | And the talk I'm going to give in is a very tiny example of what we do from our side.

00:03:13.100 | But if you go to this link in, there's a whole host of examples that you can run in, right

00:03:19.100 | from real-time time series analysis to, you know, using inverter transformers for time series,

00:03:24.840 | which is like a state-of-the-art time series transformer on its end, running in large systems,

00:03:31.480 | large language models on your serving systems, and even building an entire AI model foundry

00:03:36.360 | out of just pure Pythonic code on your side end.

00:03:40.080 | So without further ado, I'll quickly run through the code example of how you do essentially fine-tune

00:03:47.000 | a bunch of huge set of models, directly just from Python on your end.

00:03:51.620 | And I'll also show you how it looks like from the front-end side as well.

00:03:55.520 | So it's rather simple.

00:03:58.140 | All you do is I have written a bunch of normal Pythonic training functions in my local package

00:04:03.680 | called fine-tune and evaluate on our end.

00:04:06.300 | And what we are going to do is, hey, I'm going to define a Python task, which essentially calls

00:04:10.900 | in my fine-tune function, which is going to accept the model and data and return back a fine-tune

00:04:14.900 | model.

00:04:15.900 | It's a simple Python function, and I'm going to say, hey, I want to run it on a 24-core CPU

00:04:20.520 | with one GPU in it of type H100 with 48 gigs of memory.

00:04:24.140 | I'm going to give a max limit of 18 hours on it.

00:04:27.140 | And then I'm going to say, once the model is done, I'm going to accept the model and then

00:04:30.760 | evaluate its accuracy on its end.

00:04:33.760 | And finally, I'm going to just sort the model among all the best models and then pick the

00:04:38.380 | best model in it.

00:04:39.380 | And I want this to run on a CPU-based machine.

00:04:40.380 | I don't want to waste GPU for my sorting or whatever I'm going to do on my end.

00:04:45.380 | And finally, I'm going to deploy the best model that I figured that has performed well on my

00:04:51.280 | end.

00:04:52.000 | And this is, again, a simple decorative react.

00:04:53.620 | Say, hey, this is my initialization service, and I'm going to create an endpoint called

00:04:57.620 | slash generate.

00:04:58.620 | And I'm going to generate the text and give back the prediction.

00:05:04.320 | And finally, this is where the magic happens.

00:05:06.080 | To tie together all of these things, what I do is I'm going to create a workflow where I'm

00:05:10.240 | pretty much just going to simply loop over a bunch of models to fine-tune, call the fine-tune

00:05:15.020 | function, evaluate the task and get the accuracy, make a list of all the models and accuracy,

00:05:20.800 | sort the best models, and then deploy the model from my end.

00:05:23.940 | And this is completely Pythonic.

00:05:26.300 | And once you dispatch this to our server, which is essentially calling a single line

00:05:30.200 | over here, what you will go back and see is a new job that creates in our application.

00:05:37.660 | And all of the functions that you called will run in the respective devices that you just defined.

00:05:42.920 | So for instance, here is one of the evaluation steps that ran, and it has its own machine that

00:05:48.660 | we ran in.

00:05:49.660 | It ran in L14.

00:05:50.600 | It ran for six minutes, and you get back just $0.87 to evaluate your model in.

00:05:56.920 | Another model ran in V100 on its end, and it ran for six minutes again.

00:06:02.360 | It costed $0.11 to do it in.

00:06:04.320 | And in total, you finally have deployed, fine-tuned, untrained completely in Python without needing

00:06:10.340 | anything like Docker or Kubernetes on your end.

00:06:13.360 | And we have a booth over there.

00:06:14.500 | Do visit us.

00:06:15.500 | And we can have more chat over there.

00:06:16.620 | Thank you, guys.

00:06:17.220 | Thank you, guys.

00:06:17.720 | Thank you, guys.