Domain adaptation and fine-tuning for domain-specific LLMs: Abi Aryan

00:00:00.000 | Hi, everyone. Welcome to my presentation. So, I don't know what's the best way to start

00:00:22.440 | about it. I would probably say something along the lines of, well, you've heard a lot of

00:00:27.720 | really good presentations that are focused on one very specific thing. And this session

00:00:34.100 | itself will more so focus on an overview of domain adaptation and fine-tuning for large

00:00:40.040 | language models. Because there's so much information out there which is like, oh, take this, use

00:00:46.340 | that. So, my goal is to sum up all the literature for you to be able to make an informed decision

00:00:52.720 | on how to be able to do domain adaptation for your particular enterprise use case or for

00:00:58.040 | your hobby use case, however you're using it for. So, about me, this has already been spoken,

00:01:07.040 | so let's skip this. Why do we care? I think the answer to this is pretty obvious, which is,

00:01:15.100 | I mean, chat GPT as a model, or even if you're looking at open source large language models,

00:01:21.480 | they're not trained for every single use case out there. There are some domains that are underrepresented,

00:01:26.480 | there are some domains for which there is not enough data because of compliance or for whatever

00:01:31.420 | reasons. And for that, we need to be able to have some sort of method to be able to fine-tune the models, or use some other strategy and alternative of fine-tuning, whether that's knowledge bases, whether that's rags,

00:01:36.480 | or whether that's prompting. The second is, basically, you don't want to collect new data for every single domain. One of the best things that has happened with large language models, I would say, is the ability of these models to be able to transition to a new domain. So, there's one paper that I would reference.

00:01:56.480 | So, one quick example, I would say, is before, let's say, transformer models, or even while we were having transformer models, to be able to train a model to learn a new language, we needed to collect the data for that particular language, and then, to be able to

00:02:24.480 | to be able to do whatever task that we want to do in that language. One of the best things that has happened is, now, because the models are learning by embeddings, they're able to learn on a new language that they have previously not seen as well, because they're essentially learning the structure of the languages, instead of, like, what is the taxonomy of the language, which means there are some languages which are semantically

00:02:52.480 | semantically similar. So, for example, English is very semantically similar to Latin. I'm not entirely sure, there are a couple of languages that do fall into, like, that one domain, which is, oh, these languages are similar, they have semantic similarities, there are other set of languages that have semantic similarities, so it's very easy to be able to transition between those languages without ever having seen any data or any examples in those languages. The third is, basically, you want the models to be able to be accessible,

00:03:20.480 | you want the models to be able to be accessible to a wide range of users. And what I mean by that is more so like all the work that was happening along in personalization. So, simple reasons, this is something almost everybody is aware of. What is fine-tuning? Fine-tuning is almost a way of us teaching the model to be able to learn something

00:03:48.480 | for which it hasn't already been trained before. So, improving the performance of a pre-trained model. One of the ways we're doing that is by updating the parameters, right? You take some inputs, you have a hidden layer in which you're calculating the weights, you're calculating the biases, and then you have an output layer. That all stuff, I think, is obvious to almost everybody. You've seen what a transformer model for people who

00:04:16.480 | who don't know what the structure of a transformer model is. There's an encoder, there's a decoder. The reason I'm referencing this is we'll go a little bit more into details of these while we are talking about different fine-tuning methods itself. So, there's an encoder, there's a decoder. It has a feed-forward network, it has an attention network, same for the decoder one. Now, this is how we were looking at transformer models,

00:04:44.480 | the way they are, and this is storing the weights and the biases right now. But, now let's talk about making these models better. So, there are a couple of ways that we can fine-tune our methods. We can update all the model weights, or we can update some of the weights. If we update all the model weights, that falls into the category of some of the models that you've seen earlier, which is all the research

00:05:12.480 | work that was between 2018-2016, all those years, which is more around transfer learning cross-destillation models, in which you have a teacher model and a student model. The student model is learning from the teacher model, and that's the way you're sort of updating all the weights. But it is very expensive to do that. And it is computationally, it takes more storage as well.

00:05:40.480 | So, the second option that we're now looking at, the reason we're having this discussion today, is how can we update our models? Because the parameters have gone so big, we cannot keep updating all the weights. So, how about we update just some of the weights, without making sure, while making sure that we're able to get equivalent performance? And I would put an estrus on, you know, like,

00:06:08.480 | equivalent performance, because we may not be able to get the chat GPT performance, and that is something we'll talk about eventually. So, in terms of if we update some of the weights, you can break it down into three categories. To be honest, more like five categories, but there are three main ones, which is adaptive tuning.

00:06:26.480 | So, there's prefix tuning, and there's parameter efficient tuning. There's instruction tuning, which is basically giving a couple of examples. This is something you've seen a lot at a couple of -- so many examples throughout this conference, and the one that was prior to my talk as well, where we were doing instruction tuning.

00:06:44.480 | So, in terms of adaptive tuning, are relatively, obviously not super relevant to most of us, which is -- it's too expensive to have real human beings, to be able to fine-tune your parameters for you.

00:06:56.480 | Or to be able to provide your examples and say, this is wrong, this is right. So, we are only left with three techniques, which is adaptive tuning, prefix tuning, and fine -- parameter efficient fine-tuning.

00:07:08.480 | We'll go a little bit more into detail of what these are, why are we using these ones, and when do they do well. So, the first one, this is basically adapter-based tuning.

00:07:22.480 | The thing that really happens in adapter-based tuning is it's really good. What it does is it adds a small number of parameters to the existing model.

00:07:34.480 | So, those parameters are basically stored in the adapter components that you're seeing over there. This is -- the entire model of the transformer remains the same, but we are adding two new components to it that contains the extra weights.

00:07:48.480 | So, what this does is this exposes the model to the new information, and according to the original paper that came out, you know, it is able to improve the performance of the model.

00:08:00.480 | Or you could say it matches the performance of the model with only 0.15% of the parameters. Where is it good? Where -- in which cases would we use something like this?

00:08:14.480 | So, adapter fine-tuning or adaptive fine-tuning, both are the same things. Ideally, you use it when you're trying to learn a new domain itself, which is -- if you're trying to fine-tune your model for like a very --

00:08:28.480 | different domain, let's say biochemical engineering, that's more so where you would use adapter-based fine-tuning.

00:08:36.480 | The second is prefix-based fine-tuning. So, prefix-based fine-tuning, what it does is it introduces some prefixes where we are storing the model weights, and what they are able to do is they are able to mimic the behavior of the prefix that we are giving it, which is the couple of weights that we are adding in front of the tension model.

00:08:56.480 | So, in a very simple -- in very simple words, what it does, it adds an embedding layer at the front of the tension layer to mimic that behavior.

00:09:10.480 | So, one very simple example to understand this a little bit better is, you know, all of the water that we get comes out of a tank, right?

00:09:20.480 | But the way we are able to access it is using a tap. And water takes the form of a tap, which is it comes out in this quantity. So, that's very much like how prefix-tuning works, which is -- it's not changing the behavior of the model, but it's just mimicking or adding a masking layer

00:09:38.480 | layer on top of the existing weights or on top of the existing model that there is. The third and the final one, which is the prefix-based, which is the parameter efficient fine-tuning method. So, this one, the one example that you're seeing is basically the LoRa one. There are two commonly known parameter efficient fine-tuning methods that are out there, LoRa and Q LoRa. The way LoRa really works --

00:10:06.480 | it really is basically low rank adaptation method, where you use it. Any sort of parameter efficient fine-tuning method is used where you want to compress the model sizes, or you want to run it on low-resource devices. So, very ideal for large language models.

00:10:24.480 | The model's biggest reason is because we have massive parameters that we are trying to run on very small devices, which could be our laptops, and even smaller devices, which is basically the HTML devices, the audrenos and all of that stuff. So, that's one reason the entire community has been talking more so about LoRa and Q LoRa, because, again, we are looking for efficiency. The way it works under the hood is -- all of the weights are usually stored as --

00:10:52.480 | are usually stored as what is basically a matrix, right? So, most of these weights, there are a lot of layers in these weights that aren't unique. And what LoRa usually does is it identifies the linearly independent layers in terms of the weight matrix itself. So, in the matrix, you're looking at all the linearly independent lines. So, you're looking at the

00:11:20.480 | linearly independent lines or the columns. And you're picking and choosing only those ones. So, what it does is, if two things are very similar, or two things are almost like you could transform one easily through a mathematical function as, like, a multiplier of the other one, then storing that one extra layer, which is a copy of the original one, doesn't really make sense, right? So, that's how LoRa works under the hood, which is, we are reducing the size of the matrix, which is

00:11:50.480 | the size of the size of the weight matrix, essentially. Practical benefits. Obviously, you're able to decrease the size of the model. And you're also -- you can also -- you're also using less memory right now. The second method that we're looking at is basically called Q LoRa, which is quantized LoRa method. The

00:12:18.480 | The way it works is it changes the model weights to 4-bit precision. The way it usually works is you start with the pre-trained models. You collect a data set with labeled data. And you train adaptation matrix. And multiply it with the main weight matrix. And what you're essentially trying to do is you're trying to decrease the distance between the

00:12:46.480 | That's the distance between the predicted outputs of the source domain and the target domain. That's what's essentially going on in Q LoRa. One quick comparison. Obviously, I mean, in terms of, like, the people who are saying, okay, Q LoRa is great, should we use LoRa? Q LoRa.

00:13:06.480 | One quick thing I'll say on that one is, while Q LoRa works really good on the original data set that I was trained on, but to be able to get it to perform really well requires a library, bits and bytes libraries. And some other things which are not available on all the devices. Not a lot of testing has really happened for Q LoRa's efficiency on all of the models. So I would probably say maybe still sticking with LoRa and being able to optimize the performance.

00:13:34.480 | With the LoRa model is ideally the better way to go at least at this current point in time. So to very quickly summarize, which is, again, we have three different methods to be able to do domain adaptation. We have prompting, we have rags, we have fine tuning. For prompting, you can sort of prompt your models. Again, with no examples, with one example, with a couple of examples,

00:14:02.480 | when it comes to a couple of examples, I think a good answer would be about 10, which is what ChatGPT says, where obviously the performance is better, the more examples you're able to give it. Where it works is in the domains that you're looking for more generalizable models, but usually that's just demos, not real world examples. Requires less training data, it's cheaper, obviously, but it is not as performing as fine tuning. On fine tuning,

00:14:30.480 | on fine tuning, you're looking at three different methods, which is like adaptive fine tuning, you're looking at behavioral fine tuning and parameter efficient fine tuning. On each of these ones, you don't need to pick one of these three techniques. You can also combine them with prompt engineering and combine them with rags as well. Or you can do both of those things, which is you can do adaptive as well as behavioral fine tuning. The key difference between all those three methods is adaptive fine

00:14:58.480 | tuning really works well on when you have a target domain that you're trying to optimize for. So, for example, if you have multiple tasks within a single domain, let's say you have a legal company, and you're trying to build a model that works really well on five different or ten different tasks within just the legal domain itself, adaptive fine tuning works great. Behavioral fine tuning is basically where you're trying to optimize the model performance on a target task.

00:15:00.480 | So, you're not really trying only. So, you're not really working on a target task only. So, you're trying to optimize the model that works really well on a target task only. So, you're not really trying to optimize the model. So, you're trying to optimize the model performance on a target task only. So, you're not really trying to optimize the model. So, you're trying to optimize the model. And you're trying to optimize the model performance on a target task only. So, you're not really trying to optimize the model.

00:15:24.480 | where you're trying to optimize the model performance on a target task only. So, you're not really optimizing for the entire domain. You're optimizing for just one particular task. The way it really works is you're optimizing for the label space and the prior probability distribution. So, very helpful when you're trying to get to show some sort of inference and reasoning capabilities.

00:15:51.480 | You could also... A good analogy on behavioral fine tuning is it's very similar to Langchain functions if you've used Langchain functions. And parameter efficient fine tuning is like the standard fine tuning where we are freezing some of the parameters and we're only updating a very small amount of parameters using the techniques LoRa, Q, LoRa, and so on.

00:16:15.480 | But coming to, you know, are these techniques going to really work? Sure, we have all of this available. It would only work depending on how good your data is. It depends on how you're collecting your data, how you're tokenizing your data, how you're cleaning and normalizing your data. Are you removing the noise and sort of sanitizing your models? Are you doing data duplication as well to be able to remove the duplication?

00:16:42.480 | So, there was another research that was published which was basically like the memorization which happens in models is mainly because of data deduplication. If we're removing the duplicate entries, that reduces the probability of a model to be able to memorize certain tasks. Because, again, it's seeing those data sets over and over again in some form or the other. So, it's naturally collecting...

00:17:10.480 | It's naturally creating sort of a bias towards those things. And it's naturally outputting those very quickly. And the last one being data augmentation.

00:17:20.480 | Now, let's say you've done all of this. Let's say you've picked the right model. Let's say you've done your data collection thing perfectly. You've got the best data out there.

00:17:30.480 | What are still the things that you can't think of while optimizing the performance of your model. So, the first thing is, do not try to compare whether GPT-4 or GPT-5.

00:17:40.480 | It's not going to work comparatively, especially for more complex tasks. It's not a generalized model.

00:17:47.480 | While it may be able to capture the nuances of your actual data, but it may not be able to capture the nuances of the new data that it hasn't seen or in newer domains that it hasn't seen before.

00:18:01.480 | So, that's one thing which I've seen a lot of companies are trying to sort of in a dilemma with, which is, oh, we've fine-tuned our model, but it's not working as good as GPT-4.

00:18:13.480 | The second one is basically using in-context learning with dynamic examples. And one of the big reasons for that is the big problem that we see with the drift in the model, with the data drift in the models.

00:18:29.480 | So, using in-context learning with dynamic example loading allows you to be able to deal with that particular problem while also making sure that you are able to do cost management as well.

00:18:46.480 | The third thing that also one needs to think of is breaking down this task into smaller tasks.

00:18:52.480 | For example, if we are working with any sort of language, instead of trying to train the model for the entire language, can we break it down into very specific tasks?

00:19:02.480 | So, that's another thing which people need to think of.

00:19:07.480 | The final thing, I would say, is implementing some sort of gradient checkpointing.

00:19:13.480 | So, what gradient checkpointing essentially does, it reduces the memory usage.

00:19:19.480 | What it essentially does is it retrains the model and recomputes the weights during the backward pass.

00:19:29.480 | While it may look like, you know, it's not the smartest choice to make, but, you know, while the computation is higher,

00:19:41.480 | which is, yes, the weights will need to be re-computed, but the downsides are easily weighed by the memory consumption.

00:19:49.480 | So, the memory consumption is very, very less if we are implementing some sort of gradient checkpointing.

00:19:55.480 | So, another cost-effect, cost-management thing.

00:19:59.480 | Now, few more considerations and limitations, which is, let's talk about the hyperparameters now, choosing a batch size.

00:20:09.480 | Ideally, we go with a batch size of 32 or 64.

00:20:13.480 | Choosing the number of training epochs.

00:20:15.480 | Again, one of the questions I often get is, what's the right number of epochs that we should be training with?

00:20:21.480 | If you're doing a simple test, which is, if you're running something in a Google CoLab for fun thing, maybe having epoch 1 is nice.

00:20:31.480 | But if you are working with a good model, and if you're trying to optimise for, like, a particular domain, then choosing to go with 100 epochs as, like, the starting point is probably, like, the ideal choice.

00:20:47.480 | Choosing an optimiser, there are different optimisers that are out there.

00:20:49.480 | Atom optimiser is the standard choice because it's general purpose.

00:20:55.480 | And it works really well with different domains as well.

00:20:59.480 | Implementing some sort of regularisation, early stopping.

00:21:03.480 | So, again, one of the things is basically, like, in terms of, if you're looking at the models that have been trained till now, there's not a lot of implementation on optimising those performances.

00:21:17.480 | While there are bigger models that we're seeing every single day with more and more parameters, they're not essentially squeezing all the performance out of those models.

00:21:28.480 | So, one of the easy ways to be able to do that is using some sort of early stopping, which is making sure that you're only working with the data that is most efficient.

00:21:38.480 | So, if the model performance is declining, then you need to reconsider your batch and look into that batch and consider your embeddings.

00:21:47.480 | Now, let's say if you fine-tune the model, the next part, which is the hardest part of the process, is, you know, how do we evaluate our models?

00:21:57.480 | There are so many benchmarks out there and there are so many libraries out there.

00:22:03.480 | So, there's every by Ray, there's libraries by NVIDIA.

00:22:08.480 | But what you're essentially looking for mostly is the loss accuracy and perplexity, but that doesn't really paint the full picture.

00:22:17.480 | So, while I say, you know, it is the hardest part, which is there needs to be some sort of adaptation for every single business and every single use case,

00:22:26.480 | which is we need to be looking at evaluation from four different perspectives or four different components.

00:22:32.480 | The first is doing some sort of metric-based evaluation, which is something like blue score, rogue score that we were considering before.

00:22:40.480 | Doing some sort of tool-based evaluation.

00:22:42.480 | So, I think rates and biases does have a library for doing that particularly, which is the debugger one.

00:22:49.480 | And then there's another one, auto-evaluator.

00:22:51.480 | So, that is able to catch the compilation errors very quickly.

00:22:56.480 | The third one is using some sort of model-based evaluation, which is using a smaller model to be able to evaluate the other model.

00:23:05.480 | So, while this is something which is -- I have not seen a lot of performance with this one, because, again, it's hard to do.

00:23:15.480 | But it has a lot of potential, which is it does standardize the process eventually, and it automates the process.

00:23:21.480 | And the final one is basically human-in-the-loop, which is something I feel like -- you know, this is something that everybody is doing, but not the most efficient.

00:23:31.480 | So, let's just ignore human-in-the-loop.

00:23:33.480 | Even in the loop, maybe let's let OpenAI talk about this.

00:23:40.480 | The final thing that I wanted to say on this one for this particular presentation is, while fine-tuning is great, yes, you -- but you also need to think about the entire pipeline, which is how you're thinking about the data collection, how you're thinking about the storage management, how you're choosing a base model.

00:23:59.480 | So, optimizing the performance of the model doesn't really depend on just one feature, while it may work perfectly for, like, a single one-off demo.

00:24:08.480 | But to be able to put a robust application that does sustain the test of time -- and, obviously, I'm not saying, you know, what would be an ideal time that you should be testing on.

00:24:21.480 | But in the case, the goal is to be able to get the optimal performance of the model and to be able to deal with all the data drift and the prompt drift and all of those things, while also making sure that we're catching a few things early and we're not exposing the enterprise to, like, reputational risk, compliance risk, and all of those things.

00:24:40.480 | So, the entire thing has to be thought of.

00:24:41.480 | So, it is a big-picture decision that I would say that needs to be taken.

00:24:47.480 | So, that's all my presentation for today.

00:24:50.480 | I hope everybody learned something new.

00:24:53.480 | If there is something you would like to go with me in detail, then we can do that after the presentation.

00:24:59.480 | Thank you so much.

00:25:00.480 | Thank you.

00:25:00.480 | Thank you so much.

00:25:01.480 | Thank you.

00:25:02.480 | Thank you.

00:25:03.480 | Thank you.

00:25:04.480 | We'll see you next time.