Hi everyone, in this session we're going to talk about everything you need to know about fine-tuning LLMs and model merging. Quick intro, my name is Maxim Labonne, I'm a staff machine learning scientist at Liquid AI, I'm also a growth developer expert, I write blog posts on these topics, I created the LLM course which is super popular on GitHub, I also contributed to the open source community through models, through tools, and I'm the author of hands-on graph neural networks using Python with PACT.
So first of all, let's talk about fine-tuning. We saw a bit of fine-tuning in the previous session, so I'll try to not repeat too much. But basically, here's the LLM training lifecycle. You see three stages. First of all, you have the pre-training stage where you give a lot of raw text to the model, and the idea is that the model learns to do next token prediction.
The result of that is called a base model. This base model is really nice, but if you ask it questions or instructions, it's going to auto-complete your question instead of answering it, which is why we have the supervised fine-tuning stage where this time we give pairs of questions and answers to the model.
And we have a similar training objective, but the idea is that at the end of it, it's going to actually answer your questions and follow instructions. Then we have a third and final stage, the preference alignment stage, where we give human preferences to align the model to how we want it to behave, and the result is commonly referred to as chat model.
So when to use fine-tuning. Here you can see a little flow chart that I've made. It's very high level. But basically, there's a conversation about when to use prompt engineering, when to use fine-tuning. I think it's good in general to start with prompt engineering if you can. And the idea is to have a really robust evaluation structure where you have a lot of different metrics that you're interested in.
It can be the accuracy of the model. Does it answer my question well? You can create a custom benchmark if you have a very niche use case, or you can reuse open source benchmarks. Also, in terms of cost, latency, because the question is, is it good enough? If it's good enough with just prompt engineering, then probably you don't need fine-tuning.
The problem is solved. Congrats. Otherwise, the question is, can you make an instruction dataset? So can you create pairs of questions and answers to fine-tune the model? If it's not the case, it can be for multiple reasons, but it's probably a good sign that you need to re-scope the project.
Otherwise, fine-tuning is an option, and you can reuse the evaluation framework that you created to evaluate the model. So that was the technical answer, but you also have a non-technical answer to that. Here is a report from A16Z. And the question is, why do enterprises care about open source?
You can see that the two main items actually control and customers' ability. And customers' ability is mostly about fine-tuning models. So even if there's like arguments about the technical side and cost and latency, there's also like a strong argument for customers' ability and control over these models. So in terms of fine-tuning libraries, I think that you know about Anceloft now, but I'm going to talk about the other ones.
So TRL from Hugging Face, a great library built on top of transformers, very easy to use. You have Axolotl, excellent library, very versatile. You have a lot of YAML config files. And then you have LAM factory, where you have a really good graphical user interface that is built in.
So to talk a bit more about supervised fine-tuning, here you see an example of a sample that we give to the model. So we have the instruction, which is both the system prompt and the user prompt, and the answer, which is the output. So in this case, the system prompt is used to steer the behavior of the model, think like you're answering to a five-year-old.
And the user actually gives the task, remove the spaces from the following sentence. We train the model usually, like generally, on the output only. So we mask the rest. It's used as context. And what we want to do is train the model to output the correct answer. Most safety data sets, I want you to say, that use synthetic data, and that's perfectly fine.
Usually it's generated with frontier models, and that's a great way of building higher-quality data sets. Then you have the preference alignments. I'm just going to mention it here. There are a lot of different methods, PPO, DPO, KTO, IPO. In practice, direct preference optimization is probably the most popular one.
So here you see that you have a different format with an instruction, and you have a chosen answer and a rejected answer. So the idea here is that you're going to show like a positive example, negative example to the model. And with DPO, the goal is to make sure that the model that you're currently training outputs higher probabilities for the chosen answers than the untrained version of the same model.
I'm not going to delve too much into the details here, but this is the general idea and can be used to either censor the model, how to make a bomb, the chosen answer would be as an AI system. I cannot tell you that. Or it can also be used to boost the performance of the model of the model in general.
How to create SFT data sets. So this is a very fundamental question in the post-training world. And the main question is, okay, what's a good sample? Human evaluation is quite bad at actually reviewing the samples. But what I like to define is like three main features. The first one is the accuracy.
We want the samples, the outputs to be factually correct. Maybe no typos would be good too. We don't want to compromise the knowledge of the model by giving it fake information. Then you have diversity. And diversity, you want to cover as many topics as you can. Of course, it depends on your use case.
Because if you do summarization, you won't be as general as if you do general purpose fine tuning. But it's always a good idea to include a lot of different topics, different writing styles in this dataset. And finally, you have complexity. I think this one is a bit less trivial.
And it's about giving complex tasks to the model, forcing reasoning. So for example, the output will have chain of thought reasoning because you want to train the model to have this kind of reasoning. Or it can be tasks like summarization. Explain me like I'm a five-year-old. This kind of task really force the model not to only answer the question like a QA with answers you could find on Wikipedia.
It also forces it to reason over the prompt and give a more complex answer. So as a little recipe you can see here, I would recommend in general starting with open source datasets if you can combine some of them. Then you can apply different filters. The first one is data deduplication.
It can be either exact, because you want to remove duplicates. It can be fuzzy. So same idea. And then you have data quality filters. Here you have different techniques. It can be rule-based filtering. For example, you want to remove every single row where you have as an AI assistant, I cannot because people hate it.
But you can also use more clever techniques like reward models or LLM as a judge to evaluate the quality of each sample and filter out the bad samples. And then you can use data exploration with different tools like Lyla, Economic Atlas, text clustering, to have topic clustering, to visualize your dataset, to get ideas on how to improve it.
And with these ideas, you can go back to data generation and start the process all over again. In terms of SFD techniques, we have three main techniques. Full fine-tuning. This is the most basic one. You take the base model and you just train it on the instruction dataset. It has the best performance, but it's also very inefficient in general.
A more efficient way of seeing it is LoRa. With LoRa, you are going to freeze all the pre-trained weights and you add adapters to each targeted layer. These matrices A and B are these adapters. So you don't train on all the parameters of the base model. You only train a subset of them.
So this is a lot faster. But it can still be costly because you're still loading the entire model in 16-bit precision here. So a more efficient way is to quantize the pre-trained model here in 4-bit precision. This is Q LoRa. And you apply the same idea that you had with LoRa, but this time the weights are heavily quantized.
So you have a lower VRM usage. The problem is that it also degrades performance. So there's a trade-off here. I want to briefly mention some hyperparameters, but Daniel already talked about a lot of them. So I'm going to be brief. I think the most important one is the learning rate.
The learning rate is model dependent. It requires a few experiments to be able to really tweak it and find the best one. Generally, I would recommend to go as high as you can until your loss explodes like in this graph. Then you can reduce the size of the learning rate.
Other super important hyperparameters and number of epochs. I would say that depending on the size of the data set, you can have more or less epochs. Sequence length is also good because it's a trade-off with the batch size because the longer sequence lengths you have, so the bigger the context window, the more VRM you're going to use.
But you don't need to use a sequence length that's as big as the pre-trained model. Then you have the batch size. You want to maximize it to maximize the utilization of your GPUs. And then you have the the lower with the rank. This is quite easy to fine-tune, so I don't want to go into the details here.
Let's talk about model merging now. Model merging is the idea that you can take the weights of different fine-tuned models and you can combine them together so you just can leverage what the open source community has produced on the hanging face hub, for example. It doesn't require any GPUs, so it's super efficient and it provides excellent results.
So the OpenLM leaderboard was updated this morning. So we have a version two now, but this is the version one. I haven't had time to update it. But you can see that for seven B parameter models, the entire top eight or top 10 is just merge models. So it really shows that this approach is extremely effective at producing high-quality models.
And you can find similar results on really a lot of different data sets. I would recommend using MerchKit. This is like the leading library in this space with a lot of different techniques that are implemented there. So here you can see the family tree of merged models. So you don't really need to see the name of the model, but you see that every node is actually a model.
And we actually merge different merges together until it becomes like a giant family tree. This one is actually quite small. It can get a lot crazier than that, but it didn't fit on one slide. So I'll choose this one instead. About the merge techniques themselves, I want to mention a few of them.
The first one is called SLURP. It stands for Spherical Linear Interpolation. So the idea is really to apply spherical, but linear interpolation with the weights of different models. You can only merge two models at the same time with this technique, but you can really tweak it with different interpolation factors for different layers.
Here's a model that I've made in your Beagle 14.7b, which was a really efficient way of leveraging the different models that were created by the open source community. And then you have there. So in there, you want to reduce the redundancy of the model parameters. To do that, you're going to use pruning.
You're going to select the most significant parameters in your model weights, and you're going to rescale the weights of these source models. The advantage that it has is that you can merge different models, not just two, but even more together. And I would advise, I would recommend this technique, and not with just two models, not with three, but like with seven or eight models.
It works really, really well. So I strongly recommend that. Then you have a very funny technique called pass-through. And in pass-through, you can concatenate layers from different LLMs. It can also be the same one. We call it self-merge. And so here you have an example that I've made recently.
It's called Metal Llama3 120B Instruct because I took Llama3 70B Instruct and I just repeated 10 layers six times. So you could say, like, this shouldn't work at all. Like, come on. You haven't even trained the model. This is ridiculous. Actually, yeah, this is ridiculous. People loved it on Twitter and Reddit and online in general.
So it shows that there's a lot of things that we can still discover with these merge techniques, with these models. They do not -- they can be counterintuitive sometimes. And you can see that this model in particular was particularly good at creative writing. It was also quite unhinged in general, but really good at creative writing.
And now it's being used by a lot of people, even though it's super big. But no kind of fine-tuning at all. No, no fine-tuning. Nothing. And then I want to mention the last technique, which is called Mixture of Experts. So in traditional Mixture of Experts, you are going to pre-train a model with a router -- you can see on the bottom here -- and different feed-forward network layers.
And you pre-train it from scratch. But you can do something quite smart with merging, where you extract the feed-forward network layers from different fine-tuned models. And you combine them together like this. So we call this a Franken-MRE. You add a router, you combine the FFN layers from different models.
And this is how you create a mixture of experts. It's actually pretty cool. It works pretty well in practice. You can see on the left a MerchKit config for the Beyonder model. So for this model, I selected four different fine-tuned models. One as a chat model, one as a code model, one as a role-play model, and one as a math model.
You can see that I'm using positive prompts here. So actually, it's a way to initialize the router. Because if you go back to the previous slide, we can see that the router is supposed to select for each token and each layer where -- like which feed-forward network layer is going to be used.
We use two in general. And so how do we initialize it? If we do not fine-tune it, once again, we don't want to fine-tune it. We can, but we don't necessarily want to. In this case, we're just going to use these positive prompts, calculate their embeddings, and use these embeddings to initialize the routers.
And that works really, really well. So those are two models that I've used. For Fixtral, I had to modify it to make it compatible with Phyto. And that outperformed the base model on a lot of tasks. So it's really a good technique to use in general. But I would say that if you compare it to merging, as we saw with Slurp and with Dare, I would say that if you want to increase the performance, it's better to use Slurp and Dare instead of a mixture of experts, because this is a bit more experimental.
This doesn't -- this will not bring you the same level of performance. And here you can see the results of the Beyonder model. You can see that the other models I'm comparing to are the source models that I've used in this in this merge. So it's quite remarkable to see that it's actually performing better than the source models on a lot of different benchmarks.
So, yeah, that's it for me. Thank you for your attention. If you are interested in knowing more, if you want notebooks to run some code, I created the large language model course. All these notebooks are available on GitHub LLM course. And yeah, thank you. Thank you. .