back to index

Everything you need to know about Fine-tuning and Merging LLMs: Maxime Labonne


Chapters

0:0 Introduction
0:54 LLM training life cycle
1:54 When to use finetuning
3:2 Why Enterprises care about open source
3:31 Finetuning libraries
5:40 How to create SFT data sets
8:15 SFT techniques
9:22 Hyperparameters
9:57 Sequence Length
10:35 Model Merging
12:0 Slurp
13:19 Passthrough
14:26 Mixture of experts

Whisper Transcript | Transcript Only Page

00:00:00.040 | Hi everyone, in this session we're going to talk about everything you need to know about fine-tuning
00:00:19.620 | LLMs and model merging. Quick intro, my name is Maxim Labonne, I'm a staff machine learning
00:00:26.600 | scientist at Liquid AI, I'm also a growth developer expert, I write blog posts on these topics,
00:00:32.480 | I created the LLM course which is super popular on GitHub, I also contributed to the open source
00:00:38.360 | community through models, through tools, and I'm the author of hands-on graph neural networks using
00:00:45.440 | Python with PACT. So first of all, let's talk about fine-tuning. We saw a bit of fine-tuning in the
00:00:52.520 | previous session, so I'll try to not repeat too much. But basically, here's the LLM training
00:00:59.000 | lifecycle. You see three stages. First of all, you have the pre-training stage where you give a lot of
00:01:05.240 | raw text to the model, and the idea is that the model learns to do next token prediction. The result of
00:01:13.100 | that is called a base model. This base model is really nice, but if you ask it questions or instructions,
00:01:21.020 | it's going to auto-complete your question instead of answering it, which is why we have the supervised
00:01:26.300 | fine-tuning stage where this time we give pairs of questions and answers to the model. And we have a
00:01:33.260 | similar training objective, but the idea is that at the end of it, it's going to actually answer your
00:01:39.740 | questions and follow instructions. Then we have a third and final stage, the preference alignment stage,
00:01:45.340 | where we give human preferences to align the model to how we want it to behave,
00:01:50.620 | and the result is commonly referred to as chat model. So when to use fine-tuning.
00:01:55.900 | Here you can see a little flow chart that I've made. It's very high level. But basically, there's a
00:02:02.860 | conversation about when to use prompt engineering, when to use fine-tuning. I think it's good in general to
00:02:08.940 | start with prompt engineering if you can. And the idea is to have a really robust evaluation
00:02:14.540 | structure where you have a lot of different metrics that you're interested in. It can be the accuracy of
00:02:21.740 | the model. Does it answer my question well? You can create a custom benchmark if you have a very niche
00:02:27.820 | use case, or you can reuse open source benchmarks. Also, in terms of cost, latency, because the question
00:02:34.460 | is, is it good enough? If it's good enough with just prompt engineering, then probably you don't need
00:02:39.180 | fine-tuning. The problem is solved. Congrats. Otherwise, the question is, can you make an instruction
00:02:44.220 | dataset? So can you create pairs of questions and answers to fine-tune the model? If it's not the case,
00:02:51.260 | it can be for multiple reasons, but it's probably a good sign that you need to re-scope the project.
00:02:56.380 | Otherwise, fine-tuning is an option, and you can reuse the evaluation framework that you created
00:03:01.260 | to evaluate the model. So that was the technical answer, but you also have a non-technical answer to
00:03:07.500 | that. Here is a report from A16Z. And the question is, why do enterprises care about open source? You can
00:03:14.780 | see that the two main items actually control and customers' ability. And customers' ability is mostly
00:03:21.020 | about fine-tuning models. So even if there's like arguments about the technical side and cost and
00:03:26.940 | latency, there's also like a strong argument for customers' ability and control over these models.
00:03:30.940 | So in terms of fine-tuning libraries, I think that you know about Anceloft now,
00:03:35.420 | but I'm going to talk about the other ones. So TRL from Hugging Face, a great library built on top of
00:03:42.060 | transformers, very easy to use. You have Axolotl, excellent library, very versatile. You have a lot
00:03:48.940 | of YAML config files. And then you have LAM factory, where you have a really good graphical user interface
00:03:56.780 | that is built in. So to talk a bit more about supervised fine-tuning, here you see an example of a
00:04:03.260 | sample that we give to the model. So we have the instruction, which is both the system prompt and
00:04:09.820 | the user prompt, and the answer, which is the output. So in this case, the system prompt is used to steer
00:04:17.260 | the behavior of the model, think like you're answering to a five-year-old. And the user actually gives the
00:04:23.260 | task, remove the spaces from the following sentence. We train the model usually, like generally, on the
00:04:29.980 | output only. So we mask the rest. It's used as context. And what we want to do is train the model
00:04:36.300 | to output the correct answer. Most safety data sets, I want you to say, that use synthetic data,
00:04:41.660 | and that's perfectly fine. Usually it's generated with frontier models, and that's a great way of
00:04:46.460 | building higher-quality data sets. Then you have the preference alignments. I'm just going to mention it
00:04:52.460 | here. There are a lot of different methods, PPO, DPO, KTO, IPO. In practice, direct preference optimization
00:04:59.180 | is probably the most popular one. So here you see that you have a different format with an instruction,
00:05:05.500 | and you have a chosen answer and a rejected answer. So the idea here is that you're going to show like a
00:05:11.420 | positive example, negative example to the model. And with DPO, the goal is to make sure that the model
00:05:17.500 | that you're currently training outputs higher probabilities for the chosen answers than the
00:05:22.300 | untrained version of the same model. I'm not going to delve too much into the details here, but this is the
00:05:27.340 | general idea and can be used to either censor the model, how to make a bomb, the chosen answer would
00:05:32.940 | be as an AI system. I cannot tell you that. Or it can also be used to boost the performance of the model
00:05:38.780 | of the model in general. How to create SFT data sets. So this is a very fundamental question
00:05:45.100 | in the post-training world. And the main question is, okay, what's a good sample? Human evaluation is
00:05:52.380 | quite bad at actually reviewing the samples. But what I like to define is like three main features.
00:05:59.180 | The first one is the accuracy. We want the samples, the outputs to be factually correct. Maybe no typos
00:06:07.180 | would be good too. We don't want to compromise the knowledge of the model by giving it fake information.
00:06:14.060 | Then you have diversity. And diversity, you want to cover as many topics as you can. Of course,
00:06:20.620 | it depends on your use case. Because if you do summarization, you won't be as general as if you
00:06:25.980 | do general purpose fine tuning. But it's always a good idea to include a lot of different topics,
00:06:33.420 | different writing styles in this dataset. And finally, you have complexity. I think this one is a bit less
00:06:41.020 | trivial. And it's about giving complex tasks to the model, forcing reasoning. So for example, the output will
00:06:48.620 | have chain of thought reasoning because you want to train the model to have this kind of reasoning.
00:06:54.540 | Or it can be tasks like summarization. Explain me like I'm a five-year-old. This kind of task really
00:07:00.620 | force the model not to only answer the question like a QA with answers you could find on Wikipedia. It also
00:07:09.100 | forces it to reason over the prompt and give a more complex answer. So as a little recipe you can see here,
00:07:17.820 | I would recommend in general starting with open source datasets if you can combine some of them.
00:07:23.900 | Then you can apply different filters. The first one is data deduplication. It can be either exact,
00:07:29.580 | because you want to remove duplicates. It can be fuzzy. So same idea. And then you have data quality
00:07:37.740 | filters. Here you have different techniques. It can be rule-based filtering. For example, you want to remove
00:07:43.100 | every single row where you have as an AI assistant, I cannot because people hate it. But you can also use more clever
00:07:50.860 | techniques like reward models or LLM as a judge to evaluate the quality of each sample and filter out the bad samples.
00:07:58.860 | And then you can use data exploration with different tools like Lyla, Economic Atlas, text clustering, to have
00:08:04.220 | topic clustering, to visualize your dataset, to get ideas on how to improve it. And with these ideas, you can
00:08:11.820 | go back to data generation and start the process all over again. In terms of SFD techniques,
00:08:17.900 | we have three main techniques. Full fine-tuning. This is the most basic one. You take the base model and you
00:08:24.220 | just train it on the instruction dataset. It has the best performance, but it's also very inefficient in general. A
00:08:31.020 | more efficient way of seeing it is LoRa. With LoRa, you are going to freeze all the pre-trained weights and you add
00:08:38.780 | adapters to each targeted layer. These matrices A and B are these adapters. So you don't train on all the parameters of the base model.
00:08:50.220 | You only train a subset of them. So this is a lot faster. But it can still be costly because you're still loading
00:08:58.460 | the entire model in 16-bit precision here. So a more efficient way is to quantize the pre-trained model
00:09:07.660 | here in 4-bit precision. This is Q LoRa. And you apply the same idea that you had with LoRa, but this time
00:09:15.180 | the weights are heavily quantized. So you have a lower VRM usage. The problem is that it also degrades
00:09:20.540 | performance. So there's a trade-off here. I want to briefly mention some hyperparameters, but Daniel
00:09:26.860 | already talked about a lot of them. So I'm going to be brief. I think the most important one is the
00:09:32.060 | learning rate. The learning rate is model dependent. It requires a few experiments to be able to
00:09:38.220 | really tweak it and find the best one. Generally, I would recommend to go as high as you can until your loss
00:09:44.780 | explodes like in this graph. Then you can reduce the size of the learning rate. Other super important
00:09:52.460 | hyperparameters and number of epochs. I would say that depending on the size of the data set, you can have
00:09:57.740 | more or less epochs. Sequence length is also good because it's a trade-off with the batch size because
00:10:05.180 | the longer sequence lengths you have, so the bigger the context window, the more VRM you're going to use.
00:10:11.100 | But you don't need to use a sequence length that's as big as the pre-trained model. Then you have the
00:10:17.900 | batch size. You want to maximize it to maximize the utilization of your GPUs. And then you have the
00:10:24.540 | the lower with the rank. This is quite easy to fine-tune, so I don't want to go into the details here.
00:10:30.940 | Let's talk about model merging now. Model merging is the idea that you can take the weights of different
00:10:38.940 | fine-tuned models and you can combine them together so you just can leverage what the open source community
00:10:48.620 | has produced on the hanging face hub, for example. It doesn't require any GPUs, so it's super efficient
00:10:55.660 | and it provides excellent results. So the OpenLM leaderboard was updated this morning. So we have a
00:11:03.020 | version two now, but this is the version one. I haven't had time to update it. But you can see that for
00:11:09.660 | seven B parameter models, the entire top eight or top 10 is just merge models. So it really shows that
00:11:18.300 | this approach is extremely effective at producing high-quality models. And you can find similar results
00:11:24.220 | on really a lot of different data sets. I would recommend using MerchKit. This is like the leading
00:11:29.740 | library in this space with a lot of different techniques that are implemented there. So here
00:11:36.220 | you can see the family tree of merged models. So you don't really need to see the name of the model,
00:11:44.460 | but you see that every node is actually a model. And we actually merge different merges together until
00:11:51.020 | it becomes like a giant family tree. This one is actually quite small. It can get a lot crazier than
00:11:56.620 | that, but it didn't fit on one slide. So I'll choose this one instead. About the merge techniques
00:12:02.940 | themselves, I want to mention a few of them. The first one is called SLURP. It stands for Spherical
00:12:10.300 | Linear Interpolation. So the idea is really to apply spherical, but linear interpolation with the weights
00:12:17.660 | of different models. You can only merge two models at the same time with this technique, but you can really
00:12:23.660 | tweak it with different interpolation factors for different layers. Here's a model that I've made in
00:12:29.980 | your Beagle 14.7b, which was a really efficient way of leveraging the different models that were created
00:12:39.740 | by the open source community. And then you have there. So in there, you want to reduce the redundancy of the
00:12:46.300 | model parameters. To do that, you're going to use pruning. You're going to select the most significant parameters in your
00:12:52.300 | model weights, and you're going to rescale the weights of these source models. The advantage that
00:12:59.500 | it has is that you can merge different models, not just two, but even more together. And I would advise,
00:13:06.460 | I would recommend this technique, and not with just two models, not with three, but like with seven or eight models.
00:13:12.140 | It works really, really well. So I strongly recommend that. Then you have a very funny technique called
00:13:20.300 | pass-through. And in pass-through, you can concatenate layers from different LLMs. It can also be the same
00:13:27.660 | one. We call it self-merge. And so here you have an example that I've made recently. It's called
00:13:33.980 | Metal Llama3 120B Instruct because I took Llama3 70B Instruct and I just repeated 10 layers six times.
00:13:42.940 | So you could say, like, this shouldn't work at all. Like, come on. You haven't even trained the model.
00:13:49.100 | This is ridiculous. Actually, yeah, this is ridiculous. People loved it on Twitter and Reddit and
00:13:54.860 | online in general. So it shows that there's a lot of things that we can still discover with these merge
00:14:03.820 | techniques, with these models. They do not -- they can be counterintuitive sometimes. And you can see
00:14:10.140 | that this model in particular was particularly good at creative writing. It was also quite unhinged in
00:14:16.860 | general, but really good at creative writing. And now it's being used by a lot of people, even though it's super big.
00:14:23.580 | But no kind of fine-tuning at all. No, no fine-tuning. Nothing. And then I want to mention the last
00:14:32.060 | technique, which is called Mixture of Experts. So in traditional Mixture of Experts, you are going to
00:14:38.460 | pre-train a model with a router -- you can see on the bottom here -- and different feed-forward network
00:14:45.980 | layers. And you pre-train it from scratch. But you can do something quite smart with merging,
00:14:53.420 | where you extract the feed-forward network layers from different fine-tuned models. And you combine
00:14:58.140 | them together like this. So we call this a Franken-MRE. You add a router, you combine the FFN layers from
00:15:05.340 | different models. And this is how you create a mixture of experts. It's actually pretty cool. It works
00:15:14.700 | pretty well in practice. You can see on the left a MerchKit config for the Beyonder model. So for this
00:15:23.180 | model, I selected four different fine-tuned models. One as a chat model, one as a code model, one as a
00:15:30.380 | role-play model, and one as a math model. You can see that I'm using positive prompts here. So actually,
00:15:36.220 | it's a way to initialize the router. Because if you go back to the previous slide, we can see that the
00:15:43.340 | router is supposed to select for each token and each layer where -- like which feed-forward network layer
00:15:50.700 | is going to be used. We use two in general. And so how do we initialize it? If we do not fine-tune it,
00:15:57.980 | once again, we don't want to fine-tune it. We can, but we don't necessarily want to. In this case, we're just going to use
00:16:04.140 | these positive prompts, calculate their embeddings, and use these embeddings to initialize the routers.
00:16:09.980 | And that works really, really well. So those are two models that I've used. For Fixtral, I had to
00:16:14.460 | modify it to make it compatible with Phyto. And that outperformed the base model on a lot of tasks. So it's
00:16:22.780 | really a good technique to use in general. But I would say that if you compare it to merging, as we saw with
00:16:32.300 | Slurp and with Dare, I would say that if you want to increase the performance, it's better to use
00:16:40.140 | Slurp and Dare instead of a mixture of experts, because this is a bit more experimental. This doesn't -- this
00:16:47.100 | will not bring you the same level of performance. And here you can see the results of the Beyonder model.
00:16:55.260 | You can see that the other models I'm comparing to are the source models that I've used in this
00:17:01.100 | in this merge. So it's quite remarkable to see that it's actually performing better than the source models
00:17:09.180 | on a lot of different benchmarks. So, yeah, that's it for me. Thank you for your attention. If you are
00:17:18.300 | interested in knowing more, if you want notebooks to run some code, I created the large language model
00:17:24.940 | course. All these notebooks are available on GitHub LLM course. And yeah, thank you.
00:17:30.700 | Thank you. .