Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Hello, everyone. So today I'm going to talk about how to fix bugs in open source models. Thanks for coming again. We had a talk yesterday, the three-hour workshop, and thanks for coming again. So we have slides. It's at tinyurl.com/unsloth2. For the workshop slides, which we did yesterday, you can also access that now.

Tinyurl.com/unsloth as well. So you might know me from, like, the Gemma bug fixes that we did. So Gemma was an open source model by Google, and there was a few bugs in there, and we fixed a few of them. And we just did some tweets about this, and, you know, like, there's many bugs in there, like, you know, the activation function had to be approximate JLU, not exact JLU, and there are some other issues that we talked about for Gemma.

We also have, like, some -- a few stickers, which you can, you know, get them from, like, when we're outside. But this is -- yeah, we won't be handing them during the talk. But, yeah, they're very cool and cute. And also there's, like, tokenization problems as well in language models, which we also help to fix.

Today I'm just going to be talking about LLAMA 3 bugs. So yesterday I talked about Gemma and V3, and today we're just sharing all the stuff that we found with LLAMA 3. For Gemma, you can access, like, all the bug fixes that we did in our blog post, and we have a collab notebook for all of the Gemma bug fixes as well.

For V3, for example, we talked about this yesterday, and I just pasted some slides again if you want to, like, review this in your own time. For example, like, the sliding window should be 2048, not 2047. You can also, like, you should unfuse all of the QKV matrices, otherwise lower fine tuning will not work that well.

But we'll be talking mainly about LLAMA 3. So there's actually eight bugs in LLAMA 3. Some of them are not announced yet. We will be announcing these later. So this is, like, a pre-release. And we'll be going through each of them separately. The first one is you must not use double BOS tokens.

So this is actually a very common theme in fine tuning LLAMA 3. Some people don't actually know that you're adding two beginning of sentence tokens to the fine tune. And this will actually ruin your fine tune by making the accuracy of your inference time lower. So please, like, check before you fine tune if you're using double BOS tokens.

In Unsloft, we do this-- we check this automatically and we'll remove the extra BOS token automatically for you. So this will actually cause your model to lose accuracy because if you trained on two BOS tokens and you do inference on one, then your model template will be incorrect. So please check this.

It's not just a LLAMA 3 problem. Other models like Mistral and Gemma also have problems like this. So just be careful of this issue. So a very easy way to check if you have double BOS tokens, if you use the apply chat template from Hugging Face, if you do the first one, your chat template must have a BOS token.

Otherwise, it won't add it. LLAMA 3 does require a BOS token. If you do the second one, you're actually having two BOS tokens if you do this. So please do not add a BOS token to the chat template. The second issue we found was you must not use the LLAMA 3 base model if you're using the LLAMA 3 template.

There are some untrained tokens in LLAMA 3 base. The Instruct version actually has these tokens trained. So please be careful when you want to use the LLAMA 3 base model when you want to do your fine tuning because some of these tokens will cause NANDs for your gradients. These tokens include the reserve special tokens from zero to 250, the end of turn token, the start header, and the end of header.

And the graph I showed shows you the mean of the embeddings versus the other tokens. And some of them actually are zero. So the LLAMA 3 team made some of these tokens go to zero purposefully because these tokens are not actually used for the model. So just please don't use some of these tokens when you do fine tuning as well.

If you want to fix them, set them to the mean of the entire tokens. And in Onsloft, we do this automatically as well for you. So we showed some code where you can take the mean of the trained tokens and set them for the untrained tokens. Just be careful, don't do this like incorrectly as well.

If you want to take the average of all the tokens, don't just take the average. You must remove the untrained tokens from the average. If you do not do that, you might actually have an incorrect average, right? If there's like 10,000 tokens which are untrained, if you divide it by 10,000 plus the number of trained tokens, your average will be incorrect.

So you have to do this more complicated method of masking out the untrained tokens and then take the average. Also, reminder, because of this issue, the LLAMA 3 chat template will not work for the base model. I have known that many fine tuning people have used LLAMA 3 Instruct chat template for the base model and your fine tune will actually be incorrect.

You will get NANs in your gradients and your whole fine tune will be broken. So please do not use the LLAMA 3 Instruct chat template for the LLAMA 3 base model. Only use this for the Instruct model itself. Another way to fix this is to actually train the LM head and the embed tokens, which will actually learn and remove the NANs in your models.

Another interesting fact, and not just a LLAMA 3 problem, but for other models, is the pad token and the EOS token must not be the same. If you do this the same, your model will have infinite generations. The reason is because the pad token gets masked out during the error, during the cross entropy loss, and if you use the same pad token as the EOS token, then your EOS token, the end of sentence token, will be masked out.

So just be very, very careful when you do fine tuning to check what is the pad token ID and the EOS token ID. For example, if you look at V3, they're the same. So technically, V3, when you do fine tuning, it will be infinite generations. So just be careful and look, you know, before you do the fine tune, check what is the EOS token and what is the pad token?

It must be different. For unsloft, we also do this automatically. We fix this for you. And we essentially check if there is any unreserved tokens, and we just select one which is untrained. If there is no untrained tokens, then we will add an extra pad token ourselves. Be careful, do not add a pad token which has the same, like, vocabulary as your current vocabulary.

So what we do is we actually check the tokens inside the vocabulary and add, like, extra hashes to see, you know, to make a new pad token. Another issue we found for fine tuning people is, like, when you finish your fine tune, you don't actually know how to export it to Olama.

And that is because the chat template for Olama must be exactly the same as your fine tune. And this was actually very complicated to do before. And now we can actually automatically generate the model file for you during the fine tune. So we have, like, two collab notebooks for you to use for Olama.

One of them is the alpaca dataset. And one of them is a -- you can upload a CSV file to make Olama work after you finish fine tuning. Now, there are some community contributions for Olama 3 bugs. There is, like, three of them. The first one is someone noticed that you can only use CPU conversion and not GPU conversion when you convert to GGOF or Lama CPP.

So be -- you know, be careful when you convert to Lama CPP that you must use the CPU version. I think the main reason is because the precision is different in a GPU than a CPU. The CPU, when you do float 16, it's different from when the GPU does float 16 conversion.

So just be careful on that as well. Another issue is, remember, we talked about the WBOS tokens. Through our community contribution, Lama CPP now has a warning for you to tell you that you're using WBOS tokens. So please, you know, take heed of the warning and do not add WBOS tokens to your chat template.

And when you do inference. Another point someone found was adding a system prompt could make fine tuning much better. And so, like, sometimes when you do inference on Lama 3 instruct, if you add an actual system prompt, this could make your whole fine tuning better. I think for some people, when they add the system prompt, you actually miss the system prompt.

Like, you don't actually add one. So maybe try your fine tune with the system prompt. And you never know, this could work. So we have, like, a GitHub package, which is open source. And you can click the button "Start Free Fine Tune" to start your first free fine tune using Unsloth.

We already pushed all the Lama 3 bug fixes to our GitHub repo. And so, the Start Free Fine Tune button will redirect you to a fixed collab notebook for all of these issues. Feel free to star us as well. We also, like, have a Discord channel. So if you have any questions, you can ask, you know, any questions that you like about our, you know, how to do fine tuning, talk about AI, and talk about our bugs as well.

We also have, like, a blog post. So blog posts about all our fixes, about Gemma, Lama 3, Fee 3, and more. For example, we talked about continued pre-training. You can do continued pre-training using Unsloth now. You can train on the LM head and the embed tokens. And we show that instead of just training like that, you need to reduce the learning rate of the LM head and the embed tokens by 10.

Or, you know, maybe 5 to 10. And this will make your training much better. We also support four times longer contacts using Unsloth. And this also does not increase the time of completion. So we make it 1 to 2 percent slower. But you get four times longer contacts using Unsloth.

And this was because, like, we used something called offloading gradient checkpointing, where we offload the gradients to system RAM. There are some other systems which offload the gradients to the disk. Please do not do that. If you offload to disk, then your time of completion of your fine tune will be extremely slow.

So try to offload to system RAM first, and then offload to disk. Although if you don't -- if you offload incorrectly, you might actually make this slower as well. So your offloading must be non-blocking calls. And do not do blocking calls to the system RAM. Yeah. So I will show you -- okay.

Let's see if I can open up -- let me go to -- Okay. I'm going to open up a Colab Notebook for the Olama one. Okay. So for the Olama Colab Notebook, you can simply just install Unsloth over here. This is already for free for everyone to use. And essentially, you don't forget when you do the Colab Notebook, you have to select a max sequence length.

This determines how long your model wants to do long contacts fine tuning. You can set this to any number that you like. But remember, your data set must match the max sequence length. So for example, if you have -- if you want to set the max sequence length to be like 10 million or one million, but your data set is only like one million tokens or like less, try to like not set that max sequence length to be that large.

Otherwise, your model cannot do fine tuning a long sequence. Load in 4-bit does 4-bit training. So this actually reduces memory usage by four times. If you do -- if you do it to false, your memory usage will explode. So please do not try false. Especially on a free Colab Tesla T4.

If you do false, your memory usage might skyrocket to 16 GB. So do not do that. You only should do this if you use more stronger GPUs. We support -- like Unsloth supports fine tuning or models including like Lama, Mistral, Gemma, V3, and more. So this area, like the model name over here, you can actually try to select any model name that you like.

I don't think that people know that Unsloth can support other models other than the ones we listed. So please try to put any model, like a hugging face model name in there. And it should work. So for the get peft model, this is where you add the peft LoRa adapters.

The R is the rank. So we set it to be 16. But you can select any number that you like for fine tuning. So we suggest you normally to use powers of two. But you can use any number, like one, two, three, like any number that you like. The larger the rank you select, you can make the model learn more about your data set.

So -- but if you add too large of a rank, you might actually overfit your data set. And also your memory usage might skyrocket again. So we normally suggest people to select 16, 32, 64, or 128. Try not to select too large ranks. The maximum rank you should select is the size of the dimension of the model itself.

So if it's 4096, set this to be 4096. For the target modules, be careful. You must do fine tuning on all linear layers, right? So Q, K, V, O, down, up, and gate. Some people have done fine tuning without doing some of these layers. Please do not do that.

Because this will cause your fine tune to be not optimal. And the LoRa alpha, there is actually a trick for this. Normally speaking, select the alpha to be the same as the same as the rank or larger. We found that if you do 16 times 2, so the rank times 2, this can make your fine tuning much better.

You can also use use RS LoRa to be true to set the rank automatically for you -- to set the alpha automatically for you. For the gradient checkpointing, unsloft is the method which we showed that you can do long context fine tuning. You can also set this to be true, but your memory usage will increase again.

We also show you how to do data preparation in an Olama Colab notebook. So this one is we upload a Titanic CSV. So the Titanic data set, the goal was, can you predict if someone died or survived if you're on the Titanic? And you get details about the person.

For example, their age, their, like, fair, where did they embark from, and so on. With our new Colab notebooks, you have to be very careful when you do Olama chat templates. Because when you do fine tuning, you can only have two columns, the instruction and the output. But what happens if your CSV has more than one -- like, more than two columns, the instruction and output?

What we can do is you can merge the columns into one column. And with unsloft now, you can actually do that. You can merge the columns into one. And also, we show you that you can do customizable chat templates now. So previously, if you want to do an Alpaca-style fine tune, you have to use instruction, input, and response for the Alpaca-style fine tuning.

But remember, the problem is, if you want to output to Olama or GGWeb, you can only have two columns, the instruction and output. Right? If you do ChatTBT, you have to type something and then the output comes along. You can't have, like, three inputs, right? So what we do is you can actually customize your chat template.

And you must include the input and the output. And you must do this repetition twice. Some people have asked me, like, why do you have to do two repetitions of this chat template? It's because there is dangling new lines. And we found this -- we found a solution to this is you have to specify two iterations of your chat template.

We also show examples of how to do the Lama 3 chat template using our methodology. So you can see there is two iterations of the chat template. Reminder, if you don't use the two iterations, you actually -- it will error out. And this is the training methodologies. We normally suggest people to use a batch size of two, gradient accumulation of four.

Remember, the memory usage is only relevant to the batch size. So try not to set the batch size to be very large, otherwise your memory usage will explode. Instead, set your gradient accumulation steps to be larger. So the formula for the effective batch size is batch size times the gradient accumulation.

So in this case, it's two times four, which is eight. Set your learning rate to be 2e minus four or smaller, maybe 2e minus five. And after that, you can also do inference on the model. So now you have to use the apply chat template. Remember, be careful of double BOS tokens.

But we in Unslaw fixed this. And finally, you have to save this to Ollama. And, you know, you have to install Ollama first. Saving now -- we now support saving multiple ggware files. So you don't actually have to save it to one ggware file. You can save it to multiple.

And we actually allow you to do this now. Before, if you want to save to multiple ggware files, you have to wait 10 minutes extra. You can now do this automatically by, you know, specifying more than one format. We also can show you the model file which we created.

So you can actually copy-paste the model file and put this to custom -- like a custom Ollama as well. So the model file was the complicated part when we had to automatically generate this. So we have, like, internal code to generate the model file automatically. And finally, when you want to do inference, you can do Ollama to do inference.

And, you know, it works in general. So try that out. The Ollama chat template notebook is in the slide. So tinyurl.com/unsloth2. And remember, the workshop slides, which we did yesterday, is tinyurl.com/unsloth. And don't forget to join our Discord channel. If you have any questions, I'm outside. You can ask questions and stuff like that.

And, yes, like, thanks for thanks for coming. I much appreciate it. Thanks a lot. I'll see you next time.

Fixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Chapters

Transcript