back to indexFixing bugs in Gemma, Llama, & Phi 3: Daniel Han

Chapters
0:0 Introduction
0:32 Overview
2:15 Double BS Tokens
3:0 Untrained Tokens
8:38 Free Fine Tuning
10:40 Collab Notebook
00:00:00.000 |
Hello, everyone. So today I'm going to talk about how to fix bugs in open source models. 00:00:18.800 |
Thanks for coming again. We had a talk yesterday, the three-hour workshop, and thanks for coming 00:00:24.880 |
again. So we have slides. It's at tinyurl.com/unsloth2. For the workshop slides, which we did yesterday, 00:00:33.280 |
you can also access that now. Tinyurl.com/unsloth as well. So you might know me from, like, the 00:00:41.060 |
Gemma bug fixes that we did. So Gemma was an open source model by Google, and there was 00:00:44.920 |
a few bugs in there, and we fixed a few of them. And we just did some tweets about this, 00:00:51.060 |
and, you know, like, there's many bugs in there, like, you know, the activation function 00:00:54.460 |
had to be approximate JLU, not exact JLU, and there are some other issues that we talked about 00:01:00.620 |
for Gemma. We also have, like, some -- a few stickers, which you can, you know, get them 00:01:05.440 |
from, like, when we're outside. But this is -- yeah, we won't be handing them during the 00:01:10.160 |
talk. But, yeah, they're very cool and cute. And also there's, like, tokenization problems 00:01:15.520 |
as well in language models, which we also help to fix. Today I'm just going to be talking 00:01:20.860 |
about LLAMA 3 bugs. So yesterday I talked about Gemma and V3, and today we're just sharing 00:01:25.860 |
all the stuff that we found with LLAMA 3. For Gemma, you can access, like, all the bug fixes 00:01:31.160 |
that we did in our blog post, and we have a collab notebook for all of the Gemma bug fixes as 00:01:36.520 |
well. For V3, for example, we talked about this yesterday, and I just pasted some slides 00:01:44.200 |
again if you want to, like, review this in your own time. For example, like, the sliding 00:01:48.320 |
window should be 2048, not 2047. You can also, like, you should unfuse all of the QKV matrices, 00:01:56.820 |
otherwise lower fine tuning will not work that well. But we'll be talking mainly about LLAMA 3. 00:02:03.360 |
So there's actually eight bugs in LLAMA 3. Some of them are not announced yet. We will 00:02:07.640 |
be announcing these later. So this is, like, a pre-release. And we'll be going through each 00:02:12.920 |
of them separately. The first one is you must not use double BOS tokens. So this is actually 00:02:19.880 |
a very common theme in fine tuning LLAMA 3. Some people don't actually know that you're adding 00:02:24.240 |
two beginning of sentence tokens to the fine tune. And this will actually ruin your fine tune 00:02:28.980 |
by making the accuracy of your inference time lower. So please, like, check before you fine tune if you're 00:02:34.260 |
using double BOS tokens. In Unsloft, we do this-- we check this automatically and we'll remove the 00:02:39.740 |
extra BOS token automatically for you. So this will actually cause your model to lose accuracy because 00:02:45.260 |
if you trained on two BOS tokens and you do inference on one, then your model template will be incorrect. 00:02:50.700 |
So please check this. It's not just a LLAMA 3 problem. Other models like Mistral 00:02:55.980 |
and Gemma also have problems like this. So just be careful of this issue. 00:03:01.100 |
So a very easy way to check if you have double BOS tokens, if you use the apply chat template 00:03:06.380 |
from Hugging Face, if you do the first one, your chat template must have a BOS token. Otherwise, 00:03:12.380 |
it won't add it. LLAMA 3 does require a BOS token. If you do the second one, you're actually having 00:03:18.620 |
two BOS tokens if you do this. So please do not add a BOS token to the chat template. 00:03:24.380 |
The second issue we found was you must not use the LLAMA 3 base model if you're using the LLAMA 3 template. 00:03:31.660 |
There are some untrained tokens in LLAMA 3 base. The Instruct version actually has these tokens trained. 00:03:38.460 |
So please be careful when you want to use the LLAMA 3 base model when you want to do your fine tuning 00:03:43.340 |
because some of these tokens will cause NANDs for your gradients. These tokens include the reserve 00:03:47.580 |
special tokens from zero to 250, the end of turn token, the start header, and the end of header. 00:03:55.020 |
And the graph I showed shows you the mean of the embeddings versus the other tokens. And some of 00:04:01.100 |
them actually are zero. So the LLAMA 3 team made some of these tokens go to zero purposefully because 00:04:06.940 |
these tokens are not actually used for the model. So just please don't use some of these tokens when 00:04:12.540 |
you do fine tuning as well. If you want to fix them, set them to the mean of the entire 00:04:17.420 |
tokens. And in Onsloft, we do this automatically as well for you. So we showed some code where you 00:04:23.500 |
can take the mean of the trained tokens and set them for the untrained tokens. Just be careful, 00:04:29.740 |
don't do this like incorrectly as well. If you want to take the average of all the tokens, don't just 00:04:34.700 |
take the average. You must remove the untrained tokens from the average. If you do not do that, 00:04:39.340 |
you might actually have an incorrect average, right? If there's like 10,000 tokens which are untrained, 00:04:43.980 |
if you divide it by 10,000 plus the number of trained tokens, your average will be incorrect. So you 00:04:50.220 |
have to do this more complicated method of masking out the untrained tokens and then take the average. 00:04:55.260 |
Also, reminder, because of this issue, the LLAMA 3 chat template will not work for the base model. I have 00:05:02.940 |
known that many fine tuning people have used LLAMA 3 Instruct chat template for the base model and your 00:05:08.940 |
fine tune will actually be incorrect. You will get NANs in your gradients and your whole fine tune will be 00:05:13.740 |
broken. So please do not use the LLAMA 3 Instruct chat template for the LLAMA 3 base model. Only use this 00:05:19.820 |
for the Instruct model itself. Another way to fix this is to actually train the LM head and the embed tokens, 00:05:25.340 |
which will actually learn and remove the NANs in your models. Another interesting fact, and not just a LLAMA 3 00:05:33.180 |
problem, but for other models, is the pad token and the EOS token must not be the same. If you do this the 00:05:40.300 |
same, your model will have infinite generations. The reason is because the pad token gets masked out 00:05:46.460 |
during the error, during the cross entropy loss, and if you use the same pad token as the EOS token, 00:05:52.540 |
then your EOS token, the end of sentence token, will be masked out. So just be very, very careful when 00:05:57.500 |
you do fine tuning to check what is the pad token ID and the EOS token ID. For example, if you look at 00:06:03.020 |
V3, they're the same. So technically, V3, when you do fine tuning, it will be infinite generations. So just be 00:06:10.460 |
careful and look, you know, before you do the fine tune, check what is the EOS token and what is the pad 00:06:15.340 |
token? It must be different. For unsloft, we also do this automatically. We fix this for you. And we 00:06:23.340 |
essentially check if there is any unreserved tokens, and we just select one which is untrained. If there 00:06:30.140 |
is no untrained tokens, then we will add an extra pad token ourselves. Be careful, do not add a pad 00:06:35.900 |
token which has the same, like, vocabulary as your current vocabulary. So what we do is we actually check 00:06:41.820 |
the tokens inside the vocabulary and add, like, extra hashes to see, you know, to make a new pad token. 00:06:47.580 |
Another issue we found for fine tuning people is, like, when you finish your fine tune, 00:06:54.380 |
you don't actually know how to export it to Olama. And that is because the chat template for Olama must 00:06:59.340 |
be exactly the same as your fine tune. And this was actually very complicated to do before. And now we 00:07:04.940 |
can actually automatically generate the model file for you during the fine tune. So we have, like, 00:07:09.740 |
two collab notebooks for you to use for Olama. One of them is the alpaca dataset. And one of them is a -- you 00:07:15.660 |
can upload a CSV file to make Olama work after you finish fine tuning. Now, there are some community 00:07:23.500 |
contributions for Olama 3 bugs. There is, like, three of them. The first one is someone noticed that you can 00:07:30.300 |
only use CPU conversion and not GPU conversion when you convert to GGOF or Lama CPP. So be -- you know, 00:07:36.780 |
be careful when you convert to Lama CPP that you must use the CPU version. I think the main reason is 00:07:42.300 |
because the precision is different in a GPU than a CPU. The CPU, when you do float 16, it's different from 00:07:48.860 |
when the GPU does float 16 conversion. So just be careful on that as well. 00:07:54.060 |
Another issue is, remember, we talked about the WBOS tokens. Through our community contribution, 00:07:59.020 |
Lama CPP now has a warning for you to tell you that you're using WBOS tokens. So please, you know, 00:08:06.300 |
take heed of the warning and do not add WBOS tokens to your chat template. And when you do inference. 00:08:11.420 |
Another point someone found was adding a system prompt could make fine tuning much better. And so, 00:08:19.580 |
like, sometimes when you do inference on Lama 3 instruct, if you add an actual system prompt, 00:08:24.540 |
this could make your whole fine tuning better. I think for some people, when they add the system 00:08:29.260 |
prompt, you actually miss the system prompt. Like, you don't actually add one. So maybe try your fine 00:08:33.340 |
tune with the system prompt. And you never know, this could work. So we have, like, a GitHub package, 00:08:40.460 |
which is open source. And you can click the button "Start Free Fine Tune" to start your first free 00:08:45.340 |
fine tune using Unsloth. We already pushed all the Lama 3 bug fixes to our GitHub repo. And so, 00:08:50.380 |
the Start Free Fine Tune button will redirect you to a fixed collab notebook for all of these issues. 00:08:55.340 |
Feel free to star us as well. We also, like, have a Discord channel. So if you have any questions, 00:09:00.860 |
you can ask, you know, any questions that you like about our, you know, how to do fine tuning, 00:09:06.140 |
talk about AI, and talk about our bugs as well. We also have, like, a blog post. So blog posts about 00:09:12.700 |
all our fixes, about Gemma, Lama 3, Fee 3, and more. For example, we talked about continued pre-training. 00:09:19.740 |
You can do continued pre-training using Unsloth now. You can train on the LM head and the embed tokens. 00:09:25.340 |
And we show that instead of just training like that, you need to reduce the learning rate of the LM head 00:09:30.220 |
and the embed tokens by 10. Or, you know, maybe 5 to 10. And this will make your training much better. 00:09:35.500 |
We also support four times longer contacts using Unsloth. And this also does not increase the time of 00:09:42.860 |
completion. So we make it 1 to 2 percent slower. But you get four times longer contacts using Unsloth. 00:09:49.740 |
And this was because, like, we used something called offloading gradient checkpointing, where 00:09:54.540 |
we offload the gradients to system RAM. There are some other systems which offload the gradients to 00:09:59.580 |
the disk. Please do not do that. If you offload to disk, then your time of completion of your fine 00:10:04.300 |
tune will be extremely slow. So try to offload to system RAM first, and then offload to disk. 00:10:09.420 |
Although if you don't -- if you offload incorrectly, you might actually make this slower as well. 00:10:14.460 |
So your offloading must be non-blocking calls. And do not do blocking calls to the system RAM. 00:10:19.260 |
Yeah. So I will show you -- okay. Let's see if I can open up -- let me go to -- 00:10:31.500 |
Okay. I'm going to open up a Colab Notebook for the Olama one. 00:10:48.700 |
Okay. So for the Olama Colab Notebook, you can simply just install Unsloth over here. This is 00:10:55.500 |
already for free for everyone to use. And essentially, you don't forget when you do the Colab Notebook, 00:11:00.460 |
you have to select a max sequence length. This determines how long your model wants to do 00:11:05.820 |
long contacts fine tuning. You can set this to any number that you like. But remember, 00:11:09.980 |
your data set must match the max sequence length. So for example, if you have -- if you want to set the 00:11:14.380 |
max sequence length to be like 10 million or one million, but your data set is only like one million 00:11:18.940 |
tokens or like less, try to like not set that max sequence length to be that large. Otherwise, 00:11:23.660 |
your model cannot do fine tuning a long sequence. Load in 4-bit does 4-bit training. So this actually 00:11:29.740 |
reduces memory usage by four times. If you do -- if you do it to false, your memory usage will explode. 00:11:35.500 |
So please do not try false. Especially on a free Colab Tesla T4. If you do false, your memory usage 00:11:42.700 |
might skyrocket to 16 GB. So do not do that. You only should do this if you use more stronger GPUs. 00:11:49.260 |
We support -- like Unsloth supports fine tuning or models including like Lama, Mistral, Gemma, V3, 00:11:55.980 |
and more. So this area, like the model name over here, you can actually try to select any model name 00:12:01.980 |
that you like. I don't think that people know that Unsloth can support other models other than the ones 00:12:06.460 |
we listed. So please try to put any model, like a hugging face model name in there. And it should work. 00:12:11.900 |
So for the get peft model, this is where you add the peft LoRa adapters. 00:12:18.620 |
The R is the rank. So we set it to be 16. But you can select any number that you like for fine tuning. 00:12:25.500 |
So we suggest you normally to use powers of two. But you can use any number, like one, two, three, 00:12:29.580 |
like any number that you like. The larger the rank you select, you can make the model learn more about 00:12:34.780 |
your data set. So -- but if you add too large of a rank, you might actually overfit your data set. 00:12:39.180 |
And also your memory usage might skyrocket again. So we normally suggest people to select 16, 32, 64, 00:12:44.220 |
or 128. Try not to select too large ranks. The maximum rank you should select is the size of the 00:12:50.300 |
dimension of the model itself. So if it's 4096, set this to be 4096. 00:12:55.020 |
For the target modules, be careful. You must do fine tuning on all linear layers, right? So Q, 00:13:03.100 |
K, V, O, down, up, and gate. Some people have done fine tuning without doing some of these layers. 00:13:09.260 |
Please do not do that. Because this will cause your fine tune to be not optimal. 00:13:12.700 |
And the LoRa alpha, there is actually a trick for this. Normally speaking, select the alpha to be the 00:13:18.540 |
same as the same as the rank or larger. We found that if you do 16 times 2, so the rank times 2, this can 00:13:24.700 |
make your fine tuning much better. You can also use use RS LoRa to be true to set the rank automatically 00:13:30.780 |
for you -- to set the alpha automatically for you. For the gradient checkpointing, unsloft is the method 00:13:35.580 |
which we showed that you can do long context fine tuning. You can also set this to be true, but your memory usage will 00:13:41.420 |
increase again. We also show you how to do data preparation in an Olama Colab notebook. So this 00:13:48.060 |
one is we upload a Titanic CSV. So the Titanic data set, the goal was, can you predict if someone died or 00:13:53.660 |
survived if you're on the Titanic? And you get details about the person. For example, their age, their, like, 00:14:00.860 |
fair, where did they embark from, and so on. With our new Colab notebooks, you have to be very careful 00:14:07.260 |
when you do Olama chat templates. Because when you do fine tuning, you can only have two columns, 00:14:12.460 |
the instruction and the output. But what happens if your CSV has more than one -- like, more than two 00:14:16.940 |
columns, the instruction and output? What we can do is you can merge the columns into one column. And 00:14:21.980 |
with unsloft now, you can actually do that. You can merge the columns into one. 00:14:29.260 |
And also, we show you that you can do customizable chat templates now. So previously, if you want to 00:14:33.740 |
do an Alpaca-style fine tune, you have to use instruction, input, and response for the Alpaca-style 00:14:38.220 |
fine tuning. But remember, the problem is, if you want to output to Olama or GGWeb, you can only have 00:14:44.140 |
two columns, the instruction and output. Right? If you do ChatTBT, you have to type something and then 00:14:48.940 |
the output comes along. You can't have, like, three inputs, right? So what we do is you can actually 00:14:55.260 |
customize your chat template. And you must include the input and the output. And you must do this 00:15:00.380 |
repetition twice. Some people have asked me, like, why do you have to do two repetitions of this chat 00:15:05.420 |
template? It's because there is dangling new lines. And we found this -- we found a solution to this is 00:15:10.940 |
you have to specify two iterations of your chat template. We also show examples of how to do the 00:15:17.500 |
Lama 3 chat template using our methodology. So you can see there is two iterations of the chat template. 00:15:23.020 |
Reminder, if you don't use the two iterations, you actually -- it will error out. 00:15:27.420 |
And this is the training methodologies. We normally suggest people to use a batch size of two, 00:15:33.500 |
gradient accumulation of four. Remember, the memory usage is only relevant to the batch size. So try not to 00:15:39.340 |
set the batch size to be very large, otherwise your memory usage will explode. Instead, set your gradient 00:15:43.740 |
accumulation steps to be larger. So the formula for the effective batch size is batch size times the 00:15:49.340 |
gradient accumulation. So in this case, it's two times four, which is eight. Set your learning rate to 00:15:53.820 |
be 2e minus four or smaller, maybe 2e minus five. And after that, you can also do inference on the model. 00:16:03.180 |
So now you have to use the apply chat template. Remember, be careful of double BOS tokens. But we 00:16:07.660 |
in Unslaw fixed this. And finally, you have to save this to Ollama. And, you know, you have to install 00:16:15.260 |
Ollama first. Saving now -- we now support saving multiple ggware files. So you don't actually have 00:16:20.940 |
to save it to one ggware file. You can save it to multiple. And we actually allow you to do this now. 00:16:24.780 |
Before, if you want to save to multiple ggware files, you have to wait 10 minutes extra. You can now do 00:16:30.140 |
this automatically by, you know, specifying more than one format. 00:16:33.340 |
We also can show you the model file which we created. So you can actually copy-paste the model 00:16:39.180 |
file and put this to custom -- like a custom Ollama as well. So the model file was the complicated part 00:16:45.100 |
when we had to automatically generate this. So we have, like, internal code to generate the model file 00:16:49.100 |
automatically. And finally, when you want to do inference, you can do Ollama to do inference. And, you 00:16:56.380 |
know, it works in general. So try that out. The Ollama chat template notebook is in the slide. 00:17:02.940 |
So tinyurl.com/unsloth2. And remember, the workshop slides, which we did yesterday, 00:17:07.820 |
is tinyurl.com/unsloth. And don't forget to join our Discord channel. If you have any 00:17:14.380 |
questions, I'm outside. You can ask questions and stuff like that. And, yes, like, thanks for 00:17:18.940 |
thanks for coming. I much appreciate it. Thanks a lot.