back to index[Paper Club] Upcycling Large Language Models into Mixture of Experts
00:00:04.440 |
>> Okay, cool. Today, I'm going to present a mixture of experts. 00:00:18.340 |
I might sound a bit muffled because I'm having a cold. 00:00:25.000 |
To this topic, I'm going to give a brief introduction on mix of experts or MOE first. 00:00:37.280 |
like how we accelerate these MOEs and train and do inference efficiently. 00:00:44.400 |
Finally, going to talk about the upcycling LLM into MOE, 00:00:54.440 |
So the AI models are growing larger and larger. 00:01:03.720 |
Switch transformer is the first model that surpassed the one trillion model size. 00:01:09.900 |
Before that, it's only hundreds of billion parameters. 00:01:23.760 |
The question is, we only have so much compute, 00:01:30.480 |
how can we make the model better without increasing the compute? 00:01:38.400 |
"My unsubstantiated theory is that parameters are good for knowledge, 00:01:44.640 |
and compute or the flop is good for intelligence." 00:01:55.840 |
the parameters or growing knowledge without increasing the compute. 00:02:05.640 |
Here is a very simple diagram from switch transformer. 00:02:29.800 |
MOEs transform the FFN layer into multiple copy of them. 00:02:41.400 |
and then each token selectively activate a few experts, 00:02:48.680 |
The router is simply a matrix multiplication, 00:02:52.160 |
a learnable matrix to select one of the expert based on the input. 00:02:59.360 |
The model size increase enhancing its capability, 00:03:03.400 |
while the compute roughly remains the same as the original model. 00:03:17.920 |
It's more complicated than the original FFN layer. 00:03:23.560 |
You can think of the original FFN layer as the third step computation, 00:03:58.120 |
Router is simply a matrix that is applied on these tokens. 00:04:05.280 |
and we will take the highest probability as the router selected. 00:04:16.400 |
are the experts which have the highest probability on these tokens. 00:04:31.560 |
you need to align these input features with the experts. 00:04:42.120 |
you need to arrange them into a single matrix 00:05:01.000 |
The third step is the same as the original FFN layer, 00:05:14.080 |
where you need to arrange these tokens back to the original shape. 00:05:19.040 |
Then the router probability is applied as a scaling factor 00:05:32.520 |
Scaling this MOE during training is very challenging. 00:05:49.400 |
you increase the model parameters roughly by 6x or 7x. 00:05:55.880 |
This puts substantial pressure on the memory usage. 00:06:01.040 |
And the router dispatching also has overhead. 00:06:07.560 |
you will notice there's a permute on on-permute operator. 00:06:11.680 |
That essentially increases the activation memory by two times 00:06:23.840 |
it would also increase the memory of the activation by k 00:06:28.200 |
because the hidden states need to go to each expert 00:06:39.800 |
because you need to do a loop over all the experts 00:06:46.480 |
There's also an imbalance issue if all of the tokens 00:07:16.520 |
is an open-source library on GitHub available. 00:07:21.360 |
We accelerate not only MOE and also all of the LLAMs, 00:07:31.880 |
I'm not sure if anyone is still using those now, 00:07:41.440 |
the attention is accelerated with all kinds of parallelism, 00:07:46.680 |
including pipeline parallel, tensor parallel, 00:07:50.760 |
also the parallelism and the MOEs are also accelerated. 00:07:55.920 |
This is what we are primarily talking about today, 00:08:01.800 |
On top, you have two customizable training loops. 00:08:06.640 |
Megatron LLAM provides simple bare-bone training loop. 00:08:17.080 |
where you can just provide Pythonic configuration 00:08:21.560 |
In the Megatron Core MOE, we provide different approaches 00:08:32.240 |
For the router, there are Oxloss and Sinkhorn. 00:08:53.760 |
in permute, unpermute for efficient memory saving. 00:08:57.760 |
And for the expert, we have grouped MLP to accelerate this. 00:09:18.240 |
You can-- usually, you would put all of the experts 00:09:24.480 |
And then you do a for loop over all of the experts 00:09:37.840 |
and also accelerate the training on the inference. 00:09:54.160 |
can go to one expert and no tokens are dropped. 00:10:05.960 |
meaning give a set capacity factor, for example, four here. 00:11:09.280 |
For example, the DeepSeq v2 MOE, it has 160 experts 00:11:27.160 |
And these tokens need to go to different experts. 00:11:30.560 |
And at this step, you would have a scatter and a copy operator. 00:11:38.200 |
say the top k is eight here, each of the hidden states 00:11:42.800 |
would be copied eight times, which is a very huge overhead. 00:12:03.880 |
available in Maxwell for MOE, where all of these operations 00:12:10.320 |
Because the copy operation here, it has zero compute, 00:12:24.880 |
you can easily compute these features during backward pass 00:12:32.040 |
Let's also look at the implementation of Maxwell 8x7 00:12:44.880 |
You also notice in the expert operation there, 00:12:53.240 |
and compute each of the gem operation one by one. 00:13:03.240 |
Instead, we provide an interface to Catalyst group gem, 00:13:10.880 |
where the Catalyst group gem groups all of the looping 00:13:16.480 |
over experts and calculate gem into a single operation. 00:13:23.520 |
Given any number of experts and any number of tokens, 00:13:29.920 |
you can efficiently compute the output in one operation. 00:13:36.960 |
So that's pretty much all of the optimizations for MOE 00:14:00.680 |
I'm just curious, are these available for everyone 00:14:13.480 |
to use in a way that attracts your attention? 00:14:16.640 |
Or are these specific to the Megatron implementation? 00:14:22.880 |
So you can import them as a standalone module 00:14:40.200 |
Yeah, you can import Megatron Core as a library 00:14:49.160 |
So for example, if you want to use expert parallelism, 00:14:53.720 |
you will need to also use Megatron's parallelism strategy 00:15:04.800 |
But if you're using, for example, the group gem, 00:15:08.360 |
I think you can just get away with any kind of network. 00:15:12.720 |
You can combine it with Huggins phase transformer 00:15:18.080 |
This is just a PyTorch layer with a fused operator. 00:15:31.720 |
Do you have any intuition as far as the knowledge 00:15:52.000 |
I had assumed that maybe one expert was good at economics, 00:15:56.680 |
another one was good at physics, and so forth. 00:15:59.120 |
But this makes it seem like it's more token by token. 00:16:12.680 |
but I saw a lot of research on interpretability 00:16:24.320 |
a significant interpretability inside these experts. 00:16:31.120 |
One expert focused on math, the other focused on literature. 00:16:36.320 |
I think the problem is that neural network hidden states 00:16:44.560 |
So one hidden states are in this kind of superposition 00:16:49.440 |
where it can represent multiple different features. 00:16:54.960 |
So it's very hard to tell which expert focus on which area. 00:17:10.160 |
You'll find some expert focus on multiple tokens, 00:17:14.520 |
some expert focus on single tokens, things like that. 00:17:20.760 |
And there's also one pretty interesting research 00:17:26.440 |
They do a specialized training of dense model first, 00:17:40.520 |
In that case, it still preserves some of the specializations. 00:17:52.800 |
We talked about top T sampling to select the expert 00:17:57.400 |
Is there any benefit to using maybe a top T sampling 00:18:00.200 |
approach similar to how you would use top T sampling 00:18:03.960 |
for selecting an expert as compared to top K? 00:18:35.400 |
I think that promotes a little more diversity, I've heard. 00:18:44.080 |
That will create some difficulty in optimization. 00:18:49.320 |
But I think another thing pretty exciting is the expert choice. 00:18:54.520 |
You see, in expert choice model, the selection is reversed. 00:19:17.320 |
But expert choice is another way where the experts select tokens. 00:19:23.800 |
Each expert only selects K tokens constantly. 00:19:28.720 |
So even though this is fixed, but from the token 00:19:32.960 |
perspective, each token can have zero expert applied to it, 00:19:39.000 |
or more than zero, or all of the experts applied to it. 00:19:43.440 |
In that case, you're either overloading the expert 00:19:49.840 |
Yeah, in fact, expert choice applies pretty well 00:19:53.600 |
to vision models, because vision models do not have causal mask. 00:20:09.040 |
I guess I can go to the next section, upcycling MOEs. 00:20:15.400 |
So if you are going to remember one thing, remember this. 00:20:34.200 |
you can achieve better accuracy than simply training 00:20:37.560 |
the dense model further for the same number of flops. 00:20:42.240 |
The context is we have so many big dense models. 00:20:56.280 |
It's very expensive to retrain MOE variant of it. 00:21:12.240 |
On other scaling experiments, we tried on 15B models upcycling 00:21:20.040 |
and applied on 1 trillion tokens and achieved roughly about 5% 00:21:33.320 |
It's exciting because the original sparse upcycling 00:21:45.480 |
We found there are several key factors to go beyond 1 billion. 00:21:59.840 |
Let's say you have the MLP in the original plane boat 00:22:26.640 |
First is to copy the MLP layer into a number of expert copies. 00:22:38.200 |
And then you randomly initialize the router rates. 00:22:50.960 |
This model needs to perform the same as the original model 00:22:58.440 |
Otherwise, it will lead to catastrophe forgetting. 00:23:12.240 |
is through the swapping of the top-k softmax operator. 00:23:24.960 |
and then you do the top-k first to select two experts. 00:23:46.160 |
applied to the top-k output, it always sum up to 1. 00:24:09.160 |
So the model output is the same as the original dense model. 00:24:13.800 |
This is a very important feature in upcycling, 00:24:22.760 |
behaves exactly the same as dense without any training. 00:24:38.120 |
found the mixed-source approach didn't work as well as 00:24:41.120 |
expected, because the original switch transformer from Google 00:24:52.040 |
And because of upcycling, if you switch to top-k, then softmax, 00:25:09.960 |
I've already explained how mixed-source did top-k then 00:25:21.680 |
So the probability of all of the experts would sum up to 1. 00:25:28.360 |
So if you apply top-k, the output no longer sum up to 1. 00:25:40.920 |
If you just train this model naively on a large scale, 00:25:49.800 |
But on a smaller scale model like 1B or under, 00:26:00.560 |
the original sparse upcycling paper didn't go 00:26:06.720 |
So we found a very simple approach to solve this problem. 00:26:22.640 |
scale up the MLP output by the number of experts 00:26:42.480 |
And the model still behaves the same as the original model. 00:26:54.360 |
it consistently outperforms the mixed-row approach. 00:27:00.280 |
So we can get the benefit of the original three-transformer. 00:27:07.000 |
A bit of intuition behind why softmax and top-k is better. 00:27:13.680 |
Because if you apply softmax to all of the experts, 00:27:26.080 |
However, in the swap case, the probability distribution 00:27:43.640 |
Because if the softmax on one expert is always 1, 00:28:05.720 |
This is very popular in the most recent MLEs. 00:28:15.400 |
And DeepSeq V2 uses 120, 128, or 160-something experts. 00:28:23.360 |
Granularity uses more experts, but smaller ones. 00:28:27.000 |
So this gives the flexibility of more combinations 00:28:32.000 |
of different experts, so more representation power. 00:28:44.800 |
Instead, you can expand the number of experts into 4. 00:29:24.720 |
So you can think of this like the expert is first 00:30:02.600 |
Let's say these two segments are not the same anymore, 00:30:10.000 |
because you segment one expert into two shards. 00:30:18.440 |
For the top two cases, if you select one shard two times 00:30:30.000 |
And we mentioned that it's very important for upcycling 00:30:42.280 |
The solution here is also rather straightforward. 00:30:50.520 |
we would initialize half as a router, and then duplicate it. 00:30:56.200 |
So this would ensure that the probability distribution would 00:31:01.080 |
be the same in each virtual group, or in each shard group. 00:31:07.760 |
Because if these are the same, the top case selection 00:31:20.640 |
I say here the example is that I shard the MLP into two parts. 00:31:29.280 |
The first, I need to select the orange exacting ones 00:31:36.800 |
By duplicating the router base, this achieves the purpose. 00:31:44.880 |
And the formula for scaling the weights are also the same, 00:31:51.480 |
except there's an additional granularity factor. 00:31:55.480 |
This is a bit complicated, so I'll skip this part. 00:32:10.280 |
on the 8 trillion tokens Nemotron trained on. 00:32:14.120 |
The ablation study is on a smaller model, Nemotron2B. 00:32:18.840 |
And the bigger model, we do it on 8 by 15B on 1 trillion 00:32:29.440 |
is the most important hyperparameter, as always 00:32:37.120 |
Learning rate is the most important parameter. 00:32:46.440 |
the learning rate is taken from the minimum-- 00:32:57.760 |
So we found that if you have a high learning rate, 00:33:07.720 |
Here is the orange line is the lowest learning rate. 00:33:11.600 |
Basically, you continue to fine-tune the model into MOE. 00:33:34.400 |
We found that the best is to use the original highest peak 00:33:39.800 |
learning rate from pre-training, which works the best. 00:33:43.120 |
And if you dive into the base of this upcycled model, 00:33:57.680 |
So if you apply a constant small learning rate, 00:34:02.560 |
like a fine-tuning or alignment, the constant similarity 00:34:08.640 |
between the base model and the upcycled model 00:34:23.120 |
If you use a high peak learning rate for upcycling, 00:34:27.520 |
the constant similarity would be much lower, around 0.7. 00:34:34.560 |
Also, we also analyzed the mixed-row 8x7 base 00:34:47.560 |
around there, which means you need higher learning 00:34:57.160 |
An additional experiment on number of experts, 00:35:02.760 |
we found 64 experts is kind of like the sweet spot. 00:35:08.160 |
If you increase the number of experts beyond 64, 00:35:24.760 |
We upcycled the 8x15B model on 1 trillion tokens. 00:35:39.480 |
The base model, 15B, is trained on 8 trillion tokens. 00:36:01.960 |
to obtain higher performance on the evaluations. 00:36:09.680 |
The upcycling is performed on the same data for comparison. 00:36:14.480 |
Continuous training is just a dense model of continuous 00:36:18.720 |
You will notice that actually, the data plays the biggest 00:36:34.960 |
And the upcycled model is another 4% to 5% improvement 00:36:41.760 |
Continuous training is like 20% improvement because of data. 00:36:47.200 |
This reminds us, again, data is the most important in ML. 00:36:51.880 |
If you put the 5% improvement into the scaling law 00:37:10.920 |
Actually, the fine-grained MOE, the upcycled, 00:37:26.000 |
4% improvement if you plug in the scaling law from point AI, 00:37:36.480 |
And the non-fine-grained model, well, top two, 00:37:43.680 |
this is the same config as the 8x7 mixed row. 00:37:56.560 |
And it's roughly two times as powerful as the original model. 00:38:10.080 |
this is indeed some saving compared to training 00:38:21.440 |
I put the paper links, Microsoft Core MOE GitHub there 00:38:26.920 |
and Nevo GitHub where we provide high-level training interface. 00:38:32.720 |
You can also follow me on LinkedIn and Twitter. 00:38:48.280 |
But I think everyone is very excited by the presentations. 00:38:55.120 |
So let people know where to get their slides.