back to index[Paper Club] Upcycling Large Language Models into Mixture of Experts

00:00:04.440 | 
>> Okay, cool. Today, I'm going to present a mixture of experts. 00:00:18.340 | 
I might sound a bit muffled because I'm having a cold. 00:00:25.000 | 
To this topic, I'm going to give a brief introduction on mix of experts or MOE first. 00:00:37.280 | 
like how we accelerate these MOEs and train and do inference efficiently. 00:00:44.400 | 
Finally, going to talk about the upcycling LLM into MOE, 00:00:54.440 | 
So the AI models are growing larger and larger. 00:01:03.720 | 
Switch transformer is the first model that surpassed the one trillion model size. 00:01:09.900 | 
Before that, it's only hundreds of billion parameters. 00:01:23.760 | 
The question is, we only have so much compute, 00:01:30.480 | 
how can we make the model better without increasing the compute? 00:01:38.400 | 
"My unsubstantiated theory is that parameters are good for knowledge, 00:01:44.640 | 
and compute or the flop is good for intelligence." 00:01:55.840 | 
the parameters or growing knowledge without increasing the compute. 00:02:05.640 | 
Here is a very simple diagram from switch transformer. 00:02:29.800 | 
MOEs transform the FFN layer into multiple copy of them. 00:02:41.400 | 
and then each token selectively activate a few experts, 00:02:48.680 | 
The router is simply a matrix multiplication, 00:02:52.160 | 
a learnable matrix to select one of the expert based on the input. 00:02:59.360 | 
The model size increase enhancing its capability, 00:03:03.400 | 
while the compute roughly remains the same as the original model. 00:03:17.920 | 
It's more complicated than the original FFN layer. 00:03:23.560 | 
You can think of the original FFN layer as the third step computation, 00:03:58.120 | 
Router is simply a matrix that is applied on these tokens. 00:04:05.280 | 
and we will take the highest probability as the router selected. 00:04:16.400 | 
are the experts which have the highest probability on these tokens. 00:04:31.560 | 
you need to align these input features with the experts. 00:04:42.120 | 
you need to arrange them into a single matrix 00:05:01.000 | 
The third step is the same as the original FFN layer, 00:05:14.080 | 
where you need to arrange these tokens back to the original shape. 00:05:19.040 | 
Then the router probability is applied as a scaling factor 00:05:32.520 | 
Scaling this MOE during training is very challenging. 00:05:49.400 | 
you increase the model parameters roughly by 6x or 7x. 00:05:55.880 | 
This puts substantial pressure on the memory usage. 00:06:01.040 | 
And the router dispatching also has overhead. 00:06:07.560 | 
you will notice there's a permute on on-permute operator. 00:06:11.680 | 
That essentially increases the activation memory by two times 00:06:23.840 | 
it would also increase the memory of the activation by k 00:06:28.200 | 
because the hidden states need to go to each expert 00:06:39.800 | 
because you need to do a loop over all the experts 00:06:46.480 | 
There's also an imbalance issue if all of the tokens 00:07:16.520 | 
is an open-source library on GitHub available. 00:07:21.360 | 
We accelerate not only MOE and also all of the LLAMs, 00:07:31.880 | 
I'm not sure if anyone is still using those now, 00:07:41.440 | 
the attention is accelerated with all kinds of parallelism, 00:07:46.680 | 
including pipeline parallel, tensor parallel, 00:07:50.760 | 
also the parallelism and the MOEs are also accelerated. 00:07:55.920 | 
This is what we are primarily talking about today, 00:08:01.800 | 
On top, you have two customizable training loops. 00:08:06.640 | 
Megatron LLAM provides simple bare-bone training loop. 00:08:17.080 | 
where you can just provide Pythonic configuration 00:08:21.560 | 
In the Megatron Core MOE, we provide different approaches 00:08:32.240 | 
For the router, there are Oxloss and Sinkhorn. 00:08:53.760 | 
in permute, unpermute for efficient memory saving. 00:08:57.760 | 
And for the expert, we have grouped MLP to accelerate this. 00:09:18.240 | 
You can-- usually, you would put all of the experts 00:09:24.480 | 
And then you do a for loop over all of the experts 00:09:37.840 | 
and also accelerate the training on the inference. 00:09:54.160 | 
can go to one expert and no tokens are dropped. 00:10:05.960 | 
meaning give a set capacity factor, for example, four here. 00:11:09.280 | 
For example, the DeepSeq v2 MOE, it has 160 experts 00:11:27.160 | 
And these tokens need to go to different experts. 00:11:30.560 | 
And at this step, you would have a scatter and a copy operator. 00:11:38.200 | 
say the top k is eight here, each of the hidden states 00:11:42.800 | 
would be copied eight times, which is a very huge overhead. 00:12:03.880 | 
available in Maxwell for MOE, where all of these operations 00:12:10.320 | 
Because the copy operation here, it has zero compute, 00:12:24.880 | 
you can easily compute these features during backward pass 00:12:32.040 | 
Let's also look at the implementation of Maxwell 8x7 00:12:44.880 | 
You also notice in the expert operation there, 00:12:53.240 | 
and compute each of the gem operation one by one. 00:13:03.240 | 
Instead, we provide an interface to Catalyst group gem, 00:13:10.880 | 
where the Catalyst group gem groups all of the looping 00:13:16.480 | 
over experts and calculate gem into a single operation. 00:13:23.520 | 
Given any number of experts and any number of tokens, 00:13:29.920 | 
you can efficiently compute the output in one operation. 00:13:36.960 | 
So that's pretty much all of the optimizations for MOE 00:14:00.680 | 
I'm just curious, are these available for everyone 00:14:13.480 | 
to use in a way that attracts your attention? 00:14:16.640 | 
Or are these specific to the Megatron implementation? 00:14:22.880 | 
So you can import them as a standalone module 00:14:40.200 | 
Yeah, you can import Megatron Core as a library 00:14:49.160 | 
So for example, if you want to use expert parallelism, 00:14:53.720 | 
you will need to also use Megatron's parallelism strategy 00:15:04.800 | 
But if you're using, for example, the group gem, 00:15:08.360 | 
I think you can just get away with any kind of network. 00:15:12.720 | 
You can combine it with Huggins phase transformer 00:15:18.080 | 
This is just a PyTorch layer with a fused operator. 00:15:31.720 | 
Do you have any intuition as far as the knowledge 00:15:52.000 | 
I had assumed that maybe one expert was good at economics, 00:15:56.680 | 
another one was good at physics, and so forth. 00:15:59.120 | 
But this makes it seem like it's more token by token. 00:16:12.680 | 
but I saw a lot of research on interpretability 00:16:24.320 | 
a significant interpretability inside these experts. 00:16:31.120 | 
One expert focused on math, the other focused on literature. 00:16:36.320 | 
I think the problem is that neural network hidden states 00:16:44.560 | 
So one hidden states are in this kind of superposition 00:16:49.440 | 
where it can represent multiple different features. 00:16:54.960 | 
So it's very hard to tell which expert focus on which area. 00:17:10.160 | 
You'll find some expert focus on multiple tokens, 00:17:14.520 | 
some expert focus on single tokens, things like that. 00:17:20.760 | 
And there's also one pretty interesting research 00:17:26.440 | 
They do a specialized training of dense model first, 00:17:40.520 | 
In that case, it still preserves some of the specializations. 00:17:52.800 | 
We talked about top T sampling to select the expert 00:17:57.400 | 
Is there any benefit to using maybe a top T sampling 00:18:00.200 | 
approach similar to how you would use top T sampling 00:18:03.960 | 
for selecting an expert as compared to top K? 00:18:35.400 | 
I think that promotes a little more diversity, I've heard. 00:18:44.080 | 
That will create some difficulty in optimization. 00:18:49.320 | 
But I think another thing pretty exciting is the expert choice. 00:18:54.520 | 
You see, in expert choice model, the selection is reversed. 00:19:17.320 | 
But expert choice is another way where the experts select tokens. 00:19:23.800 | 
Each expert only selects K tokens constantly. 00:19:28.720 | 
So even though this is fixed, but from the token 00:19:32.960 | 
perspective, each token can have zero expert applied to it, 00:19:39.000 | 
or more than zero, or all of the experts applied to it. 00:19:43.440 | 
In that case, you're either overloading the expert 00:19:49.840 | 
Yeah, in fact, expert choice applies pretty well 00:19:53.600 | 
to vision models, because vision models do not have causal mask. 00:20:09.040 | 
I guess I can go to the next section, upcycling MOEs. 00:20:15.400 | 
So if you are going to remember one thing, remember this. 00:20:34.200 | 
you can achieve better accuracy than simply training 00:20:37.560 | 
the dense model further for the same number of flops. 00:20:42.240 | 
The context is we have so many big dense models. 00:20:56.280 | 
It's very expensive to retrain MOE variant of it. 00:21:12.240 | 
On other scaling experiments, we tried on 15B models upcycling 00:21:20.040 | 
and applied on 1 trillion tokens and achieved roughly about 5% 00:21:33.320 | 
It's exciting because the original sparse upcycling 00:21:45.480 | 
We found there are several key factors to go beyond 1 billion. 00:21:59.840 | 
Let's say you have the MLP in the original plane boat 00:22:26.640 | 
First is to copy the MLP layer into a number of expert copies. 00:22:38.200 | 
And then you randomly initialize the router rates. 00:22:50.960 | 
This model needs to perform the same as the original model 00:22:58.440 | 
Otherwise, it will lead to catastrophe forgetting. 00:23:12.240 | 
is through the swapping of the top-k softmax operator. 00:23:24.960 | 
and then you do the top-k first to select two experts. 00:23:46.160 | 
applied to the top-k output, it always sum up to 1. 00:24:09.160 | 
So the model output is the same as the original dense model. 00:24:13.800 | 
This is a very important feature in upcycling, 00:24:22.760 | 
behaves exactly the same as dense without any training. 00:24:38.120 | 
found the mixed-source approach didn't work as well as 00:24:41.120 | 
expected, because the original switch transformer from Google 00:24:52.040 | 
And because of upcycling, if you switch to top-k, then softmax, 00:25:09.960 | 
I've already explained how mixed-source did top-k then 00:25:21.680 | 
So the probability of all of the experts would sum up to 1. 00:25:28.360 | 
So if you apply top-k, the output no longer sum up to 1. 00:25:40.920 | 
If you just train this model naively on a large scale, 00:25:49.800 | 
But on a smaller scale model like 1B or under, 00:26:00.560 | 
the original sparse upcycling paper didn't go 00:26:06.720 | 
So we found a very simple approach to solve this problem. 00:26:22.640 | 
scale up the MLP output by the number of experts 00:26:42.480 | 
And the model still behaves the same as the original model. 00:26:54.360 | 
it consistently outperforms the mixed-row approach. 00:27:00.280 | 
So we can get the benefit of the original three-transformer. 00:27:07.000 | 
A bit of intuition behind why softmax and top-k is better. 00:27:13.680 | 
Because if you apply softmax to all of the experts, 00:27:26.080 | 
However, in the swap case, the probability distribution 00:27:43.640 | 
Because if the softmax on one expert is always 1, 00:28:05.720 | 
This is very popular in the most recent MLEs. 00:28:15.400 | 
And DeepSeq V2 uses 120, 128, or 160-something experts. 00:28:23.360 | 
Granularity uses more experts, but smaller ones. 00:28:27.000 | 
So this gives the flexibility of more combinations 00:28:32.000 | 
of different experts, so more representation power. 00:28:44.800 | 
Instead, you can expand the number of experts into 4. 00:29:24.720 | 
So you can think of this like the expert is first 00:30:02.600 | 
Let's say these two segments are not the same anymore, 00:30:10.000 | 
because you segment one expert into two shards. 00:30:18.440 | 
For the top two cases, if you select one shard two times 00:30:30.000 | 
And we mentioned that it's very important for upcycling 00:30:42.280 | 
The solution here is also rather straightforward. 00:30:50.520 | 
we would initialize half as a router, and then duplicate it. 00:30:56.200 | 
So this would ensure that the probability distribution would 00:31:01.080 | 
be the same in each virtual group, or in each shard group. 00:31:07.760 | 
Because if these are the same, the top case selection 00:31:20.640 | 
I say here the example is that I shard the MLP into two parts. 00:31:29.280 | 
The first, I need to select the orange exacting ones 00:31:36.800 | 
By duplicating the router base, this achieves the purpose. 00:31:44.880 | 
And the formula for scaling the weights are also the same, 00:31:51.480 | 
except there's an additional granularity factor. 00:31:55.480 | 
This is a bit complicated, so I'll skip this part. 00:32:10.280 | 
on the 8 trillion tokens Nemotron trained on. 00:32:14.120 | 
The ablation study is on a smaller model, Nemotron2B. 00:32:18.840 | 
And the bigger model, we do it on 8 by 15B on 1 trillion 00:32:29.440 | 
is the most important hyperparameter, as always 00:32:37.120 | 
Learning rate is the most important parameter. 00:32:46.440 | 
the learning rate is taken from the minimum-- 00:32:57.760 | 
So we found that if you have a high learning rate, 00:33:07.720 | 
Here is the orange line is the lowest learning rate. 00:33:11.600 | 
Basically, you continue to fine-tune the model into MOE. 00:33:34.400 | 
We found that the best is to use the original highest peak 00:33:39.800 | 
learning rate from pre-training, which works the best. 00:33:43.120 | 
And if you dive into the base of this upcycled model, 00:33:57.680 | 
So if you apply a constant small learning rate, 00:34:02.560 | 
like a fine-tuning or alignment, the constant similarity 00:34:08.640 | 
between the base model and the upcycled model 00:34:23.120 | 
If you use a high peak learning rate for upcycling, 00:34:27.520 | 
the constant similarity would be much lower, around 0.7. 00:34:34.560 | 
Also, we also analyzed the mixed-row 8x7 base 00:34:47.560 | 
around there, which means you need higher learning 00:34:57.160 | 
An additional experiment on number of experts, 00:35:02.760 | 
we found 64 experts is kind of like the sweet spot. 00:35:08.160 | 
If you increase the number of experts beyond 64, 00:35:24.760 | 
We upcycled the 8x15B model on 1 trillion tokens. 00:35:39.480 | 
The base model, 15B, is trained on 8 trillion tokens. 00:36:01.960 | 
to obtain higher performance on the evaluations. 00:36:09.680 | 
The upcycling is performed on the same data for comparison. 00:36:14.480 | 
Continuous training is just a dense model of continuous 00:36:18.720 | 
You will notice that actually, the data plays the biggest 00:36:34.960 | 
And the upcycled model is another 4% to 5% improvement 00:36:41.760 | 
Continuous training is like 20% improvement because of data. 00:36:47.200 | 
This reminds us, again, data is the most important in ML. 00:36:51.880 | 
If you put the 5% improvement into the scaling law 00:37:10.920 | 
Actually, the fine-grained MOE, the upcycled, 00:37:26.000 | 
4% improvement if you plug in the scaling law from point AI, 00:37:36.480 | 
And the non-fine-grained model, well, top two, 00:37:43.680 | 
this is the same config as the 8x7 mixed row. 00:37:56.560 | 
And it's roughly two times as powerful as the original model. 00:38:10.080 | 
this is indeed some saving compared to training 00:38:21.440 | 
I put the paper links, Microsoft Core MOE GitHub there 00:38:26.920 | 
and Nevo GitHub where we provide high-level training interface. 00:38:32.720 | 
You can also follow me on LinkedIn and Twitter. 00:38:48.280 | 
But I think everyone is very excited by the presentations. 00:38:55.120 | 
So let people know where to get their slides.