[Paper Club] Upcycling Large Language Models into Mixture of Experts

>> My inter works too. >> Okay, cool. Today, I'm going to present a mixture of experts. I'm Ethan from NVIDIA. I work on scaling, LLMs, transformers. I might sound a bit muffled because I'm having a cold. Apologies for that. To this topic, I'm going to give a brief introduction on mix of experts or MOE first.

I'm going to talk about Megatron Core MOE, like how we accelerate these MOEs and train and do inference efficiently. Finally, going to talk about the upcycling LLM into MOE, which is our recent paper. So the AI models are growing larger and larger. This is a rather old picture from 2021.

Switch transformer is the first model that surpassed the one trillion model size. Before that, it's only hundreds of billion parameters. I think at that time, it's growing 10 times each year. It looks like it's slowing down recently. The question is, we only have so much compute, how can we make the model better without increasing the compute?

From Noam, he said, "My unsubstantiated theory is that parameters are good for knowledge, and compute or the flop is good for intelligence." Whatever those term means. So MOE is a good way of growing the parameters or growing knowledge without increasing the compute. So what is MOE? Here is a very simple diagram from switch transformer.

For the traditional LLMs transformers, you have the self-attention, the residual layer norm, and then you have a FFN layer, which is just a two-layer layer, and then the residual. MOEs transform the FFN layer into multiple copy of them. You see here, there are four FFN layers, and then each token selectively activate a few experts, which is selected by router.

The router is simply a matrix multiplication, a learnable matrix to select one of the expert based on the input. The model size increase enhancing its capability, while the compute roughly remains the same as the original model. If we look into the MOE layer, it actually consists of several steps.

It's more complicated than the original FFN layer. You can think of the original FFN layer as the third step computation, whereas the expert layer, which was the original FFN layer, is applied to the input token. The first step here is routing. Given the input token, for example, here you have like six tokens, the quick brown fox jumped over.

It went over the router. Router is simply a matrix that is applied on these tokens. It generates the probabilities, and we will take the highest probability as the router selected. Here is the expert indices, are the experts which have the highest probability on these tokens. The second step is permutation.

Given the token selected experts, you need to align these input features with the experts. All of the tokens that select expert zero, you need to arrange them into a single matrix that are only for experts zero, expert one, expert two. Depending on the capacity factor, some of the token might be dropped.

The third step is the same as the original FFN layer, where you do the computation. After the computation, there will be an on-premier step where you need to arrange these tokens back to the original shape. Then the router probability is applied as a scaling factor on all of these expert features.

Scaling this MOE during training is very challenging. First, the models are massive scales. Usually, for example, mixed row 8x7b, you increase the model parameters roughly by 6x or 7x. This puts substantial pressure on the memory usage. And the router dispatching also has overhead. In the previous slide, you will notice there's a permute on on-permute operator.

That essentially increases the activation memory by two times because all of those need to be stored. If you have top-k routing, it would also increase the memory of the activation by k because the hidden states need to go to each expert and you need to duplicate it. And also reduce the jam efficiency because you need to do a loop over all the experts to do the jam separately.

There's also an imbalance issue if all of the tokens go to one expert, other experts on other GPU would be idle. Now, let me introduce the Megatron Core MOE, which is how we accelerate these MOE models given these challenges. So Megatron LLAM and Megatron Core is an open-source library on GitHub available.

We accelerate not only MOE and also all of the LLAMs, including like GPT, BERT, T5. I'm not sure if anyone is still using those now, but primarily GPT models. And inside the transformer layers, the attention is accelerated with all kinds of parallelism, including pipeline parallel, tensor parallel, also the parallelism and the MOEs are also accelerated.

This is what we are primarily talking about today, Megatron Core MOE. On top, you have two customizable training loops. Megatron LLAM provides simple bare-bone training loop. And you can easily hack. And NEMO provide a high-level interface where you can just provide Pythonic configuration to train these models. In the Megatron Core MOE, we provide different approaches to accelerate these models.

For the router, there are Oxloss and Sinkhorn. Basically, Oxloss is a token trace MOE. And Sinkhorn, without token dropping, it can usually use in expert trace. And the tokens dispatcher, they're in permute, unpermute for efficient memory saving. And for the expert, we have grouped MLP to accelerate this. So first, it's expert model parallel.

This is available in Megatron Core MOE now. You can-- usually, you would put all of the experts on one single GPUs. And then you do a for loop over all of the experts to compute the result. But instead, we can put one expert on each GPU. This will release a lot of memory and also accelerate the training on the inference.

So token dropping, the default we use is dropless, meaning all of the tokens can go to one expert and no tokens are dropped. We also support token dropping with padding, meaning give a set capacity factor, for example, four here. Each expert can, at max, accept four tokens. Tokens beyond that are going to be dropped.

So accuracy-wise and efficiency-wise, there are a lot of discussion around here. A lot of the pre-training experiments shows that token dropping is very efficient and it doesn't impact performance. But in some of the downstream fine-tuning, people realize dropless is better. Maybe it's because of the domain shifts. The tokens are no longer balanced.

So we provide both of the options. So recently, there are a lot of new MOEs that have increasing number of experts. This will cause a very huge overhead. For example, the DeepSeq v2 MOE, it has 160 experts and eight of them are active. If you think about the memory overhead, let's say first you would have the tokens.

And these tokens need to go to different experts. And at this step, you would have a scatter and a copy operator. Depending on the number of the top k, say the top k is eight here, each of the hidden states would be copied eight times, which is a very huge overhead.

And after the expert operation is done, there's another eight copy of it here. So we have a fused permutation operation available in Maxwell for MOE, where all of these operations are fused. Because the copy operation here, it has zero compute, but it could cause duplicated memory. If you have these fused operations, you can easily compute these features during backward pass while saving a lot of memory.

Let's also look at the implementation of Maxwell 8x7 on Hagen-Fitts transformer. You also notice in the expert operation there, you would iterate over all of the experts and compute each of the gem operation one by one. We found that this is very inefficient. Instead, we provide an interface to Catalyst group gem, where the Catalyst group gem groups all of the looping over experts and calculate gem into a single operation.

Given any number of experts and any number of tokens, you can efficiently compute the output in one operation. And this is very efficient. So that's pretty much all of the optimizations for MOE in Matron-Core MOE. If any one of you have questions, I can first answer those questions and then go to MOE upcycling.

I'm just curious, are these available for everyone to use in a way that attracts your attention? Or are these specific to the Megatron implementation? So you can import them as a standalone module and apply them to any of your-- Core library, right? Yeah, you can import Megatron Core as a library and just use it in your network.

But I think there are some caveats. So for example, if you want to use expert parallelism, you will need to also use Megatron's parallelism strategy to initialize a strategy first. But if you're using, for example, the group gem, I think you can just get away with any kind of network.

You can combine it with Huggins phase transformer with any problem. This is just a PyTorch layer with a fused operator. You can just import it as a library. It's very standalone. Cool. Do you have any intuition as far as the knowledge contained in each expert? I guess previous to seeing this paper, I had assumed that maybe one expert was good at economics, another one was good at physics, and so forth.

But this makes it seem like it's more token by token. So do you have any intuition around that? Yeah, so we haven't started in our research, but I saw a lot of research on interpretability of the MOE models. So unfortunately, people didn't find a significant interpretability inside these experts.

One expert focused on math, the other focused on literature. I think the problem is that neural network hidden states are already very entangled. So one hidden states are in this kind of superposition where it can represent multiple different features. So it's very hard to tell which expert focus on which area.

There are some evidence of specializations on early layers of the experts, for example, first and second layers. You'll find some expert focus on multiple tokens, some expert focus on single tokens, things like that. And there's also one pretty interesting research from Facebook. They do a specialized training of dense model first, then combine those dense experts specialized into different domain into a MOE.

In that case, it still preserves some of the specializations. Thank you. Just a quick question. We talked about top T sampling to select the expert throughout. Is there any benefit to using maybe a top T sampling approach similar to how you would use top T sampling for selecting an expert as compared to top K?

Top P? What do you mean top P? Considering a list of experts until they exceed a given probability cumulatively? Does that make sense? I see, yeah. So since the top-- it will be dynamic, let's say. Sometimes it can select some more. Sometimes it selects less. I think that promotes a little more diversity, I've heard.

Yeah. Yeah, I think that makes sense. That will create some difficulty in optimization. But I think another thing pretty exciting is the expert choice. You see, in expert choice model, the selection is reversed. So here, we talk about all of them. Usually, this is token choice, which means the token selects K experts.

So each token always have, for example, two experts applied on it. But expert choice is another way where the experts select tokens. Each expert only selects K tokens constantly. So even though this is fixed, but from the token perspective, each token can have zero expert applied to it, or more than zero, or all of the experts applied to it.

Yeah. In that case, you're either overloading the expert or not considering all the tokens. It's like a trade-off. Yeah, in fact, expert choice applies pretty well to vision models, because vision models do not have causal mask. OK. Thank you so much. OK. I guess I can go to the next section, upcycling MOEs.

So if you are going to remember one thing, remember this. So you can upcycle your dense models into a mix of experts. By training these upcycled models, you can achieve better accuracy than simply training the dense model further for the same number of flops. The context is we have so many big dense models.

For example, there's a Lama 405B. And from NVIDIA, we have Nemotron 340B. These models are huge. It's very expensive to retrain MOE variant of it. If we want to further improve these models, you can upcycle these models into MOE to achieve better accuracy. On other scaling experiments, we tried on 15B models upcycling and applied on 1 trillion tokens and achieved roughly about 5% improvement in terms of the validation loss and 4% improvement on MMLU.

It's exciting because the original sparse upcycling paper found it difficult to scale beyond 1 billion parameters. We found there are several key factors to go beyond 1 billion. So this is a concept of MOE. Let's say you have the MLP in the original plane boat transformer. And here, this is a mixture of two experts.

One of the experts is activated, so the flop is the same as the original model. Now the parameters is increased. To upcycle such a model, you do two things. First is to copy the MLP layer into a number of expert copies. Here is just two copies. And then you randomly initialize the router rates.

Then you just train this model. Pretty straightforward, right? There is one caveat here. This model needs to perform the same as the original model in the first forward pass. Otherwise, it will lead to catastrophe forgetting. So the trick to maintain this feature is through the swapping of the top-k softmax operator.

This is introduced in Mixture 8x7b. Let's say you have the MLP copied, and then you do the top-k first to select two experts. And then you do the softmax on top of the logits from the top-k router. In this way, because the softmax is applied to the top-k output, it always sum up to 1.

Here I give example 0.7, 0.3. And then you add these two outputs together. In this way, because the MLP layer is exactly the same as the dense model, these two copies are just the same. So the model output is the same as the original dense model. This is a very important feature in upcycling, because the upcycled MLE model actually behaves exactly the same as dense without any training.

And this will help stabilize the model and avoid catastrophe forgetting. But the problem here is we actually found the mixed-source approach didn't work as well as expected, because the original switch transformer from Google uses softmax and top-k for a reason. And because of upcycling, if you switch to top-k, then softmax, it actually hurts some performance.

The difference is simply the swap of top-k and the softmax order. I've already explained how mixed-source did top-k then softmax. So in the original switch transformer paper, you apply softmax first. So the probability of all of the experts would sum up to 1. So if you apply top-k, the output no longer sum up to 1.

Here, it's the example 0.4, 0.2. So this is smaller than the original output. If you just train this model naively on a large scale, the model would catastrophically forget. But on a smaller scale model like 1B or under, you can kind of get away with this problem, because small model adapts very fast.

This is probably one of the reasons the original sparse upcycling paper didn't go beyond 1 billion parameters. So we found a very simple approach to solve this problem. Because the output scale is smaller than the original model, we can simply scale up the MLP output by the number of experts divided by top-k.

For example, if it's mixed-row 8 by 7, we simply scale the MLP layer output by 4x. This will solve the problem of the scale. And the model still behaves the same as the original model. You can train the upcycle model normally. And then we found with this approach, it consistently outperforms the mixed-row approach.

So we can get the benefit of the original three-transformer. A bit of intuition behind why softmax and top-k is better. Because if you apply softmax to all of the experts, the probability distribution is always measured on all of the experts. However, in the swap case, the probability distribution is only on two experts.

And it's dynamic. It's harder for the model to learn. Additionally, if you only use top-1, this method will not work. Because if the softmax on one expert is always 1, there wouldn't be any gradient to learn. So next, let's go to fine-grained MLE. This is very popular in the most recent MLEs.

For example, the Quain V2 uses 64 experts. And DeepSeq V2 uses 120, 128, or 160-something experts. Granularity uses more experts, but smaller ones. So this gives the flexibility of more combinations of different experts, so more representation power. For example, here, originally you have the BigSurf2 expert, and 1 is selected.

Instead, you can expand the number of experts into 4. Each expert is smaller than before. So the compute is the same. And the parameters is also the same. Here, some notation here. E2 means expansion factor is 2. This is how many times the expert is copied. And G2 is granularity, or how many times the expert is segmented.

So you can think of this like the expert is first copied into two copies here. And then each copy is segmented two times. T2 means how many experts we route to. Here, you route to two experts. So the flop is the same as the original one. So you would soon notice a problem if you upcycle such model.

Let's say these two segments are not the same anymore, because you segment one expert into two shards. For the top two cases, if you select one shard two times and the other shard zero time, the output is no longer the same as the original MOE. And we mentioned that it's very important for upcycling to maintain the same forward pass as the original dense model.

The solution here is also rather straightforward. So instead of randomly initialize a router, we would initialize half as a router, and then duplicate it. So this would ensure that the probability distribution would be the same in each virtual group, or in each shard group. Because if these are the same, the top case selection will be exactly the same, the highest score would be the same for these two groups.

I say here the example is that I shard the MLP into two parts. The first, I need to select the orange exacting ones and the blue exacting ones. By duplicating the router base, this achieves the purpose. And the formula for scaling the weights are also the same, except there's an additional granularity factor.

This is a bit complicated, so I'll skip this part. So for our experiment, we do our experiment on the 8 trillion tokens Nemotron trained on. The ablation study is on a smaller model, Nemotron2B. And the bigger model, we do it on 8 by 15B on 1 trillion tokens. And we found that the learning rate is the most important hyperparameter, as always in machine learning.

And this is true also for upcycling. Learning rate is the most important parameter. So in the original sparse upcycling paper, the learning rate is taken from the minimum-- the ending learning rate from pre-training. So sometimes it can be rather small. So we found that if you have a high learning rate, it could help the upcycling a lot.

Here is the orange line is the lowest learning rate. Basically, you continue to fine-tune the model into MOE. This learning rate is typical, for example, for other tasks like alignment. But for upcycling, the model needs to adapt to a new local minima. We need a larger learning rate for this.

We found that the best is to use the original highest peak learning rate from pre-training, which works the best. And if you dive into the base of this upcycled model, you will find something really interesting. So if you apply a constant small learning rate, like a fine-tuning or alignment, the constant similarity between the base model and the upcycled model would be almost 1.

This is true for most of the aligned model, for example, LamaChat versus LamaBase. If you use a high peak learning rate for upcycling, the constant similarity would be much lower, around 0.7. Also, we also analyzed the mixed-row 8x7 base versus mixed-row 7B. We found that the similarity is also around there, which means you need higher learning rates for upcycling.

An additional experiment on number of experts, we found 64 experts is kind of like the sweet spot. If you increase the number of experts beyond 64, it provides diminishing return. Finally, this is the large-scale upcycling. We upcycled the 8x15B model on 1 trillion tokens. So there are three models here.

Let me explain. The base model, 15B, is trained on 8 trillion tokens. This is pre-training data. So validation loss, 1.6, and the MMLU, 59. So the continual training model, it target more on the academic benchmarks to obtain higher performance on the evaluations. This is roughly 1 trillion tokens. The upcycling is performed on the same data for comparison.

Continuous training is just a dense model of continuous training. You will notice that actually, the data plays the biggest factor. Even the base model continuous training, it can have a huge boost on MMLU. And the upcycled model is another 4% to 5% improvement on top of that. Continuous training is like 20% improvement because of data.

This reminds us, again, data is the most important in ML. If you put the 5% improvement into the scaling law perspective, we can roughly gauge how much the 5% improvement means. Actually, the fine-grained MOE, the upcycled, this has the same plot as the dense model. And this has about 4% improvement in terms of the loss.

4% improvement if you plug in the scaling law from point AI, it's roughly represent 1.7x bigger model. And the non-fine-grained model, well, top two, this is the same config as the 8x7 mixed row. It has increased flops because of top two. That's roughly 1.7x more flops. And it's roughly two times as powerful as the original model.

Given that we only spent like 1/8 of the original pre-training compute, this is indeed some saving compared to training these MOEs from scratch. OK. Thank you for listening. I put the paper links, Microsoft Core MOE GitHub there and Nevo GitHub where we provide high-level training interface. You can also follow me on LinkedIn and Twitter.

I can take questions now. Hey, nice to see you again. There are a whole bunch of questions. I think we'll stop the recording so we can open up for questions. But I think everyone is very excited by the presentations. There's a lot of questions. And also people want your slides.

So let people know where to get their slides. Yeah, sure. I can share the slides. I can share this slide.

[Paper Club] Upcycling Large Language Models into Mixture of Experts

Transcript